linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] arch: atomic rework
@ 2014-02-06 13:48 Peter Zijlstra
  2014-02-06 13:48 ` [RFC][PATCH 1/5] ia64: Fix up smp_mb__{before,after}_clear_bit Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra

Hi all,

A few too large patches here, mostly as RFC to see if we want to continue with
this before I sink more time into it. I hope they make it out to the lists.

This all started with me wanting to implement atomic_sub_release() for all
archs, but I got side-tracked a bit and it ended up cleaning up bits and
deleting almost 1400 lines of code.

Its been compiled on everything I have a compiler for, however frv and
tile are missing because they're special and I was tired.

---
 Documentation/atomic_ops.txt                      |   31 -
 Documentation/memory-barriers.txt                 |   44 -
 a/arch/arc/include/asm/barrier.h                  |   37 -
 a/arch/hexagon/include/asm/barrier.h              |   37 -
 arch/alpha/include/asm/atomic.h                   |  225 +++-----
 arch/alpha/include/asm/bitops.h                   |    3 
 arch/arc/include/asm/atomic.h                     |  198 ++-----
 arch/arc/include/asm/bitops.h                     |    5 
 arch/arm/include/asm/atomic.h                     |  301 ++++-------
 arch/arm/include/asm/barrier.h                    |    3 
 arch/arm/include/asm/bitops.h                     |    4 
 arch/arm64/include/asm/atomic.h                   |  212 +++-----
 arch/arm64/include/asm/barrier.h                  |    3 
 arch/arm64/include/asm/bitops.h                   |    9 
 arch/avr32/include/asm/atomic.h                   |   95 +--
 arch/avr32/include/asm/bitops.h                   |    9 
 arch/blackfin/include/asm/atomic.h                |    5 
 arch/blackfin/include/asm/barrier.h               |    3 
 arch/blackfin/include/asm/bitops.h                |   14 
 arch/blackfin/mach-common/smp.c                   |    2 
 arch/c6x/include/asm/bitops.h                     |    8 
 arch/cris/include/arch-v10/arch/system.h          |    2 
 arch/cris/include/asm/atomic.h                    |   66 +-
 arch/cris/include/asm/bitops.h                    |    9 
 arch/frv/include/asm/atomic.h                     |    7 
 arch/frv/include/asm/bitops.h                     |    6 
 arch/hexagon/include/asm/atomic.h                 |   74 +-
 arch/hexagon/include/asm/bitops.h                 |    4 
 arch/ia64/include/asm/atomic.h                    |  212 ++++----
 arch/ia64/include/asm/barrier.h                   |    3 
 arch/ia64/include/asm/bitops.h                    |    9 
 arch/ia64/include/uapi/asm/cmpxchg.h              |    9 
 arch/m32r/include/asm/atomic.h                    |  191 ++-----
 arch/m32r/include/asm/bitops.h                    |    6 
 arch/m32r/kernel/smp.c                            |    4 
 arch/m68k/include/asm/atomic.h                    |  130 +----
 arch/m68k/include/asm/bitops.h                    |    7 
 arch/metag/include/asm/atomic.h                   |    6 
 arch/metag/include/asm/atomic_lnkget.h            |  159 +-----
 arch/metag/include/asm/atomic_lock1.h             |  100 +--
 arch/metag/include/asm/barrier.h                  |    3 
 arch/metag/include/asm/bitops.h                   |    6 
 arch/mips/include/asm/atomic.h                    |  570 +++++++---------------
 arch/mips/include/asm/barrier.h                   |    3 
 arch/mips/include/asm/bitops.h                    |   11 
 arch/mips/kernel/irq.c                            |    4 
 arch/mn10300/include/asm/atomic.h                 |  199 +------
 arch/mn10300/include/asm/bitops.h                 |    4 
 arch/mn10300/mm/tlb-smp.c                         |    6 
 arch/openrisc/include/asm/bitops.h                |    9 
 arch/parisc/include/asm/atomic.h                  |  121 ++--
 arch/parisc/include/asm/bitops.h                  |    4 
 arch/powerpc/include/asm/atomic.h                 |  214 +++-----
 arch/powerpc/include/asm/barrier.h                |    3 
 arch/powerpc/include/asm/bitops.h                 |    6 
 arch/powerpc/kernel/crash.c                       |    2 
 arch/powerpc/kernel/misc_32.S                     |   19 
 arch/s390/include/asm/atomic.h                    |   93 ++-
 arch/s390/include/asm/barrier.h                   |    5 
 arch/s390/include/asm/bitops.h                    |    1 
 arch/s390/kernel/time.c                           |    4 
 arch/s390/kvm/diag.c                              |    2 
 arch/s390/kvm/intercept.c                         |    2 
 arch/s390/kvm/interrupt.c                         |   16 
 arch/s390/kvm/kvm-s390.c                          |   14 
 arch/s390/kvm/sigp.c                              |    6 
 arch/score/include/asm/bitops.h                   |    7 
 arch/sh/include/asm/atomic-grb.h                  |  164 +-----
 arch/sh/include/asm/atomic-irq.h                  |   88 +--
 arch/sh/include/asm/atomic-llsc.h                 |  135 +----
 arch/sh/include/asm/atomic.h                      |    6 
 arch/sh/include/asm/bitops.h                      |    7 
 arch/sparc/include/asm/atomic_32.h                |   30 -
 arch/sparc/include/asm/atomic_64.h                |   53 --
 arch/sparc/include/asm/barrier_32.h               |    1 
 arch/sparc/include/asm/barrier_64.h               |    3 
 arch/sparc/include/asm/bitops_32.h                |    4 
 arch/sparc/include/asm/bitops_64.h                |    4 
 arch/sparc/include/asm/processor.h                |    2 
 arch/sparc/kernel/smp_64.c                        |    2 
 arch/sparc/lib/atomic32.c                         |   28 -
 arch/sparc/lib/atomic_64.S                        |  167 ++----
 arch/sparc/lib/ksyms.c                            |   20 
 arch/tile/include/asm/atomic_32.h                 |   10 
 arch/tile/include/asm/atomic_64.h                 |    6 
 arch/tile/include/asm/barrier.h                   |   14 
 arch/tile/include/asm/bitops.h                    |    1 
 arch/tile/include/asm/bitops_32.h                 |    8 
 arch/tile/include/asm/bitops_64.h                 |    4 
 arch/x86/include/asm/atomic.h                     |   39 -
 arch/x86/include/asm/atomic64_32.h                |   20 
 arch/x86/include/asm/atomic64_64.h                |   22 
 arch/x86/include/asm/barrier.h                    |    4 
 arch/x86/include/asm/bitops.h                     |    6 
 arch/x86/include/asm/sync_bitops.h                |    2 
 arch/x86/kernel/apic/hw_nmi.c                     |    2 
 arch/xtensa/include/asm/atomic.h                  |  318 +++---------
 arch/xtensa/include/asm/bitops.h                  |    4 
 block/blk-iopoll.c                                |    4 
 crypto/chainiv.c                                  |    2 
 drivers/base/power/domain.c                       |    2 
 drivers/block/mtip32xx/mtip32xx.c                 |    4 
 drivers/cpuidle/coupled.c                         |    2 
 drivers/firewire/ohci.c                           |    2 
 drivers/gpu/drm/drm_irq.c                         |   10 
 drivers/gpu/drm/i915/i915_irq.c                   |    6 
 drivers/md/bcache/bcache.h                        |    2 
 drivers/md/bcache/closure.h                       |    2 
 drivers/md/dm-bufio.c                             |    8 
 drivers/md/dm-snap.c                              |    4 
 drivers/md/dm.c                                   |    2 
 drivers/md/raid5.c                                |    2 
 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c       |    6 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |    6 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   34 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c    |   26 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |   12 
 drivers/net/ethernet/broadcom/cnic.c              |    8 
 drivers/net/ethernet/brocade/bna/bnad.c           |    6 
 drivers/net/ethernet/chelsio/cxgb/cxgb2.c         |    2 
 drivers/net/ethernet/chelsio/cxgb3/sge.c          |    6 
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 
 drivers/net/ethernet/intel/i40e/i40e_main.c       |    2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 
 drivers/net/wireless/ti/wlcore/main.c             |    2 
 drivers/pci/xen-pcifront.c                        |    4 
 drivers/s390/scsi/zfcp_aux.c                      |    2 
 drivers/s390/scsi/zfcp_erp.c                      |   68 +-
 drivers/s390/scsi/zfcp_fc.c                       |    8 
 drivers/s390/scsi/zfcp_fsf.c                      |   30 -
 drivers/s390/scsi/zfcp_qdio.c                     |   14 
 drivers/scsi/isci/remote_device.c                 |    2 
 drivers/target/loopback/tcm_loop.c                |    4 
 drivers/target/target_core_alua.c                 |   26 -
 drivers/target/target_core_device.c               |    6 
 drivers/target/target_core_iblock.c               |    2 
 drivers/target/target_core_pr.c                   |   56 +-
 drivers/target/target_core_transport.c            |   16 
 drivers/target/target_core_ua.c                   |   10 
 drivers/tty/n_tty.c                               |    2 
 drivers/tty/serial/mxs-auart.c                    |    4 
 drivers/usb/gadget/tcm_usb_gadget.c               |    4 
 drivers/usb/serial/usb_wwan.c                     |    2 
 drivers/vhost/scsi.c                              |    2 
 drivers/w1/w1_family.c                            |    4 
 drivers/xen/xen-pciback/pciback_ops.c             |    4 
 fs/btrfs/btrfs_inode.h                            |    2 
 fs/btrfs/extent_io.c                              |    2 
 fs/btrfs/inode.c                                  |    6 
 fs/buffer.c                                       |    2 
 fs/ext4/resize.c                                  |    2 
 fs/gfs2/glock.c                                   |    8 
 fs/gfs2/glops.c                                   |    2 
 fs/gfs2/lock_dlm.c                                |    4 
 fs/gfs2/recovery.c                                |    2 
 fs/gfs2/sys.c                                     |    4 
 fs/jbd2/commit.c                                  |    6 
 fs/nfs/dir.c                                      |   12 
 fs/nfs/inode.c                                    |    2 
 fs/nfs/nfs4filelayoutdev.c                        |    4 
 fs/nfs/nfs4state.c                                |    4 
 fs/nfs/pagelist.c                                 |    6 
 fs/nfs/pnfs.c                                     |    2 
 fs/nfs/pnfs.h                                     |    2 
 fs/nfs/write.c                                    |    4 
 fs/ubifs/lpt_commit.c                             |    4 
 fs/ubifs/tnc_commit.c                             |    4 
 include/asm-generic/atomic.h                      |  176 ++----
 include/asm-generic/atomic64.h                    |   17 
 include/asm-generic/barrier.h                     |    8 
 include/asm-generic/bitops.h                      |    8 
 include/asm-generic/bitops/atomic.h               |    2 
 include/asm-generic/bitops/lock.h                 |    2 
 include/linux/atomic.h                            |   13 
 include/linux/buffer_head.h                       |    2 
 include/linux/genhd.h                             |    2 
 include/linux/interrupt.h                         |    8 
 include/linux/netdevice.h                         |    2 
 include/linux/sched.h                             |    6 
 include/linux/sunrpc/sched.h                      |    8 
 include/linux/sunrpc/xprt.h                       |    8 
 include/linux/tracehook.h                         |    2 
 include/net/ip_vs.h                               |    4 
 kernel/debug/debug_core.c                         |    4 
 kernel/futex.c                                    |    2 
 kernel/kmod.c                                     |    2 
 kernel/rcu/tree.c                                 |   22 
 kernel/rcu/tree_plugin.h                          |    8 
 kernel/sched/cpupri.c                             |    6 
 kernel/sched/wait.c                               |    2 
 lib/atomic64.c                                    |   77 +-
 mm/backing-dev.c                                  |    2 
 mm/filemap.c                                      |    4 
 net/atm/pppoatm.c                                 |    2 
 net/bluetooth/hci_event.c                         |    4 
 net/core/dev.c                                    |    8 
 net/core/link_watch.c                             |    2 
 net/ipv4/inetpeer.c                               |    2 
 net/netfilter/nf_conntrack_core.c                 |    2 
 net/rds/ib_recv.c                                 |    4 
 net/rds/iw_recv.c                                 |    4 
 net/rds/send.c                                    |    6 
 net/rds/tcp_send.c                                |    2 
 net/sunrpc/auth.c                                 |    2 
 net/sunrpc/auth_gss/auth_gss.c                    |    2 
 net/sunrpc/backchannel_rqst.c                     |    4 
 net/sunrpc/xprt.c                                 |    4 
 net/sunrpc/xprtsock.c                             |   16 
 net/unix/af_unix.c                                |    2 
 sound/pci/bt87x.c                                 |    4 
 211 files changed, 2188 insertions(+), 3563 deletions(-)



^ permalink raw reply	[flat|nested] 299+ messages in thread

* [RFC][PATCH 1/5] ia64: Fix up smp_mb__{before,after}_clear_bit
  2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
@ 2014-02-06 13:48 ` Peter Zijlstra
  2014-02-06 13:48 ` [RFC][PATCH 2/5] arc,hexagon: Delete asm/barrier.h Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra

[-- Attachment #1: peterz-ia64-atomics.patch --]
[-- Type: text/plain, Size: 1692 bytes --]

IA64 doesn't actually have acquire/release barriers, its a lie!

Add a comment explaining this and fix up the bitop barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/ia64/include/asm/bitops.h       |    7 ++-----
 arch/ia64/include/uapi/asm/cmpxchg.h |    9 +++++++++
 2 files changed, 11 insertions(+), 5 deletions(-)

--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -65,11 +65,8 @@ __set_bit (int nr, volatile void *addr)
 	*((__u32 *) addr + (nr >> 5)) |= (1 << (nr & 31));
 }
 
-/*
- * clear_bit() has "acquire" semantics.
- */
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	do { /* skip */; } while (0)
+#define smp_mb__before_clear_bit()	barrier();
+#define smp_mb__after_clear_bit()	barrier();
 
 /**
  * clear_bit - Clears a bit in memory
--- a/arch/ia64/include/uapi/asm/cmpxchg.h
+++ b/arch/ia64/include/uapi/asm/cmpxchg.h
@@ -118,6 +118,15 @@ extern long ia64_cmpxchg_called_with_bad
 #define cmpxchg_rel(ptr, o, n)	\
 	ia64_cmpxchg(rel, (ptr), (o), (n), sizeof(*(ptr)))
 
+/*
+ * Worse still - early processor implementations actually just ignored
+ * the acquire/release and did a full fence all the time.  Unfortunately
+ * this meant a lot of badly written code that used .acq when they really
+ * wanted .rel became legacy out in the wild - so when we made a cpu
+ * that strictly did the .acq or .rel ... all that code started breaking - so
+ * we had to back-pedal and keep the "legacy" behavior of a full fence :-(
+ */
+
 /* for compatibility with other platforms: */
 #define cmpxchg(ptr, o, n)	cmpxchg_acq((ptr), (o), (n))
 #define cmpxchg64(ptr, o, n)	cmpxchg_acq((ptr), (o), (n))



^ permalink raw reply	[flat|nested] 299+ messages in thread

* [RFC][PATCH 2/5] arc,hexagon: Delete asm/barrier.h
  2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
  2014-02-06 13:48 ` [RFC][PATCH 1/5] ia64: Fix up smp_mb__{before,after}_clear_bit Peter Zijlstra
@ 2014-02-06 13:48 ` Peter Zijlstra
  2014-02-06 13:48 ` [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra

[-- Attachment #1: peterz-kill-arc-hexagon-barrier.patch --]
[-- Type: text/plain, Size: 2900 bytes --]

Both already use asm-generic/barrier.h as per their
include/asm/Kbuild. Remove the stale files.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/arc/include/asm/barrier.h     |   37 -------------------------------------
 arch/hexagon/include/asm/barrier.h |   37 -------------------------------------
 2 files changed, 74 deletions(-)

--- a/arch/arc/include/asm/barrier.h
+++ /dev/null
@@ -1,37 +0,0 @@
-/*
- * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#ifndef __ASM_BARRIER_H
-#define __ASM_BARRIER_H
-
-#ifndef __ASSEMBLY__
-
-/* TODO-vineetg: Need to see what this does, don't we need sync anywhere */
-#define mb() __asm__ __volatile__ ("" : : : "memory")
-#define rmb() mb()
-#define wmb() mb()
-#define set_mb(var, value)  do { var = value; mb(); } while (0)
-#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-#define read_barrier_depends()  mb()
-
-/* TODO-vineetg verify the correctness of macros here */
-#ifdef CONFIG_SMP
-#define smp_mb()        mb()
-#define smp_rmb()       rmb()
-#define smp_wmb()       wmb()
-#else
-#define smp_mb()        barrier()
-#define smp_rmb()       barrier()
-#define smp_wmb()       barrier()
-#endif
-
-#define smp_read_barrier_depends()      do { } while (0)
-
-#endif
-
-#endif
--- a/arch/hexagon/include/asm/barrier.h
+++ /dev/null
@@ -1,37 +0,0 @@
-/*
- * Memory barrier definitions for the Hexagon architecture
- *
- * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 and
- * only version 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
- * 02110-1301, USA.
- */
-
-#ifndef _ASM_BARRIER_H
-#define _ASM_BARRIER_H
-
-#define rmb()				barrier()
-#define read_barrier_depends()		barrier()
-#define wmb()				barrier()
-#define mb()				barrier()
-#define smp_rmb()			barrier()
-#define smp_read_barrier_depends()	barrier()
-#define smp_wmb()			barrier()
-#define smp_mb()			barrier()
-
-/*  Set a value and use a memory barrier.  Used by the scheduler somewhere.  */
-#define set_mb(var, value) \
-	do { var = value; mb(); } while (0)
-
-#endif /* _ASM_BARRIER_H */



^ permalink raw reply	[flat|nested] 299+ messages in thread

* [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g
  2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
  2014-02-06 13:48 ` [RFC][PATCH 1/5] ia64: Fix up smp_mb__{before,after}_clear_bit Peter Zijlstra
  2014-02-06 13:48 ` [RFC][PATCH 2/5] arc,hexagon: Delete asm/barrier.h Peter Zijlstra
@ 2014-02-06 13:48 ` Peter Zijlstra
  2014-02-06 19:12   ` Paul E. McKenney
  2014-02-06 13:48 ` [RFC][PATCH 4/5] arch: Generic atomic.h cleanup Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra

[-- Attachment #1: peterz-arch-smp_mb__atomic.patch --]
[-- Type: text/plain, Size: 132147 bytes --]

Because atomic ops are implemented the same across an architecture,
the current incomplete set of extra barriers:

  smp_mb__before_atomic_inc()
  smp_mb__after_atomic_inc()
  smp_mb__before_atomic_dec()
  smp_mb__after_atomic_dec()
  smp_mb__before_clear_bit()
  smp_mb__after_clear_bit()

is both incomplete and superfluous.

It is incomplete because there are far more atomic operations that do
not return values -- such as atomic_add(), set_bit() etc. And it is
superfluous because they're all the same anyway.

Simplify things by reducing the triplicate set into a single set of
barriers that is valid for all void atomic ops:

  smp_mb__before_atomic()
  smp_mb__after_atomic()

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/atomic_ops.txt                      |   31 ++++--------
 Documentation/memory-barriers.txt                 |   44 ++++-------------
 arch/alpha/include/asm/atomic.h                   |    5 -
 arch/alpha/include/asm/bitops.h                   |    3 -
 arch/arc/include/asm/atomic.h                     |    5 -
 arch/arc/include/asm/bitops.h                     |    5 -
 arch/arm/include/asm/atomic.h                     |    5 -
 arch/arm/include/asm/barrier.h                    |    3 +
 arch/arm/include/asm/bitops.h                     |    4 -
 arch/arm64/include/asm/atomic.h                   |    5 -
 arch/arm64/include/asm/barrier.h                  |    3 +
 arch/arm64/include/asm/bitops.h                   |    9 ---
 arch/avr32/include/asm/atomic.h                   |    5 -
 arch/avr32/include/asm/bitops.h                   |    9 ---
 arch/blackfin/include/asm/barrier.h               |    3 +
 arch/blackfin/include/asm/bitops.h                |   14 -----
 arch/c6x/include/asm/bitops.h                     |    8 ---
 arch/cris/include/asm/atomic.h                    |    7 --
 arch/cris/include/asm/bitops.h                    |    9 ---
 arch/frv/include/asm/atomic.h                     |    7 --
 arch/frv/include/asm/bitops.h                     |    6 --
 arch/hexagon/include/asm/atomic.h                 |    6 --
 arch/hexagon/include/asm/bitops.h                 |    4 -
 arch/ia64/include/asm/atomic.h                    |    6 --
 arch/ia64/include/asm/barrier.h                   |    3 +
 arch/ia64/include/asm/bitops.h                    |    6 --
 arch/m32r/include/asm/atomic.h                    |    7 --
 arch/m32r/include/asm/bitops.h                    |    6 --
 arch/m68k/include/asm/atomic.h                    |    8 ---
 arch/m68k/include/asm/bitops.h                    |    7 --
 arch/metag/include/asm/atomic.h                   |    6 --
 arch/metag/include/asm/barrier.h                  |    3 +
 arch/metag/include/asm/bitops.h                   |    6 --
 arch/mips/include/asm/atomic.h                    |    9 ---
 arch/mips/include/asm/barrier.h                   |    3 +
 arch/mips/include/asm/bitops.h                    |   11 ----
 arch/mips/kernel/irq.c                            |    4 -
 arch/mn10300/include/asm/atomic.h                 |    7 --
 arch/mn10300/include/asm/bitops.h                 |    4 -
 arch/mn10300/mm/tlb-smp.c                         |    4 -
 arch/openrisc/include/asm/bitops.h                |    9 ---
 arch/parisc/include/asm/atomic.h                  |    6 --
 arch/parisc/include/asm/bitops.h                  |    4 -
 arch/powerpc/include/asm/atomic.h                 |    6 --
 arch/powerpc/include/asm/barrier.h                |    3 +
 arch/powerpc/include/asm/bitops.h                 |    6 --
 arch/powerpc/kernel/crash.c                       |    2 
 arch/s390/include/asm/atomic.h                    |    6 --
 arch/s390/include/asm/barrier.h                   |    5 +
 arch/s390/include/asm/bitops.h                    |    1 
 arch/score/include/asm/bitops.h                   |    7 --
 arch/sh/include/asm/atomic.h                      |    6 --
 arch/sh/include/asm/bitops.h                      |    7 --
 arch/sparc/include/asm/atomic_32.h                |    7 --
 arch/sparc/include/asm/atomic_64.h                |    7 --
 arch/sparc/include/asm/barrier_64.h               |    3 +
 arch/sparc/include/asm/bitops_32.h                |    4 -
 arch/sparc/include/asm/bitops_64.h                |    4 -
 arch/tile/include/asm/atomic_32.h                 |   10 ---
 arch/tile/include/asm/atomic_64.h                 |    6 --
 arch/tile/include/asm/barrier.h                   |   14 +++++
 arch/tile/include/asm/bitops.h                    |    1 
 arch/tile/include/asm/bitops_32.h                 |    8 ---
 arch/tile/include/asm/bitops_64.h                 |    4 -
 arch/x86/include/asm/atomic.h                     |    7 --
 arch/x86/include/asm/barrier.h                    |    4 +
 arch/x86/include/asm/bitops.h                     |    6 --
 arch/x86/include/asm/sync_bitops.h                |    2 
 arch/x86/kernel/apic/hw_nmi.c                     |    2 
 arch/xtensa/include/asm/atomic.h                  |    7 --
 arch/xtensa/include/asm/bitops.h                  |    4 -
 block/blk-iopoll.c                                |    4 -
 crypto/chainiv.c                                  |    2 
 drivers/base/power/domain.c                       |    2 
 drivers/block/mtip32xx/mtip32xx.c                 |    4 -
 drivers/cpuidle/coupled.c                         |    2 
 drivers/firewire/ohci.c                           |    2 
 drivers/gpu/drm/drm_irq.c                         |   10 +--
 drivers/gpu/drm/i915/i915_irq.c                   |    2 
 drivers/md/bcache/bcache.h                        |    2 
 drivers/md/bcache/closure.h                       |    2 
 drivers/md/dm-bufio.c                             |    8 +--
 drivers/md/dm-snap.c                              |    4 -
 drivers/md/dm.c                                   |    2 
 drivers/md/raid5.c                                |    2 
 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c       |    6 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |    6 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   34 ++++++-------
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c    |   26 +++++-----
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |   12 ++--
 drivers/net/ethernet/broadcom/cnic.c              |    8 +--
 drivers/net/ethernet/brocade/bna/bnad.c           |    6 +-
 drivers/net/ethernet/chelsio/cxgb/cxgb2.c         |    2 
 drivers/net/ethernet/chelsio/cxgb3/sge.c          |    6 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 
 drivers/net/ethernet/intel/i40e/i40e_main.c       |    2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 -
 drivers/net/wireless/ti/wlcore/main.c             |    2 
 drivers/pci/xen-pcifront.c                        |    4 -
 drivers/scsi/isci/remote_device.c                 |    2 
 drivers/target/loopback/tcm_loop.c                |    4 -
 drivers/target/target_core_alua.c                 |   26 +++++-----
 drivers/target/target_core_device.c               |    6 +-
 drivers/target/target_core_iblock.c               |    2 
 drivers/target/target_core_pr.c                   |   56 +++++++++++-----------
 drivers/target/target_core_transport.c            |   16 +++---
 drivers/target/target_core_ua.c                   |   10 +--
 drivers/tty/n_tty.c                               |    2 
 drivers/tty/serial/mxs-auart.c                    |    4 -
 drivers/usb/gadget/tcm_usb_gadget.c               |    4 -
 drivers/usb/serial/usb_wwan.c                     |    2 
 drivers/vhost/scsi.c                              |    2 
 drivers/w1/w1_family.c                            |    4 -
 drivers/xen/xen-pciback/pciback_ops.c             |    4 -
 fs/btrfs/btrfs_inode.h                            |    2 
 fs/btrfs/extent_io.c                              |    2 
 fs/btrfs/inode.c                                  |    6 +-
 fs/buffer.c                                       |    2 
 fs/ext4/resize.c                                  |    2 
 fs/gfs2/glock.c                                   |    8 +--
 fs/gfs2/glops.c                                   |    2 
 fs/gfs2/lock_dlm.c                                |    4 -
 fs/gfs2/recovery.c                                |    2 
 fs/gfs2/sys.c                                     |    4 -
 fs/jbd2/commit.c                                  |    6 +-
 fs/nfs/dir.c                                      |   12 ++--
 fs/nfs/inode.c                                    |    2 
 fs/nfs/nfs4filelayoutdev.c                        |    4 -
 fs/nfs/nfs4state.c                                |    4 -
 fs/nfs/pagelist.c                                 |    6 +-
 fs/nfs/pnfs.c                                     |    2 
 fs/nfs/pnfs.h                                     |    2 
 fs/nfs/write.c                                    |    4 -
 fs/ubifs/lpt_commit.c                             |    4 -
 fs/ubifs/tnc_commit.c                             |    4 -
 include/asm-generic/atomic.h                      |    7 --
 include/asm-generic/barrier.h                     |    8 +++
 include/asm-generic/bitops.h                      |    8 ---
 include/asm-generic/bitops/atomic.h               |    2 
 include/asm-generic/bitops/lock.h                 |    2 
 include/linux/buffer_head.h                       |    2 
 include/linux/genhd.h                             |    2 
 include/linux/interrupt.h                         |    8 +--
 include/linux/netdevice.h                         |    2 
 include/linux/sched.h                             |    6 --
 include/linux/sunrpc/sched.h                      |    8 +--
 include/linux/sunrpc/xprt.h                       |    8 +--
 include/linux/tracehook.h                         |    2 
 include/net/ip_vs.h                               |    4 -
 kernel/debug/debug_core.c                         |    4 -
 kernel/futex.c                                    |    2 
 kernel/kmod.c                                     |    2 
 kernel/rcu/tree.c                                 |   22 ++++----
 kernel/rcu/tree_plugin.h                          |    8 +--
 kernel/sched/cpupri.c                             |    6 +-
 kernel/sched/wait.c                               |    2 
 mm/backing-dev.c                                  |    2 
 mm/filemap.c                                      |    4 -
 net/atm/pppoatm.c                                 |    2 
 net/bluetooth/hci_event.c                         |    4 -
 net/core/dev.c                                    |    8 +--
 net/core/link_watch.c                             |    2 
 net/ipv4/inetpeer.c                               |    2 
 net/netfilter/nf_conntrack_core.c                 |    2 
 net/rds/ib_recv.c                                 |    4 -
 net/rds/iw_recv.c                                 |    4 -
 net/rds/send.c                                    |    6 +-
 net/rds/tcp_send.c                                |    2 
 net/sunrpc/auth.c                                 |    2 
 net/sunrpc/auth_gss/auth_gss.c                    |    2 
 net/sunrpc/backchannel_rqst.c                     |    4 -
 net/sunrpc/xprt.c                                 |    4 -
 net/sunrpc/xprtsock.c                             |   16 +++---
 net/unix/af_unix.c                                |    2 
 sound/pci/bt87x.c                                 |    4 -
 176 files changed, 415 insertions(+), 642 deletions(-)

--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -285,15 +285,13 @@ If a caller requires memory barrier sema
 operation which does not return a value, a set of interfaces are
 defined which accomplish this:
 
-	void smp_mb__before_atomic_dec(void);
-	void smp_mb__after_atomic_dec(void);
-	void smp_mb__before_atomic_inc(void);
-	void smp_mb__after_atomic_inc(void);
+	void smp_mb__before_atomic(void);
+	void smp_mb__after_atomic(void);
 
-For example, smp_mb__before_atomic_dec() can be used like so:
+For example, smp_mb__before_atomic() can be used like so:
 
 	obj->dead = 1;
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&obj->ref_count);
 
 It makes sure that all memory operations preceding the atomic_dec()
@@ -302,15 +300,10 @@ operation.  In the above example, it gua
 "1" to obj->dead will be globally visible to other cpus before the
 atomic counter decrement.
 
-Without the explicit smp_mb__before_atomic_dec() call, the
+Without the explicit smp_mb__before_atomic() call, the
 implementation could legally allow the atomic counter update visible
 to other cpus before the "obj->dead = 1;" assignment.
 
-The other three interfaces listed are used to provide explicit
-ordering with respect to memory operations after an atomic_dec() call
-(smp_mb__after_atomic_dec()) and around atomic_inc() calls
-(smp_mb__{before,after}_atomic_inc()).
-
 A missing memory barrier in the cases where they are required by the
 atomic_t implementation above can have disastrous results.  Here is
 an example, which follows a pattern occurring frequently in the Linux
@@ -487,12 +480,12 @@ memory operation done by test_and_set_bi
 Which returns a boolean indicating if bit "nr" is set in the bitmask
 pointed to by "addr".
 
-If explicit memory barriers are required around clear_bit() (which
-does not return a value, and thus does not need to provide memory
-barrier semantics), two interfaces are provided:
+If explicit memory barriers are required around {set,clear}_bit() (which do
+not return a value, and thus does not need to provide memory barrier
+semantics), two interfaces are provided:
 
-	void smp_mb__before_clear_bit(void);
-	void smp_mb__after_clear_bit(void);
+	void smp_mb__before_atomic(void);
+	void smp_mb__after_atomic(void);
 
 They are used as follows, and are akin to their atomic_t operation
 brothers:
@@ -500,13 +493,13 @@ They are used as follows, and are akin t
 	/* All memory operations before this call will
 	 * be globally visible before the clear_bit().
 	 */
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit( ... );
 
 	/* The clear_bit() will be visible before all
 	 * subsequent memory operations.
 	 */
-	 smp_mb__after_clear_bit();
+	 smp_mb__after_atomic();
 
 There are two special bitops with lock barrier semantics (acquire/release,
 same as spinlocks). These operate in the same way as their non-_lock/unlock
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1553,20 +1553,21 @@ CPU from reordering them.
      insert anything more than a compiler barrier in a UP compilation.
 
 
- (*) smp_mb__before_atomic_dec();
- (*) smp_mb__after_atomic_dec();
- (*) smp_mb__before_atomic_inc();
- (*) smp_mb__after_atomic_inc();
-
-     These are for use with atomic add, subtract, increment and decrement
-     functions that don't return a value, especially when used for reference
-     counting.  These functions do not imply memory barriers.
+ (*) smp_mb__before_atomic();
+ (*) smp_mb__after_atomic();
+
+     These are for use with atomic (such as add, subtract, increment and
+     decrement) functions that don't return a value, especially when used for
+     reference counting.  These functions do not imply memory barriers.
+
+     These are also used for atomic bitop functions that do not return a
+     value (such as set_bit and clear_bit).
 
      As an example, consider a piece of code that marks an object as being dead
      and then decrements the object's reference count:
 
 	obj->dead = 1;
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&obj->ref_count);
 
      This makes sure that the death mark on the object is perceived to be set
@@ -1576,27 +1577,6 @@ CPU from reordering them.
      operations" subsection for information on where to use these.
 
 
- (*) smp_mb__before_clear_bit(void);
- (*) smp_mb__after_clear_bit(void);
-
-     These are for use similar to the atomic inc/dec barriers.  These are
-     typically used for bitwise unlocking operations, so care must be taken as
-     there are no implicit memory barriers here either.
-
-     Consider implementing an unlock operation of some nature by clearing a
-     locking bit.  The clear_bit() would then need to be barriered like this:
-
-	smp_mb__before_clear_bit();
-	clear_bit( ... );
-
-     This prevents memory operations before the clear leaking to after it.  See
-     the subsection on "Locking Functions" with reference to RELEASE operation
-     implications.
-
-     See Documentation/atomic_ops.txt for more information.  See the "Atomic
-     operations" subsection for information on where to use these.
-
-
 MMIO WRITE BARRIER
 ------------------
 
@@ -2222,11 +2202,11 @@ barriers, but might be used for implemen
 	change_bit();
 
 With these the appropriate explicit memory barrier should be used if necessary
-(smp_mb__before_clear_bit() for instance).
+(smp_mb__before_atomic() for instance).
 
 
 The following also do _not_ imply memory barriers, and so may require explicit
-memory barriers under some circumstances (smp_mb__before_atomic_dec() for
+memory barriers under some circumstances (smp_mb__before_atomic() for
 instance):
 
 	atomic_add();
--- a/arch/alpha/include/asm/atomic.h
+++ b/arch/alpha/include/asm/atomic.h
@@ -292,9 +292,4 @@ static inline long atomic64_dec_if_posit
 #define atomic_dec(v) atomic_sub(1,(v))
 #define atomic64_dec(v) atomic64_sub(1,(v))
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 #endif /* _ALPHA_ATOMIC_H */
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -53,9 +53,6 @@ __set_bit(unsigned long nr, volatile voi
 	*m |= 1 << (nr & 31);
 }
 
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-
 static inline void
 clear_bit(unsigned long nr, volatile void * addr)
 {
--- a/arch/arc/include/asm/atomic.h
+++ b/arch/arc/include/asm/atomic.h
@@ -190,11 +190,6 @@ static inline void atomic_clear_mask(uns
 
 #endif /* !CONFIG_ARC_HAS_LLSC */
 
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 /**
  * __atomic_add_unless - add unless the number is a given value
  * @v: pointer of type atomic_t
--- a/arch/arc/include/asm/bitops.h
+++ b/arch/arc/include/asm/bitops.h
@@ -19,6 +19,7 @@
 
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <asm/barrier.h>
 
 /*
  * Hardware assisted read-modify-write using ARC700 LLOCK/SCOND insns.
@@ -496,10 +497,6 @@ static inline __attribute__ ((const)) in
  */
 #define ffz(x)	__ffs(~(x))
 
-/* TODO does this affect uni-processor code */
-#define smp_mb__before_clear_bit()  barrier()
-#define smp_mb__after_clear_bit()   barrier()
-
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/sched.h>
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -211,11 +211,6 @@ static inline int __atomic_add_unless(at
 
 #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 #ifndef CONFIG_GENERIC_ATOMIC64
 typedef struct {
 	long long counter;
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -79,5 +79,8 @@ do {									\
 
 #define set_mb(var, value)	do { var = value; smp_mb(); } while (0)
 
+#define smp_mb__before_atomic()	smp_mb()
+#define smp_mb__after_atomic()	smp_mb()
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_BARRIER_H */
--- a/arch/arm/include/asm/bitops.h
+++ b/arch/arm/include/asm/bitops.h
@@ -25,9 +25,7 @@
 
 #include <linux/compiler.h>
 #include <linux/irqflags.h>
-
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
+#include <asm/barrier.h>
 
 /*
  * These functions are the basis of our bit ops.
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -154,11 +154,6 @@ static inline int __atomic_add_unless(at
 
 #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 /*
  * 64-bit atomic operations.
  */
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -97,6 +97,9 @@ do {									\
 #define set_mb(var, value)	do { var = value; smp_mb(); } while (0)
 #define nop()		asm volatile("nop");
 
+#define smp_mb__before_atomic()	smp_mb()
+#define smp_mb__after_atomic()	smp_mb()
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* __ASM_BARRIER_H */
--- a/arch/arm64/include/asm/bitops.h
+++ b/arch/arm64/include/asm/bitops.h
@@ -17,17 +17,8 @@
 #define __ASM_BITOPS_H
 
 #include <linux/compiler.h>
-
 #include <asm/barrier.h>
 
-/*
- * clear_bit may not imply a memory barrier
- */
-#ifndef smp_mb__before_clear_bit
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-#endif
-
 #ifndef _LINUX_BITOPS_H
 #error only <linux/bitops.h> can be included directly
 #endif
--- a/arch/avr32/include/asm/atomic.h
+++ b/arch/avr32/include/asm/atomic.h
@@ -183,9 +183,4 @@ static inline int atomic_sub_if_positive
 
 #define atomic_dec_if_positive(v) atomic_sub_if_positive(1, v)
 
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /*  __ASM_AVR32_ATOMIC_H */
--- a/arch/avr32/include/asm/bitops.h
+++ b/arch/avr32/include/asm/bitops.h
@@ -13,12 +13,7 @@
 #endif
 
 #include <asm/byteorder.h>
-
-/*
- * clear_bit() doesn't provide any barrier for the compiler
- */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
+#include <asm/barrier.h>
 
 /*
  * set_bit - Atomically set a bit in memory
@@ -67,7 +62,7 @@ static inline void set_bit(int nr, volat
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static inline void clear_bit(int nr, volatile void * addr)
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -27,6 +27,9 @@
 
 #endif /* !CONFIG_SMP */
 
+#define smp_mb__before_atomic()	barrier()
+#define smp_mb__after_atomic()	barrier()
+
 #include <asm-generic/barrier.h>
 
 #endif /* _BLACKFIN_BARRIER_H */
--- a/arch/blackfin/include/asm/bitops.h
+++ b/arch/blackfin/include/asm/bitops.h
@@ -27,21 +27,17 @@
 
 #include <asm-generic/bitops/ext2-atomic.h>
 
+#include <asm/barrier.h>
+
 #ifndef CONFIG_SMP
 #include <linux/irqflags.h>
-
 /*
  * clear_bit may not imply a memory barrier
  */
-#ifndef smp_mb__before_clear_bit
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-#endif
 #include <asm-generic/bitops/atomic.h>
 #include <asm-generic/bitops/non-atomic.h>
 #else
 
-#include <asm/barrier.h>
 #include <asm/byteorder.h>	/* swab32 */
 #include <linux/linkage.h>
 
@@ -101,12 +97,6 @@ static inline int test_and_change_bit(in
 	return __raw_bit_test_toggle_asm(a, nr & 0x1f);
 }
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 #define test_bit __skip_test_bit
 #include <asm-generic/bitops/non-atomic.h>
 #undef test_bit
--- a/arch/c6x/include/asm/bitops.h
+++ b/arch/c6x/include/asm/bitops.h
@@ -14,14 +14,8 @@
 #ifdef __KERNEL__
 
 #include <linux/bitops.h>
-
 #include <asm/byteorder.h>
-
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit() barrier()
-#define smp_mb__after_clear_bit()  barrier()
+#include <asm/barrier.h>
 
 /*
  * We are lucky, DSP is perfect for bitops: do it in 3 cycles
--- a/arch/cris/include/asm/atomic.h
+++ b/arch/cris/include/asm/atomic.h
@@ -6,6 +6,7 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 #include <arch/atomic.h>
 
 /*
@@ -151,10 +152,4 @@ static inline int __atomic_add_unless(at
 	return ret;
 }
 
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()    barrier()
-#define smp_mb__after_atomic_dec()     barrier()
-#define smp_mb__before_atomic_inc()    barrier()
-#define smp_mb__after_atomic_inc()     barrier()
-
 #endif
--- a/arch/cris/include/asm/bitops.h
+++ b/arch/cris/include/asm/bitops.h
@@ -21,6 +21,7 @@
 #include <arch/bitops.h>
 #include <linux/atomic.h>
 #include <linux/compiler.h>
+#include <asm/barrier.h>
 
 /*
  * set_bit - Atomically set a bit in memory
@@ -42,7 +43,7 @@
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 
@@ -84,12 +85,6 @@ static inline int test_and_set_bit(int n
 	return retval;
 }
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()      barrier()
-#define smp_mb__after_clear_bit()       barrier()
-
 /**
  * test_and_clear_bit - Clear a bit and return its old value
  * @nr: Bit to clear
--- a/arch/frv/include/asm/atomic.h
+++ b/arch/frv/include/asm/atomic.h
@@ -17,6 +17,7 @@
 #include <linux/types.h>
 #include <asm/spr-regs.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #ifdef CONFIG_SMP
 #error not SMP safe
@@ -29,12 +30,6 @@
  * We do not have SMP systems, so we don't have to deal with that.
  */
 
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #define ATOMIC_INIT(i)		{ (i) }
 #define atomic_read(v)		(*(volatile int *)&(v)->counter)
 #define atomic_set(v, i)	(((v)->counter) = (i))
--- a/arch/frv/include/asm/bitops.h
+++ b/arch/frv/include/asm/bitops.h
@@ -25,12 +25,6 @@
 
 #include <asm-generic/bitops/ffz.h>
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 #ifndef CONFIG_FRV_OUTOFLINE_ATOMIC_OPS
 static inline
 unsigned long atomic_test_and_ANDNOT_mask(unsigned long mask, volatile unsigned long *v)
--- a/arch/hexagon/include/asm/atomic.h
+++ b/arch/hexagon/include/asm/atomic.h
@@ -24,6 +24,7 @@
 
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)		{ (i) }
 #define atomic_set(v, i)	((v)->counter = (i))
@@ -163,9 +164,4 @@ static inline int __atomic_add_unless(at
 #define atomic_inc_return(v) (atomic_add_return(1, v))
 #define atomic_dec_return(v) (atomic_sub_return(1, v))
 
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif
--- a/arch/hexagon/include/asm/bitops.h
+++ b/arch/hexagon/include/asm/bitops.h
@@ -25,12 +25,10 @@
 #include <linux/compiler.h>
 #include <asm/byteorder.h>
 #include <asm/atomic.h>
+#include <asm/barrier.h>
 
 #ifdef __KERNEL__
 
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 /*
  * The offset calculations for these are based on BITS_PER_LONG == 32
  * (i.e. I get to shift by #5-2 (32 bits per long, 4 bytes per access),
--- a/arch/ia64/include/asm/atomic.h
+++ b/arch/ia64/include/asm/atomic.h
@@ -208,10 +208,4 @@ atomic64_add_negative (__s64 i, atomic64
 #define atomic64_inc(v)			atomic64_add(1, (v))
 #define atomic64_dec(v)			atomic64_sub(1, (v))
 
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* _ASM_IA64_ATOMIC_H */
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -55,6 +55,9 @@
 
 #endif
 
+#define smp_mb__before_atomic()	barrier()
+#define smp_mb__after_atomic()	barrier()
+
 /*
  * IA64 GCC turns volatile stores into st.rel and volatile loads into ld.acq no
  * need for asm trickery!
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <asm/intrinsics.h>
+#include <asm/barrier.h>
 
 /**
  * set_bit - Atomically set a bit in memory
@@ -65,9 +66,6 @@ __set_bit (int nr, volatile void *addr)
 	*((__u32 *) addr + (nr >> 5)) |= (1 << (nr & 31));
 }
 
-#define smp_mb__before_clear_bit()	barrier();
-#define smp_mb__after_clear_bit()	barrier();
-
 /**
  * clear_bit - Clears a bit in memory
  * @nr: Bit to clear
@@ -75,7 +73,7 @@ __set_bit (int nr, volatile void *addr)
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static __inline__ void
--- a/arch/m32r/include/asm/atomic.h
+++ b/arch/m32r/include/asm/atomic.h
@@ -13,6 +13,7 @@
 #include <asm/assembler.h>
 #include <asm/cmpxchg.h>
 #include <asm/dcache_clear.h>
+#include <asm/barrier.h>
 
 /*
  * Atomic operations that C can't guarantee us.  Useful for
@@ -308,10 +309,4 @@ static __inline__ void atomic_set_mask(u
 	local_irq_restore(flags);
 }
 
-/* Atomic operations are already serializing on m32r */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif	/* _ASM_M32R_ATOMIC_H */
--- a/arch/m32r/include/asm/bitops.h
+++ b/arch/m32r/include/asm/bitops.h
@@ -21,6 +21,7 @@
 #include <asm/byteorder.h>
 #include <asm/dcache_clear.h>
 #include <asm/types.h>
+#include <asm/barrier.h>
 
 /*
  * These have to be done with inline assembly: that way the bit-setting
@@ -73,7 +74,7 @@ static __inline__ void set_bit(int nr, v
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static __inline__ void clear_bit(int nr, volatile void * addr)
@@ -103,9 +104,6 @@ static __inline__ void clear_bit(int nr,
 	local_irq_restore(flags);
 }
 
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 /**
  * change_bit - Toggle a bit in memory
  * @nr: Bit to clear
--- a/arch/m68k/include/asm/atomic.h
+++ b/arch/m68k/include/asm/atomic.h
@@ -4,6 +4,7 @@
 #include <linux/types.h>
 #include <linux/irqflags.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 /*
  * Atomic operations that C can't guarantee us.  Useful for
@@ -209,11 +210,4 @@ static __inline__ int __atomic_add_unles
 	return c;
 }
 
-
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* __ARCH_M68K_ATOMIC __ */
--- a/arch/m68k/include/asm/bitops.h
+++ b/arch/m68k/include/asm/bitops.h
@@ -13,6 +13,7 @@
 #endif
 
 #include <linux/compiler.h>
+#include <asm/barrier.h>
 
 /*
  *	Bit access functions vary across the ColdFire and 68k families.
@@ -67,12 +68,6 @@ static inline void bfset_mem_set_bit(int
 #define __set_bit(nr, vaddr)	set_bit(nr, vaddr)
 
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 static inline void bclr_reg_clear_bit(int nr, volatile unsigned long *vaddr)
 {
 	char *p = (char *)vaddr + (nr ^ 31) / 8;
--- a/arch/metag/include/asm/atomic.h
+++ b/arch/metag/include/asm/atomic.h
@@ -4,6 +4,7 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #if defined(CONFIG_METAG_ATOMICITY_IRQSOFF)
 /* The simple UP case. */
@@ -39,11 +40,6 @@
 
 #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
 
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif
 
 #define atomic_dec_if_positive(v)       atomic_sub_if_positive(1, v)
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -97,4 +97,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__before_atomic()	barrier()
+#define smp_mb__after_atomic()	barrier()
+
 #endif /* _ASM_METAG_BARRIER_H */
--- a/arch/metag/include/asm/bitops.h
+++ b/arch/metag/include/asm/bitops.h
@@ -5,12 +5,6 @@
 #include <asm/barrier.h>
 #include <asm/global_lock.h>
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 #ifdef CONFIG_SMP
 /*
  * These functions are the basis of our bit ops.
--- a/arch/mips/include/asm/atomic.h
+++ b/arch/mips/include/asm/atomic.h
@@ -761,13 +761,4 @@ static __inline__ int atomic64_add_unles
 
 #endif /* CONFIG_64BIT */
 
-/*
- * atomic*_return operations are serializing but not the non-*_return
- * versions.
- */
-#define smp_mb__before_atomic_dec()	smp_mb__before_llsc()
-#define smp_mb__after_atomic_dec()	smp_llsc_mb()
-#define smp_mb__before_atomic_inc()	smp_mb__before_llsc()
-#define smp_mb__after_atomic_inc()	smp_llsc_mb()
-
 #endif /* _ASM_ATOMIC_H */
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -195,4 +195,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__before_atomic()	smp_mb__before_llsc()
+#define smp_mb__after_atomic()	smp_llsc_mb()
+
 #endif /* __ASM_BARRIER_H */
--- a/arch/mips/include/asm/bitops.h
+++ b/arch/mips/include/asm/bitops.h
@@ -38,13 +38,6 @@
 #endif
 
 /*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	smp_mb__before_llsc()
-#define smp_mb__after_clear_bit()	smp_llsc_mb()
-
-
-/*
  * These are the "slower" versions of the functions and are in bitops.c.
  * These functions call raw_local_irq_{save,restore}().
  */
@@ -120,7 +113,7 @@ static inline void set_bit(unsigned long
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static inline void clear_bit(unsigned long nr, volatile unsigned long *addr)
@@ -175,7 +168,7 @@ static inline void clear_bit(unsigned lo
  */
 static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long *addr)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(nr, addr);
 }
 
--- a/arch/mips/kernel/irq.c
+++ b/arch/mips/kernel/irq.c
@@ -62,9 +62,9 @@ void __init alloc_legacy_irqno(void)
 
 void free_irqno(unsigned int irq)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(irq, irq_map);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 /*
--- a/arch/mn10300/include/asm/atomic.h
+++ b/arch/mn10300/include/asm/atomic.h
@@ -13,6 +13,7 @@
 
 #include <asm/irqflags.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #ifndef CONFIG_SMP
 #include <asm-generic/atomic.h>
@@ -234,12 +235,6 @@ static inline void atomic_set_mask(unsig
 #endif
 }
 
-/* Atomic operations are already serializing on MN10300??? */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* __KERNEL__ */
 #endif /* CONFIG_SMP */
 #endif /* _ASM_ATOMIC_H */
--- a/arch/mn10300/include/asm/bitops.h
+++ b/arch/mn10300/include/asm/bitops.h
@@ -18,9 +18,7 @@
 #define __ASM_BITOPS_H
 
 #include <asm/cpu-regs.h>
-
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
+#include <asm/barrier.h>
 
 /*
  * set bit
--- a/arch/mn10300/mm/tlb-smp.c
+++ b/arch/mn10300/mm/tlb-smp.c
@@ -78,9 +78,9 @@ void smp_flush_tlb(void *unused)
 	else
 		local_flush_tlb_page(flush_mm, flush_va);
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	cpumask_clear_cpu(cpu_id, &flush_cpumask);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 out:
 	put_cpu();
 }
--- a/arch/openrisc/include/asm/bitops.h
+++ b/arch/openrisc/include/asm/bitops.h
@@ -27,14 +27,7 @@
 
 #include <linux/irqflags.h>
 #include <linux/compiler.h>
-
-/*
- * clear_bit may not imply a memory barrier
- */
-#ifndef smp_mb__before_clear_bit
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-#endif
+#include <asm/barrier.h>
 
 #include <asm/bitops/__ffs.h>
 #include <asm-generic/bitops/ffz.h>
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -7,6 +7,7 @@
 
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 /*
  * Atomic operations that C can't guarantee us.  Useful for
@@ -143,11 +144,6 @@ static __inline__ int __atomic_add_unles
 
 #define ATOMIC_INIT(i)	{ (i) }
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 #ifdef CONFIG_64BIT
 
 #define ATOMIC64_INIT(i) { (i) }
--- a/arch/parisc/include/asm/bitops.h
+++ b/arch/parisc/include/asm/bitops.h
@@ -8,6 +8,7 @@
 #include <linux/compiler.h>
 #include <asm/types.h>		/* for BITS_PER_LONG/SHIFT_PER_LONG */
 #include <asm/byteorder.h>
+#include <asm/barrier.h>
 #include <linux/atomic.h>
 
 /*
@@ -19,9 +20,6 @@
 #define CHOP_SHIFTCOUNT(x) (((unsigned long) (x)) & (BITS_PER_LONG - 1))
 
 
-#define smp_mb__before_clear_bit()      smp_mb()
-#define smp_mb__after_clear_bit()       smp_mb()
-
 /* See http://marc.theaimsgroup.com/?t=108826637900003 for discussion
  * on use of volatile and __*_bit() (set/clear/change):
  *	*_bit() want use of volatile.
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -8,6 +8,7 @@
 #ifdef __KERNEL__
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)		{ (i) }
 
@@ -270,11 +271,6 @@ static __inline__ int atomic_dec_if_posi
 }
 #define atomic_dec_if_positive atomic_dec_if_positive
 
-#define smp_mb__before_atomic_dec()     smp_mb()
-#define smp_mb__after_atomic_dec()      smp_mb()
-#define smp_mb__before_atomic_inc()     smp_mb()
-#define smp_mb__after_atomic_inc()      smp_mb()
-
 #ifdef __powerpc64__
 
 #define ATOMIC64_INIT(i)	{ (i) }
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -84,4 +84,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__before_atomic()     smp_mb()
+#define smp_mb__after_atomic()      smp_mb()
+
 #endif /* _ASM_POWERPC_BARRIER_H */
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -51,11 +51,7 @@
 #define PPC_BIT(bit)		(1UL << PPC_BITLSHIFT(bit))
 #define PPC_BITMASK(bs, be)	((PPC_BIT(bs) - PPC_BIT(be)) | PPC_BIT(bs))
 
-/*
- * clear_bit doesn't imply a memory barrier
- */
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
+#include <asm/barrier.h>
 
 /* Macro for generating the ***_bits() functions */
 #define DEFINE_BITOP(fn, op, prefix)		\
--- a/arch/powerpc/kernel/crash.c
+++ b/arch/powerpc/kernel/crash.c
@@ -81,7 +81,7 @@ void crash_ipi_callback(struct pt_regs *
 	}
 
 	atomic_inc(&cpus_in_crash);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 
 	/*
 	 * Starting the kdump boot.
--- a/arch/s390/include/asm/atomic.h
+++ b/arch/s390/include/asm/atomic.h
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)  { (i) }
 
@@ -398,9 +399,4 @@ static inline long long atomic64_dec_if_
 #define atomic64_dec_and_test(_v)	(atomic64_sub_return(1, _v) == 0)
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 #endif /* __ARCH_S390_ATOMIC__  */
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -27,8 +27,9 @@
 #define smp_rmb()			rmb()
 #define smp_wmb()			wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
+
+#define smp_mb__before_atomic()	smp_mb()
+#define smp_mb__after_atomic()	smp_mb()
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
--- a/arch/s390/include/asm/bitops.h
+++ b/arch/s390/include/asm/bitops.h
@@ -47,6 +47,7 @@
 
 #include <linux/typecheck.h>
 #include <linux/compiler.h>
+#include <asm/barrier.h>
 
 #ifndef CONFIG_64BIT
 
--- a/arch/score/include/asm/bitops.h
+++ b/arch/score/include/asm/bitops.h
@@ -2,12 +2,7 @@
 #define _ASM_SCORE_BITOPS_H
 
 #include <asm/byteorder.h> /* swab32 */
-
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
+#include <asm/barrier.h>
 
 #include <asm-generic/bitops.h>
 #include <asm-generic/bitops/__fls.h>
--- a/arch/sh/include/asm/atomic.h
+++ b/arch/sh/include/asm/atomic.h
@@ -10,6 +10,7 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)	{ (i) }
 
@@ -62,9 +63,4 @@ static inline int __atomic_add_unless(at
 	return c;
 }
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 #endif /* __ASM_SH_ATOMIC_H */
--- a/arch/sh/include/asm/bitops.h
+++ b/arch/sh/include/asm/bitops.h
@@ -9,6 +9,7 @@
 
 /* For __swab32 */
 #include <asm/byteorder.h>
+#include <asm/barrier.h>
 
 #ifdef CONFIG_GUSA_RB
 #include <asm/bitops-grb.h>
@@ -22,12 +23,6 @@
 #include <asm-generic/bitops/non-atomic.h>
 #endif
 
-/*
- * clear_bit() doesn't provide any barrier for the compiler.
- */
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-
 #ifdef CONFIG_SUPERH32
 static inline unsigned long ffz(unsigned long word)
 {
--- a/arch/sparc/include/asm/atomic_32.h
+++ b/arch/sparc/include/asm/atomic_32.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 #include <asm-generic/atomic64.h>
 
 
@@ -52,10 +53,4 @@ extern void atomic_set(atomic_t *, int);
 #define atomic_dec_and_test(v) (atomic_dec_return(v) == 0)
 #define atomic_sub_and_test(i, v) (atomic_sub_return(i, v) == 0)
 
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* !(__ARCH_SPARC_ATOMIC__) */
--- a/arch/sparc/include/asm/atomic_64.h
+++ b/arch/sparc/include/asm/atomic_64.h
@@ -9,6 +9,7 @@
 
 #include <linux/types.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)		{ (i) }
 #define ATOMIC64_INIT(i)	{ (i) }
@@ -108,10 +109,4 @@ static inline long atomic64_add_unless(a
 
 extern long atomic64_dec_if_positive(atomic64_t *v);
 
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* !(__ARCH_SPARC64_ATOMIC__) */
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -68,4 +68,7 @@ do {									\
 	___p1;								\
 })
 
+#define smp_mb__before_atomic()	barrier()
+#define smp_mb__after_atomic()	barrier()
+
 #endif /* !(__SPARC64_BARRIER_H) */
--- a/arch/sparc/include/asm/bitops_32.h
+++ b/arch/sparc/include/asm/bitops_32.h
@@ -11,6 +11,7 @@
 
 #include <linux/compiler.h>
 #include <asm/byteorder.h>
+#include <asm/barrier.h>
 
 #ifdef __KERNEL__
 
@@ -90,9 +91,6 @@ static inline void change_bit(unsigned l
 
 #include <asm-generic/bitops/non-atomic.h>
 
-#define smp_mb__before_clear_bit()	do { } while(0)
-#define smp_mb__after_clear_bit()	do { } while(0)
-
 #include <asm-generic/bitops/ffz.h>
 #include <asm-generic/bitops/__ffs.h>
 #include <asm-generic/bitops/sched.h>
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -13,6 +13,7 @@
 
 #include <linux/compiler.h>
 #include <asm/byteorder.h>
+#include <asm/barrier.h>
 
 extern int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
 extern int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
@@ -23,9 +24,6 @@ extern void change_bit(unsigned long nr,
 
 #include <asm-generic/bitops/non-atomic.h>
 
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 #include <asm-generic/bitops/fls.h>
 #include <asm-generic/bitops/__fls.h>
 #include <asm-generic/bitops/fls64.h>
--- a/arch/tile/include/asm/atomic_32.h
+++ b/arch/tile/include/asm/atomic_32.h
@@ -169,16 +169,6 @@ static inline void atomic64_set(atomic64
 #define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1LL, 0LL)
 
-/*
- * We need to barrier before modifying the word, since the _atomic_xxx()
- * routines just tns the lock and then read/modify/write of the word.
- * But after the word is updated, the routine issues an "mf" before returning,
- * and since it's a function call, we don't even need a compiler barrier.
- */
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_dec()	do { } while (0)
-#define smp_mb__after_atomic_inc()	do { } while (0)
 
 #endif /* !__ASSEMBLY__ */
 
--- a/arch/tile/include/asm/atomic_64.h
+++ b/arch/tile/include/asm/atomic_64.h
@@ -105,12 +105,6 @@ static inline long atomic64_add_unless(a
 
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
 
-/* Atomic dec and inc don't implement barrier, so provide them if needed. */
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 /* Define this to indicate that cmpxchg is an efficient operation. */
 #define __HAVE_ARCH_CMPXCHG
 
--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -72,6 +72,20 @@ mb_incoherent(void)
 #define mb()		fast_mb()
 #define iob()		fast_iob()
 
+#ifndef __tilegx__ /* 32 bit */
+/*
+ * We need to barrier before modifying the word, since the _atomic_xxx()
+ * routines just tns the lock and then read/modify/write of the word.
+ * But after the word is updated, the routine issues an "mf" before returning,
+ * and since it's a function call, we don't even need a compiler barrier.
+ */
+#define smp_mb__before_atomic()	smp_mb()
+#define smp_mb__after_atomic()	do { } while (0)
+#else /* 64 bit */
+#define smp_mb__before_atomic()	smp_mb()
+#define smp_mb__after_atomic()	smp_mb()
+#endif
+
 #include <asm-generic/barrier.h>
 
 #endif /* !__ASSEMBLY__ */
--- a/arch/tile/include/asm/bitops.h
+++ b/arch/tile/include/asm/bitops.h
@@ -17,6 +17,7 @@
 #define _ASM_TILE_BITOPS_H
 
 #include <linux/types.h>
+#include <asm/barrier.h>
 
 #ifndef _LINUX_BITOPS_H
 #error only <linux/bitops.h> can be included directly
--- a/arch/tile/include/asm/bitops_32.h
+++ b/arch/tile/include/asm/bitops_32.h
@@ -49,8 +49,8 @@ static inline void set_bit(unsigned nr,
  * restricted to acting on a single-word quantity.
  *
  * clear_bit() may not contain a memory barrier, so if it is used for
- * locking purposes, you should call smp_mb__before_clear_bit() and/or
- * smp_mb__after_clear_bit() to ensure changes are visible on other cpus.
+ * locking purposes, you should call smp_mb__before_atomic() and/or
+ * smp_mb__after_atomic() to ensure changes are visible on other cpus.
  */
 static inline void clear_bit(unsigned nr, volatile unsigned long *addr)
 {
@@ -121,10 +121,6 @@ static inline int test_and_change_bit(un
 	return (_atomic_xor(addr, mask) & mask) != 0;
 }
 
-/* See discussion at smp_mb__before_atomic_dec() in <asm/atomic_32.h>. */
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	do {} while (0)
-
 #include <asm-generic/bitops/ext2-atomic.h>
 
 #endif /* _ASM_TILE_BITOPS_32_H */
--- a/arch/tile/include/asm/bitops_64.h
+++ b/arch/tile/include/asm/bitops_64.h
@@ -32,10 +32,6 @@ static inline void clear_bit(unsigned nr
 	__insn_fetchand((void *)(addr + nr / BITS_PER_LONG), ~mask);
 }
 
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-
-
 static inline void change_bit(unsigned nr, volatile unsigned long *addr)
 {
 	unsigned long mask = (1UL << (nr % BITS_PER_LONG));
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -7,6 +7,7 @@
 #include <asm/alternative.h>
 #include <asm/cmpxchg.h>
 #include <asm/rmwcc.h>
+#include <asm/barrier.h>
 
 /*
  * Atomic operations that C can't guarantee us.  Useful for
@@ -243,12 +244,6 @@ static inline void atomic_or_long(unsign
 		     : : "r" ((unsigned)(mask)), "m" (*(addr))	\
 		     : "memory")
 
-/* Atomic operations are already serializing on x86 */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #ifdef CONFIG_X86_32
 # include <asm/atomic64_32.h>
 #else
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -141,6 +141,10 @@ do {									\
 
 #endif
 
+/* Atomic operations are already serializing on x86 */
+#define smp_mb__before_atomic()	barrier()
+#define smp_mb__after_atomic()	barrier()
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -15,6 +15,7 @@
 #include <linux/compiler.h>
 #include <asm/alternative.h>
 #include <asm/rmwcc.h>
+#include <asm/barrier.h>
 
 #if BITS_PER_LONG == 32
 # define _BITOPS_LONG_SHIFT 5
@@ -102,7 +103,7 @@ static inline void __set_bit(long nr, vo
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static __always_inline void
@@ -156,9 +157,6 @@ static inline void __clear_bit_unlock(lo
 	__clear_bit(nr, addr);
 }
 
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
-
 /**
  * __change_bit - Toggle a bit in memory
  * @nr: the bit to change
--- a/arch/x86/include/asm/sync_bitops.h
+++ b/arch/x86/include/asm/sync_bitops.h
@@ -41,7 +41,7 @@ static inline void sync_set_bit(long nr,
  *
  * sync_clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static inline void sync_clear_bit(long nr, volatile unsigned long *addr)
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -57,7 +57,7 @@ void arch_trigger_all_cpu_backtrace(void
 	}
 
 	clear_bit(0, &backtrace_flag);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static int __kprobes
--- a/arch/xtensa/include/asm/atomic.h
+++ b/arch/xtensa/include/asm/atomic.h
@@ -19,6 +19,7 @@
 #ifdef __KERNEL__
 #include <asm/processor.h>
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)	{ (i) }
 
@@ -387,12 +388,6 @@ static inline void atomic_set_mask(unsig
 #endif
 }
 
-/* Atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* __KERNEL__ */
 
 #endif /* _XTENSA_ATOMIC_H */
--- a/arch/xtensa/include/asm/bitops.h
+++ b/arch/xtensa/include/asm/bitops.h
@@ -21,9 +21,7 @@
 
 #include <asm/processor.h>
 #include <asm/byteorder.h>
-
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
+#include <asm/barrier.h>
 
 #include <asm-generic/bitops/non-atomic.h>
 
--- a/block/blk-iopoll.c
+++ b/block/blk-iopoll.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(blk_iopoll_sched);
 void __blk_iopoll_complete(struct blk_iopoll *iop)
 {
 	list_del(&iop->list);
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit_unlock(IOPOLL_F_SCHED, &iop->state);
 }
 EXPORT_SYMBOL(__blk_iopoll_complete);
@@ -164,7 +164,7 @@ EXPORT_SYMBOL(blk_iopoll_disable);
 void blk_iopoll_enable(struct blk_iopoll *iop)
 {
 	BUG_ON(!test_bit(IOPOLL_F_SCHED, &iop->state));
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit_unlock(IOPOLL_F_SCHED, &iop->state);
 }
 EXPORT_SYMBOL(blk_iopoll_enable);
--- a/crypto/chainiv.c
+++ b/crypto/chainiv.c
@@ -126,7 +126,7 @@ static int async_chainiv_schedule_work(s
 	int err = ctx->err;
 
 	if (!ctx->queue.qlen) {
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(CHAINIV_STATE_INUSE, &ctx->state);
 
 		if (!ctx->queue.qlen ||
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -106,7 +106,7 @@ static bool genpd_sd_counter_dec(struct
 static void genpd_sd_counter_inc(struct generic_pm_domain *genpd)
 {
 	atomic_inc(&genpd->sd_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 static void genpd_acquire_lock(struct generic_pm_domain *genpd)
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -224,9 +224,9 @@ static int get_slot(struct mtip_port *po
  */
 static inline void release_slot(struct mtip_port *port, int tag)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(tag, port->allocated);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 /*
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -159,7 +159,7 @@ void cpuidle_coupled_parallel_barrier(st
 {
 	int n = dev->coupled->online_count;
 
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_inc(a);
 
 	while (atomic_read(a) < n)
--- a/drivers/firewire/ohci.c
+++ b/drivers/firewire/ohci.c
@@ -3509,7 +3509,7 @@ static int ohci_flush_iso_completions(st
 		}
 
 		clear_bit_unlock(0, &ctx->flushing_completions);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 	}
 
 	tasklet_enable(&ctx->context.tasklet);
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -156,7 +156,7 @@ static void vblank_disable_and_save(stru
 	 */
 	if ((vblrc > 0) && (abs64(diff_ns) > 1000000)) {
 		atomic_inc(&dev->vblank[crtc].count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 	}
 
 	/* Invalidate all timestamps while vblank irq's are off. */
@@ -864,9 +864,9 @@ static void drm_update_vblank_count(stru
 		vblanktimestamp(dev, crtc, tslot) = t_vblank;
 	}
 
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_add(diff, &dev->vblank[crtc].count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 /**
@@ -1330,9 +1330,9 @@ bool drm_handle_vblank(struct drm_device
 		/* Increment cooked vblank count. This also atomically commits
 		 * the timestamp computed above.
 		 */
-		smp_mb__before_atomic_inc();
+		smp_mb__before_atomic();
 		atomic_inc(&dev->vblank[crtc].count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 	} else {
 		DRM_DEBUG("crtc %d: Redundant vblirq ignored. diff_ns = %d\n",
 			  crtc, (int) diff_ns);
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1996,7 +1996,7 @@ static void i915_error_work_func(struct
 			 * updates before
 			 * the counter increment.
 			 */
-			smp_mb__before_atomic_inc();
+			smp_mb__before_atomic();
 			atomic_inc(&dev_priv->gpu_error.reset_counter);
 
 			kobject_uevent_env(&dev->primary->kdev->kobj,
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -841,7 +841,7 @@ static inline bool cached_dev_get(struct
 		return false;
 
 	/* Paired with the mb in cached_dev_attach */
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	return true;
 }
 
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -243,7 +243,7 @@ static inline void set_closure_fn(struct
 	cl->fn = fn;
 	cl->wq = wq;
 	/* between atomic_dec() in closure_put() */
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 }
 
 static inline void closure_queue(struct closure *cl)
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -607,9 +607,9 @@ static void write_endio(struct bio *bio,
 
 	BUG_ON(!test_bit(B_WRITING, &b->state));
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(B_WRITING, &b->state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	wake_up_bit(&b->state, B_WRITING);
 }
@@ -997,9 +997,9 @@ static void read_endio(struct bio *bio,
 
 	BUG_ON(!test_bit(B_READING, &b->state));
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(B_READING, &b->state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	wake_up_bit(&b->state, B_READING);
 }
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -642,7 +642,7 @@ static void free_pending_exception(struc
 	struct dm_snapshot *s = pe->snap;
 
 	mempool_free(pe, s->pending_pool);
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&s->pending_exceptions_count);
 }
 
@@ -783,7 +783,7 @@ static int init_hash_tables(struct dm_sn
 static void merge_shutdown(struct dm_snapshot *s)
 {
 	clear_bit_unlock(RUNNING_MERGE, &s->state_bits);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&s->state_bits, RUNNING_MERGE);
 }
 
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2451,7 +2451,7 @@ static void dm_wq_work(struct work_struc
 static void dm_queue_flush(struct mapped_device *md)
 {
 	clear_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	queue_work(md->wq, &md->work);
 }
 
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4406,7 +4406,7 @@ static void raid5_unplug(struct blk_plug
 			 * STRIPE_ON_UNPLUG_LIST clear but the stripe
 			 * is still in our list
 			 */
-			smp_mb__before_clear_bit();
+			smp_mb__before_atomic();
 			clear_bit(STRIPE_ON_UNPLUG_LIST, &sh->state);
 			/*
 			 * STRIPE_ON_RELEASE_LIST could be set here. In that
--- a/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
+++ b/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
@@ -399,7 +399,7 @@ static int dvb_usb_stop_feed(struct dvb_
 
 	/* clear 'streaming' status bit */
 	clear_bit(ADAP_STREAMING, &adap->state_bits);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&adap->state_bits, ADAP_STREAMING);
 skip_feed_stop:
 
@@ -550,7 +550,7 @@ static int dvb_usb_fe_init(struct dvb_fr
 err:
 	if (!adap->suspend_resume_active) {
 		clear_bit(ADAP_INIT, &adap->state_bits);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		wake_up_bit(&adap->state_bits, ADAP_INIT);
 	}
 
@@ -591,7 +591,7 @@ static int dvb_usb_fe_sleep(struct dvb_f
 	if (!adap->suspend_resume_active) {
 		adap->active_fe = -1;
 		clear_bit(ADAP_SLEEP, &adap->state_bits);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		wake_up_bit(&adap->state_bits, ADAP_SLEEP);
 	}
 
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -2779,7 +2779,7 @@ int bnx2x_nic_load(struct bnx2x *bp, int
 
 	case LOAD_OPEN:
 		netif_tx_start_all_queues(bp->dev);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		break;
 
 	case LOAD_DIAG:
@@ -4773,9 +4773,9 @@ void bnx2x_tx_timeout(struct net_device
 		bnx2x_panic();
 #endif
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	set_bit(BNX2X_SP_RTNL_TX_TIMEOUT, &bp->sp_rtnl_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	/* This allows the netif to be shutdown gracefully before resetting */
 	schedule_delayed_work(&bp->sp_rtnl_task, 0);
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -1837,10 +1837,10 @@ void bnx2x_sp_event(struct bnx2x_fastpat
 	/* SRIOV: reschedule any 'in_progress' operations */
 	bnx2x_iov_sp_event(bp, cid, true);
 
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_inc(&bp->cq_spq_left);
 	/* push the change in bp->spq_left and towards the memory */
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 
 	DP(BNX2X_MSG_SP, "bp->cq_spq_left %x\n", atomic_read(&bp->cq_spq_left));
 
@@ -1855,11 +1855,11 @@ void bnx2x_sp_event(struct bnx2x_fastpat
 		 * sp_state is cleared, and this order prevents
 		 * races
 		 */
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(BNX2X_AFEX_PENDING_VIFSET_MCP_ACK, &bp->sp_state);
 		wmb();
 		clear_bit(BNX2X_AFEX_FCOE_Q_UPDATE_PENDING, &bp->sp_state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 		/* schedule the sp task as mcp ack is required */
 		bnx2x_schedule_sp_task(bp);
@@ -3878,9 +3878,9 @@ static void bnx2x_fan_failure(struct bnx
 	 * This is due to some boards consuming sufficient power when driver is
 	 * up to overheat if fan fails.
 	 */
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	set_bit(BNX2X_SP_RTNL_FAN_FAILURE, &bp->sp_rtnl_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	schedule_delayed_work(&bp->sp_rtnl_task, 0);
 }
 
@@ -5137,9 +5137,9 @@ static void bnx2x_after_function_update(
 		__clear_bit(RAMROD_COMP_WAIT, &queue_params.ramrod_flags);
 
 		/* mark latest Q bit */
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(BNX2X_AFEX_FCOE_Q_UPDATE_PENDING, &bp->sp_state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 		/* send Q update ramrod for FCoE Q */
 		rc = bnx2x_queue_state_change(bp, &queue_params);
@@ -5282,10 +5282,10 @@ static void bnx2x_eq_int(struct bnx2x *b
 				 * sp_rtnl task as all Queue SP operations
 				 * should run under rtnl_lock.
 				 */
-				smp_mb__before_clear_bit();
+				smp_mb__before_atomic();
 				set_bit(BNX2X_SP_RTNL_AFEX_F_UPDATE,
 					&bp->sp_rtnl_state);
-				smp_mb__after_clear_bit();
+				smp_mb__after_atomic();
 
 				schedule_delayed_work(&bp->sp_rtnl_task, 0);
 			}
@@ -5368,7 +5368,7 @@ static void bnx2x_eq_int(struct bnx2x *b
 		spqe_cnt++;
 	} /* for */
 
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_add(spqe_cnt, &bp->eq_spq_left);
 
 	bp->eq_cons = sw_cons;
@@ -12065,9 +12065,9 @@ static void bnx2x_set_rx_mode(struct net
 	} else {
 		/* Schedule an SP task to handle rest of change */
 		DP(NETIF_MSG_IFUP, "Scheduling an Rx mode change\n");
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(BNX2X_SP_RTNL_RX_MODE, &bp->sp_rtnl_state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		schedule_delayed_work(&bp->sp_rtnl_task, 0);
 	}
 }
@@ -12101,10 +12101,10 @@ void bnx2x_set_rx_mode_inner(struct bnx2
 			/* configuring mcast to a vf involves sleeping (when we
 			 * wait for the pf's response).
 			 */
-			smp_mb__before_clear_bit();
+			smp_mb__before_atomic();
 			set_bit(BNX2X_SP_RTNL_VFPF_MCAST,
 				&bp->sp_rtnl_state);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 			schedule_delayed_work(&bp->sp_rtnl_task, 0);
 		}
 	}
@@ -13723,9 +13723,9 @@ static int bnx2x_drv_ctl(struct net_devi
 	case DRV_CTL_RET_L2_SPQ_CREDIT_CMD: {
 		int count = ctl->data.credit.credit_count;
 
-		smp_mb__before_atomic_inc();
+		smp_mb__before_atomic();
 		atomic_add(count, &bp->cq_spq_left);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		break;
 	}
 	case DRV_CTL_ULP_REGISTER_CMD: {
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -258,16 +258,16 @@ static bool bnx2x_raw_check_pending(stru
 
 static void bnx2x_raw_clear_pending(struct bnx2x_raw_obj *o)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(o->state, o->pstate);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static void bnx2x_raw_set_pending(struct bnx2x_raw_obj *o)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	set_bit(o->state, o->pstate);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 /**
@@ -2131,7 +2131,7 @@ static int bnx2x_set_rx_mode_e1x(struct
 
 	/* The operation is completed */
 	clear_bit(p->state, p->pstate);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return 0;
 }
@@ -3576,16 +3576,16 @@ int bnx2x_config_mcast(struct bnx2x *bp,
 
 static void bnx2x_mcast_clear_sched(struct bnx2x_mcast_obj *o)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(o->sched_state, o->raw.pstate);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static void bnx2x_mcast_set_sched(struct bnx2x_mcast_obj *o)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	set_bit(o->sched_state, o->raw.pstate);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static bool bnx2x_mcast_check_sched(struct bnx2x_mcast_obj *o)
@@ -4210,7 +4210,7 @@ int bnx2x_queue_state_change(struct bnx2
 		if (rc) {
 			o->next_state = BNX2X_Q_STATE_MAX;
 			clear_bit(pending_bit, pending);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 			return rc;
 		}
 
@@ -4298,7 +4298,7 @@ static int bnx2x_queue_comp_cmd(struct b
 	wmb();
 
 	clear_bit(cmd, &o->pending);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return 0;
 }
@@ -5242,7 +5242,7 @@ static inline int bnx2x_func_state_chang
 	wmb();
 
 	clear_bit(cmd, &o->pending);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return 0;
 }
@@ -5877,7 +5877,7 @@ int bnx2x_func_state_change(struct bnx2x
 		if (rc) {
 			o->next_state = BNX2X_F_STATE_MAX;
 			clear_bit(cmd, pending);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 			return rc;
 		}
 
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -971,10 +971,10 @@ static void bnx2x_vfop_qsetup(struct bnx
 op_done:
 	case BNX2X_VFOP_QSETUP_DONE:
 		vf->cfg_flags |= VF_CFG_VLAN;
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(BNX2X_SP_RTNL_HYPERVISOR_VLAN,
 			&bp->sp_rtnl_state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		schedule_delayed_work(&bp->sp_rtnl_task, 0);
 		bnx2x_vfop_end(bp, vf, vfop);
 		return;
@@ -2354,9 +2354,9 @@ static
 void bnx2x_vf_handle_filters_eqe(struct bnx2x *bp,
 				 struct bnx2x_virtf *vf)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(BNX2X_FILTER_RX_MODE_PENDING, &vf->filter_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 int bnx2x_iov_eq_sp_event(struct bnx2x *bp, union event_ring_elem *elem)
@@ -3737,10 +3737,10 @@ void bnx2x_timer_sriov(struct bnx2x *bp)
 
 	/* if channel is down we need to self destruct */
 	if (bp->old_bulletin.valid_bitmap & 1 << CHANNEL_DOWN) {
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(BNX2X_SP_RTNL_VFPF_CHANNEL_DOWN,
 			&bp->sp_rtnl_state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		schedule_delayed_work(&bp->sp_rtnl_task, 0);
 	}
 }
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -436,7 +436,7 @@ static int cnic_offld_prep(struct cnic_s
 static int cnic_close_prep(struct cnic_sock *csk)
 {
 	clear_bit(SK_F_CONNECT_START, &csk->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	if (test_and_clear_bit(SK_F_OFFLD_COMPLETE, &csk->flags)) {
 		while (test_and_set_bit(SK_F_OFFLD_SCHED, &csk->flags))
@@ -450,7 +450,7 @@ static int cnic_close_prep(struct cnic_s
 static int cnic_abort_prep(struct cnic_sock *csk)
 {
 	clear_bit(SK_F_CONNECT_START, &csk->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	while (test_and_set_bit(SK_F_OFFLD_SCHED, &csk->flags))
 		msleep(1);
@@ -3645,7 +3645,7 @@ static int cnic_cm_destroy(struct cnic_s
 
 	csk_hold(csk);
 	clear_bit(SK_F_INUSE, &csk->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	while (atomic_read(&csk->ref_count) != 1)
 		msleep(1);
 	cnic_cm_cleanup(csk);
@@ -4025,7 +4025,7 @@ static void cnic_cm_process_kcqe(struct
 			 L4_KCQE_COMPLETION_STATUS_PARITY_ERROR)
 			set_bit(SK_F_HW_ERR, &csk->flags);
 
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(SK_F_OFFLD_SCHED, &csk->flags);
 		cnic_cm_upcall(cp, csk, opcode);
 		break;
--- a/drivers/net/ethernet/brocade/bna/bnad.c
+++ b/drivers/net/ethernet/brocade/bna/bnad.c
@@ -249,7 +249,7 @@ bnad_tx_complete(struct bnad *bnad, stru
 	if (likely(test_bit(BNAD_TXQ_TX_STARTED, &tcb->flags)))
 		bna_ib_ack(tcb->i_dbell, sent);
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(BNAD_TXQ_FREE_SENT, &tcb->flags);
 
 	return sent;
@@ -1125,7 +1125,7 @@ bnad_tx_cleanup(struct delayed_work *wor
 
 		bnad_txq_cleanup(bnad, tcb);
 
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(BNAD_TXQ_FREE_SENT, &tcb->flags);
 	}
 
@@ -2999,7 +2999,7 @@ bnad_start_xmit(struct sk_buff *skb, str
 			sent = bnad_txcmpl_process(bnad, tcb);
 			if (likely(test_bit(BNAD_TXQ_TX_STARTED, &tcb->flags)))
 				bna_ib_ack(tcb->i_dbell, sent);
-			smp_mb__before_clear_bit();
+			smp_mb__before_atomic();
 			clear_bit(BNAD_TXQ_FREE_SENT, &tcb->flags);
 		} else {
 			netif_stop_queue(netdev);
--- a/drivers/net/ethernet/chelsio/cxgb/cxgb2.c
+++ b/drivers/net/ethernet/chelsio/cxgb/cxgb2.c
@@ -281,7 +281,7 @@ static int cxgb_close(struct net_device
 	if (adapter->params.stats_update_period &&
 	    !(adapter->open_device_map & PORT_MASK)) {
 		/* Stop statistics accumulation. */
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		spin_lock(&adapter->work_lock);   /* sync with update task */
 		spin_unlock(&adapter->work_lock);
 		cancel_mac_stats_update(adapter);
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -1379,7 +1379,7 @@ static inline int check_desc_avail(struc
 		struct sge_qset *qs = txq_to_qset(q, qid);
 
 		set_bit(qid, &qs->txq_stopped);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 		if (should_restart_tx(q) &&
 		    test_and_clear_bit(qid, &qs->txq_stopped))
@@ -1492,7 +1492,7 @@ static void restart_ctrlq(unsigned long
 
 	if (!skb_queue_empty(&q->sendq)) {
 		set_bit(TXQ_CTRL, &qs->txq_stopped);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 		if (should_restart_tx(q) &&
 		    test_and_clear_bit(TXQ_CTRL, &qs->txq_stopped))
@@ -1697,7 +1697,7 @@ again:	reclaim_completed_tx(adap, q, TX_
 
 		if (unlikely(q->size - q->in_use < ndesc)) {
 			set_bit(TXQ_OFLD, &qs->txq_stopped);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 
 			if (should_restart_tx(q) &&
 			    test_and_clear_bit(TXQ_OFLD, &qs->txq_stopped))
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2003,7 +2003,7 @@ static void sge_rx_timer_cb(unsigned lon
 			struct sge_fl *fl = s->egr_map[id];
 
 			clear_bit(id, s->starving_fl);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 
 			if (fl_starving(fl)) {
 				rxq = container_of(fl, struct sge_eth_rxq, fl);
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -1951,7 +1951,7 @@ static void sge_rx_timer_cb(unsigned lon
 			struct sge_fl *fl = s->egr_map[id];
 
 			clear_bit(id, s->starving_fl);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 
 			/*
 			 * Since we are accessing fl without a lock there's a
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4576,7 +4576,7 @@ static void i40e_service_event_complete(
 	BUG_ON(!test_bit(__I40E_SERVICE_SCHED, &pf->state));
 
 	/* flush memory to make sure state is correct before next watchog */
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(__I40E_SERVICE_SCHED, &pf->state);
 }
 
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -318,7 +318,7 @@ static void ixgbe_service_event_complete
 	BUG_ON(!test_bit(__IXGBE_SERVICE_SCHED, &adapter->state));
 
 	/* flush memory to make sure state is correct before next watchdog */
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(__IXGBE_SERVICE_SCHED, &adapter->state);
 }
 
@@ -4607,7 +4607,7 @@ static void ixgbe_up_complete(struct ixg
 	if (hw->mac.ops.enable_tx_laser)
 		hw->mac.ops.enable_tx_laser(hw);
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(__IXGBE_DOWN, &adapter->state);
 	ixgbe_napi_enable_all(adapter);
 
--- a/drivers/net/wireless/ti/wlcore/main.c
+++ b/drivers/net/wireless/ti/wlcore/main.c
@@ -547,7 +547,7 @@ static int wlcore_irq_locked(struct wl12
 		 * wl1271_ps_elp_wakeup cannot be called concurrently.
 		 */
 		clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 		ret = wlcore_fw_status(wl, wl->fw_status_1, wl->fw_status_2);
 		if (ret < 0)
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -662,9 +662,9 @@ static void pcifront_do_aer(struct work_
 	notify_remote_via_evtchn(pdev->evtchn);
 
 	/*in case of we lost an aer request in four lines time_window*/
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(_PDEVB_op_active, &pdev->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	schedule_pcifront_aer_op(pdev);
 
--- a/drivers/scsi/isci/remote_device.c
+++ b/drivers/scsi/isci/remote_device.c
@@ -1541,7 +1541,7 @@ void isci_remote_device_release(struct k
 	clear_bit(IDEV_STOP_PENDING, &idev->flags);
 	clear_bit(IDEV_IO_READY, &idev->flags);
 	clear_bit(IDEV_GONE, &idev->flags);
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(IDEV_ALLOCATED, &idev->flags);
 	wake_up(&ihost->eventq);
 }
--- a/drivers/target/loopback/tcm_loop.c
+++ b/drivers/target/loopback/tcm_loop.c
@@ -942,7 +942,7 @@ static int tcm_loop_port_link(
 	struct tcm_loop_hba *tl_hba = tl_tpg->tl_hba;
 
 	atomic_inc(&tl_tpg->tl_tpg_port_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	/*
 	 * Add Linux/SCSI struct scsi_device by HCTL
 	 */
@@ -977,7 +977,7 @@ static void tcm_loop_port_unlink(
 	scsi_device_put(sd);
 
 	atomic_dec(&tl_tpg->tl_tpg_port_count);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 
 	pr_debug("TCM_Loop_ConfigFS: Port Unlink Successful\n");
 }
--- a/drivers/target/target_core_alua.c
+++ b/drivers/target/target_core_alua.c
@@ -393,7 +393,7 @@ target_emulate_set_target_port_groups(st
 					continue;
 
 				atomic_inc(&tg_pt_gp->tg_pt_gp_ref_cnt);
-				smp_mb__after_atomic_inc();
+				smp_mb__after_atomic();
 
 				spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
 
@@ -404,7 +404,7 @@ target_emulate_set_target_port_groups(st
 
 				spin_lock(&dev->t10_alua.tg_pt_gps_lock);
 				atomic_dec(&tg_pt_gp->tg_pt_gp_ref_cnt);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				break;
 			}
 			spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
@@ -997,7 +997,7 @@ static void core_alua_do_transition_tg_p
 		 * TARGET PORT GROUPS command
 		 */
 		atomic_inc(&mem->tg_pt_gp_mem_ref_cnt);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock(&tg_pt_gp->tg_pt_gp_lock);
 
 		spin_lock_bh(&port->sep_alua_lock);
@@ -1027,7 +1027,7 @@ static void core_alua_do_transition_tg_p
 
 		spin_lock(&tg_pt_gp->tg_pt_gp_lock);
 		atomic_dec(&mem->tg_pt_gp_mem_ref_cnt);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 	}
 	spin_unlock(&tg_pt_gp->tg_pt_gp_lock);
 	/*
@@ -1061,7 +1061,7 @@ static void core_alua_do_transition_tg_p
 		core_alua_dump_state(tg_pt_gp->tg_pt_gp_alua_pending_state));
 	spin_lock(&dev->t10_alua.tg_pt_gps_lock);
 	atomic_dec(&tg_pt_gp->tg_pt_gp_ref_cnt);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 	spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
 
 	if (tg_pt_gp->tg_pt_gp_transition_complete)
@@ -1123,7 +1123,7 @@ static int core_alua_do_transition_tg_pt
 	 */
 	spin_lock(&dev->t10_alua.tg_pt_gps_lock);
 	atomic_inc(&tg_pt_gp->tg_pt_gp_ref_cnt);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
 
 	if (!explicit && tg_pt_gp->tg_pt_gp_implicit_trans_secs) {
@@ -1166,7 +1166,7 @@ int core_alua_do_port_transition(
 	spin_lock(&local_lu_gp_mem->lu_gp_mem_lock);
 	lu_gp = local_lu_gp_mem->lu_gp;
 	atomic_inc(&lu_gp->lu_gp_ref_cnt);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	spin_unlock(&local_lu_gp_mem->lu_gp_mem_lock);
 	/*
 	 * For storage objects that are members of the 'default_lu_gp',
@@ -1183,7 +1183,7 @@ int core_alua_do_port_transition(
 		rc = core_alua_do_transition_tg_pt(l_tg_pt_gp,
 						   new_state, explicit);
 		atomic_dec(&lu_gp->lu_gp_ref_cnt);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		return rc;
 	}
 	/*
@@ -1197,7 +1197,7 @@ int core_alua_do_port_transition(
 
 		dev = lu_gp_mem->lu_gp_mem_dev;
 		atomic_inc(&lu_gp_mem->lu_gp_mem_ref_cnt);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock(&lu_gp->lu_gp_lock);
 
 		spin_lock(&dev->t10_alua.tg_pt_gps_lock);
@@ -1226,7 +1226,7 @@ int core_alua_do_port_transition(
 				tg_pt_gp->tg_pt_gp_alua_nacl = NULL;
 			}
 			atomic_inc(&tg_pt_gp->tg_pt_gp_ref_cnt);
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 			spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
 			/*
 			 * core_alua_do_transition_tg_pt() will always return
@@ -1237,7 +1237,7 @@ int core_alua_do_port_transition(
 
 			spin_lock(&dev->t10_alua.tg_pt_gps_lock);
 			atomic_dec(&tg_pt_gp->tg_pt_gp_ref_cnt);
-			smp_mb__after_atomic_dec();
+			smp_mb__after_atomic();
 			if (rc)
 				break;
 		}
@@ -1245,7 +1245,7 @@ int core_alua_do_port_transition(
 
 		spin_lock(&lu_gp->lu_gp_lock);
 		atomic_dec(&lu_gp_mem->lu_gp_mem_ref_cnt);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 	}
 	spin_unlock(&lu_gp->lu_gp_lock);
 
@@ -1259,7 +1259,7 @@ int core_alua_do_port_transition(
 	}
 
 	atomic_dec(&lu_gp->lu_gp_ref_cnt);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 	return rc;
 }
 
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -225,7 +225,7 @@ struct se_dev_entry *core_get_se_deve_fr
 			continue;
 
 		atomic_inc(&deve->pr_ref_count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock_irq(&nacl->device_list_lock);
 
 		return deve;
@@ -1392,7 +1392,7 @@ int core_dev_add_initiator_node_lun_acl(
 	spin_lock(&lun->lun_acl_lock);
 	list_add_tail(&lacl->lacl_list, &lun->lun_acl_list);
 	atomic_inc(&lun->lun_acl_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	spin_unlock(&lun->lun_acl_lock);
 
 	pr_debug("%s_TPG[%hu]_LUN[%u->%u] - Added %s ACL for "
@@ -1426,7 +1426,7 @@ int core_dev_del_initiator_node_lun_acl(
 	spin_lock(&lun->lun_acl_lock);
 	list_del(&lacl->lacl_list);
 	atomic_dec(&lun->lun_acl_count);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 	spin_unlock(&lun->lun_acl_lock);
 
 	core_disable_device_list_for_node(lun, NULL, lacl->mapped_lun,
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -324,7 +324,7 @@ static void iblock_bio_done(struct bio *
 		 * Bump the ib_bio_err_cnt and release bio.
 		 */
 		atomic_inc(&ibr->ib_bio_err_cnt);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 	}
 
 	bio_put(bio);
--- a/drivers/target/target_core_pr.c
+++ b/drivers/target/target_core_pr.c
@@ -675,7 +675,7 @@ static struct t10_pr_registration *__cor
 	spin_lock(&dev->se_port_lock);
 	list_for_each_entry_safe(port, port_tmp, &dev->dev_sep_list, sep_list) {
 		atomic_inc(&port->sep_tg_pt_ref_cnt);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock(&dev->se_port_lock);
 
 		spin_lock_bh(&port->sep_alua_lock);
@@ -710,7 +710,7 @@ static struct t10_pr_registration *__cor
 				continue;
 
 			atomic_inc(&deve_tmp->pr_ref_count);
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 			spin_unlock_bh(&port->sep_alua_lock);
 			/*
 			 * Grab a configfs group dependency that is released
@@ -723,9 +723,9 @@ static struct t10_pr_registration *__cor
 				pr_err("core_scsi3_lunacl_depend"
 						"_item() failed\n");
 				atomic_dec(&port->sep_tg_pt_ref_cnt);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				atomic_dec(&deve_tmp->pr_ref_count);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				goto out;
 			}
 			/*
@@ -740,9 +740,9 @@ static struct t10_pr_registration *__cor
 						sa_res_key, all_tg_pt, aptpl);
 			if (!pr_reg_atp) {
 				atomic_dec(&port->sep_tg_pt_ref_cnt);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				atomic_dec(&deve_tmp->pr_ref_count);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				core_scsi3_lunacl_undepend_item(deve_tmp);
 				goto out;
 			}
@@ -755,7 +755,7 @@ static struct t10_pr_registration *__cor
 
 		spin_lock(&dev->se_port_lock);
 		atomic_dec(&port->sep_tg_pt_ref_cnt);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 	}
 	spin_unlock(&dev->se_port_lock);
 
@@ -1110,7 +1110,7 @@ static struct t10_pr_registration *__cor
 					continue;
 			}
 			atomic_inc(&pr_reg->pr_res_holders);
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 			spin_unlock(&pr_tmpl->registration_lock);
 			return pr_reg;
 		}
@@ -1125,7 +1125,7 @@ static struct t10_pr_registration *__cor
 			continue;
 
 		atomic_inc(&pr_reg->pr_res_holders);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock(&pr_tmpl->registration_lock);
 		return pr_reg;
 	}
@@ -1155,7 +1155,7 @@ static struct t10_pr_registration *core_
 static void core_scsi3_put_pr_reg(struct t10_pr_registration *pr_reg)
 {
 	atomic_dec(&pr_reg->pr_res_holders);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 }
 
 static int core_scsi3_check_implicit_release(
@@ -1349,7 +1349,7 @@ static void core_scsi3_tpg_undepend_item
 			&tpg->tpg_group.cg_item);
 
 	atomic_dec(&tpg->tpg_pr_ref_count);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 }
 
 static int core_scsi3_nodeacl_depend_item(struct se_node_acl *nacl)
@@ -1369,7 +1369,7 @@ static void core_scsi3_nodeacl_undepend_
 
 	if (nacl->dynamic_node_acl) {
 		atomic_dec(&nacl->acl_pr_ref_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		return;
 	}
 
@@ -1377,7 +1377,7 @@ static void core_scsi3_nodeacl_undepend_
 			&nacl->acl_group.cg_item);
 
 	atomic_dec(&nacl->acl_pr_ref_count);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 }
 
 static int core_scsi3_lunacl_depend_item(struct se_dev_entry *se_deve)
@@ -1408,7 +1408,7 @@ static void core_scsi3_lunacl_undepend_i
 	 */
 	if (!lun_acl) {
 		atomic_dec(&se_deve->pr_ref_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		return;
 	}
 	nacl = lun_acl->se_lun_nacl;
@@ -1418,7 +1418,7 @@ static void core_scsi3_lunacl_undepend_i
 			&lun_acl->se_lun_group.cg_item);
 
 	atomic_dec(&se_deve->pr_ref_count);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 }
 
 static sense_reason_t
@@ -1552,14 +1552,14 @@ core_scsi3_decode_spec_i_port(
 				continue;
 
 			atomic_inc(&tmp_tpg->tpg_pr_ref_count);
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 			spin_unlock(&dev->se_port_lock);
 
 			if (core_scsi3_tpg_depend_item(tmp_tpg)) {
 				pr_err(" core_scsi3_tpg_depend_item()"
 					" for tmp_tpg\n");
 				atomic_dec(&tmp_tpg->tpg_pr_ref_count);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 				goto out_unmap;
 			}
@@ -1573,7 +1573,7 @@ core_scsi3_decode_spec_i_port(
 						tmp_tpg, i_str);
 			if (dest_node_acl) {
 				atomic_inc(&dest_node_acl->acl_pr_ref_count);
-				smp_mb__after_atomic_inc();
+				smp_mb__after_atomic();
 			}
 			spin_unlock_irq(&tmp_tpg->acl_node_lock);
 
@@ -1587,7 +1587,7 @@ core_scsi3_decode_spec_i_port(
 				pr_err("configfs_depend_item() failed"
 					" for dest_node_acl->acl_group\n");
 				atomic_dec(&dest_node_acl->acl_pr_ref_count);
-				smp_mb__after_atomic_dec();
+				smp_mb__after_atomic();
 				core_scsi3_tpg_undepend_item(tmp_tpg);
 				ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 				goto out_unmap;
@@ -1647,7 +1647,7 @@ core_scsi3_decode_spec_i_port(
 			pr_err("core_scsi3_lunacl_depend_item()"
 					" failed\n");
 			atomic_dec(&dest_se_deve->pr_ref_count);
-			smp_mb__after_atomic_dec();
+			smp_mb__after_atomic();
 			core_scsi3_nodeacl_undepend_item(dest_node_acl);
 			core_scsi3_tpg_undepend_item(dest_tpg);
 			ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
@@ -3165,14 +3165,14 @@ core_scsi3_emulate_pro_register_and_move
 			continue;
 
 		atomic_inc(&dest_se_tpg->tpg_pr_ref_count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock(&dev->se_port_lock);
 
 		if (core_scsi3_tpg_depend_item(dest_se_tpg)) {
 			pr_err("core_scsi3_tpg_depend_item() failed"
 				" for dest_se_tpg\n");
 			atomic_dec(&dest_se_tpg->tpg_pr_ref_count);
-			smp_mb__after_atomic_dec();
+			smp_mb__after_atomic();
 			ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 			goto out_put_pr_reg;
 		}
@@ -3270,7 +3270,7 @@ core_scsi3_emulate_pro_register_and_move
 				initiator_str);
 	if (dest_node_acl) {
 		atomic_inc(&dest_node_acl->acl_pr_ref_count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 	}
 	spin_unlock_irq(&dest_se_tpg->acl_node_lock);
 
@@ -3286,7 +3286,7 @@ core_scsi3_emulate_pro_register_and_move
 		pr_err("core_scsi3_nodeacl_depend_item() for"
 			" dest_node_acl\n");
 		atomic_dec(&dest_node_acl->acl_pr_ref_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		dest_node_acl = NULL;
 		ret = TCM_INVALID_PARAMETER_LIST;
 		goto out;
@@ -3311,7 +3311,7 @@ core_scsi3_emulate_pro_register_and_move
 	if (core_scsi3_lunacl_depend_item(dest_se_deve)) {
 		pr_err("core_scsi3_lunacl_depend_item() failed\n");
 		atomic_dec(&dest_se_deve->pr_ref_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		dest_se_deve = NULL;
 		ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 		goto out;
@@ -3877,7 +3877,7 @@ core_scsi3_pri_read_full_status(struct s
 		add_desc_len = 0;
 
 		atomic_inc(&pr_reg->pr_res_holders);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		spin_unlock(&pr_tmpl->registration_lock);
 		/*
 		 * Determine expected length of $FABRIC_MOD specific
@@ -3891,7 +3891,7 @@ core_scsi3_pri_read_full_status(struct s
 				" out of buffer: %d\n", cmd->data_length);
 			spin_lock(&pr_tmpl->registration_lock);
 			atomic_dec(&pr_reg->pr_res_holders);
-			smp_mb__after_atomic_dec();
+			smp_mb__after_atomic();
 			break;
 		}
 		/*
@@ -3953,7 +3953,7 @@ core_scsi3_pri_read_full_status(struct s
 
 		spin_lock(&pr_tmpl->registration_lock);
 		atomic_dec(&pr_reg->pr_res_holders);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		/*
 		 * Set the ADDITIONAL DESCRIPTOR LENGTH
 		 */
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -728,7 +728,7 @@ void target_qf_do_work(struct work_struc
 	list_for_each_entry_safe(cmd, cmd_tmp, &qf_cmd_list, se_qf_node) {
 		list_del(&cmd->se_qf_node);
 		atomic_dec(&dev->dev_qf_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 
 		pr_debug("Processing %s cmd: %p QUEUE_FULL in work queue"
 			" context: %s\n", cmd->se_tfo->get_fabric_name(), cmd,
@@ -1140,7 +1140,7 @@ transport_check_alloc_task_attr(struct s
 	 * Dormant to Active status.
 	 */
 	cmd->se_ordered_id = atomic_inc_return(&dev->dev_ordered_id);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	pr_debug("Allocated se_ordered_id: %u for Task Attr: 0x%02x on %s\n",
 			cmd->se_ordered_id, cmd->sam_task_attr,
 			dev->transport->name);
@@ -1692,7 +1692,7 @@ static bool target_handle_task_attr(stru
 		return false;
 	case MSG_ORDERED_TAG:
 		atomic_inc(&dev->dev_ordered_sync);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 
 		pr_debug("Added ORDERED for CDB: 0x%02x to ordered list, "
 			 " se_ordered_id: %u\n",
@@ -1710,7 +1710,7 @@ static bool target_handle_task_attr(stru
 		 * For SIMPLE and UNTAGGED Task Attribute commands
 		 */
 		atomic_inc(&dev->simple_cmds);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		break;
 	}
 
@@ -1806,7 +1806,7 @@ static void transport_complete_task_attr
 
 	if (cmd->sam_task_attr == MSG_SIMPLE_TAG) {
 		atomic_dec(&dev->simple_cmds);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 		dev->dev_cur_ordered_id++;
 		pr_debug("Incremented dev->dev_cur_ordered_id: %u for"
 			" SIMPLE: %u\n", dev->dev_cur_ordered_id,
@@ -1818,7 +1818,7 @@ static void transport_complete_task_attr
 			cmd->se_ordered_id);
 	} else if (cmd->sam_task_attr == MSG_ORDERED_TAG) {
 		atomic_dec(&dev->dev_ordered_sync);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 
 		dev->dev_cur_ordered_id++;
 		pr_debug("Incremented dev_cur_ordered_id: %u for ORDERED:"
@@ -1877,7 +1877,7 @@ static void transport_handle_queue_full(
 	spin_lock_irq(&dev->qf_cmd_lock);
 	list_add_tail(&cmd->se_qf_node, &cmd->se_dev->qf_cmd_list);
 	atomic_inc(&dev->dev_qf_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	spin_unlock_irq(&cmd->se_dev->qf_cmd_lock);
 
 	schedule_work(&cmd->se_dev->qf_work_queue);
@@ -2805,7 +2805,7 @@ void transport_send_task_abort(struct se
 	if (cmd->data_direction == DMA_TO_DEVICE) {
 		if (cmd->se_tfo->write_pending_status(cmd) != 0) {
 			cmd->transport_state |= CMD_T_ABORTED;
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 			return;
 		}
 	}
--- a/drivers/target/target_core_ua.c
+++ b/drivers/target/target_core_ua.c
@@ -162,7 +162,7 @@ int core_scsi3_ua_allocate(
 		spin_unlock_irq(&nacl->device_list_lock);
 
 		atomic_inc(&deve->ua_count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		return 0;
 	}
 	list_add_tail(&ua->ua_nacl_list, &deve->ua_list);
@@ -175,7 +175,7 @@ int core_scsi3_ua_allocate(
 		asc, ascq);
 
 	atomic_inc(&deve->ua_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	return 0;
 }
 
@@ -190,7 +190,7 @@ void core_scsi3_ua_release_all(
 		kmem_cache_free(se_ua_cache, ua);
 
 		atomic_dec(&deve->ua_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 	}
 	spin_unlock(&deve->ua_lock);
 }
@@ -251,7 +251,7 @@ void core_scsi3_ua_for_check_condition(
 		kmem_cache_free(se_ua_cache, ua);
 
 		atomic_dec(&deve->ua_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 	}
 	spin_unlock(&deve->ua_lock);
 	spin_unlock_irq(&nacl->device_list_lock);
@@ -310,7 +310,7 @@ int core_scsi3_ua_clear_for_request_sens
 		kmem_cache_free(se_ua_cache, ua);
 
 		atomic_dec(&deve->ua_count);
-		smp_mb__after_atomic_dec();
+		smp_mb__after_atomic();
 	}
 	spin_unlock(&deve->ua_lock);
 	spin_unlock_irq(&nacl->device_list_lock);
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -2042,7 +2042,7 @@ static int canon_copy_from_read_buf(stru
 
 	if (found)
 		clear_bit(eol, ldata->read_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	ldata->read_tail += c;
 
 	if (found) {
--- a/drivers/tty/serial/mxs-auart.c
+++ b/drivers/tty/serial/mxs-auart.c
@@ -200,7 +200,7 @@ static void dma_tx_callback(void *param)
 
 	/* clear the bit used to serialize the DMA tx. */
 	clear_bit(MXS_AUART_DMA_TX_SYNC, &s->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	/* wake up the possible processes. */
 	if (uart_circ_chars_pending(xmit) < WAKEUP_CHARS)
@@ -275,7 +275,7 @@ static void mxs_auart_tx_chars(struct mx
 			mxs_auart_dma_tx(s, i);
 		} else {
 			clear_bit(MXS_AUART_DMA_TX_SYNC, &s->flags);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 		}
 		return;
 	}
--- a/drivers/usb/gadget/tcm_usb_gadget.c
+++ b/drivers/usb/gadget/tcm_usb_gadget.c
@@ -1846,7 +1846,7 @@ static int usbg_port_link(struct se_port
 	struct usbg_tpg *tpg = container_of(se_tpg, struct usbg_tpg, se_tpg);
 
 	atomic_inc(&tpg->tpg_port_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	return 0;
 }
 
@@ -1856,7 +1856,7 @@ static void usbg_port_unlink(struct se_p
 	struct usbg_tpg *tpg = container_of(se_tpg, struct usbg_tpg, se_tpg);
 
 	atomic_dec(&tpg->tpg_port_count);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 }
 
 static int usbg_check_stop_free(struct se_cmd *se_cmd)
--- a/drivers/usb/serial/usb_wwan.c
+++ b/drivers/usb/serial/usb_wwan.c
@@ -325,7 +325,7 @@ static void usb_wwan_outdat_callback(str
 
 	for (i = 0; i < N_OUT_URB; ++i) {
 		if (portdata->out_urbs[i] == urb) {
-			smp_mb__before_clear_bit();
+			smp_mb__before_atomic();
 			clear_bit(i, &portdata->out_busy);
 			break;
 		}
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1244,7 +1244,7 @@ vhost_scsi_set_endpoint(struct vhost_scs
 			tpg->tv_tpg_vhost_count++;
 			tpg->vhost_scsi = vs;
 			vs_tpg[tpg->tport_tpgt] = tpg;
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 			match = true;
 		}
 		mutex_unlock(&tpg->tv_tpg_mutex);
--- a/drivers/w1/w1_family.c
+++ b/drivers/w1/w1_family.c
@@ -131,9 +131,9 @@ void w1_family_get(struct w1_family *f)
 
 void __w1_family_get(struct w1_family *f)
 {
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_inc(&f->refcnt);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 EXPORT_SYMBOL(w1_unregister_family);
--- a/drivers/xen/xen-pciback/pciback_ops.c
+++ b/drivers/xen/xen-pciback/pciback_ops.c
@@ -348,9 +348,9 @@ void xen_pcibk_do_op(struct work_struct
 	notify_remote_via_irq(pdev->evtchn_irq);
 
 	/* Mark that we're done. */
-	smp_mb__before_clear_bit(); /* /after/ clearing PCIF_active */
+	smp_mb__before_atomic(); /* /after/ clearing PCIF_active */
 	clear_bit(_PDEVF_op_active, &pdev->flags);
-	smp_mb__after_clear_bit(); /* /before/ final check for work */
+	smp_mb__after_atomic(); /* /before/ final check for work */
 
 	/* Check to see if the driver domain tried to start another request in
 	 * between clearing _XEN_PCIF_active and clearing _PDEVF_op_active.
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -279,7 +279,7 @@ static inline void btrfs_inode_block_unl
 
 static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(BTRFS_INODE_READDIO_NEED_LOCK,
 		  &BTRFS_I(inode)->runtime_flags);
 }
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3451,7 +3451,7 @@ static int lock_extent_buffer_for_io(str
 static void end_extent_buffer_writeback(struct extent_buffer *eb)
 {
 	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7078,7 +7078,7 @@ static void btrfs_end_dio_bio(struct bio
 		 * before atomic variable goto zero, we must make sure
 		 * dip->errors is perceived to be set.
 		 */
-		smp_mb__before_atomic_dec();
+		smp_mb__before_atomic();
 	}
 
 	/* if there are more bios still pending for this dio, just exit */
@@ -7258,7 +7258,7 @@ static int btrfs_submit_direct_hook(int
 	 * before atomic variable goto zero, we must
 	 * make sure dip->errors is perceived to be set.
 	 */
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	if (atomic_dec_and_test(&dip->pending_bios))
 		bio_io_error(dip->orig_bio);
 
@@ -7401,7 +7401,7 @@ static ssize_t btrfs_direct_IO(int rw, s
 		return 0;
 
 	atomic_inc(&inode->i_dio_count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 
 	/*
 	 * The generic stuff only does filemap_write_and_wait_range, which isn't
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -77,7 +77,7 @@ EXPORT_SYMBOL(__lock_buffer);
 void unlock_buffer(struct buffer_head *bh)
 {
 	clear_bit_unlock(BH_Lock, &bh->b_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&bh->b_state, BH_Lock);
 }
 EXPORT_SYMBOL(unlock_buffer);
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -42,7 +42,7 @@ int ext4_resize_begin(struct super_block
 void ext4_resize_end(struct super_block *sb)
 {
 	clear_bit_unlock(EXT4_RESIZING, &EXT4_SB(sb)->s_resize_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static ext4_group_t ext4_meta_bg_first_group(struct super_block *sb,
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -275,7 +275,7 @@ static inline int may_grant(const struct
 static void gfs2_holder_wake(struct gfs2_holder *gh)
 {
 	clear_bit(HIF_WAIT, &gh->gh_iflags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&gh->gh_iflags, HIF_WAIT);
 }
 
@@ -409,7 +409,7 @@ static void gfs2_demote_wake(struct gfs2
 {
 	gl->gl_demote_state = LM_ST_EXCLUSIVE;
 	clear_bit(GLF_DEMOTE, &gl->gl_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&gl->gl_flags, GLF_DEMOTE);
 }
 
@@ -618,7 +618,7 @@ __acquires(&gl->gl_spin)
 
 out_sched:
 	clear_bit(GLF_LOCK, &gl->gl_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	gl->gl_lockref.count++;
 	if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
 		gl->gl_lockref.count--;
@@ -626,7 +626,7 @@ __acquires(&gl->gl_spin)
 
 out_unlock:
 	clear_bit(GLF_LOCK, &gl->gl_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	return;
 }
 
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -219,7 +219,7 @@ static void inode_go_sync(struct gfs2_gl
 	 * Writeback of the data mapping may cause the dirty flag to be set
 	 * so we have to clear it again here.
 	 */
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(GLF_DIRTY, &gl->gl_flags);
 }
 
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1132,7 +1132,7 @@ static void gdlm_recover_done(void *arg,
 		queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
 
 	clear_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&ls->ls_recover_flags, DFL_DLM_RECOVERY);
 	spin_unlock(&ls->ls_recover_spin);
 }
@@ -1269,7 +1269,7 @@ static int gdlm_mount(struct gfs2_sbd *s
 
 	ls->ls_first = !!test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags);
 	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
 	return 0;
 
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -587,7 +587,7 @@ void gfs2_recover_func(struct work_struc
 	gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
 done:
 	clear_bit(JDF_RECOVERY, &jd->jd_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
 }
 
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -332,7 +332,7 @@ static ssize_t block_store(struct gfs2_s
 		set_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags);
 	else if (val == 0) {
 		clear_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		gfs2_glock_thaw(sdp);
 	} else {
 		ret = -EINVAL;
@@ -481,7 +481,7 @@ static ssize_t jid_store(struct gfs2_sbd
 		rv = jid = -EINVAL;
 	sdp->sd_lockstruct.ls_jid = jid;
 	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
 out:
 	spin_unlock(&sdp->sd_jindex_spin);
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -43,7 +43,7 @@ static void journal_end_buffer_io_sync(s
 		clear_buffer_uptodate(bh);
 	if (orig_bh) {
 		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		wake_up_bit(&orig_bh->b_state, BH_Shadow);
 	}
 	unlock_buffer(bh);
@@ -239,7 +239,7 @@ static int journal_submit_data_buffers(j
 		spin_lock(&journal->j_list_lock);
 		J_ASSERT(jinode->i_transaction == commit_transaction);
 		clear_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
 	}
 	spin_unlock(&journal->j_list_lock);
@@ -277,7 +277,7 @@ static int journal_finish_inode_data_buf
 		}
 		spin_lock(&journal->j_list_lock);
 		clear_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
 	}
 
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1985,9 +1985,9 @@ static void nfs_access_free_entry(struct
 {
 	put_rpccred(entry->cred);
 	kfree(entry);
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_long_dec(&nfs_access_nr_entries);
-	smp_mb__after_atomic_dec();
+	smp_mb__after_atomic();
 }
 
 static void nfs_access_free_list(struct list_head *head)
@@ -2035,9 +2035,9 @@ nfs_access_cache_scan(struct shrinker *s
 		else {
 remove_lru_entry:
 			list_del_init(&nfsi->access_cache_inode_lru);
-			smp_mb__before_clear_bit();
+			smp_mb__before_atomic();
 			clear_bit(NFS_INO_ACL_LRU_SET, &nfsi->flags);
-			smp_mb__after_clear_bit();
+			smp_mb__after_atomic();
 		}
 		spin_unlock(&inode->i_lock);
 	}
@@ -2185,9 +2185,9 @@ void nfs_access_add_cache(struct inode *
 	nfs_access_add_rbtree(inode, cache);
 
 	/* Update accounting */
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_long_inc(&nfs_access_nr_entries);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 
 	/* Add inode to global LRU list */
 	if (!test_bit(NFS_INO_ACL_LRU_SET, &NFS_I(inode)->flags)) {
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1058,7 +1058,7 @@ int nfs_revalidate_mapping(struct inode
 	trace_nfs_invalidate_mapping_exit(inode, ret);
 
 	clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(bitlock, NFS_INO_INVALIDATING);
 out:
 	return ret;
--- a/fs/nfs/nfs4filelayoutdev.c
+++ b/fs/nfs/nfs4filelayoutdev.c
@@ -789,9 +789,9 @@ static void nfs4_wait_ds_connect(struct
 
 static void nfs4_clear_ds_conn_bit(struct nfs4_pnfs_ds *ds)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(NFS4DS_CONNECTING, &ds->ds_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&ds->ds_state, NFS4DS_CONNECTING);
 }
 
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1145,9 +1145,9 @@ static int nfs4_run_state_manager(void *
 
 static void nfs4_clear_state_manager_bit(struct nfs_client *clp)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING);
 	rpc_wake_up(&clp->cl_rpcwaitq);
 }
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -95,7 +95,7 @@ nfs_iocounter_dec(struct nfs_io_counter
 {
 	if (atomic_dec_and_test(&c->io_count)) {
 		clear_bit(NFS_IO_INPROGRESS, &c->flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		wake_up_bit(&c->flags, NFS_IO_INPROGRESS);
 	}
 }
@@ -193,9 +193,9 @@ void nfs_unlock_request(struct nfs_page
 		printk(KERN_ERR "NFS: Invalid unlock attempted\n");
 		BUG();
 	}
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(PG_BUSY, &req->wb_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&req->wb_flags, PG_BUSY);
 }
 
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1795,7 +1795,7 @@ static void pnfs_clear_layoutcommitting(
 	unsigned long *bitlock = &NFS_I(inode)->flags;
 
 	clear_bit_unlock(NFS_INO_LAYOUTCOMMITTING, bitlock);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(bitlock, NFS_INO_LAYOUTCOMMITTING);
 }
 
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -275,7 +275,7 @@ pnfs_get_lseg(struct pnfs_layout_segment
 {
 	if (lseg) {
 		atomic_inc(&lseg->pls_refcount);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 	}
 	return lseg;
 }
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -405,7 +405,7 @@ int nfs_writepages(struct address_space
 	nfs_pageio_complete(&pgio);
 
 	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(bitlock, NFS_INO_FLUSHING);
 
 	if (err < 0)
@@ -1458,7 +1458,7 @@ static int nfs_commit_set_lock(struct nf
 static void nfs_commit_clear_lock(struct nfs_inode *nfsi)
 {
 	clear_bit(NFS_INO_COMMIT, &nfsi->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_bit(&nfsi->flags, NFS_INO_COMMIT);
 }
 
--- a/fs/ubifs/lpt_commit.c
+++ b/fs/ubifs/lpt_commit.c
@@ -460,9 +460,9 @@ static int write_cnodes(struct ubifs_inf
 		 * important.
 		 */
 		clear_bit(DIRTY_CNODE, &cnode->flags);
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(COW_CNODE, &cnode->flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		offs += len;
 		dbg_chk_lpt_sz(c, 1, len);
 		cnode = cnode->cnext;
--- a/fs/ubifs/tnc_commit.c
+++ b/fs/ubifs/tnc_commit.c
@@ -895,9 +895,9 @@ static int write_index(struct ubifs_info
 		 * the reason for the second barrier.
 		 */
 		clear_bit(DIRTY_ZNODE, &znode->flags);
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(COW_ZNODE, &znode->flags);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 		/*
 		 * We have marked the znode as clean but have not updated the
--- a/include/asm-generic/atomic.h
+++ b/include/asm-generic/atomic.h
@@ -16,6 +16,7 @@
 #define __ASM_GENERIC_ATOMIC_H
 
 #include <asm/cmpxchg.h>
+#include <asm/barrier.h>
 
 #ifdef CONFIG_SMP
 /* Force people to define core atomics */
@@ -182,11 +183,5 @@ static inline void atomic_set_mask(unsig
 }
 #endif
 
-/* Assume that atomic operations are already serializing */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
 #endif /* __KERNEL__ */
 #endif /* __ASM_GENERIC_ATOMIC_H */
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -62,6 +62,14 @@
 #define set_mb(var, value)  do { (var) = (value); mb(); } while (0)
 #endif
 
+#ifndef smp_mb__before_atomic
+#define smp_mb__before_atomic()	smp_mb()
+#endif
+
+#ifndef smp_mb__after_atomic
+#define smp_mb__after_atomic()	smp_mb()
+#endif
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
--- a/include/asm-generic/bitops.h
+++ b/include/asm-generic/bitops.h
@@ -11,14 +11,8 @@
 
 #include <linux/irqflags.h>
 #include <linux/compiler.h>
+#include <asm/barrier.h>
 
-/*
- * clear_bit may not imply a memory barrier
- */
-#ifndef smp_mb__before_clear_bit
-#define smp_mb__before_clear_bit()	smp_mb()
-#define smp_mb__after_clear_bit()	smp_mb()
-#endif
 
 #include <asm-generic/bitops/__ffs.h>
 #include <asm-generic/bitops/ffz.h>
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -80,7 +80,7 @@ static inline void set_bit(int nr, volat
  *
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
+ * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
 static inline void clear_bit(int nr, volatile unsigned long *addr)
--- a/include/asm-generic/bitops/lock.h
+++ b/include/asm-generic/bitops/lock.h
@@ -20,7 +20,7 @@
  */
 #define clear_bit_unlock(nr, addr)	\
 do {					\
-	smp_mb__before_clear_bit();	\
+	smp_mb__before_atomic();	\
 	clear_bit(nr, addr);		\
 } while (0)
 
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -278,7 +278,7 @@ static inline void get_bh(struct buffer_
 
 static inline void put_bh(struct buffer_head *bh)
 {
-        smp_mb__before_atomic_dec();
+        smp_mb__before_atomic();
         atomic_dec(&bh->b_count);
 }
 
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -649,7 +649,7 @@ static inline void hd_ref_init(struct hd
 static inline void hd_struct_get(struct hd_struct *part)
 {
 	atomic_inc(&part->ref);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 static inline int hd_struct_try_get(struct hd_struct *part)
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -447,7 +447,7 @@ static inline int tasklet_trylock(struct
 
 static inline void tasklet_unlock(struct tasklet_struct *t)
 {
-	smp_mb__before_clear_bit(); 
+	smp_mb__before_atomic();
 	clear_bit(TASKLET_STATE_RUN, &(t)->state);
 }
 
@@ -495,7 +495,7 @@ static inline void tasklet_hi_schedule_f
 static inline void tasklet_disable_nosync(struct tasklet_struct *t)
 {
 	atomic_inc(&t->count);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 static inline void tasklet_disable(struct tasklet_struct *t)
@@ -507,13 +507,13 @@ static inline void tasklet_disable(struc
 
 static inline void tasklet_enable(struct tasklet_struct *t)
 {
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&t->count);
 }
 
 static inline void tasklet_hi_enable(struct tasklet_struct *t)
 {
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&t->count);
 }
 
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -500,7 +500,7 @@ static inline void napi_disable(struct n
 static inline void napi_enable(struct napi_struct *n)
 {
 	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(NAPI_STATE_SCHED, &n->state);
 }
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2737,10 +2737,8 @@ static inline bool __must_check current_
 	/*
 	 * Polling state must be visible before we test NEED_RESCHED,
 	 * paired by resched_task()
-	 *
-	 * XXX: assumes set/clear bit are identical barrier wise.
 	 */
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return unlikely(tif_need_resched());
 }
@@ -2758,7 +2756,7 @@ static inline bool __must_check current_
 	 * Polling state must be visible before we test NEED_RESCHED,
 	 * paired by resched_task()
 	 */
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return unlikely(tif_need_resched());
 }
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -142,18 +142,18 @@ struct rpc_task_setup {
 				test_and_set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
 #define rpc_clear_running(t)	\
 	do { \
-		smp_mb__before_clear_bit(); \
+		smp_mb__before_atomic(); \
 		clear_bit(RPC_TASK_RUNNING, &(t)->tk_runstate); \
-		smp_mb__after_clear_bit(); \
+		smp_mb__after_atomic(); \
 	} while (0)
 
 #define RPC_IS_QUEUED(t)	test_bit(RPC_TASK_QUEUED, &(t)->tk_runstate)
 #define rpc_set_queued(t)	set_bit(RPC_TASK_QUEUED, &(t)->tk_runstate)
 #define rpc_clear_queued(t)	\
 	do { \
-		smp_mb__before_clear_bit(); \
+		smp_mb__before_atomic(); \
 		clear_bit(RPC_TASK_QUEUED, &(t)->tk_runstate); \
-		smp_mb__after_clear_bit(); \
+		smp_mb__after_atomic(); \
 	} while (0)
 
 #define RPC_IS_ACTIVATED(t)	test_bit(RPC_TASK_ACTIVE, &(t)->tk_runstate)
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -368,9 +368,9 @@ static inline int xprt_test_and_clear_co
 
 static inline void xprt_clear_connecting(struct rpc_xprt *xprt)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(XPRT_CONNECTING, &xprt->state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static inline int xprt_connecting(struct rpc_xprt *xprt)
@@ -400,9 +400,9 @@ static inline void xprt_clear_bound(stru
 
 static inline void xprt_clear_binding(struct rpc_xprt *xprt)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(XPRT_BINDING, &xprt->state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static inline int xprt_test_and_set_binding(struct rpc_xprt *xprt)
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -191,7 +191,7 @@ static inline void tracehook_notify_resu
 	 * pairs with task_work_add()->set_notify_resume() after
 	 * hlist_add_head(task->task_works);
 	 */
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	if (unlikely(current->task_works))
 		task_work_run();
 }
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1204,7 +1204,7 @@ static inline bool __ip_vs_conn_get(stru
 /* put back the conn without restarting its timer */
 static inline void __ip_vs_conn_put(struct ip_vs_conn *cp)
 {
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&cp->refcnt);
 }
 void ip_vs_conn_put(struct ip_vs_conn *cp);
@@ -1408,7 +1408,7 @@ static inline void ip_vs_dest_hold(struc
 
 static inline void ip_vs_dest_put(struct ip_vs_dest *dest)
 {
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&dest->refcnt);
 }
 
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -526,7 +526,7 @@ static int kgdb_cpu_enter(struct kgdb_st
 			kgdb_info[cpu].exception_state &=
 				~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
 			kgdb_info[cpu].enter_kgdb--;
-			smp_mb__before_atomic_dec();
+			smp_mb__before_atomic();
 			atomic_dec(&slaves_in_kgdb);
 			dbg_touch_watchdogs();
 			local_irq_restore(flags);
@@ -654,7 +654,7 @@ static int kgdb_cpu_enter(struct kgdb_st
 	kgdb_info[cpu].exception_state &=
 		~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
 	kgdb_info[cpu].enter_kgdb--;
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&masters_in_kgdb);
 	/* Free kgdb_active */
 	atomic_set(&kgdb_active, -1);
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -250,7 +250,7 @@ static inline void futex_get_mm(union fu
 	 * get_futex_key() implies a full barrier. This is relied upon
 	 * as full barrier (B), see the ordering comment above.
 	 */
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -498,7 +498,7 @@ int __usermodehelper_disable(enum umh_di
 static void helper_lock(void)
 {
 	atomic_inc(&running_helpers);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 }
 
 static void helper_unlock(void)
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -389,9 +389,9 @@ static void rcu_eqs_enter_common(struct
 	}
 	rcu_prepare_for_idle(smp_processor_id());
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
-	smp_mb__before_atomic_inc();  /* See above. */
+	smp_mb__before_atomic();  /* See above. */
 	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic_inc();  /* Force ordering with next sojourn. */
+	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
 	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 
 	/*
@@ -509,10 +509,10 @@ void rcu_irq_exit(void)
 static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
 			       int user)
 {
-	smp_mb__before_atomic_inc();  /* Force ordering w/previous sojourn. */
+	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
 	atomic_inc(&rdtp->dynticks);
 	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
-	smp_mb__after_atomic_inc();  /* See above. */
+	smp_mb__after_atomic();  /* See above. */
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 	rcu_cleanup_after_idle(smp_processor_id());
 	trace_rcu_dyntick(TPS("End"), oldval, rdtp->dynticks_nesting);
@@ -637,10 +637,10 @@ void rcu_nmi_enter(void)
 	    (atomic_read(&rdtp->dynticks) & 0x1))
 		return;
 	rdtp->dynticks_nmi_nesting++;
-	smp_mb__before_atomic_inc();  /* Force delay from prior write. */
+	smp_mb__before_atomic();  /* Force delay from prior write. */
 	atomic_inc(&rdtp->dynticks);
 	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
-	smp_mb__after_atomic_inc();  /* See above. */
+	smp_mb__after_atomic();  /* See above. */
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 }
 
@@ -659,9 +659,9 @@ void rcu_nmi_exit(void)
 	    --rdtp->dynticks_nmi_nesting != 0)
 		return;
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
-	smp_mb__before_atomic_inc();  /* See above. */
+	smp_mb__before_atomic();  /* See above. */
 	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic_inc();  /* Force delay to next write. */
+	smp_mb__after_atomic();  /* Force delay to next write. */
 	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 }
 
@@ -2738,7 +2738,7 @@ void synchronize_sched_expedited(void)
 		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
 			/* ensure test happens before caller kfree */
-			smp_mb__before_atomic_inc(); /* ^^^ */
+			smp_mb__before_atomic(); /* ^^^ */
 			atomic_long_inc(&rsp->expedited_workdone1);
 			return;
 		}
@@ -2756,7 +2756,7 @@ void synchronize_sched_expedited(void)
 		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
 			/* ensure test happens before caller kfree */
-			smp_mb__before_atomic_inc(); /* ^^^ */
+			smp_mb__before_atomic(); /* ^^^ */
 			atomic_long_inc(&rsp->expedited_workdone2);
 			return;
 		}
@@ -2785,7 +2785,7 @@ void synchronize_sched_expedited(void)
 		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)snap)) {
 			/* ensure test happens before caller kfree */
-			smp_mb__before_atomic_inc(); /* ^^^ */
+			smp_mb__before_atomic(); /* ^^^ */
 			atomic_long_inc(&rsp->expedited_done_lost);
 			break;
 		}
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2514,9 +2514,9 @@ static void rcu_sysidle_enter(struct rcu
 	/* Record start of fully idle period. */
 	j = jiffies;
 	ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_inc(&rdtp->dynticks_idle);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
 }
 
@@ -2581,9 +2581,9 @@ static void rcu_sysidle_exit(struct rcu_
 	}
 
 	/* Record end of idle period. */
-	smp_mb__before_atomic_inc();
+	smp_mb__before_atomic();
 	atomic_inc(&rdtp->dynticks_idle);
-	smp_mb__after_atomic_inc();
+	smp_mb__after_atomic();
 	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
 
 	/*
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -165,7 +165,7 @@ void cpupri_set(struct cpupri *cp, int c
 		 * do a write memory barrier, and then update the count, to
 		 * make sure the vector is visible when count is set.
 		 */
-		smp_mb__before_atomic_inc();
+		smp_mb__before_atomic();
 		atomic_inc(&(vec)->count);
 		do_mb = 1;
 	}
@@ -185,14 +185,14 @@ void cpupri_set(struct cpupri *cp, int c
 		 * the new priority vec.
 		 */
 		if (do_mb)
-			smp_mb__after_atomic_inc();
+			smp_mb__after_atomic();
 
 		/*
 		 * When removing from the vector, we decrement the counter first
 		 * do a memory barrier and then clear the mask.
 		 */
 		atomic_dec(&(vec)->count);
-		smp_mb__after_atomic_inc();
+		smp_mb__after_atomic();
 		cpumask_clear_cpu(cpu, vec->mask);
 	}
 
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -394,7 +394,7 @@ EXPORT_SYMBOL(__wake_up_bit);
  *
  * In order for this to function properly, as it uses waitqueue_active()
  * internally, some kind of memory barrier must be done prior to calling
- * this. Typically, this will be smp_mb__after_clear_bit(), but in some
+ * this. Typically, this will be smp_mb__after_atomic(), but in some
  * cases where bitflags are manipulated non-atomically under a lock, one
  * may need to use a less regular barrier, such fs/inode.c's smp_mb(),
  * because spin_unlock() does not guarantee a memory barrier.
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -549,7 +549,7 @@ void clear_bdi_congested(struct backing_
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
 	if (test_and_clear_bit(bit, &bdi->state))
 		atomic_dec(&nr_bdi_congested[sync]);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
 }
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -609,7 +609,7 @@ void unlock_page(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
@@ -626,7 +626,7 @@ void end_page_writeback(struct page *pag
 	if (!test_clear_page_writeback(page))
 		BUG();
 
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
--- a/net/atm/pppoatm.c
+++ b/net/atm/pppoatm.c
@@ -252,7 +252,7 @@ static int pppoatm_may_send(struct pppoa
 	 * we need to ensure there's a memory barrier after it. The bit
 	 * *must* be set before we do the atomic_inc() on pvcc->inflight.
 	 * There's no smp_mb__after_set_bit(), so it's this or abuse
-	 * smp_mb__after_clear_bit().
+	 * smp_mb__after_atomic().
 	 */
 	test_and_set_bit(BLOCKED, &pvcc->blocked);
 
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -45,7 +45,7 @@ static void hci_cc_inquiry_cancel(struct
 		return;
 
 	clear_bit(HCI_INQUIRY, &hdev->flags);
-	smp_mb__after_clear_bit(); /* wake_up_bit advises about this barrier */
+	smp_mb__after_atomic(); /* wake_up_bit advises about this barrier */
 	wake_up_bit(&hdev->flags, HCI_INQUIRY);
 
 	hci_conn_check_pending(hdev);
@@ -1531,7 +1531,7 @@ static void hci_inquiry_complete_evt(str
 	if (!test_and_clear_bit(HCI_INQUIRY, &hdev->flags))
 		return;
 
-	smp_mb__after_clear_bit(); /* wake_up_bit advises about this barrier */
+	smp_mb__after_atomic(); /* wake_up_bit advises about this barrier */
 	wake_up_bit(&hdev->flags, HCI_INQUIRY);
 
 	if (!test_bit(HCI_MGMT, &hdev->dev_flags))
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1323,7 +1323,7 @@ static int __dev_close_many(struct list_
 		 * dev->stop() will invoke napi_disable() on all of it's
 		 * napi_struct instances on this device.
 		 */
-		smp_mb__after_clear_bit(); /* Commit netif_running(). */
+		smp_mb__after_atomic(); /* Commit netif_running(). */
 	}
 
 	dev_deactivate_many(head);
@@ -3343,7 +3343,7 @@ static void net_tx_action(struct softirq
 
 			root_lock = qdisc_lock(q);
 			if (spin_trylock(root_lock)) {
-				smp_mb__before_clear_bit();
+				smp_mb__before_atomic();
 				clear_bit(__QDISC_STATE_SCHED,
 					  &q->state);
 				qdisc_run(q);
@@ -3353,7 +3353,7 @@ static void net_tx_action(struct softirq
 					      &q->state)) {
 					__netif_reschedule(q);
 				} else {
-					smp_mb__before_clear_bit();
+					smp_mb__before_atomic();
 					clear_bit(__QDISC_STATE_SCHED,
 						  &q->state);
 				}
@@ -4216,7 +4216,7 @@ void __napi_complete(struct napi_struct
 	BUG_ON(n->gro_list);
 
 	list_del(&n->poll_list);
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(NAPI_STATE_SCHED, &n->state);
 }
 EXPORT_SYMBOL(__napi_complete);
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -147,7 +147,7 @@ static void linkwatch_do_dev(struct net_
 	 * Make sure the above read is complete since it can be
 	 * rewritten as soon as we clear the bit below.
 	 */
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 
 	/* We are about to handle this device,
 	 * so new events can be accepted
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -522,7 +522,7 @@ EXPORT_SYMBOL_GPL(inet_getpeer);
 void inet_putpeer(struct inet_peer *p)
 {
 	p->dtime = (__u32)jiffies;
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&p->refcnt);
 }
 EXPORT_SYMBOL_GPL(inet_putpeer);
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -751,7 +751,7 @@ void nf_conntrack_free(struct nf_conn *c
 	nf_ct_ext_destroy(ct);
 	nf_ct_ext_free(ct);
 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_dec(&net->ct.count);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -598,7 +598,7 @@ static void rds_ib_set_ack(struct rds_ib
 {
 	atomic64_set(&ic->i_ack_next, seq);
 	if (ack_required) {
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
 	}
 }
@@ -606,7 +606,7 @@ static void rds_ib_set_ack(struct rds_ib
 static u64 rds_ib_get_ack(struct rds_ib_connection *ic)
 {
 	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return atomic64_read(&ic->i_ack_next);
 }
--- a/net/rds/iw_recv.c
+++ b/net/rds/iw_recv.c
@@ -429,7 +429,7 @@ static void rds_iw_set_ack(struct rds_iw
 {
 	atomic64_set(&ic->i_ack_next, seq);
 	if (ack_required) {
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
 	}
 }
@@ -437,7 +437,7 @@ static void rds_iw_set_ack(struct rds_iw
 static u64 rds_iw_get_ack(struct rds_iw_connection *ic)
 {
 	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	return atomic64_read(&ic->i_ack_next);
 }
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -107,7 +107,7 @@ static int acquire_in_xmit(struct rds_co
 static void release_in_xmit(struct rds_connection *conn)
 {
 	clear_bit(RDS_IN_XMIT, &conn->c_flags);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	/*
 	 * We don't use wait_on_bit()/wake_up_bit() because our waking is in a
 	 * hot path and finding waiters is very rare.  We don't want to walk
@@ -661,7 +661,7 @@ void rds_send_drop_acked(struct rds_conn
 
 	/* order flag updates with spin locks */
 	if (!list_empty(&list))
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 
 	spin_unlock_irqrestore(&conn->c_lock, flags);
 
@@ -691,7 +691,7 @@ void rds_send_drop_to(struct rds_sock *r
 	}
 
 	/* order flag updates with the rs lock */
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	spin_unlock_irqrestore(&rs->rs_lock, flags);
 
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -93,7 +93,7 @@ int rds_tcp_xmit(struct rds_connection *
 		rm->m_ack_seq = tc->t_last_sent_nxt +
 				sizeof(struct rds_header) +
 				be32_to_cpu(rm->m_inc.i_hdr.h_len) - 1;
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		set_bit(RDS_MSG_HAS_ACK_SEQ, &rm->m_flags);
 		tc->t_last_expected_una = rm->m_ack_seq + 1;
 
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -296,7 +296,7 @@ static void
 rpcauth_unhash_cred_locked(struct rpc_cred *cred)
 {
 	hlist_del_rcu(&cred->cr_hash);
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags);
 }
 
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -142,7 +142,7 @@ gss_cred_set_ctx(struct rpc_cred *cred,
 	gss_get_ctx(ctx);
 	rcu_assign_pointer(gss_cred->gc_ctx, ctx);
 	set_bit(RPCAUTH_CRED_UPTODATE, &cred->cr_flags);
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(RPCAUTH_CRED_NEW, &cred->cr_flags);
 }
 
--- a/net/sunrpc/backchannel_rqst.c
+++ b/net/sunrpc/backchannel_rqst.c
@@ -257,10 +257,10 @@ void xprt_free_bc_request(struct rpc_rqs
 
 	dprintk("RPC:       free backchannel req=%p\n", req);
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &req->rq_bc_pa_state));
 	clear_bit(RPC_BC_PA_IN_USE, &req->rq_bc_pa_state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 
 	if (!xprt_need_to_requeue(xprt)) {
 		/*
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -230,9 +230,9 @@ static void xprt_clear_locked(struct rpc
 {
 	xprt->snd_task = NULL;
 	if (!test_bit(XPRT_CLOSE_WAIT, &xprt->state)) {
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(XPRT_LOCKED, &xprt->state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 	} else
 		queue_work(rpciod_workqueue, &xprt->task_cleanup);
 }
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -889,11 +889,11 @@ static void xs_close(struct rpc_xprt *xp
 	xs_reset_transport(transport);
 	xprt->reestablish_timeout = 0;
 
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 	clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
 	clear_bit(XPRT_CLOSING, &xprt->state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	xprt_disconnect_done(xprt);
 }
 
@@ -1500,12 +1500,12 @@ static void xs_tcp_cancel_linger_timeout
 
 static void xs_sock_reset_connection_flags(struct rpc_xprt *xprt)
 {
-	smp_mb__before_clear_bit();
+	smp_mb__before_atomic();
 	clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 	clear_bit(XPRT_CONNECTION_CLOSE, &xprt->state);
 	clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
 	clear_bit(XPRT_CLOSING, &xprt->state);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 }
 
 static void xs_sock_mark_closed(struct rpc_xprt *xprt)
@@ -1559,10 +1559,10 @@ static void xs_tcp_state_change(struct s
 		xprt->connect_cookie++;
 		xprt->reestablish_timeout = 0;
 		set_bit(XPRT_CLOSING, &xprt->state);
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(XPRT_CONNECTED, &xprt->state);
 		clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		xs_tcp_schedule_linger_timeout(xprt, xs_tcp_fin_timeout);
 		break;
 	case TCP_CLOSE_WAIT:
@@ -1581,9 +1581,9 @@ static void xs_tcp_state_change(struct s
 	case TCP_LAST_ACK:
 		set_bit(XPRT_CLOSING, &xprt->state);
 		xs_tcp_schedule_linger_timeout(xprt, xs_tcp_fin_timeout);
-		smp_mb__before_clear_bit();
+		smp_mb__before_atomic();
 		clear_bit(XPRT_CONNECTED, &xprt->state);
-		smp_mb__after_clear_bit();
+		smp_mb__after_atomic();
 		break;
 	case TCP_CLOSE:
 		xs_tcp_cancel_linger_timeout(xprt);
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1208,7 +1208,7 @@ static int unix_stream_connect(struct so
 	sk->sk_state	= TCP_ESTABLISHED;
 	sock_hold(newsk);
 
-	smp_mb__after_atomic_inc();	/* sock_hold() does an atomic_inc() */
+	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
 	unix_peer(sk)	= newsk;
 
 	unix_state_unlock(sk);
--- a/sound/pci/bt87x.c
+++ b/sound/pci/bt87x.c
@@ -435,7 +435,7 @@ static int snd_bt87x_pcm_open(struct snd
 
 _error:
 	clear_bit(0, &chip->opened);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	return err;
 }
 
@@ -450,7 +450,7 @@ static int snd_bt87x_close(struct snd_pc
 
 	chip->substream = NULL;
 	clear_bit(0, &chip->opened);
-	smp_mb__after_clear_bit();
+	smp_mb__after_atomic();
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 299+ messages in thread

* [RFC][PATCH 4/5] arch: Generic atomic.h cleanup
  2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-02-06 13:48 ` [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g Peter Zijlstra
@ 2014-02-06 13:48 ` Peter Zijlstra
  2014-02-06 17:49   ` Will Deacon
  2014-02-06 13:48 ` [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra

[-- Attachment #1: peterz-arch-atomic-and-or.patch --]
[-- Type: text/plain, Size: 106323 bytes --]

This patch should mostly be a NO-OP and only removes some duplication.
With a bit of effort these files could be further reduced.

Compile tested on everything I have a compiler for; missing are frv
and tile -- because they're special and I got tired.

XXX: split up in per arch patches?

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/alpha/include/asm/atomic.h          |  217 ++++--------
 arch/arc/include/asm/atomic.h            |  156 +++-----
 arch/arm/include/asm/atomic.h            |  283 ++++++----------
 arch/arm64/include/asm/atomic.h          |  193 +++-------
 arch/avr32/include/asm/atomic.h          |   87 ++--
 arch/cris/include/arch-v10/arch/system.h |    2 
 arch/cris/include/asm/atomic.h           |   52 +-
 arch/hexagon/include/asm/atomic.h        |   65 +--
 arch/ia64/include/asm/atomic.h           |  182 ++++------
 arch/m32r/include/asm/atomic.h           |  140 +++----
 arch/m68k/include/asm/atomic.h           |  109 ++----
 arch/metag/include/asm/atomic_lnkget.h   |  118 ++----
 arch/metag/include/asm/atomic_lock1.h    |   73 +---
 arch/mips/include/asm/atomic.h           |  547 +++++++++----------------------
 arch/mn10300/include/asm/atomic.h        |  124 ++-----
 arch/parisc/include/asm/atomic.h         |  107 +++---
 arch/powerpc/include/asm/atomic.h        |  194 +++-------
 arch/sh/include/asm/atomic-grb.h         |  118 ++----
 arch/sh/include/asm/atomic-irq.h         |   63 +--
 arch/sh/include/asm/atomic-llsc.h        |  100 ++---
 arch/sparc/include/asm/atomic_32.h       |   19 -
 arch/sparc/include/asm/atomic_64.h       |   43 +-
 arch/sparc/include/asm/barrier_32.h      |    1 
 arch/sparc/include/asm/processor.h       |    2 
 arch/sparc/kernel/smp_64.c               |    2 
 arch/sparc/lib/atomic32.c                |   25 -
 arch/sparc/lib/atomic_64.S               |  159 +++------
 arch/sparc/lib/ksyms.c                   |   17 
 arch/xtensa/include/asm/atomic.h         |  227 ++++--------
 include/asm-generic/atomic.h             |  143 ++------
 30 files changed, 1341 insertions(+), 2227 deletions(-)

--- a/arch/alpha/include/asm/atomic.h
+++ b/arch/alpha/include/asm/atomic.h
@@ -29,145 +29,84 @@
  * branch back to restart the operation.
  */
 
-static __inline__ void atomic_add(int i, atomic_t * v)
-{
-	unsigned long temp;
-	__asm__ __volatile__(
-	"1:	ldl_l %0,%1\n"
-	"	addl %0,%2,%0\n"
-	"	stl_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter)
-	:"Ir" (i), "m" (v->counter));
-}
-
-static __inline__ void atomic64_add(long i, atomic64_t * v)
-{
-	unsigned long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l %0,%1\n"
-	"	addq %0,%2,%0\n"
-	"	stq_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter)
-	:"Ir" (i), "m" (v->counter));
-}
-
-static __inline__ void atomic_sub(int i, atomic_t * v)
-{
-	unsigned long temp;
-	__asm__ __volatile__(
-	"1:	ldl_l %0,%1\n"
-	"	subl %0,%2,%0\n"
-	"	stl_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter)
-	:"Ir" (i), "m" (v->counter));
-}
-
-static __inline__ void atomic64_sub(long i, atomic64_t * v)
-{
-	unsigned long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l %0,%1\n"
-	"	subq %0,%2,%0\n"
-	"	stq_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter)
-	:"Ir" (i), "m" (v->counter));
-}
-
-
-/*
- * Same as above, but return the result value
- */
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	long temp, result;
-	smp_mb();
-	__asm__ __volatile__(
-	"1:	ldl_l %0,%1\n"
-	"	addl %0,%3,%2\n"
-	"	addl %0,%3,%0\n"
-	"	stl_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter), "=&r" (result)
-	:"Ir" (i), "m" (v->counter) : "memory");
-	smp_mb();
-	return result;
-}
-
-static __inline__ long atomic64_add_return(long i, atomic64_t * v)
-{
-	long temp, result;
-	smp_mb();
-	__asm__ __volatile__(
-	"1:	ldq_l %0,%1\n"
-	"	addq %0,%3,%2\n"
-	"	addq %0,%3,%0\n"
-	"	stq_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter), "=&r" (result)
-	:"Ir" (i), "m" (v->counter) : "memory");
-	smp_mb();
-	return result;
-}
-
-static __inline__ long atomic_sub_return(int i, atomic_t * v)
-{
-	long temp, result;
-	smp_mb();
-	__asm__ __volatile__(
-	"1:	ldl_l %0,%1\n"
-	"	subl %0,%3,%2\n"
-	"	subl %0,%3,%0\n"
-	"	stl_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter), "=&r" (result)
-	:"Ir" (i), "m" (v->counter) : "memory");
-	smp_mb();
-	return result;
-}
-
-static __inline__ long atomic64_sub_return(long i, atomic64_t * v)
-{
-	long temp, result;
-	smp_mb();
-	__asm__ __volatile__(
-	"1:	ldq_l %0,%1\n"
-	"	subq %0,%3,%2\n"
-	"	subq %0,%3,%0\n"
-	"	stq_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (v->counter), "=&r" (result)
-	:"Ir" (i), "m" (v->counter) : "memory");
-	smp_mb();
-	return result;
-}
+#define ATOMIC_OP(op)							\
+static __inline__ void atomic_##op(int i, atomic_t * v)			\
+{									\
+	unsigned long temp;						\
+	__asm__ __volatile__(						\
+	"1:	ldl_l %0,%1\n"						\
+	"	" #op "l %0,%2,%0\n"					\
+	"	stl_c %0,%1\n"						\
+	"	beq %0,2f\n"						\
+	".subsection 2\n"						\
+	"2:	br 1b\n"						\
+	".previous"							\
+	:"=&r" (temp), "=m" (v->counter)				\
+	:"Ir" (i), "m" (v->counter));					\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	long temp, result;						\
+	smp_mb();							\
+	__asm__ __volatile__(						\
+	"1:	ldl_l %0,%1\n"						\
+	"	" #op "l %0,%3,%2\n"					\
+	"	" #op "l %0,%3,%0\n"					\
+	"	stl_c %0,%1\n"						\
+	"	beq %0,2f\n"						\
+	".subsection 2\n"						\
+	"2:	br 1b\n"						\
+	".previous"							\
+	:"=&r" (temp), "=m" (v->counter), "=&r" (result)		\
+	:"Ir" (i), "m" (v->counter) : "memory");			\
+	smp_mb();							\
+	return result;							\
+}
+
+#define ATOMIC64_OP(op)							\
+static __inline__ void atomic64_##op(long i, atomic64_t * v)		\
+{									\
+	unsigned long temp;						\
+	__asm__ __volatile__(						\
+	"1:	ldq_l %0,%1\n"						\
+	"	" #op "q %0,%2,%0\n"					\
+	"	stq_c %0,%1\n"						\
+	"	beq %0,2f\n"						\
+	".subsection 2\n"						\
+	"2:	br 1b\n"						\
+	".previous"							\
+	:"=&r" (temp), "=m" (v->counter)				\
+	:"Ir" (i), "m" (v->counter));					\
+}									\
+									\
+static __inline__ long atomic64_##op##_return(long i, atomic64_t * v)	\
+{									\
+	long temp, result;						\
+	smp_mb();							\
+	__asm__ __volatile__(						\
+	"1:	ldq_l %0,%1\n"						\
+	"	" #op "q %0,%3,%2\n"					\
+	"	" #op "q %0,%3,%0\n"					\
+	"	stq_c %0,%1\n"						\
+	"	beq %0,2f\n"						\
+	".subsection 2\n"						\
+	"2:	br 1b\n"						\
+	".previous"							\
+	:"=&r" (temp), "=m" (v->counter), "=&r" (result)		\
+	:"Ir" (i), "m" (v->counter) : "memory");			\
+	smp_mb();							\
+	return result;							\
+}
+
+#define ATOMIC_OPS(op)	ATOMIC_OP(op) ATOMIC64_OP(op)
+
+ATOMIC_OPS(add)
+ATOMIC_OPS(sub)
+
+#undef ATOMIC_OPS
+#undef ATOMIC64_OP
+#undef ATOMIC_OP
 
 #define atomic64_cmpxchg(v, old, new) (cmpxchg(&((v)->counter), old, new))
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
--- a/arch/arc/include/asm/atomic.h
+++ b/arch/arc/include/asm/atomic.h
@@ -25,66 +25,41 @@
 
 #define atomic_set(v, i) (((v)->counter) = (i))
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned int temp;
-
-	__asm__ __volatile__(
-	"1:	llock   %0, [%1]	\n"
-	"	add     %0, %0, %2	\n"
-	"	scond   %0, [%1]	\n"
-	"	bnz     1b		\n"
-	: "=&r"(temp)	/* Early clobber, to prevent reg reuse */
-	: "r"(&v->counter), "ir"(i)
-	: "cc");
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned int temp;
-
-	__asm__ __volatile__(
-	"1:	llock   %0, [%1]	\n"
-	"	sub     %0, %0, %2	\n"
-	"	scond   %0, [%1]	\n"
-	"	bnz     1b		\n"
-	: "=&r"(temp)
-	: "r"(&v->counter), "ir"(i)
-	: "cc");
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned int temp;						\
+									\
+	__asm__ __volatile__(						\
+	"1:	llock   %0, [%1]	\n"				\
+	"	" #op " %0, %0, %2	\n"				\
+	"	scond   %0, [%1]	\n"				\
+	"	bnz     1b		\n"				\
+	: "=&r"(temp)	/* Early clobber, to prevent reg reuse */	\
+	: "r"(&v->counter), "ir"(i)					\
+	: "cc");							\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned int temp;						\
+									\
+	__asm__ __volatile__(						\
+	"1:	llock   %0, [%1]	\n"				\
+	"	" #op " %0, %0, %2	\n"				\
+	"	scond   %0, [%1]	\n"				\
+	"	bnz     1b		\n"				\
+	: "=&r"(temp)							\
+	: "r"(&v->counter), "ir"(i)					\
+	: "cc");							\
+									\
+	return temp;							\
 }
 
-/* add and also return the new value */
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned int temp;
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	__asm__ __volatile__(
-	"1:	llock   %0, [%1]	\n"
-	"	add     %0, %0, %2	\n"
-	"	scond   %0, [%1]	\n"
-	"	bnz     1b		\n"
-	: "=&r"(temp)
-	: "r"(&v->counter), "ir"(i)
-	: "cc");
-
-	return temp;
-}
-
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned int temp;
-
-	__asm__ __volatile__(
-	"1:	llock   %0, [%1]	\n"
-	"	sub     %0, %0, %2	\n"
-	"	scond   %0, [%1]	\n"
-	"	bnz     1b		\n"
-	: "=&r"(temp)
-	: "r"(&v->counter), "ir"(i)
-	: "cc");
-
-	return temp;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
 {
@@ -133,51 +108,34 @@ static inline void atomic_set(atomic_t *
  * Locking would change to irq-disabling only (UP) and spinlocks (SMP)
  */
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned long flags;
-
-	atomic_ops_lock(flags);
-	v->counter += i;
-	atomic_ops_unlock(flags);
+#define ATOMIC_OP(opn, op)						\
+static inline void atomic_##opn(int i, atomic_t *v)			\
+{									\
+	unsigned long flags;						\
+									\
+	atomic_ops_lock(flags);						\
+	v->counter op i;						\
+	atomic_ops_unlock(flags);					\
+}									\
+									\
+static inline int atomic_##opn##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long flags;						\
+	unsigned long temp;						\
+									\
+	atomic_ops_lock(flags);						\
+	temp = v->counter;						\
+	temp op i;							\
+	v->counter = temp;						\
+	atomic_ops_unlock(flags);					\
+									\
+	return temp;							\
 }
 
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned long flags;
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
 
-	atomic_ops_lock(flags);
-	v->counter -= i;
-	atomic_ops_unlock(flags);
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	unsigned long temp;
-
-	atomic_ops_lock(flags);
-	temp = v->counter;
-	temp += i;
-	v->counter = temp;
-	atomic_ops_unlock(flags);
-
-	return temp;
-}
-
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	unsigned long temp;
-
-	atomic_ops_lock(flags);
-	temp = v->counter;
-	temp -= i;
-	v->counter = temp;
-	atomic_ops_unlock(flags);
-
-	return temp;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
 {
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -37,83 +37,51 @@
  * store exclusive to ensure that these are atomic.  We may loop
  * to ensure that the update happens.
  */
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
-
-	prefetchw(&v->counter);
-	__asm__ __volatile__("@ atomic_add\n"
-"1:	ldrex	%0, [%3]\n"
-"	add	%0, %0, %4\n"
-"	strex	%1, %0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "Ir" (i)
-	: "cc");
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
-
-	smp_mb();
-
-	__asm__ __volatile__("@ atomic_add_return\n"
-"1:	ldrex	%0, [%3]\n"
-"	add	%0, %0, %4\n"
-"	strex	%1, %0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "Ir" (i)
-	: "cc");
-
-	smp_mb();
 
-	return result;
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
-
-	prefetchw(&v->counter);
-	__asm__ __volatile__("@ atomic_sub\n"
-"1:	ldrex	%0, [%3]\n"
-"	sub	%0, %0, %4\n"
-"	strex	%1, %0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "Ir" (i)
-	: "cc");
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	prefetchw(&v->counter);						\
+	__asm__ __volatile__("@ atomic_" #op "\n"			\
+"1:	ldrex	%0, [%3]\n"						\
+"	" #op "	%0, %0, %4\n"						\
+"	strex	%1, %0, [%3]\n"						\
+"	teq	%1, #0\n"						\
+"	bne	1b"							\
+	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)		\
+	: "r" (&v->counter), "Ir" (i)					\
+	: "cc");							\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	smp_mb();							\
+									\
+	__asm__ __volatile__("@ atomic_" #op "_return\n"		\
+"1:	ldrex	%0, [%3]\n"						\
+"	" #op "	%0, %0, %4\n"						\
+"	strex	%1, %0, [%3]\n"						\
+"	teq	%1, #0\n"						\
+"	bne	1b"							\
+	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)		\
+	: "r" (&v->counter), "Ir" (i)					\
+	: "cc");							\
+									\
+	smp_mb();							\
+									\
+	return result;							\
 }
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
-
-	smp_mb();
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	__asm__ __volatile__("@ atomic_sub_return\n"
-"1:	ldrex	%0, [%3]\n"
-"	sub	%0, %0, %4\n"
-"	strex	%1, %0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "Ir" (i)
-	: "cc");
-
-	smp_mb();
-
-	return result;
-}
+#undef ATOMIC_OP
 
 static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 {
@@ -144,33 +112,33 @@ static inline int atomic_cmpxchg(atomic_
 #error SMP not supported on pre-ARMv6 CPUs
 #endif
 
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	int val;
-
-	raw_local_irq_save(flags);
-	val = v->counter;
-	v->counter = val += i;
-	raw_local_irq_restore(flags);
-
-	return val;
+#define ATOMIC_OP(opn, op)						\
+static inline void atomic_##opn(int i, atomic_t *v)			\
+{									\
+	unsigned long flags;						\
+									\
+	raw_local_irq_save(flags);					\
+	v->counter op i;						\
+	raw_local_irq_restore(flags);					\
+}									\
+									\
+static inline int atomic_##opn##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long flags;						\
+	int val;							\
+									\
+	raw_local_irq_save(flags);					\
+	v->counter op i;						\
+	val = v->counter;						\
+	raw_local_irq_restore(flags);					\
+									\
+	return val;							\
 }
-#define atomic_add(i, v)	(void) atomic_add_return(i, v)
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	int val;
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
 
-	raw_local_irq_save(flags);
-	val = v->counter;
-	v->counter = val -= i;
-	raw_local_irq_restore(flags);
-
-	return val;
-}
-#define atomic_sub(i, v)	(void) atomic_sub_return(i, v)
+#undef ATOMIC_OP
 
 static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
@@ -270,87 +238,52 @@ static inline void atomic64_set(atomic64
 }
 #endif
 
-static inline void atomic64_add(long long i, atomic64_t *v)
-{
-	long long result;
-	unsigned long tmp;
-
-	prefetchw(&v->counter);
-	__asm__ __volatile__("@ atomic64_add\n"
-"1:	ldrexd	%0, %H0, [%3]\n"
-"	adds	%Q0, %Q0, %Q4\n"
-"	adc	%R0, %R0, %R4\n"
-"	strexd	%1, %0, %H0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "r" (i)
-	: "cc");
+#define ATOMIC64_OP(opn, op1, op2)					\
+static inline void atomic64_##opn(long long i, atomic64_t *v)		\
+{									\
+	long long result;						\
+	unsigned long tmp;						\
+									\
+	prefetchw(&v->counter);						\
+	__asm__ __volatile__("@ atomic64_" #opn "\n"			\
+"1:	ldrexd	%0, %H0, [%3]\n"					\
+"	" #op1 " %Q0, %Q0, %Q4\n"					\
+"	" #op2 " %R0, %R0, %R4\n"					\
+"	strexd	%1, %0, %H0, [%3]\n"					\
+"	teq	%1, #0\n"						\
+"	bne	1b"							\
+	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)		\
+	: "r" (&v->counter), "r" (i)					\
+	: "cc");							\
+}									\
+									\
+static inline long long atomic64_##opn##_return(long long i, atomic64_t *v) \
+{									\
+	long long result;						\
+	unsigned long tmp;						\
+									\
+	smp_mb();							\
+									\
+	__asm__ __volatile__("@ atomic64_" #opn "_return\n"		\
+"1:	ldrexd	%0, %H0, [%3]\n"					\
+"	" #op1 " %Q0, %Q0, %Q4\n"					\
+"	" #op2 " %R0, %R0, %R4\n"					\
+"	strexd	%1, %0, %H0, [%3]\n"					\
+"	teq	%1, #0\n"						\
+"	bne	1b"							\
+	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)		\
+	: "r" (&v->counter), "r" (i)					\
+	: "cc");							\
+									\
+	smp_mb();							\
+									\
+	return result;							\
 }
 
-static inline long long atomic64_add_return(long long i, atomic64_t *v)
-{
-	long long result;
-	unsigned long tmp;
-
-	smp_mb();
-
-	__asm__ __volatile__("@ atomic64_add_return\n"
-"1:	ldrexd	%0, %H0, [%3]\n"
-"	adds	%Q0, %Q0, %Q4\n"
-"	adc	%R0, %R0, %R4\n"
-"	strexd	%1, %0, %H0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "r" (i)
-	: "cc");
+ATOMIC64_OP(add, adds, adc)
+ATOMIC64_OP(sub, subs, sbc)
 
-	smp_mb();
-
-	return result;
-}
-
-static inline void atomic64_sub(long long i, atomic64_t *v)
-{
-	long long result;
-	unsigned long tmp;
-
-	prefetchw(&v->counter);
-	__asm__ __volatile__("@ atomic64_sub\n"
-"1:	ldrexd	%0, %H0, [%3]\n"
-"	subs	%Q0, %Q0, %Q4\n"
-"	sbc	%R0, %R0, %R4\n"
-"	strexd	%1, %0, %H0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "r" (i)
-	: "cc");
-}
-
-static inline long long atomic64_sub_return(long long i, atomic64_t *v)
-{
-	long long result;
-	unsigned long tmp;
-
-	smp_mb();
-
-	__asm__ __volatile__("@ atomic64_sub_return\n"
-"1:	ldrexd	%0, %H0, [%3]\n"
-"	subs	%Q0, %Q0, %Q4\n"
-"	sbc	%R0, %R0, %R4\n"
-"	strexd	%1, %0, %H0, [%3]\n"
-"	teq	%1, #0\n"
-"	bne	1b"
-	: "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
-	: "r" (&v->counter), "r" (i)
-	: "cc");
-
-	smp_mb();
-
-	return result;
-}
+#undef ATOMIC64_OP
 
 static inline long long atomic64_cmpxchg(atomic64_t *ptr, long long old,
 					long long new)
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -43,71 +43,45 @@
  * store exclusive to ensure that these are atomic.  We may loop
  * to ensure that the update happens.
  */
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
-
-	asm volatile("// atomic_add\n"
-"1:	ldxr	%w0, %2\n"
-"	add	%w0, %w0, %w3\n"
-"	stxr	%w1, %w0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc");
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
-
-	asm volatile("// atomic_add_return\n"
-"1:	ldxr	%w0, %2\n"
-"	add	%w0, %w0, %w3\n"
-"	stlxr	%w1, %w0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc", "memory");
-
-	smp_mb();
-	return result;
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
 
-	asm volatile("// atomic_sub\n"
-"1:	ldxr	%w0, %2\n"
-"	sub	%w0, %w0, %w3\n"
-"	stxr	%w1, %w0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc");
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	asm volatile("// atomic_" #op "\n"				\
+"1:	ldxr	%w0, %2\n"						\
+"	" #op "	%w0, %w0, %w3\n"					\
+"	stxr	%w1, %w0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i)							\
+	: "cc");							\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	asm volatile("// atomic_" #op "_return\n"			\
+"1:	ldxr	%w0, %2\n"						\
+"	" #op "	%w0, %w0, %w3\n"					\
+"	stlxr	%w1, %w0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i)							\
+	: "cc", "memory");						\
+									\
+	smp_mb();							\
+	return result;							\
 }
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long tmp;
-	int result;
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	asm volatile("// atomic_sub_return\n"
-"1:	ldxr	%w0, %2\n"
-"	sub	%w0, %w0, %w3\n"
-"	stlxr	%w1, %w0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc", "memory");
-
-	smp_mb();
-	return result;
-}
+#undef ATOMIC_OP
 
 static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 {
@@ -162,71 +136,44 @@ static inline int __atomic_add_unless(at
 #define atomic64_read(v)	(*(volatile long long *)&(v)->counter)
 #define atomic64_set(v,i)	(((v)->counter) = (i))
 
-static inline void atomic64_add(u64 i, atomic64_t *v)
-{
-	long result;
-	unsigned long tmp;
-
-	asm volatile("// atomic64_add\n"
-"1:	ldxr	%0, %2\n"
-"	add	%0, %0, %3\n"
-"	stxr	%w1, %0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc");
+#define ATOMIC64_OP(op)							\
+static inline void atomic64_##op(long i, atomic64_t *v)			\
+{									\
+	long result;							\
+	unsigned long tmp;						\
+									\
+	asm volatile("// atomic64_" #op "\n"				\
+"1:	ldxr	%0, %2\n"						\
+"	" #op "	%0, %0, %3\n"						\
+"	stxr	%w1, %0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i)							\
+	: "cc");							\
+}									\
+									\
+static inline long atomic64_##op##_return(long i, atomic64_t *v)	\
+{									\
+	long result;							\
+	unsigned long tmp;						\
+									\
+	asm volatile("// atomic64_" #op "_return\n"			\
+"1:	ldxr	%0, %2\n"						\
+"	" #op "	%0, %0, %3\n"						\
+"	stlxr	%w1, %0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i)							\
+	: "cc", "memory");						\
+									\
+	smp_mb();							\
+	return result;							\
 }
 
-static inline long atomic64_add_return(long i, atomic64_t *v)
-{
-	long result;
-	unsigned long tmp;
-
-	asm volatile("// atomic64_add_return\n"
-"1:	ldxr	%0, %2\n"
-"	add	%0, %0, %3\n"
-"	stlxr	%w1, %0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc", "memory");
+ATOMIC64_OP(add)
+ATOMIC64_OP(sub)
 
-	smp_mb();
-	return result;
-}
-
-static inline void atomic64_sub(u64 i, atomic64_t *v)
-{
-	long result;
-	unsigned long tmp;
-
-	asm volatile("// atomic64_sub\n"
-"1:	ldxr	%0, %2\n"
-"	sub	%0, %0, %3\n"
-"	stxr	%w1, %0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc");
-}
-
-static inline long atomic64_sub_return(long i, atomic64_t *v)
-{
-	long result;
-	unsigned long tmp;
-
-	asm volatile("// atomic64_sub_return\n"
-"1:	ldxr	%0, %2\n"
-"	sub	%0, %0, %3\n"
-"	stlxr	%w1, %0, %2\n"
-"	cbnz	%w1, 1b"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	: "Ir" (i)
-	: "cc", "memory");
-
-	smp_mb();
-	return result;
-}
+#undef ATOMIC64_OP
 
 static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
 {
--- a/arch/avr32/include/asm/atomic.h
+++ b/arch/avr32/include/asm/atomic.h
@@ -22,58 +22,45 @@
 #define atomic_read(v)		(*(volatile int *)&(v)->counter)
 #define atomic_set(v, i)	(((v)->counter) = i)
 
-/*
- * atomic_sub_return - subtract the atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v. Returns the resulting value.
- */
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	int result;
-
-	asm volatile(
-		"/* atomic_sub_return */\n"
-		"1:	ssrf	5\n"
-		"	ld.w	%0, %2\n"
-		"	sub	%0, %3\n"
-		"	stcond	%1, %0\n"
-		"	brne	1b"
-		: "=&r"(result), "=o"(v->counter)
-		: "m"(v->counter), "rKs21"(i)
-		: "cc");
-
-	return result;
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	int result;							\
+									\
+	asm volatile(							\
+		"/* atomic_" #op " */\n"				\
+		"1:	ssrf	5\n"					\
+		"	ld.w	%0, %2\n"				\
+		"	" #op "	%0, %3\n"				\
+		"	stcond	%1, %0\n"				\
+		"	brne	1b"					\
+		: "=&r"(result), "=o"(v->counter)			\
+		: "m"(v->counter), "rKs21"(i)				\
+		: "cc");						\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int result;							\
+									\
+	asm volatile(							\
+		"/* atomic_" #op "_return */\n"				\
+		"1:	ssrf	5\n"					\
+		"	ld.w	%0, %2\n"				\
+		"	" #op "	%0, %3\n"				\
+		"	stcond	%1, %0\n"				\
+		"	brne	1b"					\
+		: "=&r"(result), "=o"(v->counter)			\
+		: "m"(v->counter), "rKs21"(i)				\
+		: "cc");						\
+									\
+	return result;							\
 }
 
-/*
- * atomic_add_return - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v. Returns the resulting value.
- */
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	int result;
-
-	if (__builtin_constant_p(i) && (i >= -1048575) && (i <= 1048576))
-		result = atomic_sub_return(-i, v);
-	else
-		asm volatile(
-			"/* atomic_add_return */\n"
-			"1:	ssrf	5\n"
-			"	ld.w	%0, %1\n"
-			"	add	%0, %3\n"
-			"	stcond	%2, %0\n"
-			"	brne	1b"
-			: "=&r"(result), "=o"(v->counter)
-			: "m"(v->counter), "r"(i)
-			: "cc", "memory");
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	return result;
-}
+#undef ATOMIC_OP
 
 /*
  * atomic_sub_unless - sub unless the number is a given value
@@ -168,8 +155,6 @@ static inline int atomic_sub_if_positive
 #define atomic_xchg(v, new)	(xchg(&((v)->counter), new))
 #define atomic_cmpxchg(v, o, n)	(cmpxchg(&((v)->counter), (o), (n)))
 
-#define atomic_sub(i, v)	(void)atomic_sub_return(i, v)
-#define atomic_add(i, v)	(void)atomic_add_return(i, v)
 #define atomic_dec(v)		atomic_sub(1, (v))
 #define atomic_inc(v)		atomic_add(1, (v))
 
--- a/arch/cris/include/arch-v10/arch/system.h
+++ b/arch/cris/include/arch-v10/arch/system.h
@@ -36,8 +36,6 @@ static inline unsigned long _get_base(ch
   return 0;
 }
 
-#define nop() __asm__ __volatile__ ("nop");
-
 #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
 #define tas(ptr) (xchg((ptr),1))
 
--- a/arch/cris/include/asm/atomic.h
+++ b/arch/cris/include/asm/atomic.h
@@ -21,44 +21,32 @@
 
 /* These should be written in asm but we do it in C for now. */
 
-static inline void atomic_add(int i, volatile atomic_t *v)
-{
-	unsigned long flags;
-	cris_atomic_save(v, flags);
-	v->counter += i;
-	cris_atomic_restore(v, flags);
+#define ATOMIC_OP(opn, op)						\
+static inline void atomic_##opn(int i, volatile atomic_t *v)		\
+{									\
+	unsigned long flags;						\
+	cris_atomic_save(v, flags);					\
+	v->counter op i;						\
+	cris_atomic_restore(v, flags);					\
+}									\
+									\
+static inline int atomic_##opn##_return(int i, volatile atomic_t *v)	\
+{									\
+	unsigned long flags;						\
+	int retval;							\
+	cris_atomic_save(v, flags);					\
+	retval = (v->counter op i);					\
+	cris_atomic_restore(v, flags);					\
+	return retval;							\
 }
 
-static inline void atomic_sub(int i, volatile atomic_t *v)
-{
-	unsigned long flags;
-	cris_atomic_save(v, flags);
-	v->counter -= i;
-	cris_atomic_restore(v, flags);
-}
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
 
-static inline int atomic_add_return(int i, volatile atomic_t *v)
-{
-	unsigned long flags;
-	int retval;
-	cris_atomic_save(v, flags);
-	retval = (v->counter += i);
-	cris_atomic_restore(v, flags);
-	return retval;
-}
+#undef ATOMIC_OP
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
 
-static inline int atomic_sub_return(int i, volatile atomic_t *v)
-{
-	unsigned long flags;
-	int retval;
-	cris_atomic_save(v, flags);
-	retval = (v->counter -= i);
-	cris_atomic_restore(v, flags);
-	return retval;
-}
-
 static inline int atomic_sub_and_test(int i, volatile atomic_t *v)
 {
 	int retval;
--- a/arch/hexagon/include/asm/atomic.h
+++ b/arch/hexagon/include/asm/atomic.h
@@ -81,41 +81,42 @@ static inline int atomic_cmpxchg(atomic_
 	return __oldval;
 }
 
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	int output;
-
-	__asm__ __volatile__ (
-		"1:	%0 = memw_locked(%1);\n"
-		"	%0 = add(%0,%2);\n"
-		"	memw_locked(%1,P3)=%0;\n"
-		"	if !P3 jump 1b;\n"
-		: "=&r" (output)
-		: "r" (&v->counter), "r" (i)
-		: "memory", "p3"
-	);
-	return output;
-
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	int output;							\
+									\
+	__asm__ __volatile__ (						\
+		"1:	%0 = memw_locked(%1);\n"			\
+		"	%0 = "#op "(%0,%2);\n"				\
+		"	memw_locked(%1,P3)=%0;\n"			\
+		"	if !P3 jump 1b;\n"				\
+		: "=&r" (output)					\
+		: "r" (&v->counter), "r" (i)				\
+		: "memory", "p3"					\
+	);								\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int output;							\
+									\
+	__asm__ __volatile__ (						\
+		"1:	%0 = memw_locked(%1);\n"			\
+		"	%0 = "#op "(%0,%2);\n"				\
+		"	memw_locked(%1,P3)=%0;\n"			\
+		"	if !P3 jump 1b;\n"				\
+		: "=&r" (output)					\
+		: "r" (&v->counter), "r" (i)				\
+		: "memory", "p3"					\
+	);								\
+	return output;							\
 }
 
-#define atomic_add(i, v) atomic_add_return(i, (v))
-
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	int output;
-	__asm__ __volatile__ (
-		"1:	%0 = memw_locked(%1);\n"
-		"	%0 = sub(%0,%2);\n"
-		"	memw_locked(%1,P3)=%0\n"
-		"	if !P3 jump 1b;\n"
-		: "=&r" (output)
-		: "r" (&v->counter), "r" (i)
-		: "memory", "p3"
-	);
-	return output;
-}
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-#define atomic_sub(i, v) atomic_sub_return(i, (v))
+#undef ATOMIC_OP
 
 /**
  * __atomic_add_unless - add unless the number is a given value
--- a/arch/ia64/include/asm/atomic.h
+++ b/arch/ia64/include/asm/atomic.h
@@ -26,61 +26,93 @@
 #define atomic_set(v,i)		(((v)->counter) = (i))
 #define atomic64_set(v,i)	(((v)->counter) = (i))
 
-static __inline__ int
-ia64_atomic_add (int i, atomic_t *v)
-{
-	__s32 old, new;
-	CMPXCHG_BUGCHECK_DECL
-
-	do {
-		CMPXCHG_BUGCHECK(v);
-		old = atomic_read(v);
-		new = old + i;
-	} while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic_t)) != old);
-	return new;
+#define ATOMIC_OP(opn, op)						\
+static __inline__ int							\
+ia64_atomic_##opn (int i, atomic_t *v)					\
+{									\
+	__s32 old, new;							\
+	CMPXCHG_BUGCHECK_DECL						\
+									\
+	do {								\
+		CMPXCHG_BUGCHECK(v);					\
+		old = atomic_read(v);					\
+		new = old op i;						\
+	} while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic_t)) != old); \
+	return new;							\
 }
 
-static __inline__ long
-ia64_atomic64_add (__s64 i, atomic64_t *v)
-{
-	__s64 old, new;
-	CMPXCHG_BUGCHECK_DECL
+ATOMIC_OP(add, +)
+ATOMIC_OP(sub, -)
 
-	do {
-		CMPXCHG_BUGCHECK(v);
-		old = atomic64_read(v);
-		new = old + i;
-	} while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic64_t)) != old);
-	return new;
-}
+#undef ATOMIC_OP
 
-static __inline__ int
-ia64_atomic_sub (int i, atomic_t *v)
-{
-	__s32 old, new;
-	CMPXCHG_BUGCHECK_DECL
+#define atomic_add_return(i,v)						\
+({									\
+	int __ia64_aar_i = (i);						\
+	(__builtin_constant_p(i)					\
+	 && (   (__ia64_aar_i ==  1) || (__ia64_aar_i ==   4)		\
+	     || (__ia64_aar_i ==  8) || (__ia64_aar_i ==  16)		\
+	     || (__ia64_aar_i == -1) || (__ia64_aar_i ==  -4)		\
+	     || (__ia64_aar_i == -8) || (__ia64_aar_i == -16)))		\
+		? ia64_fetch_and_add(__ia64_aar_i, &(v)->counter)	\
+		: ia64_atomic_add(__ia64_aar_i, v);			\
+})
 
-	do {
-		CMPXCHG_BUGCHECK(v);
-		old = atomic_read(v);
-		new = old - i;
-	} while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic_t)) != old);
-	return new;
+#define atomic_sub_return(i,v)						\
+({									\
+	int __ia64_asr_i = (i);						\
+	(__builtin_constant_p(i)					\
+	 && (   (__ia64_asr_i ==   1) || (__ia64_asr_i ==   4)		\
+	     || (__ia64_asr_i ==   8) || (__ia64_asr_i ==  16)		\
+	     || (__ia64_asr_i ==  -1) || (__ia64_asr_i ==  -4)		\
+	     || (__ia64_asr_i ==  -8) || (__ia64_asr_i == -16)))	\
+		? ia64_fetch_and_add(-__ia64_asr_i, &(v)->counter)	\
+		: ia64_atomic_sub(__ia64_asr_i, v);			\
+})
+
+#define ATOMIC64_OP(opn, op)						\
+static __inline__ long							\
+ia64_atomic64_##opn (__s64 i, atomic64_t *v)				\
+{									\
+	__s64 old, new;							\
+	CMPXCHG_BUGCHECK_DECL						\
+									\
+	do {								\
+		CMPXCHG_BUGCHECK(v);					\
+		old = atomic64_read(v);					\
+		new = old op i;						\
+	} while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic64_t)) != old); \
+	return new;							\
 }
 
-static __inline__ long
-ia64_atomic64_sub (__s64 i, atomic64_t *v)
-{
-	__s64 old, new;
-	CMPXCHG_BUGCHECK_DECL
+ATOMIC64_OP(add, +)
+ATOMIC64_OP(sub, -)
 
-	do {
-		CMPXCHG_BUGCHECK(v);
-		old = atomic64_read(v);
-		new = old - i;
-	} while (ia64_cmpxchg(acq, v, old, new, sizeof(atomic64_t)) != old);
-	return new;
-}
+#undef ATOMIC64_OP
+
+#define atomic64_add_return(i,v)					\
+({									\
+	long __ia64_aar_i = (i);					\
+	(__builtin_constant_p(i)					\
+	 && (   (__ia64_aar_i ==  1) || (__ia64_aar_i ==   4)		\
+	     || (__ia64_aar_i ==  8) || (__ia64_aar_i ==  16)		\
+	     || (__ia64_aar_i == -1) || (__ia64_aar_i ==  -4)		\
+	     || (__ia64_aar_i == -8) || (__ia64_aar_i == -16)))		\
+		? ia64_fetch_and_add(__ia64_aar_i, &(v)->counter)	\
+		: ia64_atomic64_add(__ia64_aar_i, v);			\
+})
+
+#define atomic64_sub_return(i,v)					\
+({									\
+	long __ia64_asr_i = (i);					\
+	(__builtin_constant_p(i)					\
+	 && (   (__ia64_asr_i ==   1) || (__ia64_asr_i ==   4)		\
+	     || (__ia64_asr_i ==   8) || (__ia64_asr_i ==  16)		\
+	     || (__ia64_asr_i ==  -1) || (__ia64_asr_i ==  -4)		\
+	     || (__ia64_asr_i ==  -8) || (__ia64_asr_i == -16)))	\
+		? ia64_fetch_and_add(-__ia64_asr_i, &(v)->counter)	\
+		: ia64_atomic64_sub(__ia64_asr_i, v);			\
+})
 
 #define atomic_cmpxchg(v, old, new) (cmpxchg(&((v)->counter), old, new))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
@@ -122,30 +154,6 @@ static __inline__ long atomic64_add_unle
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
 
-#define atomic_add_return(i,v)						\
-({									\
-	int __ia64_aar_i = (i);						\
-	(__builtin_constant_p(i)					\
-	 && (   (__ia64_aar_i ==  1) || (__ia64_aar_i ==   4)		\
-	     || (__ia64_aar_i ==  8) || (__ia64_aar_i ==  16)		\
-	     || (__ia64_aar_i == -1) || (__ia64_aar_i ==  -4)		\
-	     || (__ia64_aar_i == -8) || (__ia64_aar_i == -16)))		\
-		? ia64_fetch_and_add(__ia64_aar_i, &(v)->counter)	\
-		: ia64_atomic_add(__ia64_aar_i, v);			\
-})
-
-#define atomic64_add_return(i,v)					\
-({									\
-	long __ia64_aar_i = (i);					\
-	(__builtin_constant_p(i)					\
-	 && (   (__ia64_aar_i ==  1) || (__ia64_aar_i ==   4)		\
-	     || (__ia64_aar_i ==  8) || (__ia64_aar_i ==  16)		\
-	     || (__ia64_aar_i == -1) || (__ia64_aar_i ==  -4)		\
-	     || (__ia64_aar_i == -8) || (__ia64_aar_i == -16)))		\
-		? ia64_fetch_and_add(__ia64_aar_i, &(v)->counter)	\
-		: ia64_atomic64_add(__ia64_aar_i, v);			\
-})
-
 /*
  * Atomically add I to V and return TRUE if the resulting value is
  * negative.
@@ -162,30 +170,6 @@ atomic64_add_negative (__s64 i, atomic64
 	return atomic64_add_return(i, v) < 0;
 }
 
-#define atomic_sub_return(i,v)						\
-({									\
-	int __ia64_asr_i = (i);						\
-	(__builtin_constant_p(i)					\
-	 && (   (__ia64_asr_i ==   1) || (__ia64_asr_i ==   4)		\
-	     || (__ia64_asr_i ==   8) || (__ia64_asr_i ==  16)		\
-	     || (__ia64_asr_i ==  -1) || (__ia64_asr_i ==  -4)		\
-	     || (__ia64_asr_i ==  -8) || (__ia64_asr_i == -16)))	\
-		? ia64_fetch_and_add(-__ia64_asr_i, &(v)->counter)	\
-		: ia64_atomic_sub(__ia64_asr_i, v);			\
-})
-
-#define atomic64_sub_return(i,v)					\
-({									\
-	long __ia64_asr_i = (i);					\
-	(__builtin_constant_p(i)					\
-	 && (   (__ia64_asr_i ==   1) || (__ia64_asr_i ==   4)		\
-	     || (__ia64_asr_i ==   8) || (__ia64_asr_i ==  16)		\
-	     || (__ia64_asr_i ==  -1) || (__ia64_asr_i ==  -4)		\
-	     || (__ia64_asr_i ==  -8) || (__ia64_asr_i == -16)))	\
-		? ia64_fetch_and_add(-__ia64_asr_i, &(v)->counter)	\
-		: ia64_atomic64_sub(__ia64_asr_i, v);			\
-})
-
 #define atomic_dec_return(v)		atomic_sub_return(1, (v))
 #define atomic_inc_return(v)		atomic_add_return(1, (v))
 #define atomic64_dec_return(v)		atomic64_sub_return(1, (v))
@@ -198,13 +182,13 @@ atomic64_add_negative (__s64 i, atomic64
 #define atomic64_dec_and_test(v)	(atomic64_sub_return(1, (v)) == 0)
 #define atomic64_inc_and_test(v)	(atomic64_add_return(1, (v)) == 0)
 
-#define atomic_add(i,v)			atomic_add_return((i), (v))
-#define atomic_sub(i,v)			atomic_sub_return((i), (v))
+#define atomic_add(i,v)			(void)atomic_add_return((i), (v))
+#define atomic_sub(i,v)			(void)atomic_sub_return((i), (v))
 #define atomic_inc(v)			atomic_add(1, (v))
 #define atomic_dec(v)			atomic_sub(1, (v))
 
-#define atomic64_add(i,v)		atomic64_add_return((i), (v))
-#define atomic64_sub(i,v)		atomic64_sub_return((i), (v))
+#define atomic64_add(i,v)		(void)atomic64_add_return((i), (v))
+#define atomic64_sub(i,v)		(void)atomic64_sub_return((i), (v))
 #define atomic64_inc(v)			atomic64_add(1, (v))
 #define atomic64_dec(v)			atomic64_sub(1, (v))
 
--- a/arch/m32r/include/asm/atomic.h
+++ b/arch/m32r/include/asm/atomic.h
@@ -39,85 +39,59 @@
  */
 #define atomic_set(v,i)	(((v)->counter) = (i))
 
-/**
- * atomic_add_return - add integer to atomic variable and return it
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and return (@i + @v).
- */
-static __inline__ int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	int result;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# atomic_add_return		\n\t"
-		DCACHE_CLEAR("%0", "r4", "%1")
-		M32R_LOCK" %0, @%1;		\n\t"
-		"add	%0, %2;			\n\t"
-		M32R_UNLOCK" %0, @%1;		\n\t"
-		: "=&r" (result)
-		: "r" (&v->counter), "r" (i)
-		: "memory"
 #ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-
-	return result;
+#define __ATOMIC_CLOBBER	, "r4"
+#else
+#define __ATOMIC_CLOBBER
+#endif
+
+#define ATOMIC_OP(op)							\
+static __inline__ void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long flags;						\
+	int result;							\
+									\
+	local_irq_save(flags);						\
+	__asm__ __volatile__ (						\
+		"# atomic_" #op "		\n\t"			\
+		DCACHE_CLEAR("%0", "r4", "%1")				\
+		M32R_LOCK" %0, @%1;		\n\t"			\
+		#op " %0, %2;			\n\t"			\
+		M32R_UNLOCK" %0, @%1;		\n\t"			\
+		: "=&r" (result)					\
+		: "r" (&v->counter), "r" (i)				\
+		: "memory"						\
+		__ATOMIC_CLOBBER					\
+	);								\
+	local_irq_restore(flags);					\
+}									\
+									\
+static __inline__ int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long flags;						\
+	int result;							\
+									\
+	local_irq_save(flags);						\
+	__asm__ __volatile__ (						\
+		"# atomic_" #op "_return	\n\t"			\
+		DCACHE_CLEAR("%0", "r4", "%1")				\
+		M32R_LOCK" %0, @%1;		\n\t"			\
+		#op " %0, %2;			\n\t"			\
+		M32R_UNLOCK" %0, @%1;		\n\t"			\
+		: "=&r" (result)					\
+		: "r" (&v->counter), "r" (i)				\
+		: "memory"						\
+		__ATOMIC_CLOBBER					\
+	);								\
+	local_irq_restore(flags);					\
+									\
+	return result;							\
 }
 
-/**
- * atomic_sub_return - subtract integer from atomic variable and return it
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v and return (@v - @i).
- */
-static __inline__ int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	int result;
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# atomic_sub_return		\n\t"
-		DCACHE_CLEAR("%0", "r4", "%1")
-		M32R_LOCK" %0, @%1;		\n\t"
-		"sub	%0, %2;			\n\t"
-		M32R_UNLOCK" %0, @%1;		\n\t"
-		: "=&r" (result)
-		: "r" (&v->counter), "r" (i)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-
-	return result;
-}
-
-/**
- * atomic_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v.
- */
-#define atomic_add(i,v) ((void) atomic_add_return((i), (v)))
-
-/**
- * atomic_sub - subtract the atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v.
- */
-#define atomic_sub(i,v) ((void) atomic_sub_return((i), (v)))
+#undef ATOMIC_OP
 
 /**
  * atomic_sub_and_test - subtract value from variable and test result
@@ -151,9 +125,7 @@ static __inline__ int atomic_inc_return(
 		: "=&r" (result)
 		: "r" (&v->counter)
 		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
+		__ATOMIC_CLOBBER
 	);
 	local_irq_restore(flags);
 
@@ -181,9 +153,7 @@ static __inline__ int atomic_dec_return(
 		: "=&r" (result)
 		: "r" (&v->counter)
 		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
+		__ATOMIC_CLOBBER
 	);
 	local_irq_restore(flags);
 
@@ -280,9 +250,7 @@ static __inline__ void atomic_clear_mask
 		: "=&r" (tmp)
 		: "r" (addr), "r" (~mask)
 		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r5"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
+		__ATOMIC_CLOBBER
 	);
 	local_irq_restore(flags);
 }
@@ -302,9 +270,7 @@ static __inline__ void atomic_set_mask(u
 		: "=&r" (tmp)
 		: "r" (addr), "r" (mask)
 		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r5"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
+		__ATOMIC_CLOBBER
 	);
 	local_irq_restore(flags);
 }
--- a/arch/m68k/include/asm/atomic.h
+++ b/arch/m68k/include/asm/atomic.h
@@ -30,16 +30,55 @@
 #define	ASM_DI	"di"
 #endif
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	__asm__ __volatile__("addl %1,%0" : "+m" (*v) : ASM_DI (i));
-}
+#ifdef CONFIG_RMW_INSNS
 
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	__asm__ __volatile__("subl %1,%0" : "+m" (*v) : ASM_DI (i));
+#define ATOMIC_OP(op, cop)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	__asm__ __volatile__(#op "l %1,%0" : "+m" (*v) : ASM_DI (i));	\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int t, tmp;							\
+									\
+	__asm__ __volatile__(						\
+			"1:	movel %2,%1\n"				\
+			"	" #op "l %3,%1\n"			\
+			"	casl %2,%1,%0\n"			\
+			"	jne 1b"					\
+			: "+m" (*v), "=&d" (t), "=&d" (tmp)		\
+			: "g" (i), "2" (atomic_read(v)));		\
+	return t;							\
+}
+
+#else
+
+#define ATOMIC_OP(op, cop)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	__asm__ __volatile__(#op "l %1,%0" : "+m" (*v) : ASM_DI (i));	\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t * v)		\
+{									\
+	unsigned long flags;						\
+	int t;								\
+									\
+	local_irq_save(flags);						\
+	t = (v->counter cop i);						\
+	local_irq_restore(flags);					\
+									\
+	return t;							\
 }
 
+#endif /* CONFIG_RMW_INSNS */
+
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
+
+#undef ATOMIC_OP
+
 static inline void atomic_inc(atomic_t *v)
 {
 	__asm__ __volatile__("addql #1,%0" : "+m" (*v));
@@ -76,67 +115,11 @@ static inline int atomic_inc_and_test(at
 
 #ifdef CONFIG_RMW_INSNS
 
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	int t, tmp;
-
-	__asm__ __volatile__(
-			"1:	movel %2,%1\n"
-			"	addl %3,%1\n"
-			"	casl %2,%1,%0\n"
-			"	jne 1b"
-			: "+m" (*v), "=&d" (t), "=&d" (tmp)
-			: "g" (i), "2" (atomic_read(v)));
-	return t;
-}
-
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	int t, tmp;
-
-	__asm__ __volatile__(
-			"1:	movel %2,%1\n"
-			"	subl %3,%1\n"
-			"	casl %2,%1,%0\n"
-			"	jne 1b"
-			: "+m" (*v), "=&d" (t), "=&d" (tmp)
-			: "g" (i), "2" (atomic_read(v)));
-	return t;
-}
-
 #define atomic_cmpxchg(v, o, n) ((int)cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
 #else /* !CONFIG_RMW_INSNS */
 
-static inline int atomic_add_return(int i, atomic_t * v)
-{
-	unsigned long flags;
-	int t;
-
-	local_irq_save(flags);
-	t = atomic_read(v);
-	t += i;
-	atomic_set(v, t);
-	local_irq_restore(flags);
-
-	return t;
-}
-
-static inline int atomic_sub_return(int i, atomic_t * v)
-{
-	unsigned long flags;
-	int t;
-
-	local_irq_save(flags);
-	t = atomic_read(v);
-	t -= i;
-	atomic_set(v, t);
-	local_irq_restore(flags);
-
-	return t;
-}
-
 static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
 	unsigned long flags;
--- a/arch/metag/include/asm/atomic_lnkget.h
+++ b/arch/metag/include/asm/atomic_lnkget.h
@@ -27,85 +27,51 @@ static inline int atomic_read(const atom
 	return temp;
 }
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	int temp;
-
-	asm volatile (
-		"1:	LNKGETD %0, [%1]\n"
-		"	ADD	%0, %0, %2\n"
-		"	LNKSETD [%1], %0\n"
-		"	DEFR	%0, TXSTAT\n"
-		"	ANDT	%0, %0, #HI(0x3f000000)\n"
-		"	CMPT	%0, #HI(0x02000000)\n"
-		"	BNZ	1b\n"
-		: "=&d" (temp)
-		: "da" (&v->counter), "bd" (i)
-		: "cc");
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	int temp;
-
-	asm volatile (
-		"1:	LNKGETD %0, [%1]\n"
-		"	SUB	%0, %0, %2\n"
-		"	LNKSETD [%1], %0\n"
-		"	DEFR	%0, TXSTAT\n"
-		"	ANDT	%0, %0, #HI(0x3f000000)\n"
-		"	CMPT	%0, #HI(0x02000000)\n"
-		"	BNZ 1b\n"
-		: "=&d" (temp)
-		: "da" (&v->counter), "bd" (i)
-		: "cc");
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	int result, temp;
-
-	smp_mb();
-
-	asm volatile (
-		"1:	LNKGETD %1, [%2]\n"
-		"	ADD	%1, %1, %3\n"
-		"	LNKSETD [%2], %1\n"
-		"	DEFR	%0, TXSTAT\n"
-		"	ANDT	%0, %0, #HI(0x3f000000)\n"
-		"	CMPT	%0, #HI(0x02000000)\n"
-		"	BNZ 1b\n"
-		: "=&d" (temp), "=&da" (result)
-		: "da" (&v->counter), "bd" (i)
-		: "cc");
-
-	smp_mb();
-
-	return result;
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	int temp;							\
+									\
+	asm volatile (							\
+		"1:	LNKGETD %0, [%1]\n"				\
+		"	" #op "	%0, %0, %2\n"				\
+		"	LNKSETD [%1], %0\n"				\
+		"	DEFR	%0, TXSTAT\n"				\
+		"	ANDT	%0, %0, #HI(0x3f000000)\n"		\
+		"	CMPT	%0, #HI(0x02000000)\n"			\
+		"	BNZ	1b\n"					\
+		: "=&d" (temp)						\
+		: "da" (&v->counter), "bd" (i)				\
+		: "cc");						\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int result, temp;						\
+									\
+	smp_mb();							\
+									\
+	asm volatile (							\
+		"1:	LNKGETD %1, [%2]\n"				\
+		"	" #op "	%1, %1, %3\n"				\
+		"	LNKSETD [%2], %1\n"				\
+		"	DEFR	%0, TXSTAT\n"				\
+		"	ANDT	%0, %0, #HI(0x3f000000)\n"		\
+		"	CMPT	%0, #HI(0x02000000)\n"			\
+		"	BNZ 1b\n"					\
+		: "=&d" (temp), "=&da" (result)				\
+		: "da" (&v->counter), "bd" (i)				\
+		: "cc");						\
+									\
+	smp_mb();							\
+									\
+	return result;							\
 }
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	int result, temp;
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	smp_mb();
-
-	asm volatile (
-		"1:	LNKGETD %1, [%2]\n"
-		"	SUB	%1, %1, %3\n"
-		"	LNKSETD [%2], %1\n"
-		"	DEFR	%0, TXSTAT\n"
-		"	ANDT	%0, %0, #HI(0x3f000000)\n"
-		"	CMPT	%0, #HI(0x02000000)\n"
-		"	BNZ	1b\n"
-		: "=&d" (temp), "=&da" (result)
-		: "da" (&v->counter), "bd" (i)
-		: "cc");
-
-	smp_mb();
-
-	return result;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
 {
--- a/arch/metag/include/asm/atomic_lock1.h
+++ b/arch/metag/include/asm/atomic_lock1.h
@@ -37,55 +37,36 @@ static inline int atomic_set(atomic_t *v
 	return i;
 }
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned long flags;
-
-	__global_lock1(flags);
-	fence();
-	v->counter += i;
-	__global_unlock1(flags);
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned long flags;
-
-	__global_lock1(flags);
-	fence();
-	v->counter -= i;
-	__global_unlock1(flags);
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long result;
-	unsigned long flags;
-
-	__global_lock1(flags);
-	result = v->counter;
-	result += i;
-	fence();
-	v->counter = result;
-	__global_unlock1(flags);
-
-	return result;
+#define ATOMIC_OP(opn, op)						\
+static inline void atomic_##opn(int i, atomic_t *v)			\
+{									\
+	unsigned long flags;						\
+									\
+	__global_lock1(flags);						\
+	fence();							\
+	v->counter op i;						\
+	__global_unlock1(flags);					\
+}									\
+									\
+static inline int atomic_##opn##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long result;						\
+	unsigned long flags;						\
+									\
+	__global_lock1(flags);						\
+	result = v->counter;						\
+	result op i;							\
+	fence();							\
+	v->counter = result;						\
+	__global_unlock1(flags);					\
+									\
+	return result;							\
 }
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long result;
-	unsigned long flags;
-
-	__global_lock1(flags);
-	result = v->counter;
-	result -= i;
-	fence();
-	v->counter = result;
-	__global_unlock1(flags);
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
 
-	return result;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
 {
--- a/arch/mips/include/asm/atomic.h
+++ b/arch/mips/include/asm/atomic.h
@@ -40,195 +40,96 @@
  */
 #define atomic_set(v, i)		((v)->counter = (i))
 
-/*
- * atomic_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v.
- */
-static __inline__ void atomic_add(int i, atomic_t * v)
-{
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		int temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	ll	%0, %1		# atomic_add		\n"
-		"	addu	%0, %2					\n"
-		"	sc	%0, %1					\n"
-		"	beqzl	%0, 1b					\n"
-		"	.set	mips0					\n"
-		: "=&r" (temp), "+m" (v->counter)
-		: "Ir" (i));
-	} else if (kernel_uses_llsc) {
-		int temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	ll	%0, %1		# atomic_add	\n"
-			"	addu	%0, %2				\n"
-			"	sc	%0, %1				\n"
-			"	.set	mips0				\n"
-			: "=&r" (temp), "+m" (v->counter)
-			: "Ir" (i));
-		} while (unlikely(!temp));
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		v->counter += i;
-		raw_local_irq_restore(flags);
-	}
+#define ATOMIC_OP(op, aop, cop)							\
+static __inline__ void atomic_##op(int i, atomic_t * v)				\
+{										\
+	if (kernel_uses_llsc && R10000_LLSC_WAR) {				\
+		int temp;							\
+										\
+		__asm__ __volatile__(						\
+		"	.set	mips3					\n"	\
+		"1:	ll	%0, %1		# atomic_" #op "	\n"	\
+		"	" #aop " %0, %2					\n"	\
+		"	sc	%0, %1					\n"	\
+		"	beqzl	%0, 1b					\n"	\
+		"	.set	mips0					\n"	\
+		: "=&r" (temp), "+m" (v->counter)				\
+		: "Ir" (i));							\
+	} else if (kernel_uses_llsc) {						\
+		int temp;							\
+										\
+		do {								\
+			__asm__ __volatile__(					\
+			"	.set	mips3				\n"	\
+			"	ll	%0, %1		# atomic_" #op "\n"	\
+			"	" #aop " %0, %2				\n"	\
+			"	sc	%0, %1				\n"	\
+			"	.set	mips0				\n"	\
+			: "=&r" (temp), "+m" (v->counter)			\
+			: "Ir" (i));						\
+		} while (unlikely(!temp));					\
+	} else {								\
+		unsigned long flags;						\
+										\
+		raw_local_irq_save(flags);					\
+		v->counter cop i;						\
+		raw_local_irq_restore(flags);					\
+	}									\
+}										\
+										\
+static __inline__ int atomic_##op##_return(int i, atomic_t * v)			\
+{										\
+	int result;								\
+										\
+	smp_mb__before_llsc();							\
+										\
+	if (kernel_uses_llsc && R10000_LLSC_WAR) {				\
+		int temp;							\
+										\
+		__asm__ __volatile__(						\
+		"	.set	mips3					\n"	\
+		"1:	ll	%1, %2		# atomic_" #op "_return	\n"	\
+		"	" #aop " %0, %1, %3				\n"	\
+		"	sc	%0, %2					\n"	\
+		"	beqzl	%0, 1b					\n"	\
+		"	addu	%0, %1, %3				\n"	\
+		"	.set	mips0					\n"	\
+		: "=&r" (result), "=&r" (temp), "+m" (v->counter)		\
+		: "Ir" (i));							\
+	} else if (kernel_uses_llsc) {						\
+		int temp;							\
+										\
+		do {								\
+			__asm__ __volatile__(					\
+			"	.set	mips3				\n"	\
+			"	ll	%1, %2	# atomic_" #op "_return	\n"	\
+			"	" #aop " %0, %1, %3			\n"	\
+			"	sc	%0, %2				\n"	\
+			"	.set	mips0				\n"	\
+			: "=&r" (result), "=&r" (temp), "+m" (v->counter)	\
+			: "Ir" (i));						\
+		} while (unlikely(!result));					\
+										\
+		result = temp + i;						\
+	} else {								\
+		unsigned long flags;						\
+										\
+		raw_local_irq_save(flags);					\
+		result = v->counter;						\
+		result cop i;							\
+		v->counter = result;						\
+		raw_local_irq_restore(flags);					\
+	}									\
+										\
+	smp_llsc_mb();								\
+										\
+	return result;								\
 }
 
-/*
- * atomic_sub - subtract the atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v.
- */
-static __inline__ void atomic_sub(int i, atomic_t * v)
-{
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		int temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	ll	%0, %1		# atomic_sub		\n"
-		"	subu	%0, %2					\n"
-		"	sc	%0, %1					\n"
-		"	beqzl	%0, 1b					\n"
-		"	.set	mips0					\n"
-		: "=&r" (temp), "+m" (v->counter)
-		: "Ir" (i));
-	} else if (kernel_uses_llsc) {
-		int temp;
+ATOMIC_OP(add, addu, +=)
+ATOMIC_OP(sub, subu, -=)
 
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	ll	%0, %1		# atomic_sub	\n"
-			"	subu	%0, %2				\n"
-			"	sc	%0, %1				\n"
-			"	.set	mips0				\n"
-			: "=&r" (temp), "+m" (v->counter)
-			: "Ir" (i));
-		} while (unlikely(!temp));
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		v->counter -= i;
-		raw_local_irq_restore(flags);
-	}
-}
-
-/*
- * Same as above, but return the result value
- */
-static __inline__ int atomic_add_return(int i, atomic_t * v)
-{
-	int result;
-
-	smp_mb__before_llsc();
-
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		int temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	ll	%1, %2		# atomic_add_return	\n"
-		"	addu	%0, %1, %3				\n"
-		"	sc	%0, %2					\n"
-		"	beqzl	%0, 1b					\n"
-		"	addu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "+m" (v->counter)
-		: "Ir" (i));
-	} else if (kernel_uses_llsc) {
-		int temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	ll	%1, %2	# atomic_add_return	\n"
-			"	addu	%0, %1, %3			\n"
-			"	sc	%0, %2				\n"
-			"	.set	mips0				\n"
-			: "=&r" (result), "=&r" (temp), "+m" (v->counter)
-			: "Ir" (i));
-		} while (unlikely(!result));
-
-		result = temp + i;
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		result = v->counter;
-		result += i;
-		v->counter = result;
-		raw_local_irq_restore(flags);
-	}
-
-	smp_llsc_mb();
-
-	return result;
-}
-
-static __inline__ int atomic_sub_return(int i, atomic_t * v)
-{
-	int result;
-
-	smp_mb__before_llsc();
-
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		int temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	ll	%1, %2		# atomic_sub_return	\n"
-		"	subu	%0, %1, %3				\n"
-		"	sc	%0, %2					\n"
-		"	beqzl	%0, 1b					\n"
-		"	subu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "=m" (v->counter)
-		: "Ir" (i), "m" (v->counter)
-		: "memory");
-
-		result = temp - i;
-	} else if (kernel_uses_llsc) {
-		int temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	ll	%1, %2	# atomic_sub_return	\n"
-			"	subu	%0, %1, %3			\n"
-			"	sc	%0, %2				\n"
-			"	.set	mips0				\n"
-			: "=&r" (result), "=&r" (temp), "+m" (v->counter)
-			: "Ir" (i));
-		} while (unlikely(!result));
-
-		result = temp - i;
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		result = v->counter;
-		result -= i;
-		v->counter = result;
-		raw_local_irq_restore(flags);
-	}
-
-	smp_llsc_mb();
-
-	return result;
-}
+#undef ATOMIC_OP
 
 /*
  * atomic_sub_if_positive - conditionally subtract integer from atomic variable
@@ -407,195 +308,97 @@ static __inline__ int __atomic_add_unles
  */
 #define atomic64_set(v, i)	((v)->counter = (i))
 
-/*
- * atomic64_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic64_t
- *
- * Atomically adds @i to @v.
- */
-static __inline__ void atomic64_add(long i, atomic64_t * v)
-{
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	lld	%0, %1		# atomic64_add		\n"
-		"	daddu	%0, %2					\n"
-		"	scd	%0, %1					\n"
-		"	beqzl	%0, 1b					\n"
-		"	.set	mips0					\n"
-		: "=&r" (temp), "+m" (v->counter)
-		: "Ir" (i));
-	} else if (kernel_uses_llsc) {
-		long temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	lld	%0, %1		# atomic64_add	\n"
-			"	daddu	%0, %2				\n"
-			"	scd	%0, %1				\n"
-			"	.set	mips0				\n"
-			: "=&r" (temp), "+m" (v->counter)
-			: "Ir" (i));
-		} while (unlikely(!temp));
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		v->counter += i;
-		raw_local_irq_restore(flags);
-	}
-}
-
-/*
- * atomic64_sub - subtract the atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic64_t
- *
- * Atomically subtracts @i from @v.
- */
-static __inline__ void atomic64_sub(long i, atomic64_t * v)
-{
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	lld	%0, %1		# atomic64_sub		\n"
-		"	dsubu	%0, %2					\n"
-		"	scd	%0, %1					\n"
-		"	beqzl	%0, 1b					\n"
-		"	.set	mips0					\n"
-		: "=&r" (temp), "+m" (v->counter)
-		: "Ir" (i));
-	} else if (kernel_uses_llsc) {
-		long temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	lld	%0, %1		# atomic64_sub	\n"
-			"	dsubu	%0, %2				\n"
-			"	scd	%0, %1				\n"
-			"	.set	mips0				\n"
-			: "=&r" (temp), "+m" (v->counter)
-			: "Ir" (i));
-		} while (unlikely(!temp));
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		v->counter -= i;
-		raw_local_irq_restore(flags);
-	}
-}
-
-/*
- * Same as above, but return the result value
- */
-static __inline__ long atomic64_add_return(long i, atomic64_t * v)
-{
-	long result;
-
-	smp_mb__before_llsc();
-
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	lld	%1, %2		# atomic64_add_return	\n"
-		"	daddu	%0, %1, %3				\n"
-		"	scd	%0, %2					\n"
-		"	beqzl	%0, 1b					\n"
-		"	daddu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "+m" (v->counter)
-		: "Ir" (i));
-	} else if (kernel_uses_llsc) {
-		long temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	lld	%1, %2	# atomic64_add_return	\n"
-			"	daddu	%0, %1, %3			\n"
-			"	scd	%0, %2				\n"
-			"	.set	mips0				\n"
-			: "=&r" (result), "=&r" (temp), "=m" (v->counter)
-			: "Ir" (i), "m" (v->counter)
-			: "memory");
-		} while (unlikely(!result));
-
-		result = temp + i;
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		result = v->counter;
-		result += i;
-		v->counter = result;
-		raw_local_irq_restore(flags);
-	}
-
-	smp_llsc_mb();
-
-	return result;
+#define ATOMIC64_OP(op, aop, cop)						\
+static __inline__ void atomic64_##op(long i, atomic64_t * v)			\
+{										\
+	if (kernel_uses_llsc && R10000_LLSC_WAR) {				\
+		long temp;							\
+										\
+		__asm__ __volatile__(						\
+		"	.set	mips3					\n"	\
+		"1:	lld	%0, %1		# atomic64_" #op "	\n"	\
+		"	" #aop " %0, %2					\n"	\
+		"	scd	%0, %1					\n"	\
+		"	beqzl	%0, 1b					\n"	\
+		"	.set	mips0					\n"	\
+		: "=&r" (temp), "+m" (v->counter)				\
+		: "Ir" (i));							\
+	} else if (kernel_uses_llsc) {						\
+		long temp;							\
+										\
+		do {								\
+			__asm__ __volatile__(					\
+			"	.set	mips3				\n"	\
+			"	lld	%0, %1		# atomic64_" #op "\n"	\
+			"	" #aop " %0, %2				\n"	\
+			"	scd	%0, %1				\n"	\
+			"	.set	mips0				\n"	\
+			: "=&r" (temp), "+m" (v->counter)			\
+			: "Ir" (i));						\
+		} while (unlikely(!temp));					\
+	} else {								\
+		unsigned long flags;						\
+										\
+		raw_local_irq_save(flags);					\
+		v->counter cop i;						\
+		raw_local_irq_restore(flags);					\
+	}									\
+}										\
+										\
+static __inline__ long atomic64_##op##_return(long i, atomic64_t * v)		\
+{										\
+	long result;								\
+										\
+	smp_mb__before_llsc();							\
+										\
+	if (kernel_uses_llsc && R10000_LLSC_WAR) {				\
+		long temp;							\
+										\
+		__asm__ __volatile__(						\
+		"	.set	mips3					\n"	\
+		"1:	lld	%1, %2		# atomic64_" #op "_return\n"	\
+		"	" #aop " %0, %1, %3				\n"	\
+		"	scd	%0, %2					\n"	\
+		"	beqzl	%0, 1b					\n"	\
+		"	" #aop " %0, %1, %3				\n"	\
+		"	.set	mips0					\n"	\
+		: "=&r" (result), "=&r" (temp), "+m" (v->counter)		\
+		: "Ir" (i));							\
+	} else if (kernel_uses_llsc) {						\
+		long temp;							\
+										\
+		do {								\
+			__asm__ __volatile__(					\
+			"	.set	mips3				\n"	\
+			"	lld	%1, %2	# atomic64_" #op "_return\n"	\
+			"	" #aop " %0, %1, %3			\n"	\
+			"	scd	%0, %2				\n"	\
+			"	.set	mips0				\n"	\
+			: "=&r" (result), "=&r" (temp), "=m" (v->counter)	\
+			: "Ir" (i), "m" (v->counter)				\
+			: "memory");						\
+		} while (unlikely(!result));					\
+										\
+		result = temp + i;						\
+	} else {								\
+		unsigned long flags;						\
+										\
+		raw_local_irq_save(flags);					\
+		result = v->counter;						\
+		result cop i;							\
+		v->counter = result;						\
+		raw_local_irq_restore(flags);					\
+	}									\
+										\
+	smp_llsc_mb();								\
+										\
+	return result;								\
 }
 
-static __inline__ long atomic64_sub_return(long i, atomic64_t * v)
-{
-	long result;
-
-	smp_mb__before_llsc();
-
-	if (kernel_uses_llsc && R10000_LLSC_WAR) {
-		long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:	lld	%1, %2		# atomic64_sub_return	\n"
-		"	dsubu	%0, %1, %3				\n"
-		"	scd	%0, %2					\n"
-		"	beqzl	%0, 1b					\n"
-		"	dsubu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "=m" (v->counter)
-		: "Ir" (i), "m" (v->counter)
-		: "memory");
-	} else if (kernel_uses_llsc) {
-		long temp;
-
-		do {
-			__asm__ __volatile__(
-			"	.set	mips3				\n"
-			"	lld	%1, %2	# atomic64_sub_return	\n"
-			"	dsubu	%0, %1, %3			\n"
-			"	scd	%0, %2				\n"
-			"	.set	mips0				\n"
-			: "=&r" (result), "=&r" (temp), "=m" (v->counter)
-			: "Ir" (i), "m" (v->counter)
-			: "memory");
-		} while (unlikely(!result));
+ATOMIC64_OP(add, daddu, +=)
+ATOMIC64_OP(sub, dsubu, -=)
 
-		result = temp - i;
-	} else {
-		unsigned long flags;
-
-		raw_local_irq_save(flags);
-		result = v->counter;
-		result -= i;
-		v->counter = result;
-		raw_local_irq_restore(flags);
-	}
-
-	smp_llsc_mb();
-
-	return result;
-}
+#undef ATOMIC64_OP
 
 /*
  * atomic64_sub_if_positive - conditionally subtract integer from atomic variable
--- a/arch/mn10300/include/asm/atomic.h
+++ b/arch/mn10300/include/asm/atomic.h
@@ -33,7 +33,6 @@
  * @v: pointer of type atomic_t
  *
  * Atomically reads the value of @v.  Note that the guaranteed
- * useful range of an atomic_t is only 24 bits.
  */
 #define atomic_read(v)	(ACCESS_ONCE((v)->counter))
 
@@ -43,102 +42,57 @@
  * @i: required value
  *
  * Atomically sets the value of @v to @i.  Note that the guaranteed
- * useful range of an atomic_t is only 24 bits.
  */
 #define atomic_set(v, i) (((v)->counter) = (i))
 
-/**
- * atomic_add_return - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns the result
- * Note that the guaranteed useful range of an atomic_t is only 24 bits.
- */
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	int retval;
-#ifdef CONFIG_SMP
-	int status;
-
-	asm volatile(
-		"1:	mov	%4,(_AAR,%3)	\n"
-		"	mov	(_ADR,%3),%1	\n"
-		"	add	%5,%1		\n"
-		"	mov	%1,(_ADR,%3)	\n"
-		"	mov	(_ADR,%3),%0	\n"	/* flush */
-		"	mov	(_ASR,%3),%0	\n"
-		"	or	%0,%0		\n"
-		"	bne	1b		\n"
-		: "=&r"(status), "=&r"(retval), "=m"(v->counter)
-		: "a"(ATOMIC_OPS_BASE_ADDR), "r"(&v->counter), "r"(i)
-		: "memory", "cc");
-
-#else
-	unsigned long flags;
-
-	flags = arch_local_cli_save();
-	retval = v->counter;
-	retval += i;
-	v->counter = retval;
-	arch_local_irq_restore(flags);
-#endif
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	int retval, status;						\
+									\
+	asm volatile(							\
+		"1:	mov	%4,(_AAR,%3)	\n"			\
+		"	mov	(_ADR,%3),%1	\n"			\
+		"	" #op "	%5,%1		\n"			\
+		"	mov	%1,(_ADR,%3)	\n"			\
+		"	mov	(_ADR,%3),%0	\n"	/* flush */	\
+		"	mov	(_ASR,%3),%0	\n"			\
+		"	or	%0,%0		\n"			\
+		"	bne	1b		\n"			\
+		: "=&r"(status), "=&r"(retval), "=m"(v->counter)	\
+		: "a"(ATOMIC_OPS_BASE_ADDR), "r"(&v->counter), "r"(i)	\
+		: "memory", "cc");					\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int retval, status;						\
+									\
+	asm volatile(							\
+		"1:	mov	%4,(_AAR,%3)	\n"			\
+		"	mov	(_ADR,%3),%1	\n"			\
+		"	" #op "	%5,%1		\n"			\
+		"	mov	%1,(_ADR,%3)	\n"			\
+		"	mov	(_ADR,%3),%0	\n"	/* flush */	\
+		"	mov	(_ASR,%3),%0	\n"			\
+		"	or	%0,%0		\n"			\
+		"	bne	1b		\n"			\
+		: "=&r"(status), "=&r"(retval), "=m"(v->counter)	\
+		: "a"(ATOMIC_OPS_BASE_ADDR), "r"(&v->counter), "r"(i)	\
+		: "memory", "cc");					\
 	return retval;
 }
 
-/**
- * atomic_sub_return - subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v and returns the result
- * Note that the guaranteed useful range of an atomic_t is only 24 bits.
- */
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	int retval;
-#ifdef CONFIG_SMP
-	int status;
-
-	asm volatile(
-		"1:	mov	%4,(_AAR,%3)	\n"
-		"	mov	(_ADR,%3),%1	\n"
-		"	sub	%5,%1		\n"
-		"	mov	%1,(_ADR,%3)	\n"
-		"	mov	(_ADR,%3),%0	\n"	/* flush */
-		"	mov	(_ASR,%3),%0	\n"
-		"	or	%0,%0		\n"
-		"	bne	1b		\n"
-		: "=&r"(status), "=&r"(retval), "=m"(v->counter)
-		: "a"(ATOMIC_OPS_BASE_ADDR), "r"(&v->counter), "r"(i)
-		: "memory", "cc");
-
-#else
-	unsigned long flags;
-	flags = arch_local_cli_save();
-	retval = v->counter;
-	retval -= i;
-	v->counter = retval;
-	arch_local_irq_restore(flags);
-#endif
-	return retval;
-}
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
+
+#undef ATOMIC_OP
 
 static inline int atomic_add_negative(int i, atomic_t *v)
 {
 	return atomic_add_return(i, v) < 0;
 }
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	atomic_add_return(i, v);
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	atomic_sub_return(i, v);
-}
-
 static inline void atomic_inc(atomic_t *v)
 {
 	atomic_add_return(1, v);
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -55,24 +55,7 @@ extern arch_spinlock_t __atomic_hash[ATO
  * are atomic, so a reader never sees inconsistent values.
  */
 
-/* It's possible to reduce all atomic operations to either
- * __atomic_add_return, atomic_set and atomic_read (the latter
- * is there only for consistency).
- */
-
-static __inline__ int __atomic_add_return(int i, atomic_t *v)
-{
-	int ret;
-	unsigned long flags;
-	_atomic_spin_lock_irqsave(v, flags);
-
-	ret = (v->counter += i);
-
-	_atomic_spin_unlock_irqrestore(v, flags);
-	return ret;
-}
-
-static __inline__ void atomic_set(atomic_t *v, int i) 
+static __inline__ void atomic_set(atomic_t *v, int i)
 {
 	unsigned long flags;
 	_atomic_spin_lock_irqsave(v, flags);
@@ -115,16 +98,38 @@ static __inline__ int __atomic_add_unles
 	return c;
 }
 
+#define ATOMIC_OP(op, cop)						\
+static __inline__ void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long flags;						\
+									\
+	_atomic_spin_lock_irqsave(v, flags);				\
+	v->counter cop i;						\
+	_atomic_spin_unlock_irqrestore(v, flags);			\
+}									\
+									\
+static __inline__ int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long flags;						\
+	int ret;							\
+									\
+	_atomic_spin_lock_irqsave(v, flags);				\
+	ret = (v->counter cop i);					\
+	_atomic_spin_unlock_irqrestore(v, flags);			\
+									\
+	return ret;							\
+}
+
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
 
-#define atomic_add(i,v)	((void)(__atomic_add_return(        (i),(v))))
-#define atomic_sub(i,v)	((void)(__atomic_add_return(-((int) (i)),(v))))
-#define atomic_inc(v)	((void)(__atomic_add_return(   1,(v))))
-#define atomic_dec(v)	((void)(__atomic_add_return(  -1,(v))))
-
-#define atomic_add_return(i,v)	(__atomic_add_return( (i),(v)))
-#define atomic_sub_return(i,v)	(__atomic_add_return(-(i),(v)))
-#define atomic_inc_return(v)	(__atomic_add_return(   1,(v)))
-#define atomic_dec_return(v)	(__atomic_add_return(  -1,(v)))
+#undef ATOMIC_OP
+
+#define atomic_inc(v)	(atomic_add(   1,(v)))
+#define atomic_dec(v)	(atomic_add(  -1,(v)))
+
+#define atomic_inc_return(v)	(atomic_add_return(   1,(v)))
+#define atomic_dec_return(v)	(atomic_add_return(  -1,(v)))
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
 
@@ -148,18 +153,32 @@ static __inline__ int __atomic_add_unles
 
 #define ATOMIC64_INIT(i) { (i) }
 
-static __inline__ s64
-__atomic64_add_return(s64 i, atomic64_t *v)
-{
-	s64 ret;
-	unsigned long flags;
-	_atomic_spin_lock_irqsave(v, flags);
+#define ATOMIC64_OP(op, cop)						\
+static __inline__ void atomic64_##op(s64 i, atomic64_t *v)		\
+{									\
+	unsigned long flags;						\
+									\
+	_atomic_spin_lock_irqsave(v, flags);				\
+	v->counter cop i;						\
+	_atomic_spin_unlock_irqrestore(v, flags);			\
+}									\
+									\
+static __inline__ s64 atomic64_##op##_return(s64 i, atomic64_t *v)	\
+{									\
+	unsigned long flags;						\
+	s64 ret;							\
+									\
+	_atomic_spin_lock_irqsave(v, flags);				\
+	ret = (v->counter cop i);					\
+	_atomic_spin_unlock_irqrestore(v, flags);			\
+									\
+	return ret;							\
+}
 
-	ret = (v->counter += i);
+ATOMIC64_OP(add, +=)
+ATOMIC64_OP(sub, -=)
 
-	_atomic_spin_unlock_irqrestore(v, flags);
-	return ret;
-}
+#undef ATOMIC64_OP
 
 static __inline__ void
 atomic64_set(atomic64_t *v, s64 i)
@@ -178,15 +197,11 @@ atomic64_read(const atomic64_t *v)
 	return (*(volatile long *)&(v)->counter);
 }
 
-#define atomic64_add(i,v)	((void)(__atomic64_add_return( ((s64)(i)),(v))))
-#define atomic64_sub(i,v)	((void)(__atomic64_add_return(-((s64)(i)),(v))))
-#define atomic64_inc(v)		((void)(__atomic64_add_return(   1,(v))))
-#define atomic64_dec(v)		((void)(__atomic64_add_return(  -1,(v))))
-
-#define atomic64_add_return(i,v)	(__atomic64_add_return( ((s64)(i)),(v)))
-#define atomic64_sub_return(i,v)	(__atomic64_add_return(-((s64)(i)),(v)))
-#define atomic64_inc_return(v)		(__atomic64_add_return(   1,(v)))
-#define atomic64_dec_return(v)		(__atomic64_add_return(  -1,(v)))
+#define atomic64_inc(v)		(atomic64_add(   1,(v)))
+#define atomic64_dec(v)		(atomic64_add(  -1,(v)))
+
+#define atomic64_inc_return(v)		(atomic64_add_return(   1,(v)))
+#define atomic64_dec_return(v)		(atomic64_add_return(  -1,(v)))
 
 #define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)
 
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -26,76 +26,48 @@ static __inline__ void atomic_set(atomic
 	__asm__ __volatile__("stw%U0%X0 %1,%0" : "=m"(v->counter) : "r"(i));
 }
 
-static __inline__ void atomic_add(int a, atomic_t *v)
-{
-	int t;
-
-	__asm__ __volatile__(
-"1:	lwarx	%0,0,%3		# atomic_add\n\
-	add	%0,%2,%0\n"
-	PPC405_ERR77(0,%3)
-"	stwcx.	%0,0,%3 \n\
-	bne-	1b"
-	: "=&r" (t), "+m" (v->counter)
-	: "r" (a), "r" (&v->counter)
-	: "cc");
+#define ATOMIC_OP(op, aop)						\
+static __inline__ void atomic_##op(int a, atomic_t *v)			\
+{									\
+	int t;								\
+									\
+	__asm__ __volatile__(						\
+"1:	lwarx	%0,0,%3		# atomic_" #op "\n"			\
+	#aop " %0,%2,%0\n"						\
+	PPC405_ERR77(0,%3)						\
+"	stwcx.	%0,0,%3 \n"						\
+"	bne-	1b\n"							\
+	: "=&r" (t), "+m" (v->counter)					\
+	: "r" (a), "r" (&v->counter)					\
+	: "cc");							\
+}									\
+									\
+static __inline__ int atomic_##op##_return(int a, atomic_t *v)		\
+{									\
+	int t;								\
+									\
+	__asm__ __volatile__(						\
+	PPC_ATOMIC_ENTRY_BARRIER					\
+"1:	lwarx	%0,0,%2		# atomic_" #op "_return\n"		\
+	#aop " %0,%1,%0\n"						\
+	PPC405_ERR77(0,%2)						\
+"	stwcx.	%0,0,%2 \n"						\
+"	bne-	1b\n"							\
+	PPC_ATOMIC_EXIT_BARRIER						\
+	: "=&r" (t)							\
+	: "r" (a), "r" (&v->counter)					\
+	: "cc", "memory");						\
+									\
+	return t;							\
 }
 
-static __inline__ int atomic_add_return(int a, atomic_t *v)
-{
-	int t;
-
-	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
-"1:	lwarx	%0,0,%2		# atomic_add_return\n\
-	add	%0,%1,%0\n"
-	PPC405_ERR77(0,%2)
-"	stwcx.	%0,0,%2 \n\
-	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
-	: "=&r" (t)
-	: "r" (a), "r" (&v->counter)
-	: "cc", "memory");
+ATOMIC_OP(add, add)
+ATOMIC_OP(sub, subf)
 
-	return t;
-}
+#undef ATOMIC_OP
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
 
-static __inline__ void atomic_sub(int a, atomic_t *v)
-{
-	int t;
-
-	__asm__ __volatile__(
-"1:	lwarx	%0,0,%3		# atomic_sub\n\
-	subf	%0,%2,%0\n"
-	PPC405_ERR77(0,%3)
-"	stwcx.	%0,0,%3 \n\
-	bne-	1b"
-	: "=&r" (t), "+m" (v->counter)
-	: "r" (a), "r" (&v->counter)
-	: "cc");
-}
-
-static __inline__ int atomic_sub_return(int a, atomic_t *v)
-{
-	int t;
-
-	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
-"1:	lwarx	%0,0,%2		# atomic_sub_return\n\
-	subf	%0,%1,%0\n"
-	PPC405_ERR77(0,%2)
-"	stwcx.	%0,0,%2 \n\
-	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
-	: "=&r" (t)
-	: "r" (a), "r" (&v->counter)
-	: "cc", "memory");
-
-	return t;
-}
-
 static __inline__ void atomic_inc(atomic_t *v)
 {
 	int t;
@@ -289,72 +261,46 @@ static __inline__ void atomic64_set(atom
 	__asm__ __volatile__("std%U0%X0 %1,%0" : "=m"(v->counter) : "r"(i));
 }
 
-static __inline__ void atomic64_add(long a, atomic64_t *v)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:	ldarx	%0,0,%3		# atomic64_add\n\
-	add	%0,%2,%0\n\
-	stdcx.	%0,0,%3 \n\
-	bne-	1b"
-	: "=&r" (t), "+m" (v->counter)
-	: "r" (a), "r" (&v->counter)
-	: "cc");
+#define ATOMIC64_OP(op, aop)						\
+static __inline__ void atomic64_##op(long a, atomic64_t *v)		\
+{									\
+	long t;								\
+									\
+	__asm__ __volatile__(						\
+"1:	ldarx	%0,0,%3		# atomic64_" #op "\n"			\
+	#aop " %0,%2,%0\n"						\
+"	stdcx.	%0,0,%3 \n"						\
+"	bne-	1b\n"							\
+	: "=&r" (t), "+m" (v->counter)					\
+	: "r" (a), "r" (&v->counter)					\
+	: "cc");							\
+}									\
+									\
+static __inline__ long atomic64_##op##_return(long a, atomic64_t *v)	\
+{									\
+	long t;								\
+									\
+	__asm__ __volatile__(						\
+	PPC_ATOMIC_ENTRY_BARRIER					\
+"1:	ldarx	%0,0,%2		# atomic64_" #op "_return\n"		\
+	#aop " %0,%1,%0\n"						\
+"	stdcx.	%0,0,%2 \n"						\
+"	bne-	1b\n"							\
+	PPC_ATOMIC_EXIT_BARRIER						\
+	: "=&r" (t)							\
+	: "r" (a), "r" (&v->counter)					\
+	: "cc", "memory");						\
+									\
+	return t;							\
 }
 
-static __inline__ long atomic64_add_return(long a, atomic64_t *v)
-{
-	long t;
-
-	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
-"1:	ldarx	%0,0,%2		# atomic64_add_return\n\
-	add	%0,%1,%0\n\
-	stdcx.	%0,0,%2 \n\
-	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
-	: "=&r" (t)
-	: "r" (a), "r" (&v->counter)
-	: "cc", "memory");
+ATOMIC64_OP(add, add)
+ATOMIC64_OP(add, subf)
 
-	return t;
-}
+#undef ATOMIC64_OP
 
 #define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)
 
-static __inline__ void atomic64_sub(long a, atomic64_t *v)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:	ldarx	%0,0,%3		# atomic64_sub\n\
-	subf	%0,%2,%0\n\
-	stdcx.	%0,0,%3 \n\
-	bne-	1b"
-	: "=&r" (t), "+m" (v->counter)
-	: "r" (a), "r" (&v->counter)
-	: "cc");
-}
-
-static __inline__ long atomic64_sub_return(long a, atomic64_t *v)
-{
-	long t;
-
-	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
-"1:	ldarx	%0,0,%2		# atomic64_sub_return\n\
-	subf	%0,%1,%0\n\
-	stdcx.	%0,0,%2 \n\
-	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
-	: "=&r" (t)
-	: "r" (a), "r" (&v->counter)
-	: "cc", "memory");
-
-	return t;
-}
-
 static __inline__ void atomic64_inc(atomic64_t *v)
 {
 	long t;
--- a/arch/sh/include/asm/atomic-grb.h
+++ b/arch/sh/include/asm/atomic-grb.h
@@ -1,85 +1,51 @@
 #ifndef __ASM_SH_ATOMIC_GRB_H
 #define __ASM_SH_ATOMIC_GRB_H
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	int tmp;
-
-	__asm__ __volatile__ (
-		"   .align 2              \n\t"
-		"   mova    1f,   r0      \n\t" /* r0 = end point */
-		"   mov    r15,   r1      \n\t" /* r1 = saved sp */
-		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */
-		"   mov.l  @%1,   %0      \n\t" /* load  old value */
-		"   add     %2,   %0      \n\t" /* add */
-		"   mov.l   %0,   @%1     \n\t" /* store new value */
-		"1: mov     r1,   r15     \n\t" /* LOGOUT */
-		: "=&r" (tmp),
-		  "+r"  (v)
-		: "r"   (i)
-		: "memory" , "r0", "r1");
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	int tmp;
-
-	__asm__ __volatile__ (
-		"   .align 2              \n\t"
-		"   mova    1f,   r0      \n\t" /* r0 = end point */
-		"   mov     r15,  r1      \n\t" /* r1 = saved sp */
-		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */
-		"   mov.l  @%1,   %0      \n\t" /* load  old value */
-		"   sub     %2,   %0      \n\t" /* sub */
-		"   mov.l   %0,   @%1     \n\t" /* store new value */
-		"1: mov     r1,   r15     \n\t" /* LOGOUT */
-		: "=&r" (tmp),
-		  "+r"  (v)
-		: "r"   (i)
-		: "memory" , "r0", "r1");
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	int tmp;
-
-	__asm__ __volatile__ (
-		"   .align 2              \n\t"
-		"   mova    1f,   r0      \n\t" /* r0 = end point */
-		"   mov    r15,   r1      \n\t" /* r1 = saved sp */
-		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */
-		"   mov.l  @%1,   %0      \n\t" /* load  old value */
-		"   add     %2,   %0      \n\t" /* add */
-		"   mov.l   %0,   @%1     \n\t" /* store new value */
-		"1: mov     r1,   r15     \n\t" /* LOGOUT */
-		: "=&r" (tmp),
-		  "+r"  (v)
-		: "r"   (i)
-		: "memory" , "r0", "r1");
-
-	return tmp;
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	int tmp;							\
+									\
+	__asm__ __volatile__ (						\
+		"   .align 2              \n\t"				\
+		"   mova    1f,   r0      \n\t" /* r0 = end point */	\
+		"   mov    r15,   r1      \n\t" /* r1 = saved sp */	\
+		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */	\
+		"   mov.l  @%1,   %0      \n\t" /* load  old value */	\
+		" " #op "   %2,   %0      \n\t" /* $op */		\
+		"   mov.l   %0,   @%1     \n\t" /* store new value */	\
+		"1: mov     r1,   r15     \n\t" /* LOGOUT */		\
+		: "=&r" (tmp),						\
+		  "+r"  (v)						\
+		: "r"   (i)						\
+		: "memory" , "r0", "r1");				\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int tmp;							\
+									\
+	__asm__ __volatile__ (						\
+		"   .align 2              \n\t"				\
+		"   mova    1f,   r0      \n\t" /* r0 = end point */	\
+		"   mov    r15,   r1      \n\t" /* r1 = saved sp */	\
+		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */	\
+		"   mov.l  @%1,   %0      \n\t" /* load  old value */	\
+		" " #op "   %2,   %0      \n\t" /* $op */		\
+		"   mov.l   %0,   @%1     \n\t" /* store new value */	\
+		"1: mov     r1,   r15     \n\t" /* LOGOUT */		\
+		: "=&r" (tmp),						\
+		  "+r"  (v)						\
+		: "r"   (i)						\
+		: "memory" , "r0", "r1");				\
+									\
+	return tmp;							\
 }
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	int tmp;
-
-	__asm__ __volatile__ (
-		"   .align 2              \n\t"
-		"   mova    1f,   r0      \n\t" /* r0 = end point */
-		"   mov    r15,   r1      \n\t" /* r1 = saved sp */
-		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */
-		"   mov.l  @%1,   %0      \n\t" /* load  old value */
-		"   sub     %2,   %0      \n\t" /* sub */
-		"   mov.l   %0,   @%1     \n\t" /* store new value */
-		"1: mov     r1,   r15     \n\t" /* LOGOUT */
-		: "=&r" (tmp),
-		  "+r"  (v)
-		: "r"   (i)
-		: "memory", "r0", "r1");
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	return tmp;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
 {
--- a/arch/sh/include/asm/atomic-irq.h
+++ b/arch/sh/include/asm/atomic-irq.h
@@ -8,49 +8,34 @@
  * forward to code at the end of this object's .text section, then
  * branch back to restart the operation.
  */
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned long flags;
 
-	raw_local_irq_save(flags);
-	v->counter += i;
-	raw_local_irq_restore(flags);
+#define ATOMIC_OP(op, cop)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long flags;						\
+									\
+	raw_local_irq_save(flags);					\
+	v->counter cop i;						\
+	raw_local_irq_restore(flags);					\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long temp, flags;					\
+									\
+	raw_local_irq_save(flags);					\
+	temp = v->counter;						\
+	temp cop i;							\
+	v->counter = temp;						\
+	raw_local_irq_restore(flags);					\
+									\
+	return temp;							\
 }
 
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned long flags;
-
-	raw_local_irq_save(flags);
-	v->counter -= i;
-	raw_local_irq_restore(flags);
-}
-
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long temp, flags;
-
-	raw_local_irq_save(flags);
-	temp = v->counter;
-	temp += i;
-	v->counter = temp;
-	raw_local_irq_restore(flags);
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
 
-	return temp;
-}
-
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long temp, flags;
-
-	raw_local_irq_save(flags);
-	temp = v->counter;
-	temp -= i;
-	v->counter = temp;
-	raw_local_irq_restore(flags);
-
-	return temp;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
 {
--- a/arch/sh/include/asm/atomic-llsc.h
+++ b/arch/sh/include/asm/atomic-llsc.h
@@ -2,39 +2,6 @@
 #define __ASM_SH_ATOMIC_LLSC_H
 
 /*
- * To get proper branch prediction for the main line, we must branch
- * forward to code at the end of this object's .text section, then
- * branch back to restart the operation.
- */
-static inline void atomic_add(int i, atomic_t *v)
-{
-	unsigned long tmp;
-
-	__asm__ __volatile__ (
-"1:	movli.l @%2, %0		! atomic_add	\n"
-"	add	%1, %0				\n"
-"	movco.l	%0, @%2				\n"
-"	bf	1b				\n"
-	: "=&z" (tmp)
-	: "r" (i), "r" (&v->counter)
-	: "t");
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	unsigned long tmp;
-
-	__asm__ __volatile__ (
-"1:	movli.l @%2, %0		! atomic_sub	\n"
-"	sub	%1, %0				\n"
-"	movco.l	%0, @%2				\n"
-"	bf	1b				\n"
-	: "=&z" (tmp)
-	: "r" (i), "r" (&v->counter)
-	: "t");
-}
-
-/*
  * SH-4A note:
  *
  * We basically get atomic_xxx_return() for free compared with
@@ -42,39 +9,48 @@ static inline void atomic_sub(int i, ato
  * encoding, so the retval is automatically set without having to
  * do any special work.
  */
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long temp;
-
-	__asm__ __volatile__ (
-"1:	movli.l @%2, %0		! atomic_add_return	\n"
-"	add	%1, %0					\n"
-"	movco.l	%0, @%2					\n"
-"	bf	1b					\n"
-"	synco						\n"
-	: "=&z" (temp)
-	: "r" (i), "r" (&v->counter)
-	: "t");
+/*
+ * To get proper branch prediction for the main line, we must branch
+ * forward to code at the end of this object's .text section, then
+ * branch back to restart the operation.
+ */
 
-	return temp;
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long tmp;						\
+									\
+	__asm__ __volatile__ (						\
+"1:	movli.l @%2, %0		! atomic_" #op "\n"			\
+"	" #op "	%1, %0				\n"			\
+"	movco.l	%0, @%2				\n"			\
+"	bf	1b				\n"			\
+	: "=&z" (tmp)							\
+	: "r" (i), "r" (&v->counter)					\
+	: "t");								\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long temp;						\
+									\
+	__asm__ __volatile__ (						\
+"1:	movli.l @%2, %0		! atomic_" #op "_return	\n"		\
+"	" #op "	%1, %0					\n"		\
+"	movco.l	%0, @%2					\n"		\
+"	bf	1b					\n"		\
+"	synco						\n"		\
+	: "=&z" (temp)							\
+	: "r" (i), "r" (&v->counter)					\
+	: "t");								\
+									\
+	return temp;							\
 }
 
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long temp;
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	__asm__ __volatile__ (
-"1:	movli.l @%2, %0		! atomic_sub_return	\n"
-"	sub	%1, %0					\n"
-"	movco.l	%0, @%2					\n"
-"	bf	1b					\n"
-"	synco						\n"
-	: "=&z" (temp)
-	: "r" (i), "r" (&v->counter)
-	: "t");
-
-	return temp;
-}
+#undef ATOMIC_OP
 
 static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
 {
--- a/arch/sparc/include/asm/atomic_32.h
+++ b/arch/sparc/include/asm/atomic_32.h
@@ -20,7 +20,7 @@
 
 #define ATOMIC_INIT(i)  { (i) }
 
-extern int __atomic_add_return(int, atomic_t *);
+extern int atomic_add_return(int, atomic_t *);
 extern int atomic_cmpxchg(atomic_t *, int, int);
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 extern int __atomic_add_unless(atomic_t *, int, int);
@@ -28,15 +28,14 @@ extern void atomic_set(atomic_t *, int);
 
 #define atomic_read(v)          (*(volatile int *)&(v)->counter)
 
-#define atomic_add(i, v)	((void)__atomic_add_return( (int)(i), (v)))
-#define atomic_sub(i, v)	((void)__atomic_add_return(-(int)(i), (v)))
-#define atomic_inc(v)		((void)__atomic_add_return(        1, (v)))
-#define atomic_dec(v)		((void)__atomic_add_return(       -1, (v)))
-
-#define atomic_add_return(i, v)	(__atomic_add_return( (int)(i), (v)))
-#define atomic_sub_return(i, v)	(__atomic_add_return(-(int)(i), (v)))
-#define atomic_inc_return(v)	(__atomic_add_return(        1, (v)))
-#define atomic_dec_return(v)	(__atomic_add_return(       -1, (v)))
+#define atomic_add(i, v)	((void)atomic_add_return( (int)(i), (v)))
+#define atomic_sub(i, v)	((void)atomic_add_return(-(int)(i), (v)))
+#define atomic_inc(v)		((void)atomic_add_return(        1, (v)))
+#define atomic_dec(v)		((void)atomic_add_return(       -1, (v)))
+
+#define atomic_sub_return(i, v)	(atomic_add_return(-(int)(i), (v)))
+#define atomic_inc_return(v)	(atomic_add_return(        1, (v)))
+#define atomic_dec_return(v)	(atomic_add_return(       -1, (v)))
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
 
--- a/arch/sparc/include/asm/atomic_64.h
+++ b/arch/sparc/include/asm/atomic_64.h
@@ -20,27 +20,22 @@
 #define atomic_set(v, i)	(((v)->counter) = i)
 #define atomic64_set(v, i)	(((v)->counter) = i)
 
-extern void atomic_add(int, atomic_t *);
-extern void atomic64_add(long, atomic64_t *);
-extern void atomic_sub(int, atomic_t *);
-extern void atomic64_sub(long, atomic64_t *);
-
-extern int atomic_add_ret(int, atomic_t *);
-extern long atomic64_add_ret(long, atomic64_t *);
-extern int atomic_sub_ret(int, atomic_t *);
-extern long atomic64_sub_ret(long, atomic64_t *);
-
-#define atomic_dec_return(v) atomic_sub_ret(1, v)
-#define atomic64_dec_return(v) atomic64_sub_ret(1, v)
+#define ATOMIC_OP(op)							\
+extern void atomic_##op(int, atomic_t *);				\
+extern void atomic64_##op(long, atomic64_t *);				\
+extern int atomic_##op##_return(int, atomic_t *);			\
+extern long atomic64_##op##_return(long, atomic64_t *);
+
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-#define atomic_inc_return(v) atomic_add_ret(1, v)
-#define atomic64_inc_return(v) atomic64_add_ret(1, v)
+#undef ATOMIC_OP
 
-#define atomic_sub_return(i, v) atomic_sub_ret(i, v)
-#define atomic64_sub_return(i, v) atomic64_sub_ret(i, v)
+#define atomic_dec_return(v)   atomic_sub_return(1, v)
+#define atomic64_dec_return(v) atomic64_sub_return(1, v)
 
-#define atomic_add_return(i, v) atomic_add_ret(i, v)
-#define atomic64_add_return(i, v) atomic64_add_ret(i, v)
+#define atomic_inc_return(v)   atomic_add_return(1, v)
+#define atomic64_inc_return(v) atomic64_add_return(1, v)
 
 /*
  * atomic_inc_and_test - increment and test
@@ -53,11 +48,11 @@ extern long atomic64_sub_ret(long, atomi
 #define atomic_inc_and_test(v) (atomic_inc_return(v) == 0)
 #define atomic64_inc_and_test(v) (atomic64_inc_return(v) == 0)
 
-#define atomic_sub_and_test(i, v) (atomic_sub_ret(i, v) == 0)
-#define atomic64_sub_and_test(i, v) (atomic64_sub_ret(i, v) == 0)
+#define atomic_sub_and_test(i, v) (atomic_sub_return(i, v) == 0)
+#define atomic64_sub_and_test(i, v) (atomic64_sub_return(i, v) == 0)
 
-#define atomic_dec_and_test(v) (atomic_sub_ret(1, v) == 0)
-#define atomic64_dec_and_test(v) (atomic64_sub_ret(1, v) == 0)
+#define atomic_dec_and_test(v) (atomic_sub_return(1, v) == 0)
+#define atomic64_dec_and_test(v) (atomic64_sub_return(1, v) == 0)
 
 #define atomic_inc(v) atomic_add(1, v)
 #define atomic64_inc(v) atomic64_add(1, v)
@@ -65,8 +60,8 @@ extern long atomic64_sub_ret(long, atomi
 #define atomic_dec(v) atomic_sub(1, v)
 #define atomic64_dec(v) atomic64_sub(1, v)
 
-#define atomic_add_negative(i, v) (atomic_add_ret(i, v) < 0)
-#define atomic64_add_negative(i, v) (atomic64_add_ret(i, v) < 0)
+#define atomic_add_negative(i, v) (atomic_add_return(i, v) < 0)
+#define atomic64_add_negative(i, v) (atomic64_add_return(i, v) < 0)
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
--- a/arch/sparc/include/asm/barrier_32.h
+++ b/arch/sparc/include/asm/barrier_32.h
@@ -1,7 +1,6 @@
 #ifndef __SPARC_BARRIER_H
 #define __SPARC_BARRIER_H
 
-#include <asm/processor.h> /* for nop() */
 #include <asm-generic/barrier.h>
 
 #endif /* !(__SPARC_BARRIER_H) */
--- a/arch/sparc/include/asm/processor.h
+++ b/arch/sparc/include/asm/processor.h
@@ -6,6 +6,4 @@
 #include <asm/processor_32.h>
 #endif
 
-#define nop() 		__asm__ __volatile__ ("nop")
-
 #endif
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1150,7 +1150,7 @@ static unsigned long penguins_are_doing_
 
 void smp_capture(void)
 {
-	int result = atomic_add_ret(1, &smp_capture_depth);
+	int result = atomic_add_return(1, &smp_capture_depth);
 
 	if (result == 1) {
 		int ncpus = num_online_cpus();
--- a/arch/sparc/lib/atomic32.c
+++ b/arch/sparc/lib/atomic32.c
@@ -27,18 +27,23 @@ static DEFINE_SPINLOCK(dummy);
 
 #endif /* SMP */
 
-int __atomic_add_return(int i, atomic_t *v)
-{
-	int ret;
-	unsigned long flags;
-	spin_lock_irqsave(ATOMIC_HASH(v), flags);
+#define ATOMIC_OP(op, cop)						\
+int atomic_##op##_return(int i, atomic_t *v)				\
+{									\
+	int ret;							\
+	unsigned long flags;						\
+	spin_lock_irqsave(ATOMIC_HASH(v), flags);			\
+									\
+	ret = (v->counter cop i);					\
+									\
+	spin_unlock_irqrestore(ATOMIC_HASH(v), flags);			\
+	return ret;							\
+}									\
+EXPORT_SYMBOL(atomic_##op##_return);
 
-	ret = (v->counter += i);
+ATOMIC_OP(add, +=)
 
-	spin_unlock_irqrestore(ATOMIC_HASH(v), flags);
-	return ret;
-}
-EXPORT_SYMBOL(__atomic_add_return);
+#undef ATOMIC_OP
 
 int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
--- a/arch/sparc/lib/atomic_64.S
+++ b/arch/sparc/lib/atomic_64.S
@@ -14,109 +14,66 @@
 	 * memory barriers, and a second which returns
 	 * a value and does the barriers.
 	 */
-ENTRY(atomic_add) /* %o0 = increment, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	lduw	[%o1], %g1
-	add	%g1, %o0, %g7
-	cas	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%icc, BACKOFF_LABEL(2f, 1b)
-	 nop
-	retl
-	 nop
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic_add)
-
-ENTRY(atomic_sub) /* %o0 = decrement, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	lduw	[%o1], %g1
-	sub	%g1, %o0, %g7
-	cas	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%icc, BACKOFF_LABEL(2f, 1b)
-	 nop
-	retl
-	 nop
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic_sub)
 
-ENTRY(atomic_add_ret) /* %o0 = increment, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	lduw	[%o1], %g1
-	add	%g1, %o0, %g7
-	cas	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%icc, BACKOFF_LABEL(2f, 1b)
-	 add	%g1, %o0, %g1
-	retl
-	 sra	%g1, 0, %o0
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic_add_ret)
+#define ATOMIC_OP(op)							\
+ENTRY(atomic_##op) /* %o0 = increment, %o1 = atomic_ptr */		\
+	BACKOFF_SETUP(%o2);						\
+1:	lduw	[%o1], %g1;						\
+	op	%g1, %o0, %g7;						\
+	cas	[%o1], %g1, %g7;					\
+	cmp	%g1, %g7;						\
+	bne,pn	%icc, BACKOFF_LABEL(2f, 1b);				\
+	 nop;								\
+	retl;								\
+	 nop;								\
+2:	BACKOFF_SPIN(%o2, %o3, 1b);					\
+ENDPROC(atomic_##op);							\
+									\
+ENTRY(atomic_##op##_return) /* %o0 = increment, %o1 = atomic_ptr */	\
+	BACKOFF_SETUP(%o2);						\
+1:	lduw	[%o1], %g1;						\
+	op	%g1, %o0, %g7;						\
+	cas	[%o1], %g1, %g7;					\
+	cmp	%g1, %g7;						\
+	bne,pn	%icc, BACKOFF_LABEL(2f, 1b);				\
+	 add	%g1, %o0, %g1;						\
+	retl;								\
+	 sra	%g1, 0, %o0;						\
+2:	BACKOFF_SPIN(%o2, %o3, 1b);					\
+ENDPROC(atomic_##op##_return);
+
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
+
+#define ATOMIC64_OP(op)							\
+ENTRY(atomic64_##op) /* %o0 = increment, %o1 = atomic_ptr */		\
+	BACKOFF_SETUP(%o2);						\
+1:	ldx	[%o1], %g1;						\
+	op	%g1, %o0, %g7;						\
+	casx	[%o1], %g1, %g7;					\
+	cmp	%g1, %g7;						\
+	bne,pn	%xcc, BACKOFF_LABEL(2f, 1b);				\
+	 nop;								\
+	retl;								\
+	 nop;								\
+2:	BACKOFF_SPIN(%o2, %o3, 1b);					\
+ENDPROC(atomic64_##op);							\
+									\
+ENTRY(atomic64_##op##_return) /* %o0 = increment, %o1 = atomic_ptr */	\
+	BACKOFF_SETUP(%o2);						\
+1:	ldx	[%o1], %g1;						\
+	op	%g1, %o0, %g7;						\
+	casx	[%o1], %g1, %g7;					\
+	cmp	%g1, %g7;						\
+	bne,pn	%xcc, BACKOFF_LABEL(2f, 1b);				\
+	 nop;								\
+	retl;								\
+	 add	%g1, %o0, %o0;						\
+2:	BACKOFF_SPIN(%o2, %o3, 1b);					\
+ENDPROC(atomic64_##op##_return);
 
-ENTRY(atomic_sub_ret) /* %o0 = decrement, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	lduw	[%o1], %g1
-	sub	%g1, %o0, %g7
-	cas	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%icc, BACKOFF_LABEL(2f, 1b)
-	 sub	%g1, %o0, %g1
-	retl
-	 sra	%g1, 0, %o0
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic_sub_ret)
-
-ENTRY(atomic64_add) /* %o0 = increment, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	ldx	[%o1], %g1
-	add	%g1, %o0, %g7
-	casx	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%xcc, BACKOFF_LABEL(2f, 1b)
-	 nop
-	retl
-	 nop
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic64_add)
-
-ENTRY(atomic64_sub) /* %o0 = decrement, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	ldx	[%o1], %g1
-	sub	%g1, %o0, %g7
-	casx	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%xcc, BACKOFF_LABEL(2f, 1b)
-	 nop
-	retl
-	 nop
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic64_sub)
-
-ENTRY(atomic64_add_ret) /* %o0 = increment, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	ldx	[%o1], %g1
-	add	%g1, %o0, %g7
-	casx	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%xcc, BACKOFF_LABEL(2f, 1b)
-	 nop
-	retl
-	 add	%g1, %o0, %o0
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic64_add_ret)
-
-ENTRY(atomic64_sub_ret) /* %o0 = decrement, %o1 = atomic_ptr */
-	BACKOFF_SETUP(%o2)
-1:	ldx	[%o1], %g1
-	sub	%g1, %o0, %g7
-	casx	[%o1], %g1, %g7
-	cmp	%g1, %g7
-	bne,pn	%xcc, BACKOFF_LABEL(2f, 1b)
-	 nop
-	retl
-	 sub	%g1, %o0, %o0
-2:	BACKOFF_SPIN(%o2, %o3, 1b)
-ENDPROC(atomic64_sub_ret)
+ATOMIC64_OP(add)
+ATOMIC64_OP(sub)
 
 ENTRY(atomic64_dec_if_positive) /* %o0 = atomic_ptr */
 	BACKOFF_SETUP(%o2)
--- a/arch/sparc/lib/ksyms.c
+++ b/arch/sparc/lib/ksyms.c
@@ -99,14 +99,15 @@ EXPORT_SYMBOL(___copy_in_user);
 EXPORT_SYMBOL(__clear_user);
 
 /* Atomic counter implementation. */
-EXPORT_SYMBOL(atomic_add);
-EXPORT_SYMBOL(atomic_add_ret);
-EXPORT_SYMBOL(atomic_sub);
-EXPORT_SYMBOL(atomic_sub_ret);
-EXPORT_SYMBOL(atomic64_add);
-EXPORT_SYMBOL(atomic64_add_ret);
-EXPORT_SYMBOL(atomic64_sub);
-EXPORT_SYMBOL(atomic64_sub_ret);
+#define ATOMIC_OP(op)							\
+EXPORT_SYMBOL(atomic_##op);						\
+EXPORT_SYMBOL(atomic_##op##_return);					\
+EXPORT_SYMBOL(atomic64_##op);						\
+EXPORT_SYMBOL(atomic64_##op##_return);
+
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
+
 EXPORT_SYMBOL(atomic64_dec_if_positive);
 
 /* Atomic bit operations. */
--- a/arch/xtensa/include/asm/atomic.h
+++ b/arch/xtensa/include/asm/atomic.h
@@ -58,165 +58,90 @@
  */
 #define atomic_set(v,i)		((v)->counter = (i))
 
-/**
- * atomic_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v.
- */
-static inline void atomic_add(int i, atomic_t * v)
-{
 #if XCHAL_HAVE_S32C1I
-	unsigned long tmp;
-	int result;
-
-	__asm__ __volatile__(
-			"1:     l32i    %1, %3, 0\n"
-			"       wsr     %1, scompare1\n"
-			"       add     %0, %1, %2\n"
-			"       s32c1i  %0, %3, 0\n"
-			"       bne     %0, %1, 1b\n"
-			: "=&a" (result), "=&a" (tmp)
-			: "a" (i), "a" (v)
-			: "memory"
-			);
-#else
-	unsigned int vval;
-
-	__asm__ __volatile__(
-			"       rsil    a15, "__stringify(LOCKLEVEL)"\n"
-			"       l32i    %0, %2, 0\n"
-			"       add     %0, %0, %1\n"
-			"       s32i    %0, %2, 0\n"
-			"       wsr     a15, ps\n"
-			"       rsync\n"
-			: "=&a" (vval)
-			: "a" (i), "a" (v)
-			: "a15", "memory"
-			);
-#endif
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t * v)			\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	__asm__ __volatile__(						\
+			"1:     l32i    %1, %3, 0\n"			\
+			"       wsr     %1, scompare1\n"		\
+			"       " #op " %0, %1, %2\n"			\
+			"       s32c1i  %0, %3, 0\n"			\
+			"       bne     %0, %1, 1b\n"			\
+			: "=&a" (result), "=&a" (tmp)			\
+			: "a" (i), "a" (v)				\
+			: "memory"					\
+			);						\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t * v)		\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	__asm__ __volatile__(						\
+			"1:     l32i    %1, %3, 0\n"			\
+			"       wsr     %1, scompare1\n"		\
+			"       " #op " %0, %1, %2\n"			\
+			"       s32c1i  %0, %3, 0\n"			\
+			"       bne     %0, %1, 1b\n"			\
+			"       " #op " %0, %0, %2\n"			\
+			: "=&a" (result), "=&a" (tmp)			\
+			: "a" (i), "a" (v)				\
+			: "memory"					\
+			);						\
+									\
+	return result;							\
 }
 
-/**
- * atomic_sub - subtract the atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v.
- */
-static inline void atomic_sub(int i, atomic_t *v)
-{
-#if XCHAL_HAVE_S32C1I
-	unsigned long tmp;
-	int result;
+#else /* XCHAL_HAVE_S32C1I */
 
-	__asm__ __volatile__(
-			"1:     l32i    %1, %3, 0\n"
-			"       wsr     %1, scompare1\n"
-			"       sub     %0, %1, %2\n"
-			"       s32c1i  %0, %3, 0\n"
-			"       bne     %0, %1, 1b\n"
-			: "=&a" (result), "=&a" (tmp)
-			: "a" (i), "a" (v)
-			: "memory"
-			);
-#else
-	unsigned int vval;
-
-	__asm__ __volatile__(
-			"       rsil    a15, "__stringify(LOCKLEVEL)"\n"
-			"       l32i    %0, %2, 0\n"
-			"       sub     %0, %0, %1\n"
-			"       s32i    %0, %2, 0\n"
-			"       wsr     a15, ps\n"
-			"       rsync\n"
-			: "=&a" (vval)
-			: "a" (i), "a" (v)
-			: "a15", "memory"
-			);
-#endif
+#define ATOMIC_OP(op)							\
+static inline void atomic_##op(int i, atomic_t * v)			\
+{									\
+	unsigned int vval;						\
+									\
+	__asm__ __volatile__(						\
+			"       rsil    a15, "__stringify(LOCKLEVEL)"\n"\
+			"       l32i    %0, %2, 0\n"			\
+			"       " #op " %0, %0, %1\n"			\
+			"       s32i    %0, %2, 0\n"			\
+			"       wsr     a15, ps\n"			\
+			"       rsync\n"				\
+			: "=&a" (vval)					\
+			: "a" (i), "a" (v)				\
+			: "a15", "memory"				\
+			);						\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t * v)		\
+{									\
+	unsigned int vval;						\
+									\
+	__asm__ __volatile__(						\
+			"       rsil    a15,"__stringify(LOCKLEVEL)"\n"	\
+			"       l32i    %0, %2, 0\n"			\
+			"       " #op " %0, %0, %1\n"			\
+			"       s32i    %0, %2, 0\n"			\
+			"       wsr     a15, ps\n"			\
+			"       rsync\n"				\
+			: "=&a" (vval)					\
+			: "a" (i), "a" (v)				\
+			: "a15", "memory"				\
+			);						\
+									\
+	return vval;							\
 }
 
-/*
- * We use atomic_{add|sub}_return to define other functions.
- */
+#endif /* XCHAL_HAVE_S32C1I */
 
-static inline int atomic_add_return(int i, atomic_t * v)
-{
-#if XCHAL_HAVE_S32C1I
-	unsigned long tmp;
-	int result;
+ATOMIC_OP(add)
+ATOMIC_OP(sub)
 
-	__asm__ __volatile__(
-			"1:     l32i    %1, %3, 0\n"
-			"       wsr     %1, scompare1\n"
-			"       add     %0, %1, %2\n"
-			"       s32c1i  %0, %3, 0\n"
-			"       bne     %0, %1, 1b\n"
-			"       add     %0, %0, %2\n"
-			: "=&a" (result), "=&a" (tmp)
-			: "a" (i), "a" (v)
-			: "memory"
-			);
-
-	return result;
-#else
-	unsigned int vval;
-
-	__asm__ __volatile__(
-			"       rsil    a15,"__stringify(LOCKLEVEL)"\n"
-			"       l32i    %0, %2, 0\n"
-			"       add     %0, %0, %1\n"
-			"       s32i    %0, %2, 0\n"
-			"       wsr     a15, ps\n"
-			"       rsync\n"
-			: "=&a" (vval)
-			: "a" (i), "a" (v)
-			: "a15", "memory"
-			);
-
-	return vval;
-#endif
-}
-
-static inline int atomic_sub_return(int i, atomic_t * v)
-{
-#if XCHAL_HAVE_S32C1I
-	unsigned long tmp;
-	int result;
-
-	__asm__ __volatile__(
-			"1:     l32i    %1, %3, 0\n"
-			"       wsr     %1, scompare1\n"
-			"       sub     %0, %1, %2\n"
-			"       s32c1i  %0, %3, 0\n"
-			"       bne     %0, %1, 1b\n"
-			"       sub     %0, %0, %2\n"
-			: "=&a" (result), "=&a" (tmp)
-			: "a" (i), "a" (v)
-			: "memory"
-			);
-
-	return result;
-#else
-	unsigned int vval;
-
-	__asm__ __volatile__(
-			"       rsil    a15,"__stringify(LOCKLEVEL)"\n"
-			"       l32i    %0, %2, 0\n"
-			"       sub     %0, %0, %1\n"
-			"       s32i    %0, %2, 0\n"
-			"       wsr     a15, ps\n"
-			"       rsync\n"
-			: "=&a" (vval)
-			: "a" (i), "a" (v)
-			: "a15", "memory"
-			);
-
-	return vval;
-#endif
-}
+#undef ATOMIC_OP
 
 /**
  * atomic_sub_and_test - subtract value from variable and test result
--- a/include/asm-generic/atomic.h
+++ b/include/asm-generic/atomic.h
@@ -19,13 +19,60 @@
 #include <asm/barrier.h>
 
 #ifdef CONFIG_SMP
-/* Force people to define core atomics */
-# if !defined(atomic_add_return) || !defined(atomic_sub_return) || \
-     !defined(atomic_clear_mask) || !defined(atomic_set_mask)
-#  error "SMP requires a little arch-specific magic"
-# endif
+
+/* we can build all atomic primitives from cmpxchg */
+
+#define GEN_ATOMIC_OP(op, cop)						\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int c, old;							\
+									\
+	c = v->counter;							\
+	while ((old = cmpxchg(&v->counter, c, c cop i)) != c)		\
+		c = old;						\
+									\
+	return c cop i;							\
+}
+
+#else
+
+#include <linux/irqflags.h>
+
+#define GEN_ATOMIC_OP(op, cop)						\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long flags;						\
+	int ret;							\
+									\
+	raw_local_irq_save(flags);					\
+	ret = (v->counter cop i);					\
+	raw_local_irq_restore(flags);					\
+									\
+	return ret;							\
+}
+
+#endif /* CONFIG_SMP */
+
+#ifndef atomic_add_return
+GEN_ATOMIC_OP(add, +)
+#endif
+
+#ifndef atomic_sub_return
+GEN_ATOMIC_OP(sub, -)
+#endif
+
+#ifndef atomic_clear_mask
+GEN_ATOMIC_OP(and, &)
+#define atomic_clear_mask(i, v) (void)atomic_and_return(~(i), (v))
+#endif
+
+#ifndef atomic_set_mask
+GEN_ATOMIC_OP(or, |)
+#define atomic_set_mask(i, v)	(void)atomic_or_return((i), (v))
 #endif
 
+#undef GEN_ATOMIC_OP
+
 /*
  * Atomic operations that C can't guarantee us.  Useful for
  * resource counting etc..
@@ -33,8 +80,6 @@
 
 #define ATOMIC_INIT(i)	{ (i) }
 
-#ifdef __KERNEL__
-
 /**
  * atomic_read - read atomic variable
  * @v: pointer of type atomic_t
@@ -56,52 +101,6 @@
 
 #include <linux/irqflags.h>
 
-/**
- * atomic_add_return - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns the result
- */
-#ifndef atomic_add_return
-static inline int atomic_add_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	int temp;
-
-	raw_local_irq_save(flags); /* Don't trace it in an irqsoff handler */
-	temp = v->counter;
-	temp += i;
-	v->counter = temp;
-	raw_local_irq_restore(flags);
-
-	return temp;
-}
-#endif
-
-/**
- * atomic_sub_return - subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v and returns the result
- */
-#ifndef atomic_sub_return
-static inline int atomic_sub_return(int i, atomic_t *v)
-{
-	unsigned long flags;
-	int temp;
-
-	raw_local_irq_save(flags); /* Don't trace it in an irqsoff handler */
-	temp = v->counter;
-	temp -= i;
-	v->counter = temp;
-	raw_local_irq_restore(flags);
-
-	return temp;
-}
-#endif
-
 static inline int atomic_add_negative(int i, atomic_t *v)
 {
 	return atomic_add_return(i, v) < 0;
@@ -146,42 +145,4 @@ static inline int __atomic_add_unless(at
   return c;
 }
 
-/**
- * atomic_clear_mask - Atomically clear bits in atomic variable
- * @mask: Mask of the bits to be cleared
- * @v: pointer of type atomic_t
- *
- * Atomically clears the bits set in @mask from @v
- */
-#ifndef atomic_clear_mask
-static inline void atomic_clear_mask(unsigned long mask, atomic_t *v)
-{
-	unsigned long flags;
-
-	mask = ~mask;
-	raw_local_irq_save(flags); /* Don't trace it in a irqsoff handler */
-	v->counter &= mask;
-	raw_local_irq_restore(flags);
-}
-#endif
-
-/**
- * atomic_set_mask - Atomically set bits in atomic variable
- * @mask: Mask of the bits to be set
- * @v: pointer of type atomic_t
- *
- * Atomically sets the bits set in @mask in @v
- */
-#ifndef atomic_set_mask
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long flags;
-
-	raw_local_irq_save(flags); /* Don't trace it in a irqsoff handler */
-	v->counter |= mask;
-	raw_local_irq_restore(flags);
-}
-#endif
-
-#endif /* __KERNEL__ */
 #endif /* __ASM_GENERIC_ATOMIC_H */



^ permalink raw reply	[flat|nested] 299+ messages in thread

* [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-02-06 13:48 ` [RFC][PATCH 4/5] arch: Generic atomic.h cleanup Peter Zijlstra
@ 2014-02-06 13:48 ` Peter Zijlstra
  2014-02-06 14:43   ` Geert Uytterhoeven
  2014-02-06 16:53   ` Linus Torvalds
  2014-02-06 18:25 ` [RFC][PATCH 0/5] arch: atomic rework David Howells
       [not found] ` <52F93B7C.2090304@tilera.com>
  6 siblings, 2 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra

[-- Attachment #1: peterz-arch-atomic-kill-mask.patch --]
[-- Type: text/plain, Size: 63577 bytes --]

Many archs have atomic_{set,clear}_mask() but not all. Remove these
and provide a comprehensive set of bitops:

  atomic{,64}_{and,or,xor}{,_return}()

for everybody.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/alpha/include/asm/atomic.h        |    3 +
 arch/arc/include/asm/atomic.h          |   29 ++---------
 arch/arm/include/asm/atomic.h          |   23 ++++----
 arch/arm64/include/asm/atomic.h        |    8 ++-
 arch/avr32/include/asm/atomic.h        |    3 +
 arch/blackfin/include/asm/atomic.h     |    5 +
 arch/blackfin/mach-common/smp.c        |    2 
 arch/cris/include/asm/atomic.h         |    3 +
 arch/hexagon/include/asm/atomic.h      |    3 +
 arch/ia64/include/asm/atomic.h         |   32 ++++++++++--
 arch/m32r/include/asm/atomic.h         |   44 +---------------
 arch/m32r/kernel/smp.c                 |    4 -
 arch/m68k/include/asm/atomic.h         |   13 +---
 arch/metag/include/asm/atomic_lnkget.h |   37 +-------------
 arch/metag/include/asm/atomic_lock1.h  |   23 +-------
 arch/mips/include/asm/atomic.h         |    6 ++
 arch/mn10300/include/asm/atomic.h      |   70 +-------------------------
 arch/mn10300/mm/tlb-smp.c              |    2 
 arch/parisc/include/asm/atomic.h       |    6 ++
 arch/powerpc/include/asm/atomic.h      |    8 ++-
 arch/powerpc/kernel/misc_32.S          |   19 -------
 arch/s390/include/asm/atomic.h         |   87 ++++++++++++++++++++-------------
 arch/s390/kernel/time.c                |    4 -
 arch/s390/kvm/diag.c                   |    2 
 arch/s390/kvm/intercept.c              |    2 
 arch/s390/kvm/interrupt.c              |   16 +++---
 arch/s390/kvm/kvm-s390.c               |   14 ++---
 arch/s390/kvm/sigp.c                   |    6 +-
 arch/sh/include/asm/atomic-grb.h       |   42 +--------------
 arch/sh/include/asm/atomic-irq.h       |   21 +------
 arch/sh/include/asm/atomic-llsc.h      |   31 +----------
 arch/sparc/include/asm/atomic_32.h     |    4 +
 arch/sparc/include/asm/atomic_64.h     |    3 +
 arch/sparc/lib/atomic32.c              |    5 +
 arch/sparc/lib/atomic_64.S             |    6 ++
 arch/sparc/lib/ksyms.c                 |    3 +
 arch/x86/include/asm/atomic.h          |   32 ++++++++----
 arch/x86/include/asm/atomic64_32.h     |   20 +++++++
 arch/x86/include/asm/atomic64_64.h     |   22 ++++++++
 arch/xtensa/include/asm/atomic.h       |   72 +--------------------------
 drivers/gpu/drm/i915/i915_irq.c        |    4 -
 drivers/s390/scsi/zfcp_aux.c           |    2 
 drivers/s390/scsi/zfcp_erp.c           |   68 ++++++++++++-------------
 drivers/s390/scsi/zfcp_fc.c            |    8 +--
 drivers/s390/scsi/zfcp_fsf.c           |   30 +++++------
 drivers/s390/scsi/zfcp_qdio.c          |   14 ++---
 include/asm-generic/atomic.h           |   34 +++++++-----
 include/asm-generic/atomic64.h         |   17 ++++--
 include/linux/atomic.h                 |   13 ----
 lib/atomic64.c                         |   77 +++++++++++------------------
 50 files changed, 404 insertions(+), 598 deletions(-)

--- a/arch/alpha/include/asm/atomic.h
+++ b/arch/alpha/include/asm/atomic.h
@@ -103,6 +103,9 @@ static __inline__ long atomic64_##op##_r
 
 ATOMIC_OPS(add)
 ATOMIC_OPS(sub)
+ATOMIC_OPS(and)
+ATOMIC_OPS(or)
+ATOMIC_OPS(xor)
 
 #undef ATOMIC_OPS
 #undef ATOMIC64_OP
--- a/arch/arc/include/asm/atomic.h
+++ b/arch/arc/include/asm/atomic.h
@@ -58,23 +58,12 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
-{
-	unsigned int temp;
-
-	__asm__ __volatile__(
-	"1:	llock   %0, [%1]	\n"
-	"	bic     %0, %0, %2	\n"
-	"	scond   %0, [%1]	\n"
-	"	bnz     1b		\n"
-	: "=&r"(temp)
-	: "r"(addr), "ir"(mask)
-	: "cc");
-}
-
 #else	/* !CONFIG_ARC_HAS_LLSC */
 
 #ifndef CONFIG_SMP
@@ -134,18 +123,12 @@ static inline int atomic_##opn##_return(
 
 ATOMIC_OP(add, +=)
 ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
-{
-	unsigned long flags;
-
-	atomic_ops_lock(flags);
-	*addr &= ~mask;
-	atomic_ops_unlock(flags);
-}
-
 #endif /* !CONFIG_ARC_HAS_LLSC */
 
 /**
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -38,7 +38,7 @@
  * to ensure that the update happens.
  */
 
-#define ATOMIC_OP(op)							\
+#define ATOMIC_OP(op, cop)						\
 static inline void atomic_##op(int i, atomic_t *v)			\
 {									\
 	unsigned long tmp;						\
@@ -78,11 +78,6 @@ static inline int atomic_##op##_return(i
 	return result;							\
 }
 
-ATOMIC_OP(add)
-ATOMIC_OP(sub)
-
-#undef ATOMIC_OP
-
 static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 {
 	int oldval;
@@ -135,11 +130,6 @@ static inline int atomic_##opn##_return(
 	return val;							\
 }
 
-ATOMIC_OP(add, +=)
-ATOMIC_OP(sub, -=)
-
-#undef ATOMIC_OP
-
 static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
 	int ret;
@@ -156,6 +146,14 @@ static inline int atomic_cmpxchg(atomic_
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+ATOMIC_OP(add, +=)
+ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
+
+#undef ATOMIC_OP
+
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
 static inline int __atomic_add_unless(atomic_t *v, int a, int u)
@@ -282,6 +280,9 @@ static inline long long atomic64_##opn##
 
 ATOMIC64_OP(add, adds, adc)
 ATOMIC64_OP(sub, subs, sbc)
+ATOMIC64_OP(and, and, and)
+ATOMIC64_OP(or, or, or)
+ATOMIC64_OP(xor, xor, xor)
 
 #undef ATOMIC64_OP
 
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -80,6 +80,9 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
@@ -136,7 +139,7 @@ static inline int __atomic_add_unless(at
 #define atomic64_read(v)	(*(volatile long long *)&(v)->counter)
 #define atomic64_set(v,i)	(((v)->counter) = (i))
 
-#define ATOMIC64_OP(op)							\
+#define ATOMIC_OP(op)							\
 static inline void atomic64_##op(long i, atomic64_t *v)			\
 {									\
 	long result;							\
@@ -172,6 +175,9 @@ static inline long atomic64_##op##_retur
 
 ATOMIC64_OP(add)
 ATOMIC64_OP(sub)
+ATOMIC64_OP(and)
+ATOMIC64_OP(or)
+ATOMIC64_OP(xor)
 
 #undef ATOMIC64_OP
 
--- a/arch/avr32/include/asm/atomic.h
+++ b/arch/avr32/include/asm/atomic.h
@@ -59,6 +59,9 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
--- a/arch/blackfin/include/asm/atomic.h
+++ b/arch/blackfin/include/asm/atomic.h
@@ -27,8 +27,9 @@ asmlinkage int __raw_atomic_test_asm(con
 #define atomic_add_return(i, v) __raw_atomic_update_asm(&(v)->counter, i)
 #define atomic_sub_return(i, v) __raw_atomic_update_asm(&(v)->counter, -(i))
 
-#define atomic_clear_mask(m, v) __raw_atomic_clear_asm(&(v)->counter, m)
-#define atomic_set_mask(m, v)   __raw_atomic_set_asm(&(v)->counter, m)
+#define atomic_and_return(m, v) __raw_atomic_clear_asm(&(v)->counter, ~m)
+#define atomic_or_return(m, v)  __raw_atomic_set_asm(&(v)->counter, m)
+#define atomic_xor_return(m, v) __raw_atomic_xor_asm(&(v)->counter, m)
 
 #endif
 
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -195,7 +195,7 @@ void send_ipi(const struct cpumask *cpum
 	local_irq_save(flags);
 	for_each_cpu(cpu, cpumask) {
 		bfin_ipi_data = &per_cpu(bfin_ipi, cpu);
-		atomic_set_mask((1 << msg), &bfin_ipi_data->bits);
+		atomic_or((1 << msg), &bfin_ipi_data->bits);
 		atomic_inc(&bfin_ipi_data->count);
 	}
 	local_irq_restore(flags);
--- a/arch/cris/include/asm/atomic.h
+++ b/arch/cris/include/asm/atomic.h
@@ -42,6 +42,9 @@ static inline int atomic_##opn##_return(
 
 ATOMIC_OP(add, +=)
 ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
 
 #undef ATOMIC_OP
 
--- a/arch/hexagon/include/asm/atomic.h
+++ b/arch/hexagon/include/asm/atomic.h
@@ -115,6 +115,9 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
--- a/arch/ia64/include/asm/atomic.h
+++ b/arch/ia64/include/asm/atomic.h
@@ -44,8 +44,6 @@ ia64_atomic_##opn (int i, atomic_t *v)
 ATOMIC_OP(add, +)
 ATOMIC_OP(sub, -)
 
-#undef ATOMIC_OP
-
 #define atomic_add_return(i,v)						\
 ({									\
 	int __ia64_aar_i = (i);						\
@@ -70,6 +68,20 @@ ATOMIC_OP(sub, -)
 		: ia64_atomic_sub(__ia64_asr_i, v);			\
 })
 
+ATOMIC_OP(and, &)
+ATOMIC_OP(or, |)
+ATOMIC_OP(xor, ^)
+
+#define atomic_and_return(i,v)	ia64_atomic_and(i,v)
+#define atomic_or_return(i,v)	ia64_atomic_or(i,v)
+#define atomic_xor_return(i,v)	ia64_atomic_xor(i,v)
+
+#define atomic_and(i,v)	(void)ia64_atomic_and(i,v)
+#define atomic_or(i,v)	(void)ia64_atomic_or(i,v)
+#define atomic_xor(i,v)	(void)ia64_atomic_xor(i,v)
+
+#undef ATOMIC_OP
+
 #define ATOMIC64_OP(opn, op)						\
 static __inline__ long							\
 ia64_atomic64_##opn (__s64 i, atomic64_t *v)				\
@@ -88,8 +100,6 @@ ia64_atomic64_##opn (__s64 i, atomic64_t
 ATOMIC64_OP(add, +)
 ATOMIC64_OP(sub, -)
 
-#undef ATOMIC64_OP
-
 #define atomic64_add_return(i,v)					\
 ({									\
 	long __ia64_aar_i = (i);					\
@@ -114,6 +124,20 @@ ATOMIC64_OP(sub, -)
 		: ia64_atomic64_sub(__ia64_asr_i, v);			\
 })
 
+ATOMIC64_OP(and, &)
+ATOMIC64_OP(or, |)
+ATOMIC64_OP(xor, ^)
+
+#define atomic64_and_return(i,v)	ia64_atomic64_and(i,v)
+#define atomic64_or_return(i,v)		ia64_atomic64_or(i,v)
+#define atomic64_xor_return(i,v)	ia64_atomic64_xor(i,v)
+
+#define atomic64_and(i,v)	(void)ia64_atomic64_and(i,v)
+#define atomic64_or(i,v)	(void)ia64_atomic64_or(i,v)
+#define atomic64_xor(i,v)	(void)ia64_atomic64_xor(i,v)
+
+#undef ATOMIC64_OP
+
 #define atomic_cmpxchg(v, old, new) (cmpxchg(&((v)->counter), old, new))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
--- a/arch/m32r/include/asm/atomic.h
+++ b/arch/m32r/include/asm/atomic.h
@@ -90,6 +90,9 @@ static __inline__ int atomic_##op##_retu
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
@@ -234,45 +237,4 @@ static __inline__ int __atomic_add_unles
 	return c;
 }
 
-
-static __inline__ void atomic_clear_mask(unsigned long  mask, atomic_t *addr)
-{
-	unsigned long flags;
-	unsigned long tmp;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# atomic_clear_mask		\n\t"
-		DCACHE_CLEAR("%0", "r5", "%1")
-		M32R_LOCK" %0, @%1;		\n\t"
-		"and	%0, %2;			\n\t"
-		M32R_UNLOCK" %0, @%1;		\n\t"
-		: "=&r" (tmp)
-		: "r" (addr), "r" (~mask)
-		: "memory"
-		__ATOMIC_CLOBBER
-	);
-	local_irq_restore(flags);
-}
-
-static __inline__ void atomic_set_mask(unsigned long  mask, atomic_t *addr)
-{
-	unsigned long flags;
-	unsigned long tmp;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# atomic_set_mask		\n\t"
-		DCACHE_CLEAR("%0", "r5", "%1")
-		M32R_LOCK" %0, @%1;		\n\t"
-		"or	%0, %2;			\n\t"
-		M32R_UNLOCK" %0, @%1;		\n\t"
-		: "=&r" (tmp)
-		: "r" (addr), "r" (mask)
-		: "memory"
-		__ATOMIC_CLOBBER
-	);
-	local_irq_restore(flags);
-}
-
 #endif	/* _ASM_M32R_ATOMIC_H */
--- a/arch/m32r/kernel/smp.c
+++ b/arch/m32r/kernel/smp.c
@@ -156,7 +156,7 @@ void smp_flush_cache_all(void)
 	cpumask_clear_cpu(smp_processor_id(), &cpumask);
 	spin_lock(&flushcache_lock);
 	mask=cpumask_bits(&cpumask);
-	atomic_set_mask(*mask, (atomic_t *)&flushcache_cpumask);
+	atomic_or(*mask, (atomic_t *)&flushcache_cpumask);
 	send_IPI_mask(&cpumask, INVALIDATE_CACHE_IPI, 0);
 	_flush_cache_copyback_all();
 	while (flushcache_cpumask)
@@ -407,7 +407,7 @@ static void flush_tlb_others(cpumask_t c
 	flush_vma = vma;
 	flush_va = va;
 	mask=cpumask_bits(&cpumask);
-	atomic_set_mask(*mask, (atomic_t *)&flush_cpumask);
+	atomic_or(*mask, (atomic_t *)&flush_cpumask);
 
 	/*
 	 * We have to send the IPI only to
--- a/arch/m68k/include/asm/atomic.h
+++ b/arch/m68k/include/asm/atomic.h
@@ -76,6 +76,9 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add, +=)
 ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or , |=)
+ATOMIC_OP(xor, ^=)
 
 #undef ATOMIC_OP
 
@@ -168,16 +171,6 @@ static inline int atomic_add_negative(in
 	return c != 0;
 }
 
-static inline void atomic_clear_mask(unsigned long mask, unsigned long *v)
-{
-	__asm__ __volatile__("andl %1,%0" : "+m" (*v) : ASM_DI (~(mask)));
-}
-
-static inline void atomic_set_mask(unsigned long mask, unsigned long *v)
-{
-	__asm__ __volatile__("orl %1,%0" : "+m" (*v) : ASM_DI (mask));
-}
-
 static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u)
 {
 	int c, old;
--- a/arch/metag/include/asm/atomic_lnkget.h
+++ b/arch/metag/include/asm/atomic_lnkget.h
@@ -70,43 +70,12 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-	int temp;
-
-	asm volatile (
-		"1:	LNKGETD %0, [%1]\n"
-		"	AND	%0, %0, %2\n"
-		"	LNKSETD	[%1] %0\n"
-		"	DEFR	%0, TXSTAT\n"
-		"	ANDT	%0, %0, #HI(0x3f000000)\n"
-		"	CMPT	%0, #HI(0x02000000)\n"
-		"	BNZ	1b\n"
-		: "=&d" (temp)
-		: "da" (&v->counter), "bd" (~mask)
-		: "cc");
-}
-
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	int temp;
-
-	asm volatile (
-		"1:	LNKGETD %0, [%1]\n"
-		"	OR	%0, %0, %2\n"
-		"	LNKSETD	[%1], %0\n"
-		"	DEFR	%0, TXSTAT\n"
-		"	ANDT	%0, %0, #HI(0x3f000000)\n"
-		"	CMPT	%0, #HI(0x02000000)\n"
-		"	BNZ	1b\n"
-		: "=&d" (temp)
-		: "da" (&v->counter), "bd" (mask)
-		: "cc");
-}
-
 static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
 	int result, temp;
--- a/arch/metag/include/asm/atomic_lock1.h
+++ b/arch/metag/include/asm/atomic_lock1.h
@@ -65,29 +65,12 @@ static inline int atomic_##opn##_return(
 
 ATOMIC_OP(add, +=)
 ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long flags;
-
-	__global_lock1(flags);
-	fence();
-	v->counter &= ~mask;
-	__global_unlock1(flags);
-}
-
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long flags;
-
-	__global_lock1(flags);
-	fence();
-	v->counter |= mask;
-	__global_unlock1(flags);
-}
-
 static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
 	int ret;
--- a/arch/mips/include/asm/atomic.h
+++ b/arch/mips/include/asm/atomic.h
@@ -128,6 +128,9 @@ static __inline__ int atomic_##op##_retu
 
 ATOMIC_OP(add, addu, +=)
 ATOMIC_OP(sub, subu, -=)
+ATOMIC_OP(and, and, &=)
+ATOMIC_OP(or, or, |=)
+ATOMIC_OP(xor, xor, ^=)
 
 #undef ATOMIC_OP
 
@@ -397,6 +400,9 @@ static __inline__ long atomic64_##op##_r
 
 ATOMIC64_OP(add, daddu, +=)
 ATOMIC64_OP(sub, dsubu, -=)
+ATOMIC64_OP(and, and, &=)
+ATOMIC64_OP(or, or, |=)
+ATOMIC64_OP(xor, xor, ^=)
 
 #undef ATOMIC64_OP
 
--- a/arch/mn10300/include/asm/atomic.h
+++ b/arch/mn10300/include/asm/atomic.h
@@ -85,6 +85,9 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or )
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
@@ -122,73 +125,6 @@ static inline void atomic_dec(atomic_t *
 #define atomic_xchg(ptr, v)		(xchg(&(ptr)->counter, (v)))
 #define atomic_cmpxchg(v, old, new)	(cmpxchg(&((v)->counter), (old), (new)))
 
-/**
- * atomic_clear_mask - Atomically clear bits in memory
- * @mask: Mask of the bits to be cleared
- * @v: pointer to word in memory
- *
- * Atomically clears the bits set in mask from the memory word specified.
- */
-static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
-{
-#ifdef CONFIG_SMP
-	int status;
-
-	asm volatile(
-		"1:	mov	%3,(_AAR,%2)	\n"
-		"	mov	(_ADR,%2),%0	\n"
-		"	and	%4,%0		\n"
-		"	mov	%0,(_ADR,%2)	\n"
-		"	mov	(_ADR,%2),%0	\n"	/* flush */
-		"	mov	(_ASR,%2),%0	\n"
-		"	or	%0,%0		\n"
-		"	bne	1b		\n"
-		: "=&r"(status), "=m"(*addr)
-		: "a"(ATOMIC_OPS_BASE_ADDR), "r"(addr), "r"(~mask)
-		: "memory", "cc");
-#else
-	unsigned long flags;
-
-	mask = ~mask;
-	flags = arch_local_cli_save();
-	*addr &= mask;
-	arch_local_irq_restore(flags);
-#endif
-}
-
-/**
- * atomic_set_mask - Atomically set bits in memory
- * @mask: Mask of the bits to be set
- * @v: pointer to word in memory
- *
- * Atomically sets the bits set in mask from the memory word specified.
- */
-static inline void atomic_set_mask(unsigned long mask, unsigned long *addr)
-{
-#ifdef CONFIG_SMP
-	int status;
-
-	asm volatile(
-		"1:	mov	%3,(_AAR,%2)	\n"
-		"	mov	(_ADR,%2),%0	\n"
-		"	or	%4,%0		\n"
-		"	mov	%0,(_ADR,%2)	\n"
-		"	mov	(_ADR,%2),%0	\n"	/* flush */
-		"	mov	(_ASR,%2),%0	\n"
-		"	or	%0,%0		\n"
-		"	bne	1b		\n"
-		: "=&r"(status), "=m"(*addr)
-		: "a"(ATOMIC_OPS_BASE_ADDR), "r"(addr), "r"(mask)
-		: "memory", "cc");
-#else
-	unsigned long flags;
-
-	flags = arch_local_cli_save();
-	*addr |= mask;
-	arch_local_irq_restore(flags);
-#endif
-}
-
 #endif /* __KERNEL__ */
 #endif /* CONFIG_SMP */
 #endif /* _ASM_ATOMIC_H */
--- a/arch/mn10300/mm/tlb-smp.c
+++ b/arch/mn10300/mm/tlb-smp.c
@@ -119,7 +119,7 @@ static void flush_tlb_others(cpumask_t c
 	flush_mm = mm;
 	flush_va = va;
 #if NR_CPUS <= BITS_PER_LONG
-	atomic_set_mask(cpumask.bits[0], &flush_cpumask.bits[0]);
+	atomic_or(cpumask.bits[0], &flush_cpumask.bits[0]);
 #else
 #error Not supported.
 #endif
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -122,6 +122,9 @@ static __inline__ int atomic_##op##_retu
 
 ATOMIC_OP(add, +=)
 ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
 
 #undef ATOMIC_OP
 
@@ -177,6 +180,9 @@ static __inline__ s64 atomic64_##op##_re
 
 ATOMIC64_OP(add, +=)
 ATOMIC64_OP(sub, -=)
+ATOMIC64_OP(and, &=)
+ATOMIC64_OP(or, |=)
+ATOMIC64_OP(xor, ^=)
 
 #undef ATOMIC64_OP
 
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -63,6 +63,9 @@ static __inline__ int atomic_##op##_retu
 
 ATOMIC_OP(add, add)
 ATOMIC_OP(sub, subf)
+ATOMIC_OP(and, and)
+ATOMIC_OP(or, or)
+ATOMIC_OP(xor, xor)
 
 #undef ATOMIC_OP
 
@@ -295,7 +298,10 @@ static __inline__ long atomic64_##op##_r
 }
 
 ATOMIC64_OP(add, add)
-ATOMIC64_OP(add, subf)
+ATOMIC64_OP(sub, subf)
+ATOMIC64_OP(and, and)
+ATOMIC64_OP(or, or)
+ATOMIC64_OP(xor, xor)
 
 #undef ATOMIC64_OP
 
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -593,25 +593,6 @@ _GLOBAL(copy_page)
 	b	2b
 
 /*
- * void atomic_clear_mask(atomic_t mask, atomic_t *addr)
- * void atomic_set_mask(atomic_t mask, atomic_t *addr);
- */
-_GLOBAL(atomic_clear_mask)
-10:	lwarx	r5,0,r4
-	andc	r5,r5,r3
-	PPC405_ERR77(0,r4)
-	stwcx.	r5,0,r4
-	bne-	10b
-	blr
-_GLOBAL(atomic_set_mask)
-10:	lwarx	r5,0,r4
-	or	r5,r5,r3
-	PPC405_ERR77(0,r4)
-	stwcx.	r5,0,r4
-	bne-	10b
-	blr
-
-/*
  * Extended precision shifts.
  *
  * Updated to be valid for shift counts from 0 to 63 inclusive.
--- a/arch/s390/include/asm/atomic.h
+++ b/arch/s390/include/asm/atomic.h
@@ -25,6 +25,7 @@
 #define __ATOMIC_OR	"lao"
 #define __ATOMIC_AND	"lan"
 #define __ATOMIC_ADD	"laa"
+#define __ATOMIC_XOR	"lax"
 
 #define __ATOMIC_LOOP(ptr, op_val, op_string)				\
 ({									\
@@ -44,6 +45,7 @@
 #define __ATOMIC_OR	"or"
 #define __ATOMIC_AND	"nr"
 #define __ATOMIC_ADD	"ar"
+#define __ATOMIC_XOR	"xr"
 
 #define __ATOMIC_LOOP(ptr, op_val, op_string)				\
 ({									\
@@ -114,15 +116,22 @@ static inline void atomic_add(int i, ato
 #define atomic_dec_return(_v)		atomic_sub_return(1, _v)
 #define atomic_dec_and_test(_v)		(atomic_sub_return(1, _v) == 0)
 
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-	__ATOMIC_LOOP(v, ~mask, __ATOMIC_AND);
+#define ATOMIC_OP(op, OP)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	__ATOMIC_LOOP(v, i, __ATOMIC_##OP);				\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	return __ATOMIC_LOOP(v, i, __ATOMIC_##OP);			\
 }
 
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	__ATOMIC_LOOP(v, mask, __ATOMIC_OR);
-}
+ATOMIC_OP(and, AND)
+ATOMIC_OP(or, OR)
+ATOMIC_OP(xor, XOR)
+
+#undef ATOMIC_OP
 
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
@@ -163,6 +172,7 @@ static inline int __atomic_add_unless(at
 #define __ATOMIC64_OR	"laog"
 #define __ATOMIC64_AND	"lang"
 #define __ATOMIC64_ADD	"laag"
+#define __ATOMIC64_XOR	"laxg"
 
 #define __ATOMIC64_LOOP(ptr, op_val, op_string)				\
 ({									\
@@ -182,6 +192,7 @@ static inline int __atomic_add_unless(at
 #define __ATOMIC64_OR	"ogr"
 #define __ATOMIC64_AND	"ngr"
 #define __ATOMIC64_ADD	"agr"
+#define __ATOMIC64_XOR	"xgr"
 
 #define __ATOMIC64_LOOP(ptr, op_val, op_string)				\
 ({									\
@@ -224,16 +235,6 @@ static inline long long atomic64_add_ret
 	return __ATOMIC64_LOOP(v, i, __ATOMIC64_ADD) + i;
 }
 
-static inline void atomic64_clear_mask(unsigned long mask, atomic64_t *v)
-{
-	__ATOMIC64_LOOP(v, ~mask, __ATOMIC64_AND);
-}
-
-static inline void atomic64_set_mask(unsigned long mask, atomic64_t *v)
-{
-	__ATOMIC64_LOOP(v, mask, __ATOMIC64_OR);
-}
-
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 
 static inline long long atomic64_cmpxchg(atomic64_t *v,
@@ -247,6 +248,22 @@ static inline long long atomic64_cmpxchg
 	return old;
 }
 
+#define ATOMIC64_OP(op, OP, cop)					\
+static inline void atomic64_##op(long i, atomic64_t *v)			\
+{									\
+	__ATOMIC64_LOOP(v, i, __ATOMIC64_##OP);				\
+}									\
+									\
+static inline long atomic64_##op##_return(long i, atomic64_t *v)	\
+{									\
+	return __ATOMIC64_LOOP(v, i, __ATOMIC64_##OP) cop i;		\
+}
+
+ATOMIC64_OP(and, AND, &)
+ATOMIC64_OP(or, OR, |)
+ATOMIC64_OP(xor, XOR, ^)
+
+#undef ATOMIC64_OP
 #undef __ATOMIC64_LOOP
 
 #else /* CONFIG_64BIT */
@@ -315,25 +332,29 @@ static inline long long atomic64_add_ret
 	return new;
 }
 
-static inline void atomic64_set_mask(unsigned long long mask, atomic64_t *v)
-{
-	long long old, new;
-
-	do {
-		old = atomic64_read(v);
-		new = old | mask;
-	} while (atomic64_cmpxchg(v, old, new) != old);
+#define ATOMIC64_OP(op, OP, cop)					\
+static inline void atomic64_##op(long long i, atomic_t *v)		\
+{									\
+	long long c, old;						\
+	c = atomic64_read(v);						\
+	while ((old = atomic64_cmpxchg(v, c, c cop i)) != c)		\
+		c = old;						\
+}									\
+									\
+static inline long long atomic64_##op##_return(long long i, atomic_t *v)\
+{									\
+	long long c, old;						\
+	c = atomic64_read(v);						\
+	while ((old = atomic64_cmpxchg(v, c, c cop i)) != c)		\
+		c = old;						\
+	return c cop i;
 }
 
-static inline void atomic64_clear_mask(unsigned long long mask, atomic64_t *v)
-{
-	long long old, new;
+ATOMIC64_OP(and, AND, &)
+ATOMIC64_OP(or, OR, |)
+ATOMIC64_OP(xor, XOR, ^)
 
-	do {
-		old = atomic64_read(v);
-		new = old & mask;
-	} while (atomic64_cmpxchg(v, old, new) != old);
-}
+#undef ATOMIC64_OP
 
 #endif /* CONFIG_64BIT */
 
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -367,7 +367,7 @@ static void disable_sync_clock(void *dum
 	 * increase the "sequence" counter to avoid the race of an
 	 * etr event and the complete recovery against get_sync_clock.
 	 */
-	atomic_clear_mask(0x80000000, sw_ptr);
+	atomic_and(~0x80000000, sw_ptr);
 	atomic_inc(sw_ptr);
 }
 
@@ -378,7 +378,7 @@ static void disable_sync_clock(void *dum
 static void enable_sync_clock(void)
 {
 	atomic_t *sw_ptr = &__get_cpu_var(clock_sync_word);
-	atomic_set_mask(0x80000000, sw_ptr);
+	atomic_or(0x80000000, sw_ptr);
 }
 
 /*
--- a/arch/s390/kvm/diag.c
+++ b/arch/s390/kvm/diag.c
@@ -94,7 +94,7 @@ static int __diag_ipl_functions(struct k
 		return -EOPNOTSUPP;
 	}
 
-	atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
+	atomic_or(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 	vcpu->run->s390_reset_flags |= KVM_S390_RESET_SUBSYSTEM;
 	vcpu->run->s390_reset_flags |= KVM_S390_RESET_IPL;
 	vcpu->run->s390_reset_flags |= KVM_S390_RESET_CPU_INIT;
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -63,7 +63,7 @@ static int handle_stop(struct kvm_vcpu *
 	trace_kvm_s390_stop_request(vcpu->arch.local_int.action_bits);
 
 	if (vcpu->arch.local_int.action_bits & ACTION_STOP_ON_STOP) {
-		atomic_set_mask(CPUSTAT_STOPPED,
+		atomic_or(CPUSTAT_STOPPED,
 				&vcpu->arch.sie_block->cpuflags);
 		vcpu->arch.local_int.action_bits &= ~ACTION_STOP_ON_STOP;
 		VCPU_EVENT(vcpu, 3, "%s", "cpu stopped");
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -118,21 +118,21 @@ static int __interrupt_is_deliverable(st
 static void __set_cpu_idle(struct kvm_vcpu *vcpu)
 {
 	BUG_ON(vcpu->vcpu_id > KVM_MAX_VCPUS - 1);
-	atomic_set_mask(CPUSTAT_WAIT, &vcpu->arch.sie_block->cpuflags);
+	atomic_or(CPUSTAT_WAIT, &vcpu->arch.sie_block->cpuflags);
 	set_bit(vcpu->vcpu_id, vcpu->arch.local_int.float_int->idle_mask);
 }
 
 static void __unset_cpu_idle(struct kvm_vcpu *vcpu)
 {
 	BUG_ON(vcpu->vcpu_id > KVM_MAX_VCPUS - 1);
-	atomic_clear_mask(CPUSTAT_WAIT, &vcpu->arch.sie_block->cpuflags);
+	atomic_and(~CPUSTAT_WAIT, &vcpu->arch.sie_block->cpuflags);
 	clear_bit(vcpu->vcpu_id, vcpu->arch.local_int.float_int->idle_mask);
 }
 
 static void __reset_intercept_indicators(struct kvm_vcpu *vcpu)
 {
-	atomic_clear_mask(CPUSTAT_ECALL_PEND |
-		CPUSTAT_IO_INT | CPUSTAT_EXT_INT | CPUSTAT_STOP_INT,
+	atomic_and(~(CPUSTAT_ECALL_PEND |
+		CPUSTAT_IO_INT | CPUSTAT_EXT_INT | CPUSTAT_STOP_INT),
 		&vcpu->arch.sie_block->cpuflags);
 	vcpu->arch.sie_block->lctl = 0x0000;
 	vcpu->arch.sie_block->ictl &= ~ICTL_LPSW;
@@ -140,7 +140,7 @@ static void __reset_intercept_indicators
 
 static void __set_cpuflag(struct kvm_vcpu *vcpu, u32 flag)
 {
-	atomic_set_mask(flag, &vcpu->arch.sie_block->cpuflags);
+	atomic_or(flag, &vcpu->arch.sie_block->cpuflags);
 }
 
 static void __set_intercept_indicator(struct kvm_vcpu *vcpu,
@@ -269,7 +269,7 @@ static void __do_deliver_interrupt(struc
 		rc |= copy_from_guest(vcpu, &vcpu->arch.sie_block->gpsw,
 				      offsetof(struct _lowcore, restart_psw),
 				      sizeof(psw_t));
-		atomic_clear_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
+		atomic_and(~CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 		break;
 	case KVM_S390_PROGRAM_INT:
 		VCPU_EVENT(vcpu, 4, "interrupt: pgm check code:%x, ilc:%x",
@@ -748,7 +748,7 @@ int kvm_s390_inject_vm(struct kvm *kvm,
 	}
 	li = fi->local_int[sigcpu];
 	spin_lock_bh(&li->lock);
-	atomic_set_mask(CPUSTAT_EXT_INT, li->cpuflags);
+	atomic_or(CPUSTAT_EXT_INT, li->cpuflags);
 	if (waitqueue_active(li->wq))
 		wake_up_interruptible(li->wq);
 	spin_unlock_bh(&li->lock);
@@ -834,7 +834,7 @@ int kvm_s390_inject_vcpu(struct kvm_vcpu
 	atomic_set(&li->active, 1);
 	if (inti->type == KVM_S390_SIGP_STOP)
 		li->action_bits |= ACTION_STOP_ON_STOP;
-	atomic_set_mask(CPUSTAT_EXT_INT, li->cpuflags);
+	atomic_or(CPUSTAT_EXT_INT, li->cpuflags);
 	if (waitqueue_active(&vcpu->wq))
 		wake_up_interruptible(&vcpu->wq);
 	spin_unlock_bh(&li->lock);
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -350,12 +350,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
 	restore_fp_regs(vcpu->arch.guest_fpregs.fprs);
 	restore_access_regs(vcpu->run->s.regs.acrs);
 	gmap_enable(vcpu->arch.gmap);
-	atomic_set_mask(CPUSTAT_RUNNING, &vcpu->arch.sie_block->cpuflags);
+	atomic_or(CPUSTAT_RUNNING, &vcpu->arch.sie_block->cpuflags);
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
-	atomic_clear_mask(CPUSTAT_RUNNING, &vcpu->arch.sie_block->cpuflags);
+	atomic_and(~CPUSTAT_RUNNING, &vcpu->arch.sie_block->cpuflags);
 	gmap_disable(vcpu->arch.gmap);
 	save_fp_ctl(&vcpu->arch.guest_fpregs.fpc);
 	save_fp_regs(vcpu->arch.guest_fpregs.fprs);
@@ -380,7 +380,7 @@ static void kvm_s390_vcpu_initial_reset(
 	vcpu->arch.guest_fpregs.fpc = 0;
 	asm volatile("lfpc %0" : : "Q" (vcpu->arch.guest_fpregs.fpc));
 	vcpu->arch.sie_block->gbea = 1;
-	atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
+	atomic_or(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 }
 
 int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
@@ -482,12 +482,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vc
 
 void s390_vcpu_block(struct kvm_vcpu *vcpu)
 {
-	atomic_set_mask(PROG_BLOCK_SIE, &vcpu->arch.sie_block->prog20);
+	atomic_or(PROG_BLOCK_SIE, &vcpu->arch.sie_block->prog20);
 }
 
 void s390_vcpu_unblock(struct kvm_vcpu *vcpu)
 {
-	atomic_clear_mask(PROG_BLOCK_SIE, &vcpu->arch.sie_block->prog20);
+	atomic_and(~PROG_BLOCK_SIE, &vcpu->arch.sie_block->prog20);
 }
 
 /*
@@ -496,7 +496,7 @@ void s390_vcpu_unblock(struct kvm_vcpu *
  * return immediately. */
 void exit_sie(struct kvm_vcpu *vcpu)
 {
-	atomic_set_mask(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
+	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
 	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
 		cpu_relax();
 }
@@ -804,7 +804,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_v
 	if (vcpu->sigset_active)
 		sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
 
-	atomic_clear_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
+	atomic_and(~CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 
 	BUG_ON(vcpu->kvm->arch.float_int.local_int[vcpu->vcpu_id] == NULL);
 
--- a/arch/s390/kvm/sigp.c
+++ b/arch/s390/kvm/sigp.c
@@ -78,7 +78,7 @@ static int __sigp_emergency(struct kvm_v
 	spin_lock_bh(&li->lock);
 	list_add_tail(&inti->list, &li->list);
 	atomic_set(&li->active, 1);
-	atomic_set_mask(CPUSTAT_EXT_INT, li->cpuflags);
+	atomic_or(CPUSTAT_EXT_INT, li->cpuflags);
 	if (waitqueue_active(li->wq))
 		wake_up_interruptible(li->wq);
 	spin_unlock_bh(&li->lock);
@@ -147,7 +147,7 @@ static int __sigp_external_call(struct k
 	spin_lock_bh(&li->lock);
 	list_add_tail(&inti->list, &li->list);
 	atomic_set(&li->active, 1);
-	atomic_set_mask(CPUSTAT_EXT_INT, li->cpuflags);
+	atomic_or(CPUSTAT_EXT_INT, li->cpuflags);
 	if (waitqueue_active(li->wq))
 		wake_up_interruptible(li->wq);
 	spin_unlock_bh(&li->lock);
@@ -177,7 +177,7 @@ static int __inject_sigp_stop(struct kvm
 	}
 	list_add_tail(&inti->list, &li->list);
 	atomic_set(&li->active, 1);
-	atomic_set_mask(CPUSTAT_STOP_INT, li->cpuflags);
+	atomic_or(CPUSTAT_STOP_INT, li->cpuflags);
 	li->action_bits |= action;
 	if (waitqueue_active(li->wq))
 		wake_up_interruptible(li->wq);
--- a/arch/sh/include/asm/atomic-grb.h
+++ b/arch/sh/include/asm/atomic-grb.h
@@ -44,46 +44,10 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-	int tmp;
-	unsigned int _mask = ~mask;
-
-	__asm__ __volatile__ (
-		"   .align 2              \n\t"
-		"   mova    1f,   r0      \n\t" /* r0 = end point */
-		"   mov    r15,   r1      \n\t" /* r1 = saved sp */
-		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */
-		"   mov.l  @%1,   %0      \n\t" /* load  old value */
-		"   and     %2,   %0      \n\t" /* add */
-		"   mov.l   %0,   @%1     \n\t" /* store new value */
-		"1: mov     r1,   r15     \n\t" /* LOGOUT */
-		: "=&r" (tmp),
-		  "+r"  (v)
-		: "r"   (_mask)
-		: "memory" , "r0", "r1");
-}
-
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	int tmp;
-
-	__asm__ __volatile__ (
-		"   .align 2              \n\t"
-		"   mova    1f,   r0      \n\t" /* r0 = end point */
-		"   mov    r15,   r1      \n\t" /* r1 = saved sp */
-		"   mov    #-6,   r15     \n\t" /* LOGIN: r15 = size */
-		"   mov.l  @%1,   %0      \n\t" /* load  old value */
-		"   or      %2,   %0      \n\t" /* or */
-		"   mov.l   %0,   @%1     \n\t" /* store new value */
-		"1: mov     r1,   r15     \n\t" /* LOGOUT */
-		: "=&r" (tmp),
-		  "+r"  (v)
-		: "r"   (mask)
-		: "memory" , "r0", "r1");
-}
-
 #endif /* __ASM_SH_ATOMIC_GRB_H */
--- a/arch/sh/include/asm/atomic-irq.h
+++ b/arch/sh/include/asm/atomic-irq.h
@@ -34,25 +34,10 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add, +=)
 ATOMIC_OP(sub, -=)
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long flags;
-
-	raw_local_irq_save(flags);
-	v->counter &= ~mask;
-	raw_local_irq_restore(flags);
-}
-
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long flags;
-
-	raw_local_irq_save(flags);
-	v->counter |= mask;
-	raw_local_irq_restore(flags);
-}
-
 #endif /* __ASM_SH_ATOMIC_IRQ_H */
--- a/arch/sh/include/asm/atomic-llsc.h
+++ b/arch/sh/include/asm/atomic-llsc.h
@@ -49,35 +49,10 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long tmp;
-
-	__asm__ __volatile__ (
-"1:	movli.l @%2, %0		! atomic_clear_mask	\n"
-"	and	%1, %0					\n"
-"	movco.l	%0, @%2					\n"
-"	bf	1b					\n"
-	: "=&z" (tmp)
-	: "r" (~mask), "r" (&v->counter)
-	: "t");
-}
-
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-	unsigned long tmp;
-
-	__asm__ __volatile__ (
-"1:	movli.l @%2, %0		! atomic_set_mask	\n"
-"	or	%1, %0					\n"
-"	movco.l	%0, @%2					\n"
-"	bf	1b					\n"
-	: "=&z" (tmp)
-	: "r" (mask), "r" (&v->counter)
-	: "t");
-}
-
 #endif /* __ASM_SH_ATOMIC_LLSC_H */
--- a/arch/sparc/include/asm/atomic_32.h
+++ b/arch/sparc/include/asm/atomic_32.h
@@ -39,6 +39,10 @@ extern void atomic_set(atomic_t *, int);
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
 
+#define atomic_and(i, v)	((void)atomic_and_return(i, v))
+#define atomic_or(i, v)		((void)atomic_or_return(i, v))
+#define atomic_xor(i, v)	((void)atomic_xor_return(i, v))
+
 /*
  * atomic_inc_and_test - increment and test
  * @v: pointer of type atomic_t
--- a/arch/sparc/include/asm/atomic_64.h
+++ b/arch/sparc/include/asm/atomic_64.h
@@ -28,6 +28,9 @@ extern long atomic64_##op##_return(long,
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
--- a/arch/sparc/lib/atomic32.c
+++ b/arch/sparc/lib/atomic32.c
@@ -42,8 +42,9 @@ int atomic_##op##_return(int i, atomic_t
 EXPORT_SYMBOL(atomic_##op##_return);
 
 ATOMIC_OP(add, +=)
-
-#undef ATOMIC_OP
+ATOMIC_OP(and, &=)
+ATOMIC_OP(or, |=)
+ATOMIC_OP(xor, ^=)
 
 int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
--- a/arch/sparc/lib/atomic_64.S
+++ b/arch/sparc/lib/atomic_64.S
@@ -44,6 +44,9 @@ ENDPROC(atomic_##op##_return);
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #define ATOMIC64_OP(op)							\
 ENTRY(atomic64_##op) /* %o0 = increment, %o1 = atomic_ptr */		\
@@ -74,6 +77,9 @@ ENDPROC(atomic64_##op##_return);
 
 ATOMIC64_OP(add)
 ATOMIC64_OP(sub)
+ATOMIC64_OP(and)
+ATOMIC64_OP(or)
+ATOMIC64_OP(xor)
 
 ENTRY(atomic64_dec_if_positive) /* %o0 = atomic_ptr */
 	BACKOFF_SETUP(%o2)
--- a/arch/sparc/lib/ksyms.c
+++ b/arch/sparc/lib/ksyms.c
@@ -107,6 +107,9 @@ EXPORT_SYMBOL(atomic64_##op##_return);
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 EXPORT_SYMBOL(atomic64_dec_if_positive);
 
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -182,6 +182,28 @@ static inline int atomic_xchg(atomic_t *
 	return xchg(&v->counter, new);
 }
 
+#define ATOMIC_OP(op, cop)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	asm volatile(LOCK_PREFIX #op"l %1,%0"				\
+			: "+m" (v->counter)				\
+			: "ir" (i));					\
+}									\
+									\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	int old, c = atomic_read(v);					\
+	while ((old = atomic_cmpxchg(v, c, c cop i)) != c)		\
+		c = old;						\
+	return c cop i;							\
+}
+
+ATOMIC_OP(and, &)
+ATOMIC_OP(or, |)
+ATOMIC_OP(xor, ^)
+
+#undef ATOMIC_OP
+
 /**
  * __atomic_add_unless - add unless the number is already a given value
  * @v: pointer of type atomic_t
@@ -234,16 +256,6 @@ static inline void atomic_or_long(unsign
 }
 #endif
 
-/* These are x86-specific, used by some header files */
-#define atomic_clear_mask(mask, addr)				\
-	asm volatile(LOCK_PREFIX "andl %0,%1"			\
-		     : : "r" (~(mask)), "m" (*(addr)) : "memory")
-
-#define atomic_set_mask(mask, addr)				\
-	asm volatile(LOCK_PREFIX "orl %0,%1"			\
-		     : : "r" ((unsigned)(mask)), "m" (*(addr))	\
-		     : "memory")
-
 #ifdef CONFIG_X86_32
 # include <asm/atomic64_32.h>
 #else
--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -313,4 +313,24 @@ static inline long long atomic64_dec_if_
 #undef alternative_atomic64
 #undef __alternative_atomic64
 
+#define ATOMIC_OP(op, cop)						\
+static inline long long atomic64_##op##_return(long long i, atomic64_t *v)\
+{									\
+	long long old, c = 0;						\
+	while ((old = atomic64_cmpxchg(v, c, c cop i)) != c)		\
+		c = old;						\
+	return c cop i;							\
+}									\
+									\
+static inline void atomic64_##op(long long i, atomic64_t *v)		\
+{									\
+	atomic64_##op##_return(i, v);					\
+}
+
+ATOMIC_OP(and, &)
+ATOMIC_OP(or, |)
+ATOMIC_OP(xor, ^)
+
+#undef ATOMIC_OP
+
 #endif /* _ASM_X86_ATOMIC64_32_H */
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -220,4 +220,26 @@ static inline long atomic64_dec_if_posit
 	return dec;
 }
 
+#define ATOMIC64_OP(op, cop)						\
+static inline void atomic64_##op(long i, atomic64_t *v)			\
+{									\
+	asm volatile(LOCK_PREFIX #op"q %1,%0"				\
+			: "+m" (v->counter)				\
+			: "ir" (i));					\
+}									\
+									\
+static inline long atomic64_##op##_return(long i, atomic64_t *v)	\
+{									\
+	long old, c = atomic64_read(v);					\
+	while ((old = atomic64_cmpxchg(v, c, c cop i)) != c)		\
+		c = old;						\
+	return c cop i;							\
+}
+
+ATOMIC64_OP(and, &)
+ATOMIC64_OP(or, |)
+ATOMIC64_OP(xor, ^)
+
+#undef ATOMIC64_OP
+
 #endif /* _ASM_X86_ATOMIC64_64_H */
--- a/arch/xtensa/include/asm/atomic.h
+++ b/arch/xtensa/include/asm/atomic.h
@@ -140,6 +140,9 @@ static inline int atomic_##op##_return(i
 
 ATOMIC_OP(add)
 ATOMIC_OP(sub)
+ATOMIC_OP(and)
+ATOMIC_OP(or)
+ATOMIC_OP(xor)
 
 #undef ATOMIC_OP
 
@@ -244,75 +247,6 @@ static __inline__ int __atomic_add_unles
 	return c;
 }
 
-
-static inline void atomic_clear_mask(unsigned int mask, atomic_t *v)
-{
-#if XCHAL_HAVE_S32C1I
-	unsigned long tmp;
-	int result;
-
-	__asm__ __volatile__(
-			"1:     l32i    %1, %3, 0\n"
-			"       wsr     %1, scompare1\n"
-			"       and     %0, %1, %2\n"
-			"       s32c1i  %0, %3, 0\n"
-			"       bne     %0, %1, 1b\n"
-			: "=&a" (result), "=&a" (tmp)
-			: "a" (~mask), "a" (v)
-			: "memory"
-			);
-#else
-	unsigned int all_f = -1;
-	unsigned int vval;
-
-	__asm__ __volatile__(
-			"       rsil    a15,"__stringify(LOCKLEVEL)"\n"
-			"       l32i    %0, %2, 0\n"
-			"       xor     %1, %4, %3\n"
-			"       and     %0, %0, %4\n"
-			"       s32i    %0, %2, 0\n"
-			"       wsr     a15, ps\n"
-			"       rsync\n"
-			: "=&a" (vval), "=a" (mask)
-			: "a" (v), "a" (all_f), "1" (mask)
-			: "a15", "memory"
-			);
-#endif
-}
-
-static inline void atomic_set_mask(unsigned int mask, atomic_t *v)
-{
-#if XCHAL_HAVE_S32C1I
-	unsigned long tmp;
-	int result;
-
-	__asm__ __volatile__(
-			"1:     l32i    %1, %3, 0\n"
-			"       wsr     %1, scompare1\n"
-			"       or      %0, %1, %2\n"
-			"       s32c1i  %0, %3, 0\n"
-			"       bne     %0, %1, 1b\n"
-			: "=&a" (result), "=&a" (tmp)
-			: "a" (mask), "a" (v)
-			: "memory"
-			);
-#else
-	unsigned int vval;
-
-	__asm__ __volatile__(
-			"       rsil    a15,"__stringify(LOCKLEVEL)"\n"
-			"       l32i    %0, %2, 0\n"
-			"       or      %0, %0, %1\n"
-			"       s32i    %0, %2, 0\n"
-			"       wsr     a15, ps\n"
-			"       rsync\n"
-			: "=&a" (vval)
-			: "a" (mask), "a" (v)
-			: "a15", "memory"
-			);
-#endif
-}
-
 #endif /* __KERNEL__ */
 
 #endif /* _XTENSA_ATOMIC_H */
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2002,7 +2002,7 @@ static void i915_error_work_func(struct
 			kobject_uevent_env(&dev->primary->kdev->kobj,
 					   KOBJ_CHANGE, reset_done_event);
 		} else {
-			atomic_set_mask(I915_WEDGED, &error->reset_counter);
+			atomic_or(I915_WEDGED, &error->reset_counter);
 		}
 
 		/*
@@ -2123,7 +2123,7 @@ void i915_handle_error(struct drm_device
 	i915_report_and_clear_eir(dev);
 
 	if (wedged) {
-		atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
+		atomic_or(I915_RESET_IN_PROGRESS_FLAG,
 				&dev_priv->gpu_error.reset_counter);
 
 		/*
--- a/drivers/s390/scsi/zfcp_aux.c
+++ b/drivers/s390/scsi/zfcp_aux.c
@@ -527,7 +527,7 @@ struct zfcp_port *zfcp_port_enqueue(stru
 	list_add_tail(&port->list, &adapter->port_list);
 	write_unlock_irq(&adapter->port_list_lock);
 
-	atomic_set_mask(status | ZFCP_STATUS_COMMON_RUNNING, &port->status);
+	atomic_or(status | ZFCP_STATUS_COMMON_RUNNING, &port->status);
 
 	return port;
 
--- a/drivers/s390/scsi/zfcp_erp.c
+++ b/drivers/s390/scsi/zfcp_erp.c
@@ -190,7 +190,7 @@ static struct zfcp_erp_action *zfcp_erp_
 		if (!(act_status & ZFCP_STATUS_ERP_NO_REF))
 			if (scsi_device_get(sdev))
 				return NULL;
-		atomic_set_mask(ZFCP_STATUS_COMMON_ERP_INUSE,
+		atomic_or(ZFCP_STATUS_COMMON_ERP_INUSE,
 				&zfcp_sdev->status);
 		erp_action = &zfcp_sdev->erp_action;
 		memset(erp_action, 0, sizeof(struct zfcp_erp_action));
@@ -206,7 +206,7 @@ static struct zfcp_erp_action *zfcp_erp_
 		if (!get_device(&port->dev))
 			return NULL;
 		zfcp_erp_action_dismiss_port(port);
-		atomic_set_mask(ZFCP_STATUS_COMMON_ERP_INUSE, &port->status);
+		atomic_or(ZFCP_STATUS_COMMON_ERP_INUSE, &port->status);
 		erp_action = &port->erp_action;
 		memset(erp_action, 0, sizeof(struct zfcp_erp_action));
 		erp_action->port = port;
@@ -217,7 +217,7 @@ static struct zfcp_erp_action *zfcp_erp_
 	case ZFCP_ERP_ACTION_REOPEN_ADAPTER:
 		kref_get(&adapter->ref);
 		zfcp_erp_action_dismiss_adapter(adapter);
-		atomic_set_mask(ZFCP_STATUS_COMMON_ERP_INUSE, &adapter->status);
+		atomic_or(ZFCP_STATUS_COMMON_ERP_INUSE, &adapter->status);
 		erp_action = &adapter->erp_action;
 		memset(erp_action, 0, sizeof(struct zfcp_erp_action));
 		if (!(atomic_read(&adapter->status) &
@@ -254,7 +254,7 @@ static int zfcp_erp_action_enqueue(int w
 	act = zfcp_erp_setup_act(need, act_status, adapter, port, sdev);
 	if (!act)
 		goto out;
-	atomic_set_mask(ZFCP_STATUS_ADAPTER_ERP_PENDING, &adapter->status);
+	atomic_or(ZFCP_STATUS_ADAPTER_ERP_PENDING, &adapter->status);
 	++adapter->erp_total_count;
 	list_add_tail(&act->list, &adapter->erp_ready_head);
 	wake_up(&adapter->erp_ready_wq);
@@ -486,14 +486,14 @@ static void zfcp_erp_adapter_unblock(str
 {
 	if (status_change_set(ZFCP_STATUS_COMMON_UNBLOCKED, &adapter->status))
 		zfcp_dbf_rec_run("eraubl1", &adapter->erp_action);
-	atomic_set_mask(ZFCP_STATUS_COMMON_UNBLOCKED, &adapter->status);
+	atomic_or(ZFCP_STATUS_COMMON_UNBLOCKED, &adapter->status);
 }
 
 static void zfcp_erp_port_unblock(struct zfcp_port *port)
 {
 	if (status_change_set(ZFCP_STATUS_COMMON_UNBLOCKED, &port->status))
 		zfcp_dbf_rec_run("erpubl1", &port->erp_action);
-	atomic_set_mask(ZFCP_STATUS_COMMON_UNBLOCKED, &port->status);
+	atomic_or(ZFCP_STATUS_COMMON_UNBLOCKED, &port->status);
 }
 
 static void zfcp_erp_lun_unblock(struct scsi_device *sdev)
@@ -502,7 +502,7 @@ static void zfcp_erp_lun_unblock(struct
 
 	if (status_change_set(ZFCP_STATUS_COMMON_UNBLOCKED, &zfcp_sdev->status))
 		zfcp_dbf_rec_run("erlubl1", &sdev_to_zfcp(sdev)->erp_action);
-	atomic_set_mask(ZFCP_STATUS_COMMON_UNBLOCKED, &zfcp_sdev->status);
+	atomic_or(ZFCP_STATUS_COMMON_UNBLOCKED, &zfcp_sdev->status);
 }
 
 static void zfcp_erp_action_to_running(struct zfcp_erp_action *erp_action)
@@ -642,7 +642,7 @@ static void zfcp_erp_wakeup(struct zfcp_
 	read_lock_irqsave(&adapter->erp_lock, flags);
 	if (list_empty(&adapter->erp_ready_head) &&
 	    list_empty(&adapter->erp_running_head)) {
-			atomic_clear_mask(ZFCP_STATUS_ADAPTER_ERP_PENDING,
+			atomic_and(~ZFCP_STATUS_ADAPTER_ERP_PENDING,
 					  &adapter->status);
 			wake_up(&adapter->erp_done_wqh);
 	}
@@ -665,16 +665,16 @@ static int zfcp_erp_adapter_strat_fsf_xc
 	int sleep = 1;
 	struct zfcp_adapter *adapter = erp_action->adapter;
 
-	atomic_clear_mask(ZFCP_STATUS_ADAPTER_XCONFIG_OK, &adapter->status);
+	atomic_and(~ZFCP_STATUS_ADAPTER_XCONFIG_OK, &adapter->status);
 
 	for (retries = 7; retries; retries--) {
-		atomic_clear_mask(ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
+		atomic_and(~ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
 				  &adapter->status);
 		write_lock_irq(&adapter->erp_lock);
 		zfcp_erp_action_to_running(erp_action);
 		write_unlock_irq(&adapter->erp_lock);
 		if (zfcp_fsf_exchange_config_data(erp_action)) {
-			atomic_clear_mask(ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
+			atomic_and(~ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
 					  &adapter->status);
 			return ZFCP_ERP_FAILED;
 		}
@@ -692,7 +692,7 @@ static int zfcp_erp_adapter_strat_fsf_xc
 		sleep *= 2;
 	}
 
-	atomic_clear_mask(ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
+	atomic_and(~ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
 			  &adapter->status);
 
 	if (!(atomic_read(&adapter->status) & ZFCP_STATUS_ADAPTER_XCONFIG_OK))
@@ -764,8 +764,8 @@ static void zfcp_erp_adapter_strategy_cl
 	/* all ports and LUNs are closed */
 	zfcp_erp_clear_adapter_status(adapter, ZFCP_STATUS_COMMON_OPEN);
 
-	atomic_clear_mask(ZFCP_STATUS_ADAPTER_XCONFIG_OK |
-			  ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED, &adapter->status);
+	atomic_and(~(ZFCP_STATUS_ADAPTER_XCONFIG_OK |
+			  ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED), &adapter->status);
 }
 
 static int zfcp_erp_adapter_strategy_open(struct zfcp_erp_action *act)
@@ -773,8 +773,8 @@ static int zfcp_erp_adapter_strategy_ope
 	struct zfcp_adapter *adapter = act->adapter;
 
 	if (zfcp_qdio_open(adapter->qdio)) {
-		atomic_clear_mask(ZFCP_STATUS_ADAPTER_XCONFIG_OK |
-				  ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED,
+		atomic_and(~(ZFCP_STATUS_ADAPTER_XCONFIG_OK |
+				  ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED),
 				  &adapter->status);
 		return ZFCP_ERP_FAILED;
 	}
@@ -784,7 +784,7 @@ static int zfcp_erp_adapter_strategy_ope
 		return ZFCP_ERP_FAILED;
 	}
 
-	atomic_set_mask(ZFCP_STATUS_COMMON_OPEN, &adapter->status);
+	atomic_and(~ZFCP_STATUS_COMMON_OPEN, &adapter->status);
 
 	return ZFCP_ERP_SUCCEEDED;
 }
@@ -823,7 +823,7 @@ static int zfcp_erp_port_forced_strategy
 
 static void zfcp_erp_port_strategy_clearstati(struct zfcp_port *port)
 {
-	atomic_clear_mask(ZFCP_STATUS_COMMON_ACCESS_DENIED, &port->status);
+	atomic_and(~ZFCP_STATUS_COMMON_ACCESS_DENIED, &port->status);
 }
 
 static int zfcp_erp_port_forced_strategy(struct zfcp_erp_action *erp_action)
@@ -955,7 +955,7 @@ static void zfcp_erp_lun_strategy_clears
 {
 	struct zfcp_scsi_dev *zfcp_sdev = sdev_to_zfcp(sdev);
 
-	atomic_clear_mask(ZFCP_STATUS_COMMON_ACCESS_DENIED,
+	atomic_and(~ZFCP_STATUS_COMMON_ACCESS_DENIED,
 			  &zfcp_sdev->status);
 }
 
@@ -1194,18 +1194,18 @@ static void zfcp_erp_action_dequeue(stru
 	switch (erp_action->action) {
 	case ZFCP_ERP_ACTION_REOPEN_LUN:
 		zfcp_sdev = sdev_to_zfcp(erp_action->sdev);
-		atomic_clear_mask(ZFCP_STATUS_COMMON_ERP_INUSE,
+		atomic_and(~ZFCP_STATUS_COMMON_ERP_INUSE,
 				  &zfcp_sdev->status);
 		break;
 
 	case ZFCP_ERP_ACTION_REOPEN_PORT_FORCED:
 	case ZFCP_ERP_ACTION_REOPEN_PORT:
-		atomic_clear_mask(ZFCP_STATUS_COMMON_ERP_INUSE,
+		atomic_and(~ZFCP_STATUS_COMMON_ERP_INUSE,
 				  &erp_action->port->status);
 		break;
 
 	case ZFCP_ERP_ACTION_REOPEN_ADAPTER:
-		atomic_clear_mask(ZFCP_STATUS_COMMON_ERP_INUSE,
+		atomic_and(~ZFCP_STATUS_COMMON_ERP_INUSE,
 				  &erp_action->adapter->status);
 		break;
 	}
@@ -1429,19 +1429,19 @@ void zfcp_erp_set_adapter_status(struct
 	unsigned long flags;
 	u32 common_mask = mask & ZFCP_COMMON_FLAGS;
 
-	atomic_set_mask(mask, &adapter->status);
+	atomic_or(mask, &adapter->status);
 
 	if (!common_mask)
 		return;
 
 	read_lock_irqsave(&adapter->port_list_lock, flags);
 	list_for_each_entry(port, &adapter->port_list, list)
-		atomic_set_mask(common_mask, &port->status);
+		atomic_or(common_mask, &port->status);
 	read_unlock_irqrestore(&adapter->port_list_lock, flags);
 
 	spin_lock_irqsave(adapter->scsi_host->host_lock, flags);
 	__shost_for_each_device(sdev, adapter->scsi_host)
-		atomic_set_mask(common_mask, &sdev_to_zfcp(sdev)->status);
+		atomic_or(common_mask, &sdev_to_zfcp(sdev)->status);
 	spin_unlock_irqrestore(adapter->scsi_host->host_lock, flags);
 }
 
@@ -1460,7 +1460,7 @@ void zfcp_erp_clear_adapter_status(struc
 	u32 common_mask = mask & ZFCP_COMMON_FLAGS;
 	u32 clear_counter = mask & ZFCP_STATUS_COMMON_ERP_FAILED;
 
-	atomic_clear_mask(mask, &adapter->status);
+	atomic_and(~mask, &adapter->status);
 
 	if (!common_mask)
 		return;
@@ -1470,7 +1470,7 @@ void zfcp_erp_clear_adapter_status(struc
 
 	read_lock_irqsave(&adapter->port_list_lock, flags);
 	list_for_each_entry(port, &adapter->port_list, list) {
-		atomic_clear_mask(common_mask, &port->status);
+		atomic_and(~common_mask, &port->status);
 		if (clear_counter)
 			atomic_set(&port->erp_counter, 0);
 	}
@@ -1478,7 +1478,7 @@ void zfcp_erp_clear_adapter_status(struc
 
 	spin_lock_irqsave(adapter->scsi_host->host_lock, flags);
 	__shost_for_each_device(sdev, adapter->scsi_host) {
-		atomic_clear_mask(common_mask, &sdev_to_zfcp(sdev)->status);
+		atomic_and(~common_mask, &sdev_to_zfcp(sdev)->status);
 		if (clear_counter)
 			atomic_set(&sdev_to_zfcp(sdev)->erp_counter, 0);
 	}
@@ -1498,7 +1498,7 @@ void zfcp_erp_set_port_status(struct zfc
 	u32 common_mask = mask & ZFCP_COMMON_FLAGS;
 	unsigned long flags;
 
-	atomic_set_mask(mask, &port->status);
+	atomic_or(mask, &port->status);
 
 	if (!common_mask)
 		return;
@@ -1506,7 +1506,7 @@ void zfcp_erp_set_port_status(struct zfc
 	spin_lock_irqsave(port->adapter->scsi_host->host_lock, flags);
 	__shost_for_each_device(sdev, port->adapter->scsi_host)
 		if (sdev_to_zfcp(sdev)->port == port)
-			atomic_set_mask(common_mask,
+			atomic_or(common_mask,
 					&sdev_to_zfcp(sdev)->status);
 	spin_unlock_irqrestore(port->adapter->scsi_host->host_lock, flags);
 }
@@ -1525,7 +1525,7 @@ void zfcp_erp_clear_port_status(struct z
 	u32 clear_counter = mask & ZFCP_STATUS_COMMON_ERP_FAILED;
 	unsigned long flags;
 
-	atomic_clear_mask(mask, &port->status);
+	atomic_and(~mask, &port->status);
 
 	if (!common_mask)
 		return;
@@ -1536,7 +1536,7 @@ void zfcp_erp_clear_port_status(struct z
 	spin_lock_irqsave(port->adapter->scsi_host->host_lock, flags);
 	__shost_for_each_device(sdev, port->adapter->scsi_host)
 		if (sdev_to_zfcp(sdev)->port == port) {
-			atomic_clear_mask(common_mask,
+			atomic_and(~common_mask,
 					  &sdev_to_zfcp(sdev)->status);
 			if (clear_counter)
 				atomic_set(&sdev_to_zfcp(sdev)->erp_counter, 0);
@@ -1553,7 +1553,7 @@ void zfcp_erp_set_lun_status(struct scsi
 {
 	struct zfcp_scsi_dev *zfcp_sdev = sdev_to_zfcp(sdev);
 
-	atomic_set_mask(mask, &zfcp_sdev->status);
+	atomic_or(mask, &zfcp_sdev->status);
 }
 
 /**
@@ -1565,7 +1565,7 @@ void zfcp_erp_clear_lun_status(struct sc
 {
 	struct zfcp_scsi_dev *zfcp_sdev = sdev_to_zfcp(sdev);
 
-	atomic_clear_mask(mask, &zfcp_sdev->status);
+	atomic_and(~mask, &zfcp_sdev->status);
 
 	if (mask & ZFCP_STATUS_COMMON_ERP_FAILED)
 		atomic_set(&zfcp_sdev->erp_counter, 0);
--- a/drivers/s390/scsi/zfcp_fc.c
+++ b/drivers/s390/scsi/zfcp_fc.c
@@ -465,7 +465,7 @@ static void zfcp_fc_adisc_handler(void *
 	/* port is good, unblock rport without going through erp */
 	zfcp_scsi_schedule_rport_register(port);
  out:
-	atomic_clear_mask(ZFCP_STATUS_PORT_LINK_TEST, &port->status);
+	atomic_and(~ZFCP_STATUS_PORT_LINK_TEST, &port->status);
 	put_device(&port->dev);
 	kmem_cache_free(zfcp_fc_req_cache, fc_req);
 }
@@ -521,14 +521,14 @@ void zfcp_fc_link_test_work(struct work_
 	if (atomic_read(&port->status) & ZFCP_STATUS_PORT_LINK_TEST)
 		goto out;
 
-	atomic_set_mask(ZFCP_STATUS_PORT_LINK_TEST, &port->status);
+	atomic_or(ZFCP_STATUS_PORT_LINK_TEST, &port->status);
 
 	retval = zfcp_fc_adisc(port);
 	if (retval == 0)
 		return;
 
 	/* send of ADISC was not possible */
-	atomic_clear_mask(ZFCP_STATUS_PORT_LINK_TEST, &port->status);
+	atomic_and(~ZFCP_STATUS_PORT_LINK_TEST, &port->status);
 	zfcp_erp_port_forced_reopen(port, 0, "fcltwk1");
 
 out:
@@ -597,7 +597,7 @@ static void zfcp_fc_validate_port(struct
 	if (!(atomic_read(&port->status) & ZFCP_STATUS_COMMON_NOESC))
 		return;
 
-	atomic_clear_mask(ZFCP_STATUS_COMMON_NOESC, &port->status);
+	atomic_and(~ZFCP_STATUS_COMMON_NOESC, &port->status);
 
 	if ((port->supported_classes != 0) ||
 	    !list_empty(&port->unit_list))
--- a/drivers/s390/scsi/zfcp_fsf.c
+++ b/drivers/s390/scsi/zfcp_fsf.c
@@ -114,7 +114,7 @@ static void zfcp_fsf_link_down_info_eval
 	if (atomic_read(&adapter->status) & ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED)
 		return;
 
-	atomic_set_mask(ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED, &adapter->status);
+	atomic_or(ZFCP_STATUS_ADAPTER_LINK_UNPLUGGED, &adapter->status);
 
 	zfcp_scsi_schedule_rports_block(adapter);
 
@@ -345,7 +345,7 @@ static void zfcp_fsf_protstatus_eval(str
 		zfcp_erp_adapter_shutdown(adapter, 0, "fspse_3");
 		break;
 	case FSF_PROT_HOST_CONNECTION_INITIALIZING:
-		atomic_set_mask(ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
+		atomic_or(ZFCP_STATUS_ADAPTER_HOST_CON_INIT,
 				&adapter->status);
 		break;
 	case FSF_PROT_DUPLICATE_REQUEST_ID:
@@ -554,7 +554,7 @@ static void zfcp_fsf_exchange_config_dat
 			zfcp_erp_adapter_shutdown(adapter, 0, "fsecdh1");
 			return;
 		}
-		atomic_set_mask(ZFCP_STATUS_ADAPTER_XCONFIG_OK,
+		atomic_or(ZFCP_STATUS_ADAPTER_XCONFIG_OK,
 				&adapter->status);
 		break;
 	case FSF_EXCHANGE_CONFIG_DATA_INCOMPLETE:
@@ -567,7 +567,7 @@ static void zfcp_fsf_exchange_config_dat
 
 		/* avoids adapter shutdown to be able to recognize
 		 * events such as LINK UP */
-		atomic_set_mask(ZFCP_STATUS_ADAPTER_XCONFIG_OK,
+		atomic_or(ZFCP_STATUS_ADAPTER_XCONFIG_OK,
 				&adapter->status);
 		zfcp_fsf_link_down_info_eval(req,
 			&qtcb->header.fsf_status_qual.link_down_info);
@@ -1394,10 +1394,10 @@ static void zfcp_fsf_open_port_handler(s
 		break;
 	case FSF_GOOD:
 		port->handle = header->port_handle;
-		atomic_set_mask(ZFCP_STATUS_COMMON_OPEN |
+		atomic_or(ZFCP_STATUS_COMMON_OPEN |
 				ZFCP_STATUS_PORT_PHYS_OPEN, &port->status);
-		atomic_clear_mask(ZFCP_STATUS_COMMON_ACCESS_DENIED |
-		                  ZFCP_STATUS_COMMON_ACCESS_BOXED,
+		atomic_and(~(ZFCP_STATUS_COMMON_ACCESS_DENIED |
+		                  ZFCP_STATUS_COMMON_ACCESS_BOXED),
 		                  &port->status);
 		/* check whether D_ID has changed during open */
 		/*
@@ -1678,10 +1678,10 @@ static void zfcp_fsf_close_physical_port
 	case FSF_PORT_BOXED:
 		/* can't use generic zfcp_erp_modify_port_status because
 		 * ZFCP_STATUS_COMMON_OPEN must not be reset for the port */
-		atomic_clear_mask(ZFCP_STATUS_PORT_PHYS_OPEN, &port->status);
+		atomic_and(~ZFCP_STATUS_PORT_PHYS_OPEN, &port->status);
 		shost_for_each_device(sdev, port->adapter->scsi_host)
 			if (sdev_to_zfcp(sdev)->port == port)
-				atomic_clear_mask(ZFCP_STATUS_COMMON_OPEN,
+				atomic_and(~ZFCP_STATUS_COMMON_OPEN,
 						  &sdev_to_zfcp(sdev)->status);
 		zfcp_erp_set_port_status(port, ZFCP_STATUS_COMMON_ACCESS_BOXED);
 		zfcp_erp_port_reopen(port, ZFCP_STATUS_COMMON_ERP_FAILED,
@@ -1701,10 +1701,10 @@ static void zfcp_fsf_close_physical_port
 		/* can't use generic zfcp_erp_modify_port_status because
 		 * ZFCP_STATUS_COMMON_OPEN must not be reset for the port
 		 */
-		atomic_clear_mask(ZFCP_STATUS_PORT_PHYS_OPEN, &port->status);
+		atomic_and(~ZFCP_STATUS_PORT_PHYS_OPEN, &port->status);
 		shost_for_each_device(sdev, port->adapter->scsi_host)
 			if (sdev_to_zfcp(sdev)->port == port)
-				atomic_clear_mask(ZFCP_STATUS_COMMON_OPEN,
+				atomic_and(~ZFCP_STATUS_COMMON_OPEN,
 						  &sdev_to_zfcp(sdev)->status);
 		break;
 	}
@@ -1767,8 +1767,8 @@ static void zfcp_fsf_open_lun_handler(st
 
 	zfcp_sdev = sdev_to_zfcp(sdev);
 
-	atomic_clear_mask(ZFCP_STATUS_COMMON_ACCESS_DENIED |
-			  ZFCP_STATUS_COMMON_ACCESS_BOXED,
+	atomic_and(~(ZFCP_STATUS_COMMON_ACCESS_DENIED |
+			  ZFCP_STATUS_COMMON_ACCESS_BOXED),
 			  &zfcp_sdev->status);
 
 	switch (header->fsf_status) {
@@ -1823,7 +1823,7 @@ static void zfcp_fsf_open_lun_handler(st
 
 	case FSF_GOOD:
 		zfcp_sdev->lun_handle = header->lun_handle;
-		atomic_set_mask(ZFCP_STATUS_COMMON_OPEN, &zfcp_sdev->status);
+		atomic_or(ZFCP_STATUS_COMMON_OPEN, &zfcp_sdev->status);
 		break;
 	}
 }
@@ -1914,7 +1914,7 @@ static void zfcp_fsf_close_lun_handler(s
 		}
 		break;
 	case FSF_GOOD:
-		atomic_clear_mask(ZFCP_STATUS_COMMON_OPEN, &zfcp_sdev->status);
+		atomic_and(~ZFCP_STATUS_COMMON_OPEN, &zfcp_sdev->status);
 		break;
 	}
 }
--- a/drivers/s390/scsi/zfcp_qdio.c
+++ b/drivers/s390/scsi/zfcp_qdio.c
@@ -351,7 +351,7 @@ void zfcp_qdio_close(struct zfcp_qdio *q
 
 	/* clear QDIOUP flag, thus do_QDIO is not called during qdio_shutdown */
 	spin_lock_irq(&qdio->req_q_lock);
-	atomic_clear_mask(ZFCP_STATUS_ADAPTER_QDIOUP, &adapter->status);
+	atomic_and(~ZFCP_STATUS_ADAPTER_QDIOUP, &adapter->status);
 	spin_unlock_irq(&qdio->req_q_lock);
 
 	wake_up(&qdio->req_q_wq);
@@ -386,7 +386,7 @@ int zfcp_qdio_open(struct zfcp_qdio *qdi
 	if (atomic_read(&adapter->status) & ZFCP_STATUS_ADAPTER_QDIOUP)
 		return -EIO;
 
-	atomic_clear_mask(ZFCP_STATUS_ADAPTER_SIOSL_ISSUED,
+	atomic_and(~ZFCP_STATUS_ADAPTER_SIOSL_ISSUED,
 			  &qdio->adapter->status);
 
 	zfcp_qdio_setup_init_data(&init_data, qdio);
@@ -398,14 +398,14 @@ int zfcp_qdio_open(struct zfcp_qdio *qdi
 		goto failed_qdio;
 
 	if (ssqd.qdioac2 & CHSC_AC2_DATA_DIV_ENABLED)
-		atomic_set_mask(ZFCP_STATUS_ADAPTER_DATA_DIV_ENABLED,
+		atomic_or(ZFCP_STATUS_ADAPTER_DATA_DIV_ENABLED,
 				&qdio->adapter->status);
 
 	if (ssqd.qdioac2 & CHSC_AC2_MULTI_BUFFER_ENABLED) {
-		atomic_set_mask(ZFCP_STATUS_ADAPTER_MB_ACT, &adapter->status);
+		atomic_or(ZFCP_STATUS_ADAPTER_MB_ACT, &adapter->status);
 		qdio->max_sbale_per_sbal = QDIO_MAX_ELEMENTS_PER_BUFFER;
 	} else {
-		atomic_clear_mask(ZFCP_STATUS_ADAPTER_MB_ACT, &adapter->status);
+		atomic_and(~ZFCP_STATUS_ADAPTER_MB_ACT, &adapter->status);
 		qdio->max_sbale_per_sbal = QDIO_MAX_ELEMENTS_PER_BUFFER - 1;
 	}
 
@@ -429,7 +429,7 @@ int zfcp_qdio_open(struct zfcp_qdio *qdi
 	/* set index of first available SBALS / number of available SBALS */
 	qdio->req_q_idx = 0;
 	atomic_set(&qdio->req_q_free, QDIO_MAX_BUFFERS_PER_Q);
-	atomic_set_mask(ZFCP_STATUS_ADAPTER_QDIOUP, &qdio->adapter->status);
+	atomic_or(ZFCP_STATUS_ADAPTER_QDIOUP, &qdio->adapter->status);
 
 	if (adapter->scsi_host) {
 		adapter->scsi_host->sg_tablesize = qdio->max_sbale_per_req;
@@ -506,6 +506,6 @@ void zfcp_qdio_siosl(struct zfcp_adapter
 
 	rc = ccw_device_siosl(adapter->ccw_device);
 	if (!rc)
-		atomic_set_mask(ZFCP_STATUS_ADAPTER_SIOSL_ISSUED,
+		atomic_or(ZFCP_STATUS_ADAPTER_SIOSL_ISSUED,
 				&adapter->status);
 }
--- a/include/asm-generic/atomic.h
+++ b/include/asm-generic/atomic.h
@@ -61,16 +61,32 @@ GEN_ATOMIC_OP(add, +)
 GEN_ATOMIC_OP(sub, -)
 #endif
 
-#ifndef atomic_clear_mask
+#ifndef atomic_and_return
 GEN_ATOMIC_OP(and, &)
-#define atomic_clear_mask(i, v) (void)atomic_and_return(~(i), (v))
 #endif
 
-#ifndef atomic_set_mask
+#ifndef atomic_or_return
 GEN_ATOMIC_OP(or, |)
-#define atomic_set_mask(i, v)	(void)atomic_or_return((i), (v))
 #endif
 
+#ifndef atomic_xor_return
+GEN_ATOMIC_OP(xor, ^)
+#endif
+
+#undef GEN_ATOMIC_OP
+
+#define GEN_ATOMIC_OP(op)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	atomic_##op##_return(i, v);					\
+}
+
+GEN_ATOMIC_OP(add)
+GEN_ATOMIC_OP(sub)
+GEN_ATOMIC_OP(and)
+GEN_ATOMIC_OP(or)
+GEN_ATOMIC_OP(xor)
+
 #undef GEN_ATOMIC_OP
 
 /*
@@ -106,16 +122,6 @@ static inline int atomic_add_negative(in
 	return atomic_add_return(i, v) < 0;
 }
 
-static inline void atomic_add(int i, atomic_t *v)
-{
-	atomic_add_return(i, v);
-}
-
-static inline void atomic_sub(int i, atomic_t *v)
-{
-	atomic_sub_return(i, v);
-}
-
 static inline void atomic_inc(atomic_t *v)
 {
 	atomic_add_return(1, v);
--- a/include/asm-generic/atomic64.h
+++ b/include/asm-generic/atomic64.h
@@ -20,10 +20,19 @@ typedef struct {
 
 extern long long atomic64_read(const atomic64_t *v);
 extern void	 atomic64_set(atomic64_t *v, long long i);
-extern void	 atomic64_add(long long a, atomic64_t *v);
-extern long long atomic64_add_return(long long a, atomic64_t *v);
-extern void	 atomic64_sub(long long a, atomic64_t *v);
-extern long long atomic64_sub_return(long long a, atomic64_t *v);
+
+#define ATOMIC64_OP(op)							\
+extern void	 atomic64_##op(long long a, atomic64_t *v);		\
+extern long long atomic64_##op##_return(long long a, atomic64_t *v);
+
+ATOMIC64_OP(add)
+ATOMIC64_OP(sub)
+ATOMIC64_OP(and)
+ATOMIC64_OP(or )
+ATOMIC64_OP(xor)
+
+#undef ATOMIC64_OP
+
 extern long long atomic64_dec_if_positive(atomic64_t *v);
 extern long long atomic64_cmpxchg(atomic64_t *v, long long o, long long n);
 extern long long atomic64_xchg(atomic64_t *v, long long new);
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -111,19 +111,6 @@ static inline int atomic_dec_if_positive
 }
 #endif
 
-#ifndef CONFIG_ARCH_HAS_ATOMIC_OR
-static inline void atomic_or(int i, atomic_t *v)
-{
-	int old;
-	int new;
-
-	do {
-		old = atomic_read(v);
-		new = old | i;
-	} while (atomic_cmpxchg(v, old, new) != old);
-}
-#endif /* #ifndef CONFIG_ARCH_HAS_ATOMIC_OR */
-
 #include <asm-generic/atomic-long.h>
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
--- a/lib/atomic64.c
+++ b/lib/atomic64.c
@@ -70,53 +70,38 @@ void atomic64_set(atomic64_t *v, long lo
 }
 EXPORT_SYMBOL(atomic64_set);
 
-void atomic64_add(long long a, atomic64_t *v)
-{
-	unsigned long flags;
-	raw_spinlock_t *lock = lock_addr(v);
-
-	raw_spin_lock_irqsave(lock, flags);
-	v->counter += a;
-	raw_spin_unlock_irqrestore(lock, flags);
-}
-EXPORT_SYMBOL(atomic64_add);
-
-long long atomic64_add_return(long long a, atomic64_t *v)
-{
-	unsigned long flags;
-	raw_spinlock_t *lock = lock_addr(v);
-	long long val;
-
-	raw_spin_lock_irqsave(lock, flags);
-	val = v->counter += a;
-	raw_spin_unlock_irqrestore(lock, flags);
-	return val;
-}
-EXPORT_SYMBOL(atomic64_add_return);
+#define ATOMIC64_OP(op, cop)						\
+void atomic64_##op(long long a, atomic64_t *v)				\
+{									\
+	unsigned long flags;						\
+	raw_spinlock_t *lock = lock_addr(v);				\
+									\
+	raw_spin_lock_irqsave(lock, flags);				\
+	v->counter cop a;						\
+	raw_spin_unlock_irqrestore(lock, flags);			\
+}									\
+EXPORT_SYMBOL(atomic64_##op);						\
+									\
+long long atomic64_##op##_return(long long a, atomic64_t *v)		\
+{									\
+	unsigned long flags;						\
+	raw_spinlock_t *lock = lock_addr(v);				\
+	long long val;							\
+									\
+	raw_spin_lock_irqsave(lock, flags);				\
+	val = (v->counter cop a);					\
+	raw_spin_unlock_irqrestore(lock, flags);			\
+	return val;							\
+}									\
+EXPORT_SYMBOL(atomic64_##op##_return);
+
+ATOMIC64_OP(add, +=)
+ATOMIC64_OP(sub, -=)
+ATOMIC64_OP(and, &=)
+ATOMIC64_OP(or , |=)
+ATOMIC64_OP(xor, ^=)
 
-void atomic64_sub(long long a, atomic64_t *v)
-{
-	unsigned long flags;
-	raw_spinlock_t *lock = lock_addr(v);
-
-	raw_spin_lock_irqsave(lock, flags);
-	v->counter -= a;
-	raw_spin_unlock_irqrestore(lock, flags);
-}
-EXPORT_SYMBOL(atomic64_sub);
-
-long long atomic64_sub_return(long long a, atomic64_t *v)
-{
-	unsigned long flags;
-	raw_spinlock_t *lock = lock_addr(v);
-	long long val;
-
-	raw_spin_lock_irqsave(lock, flags);
-	val = v->counter -= a;
-	raw_spin_unlock_irqrestore(lock, flags);
-	return val;
-}
-EXPORT_SYMBOL(atomic64_sub_return);
+#undef ATOMIC64_OP
 
 long long atomic64_dec_if_positive(atomic64_t *v)
 {



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 13:48 ` [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops Peter Zijlstra
@ 2014-02-06 14:43   ` Geert Uytterhoeven
  2014-02-06 16:14     ` Peter Zijlstra
  2014-02-06 16:53   ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Geert Uytterhoeven @ 2014-02-06 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-Arch, linux-kernel, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Will Deacon, Paul McKenney

On Thu, Feb 6, 2014 at 2:48 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> --- a/arch/m68k/include/asm/atomic.h
> +++ b/arch/m68k/include/asm/atomic.h
> @@ -76,6 +76,9 @@ static inline int atomic_##op##_return(i
>
>  ATOMIC_OP(add, +=)
>  ATOMIC_OP(sub, -=)
> +ATOMIC_OP(and, &=)
> +ATOMIC_OP(or , |=)
> +ATOMIC_OP(xor, ^=)

This last one ain't gonna fly. Given

static inline void atomic_##op(int i, atomic_t *v)                     \
{                                                                      \
       __asm__ __volatile__(#op "l %1,%0" : "+m" (*v) : ASM_DI (i));   \
}

from the previous patch, it'll generate the "xorl" mnemonic, which
IIRC does not exist. It's called "eorl".

Verified with m68k-linux-gnu-as:

Error: Unknown operator -- statement `xorl %d0,%d1' ignored

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 14:43   ` Geert Uytterhoeven
@ 2014-02-06 16:14     ` Peter Zijlstra
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 16:14 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linux-Arch, linux-kernel, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Will Deacon, Paul McKenney

On Thu, Feb 06, 2014 at 03:43:19PM +0100, Geert Uytterhoeven wrote:
> On Thu, Feb 6, 2014 at 2:48 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > --- a/arch/m68k/include/asm/atomic.h
> > +++ b/arch/m68k/include/asm/atomic.h
> > @@ -76,6 +76,9 @@ static inline int atomic_##op##_return(i
> >
> >  ATOMIC_OP(add, +=)
> >  ATOMIC_OP(sub, -=)
> > +ATOMIC_OP(and, &=)
> > +ATOMIC_OP(or , |=)
> > +ATOMIC_OP(xor, ^=)
> 
> This last one ain't gonna fly. Given
> 
> static inline void atomic_##op(int i, atomic_t *v)                     \
> {                                                                      \
>        __asm__ __volatile__(#op "l %1,%0" : "+m" (*v) : ASM_DI (i));   \
> }
> 
> from the previous patch, it'll generate the "xorl" mnemonic, which
> IIRC does not exist. It's called "eorl".

Some instruction sets I googled, others I didn't; guess what I did (or
rather didn't do) for m68k :-)

ok I'll make that:

#define ATOMIC_OP(op, aop, cop)

ATOMIC_OP(xor, eor, ^=)

I had to so similar things on a few other archs as well, asm instruction
names are odd sometimes.

Thanks!

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 13:48 ` [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops Peter Zijlstra
  2014-02-06 14:43   ` Geert Uytterhoeven
@ 2014-02-06 16:53   ` Linus Torvalds
  2014-02-06 17:52     ` Peter Zijlstra
  1 sibling, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-06 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, Will Deacon, Paul McKenney

On Thu, Feb 6, 2014 at 5:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> Many archs have atomic_{set,clear}_mask() but not all. Remove these
> and provide a comprehensive set of bitops:
>
>   atomic{,64}_{and,or,xor}{,_return}()

Who uses these, and why?

The "_return()" versions of atomic ops are noticeably slower and more
complex on common architectures (ie x86), and apparently there is no
use of them since they didn't exist.

So why add them? Just to encourage people to do bad things?

                  Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 4/5] arch: Generic atomic.h cleanup
  2014-02-06 13:48 ` [RFC][PATCH 4/5] arch: Generic atomic.h cleanup Peter Zijlstra
@ 2014-02-06 17:49   ` Will Deacon
  0 siblings, 0 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-06 17:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck

On Thu, Feb 06, 2014 at 01:48:29PM +0000, Peter Zijlstra wrote:
> XXX: split up in per arch patches?

Yes please -- I have a bunch of changes in this area pending for both arm
and arm64.

Cheers,

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 16:53   ` Linus Torvalds
@ 2014-02-06 17:52     ` Peter Zijlstra
  2014-02-06 17:56       ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, Will Deacon, Paul McKenney

On Thu, Feb 06, 2014 at 08:53:02AM -0800, Linus Torvalds wrote:
> On Thu, Feb 6, 2014 at 5:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Many archs have atomic_{set,clear}_mask() but not all. Remove these
> > and provide a comprehensive set of bitops:
> >
> >   atomic{,64}_{and,or,xor}{,_return}()
> 
> Who uses these, and why?

s390 is stuffed with atomic_{set,clear}_mask usage, no clue whatfor.

There's 2 atomic_set_mask() users in drm/i915, again haven't looked if
it makes sense.

Various archs use atomic_set_mask() for their tlb flush mask, which
arguably should be done using the bitmap functions we have.

> The "_return()" versions of atomic ops are noticeably slower and more
> complex on common architectures (ie x86), and apparently there is no
> use of them since they didn't exist.
> 
> So why add them? Just to encourage people to do bad things?

Fair enough, they're mostly an accident of how I implemented the macro
generation magic. I suppose I can change that and avoid generating the
_return thingies.

I don't particularly care for this patch too much; but I do dislike the
atomic_{set,clear}_mask() things -- mostly because they don't really fit
the normal atomic_t functions and because some archs have then and
others do no.

And atomic_{set,clear}_mask() can be used for setting/clearing multiple
bits as opposed to {set,clear}_bit, which only does a single bit.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 17:52     ` Peter Zijlstra
@ 2014-02-06 17:56       ` Linus Torvalds
  2014-02-06 18:09         ` Peter Zijlstra
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-06 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, Will Deacon, Paul McKenney

On Thu, Feb 6, 2014 at 9:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> s390 is stuffed with atomic_{set,clear}_mask usage, no clue whatfor.

It's not the set/clear_mask ones I would worry about.

It's the "_return" variants. As far as I can tell, there are exactly
ZERO users of that stuff, and they are BAD BAD BAD.

On x86, those things would cause a cmpxchg loop, and the bad
read-for-shared-before-acquire cacheline pattern.

So why indirectly encourage people to add users for a bad operation?

                        Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops
  2014-02-06 17:56       ` Linus Torvalds
@ 2014-02-06 18:09         ` Peter Zijlstra
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, Will Deacon, Paul McKenney

On Thu, Feb 06, 2014 at 09:56:54AM -0800, Linus Torvalds wrote:
> On Thu, Feb 6, 2014 at 9:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> It's the "_return" variants. As far as I can tell, there are exactly
> ZERO users of that stuff, and they are BAD BAD BAD.
> 
> On x86, those things would cause a cmpxchg loop, and the bad
> read-for-shared-before-acquire cacheline pattern.
> 
> So why indirectly encourage people to add users for a bad operation?

OK, I'll not generate the _return() variants.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-02-06 13:48 ` [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops Peter Zijlstra
@ 2014-02-06 18:25 ` David Howells
  2014-02-06 18:30   ` Peter Zijlstra
                     ` (3 more replies)
       [not found] ` <52F93B7C.2090304@tilera.com>
  6 siblings, 4 replies; 299+ messages in thread
From: David Howells @ 2014-02-06 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	will.deacon, paulmck, ramana.radhakrishnan


Is it worth considering a move towards using C11 atomics and barriers and
compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
these.

One thing I'm not sure of, though, is how well gcc's atomics will cope with
interrupt handlers touching atomics on CPUs without suitable atomic
instructions - that said, userspace does have to deal with signals getting
underfoot. but then userspace can't normally disable interrupts.

David

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:25 ` [RFC][PATCH 0/5] arch: atomic rework David Howells
@ 2014-02-06 18:30   ` Peter Zijlstra
  2014-02-06 18:42   ` Paul E. McKenney
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-06 18:30 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, linux-kernel, torvalds, akpm, mingo, will.deacon,
	paulmck, ramana.radhakrishnan

On Thu, Feb 06, 2014 at 06:25:49PM +0000, David Howells wrote:
> 
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.
> 
> One thing I'm not sure of, though, is how well gcc's atomics will cope with
> interrupt handlers touching atomics on CPUs without suitable atomic
> instructions - that said, userspace does have to deal with signals getting
> underfoot. but then userspace can't normally disable interrupts.

I can do an asm-generic/atomic_c11.h if people want.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:25 ` [RFC][PATCH 0/5] arch: atomic rework David Howells
  2014-02-06 18:30   ` Peter Zijlstra
@ 2014-02-06 18:42   ` Paul E. McKenney
  2014-02-06 18:55   ` Ramana Radhakrishnan
  2014-02-06 19:21   ` Linus Torvalds
  3 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-06 18:42 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo,
	will.deacon, ramana.radhakrishnan

On Thu, Feb 06, 2014 at 06:25:49PM +0000, David Howells wrote:
> 
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.

Makes sense to me!

> One thing I'm not sure of, though, is how well gcc's atomics will cope with
> interrupt handlers touching atomics on CPUs without suitable atomic
> instructions - that said, userspace does have to deal with signals getting
> underfoot. but then userspace can't normally disable interrupts.

Perhaps make the C11 definitions so that any arch can override any
specific definition?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:25 ` [RFC][PATCH 0/5] arch: atomic rework David Howells
  2014-02-06 18:30   ` Peter Zijlstra
  2014-02-06 18:42   ` Paul E. McKenney
@ 2014-02-06 18:55   ` Ramana Radhakrishnan
  2014-02-06 18:59     ` Will Deacon
  2014-02-06 19:21   ` Linus Torvalds
  3 siblings, 1 reply; 299+ messages in thread
From: Ramana Radhakrishnan @ 2014-02-06 18:55 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo,
	Will Deacon, paulmck, gcc

On 02/06/14 18:25, David Howells wrote:
>
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.


It sounds interesting to me, if we can make it work properly and 
reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.

>
> One thing I'm not sure of, though, is how well gcc's atomics will cope with
> interrupt handlers touching atomics on CPUs without suitable atomic
> instructions - that said, userspace does have to deal with signals getting
> underfoot. but then userspace can't normally disable interrupts.
>
> David
>

Ramana


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:55   ` Ramana Radhakrishnan
@ 2014-02-06 18:59     ` Will Deacon
  2014-02-06 19:27       ` Paul E. McKenney
  2014-02-06 21:09       ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-06 18:59 UTC (permalink / raw)
  To: Ramana Radhakrishnan
  Cc: David Howells, Peter Zijlstra, linux-arch, linux-kernel,
	torvalds, akpm, mingo, paulmck, gcc

On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> On 02/06/14 18:25, David Howells wrote:
> >
> > Is it worth considering a move towards using C11 atomics and barriers and
> > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > these.
> 
> 
> It sounds interesting to me, if we can make it work properly and 
> reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.

Given my (albeit limited) experience playing with the C11 spec and GCC, I
really think this is a bad idea for the kernel. It seems that nobody really
agrees on exactly how the C11 atomics map to real architectural
instructions on anything but the trivial architectures. For example, should
the following code fire the assert?


extern atomic<int> foo, bar, baz;

void thread1(void)
{
	foo.store(42, memory_order_relaxed);
	bar.fetch_add(1, memory_order_seq_cst);
	baz.store(42, memory_order_relaxed);
}

void thread2(void)
{
	while (baz.load(memory_order_seq_cst) != 42) {
		/* do nothing */
	}

	assert(foo.load(memory_order_seq_cst) == 42);
}


To answer that question, you need to go and look at the definitions of
synchronises-with, happens-before, dependency_ordered_before and a whole
pile of vaguely written waffle to realise that you don't know. Certainly,
the code that arm64 GCC currently spits out would allow the assertion to fire
on some microarchitectures.

There are also so many ways to blow your head off it's untrue. For example,
cmpxchg takes a separate memory model parameter for failure and success, but
then there are restrictions on the sets you can use for each. It's not hard
to find well-known memory-ordering experts shouting "Just use
memory_model_seq_cst for everything, it's too hard otherwise". Then there's
the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
atm and optimises all of the data dependencies away) as well as the definition
of "data races", which seem to be used as an excuse to miscompile a program
at the earliest opportunity.

Trying to introduce system concepts (writes to devices, interrupts,
non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
just rather stick to the semantics we have and the asm volatile barriers.

That's not to say I don't there's no room for improvement in what we have
in the kernel. Certainly, I'd welcome allowing more relaxed operations on
architectures that support them, but it needs to be something that at least
the different architecture maintainers can understand how to implement
efficiently behind an uncomplicated interface. I don't think that interface is
C11.

Just my thoughts on the matter...

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g
  2014-02-06 13:48 ` [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g Peter Zijlstra
@ 2014-02-06 19:12   ` Paul E. McKenney
  2014-02-07  9:52     ` Will Deacon
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-06 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, torvalds, akpm, mingo, will.deacon

On Thu, Feb 06, 2014 at 02:48:28PM +0100, Peter Zijlstra wrote:
> Because atomic ops are implemented the same across an architecture,
> the current incomplete set of extra barriers:
> 
>   smp_mb__before_atomic_inc()
>   smp_mb__after_atomic_inc()
>   smp_mb__before_atomic_dec()
>   smp_mb__after_atomic_dec()
>   smp_mb__before_clear_bit()
>   smp_mb__after_clear_bit()
> 
> is both incomplete and superfluous.
> 
> It is incomplete because there are far more atomic operations that do
> not return values -- such as atomic_add(), set_bit() etc. And it is
> superfluous because they're all the same anyway.
> 
> Simplify things by reducing the triplicate set into a single set of
> barriers that is valid for all void atomic ops:
> 
>   smp_mb__before_atomic()
>   smp_mb__after_atomic()
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

I very much like the API shrinkage.  The RCU changes are good, and
I believe that the rest is OK too, though my eyes were going a bit
buggy towards the end...

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  Documentation/atomic_ops.txt                      |   31 ++++--------
>  Documentation/memory-barriers.txt                 |   44 ++++-------------
>  arch/alpha/include/asm/atomic.h                   |    5 -
>  arch/alpha/include/asm/bitops.h                   |    3 -
>  arch/arc/include/asm/atomic.h                     |    5 -
>  arch/arc/include/asm/bitops.h                     |    5 -
>  arch/arm/include/asm/atomic.h                     |    5 -
>  arch/arm/include/asm/barrier.h                    |    3 +
>  arch/arm/include/asm/bitops.h                     |    4 -
>  arch/arm64/include/asm/atomic.h                   |    5 -
>  arch/arm64/include/asm/barrier.h                  |    3 +
>  arch/arm64/include/asm/bitops.h                   |    9 ---
>  arch/avr32/include/asm/atomic.h                   |    5 -
>  arch/avr32/include/asm/bitops.h                   |    9 ---
>  arch/blackfin/include/asm/barrier.h               |    3 +
>  arch/blackfin/include/asm/bitops.h                |   14 -----
>  arch/c6x/include/asm/bitops.h                     |    8 ---
>  arch/cris/include/asm/atomic.h                    |    7 --
>  arch/cris/include/asm/bitops.h                    |    9 ---
>  arch/frv/include/asm/atomic.h                     |    7 --
>  arch/frv/include/asm/bitops.h                     |    6 --
>  arch/hexagon/include/asm/atomic.h                 |    6 --
>  arch/hexagon/include/asm/bitops.h                 |    4 -
>  arch/ia64/include/asm/atomic.h                    |    6 --
>  arch/ia64/include/asm/barrier.h                   |    3 +
>  arch/ia64/include/asm/bitops.h                    |    6 --
>  arch/m32r/include/asm/atomic.h                    |    7 --
>  arch/m32r/include/asm/bitops.h                    |    6 --
>  arch/m68k/include/asm/atomic.h                    |    8 ---
>  arch/m68k/include/asm/bitops.h                    |    7 --
>  arch/metag/include/asm/atomic.h                   |    6 --
>  arch/metag/include/asm/barrier.h                  |    3 +
>  arch/metag/include/asm/bitops.h                   |    6 --
>  arch/mips/include/asm/atomic.h                    |    9 ---
>  arch/mips/include/asm/barrier.h                   |    3 +
>  arch/mips/include/asm/bitops.h                    |   11 ----
>  arch/mips/kernel/irq.c                            |    4 -
>  arch/mn10300/include/asm/atomic.h                 |    7 --
>  arch/mn10300/include/asm/bitops.h                 |    4 -
>  arch/mn10300/mm/tlb-smp.c                         |    4 -
>  arch/openrisc/include/asm/bitops.h                |    9 ---
>  arch/parisc/include/asm/atomic.h                  |    6 --
>  arch/parisc/include/asm/bitops.h                  |    4 -
>  arch/powerpc/include/asm/atomic.h                 |    6 --
>  arch/powerpc/include/asm/barrier.h                |    3 +
>  arch/powerpc/include/asm/bitops.h                 |    6 --
>  arch/powerpc/kernel/crash.c                       |    2 
>  arch/s390/include/asm/atomic.h                    |    6 --
>  arch/s390/include/asm/barrier.h                   |    5 +
>  arch/s390/include/asm/bitops.h                    |    1 
>  arch/score/include/asm/bitops.h                   |    7 --
>  arch/sh/include/asm/atomic.h                      |    6 --
>  arch/sh/include/asm/bitops.h                      |    7 --
>  arch/sparc/include/asm/atomic_32.h                |    7 --
>  arch/sparc/include/asm/atomic_64.h                |    7 --
>  arch/sparc/include/asm/barrier_64.h               |    3 +
>  arch/sparc/include/asm/bitops_32.h                |    4 -
>  arch/sparc/include/asm/bitops_64.h                |    4 -
>  arch/tile/include/asm/atomic_32.h                 |   10 ---
>  arch/tile/include/asm/atomic_64.h                 |    6 --
>  arch/tile/include/asm/barrier.h                   |   14 +++++
>  arch/tile/include/asm/bitops.h                    |    1 
>  arch/tile/include/asm/bitops_32.h                 |    8 ---
>  arch/tile/include/asm/bitops_64.h                 |    4 -
>  arch/x86/include/asm/atomic.h                     |    7 --
>  arch/x86/include/asm/barrier.h                    |    4 +
>  arch/x86/include/asm/bitops.h                     |    6 --
>  arch/x86/include/asm/sync_bitops.h                |    2 
>  arch/x86/kernel/apic/hw_nmi.c                     |    2 
>  arch/xtensa/include/asm/atomic.h                  |    7 --
>  arch/xtensa/include/asm/bitops.h                  |    4 -
>  block/blk-iopoll.c                                |    4 -
>  crypto/chainiv.c                                  |    2 
>  drivers/base/power/domain.c                       |    2 
>  drivers/block/mtip32xx/mtip32xx.c                 |    4 -
>  drivers/cpuidle/coupled.c                         |    2 
>  drivers/firewire/ohci.c                           |    2 
>  drivers/gpu/drm/drm_irq.c                         |   10 +--
>  drivers/gpu/drm/i915/i915_irq.c                   |    2 
>  drivers/md/bcache/bcache.h                        |    2 
>  drivers/md/bcache/closure.h                       |    2 
>  drivers/md/dm-bufio.c                             |    8 +--
>  drivers/md/dm-snap.c                              |    4 -
>  drivers/md/dm.c                                   |    2 
>  drivers/md/raid5.c                                |    2 
>  drivers/media/usb/dvb-usb-v2/dvb_usb_core.c       |    6 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |    6 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   34 ++++++-------
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c    |   26 +++++-----
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |   12 ++--
>  drivers/net/ethernet/broadcom/cnic.c              |    8 +--
>  drivers/net/ethernet/brocade/bna/bnad.c           |    6 +-
>  drivers/net/ethernet/chelsio/cxgb/cxgb2.c         |    2 
>  drivers/net/ethernet/chelsio/cxgb3/sge.c          |    6 +-
>  drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 
>  drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 
>  drivers/net/ethernet/intel/i40e/i40e_main.c       |    2 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 -
>  drivers/net/wireless/ti/wlcore/main.c             |    2 
>  drivers/pci/xen-pcifront.c                        |    4 -
>  drivers/scsi/isci/remote_device.c                 |    2 
>  drivers/target/loopback/tcm_loop.c                |    4 -
>  drivers/target/target_core_alua.c                 |   26 +++++-----
>  drivers/target/target_core_device.c               |    6 +-
>  drivers/target/target_core_iblock.c               |    2 
>  drivers/target/target_core_pr.c                   |   56 +++++++++++-----------
>  drivers/target/target_core_transport.c            |   16 +++---
>  drivers/target/target_core_ua.c                   |   10 +--
>  drivers/tty/n_tty.c                               |    2 
>  drivers/tty/serial/mxs-auart.c                    |    4 -
>  drivers/usb/gadget/tcm_usb_gadget.c               |    4 -
>  drivers/usb/serial/usb_wwan.c                     |    2 
>  drivers/vhost/scsi.c                              |    2 
>  drivers/w1/w1_family.c                            |    4 -
>  drivers/xen/xen-pciback/pciback_ops.c             |    4 -
>  fs/btrfs/btrfs_inode.h                            |    2 
>  fs/btrfs/extent_io.c                              |    2 
>  fs/btrfs/inode.c                                  |    6 +-
>  fs/buffer.c                                       |    2 
>  fs/ext4/resize.c                                  |    2 
>  fs/gfs2/glock.c                                   |    8 +--
>  fs/gfs2/glops.c                                   |    2 
>  fs/gfs2/lock_dlm.c                                |    4 -
>  fs/gfs2/recovery.c                                |    2 
>  fs/gfs2/sys.c                                     |    4 -
>  fs/jbd2/commit.c                                  |    6 +-
>  fs/nfs/dir.c                                      |   12 ++--
>  fs/nfs/inode.c                                    |    2 
>  fs/nfs/nfs4filelayoutdev.c                        |    4 -
>  fs/nfs/nfs4state.c                                |    4 -
>  fs/nfs/pagelist.c                                 |    6 +-
>  fs/nfs/pnfs.c                                     |    2 
>  fs/nfs/pnfs.h                                     |    2 
>  fs/nfs/write.c                                    |    4 -
>  fs/ubifs/lpt_commit.c                             |    4 -
>  fs/ubifs/tnc_commit.c                             |    4 -
>  include/asm-generic/atomic.h                      |    7 --
>  include/asm-generic/barrier.h                     |    8 +++
>  include/asm-generic/bitops.h                      |    8 ---
>  include/asm-generic/bitops/atomic.h               |    2 
>  include/asm-generic/bitops/lock.h                 |    2 
>  include/linux/buffer_head.h                       |    2 
>  include/linux/genhd.h                             |    2 
>  include/linux/interrupt.h                         |    8 +--
>  include/linux/netdevice.h                         |    2 
>  include/linux/sched.h                             |    6 --
>  include/linux/sunrpc/sched.h                      |    8 +--
>  include/linux/sunrpc/xprt.h                       |    8 +--
>  include/linux/tracehook.h                         |    2 
>  include/net/ip_vs.h                               |    4 -
>  kernel/debug/debug_core.c                         |    4 -
>  kernel/futex.c                                    |    2 
>  kernel/kmod.c                                     |    2 
>  kernel/rcu/tree.c                                 |   22 ++++----
>  kernel/rcu/tree_plugin.h                          |    8 +--
>  kernel/sched/cpupri.c                             |    6 +-
>  kernel/sched/wait.c                               |    2 
>  mm/backing-dev.c                                  |    2 
>  mm/filemap.c                                      |    4 -
>  net/atm/pppoatm.c                                 |    2 
>  net/bluetooth/hci_event.c                         |    4 -
>  net/core/dev.c                                    |    8 +--
>  net/core/link_watch.c                             |    2 
>  net/ipv4/inetpeer.c                               |    2 
>  net/netfilter/nf_conntrack_core.c                 |    2 
>  net/rds/ib_recv.c                                 |    4 -
>  net/rds/iw_recv.c                                 |    4 -
>  net/rds/send.c                                    |    6 +-
>  net/rds/tcp_send.c                                |    2 
>  net/sunrpc/auth.c                                 |    2 
>  net/sunrpc/auth_gss/auth_gss.c                    |    2 
>  net/sunrpc/backchannel_rqst.c                     |    4 -
>  net/sunrpc/xprt.c                                 |    4 -
>  net/sunrpc/xprtsock.c                             |   16 +++---
>  net/unix/af_unix.c                                |    2 
>  sound/pci/bt87x.c                                 |    4 -
>  176 files changed, 415 insertions(+), 642 deletions(-)
> 
> --- a/Documentation/atomic_ops.txt
> +++ b/Documentation/atomic_ops.txt
> @@ -285,15 +285,13 @@ If a caller requires memory barrier sema
>  operation which does not return a value, a set of interfaces are
>  defined which accomplish this:
> 
> -	void smp_mb__before_atomic_dec(void);
> -	void smp_mb__after_atomic_dec(void);
> -	void smp_mb__before_atomic_inc(void);
> -	void smp_mb__after_atomic_inc(void);
> +	void smp_mb__before_atomic(void);
> +	void smp_mb__after_atomic(void);
> 
> -For example, smp_mb__before_atomic_dec() can be used like so:
> +For example, smp_mb__before_atomic() can be used like so:
> 
>  	obj->dead = 1;
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&obj->ref_count);
> 
>  It makes sure that all memory operations preceding the atomic_dec()
> @@ -302,15 +300,10 @@ operation.  In the above example, it gua
>  "1" to obj->dead will be globally visible to other cpus before the
>  atomic counter decrement.
> 
> -Without the explicit smp_mb__before_atomic_dec() call, the
> +Without the explicit smp_mb__before_atomic() call, the
>  implementation could legally allow the atomic counter update visible
>  to other cpus before the "obj->dead = 1;" assignment.
> 
> -The other three interfaces listed are used to provide explicit
> -ordering with respect to memory operations after an atomic_dec() call
> -(smp_mb__after_atomic_dec()) and around atomic_inc() calls
> -(smp_mb__{before,after}_atomic_inc()).
> -
>  A missing memory barrier in the cases where they are required by the
>  atomic_t implementation above can have disastrous results.  Here is
>  an example, which follows a pattern occurring frequently in the Linux
> @@ -487,12 +480,12 @@ memory operation done by test_and_set_bi
>  Which returns a boolean indicating if bit "nr" is set in the bitmask
>  pointed to by "addr".
> 
> -If explicit memory barriers are required around clear_bit() (which
> -does not return a value, and thus does not need to provide memory
> -barrier semantics), two interfaces are provided:
> +If explicit memory barriers are required around {set,clear}_bit() (which do
> +not return a value, and thus does not need to provide memory barrier
> +semantics), two interfaces are provided:
> 
> -	void smp_mb__before_clear_bit(void);
> -	void smp_mb__after_clear_bit(void);
> +	void smp_mb__before_atomic(void);
> +	void smp_mb__after_atomic(void);
> 
>  They are used as follows, and are akin to their atomic_t operation
>  brothers:
> @@ -500,13 +493,13 @@ They are used as follows, and are akin t
>  	/* All memory operations before this call will
>  	 * be globally visible before the clear_bit().
>  	 */
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit( ... );
> 
>  	/* The clear_bit() will be visible before all
>  	 * subsequent memory operations.
>  	 */
> -	 smp_mb__after_clear_bit();
> +	 smp_mb__after_atomic();
> 
>  There are two special bitops with lock barrier semantics (acquire/release,
>  same as spinlocks). These operate in the same way as their non-_lock/unlock
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1553,20 +1553,21 @@ CPU from reordering them.
>       insert anything more than a compiler barrier in a UP compilation.
> 
> 
> - (*) smp_mb__before_atomic_dec();
> - (*) smp_mb__after_atomic_dec();
> - (*) smp_mb__before_atomic_inc();
> - (*) smp_mb__after_atomic_inc();
> -
> -     These are for use with atomic add, subtract, increment and decrement
> -     functions that don't return a value, especially when used for reference
> -     counting.  These functions do not imply memory barriers.
> + (*) smp_mb__before_atomic();
> + (*) smp_mb__after_atomic();
> +
> +     These are for use with atomic (such as add, subtract, increment and
> +     decrement) functions that don't return a value, especially when used for
> +     reference counting.  These functions do not imply memory barriers.
> +
> +     These are also used for atomic bitop functions that do not return a
> +     value (such as set_bit and clear_bit).
> 
>       As an example, consider a piece of code that marks an object as being dead
>       and then decrements the object's reference count:
> 
>  	obj->dead = 1;
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&obj->ref_count);
> 
>       This makes sure that the death mark on the object is perceived to be set
> @@ -1576,27 +1577,6 @@ CPU from reordering them.
>       operations" subsection for information on where to use these.
> 
> 
> - (*) smp_mb__before_clear_bit(void);
> - (*) smp_mb__after_clear_bit(void);
> -
> -     These are for use similar to the atomic inc/dec barriers.  These are
> -     typically used for bitwise unlocking operations, so care must be taken as
> -     there are no implicit memory barriers here either.
> -
> -     Consider implementing an unlock operation of some nature by clearing a
> -     locking bit.  The clear_bit() would then need to be barriered like this:
> -
> -	smp_mb__before_clear_bit();
> -	clear_bit( ... );
> -
> -     This prevents memory operations before the clear leaking to after it.  See
> -     the subsection on "Locking Functions" with reference to RELEASE operation
> -     implications.
> -
> -     See Documentation/atomic_ops.txt for more information.  See the "Atomic
> -     operations" subsection for information on where to use these.
> -
> -
>  MMIO WRITE BARRIER
>  ------------------
> 
> @@ -2222,11 +2202,11 @@ barriers, but might be used for implemen
>  	change_bit();
> 
>  With these the appropriate explicit memory barrier should be used if necessary
> -(smp_mb__before_clear_bit() for instance).
> +(smp_mb__before_atomic() for instance).
> 
> 
>  The following also do _not_ imply memory barriers, and so may require explicit
> -memory barriers under some circumstances (smp_mb__before_atomic_dec() for
> +memory barriers under some circumstances (smp_mb__before_atomic() for
>  instance):
> 
>  	atomic_add();
> --- a/arch/alpha/include/asm/atomic.h
> +++ b/arch/alpha/include/asm/atomic.h
> @@ -292,9 +292,4 @@ static inline long atomic64_dec_if_posit
>  #define atomic_dec(v) atomic_sub(1,(v))
>  #define atomic64_dec(v) atomic64_sub(1,(v))
> 
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #endif /* _ALPHA_ATOMIC_H */
> --- a/arch/alpha/include/asm/bitops.h
> +++ b/arch/alpha/include/asm/bitops.h
> @@ -53,9 +53,6 @@ __set_bit(unsigned long nr, volatile voi
>  	*m |= 1 << (nr & 31);
>  }
> 
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -
>  static inline void
>  clear_bit(unsigned long nr, volatile void * addr)
>  {
> --- a/arch/arc/include/asm/atomic.h
> +++ b/arch/arc/include/asm/atomic.h
> @@ -190,11 +190,6 @@ static inline void atomic_clear_mask(uns
> 
>  #endif /* !CONFIG_ARC_HAS_LLSC */
> 
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  /**
>   * __atomic_add_unless - add unless the number is a given value
>   * @v: pointer of type atomic_t
> --- a/arch/arc/include/asm/bitops.h
> +++ b/arch/arc/include/asm/bitops.h
> @@ -19,6 +19,7 @@
> 
>  #include <linux/types.h>
>  #include <linux/compiler.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * Hardware assisted read-modify-write using ARC700 LLOCK/SCOND insns.
> @@ -496,10 +497,6 @@ static inline __attribute__ ((const)) in
>   */
>  #define ffz(x)	__ffs(~(x))
> 
> -/* TODO does this affect uni-processor code */
> -#define smp_mb__before_clear_bit()  barrier()
> -#define smp_mb__after_clear_bit()   barrier()
> -
>  #include <asm-generic/bitops/hweight.h>
>  #include <asm-generic/bitops/fls64.h>
>  #include <asm-generic/bitops/sched.h>
> --- a/arch/arm/include/asm/atomic.h
> +++ b/arch/arm/include/asm/atomic.h
> @@ -211,11 +211,6 @@ static inline int __atomic_add_unless(at
> 
>  #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
> 
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #ifndef CONFIG_GENERIC_ATOMIC64
>  typedef struct {
>  	long long counter;
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -79,5 +79,8 @@ do {									\
> 
>  #define set_mb(var, value)	do { var = value; smp_mb(); } while (0)
> 
> +#define smp_mb__before_atomic()	smp_mb()
> +#define smp_mb__after_atomic()	smp_mb()
> +
>  #endif /* !__ASSEMBLY__ */
>  #endif /* __ASM_BARRIER_H */
> --- a/arch/arm/include/asm/bitops.h
> +++ b/arch/arm/include/asm/bitops.h
> @@ -25,9 +25,7 @@
> 
>  #include <linux/compiler.h>
>  #include <linux/irqflags.h>
> -
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> +#include <asm/barrier.h>
> 
>  /*
>   * These functions are the basis of our bit ops.
> --- a/arch/arm64/include/asm/atomic.h
> +++ b/arch/arm64/include/asm/atomic.h
> @@ -154,11 +154,6 @@ static inline int __atomic_add_unless(at
> 
>  #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
> 
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  /*
>   * 64-bit atomic operations.
>   */
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -97,6 +97,9 @@ do {									\
>  #define set_mb(var, value)	do { var = value; smp_mb(); } while (0)
>  #define nop()		asm volatile("nop");
> 
> +#define smp_mb__before_atomic()	smp_mb()
> +#define smp_mb__after_atomic()	smp_mb()
> +
>  #endif	/* __ASSEMBLY__ */
> 
>  #endif	/* __ASM_BARRIER_H */
> --- a/arch/arm64/include/asm/bitops.h
> +++ b/arch/arm64/include/asm/bitops.h
> @@ -17,17 +17,8 @@
>  #define __ASM_BITOPS_H
> 
>  #include <linux/compiler.h>
> -
>  #include <asm/barrier.h>
> 
> -/*
> - * clear_bit may not imply a memory barrier
> - */
> -#ifndef smp_mb__before_clear_bit
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -#endif
> -
>  #ifndef _LINUX_BITOPS_H
>  #error only <linux/bitops.h> can be included directly
>  #endif
> --- a/arch/avr32/include/asm/atomic.h
> +++ b/arch/avr32/include/asm/atomic.h
> @@ -183,9 +183,4 @@ static inline int atomic_sub_if_positive
> 
>  #define atomic_dec_if_positive(v) atomic_sub_if_positive(1, v)
> 
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /*  __ASM_AVR32_ATOMIC_H */
> --- a/arch/avr32/include/asm/bitops.h
> +++ b/arch/avr32/include/asm/bitops.h
> @@ -13,12 +13,7 @@
>  #endif
> 
>  #include <asm/byteorder.h>
> -
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler
> - */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> +#include <asm/barrier.h>
> 
>  /*
>   * set_bit - Atomically set a bit in memory
> @@ -67,7 +62,7 @@ static inline void set_bit(int nr, volat
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static inline void clear_bit(int nr, volatile void * addr)
> --- a/arch/blackfin/include/asm/barrier.h
> +++ b/arch/blackfin/include/asm/barrier.h
> @@ -27,6 +27,9 @@
> 
>  #endif /* !CONFIG_SMP */
> 
> +#define smp_mb__before_atomic()	barrier()
> +#define smp_mb__after_atomic()	barrier()
> +
>  #include <asm-generic/barrier.h>
> 
>  #endif /* _BLACKFIN_BARRIER_H */
> --- a/arch/blackfin/include/asm/bitops.h
> +++ b/arch/blackfin/include/asm/bitops.h
> @@ -27,21 +27,17 @@
> 
>  #include <asm-generic/bitops/ext2-atomic.h>
> 
> +#include <asm/barrier.h>
> +
>  #ifndef CONFIG_SMP
>  #include <linux/irqflags.h>
> -
>  /*
>   * clear_bit may not imply a memory barrier
>   */
> -#ifndef smp_mb__before_clear_bit
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -#endif
>  #include <asm-generic/bitops/atomic.h>
>  #include <asm-generic/bitops/non-atomic.h>
>  #else
> 
> -#include <asm/barrier.h>
>  #include <asm/byteorder.h>	/* swab32 */
>  #include <linux/linkage.h>
> 
> @@ -101,12 +97,6 @@ static inline int test_and_change_bit(in
>  	return __raw_bit_test_toggle_asm(a, nr & 0x1f);
>  }
> 
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  #define test_bit __skip_test_bit
>  #include <asm-generic/bitops/non-atomic.h>
>  #undef test_bit
> --- a/arch/c6x/include/asm/bitops.h
> +++ b/arch/c6x/include/asm/bitops.h
> @@ -14,14 +14,8 @@
>  #ifdef __KERNEL__
> 
>  #include <linux/bitops.h>
> -
>  #include <asm/byteorder.h>
> -
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit() barrier()
> -#define smp_mb__after_clear_bit()  barrier()
> +#include <asm/barrier.h>
> 
>  /*
>   * We are lucky, DSP is perfect for bitops: do it in 3 cycles
> --- a/arch/cris/include/asm/atomic.h
> +++ b/arch/cris/include/asm/atomic.h
> @@ -6,6 +6,7 @@
>  #include <linux/compiler.h>
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
>  #include <arch/atomic.h>
> 
>  /*
> @@ -151,10 +152,4 @@ static inline int __atomic_add_unless(at
>  	return ret;
>  }
> 
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()    barrier()
> -#define smp_mb__after_atomic_dec()     barrier()
> -#define smp_mb__before_atomic_inc()    barrier()
> -#define smp_mb__after_atomic_inc()     barrier()
> -
>  #endif
> --- a/arch/cris/include/asm/bitops.h
> +++ b/arch/cris/include/asm/bitops.h
> @@ -21,6 +21,7 @@
>  #include <arch/bitops.h>
>  #include <linux/atomic.h>
>  #include <linux/compiler.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * set_bit - Atomically set a bit in memory
> @@ -42,7 +43,7 @@
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
> 
> @@ -84,12 +85,6 @@ static inline int test_and_set_bit(int n
>  	return retval;
>  }
> 
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()      barrier()
> -#define smp_mb__after_clear_bit()       barrier()
> -
>  /**
>   * test_and_clear_bit - Clear a bit and return its old value
>   * @nr: Bit to clear
> --- a/arch/frv/include/asm/atomic.h
> +++ b/arch/frv/include/asm/atomic.h
> @@ -17,6 +17,7 @@
>  #include <linux/types.h>
>  #include <asm/spr-regs.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #ifdef CONFIG_SMP
>  #error not SMP safe
> @@ -29,12 +30,6 @@
>   * We do not have SMP systems, so we don't have to deal with that.
>   */
> 
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #define ATOMIC_INIT(i)		{ (i) }
>  #define atomic_read(v)		(*(volatile int *)&(v)->counter)
>  #define atomic_set(v, i)	(((v)->counter) = (i))
> --- a/arch/frv/include/asm/bitops.h
> +++ b/arch/frv/include/asm/bitops.h
> @@ -25,12 +25,6 @@
> 
>  #include <asm-generic/bitops/ffz.h>
> 
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  #ifndef CONFIG_FRV_OUTOFLINE_ATOMIC_OPS
>  static inline
>  unsigned long atomic_test_and_ANDNOT_mask(unsigned long mask, volatile unsigned long *v)
> --- a/arch/hexagon/include/asm/atomic.h
> +++ b/arch/hexagon/include/asm/atomic.h
> @@ -24,6 +24,7 @@
> 
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #define ATOMIC_INIT(i)		{ (i) }
>  #define atomic_set(v, i)	((v)->counter = (i))
> @@ -163,9 +164,4 @@ static inline int __atomic_add_unless(at
>  #define atomic_inc_return(v) (atomic_add_return(1, v))
>  #define atomic_dec_return(v) (atomic_sub_return(1, v))
> 
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif
> --- a/arch/hexagon/include/asm/bitops.h
> +++ b/arch/hexagon/include/asm/bitops.h
> @@ -25,12 +25,10 @@
>  #include <linux/compiler.h>
>  #include <asm/byteorder.h>
>  #include <asm/atomic.h>
> +#include <asm/barrier.h>
> 
>  #ifdef __KERNEL__
> 
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  /*
>   * The offset calculations for these are based on BITS_PER_LONG == 32
>   * (i.e. I get to shift by #5-2 (32 bits per long, 4 bytes per access),
> --- a/arch/ia64/include/asm/atomic.h
> +++ b/arch/ia64/include/asm/atomic.h
> @@ -208,10 +208,4 @@ atomic64_add_negative (__s64 i, atomic64
>  #define atomic64_inc(v)			atomic64_add(1, (v))
>  #define atomic64_dec(v)			atomic64_sub(1, (v))
> 
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* _ASM_IA64_ATOMIC_H */
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -55,6 +55,9 @@
> 
>  #endif
> 
> +#define smp_mb__before_atomic()	barrier()
> +#define smp_mb__after_atomic()	barrier()
> +
>  /*
>   * IA64 GCC turns volatile stores into st.rel and volatile loads into ld.acq no
>   * need for asm trickery!
> --- a/arch/ia64/include/asm/bitops.h
> +++ b/arch/ia64/include/asm/bitops.h
> @@ -16,6 +16,7 @@
>  #include <linux/compiler.h>
>  #include <linux/types.h>
>  #include <asm/intrinsics.h>
> +#include <asm/barrier.h>
> 
>  /**
>   * set_bit - Atomically set a bit in memory
> @@ -65,9 +66,6 @@ __set_bit (int nr, volatile void *addr)
>  	*((__u32 *) addr + (nr >> 5)) |= (1 << (nr & 31));
>  }
> 
> -#define smp_mb__before_clear_bit()	barrier();
> -#define smp_mb__after_clear_bit()	barrier();
> -
>  /**
>   * clear_bit - Clears a bit in memory
>   * @nr: Bit to clear
> @@ -75,7 +73,7 @@ __set_bit (int nr, volatile void *addr)
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static __inline__ void
> --- a/arch/m32r/include/asm/atomic.h
> +++ b/arch/m32r/include/asm/atomic.h
> @@ -13,6 +13,7 @@
>  #include <asm/assembler.h>
>  #include <asm/cmpxchg.h>
>  #include <asm/dcache_clear.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * Atomic operations that C can't guarantee us.  Useful for
> @@ -308,10 +309,4 @@ static __inline__ void atomic_set_mask(u
>  	local_irq_restore(flags);
>  }
> 
> -/* Atomic operations are already serializing on m32r */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif	/* _ASM_M32R_ATOMIC_H */
> --- a/arch/m32r/include/asm/bitops.h
> +++ b/arch/m32r/include/asm/bitops.h
> @@ -21,6 +21,7 @@
>  #include <asm/byteorder.h>
>  #include <asm/dcache_clear.h>
>  #include <asm/types.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * These have to be done with inline assembly: that way the bit-setting
> @@ -73,7 +74,7 @@ static __inline__ void set_bit(int nr, v
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static __inline__ void clear_bit(int nr, volatile void * addr)
> @@ -103,9 +104,6 @@ static __inline__ void clear_bit(int nr,
>  	local_irq_restore(flags);
>  }
> 
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  /**
>   * change_bit - Toggle a bit in memory
>   * @nr: Bit to clear
> --- a/arch/m68k/include/asm/atomic.h
> +++ b/arch/m68k/include/asm/atomic.h
> @@ -4,6 +4,7 @@
>  #include <linux/types.h>
>  #include <linux/irqflags.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * Atomic operations that C can't guarantee us.  Useful for
> @@ -209,11 +210,4 @@ static __inline__ int __atomic_add_unles
>  	return c;
>  }
> 
> -
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* __ARCH_M68K_ATOMIC __ */
> --- a/arch/m68k/include/asm/bitops.h
> +++ b/arch/m68k/include/asm/bitops.h
> @@ -13,6 +13,7 @@
>  #endif
> 
>  #include <linux/compiler.h>
> +#include <asm/barrier.h>
> 
>  /*
>   *	Bit access functions vary across the ColdFire and 68k families.
> @@ -67,12 +68,6 @@ static inline void bfset_mem_set_bit(int
>  #define __set_bit(nr, vaddr)	set_bit(nr, vaddr)
> 
> 
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  static inline void bclr_reg_clear_bit(int nr, volatile unsigned long *vaddr)
>  {
>  	char *p = (char *)vaddr + (nr ^ 31) / 8;
> --- a/arch/metag/include/asm/atomic.h
> +++ b/arch/metag/include/asm/atomic.h
> @@ -4,6 +4,7 @@
>  #include <linux/compiler.h>
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #if defined(CONFIG_METAG_ATOMICITY_IRQSOFF)
>  /* The simple UP case. */
> @@ -39,11 +40,6 @@
> 
>  #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
> 
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif
> 
>  #define atomic_dec_if_positive(v)       atomic_sub_if_positive(1, v)
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -97,4 +97,7 @@ do {									\
>  	___p1;								\
>  })
> 
> +#define smp_mb__before_atomic()	barrier()
> +#define smp_mb__after_atomic()	barrier()
> +
>  #endif /* _ASM_METAG_BARRIER_H */
> --- a/arch/metag/include/asm/bitops.h
> +++ b/arch/metag/include/asm/bitops.h
> @@ -5,12 +5,6 @@
>  #include <asm/barrier.h>
>  #include <asm/global_lock.h>
> 
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  #ifdef CONFIG_SMP
>  /*
>   * These functions are the basis of our bit ops.
> --- a/arch/mips/include/asm/atomic.h
> +++ b/arch/mips/include/asm/atomic.h
> @@ -761,13 +761,4 @@ static __inline__ int atomic64_add_unles
> 
>  #endif /* CONFIG_64BIT */
> 
> -/*
> - * atomic*_return operations are serializing but not the non-*_return
> - * versions.
> - */
> -#define smp_mb__before_atomic_dec()	smp_mb__before_llsc()
> -#define smp_mb__after_atomic_dec()	smp_llsc_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb__before_llsc()
> -#define smp_mb__after_atomic_inc()	smp_llsc_mb()
> -
>  #endif /* _ASM_ATOMIC_H */
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -195,4 +195,7 @@ do {									\
>  	___p1;								\
>  })
> 
> +#define smp_mb__before_atomic()	smp_mb__before_llsc()
> +#define smp_mb__after_atomic()	smp_llsc_mb()
> +
>  #endif /* __ASM_BARRIER_H */
> --- a/arch/mips/include/asm/bitops.h
> +++ b/arch/mips/include/asm/bitops.h
> @@ -38,13 +38,6 @@
>  #endif
> 
>  /*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	smp_mb__before_llsc()
> -#define smp_mb__after_clear_bit()	smp_llsc_mb()
> -
> -
> -/*
>   * These are the "slower" versions of the functions and are in bitops.c.
>   * These functions call raw_local_irq_{save,restore}().
>   */
> @@ -120,7 +113,7 @@ static inline void set_bit(unsigned long
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static inline void clear_bit(unsigned long nr, volatile unsigned long *addr)
> @@ -175,7 +168,7 @@ static inline void clear_bit(unsigned lo
>   */
>  static inline void clear_bit_unlock(unsigned long nr, volatile unsigned long *addr)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(nr, addr);
>  }
> 
> --- a/arch/mips/kernel/irq.c
> +++ b/arch/mips/kernel/irq.c
> @@ -62,9 +62,9 @@ void __init alloc_legacy_irqno(void)
> 
>  void free_irqno(unsigned int irq)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(irq, irq_map);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  /*
> --- a/arch/mn10300/include/asm/atomic.h
> +++ b/arch/mn10300/include/asm/atomic.h
> @@ -13,6 +13,7 @@
> 
>  #include <asm/irqflags.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #ifndef CONFIG_SMP
>  #include <asm-generic/atomic.h>
> @@ -234,12 +235,6 @@ static inline void atomic_set_mask(unsig
>  #endif
>  }
> 
> -/* Atomic operations are already serializing on MN10300??? */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* __KERNEL__ */
>  #endif /* CONFIG_SMP */
>  #endif /* _ASM_ATOMIC_H */
> --- a/arch/mn10300/include/asm/bitops.h
> +++ b/arch/mn10300/include/asm/bitops.h
> @@ -18,9 +18,7 @@
>  #define __ASM_BITOPS_H
> 
>  #include <asm/cpu-regs.h>
> -
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> +#include <asm/barrier.h>
> 
>  /*
>   * set bit
> --- a/arch/mn10300/mm/tlb-smp.c
> +++ b/arch/mn10300/mm/tlb-smp.c
> @@ -78,9 +78,9 @@ void smp_flush_tlb(void *unused)
>  	else
>  		local_flush_tlb_page(flush_mm, flush_va);
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	cpumask_clear_cpu(cpu_id, &flush_cpumask);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  out:
>  	put_cpu();
>  }
> --- a/arch/openrisc/include/asm/bitops.h
> +++ b/arch/openrisc/include/asm/bitops.h
> @@ -27,14 +27,7 @@
> 
>  #include <linux/irqflags.h>
>  #include <linux/compiler.h>
> -
> -/*
> - * clear_bit may not imply a memory barrier
> - */
> -#ifndef smp_mb__before_clear_bit
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -#endif
> +#include <asm/barrier.h>
> 
>  #include <asm/bitops/__ffs.h>
>  #include <asm-generic/bitops/ffz.h>
> --- a/arch/parisc/include/asm/atomic.h
> +++ b/arch/parisc/include/asm/atomic.h
> @@ -7,6 +7,7 @@
> 
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * Atomic operations that C can't guarantee us.  Useful for
> @@ -143,11 +144,6 @@ static __inline__ int __atomic_add_unles
> 
>  #define ATOMIC_INIT(i)	{ (i) }
> 
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #ifdef CONFIG_64BIT
> 
>  #define ATOMIC64_INIT(i) { (i) }
> --- a/arch/parisc/include/asm/bitops.h
> +++ b/arch/parisc/include/asm/bitops.h
> @@ -8,6 +8,7 @@
>  #include <linux/compiler.h>
>  #include <asm/types.h>		/* for BITS_PER_LONG/SHIFT_PER_LONG */
>  #include <asm/byteorder.h>
> +#include <asm/barrier.h>
>  #include <linux/atomic.h>
> 
>  /*
> @@ -19,9 +20,6 @@
>  #define CHOP_SHIFTCOUNT(x) (((unsigned long) (x)) & (BITS_PER_LONG - 1))
> 
> 
> -#define smp_mb__before_clear_bit()      smp_mb()
> -#define smp_mb__after_clear_bit()       smp_mb()
> -
>  /* See http://marc.theaimsgroup.com/?t=108826637900003 for discussion
>   * on use of volatile and __*_bit() (set/clear/change):
>   *	*_bit() want use of volatile.
> --- a/arch/powerpc/include/asm/atomic.h
> +++ b/arch/powerpc/include/asm/atomic.h
> @@ -8,6 +8,7 @@
>  #ifdef __KERNEL__
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #define ATOMIC_INIT(i)		{ (i) }
> 
> @@ -270,11 +271,6 @@ static __inline__ int atomic_dec_if_posi
>  }
>  #define atomic_dec_if_positive atomic_dec_if_positive
> 
> -#define smp_mb__before_atomic_dec()     smp_mb()
> -#define smp_mb__after_atomic_dec()      smp_mb()
> -#define smp_mb__before_atomic_inc()     smp_mb()
> -#define smp_mb__after_atomic_inc()      smp_mb()
> -
>  #ifdef __powerpc64__
> 
>  #define ATOMIC64_INIT(i)	{ (i) }
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -84,4 +84,7 @@ do {									\
>  	___p1;								\
>  })
> 
> +#define smp_mb__before_atomic()     smp_mb()
> +#define smp_mb__after_atomic()      smp_mb()
> +
>  #endif /* _ASM_POWERPC_BARRIER_H */
> --- a/arch/powerpc/include/asm/bitops.h
> +++ b/arch/powerpc/include/asm/bitops.h
> @@ -51,11 +51,7 @@
>  #define PPC_BIT(bit)		(1UL << PPC_BITLSHIFT(bit))
>  #define PPC_BITMASK(bs, be)	((PPC_BIT(bs) - PPC_BIT(be)) | PPC_BIT(bs))
> 
> -/*
> - * clear_bit doesn't imply a memory barrier
> - */
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> +#include <asm/barrier.h>
> 
>  /* Macro for generating the ***_bits() functions */
>  #define DEFINE_BITOP(fn, op, prefix)		\
> --- a/arch/powerpc/kernel/crash.c
> +++ b/arch/powerpc/kernel/crash.c
> @@ -81,7 +81,7 @@ void crash_ipi_callback(struct pt_regs *
>  	}
> 
>  	atomic_inc(&cpus_in_crash);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
> 
>  	/*
>  	 * Starting the kdump boot.
> --- a/arch/s390/include/asm/atomic.h
> +++ b/arch/s390/include/asm/atomic.h
> @@ -16,6 +16,7 @@
>  #include <linux/compiler.h>
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #define ATOMIC_INIT(i)  { (i) }
> 
> @@ -398,9 +399,4 @@ static inline long long atomic64_dec_if_
>  #define atomic64_dec_and_test(_v)	(atomic64_sub_return(1, _v) == 0)
>  #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
> 
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #endif /* __ARCH_S390_ATOMIC__  */
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -27,8 +27,9 @@
>  #define smp_rmb()			rmb()
>  #define smp_wmb()			wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> +
> +#define smp_mb__before_atomic()	smp_mb()
> +#define smp_mb__after_atomic()	smp_mb()
> 
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
> 
> --- a/arch/s390/include/asm/bitops.h
> +++ b/arch/s390/include/asm/bitops.h
> @@ -47,6 +47,7 @@
> 
>  #include <linux/typecheck.h>
>  #include <linux/compiler.h>
> +#include <asm/barrier.h>
> 
>  #ifndef CONFIG_64BIT
> 
> --- a/arch/score/include/asm/bitops.h
> +++ b/arch/score/include/asm/bitops.h
> @@ -2,12 +2,7 @@
>  #define _ASM_SCORE_BITOPS_H
> 
>  #include <asm/byteorder.h> /* swab32 */
> -
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> +#include <asm/barrier.h>
> 
>  #include <asm-generic/bitops.h>
>  #include <asm-generic/bitops/__fls.h>
> --- a/arch/sh/include/asm/atomic.h
> +++ b/arch/sh/include/asm/atomic.h
> @@ -10,6 +10,7 @@
>  #include <linux/compiler.h>
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #define ATOMIC_INIT(i)	{ (i) }
> 
> @@ -62,9 +63,4 @@ static inline int __atomic_add_unless(at
>  	return c;
>  }
> 
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #endif /* __ASM_SH_ATOMIC_H */
> --- a/arch/sh/include/asm/bitops.h
> +++ b/arch/sh/include/asm/bitops.h
> @@ -9,6 +9,7 @@
> 
>  /* For __swab32 */
>  #include <asm/byteorder.h>
> +#include <asm/barrier.h>
> 
>  #ifdef CONFIG_GUSA_RB
>  #include <asm/bitops-grb.h>
> @@ -22,12 +23,6 @@
>  #include <asm-generic/bitops/non-atomic.h>
>  #endif
> 
> -/*
> - * clear_bit() doesn't provide any barrier for the compiler.
> - */
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -
>  #ifdef CONFIG_SUPERH32
>  static inline unsigned long ffz(unsigned long word)
>  {
> --- a/arch/sparc/include/asm/atomic_32.h
> +++ b/arch/sparc/include/asm/atomic_32.h
> @@ -14,6 +14,7 @@
>  #include <linux/types.h>
> 
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
>  #include <asm-generic/atomic64.h>
> 
> 
> @@ -52,10 +53,4 @@ extern void atomic_set(atomic_t *, int);
>  #define atomic_dec_and_test(v) (atomic_dec_return(v) == 0)
>  #define atomic_sub_and_test(i, v) (atomic_sub_return(i, v) == 0)
> 
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* !(__ARCH_SPARC_ATOMIC__) */
> --- a/arch/sparc/include/asm/atomic_64.h
> +++ b/arch/sparc/include/asm/atomic_64.h
> @@ -9,6 +9,7 @@
> 
>  #include <linux/types.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #define ATOMIC_INIT(i)		{ (i) }
>  #define ATOMIC64_INIT(i)	{ (i) }
> @@ -108,10 +109,4 @@ static inline long atomic64_add_unless(a
> 
>  extern long atomic64_dec_if_positive(atomic64_t *v);
> 
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* !(__ARCH_SPARC64_ATOMIC__) */
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -68,4 +68,7 @@ do {									\
>  	___p1;								\
>  })
> 
> +#define smp_mb__before_atomic()	barrier()
> +#define smp_mb__after_atomic()	barrier()
> +
>  #endif /* !(__SPARC64_BARRIER_H) */
> --- a/arch/sparc/include/asm/bitops_32.h
> +++ b/arch/sparc/include/asm/bitops_32.h
> @@ -11,6 +11,7 @@
> 
>  #include <linux/compiler.h>
>  #include <asm/byteorder.h>
> +#include <asm/barrier.h>
> 
>  #ifdef __KERNEL__
> 
> @@ -90,9 +91,6 @@ static inline void change_bit(unsigned l
> 
>  #include <asm-generic/bitops/non-atomic.h>
> 
> -#define smp_mb__before_clear_bit()	do { } while(0)
> -#define smp_mb__after_clear_bit()	do { } while(0)
> -
>  #include <asm-generic/bitops/ffz.h>
>  #include <asm-generic/bitops/__ffs.h>
>  #include <asm-generic/bitops/sched.h>
> --- a/arch/sparc/include/asm/bitops_64.h
> +++ b/arch/sparc/include/asm/bitops_64.h
> @@ -13,6 +13,7 @@
> 
>  #include <linux/compiler.h>
>  #include <asm/byteorder.h>
> +#include <asm/barrier.h>
> 
>  extern int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
>  extern int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
> @@ -23,9 +24,6 @@ extern void change_bit(unsigned long nr,
> 
>  #include <asm-generic/bitops/non-atomic.h>
> 
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  #include <asm-generic/bitops/fls.h>
>  #include <asm-generic/bitops/__fls.h>
>  #include <asm-generic/bitops/fls64.h>
> --- a/arch/tile/include/asm/atomic_32.h
> +++ b/arch/tile/include/asm/atomic_32.h
> @@ -169,16 +169,6 @@ static inline void atomic64_set(atomic64
>  #define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
>  #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1LL, 0LL)
> 
> -/*
> - * We need to barrier before modifying the word, since the _atomic_xxx()
> - * routines just tns the lock and then read/modify/write of the word.
> - * But after the word is updated, the routine issues an "mf" before returning,
> - * and since it's a function call, we don't even need a compiler barrier.
> - */
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_dec()	do { } while (0)
> -#define smp_mb__after_atomic_inc()	do { } while (0)
> 
>  #endif /* !__ASSEMBLY__ */
> 
> --- a/arch/tile/include/asm/atomic_64.h
> +++ b/arch/tile/include/asm/atomic_64.h
> @@ -105,12 +105,6 @@ static inline long atomic64_add_unless(a
> 
>  #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
> 
> -/* Atomic dec and inc don't implement barrier, so provide them if needed. */
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  /* Define this to indicate that cmpxchg is an efficient operation. */
>  #define __HAVE_ARCH_CMPXCHG
> 
> --- a/arch/tile/include/asm/barrier.h
> +++ b/arch/tile/include/asm/barrier.h
> @@ -72,6 +72,20 @@ mb_incoherent(void)
>  #define mb()		fast_mb()
>  #define iob()		fast_iob()
> 
> +#ifndef __tilegx__ /* 32 bit */
> +/*
> + * We need to barrier before modifying the word, since the _atomic_xxx()
> + * routines just tns the lock and then read/modify/write of the word.
> + * But after the word is updated, the routine issues an "mf" before returning,
> + * and since it's a function call, we don't even need a compiler barrier.
> + */
> +#define smp_mb__before_atomic()	smp_mb()
> +#define smp_mb__after_atomic()	do { } while (0)
> +#else /* 64 bit */
> +#define smp_mb__before_atomic()	smp_mb()
> +#define smp_mb__after_atomic()	smp_mb()
> +#endif
> +
>  #include <asm-generic/barrier.h>
> 
>  #endif /* !__ASSEMBLY__ */
> --- a/arch/tile/include/asm/bitops.h
> +++ b/arch/tile/include/asm/bitops.h
> @@ -17,6 +17,7 @@
>  #define _ASM_TILE_BITOPS_H
> 
>  #include <linux/types.h>
> +#include <asm/barrier.h>
> 
>  #ifndef _LINUX_BITOPS_H
>  #error only <linux/bitops.h> can be included directly
> --- a/arch/tile/include/asm/bitops_32.h
> +++ b/arch/tile/include/asm/bitops_32.h
> @@ -49,8 +49,8 @@ static inline void set_bit(unsigned nr,
>   * restricted to acting on a single-word quantity.
>   *
>   * clear_bit() may not contain a memory barrier, so if it is used for
> - * locking purposes, you should call smp_mb__before_clear_bit() and/or
> - * smp_mb__after_clear_bit() to ensure changes are visible on other cpus.
> + * locking purposes, you should call smp_mb__before_atomic() and/or
> + * smp_mb__after_atomic() to ensure changes are visible on other cpus.
>   */
>  static inline void clear_bit(unsigned nr, volatile unsigned long *addr)
>  {
> @@ -121,10 +121,6 @@ static inline int test_and_change_bit(un
>  	return (_atomic_xor(addr, mask) & mask) != 0;
>  }
> 
> -/* See discussion at smp_mb__before_atomic_dec() in <asm/atomic_32.h>. */
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	do {} while (0)
> -
>  #include <asm-generic/bitops/ext2-atomic.h>
> 
>  #endif /* _ASM_TILE_BITOPS_32_H */
> --- a/arch/tile/include/asm/bitops_64.h
> +++ b/arch/tile/include/asm/bitops_64.h
> @@ -32,10 +32,6 @@ static inline void clear_bit(unsigned nr
>  	__insn_fetchand((void *)(addr + nr / BITS_PER_LONG), ~mask);
>  }
> 
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -
> -
>  static inline void change_bit(unsigned nr, volatile unsigned long *addr)
>  {
>  	unsigned long mask = (1UL << (nr % BITS_PER_LONG));
> --- a/arch/x86/include/asm/atomic.h
> +++ b/arch/x86/include/asm/atomic.h
> @@ -7,6 +7,7 @@
>  #include <asm/alternative.h>
>  #include <asm/cmpxchg.h>
>  #include <asm/rmwcc.h>
> +#include <asm/barrier.h>
> 
>  /*
>   * Atomic operations that C can't guarantee us.  Useful for
> @@ -243,12 +244,6 @@ static inline void atomic_or_long(unsign
>  		     : : "r" ((unsigned)(mask)), "m" (*(addr))	\
>  		     : "memory")
> 
> -/* Atomic operations are already serializing on x86 */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #ifdef CONFIG_X86_32
>  # include <asm/atomic64_32.h>
>  #else
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -141,6 +141,10 @@ do {									\
> 
>  #endif
> 
> +/* Atomic operations are already serializing on x86 */
> +#define smp_mb__before_atomic()	barrier()
> +#define smp_mb__after_atomic()	barrier()
> +
>  /*
>   * Stop RDTSC speculation. This is needed when you need to use RDTSC
>   * (or get_cycles or vread that possibly accesses the TSC) in a defined
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -15,6 +15,7 @@
>  #include <linux/compiler.h>
>  #include <asm/alternative.h>
>  #include <asm/rmwcc.h>
> +#include <asm/barrier.h>
> 
>  #if BITS_PER_LONG == 32
>  # define _BITOPS_LONG_SHIFT 5
> @@ -102,7 +103,7 @@ static inline void __set_bit(long nr, vo
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static __always_inline void
> @@ -156,9 +157,6 @@ static inline void __clear_bit_unlock(lo
>  	__clear_bit(nr, addr);
>  }
> 
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> -
>  /**
>   * __change_bit - Toggle a bit in memory
>   * @nr: the bit to change
> --- a/arch/x86/include/asm/sync_bitops.h
> +++ b/arch/x86/include/asm/sync_bitops.h
> @@ -41,7 +41,7 @@ static inline void sync_set_bit(long nr,
>   *
>   * sync_clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static inline void sync_clear_bit(long nr, volatile unsigned long *addr)
> --- a/arch/x86/kernel/apic/hw_nmi.c
> +++ b/arch/x86/kernel/apic/hw_nmi.c
> @@ -57,7 +57,7 @@ void arch_trigger_all_cpu_backtrace(void
>  	}
> 
>  	clear_bit(0, &backtrace_flag);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static int __kprobes
> --- a/arch/xtensa/include/asm/atomic.h
> +++ b/arch/xtensa/include/asm/atomic.h
> @@ -19,6 +19,7 @@
>  #ifdef __KERNEL__
>  #include <asm/processor.h>
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #define ATOMIC_INIT(i)	{ (i) }
> 
> @@ -387,12 +388,6 @@ static inline void atomic_set_mask(unsig
>  #endif
>  }
> 
> -/* Atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* __KERNEL__ */
> 
>  #endif /* _XTENSA_ATOMIC_H */
> --- a/arch/xtensa/include/asm/bitops.h
> +++ b/arch/xtensa/include/asm/bitops.h
> @@ -21,9 +21,7 @@
> 
>  #include <asm/processor.h>
>  #include <asm/byteorder.h>
> -
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> +#include <asm/barrier.h>
> 
>  #include <asm-generic/bitops/non-atomic.h>
> 
> --- a/block/blk-iopoll.c
> +++ b/block/blk-iopoll.c
> @@ -52,7 +52,7 @@ EXPORT_SYMBOL(blk_iopoll_sched);
>  void __blk_iopoll_complete(struct blk_iopoll *iop)
>  {
>  	list_del(&iop->list);
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit_unlock(IOPOLL_F_SCHED, &iop->state);
>  }
>  EXPORT_SYMBOL(__blk_iopoll_complete);
> @@ -164,7 +164,7 @@ EXPORT_SYMBOL(blk_iopoll_disable);
>  void blk_iopoll_enable(struct blk_iopoll *iop)
>  {
>  	BUG_ON(!test_bit(IOPOLL_F_SCHED, &iop->state));
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit_unlock(IOPOLL_F_SCHED, &iop->state);
>  }
>  EXPORT_SYMBOL(blk_iopoll_enable);
> --- a/crypto/chainiv.c
> +++ b/crypto/chainiv.c
> @@ -126,7 +126,7 @@ static int async_chainiv_schedule_work(s
>  	int err = ctx->err;
> 
>  	if (!ctx->queue.qlen) {
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(CHAINIV_STATE_INUSE, &ctx->state);
> 
>  		if (!ctx->queue.qlen ||
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -106,7 +106,7 @@ static bool genpd_sd_counter_dec(struct
>  static void genpd_sd_counter_inc(struct generic_pm_domain *genpd)
>  {
>  	atomic_inc(&genpd->sd_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  static void genpd_acquire_lock(struct generic_pm_domain *genpd)
> --- a/drivers/block/mtip32xx/mtip32xx.c
> +++ b/drivers/block/mtip32xx/mtip32xx.c
> @@ -224,9 +224,9 @@ static int get_slot(struct mtip_port *po
>   */
>  static inline void release_slot(struct mtip_port *port, int tag)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(tag, port->allocated);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  /*
> --- a/drivers/cpuidle/coupled.c
> +++ b/drivers/cpuidle/coupled.c
> @@ -159,7 +159,7 @@ void cpuidle_coupled_parallel_barrier(st
>  {
>  	int n = dev->coupled->online_count;
> 
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_inc(a);
> 
>  	while (atomic_read(a) < n)
> --- a/drivers/firewire/ohci.c
> +++ b/drivers/firewire/ohci.c
> @@ -3509,7 +3509,7 @@ static int ohci_flush_iso_completions(st
>  		}
> 
>  		clear_bit_unlock(0, &ctx->flushing_completions);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  	}
> 
>  	tasklet_enable(&ctx->context.tasklet);
> --- a/drivers/gpu/drm/drm_irq.c
> +++ b/drivers/gpu/drm/drm_irq.c
> @@ -156,7 +156,7 @@ static void vblank_disable_and_save(stru
>  	 */
>  	if ((vblrc > 0) && (abs64(diff_ns) > 1000000)) {
>  		atomic_inc(&dev->vblank[crtc].count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  	}
> 
>  	/* Invalidate all timestamps while vblank irq's are off. */
> @@ -864,9 +864,9 @@ static void drm_update_vblank_count(stru
>  		vblanktimestamp(dev, crtc, tslot) = t_vblank;
>  	}
> 
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_add(diff, &dev->vblank[crtc].count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  /**
> @@ -1330,9 +1330,9 @@ bool drm_handle_vblank(struct drm_device
>  		/* Increment cooked vblank count. This also atomically commits
>  		 * the timestamp computed above.
>  		 */
> -		smp_mb__before_atomic_inc();
> +		smp_mb__before_atomic();
>  		atomic_inc(&dev->vblank[crtc].count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  	} else {
>  		DRM_DEBUG("crtc %d: Redundant vblirq ignored. diff_ns = %d\n",
>  			  crtc, (int) diff_ns);
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1996,7 +1996,7 @@ static void i915_error_work_func(struct
>  			 * updates before
>  			 * the counter increment.
>  			 */
> -			smp_mb__before_atomic_inc();
> +			smp_mb__before_atomic();
>  			atomic_inc(&dev_priv->gpu_error.reset_counter);
> 
>  			kobject_uevent_env(&dev->primary->kdev->kobj,
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -841,7 +841,7 @@ static inline bool cached_dev_get(struct
>  		return false;
> 
>  	/* Paired with the mb in cached_dev_attach */
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	return true;
>  }
> 
> --- a/drivers/md/bcache/closure.h
> +++ b/drivers/md/bcache/closure.h
> @@ -243,7 +243,7 @@ static inline void set_closure_fn(struct
>  	cl->fn = fn;
>  	cl->wq = wq;
>  	/* between atomic_dec() in closure_put() */
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  }
> 
>  static inline void closure_queue(struct closure *cl)
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -607,9 +607,9 @@ static void write_endio(struct bio *bio,
> 
>  	BUG_ON(!test_bit(B_WRITING, &b->state));
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(B_WRITING, &b->state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	wake_up_bit(&b->state, B_WRITING);
>  }
> @@ -997,9 +997,9 @@ static void read_endio(struct bio *bio,
> 
>  	BUG_ON(!test_bit(B_READING, &b->state));
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(B_READING, &b->state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	wake_up_bit(&b->state, B_READING);
>  }
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -642,7 +642,7 @@ static void free_pending_exception(struc
>  	struct dm_snapshot *s = pe->snap;
> 
>  	mempool_free(pe, s->pending_pool);
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&s->pending_exceptions_count);
>  }
> 
> @@ -783,7 +783,7 @@ static int init_hash_tables(struct dm_sn
>  static void merge_shutdown(struct dm_snapshot *s)
>  {
>  	clear_bit_unlock(RUNNING_MERGE, &s->state_bits);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&s->state_bits, RUNNING_MERGE);
>  }
> 
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2451,7 +2451,7 @@ static void dm_wq_work(struct work_struc
>  static void dm_queue_flush(struct mapped_device *md)
>  {
>  	clear_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	queue_work(md->wq, &md->work);
>  }
> 
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4406,7 +4406,7 @@ static void raid5_unplug(struct blk_plug
>  			 * STRIPE_ON_UNPLUG_LIST clear but the stripe
>  			 * is still in our list
>  			 */
> -			smp_mb__before_clear_bit();
> +			smp_mb__before_atomic();
>  			clear_bit(STRIPE_ON_UNPLUG_LIST, &sh->state);
>  			/*
>  			 * STRIPE_ON_RELEASE_LIST could be set here. In that
> --- a/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
> +++ b/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
> @@ -399,7 +399,7 @@ static int dvb_usb_stop_feed(struct dvb_
> 
>  	/* clear 'streaming' status bit */
>  	clear_bit(ADAP_STREAMING, &adap->state_bits);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&adap->state_bits, ADAP_STREAMING);
>  skip_feed_stop:
> 
> @@ -550,7 +550,7 @@ static int dvb_usb_fe_init(struct dvb_fr
>  err:
>  	if (!adap->suspend_resume_active) {
>  		clear_bit(ADAP_INIT, &adap->state_bits);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		wake_up_bit(&adap->state_bits, ADAP_INIT);
>  	}
> 
> @@ -591,7 +591,7 @@ static int dvb_usb_fe_sleep(struct dvb_f
>  	if (!adap->suspend_resume_active) {
>  		adap->active_fe = -1;
>  		clear_bit(ADAP_SLEEP, &adap->state_bits);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		wake_up_bit(&adap->state_bits, ADAP_SLEEP);
>  	}
> 
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -2779,7 +2779,7 @@ int bnx2x_nic_load(struct bnx2x *bp, int
> 
>  	case LOAD_OPEN:
>  		netif_tx_start_all_queues(bp->dev);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		break;
> 
>  	case LOAD_DIAG:
> @@ -4773,9 +4773,9 @@ void bnx2x_tx_timeout(struct net_device
>  		bnx2x_panic();
>  #endif
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	set_bit(BNX2X_SP_RTNL_TX_TIMEOUT, &bp->sp_rtnl_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	/* This allows the netif to be shutdown gracefully before resetting */
>  	schedule_delayed_work(&bp->sp_rtnl_task, 0);
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> @@ -1837,10 +1837,10 @@ void bnx2x_sp_event(struct bnx2x_fastpat
>  	/* SRIOV: reschedule any 'in_progress' operations */
>  	bnx2x_iov_sp_event(bp, cid, true);
> 
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_inc(&bp->cq_spq_left);
>  	/* push the change in bp->spq_left and towards the memory */
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
> 
>  	DP(BNX2X_MSG_SP, "bp->cq_spq_left %x\n", atomic_read(&bp->cq_spq_left));
> 
> @@ -1855,11 +1855,11 @@ void bnx2x_sp_event(struct bnx2x_fastpat
>  		 * sp_state is cleared, and this order prevents
>  		 * races
>  		 */
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(BNX2X_AFEX_PENDING_VIFSET_MCP_ACK, &bp->sp_state);
>  		wmb();
>  		clear_bit(BNX2X_AFEX_FCOE_Q_UPDATE_PENDING, &bp->sp_state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  		/* schedule the sp task as mcp ack is required */
>  		bnx2x_schedule_sp_task(bp);
> @@ -3878,9 +3878,9 @@ static void bnx2x_fan_failure(struct bnx
>  	 * This is due to some boards consuming sufficient power when driver is
>  	 * up to overheat if fan fails.
>  	 */
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	set_bit(BNX2X_SP_RTNL_FAN_FAILURE, &bp->sp_rtnl_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	schedule_delayed_work(&bp->sp_rtnl_task, 0);
>  }
> 
> @@ -5137,9 +5137,9 @@ static void bnx2x_after_function_update(
>  		__clear_bit(RAMROD_COMP_WAIT, &queue_params.ramrod_flags);
> 
>  		/* mark latest Q bit */
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(BNX2X_AFEX_FCOE_Q_UPDATE_PENDING, &bp->sp_state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  		/* send Q update ramrod for FCoE Q */
>  		rc = bnx2x_queue_state_change(bp, &queue_params);
> @@ -5282,10 +5282,10 @@ static void bnx2x_eq_int(struct bnx2x *b
>  				 * sp_rtnl task as all Queue SP operations
>  				 * should run under rtnl_lock.
>  				 */
> -				smp_mb__before_clear_bit();
> +				smp_mb__before_atomic();
>  				set_bit(BNX2X_SP_RTNL_AFEX_F_UPDATE,
>  					&bp->sp_rtnl_state);
> -				smp_mb__after_clear_bit();
> +				smp_mb__after_atomic();
> 
>  				schedule_delayed_work(&bp->sp_rtnl_task, 0);
>  			}
> @@ -5368,7 +5368,7 @@ static void bnx2x_eq_int(struct bnx2x *b
>  		spqe_cnt++;
>  	} /* for */
> 
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_add(spqe_cnt, &bp->eq_spq_left);
> 
>  	bp->eq_cons = sw_cons;
> @@ -12065,9 +12065,9 @@ static void bnx2x_set_rx_mode(struct net
>  	} else {
>  		/* Schedule an SP task to handle rest of change */
>  		DP(NETIF_MSG_IFUP, "Scheduling an Rx mode change\n");
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(BNX2X_SP_RTNL_RX_MODE, &bp->sp_rtnl_state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		schedule_delayed_work(&bp->sp_rtnl_task, 0);
>  	}
>  }
> @@ -12101,10 +12101,10 @@ void bnx2x_set_rx_mode_inner(struct bnx2
>  			/* configuring mcast to a vf involves sleeping (when we
>  			 * wait for the pf's response).
>  			 */
> -			smp_mb__before_clear_bit();
> +			smp_mb__before_atomic();
>  			set_bit(BNX2X_SP_RTNL_VFPF_MCAST,
>  				&bp->sp_rtnl_state);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
>  			schedule_delayed_work(&bp->sp_rtnl_task, 0);
>  		}
>  	}
> @@ -13723,9 +13723,9 @@ static int bnx2x_drv_ctl(struct net_devi
>  	case DRV_CTL_RET_L2_SPQ_CREDIT_CMD: {
>  		int count = ctl->data.credit.credit_count;
> 
> -		smp_mb__before_atomic_inc();
> +		smp_mb__before_atomic();
>  		atomic_add(count, &bp->cq_spq_left);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		break;
>  	}
>  	case DRV_CTL_ULP_REGISTER_CMD: {
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
> @@ -258,16 +258,16 @@ static bool bnx2x_raw_check_pending(stru
> 
>  static void bnx2x_raw_clear_pending(struct bnx2x_raw_obj *o)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(o->state, o->pstate);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static void bnx2x_raw_set_pending(struct bnx2x_raw_obj *o)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	set_bit(o->state, o->pstate);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  /**
> @@ -2131,7 +2131,7 @@ static int bnx2x_set_rx_mode_e1x(struct
> 
>  	/* The operation is completed */
>  	clear_bit(p->state, p->pstate);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return 0;
>  }
> @@ -3576,16 +3576,16 @@ int bnx2x_config_mcast(struct bnx2x *bp,
> 
>  static void bnx2x_mcast_clear_sched(struct bnx2x_mcast_obj *o)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(o->sched_state, o->raw.pstate);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static void bnx2x_mcast_set_sched(struct bnx2x_mcast_obj *o)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	set_bit(o->sched_state, o->raw.pstate);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static bool bnx2x_mcast_check_sched(struct bnx2x_mcast_obj *o)
> @@ -4210,7 +4210,7 @@ int bnx2x_queue_state_change(struct bnx2
>  		if (rc) {
>  			o->next_state = BNX2X_Q_STATE_MAX;
>  			clear_bit(pending_bit, pending);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
>  			return rc;
>  		}
> 
> @@ -4298,7 +4298,7 @@ static int bnx2x_queue_comp_cmd(struct b
>  	wmb();
> 
>  	clear_bit(cmd, &o->pending);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return 0;
>  }
> @@ -5242,7 +5242,7 @@ static inline int bnx2x_func_state_chang
>  	wmb();
> 
>  	clear_bit(cmd, &o->pending);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return 0;
>  }
> @@ -5877,7 +5877,7 @@ int bnx2x_func_state_change(struct bnx2x
>  		if (rc) {
>  			o->next_state = BNX2X_F_STATE_MAX;
>  			clear_bit(cmd, pending);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
>  			return rc;
>  		}
> 
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> @@ -971,10 +971,10 @@ static void bnx2x_vfop_qsetup(struct bnx
>  op_done:
>  	case BNX2X_VFOP_QSETUP_DONE:
>  		vf->cfg_flags |= VF_CFG_VLAN;
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(BNX2X_SP_RTNL_HYPERVISOR_VLAN,
>  			&bp->sp_rtnl_state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		schedule_delayed_work(&bp->sp_rtnl_task, 0);
>  		bnx2x_vfop_end(bp, vf, vfop);
>  		return;
> @@ -2354,9 +2354,9 @@ static
>  void bnx2x_vf_handle_filters_eqe(struct bnx2x *bp,
>  				 struct bnx2x_virtf *vf)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(BNX2X_FILTER_RX_MODE_PENDING, &vf->filter_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  int bnx2x_iov_eq_sp_event(struct bnx2x *bp, union event_ring_elem *elem)
> @@ -3737,10 +3737,10 @@ void bnx2x_timer_sriov(struct bnx2x *bp)
> 
>  	/* if channel is down we need to self destruct */
>  	if (bp->old_bulletin.valid_bitmap & 1 << CHANNEL_DOWN) {
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(BNX2X_SP_RTNL_VFPF_CHANNEL_DOWN,
>  			&bp->sp_rtnl_state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		schedule_delayed_work(&bp->sp_rtnl_task, 0);
>  	}
>  }
> --- a/drivers/net/ethernet/broadcom/cnic.c
> +++ b/drivers/net/ethernet/broadcom/cnic.c
> @@ -436,7 +436,7 @@ static int cnic_offld_prep(struct cnic_s
>  static int cnic_close_prep(struct cnic_sock *csk)
>  {
>  	clear_bit(SK_F_CONNECT_START, &csk->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	if (test_and_clear_bit(SK_F_OFFLD_COMPLETE, &csk->flags)) {
>  		while (test_and_set_bit(SK_F_OFFLD_SCHED, &csk->flags))
> @@ -450,7 +450,7 @@ static int cnic_close_prep(struct cnic_s
>  static int cnic_abort_prep(struct cnic_sock *csk)
>  {
>  	clear_bit(SK_F_CONNECT_START, &csk->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	while (test_and_set_bit(SK_F_OFFLD_SCHED, &csk->flags))
>  		msleep(1);
> @@ -3645,7 +3645,7 @@ static int cnic_cm_destroy(struct cnic_s
> 
>  	csk_hold(csk);
>  	clear_bit(SK_F_INUSE, &csk->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	while (atomic_read(&csk->ref_count) != 1)
>  		msleep(1);
>  	cnic_cm_cleanup(csk);
> @@ -4025,7 +4025,7 @@ static void cnic_cm_process_kcqe(struct
>  			 L4_KCQE_COMPLETION_STATUS_PARITY_ERROR)
>  			set_bit(SK_F_HW_ERR, &csk->flags);
> 
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(SK_F_OFFLD_SCHED, &csk->flags);
>  		cnic_cm_upcall(cp, csk, opcode);
>  		break;
> --- a/drivers/net/ethernet/brocade/bna/bnad.c
> +++ b/drivers/net/ethernet/brocade/bna/bnad.c
> @@ -249,7 +249,7 @@ bnad_tx_complete(struct bnad *bnad, stru
>  	if (likely(test_bit(BNAD_TXQ_TX_STARTED, &tcb->flags)))
>  		bna_ib_ack(tcb->i_dbell, sent);
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(BNAD_TXQ_FREE_SENT, &tcb->flags);
> 
>  	return sent;
> @@ -1125,7 +1125,7 @@ bnad_tx_cleanup(struct delayed_work *wor
> 
>  		bnad_txq_cleanup(bnad, tcb);
> 
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(BNAD_TXQ_FREE_SENT, &tcb->flags);
>  	}
> 
> @@ -2999,7 +2999,7 @@ bnad_start_xmit(struct sk_buff *skb, str
>  			sent = bnad_txcmpl_process(bnad, tcb);
>  			if (likely(test_bit(BNAD_TXQ_TX_STARTED, &tcb->flags)))
>  				bna_ib_ack(tcb->i_dbell, sent);
> -			smp_mb__before_clear_bit();
> +			smp_mb__before_atomic();
>  			clear_bit(BNAD_TXQ_FREE_SENT, &tcb->flags);
>  		} else {
>  			netif_stop_queue(netdev);
> --- a/drivers/net/ethernet/chelsio/cxgb/cxgb2.c
> +++ b/drivers/net/ethernet/chelsio/cxgb/cxgb2.c
> @@ -281,7 +281,7 @@ static int cxgb_close(struct net_device
>  	if (adapter->params.stats_update_period &&
>  	    !(adapter->open_device_map & PORT_MASK)) {
>  		/* Stop statistics accumulation. */
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		spin_lock(&adapter->work_lock);   /* sync with update task */
>  		spin_unlock(&adapter->work_lock);
>  		cancel_mac_stats_update(adapter);
> --- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
> +++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
> @@ -1379,7 +1379,7 @@ static inline int check_desc_avail(struc
>  		struct sge_qset *qs = txq_to_qset(q, qid);
> 
>  		set_bit(qid, &qs->txq_stopped);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  		if (should_restart_tx(q) &&
>  		    test_and_clear_bit(qid, &qs->txq_stopped))
> @@ -1492,7 +1492,7 @@ static void restart_ctrlq(unsigned long
> 
>  	if (!skb_queue_empty(&q->sendq)) {
>  		set_bit(TXQ_CTRL, &qs->txq_stopped);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  		if (should_restart_tx(q) &&
>  		    test_and_clear_bit(TXQ_CTRL, &qs->txq_stopped))
> @@ -1697,7 +1697,7 @@ again:	reclaim_completed_tx(adap, q, TX_
> 
>  		if (unlikely(q->size - q->in_use < ndesc)) {
>  			set_bit(TXQ_OFLD, &qs->txq_stopped);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
> 
>  			if (should_restart_tx(q) &&
>  			    test_and_clear_bit(TXQ_OFLD, &qs->txq_stopped))
> --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
> @@ -2003,7 +2003,7 @@ static void sge_rx_timer_cb(unsigned lon
>  			struct sge_fl *fl = s->egr_map[id];
> 
>  			clear_bit(id, s->starving_fl);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
> 
>  			if (fl_starving(fl)) {
>  				rxq = container_of(fl, struct sge_eth_rxq, fl);
> --- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
> @@ -1951,7 +1951,7 @@ static void sge_rx_timer_cb(unsigned lon
>  			struct sge_fl *fl = s->egr_map[id];
> 
>  			clear_bit(id, s->starving_fl);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
> 
>  			/*
>  			 * Since we are accessing fl without a lock there's a
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -4576,7 +4576,7 @@ static void i40e_service_event_complete(
>  	BUG_ON(!test_bit(__I40E_SERVICE_SCHED, &pf->state));
> 
>  	/* flush memory to make sure state is correct before next watchog */
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(__I40E_SERVICE_SCHED, &pf->state);
>  }
> 
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -318,7 +318,7 @@ static void ixgbe_service_event_complete
>  	BUG_ON(!test_bit(__IXGBE_SERVICE_SCHED, &adapter->state));
> 
>  	/* flush memory to make sure state is correct before next watchdog */
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(__IXGBE_SERVICE_SCHED, &adapter->state);
>  }
> 
> @@ -4607,7 +4607,7 @@ static void ixgbe_up_complete(struct ixg
>  	if (hw->mac.ops.enable_tx_laser)
>  		hw->mac.ops.enable_tx_laser(hw);
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(__IXGBE_DOWN, &adapter->state);
>  	ixgbe_napi_enable_all(adapter);
> 
> --- a/drivers/net/wireless/ti/wlcore/main.c
> +++ b/drivers/net/wireless/ti/wlcore/main.c
> @@ -547,7 +547,7 @@ static int wlcore_irq_locked(struct wl12
>  		 * wl1271_ps_elp_wakeup cannot be called concurrently.
>  		 */
>  		clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  		ret = wlcore_fw_status(wl, wl->fw_status_1, wl->fw_status_2);
>  		if (ret < 0)
> --- a/drivers/pci/xen-pcifront.c
> +++ b/drivers/pci/xen-pcifront.c
> @@ -662,9 +662,9 @@ static void pcifront_do_aer(struct work_
>  	notify_remote_via_evtchn(pdev->evtchn);
> 
>  	/*in case of we lost an aer request in four lines time_window*/
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(_PDEVB_op_active, &pdev->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	schedule_pcifront_aer_op(pdev);
> 
> --- a/drivers/scsi/isci/remote_device.c
> +++ b/drivers/scsi/isci/remote_device.c
> @@ -1541,7 +1541,7 @@ void isci_remote_device_release(struct k
>  	clear_bit(IDEV_STOP_PENDING, &idev->flags);
>  	clear_bit(IDEV_IO_READY, &idev->flags);
>  	clear_bit(IDEV_GONE, &idev->flags);
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(IDEV_ALLOCATED, &idev->flags);
>  	wake_up(&ihost->eventq);
>  }
> --- a/drivers/target/loopback/tcm_loop.c
> +++ b/drivers/target/loopback/tcm_loop.c
> @@ -942,7 +942,7 @@ static int tcm_loop_port_link(
>  	struct tcm_loop_hba *tl_hba = tl_tpg->tl_hba;
> 
>  	atomic_inc(&tl_tpg->tl_tpg_port_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	/*
>  	 * Add Linux/SCSI struct scsi_device by HCTL
>  	 */
> @@ -977,7 +977,7 @@ static void tcm_loop_port_unlink(
>  	scsi_device_put(sd);
> 
>  	atomic_dec(&tl_tpg->tl_tpg_port_count);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
> 
>  	pr_debug("TCM_Loop_ConfigFS: Port Unlink Successful\n");
>  }
> --- a/drivers/target/target_core_alua.c
> +++ b/drivers/target/target_core_alua.c
> @@ -393,7 +393,7 @@ target_emulate_set_target_port_groups(st
>  					continue;
> 
>  				atomic_inc(&tg_pt_gp->tg_pt_gp_ref_cnt);
> -				smp_mb__after_atomic_inc();
> +				smp_mb__after_atomic();
> 
>  				spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
> 
> @@ -404,7 +404,7 @@ target_emulate_set_target_port_groups(st
> 
>  				spin_lock(&dev->t10_alua.tg_pt_gps_lock);
>  				atomic_dec(&tg_pt_gp->tg_pt_gp_ref_cnt);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				break;
>  			}
>  			spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
> @@ -997,7 +997,7 @@ static void core_alua_do_transition_tg_p
>  		 * TARGET PORT GROUPS command
>  		 */
>  		atomic_inc(&mem->tg_pt_gp_mem_ref_cnt);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock(&tg_pt_gp->tg_pt_gp_lock);
> 
>  		spin_lock_bh(&port->sep_alua_lock);
> @@ -1027,7 +1027,7 @@ static void core_alua_do_transition_tg_p
> 
>  		spin_lock(&tg_pt_gp->tg_pt_gp_lock);
>  		atomic_dec(&mem->tg_pt_gp_mem_ref_cnt);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock(&tg_pt_gp->tg_pt_gp_lock);
>  	/*
> @@ -1061,7 +1061,7 @@ static void core_alua_do_transition_tg_p
>  		core_alua_dump_state(tg_pt_gp->tg_pt_gp_alua_pending_state));
>  	spin_lock(&dev->t10_alua.tg_pt_gps_lock);
>  	atomic_dec(&tg_pt_gp->tg_pt_gp_ref_cnt);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  	spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
> 
>  	if (tg_pt_gp->tg_pt_gp_transition_complete)
> @@ -1123,7 +1123,7 @@ static int core_alua_do_transition_tg_pt
>  	 */
>  	spin_lock(&dev->t10_alua.tg_pt_gps_lock);
>  	atomic_inc(&tg_pt_gp->tg_pt_gp_ref_cnt);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
> 
>  	if (!explicit && tg_pt_gp->tg_pt_gp_implicit_trans_secs) {
> @@ -1166,7 +1166,7 @@ int core_alua_do_port_transition(
>  	spin_lock(&local_lu_gp_mem->lu_gp_mem_lock);
>  	lu_gp = local_lu_gp_mem->lu_gp;
>  	atomic_inc(&lu_gp->lu_gp_ref_cnt);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	spin_unlock(&local_lu_gp_mem->lu_gp_mem_lock);
>  	/*
>  	 * For storage objects that are members of the 'default_lu_gp',
> @@ -1183,7 +1183,7 @@ int core_alua_do_port_transition(
>  		rc = core_alua_do_transition_tg_pt(l_tg_pt_gp,
>  						   new_state, explicit);
>  		atomic_dec(&lu_gp->lu_gp_ref_cnt);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		return rc;
>  	}
>  	/*
> @@ -1197,7 +1197,7 @@ int core_alua_do_port_transition(
> 
>  		dev = lu_gp_mem->lu_gp_mem_dev;
>  		atomic_inc(&lu_gp_mem->lu_gp_mem_ref_cnt);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock(&lu_gp->lu_gp_lock);
> 
>  		spin_lock(&dev->t10_alua.tg_pt_gps_lock);
> @@ -1226,7 +1226,7 @@ int core_alua_do_port_transition(
>  				tg_pt_gp->tg_pt_gp_alua_nacl = NULL;
>  			}
>  			atomic_inc(&tg_pt_gp->tg_pt_gp_ref_cnt);
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
>  			spin_unlock(&dev->t10_alua.tg_pt_gps_lock);
>  			/*
>  			 * core_alua_do_transition_tg_pt() will always return
> @@ -1237,7 +1237,7 @@ int core_alua_do_port_transition(
> 
>  			spin_lock(&dev->t10_alua.tg_pt_gps_lock);
>  			atomic_dec(&tg_pt_gp->tg_pt_gp_ref_cnt);
> -			smp_mb__after_atomic_dec();
> +			smp_mb__after_atomic();
>  			if (rc)
>  				break;
>  		}
> @@ -1245,7 +1245,7 @@ int core_alua_do_port_transition(
> 
>  		spin_lock(&lu_gp->lu_gp_lock);
>  		atomic_dec(&lu_gp_mem->lu_gp_mem_ref_cnt);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock(&lu_gp->lu_gp_lock);
> 
> @@ -1259,7 +1259,7 @@ int core_alua_do_port_transition(
>  	}
> 
>  	atomic_dec(&lu_gp->lu_gp_ref_cnt);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  	return rc;
>  }
> 
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -225,7 +225,7 @@ struct se_dev_entry *core_get_se_deve_fr
>  			continue;
> 
>  		atomic_inc(&deve->pr_ref_count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock_irq(&nacl->device_list_lock);
> 
>  		return deve;
> @@ -1392,7 +1392,7 @@ int core_dev_add_initiator_node_lun_acl(
>  	spin_lock(&lun->lun_acl_lock);
>  	list_add_tail(&lacl->lacl_list, &lun->lun_acl_list);
>  	atomic_inc(&lun->lun_acl_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	spin_unlock(&lun->lun_acl_lock);
> 
>  	pr_debug("%s_TPG[%hu]_LUN[%u->%u] - Added %s ACL for "
> @@ -1426,7 +1426,7 @@ int core_dev_del_initiator_node_lun_acl(
>  	spin_lock(&lun->lun_acl_lock);
>  	list_del(&lacl->lacl_list);
>  	atomic_dec(&lun->lun_acl_count);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  	spin_unlock(&lun->lun_acl_lock);
> 
>  	core_disable_device_list_for_node(lun, NULL, lacl->mapped_lun,
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -324,7 +324,7 @@ static void iblock_bio_done(struct bio *
>  		 * Bump the ib_bio_err_cnt and release bio.
>  		 */
>  		atomic_inc(&ibr->ib_bio_err_cnt);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  	}
> 
>  	bio_put(bio);
> --- a/drivers/target/target_core_pr.c
> +++ b/drivers/target/target_core_pr.c
> @@ -675,7 +675,7 @@ static struct t10_pr_registration *__cor
>  	spin_lock(&dev->se_port_lock);
>  	list_for_each_entry_safe(port, port_tmp, &dev->dev_sep_list, sep_list) {
>  		atomic_inc(&port->sep_tg_pt_ref_cnt);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock(&dev->se_port_lock);
> 
>  		spin_lock_bh(&port->sep_alua_lock);
> @@ -710,7 +710,7 @@ static struct t10_pr_registration *__cor
>  				continue;
> 
>  			atomic_inc(&deve_tmp->pr_ref_count);
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
>  			spin_unlock_bh(&port->sep_alua_lock);
>  			/*
>  			 * Grab a configfs group dependency that is released
> @@ -723,9 +723,9 @@ static struct t10_pr_registration *__cor
>  				pr_err("core_scsi3_lunacl_depend"
>  						"_item() failed\n");
>  				atomic_dec(&port->sep_tg_pt_ref_cnt);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				atomic_dec(&deve_tmp->pr_ref_count);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				goto out;
>  			}
>  			/*
> @@ -740,9 +740,9 @@ static struct t10_pr_registration *__cor
>  						sa_res_key, all_tg_pt, aptpl);
>  			if (!pr_reg_atp) {
>  				atomic_dec(&port->sep_tg_pt_ref_cnt);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				atomic_dec(&deve_tmp->pr_ref_count);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				core_scsi3_lunacl_undepend_item(deve_tmp);
>  				goto out;
>  			}
> @@ -755,7 +755,7 @@ static struct t10_pr_registration *__cor
> 
>  		spin_lock(&dev->se_port_lock);
>  		atomic_dec(&port->sep_tg_pt_ref_cnt);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock(&dev->se_port_lock);
> 
> @@ -1110,7 +1110,7 @@ static struct t10_pr_registration *__cor
>  					continue;
>  			}
>  			atomic_inc(&pr_reg->pr_res_holders);
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
>  			spin_unlock(&pr_tmpl->registration_lock);
>  			return pr_reg;
>  		}
> @@ -1125,7 +1125,7 @@ static struct t10_pr_registration *__cor
>  			continue;
> 
>  		atomic_inc(&pr_reg->pr_res_holders);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock(&pr_tmpl->registration_lock);
>  		return pr_reg;
>  	}
> @@ -1155,7 +1155,7 @@ static struct t10_pr_registration *core_
>  static void core_scsi3_put_pr_reg(struct t10_pr_registration *pr_reg)
>  {
>  	atomic_dec(&pr_reg->pr_res_holders);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  }
> 
>  static int core_scsi3_check_implicit_release(
> @@ -1349,7 +1349,7 @@ static void core_scsi3_tpg_undepend_item
>  			&tpg->tpg_group.cg_item);
> 
>  	atomic_dec(&tpg->tpg_pr_ref_count);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  }
> 
>  static int core_scsi3_nodeacl_depend_item(struct se_node_acl *nacl)
> @@ -1369,7 +1369,7 @@ static void core_scsi3_nodeacl_undepend_
> 
>  	if (nacl->dynamic_node_acl) {
>  		atomic_dec(&nacl->acl_pr_ref_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		return;
>  	}
> 
> @@ -1377,7 +1377,7 @@ static void core_scsi3_nodeacl_undepend_
>  			&nacl->acl_group.cg_item);
> 
>  	atomic_dec(&nacl->acl_pr_ref_count);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  }
> 
>  static int core_scsi3_lunacl_depend_item(struct se_dev_entry *se_deve)
> @@ -1408,7 +1408,7 @@ static void core_scsi3_lunacl_undepend_i
>  	 */
>  	if (!lun_acl) {
>  		atomic_dec(&se_deve->pr_ref_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		return;
>  	}
>  	nacl = lun_acl->se_lun_nacl;
> @@ -1418,7 +1418,7 @@ static void core_scsi3_lunacl_undepend_i
>  			&lun_acl->se_lun_group.cg_item);
> 
>  	atomic_dec(&se_deve->pr_ref_count);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  }
> 
>  static sense_reason_t
> @@ -1552,14 +1552,14 @@ core_scsi3_decode_spec_i_port(
>  				continue;
> 
>  			atomic_inc(&tmp_tpg->tpg_pr_ref_count);
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
>  			spin_unlock(&dev->se_port_lock);
> 
>  			if (core_scsi3_tpg_depend_item(tmp_tpg)) {
>  				pr_err(" core_scsi3_tpg_depend_item()"
>  					" for tmp_tpg\n");
>  				atomic_dec(&tmp_tpg->tpg_pr_ref_count);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>  				goto out_unmap;
>  			}
> @@ -1573,7 +1573,7 @@ core_scsi3_decode_spec_i_port(
>  						tmp_tpg, i_str);
>  			if (dest_node_acl) {
>  				atomic_inc(&dest_node_acl->acl_pr_ref_count);
> -				smp_mb__after_atomic_inc();
> +				smp_mb__after_atomic();
>  			}
>  			spin_unlock_irq(&tmp_tpg->acl_node_lock);
> 
> @@ -1587,7 +1587,7 @@ core_scsi3_decode_spec_i_port(
>  				pr_err("configfs_depend_item() failed"
>  					" for dest_node_acl->acl_group\n");
>  				atomic_dec(&dest_node_acl->acl_pr_ref_count);
> -				smp_mb__after_atomic_dec();
> +				smp_mb__after_atomic();
>  				core_scsi3_tpg_undepend_item(tmp_tpg);
>  				ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>  				goto out_unmap;
> @@ -1647,7 +1647,7 @@ core_scsi3_decode_spec_i_port(
>  			pr_err("core_scsi3_lunacl_depend_item()"
>  					" failed\n");
>  			atomic_dec(&dest_se_deve->pr_ref_count);
> -			smp_mb__after_atomic_dec();
> +			smp_mb__after_atomic();
>  			core_scsi3_nodeacl_undepend_item(dest_node_acl);
>  			core_scsi3_tpg_undepend_item(dest_tpg);
>  			ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
> @@ -3165,14 +3165,14 @@ core_scsi3_emulate_pro_register_and_move
>  			continue;
> 
>  		atomic_inc(&dest_se_tpg->tpg_pr_ref_count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock(&dev->se_port_lock);
> 
>  		if (core_scsi3_tpg_depend_item(dest_se_tpg)) {
>  			pr_err("core_scsi3_tpg_depend_item() failed"
>  				" for dest_se_tpg\n");
>  			atomic_dec(&dest_se_tpg->tpg_pr_ref_count);
> -			smp_mb__after_atomic_dec();
> +			smp_mb__after_atomic();
>  			ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>  			goto out_put_pr_reg;
>  		}
> @@ -3270,7 +3270,7 @@ core_scsi3_emulate_pro_register_and_move
>  				initiator_str);
>  	if (dest_node_acl) {
>  		atomic_inc(&dest_node_acl->acl_pr_ref_count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock_irq(&dest_se_tpg->acl_node_lock);
> 
> @@ -3286,7 +3286,7 @@ core_scsi3_emulate_pro_register_and_move
>  		pr_err("core_scsi3_nodeacl_depend_item() for"
>  			" dest_node_acl\n");
>  		atomic_dec(&dest_node_acl->acl_pr_ref_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		dest_node_acl = NULL;
>  		ret = TCM_INVALID_PARAMETER_LIST;
>  		goto out;
> @@ -3311,7 +3311,7 @@ core_scsi3_emulate_pro_register_and_move
>  	if (core_scsi3_lunacl_depend_item(dest_se_deve)) {
>  		pr_err("core_scsi3_lunacl_depend_item() failed\n");
>  		atomic_dec(&dest_se_deve->pr_ref_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		dest_se_deve = NULL;
>  		ret = TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>  		goto out;
> @@ -3877,7 +3877,7 @@ core_scsi3_pri_read_full_status(struct s
>  		add_desc_len = 0;
> 
>  		atomic_inc(&pr_reg->pr_res_holders);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		spin_unlock(&pr_tmpl->registration_lock);
>  		/*
>  		 * Determine expected length of $FABRIC_MOD specific
> @@ -3891,7 +3891,7 @@ core_scsi3_pri_read_full_status(struct s
>  				" out of buffer: %d\n", cmd->data_length);
>  			spin_lock(&pr_tmpl->registration_lock);
>  			atomic_dec(&pr_reg->pr_res_holders);
> -			smp_mb__after_atomic_dec();
> +			smp_mb__after_atomic();
>  			break;
>  		}
>  		/*
> @@ -3953,7 +3953,7 @@ core_scsi3_pri_read_full_status(struct s
> 
>  		spin_lock(&pr_tmpl->registration_lock);
>  		atomic_dec(&pr_reg->pr_res_holders);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		/*
>  		 * Set the ADDITIONAL DESCRIPTOR LENGTH
>  		 */
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -728,7 +728,7 @@ void target_qf_do_work(struct work_struc
>  	list_for_each_entry_safe(cmd, cmd_tmp, &qf_cmd_list, se_qf_node) {
>  		list_del(&cmd->se_qf_node);
>  		atomic_dec(&dev->dev_qf_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
> 
>  		pr_debug("Processing %s cmd: %p QUEUE_FULL in work queue"
>  			" context: %s\n", cmd->se_tfo->get_fabric_name(), cmd,
> @@ -1140,7 +1140,7 @@ transport_check_alloc_task_attr(struct s
>  	 * Dormant to Active status.
>  	 */
>  	cmd->se_ordered_id = atomic_inc_return(&dev->dev_ordered_id);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	pr_debug("Allocated se_ordered_id: %u for Task Attr: 0x%02x on %s\n",
>  			cmd->se_ordered_id, cmd->sam_task_attr,
>  			dev->transport->name);
> @@ -1692,7 +1692,7 @@ static bool target_handle_task_attr(stru
>  		return false;
>  	case MSG_ORDERED_TAG:
>  		atomic_inc(&dev->dev_ordered_sync);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
> 
>  		pr_debug("Added ORDERED for CDB: 0x%02x to ordered list, "
>  			 " se_ordered_id: %u\n",
> @@ -1710,7 +1710,7 @@ static bool target_handle_task_attr(stru
>  		 * For SIMPLE and UNTAGGED Task Attribute commands
>  		 */
>  		atomic_inc(&dev->simple_cmds);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		break;
>  	}
> 
> @@ -1806,7 +1806,7 @@ static void transport_complete_task_attr
> 
>  	if (cmd->sam_task_attr == MSG_SIMPLE_TAG) {
>  		atomic_dec(&dev->simple_cmds);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  		dev->dev_cur_ordered_id++;
>  		pr_debug("Incremented dev->dev_cur_ordered_id: %u for"
>  			" SIMPLE: %u\n", dev->dev_cur_ordered_id,
> @@ -1818,7 +1818,7 @@ static void transport_complete_task_attr
>  			cmd->se_ordered_id);
>  	} else if (cmd->sam_task_attr == MSG_ORDERED_TAG) {
>  		atomic_dec(&dev->dev_ordered_sync);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
> 
>  		dev->dev_cur_ordered_id++;
>  		pr_debug("Incremented dev_cur_ordered_id: %u for ORDERED:"
> @@ -1877,7 +1877,7 @@ static void transport_handle_queue_full(
>  	spin_lock_irq(&dev->qf_cmd_lock);
>  	list_add_tail(&cmd->se_qf_node, &cmd->se_dev->qf_cmd_list);
>  	atomic_inc(&dev->dev_qf_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	spin_unlock_irq(&cmd->se_dev->qf_cmd_lock);
> 
>  	schedule_work(&cmd->se_dev->qf_work_queue);
> @@ -2805,7 +2805,7 @@ void transport_send_task_abort(struct se
>  	if (cmd->data_direction == DMA_TO_DEVICE) {
>  		if (cmd->se_tfo->write_pending_status(cmd) != 0) {
>  			cmd->transport_state |= CMD_T_ABORTED;
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
>  			return;
>  		}
>  	}
> --- a/drivers/target/target_core_ua.c
> +++ b/drivers/target/target_core_ua.c
> @@ -162,7 +162,7 @@ int core_scsi3_ua_allocate(
>  		spin_unlock_irq(&nacl->device_list_lock);
> 
>  		atomic_inc(&deve->ua_count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		return 0;
>  	}
>  	list_add_tail(&ua->ua_nacl_list, &deve->ua_list);
> @@ -175,7 +175,7 @@ int core_scsi3_ua_allocate(
>  		asc, ascq);
> 
>  	atomic_inc(&deve->ua_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	return 0;
>  }
> 
> @@ -190,7 +190,7 @@ void core_scsi3_ua_release_all(
>  		kmem_cache_free(se_ua_cache, ua);
> 
>  		atomic_dec(&deve->ua_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock(&deve->ua_lock);
>  }
> @@ -251,7 +251,7 @@ void core_scsi3_ua_for_check_condition(
>  		kmem_cache_free(se_ua_cache, ua);
> 
>  		atomic_dec(&deve->ua_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock(&deve->ua_lock);
>  	spin_unlock_irq(&nacl->device_list_lock);
> @@ -310,7 +310,7 @@ int core_scsi3_ua_clear_for_request_sens
>  		kmem_cache_free(se_ua_cache, ua);
> 
>  		atomic_dec(&deve->ua_count);
> -		smp_mb__after_atomic_dec();
> +		smp_mb__after_atomic();
>  	}
>  	spin_unlock(&deve->ua_lock);
>  	spin_unlock_irq(&nacl->device_list_lock);
> --- a/drivers/tty/n_tty.c
> +++ b/drivers/tty/n_tty.c
> @@ -2042,7 +2042,7 @@ static int canon_copy_from_read_buf(stru
> 
>  	if (found)
>  		clear_bit(eol, ldata->read_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	ldata->read_tail += c;
> 
>  	if (found) {
> --- a/drivers/tty/serial/mxs-auart.c
> +++ b/drivers/tty/serial/mxs-auart.c
> @@ -200,7 +200,7 @@ static void dma_tx_callback(void *param)
> 
>  	/* clear the bit used to serialize the DMA tx. */
>  	clear_bit(MXS_AUART_DMA_TX_SYNC, &s->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	/* wake up the possible processes. */
>  	if (uart_circ_chars_pending(xmit) < WAKEUP_CHARS)
> @@ -275,7 +275,7 @@ static void mxs_auart_tx_chars(struct mx
>  			mxs_auart_dma_tx(s, i);
>  		} else {
>  			clear_bit(MXS_AUART_DMA_TX_SYNC, &s->flags);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
>  		}
>  		return;
>  	}
> --- a/drivers/usb/gadget/tcm_usb_gadget.c
> +++ b/drivers/usb/gadget/tcm_usb_gadget.c
> @@ -1846,7 +1846,7 @@ static int usbg_port_link(struct se_port
>  	struct usbg_tpg *tpg = container_of(se_tpg, struct usbg_tpg, se_tpg);
> 
>  	atomic_inc(&tpg->tpg_port_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	return 0;
>  }
> 
> @@ -1856,7 +1856,7 @@ static void usbg_port_unlink(struct se_p
>  	struct usbg_tpg *tpg = container_of(se_tpg, struct usbg_tpg, se_tpg);
> 
>  	atomic_dec(&tpg->tpg_port_count);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  }
> 
>  static int usbg_check_stop_free(struct se_cmd *se_cmd)
> --- a/drivers/usb/serial/usb_wwan.c
> +++ b/drivers/usb/serial/usb_wwan.c
> @@ -325,7 +325,7 @@ static void usb_wwan_outdat_callback(str
> 
>  	for (i = 0; i < N_OUT_URB; ++i) {
>  		if (portdata->out_urbs[i] == urb) {
> -			smp_mb__before_clear_bit();
> +			smp_mb__before_atomic();
>  			clear_bit(i, &portdata->out_busy);
>  			break;
>  		}
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1244,7 +1244,7 @@ vhost_scsi_set_endpoint(struct vhost_scs
>  			tpg->tv_tpg_vhost_count++;
>  			tpg->vhost_scsi = vs;
>  			vs_tpg[tpg->tport_tpgt] = tpg;
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
>  			match = true;
>  		}
>  		mutex_unlock(&tpg->tv_tpg_mutex);
> --- a/drivers/w1/w1_family.c
> +++ b/drivers/w1/w1_family.c
> @@ -131,9 +131,9 @@ void w1_family_get(struct w1_family *f)
> 
>  void __w1_family_get(struct w1_family *f)
>  {
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_inc(&f->refcnt);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  EXPORT_SYMBOL(w1_unregister_family);
> --- a/drivers/xen/xen-pciback/pciback_ops.c
> +++ b/drivers/xen/xen-pciback/pciback_ops.c
> @@ -348,9 +348,9 @@ void xen_pcibk_do_op(struct work_struct
>  	notify_remote_via_irq(pdev->evtchn_irq);
> 
>  	/* Mark that we're done. */
> -	smp_mb__before_clear_bit(); /* /after/ clearing PCIF_active */
> +	smp_mb__before_atomic(); /* /after/ clearing PCIF_active */
>  	clear_bit(_PDEVF_op_active, &pdev->flags);
> -	smp_mb__after_clear_bit(); /* /before/ final check for work */
> +	smp_mb__after_atomic(); /* /before/ final check for work */
> 
>  	/* Check to see if the driver domain tried to start another request in
>  	 * between clearing _XEN_PCIF_active and clearing _PDEVF_op_active.
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -279,7 +279,7 @@ static inline void btrfs_inode_block_unl
> 
>  static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>  		  &BTRFS_I(inode)->runtime_flags);
>  }
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3451,7 +3451,7 @@ static int lock_extent_buffer_for_io(str
>  static void end_extent_buffer_writeback(struct extent_buffer *eb)
>  {
>  	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
>  }
> 
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7078,7 +7078,7 @@ static void btrfs_end_dio_bio(struct bio
>  		 * before atomic variable goto zero, we must make sure
>  		 * dip->errors is perceived to be set.
>  		 */
> -		smp_mb__before_atomic_dec();
> +		smp_mb__before_atomic();
>  	}
> 
>  	/* if there are more bios still pending for this dio, just exit */
> @@ -7258,7 +7258,7 @@ static int btrfs_submit_direct_hook(int
>  	 * before atomic variable goto zero, we must
>  	 * make sure dip->errors is perceived to be set.
>  	 */
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	if (atomic_dec_and_test(&dip->pending_bios))
>  		bio_io_error(dip->orig_bio);
> 
> @@ -7401,7 +7401,7 @@ static ssize_t btrfs_direct_IO(int rw, s
>  		return 0;
> 
>  	atomic_inc(&inode->i_dio_count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
> 
>  	/*
>  	 * The generic stuff only does filemap_write_and_wait_range, which isn't
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -77,7 +77,7 @@ EXPORT_SYMBOL(__lock_buffer);
>  void unlock_buffer(struct buffer_head *bh)
>  {
>  	clear_bit_unlock(BH_Lock, &bh->b_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&bh->b_state, BH_Lock);
>  }
>  EXPORT_SYMBOL(unlock_buffer);
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -42,7 +42,7 @@ int ext4_resize_begin(struct super_block
>  void ext4_resize_end(struct super_block *sb)
>  {
>  	clear_bit_unlock(EXT4_RESIZING, &EXT4_SB(sb)->s_resize_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static ext4_group_t ext4_meta_bg_first_group(struct super_block *sb,
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -275,7 +275,7 @@ static inline int may_grant(const struct
>  static void gfs2_holder_wake(struct gfs2_holder *gh)
>  {
>  	clear_bit(HIF_WAIT, &gh->gh_iflags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&gh->gh_iflags, HIF_WAIT);
>  }
> 
> @@ -409,7 +409,7 @@ static void gfs2_demote_wake(struct gfs2
>  {
>  	gl->gl_demote_state = LM_ST_EXCLUSIVE;
>  	clear_bit(GLF_DEMOTE, &gl->gl_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&gl->gl_flags, GLF_DEMOTE);
>  }
> 
> @@ -618,7 +618,7 @@ __acquires(&gl->gl_spin)
> 
>  out_sched:
>  	clear_bit(GLF_LOCK, &gl->gl_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	gl->gl_lockref.count++;
>  	if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
>  		gl->gl_lockref.count--;
> @@ -626,7 +626,7 @@ __acquires(&gl->gl_spin)
> 
>  out_unlock:
>  	clear_bit(GLF_LOCK, &gl->gl_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	return;
>  }
> 
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -219,7 +219,7 @@ static void inode_go_sync(struct gfs2_gl
>  	 * Writeback of the data mapping may cause the dirty flag to be set
>  	 * so we have to clear it again here.
>  	 */
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(GLF_DIRTY, &gl->gl_flags);
>  }
> 
> --- a/fs/gfs2/lock_dlm.c
> +++ b/fs/gfs2/lock_dlm.c
> @@ -1132,7 +1132,7 @@ static void gdlm_recover_done(void *arg,
>  		queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
> 
>  	clear_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&ls->ls_recover_flags, DFL_DLM_RECOVERY);
>  	spin_unlock(&ls->ls_recover_spin);
>  }
> @@ -1269,7 +1269,7 @@ static int gdlm_mount(struct gfs2_sbd *s
> 
>  	ls->ls_first = !!test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags);
>  	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
>  	return 0;
> 
> --- a/fs/gfs2/recovery.c
> +++ b/fs/gfs2/recovery.c
> @@ -587,7 +587,7 @@ void gfs2_recover_func(struct work_struc
>  	gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
>  done:
>  	clear_bit(JDF_RECOVERY, &jd->jd_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
>  }
> 
> --- a/fs/gfs2/sys.c
> +++ b/fs/gfs2/sys.c
> @@ -332,7 +332,7 @@ static ssize_t block_store(struct gfs2_s
>  		set_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags);
>  	else if (val == 0) {
>  		clear_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		gfs2_glock_thaw(sdp);
>  	} else {
>  		ret = -EINVAL;
> @@ -481,7 +481,7 @@ static ssize_t jid_store(struct gfs2_sbd
>  		rv = jid = -EINVAL;
>  	sdp->sd_lockstruct.ls_jid = jid;
>  	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
>  out:
>  	spin_unlock(&sdp->sd_jindex_spin);
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -43,7 +43,7 @@ static void journal_end_buffer_io_sync(s
>  		clear_buffer_uptodate(bh);
>  	if (orig_bh) {
>  		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		wake_up_bit(&orig_bh->b_state, BH_Shadow);
>  	}
>  	unlock_buffer(bh);
> @@ -239,7 +239,7 @@ static int journal_submit_data_buffers(j
>  		spin_lock(&journal->j_list_lock);
>  		J_ASSERT(jinode->i_transaction == commit_transaction);
>  		clear_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>  	}
>  	spin_unlock(&journal->j_list_lock);
> @@ -277,7 +277,7 @@ static int journal_finish_inode_data_buf
>  		}
>  		spin_lock(&journal->j_list_lock);
>  		clear_bit(__JI_COMMIT_RUNNING, &jinode->i_flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>  	}
> 
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1985,9 +1985,9 @@ static void nfs_access_free_entry(struct
>  {
>  	put_rpccred(entry->cred);
>  	kfree(entry);
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_long_dec(&nfs_access_nr_entries);
> -	smp_mb__after_atomic_dec();
> +	smp_mb__after_atomic();
>  }
> 
>  static void nfs_access_free_list(struct list_head *head)
> @@ -2035,9 +2035,9 @@ nfs_access_cache_scan(struct shrinker *s
>  		else {
>  remove_lru_entry:
>  			list_del_init(&nfsi->access_cache_inode_lru);
> -			smp_mb__before_clear_bit();
> +			smp_mb__before_atomic();
>  			clear_bit(NFS_INO_ACL_LRU_SET, &nfsi->flags);
> -			smp_mb__after_clear_bit();
> +			smp_mb__after_atomic();
>  		}
>  		spin_unlock(&inode->i_lock);
>  	}
> @@ -2185,9 +2185,9 @@ void nfs_access_add_cache(struct inode *
>  	nfs_access_add_rbtree(inode, cache);
> 
>  	/* Update accounting */
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_long_inc(&nfs_access_nr_entries);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
> 
>  	/* Add inode to global LRU list */
>  	if (!test_bit(NFS_INO_ACL_LRU_SET, &NFS_I(inode)->flags)) {
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -1058,7 +1058,7 @@ int nfs_revalidate_mapping(struct inode
>  	trace_nfs_invalidate_mapping_exit(inode, ret);
> 
>  	clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(bitlock, NFS_INO_INVALIDATING);
>  out:
>  	return ret;
> --- a/fs/nfs/nfs4filelayoutdev.c
> +++ b/fs/nfs/nfs4filelayoutdev.c
> @@ -789,9 +789,9 @@ static void nfs4_wait_ds_connect(struct
> 
>  static void nfs4_clear_ds_conn_bit(struct nfs4_pnfs_ds *ds)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(NFS4DS_CONNECTING, &ds->ds_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&ds->ds_state, NFS4DS_CONNECTING);
>  }
> 
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -1145,9 +1145,9 @@ static int nfs4_run_state_manager(void *
> 
>  static void nfs4_clear_state_manager_bit(struct nfs_client *clp)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING);
>  	rpc_wake_up(&clp->cl_rpcwaitq);
>  }
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -95,7 +95,7 @@ nfs_iocounter_dec(struct nfs_io_counter
>  {
>  	if (atomic_dec_and_test(&c->io_count)) {
>  		clear_bit(NFS_IO_INPROGRESS, &c->flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		wake_up_bit(&c->flags, NFS_IO_INPROGRESS);
>  	}
>  }
> @@ -193,9 +193,9 @@ void nfs_unlock_request(struct nfs_page
>  		printk(KERN_ERR "NFS: Invalid unlock attempted\n");
>  		BUG();
>  	}
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(PG_BUSY, &req->wb_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&req->wb_flags, PG_BUSY);
>  }
> 
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1795,7 +1795,7 @@ static void pnfs_clear_layoutcommitting(
>  	unsigned long *bitlock = &NFS_I(inode)->flags;
> 
>  	clear_bit_unlock(NFS_INO_LAYOUTCOMMITTING, bitlock);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(bitlock, NFS_INO_LAYOUTCOMMITTING);
>  }
> 
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -275,7 +275,7 @@ pnfs_get_lseg(struct pnfs_layout_segment
>  {
>  	if (lseg) {
>  		atomic_inc(&lseg->pls_refcount);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  	}
>  	return lseg;
>  }
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -405,7 +405,7 @@ int nfs_writepages(struct address_space
>  	nfs_pageio_complete(&pgio);
> 
>  	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(bitlock, NFS_INO_FLUSHING);
> 
>  	if (err < 0)
> @@ -1458,7 +1458,7 @@ static int nfs_commit_set_lock(struct nf
>  static void nfs_commit_clear_lock(struct nfs_inode *nfsi)
>  {
>  	clear_bit(NFS_INO_COMMIT, &nfsi->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_bit(&nfsi->flags, NFS_INO_COMMIT);
>  }
> 
> --- a/fs/ubifs/lpt_commit.c
> +++ b/fs/ubifs/lpt_commit.c
> @@ -460,9 +460,9 @@ static int write_cnodes(struct ubifs_inf
>  		 * important.
>  		 */
>  		clear_bit(DIRTY_CNODE, &cnode->flags);
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(COW_CNODE, &cnode->flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		offs += len;
>  		dbg_chk_lpt_sz(c, 1, len);
>  		cnode = cnode->cnext;
> --- a/fs/ubifs/tnc_commit.c
> +++ b/fs/ubifs/tnc_commit.c
> @@ -895,9 +895,9 @@ static int write_index(struct ubifs_info
>  		 * the reason for the second barrier.
>  		 */
>  		clear_bit(DIRTY_ZNODE, &znode->flags);
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(COW_ZNODE, &znode->flags);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  		/*
>  		 * We have marked the znode as clean but have not updated the
> --- a/include/asm-generic/atomic.h
> +++ b/include/asm-generic/atomic.h
> @@ -16,6 +16,7 @@
>  #define __ASM_GENERIC_ATOMIC_H
> 
>  #include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> 
>  #ifdef CONFIG_SMP
>  /* Force people to define core atomics */
> @@ -182,11 +183,5 @@ static inline void atomic_set_mask(unsig
>  }
>  #endif
> 
> -/* Assume that atomic operations are already serializing */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
>  #endif /* __KERNEL__ */
>  #endif /* __ASM_GENERIC_ATOMIC_H */
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -62,6 +62,14 @@
>  #define set_mb(var, value)  do { (var) = (value); mb(); } while (0)
>  #endif
> 
> +#ifndef smp_mb__before_atomic
> +#define smp_mb__before_atomic()	smp_mb()
> +#endif
> +
> +#ifndef smp_mb__after_atomic
> +#define smp_mb__after_atomic()	smp_mb()
> +#endif
> +
>  #define smp_store_release(p, v)						\
>  do {									\
>  	compiletime_assert_atomic_type(*p);				\
> --- a/include/asm-generic/bitops.h
> +++ b/include/asm-generic/bitops.h
> @@ -11,14 +11,8 @@
> 
>  #include <linux/irqflags.h>
>  #include <linux/compiler.h>
> +#include <asm/barrier.h>
> 
> -/*
> - * clear_bit may not imply a memory barrier
> - */
> -#ifndef smp_mb__before_clear_bit
> -#define smp_mb__before_clear_bit()	smp_mb()
> -#define smp_mb__after_clear_bit()	smp_mb()
> -#endif
> 
>  #include <asm-generic/bitops/__ffs.h>
>  #include <asm-generic/bitops/ffz.h>
> --- a/include/asm-generic/bitops/atomic.h
> +++ b/include/asm-generic/bitops/atomic.h
> @@ -80,7 +80,7 @@ static inline void set_bit(int nr, volat
>   *
>   * clear_bit() is atomic and may not be reordered.  However, it does
>   * not contain a memory barrier, so if it is used for locking purposes,
> - * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> + * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
>   * in order to ensure changes are visible on other processors.
>   */
>  static inline void clear_bit(int nr, volatile unsigned long *addr)
> --- a/include/asm-generic/bitops/lock.h
> +++ b/include/asm-generic/bitops/lock.h
> @@ -20,7 +20,7 @@
>   */
>  #define clear_bit_unlock(nr, addr)	\
>  do {					\
> -	smp_mb__before_clear_bit();	\
> +	smp_mb__before_atomic();	\
>  	clear_bit(nr, addr);		\
>  } while (0)
> 
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -278,7 +278,7 @@ static inline void get_bh(struct buffer_
> 
>  static inline void put_bh(struct buffer_head *bh)
>  {
> -        smp_mb__before_atomic_dec();
> +        smp_mb__before_atomic();
>          atomic_dec(&bh->b_count);
>  }
> 
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -649,7 +649,7 @@ static inline void hd_ref_init(struct hd
>  static inline void hd_struct_get(struct hd_struct *part)
>  {
>  	atomic_inc(&part->ref);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  static inline int hd_struct_try_get(struct hd_struct *part)
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -447,7 +447,7 @@ static inline int tasklet_trylock(struct
> 
>  static inline void tasklet_unlock(struct tasklet_struct *t)
>  {
> -	smp_mb__before_clear_bit(); 
> +	smp_mb__before_atomic();
>  	clear_bit(TASKLET_STATE_RUN, &(t)->state);
>  }
> 
> @@ -495,7 +495,7 @@ static inline void tasklet_hi_schedule_f
>  static inline void tasklet_disable_nosync(struct tasklet_struct *t)
>  {
>  	atomic_inc(&t->count);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  static inline void tasklet_disable(struct tasklet_struct *t)
> @@ -507,13 +507,13 @@ static inline void tasklet_disable(struc
> 
>  static inline void tasklet_enable(struct tasklet_struct *t)
>  {
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&t->count);
>  }
> 
>  static inline void tasklet_hi_enable(struct tasklet_struct *t)
>  {
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&t->count);
>  }
> 
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -500,7 +500,7 @@ static inline void napi_disable(struct n
>  static inline void napi_enable(struct napi_struct *n)
>  {
>  	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(NAPI_STATE_SCHED, &n->state);
>  }
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2737,10 +2737,8 @@ static inline bool __must_check current_
>  	/*
>  	 * Polling state must be visible before we test NEED_RESCHED,
>  	 * paired by resched_task()
> -	 *
> -	 * XXX: assumes set/clear bit are identical barrier wise.
>  	 */
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return unlikely(tif_need_resched());
>  }
> @@ -2758,7 +2756,7 @@ static inline bool __must_check current_
>  	 * Polling state must be visible before we test NEED_RESCHED,
>  	 * paired by resched_task()
>  	 */
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return unlikely(tif_need_resched());
>  }
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -142,18 +142,18 @@ struct rpc_task_setup {
>  				test_and_set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
>  #define rpc_clear_running(t)	\
>  	do { \
> -		smp_mb__before_clear_bit(); \
> +		smp_mb__before_atomic(); \
>  		clear_bit(RPC_TASK_RUNNING, &(t)->tk_runstate); \
> -		smp_mb__after_clear_bit(); \
> +		smp_mb__after_atomic(); \
>  	} while (0)
> 
>  #define RPC_IS_QUEUED(t)	test_bit(RPC_TASK_QUEUED, &(t)->tk_runstate)
>  #define rpc_set_queued(t)	set_bit(RPC_TASK_QUEUED, &(t)->tk_runstate)
>  #define rpc_clear_queued(t)	\
>  	do { \
> -		smp_mb__before_clear_bit(); \
> +		smp_mb__before_atomic(); \
>  		clear_bit(RPC_TASK_QUEUED, &(t)->tk_runstate); \
> -		smp_mb__after_clear_bit(); \
> +		smp_mb__after_atomic(); \
>  	} while (0)
> 
>  #define RPC_IS_ACTIVATED(t)	test_bit(RPC_TASK_ACTIVE, &(t)->tk_runstate)
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -368,9 +368,9 @@ static inline int xprt_test_and_clear_co
> 
>  static inline void xprt_clear_connecting(struct rpc_xprt *xprt)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(XPRT_CONNECTING, &xprt->state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static inline int xprt_connecting(struct rpc_xprt *xprt)
> @@ -400,9 +400,9 @@ static inline void xprt_clear_bound(stru
> 
>  static inline void xprt_clear_binding(struct rpc_xprt *xprt)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(XPRT_BINDING, &xprt->state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static inline int xprt_test_and_set_binding(struct rpc_xprt *xprt)
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -191,7 +191,7 @@ static inline void tracehook_notify_resu
>  	 * pairs with task_work_add()->set_notify_resume() after
>  	 * hlist_add_head(task->task_works);
>  	 */
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	if (unlikely(current->task_works))
>  		task_work_run();
>  }
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -1204,7 +1204,7 @@ static inline bool __ip_vs_conn_get(stru
>  /* put back the conn without restarting its timer */
>  static inline void __ip_vs_conn_put(struct ip_vs_conn *cp)
>  {
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&cp->refcnt);
>  }
>  void ip_vs_conn_put(struct ip_vs_conn *cp);
> @@ -1408,7 +1408,7 @@ static inline void ip_vs_dest_hold(struc
> 
>  static inline void ip_vs_dest_put(struct ip_vs_dest *dest)
>  {
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&dest->refcnt);
>  }
> 
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -526,7 +526,7 @@ static int kgdb_cpu_enter(struct kgdb_st
>  			kgdb_info[cpu].exception_state &=
>  				~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
>  			kgdb_info[cpu].enter_kgdb--;
> -			smp_mb__before_atomic_dec();
> +			smp_mb__before_atomic();
>  			atomic_dec(&slaves_in_kgdb);
>  			dbg_touch_watchdogs();
>  			local_irq_restore(flags);
> @@ -654,7 +654,7 @@ static int kgdb_cpu_enter(struct kgdb_st
>  	kgdb_info[cpu].exception_state &=
>  		~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
>  	kgdb_info[cpu].enter_kgdb--;
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&masters_in_kgdb);
>  	/* Free kgdb_active */
>  	atomic_set(&kgdb_active, -1);
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -250,7 +250,7 @@ static inline void futex_get_mm(union fu
>  	 * get_futex_key() implies a full barrier. This is relied upon
>  	 * as full barrier (B), see the ordering comment above.
>  	 */
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -498,7 +498,7 @@ int __usermodehelper_disable(enum umh_di
>  static void helper_lock(void)
>  {
>  	atomic_inc(&running_helpers);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  }
> 
>  static void helper_unlock(void)
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -389,9 +389,9 @@ static void rcu_eqs_enter_common(struct
>  	}
>  	rcu_prepare_for_idle(smp_processor_id());
>  	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> -	smp_mb__before_atomic_inc();  /* See above. */
> +	smp_mb__before_atomic();  /* See above. */
>  	atomic_inc(&rdtp->dynticks);
> -	smp_mb__after_atomic_inc();  /* Force ordering with next sojourn. */
> +	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
>  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> 
>  	/*
> @@ -509,10 +509,10 @@ void rcu_irq_exit(void)
>  static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
>  			       int user)
>  {
> -	smp_mb__before_atomic_inc();  /* Force ordering w/previous sojourn. */
> +	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
>  	atomic_inc(&rdtp->dynticks);
>  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> -	smp_mb__after_atomic_inc();  /* See above. */
> +	smp_mb__after_atomic();  /* See above. */
>  	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>  	rcu_cleanup_after_idle(smp_processor_id());
>  	trace_rcu_dyntick(TPS("End"), oldval, rdtp->dynticks_nesting);
> @@ -637,10 +637,10 @@ void rcu_nmi_enter(void)
>  	    (atomic_read(&rdtp->dynticks) & 0x1))
>  		return;
>  	rdtp->dynticks_nmi_nesting++;
> -	smp_mb__before_atomic_inc();  /* Force delay from prior write. */
> +	smp_mb__before_atomic();  /* Force delay from prior write. */
>  	atomic_inc(&rdtp->dynticks);
>  	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> -	smp_mb__after_atomic_inc();  /* See above. */
> +	smp_mb__after_atomic();  /* See above. */
>  	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>  }
> 
> @@ -659,9 +659,9 @@ void rcu_nmi_exit(void)
>  	    --rdtp->dynticks_nmi_nesting != 0)
>  		return;
>  	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> -	smp_mb__before_atomic_inc();  /* See above. */
> +	smp_mb__before_atomic();  /* See above. */
>  	atomic_inc(&rdtp->dynticks);
> -	smp_mb__after_atomic_inc();  /* Force delay to next write. */
> +	smp_mb__after_atomic();  /* Force delay to next write. */
>  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>  }
> 
> @@ -2738,7 +2738,7 @@ void synchronize_sched_expedited(void)
>  		s = atomic_long_read(&rsp->expedited_done);
>  		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
>  			/* ensure test happens before caller kfree */
> -			smp_mb__before_atomic_inc(); /* ^^^ */
> +			smp_mb__before_atomic(); /* ^^^ */
>  			atomic_long_inc(&rsp->expedited_workdone1);
>  			return;
>  		}
> @@ -2756,7 +2756,7 @@ void synchronize_sched_expedited(void)
>  		s = atomic_long_read(&rsp->expedited_done);
>  		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
>  			/* ensure test happens before caller kfree */
> -			smp_mb__before_atomic_inc(); /* ^^^ */
> +			smp_mb__before_atomic(); /* ^^^ */
>  			atomic_long_inc(&rsp->expedited_workdone2);
>  			return;
>  		}
> @@ -2785,7 +2785,7 @@ void synchronize_sched_expedited(void)
>  		s = atomic_long_read(&rsp->expedited_done);
>  		if (ULONG_CMP_GE((ulong)s, (ulong)snap)) {
>  			/* ensure test happens before caller kfree */
> -			smp_mb__before_atomic_inc(); /* ^^^ */
> +			smp_mb__before_atomic(); /* ^^^ */
>  			atomic_long_inc(&rsp->expedited_done_lost);
>  			break;
>  		}
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2514,9 +2514,9 @@ static void rcu_sysidle_enter(struct rcu
>  	/* Record start of fully idle period. */
>  	j = jiffies;
>  	ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_inc(&rdtp->dynticks_idle);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
>  }
> 
> @@ -2581,9 +2581,9 @@ static void rcu_sysidle_exit(struct rcu_
>  	}
> 
>  	/* Record end of idle period. */
> -	smp_mb__before_atomic_inc();
> +	smp_mb__before_atomic();
>  	atomic_inc(&rdtp->dynticks_idle);
> -	smp_mb__after_atomic_inc();
> +	smp_mb__after_atomic();
>  	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> 
>  	/*
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -165,7 +165,7 @@ void cpupri_set(struct cpupri *cp, int c
>  		 * do a write memory barrier, and then update the count, to
>  		 * make sure the vector is visible when count is set.
>  		 */
> -		smp_mb__before_atomic_inc();
> +		smp_mb__before_atomic();
>  		atomic_inc(&(vec)->count);
>  		do_mb = 1;
>  	}
> @@ -185,14 +185,14 @@ void cpupri_set(struct cpupri *cp, int c
>  		 * the new priority vec.
>  		 */
>  		if (do_mb)
> -			smp_mb__after_atomic_inc();
> +			smp_mb__after_atomic();
> 
>  		/*
>  		 * When removing from the vector, we decrement the counter first
>  		 * do a memory barrier and then clear the mask.
>  		 */
>  		atomic_dec(&(vec)->count);
> -		smp_mb__after_atomic_inc();
> +		smp_mb__after_atomic();
>  		cpumask_clear_cpu(cpu, vec->mask);
>  	}
> 
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -394,7 +394,7 @@ EXPORT_SYMBOL(__wake_up_bit);
>   *
>   * In order for this to function properly, as it uses waitqueue_active()
>   * internally, some kind of memory barrier must be done prior to calling
> - * this. Typically, this will be smp_mb__after_clear_bit(), but in some
> + * this. Typically, this will be smp_mb__after_atomic(), but in some
>   * cases where bitflags are manipulated non-atomically under a lock, one
>   * may need to use a less regular barrier, such fs/inode.c's smp_mb(),
>   * because spin_unlock() does not guarantee a memory barrier.
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -549,7 +549,7 @@ void clear_bdi_congested(struct backing_
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
>  	if (test_and_clear_bit(bit, &bdi->state))
>  		atomic_dec(&nr_bdi_congested[sync]);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);
>  }
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -609,7 +609,7 @@ void unlock_page(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	clear_bit_unlock(PG_locked, &page->flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_page(page, PG_locked);
>  }
>  EXPORT_SYMBOL(unlock_page);
> @@ -626,7 +626,7 @@ void end_page_writeback(struct page *pag
>  	if (!test_clear_page_writeback(page))
>  		BUG();
> 
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	wake_up_page(page, PG_writeback);
>  }
>  EXPORT_SYMBOL(end_page_writeback);
> --- a/net/atm/pppoatm.c
> +++ b/net/atm/pppoatm.c
> @@ -252,7 +252,7 @@ static int pppoatm_may_send(struct pppoa
>  	 * we need to ensure there's a memory barrier after it. The bit
>  	 * *must* be set before we do the atomic_inc() on pvcc->inflight.
>  	 * There's no smp_mb__after_set_bit(), so it's this or abuse
> -	 * smp_mb__after_clear_bit().
> +	 * smp_mb__after_atomic().
>  	 */
>  	test_and_set_bit(BLOCKED, &pvcc->blocked);
> 
> --- a/net/bluetooth/hci_event.c
> +++ b/net/bluetooth/hci_event.c
> @@ -45,7 +45,7 @@ static void hci_cc_inquiry_cancel(struct
>  		return;
> 
>  	clear_bit(HCI_INQUIRY, &hdev->flags);
> -	smp_mb__after_clear_bit(); /* wake_up_bit advises about this barrier */
> +	smp_mb__after_atomic(); /* wake_up_bit advises about this barrier */
>  	wake_up_bit(&hdev->flags, HCI_INQUIRY);
> 
>  	hci_conn_check_pending(hdev);
> @@ -1531,7 +1531,7 @@ static void hci_inquiry_complete_evt(str
>  	if (!test_and_clear_bit(HCI_INQUIRY, &hdev->flags))
>  		return;
> 
> -	smp_mb__after_clear_bit(); /* wake_up_bit advises about this barrier */
> +	smp_mb__after_atomic(); /* wake_up_bit advises about this barrier */
>  	wake_up_bit(&hdev->flags, HCI_INQUIRY);
> 
>  	if (!test_bit(HCI_MGMT, &hdev->dev_flags))
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1323,7 +1323,7 @@ static int __dev_close_many(struct list_
>  		 * dev->stop() will invoke napi_disable() on all of it's
>  		 * napi_struct instances on this device.
>  		 */
> -		smp_mb__after_clear_bit(); /* Commit netif_running(). */
> +		smp_mb__after_atomic(); /* Commit netif_running(). */
>  	}
> 
>  	dev_deactivate_many(head);
> @@ -3343,7 +3343,7 @@ static void net_tx_action(struct softirq
> 
>  			root_lock = qdisc_lock(q);
>  			if (spin_trylock(root_lock)) {
> -				smp_mb__before_clear_bit();
> +				smp_mb__before_atomic();
>  				clear_bit(__QDISC_STATE_SCHED,
>  					  &q->state);
>  				qdisc_run(q);
> @@ -3353,7 +3353,7 @@ static void net_tx_action(struct softirq
>  					      &q->state)) {
>  					__netif_reschedule(q);
>  				} else {
> -					smp_mb__before_clear_bit();
> +					smp_mb__before_atomic();
>  					clear_bit(__QDISC_STATE_SCHED,
>  						  &q->state);
>  				}
> @@ -4216,7 +4216,7 @@ void __napi_complete(struct napi_struct
>  	BUG_ON(n->gro_list);
> 
>  	list_del(&n->poll_list);
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(NAPI_STATE_SCHED, &n->state);
>  }
>  EXPORT_SYMBOL(__napi_complete);
> --- a/net/core/link_watch.c
> +++ b/net/core/link_watch.c
> @@ -147,7 +147,7 @@ static void linkwatch_do_dev(struct net_
>  	 * Make sure the above read is complete since it can be
>  	 * rewritten as soon as we clear the bit below.
>  	 */
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
> 
>  	/* We are about to handle this device,
>  	 * so new events can be accepted
> --- a/net/ipv4/inetpeer.c
> +++ b/net/ipv4/inetpeer.c
> @@ -522,7 +522,7 @@ EXPORT_SYMBOL_GPL(inet_getpeer);
>  void inet_putpeer(struct inet_peer *p)
>  {
>  	p->dtime = (__u32)jiffies;
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&p->refcnt);
>  }
>  EXPORT_SYMBOL_GPL(inet_putpeer);
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -751,7 +751,7 @@ void nf_conntrack_free(struct nf_conn *c
>  	nf_ct_ext_destroy(ct);
>  	nf_ct_ext_free(ct);
>  	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_dec(&net->ct.count);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_free);
> --- a/net/rds/ib_recv.c
> +++ b/net/rds/ib_recv.c
> @@ -598,7 +598,7 @@ static void rds_ib_set_ack(struct rds_ib
>  {
>  	atomic64_set(&ic->i_ack_next, seq);
>  	if (ack_required) {
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
>  	}
>  }
> @@ -606,7 +606,7 @@ static void rds_ib_set_ack(struct rds_ib
>  static u64 rds_ib_get_ack(struct rds_ib_connection *ic)
>  {
>  	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return atomic64_read(&ic->i_ack_next);
>  }
> --- a/net/rds/iw_recv.c
> +++ b/net/rds/iw_recv.c
> @@ -429,7 +429,7 @@ static void rds_iw_set_ack(struct rds_iw
>  {
>  	atomic64_set(&ic->i_ack_next, seq);
>  	if (ack_required) {
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
>  	}
>  }
> @@ -437,7 +437,7 @@ static void rds_iw_set_ack(struct rds_iw
>  static u64 rds_iw_get_ack(struct rds_iw_connection *ic)
>  {
>  	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	return atomic64_read(&ic->i_ack_next);
>  }
> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -107,7 +107,7 @@ static int acquire_in_xmit(struct rds_co
>  static void release_in_xmit(struct rds_connection *conn)
>  {
>  	clear_bit(RDS_IN_XMIT, &conn->c_flags);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	/*
>  	 * We don't use wait_on_bit()/wake_up_bit() because our waking is in a
>  	 * hot path and finding waiters is very rare.  We don't want to walk
> @@ -661,7 +661,7 @@ void rds_send_drop_acked(struct rds_conn
> 
>  	/* order flag updates with spin locks */
>  	if (!list_empty(&list))
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
> 
>  	spin_unlock_irqrestore(&conn->c_lock, flags);
> 
> @@ -691,7 +691,7 @@ void rds_send_drop_to(struct rds_sock *r
>  	}
> 
>  	/* order flag updates with the rs lock */
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	spin_unlock_irqrestore(&rs->rs_lock, flags);
> 
> --- a/net/rds/tcp_send.c
> +++ b/net/rds/tcp_send.c
> @@ -93,7 +93,7 @@ int rds_tcp_xmit(struct rds_connection *
>  		rm->m_ack_seq = tc->t_last_sent_nxt +
>  				sizeof(struct rds_header) +
>  				be32_to_cpu(rm->m_inc.i_hdr.h_len) - 1;
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		set_bit(RDS_MSG_HAS_ACK_SEQ, &rm->m_flags);
>  		tc->t_last_expected_una = rm->m_ack_seq + 1;
> 
> --- a/net/sunrpc/auth.c
> +++ b/net/sunrpc/auth.c
> @@ -296,7 +296,7 @@ static void
>  rpcauth_unhash_cred_locked(struct rpc_cred *cred)
>  {
>  	hlist_del_rcu(&cred->cr_hash);
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags);
>  }
> 
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -142,7 +142,7 @@ gss_cred_set_ctx(struct rpc_cred *cred,
>  	gss_get_ctx(ctx);
>  	rcu_assign_pointer(gss_cred->gc_ctx, ctx);
>  	set_bit(RPCAUTH_CRED_UPTODATE, &cred->cr_flags);
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(RPCAUTH_CRED_NEW, &cred->cr_flags);
>  }
> 
> --- a/net/sunrpc/backchannel_rqst.c
> +++ b/net/sunrpc/backchannel_rqst.c
> @@ -257,10 +257,10 @@ void xprt_free_bc_request(struct rpc_rqs
> 
>  	dprintk("RPC:       free backchannel req=%p\n", req);
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &req->rq_bc_pa_state));
>  	clear_bit(RPC_BC_PA_IN_USE, &req->rq_bc_pa_state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
> 
>  	if (!xprt_need_to_requeue(xprt)) {
>  		/*
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -230,9 +230,9 @@ static void xprt_clear_locked(struct rpc
>  {
>  	xprt->snd_task = NULL;
>  	if (!test_bit(XPRT_CLOSE_WAIT, &xprt->state)) {
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(XPRT_LOCKED, &xprt->state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  	} else
>  		queue_work(rpciod_workqueue, &xprt->task_cleanup);
>  }
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -889,11 +889,11 @@ static void xs_close(struct rpc_xprt *xp
>  	xs_reset_transport(transport);
>  	xprt->reestablish_timeout = 0;
> 
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
>  	clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
>  	clear_bit(XPRT_CLOSING, &xprt->state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	xprt_disconnect_done(xprt);
>  }
> 
> @@ -1500,12 +1500,12 @@ static void xs_tcp_cancel_linger_timeout
> 
>  static void xs_sock_reset_connection_flags(struct rpc_xprt *xprt)
>  {
> -	smp_mb__before_clear_bit();
> +	smp_mb__before_atomic();
>  	clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
>  	clear_bit(XPRT_CONNECTION_CLOSE, &xprt->state);
>  	clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
>  	clear_bit(XPRT_CLOSING, &xprt->state);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  }
> 
>  static void xs_sock_mark_closed(struct rpc_xprt *xprt)
> @@ -1559,10 +1559,10 @@ static void xs_tcp_state_change(struct s
>  		xprt->connect_cookie++;
>  		xprt->reestablish_timeout = 0;
>  		set_bit(XPRT_CLOSING, &xprt->state);
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(XPRT_CONNECTED, &xprt->state);
>  		clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		xs_tcp_schedule_linger_timeout(xprt, xs_tcp_fin_timeout);
>  		break;
>  	case TCP_CLOSE_WAIT:
> @@ -1581,9 +1581,9 @@ static void xs_tcp_state_change(struct s
>  	case TCP_LAST_ACK:
>  		set_bit(XPRT_CLOSING, &xprt->state);
>  		xs_tcp_schedule_linger_timeout(xprt, xs_tcp_fin_timeout);
> -		smp_mb__before_clear_bit();
> +		smp_mb__before_atomic();
>  		clear_bit(XPRT_CONNECTED, &xprt->state);
> -		smp_mb__after_clear_bit();
> +		smp_mb__after_atomic();
>  		break;
>  	case TCP_CLOSE:
>  		xs_tcp_cancel_linger_timeout(xprt);
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1208,7 +1208,7 @@ static int unix_stream_connect(struct so
>  	sk->sk_state	= TCP_ESTABLISHED;
>  	sock_hold(newsk);
> 
> -	smp_mb__after_atomic_inc();	/* sock_hold() does an atomic_inc() */
> +	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
>  	unix_peer(sk)	= newsk;
> 
>  	unix_state_unlock(sk);
> --- a/sound/pci/bt87x.c
> +++ b/sound/pci/bt87x.c
> @@ -435,7 +435,7 @@ static int snd_bt87x_pcm_open(struct snd
> 
>  _error:
>  	clear_bit(0, &chip->opened);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	return err;
>  }
> 
> @@ -450,7 +450,7 @@ static int snd_bt87x_close(struct snd_pc
> 
>  	chip->substream = NULL;
>  	clear_bit(0, &chip->opened);
> -	smp_mb__after_clear_bit();
> +	smp_mb__after_atomic();
>  	return 0;
>  }
> 
> 
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:25 ` [RFC][PATCH 0/5] arch: atomic rework David Howells
                     ` (2 preceding siblings ...)
  2014-02-06 18:55   ` Ramana Radhakrishnan
@ 2014-02-06 19:21   ` Linus Torvalds
  3 siblings, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-06 19:21 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-arch, Linux Kernel Mailing List,
	Andrew Morton, Ingo Molnar, Will Deacon, Paul McKenney,
	ramana.radhakrishnan

On Thu, Feb 6, 2014 at 10:25 AM, David Howells <dhowells@redhat.com> wrote:
>
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.

I think that's a bad idea as a generic implementation, but it's quite
possibly worth doing on an architecture-by-architecture basis.

The thing is, gcc builtins sometimes suck. Sometimes it's because
certain architectures want particular gcc versions that don't do a
good job, sometimes it's because the architecture doesn't lend itself
to doing what we want done and we end up better off with some tweaked
thing, sometimes it's because the language feature is just badly
designed. I'm not convinced that the C11 people got things right.

But in specific cases, maybe the architecture wants to use the
builtin. And the gcc version checks are likely to be
architecture-specific for a longish while. So some particular
architectures might want to use them, but I really doubt it makes
sense to make it the default with arch overrides.

                Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:59     ` Will Deacon
@ 2014-02-06 19:27       ` Paul E. McKenney
  2014-02-06 21:17         ` Torvald Riegel
  2014-02-06 21:09       ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-06 19:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch,
	linux-kernel, torvalds, akpm, mingo, gcc

On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > On 02/06/14 18:25, David Howells wrote:
> > >
> > > Is it worth considering a move towards using C11 atomics and barriers and
> > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > these.
> > 
> > 
> > It sounds interesting to me, if we can make it work properly and 
> > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> 
> Given my (albeit limited) experience playing with the C11 spec and GCC, I
> really think this is a bad idea for the kernel. It seems that nobody really
> agrees on exactly how the C11 atomics map to real architectural
> instructions on anything but the trivial architectures. For example, should
> the following code fire the assert?
> 
> 
> extern atomic<int> foo, bar, baz;
> 
> void thread1(void)
> {
> 	foo.store(42, memory_order_relaxed);
> 	bar.fetch_add(1, memory_order_seq_cst);
> 	baz.store(42, memory_order_relaxed);
> }
> 
> void thread2(void)
> {
> 	while (baz.load(memory_order_seq_cst) != 42) {
> 		/* do nothing */
> 	}
> 
> 	assert(foo.load(memory_order_seq_cst) == 42);
> }
> 
> 
> To answer that question, you need to go and look at the definitions of
> synchronises-with, happens-before, dependency_ordered_before and a whole
> pile of vaguely written waffle to realise that you don't know. Certainly,
> the code that arm64 GCC currently spits out would allow the assertion to fire
> on some microarchitectures.

Yep!  I believe that a memory_order_seq_cst fence in combination with the
fetch_add() would do the trick on many architectures, however.  All of
this is one reason that any C11 definitions need to be individually
overridable by individual architectures.

> There are also so many ways to blow your head off it's untrue. For example,
> cmpxchg takes a separate memory model parameter for failure and success, but
> then there are restrictions on the sets you can use for each. It's not hard
> to find well-known memory-ordering experts shouting "Just use
> memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> atm and optimises all of the data dependencies away) as well as the definition
> of "data races", which seem to be used as an excuse to miscompile a program
> at the earliest opportunity.

Trust me, rcu_dereference() is not going to be defined in terms of
memory_order_consume until the compilers implement it both correctly and
efficiently.  They are not there yet, and there is currently no shortage
of compiler writers who would prefer to ignore memory_order_consume.
And rcu_dereference() will need per-arch overrides for some time during
any transition to memory_order_consume.

> Trying to introduce system concepts (writes to devices, interrupts,
> non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
> just rather stick to the semantics we have and the asm volatile barriers.

And barrier() isn't going to go away any time soon, either.  And
ACCESS_ONCE() needs to keep volatile semantics until there is some
memory_order_whatever that prevents loads and stores from being coalesced.

> That's not to say I don't there's no room for improvement in what we have
> in the kernel. Certainly, I'd welcome allowing more relaxed operations on
> architectures that support them, but it needs to be something that at least
> the different architecture maintainers can understand how to implement
> efficiently behind an uncomplicated interface. I don't think that interface is
> C11.
> 
> Just my thoughts on the matter...

C11 does not provide a good interface for the Linux kernel, nor was it
intended to do so.  It might provide good implementations for some of
the atomic ops for some architectures.  This could reduce the amount of
assembly written for new architectures, and could potentially allow the
compiler to do a better job of optimizing (scary thought!).  But for this
to work, that architecture's Linux-kernel maintainer and gcc maintainer
would need to be working together.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 18:59     ` Will Deacon
  2014-02-06 19:27       ` Paul E. McKenney
@ 2014-02-06 21:09       ` Torvald Riegel
  2014-02-06 21:55         ` Paul E. McKenney
                           ` (2 more replies)
  1 sibling, 3 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-06 21:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch,
	linux-kernel, torvalds, akpm, mingo, paulmck, gcc

On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > On 02/06/14 18:25, David Howells wrote:
> > >
> > > Is it worth considering a move towards using C11 atomics and barriers and
> > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > these.
> > 
> > 
> > It sounds interesting to me, if we can make it work properly and 
> > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> 
> Given my (albeit limited) experience playing with the C11 spec and GCC, I
> really think this is a bad idea for the kernel.

I'm not going to comment on what's best for the kernel (simply because I
don't work on it), but I disagree with several of your statements.

> It seems that nobody really
> agrees on exactly how the C11 atomics map to real architectural
> instructions on anything but the trivial architectures.

There's certainly different ways to implement the memory model and those
have to be specified elsewhere, but I don't see how this differs much
from other things specified in the ABI(s) for each architecture.

> For example, should
> the following code fire the assert?

I don't see how your example (which is about what the language requires
or not) relates to the statement about the mapping above?

> 
> extern atomic<int> foo, bar, baz;
> 
> void thread1(void)
> {
> 	foo.store(42, memory_order_relaxed);
> 	bar.fetch_add(1, memory_order_seq_cst);
> 	baz.store(42, memory_order_relaxed);
> }
> 
> void thread2(void)
> {
> 	while (baz.load(memory_order_seq_cst) != 42) {
> 		/* do nothing */
> 	}
> 
> 	assert(foo.load(memory_order_seq_cst) == 42);
> }
> 

It's a good example.  My first gut feeling was that the assertion should
never fire, but that was wrong because (as I seem to usually forget) the
seq-cst total order is just a constraint but doesn't itself contribute
to synchronizes-with -- but this is different for seq-cst fences.

> To answer that question, you need to go and look at the definitions of
> synchronises-with, happens-before, dependency_ordered_before and a whole
> pile of vaguely written waffle to realise that you don't know.

Are you familiar with the formalization of the C11/C++11 model by Batty
et al.?
http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
http://www.cl.cam.ac.uk/~mjb220/n3132.pdf

They also have a nice tool that can run condensed examples and show you
all allowed (and forbidden) executions (it runs in the browser, so is
slow for larger examples), including nice annotated graphs for those:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

It requires somewhat special syntax, but the following, which should be
equivalent to your example above, runs just fine:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  atomic_int baz = 0; 
  {{{ {
        foo.store(42, memory_order_relaxed);
        bar.store(1, memory_order_seq_cst);
        baz.store(42, memory_order_relaxed);
      }
  ||| {
        r1=baz.load(memory_order_seq_cst).readsvalue(42);
        r2=foo.load(memory_order_seq_cst).readsvalue(0);
      }
  }}};
  return 0; }

That yields 3 consistent executions for me, and likewise if the last
readsvalue() is using 42 as argument.

If you add a "fence(memory_order_seq_cst);" after the store to foo, the
program can't observe != 42 for foo anymore, because the seq-cst fence
is adding a synchronizes-with edge via the baz reads-from.

I think this is a really neat tool, and very helpful to answer such
questions as in your example.

> Certainly,
> the code that arm64 GCC currently spits out would allow the assertion to fire
> on some microarchitectures.
> 
> There are also so many ways to blow your head off it's untrue. For example,
> cmpxchg takes a separate memory model parameter for failure and success, but
> then there are restrictions on the sets you can use for each.

That's in there for the architectures without a single-instruction
CAS/cmpxchg, I believe.

> It's not hard
> to find well-known memory-ordering experts shouting "Just use
> memory_model_seq_cst for everything, it's too hard otherwise".

Everyone I've heard saying this meant this as advice to people new to
synchronization or just dealing infrequently with it.  The advice is the
simple and safe fallback, and I don't think it's meant as an
acknowledgment that the model itself would be too hard.  If the
language's memory model is supposed to represent weak HW memory models
to at least some extent, there's only so much you can do in terms of
keeping it simple.  If all architectures had x86-like models, the
language's model would certainly be simpler... :)

> Then there's
> the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> atm and optimises all of the data dependencies away)

AFAIK consume memory order was added to model Power/ARM-specific
behavior.  I agree that the way the standard specifies how dependencies
are to be preserved is kind of vague (as far as I understand it).  See
GCC PR 59448.

> as well as the definition
> of "data races", which seem to be used as an excuse to miscompile a program
> at the earliest opportunity.

No.  The purpose of this is to *not disallow* every optimization on
non-synchronizing code.  Due to the assumption of data-race-free
programs, the compiler can assume a sequential code sequence when no
atomics are involved (and thus, keep applying optimizations for
sequential code).

Or is there something particular that you dislike about the
specification of data races?

> Trying to introduce system concepts (writes to devices, interrupts,
> non-coherent agents) into this mess is going to be an uphill battle IMHO.

That might very well be true.

OTOH, if you whould need to model this uniformly across different
architectures (ie, so that there is a intra-kernel-portable abstraction
for those system concepts), you might as well try doing this by
extending the C11/C++11 model.  Maybe that will not be successful or not
really a good fit, though, but at least then it's clear why that's the
case.

> I'd
> just rather stick to the semantics we have and the asm volatile barriers.
> 
> That's not to say I don't there's no room for improvement in what we have
> in the kernel. Certainly, I'd welcome allowing more relaxed operations on
> architectures that support them, but it needs to be something that at least
> the different architecture maintainers can understand how to implement
> efficiently behind an uncomplicated interface. I don't think that interface is
> C11.

IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
the only memory model that has widespread support.  So, even though it's
a fairly weak memory model (unless you go for the "only seq-cst"
beginners advice) and thus comes with a higher complexity, this model is
what likely most people will be familiar with over time.  Deviating from
the "standard" model can have valid reasons, but it also has a cost in
that new contributors are more likely to be familiar with the "standard"
model.

Note that I won't claim that the C11/C++11 model is perfect -- there are
a few rough edges there (e.g., the forward progress guarantees are (or
used to be) a little coarse for my taste), and consume vs. dependencies
worries me as well.  But, IMHO, overall it's the best C/C++ language
model we have.


Torvald



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 19:27       ` Paul E. McKenney
@ 2014-02-06 21:17         ` Torvald Riegel
  2014-02-06 22:11           ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-06 21:17 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > > On 02/06/14 18:25, David Howells wrote:
> > > >
> > > > Is it worth considering a move towards using C11 atomics and barriers and
> > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > > these.
> > > 
> > > 
> > > It sounds interesting to me, if we can make it work properly and 
> > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> > 
> > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > really think this is a bad idea for the kernel. It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures. For example, should
> > the following code fire the assert?
> > 
> > 
> > extern atomic<int> foo, bar, baz;
> > 
> > void thread1(void)
> > {
> > 	foo.store(42, memory_order_relaxed);
> > 	bar.fetch_add(1, memory_order_seq_cst);
> > 	baz.store(42, memory_order_relaxed);
> > }
> > 
> > void thread2(void)
> > {
> > 	while (baz.load(memory_order_seq_cst) != 42) {
> > 		/* do nothing */
> > 	}
> > 
> > 	assert(foo.load(memory_order_seq_cst) == 42);
> > }
> > 
> > 
> > To answer that question, you need to go and look at the definitions of
> > synchronises-with, happens-before, dependency_ordered_before and a whole
> > pile of vaguely written waffle to realise that you don't know. Certainly,
> > the code that arm64 GCC currently spits out would allow the assertion to fire
> > on some microarchitectures.
> 
> Yep!  I believe that a memory_order_seq_cst fence in combination with the
> fetch_add() would do the trick on many architectures, however.  All of
> this is one reason that any C11 definitions need to be individually
> overridable by individual architectures.

"Overridable" in which sense?  Do you want to change the semantics on
the language level in the sense of altering the memory model, or rather
use a different implementation under the hood to, for example, fix
deficiencies in the compilers?

> > There are also so many ways to blow your head off it's untrue. For example,
> > cmpxchg takes a separate memory model parameter for failure and success, but
> > then there are restrictions on the sets you can use for each. It's not hard
> > to find well-known memory-ordering experts shouting "Just use
> > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > atm and optimises all of the data dependencies away) as well as the definition
> > of "data races", which seem to be used as an excuse to miscompile a program
> > at the earliest opportunity.
> 
> Trust me, rcu_dereference() is not going to be defined in terms of
> memory_order_consume until the compilers implement it both correctly and
> efficiently.  They are not there yet, and there is currently no shortage
> of compiler writers who would prefer to ignore memory_order_consume.

Do you have any input on
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
language standard's definition of dependencies?

> And rcu_dereference() will need per-arch overrides for some time during
> any transition to memory_order_consume.
> 
> > Trying to introduce system concepts (writes to devices, interrupts,
> > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
> > just rather stick to the semantics we have and the asm volatile barriers.
> 
> And barrier() isn't going to go away any time soon, either.  And
> ACCESS_ONCE() needs to keep volatile semantics until there is some
> memory_order_whatever that prevents loads and stores from being coalesced.

I'd be happy to discuss something like this in ISO C++ SG1 (or has this
been discussed in the past already?).  But it needs to have a paper I
suppose.

Will you be in Issaquah for the C++ meeting next week?



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 21:09       ` Torvald Riegel
@ 2014-02-06 21:55         ` Paul E. McKenney
  2014-02-06 22:58           ` Torvald Riegel
  2014-02-06 22:13         ` Joseph S. Myers
  2014-02-07 12:01         ` Will Deacon
  2 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-06 21:55 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > > On 02/06/14 18:25, David Howells wrote:
> > > >
> > > > Is it worth considering a move towards using C11 atomics and barriers and
> > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > > these.
> > > 
> > > 
> > > It sounds interesting to me, if we can make it work properly and 
> > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> > 
> > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > really think this is a bad idea for the kernel.
> 
> I'm not going to comment on what's best for the kernel (simply because I
> don't work on it), but I disagree with several of your statements.
> 
> > It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures.
> 
> There's certainly different ways to implement the memory model and those
> have to be specified elsewhere, but I don't see how this differs much
> from other things specified in the ABI(s) for each architecture.
> 
> > For example, should
> > the following code fire the assert?
> 
> I don't see how your example (which is about what the language requires
> or not) relates to the statement about the mapping above?
> 
> > 
> > extern atomic<int> foo, bar, baz;
> > 
> > void thread1(void)
> > {
> > 	foo.store(42, memory_order_relaxed);
> > 	bar.fetch_add(1, memory_order_seq_cst);
> > 	baz.store(42, memory_order_relaxed);
> > }
> > 
> > void thread2(void)
> > {
> > 	while (baz.load(memory_order_seq_cst) != 42) {
> > 		/* do nothing */
> > 	}
> > 
> > 	assert(foo.load(memory_order_seq_cst) == 42);
> > }
> > 
> 
> It's a good example.  My first gut feeling was that the assertion should
> never fire, but that was wrong because (as I seem to usually forget) the
> seq-cst total order is just a constraint but doesn't itself contribute
> to synchronizes-with -- but this is different for seq-cst fences.

>From what I can see, Will's point is that mapping the Linux kernel's
atomic_add_return() primitive into fetch_add() does not work because
atomic_add_return()'s ordering properties require that the assert()
never fire.

Augmenting the fetch_add() with a seq_cst fence would work on many
architectures, but not for all similar examples.  The reason is that
the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
Power's sync.  To your point, I believe that it would make the above
example work, but there are some IRIW-like examples that would fail
according to the standard (though a number of specific implementations
would in fact work correctly).

> > To answer that question, you need to go and look at the definitions of
> > synchronises-with, happens-before, dependency_ordered_before and a whole
> > pile of vaguely written waffle to realise that you don't know.
> 
> Are you familiar with the formalization of the C11/C++11 model by Batty
> et al.?
> http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> They also have a nice tool that can run condensed examples and show you
> all allowed (and forbidden) executions (it runs in the browser, so is
> slow for larger examples), including nice annotated graphs for those:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> 
> It requires somewhat special syntax, but the following, which should be
> equivalent to your example above, runs just fine:
> 
> int main() {
>   atomic_int foo = 0; 
>   atomic_int bar = 0; 
>   atomic_int baz = 0; 
>   {{{ {
>         foo.store(42, memory_order_relaxed);
>         bar.store(1, memory_order_seq_cst);
>         baz.store(42, memory_order_relaxed);
>       }
>   ||| {
>         r1=baz.load(memory_order_seq_cst).readsvalue(42);
>         r2=foo.load(memory_order_seq_cst).readsvalue(0);
>       }
>   }}};
>   return 0; }
> 
> That yields 3 consistent executions for me, and likewise if the last
> readsvalue() is using 42 as argument.
> 
> If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> program can't observe != 42 for foo anymore, because the seq-cst fence
> is adding a synchronizes-with edge via the baz reads-from.
> 
> I think this is a really neat tool, and very helpful to answer such
> questions as in your example.

Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
your substitution of store() for fetch_add() is correct.  Then this shows
that we cannot substitute fetch_add() for atomic_add_return().

Adding atomic_thread_fence(memory_order_seq_cst) after the bar.store
gives me "192 executions; no consistent", so perhaps there is hope for
augmenting the fetch_add() with a fence.  Except, as noted above, for
any number of IRIW-like examples such as the following:

int main() {
  atomic_int x = 0; atomic_int y = 0;
  {{{ x.store(1, memory_order_release);
  ||| y.store(1, memory_order_release);
  ||| { r1=x.load(memory_order_relaxed).readsvalue(1);
        atomic_thread_fence(memory_order_seq_cst);
        r2=y.load(memory_order_relaxed).readsvalue(0); }
  ||| { r3=y.load(memory_order_relaxed).readsvalue(1);
        atomic_thread_fence(memory_order_seq_cst);
        r4=x.load(memory_order_relaxed).readsvalue(0); }
  }}};
  return 0; }

Adding a seq_cst store to a new variable z between each pair of reads
seems to choke cppmem:

int main() {
  atomic_int x = 0; atomic_int y = 0; atomic_int z = 0
  {{{ x.store(1, memory_order_release);
  ||| y.store(1, memory_order_release);
  ||| { r1=x.load(memory_order_relaxed).readsvalue(1);
        z.store(1, memory_order_seq_cst);
        atomic_thread_fence(memory_order_seq_cst);
        r2=y.load(memory_order_relaxed).readsvalue(0); }
  ||| { r3=y.load(memory_order_relaxed).readsvalue(1);
        z.store(1, memory_order_seq_cst);
        atomic_thread_fence(memory_order_seq_cst);
        r4=x.load(memory_order_relaxed).readsvalue(0); }
  }}};
  return 0; }

Ah, it did eventually finish with "576 executions; 6 consistent, all
race free".  So this is an example where C11 has a hard time modeling
the Linux kernel's atomic_add_return().  Therefore, use of C11 atomics
to implement Linux kernel atomic operations requires knowledge of the
underlying architecture and the compiler's implementation, as was noted
earlier in this thread.

> > Certainly,
> > the code that arm64 GCC currently spits out would allow the assertion to fire
> > on some microarchitectures.
> > 
> > There are also so many ways to blow your head off it's untrue. For example,
> > cmpxchg takes a separate memory model parameter for failure and success, but
> > then there are restrictions on the sets you can use for each.
> 
> That's in there for the architectures without a single-instruction
> CAS/cmpxchg, I believe.

Yep.  The Linux kernel currently requires the rough equivalent of
memory_order_seq_cst for both paths, but there is some chance that the
failure-path requirement might be weakened.

> > It's not hard
> > to find well-known memory-ordering experts shouting "Just use
> > memory_model_seq_cst for everything, it's too hard otherwise".
> 
> Everyone I've heard saying this meant this as advice to people new to
> synchronization or just dealing infrequently with it.  The advice is the
> simple and safe fallback, and I don't think it's meant as an
> acknowledgment that the model itself would be too hard.  If the
> language's memory model is supposed to represent weak HW memory models
> to at least some extent, there's only so much you can do in terms of
> keeping it simple.  If all architectures had x86-like models, the
> language's model would certainly be simpler... :)

That is said a lot, but there was a recent Linux-kernel example that
turned out to be quite hard to prove for x86.  ;-)

> > Then there's
> > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > atm and optimises all of the data dependencies away)
> 
> AFAIK consume memory order was added to model Power/ARM-specific
> behavior.  I agree that the way the standard specifies how dependencies
> are to be preserved is kind of vague (as far as I understand it).  See
> GCC PR 59448.

This one? http://gcc.gnu.org/ml/gcc-bugs/2013-12/msg01083.html

That does indeed look to match what Will was calling out as a problem.

> > as well as the definition
> > of "data races", which seem to be used as an excuse to miscompile a program
> > at the earliest opportunity.
> 
> No.  The purpose of this is to *not disallow* every optimization on
> non-synchronizing code.  Due to the assumption of data-race-free
> programs, the compiler can assume a sequential code sequence when no
> atomics are involved (and thus, keep applying optimizations for
> sequential code).
> 
> Or is there something particular that you dislike about the
> specification of data races?

Cut Will a break, Torvald!  ;-)

> > Trying to introduce system concepts (writes to devices, interrupts,
> > non-coherent agents) into this mess is going to be an uphill battle IMHO.
> 
> That might very well be true.
> 
> OTOH, if you whould need to model this uniformly across different
> architectures (ie, so that there is a intra-kernel-portable abstraction
> for those system concepts), you might as well try doing this by
> extending the C11/C++11 model.  Maybe that will not be successful or not
> really a good fit, though, but at least then it's clear why that's the
> case.

I would guess that Linux-kernel use of C11 atomics will be selected or not
on an architecture-specific for the foreseeable future.

> > I'd
> > just rather stick to the semantics we have and the asm volatile barriers.
> > 
> > That's not to say I don't there's no room for improvement in what we have
> > in the kernel. Certainly, I'd welcome allowing more relaxed operations on
> > architectures that support them, but it needs to be something that at least
> > the different architecture maintainers can understand how to implement
> > efficiently behind an uncomplicated interface. I don't think that interface is
> > C11.
> 
> IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
> the only memory model that has widespread support.  So, even though it's
> a fairly weak memory model (unless you go for the "only seq-cst"
> beginners advice) and thus comes with a higher complexity, this model is
> what likely most people will be familiar with over time.  Deviating from
> the "standard" model can have valid reasons, but it also has a cost in
> that new contributors are more likely to be familiar with the "standard"
> model.
> 
> Note that I won't claim that the C11/C++11 model is perfect -- there are
> a few rough edges there (e.g., the forward progress guarantees are (or
> used to be) a little coarse for my taste), and consume vs. dependencies
> worries me as well.  But, IMHO, overall it's the best C/C++ language
> model we have.

I could be wrong, but I strongly suspect that in the near term,
any memory-model migration of the 15M+ LoC Linux-kernel code base
will be incremental in nature.  Especially if the C/C++ committee
insists on strengthening memory_order_relaxed.  :-/

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 21:17         ` Torvald Riegel
@ 2014-02-06 22:11           ` Paul E. McKenney
  2014-02-06 23:44             ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-06 22:11 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > > > On 02/06/14 18:25, David Howells wrote:
> > > > >
> > > > > Is it worth considering a move towards using C11 atomics and barriers and
> > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > > > these.
> > > > 
> > > > 
> > > > It sounds interesting to me, if we can make it work properly and 
> > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> > > 
> > > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > > really think this is a bad idea for the kernel. It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures. For example, should
> > > the following code fire the assert?
> > > 
> > > 
> > > extern atomic<int> foo, bar, baz;
> > > 
> > > void thread1(void)
> > > {
> > > 	foo.store(42, memory_order_relaxed);
> > > 	bar.fetch_add(1, memory_order_seq_cst);
> > > 	baz.store(42, memory_order_relaxed);
> > > }
> > > 
> > > void thread2(void)
> > > {
> > > 	while (baz.load(memory_order_seq_cst) != 42) {
> > > 		/* do nothing */
> > > 	}
> > > 
> > > 	assert(foo.load(memory_order_seq_cst) == 42);
> > > }
> > > 
> > > 
> > > To answer that question, you need to go and look at the definitions of
> > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > pile of vaguely written waffle to realise that you don't know. Certainly,
> > > the code that arm64 GCC currently spits out would allow the assertion to fire
> > > on some microarchitectures.
> > 
> > Yep!  I believe that a memory_order_seq_cst fence in combination with the
> > fetch_add() would do the trick on many architectures, however.  All of
> > this is one reason that any C11 definitions need to be individually
> > overridable by individual architectures.
> 
> "Overridable" in which sense?  Do you want to change the semantics on
> the language level in the sense of altering the memory model, or rather
> use a different implementation under the hood to, for example, fix
> deficiencies in the compilers?

We need the architecture maintainer to be able to select either an
assembly-language implementation or a C11-atomics implementation for any
given Linux-kernel operation.  For example, a given architecture might
be able to use fetch_add(1, memory_order_relaxed) for atomic_inc() but
assembly for atomic_add_return().  This is because atomic_inc() is not
required to have any particular ordering properties, while as discussed
previously, atomic_add_return() requires tighter ordering than the C11
standard provides.

> > > There are also so many ways to blow your head off it's untrue. For example,
> > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > then there are restrictions on the sets you can use for each. It's not hard
> > > to find well-known memory-ordering experts shouting "Just use
> > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > atm and optimises all of the data dependencies away) as well as the definition
> > > of "data races", which seem to be used as an excuse to miscompile a program
> > > at the earliest opportunity.
> > 
> > Trust me, rcu_dereference() is not going to be defined in terms of
> > memory_order_consume until the compilers implement it both correctly and
> > efficiently.  They are not there yet, and there is currently no shortage
> > of compiler writers who would prefer to ignore memory_order_consume.
> 
> Do you have any input on
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> language standard's definition of dependencies?

Let's see...  1.10p9 says that a dependency must be carried unless:

— B is an invocation of any specialization of std::kill_dependency (29.3), or
— A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
or
— A is the left operand of a conditional (?:, see 5.16) operator, or
— A is the left operand of the built-in comma (,) operator (5.18);

So the use of "flag" before the "?" is ignored.  But the "flag - flag"
after the "?" will carry a dependency, so the code fragment in 59448
needs to do the ordering rather than just optimizing "flag - flag" out
of existence.  One way to do that on both ARM and Power is to actually
emit code for "flag - flag", but there are a number of other ways to
make that work.

BTW, there is some discussion on 1.10p9's handling of && and ||, and
that clause is likely to change.  And yes, I am behind on analyzing
usage in the Linux kernel to find out if Linux cares...

> > And rcu_dereference() will need per-arch overrides for some time during
> > any transition to memory_order_consume.
> > 
> > > Trying to introduce system concepts (writes to devices, interrupts,
> > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
> > > just rather stick to the semantics we have and the asm volatile barriers.
> > 
> > And barrier() isn't going to go away any time soon, either.  And
> > ACCESS_ONCE() needs to keep volatile semantics until there is some
> > memory_order_whatever that prevents loads and stores from being coalesced.
> 
> I'd be happy to discuss something like this in ISO C++ SG1 (or has this
> been discussed in the past already?).  But it needs to have a paper I
> suppose.

The current position of the usual suspects other than me is that this
falls into the category of forward-progress guarantees, which are
considers (again, by the usual suspects other than me) to be out
of scope.

> Will you be in Issaquah for the C++ meeting next week?

Weather permitting, I will be there!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 21:09       ` Torvald Riegel
  2014-02-06 21:55         ` Paul E. McKenney
@ 2014-02-06 22:13         ` Joseph S. Myers
  2014-02-06 23:25           ` Torvald Riegel
  2014-02-07 12:01         ` Will Deacon
  2 siblings, 1 reply; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-06 22:13 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc

On Thu, 6 Feb 2014, Torvald Riegel wrote:

> > It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures.
> 
> There's certainly different ways to implement the memory model and those
> have to be specified elsewhere, but I don't see how this differs much
> from other things specified in the ABI(s) for each architecture.

It is not clear to me that there is any good consensus understanding of 
how to specify this in an ABI, or how, given an ABI, to determine whether 
an implementation is valid.

For ABIs not considering atomics / concurrency, it's well understood, for 
example, that the ABI specifies observable conditions at function call 
boundaries ("these arguments are in these registers", "the stack pointer 
has this alignment", "on return from the function, the values of these 
registers are unchanged from the values they had on entry").  It may 
sometimes specify things at other times (e.g. presence or absence of a red 
zone - whether memory beyond the stack pointer may be overwritten on an 
interrupt).  But if it gives a code sequence, it's clear this is just an 
example rather than a requirement for particular code to be generated - 
any code sequence suffices if it meets the observable conditions at the 
points where code generated by one implementation may pass control to code 
generated by another implementation.

When atomics are involved, you no longer have a limited set of 
well-defined points where control may pass from code generated by one 
implementation to code generated by another - the code generated by the 
two may be running concurrently.

We know of certain cases 
<http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html> where there are 
choices of the mapping of atomic operations to particular instructions.  
But I'm not sure there's much evidence that these are the only ABI issues 
arising from concurrency - that there aren't any other ways in which an 
implementation may transform code, consistent with the as-if rule of ISO 
C, that may run into incompatibilities of different choices.  And even if 
those are the only issues, it's not clear there are well-understood ways 
to define the mapping from the C11 memory model to the architecture's 
model, which provide a good way to reason about whether a particular 
choice of instructions is valid according to the mapping.

> Are you familiar with the formalization of the C11/C++11 model by Batty
> et al.?
> http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf

These discuss, as well as the model itself, proving the validity of a 
particular choice of x86 instructions.  I imagine that might be a starting 
point towards an understanding of how to describe the relevant issues in 
an ABI, and how to determine whether a choice of instructions is 
consistent with what an ABI says.  But I don't get the impression this is 
yet at the level where people not doing research in the area can routinely 
read and write such ABIs and work out whether a code sequence is 
consistent with them.

(If an ABI says "use instruction X", then you can't use a more efficient 
X' added by a new version of the instruction set.  But it can't 
necessarily be as loose as saying "use X and Y, or other instructions that 
achieve semantics when the other thread is using X or Y", because it might 
be the case that Y' interoperates with X, X' interoperates with Y, but X' 
and Y' don't interoperate with each other.  I'd envisage something more 
like mapping not to instructions, but to concepts within the 
architecture's own memory model - but that requires the architecture's 
memory model to be as well defined, and probably formalized, as the C11 
one.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 21:55         ` Paul E. McKenney
@ 2014-02-06 22:58           ` Torvald Riegel
  2014-02-07  4:06             ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-06 22:58 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > > > On 02/06/14 18:25, David Howells wrote:
> > > > >
> > > > > Is it worth considering a move towards using C11 atomics and barriers and
> > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > > > these.
> > > > 
> > > > 
> > > > It sounds interesting to me, if we can make it work properly and 
> > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> > > 
> > > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > > really think this is a bad idea for the kernel.
> > 
> > I'm not going to comment on what's best for the kernel (simply because I
> > don't work on it), but I disagree with several of your statements.
> > 
> > > It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures.
> > 
> > There's certainly different ways to implement the memory model and those
> > have to be specified elsewhere, but I don't see how this differs much
> > from other things specified in the ABI(s) for each architecture.
> > 
> > > For example, should
> > > the following code fire the assert?
> > 
> > I don't see how your example (which is about what the language requires
> > or not) relates to the statement about the mapping above?
> > 
> > > 
> > > extern atomic<int> foo, bar, baz;
> > > 
> > > void thread1(void)
> > > {
> > > 	foo.store(42, memory_order_relaxed);
> > > 	bar.fetch_add(1, memory_order_seq_cst);
> > > 	baz.store(42, memory_order_relaxed);
> > > }
> > > 
> > > void thread2(void)
> > > {
> > > 	while (baz.load(memory_order_seq_cst) != 42) {
> > > 		/* do nothing */
> > > 	}
> > > 
> > > 	assert(foo.load(memory_order_seq_cst) == 42);
> > > }
> > > 
> > 
> > It's a good example.  My first gut feeling was that the assertion should
> > never fire, but that was wrong because (as I seem to usually forget) the
> > seq-cst total order is just a constraint but doesn't itself contribute
> > to synchronizes-with -- but this is different for seq-cst fences.
> 
> From what I can see, Will's point is that mapping the Linux kernel's
> atomic_add_return() primitive into fetch_add() does not work because
> atomic_add_return()'s ordering properties require that the assert()
> never fire.
> 
> Augmenting the fetch_add() with a seq_cst fence would work on many
> architectures, but not for all similar examples.  The reason is that
> the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
> Power's sync.  To your point, I believe that it would make the above
> example work, but there are some IRIW-like examples that would fail
> according to the standard (though a number of specific implementations
> would in fact work correctly).

Thanks for the background.  I don't read LKML, and it wasn't obvious
from reading just the part of the thread posted to gcc@ that achieving
these semantics is the goal.

> > > To answer that question, you need to go and look at the definitions of
> > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > pile of vaguely written waffle to realise that you don't know.
> > 
> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > 
> > They also have a nice tool that can run condensed examples and show you
> > all allowed (and forbidden) executions (it runs in the browser, so is
> > slow for larger examples), including nice annotated graphs for those:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > 
> > It requires somewhat special syntax, but the following, which should be
> > equivalent to your example above, runs just fine:
> > 
> > int main() {
> >   atomic_int foo = 0; 
> >   atomic_int bar = 0; 
> >   atomic_int baz = 0; 
> >   {{{ {
> >         foo.store(42, memory_order_relaxed);
> >         bar.store(1, memory_order_seq_cst);
> >         baz.store(42, memory_order_relaxed);
> >       }
> >   ||| {
> >         r1=baz.load(memory_order_seq_cst).readsvalue(42);
> >         r2=foo.load(memory_order_seq_cst).readsvalue(0);
> >       }
> >   }}};
> >   return 0; }
> > 
> > That yields 3 consistent executions for me, and likewise if the last
> > readsvalue() is using 42 as argument.
> > 
> > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > program can't observe != 42 for foo anymore, because the seq-cst fence
> > is adding a synchronizes-with edge via the baz reads-from.
> > 
> > I think this is a really neat tool, and very helpful to answer such
> > questions as in your example.
> 
> Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
> your substitution of store() for fetch_add() is correct.  Then this shows
> that we cannot substitute fetch_add() for atomic_add_return().

It should be in this example, I believe.

The tool also supports CAS:
cas_strong_explicit(bar,0,1,memory_order_seq_cst,memory_order_seq_cst);
cas_strong(bar,0,1);
cas_weak likewise...

With that change, I get a few more executions but still the 3 consistent
ones for either outcome.

> > > It's not hard
> > > to find well-known memory-ordering experts shouting "Just use
> > > memory_model_seq_cst for everything, it's too hard otherwise".
> > 
> > Everyone I've heard saying this meant this as advice to people new to
> > synchronization or just dealing infrequently with it.  The advice is the
> > simple and safe fallback, and I don't think it's meant as an
> > acknowledgment that the model itself would be too hard.  If the
> > language's memory model is supposed to represent weak HW memory models
> > to at least some extent, there's only so much you can do in terms of
> > keeping it simple.  If all architectures had x86-like models, the
> > language's model would certainly be simpler... :)
> 
> That is said a lot, but there was a recent Linux-kernel example that
> turned out to be quite hard to prove for x86.  ;-)
> 
> > > Then there's
> > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > atm and optimises all of the data dependencies away)
> > 
> > AFAIK consume memory order was added to model Power/ARM-specific
> > behavior.  I agree that the way the standard specifies how dependencies
> > are to be preserved is kind of vague (as far as I understand it).  See
> > GCC PR 59448.
> 
> This one? http://gcc.gnu.org/ml/gcc-bugs/2013-12/msg01083.html

Yes, this bug, and I'd like to get feedback on the implementability of
this: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10

> That does indeed look to match what Will was calling out as a problem.
> 
> > > as well as the definition
> > > of "data races", which seem to be used as an excuse to miscompile a program
> > > at the earliest opportunity.
> > 
> > No.  The purpose of this is to *not disallow* every optimization on
> > non-synchronizing code.  Due to the assumption of data-race-free
> > programs, the compiler can assume a sequential code sequence when no
> > atomics are involved (and thus, keep applying optimizations for
> > sequential code).
> > 
> > Or is there something particular that you dislike about the
> > specification of data races?
> 
> Cut Will a break, Torvald!  ;-)

Sorry if my comment came across too harsh.  I've observed a couple of
times that people forget about the compiler's role in this, and it might
not be obvious, so I wanted to point out just that.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 22:13         ` Joseph S. Myers
@ 2014-02-06 23:25           ` Torvald Riegel
  2014-02-06 23:33             ` Joseph S. Myers
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-06 23:25 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc

On Thu, 2014-02-06 at 22:13 +0000, Joseph S. Myers wrote:
> On Thu, 6 Feb 2014, Torvald Riegel wrote:
> 
> > > It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures.
> > 
> > There's certainly different ways to implement the memory model and those
> > have to be specified elsewhere, but I don't see how this differs much
> > from other things specified in the ABI(s) for each architecture.
> 
> It is not clear to me that there is any good consensus understanding of 
> how to specify this in an ABI, or how, given an ABI, to determine whether 
> an implementation is valid.
> 
> For ABIs not considering atomics / concurrency, it's well understood, for 
> example, that the ABI specifies observable conditions at function call 
> boundaries ("these arguments are in these registers", "the stack pointer 
> has this alignment", "on return from the function, the values of these 
> registers are unchanged from the values they had on entry").  It may 
> sometimes specify things at other times (e.g. presence or absence of a red 
> zone - whether memory beyond the stack pointer may be overwritten on an 
> interrupt).  But if it gives a code sequence, it's clear this is just an 
> example rather than a requirement for particular code to be generated - 
> any code sequence suffices if it meets the observable conditions at the 
> points where code generated by one implementation may pass control to code 
> generated by another implementation.
> 
> When atomics are involved, you no longer have a limited set of 
> well-defined points where control may pass from code generated by one 
> implementation to code generated by another - the code generated by the 
> two may be running concurrently.

Agreed.

> We know of certain cases 
> <http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html> where there are 
> choices of the mapping of atomic operations to particular instructions.  
> But I'm not sure there's much evidence that these are the only ABI issues 
> arising from concurrency - that there aren't any other ways in which an 
> implementation may transform code, consistent with the as-if rule of ISO 
> C, that may run into incompatibilities of different choices.

I can't think of other incompatibilities with high likelihood --
provided we ignore consume memory order and the handling of dependencies
(see below).  But I would doubt there is a high risk of such because (1)
the data race definition should hopefully not cause subtle
incompatibilities and (2) there is  clear "hand-off point" from the
compiler to a specific instruction representing an atomic access.  For
example, if we have a release store, and we agree on the instructions
used for that, then compilers will have to ensure happens-before for
anything before the release store; for example, as long as stores
sequenced-before the release store are performed, it doesn't matter in
which order that happens.  Subsequently, an acquire-load somewhere else
can pick this sequence of events up just by using the agreed-upon
acquire-load; like with the stores, it can order subsequent loads in any
way it sees fit, including different optimizations.  That's obviously
not a formal proof, though.  But it seems likely to me, at least :)

I'm more concerned about consume and dependencies because as far as I
understand the standard's requirements, dependencies need to be tracked
across function calls.  Thus, we might have several compilers involved
in that, and we can't just "condense" things to happens-before, but it's
instead how and that we keep dependencies intact.  Because of that, I'm
wondering whether this is actually implementable practically.  (See
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10)

> And even if 
> those are the only issues, it's not clear there are well-understood ways 
> to define the mapping from the C11 memory model to the architecture's 
> model, which provide a good way to reason about whether a particular 
> choice of instructions is valid according to the mapping.

I think that if we have different options, there needs to be agreement
on which to choose across the compilers, at the very least.  I don't
quite know how this looks like for GCC and LLVM, for example.

> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> These discuss, as well as the model itself, proving the validity of a 
> particular choice of x86 instructions.  I imagine that might be a starting 
> point towards an understanding of how to describe the relevant issues in 
> an ABI, and how to determine whether a choice of instructions is 
> consistent with what an ABI says.  But I don't get the impression this is 
> yet at the level where people not doing research in the area can routinely 
> read and write such ABIs and work out whether a code sequence is 
> consistent with them.

It's certainly complicated stuff (IMHO).  On the positive side, it's
just a few compilers, the kernel, etc. that have to deal with this, if
most programmers use the languages' memory model.

> (If an ABI says "use instruction X", then you can't use a more efficient 
> X' added by a new version of the instruction set.  But it can't 
> necessarily be as loose as saying "use X and Y, or other instructions that 
> achieve semantics when the other thread is using X or Y", because it might 
> be the case that Y' interoperates with X, X' interoperates with Y, but X' 
> and Y' don't interoperate with each other.  I'd envisage something more 
> like mapping not to instructions, but to concepts within the 
> architecture's own memory model - but that requires the architecture's 
> memory model to be as well defined, and probably formalized, as the C11 
> one.)

Yes.  The same group of researchers have also worked on formalizing the
Power model, and use this as base for a proof of the correctness of the
proposed mappings: http://www.cl.cam.ac.uk/~pes20/cppppc/

The formal approach to all this might be a complex task, but it is more
confidence-inspiring than making guesses about one standard's prose vs.
another specification's prose.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 23:25           ` Torvald Riegel
@ 2014-02-06 23:33             ` Joseph S. Myers
  0 siblings, 0 replies; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-06 23:33 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc

On Fri, 7 Feb 2014, Torvald Riegel wrote:

> I think that if we have different options, there needs to be agreement
> on which to choose across the compilers, at the very least.  I don't
> quite know how this looks like for GCC and LLVM, for example.

I'm not sure we even necessarily get compatibility for the alignment of 
_Atomic types yet (and no ABI document I've seen discusses that issue).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 22:11           ` Paul E. McKenney
@ 2014-02-06 23:44             ` Torvald Riegel
  2014-02-07  4:20               ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-06 23:44 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > > There are also so many ways to blow your head off it's untrue. For example,
> > > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > > then there are restrictions on the sets you can use for each. It's not hard
> > > > to find well-known memory-ordering experts shouting "Just use
> > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > atm and optimises all of the data dependencies away) as well as the definition
> > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > at the earliest opportunity.
> > > 
> > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > memory_order_consume until the compilers implement it both correctly and
> > > efficiently.  They are not there yet, and there is currently no shortage
> > > of compiler writers who would prefer to ignore memory_order_consume.
> > 
> > Do you have any input on
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > language standard's definition of dependencies?
> 
> Let's see...  1.10p9 says that a dependency must be carried unless:
> 
> — B is an invocation of any specialization of std::kill_dependency (29.3), or
> — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
> or
> — A is the left operand of a conditional (?:, see 5.16) operator, or
> — A is the left operand of the built-in comma (,) operator (5.18);
> 
> So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> after the "?" will carry a dependency, so the code fragment in 59448
> needs to do the ordering rather than just optimizing "flag - flag" out
> of existence.  One way to do that on both ARM and Power is to actually
> emit code for "flag - flag", but there are a number of other ways to
> make that work.

And that's what would concern me, considering that these requirements
seem to be able to creep out easily.  Also, whereas the other atomics
just constrain compilers wrt. reordering across atomic accesses or
changes to the atomic accesses themselves, the dependencies are new
requirements on pieces of otherwise non-synchronizing code.  The latter
seems far more involved to me.

> BTW, there is some discussion on 1.10p9's handling of && and ||, and
> that clause is likely to change.  And yes, I am behind on analyzing
> usage in the Linux kernel to find out if Linux cares...

Do you have any pointers to these discussions (e.g., LWG issues)?

> > > And rcu_dereference() will need per-arch overrides for some time during
> > > any transition to memory_order_consume.
> > > 
> > > > Trying to introduce system concepts (writes to devices, interrupts,
> > > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
> > > > just rather stick to the semantics we have and the asm volatile barriers.
> > > 
> > > And barrier() isn't going to go away any time soon, either.  And
> > > ACCESS_ONCE() needs to keep volatile semantics until there is some
> > > memory_order_whatever that prevents loads and stores from being coalesced.
> > 
> > I'd be happy to discuss something like this in ISO C++ SG1 (or has this
> > been discussed in the past already?).  But it needs to have a paper I
> > suppose.
> 
> The current position of the usual suspects other than me is that this
> falls into the category of forward-progress guarantees, which are
> considers (again, by the usual suspects other than me) to be out
> of scope.

But I think we need to better describe forward progress, even though
that might be tricky.  We made at least some progress on
http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even
though we can't constrain the OS schedulers too much, and for lock-free
we're in this weird position that on most general-purpose schedulers and
machines, obstruction-free algorithms are likely to work just fine like
lock-free, most of the time, in practice...

We also need to discuss forward progress guarantees for any
parallelism/concurrency abstractions, I believe:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3874.pdf

Hopefully we'll get some more acceptance of this being in scope...

> > Will you be in Issaquah for the C++ meeting next week?
> 
> Weather permitting, I will be there!

Great, maybe we can find some time in SG1 to discuss this then.  Even if
the standard doesn't want to include it, SG1 should be a good forum to
understand everyone's concerns around that, with the hope that this
would help potential non-standard extensions to be still checked by the
same folks that did the rest of the memory model.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 22:58           ` Torvald Riegel
@ 2014-02-07  4:06             ` Paul E. McKenney
  2014-02-07  9:13               ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07  4:06 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> > > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > > > > On 02/06/14 18:25, David Howells wrote:
> > > > > >
> > > > > > Is it worth considering a move towards using C11 atomics and barriers and
> > > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> > > > > > these.
> > > > > 
> > > > > 
> > > > > It sounds interesting to me, if we can make it work properly and 
> > > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in.
> > > > 
> > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > > > really think this is a bad idea for the kernel.
> > > 
> > > I'm not going to comment on what's best for the kernel (simply because I
> > > don't work on it), but I disagree with several of your statements.
> > > 
> > > > It seems that nobody really
> > > > agrees on exactly how the C11 atomics map to real architectural
> > > > instructions on anything but the trivial architectures.
> > > 
> > > There's certainly different ways to implement the memory model and those
> > > have to be specified elsewhere, but I don't see how this differs much
> > > from other things specified in the ABI(s) for each architecture.
> > > 
> > > > For example, should
> > > > the following code fire the assert?
> > > 
> > > I don't see how your example (which is about what the language requires
> > > or not) relates to the statement about the mapping above?
> > > 
> > > > 
> > > > extern atomic<int> foo, bar, baz;
> > > > 
> > > > void thread1(void)
> > > > {
> > > > 	foo.store(42, memory_order_relaxed);
> > > > 	bar.fetch_add(1, memory_order_seq_cst);
> > > > 	baz.store(42, memory_order_relaxed);
> > > > }
> > > > 
> > > > void thread2(void)
> > > > {
> > > > 	while (baz.load(memory_order_seq_cst) != 42) {
> > > > 		/* do nothing */
> > > > 	}
> > > > 
> > > > 	assert(foo.load(memory_order_seq_cst) == 42);
> > > > }
> > > > 
> > > 
> > > It's a good example.  My first gut feeling was that the assertion should
> > > never fire, but that was wrong because (as I seem to usually forget) the
> > > seq-cst total order is just a constraint but doesn't itself contribute
> > > to synchronizes-with -- but this is different for seq-cst fences.
> > 
> > From what I can see, Will's point is that mapping the Linux kernel's
> > atomic_add_return() primitive into fetch_add() does not work because
> > atomic_add_return()'s ordering properties require that the assert()
> > never fire.
> > 
> > Augmenting the fetch_add() with a seq_cst fence would work on many
> > architectures, but not for all similar examples.  The reason is that
> > the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
> > Power's sync.  To your point, I believe that it would make the above
> > example work, but there are some IRIW-like examples that would fail
> > according to the standard (though a number of specific implementations
> > would in fact work correctly).
> 
> Thanks for the background.  I don't read LKML, and it wasn't obvious
> from reading just the part of the thread posted to gcc@ that achieving
> these semantics is the goal.

Lots of history on both sides of this one, I am afraid!  ;-)

> > > > To answer that question, you need to go and look at the definitions of
> > > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > > pile of vaguely written waffle to realise that you don't know.
> > > 
> > > Are you familiar with the formalization of the C11/C++11 model by Batty
> > > et al.?
> > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > 
> > > They also have a nice tool that can run condensed examples and show you
> > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > slow for larger examples), including nice annotated graphs for those:
> > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > 
> > > It requires somewhat special syntax, but the following, which should be
> > > equivalent to your example above, runs just fine:
> > > 
> > > int main() {
> > >   atomic_int foo = 0; 
> > >   atomic_int bar = 0; 
> > >   atomic_int baz = 0; 
> > >   {{{ {
> > >         foo.store(42, memory_order_relaxed);
> > >         bar.store(1, memory_order_seq_cst);
> > >         baz.store(42, memory_order_relaxed);
> > >       }
> > >   ||| {
> > >         r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > >         r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > >       }
> > >   }}};
> > >   return 0; }
> > > 
> > > That yields 3 consistent executions for me, and likewise if the last
> > > readsvalue() is using 42 as argument.
> > > 
> > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > > program can't observe != 42 for foo anymore, because the seq-cst fence
> > > is adding a synchronizes-with edge via the baz reads-from.
> > > 
> > > I think this is a really neat tool, and very helpful to answer such
> > > questions as in your example.
> > 
> > Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
> > your substitution of store() for fetch_add() is correct.  Then this shows
> > that we cannot substitute fetch_add() for atomic_add_return().
> 
> It should be in this example, I believe.

You lost me on this one.

> The tool also supports CAS:
> cas_strong_explicit(bar,0,1,memory_order_seq_cst,memory_order_seq_cst);
> cas_strong(bar,0,1);
> cas_weak likewise...
> 
> With that change, I get a few more executions but still the 3 consistent
> ones for either outcome.

Good point, thank you for the tip!

> > > > It's not hard
> > > > to find well-known memory-ordering experts shouting "Just use
> > > > memory_model_seq_cst for everything, it's too hard otherwise".
> > > 
> > > Everyone I've heard saying this meant this as advice to people new to
> > > synchronization or just dealing infrequently with it.  The advice is the
> > > simple and safe fallback, and I don't think it's meant as an
> > > acknowledgment that the model itself would be too hard.  If the
> > > language's memory model is supposed to represent weak HW memory models
> > > to at least some extent, there's only so much you can do in terms of
> > > keeping it simple.  If all architectures had x86-like models, the
> > > language's model would certainly be simpler... :)
> > 
> > That is said a lot, but there was a recent Linux-kernel example that
> > turned out to be quite hard to prove for x86.  ;-)
> > 
> > > > Then there's
> > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > atm and optimises all of the data dependencies away)
> > > 
> > > AFAIK consume memory order was added to model Power/ARM-specific
> > > behavior.  I agree that the way the standard specifies how dependencies
> > > are to be preserved is kind of vague (as far as I understand it).  See
> > > GCC PR 59448.
> > 
> > This one? http://gcc.gnu.org/ml/gcc-bugs/2013-12/msg01083.html
> 
> Yes, this bug, and I'd like to get feedback on the implementability of
> this: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10

Agreed that the load and store need to be atomic.  From a Linux-kernel
perspective, perhaps some day ACCESS_ONCE() can be a volatile atomic
operation.  This will require a bit of churn because things like:

	ACCESS_ONCE(x)++;

have no obvious C11 counterpart that I know of, at least not one that can
be reasonably implemented within the confines of a C-preprocessor macro.
But this can probably be overcome at some point.  (And yes, the usage
of ACCESS_ONCE() has grown way beyond what I originally had in mind, but
such is life.)

> > That does indeed look to match what Will was calling out as a problem.
> > 
> > > > as well as the definition
> > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > at the earliest opportunity.
> > > 
> > > No.  The purpose of this is to *not disallow* every optimization on
> > > non-synchronizing code.  Due to the assumption of data-race-free
> > > programs, the compiler can assume a sequential code sequence when no
> > > atomics are involved (and thus, keep applying optimizations for
> > > sequential code).
> > > 
> > > Or is there something particular that you dislike about the
> > > specification of data races?
> > 
> > Cut Will a break, Torvald!  ;-)
> 
> Sorry if my comment came across too harsh.  I've observed a couple of
> times that people forget about the compiler's role in this, and it might
> not be obvious, so I wanted to point out just that.

As far as I know, no need to apologize.

The problem we are having is that Linux-kernel hackers need to make their
code work with whatever is available from the compiler.  The compiler is
written to a standard that, even with C11, is insufficient for the needs
of kernel hackers: MMIO, subtle differences in memory-ordering semantics,
memory fences that are too weak to implement smp_mb() in the general case,
and so on.  So kernel hackers need to use non-standard extensions,
inline assembly, and sometimes even compiler implementation details.

So it is only natural to expect the occasional bout of frustration on
the part of the kernel hacker.  For the compiler writer's part, I am
sure that having to deal with odd cases that are outside of the standard
is also sometimes frustrating.  Almost like both the kernel hackers and
the compiler writers were living in the real world or something.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 23:44             ` Torvald Riegel
@ 2014-02-07  4:20               ` Paul E. McKenney
  2014-02-07  7:44                 ` Peter Zijlstra
  2014-02-10  0:06                 ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07  4:20 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > > > There are also so many ways to blow your head off it's untrue. For example,
> > > > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > > > then there are restrictions on the sets you can use for each. It's not hard
> > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > > atm and optimises all of the data dependencies away) as well as the definition
> > > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > > at the earliest opportunity.
> > > > 
> > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > memory_order_consume until the compilers implement it both correctly and
> > > > efficiently.  They are not there yet, and there is currently no shortage
> > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > 
> > > Do you have any input on
> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > language standard's definition of dependencies?
> > 
> > Let's see...  1.10p9 says that a dependency must be carried unless:
> > 
> > — B is an invocation of any specialization of std::kill_dependency (29.3), or
> > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
> > or
> > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > — A is the left operand of the built-in comma (,) operator (5.18);
> > 
> > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > after the "?" will carry a dependency, so the code fragment in 59448
> > needs to do the ordering rather than just optimizing "flag - flag" out
> > of existence.  One way to do that on both ARM and Power is to actually
> > emit code for "flag - flag", but there are a number of other ways to
> > make that work.
> 
> And that's what would concern me, considering that these requirements
> seem to be able to creep out easily.  Also, whereas the other atomics
> just constrain compilers wrt. reordering across atomic accesses or
> changes to the atomic accesses themselves, the dependencies are new
> requirements on pieces of otherwise non-synchronizing code.  The latter
> seems far more involved to me.

Well, the wording of 1.10p9 is pretty explicit on this point.
There are only a few exceptions to the rule that dependencies from
memory_order_consume loads must be tracked.  And to your point about
requirements being placed on pieces of otherwise non-synchronizing code,
we already have that with plain old load acquire and store release --
both of these put ordering constraints that affect the surrounding
non-synchronizing code.

This issue got a lot of discussion, and the compromise is that
dependencies cannot leak into or out of functions unless the relevant
parameters or return values are annotated with [[carries_dependency]].
This means that the compiler can see all the places where dependencies
must be tracked.  This is described in 7.6.4.  If a dependency chain
headed by a memory_order_consume load goes into or out of a function
without the aid of the [[carries_dependency]] attribute, the compiler
needs to do something else to enforce ordering, e.g., emit a memory
barrier.

>From a Linux-kernel viewpoint, this is a bit ugly, as it requires
annotations and use of kill_dependency, but it was the best I could do
at the time.  If things go as they usually do, there will be some other
reason why those are needed...

> > BTW, there is some discussion on 1.10p9's handling of && and ||, and
> > that clause is likely to change.  And yes, I am behind on analyzing
> > usage in the Linux kernel to find out if Linux cares...
> 
> Do you have any pointers to these discussions (e.g., LWG issues)?

Nope, just a bare email thread.  I would guess that it will come up
next week.

The question is whether dependencies should be carried through && or ||
at all, and if so how.  My current guess is that && and || should not
carry dependencies.

> > > > And rcu_dereference() will need per-arch overrides for some time during
> > > > any transition to memory_order_consume.
> > > > 
> > > > > Trying to introduce system concepts (writes to devices, interrupts,
> > > > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
> > > > > just rather stick to the semantics we have and the asm volatile barriers.
> > > > 
> > > > And barrier() isn't going to go away any time soon, either.  And
> > > > ACCESS_ONCE() needs to keep volatile semantics until there is some
> > > > memory_order_whatever that prevents loads and stores from being coalesced.
> > > 
> > > I'd be happy to discuss something like this in ISO C++ SG1 (or has this
> > > been discussed in the past already?).  But it needs to have a paper I
> > > suppose.
> > 
> > The current position of the usual suspects other than me is that this
> > falls into the category of forward-progress guarantees, which are
> > considers (again, by the usual suspects other than me) to be out
> > of scope.
> 
> But I think we need to better describe forward progress, even though
> that might be tricky.  We made at least some progress on
> http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even
> though we can't constrain the OS schedulers too much, and for lock-free
> we're in this weird position that on most general-purpose schedulers and
> machines, obstruction-free algorithms are likely to work just fine like
> lock-free, most of the time, in practice...

Yep, there is a draft paper by Alistarh et al. making this point.
They could go quite a bit further.  With a reasonably short set of
additional constraints, you can get bounded execution times out of
locking as well.  They were not amused when I suggested this, Bjoern
Brandenberg's dissertation notwithstanding.  ;-)

> We also need to discuss forward progress guarantees for any
> parallelism/concurrency abstractions, I believe:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3874.pdf
> 
> Hopefully we'll get some more acceptance of this being in scope...

That would be good.  Just in case C11 is to be applicable to real-time
software.

> > > Will you be in Issaquah for the C++ meeting next week?
> > 
> > Weather permitting, I will be there!
> 
> Great, maybe we can find some time in SG1 to discuss this then.  Even if
> the standard doesn't want to include it, SG1 should be a good forum to
> understand everyone's concerns around that, with the hope that this
> would help potential non-standard extensions to be still checked by the
> same folks that did the rest of the memory model.

Sounds good!  Hopefully some discussion of out-of-thin-air values as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07  4:20               ` Paul E. McKenney
@ 2014-02-07  7:44                 ` Peter Zijlstra
  2014-02-07 16:50                   ` Paul E. McKenney
  2014-02-10  0:06                 ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-07  7:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> Hopefully some discussion of out-of-thin-air values as well.

Yes, absolutely shoot store speculation in the head already. Then drive
a wooden stake through its hart.

C11/C++11 should not be allowed to claim itself a memory model until that
is sorted.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07  4:06             ` Paul E. McKenney
@ 2014-02-07  9:13               ` Torvald Riegel
  2014-02-07 16:44                 ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-07  9:13 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> > > > > To answer that question, you need to go and look at the definitions of
> > > > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > > > pile of vaguely written waffle to realise that you don't know.
> > > > 
> > > > Are you familiar with the formalization of the C11/C++11 model by Batty
> > > > et al.?
> > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > 
> > > > They also have a nice tool that can run condensed examples and show you
> > > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > > slow for larger examples), including nice annotated graphs for those:
> > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > > 
> > > > It requires somewhat special syntax, but the following, which should be
> > > > equivalent to your example above, runs just fine:
> > > > 
> > > > int main() {
> > > >   atomic_int foo = 0; 
> > > >   atomic_int bar = 0; 
> > > >   atomic_int baz = 0; 
> > > >   {{{ {
> > > >         foo.store(42, memory_order_relaxed);
> > > >         bar.store(1, memory_order_seq_cst);
> > > >         baz.store(42, memory_order_relaxed);
> > > >       }
> > > >   ||| {
> > > >         r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > > >         r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > > >       }
> > > >   }}};
> > > >   return 0; }
> > > > 
> > > > That yields 3 consistent executions for me, and likewise if the last
> > > > readsvalue() is using 42 as argument.
> > > > 
> > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > > > program can't observe != 42 for foo anymore, because the seq-cst fence
> > > > is adding a synchronizes-with edge via the baz reads-from.
> > > > 
> > > > I think this is a really neat tool, and very helpful to answer such
> > > > questions as in your example.
> > > 
> > > Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
> > > your substitution of store() for fetch_add() is correct.  Then this shows
> > > that we cannot substitute fetch_add() for atomic_add_return().
> > 
> > It should be in this example, I believe.
> 
> You lost me on this one.

I mean that in this example, substituting fetch_add() with store()
should not change meaning, given that what the fetch_add reads-from
seems irrelevant.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g
  2014-02-06 19:12   ` Paul E. McKenney
@ 2014-02-07  9:52     ` Will Deacon
  0 siblings, 0 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-07  9:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo

On Thu, Feb 06, 2014 at 07:12:04PM +0000, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 02:48:28PM +0100, Peter Zijlstra wrote:
> > Because atomic ops are implemented the same across an architecture,
> > the current incomplete set of extra barriers:
> >
> >   smp_mb__before_atomic_inc()
> >   smp_mb__after_atomic_inc()
> >   smp_mb__before_atomic_dec()
> >   smp_mb__after_atomic_dec()
> >   smp_mb__before_clear_bit()
> >   smp_mb__after_clear_bit()
> >
> > is both incomplete and superfluous.
> >
> > It is incomplete because there are far more atomic operations that do
> > not return values -- such as atomic_add(), set_bit() etc. And it is
> > superfluous because they're all the same anyway.
> >
> > Simplify things by reducing the triplicate set into a single set of
> > barriers that is valid for all void atomic ops:
> >
> >   smp_mb__before_atomic()
> >   smp_mb__after_atomic()
> >
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> 
> I very much like the API shrinkage.  The RCU changes are good, and
> I believe that the rest is OK too, though my eyes were going a bit
> buggy towards the end...
> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

... and the arm[64] parts look fine to me.

  Acked-by: Will Deacon <will.deacon@arm.com>

Usual comment about upcoming conflicts :)

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-06 21:09       ` Torvald Riegel
  2014-02-06 21:55         ` Paul E. McKenney
  2014-02-06 22:13         ` Joseph S. Myers
@ 2014-02-07 12:01         ` Will Deacon
  2014-02-07 16:47           ` Paul E. McKenney
  2 siblings, 1 reply; 299+ messages in thread
From: Will Deacon @ 2014-02-07 12:01 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch,
	linux-kernel, torvalds, akpm, mingo, paulmck, gcc

Hello Torvald,

It looks like Paul clarified most of the points I was trying to make
(thanks Paul!), so I won't go back over them here.

On Thu, Feb 06, 2014 at 09:09:25PM +0000, Torvald Riegel wrote:
> Are you familiar with the formalization of the C11/C++11 model by Batty
> et al.?
> http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> They also have a nice tool that can run condensed examples and show you
> all allowed (and forbidden) executions (it runs in the browser, so is
> slow for larger examples), including nice annotated graphs for those:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem
in the past, but I didn't realise they have a version for C++11 too.
Actually, the armmem backend doesn't implement our atomic instructions or
the acquire/release accessors, so it's not been as useful as it could be.
I should probably try to learn OCaml...

> IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
> the only memory model that has widespread support.  So, even though it's
> a fairly weak memory model (unless you go for the "only seq-cst"
> beginners advice) and thus comes with a higher complexity, this model is
> what likely most people will be familiar with over time.  Deviating from
> the "standard" model can have valid reasons, but it also has a cost in
> that new contributors are more likely to be familiar with the "standard"
> model.

Indeed, I wasn't trying to write-off the C11 memory model as something we
can never use in the kernel. I just don't think the current situation is
anywhere close to usable for a project such as Linux. If a greater
understanding of the memory model does eventually manifest amongst C/C++
developers (by which I mean, the beginners advice is really treated as
such and there is a widespread intuition about ordering guarantees, as
opposed to the need to use formal tools), then surely the tools and libraries
will stabilise and provide uniform semantics across the 25+ architectures
that Linux currently supports. If *that* happens, this discussion is certainly
worth having again.

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07  9:13               ` Torvald Riegel
@ 2014-02-07 16:44                 ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07 16:44 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Fri, Feb 07, 2014 at 10:13:40AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> > > > > > To answer that question, you need to go and look at the definitions of
> > > > > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > > > > pile of vaguely written waffle to realise that you don't know.
> > > > > 
> > > > > Are you familiar with the formalization of the C11/C++11 model by Batty
> > > > > et al.?
> > > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > > 
> > > > > They also have a nice tool that can run condensed examples and show you
> > > > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > > > slow for larger examples), including nice annotated graphs for those:
> > > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > > > 
> > > > > It requires somewhat special syntax, but the following, which should be
> > > > > equivalent to your example above, runs just fine:
> > > > > 
> > > > > int main() {
> > > > >   atomic_int foo = 0; 
> > > > >   atomic_int bar = 0; 
> > > > >   atomic_int baz = 0; 
> > > > >   {{{ {
> > > > >         foo.store(42, memory_order_relaxed);
> > > > >         bar.store(1, memory_order_seq_cst);
> > > > >         baz.store(42, memory_order_relaxed);
> > > > >       }
> > > > >   ||| {
> > > > >         r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > > > >         r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > > > >       }
> > > > >   }}};
> > > > >   return 0; }
> > > > > 
> > > > > That yields 3 consistent executions for me, and likewise if the last
> > > > > readsvalue() is using 42 as argument.
> > > > > 
> > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > > > > program can't observe != 42 for foo anymore, because the seq-cst fence
> > > > > is adding a synchronizes-with edge via the baz reads-from.
> > > > > 
> > > > > I think this is a really neat tool, and very helpful to answer such
> > > > > questions as in your example.
> > > > 
> > > > Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
> > > > your substitution of store() for fetch_add() is correct.  Then this shows
> > > > that we cannot substitute fetch_add() for atomic_add_return().
> > > 
> > > It should be in this example, I believe.
> > 
> > You lost me on this one.
> 
> I mean that in this example, substituting fetch_add() with store()
> should not change meaning, given that what the fetch_add reads-from
> seems irrelevant.

Got it.  Agreed, though your other suggestion of substituting CAS is
more convincing.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 12:01         ` Will Deacon
@ 2014-02-07 16:47           ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07 16:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: Torvald Riegel, Ramana Radhakrishnan, David Howells,
	Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, Feb 07, 2014 at 12:01:25PM +0000, Will Deacon wrote:
> Hello Torvald,
> 
> It looks like Paul clarified most of the points I was trying to make
> (thanks Paul!), so I won't go back over them here.
> 
> On Thu, Feb 06, 2014 at 09:09:25PM +0000, Torvald Riegel wrote:
> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > 
> > They also have a nice tool that can run condensed examples and show you
> > all allowed (and forbidden) executions (it runs in the browser, so is
> > slow for larger examples), including nice annotated graphs for those:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> 
> Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem
> in the past, but I didn't realise they have a version for C++11 too.
> Actually, the armmem backend doesn't implement our atomic instructions or
> the acquire/release accessors, so it's not been as useful as it could be.
> I should probably try to learn OCaml...

That would be very cool!

Another option would be to recruit a grad student to take on that project
for Peter Sewell.  He might already have one, for all I know.

> > IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
> > the only memory model that has widespread support.  So, even though it's
> > a fairly weak memory model (unless you go for the "only seq-cst"
> > beginners advice) and thus comes with a higher complexity, this model is
> > what likely most people will be familiar with over time.  Deviating from
> > the "standard" model can have valid reasons, but it also has a cost in
> > that new contributors are more likely to be familiar with the "standard"
> > model.
> 
> Indeed, I wasn't trying to write-off the C11 memory model as something we
> can never use in the kernel. I just don't think the current situation is
> anywhere close to usable for a project such as Linux. If a greater
> understanding of the memory model does eventually manifest amongst C/C++
> developers (by which I mean, the beginners advice is really treated as
> such and there is a widespread intuition about ordering guarantees, as
> opposed to the need to use formal tools), then surely the tools and libraries
> will stabilise and provide uniform semantics across the 25+ architectures
> that Linux currently supports. If *that* happens, this discussion is certainly
> worth having again.

And it is likely to be worthwhile even before then on a piecemeal
basis, where architecture maintainers pick and choose which primitive
is in inline assembly and which the compiler can deal with properly.
For example, I bet that atomic_inc() can be implemented just fine by C11
in the very near future.  However, atomic_add_return() is another story.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07  7:44                 ` Peter Zijlstra
@ 2014-02-07 16:50                   ` Paul E. McKenney
  2014-02-07 16:55                     ` Will Deacon
  2014-02-07 18:44                     ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > Hopefully some discussion of out-of-thin-air values as well.
> 
> Yes, absolutely shoot store speculation in the head already. Then drive
> a wooden stake through its hart.
> 
> C11/C++11 should not be allowed to claim itself a memory model until that
> is sorted.

There actually is a proposal being put forward, but it might not make ARM
and Power people happy because it involves adding a compare, a branch,
and an ISB/isync after every relaxed load...  Me, I agree with you,
much preferring the no-store-speculation approach.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 16:50                   ` Paul E. McKenney
@ 2014-02-07 16:55                     ` Will Deacon
  2014-02-07 17:06                       ` Peter Zijlstra
  2014-02-07 18:02                       ` Paul E. McKenney
  2014-02-07 18:44                     ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-07 16:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

Hi Paul,

On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > Hopefully some discussion of out-of-thin-air values as well.
> > 
> > Yes, absolutely shoot store speculation in the head already. Then drive
> > a wooden stake through its hart.
> > 
> > C11/C++11 should not be allowed to claim itself a memory model until that
> > is sorted.
> 
> There actually is a proposal being put forward, but it might not make ARM
> and Power people happy because it involves adding a compare, a branch,
> and an ISB/isync after every relaxed load...  Me, I agree with you,
> much preferring the no-store-speculation approach.

Can you elaborate a bit on this please? We don't permit speculative stores
in the ARM architecture, so it seems counter-intuitive that GCC needs to
emit any additional instructions to prevent that from happening.

Stores can, of course, be observed out-of-order but that's a lot more
reasonable :)

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 16:55                     ` Will Deacon
@ 2014-02-07 17:06                       ` Peter Zijlstra
  2014-02-07 17:13                         ` Will Deacon
                                           ` (2 more replies)
  2014-02-07 18:02                       ` Paul E. McKenney
  1 sibling, 3 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-07 17:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > Hopefully some discussion of out-of-thin-air values as well.
> > > 
> > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > a wooden stake through its hart.
> > > 
> > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > is sorted.
> > 
> > There actually is a proposal being put forward, but it might not make ARM
> > and Power people happy because it involves adding a compare, a branch,
> > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > much preferring the no-store-speculation approach.
> 
> Can you elaborate a bit on this please? We don't permit speculative stores
> in the ARM architecture, so it seems counter-intuitive that GCC needs to
> emit any additional instructions to prevent that from happening.
> 
> Stores can, of course, be observed out-of-order but that's a lot more
> reasonable :)

This is more about the compiler speculating on stores; imagine:

  if (x)
  	y = 1;
  else
  	y = 2;

The compiler is allowed to change that into:

  y = 2;
  if (x)
  	y = 1;

Which is of course a big problem when you want to rely on the ordering.

There's further problems where things like memset() can write outside
the specified address range. Examples are memset() using single
instructions to wipe entire cachelines and then 'restoring' the tail
bit.

While valid for single threaded, its a complete disaster for concurrent
code.

There's more, but it all boils down to doing stores you don't expect in
a 'sane' concurrent environment and/or don't respect the control flow.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 17:06                       ` Peter Zijlstra
@ 2014-02-07 17:13                         ` Will Deacon
  2014-02-07 17:20                           ` Peter Zijlstra
  2014-02-07 18:03                           ` Paul E. McKenney
  2014-02-07 17:46                         ` Joseph S. Myers
  2014-02-07 18:43                         ` Torvald Riegel
  2 siblings, 2 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-07 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, Feb 07, 2014 at 05:06:54PM +0000, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> > 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> This is more about the compiler speculating on stores; imagine:
> 
>   if (x)
>   	y = 1;
>   else
>   	y = 2;
> 
> The compiler is allowed to change that into:
> 
>   y = 2;
>   if (x)
>   	y = 1;
> 
> Which is of course a big problem when you want to rely on the ordering.

Understood, but that doesn't explain why Paul wants to add ISB/isync
instructions which affect the *CPU* rather than the compiler!

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 17:13                         ` Will Deacon
@ 2014-02-07 17:20                           ` Peter Zijlstra
  2014-02-07 18:03                           ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-07 17:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, Feb 07, 2014 at 05:13:36PM +0000, Will Deacon wrote:
> Understood, but that doesn't explain why Paul wants to add ISB/isync
> instructions which affect the *CPU* rather than the compiler!

I doubt Paul wants it, but yeah, I'm curious about that proposal as
well, sounds like someone took a big toke from the bong again; it seems
a favourite past time amongst committees.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 17:06                       ` Peter Zijlstra
  2014-02-07 17:13                         ` Will Deacon
@ 2014-02-07 17:46                         ` Joseph S. Myers
  2014-02-07 18:43                         ` Torvald Riegel
  2 siblings, 0 replies; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-07 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Paul E. McKenney, Torvald Riegel,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	torvalds, akpm, mingo, gcc

On Fri, 7 Feb 2014, Peter Zijlstra wrote:

> There's further problems where things like memset() can write outside
> the specified address range. Examples are memset() using single
> instructions to wipe entire cachelines and then 'restoring' the tail
> bit.

If memset (or any C library function) modifies bytes it's not permitted to 
modify in the abstract machine, that's a simple bug and should be reported 
as usual.  We've made GCC follow that part of the memory model by default 
(so a store to a non-bit-field structure field doesn't do a 
read-modify-write to a word containing another field, for example) and I 
think it's pretty obvious that glibc should do so as well.

(Of course, memset is not an atomic operation, and you need to allow for 
that if you use it on an _Atomic object - which is I think valid, unless 
the object is also volatile, but perhaps ill-advised.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 16:55                     ` Will Deacon
  2014-02-07 17:06                       ` Peter Zijlstra
@ 2014-02-07 18:02                       ` Paul E. McKenney
  2014-02-10  0:27                         ` Torvald Riegel
  2014-02-10 11:48                         ` Peter Zijlstra
  1 sibling, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07 18:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > Hopefully some discussion of out-of-thin-air values as well.
> > > 
> > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > a wooden stake through its hart.
> > > 
> > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > is sorted.
> > 
> > There actually is a proposal being put forward, but it might not make ARM
> > and Power people happy because it involves adding a compare, a branch,
> > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > much preferring the no-store-speculation approach.
> 
> Can you elaborate a bit on this please? We don't permit speculative stores
> in the ARM architecture, so it seems counter-intuitive that GCC needs to
> emit any additional instructions to prevent that from happening.

Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
proof that out-of-thin-air values cannot be observed in the face of any
compiler optimization that refrains from reordering a prior relaxed load
with a subsequent relaxed store.

> Stores can, of course, be observed out-of-order but that's a lot more
> reasonable :)

So let me try an example.  I am sure that Torvald Riegel will jump in
with any needed corrections or amplifications:

Initial state: x == y == 0

T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
	atomic_store_explicit(r1, y, memory_order_relaxed);

T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
	atomic_store_explicit(r2, x, memory_order_relaxed);

One would intuitively expect r1 == r2 == 0 as the only possible outcome.
But suppose that the compiler used specialization optimizations, as it
would if there was a function that has a very lightweight implementation
for some values and a very heavyweight one for other.  In particular,
suppose that the lightweight implementation was for the value 42.
Then the compiler might do something like the following:

Initial state: x == y == 0

T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
	if (r1 == 42)
		atomic_store_explicit(42, y, memory_order_relaxed);
	else
		atomic_store_explicit(r1, y, memory_order_relaxed);

T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
	atomic_store_explicit(r2, x, memory_order_relaxed);

Suddenly we have an explicit constant 42 showing up.  Of course, if
the compiler carefully avoided speculative stores (as both Peter and
I believe that it should if its code generation is to be regarded as
anything other than an act of vandalism, the words in the standard
notwithstanding), there would be no problem.  But currently, a number
of compiler writers see absolutely nothing wrong with transforming
the optimized-for-42 version above with something like this:

Initial state: x == y == 0

T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
	atomic_store_explicit(42, y, memory_order_relaxed);
	if (r1 != 42)
		atomic_store_explicit(r1, y, memory_order_relaxed);

T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
	atomic_store_explicit(r2, x, memory_order_relaxed);

And then it is a short and uncontroversial step to the following:

Initial state: x == y == 0

T1:	atomic_store_explicit(42, y, memory_order_relaxed);
	r1 = atomic_load_explicit(x, memory_order_relaxed);
	if (r1 != 42)
		atomic_store_explicit(r1, y, memory_order_relaxed);

T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
	atomic_store_explicit(r2, x, memory_order_relaxed);

This can of course result in r1 == r2 == 42, even though the constant
42 never appeared in the original code.  This is one way to generate
an out-of-thin-air value.

As near as I can tell, compiler writers hate the idea of prohibiting
speculative-store optimizations because it requires them to introduce
both control and data dependency tracking into their compilers.  Many of
them seem to hate dependency tracking with a purple passion.  At least,
such a hatred would go a long way towards explaining the incomplete
and high-overhead implementations of memory_order_consume, the long
and successful use of idioms based on the memory_order_consume pattern
notwithstanding [*].  ;-)

That said, the Java guys are talking about introducing something
vaguely resembling memory_order_consume (and thus resembling the
rcu_assign_pointer() and rcu_dereference() portions of RCU) to solve Java
out-of-thin-air issues involving initialization, so perhaps there is hope.

						Thanx, Paul

[*] http://queue.acm.org/detail.cfm?id=2488549
    http://www.rdrop.com/users/paulmck/RCU/rclockpdcsproof.pdf


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 17:13                         ` Will Deacon
  2014-02-07 17:20                           ` Peter Zijlstra
@ 2014-02-07 18:03                           ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-07 18:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, Feb 07, 2014 at 05:13:36PM +0000, Will Deacon wrote:
> On Fri, Feb 07, 2014 at 05:06:54PM +0000, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > > 
> > > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > > a wooden stake through its hart.
> > > > > 
> > > > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > > > is sorted.
> > > > 
> > > > There actually is a proposal being put forward, but it might not make ARM
> > > > and Power people happy because it involves adding a compare, a branch,
> > > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > > much preferring the no-store-speculation approach.
> > > 
> > > Can you elaborate a bit on this please? We don't permit speculative stores
> > > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > > emit any additional instructions to prevent that from happening.
> > > 
> > > Stores can, of course, be observed out-of-order but that's a lot more
> > > reasonable :)
> > 
> > This is more about the compiler speculating on stores; imagine:
> > 
> >   if (x)
> >   	y = 1;
> >   else
> >   	y = 2;
> > 
> > The compiler is allowed to change that into:
> > 
> >   y = 2;
> >   if (x)
> >   	y = 1;
> > 
> > Which is of course a big problem when you want to rely on the ordering.
> 
> Understood, but that doesn't explain why Paul wants to add ISB/isync
> instructions which affect the *CPU* rather than the compiler!

Hey!!!  -I- don't want to add those instructions!  Others do.
Unfortunately, lots of others.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 17:06                       ` Peter Zijlstra
  2014-02-07 17:13                         ` Will Deacon
  2014-02-07 17:46                         ` Joseph S. Myers
@ 2014-02-07 18:43                         ` Torvald Riegel
  2 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-07 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Paul E. McKenney, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Fri, 2014-02-07 at 18:06 +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> > 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> This is more about the compiler speculating on stores; imagine:
> 
>   if (x)
>   	y = 1;
>   else
>   	y = 2;
> 
> The compiler is allowed to change that into:
> 
>   y = 2;
>   if (x)
>   	y = 1;

If you write the example like that, this is indeed allowed because it's
all sequential code (and there's no volatiles in there, at least you
didn't show them :).  A store to y would happen in either case.  You
cannot observe the difference between both examples in a data-race-free
program.

Are there supposed to be atomic/non-sequential accesses in there?  If
so, please update the example.

> Which is of course a big problem when you want to rely on the ordering.
> 
> There's further problems where things like memset() can write outside
> the specified address range. Examples are memset() using single
> instructions to wipe entire cachelines and then 'restoring' the tail
> bit.

As Joseph said, this would be a bug IMO.

> While valid for single threaded, its a complete disaster for concurrent
> code.
> 
> There's more, but it all boils down to doing stores you don't expect in
> a 'sane' concurrent environment and/or don't respect the control flow.

A few of those got fixed already, because they violated the memory
model's requirements.  If you have further examples that are valid code
in the C11/C++11 model, please report them.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 16:50                   ` Paul E. McKenney
  2014-02-07 16:55                     ` Will Deacon
@ 2014-02-07 18:44                     ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-07 18:44 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Fri, 2014-02-07 at 08:50 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > Hopefully some discussion of out-of-thin-air values as well.
> > 
> > Yes, absolutely shoot store speculation in the head already. Then drive
> > a wooden stake through its hart.
> > 
> > C11/C++11 should not be allowed to claim itself a memory model until that
> > is sorted.
> 
> There actually is a proposal being put forward, but it might not make ARM
> and Power people happy because it involves adding a compare, a branch,
> and an ISB/isync after every relaxed load...  Me, I agree with you,
> much preferring the no-store-speculation approach.

My vague recollection is that everyone agrees that out-of-thin-air
values shouldn't be allowed, but that it's surprisingly complex to
actually specify this properly.

However, the example that Peter posted further down in the thread seems
to be unrelated to out-of-thin-air.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07  4:20               ` Paul E. McKenney
  2014-02-07  7:44                 ` Peter Zijlstra
@ 2014-02-10  0:06                 ` Torvald Riegel
  2014-02-10  3:51                   ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-10  0:06 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > > > > There are also so many ways to blow your head off it's untrue. For example,
> > > > > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > > > > then there are restrictions on the sets you can use for each. It's not hard
> > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > > > atm and optimises all of the data dependencies away) as well as the definition
> > > > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > > > at the earliest opportunity.
> > > > > 
> > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > memory_order_consume until the compilers implement it both correctly and
> > > > > efficiently.  They are not there yet, and there is currently no shortage
> > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > 
> > > > Do you have any input on
> > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > language standard's definition of dependencies?
> > > 
> > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > 
> > > — B is an invocation of any specialization of std::kill_dependency (29.3), or
> > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
> > > or
> > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > 
> > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > after the "?" will carry a dependency, so the code fragment in 59448
> > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > of existence.  One way to do that on both ARM and Power is to actually
> > > emit code for "flag - flag", but there are a number of other ways to
> > > make that work.
> > 
> > And that's what would concern me, considering that these requirements
> > seem to be able to creep out easily.  Also, whereas the other atomics
> > just constrain compilers wrt. reordering across atomic accesses or
> > changes to the atomic accesses themselves, the dependencies are new
> > requirements on pieces of otherwise non-synchronizing code.  The latter
> > seems far more involved to me.
> 
> Well, the wording of 1.10p9 is pretty explicit on this point.
> There are only a few exceptions to the rule that dependencies from
> memory_order_consume loads must be tracked.  And to your point about
> requirements being placed on pieces of otherwise non-synchronizing code,
> we already have that with plain old load acquire and store release --
> both of these put ordering constraints that affect the surrounding
> non-synchronizing code.

I think there's a significant difference.  With acquire/release or more
general memory orders, it's true that we can't order _across_ the atomic
access.  However, we can reorder and optimize without additional
constraints if we do not reorder.  This is not the case with consume
memory order, as the (p + flag - flag) example shows.

> This issue got a lot of discussion, and the compromise is that
> dependencies cannot leak into or out of functions unless the relevant
> parameters or return values are annotated with [[carries_dependency]].
> This means that the compiler can see all the places where dependencies
> must be tracked.  This is described in 7.6.4.

I wasn't aware of 7.6.4 (but it isn't referred to as an additional
constraint--what it is--in 1.10, so I guess at least that should be
fixed).
Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
difference, at least if one is assuming that normal optimization of
sequential code is the default, and that maintaining things such as
(flag-flag) is not; if optimizing away (flag-flag) would require the
insertion of fences unless there is the carries_dependency attribute,
then this would be bad I think.

IMHO, the dependencies construct (carries_dependency, kill_dependency)
seem to be backwards to me.  They assume that the compiler preserves all
those dependencies including (flag-flag) by default, which prohibits
meaningful optimizations.  Instead, I guess there should be a construct
for explicitly exploiting the dependencies (e.g., a
preserve_dependencies call, whose argument will not be optimized fully).
Exploiting dependencies will be special code and isn't the common case,
so it can be require additional annotations.

> If a dependency chain
> headed by a memory_order_consume load goes into or out of a function
> without the aid of the [[carries_dependency]] attribute, the compiler
> needs to do something else to enforce ordering, e.g., emit a memory
> barrier.

I agree that this is a way to see it.  But I can't see how this will
motivate compiler implementers to not just emit a stronger barrier right
away.

> From a Linux-kernel viewpoint, this is a bit ugly, as it requires
> annotations and use of kill_dependency, but it was the best I could do
> at the time.  If things go as they usually do, there will be some other
> reason why those are needed...

Did you consider something along the "preserve_dependencies" call?  If
so, why did you go for kill_dependency?



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 18:02                       ` Paul E. McKenney
@ 2014-02-10  0:27                         ` Torvald Riegel
  2014-02-10  0:56                           ` Linus Torvalds
                                             ` (4 more replies)
  2014-02-10 11:48                         ` Peter Zijlstra
  1 sibling, 5 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-10  0:27 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> 
> Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> proof that out-of-thin-air values cannot be observed in the face of any
> compiler optimization that refrains from reordering a prior relaxed load
> with a subsequent relaxed store.
> 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> So let me try an example.  I am sure that Torvald Riegel will jump in
> with any needed corrections or amplifications:
> 
> Initial state: x == y == 0
> 
> T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> 	atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> 	atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> But suppose that the compiler used specialization optimizations, as it
> would if there was a function that has a very lightweight implementation
> for some values and a very heavyweight one for other.  In particular,
> suppose that the lightweight implementation was for the value 42.
> Then the compiler might do something like the following:
> 
> Initial state: x == y == 0
> 
> T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> 	if (r1 == 42)
> 		atomic_store_explicit(42, y, memory_order_relaxed);
> 	else
> 		atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> 	atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Suddenly we have an explicit constant 42 showing up.  Of course, if
> the compiler carefully avoided speculative stores (as both Peter and
> I believe that it should if its code generation is to be regarded as
> anything other than an act of vandalism, the words in the standard
> notwithstanding), there would be no problem.  But currently, a number
> of compiler writers see absolutely nothing wrong with transforming
> the optimized-for-42 version above with something like this:
> 
> Initial state: x == y == 0
> 
> T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> 	atomic_store_explicit(42, y, memory_order_relaxed);
> 	if (r1 != 42)
> 		atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> 	atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

For this to be correct, the compiler would actually have to prove that
the speculative store is "as-if correct", which in turn would mean that
it needs to be aware of all potential observers, and check whether those
observers aren't actually affected by the speculative store.

I would guess that the compilers you have in mind don't really do that.
If they do, then I don't see why this should be okay, unless you think
out-of-thin-air values are something good (which I wouldn't agree with).

> And then it is a short and uncontroversial step to the following:
> 
> Initial state: x == y == 0
> 
> T1:	atomic_store_explicit(42, y, memory_order_relaxed);
> 	r1 = atomic_load_explicit(x, memory_order_relaxed);
> 	if (r1 != 42)
> 		atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> 	atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> This can of course result in r1 == r2 == 42, even though the constant
> 42 never appeared in the original code.  This is one way to generate
> an out-of-thin-air value.
> 
> As near as I can tell, compiler writers hate the idea of prohibiting
> speculative-store optimizations because it requires them to introduce
> both control and data dependency tracking into their compilers.

I wouldn't characterize the situation like this (although I can't speak
for others, obviously).  IMHO, it's perfectly fine on sequential /
non-synchronizing code, because we know the difference isn't observable
by a correct program.  For synchronizing code, compilers just shouldn't
do it, or they would have to truly prove that speculation is harmless.
That will be hard, so I think it should just be avoided.

Synchronization code will likely have been tuned anyway (especially if
it uses relaxed MO), so I don't see a large need for trying to optimize
using speculative atomic stores.

Thus, I think there's an easy and practical solution.

> Many of
> them seem to hate dependency tracking with a purple passion.  At least,
> such a hatred would go a long way towards explaining the incomplete
> and high-overhead implementations of memory_order_consume, the long
> and successful use of idioms based on the memory_order_consume pattern
> notwithstanding [*].  ;-)

I still think that's different because it blurs the difference between
sequential code and synchronizing code (ie, atomic accesses).  With
consume MO, the simple solution above doesn't work anymore, because
suddenly synchronizing code does affect optimizations in sequential
code, even if that wouldn't reorder across the synchronizing code (which
would be clearly "visible" to the implementation of the optimization).



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:27                         ` Torvald Riegel
@ 2014-02-10  0:56                           ` Linus Torvalds
  2014-02-10  1:16                             ` Torvald Riegel
  2014-02-10  3:21                           ` Paul E. McKenney
                                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-10  0:56 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
>
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.

What BS is that? If you use an "atomic_store_explicit()", by
definition you're either

 (a) f*cking insane
 (b) not doing sequential non-synchronizing code

and a compiler that assumes that the programmer is insane may actually
be correct more often than not, but it's still a shit compiler.
Agreed?

So I don't see how any sane person can say that speculative writes are
ok. They are clearly not ok.

Speculative stores are a bad idea in general. They are completely
invalid for anything that says "atomic". This is not even worth
discussing.

                Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:56                           ` Linus Torvalds
@ 2014-02-10  1:16                             ` Torvald Riegel
  2014-02-10  1:24                               ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-10  1:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > I wouldn't characterize the situation like this (although I can't speak
> > for others, obviously).  IMHO, it's perfectly fine on sequential /
> > non-synchronizing code, because we know the difference isn't observable
> > by a correct program.
> 
> What BS is that? If you use an "atomic_store_explicit()", by
> definition you're either
> 
>  (a) f*cking insane
>  (b) not doing sequential non-synchronizing code
> 
> and a compiler that assumes that the programmer is insane may actually
> be correct more often than not, but it's still a shit compiler.
> Agreed?

Due to all the expletives, I can't really understand what you are
saying.  Nor does what I guess it might mean seem to relate to what I
said.

(a) seems to say that you don't like requiring programmers to mark
atomic accesses specially.  Is that the case?

If so, I have to disagree.  If you're writing concurrent code, marking
the atomic accesses is the least of your problem.  Instead, for the
concurrent code I've written, it rather improved readability; it made
code more verbose, but in turn it made the synchronization easier to
see.

Beside this question of taste (and I don't care what the Kernel style
guides are), there is a real technical argument here, though:  Requiring
all synchronizing memory accesses to be annotated makes the compiler
aware what is sequential code, and what is not.  Without it, one can't
exploit data-race-freedom.  So unless we want to slow down optimization
of sequential code, we need to tell the compiler what is what.  If every
access is potentially synchronizing, then this doesn't just prevent
speculative stores.

(b) seems as if you are saying that if there is a specially annotated
atomic access in the code, that this isn't sequential/non-synchronizing
code anymore.  I don't agree with that, obviously.

> So I don't see how any sane person can say that speculative writes are
> ok. They are clearly not ok.

We are discussing programs written against the C11/C++11 memory model
here.  At least that's the part that got forwarded to gcc@gcc.gnu.org,
and was subject of the nearest emails in the thread.

This memory model requires programs to be data-race-free.  Thus, we can
optimize accordingly.  If you don't like that, then this isn't C11/C++11
anymore.  Which would be fine, but then complain about that
specifically.

> Speculative stores are a bad idea in general.

I disagree in the context of C11/C++11 programs.  At least from a
correctness point of view.

> They are completely
> invalid for anything that says "atomic".

I agree, as you will see when you read the other emails I posted as part
of this thread.  But those two things are separate.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  1:16                             ` Torvald Riegel
@ 2014-02-10  1:24                               ` Linus Torvalds
  2014-02-10  1:46                                 ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-10  1:24 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel <triegel@redhat.com> wrote:
>
> (a) seems to say that you don't like requiring programmers to mark
> atomic accesses specially.  Is that the case?

In Paul's example, they were marked specially.

And you seemed to argue that Paul's example could possibly return
anything but 0/0.

If you really think that, I hope to God that you have nothing to do
with the C standard or any actual compiler I ever use.

Because such a standard or compiler would be shit. It's sadly not too uncommon.

             Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  1:24                               ` Linus Torvalds
@ 2014-02-10  1:46                                 ` Torvald Riegel
  2014-02-10  2:04                                   ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-10  1:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > (a) seems to say that you don't like requiring programmers to mark
> > atomic accesses specially.  Is that the case?
> 
> In Paul's example, they were marked specially.
> 
> And you seemed to argue that Paul's example could possibly return
> anything but 0/0.

Just read my reply to Paul again.  Here's an excerpt:

> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>       atomic_store_explicit(42, y, memory_order_relaxed);
>       if (r1 != 42)
>               atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>       atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.


IOW, I wrote that such a compiler transformation would be wrong in my
opinion.  Thus, it should *not* return 42.
If you still see how what I wrote could be misunderstood, please let me
know because I really don't see it.

Yes, I don't do so by swearing or calling other insane, and I try to see
the reasoning that those compilers' authors might have had, even if I
don't agree with it.  In my personal opinion, that's a good thing.

> If you really think that, I hope to God that you have nothing to do
> with the C standard or any actual compiler I ever use.
> 
> Because such a standard or compiler would be shit. It's sadly not too uncommon.

Thanks for the kind words.  Perhaps writing that was quicker than
reading what I actually wrote.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  1:46                                 ` Torvald Riegel
@ 2014-02-10  2:04                                   ` Linus Torvalds
  0 siblings, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-10  2:04 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel <triegel@redhat.com> wrote:
>
> IOW, I wrote that such a compiler transformation would be wrong in my
> opinion.  Thus, it should *not* return 42.

Ahh, I am happy to have misunderstood. The "intuitively" threw me,
because I thought that was building up to a "but", and misread the
rest.

I then react stronly, because I've seen so much total crap (the
type-based C aliasing rules topping my list) etc coming out of
standards groups because it allows them to generate wrong code that
goes faster, that I just assume compiler people are out to do stupid
things in the name of "..but the standard allows it".

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:27                         ` Torvald Riegel
  2014-02-10  0:56                           ` Linus Torvalds
@ 2014-02-10  3:21                           ` Paul E. McKenney
  2014-02-10  3:45                           ` Paul E. McKenney
                                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-10  3:21 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > > 
> > > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > > a wooden stake through its hart.
> > > > > 
> > > > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > > > is sorted.
> > > > 
> > > > There actually is a proposal being put forward, but it might not make ARM
> > > > and Power people happy because it involves adding a compare, a branch,
> > > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > > much preferring the no-store-speculation approach.
> > > 
> > > Can you elaborate a bit on this please? We don't permit speculative stores
> > > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > > emit any additional instructions to prevent that from happening.
> > 
> > Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> > proof that out-of-thin-air values cannot be observed in the face of any
> > compiler optimization that refrains from reordering a prior relaxed load
> > with a subsequent relaxed store.
> > 
> > > Stores can, of course, be observed out-of-order but that's a lot more
> > > reasonable :)
> > 
> > So let me try an example.  I am sure that Torvald Riegel will jump in
> > with any needed corrections or amplifications:
> > 
> > Initial state: x == y == 0
> > 
> > T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> > 	atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> > 	atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> > But suppose that the compiler used specialization optimizations, as it
> > would if there was a function that has a very lightweight implementation
> > for some values and a very heavyweight one for other.  In particular,
> > suppose that the lightweight implementation was for the value 42.
> > Then the compiler might do something like the following:
> > 
> > Initial state: x == y == 0
> > 
> > T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> > 	if (r1 == 42)
> > 		atomic_store_explicit(42, y, memory_order_relaxed);
> > 	else
> > 		atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> > 	atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > Suddenly we have an explicit constant 42 showing up.  Of course, if
> > the compiler carefully avoided speculative stores (as both Peter and
> > I believe that it should if its code generation is to be regarded as
> > anything other than an act of vandalism, the words in the standard
> > notwithstanding), there would be no problem.  But currently, a number
> > of compiler writers see absolutely nothing wrong with transforming
> > the optimized-for-42 version above with something like this:
> > 
> > Initial state: x == y == 0
> > 
> > T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> > 	atomic_store_explicit(42, y, memory_order_relaxed);
> > 	if (r1 != 42)
> > 		atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> > 	atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.
> 
> For this to be correct, the compiler would actually have to prove that
> the speculative store is "as-if correct", which in turn would mean that
> it needs to be aware of all potential observers, and check whether those
> observers aren't actually affected by the speculative store.
> 
> I would guess that the compilers you have in mind don't really do that.
> If they do, then I don't see why this should be okay, unless you think
> out-of-thin-air values are something good (which I wouldn't agree with).

OK, we agree that pulling the atomic store to y out of its "if" statement
is a bad thing.  Very good!  Now we just have to convince others on
the committee.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:27                         ` Torvald Riegel
  2014-02-10  0:56                           ` Linus Torvalds
  2014-02-10  3:21                           ` Paul E. McKenney
@ 2014-02-10  3:45                           ` Paul E. McKenney
  2014-02-10 11:46                           ` Peter Zijlstra
  2014-02-10 19:09                           ` Linus Torvalds
  4 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-10  3:45 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote:

[ . . . ]

> > And then it is a short and uncontroversial step to the following:
> > 
> > Initial state: x == y == 0
> > 
> > T1:	atomic_store_explicit(42, y, memory_order_relaxed);
> > 	r1 = atomic_load_explicit(x, memory_order_relaxed);
> > 	if (r1 != 42)
> > 		atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> > 	atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > This can of course result in r1 == r2 == 42, even though the constant
> > 42 never appeared in the original code.  This is one way to generate
> > an out-of-thin-air value.
> > 
> > As near as I can tell, compiler writers hate the idea of prohibiting
> > speculative-store optimizations because it requires them to introduce
> > both control and data dependency tracking into their compilers.
> 
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.  For synchronizing code, compilers just shouldn't
> do it, or they would have to truly prove that speculation is harmless.
> That will be hard, so I think it should just be avoided.
> 
> Synchronization code will likely have been tuned anyway (especially if
> it uses relaxed MO), so I don't see a large need for trying to optimize
> using speculative atomic stores.
> 
> Thus, I think there's an easy and practical solution.

I like this approach, but there has been resistance to it in the past.
Definitely worth a good try, though!

> > Many of
> > them seem to hate dependency tracking with a purple passion.  At least,
> > such a hatred would go a long way towards explaining the incomplete
> > and high-overhead implementations of memory_order_consume, the long
> > and successful use of idioms based on the memory_order_consume pattern
> > notwithstanding [*].  ;-)
> 
> I still think that's different because it blurs the difference between
> sequential code and synchronizing code (ie, atomic accesses).  With
> consume MO, the simple solution above doesn't work anymore, because
> suddenly synchronizing code does affect optimizations in sequential
> code, even if that wouldn't reorder across the synchronizing code (which
> would be clearly "visible" to the implementation of the optimization).

I understand that memory_order_consume is a bit harder on compiler
writers than the other memory orders, but it is also pretty valuable.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:06                 ` Torvald Riegel
@ 2014-02-10  3:51                   ` Paul E. McKenney
  2014-02-12  5:13                     ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-10  3:51 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > > > > > There are also so many ways to blow your head off it's untrue. For example,
> > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > > > > > then there are restrictions on the sets you can use for each. It's not hard
> > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > > > > atm and optimises all of the data dependencies away) as well as the definition
> > > > > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > > > > at the earliest opportunity.
> > > > > > 
> > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > memory_order_consume until the compilers implement it both correctly and
> > > > > > efficiently.  They are not there yet, and there is currently no shortage
> > > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > > 
> > > > > Do you have any input on
> > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > > language standard's definition of dependencies?
> > > > 
> > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > 
> > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or
> > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
> > > > or
> > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > 
> > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > emit code for "flag - flag", but there are a number of other ways to
> > > > make that work.
> > > 
> > > And that's what would concern me, considering that these requirements
> > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > just constrain compilers wrt. reordering across atomic accesses or
> > > changes to the atomic accesses themselves, the dependencies are new
> > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > seems far more involved to me.
> > 
> > Well, the wording of 1.10p9 is pretty explicit on this point.
> > There are only a few exceptions to the rule that dependencies from
> > memory_order_consume loads must be tracked.  And to your point about
> > requirements being placed on pieces of otherwise non-synchronizing code,
> > we already have that with plain old load acquire and store release --
> > both of these put ordering constraints that affect the surrounding
> > non-synchronizing code.
> 
> I think there's a significant difference.  With acquire/release or more
> general memory orders, it's true that we can't order _across_ the atomic
> access.  However, we can reorder and optimize without additional
> constraints if we do not reorder.  This is not the case with consume
> memory order, as the (p + flag - flag) example shows.

Agreed, memory_order_consume does introduce additional restrictions.

> > This issue got a lot of discussion, and the compromise is that
> > dependencies cannot leak into or out of functions unless the relevant
> > parameters or return values are annotated with [[carries_dependency]].
> > This means that the compiler can see all the places where dependencies
> > must be tracked.  This is described in 7.6.4.
> 
> I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> constraint--what it is--in 1.10, so I guess at least that should be
> fixed).
> Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> difference, at least if one is assuming that normal optimization of
> sequential code is the default, and that maintaining things such as
> (flag-flag) is not; if optimizing away (flag-flag) would require the
> insertion of fences unless there is the carries_dependency attribute,
> then this would be bad I think.

No, the attribute does not make a semantic difference.  If a dependency
flows into a function without [[carries_dependency]], the implementation
is within its right to emit an acquire barrier or similar.

> IMHO, the dependencies construct (carries_dependency, kill_dependency)
> seem to be backwards to me.  They assume that the compiler preserves all
> those dependencies including (flag-flag) by default, which prohibits
> meaningful optimizations.  Instead, I guess there should be a construct
> for explicitly exploiting the dependencies (e.g., a
> preserve_dependencies call, whose argument will not be optimized fully).
> Exploiting dependencies will be special code and isn't the common case,
> so it can be require additional annotations.

If you are compiling a function that has no [[carries_dependency]]
attributes on it arguments and return value, and none on any of the
functions that it calls, and contains no memomry_order_consume loads,
then you can break dependencies all you like within that function.

That said, I am of course open to discussing alternative approaches.
Especially those that ease the migration of the existing code in the
Linux kernel that rely on dependencies.  ;-)

> > If a dependency chain
> > headed by a memory_order_consume load goes into or out of a function
> > without the aid of the [[carries_dependency]] attribute, the compiler
> > needs to do something else to enforce ordering, e.g., emit a memory
> > barrier.
> 
> I agree that this is a way to see it.  But I can't see how this will
> motivate compiler implementers to not just emit a stronger barrier right
> away.

That certainly has been the most common approach.

> > From a Linux-kernel viewpoint, this is a bit ugly, as it requires
> > annotations and use of kill_dependency, but it was the best I could do
> > at the time.  If things go as they usually do, there will be some other
> > reason why those are needed...
> 
> Did you consider something along the "preserve_dependencies" call?  If
> so, why did you go for kill_dependency?

Could you please give more detail on what a "preserve_dependencies" call
would do and where it would be used?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:27                         ` Torvald Riegel
                                             ` (2 preceding siblings ...)
  2014-02-10  3:45                           ` Paul E. McKenney
@ 2014-02-10 11:46                           ` Peter Zijlstra
  2014-02-10 19:09                           ` Linus Torvalds
  4 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-10 11:46 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: paulmck, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> > Initial state: x == y == 0
> > 
> > T1:	r1 = atomic_load_explicit(x, memory_order_relaxed);
> > 	atomic_store_explicit(42, y, memory_order_relaxed);
> > 	if (r1 != 42)
> > 		atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2:	r2 = atomic_load_explicit(y, memory_order_relaxed);
> > 	atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Yeah, my bad for not being familiar with the atrocious crap C11 made of
atomics :/


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-07 18:02                       ` Paul E. McKenney
  2014-02-10  0:27                         ` Torvald Riegel
@ 2014-02-10 11:48                         ` Peter Zijlstra
  2014-02-10 11:49                           ` Will Deacon
  1 sibling, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-10 11:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Torvald Riegel, Ramana Radhakrishnan, David Howells,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> As near as I can tell, compiler writers hate the idea of prohibiting
> speculative-store optimizations because it requires them to introduce
> both control and data dependency tracking into their compilers.  Many of
> them seem to hate dependency tracking with a purple passion.  At least,
> such a hatred would go a long way towards explaining the incomplete
> and high-overhead implementations of memory_order_consume, the long
> and successful use of idioms based on the memory_order_consume pattern
> notwithstanding [*].  ;-)

Just tell them that because the hardware provides control dependencies
we actually use and rely on them.

Not that I expect they care too much what we do, given the current state
of things.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 11:48                         ` Peter Zijlstra
@ 2014-02-10 11:49                           ` Will Deacon
  2014-02-10 12:05                             ` Peter Zijlstra
  2014-02-10 15:04                             ` Paul E. McKenney
  0 siblings, 2 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-10 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > As near as I can tell, compiler writers hate the idea of prohibiting
> > speculative-store optimizations because it requires them to introduce
> > both control and data dependency tracking into their compilers.  Many of
> > them seem to hate dependency tracking with a purple passion.  At least,
> > such a hatred would go a long way towards explaining the incomplete
> > and high-overhead implementations of memory_order_consume, the long
> > and successful use of idioms based on the memory_order_consume pattern
> > notwithstanding [*].  ;-)
> 
> Just tell them that because the hardware provides control dependencies
> we actually use and rely on them.

s/control/address/ ?

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 11:49                           ` Will Deacon
@ 2014-02-10 12:05                             ` Peter Zijlstra
  2014-02-10 15:04                             ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-10 12:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Nope, control.

Since stores cannot be speculated and thus require linear control flow
history we can use it to order LOAD -> STORE when the LOAD is required
for the control flow decision and the STORE depends on the control flow
path.

Also see commit 18c03c61444a211237f3d4782353cb38dba795df to
Documentation/memory-barriers.txt

---
commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Nov 25 11:49:10 2013 +0100

    perf: Optimize ring-buffer write by depending on control dependencies
    
    Remove a full barrier from the ring-buffer write path by relying on
    a control dependency to order a LOAD -> STORE scenario.
    
    Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnxrjl@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index e8b168af135b..146a5792b1d2 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -61,19 +61,20 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 	 *
 	 *   kernel				user
 	 *
-	 *   READ ->data_tail			READ ->data_head
-	 *   smp_mb()	(A)			smp_rmb()	(C)
-	 *   WRITE $data			READ $data
-	 *   smp_wmb()	(B)			smp_mb()	(D)
-	 *   STORE ->data_head			WRITE ->data_tail
+	 *   if (LOAD ->data_tail) {		LOAD ->data_head
+	 *			(A)		smp_rmb()	(C)
+	 *	STORE $data			LOAD $data
+	 *	smp_wmb()	(B)		smp_mb()	(D)
+	 *	STORE ->data_head		STORE ->data_tail
+	 *   }
 	 *
 	 * Where A pairs with D, and B pairs with C.
 	 *
-	 * I don't think A needs to be a full barrier because we won't in fact
-	 * write data until we see the store from userspace. So we simply don't
-	 * issue the data WRITE until we observe it. Be conservative for now.
+	 * In our case (A) is a control dependency that separates the load of
+	 * the ->data_tail and the stores of $data. In case ->data_tail
+	 * indicates there is no room in the buffer to store $data we do not.
 	 *
-	 * OTOH, D needs to be a full barrier since it separates the data READ
+	 * D needs to be a full barrier since it separates the data READ
 	 * from the tail WRITE.
 	 *
 	 * For B a WMB is sufficient since it separates two WRITEs, and for C
@@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 	 *
 	 * See perf_output_begin().
 	 */
-	smp_wmb();
+	smp_wmb(); /* B, matches C */
 	rb->user_page->data_head = head;
 
 	/*
@@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle,
 		if (!rb->overwrite &&
 		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
 			goto fail;
+
+		/*
+		 * The above forms a control dependency barrier separating the
+		 * @tail load above from the data stores below. Since the @tail
+		 * load is required to compute the branch to fail below.
+		 *
+		 * A, matches D; the full memory barrier userspace SHOULD issue
+		 * after reading the data and before storing the new tail
+		 * position.
+		 *
+		 * See perf_output_put_handle().
+		 */
+
 		head += size;
 	} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
 	/*
-	 * Separate the userpage->tail read from the data stores below.
-	 * Matches the MB userspace SHOULD issue after reading the data
-	 * and before storing the new tail position.
-	 *
-	 * See perf_output_put_handle().
+	 * We rely on the implied barrier() by local_cmpxchg() to ensure
+	 * none of the data stores below can be lifted up by the compiler.
 	 */
-	smp_mb();
 
 	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
 		local_add(rb->watermark, &rb->wakeup);

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 11:49                           ` Will Deacon
  2014-02-10 12:05                             ` Peter Zijlstra
@ 2014-02-10 15:04                             ` Paul E. McKenney
  2014-02-10 16:22                               ` Will Deacon
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-10 15:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Both are important, but as Peter's reply noted, it was control
dependencies under discussion.  Data dependencies (which include the
ARM/PowerPC notion of address dependencies) are called out by the standard
already, but control dependencies are not.  I am not all that satisified
by current implementations of data dependencies, admittedly.  Should
be an interesting discussion.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 15:04                             ` Paul E. McKenney
@ 2014-02-10 16:22                               ` Will Deacon
  0 siblings, 0 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-10 16:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo,
	gcc

On Mon, Feb 10, 2014 at 03:04:43PM +0000, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote:
> > On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote:
> > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > > speculative-store optimizations because it requires them to introduce
> > > > both control and data dependency tracking into their compilers.  Many of
> > > > them seem to hate dependency tracking with a purple passion.  At least,
> > > > such a hatred would go a long way towards explaining the incomplete
> > > > and high-overhead implementations of memory_order_consume, the long
> > > > and successful use of idioms based on the memory_order_consume pattern
> > > > notwithstanding [*].  ;-)
> > > 
> > > Just tell them that because the hardware provides control dependencies
> > > we actually use and rely on them.
> > 
> > s/control/address/ ?
> 
> Both are important, but as Peter's reply noted, it was control
> dependencies under discussion.  Data dependencies (which include the
> ARM/PowerPC notion of address dependencies) are called out by the standard
> already, but control dependencies are not.  I am not all that satisified
> by current implementations of data dependencies, admittedly.  Should
> be an interesting discussion.  ;-)

Ok, but since you can't use control dependencies to order LOAD -> LOAD, it's
a pretty big ask of the compiler to make use of them for things like
consume, where a data dependency will suffice for any combination of
accesses.

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  0:27                         ` Torvald Riegel
                                             ` (3 preceding siblings ...)
  2014-02-10 11:46                           ` Peter Zijlstra
@ 2014-02-10 19:09                           ` Linus Torvalds
  2014-02-11 15:59                             ` Paul E. McKenney
  2014-02-12  5:39                             ` Torvald Riegel
  4 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-10 19:09 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
>
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Btw, what is the definition of "observable" for the atomics?

Because I'm hoping that it's not the same as for volatiles, where
"observable" is about the virtual machine itself, and as such volatile
accesses cannot be combined or optimized at all.

Now, I claim that atomic accesses cannot be done speculatively for
writes, and not re-done for reads (because the value could change),
but *combining* them would be possible and good.

For example, we often have multiple independent atomic accesses that
could certainly be combined: testing the individual bits of an atomic
value with helper functions, causing things like "load atomic, test
bit, load same atomic, test another bit". The two atomic loads could
be done as a single load without possibly changing semantics on a real
machine, but if "visibility" is defined in the same way it is for
"volatile", that wouldn't be a valid transformation. Right now we use
"volatile" semantics for these kinds of things, and they really can
hurt.

Same goes for multiple writes (possibly due to setting bits):
combining multiple accesses into a single one is generally fine, it's
*adding* write accesses speculatively that is broken by design..

At the same time, you can't combine atomic loads or stores infinitely
- "visibility" on a real machine definitely is about timeliness.
Removing all but the last write when there are multiple consecutive
writes is generally fine, even if you unroll a loop to generate those
writes. But if what remains is a loop, it might be a busy-loop
basically waiting for something, so it would be wrong ("untimely") to
hoist a store in a loop entirely past the end of the loop, or hoist a
load in a loop to before the loop.

Does the standard allow for that kind of behavior?

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
       [not found]   ` <20140210205719.GY5002@laptop.programming.kicks-ass.net>
@ 2014-02-10 21:08     ` Chris Metcalf
  2014-02-10 21:14       ` Peter Zijlstra
  0 siblings, 1 reply; 299+ messages in thread
From: Chris Metcalf @ 2014-02-10 21:08 UTC (permalink / raw)
  To: Peter Zijlstra, Linux Kernel Mailing List

(+LKML again)

On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
>> On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
>>> Its been compiled on everything I have a compiler for, however frv and
>>> tile are missing because they're special and I was tired.
>> So what's the specialness on tile?
> Its not doing the atomic work in ASM but uses magic builtins or such.
>
> I got the list of magic funcs for tilegx, but didn't look into the 32bit
> chips.

Oh, I see.  The <asm/atomic.h> files on tile are already reasonably well-factored.

It's possible you could do better, but I think not by too much, other than possibly
by using <asm-generic/atomic.h> for some of the common idioms like "subtraction
is addition with a negative second argument", etc., which hasn't been done elsewhere.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 21:08     ` Chris Metcalf
@ 2014-02-10 21:14       ` Peter Zijlstra
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-10 21:14 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: Linux Kernel Mailing List

On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote:
> (+LKML again)
> 
> On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
> > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
> >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
> >>> Its been compiled on everything I have a compiler for, however frv and
> >>> tile are missing because they're special and I was tired.
> >> So what's the specialness on tile?
> > Its not doing the atomic work in ASM but uses magic builtins or such.
> >
> > I got the list of magic funcs for tilegx, but didn't look into the 32bit
> > chips.
> 
> Oh, I see.  The <asm/atomic.h> files on tile are already reasonably well-factored.
> 
> It's possible you could do better, but I think not by too much, other than possibly
> by using <asm-generic/atomic.h> for some of the common idioms like "subtraction
> is addition with a negative second argument", etc., which hasn't been done elsewhere.

The last patch 5/5 adds a few atomic ops; I could of course use
cmpxchg() loops for everything, but I found tilegx actually has fetch_or
and fetch_and to implement atomic_or() / atomic_and().

It doesn't have fetch_xor() from what I've been told so atomic_xor()
will have to become a cmpxchg() loop.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 19:09                           ` Linus Torvalds
@ 2014-02-11 15:59                             ` Paul E. McKenney
  2014-02-12  6:06                               ` Torvald Riegel
  2014-02-12  5:39                             ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-11 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.
> 
> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),
> but *combining* them would be possible and good.
> 
> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.
> 
> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..
> 
> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.
> 
> Does the standard allow for that kind of behavior?

You asked!  ;-)

So the current standard allows merging of both loads and stores, unless of
course ordring constraints prevent the merging.  Volatile semantics may be
used to prevent this merging, if desired, for example, for real-time code.
Infinite merging is intended to be prohibited, but I am not certain that
the current wording is bullet-proof (1.10p24 and 1.10p25).

The only prohibition against speculative stores that I can see is in a
non-normative note, and it can be argued to apply only to things that are
not atomics (1.10p22).  I don't see any prohibition against reordering
a store to precede a load preceding a conditional branch -- which would
not be speculative if the branch was know to be taken and the load
hit in the store buffer.  In a system where stores could be reordered,
some other CPU might perceive the store as happening before the load
that controlled the conditional branch.  This needs to be addressed.

Why this hole?  At the time, the current formalizations of popular
CPU architectures did not exist, and it was not clear that all popular
hardware avoided speculative stores.  

There is also fun with "out of thin air" values, which everyone agrees
should be prohibited, but where there is not agreement on how to prohibit
them in a mathematically constructive manner.  The current draft contains
a clause simply stating that out-of-thin-air values are prohibited,
which doesn't help someone constructing tools to analyze C++ code.
One proposal requires that subsequent atomic stores never be reordered
before prior atomic loads, which requires useless ordering code to be
emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter
Zijlstra's reaction to this proposal a few days ago).  Note that Itanium
already pays this price in order to provide full single-variable cache
coherence.  This out-of-thin-air discussion is also happening in the
Java community in preparation for a new rev of the Java memory model.

There will also be some discussions on memory_order_consume, which is
intended to (eventually) implement rcu_dereference().  The compiler
writers don't like tracking dependencies, but there may be some ways
of constraining optimizations to preserve the common dependencies that,
while providing some syntax to force preservation of dependencies that
would normally be optimized out.  One example of this is where you have an
RCU-protected array that might sometimes contain only a single element.
In the single-element case, the compiler knows a priori which element
will be used, and will therefore optimize the dependency away, so that
the reader might see pre-initialization state.  But this is rare, so
if added syntax needs to be added in this case, I believe we should be
OK with it.  (If syntax is needed for plain old dereferences, it is
thumbs down all the way as far as I am concerned.  Ditto for things
like stripping the bottom bits off of a decorated pointer.)

No doubt other memory-model issues will come up, but those are the ones
I know about at the moment.

As I said to begin with, hey, you asked!

That said, I would very much appreciate any thoughts or suggestions on
handling these issues.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10  3:51                   ` Paul E. McKenney
@ 2014-02-12  5:13                     ` Torvald Riegel
  2014-02-12 18:26                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-12  5:13 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > > > > > > There are also so many ways to blow your head off it's untrue. For example,
> > > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > > > > > > then there are restrictions on the sets you can use for each. It's not hard
> > > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > > > > > atm and optimises all of the data dependencies away) as well as the definition
> > > > > > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > > > > > at the earliest opportunity.
> > > > > > > 
> > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > > memory_order_consume until the compilers implement it both correctly and
> > > > > > > efficiently.  They are not there yet, and there is currently no shortage
> > > > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > > > 
> > > > > > Do you have any input on
> > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > > > language standard's definition of dependencies?
> > > > > 
> > > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > > 
> > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or
> > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
> > > > > or
> > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > > 
> > > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > > emit code for "flag - flag", but there are a number of other ways to
> > > > > make that work.
> > > > 
> > > > And that's what would concern me, considering that these requirements
> > > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > > just constrain compilers wrt. reordering across atomic accesses or
> > > > changes to the atomic accesses themselves, the dependencies are new
> > > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > > seems far more involved to me.
> > > 
> > > Well, the wording of 1.10p9 is pretty explicit on this point.
> > > There are only a few exceptions to the rule that dependencies from
> > > memory_order_consume loads must be tracked.  And to your point about
> > > requirements being placed on pieces of otherwise non-synchronizing code,
> > > we already have that with plain old load acquire and store release --
> > > both of these put ordering constraints that affect the surrounding
> > > non-synchronizing code.
> > 
> > I think there's a significant difference.  With acquire/release or more
> > general memory orders, it's true that we can't order _across_ the atomic
> > access.  However, we can reorder and optimize without additional
> > constraints if we do not reorder.  This is not the case with consume
> > memory order, as the (p + flag - flag) example shows.
> 
> Agreed, memory_order_consume does introduce additional restrictions.
> 
> > > This issue got a lot of discussion, and the compromise is that
> > > dependencies cannot leak into or out of functions unless the relevant
> > > parameters or return values are annotated with [[carries_dependency]].
> > > This means that the compiler can see all the places where dependencies
> > > must be tracked.  This is described in 7.6.4.
> > 
> > I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> > constraint--what it is--in 1.10, so I guess at least that should be
> > fixed).
> > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> > difference, at least if one is assuming that normal optimization of
> > sequential code is the default, and that maintaining things such as
> > (flag-flag) is not; if optimizing away (flag-flag) would require the
> > insertion of fences unless there is the carries_dependency attribute,
> > then this would be bad I think.
> 
> No, the attribute does not make a semantic difference.  If a dependency
> flows into a function without [[carries_dependency]], the implementation
> is within its right to emit an acquire barrier or similar.

So you can't just ignore the attribute when generating code -- you would
at the very least have to do this consistently across all the compilers
compiling parts of your code.  Which tells me that it does make a
semantic difference.  But I know that what should or should not be an
attribute is controversial.

> > IMHO, the dependencies construct (carries_dependency, kill_dependency)
> > seem to be backwards to me.  They assume that the compiler preserves all
> > those dependencies including (flag-flag) by default, which prohibits
> > meaningful optimizations.  Instead, I guess there should be a construct
> > for explicitly exploiting the dependencies (e.g., a
> > preserve_dependencies call, whose argument will not be optimized fully).
> > Exploiting dependencies will be special code and isn't the common case,
> > so it can be require additional annotations.
> 
> If you are compiling a function that has no [[carries_dependency]]
> attributes on it arguments and return value, and none on any of the
> functions that it calls, and contains no memomry_order_consume loads,
> then you can break dependencies all you like within that function.
> 
> That said, I am of course open to discussing alternative approaches.
> Especially those that ease the migration of the existing code in the
> Linux kernel that rely on dependencies.  ;-)
> 
> > > If a dependency chain
> > > headed by a memory_order_consume load goes into or out of a function
> > > without the aid of the [[carries_dependency]] attribute, the compiler
> > > needs to do something else to enforce ordering, e.g., emit a memory
> > > barrier.
> > 
> > I agree that this is a way to see it.  But I can't see how this will
> > motivate compiler implementers to not just emit a stronger barrier right
> > away.
> 
> That certainly has been the most common approach.
> 
> > > From a Linux-kernel viewpoint, this is a bit ugly, as it requires
> > > annotations and use of kill_dependency, but it was the best I could do
> > > at the time.  If things go as they usually do, there will be some other
> > > reason why those are needed...
> > 
> > Did you consider something along the "preserve_dependencies" call?  If
> > so, why did you go for kill_dependency?
> 
> Could you please give more detail on what a "preserve_dependencies" call
> would do and where it would be used?

Demarcating regions of code (or particular expressions) that require the
preservation of source-code-level dependencies (eg, "p + flag - flag"),
and which thus constraints optimizations allowed on normal code.

What we have right now is a "blacklist" of things that kill
dependencies, which still requires compilers to not touch things like
"flag - flag".  Doing so isn't useful in most code, so it would be
better to have a "whitelist" of things in which dependencies are
strictly preserved.  The list of relevant expressions found in the
current RCU uses might be one such list.  Another would be to require
programmers to explicitly annotate the expressions where those
dependencies should be specially preserved, with something like a
"preserve_dependencies(p + flag - flag)" call.

I agree that this might be harder when dependency tracking is scattered
throughout a larger region of code, as you pointed out today.  But I'm
just looking for a setting that is more likely to see good support by
compilers.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-10 19:09                           ` Linus Torvalds
  2014-02-11 15:59                             ` Paul E. McKenney
@ 2014-02-12  5:39                             ` Torvald Riegel
  2014-02-12 18:07                               ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-12  5:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.

No, atomics aren't an observable behavior of the abstract machine
(unless they are volatile).  See 1.8.p8 (citing the C++ standard).

> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),

Agreed, unless the compiler can prove that this doesn't make a
difference in the program at hand and it's not volatile atomics.  In
general, that will be hard and thus won't happen often I suppose, but if
correctly proved it would fall under the as-if rule I think.

> but *combining* them would be possible and good.

Agreed.

> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.

Agreed.  In your example, the compiler would have to prove that the
abstract machine would always be able to run the two loads atomically
(ie, as one load) without running into impossible/disallowed behavior of
the program.  But if there's no loop or branch or such in-between, this
should be straight-forward because any hardware oddity or similar could
merge those loads and it wouldn't be disallowed by the standard
(considering that we're talking about a finite number of loads), so the
compiler would be allowed to do it as well.

> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..

Agreed.  As Paul points out, this being correct assumes that there are
no other ordering guarantees or memory accesses "interfering", but if
the stores are to the same memory location and adjacent to each other in
the program, then I don't see a reason why they wouldn't be combinable.

> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.

Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
although those might not be bullet-proof as Paul points out.  Forward
progress is rather vaguely specified in the standard, but at least parts
of the committee (and people in ISO C++ SG1, in particular) are working
on trying to improve this.

> Does the standard allow for that kind of behavior?

I think the standard requires (or intends to require) the behavior that
you (and I) seem to prefer in these examples.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-11 15:59                             ` Paul E. McKenney
@ 2014-02-12  6:06                               ` Torvald Riegel
  2014-02-12  9:19                                 ` Peter Zijlstra
  2014-02-12 17:39                                 ` Paul E. McKenney
  0 siblings, 2 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-12  6:06 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > >
> > > Intuitively, this is wrong because this let's the program take a step
> > > the abstract machine wouldn't do.  This is different to the sequential
> > > code that Peter posted because it uses atomics, and thus one can't
> > > easily assume that the difference is not observable.
> > 
> > Btw, what is the definition of "observable" for the atomics?
> > 
> > Because I'm hoping that it's not the same as for volatiles, where
> > "observable" is about the virtual machine itself, and as such volatile
> > accesses cannot be combined or optimized at all.
> > 
> > Now, I claim that atomic accesses cannot be done speculatively for
> > writes, and not re-done for reads (because the value could change),
> > but *combining* them would be possible and good.
> > 
> > For example, we often have multiple independent atomic accesses that
> > could certainly be combined: testing the individual bits of an atomic
> > value with helper functions, causing things like "load atomic, test
> > bit, load same atomic, test another bit". The two atomic loads could
> > be done as a single load without possibly changing semantics on a real
> > machine, but if "visibility" is defined in the same way it is for
> > "volatile", that wouldn't be a valid transformation. Right now we use
> > "volatile" semantics for these kinds of things, and they really can
> > hurt.
> > 
> > Same goes for multiple writes (possibly due to setting bits):
> > combining multiple accesses into a single one is generally fine, it's
> > *adding* write accesses speculatively that is broken by design..
> > 
> > At the same time, you can't combine atomic loads or stores infinitely
> > - "visibility" on a real machine definitely is about timeliness.
> > Removing all but the last write when there are multiple consecutive
> > writes is generally fine, even if you unroll a loop to generate those
> > writes. But if what remains is a loop, it might be a busy-loop
> > basically waiting for something, so it would be wrong ("untimely") to
> > hoist a store in a loop entirely past the end of the loop, or hoist a
> > load in a loop to before the loop.
> > 
> > Does the standard allow for that kind of behavior?
> 
> You asked!  ;-)
> 
> So the current standard allows merging of both loads and stores, unless of
> course ordring constraints prevent the merging.  Volatile semantics may be
> used to prevent this merging, if desired, for example, for real-time code.

Agreed.

> Infinite merging is intended to be prohibited, but I am not certain that
> the current wording is bullet-proof (1.10p24 and 1.10p25).

Yeah, maybe not.  But it at least seems to rather clearly indicate the
intent ;)

> The only prohibition against speculative stores that I can see is in a
> non-normative note, and it can be argued to apply only to things that are
> not atomics (1.10p22).

I think this one is specifically about speculative stores that would
affect memory locations that the abstract machine would not write to,
and that might be observable or create data races.  While a compiler
could potentially prove that such stores aren't leading to a difference
in the behavior of the program (e.g., by proving that there are no
observers anywhere and this isn't overlapping with any volatile
locations), I think that this is hard in general and most compilers will
just not do such things.  In GCC, bugs in that category were fixed after
researchers doing fuzz-testing found them (IIRC, speculative stores by
loops).

> I don't see any prohibition against reordering
> a store to precede a load preceding a conditional branch -- which would
> not be speculative if the branch was know to be taken and the load
> hit in the store buffer.  In a system where stores could be reordered,
> some other CPU might perceive the store as happening before the load
> that controlled the conditional branch.  This needs to be addressed.

I don't know the specifics of your example, but from how I understand
it, I don't see a problem if the compiler can prove that the store will
always happen.

To be more specific, if the compiler can prove that the store will
happen anyway, and the region of code can be assumed to always run
atomically (e.g., there's no loop or such in there), then it is known
that we have one atomic region of code that will always perform the
store, so we might as well do the stuff in the region in some order.

Now, if any of the memory accesses are atomic, then the whole region of
code containing those accesses is often not atomic because other threads
might observe intermediate results in a data-race-free way.

(I know that this isn't a very precise formulation, but I hope it brings
my line of reasoning across.)

> Why this hole?  At the time, the current formalizations of popular
> CPU architectures did not exist, and it was not clear that all popular
> hardware avoided speculative stores.  

I'm not quite sure which hole you see there.  Can you elaborate?



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12  6:06                               ` Torvald Riegel
@ 2014-02-12  9:19                                 ` Peter Zijlstra
  2014-02-12 17:42                                   ` Paul E. McKenney
  2014-02-14  5:07                                   ` Torvald Riegel
  2014-02-12 17:39                                 ` Paul E. McKenney
  1 sibling, 2 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-12  9:19 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

> I don't know the specifics of your example, but from how I understand
> it, I don't see a problem if the compiler can prove that the store will
> always happen.
> 
> To be more specific, if the compiler can prove that the store will
> happen anyway, and the region of code can be assumed to always run
> atomically (e.g., there's no loop or such in there), then it is known
> that we have one atomic region of code that will always perform the
> store, so we might as well do the stuff in the region in some order.
> 
> Now, if any of the memory accesses are atomic, then the whole region of
> code containing those accesses is often not atomic because other threads
> might observe intermediate results in a data-race-free way.
> 
> (I know that this isn't a very precise formulation, but I hope it brings
> my line of reasoning across.)

So given something like:

	if (x)
		y = 3;

assuming both x and y are atomic (so don't gimme crap for now knowing
the C11 atomic incantations); and you can prove x is always true; you
don't see a problem with not emitting the conditional?

Avoiding the conditional changes the result; see that control dependency
email from earlier. In the above example the load of X and the store to
Y are strictly ordered, due to control dependencies. Not emitting the
condition and maybe not even emitting the load completely wrecks this.

Its therefore an invalid optimization to take out the conditional or
speculate the store, since it takes out the dependency.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12  6:06                               ` Torvald Riegel
  2014-02-12  9:19                                 ` Peter Zijlstra
@ 2014-02-12 17:39                                 ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-12 17:39 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 11, 2014 at 10:06:34PM -0800, Torvald Riegel wrote:
> On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > > >
> > > > Intuitively, this is wrong because this let's the program take a step
> > > > the abstract machine wouldn't do.  This is different to the sequential
> > > > code that Peter posted because it uses atomics, and thus one can't
> > > > easily assume that the difference is not observable.
> > > 
> > > Btw, what is the definition of "observable" for the atomics?
> > > 
> > > Because I'm hoping that it's not the same as for volatiles, where
> > > "observable" is about the virtual machine itself, and as such volatile
> > > accesses cannot be combined or optimized at all.
> > > 
> > > Now, I claim that atomic accesses cannot be done speculatively for
> > > writes, and not re-done for reads (because the value could change),
> > > but *combining* them would be possible and good.
> > > 
> > > For example, we often have multiple independent atomic accesses that
> > > could certainly be combined: testing the individual bits of an atomic
> > > value with helper functions, causing things like "load atomic, test
> > > bit, load same atomic, test another bit". The two atomic loads could
> > > be done as a single load without possibly changing semantics on a real
> > > machine, but if "visibility" is defined in the same way it is for
> > > "volatile", that wouldn't be a valid transformation. Right now we use
> > > "volatile" semantics for these kinds of things, and they really can
> > > hurt.
> > > 
> > > Same goes for multiple writes (possibly due to setting bits):
> > > combining multiple accesses into a single one is generally fine, it's
> > > *adding* write accesses speculatively that is broken by design..
> > > 
> > > At the same time, you can't combine atomic loads or stores infinitely
> > > - "visibility" on a real machine definitely is about timeliness.
> > > Removing all but the last write when there are multiple consecutive
> > > writes is generally fine, even if you unroll a loop to generate those
> > > writes. But if what remains is a loop, it might be a busy-loop
> > > basically waiting for something, so it would be wrong ("untimely") to
> > > hoist a store in a loop entirely past the end of the loop, or hoist a
> > > load in a loop to before the loop.
> > > 
> > > Does the standard allow for that kind of behavior?
> > 
> > You asked!  ;-)
> > 
> > So the current standard allows merging of both loads and stores, unless of
> > course ordring constraints prevent the merging.  Volatile semantics may be
> > used to prevent this merging, if desired, for example, for real-time code.
> 
> Agreed.
> 
> > Infinite merging is intended to be prohibited, but I am not certain that
> > the current wording is bullet-proof (1.10p24 and 1.10p25).
> 
> Yeah, maybe not.  But it at least seems to rather clearly indicate the
> intent ;)

That is my hope.  ;-)

> > The only prohibition against speculative stores that I can see is in a
> > non-normative note, and it can be argued to apply only to things that are
> > not atomics (1.10p22).
> 
> I think this one is specifically about speculative stores that would
> affect memory locations that the abstract machine would not write to,
> and that might be observable or create data races.  While a compiler
> could potentially prove that such stores aren't leading to a difference
> in the behavior of the program (e.g., by proving that there are no
> observers anywhere and this isn't overlapping with any volatile
> locations), I think that this is hard in general and most compilers will
> just not do such things.  In GCC, bugs in that category were fixed after
> researchers doing fuzz-testing found them (IIRC, speculative stores by
> loops).

And that is my fear.  ;-)

> > I don't see any prohibition against reordering
> > a store to precede a load preceding a conditional branch -- which would
> > not be speculative if the branch was know to be taken and the load
> > hit in the store buffer.  In a system where stores could be reordered,
> > some other CPU might perceive the store as happening before the load
> > that controlled the conditional branch.  This needs to be addressed.
> 
> I don't know the specifics of your example, but from how I understand
> it, I don't see a problem if the compiler can prove that the store will
> always happen.

The current Documentation/memory-barriers.txt formulation requires
that both the load and the store have volatile semantics.  Does
that help?

> To be more specific, if the compiler can prove that the store will
> happen anyway, and the region of code can be assumed to always run
> atomically (e.g., there's no loop or such in there), then it is known
> that we have one atomic region of code that will always perform the
> store, so we might as well do the stuff in the region in some order.

And it would be very hard to write a program that proved that the
store had been reordered prior to the load in this case.

> Now, if any of the memory accesses are atomic, then the whole region of
> code containing those accesses is often not atomic because other threads
> might observe intermediate results in a data-race-free way.
> 
> (I know that this isn't a very precise formulation, but I hope it brings
> my line of reasoning across.)
> 
> > Why this hole?  At the time, the current formalizations of popular
> > CPU architectures did not exist, and it was not clear that all popular
> > hardware avoided speculative stores.  
> 
> I'm not quite sure which hole you see there.  Can you elaborate?

Here is one attempt, based on Peter's example later in this thread:

	#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

	atomic_int x, y;  /* Default initialization to zero. */
	int r1;

	void T0(void)
	{
		if (atomic_load(&ACCESS_ONCE(x), memory_order_relaxed))
			atomic_store(&ACCESS_ONCE(y), 1, memory_order_relaxed);
	}

	void T1(void)
	{
		r1 = atomic_load(y, memory_order_seq_cst);
		/* Might also need an atomic_thread_fence() here... */
		atomic_store(x, 1, memory_order_seq_cst);
	}

	assert(r1 == 0);

Peter and I would like the assertion to never trigger in this case,
but the current C11 does not seem to guarantee this.  I believe that
the volatile casts forbid the compiler from deciding to omit T0's "if"
even in cases where it could prove that x was always zero.  And given
the store to x, it should not be able to prove constant x here, right?

Again, I believe that current C11 does not guarantee that the assertion
will never trigger, but it would be good if one of the successors to
C11 did make that guarantee.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12  9:19                                 ` Peter Zijlstra
@ 2014-02-12 17:42                                   ` Paul E. McKenney
  2014-02-12 18:12                                     ` Peter Zijlstra
  2014-02-14  5:07                                   ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-12 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Torvald Riegel, Linus Torvalds, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 12, 2014 at 10:19:07AM +0100, Peter Zijlstra wrote:
> > I don't know the specifics of your example, but from how I understand
> > it, I don't see a problem if the compiler can prove that the store will
> > always happen.
> > 
> > To be more specific, if the compiler can prove that the store will
> > happen anyway, and the region of code can be assumed to always run
> > atomically (e.g., there's no loop or such in there), then it is known
> > that we have one atomic region of code that will always perform the
> > store, so we might as well do the stuff in the region in some order.
> > 
> > Now, if any of the memory accesses are atomic, then the whole region of
> > code containing those accesses is often not atomic because other threads
> > might observe intermediate results in a data-race-free way.
> > 
> > (I know that this isn't a very precise formulation, but I hope it brings
> > my line of reasoning across.)
> 
> So given something like:
> 
> 	if (x)
> 		y = 3;
> 
> assuming both x and y are atomic (so don't gimme crap for now knowing
> the C11 atomic incantations); and you can prove x is always true; you
> don't see a problem with not emitting the conditional?

You need volatile semantics to force the compiler to ignore any proofs
it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
calls in my email to Torvald.  (Hopefully I translated your example
reasonably.)

							Thanx, Paul

> Avoiding the conditional changes the result; see that control dependency
> email from earlier. In the above example the load of X and the store to
> Y are strictly ordered, due to control dependencies. Not emitting the
> condition and maybe not even emitting the load completely wrecks this.
> 
> Its therefore an invalid optimization to take out the conditional or
> speculate the store, since it takes out the dependency.
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12  5:39                             ` Torvald Riegel
@ 2014-02-12 18:07                               ` Paul E. McKenney
  2014-02-12 20:22                                 ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-12 18:07 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 11, 2014 at 09:39:24PM -0800, Torvald Riegel wrote:
> On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
> > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > >
> > > Intuitively, this is wrong because this let's the program take a step
> > > the abstract machine wouldn't do.  This is different to the sequential
> > > code that Peter posted because it uses atomics, and thus one can't
> > > easily assume that the difference is not observable.
> > 
> > Btw, what is the definition of "observable" for the atomics?
> > 
> > Because I'm hoping that it's not the same as for volatiles, where
> > "observable" is about the virtual machine itself, and as such volatile
> > accesses cannot be combined or optimized at all.
> 
> No, atomics aren't an observable behavior of the abstract machine
> (unless they are volatile).  See 1.8.p8 (citing the C++ standard).

Us Linux-kernel hackers will often need to use volatile semantics in
combination with C11 atomics in most cases.  The C11 atomics do cover
some of the reasons we currently use ACCESS_ONCE(), but not all of them --
in particular, it allows load/store merging.

> > Now, I claim that atomic accesses cannot be done speculatively for
> > writes, and not re-done for reads (because the value could change),
> 
> Agreed, unless the compiler can prove that this doesn't make a
> difference in the program at hand and it's not volatile atomics.  In
> general, that will be hard and thus won't happen often I suppose, but if
> correctly proved it would fall under the as-if rule I think.
> 
> > but *combining* them would be possible and good.
> 
> Agreed.

In some cases, agreed.  But many uses in the Linux kernel will need
volatile semantics in combination with C11 atomics.  Which is OK, for
the foreseeable future, anyway.

> > For example, we often have multiple independent atomic accesses that
> > could certainly be combined: testing the individual bits of an atomic
> > value with helper functions, causing things like "load atomic, test
> > bit, load same atomic, test another bit". The two atomic loads could
> > be done as a single load without possibly changing semantics on a real
> > machine, but if "visibility" is defined in the same way it is for
> > "volatile", that wouldn't be a valid transformation. Right now we use
> > "volatile" semantics for these kinds of things, and they really can
> > hurt.
> 
> Agreed.  In your example, the compiler would have to prove that the
> abstract machine would always be able to run the two loads atomically
> (ie, as one load) without running into impossible/disallowed behavior of
> the program.  But if there's no loop or branch or such in-between, this
> should be straight-forward because any hardware oddity or similar could
> merge those loads and it wouldn't be disallowed by the standard
> (considering that we're talking about a finite number of loads), so the
> compiler would be allowed to do it as well.

As long as they are not marked volatile, agreed.

							Thanx, Paul

> > Same goes for multiple writes (possibly due to setting bits):
> > combining multiple accesses into a single one is generally fine, it's
> > *adding* write accesses speculatively that is broken by design..
> 
> Agreed.  As Paul points out, this being correct assumes that there are
> no other ordering guarantees or memory accesses "interfering", but if
> the stores are to the same memory location and adjacent to each other in
> the program, then I don't see a reason why they wouldn't be combinable.
> 
> > At the same time, you can't combine atomic loads or stores infinitely
> > - "visibility" on a real machine definitely is about timeliness.
> > Removing all but the last write when there are multiple consecutive
> > writes is generally fine, even if you unroll a loop to generate those
> > writes. But if what remains is a loop, it might be a busy-loop
> > basically waiting for something, so it would be wrong ("untimely") to
> > hoist a store in a loop entirely past the end of the loop, or hoist a
> > load in a loop to before the loop.
> 
> Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
> although those might not be bullet-proof as Paul points out.  Forward
> progress is rather vaguely specified in the standard, but at least parts
> of the committee (and people in ISO C++ SG1, in particular) are working
> on trying to improve this.
> 
> > Does the standard allow for that kind of behavior?
> 
> I think the standard requires (or intends to require) the behavior that
> you (and I) seem to prefer in these examples.
> 
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12 17:42                                   ` Paul E. McKenney
@ 2014-02-12 18:12                                     ` Peter Zijlstra
  2014-02-17 18:18                                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-12 18:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Torvald Riegel, Linus Torvalds, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
> You need volatile semantics to force the compiler to ignore any proofs
> it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
> calls in my email to Torvald.  (Hopefully I translated your example
> reasonably.)

My brain gave out for today; but it did appear to have the right
structure.

I would prefer it C11 would not require the volatile casts. It should
simply _never_ speculate with atomic writes, volatile or not.




^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12  5:13                     ` Torvald Riegel
@ 2014-02-12 18:26                       ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-12 18:26 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra,
	linux-arch, linux-kernel, torvalds, akpm, mingo, gcc

On Tue, Feb 11, 2014 at 09:13:34PM -0800, Torvald Riegel wrote:
> On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote:
> > > > > > > > > There are also so many ways to blow your head off it's untrue. For example,
> > > > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but
> > > > > > > > > then there are restrictions on the sets you can use for each. It's not hard
> > > > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> > > > > > > > > atm and optimises all of the data dependencies away) as well as the definition
> > > > > > > > > of "data races", which seem to be used as an excuse to miscompile a program
> > > > > > > > > at the earliest opportunity.
> > > > > > > > 
> > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > > > memory_order_consume until the compilers implement it both correctly and
> > > > > > > > efficiently.  They are not there yet, and there is currently no shortage
> > > > > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > > > > 
> > > > > > > Do you have any input on
> > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > > > > language standard's definition of dependencies?
> > > > > > 
> > > > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > > > 
> > > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or
> > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator,
> > > > > > or
> > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > > > 
> > > > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > > > emit code for "flag - flag", but there are a number of other ways to
> > > > > > make that work.
> > > > > 
> > > > > And that's what would concern me, considering that these requirements
> > > > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > > > just constrain compilers wrt. reordering across atomic accesses or
> > > > > changes to the atomic accesses themselves, the dependencies are new
> > > > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > > > seems far more involved to me.
> > > > 
> > > > Well, the wording of 1.10p9 is pretty explicit on this point.
> > > > There are only a few exceptions to the rule that dependencies from
> > > > memory_order_consume loads must be tracked.  And to your point about
> > > > requirements being placed on pieces of otherwise non-synchronizing code,
> > > > we already have that with plain old load acquire and store release --
> > > > both of these put ordering constraints that affect the surrounding
> > > > non-synchronizing code.
> > > 
> > > I think there's a significant difference.  With acquire/release or more
> > > general memory orders, it's true that we can't order _across_ the atomic
> > > access.  However, we can reorder and optimize without additional
> > > constraints if we do not reorder.  This is not the case with consume
> > > memory order, as the (p + flag - flag) example shows.
> > 
> > Agreed, memory_order_consume does introduce additional restrictions.
> > 
> > > > This issue got a lot of discussion, and the compromise is that
> > > > dependencies cannot leak into or out of functions unless the relevant
> > > > parameters or return values are annotated with [[carries_dependency]].
> > > > This means that the compiler can see all the places where dependencies
> > > > must be tracked.  This is described in 7.6.4.
> > > 
> > > I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> > > constraint--what it is--in 1.10, so I guess at least that should be
> > > fixed).
> > > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> > > difference, at least if one is assuming that normal optimization of
> > > sequential code is the default, and that maintaining things such as
> > > (flag-flag) is not; if optimizing away (flag-flag) would require the
> > > insertion of fences unless there is the carries_dependency attribute,
> > > then this would be bad I think.
> > 
> > No, the attribute does not make a semantic difference.  If a dependency
> > flows into a function without [[carries_dependency]], the implementation
> > is within its right to emit an acquire barrier or similar.
> 
> So you can't just ignore the attribute when generating code -- you would
> at the very least have to do this consistently across all the compilers
> compiling parts of your code.  Which tells me that it does make a
> semantic difference.  But I know that what should or should not be an
> attribute is controversial.

Actually, I believe that you -can- ignore the attribute when generating
code.  In that case, you do the same thing that you would do the same
thing as you would if there was no attribute, namely emit whatever
barrier was required to render irrelevant any dependency breaking that
might occur in the called function.

> > > IMHO, the dependencies construct (carries_dependency, kill_dependency)
> > > seem to be backwards to me.  They assume that the compiler preserves all
> > > those dependencies including (flag-flag) by default, which prohibits
> > > meaningful optimizations.  Instead, I guess there should be a construct
> > > for explicitly exploiting the dependencies (e.g., a
> > > preserve_dependencies call, whose argument will not be optimized fully).
> > > Exploiting dependencies will be special code and isn't the common case,
> > > so it can be require additional annotations.
> > 
> > If you are compiling a function that has no [[carries_dependency]]
> > attributes on it arguments and return value, and none on any of the
> > functions that it calls, and contains no memomry_order_consume loads,
> > then you can break dependencies all you like within that function.
> > 
> > That said, I am of course open to discussing alternative approaches.
> > Especially those that ease the migration of the existing code in the
> > Linux kernel that rely on dependencies.  ;-)
> > 
> > > > If a dependency chain
> > > > headed by a memory_order_consume load goes into or out of a function
> > > > without the aid of the [[carries_dependency]] attribute, the compiler
> > > > needs to do something else to enforce ordering, e.g., emit a memory
> > > > barrier.
> > > 
> > > I agree that this is a way to see it.  But I can't see how this will
> > > motivate compiler implementers to not just emit a stronger barrier right
> > > away.
> > 
> > That certainly has been the most common approach.
> > 
> > > > From a Linux-kernel viewpoint, this is a bit ugly, as it requires
> > > > annotations and use of kill_dependency, but it was the best I could do
> > > > at the time.  If things go as they usually do, there will be some other
> > > > reason why those are needed...
> > > 
> > > Did you consider something along the "preserve_dependencies" call?  If
> > > so, why did you go for kill_dependency?
> > 
> > Could you please give more detail on what a "preserve_dependencies" call
> > would do and where it would be used?
> 
> Demarcating regions of code (or particular expressions) that require the
> preservation of source-code-level dependencies (eg, "p + flag - flag"),
> and which thus constraints optimizations allowed on normal code.
> 
> What we have right now is a "blacklist" of things that kill
> dependencies, which still requires compilers to not touch things like
> "flag - flag".  Doing so isn't useful in most code, so it would be
> better to have a "whitelist" of things in which dependencies are
> strictly preserved.  The list of relevant expressions found in the
> current RCU uses might be one such list.  Another would be to require
> programmers to explicitly annotate the expressions where those
> dependencies should be specially preserved, with something like a
> "preserve_dependencies(p + flag - flag)" call.
> 
> I agree that this might be harder when dependency tracking is scattered
> throughout a larger region of code, as you pointed out today.  But I'm
> just looking for a setting that is more likely to see good support by
> compilers.

The 3.13 version of the Linux kernel contains almost 1300 instances
of the rcu_dereference() family of macros, including wrappers.  I have
thus far gone through about 300 of them by hand and another 200 via
scripts.  Here are the preliminary results:

Common operations on pointers returned from rcu_dereference():

->	To dereference, can be chained, assuming entire linked structure
	published with single rcu_assign_pointer().  The & prefix operator
	is applied to dereferenced pointer, as are many other things,
	including rcu_dereference() itself.

[]	To index array referenced by RCU-protected index,
	and to specify array indexed by something else.

& infix	To strip low-order bits, but clearly mask must be non-zero.
	Should include infix "|" as well, but only if mask has some
	zero bits.

+ infix	To index RCU-protected array.  (Should include infix "-" too.)

=	To assign to a temporary variable.

()	To invoke an RCU-protected pointer to a function.

cast	To emulate subtypes.

?:	To substitute defaults, however, dependencies need to be carried only
	through middle and right-hand arguments.

For completeness, unary "*" and "&" should also be included, as these
are sometimes used in macros that apply casts.  One could argue that
symmetry would require that "|" be treated the same as infix "&", but
excluding the case where the other operand is all one bits, but I don't
feel strongly about "|".

Please note that I am not aware of any reasonable compiler optimization
that would break dependency chains in these cases, at least not in the
case where the original memory_order_consume load had volatile semantics,
as rcu_dereference() and friends in fact do have.

Many dependency chains are short and contained, but there are quite a few
large ones, particularly in the networking code.  Here is one moderately
ornate example dependency chain:

o	The arp_process() function calls __in_dev_get_rcu(), which
	returns an RCU-protected pointer.

	Then arp_process() invokes the following macros and functions:
	
		IN_DEV_ROUTE_LOCALNET() -> ipv4_devconf_get()
		arp_ignore() -- which calls:
			IN_DEV_ARP_IGNORE() -> ipv4_devconf_get()
			inet_confirm_addr() -- which calls:
				dev_net() -- which calls:
					read_pnet()
		IN_DEV_ARPFILTER() -> ipv4_devconf_get()
		IN_DEV_CONF_GET() -> ipv4_devconf_get()
		arp_fwd_proxy() -- which calls:
			IN_DEV_PROXY_ARP() -> ipv4_devconf_get()
			IN_DEV_MEDIUM_ID() -> ipv4_devconf_get()
		arp_fwd_pvlan() -- which calls IN_DEV_PROXY_ARP_PVLAN(),
			which eventually maps to ipv4_devconf_get()
		pneigh_enqueue()

This sort of example is one reason why I would not look fondly on any
suggestion that required decorating the operators called out above.  That
said, I would be OK with dropping the non-volatile memory_order_consume
load from C11 and C++11 -- I don't see how memory_order_consume is useful
unless it includes volatile semantics.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12 18:07                               ` Paul E. McKenney
@ 2014-02-12 20:22                                 ` Linus Torvalds
  2014-02-13  0:23                                   ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-12 20:22 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Us Linux-kernel hackers will often need to use volatile semantics in
> combination with C11 atomics in most cases.  The C11 atomics do cover
> some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> in particular, it allows load/store merging.

I really disagree with the "will need to use volatile".

We should never need to use volatile (outside of whatever MMIO we do
using C) if C11 defines atomics correctly.

Allowing load/store merging is *fine*. All sane CPU's do that anyway -
it's called a cache - and there's no actual reason to think that
"ACCESS_ONCE()" has to mean our current "volatile".

Now, it's possible that the C standards simply get atomics _wrong_, so
that they create visible semantics that are different from what a CPU
cache already does, but that's a plain bug in the standard if so.

But merging loads and stores is fine. And I *guarantee* it is fine,
exactly because CPU's already do it, so claiming that the compiler
couldn't do it is just insanity.

Now, there are things that are *not* fine, like speculative stores
that could be visible to other threads. Those are *bugs* (either in
the compiler or in the standard), and anybody who claims otherwise is
not worth discussing with.

But I really really disagree with the "we might have to use
'volatile'". Because if we *ever* have to use 'volatile' with the
standard C atomic types, then we're just better off ignoring the
atomic types entirely, because they are obviously broken shit - and
we're better off doing it ourselves the way we have forever.

Seriously. This is not even hyperbole. It really is as simple as that.

                  Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12 20:22                                 ` Linus Torvalds
@ 2014-02-13  0:23                                   ` Paul E. McKenney
  2014-02-13 20:03                                     ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-13  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Us Linux-kernel hackers will often need to use volatile semantics in
> > combination with C11 atomics in most cases.  The C11 atomics do cover
> > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > in particular, it allows load/store merging.
> 
> I really disagree with the "will need to use volatile".
> 
> We should never need to use volatile (outside of whatever MMIO we do
> using C) if C11 defines atomics correctly.
> 
> Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> it's called a cache - and there's no actual reason to think that
> "ACCESS_ONCE()" has to mean our current "volatile".
> 
> Now, it's possible that the C standards simply get atomics _wrong_, so
> that they create visible semantics that are different from what a CPU
> cache already does, but that's a plain bug in the standard if so.
> 
> But merging loads and stores is fine. And I *guarantee* it is fine,
> exactly because CPU's already do it, so claiming that the compiler
> couldn't do it is just insanity.

Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
normally get their stores pushed through the store buffer in reasonable
time, and CPUs also use things like invalidations to ensure that a
store is seen in reasonable time by readers.  Compilers don't always
have these two properties, so we do need to be more careful of load
and store merging by compilers.

> Now, there are things that are *not* fine, like speculative stores
> that could be visible to other threads. Those are *bugs* (either in
> the compiler or in the standard), and anybody who claims otherwise is
> not worth discussing with.

And as near as I can tell, volatile semantics are required in C11 to
avoid speculative stores.  I might be wrong about this, and hope that
I am wrong.  But I am currently not seeing it in the current standard.
(Though I expect that most compilers would avoid speculating stores,
especially in the near term.

> But I really really disagree with the "we might have to use
> 'volatile'". Because if we *ever* have to use 'volatile' with the
> standard C atomic types, then we're just better off ignoring the
> atomic types entirely, because they are obviously broken shit - and
> we're better off doing it ourselves the way we have forever.
> 
> Seriously. This is not even hyperbole. It really is as simple as that.

Agreed, if we are talking about replacing ACCESS_ONCE() with C11
relaxed atomics any time soon.  But someone porting Linux to a
new CPU architecture might use a carefully chosen subset of C11
atomics to implement some of the Linux atomic operations, especially
non-value-returning atomics such as atomic_inc().

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-13  0:23                                   ` Paul E. McKenney
@ 2014-02-13 20:03                                     ` Torvald Riegel
  2014-02-14  2:01                                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-13 20:03 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > > in particular, it allows load/store merging.
> > 
> > I really disagree with the "will need to use volatile".
> > 
> > We should never need to use volatile (outside of whatever MMIO we do
> > using C) if C11 defines atomics correctly.
> > 
> > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > it's called a cache - and there's no actual reason to think that
> > "ACCESS_ONCE()" has to mean our current "volatile".
> > 
> > Now, it's possible that the C standards simply get atomics _wrong_, so
> > that they create visible semantics that are different from what a CPU
> > cache already does, but that's a plain bug in the standard if so.
> > 
> > But merging loads and stores is fine. And I *guarantee* it is fine,
> > exactly because CPU's already do it, so claiming that the compiler
> > couldn't do it is just insanity.
> 
> Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> normally get their stores pushed through the store buffer in reasonable
> time, and CPUs also use things like invalidations to ensure that a
> store is seen in reasonable time by readers.  Compilers don't always
> have these two properties, so we do need to be more careful of load
> and store merging by compilers.

The standard's _wording_ is a little vague about forward-progress
guarantees, but I believe the vast majority of the people involved do
want compilers to not prevent forward progress.  There is of course a
difference whether a compiler establishes _eventual_ forward progress in
the sense of after 10 years or forward progress in a small bounded
interval of time, but this is a QoI issue, and good compilers won't want
to introduce unnecessary latencies.  I believe that it is fine if the
standard merely talks about eventual forward progress.

> > Now, there are things that are *not* fine, like speculative stores
> > that could be visible to other threads. Those are *bugs* (either in
> > the compiler or in the standard), and anybody who claims otherwise is
> > not worth discussing with.
> 
> And as near as I can tell, volatile semantics are required in C11 to
> avoid speculative stores.  I might be wrong about this, and hope that
> I am wrong.  But I am currently not seeing it in the current standard.
> (Though I expect that most compilers would avoid speculating stores,
> especially in the near term.

This really depends on how we define speculative stores.  The memory
model is absolutely clear that programs have to behave as if executed by
the virtual machine, and that rules out speculative stores to volatiles
and other locations.  Under certain circumstances, there will be
"speculative" stores in the sense that they will happen at different
times as if you had a trivial implementation of the abstract machine.
But to be allowed to do that, the compiler has to prove that such a
transformation still fulfills the as-if rule.

IOW, the abstract machine is what currently defines disallowed
speculative stores.  If you want to put *further* constraints on what
implementations are allowed to do, I suppose it is best to talk about
those and see how we can add rules that allow programmers to express
those constraints.  For example, control dependencies might be such a
case.  I don't have a specific suggestion -- maybe the control
dependencies are best tackled similar to consume dependencies (even
though we don't have a good solution for those yets).  But using
volatile accesses for that seems to be a big hammer, or even the wrong
one.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-13 20:03                                     ` Torvald Riegel
@ 2014-02-14  2:01                                       ` Paul E. McKenney
  2014-02-14  4:43                                         ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-14  2:01 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
> On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > > > in particular, it allows load/store merging.
> > > 
> > > I really disagree with the "will need to use volatile".
> > > 
> > > We should never need to use volatile (outside of whatever MMIO we do
> > > using C) if C11 defines atomics correctly.
> > > 
> > > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > > it's called a cache - and there's no actual reason to think that
> > > "ACCESS_ONCE()" has to mean our current "volatile".
> > > 
> > > Now, it's possible that the C standards simply get atomics _wrong_, so
> > > that they create visible semantics that are different from what a CPU
> > > cache already does, but that's a plain bug in the standard if so.
> > > 
> > > But merging loads and stores is fine. And I *guarantee* it is fine,
> > > exactly because CPU's already do it, so claiming that the compiler
> > > couldn't do it is just insanity.
> > 
> > Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> > normally get their stores pushed through the store buffer in reasonable
> > time, and CPUs also use things like invalidations to ensure that a
> > store is seen in reasonable time by readers.  Compilers don't always
> > have these two properties, so we do need to be more careful of load
> > and store merging by compilers.
> 
> The standard's _wording_ is a little vague about forward-progress
> guarantees, but I believe the vast majority of the people involved do
> want compilers to not prevent forward progress.  There is of course a
> difference whether a compiler establishes _eventual_ forward progress in
> the sense of after 10 years or forward progress in a small bounded
> interval of time, but this is a QoI issue, and good compilers won't want
> to introduce unnecessary latencies.  I believe that it is fine if the
> standard merely talks about eventual forward progress.

The compiler will need to earn my trust on this one.  ;-)

> > > Now, there are things that are *not* fine, like speculative stores
> > > that could be visible to other threads. Those are *bugs* (either in
> > > the compiler or in the standard), and anybody who claims otherwise is
> > > not worth discussing with.
> > 
> > And as near as I can tell, volatile semantics are required in C11 to
> > avoid speculative stores.  I might be wrong about this, and hope that
> > I am wrong.  But I am currently not seeing it in the current standard.
> > (Though I expect that most compilers would avoid speculating stores,
> > especially in the near term.
> 
> This really depends on how we define speculative stores.  The memory
> model is absolutely clear that programs have to behave as if executed by
> the virtual machine, and that rules out speculative stores to volatiles
> and other locations.  Under certain circumstances, there will be
> "speculative" stores in the sense that they will happen at different
> times as if you had a trivial implementation of the abstract machine.
> But to be allowed to do that, the compiler has to prove that such a
> transformation still fulfills the as-if rule.

Agreed, although the as-if rule would ignore control dependencies, since
these are not yet part of the standard (as you in fact note below).
I nevertheless consider myself at least somewhat reassured that current
C11 won't speculate stores.  My remaining concerns involve the compiler
proving to itself that a given branch is always taken, thus motivating
it to optimize the branch away -- though this is more properly a
control-dependency concern.

> IOW, the abstract machine is what currently defines disallowed
> speculative stores.  If you want to put *further* constraints on what
> implementations are allowed to do, I suppose it is best to talk about
> those and see how we can add rules that allow programmers to express
> those constraints.  For example, control dependencies might be such a
> case.  I don't have a specific suggestion -- maybe the control
> dependencies are best tackled similar to consume dependencies (even
> though we don't have a good solution for those yets).  But using
> volatile accesses for that seems to be a big hammer, or even the wrong
> one.

In current compilers, the two hammers we have are volatile and barrier().
But yes, it would be good to have something more focused.  One option
would be to propose memory_order_control loads to see how loudly the
committee screams.  One use case might be as follows:

	if (atomic_load(x, memory_order_control))
		atomic_store(y, memory_order_relaxed);

This could also be written:

	r1 = atomic_load(x, memory_order_control);
	if (r1)
		atomic_store(y, memory_order_relaxed);

A branch depending on the memory_order_control load could not be optimized
out, though I suppose that the compiler could substitute a memory-barrier
instruction for the branch.  Seems like it would take a very large number
of branches to equal the overhead of the memory barrier, though.

Another option would be to flag the conditional expression, prohibiting
the compiler from optimizing out any conditional branches.  Perhaps
something like this:

	r1 = atomic_load(x, memory_order_control);
	if (control_dependency(r1))
		atomic_store(y, memory_order_relaxed);

Other thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14  2:01                                       ` Paul E. McKenney
@ 2014-02-14  4:43                                         ` Torvald Riegel
  2014-02-14 17:29                                           ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-14  4:43 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
> > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> > > > <paulmck@linux.vnet.ibm.com> wrote:
> > > > >
> > > > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > > > > in particular, it allows load/store merging.
> > > > 
> > > > I really disagree with the "will need to use volatile".
> > > > 
> > > > We should never need to use volatile (outside of whatever MMIO we do
> > > > using C) if C11 defines atomics correctly.
> > > > 
> > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > > > it's called a cache - and there's no actual reason to think that
> > > > "ACCESS_ONCE()" has to mean our current "volatile".
> > > > 
> > > > Now, it's possible that the C standards simply get atomics _wrong_, so
> > > > that they create visible semantics that are different from what a CPU
> > > > cache already does, but that's a plain bug in the standard if so.
> > > > 
> > > > But merging loads and stores is fine. And I *guarantee* it is fine,
> > > > exactly because CPU's already do it, so claiming that the compiler
> > > > couldn't do it is just insanity.
> > > 
> > > Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> > > normally get their stores pushed through the store buffer in reasonable
> > > time, and CPUs also use things like invalidations to ensure that a
> > > store is seen in reasonable time by readers.  Compilers don't always
> > > have these two properties, so we do need to be more careful of load
> > > and store merging by compilers.
> > 
> > The standard's _wording_ is a little vague about forward-progress
> > guarantees, but I believe the vast majority of the people involved do
> > want compilers to not prevent forward progress.  There is of course a
> > difference whether a compiler establishes _eventual_ forward progress in
> > the sense of after 10 years or forward progress in a small bounded
> > interval of time, but this is a QoI issue, and good compilers won't want
> > to introduce unnecessary latencies.  I believe that it is fine if the
> > standard merely talks about eventual forward progress.
> 
> The compiler will need to earn my trust on this one.  ;-)
> 
> > > > Now, there are things that are *not* fine, like speculative stores
> > > > that could be visible to other threads. Those are *bugs* (either in
> > > > the compiler or in the standard), and anybody who claims otherwise is
> > > > not worth discussing with.
> > > 
> > > And as near as I can tell, volatile semantics are required in C11 to
> > > avoid speculative stores.  I might be wrong about this, and hope that
> > > I am wrong.  But I am currently not seeing it in the current standard.
> > > (Though I expect that most compilers would avoid speculating stores,
> > > especially in the near term.
> > 
> > This really depends on how we define speculative stores.  The memory
> > model is absolutely clear that programs have to behave as if executed by
> > the virtual machine, and that rules out speculative stores to volatiles
> > and other locations.  Under certain circumstances, there will be
> > "speculative" stores in the sense that they will happen at different
> > times as if you had a trivial implementation of the abstract machine.
> > But to be allowed to do that, the compiler has to prove that such a
> > transformation still fulfills the as-if rule.
> 
> Agreed, although the as-if rule would ignore control dependencies, since
> these are not yet part of the standard (as you in fact note below).
> I nevertheless consider myself at least somewhat reassured that current
> C11 won't speculate stores.  My remaining concerns involve the compiler
> proving to itself that a given branch is always taken, thus motivating
> it to optimize the branch away -- though this is more properly a
> control-dependency concern.
> 
> > IOW, the abstract machine is what currently defines disallowed
> > speculative stores.  If you want to put *further* constraints on what
> > implementations are allowed to do, I suppose it is best to talk about
> > those and see how we can add rules that allow programmers to express
> > those constraints.  For example, control dependencies might be such a
> > case.  I don't have a specific suggestion -- maybe the control
> > dependencies are best tackled similar to consume dependencies (even
> > though we don't have a good solution for those yets).  But using
> > volatile accesses for that seems to be a big hammer, or even the wrong
> > one.
> 
> In current compilers, the two hammers we have are volatile and barrier().
> But yes, it would be good to have something more focused.  One option
> would be to propose memory_order_control loads to see how loudly the
> committee screams.  One use case might be as follows:
> 
> 	if (atomic_load(x, memory_order_control))
> 		atomic_store(y, memory_order_relaxed);
> 
> This could also be written:
> 
> 	r1 = atomic_load(x, memory_order_control);
> 	if (r1)
> 		atomic_store(y, memory_order_relaxed);
> 
> A branch depending on the memory_order_control load could not be optimized
> out, though I suppose that the compiler could substitute a memory-barrier
> instruction for the branch.  Seems like it would take a very large number
> of branches to equal the overhead of the memory barrier, though.
> 
> Another option would be to flag the conditional expression, prohibiting
> the compiler from optimizing out any conditional branches.  Perhaps
> something like this:
> 
> 	r1 = atomic_load(x, memory_order_control);
> 	if (control_dependency(r1))
> 		atomic_store(y, memory_order_relaxed);

That's the one I had in mind and talked to you about earlier today.  My
gut feeling is that this is preferably over the other because it "marks"
the if-statement, so the compiler knows exactly which branches matter.
I'm not sure one would need the other memory order for that, if indeed
all you want is relaxed -> branch -> relaxed.  But maybe there are
corner cases (see the weaker-than-relaxed discussion in SG1 today).



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12  9:19                                 ` Peter Zijlstra
  2014-02-12 17:42                                   ` Paul E. McKenney
@ 2014-02-14  5:07                                   ` Torvald Riegel
  2014-02-14  9:50                                     ` Peter Zijlstra
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-14  5:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote:
> > I don't know the specifics of your example, but from how I understand
> > it, I don't see a problem if the compiler can prove that the store will
> > always happen.
> > 
> > To be more specific, if the compiler can prove that the store will
> > happen anyway, and the region of code can be assumed to always run
> > atomically (e.g., there's no loop or such in there), then it is known
> > that we have one atomic region of code that will always perform the
> > store, so we might as well do the stuff in the region in some order.
> > 
> > Now, if any of the memory accesses are atomic, then the whole region of
> > code containing those accesses is often not atomic because other threads
> > might observe intermediate results in a data-race-free way.
> > 
> > (I know that this isn't a very precise formulation, but I hope it brings
> > my line of reasoning across.)
> 
> So given something like:
> 
> 	if (x)
> 		y = 3;
> 
> assuming both x and y are atomic (so don't gimme crap for now knowing
> the C11 atomic incantations); and you can prove x is always true; you
> don't see a problem with not emitting the conditional?

That depends on what your goal is.  It would be correct as far as the
standard is specified; this makes sense if all you want is indeed a
program that does what the abstract machine might do, and produces the
same output / side effects.

If you're trying to preserve the branch in the code emitted / executed
by the implementation, then it would not be correct.  But those branches
aren't specified as being part of the observable side effects.  In the
common case, this makes sense because it enables optimizations that are
useful; this line of reasoning also allows the compiler to merge some
atomic accesses in the way that Linus would like to see it.

> Avoiding the conditional changes the result; see that control dependency
> email from earlier.

It does not regarding how the standard defines "result".

> In the above example the load of X and the store to
> Y are strictly ordered, due to control dependencies. Not emitting the
> condition and maybe not even emitting the load completely wrecks this.

I think you're trying to solve this backwards.  You are looking at this
with an implicit wishlist of what the compiler should do (or how you
want to use the hardware), but this is not a viable specification that
one can write a compiler against.

We do need clear rules for what the compiler is allowed to do or not
(e.g., a memory model that models multi-threaded executions).  Otherwise
it's all hand-waving, and we're getting nowhere.  Thus, the way to
approach this is to propose a feature or change to the standard, make
sure that this is consistent and has no unintended side effects for
other aspects of compilation or other code, and then ask the compiler to
implement it.  IOW, we need a patch for where this all starts: in the
rules and requirements for compilation.

Paul and I are at the C++ meeting currently, and we had sessions in
which the concurrency study group talked about memory model issues like
dependency tracking and memory_order_consume.  Paul shared uses of
atomics (or likewise) in the kernel, and we discussed how the memory
model currently handles various cases and why, how one could express
other requirements consistently, and what is actually implementable in
practice.  I can't speak for Paul, but I thought those discussions were
productive.

> Its therefore an invalid optimization to take out the conditional or
> speculate the store, since it takes out the dependency.




^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14  5:07                                   ` Torvald Riegel
@ 2014-02-14  9:50                                     ` Peter Zijlstra
  2014-02-14 19:19                                       ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-14  9:50 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
> That depends on what your goal is.

A compiler that we don't need to fight in order to generate sane code
would be nice. But as Linus said; we can continue to ignore you lot and
go on as we've done.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14  4:43                                         ` Torvald Riegel
@ 2014-02-14 17:29                                           ` Paul E. McKenney
  2014-02-14 19:21                                             ` Torvald Riegel
  2014-02-14 19:50                                             ` Linus Torvalds
  0 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-14 17:29 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
> On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:

[ . . . ]

> > Another option would be to flag the conditional expression, prohibiting
> > the compiler from optimizing out any conditional branches.  Perhaps
> > something like this:
> > 
> > 	r1 = atomic_load(x, memory_order_control);
> > 	if (control_dependency(r1))
> > 		atomic_store(y, memory_order_relaxed);
> 
> That's the one I had in mind and talked to you about earlier today.  My
> gut feeling is that this is preferably over the other because it "marks"
> the if-statement, so the compiler knows exactly which branches matter.
> I'm not sure one would need the other memory order for that, if indeed
> all you want is relaxed -> branch -> relaxed.  But maybe there are
> corner cases (see the weaker-than-relaxed discussion in SG1 today).

Linus, Peter, any objections to marking places where we are relying on
ordering from control dependencies against later stores?  This approach
seems to me to have significant documentation benefits.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14  9:50                                     ` Peter Zijlstra
@ 2014-02-14 19:19                                       ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-14 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
> > That depends on what your goal is.

First, I don't know why you quoted that, but without the context,
quoting it doesn't make sense.  Let me repeat the point.  The standard
is the rule set for the compiler.  Period.  The compiler does not just
serve the semantics that you might have in your head.  It does have to
do something meaningful for all of its users.  Thus, the goal for the
compiler is to properly compile programs in the language as specified.

If there is a deficiency in the standard (bug or missing feature) -- and
thus the specification, we need to have a patch for the standard that
fixes this deficiency.  If you think that this is the case, that's where
you fix it.

If your goal is to do wishful thinking, imagine some kind of semantics
in your head, and then assume that magically, implementations will do
just that, then that's bound to fail.

> A compiler that we don't need to fight in order to generate sane code
> would be nice. But as Linus said; we can continue to ignore you lot and
> go on as we've done.

I don't see why it's so hard to understand that you need to specify
semantics, and the place (or at least the base) for that is the
standard.  Aren't you guys the ones replying "send a patch"? :)

This isn't any different.  If you're uncomfortable working with the
standard, then say so, and reach out to people that aren't.

You can surely ignore the specification of the language(s) that you are
depending on.  But that won't help you.  If you want a change, get
involved.  (Oh, and claiming that the other side doesn't get it doesn't
count as getting involved.)

There's no fight between people here.  It's just a technical problem
that we have to solve in the right way.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14 17:29                                           ` Paul E. McKenney
@ 2014-02-14 19:21                                             ` Torvald Riegel
  2014-02-14 19:50                                             ` Linus Torvalds
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-14 19:21 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
> > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > Another option would be to flag the conditional expression, prohibiting
> > > the compiler from optimizing out any conditional branches.  Perhaps
> > > something like this:
> > > 
> > > 	r1 = atomic_load(x, memory_order_control);
> > > 	if (control_dependency(r1))
> > > 		atomic_store(y, memory_order_relaxed);
> > 
> > That's the one I had in mind and talked to you about earlier today.  My
> > gut feeling is that this is preferably over the other because it "marks"
> > the if-statement, so the compiler knows exactly which branches matter.
> > I'm not sure one would need the other memory order for that, if indeed
> > all you want is relaxed -> branch -> relaxed.  But maybe there are
> > corner cases (see the weaker-than-relaxed discussion in SG1 today).
> 
> Linus, Peter, any objections to marking places where we are relying on
> ordering from control dependencies against later stores?  This approach
> seems to me to have significant documentation benefits.

Let me note that at least as I'm concerned, that's just a quick idea.
At least I haven't looked at (1) how to properly specify the semantics
of this, (2) whether it has any bad effects on unrelated code, (3) and
whether there are pitfalls for compiler implementations.  It looks not
too bad at first glance, though.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14 17:29                                           ` Paul E. McKenney
  2014-02-14 19:21                                             ` Torvald Riegel
@ 2014-02-14 19:50                                             ` Linus Torvalds
  2014-02-14 20:02                                               ` Linus Torvalds
  2014-02-15 17:30                                               ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-14 19:50 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Linus, Peter, any objections to marking places where we are relying on
> ordering from control dependencies against later stores?  This approach
> seems to me to have significant documentation benefits.

Quite frankly, I think it's stupid, and the "documentation" is not a
benefit, it's just wrong.

How would you figure out whether your added "documentation" holds true
for particular branches but not others?

How could you *ever* trust a compiler that makes the dependency meaningful?

Again, let's keep this simple and sane:

 - if a compiler ever generates code where an atomic store movement is
"visible" in any way, then that compiler is broken shit.

I don't understand why you even argue this. Seriously, Paul, you seem
to *want* to think that "broken shit" is acceptable, and that we
should then add magic markers to say "now you need to *not* be broken
shit".

Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER.

And if a compiler can *prove* that whatever code movement it does
cannot make a difference, then let it do so. No amount of
"documentation" should matter.

Seriously, this whole discussion has been completely moronic. I don't
understand why you even bring shit like this up:

> >     r1 = atomic_load(x, memory_order_control);
> >     if (control_dependency(r1))
> >             atomic_store(y, memory_order_relaxed);

I mean, really? Anybody who writes code like that, or any compiler
where that "control_dependency()" marker makes any difference
what-so-ever for code generation should just be retroactively aborted.

There is absolutely *zero* reason for that "control_dependency()"
crap. If you ever find a reason for it, it is either because the
compiler is buggy, or because the standard is so shit that we should
never *ever* use the atomics.

Seriously. This thread has devolved into some kind of "just what kind
of idiotic compiler cesspool crap could we accept". Get away from that
f*cking mindset. We don't accept *any* crap.

Why are we still discussing this idiocy? It's irrelevant. If the
standard really allows random store speculation, the standard doesn't
matter, and sane people shouldn't waste their time arguing about it.

            Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14 19:50                                             ` Linus Torvalds
@ 2014-02-14 20:02                                               ` Linus Torvalds
  2014-02-15  2:08                                                 ` Paul E. McKenney
  2014-02-15 17:45                                                 ` Torvald Riegel
  2014-02-15 17:30                                               ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-14 20:02 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Why are we still discussing this idiocy? It's irrelevant. If the
> standard really allows random store speculation, the standard doesn't
> matter, and sane people shouldn't waste their time arguing about it.

Btw, the other part of this coin is that our manual types (using
volatile and various architecture-specific stuff) and our manual
barriers and inline asm accesses are generally *fine*.

The C11 stuff doesn't buy us anything. The argument that "new
architectures might want to use it" is prue and utter bollocks, since
unless the standard gets the thing *right*, nobody sane would ever use
it for some new architecture, when the sane thing to do is to just
fill in the normal barriers and inline asms.

So I'm very very serious: either the compiler and the standard gets
things right, or we don't use it. There is no middle ground where "we
might use it for one or two architectures and add random hints".
That's just stupid.

The only "middle ground" is about which compiler version we end up
trusting _if_ it turns out that the compiler and standard do get
things right. From Torvald's explanations (once I don't mis-read them
;), my take-away so far has actually been that the standard *does* get
things right, but I do know from over-long personal experience that
compiler people sometimes want to be legalistic and twist the
documentation to the breaking point, at which point we just go "we'd
be crazy do use that".

See our use of "-fno-strict-aliasing", for example. The C standard
aliasing rules are a mistake, stupid, and wrong, and gcc uses those
stupid type-based alias rules even when statically *proving* the
aliasing gives the opposite result. End result: we turn the shit off.

Exact same deal wrt atomics. We are *not* going to add crazy "this
here is a control dependency" crap. There's a test, the compiler
*sees* the control dependency for chrissake, and it still generates
crap, we turn that broken "optimization" off. It really is that
simple.

            Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14 20:02                                               ` Linus Torvalds
@ 2014-02-15  2:08                                                 ` Paul E. McKenney
  2014-02-15  2:44                                                   ` Linus Torvalds
  2014-02-15 17:45                                                 ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-15  2:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 12:02:23PM -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Why are we still discussing this idiocy? It's irrelevant. If the
> > standard really allows random store speculation, the standard doesn't
> > matter, and sane people shouldn't waste their time arguing about it.
> 
> Btw, the other part of this coin is that our manual types (using
> volatile and various architecture-specific stuff) and our manual
> barriers and inline asm accesses are generally *fine*.
> 
> The C11 stuff doesn't buy us anything. The argument that "new
> architectures might want to use it" is prue and utter bollocks, since
> unless the standard gets the thing *right*, nobody sane would ever use
> it for some new architecture, when the sane thing to do is to just
> fill in the normal barriers and inline asms.
> 
> So I'm very very serious: either the compiler and the standard gets
> things right, or we don't use it. There is no middle ground where "we
> might use it for one or two architectures and add random hints".
> That's just stupid.
> 
> The only "middle ground" is about which compiler version we end up
> trusting _if_ it turns out that the compiler and standard do get
> things right. From Torvald's explanations (once I don't mis-read them
> ;), my take-away so far has actually been that the standard *does* get
> things right, but I do know from over-long personal experience that
> compiler people sometimes want to be legalistic and twist the
> documentation to the breaking point, at which point we just go "we'd
> be crazy do use that".
> 
> See our use of "-fno-strict-aliasing", for example. The C standard
> aliasing rules are a mistake, stupid, and wrong, and gcc uses those
> stupid type-based alias rules even when statically *proving* the
> aliasing gives the opposite result. End result: we turn the shit off.
> 
> Exact same deal wrt atomics. We are *not* going to add crazy "this
> here is a control dependency" crap. There's a test, the compiler
> *sees* the control dependency for chrissake, and it still generates
> crap, we turn that broken "optimization" off. It really is that
> simple.

>From what I can see at the moment, the standard -generally- avoids
speculative stores, but there are a few corner cases where it might
allow them.  I will be working with the committee to see exactly what the
situation is.  Might be that I am confused and that everything really
is OK, might be that I am right but the corner cases are things that
no sane kernel developer would do anyway, it might be that the standard
needs a bit of repair, or it might be that the corner cases are somehow
inherent and problematic (but I hope not!).  I will let you know what
I find, but it will probably be a few months.

In the meantime, agreed, we keep doing what we have been doing.  And
maybe in the long term as well, for that matter.

One way of looking at the discussion between Torvald and myself would be
as a seller (Torvald) and a buyer (me) haggling over the fine print in
a proposed contract (the standard).  Whether that makes you feel better
or worse about the situation I cannot say.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15  2:08                                                 ` Paul E. McKenney
@ 2014-02-15  2:44                                                   ` Linus Torvalds
  2014-02-15  2:48                                                     ` Linus Torvalds
  2014-02-15 18:07                                                     ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-15  2:44 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> One way of looking at the discussion between Torvald and myself would be
> as a seller (Torvald) and a buyer (me) haggling over the fine print in
> a proposed contract (the standard).  Whether that makes you feel better
> or worse about the situation I cannot say.  ;-)

Oh, I'm perfectly fine with that. But we're not desperate to buy, and
I actually think the C11 people are - or at least should be - *way*
more desperate to sell their model to us than we are to take it.

Which means that as a buyer you should say "this is what we want, if
you don't give us this, we'll just walk away". Not try to see how much
we can pay for it. Because there is very little upside for us, and
_unless_ the C11 standard gets things right it's just extra complexity
for us, coupled with new compiler fragility and years of code
generation bugs.

Why would we want that extra complexity and inevitable compiler bugs?
If we then have to fight compiler writers that point to the standard
and say "..but look, the standard says we can do this", then at that
point it went from "extra complexity and compiler bugs" to a whole
'nother level of frustration and pain.

So just walk away unless the C11 standard gives us exactly what we
want. Not "something kind of like what we'd use". EXACTLY. Because I'm
not in the least interested in fighting compiler people that have a
crappy standard they can point to. Been there, done that, got the
T-shirt and learnt my lesson.

And the thing is, I suspect that the Linux kernel is the most complete
- and most serious - user of true atomics that the C11 people can sell
their solution to.

If we don't buy it, they have no serious user. Sure, they'll have lots
of random other one-off users for their atomics, where each user wants
one particular thing, but I suspect that we'll have the only really
unified portable code base that handles pretty much *all* the serious
odd cases that the C11 atomics can actually talk about to each other.

Oh, they'll push things through with or without us, and it will be a
collection of random stuff, where they tried to please everybody, with
particularly compiler/architecture people who have no f*cking clue
about how their stuff is used pushing to make it easy/efficient for
their particular compiler/architecture.

But we have real optimized uses of pretty much all relevant cases that
people actually care about.

We can walk away from them, and not really lose anything but a small
convenience (and it's a convenience *only* if the standard gets things
right).

And conversely, the C11 people can walk away from us too. But if they
can't make us happy (and by "make us happy", I really mean no stupid
games on our part) I personally think they'll have a stronger
standard, and a real use case, and real arguments. I'm assuming they
want that.

That's why I complain when you talk about things like marking control
dependencies explicitly. That's *us* bending over backwards. And as a
buyer, we have absolutely zero reason to do that.

Tell the C11 people: "no speculative writes". Full stop. End of story.
Because we're not buying anything else.

Similarly, if we need to mark atomics "volatile", then now the C11
atomics are no longer even a "small convenience", now they are just
extra complexity wrt what we already have. So just make it clear that
if the C11 standard needs to mark atomics volatile in order to get
non-speculative and non-reloading behavior, then the C11 atomics are
useless to us, and we're not buying.

Remember: a compiler can *always* do "as if" optimizations - if a
compiler writer can prove that the end result acts 100% the same using
an optimized sequence, then they can do whatever the hell they want.
That's not the issue. But if we can *ever* see semantic impact of
speculative writes, the compiler is buggy, and the compiler writers
need to be aware that it is buggy. No ifs, buts, maybes about it.

So I'm perfectly fine with you seeing yourself as a buyer. But I want
you to be a really *picky* and anal buyer - one that knows he has the
upper hand, and can walk away with no downside.

                    Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15  2:44                                                   ` Linus Torvalds
@ 2014-02-15  2:48                                                     ` Linus Torvalds
  2014-02-15  6:35                                                       ` Paul E. McKenney
  2014-02-15 18:07                                                     ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-15  2:48 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And conversely, the C11 people can walk away from us too. But if they
> can't make us happy (and by "make us happy", I really mean no stupid
> games on our part) I personally think they'll have a stronger
> standard, and a real use case, and real arguments. I'm assuming they
> want that.

I should have somebody who proof-reads my emails before I send them out.

I obviously meant "if they *can* make us happy" (not "can't").

            Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15  2:48                                                     ` Linus Torvalds
@ 2014-02-15  6:35                                                       ` Paul E. McKenney
  2014-02-15  6:58                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-15  6:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And conversely, the C11 people can walk away from us too. But if they
> > can't make us happy (and by "make us happy", I really mean no stupid
> > games on our part) I personally think they'll have a stronger
> > standard, and a real use case, and real arguments. I'm assuming they
> > want that.
> 
> I should have somebody who proof-reads my emails before I send them out.
> 
> I obviously meant "if they *can* make us happy" (not "can't").

Understood.  My next step is to take a more detailed look at the piece
of the standard that should support RCU.  Depending on how that turns
out, I might look at other parts of the standard vs. Linux's atomics
and memory-ordering needs.  Should be interesting.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15  6:35                                                       ` Paul E. McKenney
@ 2014-02-15  6:58                                                         ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-15  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 14, 2014 at 10:35:44PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And conversely, the C11 people can walk away from us too. But if they
> > > can't make us happy (and by "make us happy", I really mean no stupid
> > > games on our part) I personally think they'll have a stronger
> > > standard, and a real use case, and real arguments. I'm assuming they
> > > want that.
> > 
> > I should have somebody who proof-reads my emails before I send them out.
> > 
> > I obviously meant "if they *can* make us happy" (not "can't").
> 
> Understood.  My next step is to take a more detailed look at the piece
> of the standard that should support RCU.  Depending on how that turns
> out, I might look at other parts of the standard vs. Linux's atomics
> and memory-ordering needs.  Should be interesting.  ;-)

And perhaps a better way to represent the roles is that I am not the
buyer, but rather the purchasing agent for the -potential- buyer.  -You-
are of course the potential buyer.

If I were to see myself as the buyer, then I must confess that the
concerns you implicitly expressed in your prior email would be all too
well-founded!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14 19:50                                             ` Linus Torvalds
  2014-02-14 20:02                                               ` Linus Torvalds
@ 2014-02-15 17:30                                               ` Torvald Riegel
  2014-02-15 19:15                                                 ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-15 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Linus, Peter, any objections to marking places where we are relying on
> > ordering from control dependencies against later stores?  This approach
> > seems to me to have significant documentation benefits.
> 
> Quite frankly, I think it's stupid, and the "documentation" is not a
> benefit, it's just wrong.

I think the example is easy to misunderstand, because the context isn't
clear.  Therefore, let me first try to clarify the background.

(1) The abstract machine does not write speculatively.
(2) Emitting a branch instruction and executing a branch at runtime is
not part of the specified behavior of the abstract machine.  Of course,
the abstract machine performs conditional execution, but that just
specifies the output / side effects that it must produce (e.g., volatile
stores) -- not with which hardware instructions it is producing this.
(3) A compiled program must produce the same output as if executed by
the abstract machine.

Thus, we need to be careful what "speculative store" is meant to refer
to.  A few examples:

if (atomic_load(&x, mo_relaxed) == 1)
  atomic_store(&y, 3, mo_relaxed));

Here, the requirement is that in terms of program logic, y is assigned 3
if x equals 1.  It's not specified how an implementation does that.
* If the compiler can prove that x is always 1, then it can remove the
branch.  This is because of (2).  Because of the proof, (1) is not
violated.
* If the compiler can prove that the store to y is never observed or
does not change the program's output, the store can be removed.

if (atomic_load(&x, mo_relaxed) == 1)
  { atomic_store(&y, 3, mo_relaxed)); other_a(); }
else
  { atomic_store(&y, 3, mo_relaxed)); other_b(); }

Here, y will be assigned to regardless of the value of x.
* The compiler can hoist the store out of the two branches.  This is
because the store and the branch instruction aren't observable outcomes
of the abstract machine.
* The compiler can even move the store to y before the load from x
(unless this affects logical program order of this thread in some way.)
This is because the load/store are ordered by sequenced-before
(intra-thread), but mo_relaxed allows the hardware to reorder, so the
compiler can do it as well (IOW, other threads can't expect a particular
order).

if (atomic_load(&x, mo_acquire) == 1)
  atomic_store(&y, 3, mo_relaxed));

This is similar to the first case, but with stronger memory order.
* If the compiler proves that x is always 1, then it does so by showing
that the load will always be able to read from a particular store (or
several of them) that (all) assigned 1 to x -- as specified by the
abstract machine and taking the forward progress guarantees into
account.  In general, it still has to establish the synchronized-with
edge if any of those stores used release_mo (or other fences resulting
in the same situation), so it can't just get rid of the acquire "fence"
in this case.  (There are probably situations in which this can be done,
but I can't characterize them easily at the moment.)


These examples all rely on the abstract machine as specified in the
current standard.  In contrast, the example that Paul (and Peter, I
assume) where looking at is not currently modeled by the standard.
AFAIU, they want to exploit that control dependencies, when encountered
in binary code, can result in the hardware giving certain ordering
guarantees.

This is vaguely similar to mo_consume which is about data dependencies.
mo_consume is, partially due to how it's specified, pretty hard to
implement for compilers in a way that actually exploits and preserves
data dependencies and not just substitutes mo_consume for a stronger
memory order.

Part of this problem is that the standard takes an opt-out approach
regarding the code that should track dependencies (e.g., certain
operators are specified as not preserving them), instead of cleanly
carving out meaningful operators where one can track dependencies
without obstructing generally useful compiler optimizations (i.e.,
"opt-in").  This leads to cases such as that in "*(p + f - f)", the
compiler either has to keep f - f or emit a stronger fence if f is
originating from a mo_consume load.  Furthermore, dependencies are
supposed to be tracked across any load and store, so the compiler needs
to do points-to if it wants to optimize this as much as possible.

Paul and I have been thinking about alternatives, and one of them was
doing the opt-in by demarcating code that needs explicit dependency
tracking because it wants to exploit mo_consume.

Back to HW control dependencies, this lead to the idea of marking the
"control dependencies" in the source code (ie, on the abstract machine
level), that need to be preserved in the generated binary code, even if
they have no semantic meaning on the abstract machine level.  So, this
is something extra that isn't modeled in the standard currently, because
of (1) and (2) above.

(Note that it's clearly possible that I misunderstand the goals of
Paul/Peter.  But then this would just indicate that working on precise
specifications does help :)

> How would you figure out whether your added "documentation" holds true
> for particular branches but not others?
> 
> How could you *ever* trust a compiler that makes the dependency meaningful?

Does the above clarify the situation?  If not, can you perhaps rephrase
any remaining questions?

> Again, let's keep this simple and sane:
> 
>  - if a compiler ever generates code where an atomic store movement is
> "visible" in any way, then that compiler is broken shit.

Unless volatile, the store is not part of the "visible" output of the
abstract machine, and such an implementation "detail".  In turn, any
correct store movement must not affect the output of the program, so the
implementation detail remains invisible.

> I don't understand why you even argue this. Seriously, Paul, you seem
> to *want* to think that "broken shit" is acceptable, and that we
> should then add magic markers to say "now you need to *not* be broken
> shit".
> 
> Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER.
> 
> And if a compiler can *prove* that whatever code movement it does
> cannot make a difference, then let it do so. No amount of
> "documentation" should matter.

Enabling that is certainly a goal of how the standard specifies all
this.  I'll let you sort out whether you want to exploit the control
dependency thing :)

> Seriously, this whole discussion has been completely moronic. I don't
> understand why you even bring shit like this up:
> 
> > >     r1 = atomic_load(x, memory_order_control);
> > >     if (control_dependency(r1))
> > >             atomic_store(y, memory_order_relaxed);
> 
> I mean, really? Anybody who writes code like that, or any compiler
> where that "control_dependency()" marker makes any difference
> what-so-ever for code generation should just be retroactively aborted.

It doesn't make a difference in the standard as specified (well, there's
no control_dependency :).  I hope the background above clarifies the
discussed extension idea this originated from.

> There is absolutely *zero* reason for that "control_dependency()"
> crap. If you ever find a reason for it, it is either because the
> compiler is buggy, or because the standard is so shit that we should
> never *ever* use the atomics.
> 
> Seriously. This thread has devolved into some kind of "just what kind
> of idiotic compiler cesspool crap could we accept". Get away from that
> f*cking mindset. We don't accept *any* crap.
> 
> Why are we still discussing this idiocy? It's irrelevant. If the
> standard really allows random store speculation, the standard doesn't
> matter, and sane people shouldn't waste their time arguing about it.

It disallows it if this changes program semantics as specified by the
abstract machine.  Does that answer your concerns?  (Or, IOW, do you
still wonder whether it's crap? ;)


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-14 20:02                                               ` Linus Torvalds
  2014-02-15  2:08                                                 ` Paul E. McKenney
@ 2014-02-15 17:45                                                 ` Torvald Riegel
  2014-02-15 18:49                                                   ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-15 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Why are we still discussing this idiocy? It's irrelevant. If the
> > standard really allows random store speculation, the standard doesn't
> > matter, and sane people shouldn't waste their time arguing about it.
> 
> Btw, the other part of this coin is that our manual types (using
> volatile and various architecture-specific stuff) and our manual
> barriers and inline asm accesses are generally *fine*.

AFAICT, it does work for you, but hasn't been exactly pain-free.

I think a major benefit of C11's memory model is that it gives a
*precise* specification for how a compiler is allowed to optimize.
There is a formalization of the model, which allows things like the
cppmem tool by the Cambridge group.  It also allows meaningful fuzz
testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal
several GCC compiler bugs.

I also think that reasoning about this model is easier than reasoning
about how lots of different, concrete compiler optimizations would
interact.

> The C11 stuff doesn't buy us anything. The argument that "new
> architectures might want to use it" is prue and utter bollocks, since
> unless the standard gets the thing *right*, nobody sane would ever use
> it for some new architecture, when the sane thing to do is to just
> fill in the normal barriers and inline asms.
> 
> So I'm very very serious: either the compiler and the standard gets
> things right, or we don't use it. There is no middle ground where "we
> might use it for one or two architectures and add random hints".
> That's just stupid.
> 
> The only "middle ground" is about which compiler version we end up
> trusting _if_ it turns out that the compiler and standard do get
> things right. From Torvald's explanations (once I don't mis-read them
> ;), my take-away so far has actually been that the standard *does* get
> things right, but I do know from over-long personal experience that
> compiler people sometimes want to be legalistic and twist the
> documentation to the breaking point, at which point we just go "we'd
> be crazy do use that".

I agree that compilers want to optimize, and sometimes there's probably
a little too much emphasis on applying an optimization vs. not
surprising users.  But we have to draw a line (e.g., what is undefined
behavior and what is not), because we need this to be actually able to
optimize.

Therefore, we need to get the rules into a shape that both allows
optimizations and isn't full of surprising corner cases.  The rules are
the standard, so it's the standard we have to get right.  According to
my experience, a lot of thought goes into how to design the standard's
language and library so that they are intuitive yet efficient.

If you see issues in the standard, please bring them up.  Either report
the defects directly and get involved yourself, or reach out to somebody
that is participating in the standards process.

The standard certainly isn't perfect, so there is room to contribute.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15  2:44                                                   ` Linus Torvalds
  2014-02-15  2:48                                                     ` Linus Torvalds
@ 2014-02-15 18:07                                                     ` Torvald Riegel
  2014-02-17 18:59                                                       ` Joseph S. Myers
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-15 18:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > One way of looking at the discussion between Torvald and myself would be
> > as a seller (Torvald) and a buyer (me) haggling over the fine print in
> > a proposed contract (the standard).  Whether that makes you feel better
> > or worse about the situation I cannot say.  ;-)
> 
> Oh, I'm perfectly fine with that. But we're not desperate to buy, and
> I actually think the C11 people are - or at least should be - *way*
> more desperate to sell their model to us than we are to take it.
> 
> Which means that as a buyer you should say "this is what we want, if
> you don't give us this, we'll just walk away". Not try to see how much
> we can pay for it. Because there is very little upside for us, and
> _unless_ the C11 standard gets things right it's just extra complexity
> for us, coupled with new compiler fragility and years of code
> generation bugs.

I think there is an upside to you, mainly in that it allows compiler
testing tools, potentially verification tools for atomics, and tools
like cppmem that show allowed executions of code.

I agree that the Linux community has been working well without this, and
it's big enough to make running it's own show viable.  This will be
different for smaller projects, though.

> Why would we want that extra complexity and inevitable compiler bugs?
> If we then have to fight compiler writers that point to the standard
> and say "..but look, the standard says we can do this", then at that
> point it went from "extra complexity and compiler bugs" to a whole
> 'nother level of frustration and pain.

I see your point, but flip side of the coin is that if you get the
standard to say what you want, then you can tell the compiler writers to
look at the standard.  Or show them bugs revealed by fuzz testing and
such.

> So just walk away unless the C11 standard gives us exactly what we
> want. Not "something kind of like what we'd use". EXACTLY. Because I'm
> not in the least interested in fighting compiler people that have a
> crappy standard they can point to. Been there, done that, got the
> T-shirt and learnt my lesson.
> 
> And the thing is, I suspect that the Linux kernel is the most complete
> - and most serious - user of true atomics that the C11 people can sell
> their solution to.

I agree, but there are likely also other big projects that could make
use of C11 atomics on the userspace side (e.g., certain databases, ...).

> If we don't buy it, they have no serious user.

I disagree with that.  That obviously depends on one's definition of
"serious", but if you combine all C/C++ programs that use low-level
atomics, then this is serious use as well.  There's lots of
shared-memory synchronization in userspace as well.

> Sure, they'll have lots
> of random other one-off users for their atomics, where each user wants
> one particular thing, but I suspect that we'll have the only really
> unified portable code base

glibc is a counterexample that comes to mind, although it's a smaller
code base.  (It's currently not using C11 atomics, but transitioning
there makes sense, and some thing I want to get to eventually.)

> that handles pretty much *all* the serious
> odd cases that the C11 atomics can actually talk about to each other.

You certainly have lots of odd cases, but I would disagree with the
assumption that only the Linux kernel will do full "testing" of the
implementations.  If you have plenty of userspace programs using the
atomics, that's a pretty big test suite, and one that should help bring
the compilers up to speed.  So that might be a benefit even to the Linux
kernel if it would use the C11 atomics.

> Oh, they'll push things through with or without us, and it will be a
> collection of random stuff, where they tried to please everybody, with
> particularly compiler/architecture people who have no f*cking clue
> about how their stuff is used pushing to make it easy/efficient for
> their particular compiler/architecture.

I'll ignore this... :)

> But we have real optimized uses of pretty much all relevant cases that
> people actually care about.

You certainly cover a lot of cases.  Finding out whether you cover all
that "people care about" would require you to actually ask all people,
which I'm sure you've done ;)

> We can walk away from them, and not really lose anything but a small
> convenience (and it's a convenience *only* if the standard gets things
> right).
> 
> And conversely, the C11 people can walk away from us too. But if they
> can't make us happy (and by "make us happy", I really mean no stupid
> games on our part) I personally think they'll have a stronger
> standard, and a real use case, and real arguments. I'm assuming they
> want that.

I agree.

> That's why I complain when you talk about things like marking control
> dependencies explicitly. That's *us* bending over backwards. And as a
> buyer, we have absolutely zero reason to do that.

As I understood the situation, it was rather like the buyer trying to
wrestle in an additional feature without having a proper specification
of what the feature should actually do.  That's why we've been talking
about what it should actually do, and how ...

> Tell the C11 people: "no speculative writes". Full stop. End of story.
> Because we're not buying anything else.
> 
> Similarly, if we need to mark atomics "volatile", then now the C11
> atomics are no longer even a "small convenience", now they are just
> extra complexity wrt what we already have. So just make it clear that
> if the C11 standard needs to mark atomics volatile in order to get
> non-speculative and non-reloading behavior, then the C11 atomics are
> useless to us, and we're not buying.

I hope what I wrote previously removes those concerns.

> Remember: a compiler can *always* do "as if" optimizations - if a
> compiler writer can prove that the end result acts 100% the same using
> an optimized sequence, then they can do whatever the hell they want.
> That's not the issue.

Good to know that we agree on that.

> But if we can *ever* see semantic impact of
> speculative writes, the compiler is buggy, and the compiler writers
> need to be aware that it is buggy. No ifs, buts, maybes about it.

Agreed.

> So I'm perfectly fine with you seeing yourself as a buyer. But I want
> you to be a really *picky* and anal buyer - one that knows he has the
> upper hand, and can walk away with no downside.

FWIW, I don't see myself as being the seller.  Instead, I'm hope we can
improve the situation for everyone involved, whether that involves C11
or something else.  I see a potential mutual benefit, and I want to try
to exploit it.  Regarding the distributions I'm concerned about, the
Linux kernel and GCC are very much in the same boat.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15 17:45                                                 ` Torvald Riegel
@ 2014-02-15 18:49                                                   ` Linus Torvalds
  2014-02-17 19:55                                                     ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-15 18:49 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> I think a major benefit of C11's memory model is that it gives a
> *precise* specification for how a compiler is allowed to optimize.

Clearly it does *not*. This whole discussion is proof of that. It's
not at all clear, and the standard apparently is at least debatably
allowing things that shouldn't be allowed. It's also a whole lot more
complicated than "volatile", so the likelihood of a compiler writer
actually getting it right - even if the standard does - is lower.
They've gotten "volatile" wrong too, after all (particularly in C++).

           Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15 17:30                                               ` Torvald Riegel
@ 2014-02-15 19:15                                                 ` Linus Torvalds
  2014-02-17 22:09                                                   ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-15 19:15 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> I think the example is easy to misunderstand, because the context isn't
> clear.  Therefore, let me first try to clarify the background.
>
> (1) The abstract machine does not write speculatively.
> (2) Emitting a branch instruction and executing a branch at runtime is
> not part of the specified behavior of the abstract machine.  Of course,
> the abstract machine performs conditional execution, but that just
> specifies the output / side effects that it must produce (e.g., volatile
> stores) -- not with which hardware instructions it is producing this.
> (3) A compiled program must produce the same output as if executed by
> the abstract machine.

Ok, I'm fine with that.

> Thus, we need to be careful what "speculative store" is meant to refer
> to.  A few examples:
>
> if (atomic_load(&x, mo_relaxed) == 1)
>   atomic_store(&y, 3, mo_relaxed));

No, please don't use this idiotic example. It is wrong.

The fact is, if a compiler generates anything but the obvious sequence
(read/cmp/branch/store - where branch/store might obviously be done
with some other machine conditional like a predicate), the compiler is
wrong.

Anybody who argues anything else is wrong, or confused, or confusing.

Instead, argue about *other* sequences where the compiler can do something.

For example, this sequence:

   atomic_store(&x, a, mo_relaxed);
   b = atomic_load(&x, mo_relaxed);

can validly be transformed to

   atomic_store(&x, a, mo_relaxed);
   b = (typeof(x)) a;

and I think everybody agrees about that. In fact, that optimization
can be done even for mo_strict.

But even that "obvious" optimization has subtle cases. What if the
store is relaxed, but the load is strict? You can't do the
optimization without a lot of though, because dropping the strict load
would drop an ordering point. So even the "store followed by exact
same load" case has subtle issues.

With similar caveats, it is perfectly valid to merge two consecutive
loads, and to merge two consecutive stores.

Now that means that the sequence

    atomic_store(&x, 1, mo_relaxed);
    if (atomic_load(&x, mo_relaxed) == 1)
        atomic_store(&y, 3, mo_relaxed);

can first be optimized to

    atomic_store(&x, 1, mo_relaxed);
    if (1 == 1)
        atomic_store(&y, 3, mo_relaxed);

and then you get the end result that you wanted in the first place
(including the ability to re-order the two stores due to the relaxed
ordering, assuming they can be proven to not alias - and please don't
use the idiotic type-based aliasing rules).

Bringing up your first example is pure and utter confusion. Don't do
it. Instead, show what are obvious and valid transformations, and then
you can bring up these kinds of combinations as "look, this is
obviously also correct".

Now, the reason I want to make it clear that the code example you
point to is a crap example is that because it doesn't talk about the
*real* issue, it also misses a lot of really important details.

For example, when you say "if the compiler can prove that the
conditional is always true" then YOU ARE TOTALLY WRONG.

So why do I say you are wrong, after I just gave you an example of how
it happens? Because my example went back to the *real* issue, and
there are actual real semantically meaningful details with doing
things like load merging.

To give an example, let's rewrite things a bit more to use an extra variable:

    atomic_store(&x, 1, mo_relaxed);
    a = atomic_load(&1, mo_relaxed);
    if (a == 1)
        atomic_store(&y, 3, mo_relaxed);

which looks exactly the same. If you now say "if you can prove the
conditional is always true, you can make the store unconditional", YOU
ARE WRONG.

Why?

This sequence:

    atomic_store(&x, 1, mo_relaxed);
    a = atomic_load(&x, mo_relaxed);
    atomic_store(&y, 3, mo_relaxed);

is actually - and very seriously - buggy.

Why? Because you have effectively split the atomic_load into two loads
- one for the value of 'a', and one for your 'proof' that the store is
unconditional.

Maybe you go "Nobody sane would do that", and you'd be right. But
compilers aren't "sane" (and compiler _writers_ I have some serious
doubts about), and how you describe the problem really does affect the
solution.

My description ("you can combine two subsequent atomic accesses, if
you are careful about still having the same ordering points") doesn't
have the confusion. It makes it clear that no, you can't speculate
writes, but yes, obviously you can combine certain things (think
"write buffers" and "memory caches"), and that may make you able to
remove the conditional. Which is very different from speculating
writes.

But my description also makes it clear that the transformation above
was buggy, but that rewriting it as

    a = 1;
    atomic_store(&y, 3, mo_relaxed);
    atomic_store(&x, a, mo_relaxed);

is fine (doing the re-ordering here just to make a point).

So I suspect we actually agree, I just think that the "example" that
has been floating around, and the explanation for it, has been
actively bad and misleading.

             Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-12 18:12                                     ` Peter Zijlstra
@ 2014-02-17 18:18                                       ` Paul E. McKenney
  2014-02-17 20:39                                         ` Richard Biener
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-17 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Torvald Riegel, Linus Torvalds, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
> > You need volatile semantics to force the compiler to ignore any proofs
> > it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
> > calls in my email to Torvald.  (Hopefully I translated your example
> > reasonably.)
> 
> My brain gave out for today; but it did appear to have the right
> structure.

I can relate.  ;-)

> I would prefer it C11 would not require the volatile casts. It should
> simply _never_ speculate with atomic writes, volatile or not.

I agree with not needing volatiles to prevent speculated writes.  However,
they will sometimes be needed to prevent excessive load/store combining.
The compiler doesn't have the runtime feedback mechanisms that the
hardware has, and thus will need help from the developer from time
to time.

Or maybe the Linux kernel simply waits to transition to C11 relaxed atomics
until the compiler has learned to be sufficiently conservative in its
load-store combining decisions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15 18:07                                                     ` Torvald Riegel
@ 2014-02-17 18:59                                                       ` Joseph S. Myers
  2014-02-17 19:19                                                         ` Will Deacon
  2014-02-17 19:41                                                         ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-17 18:59 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, 15 Feb 2014, Torvald Riegel wrote:

> glibc is a counterexample that comes to mind, although it's a smaller
> code base.  (It's currently not using C11 atomics, but transitioning
> there makes sense, and some thing I want to get to eventually.)

glibc is using C11 atomics (GCC builtins rather than _Atomic / 
<stdatomic.h>, but using __atomic_* with explicitly specified memory model 
rather than the older __sync_*) on AArch64, plus in certain cases on ARM 
and MIPS.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 18:59                                                       ` Joseph S. Myers
@ 2014-02-17 19:19                                                         ` Will Deacon
  2014-02-17 19:41                                                         ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Will Deacon @ 2014-02-17 19:19 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Torvald Riegel, Linus Torvalds, Paul McKenney, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 06:59:31PM +0000, Joseph S. Myers wrote:
> On Sat, 15 Feb 2014, Torvald Riegel wrote:
> 
> > glibc is a counterexample that comes to mind, although it's a smaller
> > code base.  (It's currently not using C11 atomics, but transitioning
> > there makes sense, and some thing I want to get to eventually.)
> 
> glibc is using C11 atomics (GCC builtins rather than _Atomic / 
> <stdatomic.h>, but using __atomic_* with explicitly specified memory model 
> rather than the older __sync_*) on AArch64, plus in certain cases on ARM 
> and MIPS.

Hmm, actually that results in a change in behaviour for the __sync_*
primitives on AArch64. The documentation for those states that:

  `In most cases, these built-in functions are considered a full barrier. That
  is, no memory operand is moved across the operation, either forward or
  backward. Further, instructions are issued as necessary to prevent the
  processor from speculating loads across the operation and from queuing stores
  after the operation.'

which is stronger than simply mapping them to memory_model_seq_cst, which
seems to be what the AArch64 compiler is doing (so you get acquire + release
instead of a full fence).

Will

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 18:59                                                       ` Joseph S. Myers
  2014-02-17 19:19                                                         ` Will Deacon
@ 2014-02-17 19:41                                                         ` Torvald Riegel
  2014-02-17 23:12                                                           ` Joseph S. Myers
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 19:41 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Linus Torvalds, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 2014-02-17 at 18:59 +0000, Joseph S. Myers wrote:
> On Sat, 15 Feb 2014, Torvald Riegel wrote:
> 
> > glibc is a counterexample that comes to mind, although it's a smaller
> > code base.  (It's currently not using C11 atomics, but transitioning
> > there makes sense, and some thing I want to get to eventually.)
> 
> glibc is using C11 atomics (GCC builtins rather than _Atomic / 
> <stdatomic.h>, but using __atomic_* with explicitly specified memory model 
> rather than the older __sync_*) on AArch64, plus in certain cases on ARM 
> and MIPS.

I think the major steps remaining is moving the other architectures
over, and rechecking concurrent code (e.g., for the code that I have
seen, it was either asm variants (eg, on x86), or built before C11; ARM
pthread_once was lacking memory_barriers (see "pthread_once unification"
patches I posted)).  We also need/should to move towards using
relaxed-MO atomic loads instead of plain loads.




^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15 18:49                                                   ` Linus Torvalds
@ 2014-02-17 19:55                                                     ` Torvald Riegel
  2014-02-17 20:18                                                       ` Linus Torvalds
  2014-02-17 20:23                                                       ` Paul E. McKenney
  0 siblings, 2 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote:
> On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > I think a major benefit of C11's memory model is that it gives a
> > *precise* specification for how a compiler is allowed to optimize.
> 
> Clearly it does *not*. This whole discussion is proof of that. It's
> not at all clear,

It might not be an easy-to-understand specification, but as far as I'm
aware it is precise.  The Cambridge group's formalization certainly is
precise.  From that, one can derive (together with the usual rules for
as-if etc.) what a compiler is allowed to do (assuming that the standard
is indeed precise).  My replies in this discussion have been based on
reasoning about the standard, and not secret knowledge (with the
exception of no-out-of-thin-air, which is required in the standard's
prose but not yet formalized).

I agree that I'm using the formalization as a kind of placeholder for
the standard's prose (which isn't all that easy to follow for me
either), but I guess there's no way around an ISO standard using prose.

If you see a case in which the standard isn't precise, please bring it
up or open a C++ CWG issue for it.

> and the standard apparently is at least debatably
> allowing things that shouldn't be allowed.

Which example do you have in mind here?  Haven't we resolved all the
debated examples, or did I miss any?

> It's also a whole lot more
> complicated than "volatile", so the likelihood of a compiler writer
> actually getting it right - even if the standard does - is lower.

It's not easy, that's for sure, but none of the high-performance
alternatives are easy either.  There are testing tools out there based
on the formalization of the model, and we've found bugs with them.

And the alternative of using something not specified by the standard is
even worse, I think, because then you have to guess what a compiler
might do, without having any constraints; IOW, one is resorting to "no
sane compiler would do that", and that doesn't seem to very robust
either.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 19:55                                                     ` Torvald Riegel
@ 2014-02-17 20:18                                                       ` Linus Torvalds
  2014-02-17 21:21                                                         ` Torvald Riegel
                                                                           ` (2 more replies)
  2014-02-17 20:23                                                       ` Paul E. McKenney
  1 sibling, 3 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-17 20:18 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> Which example do you have in mind here?  Haven't we resolved all the
> debated examples, or did I miss any?

Well, Paul seems to still think that the standard possibly allows
speculative writes or possibly value speculation in ways that break
the hardware-guaranteed orderings.

And personally, I can't read standards paperwork. It is invariably
written in some basically impossible-to-understand lawyeristic mode,
and then it is read by people (compiler writers) that intentionally
try to mis-use the words and do language-lawyering ("that depends on
what the meaning of 'is' is"). The whole "lvalue vs rvalue expression
vs 'what is a volatile access'" thing for C++ was/is a great example
of that.

So quite frankly, as a result I refuse to have anything to do with the
process directly.

                 Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 19:55                                                     ` Torvald Riegel
  2014-02-17 20:18                                                       ` Linus Torvalds
@ 2014-02-17 20:23                                                       ` Paul E. McKenney
  2014-02-17 21:05                                                         ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-17 20:23 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote:
> On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote:
> > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > >
> > > I think a major benefit of C11's memory model is that it gives a
> > > *precise* specification for how a compiler is allowed to optimize.
> > 
> > Clearly it does *not*. This whole discussion is proof of that. It's
> > not at all clear,
> 
> It might not be an easy-to-understand specification, but as far as I'm
> aware it is precise.  The Cambridge group's formalization certainly is
> precise.  From that, one can derive (together with the usual rules for
> as-if etc.) what a compiler is allowed to do (assuming that the standard
> is indeed precise).  My replies in this discussion have been based on
> reasoning about the standard, and not secret knowledge (with the
> exception of no-out-of-thin-air, which is required in the standard's
> prose but not yet formalized).
> 
> I agree that I'm using the formalization as a kind of placeholder for
> the standard's prose (which isn't all that easy to follow for me
> either), but I guess there's no way around an ISO standard using prose.
> 
> If you see a case in which the standard isn't precise, please bring it
> up or open a C++ CWG issue for it.

I suggest that I go through the Linux kernel's requirements for atomics
and memory barriers and see how they map to C11 atomics.  With that done,
we would have very specific examples to go over.  Without that done, the
discussion won't converge very well.

Seem reasonable?

							Thanx, Paul

> > and the standard apparently is at least debatably
> > allowing things that shouldn't be allowed.
> 
> Which example do you have in mind here?  Haven't we resolved all the
> debated examples, or did I miss any?
> 
> > It's also a whole lot more
> > complicated than "volatile", so the likelihood of a compiler writer
> > actually getting it right - even if the standard does - is lower.
> 
> It's not easy, that's for sure, but none of the high-performance
> alternatives are easy either.  There are testing tools out there based
> on the formalization of the model, and we've found bugs with them.
> 
> And the alternative of using something not specified by the standard is
> even worse, I think, because then you have to guess what a compiler
> might do, without having any constraints; IOW, one is resorting to "no
> sane compiler would do that", and that doesn't seem to very robust
> either.
> 
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 18:18                                       ` Paul E. McKenney
@ 2014-02-17 20:39                                         ` Richard Biener
  2014-02-17 22:14                                           ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Richard Biener @ 2014-02-17 20:39 UTC (permalink / raw)
  To: paulmck, Paul E. McKenney, Peter Zijlstra
  Cc: Torvald Riegel, Linus Torvalds, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
>> > You need volatile semantics to force the compiler to ignore any
>proofs
>> > it might otherwise attempt to construct.  Hence all the
>ACCESS_ONCE()
>> > calls in my email to Torvald.  (Hopefully I translated your example
>> > reasonably.)
>> 
>> My brain gave out for today; but it did appear to have the right
>> structure.
>
>I can relate.  ;-)
>
>> I would prefer it C11 would not require the volatile casts. It should
>> simply _never_ speculate with atomic writes, volatile or not.
>
>I agree with not needing volatiles to prevent speculated writes. 
>However,
>they will sometimes be needed to prevent excessive load/store
>combining.
>The compiler doesn't have the runtime feedback mechanisms that the
>hardware has, and thus will need help from the developer from time
>to time.
>
>Or maybe the Linux kernel simply waits to transition to C11 relaxed
>atomics
>until the compiler has learned to be sufficiently conservative in its
>load-store combining decisions.

Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time.

Richard.

>							Thanx, Paul



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 20:23                                                       ` Paul E. McKenney
@ 2014-02-17 21:05                                                         ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 21:05 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 2014-02-17 at 12:23 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote:
> > On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote:
> > > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > >
> > > > I think a major benefit of C11's memory model is that it gives a
> > > > *precise* specification for how a compiler is allowed to optimize.
> > > 
> > > Clearly it does *not*. This whole discussion is proof of that. It's
> > > not at all clear,
> > 
> > It might not be an easy-to-understand specification, but as far as I'm
> > aware it is precise.  The Cambridge group's formalization certainly is
> > precise.  From that, one can derive (together with the usual rules for
> > as-if etc.) what a compiler is allowed to do (assuming that the standard
> > is indeed precise).  My replies in this discussion have been based on
> > reasoning about the standard, and not secret knowledge (with the
> > exception of no-out-of-thin-air, which is required in the standard's
> > prose but not yet formalized).
> > 
> > I agree that I'm using the formalization as a kind of placeholder for
> > the standard's prose (which isn't all that easy to follow for me
> > either), but I guess there's no way around an ISO standard using prose.
> > 
> > If you see a case in which the standard isn't precise, please bring it
> > up or open a C++ CWG issue for it.
> 
> I suggest that I go through the Linux kernel's requirements for atomics
> and memory barriers and see how they map to C11 atomics.  With that done,
> we would have very specific examples to go over.  Without that done, the
> discussion won't converge very well.
> 
> Seem reasonable?

Sounds good!


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 20:18                                                       ` Linus Torvalds
@ 2014-02-17 21:21                                                         ` Torvald Riegel
  2014-02-17 22:02                                                           ` Linus Torvalds
  2014-02-17 23:10                                                         ` Alec Teal
  2014-02-18  3:00                                                         ` Paul E. McKenney
  2 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > Which example do you have in mind here?  Haven't we resolved all the
> > debated examples, or did I miss any?
> 
> Well, Paul seems to still think that the standard possibly allows
> speculative writes or possibly value speculation in ways that break
> the hardware-guaranteed orderings.

That's true, I just didn't see any specific examples so far.

> And personally, I can't read standards paperwork. It is invariably
> written in some basically impossible-to-understand lawyeristic mode,

Yeah, it's not the most intuitive form for things like the memory model.

> and then it is read by people (compiler writers) that intentionally
> try to mis-use the words and do language-lawyering ("that depends on
> what the meaning of 'is' is").

That assumption about people working on compilers is a little too broad,
don't you think?

I think that it is important to stick to a specification, in the same
way that one wouldn't expect a program with undefined behavior make any
sense of it, magically, in cases where stuff is undefined.

However, that of course doesn't include trying to exploit weasel-wording
(BTW, both users and compiler writers try to do it).  IMHO,
weasel-wording in a standard is a problem in itself even if not
exploited, and often it indicates that there is a real issue.  There
might be reasons to have weasel-wording (e.g., because there's no known
better way to express it like in case of the not really precise
no-out-of-thin-air rule today), but nonetheless those aren't ideal.

> The whole "lvalue vs rvalue expression
> vs 'what is a volatile access'" thing for C++ was/is a great example
> of that.

I'm not aware of the details of this.

> So quite frankly, as a result I refuse to have anything to do with the
> process directly.

That's unfortunate.  Then please work with somebody that isn't
uncomfortable with participating directly in the process.  But be
warned, it may very well be a person working on compilers :)

Have you looked at the formalization of the model by Batty et al.?  The
overview of this is prose, but the formalized model itself is all formal
relations and logic.  So there should be no language-lawyering issues
with that form.  (For me, the formalized model is much easier to reason
about.)


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 21:21                                                         ` Torvald Riegel
@ 2014-02-17 22:02                                                           ` Linus Torvalds
  2014-02-17 22:25                                                             ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:02 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote:
>> and then it is read by people (compiler writers) that intentionally
>> try to mis-use the words and do language-lawyering ("that depends on
>> what the meaning of 'is' is").
>
> That assumption about people working on compilers is a little too broad,
> don't you think?

Let's just say that *some* are that way, and those are the ones that I
end up butting heads with.

The sane ones I never have to argue with - point them at a bug, and
they just say "yup, bug". The insane ones say "we don't need to fix
that, because if you read this copy of the standards that have been
translated to chinese and back, it clearly says that this is
acceptable".

>> The whole "lvalue vs rvalue expression
>> vs 'what is a volatile access'" thing for C++ was/is a great example
>> of that.
>
> I'm not aware of the details of this.

The argument was that an lvalue doesn't actually "access" the memory
(an rvalue does), so this:

   volatile int *p = ...;

   *p;

doesn't need to generate a load from memory, because "*p" is still an
lvalue (since you could assign things to it).

This isn't an issue in C, because in C, expression statements are
always rvalues, but C++ changed that. The people involved with the C++
standards have generally been totally clueless about their subtle
changes.

I may have misstated something, but basically some C++ people tried
very hard to make "volatile" useless.

We had other issues too. Like C compiler people who felt that the
type-based aliasing should always override anything else, even if the
variable accessed (through different types) was statically clearly
aliasing and used the exact same pointer. That made it impossible to
do a syntactically clean model of "this aliases", since the _only_
exception to the type-based aliasing rule was to generate a union for
every possible access pairing.

We turned off type-based aliasing (as I've mentioned before, I think
it's a fundamentally broken feature to begin with, and a horrible
horrible hack that adds no value for anybody but the HPC people).

Gcc eventually ended up having some sane syntax for overriding it, but
by then I was too disgusted with the people involved to even care.

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-15 19:15                                                 ` Linus Torvalds
@ 2014-02-17 22:09                                                   ` Torvald Riegel
  2014-02-17 22:32                                                     ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote:
> On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > I think the example is easy to misunderstand, because the context isn't
> > clear.  Therefore, let me first try to clarify the background.
> >
> > (1) The abstract machine does not write speculatively.
> > (2) Emitting a branch instruction and executing a branch at runtime is
> > not part of the specified behavior of the abstract machine.  Of course,
> > the abstract machine performs conditional execution, but that just
> > specifies the output / side effects that it must produce (e.g., volatile
> > stores) -- not with which hardware instructions it is producing this.
> > (3) A compiled program must produce the same output as if executed by
> > the abstract machine.
> 
> Ok, I'm fine with that.
> 
> > Thus, we need to be careful what "speculative store" is meant to refer
> > to.  A few examples:
> >
> > if (atomic_load(&x, mo_relaxed) == 1)
> >   atomic_store(&y, 3, mo_relaxed));
> 
> No, please don't use this idiotic example. It is wrong.

It won't be useful in practice in a lot of cases, but that doesn't mean
it's wrong.  It's clearly not illegal code.  It also serves a purpose: a
simple example to reason about a few aspects of the memory model.

> The fact is, if a compiler generates anything but the obvious sequence
> (read/cmp/branch/store - where branch/store might obviously be done
> with some other machine conditional like a predicate), the compiler is
> wrong.

Why?  I've reasoned why (1) to (3) above allow in certain cases (i.e.,
the first load always returning 1) for the branch (or other machine
conditional) to not be emitted.  So please either poke holes into this
reasoning, or clarify that you don't in fact, contrary to what you wrote
above, agree with (1) to (3).

> Anybody who argues anything else is wrong, or confused, or confusing.

I appreciate your opinion, and maybe I'm just one of the three things
above (my vote is on "confusing").  But without you saying why doesn't
help me see what's the misunderstanding here.

> Instead, argue about *other* sequences where the compiler can do something.

I'd prefer if we could clarify the misunderstanding for the simple case
first that doesn't involve stronger ordering requirements in the form of
non-relaxed MOs.

> For example, this sequence:
> 
>    atomic_store(&x, a, mo_relaxed);
>    b = atomic_load(&x, mo_relaxed);
> 
> can validly be transformed to
> 
>    atomic_store(&x, a, mo_relaxed);
>    b = (typeof(x)) a;
> 
> and I think everybody agrees about that. In fact, that optimization
> can be done even for mo_strict.

Yes.

> But even that "obvious" optimization has subtle cases. What if the
> store is relaxed, but the load is strict? You can't do the
> optimization without a lot of though, because dropping the strict load
> would drop an ordering point. So even the "store followed by exact
> same load" case has subtle issues.

Yes if a compiler wants to optimize that, it has to give it more
thought.  My gut feeling is that either the store should get the
stronger ordering, or the accesses should be merged.  But I'd have to
think more about that one (which I can do on request).

> With similar caveats, it is perfectly valid to merge two consecutive
> loads, and to merge two consecutive stores.
> 
> Now that means that the sequence
> 
>     atomic_store(&x, 1, mo_relaxed);
>     if (atomic_load(&x, mo_relaxed) == 1)
>         atomic_store(&y, 3, mo_relaxed);
> 
> can first be optimized to
> 
>     atomic_store(&x, 1, mo_relaxed);
>     if (1 == 1)
>         atomic_store(&y, 3, mo_relaxed);
> 
> and then you get the end result that you wanted in the first place
> (including the ability to re-order the two stores due to the relaxed
> ordering, assuming they can be proven to not alias - and please don't
> use the idiotic type-based aliasing rules).
> 
> Bringing up your first example is pure and utter confusion.

Sorry if it was confusing.  But then maybe we need to talk about it
more, because it shouldn't be confusing if we agree on what the memory
model allows and what not.  I had originally picked the example because
it was related to the example Paul/Peter brought up.

> Don't do
> it. Instead, show what are obvious and valid transformations, and then
> you can bring up these kinds of combinations as "look, this is
> obviously also correct".

I have my doubts whether the best way to reason about the memory model
is by thinking about specific compiler transformations.  YMMV,
obviously.

The -- kind of vague -- reason is that the allowed transformations will
be more complicated to reason about than the allowed output of a
concurrent program when understanding the memory model (ie, ordering and
interleaving of memory accesses, etc.).  However, I can see that when
trying to optimize with a hardware memory model in mind, this might look
appealing.

What the compiler will do is exploiting knowledge about all possible
executions.  For example, if it knows that x is always 1, it will do the
transform.  The user would need to consider that case anyway, but he/she
can do without thinking about all potentially allowed compiler
optimizations.

> Now, the reason I want to make it clear that the code example you
> point to is a crap example is that because it doesn't talk about the
> *real* issue, it also misses a lot of really important details.
> 
> For example, when you say "if the compiler can prove that the
> conditional is always true" then YOU ARE TOTALLY WRONG.

What are you trying to say?  That a compiler can't prove that this is
the case in any program?  Or that it won't be likely it can?  Or what
else?

> So why do I say you are wrong, after I just gave you an example of how
> it happens? Because my example went back to the *real* issue, and
> there are actual real semantically meaningful details with doing
> things like load merging.
> 
> To give an example, let's rewrite things a bit more to use an extra variable:
> 
>     atomic_store(&x, 1, mo_relaxed);
>     a = atomic_load(&1, mo_relaxed);
>     if (a == 1)
>         atomic_store(&y, 3, mo_relaxed);
> 
> which looks exactly the same.

I'm confused.  Is this a new example?

Or is it supposed to be a compiler's rewrite of the following?

    atomic_store(&x, 1, mo_relaxed);
    if (atomic_load(&x, mo_relaxed) == 1)
       atomic_store(&y, 3, mo_relaxed);

If that's supposed to be the case this is the same.

If it is supposed to be a rewrite of the original example with *just*
two stores, this is obviously an incorrect transformation (unless it can
be proven that no other thread writes to x).  The compiler can't make
stuff conditional that wasn't conditional before.  That would change the
allowed behavior of the abstract machine.

> If you now say "if you can prove the
> conditional is always true, you can make the store unconditional", YOU
> ARE WRONG.
> 
> Why?
> 
> This sequence:
> 
>     atomic_store(&x, 1, mo_relaxed);
>     a = atomic_load(&x, mo_relaxed);
>     atomic_store(&y, 3, mo_relaxed);
> 
> is actually - and very seriously - buggy.
> 
> Why? Because you have effectively split the atomic_load into two loads
> - one for the value of 'a', and one for your 'proof' that the store is
> unconditional.

I can't follow that, because it isn't clear to me which code sequences
are meant to belong together, and which transformations the compiler is
supposed to make.  If you would clarify that, then I can reply to this
part.

> Maybe you go "Nobody sane would do that", and you'd be right. But
> compilers aren't "sane" (and compiler _writers_ I have some serious
> doubts about), and how you describe the problem really does affect the
> solution.

To me, that sounds like a strong argument for a strong and precise
specification, and formal methods.  If we have that, stuff can be
resolved without having to rely on "sane" people, as long as they can
follow logic.  Because this removes the problem from "hopefully all of
us want the same" (which is sometimes called "sane", from both sides of
a disagreement) to "just stick to the specification".

> My description ("you can combine two subsequent atomic accesses, if
> you are careful about still having the same ordering points") doesn't
> have the confusion.

I'd much rather take the memory model's specification.  One example is
that it separates safety/correctness guarantees from liveness/progress
guarantees.  "the same ordering points" can be understood as not being
able to merge subsequent loads to the same memory location.  Both
progress guarantees and correctness guarantees can affect this, but they
have different consequences.  For example, if it may be perfectly fine
to assume that two subsequent loads to the same memory location may
always execute atomically (and thus join them), you don't want to do
this to all such loads in a spin-wait loop (or you'll likely prevent
forward progress).

In the end, we need to define "are careful about" and similar clauses.
Eventually, after defining everything, we'll end up with a memory model
that is specified in as much detail as C11's is.

> It makes it clear that no, you can't speculate
> writes, but yes, obviously you can combine certain things (think
> "write buffers" and "memory caches"), and that may make you able to
> remove the conditional. Which is very different from speculating
> writes.

It's equally fine to reason about executions and allowed interleavings
on the level of the source program (ie, the memory model and abstract
machine).  For example, combining certain things is essentially assuming
the atomic execution of those things; it will be clear from the program,
without thinking write buffers and caches, whether that's possible or
not in the source program.  If it is, the compiler can do it too.  The
compiler must not loose anything (eg, memory orders) when doing that; if
it wants to remove memory orders, it must show that this doesn't result
in executions not possible before.  Also, it must not prevent executions
that ensure forward progress (e.g., a spin-wait loop should better not
be atomic wrt. everything else, or you'll block forever).

> But my description also makes it clear that the transformation above
> was buggy, but that rewriting it as
> 
>     a = 1;
>     atomic_store(&y, 3, mo_relaxed);
>     atomic_store(&x, a, mo_relaxed);
> 
> is fine (doing the re-ordering here just to make a point).

Oh, so you were assuming that a is actually used afterwards?

> So I suspect we actually agree, I just think that the "example" that
> has been floating around, and the explanation for it, has been
> actively bad and misleading.

I do hope we agree.  I'm still confused why you think the branch needs
to be emitted in the first example I brought up, but other than that, we
seem to be on the same page.  I'm hopeful we can clarify the other
points if we keep talking about different examples.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 20:39                                         ` Richard Biener
@ 2014-02-17 22:14                                           ` Paul E. McKenney
  2014-02-17 22:27                                             ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-17 22:14 UTC (permalink / raw)
  To: Richard Biener
  Cc: Peter Zijlstra, Torvald Riegel, Linus Torvalds, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote:
> On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
> >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
> >> > You need volatile semantics to force the compiler to ignore any
> >proofs
> >> > it might otherwise attempt to construct.  Hence all the
> >ACCESS_ONCE()
> >> > calls in my email to Torvald.  (Hopefully I translated your example
> >> > reasonably.)
> >> 
> >> My brain gave out for today; but it did appear to have the right
> >> structure.
> >
> >I can relate.  ;-)
> >
> >> I would prefer it C11 would not require the volatile casts. It should
> >> simply _never_ speculate with atomic writes, volatile or not.
> >
> >I agree with not needing volatiles to prevent speculated writes. 
> >However,
> >they will sometimes be needed to prevent excessive load/store
> >combining.
> >The compiler doesn't have the runtime feedback mechanisms that the
> >hardware has, and thus will need help from the developer from time
> >to time.
> >
> >Or maybe the Linux kernel simply waits to transition to C11 relaxed
> >atomics
> >until the compiler has learned to be sufficiently conservative in its
> >load-store combining decisions.
> 
> Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time.

Perhaps we solve this chicken-and-egg problem by creating a test suite?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 22:02                                                           ` Linus Torvalds
@ 2014-02-17 22:25                                                             ` Torvald Riegel
  2014-02-17 22:47                                                               ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote:
> >> and then it is read by people (compiler writers) that intentionally
> >> try to mis-use the words and do language-lawyering ("that depends on
> >> what the meaning of 'is' is").
> >
> > That assumption about people working on compilers is a little too broad,
> > don't you think?
> 
> Let's just say that *some* are that way, and those are the ones that I
> end up butting heads with.
> 
> The sane ones I never have to argue with - point them at a bug, and
> they just say "yup, bug". The insane ones say "we don't need to fix
> that, because if you read this copy of the standards that have been
> translated to chinese and back, it clearly says that this is
> acceptable".
> 
> >> The whole "lvalue vs rvalue expression
> >> vs 'what is a volatile access'" thing for C++ was/is a great example
> >> of that.
> >
> > I'm not aware of the details of this.
> 
> The argument was that an lvalue doesn't actually "access" the memory
> (an rvalue does), so this:
> 
>    volatile int *p = ...;
> 
>    *p;
> 
> doesn't need to generate a load from memory, because "*p" is still an
> lvalue (since you could assign things to it).
> 
> This isn't an issue in C, because in C, expression statements are
> always rvalues, but C++ changed that.

Huhh.  I can see the problems that this creates in terms of C/C++
compatibility.

> The people involved with the C++
> standards have generally been totally clueless about their subtle
> changes.

This isn't a fair characterization.  There are many people that do care,
and certainly not all are clueless.  But it's a limited set of people,
bugs happen, and not all of them will have the same goals.

I think one way to prevent such problems in the future could be to have
someone in the kernel community volunteer to look through standard
revisions before they are published.  The standard needs to be fixed,
because compilers need to conform to the standard (e.g., a compiler's
extension "fixing" the above wouldn't be conforming anymore because it
emits more volatile reads than specified).

Or maybe those of us working on the standard need to flag potential
changes of interest to the kernel folks.  But that may be less reliable
than someone from the kernel side looking at them; I don't know.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 22:14                                           ` Paul E. McKenney
@ 2014-02-17 22:27                                             ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 22:27 UTC (permalink / raw)
  To: paulmck
  Cc: Richard Biener, Peter Zijlstra, Linus Torvalds, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 2014-02-17 at 14:14 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote:
> > On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
> > >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
> > >> > You need volatile semantics to force the compiler to ignore any
> > >proofs
> > >> > it might otherwise attempt to construct.  Hence all the
> > >ACCESS_ONCE()
> > >> > calls in my email to Torvald.  (Hopefully I translated your example
> > >> > reasonably.)
> > >> 
> > >> My brain gave out for today; but it did appear to have the right
> > >> structure.
> > >
> > >I can relate.  ;-)
> > >
> > >> I would prefer it C11 would not require the volatile casts. It should
> > >> simply _never_ speculate with atomic writes, volatile or not.
> > >
> > >I agree with not needing volatiles to prevent speculated writes. 
> > >However,
> > >they will sometimes be needed to prevent excessive load/store
> > >combining.
> > >The compiler doesn't have the runtime feedback mechanisms that the
> > >hardware has, and thus will need help from the developer from time
> > >to time.
> > >
> > >Or maybe the Linux kernel simply waits to transition to C11 relaxed
> > >atomics
> > >until the compiler has learned to be sufficiently conservative in its
> > >load-store combining decisions.
> > 
> > Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time.
> 
> Perhaps we solve this chicken-and-egg problem by creating a test suite?

Perhaps.  The test suite might also be a good set of examples showing
which cases we expect to be optimized in a certain way, and which not.
I suppose the uses of (the equivalent) of atomics in the kernel would be
a good start.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 22:09                                                   ` Torvald Riegel
@ 2014-02-17 22:32                                                     ` Linus Torvalds
  2014-02-17 23:17                                                       ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:32 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote:
>> >
>> > if (atomic_load(&x, mo_relaxed) == 1)
>> >   atomic_store(&y, 3, mo_relaxed));
>>
>> No, please don't use this idiotic example. It is wrong.
>
> It won't be useful in practice in a lot of cases, but that doesn't mean
> it's wrong.  It's clearly not illegal code.  It also serves a purpose: a
> simple example to reason about a few aspects of the memory model.

It's not illegal code, but i you claim that you can make that store
unconditional, it's a pointless and wrong example.

>> The fact is, if a compiler generates anything but the obvious sequence
>> (read/cmp/branch/store - where branch/store might obviously be done
>> with some other machine conditional like a predicate), the compiler is
>> wrong.
>
> Why?  I've reasoned why (1) to (3) above allow in certain cases (i.e.,
> the first load always returning 1) for the branch (or other machine
> conditional) to not be emitted.  So please either poke holes into this
> reasoning, or clarify that you don't in fact, contrary to what you wrote
> above, agree with (1) to (3).

The thing is, the first load DOES NOT RETURN 1. It returns whatever
that memory location contains. End of story.

Stop claiming it "can return 1".. It *never* returns 1 unless you do
the load and *verify* it, or unless the load itself can be made to go
away. And with the code sequence given, that just doesn't happen. END
OF STORY.

So your argument is *shit*. Why do you continue to argue it?

I told you how that load can go away, and you agreed. But IT CANNOT GO
AWAY any other way. You cannot claim "the compiler knows". The
compiler doesn't know. It's that simple.

>> So why do I say you are wrong, after I just gave you an example of how
>> it happens? Because my example went back to the *real* issue, and
>> there are actual real semantically meaningful details with doing
>> things like load merging.
>>
>> To give an example, let's rewrite things a bit more to use an extra variable:
>>
>>     atomic_store(&x, 1, mo_relaxed);
>>     a = atomic_load(&1, mo_relaxed);
>>     if (a == 1)
>>         atomic_store(&y, 3, mo_relaxed);
>>
>> which looks exactly the same.
>
> I'm confused.  Is this a new example?

That is a new example. The important part is that it has left a
"trace" for the programmer: because 'a' contains the value, the
programmer can now look at the value later and say "oh, we know we did
a store iff a was 1"

>> This sequence:
>>
>>     atomic_store(&x, 1, mo_relaxed);
>>     a = atomic_load(&x, mo_relaxed);
>>     atomic_store(&y, 3, mo_relaxed);
>>
>> is actually - and very seriously - buggy.
>>
>> Why? Because you have effectively split the atomic_load into two loads
>> - one for the value of 'a', and one for your 'proof' that the store is
>> unconditional.
>
> I can't follow that, because it isn't clear to me which code sequences
> are meant to belong together, and which transformations the compiler is
> supposed to make.  If you would clarify that, then I can reply to this
> part.

Basically, if the compiler allows the condition of "I wrote 3 to the
y, but the programmer sees 'a' has another value than 1 later" then
the compiler is one buggy pile of shit. It fundamentally broke the
whole concept of atomic accesses. Basically the "atomic" access to 'x'
turned into two different accesses: the one that "proved" that x had
the value 1 (and caused the value 3 to be written), and the other load
that then write that other value into 'a'.

It's really not that complicated.

And this is why descriptions like this should ABSOLUTELY NOT BE
WRITTEN as "if the compiler can prove that 'x' had the value 1, it can
remove the branch". Because that IS NOT SUFFICIENT. That was not a
valid transformation of the atomic load.

The only valid transformation was the one I stated, namely to remove
the load entirely and replace it with the value written earlier in the
same execution context.

Really, why is so hard to understand?

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 22:25                                                             ` Torvald Riegel
@ 2014-02-17 22:47                                                               ` Linus Torvalds
  2014-02-17 23:41                                                                 ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:47 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote:
>>
>> The argument was that an lvalue doesn't actually "access" the memory
>> (an rvalue does), so this:
>>
>>    volatile int *p = ...;
>>
>>    *p;
>>
>> doesn't need to generate a load from memory, because "*p" is still an
>> lvalue (since you could assign things to it).
>>
>> This isn't an issue in C, because in C, expression statements are
>> always rvalues, but C++ changed that.
>
> Huhh.  I can see the problems that this creates in terms of C/C++
> compatibility.

That's not the biggest problem.

The biggest problem is that you have compiler writers that don't care
about sane *use* of the features they write a compiler for, they just
care about the standard.

So they don't care about C vs C++ compatibility. Even more
importantly, they don't care about the *user* that uses only C++ and
the fact that their reading of the standard results in *meaningless*
behavior. They point to the standard and say "that's what the standard
says, suck it", and silently generate code (or in this case, avoid
generating code) that makes no sense.

So it's not about C++ being incompatible with C, it's about C++ having
insane and bad semantics unless you just admit that "oh, ok, I need to
not just read the standard, I also need to use my brain, and admit
that a C++ statement expression needs to act as if it is an "access"
wrt volatile variables".

In other words, as a compiler person, you do need to read more than
the paper of standard. You need to also take into account what is
reasonable behavior even when the standard could possibly be read some
other way. And some compiler people don't.

The "volatile access in statement expression" did get resolved,
sanely, at least in gcc. I think gcc warns about some remaining cases.

Btw, afaik, C++11 actually clarifies the standard to require the
reads, because everybody *knew* that not requiring the read was insane
and meaningless behavior, and clearly against the intent of
"volatile".

But that didn't stop compiler writers from saying "hey, the standard
allows my insane and meaningless behavior, so I'll implement it and
not consider it a bug".

                    Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 20:18                                                       ` Linus Torvalds
  2014-02-17 21:21                                                         ` Torvald Riegel
@ 2014-02-17 23:10                                                         ` Alec Teal
  2014-02-18  0:05                                                           ` Linus Torvalds
  2014-02-18  3:00                                                         ` Paul E. McKenney
  2 siblings, 1 reply; 299+ messages in thread
From: Alec Teal @ 2014-02-17 23:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On 17/02/14 20:18, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel<triegel@redhat.com>  wrote:
>> Which example do you have in mind here?  Haven't we resolved all the
>> debated examples, or did I miss any?
> Well, Paul seems to still think that the standard possibly allows
> speculative writes or possibly value speculation in ways that break
> the hardware-guaranteed orderings.
>
> And personally, I can't read standards paperwork. It is invariably
Can't => Don't - evidently.
> written in some basically impossible-to-understand lawyeristic mode,
You mean "unambiguous" - try reading a patent (Apple have 1000s of 
trivial ones, I tried reading one once thinking "how could they have 
phrased it so this got approved", their technique was to make the reader 
want to start cutting themselves to prove they wern't numb to everything)
> and then it is read by people (compiler writers) that intentionally
> try to mis-use the words and do language-lawyering ("that depends on
> what the meaning of 'is' is"). The whole "lvalue vs rvalue expression
> vs 'what is a volatile access'" thing for C++ was/is a great example
> of that.
I'm not going to teach you what rvalues and lvalues, but! 
http://lmgtfy.com/?q=what+are+rvalues might help.
>
> So quite frankly, as a result I refuse to have anything to do with the
> process directly.
Is this goodbye?
>
>                   Linus
That aside, what is the problem? If the compiler has created code that 
that has different program states than what would be created without 
optimisation please file a bug report and/or send something to the 
mailing list USING A CIVIL TONE, there's no need for swear-words and 
profanities all the time - use them when you want to emphasise 
something. Additionally if you are always angry, start calling that 
state "normal" then reserve such words for when you are outraged.

There are so many emails from you bitching about stuff, I've lost track 
of what you're bitching about you bitch that much about it. Like this 
standards stuff above (notice I said stuff, not "crap" or "shit").

What exactly is your problem, if the compiler is doing something the 
standard does not permit, or optimising something wrongly (read: "puts 
the program in a different state than if the optimisation was not 
applied") that is REALLY serious, you are right to report it; but 
whining like a n00b on Stack-overflow when a question gets closed is not 
helping.

I tried reading back though the emails (I dismissed them previously) but 
there's just so much ranting, and rants about the standard too (I would 
trash this if I deemed the effort required to delete was less than the 
storage of the bytes the message takes up) standardised behaviour is 
VERY important.

So start again, what is the serious problem, have you got any code that 
would let me replicate it, what is your version of GCC?

Oh and lastly! Optimisations are not as casual as "oh, we could do this 
and it'd work better" unlike kernel work or any other software that is 
being improved, it is very formal (and rightfully so). I seriously 
recommend you read the first 40 pages at least of a book called 
"Compiler Design, Analysis and Transformation" it's not about the 
parsing phases or anything, but it develops a good introduction and 
later a good foundation for exploring the field further. Compilers do 
not operate on what I call "A-level logic" and to show what I mean I use 
the shovel-to-the-face of real analysis, "of course 1/x tends towards 0, 
it's not gonna be 5!!" = A-level logic. "Let epsilon > 0 be given, then 
there exists an N...." - formal proof. So when one says "the compiler 
can prove" it's not some silly thing powered by A-level logic, it is the 
implementation of something that can be proven to be correct (in the 
sense of the program states mentioned before)

So yeah, calm down and explain - no lashing out at standards bodies, 
what is the problem?

Alec

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 19:41                                                         ` Torvald Riegel
@ 2014-02-17 23:12                                                           ` Joseph S. Myers
  0 siblings, 0 replies; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-17 23:12 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 17 Feb 2014, Torvald Riegel wrote:

> On Mon, 2014-02-17 at 18:59 +0000, Joseph S. Myers wrote:
> > On Sat, 15 Feb 2014, Torvald Riegel wrote:
> > 
> > > glibc is a counterexample that comes to mind, although it's a smaller
> > > code base.  (It's currently not using C11 atomics, but transitioning
> > > there makes sense, and some thing I want to get to eventually.)
> > 
> > glibc is using C11 atomics (GCC builtins rather than _Atomic / 
> > <stdatomic.h>, but using __atomic_* with explicitly specified memory model 
> > rather than the older __sync_*) on AArch64, plus in certain cases on ARM 
> > and MIPS.
> 
> I think the major steps remaining is moving the other architectures
> over, and rechecking concurrent code (e.g., for the code that I have

I don't think we'll be ready to require GCC >= 4.7 to build glibc for 
another year or two, although probably we could move the requirement up 
from 4.4 to 4.6.  (And some platforms only had the C11 atomics optimized 
later than 4.7.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 22:32                                                     ` Linus Torvalds
@ 2014-02-17 23:17                                                       ` Torvald Riegel
  2014-02-18  0:09                                                         ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 23:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 14:32 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote:
> >> >
> >> > if (atomic_load(&x, mo_relaxed) == 1)
> >> >   atomic_store(&y, 3, mo_relaxed));
> >>
> >> No, please don't use this idiotic example. It is wrong.
> >
> > It won't be useful in practice in a lot of cases, but that doesn't mean
> > it's wrong.  It's clearly not illegal code.  It also serves a purpose: a
> > simple example to reason about a few aspects of the memory model.
> 
> It's not illegal code, but i you claim that you can make that store
> unconditional, it's a pointless and wrong example.
> 
> >> The fact is, if a compiler generates anything but the obvious sequence
> >> (read/cmp/branch/store - where branch/store might obviously be done
> >> with some other machine conditional like a predicate), the compiler is
> >> wrong.
> >
> > Why?  I've reasoned why (1) to (3) above allow in certain cases (i.e.,
> > the first load always returning 1) for the branch (or other machine
> > conditional) to not be emitted.  So please either poke holes into this
> > reasoning, or clarify that you don't in fact, contrary to what you wrote
> > above, agree with (1) to (3).
> 
> The thing is, the first load DOES NOT RETURN 1. It returns whatever
> that memory location contains. End of story.

The memory location is just an abstraction for state, if it's not
volatile.

> Stop claiming it "can return 1".. It *never* returns 1 unless you do
> the load and *verify* it, or unless the load itself can be made to go
> away. And with the code sequence given, that just doesn't happen. END
> OF STORY.

void foo();
{
  atomic<int> x = 1;
  if (atomic_load(&x, mo_relaxed) == 1)
    atomic_store(&y, 3, mo_relaxed));
}

This is a counter example to your claim, and yes, the compiler has proof
that x is 1.  It's deliberately simple, but I can replace this with
other more advanced situations.  For example, if x comes out of malloc
(or, on the kernel side, something else that returns non-aliasing
memory) and hasn't provably escaped to other threads yet.

I haven't posted this full example, but I've *clearly* said that *if*
the compiler can prove that the load would always return 1, it can
remove it.  And it's simple to see why that's the case: If this holds,
then in all allowed executions it would load from a know store, the
relaxed_mo gives no further ordering guarantees so we can just take the
value, and we're good.

> So your argument is *shit*. Why do you continue to argue it?

Maybe because it isn't?  Maybe you should try to at least trust that my
intentions are good, even if distrusting my ability to reason.

> I told you how that load can go away, and you agreed. But IT CANNOT GO
> AWAY any other way. You cannot claim "the compiler knows". The
> compiler doesn't know. It's that simple.

Oh yes it can.  Because of the same rules that allow you to perform the
other transformations.  Please try to see the similarities here.  You
previously said you don't want to mix volatile semantics and atomics.
This is something that's being applied in this example.

> >> So why do I say you are wrong, after I just gave you an example of how
> >> it happens? Because my example went back to the *real* issue, and
> >> there are actual real semantically meaningful details with doing
> >> things like load merging.
> >>
> >> To give an example, let's rewrite things a bit more to use an extra variable:
> >>
> >>     atomic_store(&x, 1, mo_relaxed);
> >>     a = atomic_load(&1, mo_relaxed);
> >>     if (a == 1)
> >>         atomic_store(&y, 3, mo_relaxed);
> >>
> >> which looks exactly the same.
> >
> > I'm confused.  Is this a new example?
> 
> That is a new example. The important part is that it has left a
> "trace" for the programmer: because 'a' contains the value, the
> programmer can now look at the value later and say "oh, we know we did
> a store iff a was 1"
> 
> >> This sequence:
> >>
> >>     atomic_store(&x, 1, mo_relaxed);
> >>     a = atomic_load(&x, mo_relaxed);
> >>     atomic_store(&y, 3, mo_relaxed);
> >>
> >> is actually - and very seriously - buggy.
> >>
> >> Why? Because you have effectively split the atomic_load into two loads
> >> - one for the value of 'a', and one for your 'proof' that the store is
> >> unconditional.
> >
> > I can't follow that, because it isn't clear to me which code sequences
> > are meant to belong together, and which transformations the compiler is
> > supposed to make.  If you would clarify that, then I can reply to this
> > part.
> 
> Basically, if the compiler allows the condition of "I wrote 3 to the
> y, but the programmer sees 'a' has another value than 1 later" then
> the compiler is one buggy pile of shit. It fundamentally broke the
> whole concept of atomic accesses. Basically the "atomic" access to 'x'
> turned into two different accesses: the one that "proved" that x had
> the value 1 (and caused the value 3 to be written), and the other load
> that then write that other value into 'a'.
> 
> It's really not that complicated.

Yes that's not complicated, but I assumed this to be obvious and wasn't
aware that this is contentious.

> And this is why descriptions like this should ABSOLUTELY NOT BE
> WRITTEN as "if the compiler can prove that 'x' had the value 1, it can
> remove the branch". Because that IS NOT SUFFICIENT. That was not a
> valid transformation of the atomic load.

Now I see where the confusion was.  Sorry if I didn't point this out
explicitly, but if proving that x had the value 1, the first thing a
compiler would naturally do is to replace *the load* by 1, and
*afterwards* remove the branch because it sees 1 == 1.

Nonetheless, if being picky about it, keeping the load is correct if the
proof that x will always have the value 1 is correct (it might prevent
some optimizations though; in foo() above, keeping the load would also
prevent removing the variable on the stack).  In a correct compiler,
this will of course lead to the memory location actually existing and
having the value 1 in a compiled program.

> The only valid transformation was the one I stated, namely to remove
> the load entirely and replace it with the value written earlier in the
> same execution context.

No, your transformation is similar but has a different reasoning behind
it.  What the compiler (easily) proves in your example is that *this
thread* is always allowed to observe it's prior store to x.  That's a
different assumption than that x will always be of value 1 when this
code sequence is executed.  Therefore, the results for the compilation
are also slightly different.

> Really, why is so hard to understand?

It's not hard to understand, we've just been talking past each other.
(But that's something we both participated in, I'd like to point out.)

I think it also shows that reasoning about executions starting with what
the compiler and HW can do to the code is more complex than reasoning
about allowed executions of the abstract machine.  If using the latter,
and you would have formulated the proof the compiler does about the
executions, we might have been able to see the misunderstanding earlier.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 22:47                                                               ` Linus Torvalds
@ 2014-02-17 23:41                                                                 ` Torvald Riegel
  2014-02-18  0:18                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-17 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 14:47 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote:
> >>
> >> The argument was that an lvalue doesn't actually "access" the memory
> >> (an rvalue does), so this:
> >>
> >>    volatile int *p = ...;
> >>
> >>    *p;
> >>
> >> doesn't need to generate a load from memory, because "*p" is still an
> >> lvalue (since you could assign things to it).
> >>
> >> This isn't an issue in C, because in C, expression statements are
> >> always rvalues, but C++ changed that.
> >
> > Huhh.  I can see the problems that this creates in terms of C/C++
> > compatibility.
> 
> That's not the biggest problem.
> 
> The biggest problem is that you have compiler writers that don't care
> about sane *use* of the features they write a compiler for, they just
> care about the standard.
> 
> So they don't care about C vs C++ compatibility. Even more
> importantly, they don't care about the *user* that uses only C++ and
> the fact that their reading of the standard results in *meaningless*
> behavior. They point to the standard and say "that's what the standard
> says, suck it", and silently generate code (or in this case, avoid
> generating code) that makes no sense.

There's an underlying problem here that's independent from the actual
instance that you're worried about here: "no sense" is a ultimately a
matter of taste/objectives/priorities as long as the respective
specification is logically consistent.

If you want to be independent of your sanity being different from other
people's sanity (e.g., compiler writers), you need to make sure that the
specification is precise and says what you want.  IOW, think about the
specification being the program, and the people being computers; you
better want a well-defined program in this case.

> So it's not about C++ being incompatible with C, it's about C++ having
> insane and bad semantics unless you just admit that "oh, ok, I need to
> not just read the standard, I also need to use my brain, and admit
> that a C++ statement expression needs to act as if it is an "access"
> wrt volatile variables".

1) I agree that (IMO) a good standard strives for being easy to
understand.

2) In practice, there is a trade-off between "Easy to understand" and
actually producing a specification.  A standard is not a tutorial.  And
that's for good reason, because (a) there might be more than one way to
teach something and that should be allowed and (b) that the standard
should carry the full precision but still be compact enough to be
manageable.

3) Implementations can try to be nice to users by helping them avoiding
error-prone corner cases or such.  A warning for common problems is such
a case.  But an implementation has to draw a line somewhere, demarcating
cases where it fully exploits what the standard says (eg, to allow
optimizations) from cases where it is more conservative and does what
the standard allows but in a potentially more intuitive way.  That's
especially the case if it's being asked to produce high-performance
code.

4) There will be arguments for where the line actually is, simply
because different users will have different goals.

5) The way to reduce 4) is to either make the standard more specific, or
to provide better user documentation.  If the standard has strict
requirements, then there will be less misunderstanding.

6) To achieve 5), one way is to get involved in the standards process.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 23:10                                                         ` Alec Teal
@ 2014-02-18  0:05                                                           ` Linus Torvalds
  2014-02-18 15:31                                                             ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18  0:05 UTC (permalink / raw)
  To: Alec Teal
  Cc: Torvald Riegel, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 3:10 PM, Alec Teal <a.teal@warwick.ac.uk> wrote:
>
> You mean "unambiguous" - try reading a patent (Apple have 1000s of trivial
> ones, I tried reading one once thinking "how could they have phrased it so
> this got approved", their technique was to make the reader want to start
> cutting themselves to prove they wern't numb to everything)

Oh, I agree, patent language is worse.

> I'm not going to teach you what rvalues and lvalues, but!

I know what lvalues and rvalues are. I *understand* the thinking that
goes on behind the "let's not do the access, because it's not an
rvalue, so there is no 'access' to the object".

I understand it from a technical perspective.

I don't understand the compiler writer that uses a *technicality* to
argue against generating sane code that is obviously what the user
actually asked for.

See the difference?

> So start again, what is the serious problem, have you got any code that
> would let me replicate it, what is your version of GCC?

The volatile problem is long fixed. The people who argued for the
"legalistically correct", but insane behavior lost (and as mentioned,
I think C++11 actually fixed the legalistic reading too).

I'm bringing it up because I've had too many cases where compiler
writers pointed to standard and said "that is ambiguous or undefined,
so we can do whatever the hell we want, regardless of whether that's
sensible, or regardless of whether there is a sensible way to get the
behavior you want or not".


> Oh and lastly! Optimisations are not as casual as "oh, we could do this and
> it'd work better" unlike kernel work or any other software that is being
> improved, it is very formal (and rightfully so)

Alec, I know compilers. I don't do code generation (quite frankly,
register allocation and instruction choice is when I give up), but I
did actually write my own for static analysis, including turning
things into SSA etc.

No, I'm not a "compiler person", but I actually do know enough that I
understand what goes on.

And exactly because I know enough, I would *really* like atomics to be
well-defined, and have very clear - and *local* - rules about how they
can be combined and optimized.

None of this "if you can prove that the read has value X" stuff. And
things like value speculation should simply not be allowed, because
that actually breaks the dependency chain that the CPU architects give
guarantees for. Instead, make the rules be very clear, and very
simple, like my suggestion. You can never remove a load because you
can "prove" it has some value, but you can combine two consecutive
atomic accesses/

For example, CPU people actually do tend to give guarantees for
certain things, like stores that are causally related being visible in
a particular order. If the compiler starts doing value speculation on
atomic accesses, you are quite possibly breaking things like that.
It's just not a good idea. Don't do it. Write the standard so that it
clearly is disallowed.

Because you may think that a C standard is machine-independent, but
that isn't really the case. The people who write code still write code
for a particular machine. Our code works (in the general case) on
different byte orderings, different register sizes, different memory
ordering models. But in each *instance* we still end up actually
coding for each machine.

So the rules for atomics should be simple and *specific* enough that
when you write code for a particular architecture, you can take the
architecture memory ordering *and* the C atomics orderings into
account, and do the right thing for that architecture.

And that very much means that doing things like value speculation MUST
NOT HAPPEN. See? Even if you can prove that your code is "equivalent",
it isn't.

So for example, let's say that you have a pointer, and you have some
reason to believe that the pointer has a particular value. So you
rewrite following the pointer from this:

  value = ptr->val;

into

  value = speculated->value;
  tmp = ptr;
  if (unlikely(tmp != speculated))
    value = tmp->value;

and maybe you can now make the critical code-path for the speculated
case go faster (since now there is no data dependency for the
speculated case, and the actual pointer chasing load is now no longer
in the critical path), and you made things faster because your
profiling showed that the speculated case was true 99% of the time.
Wonderful, right? And clearly, the code "provably" does the same
thing.

EXCEPT THAT IS NOT TRUE AT ALL.

It very much does not do the same thing at all, and by doing value
speculation and "proving" something was true, the only thing you did
was to make incorrect code run faster. Because now the causally
related load of value from the pointer isn't actually causally related
at all, and you broke the memory ordering.

This is why I don't like it when I see Torvald talk about "proving"
things. It's bullshit. You can "prove" pretty much anything, and in
the process lose sight of the bigger issue, namely that there is code
that depends on

When it comes to atomic accesses, you don't play those kinds of games,
exactly because the ordering of the accesses matter in ways that are
not really sanely describable at a source code level. The *only* kinds
of games you play are like the ones I described - combining accesses
under very strict local rules.

And the strict local rules really should be of the type "a store
followed by a load to the same location with the same memory ordering
constraints can be combined". Never *ever* of the kind "if you can
prove X".

I hope my example made it clear *why* I react so strongly when Torvald
starts talking about "if you can prove the value is 1".

                     Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 23:17                                                       ` Torvald Riegel
@ 2014-02-18  0:09                                                         ` Linus Torvalds
  2014-02-18 15:46                                                           ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18  0:09 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Mon, 2014-02-17 at 14:32 -0800,
>
>> Stop claiming it "can return 1".. It *never* returns 1 unless you do
>> the load and *verify* it, or unless the load itself can be made to go
>> away. And with the code sequence given, that just doesn't happen. END
>> OF STORY.
>
> void foo();
> {
>   atomic<int> x = 1;
>   if (atomic_load(&x, mo_relaxed) == 1)
>     atomic_store(&y, 3, mo_relaxed));
> }

This is the very example I gave, where the real issue is not that "you
prove that load returns 1", you instead say "store followed by a load
can be combined".

I (in another email I just wrote) tried to show why the "prove
something is true" is a very dangerous model.  Seriously, it's pure
crap. It's broken.

If the C standard defines atomics in terms of "provable equivalence",
it's broken. Exactly because on a *virtual* machine you can prove
things that are not actually true in a *real* machine. I have the
example of value speculation changing the memory ordering model of the
actual machine.

See?

                Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 23:41                                                                 ` Torvald Riegel
@ 2014-02-18  0:18                                                                   ` Linus Torvalds
  2014-02-18  1:26                                                                     ` Paul E. McKenney
  2014-02-18 15:38                                                                     ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18  0:18 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote:
>
> There's an underlying problem here that's independent from the actual
> instance that you're worried about here: "no sense" is a ultimately a
> matter of taste/objectives/priorities as long as the respective
> specification is logically consistent.

Yes. But I don't think it's "independent".

Exactly *because* some people will read standards without applying
"does the resulting code generation actually make sense for the
programmer that wrote the code", the standard has to be pretty clear.

The standard often *isn't* pretty clear. It wasn't clear enough when
it came to "volatile", and yet that was a *much* simpler concept than
atomic accesses and memory ordering.

And most of the time it's not a big deal. But because the C standard
generally tries to be very portable, and cover different machines,
there tends to be a mindset that anything inherently unportable is
"undefined" or "implementation defined", and then the compiler writer
is basically given free reign to do anything they want (with
"implementation defined" at least requiring that it is reliably the
same thing).

And when it comes to memory ordering, *everything* is basically
non-portable, because different CPU's very much have different rules.
I worry that that means that the standard then takes the stance that
"well, compiler re-ordering is no worse than CPU re-ordering, so we
let the compiler do anything". And then we have to either add
"volatile" to make sure the compiler doesn't do that, or use an overly
strict memory model at the compiler level that makes it all pointless.

So I really really hope that the standard doesn't give compiler
writers free hands to do anything that they can prove is "equivalent"
in the virtual C machine model. That's not how you get reliable
results.

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  0:18                                                                   ` Linus Torvalds
@ 2014-02-18  1:26                                                                     ` Paul E. McKenney
  2014-02-18 15:38                                                                     ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 04:18:52PM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > There's an underlying problem here that's independent from the actual
> > instance that you're worried about here: "no sense" is a ultimately a
> > matter of taste/objectives/priorities as long as the respective
> > specification is logically consistent.
> 
> Yes. But I don't think it's "independent".
> 
> Exactly *because* some people will read standards without applying
> "does the resulting code generation actually make sense for the
> programmer that wrote the code", the standard has to be pretty clear.
> 
> The standard often *isn't* pretty clear. It wasn't clear enough when
> it came to "volatile", and yet that was a *much* simpler concept than
> atomic accesses and memory ordering.
> 
> And most of the time it's not a big deal. But because the C standard
> generally tries to be very portable, and cover different machines,
> there tends to be a mindset that anything inherently unportable is
> "undefined" or "implementation defined", and then the compiler writer
> is basically given free reign to do anything they want (with
> "implementation defined" at least requiring that it is reliably the
> same thing).
> 
> And when it comes to memory ordering, *everything* is basically
> non-portable, because different CPU's very much have different rules.
> I worry that that means that the standard then takes the stance that
> "well, compiler re-ordering is no worse than CPU re-ordering, so we
> let the compiler do anything". And then we have to either add
> "volatile" to make sure the compiler doesn't do that, or use an overly
> strict memory model at the compiler level that makes it all pointless.

For whatever it is worth, this line of reasoning has been one reason why
I have been objecting strenuously every time someone on the committee
suggests eliminating "volatile" from the standard.

							Thanx, Paul

> So I really really hope that the standard doesn't give compiler
> writers free hands to do anything that they can prove is "equivalent"
> in the virtual C machine model. That's not how you get reliable
> results.
> 
>                Linus
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-17 20:18                                                       ` Linus Torvalds
  2014-02-17 21:21                                                         ` Torvald Riegel
  2014-02-17 23:10                                                         ` Alec Teal
@ 2014-02-18  3:00                                                         ` Paul E. McKenney
  2014-02-18  3:24                                                           ` Linus Torvalds
  2014-02-18 15:56                                                           ` Torvald Riegel
  2 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18  3:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > Which example do you have in mind here?  Haven't we resolved all the
> > debated examples, or did I miss any?
> 
> Well, Paul seems to still think that the standard possibly allows
> speculative writes or possibly value speculation in ways that break
> the hardware-guaranteed orderings.

It is not that I know of any specific problems, but rather that I
know I haven't looked under all the rocks.  Plus my impression from
my few years on the committee is that the standard will be pushed to
the limit when it comes time to add optimizations.

One example that I learned about last week uses the branch-prediction
hardware to validate value speculation.  And no, I am not at all a fan
of value speculation, in case you were curious.  However, it is still
an educational example.

This is where you start:

	p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
	do_something(p->a, p->b, p->c);
	p->d = 1;

Then you leverage branch-prediction hardware as follows:

	p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
	if (p == GUESS) {
		do_something(GUESS->a, GUESS->b, GUESS->c);
		GUESS->d = 1;
	} else {
		do_something(p->a, p->b, p->c);
		p->d = 1;
	}

The CPU's branch-prediction hardware squashes speculation in the case where
the guess was wrong, and this prevents the speculative store to ->d from
ever being visible.  However, the then-clause breaks dependencies, which
means that the loads -could- be speculated, so that do_something() gets
passed pre-initialization values.

Now, I hope and expect that the wording in the standard about dependency
ordering prohibits this sort of thing.  But I do not yet know for certain.

And yes, I am being paranoid.  But not unnecessarily paranoid.  ;-)

							Thanx, Paul

> And personally, I can't read standards paperwork. It is invariably
> written in some basically impossible-to-understand lawyeristic mode,
> and then it is read by people (compiler writers) that intentionally
> try to mis-use the words and do language-lawyering ("that depends on
> what the meaning of 'is' is"). The whole "lvalue vs rvalue expression
> vs 'what is a volatile access'" thing for C++ was/is a great example
> of that.
> 
> So quite frankly, as a result I refuse to have anything to do with the
> process directly.
> 
>                  Linus
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  3:00                                                         ` Paul E. McKenney
@ 2014-02-18  3:24                                                           ` Linus Torvalds
  2014-02-18  3:42                                                             ` Linus Torvalds
  2014-02-18  5:01                                                             ` Paul E. McKenney
  2014-02-18 15:56                                                           ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18  3:24 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 7:00 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> One example that I learned about last week uses the branch-prediction
> hardware to validate value speculation.  And no, I am not at all a fan
> of value speculation, in case you were curious.

Heh. See the example I used in my reply to Alec Teal. It basically
broke the same dependency the same way.

Yes, value speculation of reads is simply wrong, the same way
speculative writes are simply wrong. The dependency chain matters, and
is meaningful, and breaking it is actively bad.

As far as I can tell, the intent is that you can't do value
speculation (except perhaps for the "relaxed", which quite frankly
sounds largely useless). But then I do get very very nervous when
people talk about "proving" certain values.

                Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  3:24                                                           ` Linus Torvalds
@ 2014-02-18  3:42                                                             ` Linus Torvalds
  2014-02-18  5:22                                                               ` Paul E. McKenney
  2014-02-18 16:17                                                               ` Torvald Riegel
  2014-02-18  5:01                                                             ` Paul E. McKenney
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18  3:42 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> As far as I can tell, the intent is that you can't do value
> speculation (except perhaps for the "relaxed", which quite frankly
> sounds largely useless).

Hmm. The language I see for "consume" is not obvious:

  "Consume operation: no reads in the current thread dependent on the
value currently loaded can be reordered before this load"

and it could make a compiler writer say that value speculation is
still valid, if you do it like this (with "ptr" being the atomic
variable):

  value = ptr->val;

into

  tmp = ptr;
  value = speculated.value;
  if (unlikely(tmp != &speculated))
    value = tmp->value;

which is still bogus. The load of "ptr" does happen before the load of
"value = speculated->value" in the instruction stream, but it would
still result in the CPU possibly moving the value read before the
pointer read at least on ARM and power.

So if you're a compiler person, you think you followed the letter of
the spec - as far as *you* were concerned, no load dependent on the
value of the atomic load moved to before the atomic load. You go home,
happy, knowing you've done your job. Never mind that you generated
code that doesn't actually work.

I dread having to explain to the compiler person that he may be right
in some theoretical virtual machine, but the code is subtly broken and
nobody will ever understand why (and likely not be able to create a
test-case showing the breakage).

But maybe the full standard makes it clear that "reordered before this
load" actually means on the real hardware, not just in the generated
instruction stream. Reading it with understanding of the *intent* and
understanding all the different memory models that requirement should
be obvious (on alpha, you need an "rmb" instruction after the load),
but ...

                    Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  3:24                                                           ` Linus Torvalds
  2014-02-18  3:42                                                             ` Linus Torvalds
@ 2014-02-18  5:01                                                             ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18  5:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 07:24:56PM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 7:00 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > One example that I learned about last week uses the branch-prediction
> > hardware to validate value speculation.  And no, I am not at all a fan
> > of value speculation, in case you were curious.
> 
> Heh. See the example I used in my reply to Alec Teal. It basically
> broke the same dependency the same way.

;-)

> Yes, value speculation of reads is simply wrong, the same way
> speculative writes are simply wrong. The dependency chain matters, and
> is meaningful, and breaking it is actively bad.
> 
> As far as I can tell, the intent is that you can't do value
> speculation (except perhaps for the "relaxed", which quite frankly
> sounds largely useless). But then I do get very very nervous when
> people talk about "proving" certain values.

That was certainly my intent, but as you might have notice in the
discussion earlier in this thread, the intent can get lost pretty
quickly.  ;-)

The HPC guys appear to be the most interested in breaking dependencies.
Their software does't rely on dependencies, and from their viewpoint
anything that has any chance of leaving an FP unit of any type idle is
a very bad thing.  But there are probably other benchmarks for which
breaking dependencies gives a few percent performance boost.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  3:42                                                             ` Linus Torvalds
@ 2014-02-18  5:22                                                               ` Paul E. McKenney
  2014-02-18 16:17                                                               ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18  5:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 17, 2014 at 07:42:42PM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > As far as I can tell, the intent is that you can't do value
> > speculation (except perhaps for the "relaxed", which quite frankly
> > sounds largely useless).
> 
> Hmm. The language I see for "consume" is not obvious:
> 
>   "Consume operation: no reads in the current thread dependent on the
> value currently loaded can be reordered before this load"
> 
> and it could make a compiler writer say that value speculation is
> still valid, if you do it like this (with "ptr" being the atomic
> variable):
> 
>   value = ptr->val;
> 
> into
> 
>   tmp = ptr;
>   value = speculated.value;
>   if (unlikely(tmp != &speculated))
>     value = tmp->value;
> 
> which is still bogus. The load of "ptr" does happen before the load of
> "value = speculated->value" in the instruction stream, but it would
> still result in the CPU possibly moving the value read before the
> pointer read at least on ARM and power.
> 
> So if you're a compiler person, you think you followed the letter of
> the spec - as far as *you* were concerned, no load dependent on the
> value of the atomic load moved to before the atomic load. You go home,
> happy, knowing you've done your job. Never mind that you generated
> code that doesn't actually work.

Agreed, that would be bad.  But please see below.

> I dread having to explain to the compiler person that he may be right
> in some theoretical virtual machine, but the code is subtly broken and
> nobody will ever understand why (and likely not be able to create a
> test-case showing the breakage).

If things go as they usually do, such explanations will be required
a time or two.

> But maybe the full standard makes it clear that "reordered before this
> load" actually means on the real hardware, not just in the generated
> instruction stream. Reading it with understanding of the *intent* and
> understanding all the different memory models that requirement should
> be obvious (on alpha, you need an "rmb" instruction after the load),
> but ...

The key point with memory_order_consume is that it must be paired with
some sort of store-release, a category that includes stores tagged
with memory_order_release (surprise!), memory_order_acq_rel, and
memory_order_seq_cst.  This pairing is analogous to the memory-barrier
pairing in the Linux kernel.

So you have something like this for the rcu_assign_pointer() side:

	p = kmalloc(...);
	if (unlikely(!p))
		return -ENOMEM;
	p->a = 1;
	p->b = 2;
	p->c = 3;
	/* The following would be buried within rcu_assign_pointer(). */
	atomic_store_explicit(&gp, p, memory_order_release);

And something like this for the rcu_dereference() side:

	/* The following would be buried within rcu_dereference(). */
	q = atomic_load_explicit(&gp, memory_order_consume);
	do_something_with(q->a);

So, let's look at the C11 draft, section 5.1.2.4 "Multi-threaded
executions and data races".

5.1.2.4p14 says that the atomic_load_explicit() carries a dependency to
the argument of do_something_with().

5.1.2.4p15 says that the atomic_store_explicit() is dependency-ordered
before the atomic_load_explicit().

5.1.2.4p15 also says that the atomic_store_explicit() is
dependency-ordered before the argument of do_something_with().  This is
because if A is dependency-ordered before X and X carries a dependency
to B, then A is dependency-ordered before B.

5.1.2.4p16 says that the atomic_store_explicit() inter-thread happens
before the argument of do_something_with().

The assignment to p->a is sequenced before the atomic_store_explicit().

Therefore, combining these last two, the assignment to p->a happens
before the argument of do_something_with(), and that means that
do_something_with() had better see the "1" assigned to p->a or some
later value.

But as far as I know, compiler writers currently take the approach of
treating memory_order_consume as if it was memory_order_acquire.
Which certainly works, as long as ARM and PowerPC people don't mind
an extra memory barrier out of each rcu_dereference().

Which is one thing that compiler writers are permitted to do according
to the standard -- substitute a memory-barrier instruction for any
given dependency...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  0:05                                                           ` Linus Torvalds
@ 2014-02-18 15:31                                                             ` Torvald Riegel
  2014-02-18 16:49                                                               ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 15:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> And exactly because I know enough, I would *really* like atomics to be
> well-defined, and have very clear - and *local* - rules about how they
> can be combined and optimized.

"Local"?

> None of this "if you can prove that the read has value X" stuff. And
> things like value speculation should simply not be allowed, because
> that actually breaks the dependency chain that the CPU architects give
> guarantees for. Instead, make the rules be very clear, and very
> simple, like my suggestion. You can never remove a load because you
> can "prove" it has some value, but you can combine two consecutive
> atomic accesses/

Sorry, but the rules *are* very clear.  I *really* suggest to look at
the formalization by Batty et al.  And in these rules, proving that a
read will always return value X has a well-defined meaning, and you can
use it.  That simply follows from how the model is built.

What you seem to want just isn't covered by the model as it is today --
you can't infer from that that the model itself would be wrong.  The
dependency chains aren't modeled in the way you envision it (except in
what consume_mo tries, but that seems to be hard to implement); they are
there on the level of the program logic as modeled by the abstract
machine and the respective execution/output rules, but they are not
built to represent those specific ordering guarantees the HW gives you.

I would also be cautious claiming that the rules you suggested would be
very clear and very simple.  I haven't seen a memory model spec from you
that would be viable as the standard model for C/C++, nor have I seen
proof that this would actually be easier to understand for programmers
in general.

> For example, CPU people actually do tend to give guarantees for
> certain things, like stores that are causally related being visible in
> a particular order.

Yes, but that's not part of the model so far.  If you want to exploit
this, please make a suggestion for how to extend the model to cover
this.

You certainly expect compilers to actually optimize code, which usually
removes or reorders stuff.  Now, we could say that we just don't want
that for atomic accesses, but even this isn't as clear-cut: How do we
actually detect where non-atomic code is intended by the programmer to
express (1) a dependency that needs to be transformed into a dependency
in the generated code from (2) a dependency that is just program logic?

IOW, consider the consume_mo example "*(p + flag - flag)", where flag is
coming from a consume_mo load: Can you give a complete set of rules (for
a full program) for when the compiler is allowed to optimize out
flag-flag and when not?  (And it should be practically implementable in
a compiler, and not prevent optimizations where people expect them.)

> If the compiler starts doing value speculation on
> atomic accesses, you are quite possibly breaking things like that.
> It's just not a good idea. Don't do it. Write the standard so that it
> clearly is disallowed.

You never want value speculation for anything possibly originating from
an atomic load?  Or what are the concrete rules you have in mind?

> Because you may think that a C standard is machine-independent, but
> that isn't really the case. The people who write code still write code
> for a particular machine. Our code works (in the general case) on
> different byte orderings, different register sizes, different memory
> ordering models. But in each *instance* we still end up actually
> coding for each machine.

That's how *you* do it.  I'm sure you are aware that for lots of
programmers, having a machine-dependent standard is not helpful at all.
As the standard is written, code is portable.  It can certainly be
beneficial for programmers to optimize for different machines, but the
core of the memory model will always remain portable, because that's
what arguably most programmers need.  That doesn't prevent
machine-dependent extensions, but those then should be well integrated
with the rest of the model.

> So the rules for atomics should be simple and *specific* enough that
> when you write code for a particular architecture, you can take the
> architecture memory ordering *and* the C atomics orderings into
> account, and do the right thing for that architecture.

That would be an extension, but not viable as a *general* requirement as
far as the standard is concerned.

> And that very much means that doing things like value speculation MUST
> NOT HAPPEN. See? Even if you can prove that your code is "equivalent",
> it isn't.

I'm kind of puzzled why you keep stating generalizations of assertions
that clearly aren't well-founded (ie, which might be true for the kernel
or your wishes for how the world should be, but aren't true in general
or for the objectives of others).

It's not helpful to just pretend other people's opinions or worlds
didn't exist in discussions such as this one.  It's clear that the
standard has to consider all users of C/C++.  The kernel is a big user
of it, but also just that.

Why don't you just say that you do not *want* value speculation to
happen, because it does X and you'd like to exploit HW guarantee Y etc.?
That would be constructive, and actually correct in contrast to those
other broad claims.

(FWIW: Personally, I don't care whether you say something in a nice tone
or in some other way WITH BIG LETTERS.  I scan both for logically
consistent reasoning, and filter out the rest.  Therefore, avoiding
excessive amounts of the rest would make the discussion more efficient
for me, and I'd appreciate if everyone could work towards making it
efficient.  It's a tricky topic we're discussing, so I'd like to use my
brain for the actual technical bits and not for all the noise.)

> So for example, let's say that you have a pointer, and you have some
> reason to believe that the pointer has a particular value. So you
> rewrite following the pointer from this:
> 
>   value = ptr->val;
> 
> into
> 
>   value = speculated->value;
>   tmp = ptr;
>   if (unlikely(tmp != speculated))
>     value = tmp->value;
> 
> and maybe you can now make the critical code-path for the speculated
> case go faster (since now there is no data dependency for the
> speculated case, and the actual pointer chasing load is now no longer
> in the critical path), and you made things faster because your
> profiling showed that the speculated case was true 99% of the time.
> Wonderful, right? And clearly, the code "provably" does the same
> thing.

Please try to see that any proof is based on the language semantics as
specified.  Thus, you have to distinguish between (1) disagreeing with
the proof being correct and (2) disliking the language semantics.

I believe you mean the latter (let me know if you actually mean (1)); in
that case, it's the language semantics you want to be different, and we
need to discuss this.

> EXCEPT THAT IS NOT TRUE AT ALL.
> 
> It very much does not do the same thing at all, and by doing value
> speculation and "proving" something was true, the only thing you did
> was to make incorrect code run faster. Because now the causally
> related load of value from the pointer isn't actually causally related
> at all, and you broke the memory ordering.

That's not how the language is specified, sorry.

> This is why I don't like it when I see Torvald talk about "proving"
> things. It's bullshit. You can "prove" pretty much anything, and in
> the process lose sight of the bigger issue, namely that there is code
> that depends on

Depends on what?

We really need to keep separate things separate here.  We started the
discussion with the topic of what behavior the memory model *as
specified today* allows.  In that model, the proofs are correct.

This model does not let programmers exploit the control dependencies and
such that you have in mind.  It does let programmers write correct code,
but this code might be somewhat less efficient (e.g., because it would
use memory_order_acquire instead of a weaker MO and control
dependencies).  If you want to discuss that, fine; we'd then need to see
how we can specify this properly and offer to programmers, and how to
extend the specification of the language semantics accordingly.

That Paul is looking through the uses of synchronization and how they
could (or could not) map to the memory model will be good input to this
discussion.

> When it comes to atomic accesses, you don't play those kinds of games,
> exactly because the ordering of the accesses matter in ways that are
> not really sanely describable at a source code level.

They obviously need to be specifiable for a language that wants to
provide portable atomics.  For machine-dependent extensions, this could
take machine properties into accout, of course.  But even those need to
be modeled properly, at the source code level.  How else is a C/C++
compiler make sense out of it?

> The *only* kinds
> of games you play are like the ones I described - combining accesses
> under very strict local rules.
> 
> And the strict local rules really should be of the type "a store
> followed by a load to the same location with the same memory ordering
> constraints can be combined". Never *ever* of the kind "if you can
> prove X".

I doubt that's a better way to put it.  For example, you haven't defined
"followed" -- if you'd do that in a complete way, you'd inevitably come
to situations where something is checked for being provably correct.
What you do, in fact, is relying on the proof that the adjacent
store/load pair could always be executed atomically by the machine,
without preventing forward progress.

> I hope my example made it clear *why* I react so strongly when Torvald
> starts talking about "if you can prove the value is 1".

It's getting clearer to me, but as far as I can understand, from my
perspective, you're objecting to a straw man while not requesting the
thing you actually want.

It seems you're coming from a bottom-up perspective, seeing C code as
something defined like high-level assembler, which still has
machine-dependent parts and such.  The standard defines the language
semantics kind of the other way around, top-down from an abstract
machine, with memory_orders that at least to some extent allow one to
exploit different HW mechanisms with different costs for enforcing
ordering.

I can see why you have this perspective, but the standard needs the
top-down approach to suite all it's users.  I believe we should continue
discussing how we can bridge the gap between both views.  None of those
views is inherently wrong, they are just different.  No strong opinion
will change that, but I'm still hopeful that constructive discussion can
help bridge the gap.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  0:18                                                                   ` Linus Torvalds
  2014-02-18  1:26                                                                     ` Paul E. McKenney
@ 2014-02-18 15:38                                                                     ` Torvald Riegel
  2014-02-18 16:55                                                                       ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 15:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > There's an underlying problem here that's independent from the actual
> > instance that you're worried about here: "no sense" is a ultimately a
> > matter of taste/objectives/priorities as long as the respective
> > specification is logically consistent.
> 
> Yes. But I don't think it's "independent".
> 
> Exactly *because* some people will read standards without applying
> "does the resulting code generation actually make sense for the
> programmer that wrote the code", the standard has to be pretty clear.
> 
> The standard often *isn't* pretty clear. It wasn't clear enough when
> it came to "volatile", and yet that was a *much* simpler concept than
> atomic accesses and memory ordering.
> 
> And most of the time it's not a big deal. But because the C standard
> generally tries to be very portable, and cover different machines,
> there tends to be a mindset that anything inherently unportable is
> "undefined" or "implementation defined", and then the compiler writer
> is basically given free reign to do anything they want (with
> "implementation defined" at least requiring that it is reliably the
> same thing).

Yes, that's how it works in general.  And this makes sense, because all
optimizations rely on that.  Second, you can't keep something consistent
(eg, between compilers) if it isn't specified.  So if we want stricter
rules, those need to be specified somewhere.

> And when it comes to memory ordering, *everything* is basically
> non-portable, because different CPU's very much have different rules.

Well, the current set of memory orders (and the memory model as a whole)
is portable, even though it might not allow to exploit all hardware
properties, and thus might perform sub-optimally in some cases.

> I worry that that means that the standard then takes the stance that
> "well, compiler re-ordering is no worse than CPU re-ordering, so we
> let the compiler do anything". And then we have to either add
> "volatile" to make sure the compiler doesn't do that, or use an overly
> strict memory model at the compiler level that makes it all pointless.

Using "volatile" is not a good option, I think, because synchronization
between threads should be orthogonal to observable output of the
abstract machine.

The current memory model might not allow to exploit all hardware
properties, I agree.

But then why don't we work on how to extend it to do so?  We need to
specify the behavior we want anyway, and this can't be independent of
the language semantics, so it has to be conceptually integrated with the
standard anyway.

> So I really really hope that the standard doesn't give compiler
> writers free hands to do anything that they can prove is "equivalent"
> in the virtual C machine model.

It does, but it also doesn't mean this can't be extended.  So let's
focus on whether we can find an extension.

> That's not how you get reliable
> results.

In this general form, that's obviously a false claim.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  0:09                                                         ` Linus Torvalds
@ 2014-02-18 15:46                                                           ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 15:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 16:09 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Mon, 2014-02-17 at 14:32 -0800,
> >
> >> Stop claiming it "can return 1".. It *never* returns 1 unless you do
> >> the load and *verify* it, or unless the load itself can be made to go
> >> away. And with the code sequence given, that just doesn't happen. END
> >> OF STORY.
> >
> > void foo();
> > {
> >   atomic<int> x = 1;
> >   if (atomic_load(&x, mo_relaxed) == 1)
> >     atomic_store(&y, 3, mo_relaxed));
> > }
> 
> This is the very example I gave, where the real issue is not that "you
> prove that load returns 1", you instead say "store followed by a load
> can be combined".
> 
> I (in another email I just wrote) tried to show why the "prove
> something is true" is a very dangerous model.  Seriously, it's pure
> crap. It's broken.

I don't see anything dangerous in the example above with the language
semantics as specified: It's a well-defined situation, given the rules
of the language.  I replied to the other email you wrote with my
viewpoint on why the above is useful, how it compares to what you seem
to what, and where I think we need to start to bridge the gap.

> If the C standard defines atomics in terms of "provable equivalence",
> it's broken. Exactly because on a *virtual* machine you can prove
> things that are not actually true in a *real* machine.

For the control dependencies you have in mind, it's actually the other
way around.  You expect the real machine's properties in a program whose
semantics only give you the virtual machine's properties.  Anything you
prove on the virtual machine will be true on the real machine (in a
correct implementation) -- but you can't expect to have real-machine
properties on language that's based on the virtual machine.

> I have the
> example of value speculation changing the memory ordering model of the
> actual machine.

This example is not true for the language as specified.  It is true for
a modified language that you have in mind, but for this one I've just
seen pretty rough rules so far.  Please see my other reply.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  3:00                                                         ` Paul E. McKenney
  2014-02-18  3:24                                                           ` Linus Torvalds
@ 2014-02-18 15:56                                                           ` Torvald Riegel
  2014-02-18 16:51                                                             ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 15:56 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > >
> > > Which example do you have in mind here?  Haven't we resolved all the
> > > debated examples, or did I miss any?
> > 
> > Well, Paul seems to still think that the standard possibly allows
> > speculative writes or possibly value speculation in ways that break
> > the hardware-guaranteed orderings.
> 
> It is not that I know of any specific problems, but rather that I
> know I haven't looked under all the rocks.  Plus my impression from
> my few years on the committee is that the standard will be pushed to
> the limit when it comes time to add optimizations.
> 
> One example that I learned about last week uses the branch-prediction
> hardware to validate value speculation.  And no, I am not at all a fan
> of value speculation, in case you were curious.  However, it is still
> an educational example.
> 
> This is where you start:
> 
> 	p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
> 	do_something(p->a, p->b, p->c);
> 	p->d = 1;

I assume that's the source code.

> Then you leverage branch-prediction hardware as follows:
> 
> 	p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
> 	if (p == GUESS) {
> 		do_something(GUESS->a, GUESS->b, GUESS->c);
> 		GUESS->d = 1;
> 	} else {
> 		do_something(p->a, p->b, p->c);
> 		p->d = 1;
> 	}

I assume that this is a potential transformation by a compiler.

> The CPU's branch-prediction hardware squashes speculation in the case where
> the guess was wrong, and this prevents the speculative store to ->d from
> ever being visible.  However, the then-clause breaks dependencies, which
> means that the loads -could- be speculated, so that do_something() gets
> passed pre-initialization values.
> 
> Now, I hope and expect that the wording in the standard about dependency
> ordering prohibits this sort of thing.  But I do not yet know for certain.

The transformation would be incorrect.  p->a in the source code carries
a dependency, and as you say, the transformed code wouldn't have that
dependency any more.  So the transformed code would loose ordering
constraints that it has in the virtual machine, so in the absence of
other proofs of correctness based on properties not shown in the
example, the transformed code would not result in the same behavior as
allowed by the abstract machine.

If the transformation would actually be by a programmer, then this
wouldn't do the same as the first example because mo_consume doesn't
work through the if statement.

Are there other specified concerns that you have regarding this example?


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18  3:42                                                             ` Linus Torvalds
  2014-02-18  5:22                                                               ` Paul E. McKenney
@ 2014-02-18 16:17                                                               ` Torvald Riegel
  2014-02-18 17:44                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-17 at 19:42 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > As far as I can tell, the intent is that you can't do value
> > speculation (except perhaps for the "relaxed", which quite frankly
> > sounds largely useless).
> 
> Hmm. The language I see for "consume" is not obvious:
> 
>   "Consume operation: no reads in the current thread dependent on the
> value currently loaded can be reordered before this load"

I can't remember seeing that language in the standard (ie, C or C++).
Where is this from?

> and it could make a compiler writer say that value speculation is
> still valid, if you do it like this (with "ptr" being the atomic
> variable):
> 
>   value = ptr->val;

I assume the load from ptr has mo_consume ordering?

> into
> 
>   tmp = ptr;
>   value = speculated.value;
>   if (unlikely(tmp != &speculated))
>     value = tmp->value;
> 
> which is still bogus. The load of "ptr" does happen before the load of
> "value = speculated->value" in the instruction stream, but it would
> still result in the CPU possibly moving the value read before the
> pointer read at least on ARM and power.

And surprise, in the C/C++ model the load from ptr is sequenced-before
the load from speculated, but there's no ordering constraint on the
reads-from relation for the value load if you use mo_consume on the ptr
load.  Thus, the transformed code has less ordering constraints than the
original code, and we arrive at the same outcome.

> So if you're a compiler person, you think you followed the letter of
> the spec - as far as *you* were concerned, no load dependent on the
> value of the atomic load moved to before the atomic load.

No.  Because the wobbly sentence you cited(?) above is not what the
standard says.

Would you please stop making claims about what compiler writers would do
or not if you seemingly aren't even familiar with the model that
compiler writers would use to reason about transformations?  Seriously?

> You go home,
> happy, knowing you've done your job. Never mind that you generated
> code that doesn't actually work.
> 
> I dread having to explain to the compiler person that he may be right
> in some theoretical virtual machine, but the code is subtly broken and
> nobody will ever understand why (and likely not be able to create a
> test-case showing the breakage).
> 
> But maybe the full standard makes it clear that "reordered before this
> load" actually means on the real hardware, not just in the generated
> instruction stream.

Do you think everyone else is stupid?  If there's an ordering constraint
in the virtual machine, it better be present when executing in the real
machine unless it provably cannot result in different output as
specified by the language's semantics.

> Reading it with understanding of the *intent* and
> understanding all the different memory models that requirement should
> be obvious (on alpha, you need an "rmb" instruction after the load),
> but ...

The standard is clear on what's required.  I strongly suggest reading
the formalization of the memory model by Batty et al.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 15:31                                                             ` Torvald Riegel
@ 2014-02-18 16:49                                                               ` Linus Torvalds
  2014-02-18 17:16                                                                 ` Paul E. McKenney
  2014-02-18 21:21                                                                 ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18 16:49 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
>> And exactly because I know enough, I would *really* like atomics to be
>> well-defined, and have very clear - and *local* - rules about how they
>> can be combined and optimized.
>
> "Local"?

Yes.

So I think that one of the big advantages of atomics over volatile is
that they *can* be optimized, and as such I'm not at all against
trying to generate much better code than for volatile accesses.

But at the same time, that can go too far. For example, one of the
things we'd want to use atomics for is page table accesses, where it
is very important that we don't generate multiple accesses to the
values, because parts of the values can be change *by*hardware* (ie
accessed and dirty bits).

So imagine that you have some clever global optimizer that sees that
the program never ever actually sets the dirty bit at all in any
thread, and then uses that kind of non-local knowledge to make
optimization decisions. THAT WOULD BE BAD.

Do you see what I'm aiming for? Any optimization that tries to prove
anything from more than local state is by definition broken, because
it assumes that everything is described by the program.

But *local* optimizations are fine, as long as they follow the obvious
rule of not actually making changes that are semantically visible.

(In practice, I'd be impressed as hell for that particular example,
and we actually do end up setting the dirty bit by hand in some
situations, so the example is slightly made up, but there are other
cases that might be more realistic in that sometimes we do things that
are hidden from the compiler - in assembly etc - and the compiler
might *think* it knows what is going on, but it doesn't actually see
all accesses).

> Sorry, but the rules *are* very clear.  I *really* suggest to look at
> the formalization by Batty et al.  And in these rules, proving that a
> read will always return value X has a well-defined meaning, and you can
> use it.  That simply follows from how the model is built.

What "model"?

That's the thing. I have tried to figure out whether the model is some
abstract C model, or a model based on the actual hardware that the
compiler is compiling for, and whether the model is one that assumes
the compiler has complete knowledge of the system (see the example
above).

And it seems to be a mixture of it all. The definitions of the various
orderings obviously very much imply that the compiler has to insert
the proper barriers or sequence markers for that architecture, but
then there is no apparent way to depend on any *other* architecture
ordering guarantees. Like our knowledge that all architectures (with
the exception of alpha, which really doesn't end up being something we
worry about any more) end up having the load dependency ordering
guarantee.

> What you seem to want just isn't covered by the model as it is today --
> you can't infer from that that the model itself would be wrong.  The
> dependency chains aren't modeled in the way you envision it (except in
> what consume_mo tries, but that seems to be hard to implement); they are
> there on the level of the program logic as modeled by the abstract
> machine and the respective execution/output rules, but they are not
> built to represent those specific ordering guarantees the HW gives you.

So this is a problem. It basically means that we cannot do the kinds
of things we do now, which very much involve knowing what the memory
ordering of a particular machine is, and combining that knowledge with
our knowledge of code generation.

Now, *most* of what we do is protected by locking and is all fine. But
we do have a few rather subtle places in RCU and in the scheduler
where we depend on the actual dependency chain.

In *practice*, I seriously doubt any reasonable compiler can actually
make a mess of it. The kinds of optimizations that would actually
defeat the dependency chain are simply not realistic. And I suspect
that will end up being what we rely on - there being no actual sane
sequence that a compiler would ever do, even if we wouldn't have
guarantees for some of it.

And I suspect  I can live with that. We _have_ lived with that for the
longest time, after all. We very much do things that aren't covered by
any existing C standard, and just basically use tricks to coax the
compiler into never generating code that doesn't work (with our inline
asm barriers etc being a prime example).

> I would also be cautious claiming that the rules you suggested would be
> very clear and very simple.  I haven't seen a memory model spec from you
> that would be viable as the standard model for C/C++, nor have I seen
> proof that this would actually be easier to understand for programmers
> in general.

So personally, if I were to write the spec, I would have taken a
completely different approach from what the standard obviously does.

I'd have taken the approach of specifying the required semantics each
atomic op (ie the memory operations would end up having to be
annotated with their ordering constraints), and then said that the
compiler can generate any code that is equivalent to that
_on_the_target_machine_.

Why? Just to avoid the whole "ok, which set of rules applies now" problem.

>> For example, CPU people actually do tend to give guarantees for
>> certain things, like stores that are causally related being visible in
>> a particular order.
>
> Yes, but that's not part of the model so far.  If you want to exploit
> this, please make a suggestion for how to extend the model to cover
> this.

See above. This is exactly why I detest the C "model" thing. Now you
need ways to describe those CPU guarantees, because if you can't
describe them, you can't express them in the model.

I would *much* have preferred the C standard to say that you have to
generate code that is guaranteed to act the same way - on that machine
- as the "naive direct unoptimized translation".

IOW, I would *revel* in the fact that different machines are
different, and basically just describe the "stupid" code generation.
You'd get the guaranteed semantic baseline, but you'd *also* be able
to know that whatever architecture guarantees you have would remain.
Without having to describe those architecture issues.

It would be *so* nice if the C standard had done that for pretty much
everything that is "implementation dependent". Not just atomics.

[ will look at the rest of your email later ]

                 Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 15:56                                                           ` Torvald Riegel
@ 2014-02-18 16:51                                                             ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 16:51 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 04:56:40PM +0100, Torvald Riegel wrote:
> On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
> > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > >
> > > > Which example do you have in mind here?  Haven't we resolved all the
> > > > debated examples, or did I miss any?
> > > 
> > > Well, Paul seems to still think that the standard possibly allows
> > > speculative writes or possibly value speculation in ways that break
> > > the hardware-guaranteed orderings.
> > 
> > It is not that I know of any specific problems, but rather that I
> > know I haven't looked under all the rocks.  Plus my impression from
> > my few years on the committee is that the standard will be pushed to
> > the limit when it comes time to add optimizations.
> > 
> > One example that I learned about last week uses the branch-prediction
> > hardware to validate value speculation.  And no, I am not at all a fan
> > of value speculation, in case you were curious.  However, it is still
> > an educational example.
> > 
> > This is where you start:
> > 
> > 	p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
> > 	do_something(p->a, p->b, p->c);
> > 	p->d = 1;
> 
> I assume that's the source code.

Yep!

> > Then you leverage branch-prediction hardware as follows:
> > 
> > 	p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
> > 	if (p == GUESS) {
> > 		do_something(GUESS->a, GUESS->b, GUESS->c);
> > 		GUESS->d = 1;
> > 	} else {
> > 		do_something(p->a, p->b, p->c);
> > 		p->d = 1;
> > 	}
> 
> I assume that this is a potential transformation by a compiler.

Again, yep!

> > The CPU's branch-prediction hardware squashes speculation in the case where
> > the guess was wrong, and this prevents the speculative store to ->d from
> > ever being visible.  However, the then-clause breaks dependencies, which
> > means that the loads -could- be speculated, so that do_something() gets
> > passed pre-initialization values.
> > 
> > Now, I hope and expect that the wording in the standard about dependency
> > ordering prohibits this sort of thing.  But I do not yet know for certain.
> 
> The transformation would be incorrect.  p->a in the source code carries
> a dependency, and as you say, the transformed code wouldn't have that
> dependency any more.  So the transformed code would loose ordering
> constraints that it has in the virtual machine, so in the absence of
> other proofs of correctness based on properties not shown in the
> example, the transformed code would not result in the same behavior as
> allowed by the abstract machine.

Glad that you agree!  ;-)

> If the transformation would actually be by a programmer, then this
> wouldn't do the same as the first example because mo_consume doesn't
> work through the if statement.

Agreed.

> Are there other specified concerns that you have regarding this example?

Nope.  Just generalized paranoia.  (But just because I am paranoid
doesn't mean that there isn't a bug lurking somewhere in the standard,
the compiler, the kernel, or my own head!)

I will likely have more once I start mapping Linux kernel atomics to the
C11 standard.  One more paper past N3934 comes first, though.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 15:38                                                                     ` Torvald Riegel
@ 2014-02-18 16:55                                                                       ` Paul E. McKenney
  2014-02-18 19:57                                                                         ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 16:55 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote:
> On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > >
> > > There's an underlying problem here that's independent from the actual
> > > instance that you're worried about here: "no sense" is a ultimately a
> > > matter of taste/objectives/priorities as long as the respective
> > > specification is logically consistent.
> > 
> > Yes. But I don't think it's "independent".
> > 
> > Exactly *because* some people will read standards without applying
> > "does the resulting code generation actually make sense for the
> > programmer that wrote the code", the standard has to be pretty clear.
> > 
> > The standard often *isn't* pretty clear. It wasn't clear enough when
> > it came to "volatile", and yet that was a *much* simpler concept than
> > atomic accesses and memory ordering.
> > 
> > And most of the time it's not a big deal. But because the C standard
> > generally tries to be very portable, and cover different machines,
> > there tends to be a mindset that anything inherently unportable is
> > "undefined" or "implementation defined", and then the compiler writer
> > is basically given free reign to do anything they want (with
> > "implementation defined" at least requiring that it is reliably the
> > same thing).
> 
> Yes, that's how it works in general.  And this makes sense, because all
> optimizations rely on that.  Second, you can't keep something consistent
> (eg, between compilers) if it isn't specified.  So if we want stricter
> rules, those need to be specified somewhere.
> 
> > And when it comes to memory ordering, *everything* is basically
> > non-portable, because different CPU's very much have different rules.
> 
> Well, the current set of memory orders (and the memory model as a whole)
> is portable, even though it might not allow to exploit all hardware
> properties, and thus might perform sub-optimally in some cases.
> 
> > I worry that that means that the standard then takes the stance that
> > "well, compiler re-ordering is no worse than CPU re-ordering, so we
> > let the compiler do anything". And then we have to either add
> > "volatile" to make sure the compiler doesn't do that, or use an overly
> > strict memory model at the compiler level that makes it all pointless.
> 
> Using "volatile" is not a good option, I think, because synchronization
> between threads should be orthogonal to observable output of the
> abstract machine.

Are you thinking of "volatile" -instead- of atomics?  My belief is that
given the current standard there will be times that we need to use
"volatile" -in- -addition- to atomics.

> The current memory model might not allow to exploit all hardware
> properties, I agree.
> 
> But then why don't we work on how to extend it to do so?  We need to
> specify the behavior we want anyway, and this can't be independent of
> the language semantics, so it has to be conceptually integrated with the
> standard anyway.
> 
> > So I really really hope that the standard doesn't give compiler
> > writers free hands to do anything that they can prove is "equivalent"
> > in the virtual C machine model.
> 
> It does, but it also doesn't mean this can't be extended.  So let's
> focus on whether we can find an extension.
> 
> > That's not how you get reliable
> > results.
> 
> In this general form, that's obviously a false claim.

These two sentences starkly illustrate the difference in perspective
between you two.  You are talking past each other.  Not sure how to fix
this at the moment, but what else is new?  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 16:49                                                               ` Linus Torvalds
@ 2014-02-18 17:16                                                                 ` Paul E. McKenney
  2014-02-18 18:23                                                                   ` Peter Sewell
  2014-02-18 21:40                                                                   ` Torvald Riegel
  2014-02-18 21:21                                                                 ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> And exactly because I know enough, I would *really* like atomics to be
> >> well-defined, and have very clear - and *local* - rules about how they
> >> can be combined and optimized.
> >
> > "Local"?
> 
> Yes.
> 
> So I think that one of the big advantages of atomics over volatile is
> that they *can* be optimized, and as such I'm not at all against
> trying to generate much better code than for volatile accesses.
> 
> But at the same time, that can go too far. For example, one of the
> things we'd want to use atomics for is page table accesses, where it
> is very important that we don't generate multiple accesses to the
> values, because parts of the values can be change *by*hardware* (ie
> accessed and dirty bits).
> 
> So imagine that you have some clever global optimizer that sees that
> the program never ever actually sets the dirty bit at all in any
> thread, and then uses that kind of non-local knowledge to make
> optimization decisions. THAT WOULD BE BAD.

Might as well list other reasons why value proofs via whole-program
analysis are unreliable for the Linux kernel:

1.	As Linus said, changes from hardware.

2.	Assembly code that is not visible to the compiler.
	Inline asms will -normally- let the compiler know what
	memory they change, but some just use the "memory" tag.
	Worse yet, I suspect that most compilers don't look all
	that carefully at .S files.

	Any number of other programs contain assembly files.

3.	Kernel modules that have not yet been written.  Now, the
	compiler could refrain from trying to prove anything about
	an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
	is currently no way to communicate this information to the
	compiler other than marking the variable "volatile".

	Other programs have similar issues, e.g., via dlopen().

4.	Some drivers allow user-mode code to mmap() some of their
	state.  Any changes undertaken by the user-mode code would
	be invisible to the compiler.

5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/

And probably other stuff as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 16:17                                                               ` Torvald Riegel
@ 2014-02-18 17:44                                                                 ` Linus Torvalds
  2014-02-18 19:40                                                                   ` Paul E. McKenney
  2014-02-18 19:47                                                                   ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18 17:44 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel <triegel@redhat.com> wrote:
>>
>>   "Consume operation: no reads in the current thread dependent on the
>> value currently loaded can be reordered before this load"
>
> I can't remember seeing that language in the standard (ie, C or C++).
> Where is this from?

That's just for googling for explanations. I do have some old standard
draft, but that doesn't have any concise definitions anywhere that I
could find.

>> and it could make a compiler writer say that value speculation is
>> still valid, if you do it like this (with "ptr" being the atomic
>> variable):
>>
>>   value = ptr->val;
>
> I assume the load from ptr has mo_consume ordering?

Yes.

>> into
>>
>>   tmp = ptr;
>>   value = speculated.value;
>>   if (unlikely(tmp != &speculated))
>>     value = tmp->value;
>>
>> which is still bogus. The load of "ptr" does happen before the load of
>> "value = speculated->value" in the instruction stream, but it would
>> still result in the CPU possibly moving the value read before the
>> pointer read at least on ARM and power.
>
> And surprise, in the C/C++ model the load from ptr is sequenced-before
> the load from speculated, but there's no ordering constraint on the
> reads-from relation for the value load if you use mo_consume on the ptr
> load.  Thus, the transformed code has less ordering constraints than the
> original code, and we arrive at the same outcome.

Ok, good.

> The standard is clear on what's required.  I strongly suggest reading
> the formalization of the memory model by Batty et al.

Can you point to it? Because I can find a draft standard, and it sure
as hell does *not* contain any clarity of the model. It has a *lot* of
verbiage, but it's pretty much impossible to actually understand, even
for somebody who really understands memory ordering.

             Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 17:16                                                                 ` Paul E. McKenney
@ 2014-02-18 18:23                                                                   ` Peter Sewell
  2014-02-18 19:00                                                                     ` Linus Torvalds
  2014-02-18 19:42                                                                     ` Paul E. McKenney
  2014-02-18 21:40                                                                   ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Peter Sewell @ 2014-02-18 18:23 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Linus Torvalds, Torvald Riegel, Alec Teal, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On 18 February 2014 17:16, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
>> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
>> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
>> >> And exactly because I know enough, I would *really* like atomics to be
>> >> well-defined, and have very clear - and *local* - rules about how they
>> >> can be combined and optimized.
>> >
>> > "Local"?
>>
>> Yes.
>>
>> So I think that one of the big advantages of atomics over volatile is
>> that they *can* be optimized, and as such I'm not at all against
>> trying to generate much better code than for volatile accesses.
>>
>> But at the same time, that can go too far. For example, one of the
>> things we'd want to use atomics for is page table accesses, where it
>> is very important that we don't generate multiple accesses to the
>> values, because parts of the values can be change *by*hardware* (ie
>> accessed and dirty bits).
>>
>> So imagine that you have some clever global optimizer that sees that
>> the program never ever actually sets the dirty bit at all in any
>> thread, and then uses that kind of non-local knowledge to make
>> optimization decisions. THAT WOULD BE BAD.
>
> Might as well list other reasons why value proofs via whole-program
> analysis are unreliable for the Linux kernel:
>
> 1.      As Linus said, changes from hardware.
>
> 2.      Assembly code that is not visible to the compiler.
>         Inline asms will -normally- let the compiler know what
>         memory they change, but some just use the "memory" tag.
>         Worse yet, I suspect that most compilers don't look all
>         that carefully at .S files.
>
>         Any number of other programs contain assembly files.
>
> 3.      Kernel modules that have not yet been written.  Now, the
>         compiler could refrain from trying to prove anything about
>         an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
>         is currently no way to communicate this information to the
>         compiler other than marking the variable "volatile".
>
>         Other programs have similar issues, e.g., via dlopen().
>
> 4.      Some drivers allow user-mode code to mmap() some of their
>         state.  Any changes undertaken by the user-mode code would
>         be invisible to the compiler.
>
> 5.      JITed code produced based on BPF: https://lwn.net/Articles/437981/
>
> And probably other stuff as well.

interesting list.  So are you saying that value-range-analysis and
such-like (I say glibly, without really knowing what "such-like"
refers to here) are fundamentally incompatible with
the kernel code, or can you think of some way to tell the compiler a
bound on the footprint of the "unseen" changes in each of those cases?

Peter


>                                                         Thanx, Paul
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 18:23                                                                   ` Peter Sewell
@ 2014-02-18 19:00                                                                     ` Linus Torvalds
  2014-02-18 19:42                                                                     ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18 19:00 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: Paul McKenney, Torvald Riegel, Alec Teal, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 10:23 AM, Peter Sewell
<Peter.Sewell@cl.cam.ac.uk> wrote:
>
> interesting list.  So are you saying that value-range-analysis and
> such-like (I say glibly, without really knowing what "such-like"
> refers to here) are fundamentally incompatible with
> the kernel code

No, it's fine to do things like value-range-analysis, it's just that
you have to do it on a local scope, and not think that you can see
every single write to some variable.

And as Paul points out, this is actually generally true even outside
of kernels. We may be pretty unique in having some things that are
literally done by hardware (eg the page table updates), but we're
certainly not unique in having loadable modules or code that the
compiler doesn't know about by virtue of being compiled separately
(assembly files, JITted, whatever).

And we are actually perfectly fine with compiler barriers. One of our
most common barriers is simply this:

   #define barrier() __asm__ __volatile__("": : :"memory")

which basically tells the compiler "something changes memory in ways
you don't understand, so you cannot assume anything about memory
contents".

Obviously, a compiler is still perfectly fine optimizing things like
local variables etc that haven't had their address taken across such a
barrier (regardless of whether those local variables might be spilled
to the stack frame etc), but equally obviously we'd require that this
kind of thing makes sure that any atomic writes have been finalized by
the time the barrier happens (it does *not* imply a memory ordering,
just a compiler memory access barrier - we have separate things for
ordering requirements, and those obviously often generate actual
barrier instructions)

And we're likely always going to have things like this, but if C11
atomics end up working well for us, we might have *fewer* of them.

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 17:44                                                                 ` Linus Torvalds
@ 2014-02-18 19:40                                                                   ` Paul E. McKenney
  2014-02-18 19:47                                                                   ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 09:44:48AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel <triegel@redhat.com> wrote:

[ . . . ]

> > The standard is clear on what's required.  I strongly suggest reading
> > the formalization of the memory model by Batty et al.
> 
> Can you point to it? Because I can find a draft standard, and it sure
> as hell does *not* contain any clarity of the model. It has a *lot* of
> verbiage, but it's pretty much impossible to actually understand, even
> for somebody who really understands memory ordering.

I suspect he is thinking of the following:

"Mathematizing C++ Concurrency." Mark Batty, Scott Owens, Susmit Sarkar,
Peter Sewell, and Tjark Weber.
https://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf

Even if you don't like the math, it contains some very good examples.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 18:23                                                                   ` Peter Sewell
  2014-02-18 19:00                                                                     ` Linus Torvalds
@ 2014-02-18 19:42                                                                     ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 19:42 UTC (permalink / raw)
  To: Peter Sewell
  Cc: Linus Torvalds, Torvald Riegel, Alec Teal, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 06:23:47PM +0000, Peter Sewell wrote:
> On 18 February 2014 17:16, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> >> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> >> And exactly because I know enough, I would *really* like atomics to be
> >> >> well-defined, and have very clear - and *local* - rules about how they
> >> >> can be combined and optimized.
> >> >
> >> > "Local"?
> >>
> >> Yes.
> >>
> >> So I think that one of the big advantages of atomics over volatile is
> >> that they *can* be optimized, and as such I'm not at all against
> >> trying to generate much better code than for volatile accesses.
> >>
> >> But at the same time, that can go too far. For example, one of the
> >> things we'd want to use atomics for is page table accesses, where it
> >> is very important that we don't generate multiple accesses to the
> >> values, because parts of the values can be change *by*hardware* (ie
> >> accessed and dirty bits).
> >>
> >> So imagine that you have some clever global optimizer that sees that
> >> the program never ever actually sets the dirty bit at all in any
> >> thread, and then uses that kind of non-local knowledge to make
> >> optimization decisions. THAT WOULD BE BAD.
> >
> > Might as well list other reasons why value proofs via whole-program
> > analysis are unreliable for the Linux kernel:
> >
> > 1.      As Linus said, changes from hardware.
> >
> > 2.      Assembly code that is not visible to the compiler.
> >         Inline asms will -normally- let the compiler know what
> >         memory they change, but some just use the "memory" tag.
> >         Worse yet, I suspect that most compilers don't look all
> >         that carefully at .S files.
> >
> >         Any number of other programs contain assembly files.
> >
> > 3.      Kernel modules that have not yet been written.  Now, the
> >         compiler could refrain from trying to prove anything about
> >         an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> >         is currently no way to communicate this information to the
> >         compiler other than marking the variable "volatile".
> >
> >         Other programs have similar issues, e.g., via dlopen().
> >
> > 4.      Some drivers allow user-mode code to mmap() some of their
> >         state.  Any changes undertaken by the user-mode code would
> >         be invisible to the compiler.
> >
> > 5.      JITed code produced based on BPF: https://lwn.net/Articles/437981/
> >
> > And probably other stuff as well.
> 
> interesting list.  So are you saying that value-range-analysis and
> such-like (I say glibly, without really knowing what "such-like"
> refers to here) are fundamentally incompatible with
> the kernel code, or can you think of some way to tell the compiler a
> bound on the footprint of the "unseen" changes in each of those cases?

Other than the "volatile" keyword, no.

Well, I suppose you could also create a function that changed the
variables in question, then arrange to never call it, but in such a way
that the compiler could not prove that it was never called.  But ouch!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 17:44                                                                 ` Linus Torvalds
  2014-02-18 19:40                                                                   ` Paul E. McKenney
@ 2014-02-18 19:47                                                                   ` Torvald Riegel
  2014-02-20  0:53                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > The standard is clear on what's required.  I strongly suggest reading
> > the formalization of the memory model by Batty et al.
> 
> Can you point to it? Because I can find a draft standard, and it sure
> as hell does *not* contain any clarity of the model. It has a *lot* of
> verbiage, but it's pretty much impossible to actually understand, even
> for somebody who really understands memory ordering.

http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
This has an explanation of the model up front, and then the detailed
formulae in Section 6.  This is from 2010, and there might have been
smaller changes since then, but I'm not aware of any bigger ones.

The cppmem tool is based on this, so if you want to play around with a
few code examples, it's pretty nice because it shows you all allowed
executions, including graphs like in the paper (see elsewhere in this
thread for CAS syntax):
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 16:55                                                                       ` Paul E. McKenney
@ 2014-02-18 19:57                                                                         ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 19:57 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 08:55 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote:
> > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > > >
> > > > There's an underlying problem here that's independent from the actual
> > > > instance that you're worried about here: "no sense" is a ultimately a
> > > > matter of taste/objectives/priorities as long as the respective
> > > > specification is logically consistent.
> > > 
> > > Yes. But I don't think it's "independent".
> > > 
> > > Exactly *because* some people will read standards without applying
> > > "does the resulting code generation actually make sense for the
> > > programmer that wrote the code", the standard has to be pretty clear.
> > > 
> > > The standard often *isn't* pretty clear. It wasn't clear enough when
> > > it came to "volatile", and yet that was a *much* simpler concept than
> > > atomic accesses and memory ordering.
> > > 
> > > And most of the time it's not a big deal. But because the C standard
> > > generally tries to be very portable, and cover different machines,
> > > there tends to be a mindset that anything inherently unportable is
> > > "undefined" or "implementation defined", and then the compiler writer
> > > is basically given free reign to do anything they want (with
> > > "implementation defined" at least requiring that it is reliably the
> > > same thing).
> > 
> > Yes, that's how it works in general.  And this makes sense, because all
> > optimizations rely on that.  Second, you can't keep something consistent
> > (eg, between compilers) if it isn't specified.  So if we want stricter
> > rules, those need to be specified somewhere.
> > 
> > > And when it comes to memory ordering, *everything* is basically
> > > non-portable, because different CPU's very much have different rules.
> > 
> > Well, the current set of memory orders (and the memory model as a whole)
> > is portable, even though it might not allow to exploit all hardware
> > properties, and thus might perform sub-optimally in some cases.
> > 
> > > I worry that that means that the standard then takes the stance that
> > > "well, compiler re-ordering is no worse than CPU re-ordering, so we
> > > let the compiler do anything". And then we have to either add
> > > "volatile" to make sure the compiler doesn't do that, or use an overly
> > > strict memory model at the compiler level that makes it all pointless.
> > 
> > Using "volatile" is not a good option, I think, because synchronization
> > between threads should be orthogonal to observable output of the
> > abstract machine.
> 
> Are you thinking of "volatile" -instead- of atomics?  My belief is that
> given the current standard there will be times that we need to use
> "volatile" -in- -addition- to atomics.

No, I was talking about having to use "volatile" in addition to atomics
to get synchronization properties for the atomics.  ISTM that this would
be unfortunate because the objective for "volatile" (ie, declaring what
is output of the abstract machine) is orthogonal to synchronization
between threads.  So if we want to preserve control dependencies and get
ordering guarantees through that, I wouldn't prefer it through hacks
based on volatile.

For example, if we have a macro or helper function that contains
synchronization code, but the compiler can prove that the macro/function
is used just on data only accessible to a single thread, then the
"volatile" notation on it would prevent optimizations.

> > The current memory model might not allow to exploit all hardware
> > properties, I agree.
> > 
> > But then why don't we work on how to extend it to do so?  We need to
> > specify the behavior we want anyway, and this can't be independent of
> > the language semantics, so it has to be conceptually integrated with the
> > standard anyway.
> > 
> > > So I really really hope that the standard doesn't give compiler
> > > writers free hands to do anything that they can prove is "equivalent"
> > > in the virtual C machine model.
> > 
> > It does, but it also doesn't mean this can't be extended.  So let's
> > focus on whether we can find an extension.
> > 
> > > That's not how you get reliable
> > > results.
> > 
> > In this general form, that's obviously a false claim.
> 
> These two sentences starkly illustrate the difference in perspective
> between you two.  You are talking past each other.  Not sure how to fix
> this at the moment, but what else is new?  ;-)

Well, we're still talking, and we are making little steps in clarifying
things, so there is progress :)


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 16:49                                                               ` Linus Torvalds
  2014-02-18 17:16                                                                 ` Paul E. McKenney
@ 2014-02-18 21:21                                                                 ` Torvald Riegel
  2014-02-18 21:40                                                                   ` Peter Zijlstra
                                                                                     ` (2 more replies)
  1 sibling, 3 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 08:49 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> And exactly because I know enough, I would *really* like atomics to be
> >> well-defined, and have very clear - and *local* - rules about how they
> >> can be combined and optimized.
> >
> > "Local"?
> 
> Yes.
> 
> So I think that one of the big advantages of atomics over volatile is
> that they *can* be optimized, and as such I'm not at all against
> trying to generate much better code than for volatile accesses.
> 
> But at the same time, that can go too far. For example, one of the
> things we'd want to use atomics for is page table accesses, where it
> is very important that we don't generate multiple accesses to the
> values, because parts of the values can be change *by*hardware* (ie
> accessed and dirty bits).
> 
> So imagine that you have some clever global optimizer that sees that
> the program never ever actually sets the dirty bit at all in any
> thread, and then uses that kind of non-local knowledge to make
> optimization decisions. THAT WOULD BE BAD.
> 
> Do you see what I'm aiming for?

Yes, I do.  But that seems to be "volatile" territory.  It crosses the
boundaries of the abstract machine, and thus is input/output.  Which
fraction of your atomic accesses can read values produced by hardware?
I would still suppose that lots of synchronization is not affected by
this.

Do you perhaps want a weaker form of volatile?  That is, one that, for
example, allows combining of two adjacent loads of the dirty bits, but
will make sure that this is treated as if there is some imaginary
external thread that it cannot analyze and that may write?

I'm trying to be careful here in distinguishing between volatile and
synchronization because I believe those are orthogonal, and this is a
good thing because it allows for more-than-conservatively optimized
code.

> Any optimization that tries to prove
> anything from more than local state is by definition broken, because
> it assumes that everything is described by the program.

Well, that's how atomics that aren't volatile are defined in the
standard.  I can see that you want something else too, but that doesn't
mean that the other thing is broken.

> But *local* optimizations are fine, as long as they follow the obvious
> rule of not actually making changes that are semantically visible.

If we assume that there is this imaginary thread called hardware that
can write/read to/from such weak-volatile atomics, I believe this should
restrict optimizations sufficiently even in the model as specified in
the standard.  For example, it would prevent a compiler from proving
that there is no access by another thread to a variable, so it would
prevent the cases in our discussion that you didn't want to get
optimized.  Yet, I believe the model itself could stay unmodified.

Thus, with this "weak volatile", we could let programmers request
precisely the semantics that they want, without using volatile too much
and without preventing optimizations more than necessary.  Thoughts?

> (In practice, I'd be impressed as hell for that particular example,
> and we actually do end up setting the dirty bit by hand in some
> situations, so the example is slightly made up, but there are other
> cases that might be more realistic in that sometimes we do things that
> are hidden from the compiler - in assembly etc - and the compiler
> might *think* it knows what is going on, but it doesn't actually see
> all accesses).

If something is visible to assembly, then the compiler should see this
(or at least be aware that it can't prove anything about such data).  Or
am I missing anything?

> > Sorry, but the rules *are* very clear.  I *really* suggest to look at
> > the formalization by Batty et al.  And in these rules, proving that a
> > read will always return value X has a well-defined meaning, and you can
> > use it.  That simply follows from how the model is built.
> 
> What "model"?

The C/C++ memory model was what I was referring to.

> That's the thing. I have tried to figure out whether the model is some
> abstract C model, or a model based on the actual hardware that the
> compiler is compiling for, and whether the model is one that assumes
> the compiler has complete knowledge of the system (see the example
> above).

It's a model as specified in the standard.  It's not parametrized by the
hardware the program will eventually run on (ignoring
implementation-defined behavior, timing, etc.).  The compiler has
complete knowledge of the system unless for "volatiles" and things
coming in from the external world or escaping to it (e.g., stuff
escaping into asm statements).

> And it seems to be a mixture of it all. The definitions of the various
> orderings obviously very much imply that the compiler has to insert
> the proper barriers or sequence markers for that architecture, but
> then there is no apparent way to depend on any *other* architecture
> ordering guarantees. Like our knowledge that all architectures (with
> the exception of alpha, which really doesn't end up being something we
> worry about any more) end up having the load dependency ordering
> guarantee.

That is true, things like those other ordering guarantees aren't part of
the model currently.  But they might perhaps make good extensions.

> > What you seem to want just isn't covered by the model as it is today --
> > you can't infer from that that the model itself would be wrong.  The
> > dependency chains aren't modeled in the way you envision it (except in
> > what consume_mo tries, but that seems to be hard to implement); they are
> > there on the level of the program logic as modeled by the abstract
> > machine and the respective execution/output rules, but they are not
> > built to represent those specific ordering guarantees the HW gives you.
> 
> So this is a problem. It basically means that we cannot do the kinds
> of things we do now, which very much involve knowing what the memory
> ordering of a particular machine is, and combining that knowledge with
> our knowledge of code generation.

Yes, I agree that the standard is lacking a "feature" in this case (for
lack of a better word -- I wouldn't call it a bug).  I do hope that this
discussion can lead us to a better understanding of this, and how we
might perhaps add it as an extension to the C/C++ model while still
keeping all the benefits of the model.

> Now, *most* of what we do is protected by locking and is all fine. But
> we do have a few rather subtle places in RCU and in the scheduler
> where we depend on the actual dependency chain.
> 
> In *practice*, I seriously doubt any reasonable compiler can actually
> make a mess of it. The kinds of optimizations that would actually
> defeat the dependency chain are simply not realistic. And I suspect
> that will end up being what we rely on - there being no actual sane
> sequence that a compiler would ever do, even if we wouldn't have
> guarantees for some of it.

Yes, maybe.  I would prefer if we could put hard requirements for
compilers about this in the standard (so that you get a more stable
target to work against), but that might be hard to do in this case.

> And I suspect  I can live with that. We _have_ lived with that for the
> longest time, after all. We very much do things that aren't covered by
> any existing C standard, and just basically use tricks to coax the
> compiler into never generating code that doesn't work (with our inline
> asm barriers etc being a prime example).
> 
> > I would also be cautious claiming that the rules you suggested would be
> > very clear and very simple.  I haven't seen a memory model spec from you
> > that would be viable as the standard model for C/C++, nor have I seen
> > proof that this would actually be easier to understand for programmers
> > in general.
> 
> So personally, if I were to write the spec, I would have taken a
> completely different approach from what the standard obviously does.
> 
> I'd have taken the approach of specifying the required semantics each
> atomic op (ie the memory operations would end up having to be
> annotated with their ordering constraints), and then said that the
> compiler can generate any code that is equivalent to that
> _on_the_target_machine_.
> 
> Why? Just to avoid the whole "ok, which set of rules applies now" problem.
> 
> >> For example, CPU people actually do tend to give guarantees for
> >> certain things, like stores that are causally related being visible in
> >> a particular order.
> >
> > Yes, but that's not part of the model so far.  If you want to exploit
> > this, please make a suggestion for how to extend the model to cover
> > this.
> 
> See above. This is exactly why I detest the C "model" thing. Now you
> need ways to describe those CPU guarantees, because if you can't
> describe them, you can't express them in the model.

That's true, it has to be modeled.  The downside of your approach would
be that it's not portable code anymore, which would be a big downside
for the majority of the C/C++ programmers I guess.

> I would *much* have preferred the C standard to say that you have to
> generate code that is guaranteed to act the same way - on that machine
> - as the "naive direct unoptimized translation".

I can see the advantages that has when using C as something like a
high-level assembler.  But as the discussion of the difficulty of
tracking data dependencies indicates (including tracking though
non-synchronizing code), this behavior might spread throughout the
program and is hard to contain.  For many programs, especially on the
userspace side or wherever people can't or don't want to perform all
optimizations themselves, the optimizations are a real benefit.  For
example, generic or modular programming becomes so much more practical
if a compiler can optimize such code.  So if we want both, it's hard to
draw the line.

> IOW, I would *revel* in the fact that different machines are
> different, and basically just describe the "stupid" code generation.
> You'd get the guaranteed semantic baseline, but you'd *also* be able
> to know that whatever architecture guarantees you have would remain.
> Without having to describe those architecture issues.

Would you be okay with this preventing lots of optimizations a compiler
otherwise could do?  Because AFAICT, this spreads into non-synchronizing
code via the dependency-tracking, for example.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 17:16                                                                 ` Paul E. McKenney
  2014-02-18 18:23                                                                   ` Peter Sewell
@ 2014-02-18 21:40                                                                   ` Torvald Riegel
  2014-02-18 21:52                                                                     ` Peter Zijlstra
  2014-02-18 22:58                                                                     ` Paul E. McKenney
  1 sibling, 2 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 21:40 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > >> And exactly because I know enough, I would *really* like atomics to be
> > >> well-defined, and have very clear - and *local* - rules about how they
> > >> can be combined and optimized.
> > >
> > > "Local"?
> > 
> > Yes.
> > 
> > So I think that one of the big advantages of atomics over volatile is
> > that they *can* be optimized, and as such I'm not at all against
> > trying to generate much better code than for volatile accesses.
> > 
> > But at the same time, that can go too far. For example, one of the
> > things we'd want to use atomics for is page table accesses, where it
> > is very important that we don't generate multiple accesses to the
> > values, because parts of the values can be change *by*hardware* (ie
> > accessed and dirty bits).
> > 
> > So imagine that you have some clever global optimizer that sees that
> > the program never ever actually sets the dirty bit at all in any
> > thread, and then uses that kind of non-local knowledge to make
> > optimization decisions. THAT WOULD BE BAD.
> 
> Might as well list other reasons why value proofs via whole-program
> analysis are unreliable for the Linux kernel:
> 
> 1.	As Linus said, changes from hardware.

This is what's volatile is for, right?  (Or the weak-volatile idea I
mentioned).

Compilers won't be able to prove something about the values of such
variables, if marked (weak-)volatile.

> 2.	Assembly code that is not visible to the compiler.
> 	Inline asms will -normally- let the compiler know what
> 	memory they change, but some just use the "memory" tag.
> 	Worse yet, I suspect that most compilers don't look all
> 	that carefully at .S files.
> 
> 	Any number of other programs contain assembly files.

Are the annotations of changed memory really a problem?  If the "memory"
tag exists, isn't that supposed to mean all memory?

To make a proof about a program for location X, the compiler has to
analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
the compiler will simply not be able to prove a thing (except maybe due
to the data-race-free requirement for non-atomics).  The attempt to
prove something isn't unreliable, simply because a correct compiler
won't claim to be able to "prove" something.

One reason that could corrupt this is that if program addresses objects
other than through the mechanisms defined in the language.  For example,
if one thread lays out a data structure at a constant fixed memory
address, and another one then uses the fixed memory address to get
access to the object with a cast (e.g., (void*)0x123).

> 3.	Kernel modules that have not yet been written.  Now, the
> 	compiler could refrain from trying to prove anything about
> 	an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> 	is currently no way to communicate this information to the
> 	compiler other than marking the variable "volatile".

Even if the variable is just externally accessible, then the compiler
knows that it can't do whole-program analysis about it.

It is true that whole-program analysis will not be applicable in this
case, but it will not be unreliable.  I think that's an important
difference.

> 	Other programs have similar issues, e.g., via dlopen().
> 
> 4.	Some drivers allow user-mode code to mmap() some of their
> 	state.  Any changes undertaken by the user-mode code would
> 	be invisible to the compiler.

A good point, but a compiler that doesn't try to (incorrectly) assume
something about the semantics of mmap will simply see that the mmap'ed
data will escape to stuff if can't analyze, so it will not be able to
make a proof.

This is different from, for example, malloc(), which is guaranteed to
return "fresh" nonaliasing memory.

> 5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/

This might be special, or not, depending on how the JITed code gets
access to data.  If this is via fixed addresses (e.g., (void*)0x123),
then see above.  If this is through function calls that the compiler
can't analyze, then this is like 4.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:21                                                                 ` Torvald Riegel
@ 2014-02-18 21:40                                                                   ` Peter Zijlstra
  2014-02-18 21:47                                                                     ` Torvald Riegel
  2014-02-18 21:47                                                                   ` Peter Zijlstra
  2014-02-18 22:14                                                                   ` Linus Torvalds
  2 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:40 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> Well, that's how atomics that aren't volatile are defined in the
> standard.  I can see that you want something else too, but that doesn't
> mean that the other thing is broken.

Well that other thing depends on being able to see the entire program at
compile time. PaulMck already listed various ways in which this is
not feasible even for normal userspace code.

In particular; DSOs and JITs were mentioned.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:21                                                                 ` Torvald Riegel
  2014-02-18 21:40                                                                   ` Peter Zijlstra
@ 2014-02-18 21:47                                                                   ` Peter Zijlstra
  2014-02-19 11:07                                                                     ` Torvald Riegel
  2014-02-18 22:14                                                                   ` Linus Torvalds
  2 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:47 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> boundaries of the abstract machine, and thus is input/output.  Which
> fraction of your atomic accesses can read values produced by hardware?
> I would still suppose that lots of synchronization is not affected by
> this.

Its not only hardware; also the kernel/user boundary has this same
problem. We cannot a-priory say what userspace will do; in fact, because
we're a general purpose OS, we must assume it will willfully try its
bestest to wreck whatever assumptions we make about its behaviour.

We also have loadable modules -- much like regular userspace DSOs -- so
there too we cannot say what will or will not happen.

We also have JITs that generate code on demand.

And I'm absolutely sure (with the exception of the JITs, its not an area
I've worked on) that we have atomic usage across all those boundaries.

I must agree with Linus, global state driven optimizations are crack
brained; esp. for atomics. We simply cannot know all state at compile
time. The best we can hope for are local optimizations.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:40                                                                   ` Peter Zijlstra
@ 2014-02-18 21:47                                                                     ` Torvald Riegel
  2014-02-19 15:23                                                                       ` David Lang
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 21:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> > Well, that's how atomics that aren't volatile are defined in the
> > standard.  I can see that you want something else too, but that doesn't
> > mean that the other thing is broken.
> 
> Well that other thing depends on being able to see the entire program at
> compile time. PaulMck already listed various ways in which this is
> not feasible even for normal userspace code.
> 
> In particular; DSOs and JITs were mentioned.

No it doesn't depend on whole-program analysis being possible.  Because
if it isn't, then a correct compiler will just not do certain
optimizations simply because it can't prove properties required for the
optimization to hold.  With the exception of access to objects via magic
numbers (e.g., fixed and known addresses (see my reply to Paul), which
are outside of the semantics specified in the standard), I don't see a
correctness problem here.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:40                                                                   ` Torvald Riegel
@ 2014-02-18 21:52                                                                     ` Peter Zijlstra
  2014-02-19  9:52                                                                       ` Torvald Riegel
  2014-02-18 22:58                                                                     ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:52 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: paulmck, Linus Torvalds, Alec Teal, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

> > 4.	Some drivers allow user-mode code to mmap() some of their
> > 	state.  Any changes undertaken by the user-mode code would
> > 	be invisible to the compiler.
> 
> A good point, but a compiler that doesn't try to (incorrectly) assume
> something about the semantics of mmap will simply see that the mmap'ed
> data will escape to stuff if can't analyze, so it will not be able to
> make a proof.
> 
> This is different from, for example, malloc(), which is guaranteed to
> return "fresh" nonaliasing memory.

The kernel side of this is different.. it looks like 'normal' memory, we
just happen to allow it to end up in userspace too.

But on that point; how do you tell the compiler the difference between
malloc() and mmap()? Is that some function attribute?

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:21                                                                 ` Torvald Riegel
  2014-02-18 21:40                                                                   ` Peter Zijlstra
  2014-02-18 21:47                                                                   ` Peter Zijlstra
@ 2014-02-18 22:14                                                                   ` Linus Torvalds
  2014-02-19 14:40                                                                     ` Torvald Riegel
  2 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18 22:14 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote:
>>
>> So imagine that you have some clever global optimizer that sees that
>> the program never ever actually sets the dirty bit at all in any
>> thread, and then uses that kind of non-local knowledge to make
>> optimization decisions. THAT WOULD BE BAD.
>>
>> Do you see what I'm aiming for?
>
> Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> boundaries of the abstract machine, and thus is input/output.  Which
> fraction of your atomic accesses can read values produced by hardware?
> I would still suppose that lots of synchronization is not affected by
> this.

The "hardware can change things" case is indeed pretty rare.

But quite frankly, even when it isn't hardware, as far as the compiler
is concerned you have the exact same issue - you have TLB faults
happening on other CPU's that do the same thing asynchronously using
software TLB fault handlers. So *semantically*, it really doesn't make
any difference what-so-ever if it's a software TLB handler on another
CPU, a microcoded TLB fault, or an actual hardware path.

So if the answer for all of the above is "use volatile", then I think
that means that the C11 atomics are badly designed.

The whole *point* of atomic accesses is that stuff like above should
"JustWork(tm)"

> Do you perhaps want a weaker form of volatile?  That is, one that, for
> example, allows combining of two adjacent loads of the dirty bits, but
> will make sure that this is treated as if there is some imaginary
> external thread that it cannot analyze and that may write?

Yes, that's basically what I would want. And it is what I would expect
an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
bit of a misnomer, because technically we really generally want
"ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
volatile, and is a hell of a lot simpler to write ;^).

So we obviously use "volatile" for this currently, and generally the
semantics we really want are:

 - the load or store is done as a single access ("atomic")

 - the compiler must not try to re-materialize the value by reloading
it from memory (this is the "at most once" part)

and quite frankly, "volatile" is a big hammer for this. In practice it
tends to work pretty well, though, because in _most_ cases, there
really is just the single access, so there isn't anything that it
could be combined with, and the biggest issue is often just the
correctness of not re-materializing the value.

And I agree - memory ordering is a totally separate issue, and in fact
we largely tend to consider it entirely separate. For cases where we
have ordering constraints, we either handle those with special
accessors (ie "atomic-modify-and-test" helpers tend to have some
serialization guarantees built in), or we add explicit fencing.

But semantically, C11 atomic accessors *should* generally have the
correct behavior for our uses.

If we have to add "volatile", that makes atomics basically useless. We
already *have* the volatile semantics, if atomics need it, that just
means that atomics have zero upside for us.

>> But *local* optimizations are fine, as long as they follow the obvious
>> rule of not actually making changes that are semantically visible.
>
> If we assume that there is this imaginary thread called hardware that
> can write/read to/from such weak-volatile atomics, I believe this should
> restrict optimizations sufficiently even in the model as specified in
> the standard.

Well, what about *real* threads that do this, but that aren't
analyzable by the C compiler because they are written in another
language entirely (inline asm, asm, perl, INTERCA:. microcode,
PAL-code, whatever?)

I really don't think that "hardware" is necessary for this to happen.
What is done by hardware on x86, for example, is done by PAL-code
(loaded at boot-time) on alpha, and done by hand-tuned assembler fault
handlers on Sparc. The *effect* is the same: it's not visible to the
compiler. There is no way in hell that the compiler can understand the
hand-tuned Sparc TLB fault handler, even if it parsed it.

>> IOW, I would *revel* in the fact that different machines are
>> different, and basically just describe the "stupid" code generation.
>> You'd get the guaranteed semantic baseline, but you'd *also* be able
>> to know that whatever architecture guarantees you have would remain.
>> Without having to describe those architecture issues.
>
> Would you be okay with this preventing lots of optimizations a compiler
> otherwise could do?  Because AFAICT, this spreads into non-synchronizing
> code via the dependency-tracking, for example.

Actually, it probably wouldn't really hurt code generation.

The thing is, you really have three cases:

 - architectures that have weak memory ordering and fairly stupid
hardware (power, some day maybe ARM)

 - architectures that expose a fairly strong memory ordering, but
reorders aggressively in hardware by tracking cacheline lifetimes
until instruction retirement (x86)

 - slow hardware (current ARM, crap Atom hardware etc embedded devices
that don't really do either)

and let's just agree that anybody who does high-performance work
doesn't really care about the third case, ok?

The compiler really wouldn't be that constrained in ordering on a
weakly ordered machine: sure, you'll have to add "lwsync" on powerpc
pretty much every time you have an ordered atomic read, but then you
could optimize them away for certain cases (like "oh, it was a
'consume' read, and the only consumer is already data-dependent, so
now we can remove the lwsync"), so you'd end up with pretty much the
code you looked for anyway.

And the HPC people who use the "relaxed" model wouldn't see the extra
memory barriers anyway.

And on a non-weakly ordered machine like x86, there are going to be
hardly any barriers in the first place, and the hardware will do the
right thing wrt ordering, so it's not like the compiler has gotten
prevented from any big optimization wins.

So I really think a simpler model that tied to the hardware directly -
which you need to do *anyway* since the memory ordering constraints
really are hw-dependent - would have been preferable. The people who
don't want to take advantage of hardware-specific guarantees would
never know the difference (they'll rely purely on pairing
acquire/release and fences properly - the way they *already* have to
in the current C11 model), and the people who _do_ care about
particular hw guarantees would get them by definition.

         Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:40                                                                   ` Torvald Riegel
  2014-02-18 21:52                                                                     ` Peter Zijlstra
@ 2014-02-18 22:58                                                                     ` Paul E. McKenney
  2014-02-19 10:59                                                                       ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 22:58 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> 
> On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > >> And exactly because I know enough, I would *really* like atomics to be
> > > >> well-defined, and have very clear - and *local* - rules about how they
> > > >> can be combined and optimized.
> > > >
> > > > "Local"?
> > > 
> > > Yes.
> > > 
> > > So I think that one of the big advantages of atomics over volatile is
> > > that they *can* be optimized, and as such I'm not at all against
> > > trying to generate much better code than for volatile accesses.
> > > 
> > > But at the same time, that can go too far. For example, one of the
> > > things we'd want to use atomics for is page table accesses, where it
> > > is very important that we don't generate multiple accesses to the
> > > values, because parts of the values can be change *by*hardware* (ie
> > > accessed and dirty bits).
> > > 
> > > So imagine that you have some clever global optimizer that sees that
> > > the program never ever actually sets the dirty bit at all in any
> > > thread, and then uses that kind of non-local knowledge to make
> > > optimization decisions. THAT WOULD BE BAD.
> > 
> > Might as well list other reasons why value proofs via whole-program
> > analysis are unreliable for the Linux kernel:
> > 
> > 1.	As Linus said, changes from hardware.
> 
> This is what's volatile is for, right?  (Or the weak-volatile idea I
> mentioned).
> 
> Compilers won't be able to prove something about the values of such
> variables, if marked (weak-)volatile.

Yep.

> > 2.	Assembly code that is not visible to the compiler.
> > 	Inline asms will -normally- let the compiler know what
> > 	memory they change, but some just use the "memory" tag.
> > 	Worse yet, I suspect that most compilers don't look all
> > 	that carefully at .S files.
> > 
> > 	Any number of other programs contain assembly files.
> 
> Are the annotations of changed memory really a problem?  If the "memory"
> tag exists, isn't that supposed to mean all memory?
> 
> To make a proof about a program for location X, the compiler has to
> analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> the compiler will simply not be able to prove a thing (except maybe due
> to the data-race-free requirement for non-atomics).  The attempt to
> prove something isn't unreliable, simply because a correct compiler
> won't claim to be able to "prove" something.

I am indeed less worried about inline assembler than I am about files
full of assembly.  Or files full of other languages.

> One reason that could corrupt this is that if program addresses objects
> other than through the mechanisms defined in the language.  For example,
> if one thread lays out a data structure at a constant fixed memory
> address, and another one then uses the fixed memory address to get
> access to the object with a cast (e.g., (void*)0x123).

Or if the program uses gcc linker scripts to get the same effect.

> > 3.	Kernel modules that have not yet been written.  Now, the
> > 	compiler could refrain from trying to prove anything about
> > 	an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > 	is currently no way to communicate this information to the
> > 	compiler other than marking the variable "volatile".
> 
> Even if the variable is just externally accessible, then the compiler
> knows that it can't do whole-program analysis about it.
> 
> It is true that whole-program analysis will not be applicable in this
> case, but it will not be unreliable.  I think that's an important
> difference.

Let me make sure that I understand what you are saying.  If my program has
"extern int foo;", the compiler will refrain from doing whole-program
analysis involving "foo"?  Or to ask it another way, when you say
"whole-program analysis", are you restricting that analysis to the
current translation unit?

If so, I was probably not the only person thinking that you instead meant
analysis across all translation units linked into the program.  ;-)

> > 	Other programs have similar issues, e.g., via dlopen().
> > 
> > 4.	Some drivers allow user-mode code to mmap() some of their
> > 	state.  Any changes undertaken by the user-mode code would
> > 	be invisible to the compiler.
> 
> A good point, but a compiler that doesn't try to (incorrectly) assume
> something about the semantics of mmap will simply see that the mmap'ed
> data will escape to stuff if can't analyze, so it will not be able to
> make a proof.
> 
> This is different from, for example, malloc(), which is guaranteed to
> return "fresh" nonaliasing memory.

As Peter noted, this is the other end of mmap().  The -user- code sees
that there is an mmap(), but the kernel code invokes functions that
poke values into hardware registers (or into in-memory page tables)
that, as a side effect, cause some of the kernel's memory to be
accessible to some user program.

Presumably the kernel code needs to do something to account for the
possibility of usermode access whenever it accesses that memory.
Volatile casts, volatile storage class on the declarations, barrier()
calls, whatever.

I echo Peter's question about how one tags functions like mmap().

I will also remember this for the next time someone on the committee
discounts "volatile".  ;-)

> > 5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/
> 
> This might be special, or not, depending on how the JITed code gets
> access to data.  If this is via fixed addresses (e.g., (void*)0x123),
> then see above.  If this is through function calls that the compiler
> can't analyze, then this is like 4.

It could well be via the kernel reading its own symbol table, sort of
a poor-person's reflection facility.  I guess that would be for all
intents and purposes equivalent to your (void*)0x123.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:52                                                                     ` Peter Zijlstra
@ 2014-02-19  9:52                                                                       ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Linus Torvalds, Alec Teal, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 22:52 +0100, Peter Zijlstra wrote:
> > > 4.	Some drivers allow user-mode code to mmap() some of their
> > > 	state.  Any changes undertaken by the user-mode code would
> > > 	be invisible to the compiler.
> > 
> > A good point, but a compiler that doesn't try to (incorrectly) assume
> > something about the semantics of mmap will simply see that the mmap'ed
> > data will escape to stuff if can't analyze, so it will not be able to
> > make a proof.
> > 
> > This is different from, for example, malloc(), which is guaranteed to
> > return "fresh" nonaliasing memory.
> 
> The kernel side of this is different.. it looks like 'normal' memory, we
> just happen to allow it to end up in userspace too.
> 
> But on that point; how do you tell the compiler the difference between
> malloc() and mmap()? Is that some function attribute?

Yes:

malloc
        The malloc attribute is used to tell the compiler that a
        function may be treated as if any non-NULL pointer it returns
        cannot alias any other pointer valid when the function returns
        and that the memory has undefined content. This often improves
        optimization. Standard functions with this property include
        malloc and calloc. realloc-like functions do not have this
        property as the memory pointed to does not have undefined
        content.

I'm not quite sure whether GCC assumes malloc() to be indeed C's malloc
even if the function attribute isn't used, and/or whether that is
different for freestanding environments.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 22:58                                                                     ` Paul E. McKenney
@ 2014-02-19 10:59                                                                       ` Torvald Riegel
  2014-02-19 15:14                                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19 10:59 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > 
> > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > >> And exactly because I know enough, I would *really* like atomics to be
> > > > >> well-defined, and have very clear - and *local* - rules about how they
> > > > >> can be combined and optimized.
> > > > >
> > > > > "Local"?
> > > > 
> > > > Yes.
> > > > 
> > > > So I think that one of the big advantages of atomics over volatile is
> > > > that they *can* be optimized, and as such I'm not at all against
> > > > trying to generate much better code than for volatile accesses.
> > > > 
> > > > But at the same time, that can go too far. For example, one of the
> > > > things we'd want to use atomics for is page table accesses, where it
> > > > is very important that we don't generate multiple accesses to the
> > > > values, because parts of the values can be change *by*hardware* (ie
> > > > accessed and dirty bits).
> > > > 
> > > > So imagine that you have some clever global optimizer that sees that
> > > > the program never ever actually sets the dirty bit at all in any
> > > > thread, and then uses that kind of non-local knowledge to make
> > > > optimization decisions. THAT WOULD BE BAD.
> > > 
> > > Might as well list other reasons why value proofs via whole-program
> > > analysis are unreliable for the Linux kernel:
> > > 
> > > 1.	As Linus said, changes from hardware.
> > 
> > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > mentioned).
> > 
> > Compilers won't be able to prove something about the values of such
> > variables, if marked (weak-)volatile.
> 
> Yep.
> 
> > > 2.	Assembly code that is not visible to the compiler.
> > > 	Inline asms will -normally- let the compiler know what
> > > 	memory they change, but some just use the "memory" tag.
> > > 	Worse yet, I suspect that most compilers don't look all
> > > 	that carefully at .S files.
> > > 
> > > 	Any number of other programs contain assembly files.
> > 
> > Are the annotations of changed memory really a problem?  If the "memory"
> > tag exists, isn't that supposed to mean all memory?
> > 
> > To make a proof about a program for location X, the compiler has to
> > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > the compiler will simply not be able to prove a thing (except maybe due
> > to the data-race-free requirement for non-atomics).  The attempt to
> > prove something isn't unreliable, simply because a correct compiler
> > won't claim to be able to "prove" something.
> 
> I am indeed less worried about inline assembler than I am about files
> full of assembly.  Or files full of other languages.
> 
> > One reason that could corrupt this is that if program addresses objects
> > other than through the mechanisms defined in the language.  For example,
> > if one thread lays out a data structure at a constant fixed memory
> > address, and another one then uses the fixed memory address to get
> > access to the object with a cast (e.g., (void*)0x123).
> 
> Or if the program uses gcc linker scripts to get the same effect.
> 
> > > 3.	Kernel modules that have not yet been written.  Now, the
> > > 	compiler could refrain from trying to prove anything about
> > > 	an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > > 	is currently no way to communicate this information to the
> > > 	compiler other than marking the variable "volatile".
> > 
> > Even if the variable is just externally accessible, then the compiler
> > knows that it can't do whole-program analysis about it.
> > 
> > It is true that whole-program analysis will not be applicable in this
> > case, but it will not be unreliable.  I think that's an important
> > difference.
> 
> Let me make sure that I understand what you are saying.  If my program has
> "extern int foo;", the compiler will refrain from doing whole-program
> analysis involving "foo"?

Yes.  If it can't be sure to actually have the whole program available,
it can't do whole-program analysis, right?  Things like the linker
scripts you mention or other stuff outside of the language semantics
complicates this somewhat, and maybe some compilers assume too much.
There's also the point that data-race-freedom is required for
non-atomics even if those are shared with non-C-code.

But except those corner cases, a compiler sees whether something escapes
and becomes visible/accessible to other entities.

> Or to ask it another way, when you say
> "whole-program analysis", are you restricting that analysis to the
> current translation unit?

No.  I mean, you can do analysis of the current translation unit, but
that will do just that; if the variable, for example, is accessible
outside of this translation unit, the compiler can't make a
whole-program proof about it, and thus can't do certain optimizations.

> If so, I was probably not the only person thinking that you instead meant
> analysis across all translation units linked into the program.  ;-)

That's roughly what I meant, but not just including translation units
but truly all parts of the program, including non-C program parts.  IOW,
literally the whole program :)

That's why I said that if you indeed do *whole program* analysis, then
things should be fine (modulo corner cases such as linker scripts, later
binary rewriting of code produced by the compiler, etc.).  Many of the
things you worried about *prevent* whole-program analysis, which means
that they do not make it any less reliable.  Does that clarify my line
of thought?

> > > 	Other programs have similar issues, e.g., via dlopen().
> > > 
> > > 4.	Some drivers allow user-mode code to mmap() some of their
> > > 	state.  Any changes undertaken by the user-mode code would
> > > 	be invisible to the compiler.
> > 
> > A good point, but a compiler that doesn't try to (incorrectly) assume
> > something about the semantics of mmap will simply see that the mmap'ed
> > data will escape to stuff if can't analyze, so it will not be able to
> > make a proof.
> > 
> > This is different from, for example, malloc(), which is guaranteed to
> > return "fresh" nonaliasing memory.
> 
> As Peter noted, this is the other end of mmap().  The -user- code sees
> that there is an mmap(), but the kernel code invokes functions that
> poke values into hardware registers (or into in-memory page tables)
> that, as a side effect, cause some of the kernel's memory to be
> accessible to some user program.
> 
> Presumably the kernel code needs to do something to account for the
> possibility of usermode access whenever it accesses that memory.
> Volatile casts, volatile storage class on the declarations, barrier()
> calls, whatever.

In this case, there should be another option except volatile:  If
userspace code is using the C11 memory model as well and lock-free
atomics to synchronize, then this should have well-defined semantics
without using volatile.

On both sides, the compiler will see that mmap() (or similar) is called,
so that means the data escapes to something unknown, which could create
threads and so on.  So first, it can't do whole-program analysis for
this state anymore, and has to assume that other C11 threads are
accessing this memory.  Next, lock-free atomics are specified to be
"address-free", meaning that they must work independent of where in
memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a
"should" and non-normative, but essential IMO).  Thus, this then boils
down to just a simple case of synchronization.  (Of course, the rest of
the ABI has to match too for the data exchange to work.)

> I echo Peter's question about how one tags functions like mmap().
> 
> I will also remember this for the next time someone on the committee
> discounts "volatile".  ;-)
> 
> > > 5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/
> > 
> > This might be special, or not, depending on how the JITed code gets
> > access to data.  If this is via fixed addresses (e.g., (void*)0x123),
> > then see above.  If this is through function calls that the compiler
> > can't analyze, then this is like 4.
> 
> It could well be via the kernel reading its own symbol table, sort of
> a poor-person's reflection facility.  I guess that would be for all
> intents and purposes equivalent to your (void*)0x123.

If it is replacing code generated by the compiler, then yes.  If the JIT
is just filling in functions that had been undefined yet declared
before, then the compiler will have seen the data escape through the
function interfaces, and should be aware that there is other stuff.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:47                                                                   ` Peter Zijlstra
@ 2014-02-19 11:07                                                                     ` Torvald Riegel
  2014-02-19 11:42                                                                       ` Peter Zijlstra
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 22:47 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> > Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> > boundaries of the abstract machine, and thus is input/output.  Which
> > fraction of your atomic accesses can read values produced by hardware?
> > I would still suppose that lots of synchronization is not affected by
> > this.
> 
> Its not only hardware; also the kernel/user boundary has this same
> problem. We cannot a-priory say what userspace will do; in fact, because
> we're a general purpose OS, we must assume it will willfully try its
> bestest to wreck whatever assumptions we make about its behaviour.

That's a good note, and I think a distinct case from those below,
because here you're saying that you can't assume that the userspace code
follows the C11 semantics ...

> We also have loadable modules -- much like regular userspace DSOs -- so
> there too we cannot say what will or will not happen.
> 
> We also have JITs that generate code on demand.

.. whereas for those, you might assume that the other code follows C11
semantics and the same ABI, which makes this just a normal case already
handled as (see my other replies nearby in this thread).

> And I'm absolutely sure (with the exception of the JITs, its not an area
> I've worked on) that we have atomic usage across all those boundaries.

That would be fine as long as all involved parties use the same memory
model and ABI to implement it.

(Of course, I'm assuming here that the compiler is aware of sharing with
other entities, which is always the case except in those corner case
like accesses to (void*)0x123 magically aliasing with something else).


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-19 11:07                                                                     ` Torvald Riegel
@ 2014-02-19 11:42                                                                       ` Peter Zijlstra
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-19 11:42 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 12:07:02PM +0100, Torvald Riegel wrote:
> > Its not only hardware; also the kernel/user boundary has this same
> > problem. We cannot a-priory say what userspace will do; in fact, because
> > we're a general purpose OS, we must assume it will willfully try its
> > bestest to wreck whatever assumptions we make about its behaviour.
> 
> That's a good note, and I think a distinct case from those below,
> because here you're saying that you can't assume that the userspace code
> follows the C11 semantics ...

Right; we can malfunction in those cases though; as long as the
malfunctioning happens on the userspace side. That is, whatever
userspace does should not cause the kernel to crash, but userspace
crashing itself, or getting crap data or whatever is its own damn fault
for not following expected behaviour.

To stay on topic; if the kernel/user interface requires memory ordering
and userspace explicitly omits the barriers all malfunctioning should be
on the user. For instance it might loose a fwd progress guarantee or
data integrity guarantees.

In specific, given a kernel/user lockless producer/consumer buffer, if
the user-side allows the tail write to happen before its data reads are
complete, the kernel might overwrite the data its still reading.

Or in case of futexes, if the user side doesn't use the appropriate
operations its lock state gets corrupt but only userspace should suffer.

But yes, this does require some care and consideration from our side.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 22:14                                                                   ` Linus Torvalds
@ 2014-02-19 14:40                                                                     ` Torvald Riegel
  2014-02-19 19:49                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19 14:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-02-18 at 14:14 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >>
> >> So imagine that you have some clever global optimizer that sees that
> >> the program never ever actually sets the dirty bit at all in any
> >> thread, and then uses that kind of non-local knowledge to make
> >> optimization decisions. THAT WOULD BE BAD.
> >>
> >> Do you see what I'm aiming for?
> >
> > Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> > boundaries of the abstract machine, and thus is input/output.  Which
> > fraction of your atomic accesses can read values produced by hardware?
> > I would still suppose that lots of synchronization is not affected by
> > this.
> 
> The "hardware can change things" case is indeed pretty rare.
> 
> But quite frankly, even when it isn't hardware, as far as the compiler
> is concerned you have the exact same issue - you have TLB faults
> happening on other CPU's that do the same thing asynchronously using
> software TLB fault handlers. So *semantically*, it really doesn't make
> any difference what-so-ever if it's a software TLB handler on another
> CPU, a microcoded TLB fault, or an actual hardware path.

I think there are a few semantic differences:

* If a SW handler uses the C11 memory model, it will synchronize like
any other thread.  HW might do something else entirely, including
synchronizing differently, not using atomic accesses, etc.  (At least
that's the constraints I had in mind).

* If we can treat any interrupt handler like Just Another Thread, then
the next question is whether the compiler will be aware that there is
another thread.  I think that in practice it will be: You'll set up the
handler in some way by calling a function the compiler can't analyze, so
the compiler will know that stuff accessible to the handler (e.g.,
global variables) will potentially be accessed by other threads. 

* Similarly, if the C code is called from some external thing, it also
has to assume the presence of other threads.  (Perhaps this is what the
compiler has to assume in a freestanding implementation anyway...)

However, accessibility will be different for, say, stack variables that
haven't been shared with other functions yet; those are arguably not
reachable by other things, at least not through mechanisms defined by
the C standard.  So optimizing these should be possible with the
assumption that there is no other thread (at least as default -- I'm not
saying that this is the only reasonable semantics).

> So if the answer for all of the above is "use volatile", then I think
> that means that the C11 atomics are badly designed.
> 
> The whole *point* of atomic accesses is that stuff like above should
> "JustWork(tm)"

I think that it should in the majority of cases.  If the other thing
potentially accessing can do as much as a valid C11 thread can do, the
synchronization itself will work just fine.  In most cases except the
(void*)0x123 example (or linker scripts etc.) the compiler is aware when
data is made visible to other threads or other non-analyzable functions
that may spawn other threads (or just by being a plain global variable
accessible to other (potentially .S) translation units.

> > Do you perhaps want a weaker form of volatile?  That is, one that, for
> > example, allows combining of two adjacent loads of the dirty bits, but
> > will make sure that this is treated as if there is some imaginary
> > external thread that it cannot analyze and that may write?
> 
> Yes, that's basically what I would want. And it is what I would expect
> an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
> bit of a misnomer, because technically we really generally want
> "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
> volatile, and is a hell of a lot simpler to write ;^).
> 
> So we obviously use "volatile" for this currently, and generally the
> semantics we really want are:
> 
>  - the load or store is done as a single access ("atomic")
> 
>  - the compiler must not try to re-materialize the value by reloading
> it from memory (this is the "at most once" part)

In the presence of other threads performing operations unknown to the
compiler, that's what you should get even if the compiler is trying to
optimize C11 atomics.  The first requirement is clear, and the "at most
once" follows from another thread potentially writing to the variable.

The only difference I can see right now is that a compiler may be able
to *prove* that it doesn't matter whether it reloaded the value or not.
But this seems very hard to prove for me, and likely to require
whole-program analysis (which won't be possible because we don't know
what other threads are doing).  I would guess that this isn't a problem
in practice.  I just wanted to note it because it theoretically does
have a different semantics than plain volatiles.

> and quite frankly, "volatile" is a big hammer for this. In practice it
> tends to work pretty well, though, because in _most_ cases, there
> really is just the single access, so there isn't anything that it
> could be combined with, and the biggest issue is often just the
> correctness of not re-materializing the value.
> 
> And I agree - memory ordering is a totally separate issue, and in fact
> we largely tend to consider it entirely separate. For cases where we
> have ordering constraints, we either handle those with special
> accessors (ie "atomic-modify-and-test" helpers tend to have some
> serialization guarantees built in), or we add explicit fencing.

Good.

> But semantically, C11 atomic accessors *should* generally have the
> correct behavior for our uses.
> 
> If we have to add "volatile", that makes atomics basically useless. We
> already *have* the volatile semantics, if atomics need it, that just
> means that atomics have zero upside for us.

I agree, but I don't think it's necessary.  atomics should have the
right semantics for you, provided the compiler is aware that there are
other unknown threads accessing the same data.

> >> But *local* optimizations are fine, as long as they follow the obvious
> >> rule of not actually making changes that are semantically visible.
> >
> > If we assume that there is this imaginary thread called hardware that
> > can write/read to/from such weak-volatile atomics, I believe this should
> > restrict optimizations sufficiently even in the model as specified in
> > the standard.
> 
> Well, what about *real* threads that do this, but that aren't
> analyzable by the C compiler because they are written in another
> language entirely (inline asm, asm, perl, INTERCA:. microcode,
> PAL-code, whatever?)
> 
> I really don't think that "hardware" is necessary for this to happen.
> What is done by hardware on x86, for example, is done by PAL-code
> (loaded at boot-time) on alpha, and done by hand-tuned assembler fault
> handlers on Sparc. The *effect* is the same: it's not visible to the
> compiler. There is no way in hell that the compiler can understand the
> hand-tuned Sparc TLB fault handler, even if it parsed it.

I agree.  Let me rephrase it.

If all those other threads written in whichever way use the same memory
model and ABI for synchronization (e.g., choice of HW barriers for a
certain memory_order), it doesn't matter whether it's a hardware thread,
microcode, whatever.  In this case, C11 atomics should be fine.
(We have this in userspace already, because correct compilers will have
to assume that the code generated by them has to properly synchronize
with other code generated by different compilers.)

If the other threads use a different model, access memory entirely
differently, etc, then we might be back to "volatile" because we don't
know anything, and the very strict rules about execution steps of the
abstract machine (ie, no as-if rule) are probably the safest thing to
do.

If you agree with this categorization, then I believe we just need to
look at whether a compiler is naturally aware of a variable being shared
with potentially other threads that follow C11 synchronization semantics
but are written in other languages and generally not accessible:
* Maybe that's the case anyway when compiling for freestanding
optimizations.
* In a lot of cases, the compiler will know, because data escapes to
non-C / non-analyzable functions, or is global and accessible to other
translation units.
* Maybe we need some additional mechanism to mark those corner cases
where it isn't known (e.g., because of (void*)0x123 fixed-address
accesses, or other non-C-semantics issues).  That should be a clearer
mechanism than weak-volatile; maybe a shared_with_other_threads
attribute.  But my current gut feeling is that we wouldn't need that
often, if ever.

Sounds better?


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-19 10:59                                                                       ` Torvald Riegel
@ 2014-02-19 15:14                                                                         ` Paul E. McKenney
  2014-02-19 17:55                                                                           ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-19 15:14 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:
> On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > > 
> > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > > >> And exactly because I know enough, I would *really* like atomics to be
> > > > > >> well-defined, and have very clear - and *local* - rules about how they
> > > > > >> can be combined and optimized.
> > > > > >
> > > > > > "Local"?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > So I think that one of the big advantages of atomics over volatile is
> > > > > that they *can* be optimized, and as such I'm not at all against
> > > > > trying to generate much better code than for volatile accesses.
> > > > > 
> > > > > But at the same time, that can go too far. For example, one of the
> > > > > things we'd want to use atomics for is page table accesses, where it
> > > > > is very important that we don't generate multiple accesses to the
> > > > > values, because parts of the values can be change *by*hardware* (ie
> > > > > accessed and dirty bits).
> > > > > 
> > > > > So imagine that you have some clever global optimizer that sees that
> > > > > the program never ever actually sets the dirty bit at all in any
> > > > > thread, and then uses that kind of non-local knowledge to make
> > > > > optimization decisions. THAT WOULD BE BAD.
> > > > 
> > > > Might as well list other reasons why value proofs via whole-program
> > > > analysis are unreliable for the Linux kernel:
> > > > 
> > > > 1.	As Linus said, changes from hardware.
> > > 
> > > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > > mentioned).
> > > 
> > > Compilers won't be able to prove something about the values of such
> > > variables, if marked (weak-)volatile.
> > 
> > Yep.
> > 
> > > > 2.	Assembly code that is not visible to the compiler.
> > > > 	Inline asms will -normally- let the compiler know what
> > > > 	memory they change, but some just use the "memory" tag.
> > > > 	Worse yet, I suspect that most compilers don't look all
> > > > 	that carefully at .S files.
> > > > 
> > > > 	Any number of other programs contain assembly files.
> > > 
> > > Are the annotations of changed memory really a problem?  If the "memory"
> > > tag exists, isn't that supposed to mean all memory?
> > > 
> > > To make a proof about a program for location X, the compiler has to
> > > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > > the compiler will simply not be able to prove a thing (except maybe due
> > > to the data-race-free requirement for non-atomics).  The attempt to
> > > prove something isn't unreliable, simply because a correct compiler
> > > won't claim to be able to "prove" something.
> > 
> > I am indeed less worried about inline assembler than I am about files
> > full of assembly.  Or files full of other languages.
> > 
> > > One reason that could corrupt this is that if program addresses objects
> > > other than through the mechanisms defined in the language.  For example,
> > > if one thread lays out a data structure at a constant fixed memory
> > > address, and another one then uses the fixed memory address to get
> > > access to the object with a cast (e.g., (void*)0x123).
> > 
> > Or if the program uses gcc linker scripts to get the same effect.
> > 
> > > > 3.	Kernel modules that have not yet been written.  Now, the
> > > > 	compiler could refrain from trying to prove anything about
> > > > 	an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > > > 	is currently no way to communicate this information to the
> > > > 	compiler other than marking the variable "volatile".
> > > 
> > > Even if the variable is just externally accessible, then the compiler
> > > knows that it can't do whole-program analysis about it.
> > > 
> > > It is true that whole-program analysis will not be applicable in this
> > > case, but it will not be unreliable.  I think that's an important
> > > difference.
> > 
> > Let me make sure that I understand what you are saying.  If my program has
> > "extern int foo;", the compiler will refrain from doing whole-program
> > analysis involving "foo"?
> 
> Yes.  If it can't be sure to actually have the whole program available,
> it can't do whole-program analysis, right?  Things like the linker
> scripts you mention or other stuff outside of the language semantics
> complicates this somewhat, and maybe some compilers assume too much.
> There's also the point that data-race-freedom is required for
> non-atomics even if those are shared with non-C-code.
> 
> But except those corner cases, a compiler sees whether something escapes
> and becomes visible/accessible to other entities.

The traditional response to "except those corner cases" is of course
"Murphy was an optimist".  ;-)

That said, point taken -- you expect that the compiler will always be
told of anything that would limit its ability to reason about the
whole program.

> > Or to ask it another way, when you say
> > "whole-program analysis", are you restricting that analysis to the
> > current translation unit?
> 
> No.  I mean, you can do analysis of the current translation unit, but
> that will do just that; if the variable, for example, is accessible
> outside of this translation unit, the compiler can't make a
> whole-program proof about it, and thus can't do certain optimizations.

I had to read this several times to find an interpretation that might
make sense.  That interpretation is "The compiler will do whole-program
analysis only on those variables that it believes are accessed only by
the current translation unit."  Is that what you meant?

> > If so, I was probably not the only person thinking that you instead meant
> > analysis across all translation units linked into the program.  ;-)
> 
> That's roughly what I meant, but not just including translation units
> but truly all parts of the program, including non-C program parts.  IOW,
> literally the whole program :)
> 
> That's why I said that if you indeed do *whole program* analysis, then
> things should be fine (modulo corner cases such as linker scripts, later
> binary rewriting of code produced by the compiler, etc.).  Many of the
> things you worried about *prevent* whole-program analysis, which means
> that they do not make it any less reliable.  Does that clarify my line
> of thought?

If my interpretation above is correct, yes.  It appears that you are much
more confident than many kernel folks that the compiler will be informed of
everything that might limit its omniscience.

> > > > 	Other programs have similar issues, e.g., via dlopen().
> > > > 
> > > > 4.	Some drivers allow user-mode code to mmap() some of their
> > > > 	state.  Any changes undertaken by the user-mode code would
> > > > 	be invisible to the compiler.
> > > 
> > > A good point, but a compiler that doesn't try to (incorrectly) assume
> > > something about the semantics of mmap will simply see that the mmap'ed
> > > data will escape to stuff if can't analyze, so it will not be able to
> > > make a proof.
> > > 
> > > This is different from, for example, malloc(), which is guaranteed to
> > > return "fresh" nonaliasing memory.
> > 
> > As Peter noted, this is the other end of mmap().  The -user- code sees
> > that there is an mmap(), but the kernel code invokes functions that
> > poke values into hardware registers (or into in-memory page tables)
> > that, as a side effect, cause some of the kernel's memory to be
> > accessible to some user program.
> > 
> > Presumably the kernel code needs to do something to account for the
> > possibility of usermode access whenever it accesses that memory.
> > Volatile casts, volatile storage class on the declarations, barrier()
> > calls, whatever.
> 
> In this case, there should be another option except volatile:  If
> userspace code is using the C11 memory model as well and lock-free
> atomics to synchronize, then this should have well-defined semantics
> without using volatile.

For user-mode programs that have not yet been written, this could be
a reasonable approach.  For existing user-mode binaries, C11 won't
help, which leaves things like volatile and assembly (including the
"memory" qualifier as used in barrier() macro).

> On both sides, the compiler will see that mmap() (or similar) is called,
> so that means the data escapes to something unknown, which could create
> threads and so on.  So first, it can't do whole-program analysis for
> this state anymore, and has to assume that other C11 threads are
> accessing this memory.  Next, lock-free atomics are specified to be
> "address-free", meaning that they must work independent of where in
> memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a
> "should" and non-normative, but essential IMO).  Thus, this then boils
> down to just a simple case of synchronization.  (Of course, the rest of
> the ABI has to match too for the data exchange to work.)

The compiler will see mmap() on the user side, but not on the kernel
side.  On the kernel side, something special is required.

Agree that "address-free" would be nice as "shall" rather than "should".

> > I echo Peter's question about how one tags functions like mmap().
> > 
> > I will also remember this for the next time someone on the committee
> > discounts "volatile".  ;-)
> > 
> > > > 5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/
> > > 
> > > This might be special, or not, depending on how the JITed code gets
> > > access to data.  If this is via fixed addresses (e.g., (void*)0x123),
> > > then see above.  If this is through function calls that the compiler
> > > can't analyze, then this is like 4.
> > 
> > It could well be via the kernel reading its own symbol table, sort of
> > a poor-person's reflection facility.  I guess that would be for all
> > intents and purposes equivalent to your (void*)0x123.
> 
> If it is replacing code generated by the compiler, then yes.  If the JIT
> is just filling in functions that had been undefined yet declared
> before, then the compiler will have seen the data escape through the
> function interfaces, and should be aware that there is other stuff.

So one other concern would then be things things like ftrace, kprobes,
ksplice, and so on.  These rewrite the kernel binary at runtime, though
in very limited ways.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 21:47                                                                     ` Torvald Riegel
@ 2014-02-19 15:23                                                                       ` David Lang
  2014-02-19 18:11                                                                         ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: David Lang @ 2014-02-19 15:23 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Peter Zijlstra, Linus Torvalds, Alec Teal, Paul McKenney,
	Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, 18 Feb 2014, Torvald Riegel wrote:

> On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
>>> Well, that's how atomics that aren't volatile are defined in the
>>> standard.  I can see that you want something else too, but that doesn't
>>> mean that the other thing is broken.
>>
>> Well that other thing depends on being able to see the entire program at
>> compile time. PaulMck already listed various ways in which this is
>> not feasible even for normal userspace code.
>>
>> In particular; DSOs and JITs were mentioned.
>
> No it doesn't depend on whole-program analysis being possible.  Because
> if it isn't, then a correct compiler will just not do certain
> optimizations simply because it can't prove properties required for the
> optimization to hold.  With the exception of access to objects via magic
> numbers (e.g., fixed and known addresses (see my reply to Paul), which
> are outside of the semantics specified in the standard), I don't see a
> correctness problem here.

Are you really sure that the compiler can figure out every possible thing that a 
loadable module or JITed code can access? That seems like a pretty strong claim.

David Lang

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-19 15:14                                                                         ` Paul E. McKenney
@ 2014-02-19 17:55                                                                           ` Torvald Riegel
  2014-02-19 22:12                                                                             ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19 17:55 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote:
> On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:
> > On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > > > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > > > 
> > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > > > >> And exactly because I know enough, I would *really* like atomics to be
> > > > > > >> well-defined, and have very clear - and *local* - rules about how they
> > > > > > >> can be combined and optimized.
> > > > > > >
> > > > > > > "Local"?
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > So I think that one of the big advantages of atomics over volatile is
> > > > > > that they *can* be optimized, and as such I'm not at all against
> > > > > > trying to generate much better code than for volatile accesses.
> > > > > > 
> > > > > > But at the same time, that can go too far. For example, one of the
> > > > > > things we'd want to use atomics for is page table accesses, where it
> > > > > > is very important that we don't generate multiple accesses to the
> > > > > > values, because parts of the values can be change *by*hardware* (ie
> > > > > > accessed and dirty bits).
> > > > > > 
> > > > > > So imagine that you have some clever global optimizer that sees that
> > > > > > the program never ever actually sets the dirty bit at all in any
> > > > > > thread, and then uses that kind of non-local knowledge to make
> > > > > > optimization decisions. THAT WOULD BE BAD.
> > > > > 
> > > > > Might as well list other reasons why value proofs via whole-program
> > > > > analysis are unreliable for the Linux kernel:
> > > > > 
> > > > > 1.	As Linus said, changes from hardware.
> > > > 
> > > > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > > > mentioned).
> > > > 
> > > > Compilers won't be able to prove something about the values of such
> > > > variables, if marked (weak-)volatile.
> > > 
> > > Yep.
> > > 
> > > > > 2.	Assembly code that is not visible to the compiler.
> > > > > 	Inline asms will -normally- let the compiler know what
> > > > > 	memory they change, but some just use the "memory" tag.
> > > > > 	Worse yet, I suspect that most compilers don't look all
> > > > > 	that carefully at .S files.
> > > > > 
> > > > > 	Any number of other programs contain assembly files.
> > > > 
> > > > Are the annotations of changed memory really a problem?  If the "memory"
> > > > tag exists, isn't that supposed to mean all memory?
> > > > 
> > > > To make a proof about a program for location X, the compiler has to
> > > > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > > > the compiler will simply not be able to prove a thing (except maybe due
> > > > to the data-race-free requirement for non-atomics).  The attempt to
> > > > prove something isn't unreliable, simply because a correct compiler
> > > > won't claim to be able to "prove" something.
> > > 
> > > I am indeed less worried about inline assembler than I am about files
> > > full of assembly.  Or files full of other languages.
> > > 
> > > > One reason that could corrupt this is that if program addresses objects
> > > > other than through the mechanisms defined in the language.  For example,
> > > > if one thread lays out a data structure at a constant fixed memory
> > > > address, and another one then uses the fixed memory address to get
> > > > access to the object with a cast (e.g., (void*)0x123).
> > > 
> > > Or if the program uses gcc linker scripts to get the same effect.
> > > 
> > > > > 3.	Kernel modules that have not yet been written.  Now, the
> > > > > 	compiler could refrain from trying to prove anything about
> > > > > 	an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > > > > 	is currently no way to communicate this information to the
> > > > > 	compiler other than marking the variable "volatile".
> > > > 
> > > > Even if the variable is just externally accessible, then the compiler
> > > > knows that it can't do whole-program analysis about it.
> > > > 
> > > > It is true that whole-program analysis will not be applicable in this
> > > > case, but it will not be unreliable.  I think that's an important
> > > > difference.
> > > 
> > > Let me make sure that I understand what you are saying.  If my program has
> > > "extern int foo;", the compiler will refrain from doing whole-program
> > > analysis involving "foo"?
> > 
> > Yes.  If it can't be sure to actually have the whole program available,
> > it can't do whole-program analysis, right?  Things like the linker
> > scripts you mention or other stuff outside of the language semantics
> > complicates this somewhat, and maybe some compilers assume too much.
> > There's also the point that data-race-freedom is required for
> > non-atomics even if those are shared with non-C-code.
> > 
> > But except those corner cases, a compiler sees whether something escapes
> > and becomes visible/accessible to other entities.
> 
> The traditional response to "except those corner cases" is of course
> "Murphy was an optimist".  ;-)
> 
> That said, point taken -- you expect that the compiler will always be
> told of anything that would limit its ability to reason about the
> whole program.
> 
> > > Or to ask it another way, when you say
> > > "whole-program analysis", are you restricting that analysis to the
> > > current translation unit?
> > 
> > No.  I mean, you can do analysis of the current translation unit, but
> > that will do just that; if the variable, for example, is accessible
> > outside of this translation unit, the compiler can't make a
> > whole-program proof about it, and thus can't do certain optimizations.
> 
> I had to read this several times to find an interpretation that might
> make sense.  That interpretation is "The compiler will do whole-program
> analysis only on those variables that it believes are accessed only by
> the current translation unit."  Is that what you meant?

Yes.  For a pure-C program (ie, one that's perfectly specified by just
the C standard), this will be the case; IOW, I'm not aware of any corner
case in such a setting.  But the kernel is doing more than what the C
standard covers, so we'll have to check those things.

> > > If so, I was probably not the only person thinking that you instead meant
> > > analysis across all translation units linked into the program.  ;-)
> > 
> > That's roughly what I meant, but not just including translation units
> > but truly all parts of the program, including non-C program parts.  IOW,
> > literally the whole program :)
> > 
> > That's why I said that if you indeed do *whole program* analysis, then
> > things should be fine (modulo corner cases such as linker scripts, later
> > binary rewriting of code produced by the compiler, etc.).  Many of the
> > things you worried about *prevent* whole-program analysis, which means
> > that they do not make it any less reliable.  Does that clarify my line
> > of thought?
> 
> If my interpretation above is correct, yes.  It appears that you are much
> more confident than many kernel folks that the compiler will be informed of
> everything that might limit its omniscience.

Maybe.  Nonetheless, if it's just a matter of letting the compiler know
that there is Other Stuff when the compiler isn't aware of that, but
once doing so all is good because the memory model handles this just
fine, then this makes me more optimistic than if the model was
insufficient.

> > > > > 	Other programs have similar issues, e.g., via dlopen().
> > > > > 
> > > > > 4.	Some drivers allow user-mode code to mmap() some of their
> > > > > 	state.  Any changes undertaken by the user-mode code would
> > > > > 	be invisible to the compiler.
> > > > 
> > > > A good point, but a compiler that doesn't try to (incorrectly) assume
> > > > something about the semantics of mmap will simply see that the mmap'ed
> > > > data will escape to stuff if can't analyze, so it will not be able to
> > > > make a proof.
> > > > 
> > > > This is different from, for example, malloc(), which is guaranteed to
> > > > return "fresh" nonaliasing memory.
> > > 
> > > As Peter noted, this is the other end of mmap().  The -user- code sees
> > > that there is an mmap(), but the kernel code invokes functions that
> > > poke values into hardware registers (or into in-memory page tables)
> > > that, as a side effect, cause some of the kernel's memory to be
> > > accessible to some user program.
> > > 
> > > Presumably the kernel code needs to do something to account for the
> > > possibility of usermode access whenever it accesses that memory.
> > > Volatile casts, volatile storage class on the declarations, barrier()
> > > calls, whatever.
> > 
> > In this case, there should be another option except volatile:  If
> > userspace code is using the C11 memory model as well and lock-free
> > atomics to synchronize, then this should have well-defined semantics
> > without using volatile.
> 
> For user-mode programs that have not yet been written, this could be
> a reasonable approach.  For existing user-mode binaries, C11 won't
> help, which leaves things like volatile and assembly (including the
> "memory" qualifier as used in barrier() macro).

I agree.  I also agree that there is the possibility of
malicious/incorrect userspace code, and that this shouldn't endanger the
kernel.

> > On both sides, the compiler will see that mmap() (or similar) is called,
> > so that means the data escapes to something unknown, which could create
> > threads and so on.  So first, it can't do whole-program analysis for
> > this state anymore, and has to assume that other C11 threads are
> > accessing this memory.  Next, lock-free atomics are specified to be
> > "address-free", meaning that they must work independent of where in
> > memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a
> > "should" and non-normative, but essential IMO).  Thus, this then boils
> > down to just a simple case of synchronization.  (Of course, the rest of
> > the ABI has to match too for the data exchange to work.)
> 
> The compiler will see mmap() on the user side, but not on the kernel
> side.  On the kernel side, something special is required.

Maybe -- you'll certainly know better :)

But maybe it's not that hard: For example, if the memory is in current
code made available to userspace via calling some function with an asm
implementation that the compiler can't analyze, then this should be
sufficient.

> Agree that "address-free" would be nice as "shall" rather than "should".
> 
> > > I echo Peter's question about how one tags functions like mmap().
> > > 
> > > I will also remember this for the next time someone on the committee
> > > discounts "volatile".  ;-)
> > > 
> > > > > 5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/
> > > > 
> > > > This might be special, or not, depending on how the JITed code gets
> > > > access to data.  If this is via fixed addresses (e.g., (void*)0x123),
> > > > then see above.  If this is through function calls that the compiler
> > > > can't analyze, then this is like 4.
> > > 
> > > It could well be via the kernel reading its own symbol table, sort of
> > > a poor-person's reflection facility.  I guess that would be for all
> > > intents and purposes equivalent to your (void*)0x123.
> > 
> > If it is replacing code generated by the compiler, then yes.  If the JIT
> > is just filling in functions that had been undefined yet declared
> > before, then the compiler will have seen the data escape through the
> > function interfaces, and should be aware that there is other stuff.
> 
> So one other concern would then be things things like ftrace, kprobes,
> ksplice, and so on.  These rewrite the kernel binary at runtime, though
> in very limited ways.

Yes.  Nonetheless, I wouldn't see a problem if they, say, rewrite with
C11-compatible code (and same ABI) on a function granularity (and when
the function itself isn't executing concurrently) -- this seems to be
similar to just having another compiler compile this particular
function.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-19 15:23                                                                       ` David Lang
@ 2014-02-19 18:11                                                                         ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19 18:11 UTC (permalink / raw)
  To: David Lang
  Cc: Peter Zijlstra, Linus Torvalds, Alec Teal, Paul McKenney,
	Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Wed, 2014-02-19 at 07:23 -0800, David Lang wrote:
> On Tue, 18 Feb 2014, Torvald Riegel wrote:
> 
> > On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
> >> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> >>> Well, that's how atomics that aren't volatile are defined in the
> >>> standard.  I can see that you want something else too, but that doesn't
> >>> mean that the other thing is broken.
> >>
> >> Well that other thing depends on being able to see the entire program at
> >> compile time. PaulMck already listed various ways in which this is
> >> not feasible even for normal userspace code.
> >>
> >> In particular; DSOs and JITs were mentioned.
> >
> > No it doesn't depend on whole-program analysis being possible.  Because
> > if it isn't, then a correct compiler will just not do certain
> > optimizations simply because it can't prove properties required for the
> > optimization to hold.  With the exception of access to objects via magic
> > numbers (e.g., fixed and known addresses (see my reply to Paul), which
> > are outside of the semantics specified in the standard), I don't see a
> > correctness problem here.
> 
> Are you really sure that the compiler can figure out every possible thing that a 
> loadable module or JITed code can access? That seems like a pretty strong claim.

If the other code can be produced by a C translation unit that is valid
to be linked with the rest of the program, then I'm pretty sure the
compiler has a well-defined notion of whether it does or does not see
all other potential accesses.  IOW, if the C compiler is dealing with C
semantics and mechanisms only (including the C mechanisms for sharing
with non-C code!), then it will know what to do.

If you're playing tricks behind the C compiler's back using
implementation-defined stuff outside of the C specification, then
there's nothing the compiler really can do.  For example, if you're
trying to access a variable on a function's stack from some other
function, you better know how the register allocator of the compiler
operates.  In contrast, if you let this function simply export the
address of the variable to some external place, all will be fine.

The documentation of GCC's -fwhole-program and -flto might also be
interesting for you.  GCC wouldn't need to have -fwhole-program if it
weren't conservative by default (correctly so).


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-19 14:40                                                                     ` Torvald Riegel
@ 2014-02-19 19:49                                                                       ` Linus Torvalds
  0 siblings, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-19 19:49 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 6:40 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> If all those other threads written in whichever way use the same memory
> model and ABI for synchronization (e.g., choice of HW barriers for a
> certain memory_order), it doesn't matter whether it's a hardware thread,
> microcode, whatever.  In this case, C11 atomics should be fine.
> (We have this in userspace already, because correct compilers will have
> to assume that the code generated by them has to properly synchronize
> with other code generated by different compilers.)
>
> If the other threads use a different model, access memory entirely
> differently, etc, then we might be back to "volatile" because we don't
> know anything, and the very strict rules about execution steps of the
> abstract machine (ie, no as-if rule) are probably the safest thing to
> do.

Oh, I don't even care about architectures that don't have real hardware atomics.

So if there's a software protocol for atomics, all bets are off. The
compiler almost certainly has to do atomics with function calls
anyway, and we'll just plug in out own. And frankly, nobody will ever
care, because those architectures aren't relevant, and never will be.

Sure, there are some ancient Sparc platforms that only support a
single-byte "ldstub" and there are some embedded chips that don't
really do SMP, but have some pseudo-smp with special separate locking.
Really, nobody cares. The C standard has that crazy lock-free atomic
tests, and talks about address-free, but generally we require both
lock-free and address-free in the kernel, because otherwise it's just
too painful to do interrupt-safe locking, or do atomics in user-space
(for futexes).

So if your worry is just about software protocols for CPU's that
aren't actually designed for modern SMP, that's pretty much a complete
non-issue.

                Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-19 17:55                                                                           ` Torvald Riegel
@ 2014-02-19 22:12                                                                             ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-19 22:12 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 06:55:51PM +0100, Torvald Riegel wrote:
> On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote:
> > On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:

[ . . . ]

> > > On both sides, the compiler will see that mmap() (or similar) is called,
> > > so that means the data escapes to something unknown, which could create
> > > threads and so on.  So first, it can't do whole-program analysis for
> > > this state anymore, and has to assume that other C11 threads are
> > > accessing this memory.  Next, lock-free atomics are specified to be
> > > "address-free", meaning that they must work independent of where in
> > > memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a
> > > "should" and non-normative, but essential IMO).  Thus, this then boils
> > > down to just a simple case of synchronization.  (Of course, the rest of
> > > the ABI has to match too for the data exchange to work.)
> > 
> > The compiler will see mmap() on the user side, but not on the kernel
> > side.  On the kernel side, something special is required.
> 
> Maybe -- you'll certainly know better :)
> 
> But maybe it's not that hard: For example, if the memory is in current
> code made available to userspace via calling some function with an asm
> implementation that the compiler can't analyze, then this should be
> sufficient.

The kernel code would need to explicitly tell the compiler what portions
of the kernel address space were covered by this.  I would not want the
compiler to have to work it out based on observing interactions with the
page tables.  ;-)

> > Agree that "address-free" would be nice as "shall" rather than "should".
> > 
> > > > I echo Peter's question about how one tags functions like mmap().
> > > > 
> > > > I will also remember this for the next time someone on the committee
> > > > discounts "volatile".  ;-)
> > > > 
> > > > > > 5.	JITed code produced based on BPF: https://lwn.net/Articles/437981/
> > > > > 
> > > > > This might be special, or not, depending on how the JITed code gets
> > > > > access to data.  If this is via fixed addresses (e.g., (void*)0x123),
> > > > > then see above.  If this is through function calls that the compiler
> > > > > can't analyze, then this is like 4.
> > > > 
> > > > It could well be via the kernel reading its own symbol table, sort of
> > > > a poor-person's reflection facility.  I guess that would be for all
> > > > intents and purposes equivalent to your (void*)0x123.
> > > 
> > > If it is replacing code generated by the compiler, then yes.  If the JIT
> > > is just filling in functions that had been undefined yet declared
> > > before, then the compiler will have seen the data escape through the
> > > function interfaces, and should be aware that there is other stuff.
> > 
> > So one other concern would then be things things like ftrace, kprobes,
> > ksplice, and so on.  These rewrite the kernel binary at runtime, though
> > in very limited ways.
> 
> Yes.  Nonetheless, I wouldn't see a problem if they, say, rewrite with
> C11-compatible code (and same ABI) on a function granularity (and when
> the function itself isn't executing concurrently) -- this seems to be
> similar to just having another compiler compile this particular
> function.

Well, they aren't using C11-compatible code yet.  They do patch within
functions.  And in some cases, they make staged sequences of changes to
allow the patching to happen concurrently with other CPUs executing the
code being patched.  Not sure that any of the latter is actually in the
kernel at the moment, but it has at least been prototyped and discussed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 19:47                                                                   ` Torvald Riegel
@ 2014-02-20  0:53                                                                     ` Linus Torvalds
  2014-02-20  4:01                                                                       ` Paul E. McKenney
  2014-02-20 17:14                                                                       ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20  0:53 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
>>
>> Can you point to it? Because I can find a draft standard, and it sure
>> as hell does *not* contain any clarity of the model. It has a *lot* of
>> verbiage, but it's pretty much impossible to actually understand, even
>> for somebody who really understands memory ordering.
>
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> This has an explanation of the model up front, and then the detailed
> formulae in Section 6.  This is from 2010, and there might have been
> smaller changes since then, but I'm not aware of any bigger ones.

Ahh, this is different from what others pointed at. Same people,
similar name, but not the same paper.

I will read this version too, but from reading the other one and the
standard in parallel and trying to make sense of it, it seems that I
may have originally misunderstood part of the whole control dependency
chain.

The fact that the left side of "? :", "&&" and "||" breaks data
dependencies made me originally think that the standard tried very
hard to break any control dependencies. Which I felt was insane, when
then some of the examples literally were about the testing of the
value of an atomic read. The data dependency matters quite a bit. The
fact that the other "Mathematical" paper then very much talked about
consume only in the sense of following a pointer made me think so even
more.

But reading it some more, I now think that the whole "data dependency"
logic (which is where the special left-hand side rule of the ternary
and logical operators come in) are basically an exception to the rule
that sequence points end up being also meaningful for ordering (ok, so
C11 seems to have renamed "sequence points" to "sequenced before").

So while an expression like

    atomic_read(p, consume) ? a : b;

doesn't have a data dependency from the atomic read that forces
serialization, writing

   if (atomic_read(p, consume))
      a;
   else
      b;

the standard *does* imply that the atomic read is "happens-before" wrt
"a", and I'm hoping that there is no question that the control
dependency still acts as an ordering point.

THAT was one of my big confusions, the discussion about control
dependencies and the fact that the logical ops broke the data
dependency made me believe that the standard tried to actively avoid
the whole issue with "control dependencies can break ordering
dependencies on some CPU's due to branch prediction and memory
re-ordering by the CPU".

But after all the reading, I'm starting to think that that was never
actually the implication at all, and the "logical ops breaks the data
dependency rule" is simply an exception to the sequence point rule.
All other sequence points still do exist, and do imply an ordering
that matters for "consume"

Am I now reading it right?

So the clarification is basically to the statement that the "if
(consume(p)) a" version *would* have an ordering guarantee between the
read of "p" and "a", but the "consume(p) ? a : b" would *not* have
such an ordering guarantee. Yes?

                       Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  0:53                                                                     ` Linus Torvalds
@ 2014-02-20  4:01                                                                       ` Paul E. McKenney
  2014-02-20  4:43                                                                         ` Linus Torvalds
  2014-02-20 17:26                                                                         ` Torvald Riegel
  2014-02-20 17:14                                                                       ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20  4:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> >>
> >> Can you point to it? Because I can find a draft standard, and it sure
> >> as hell does *not* contain any clarity of the model. It has a *lot* of
> >> verbiage, but it's pretty much impossible to actually understand, even
> >> for somebody who really understands memory ordering.
> >
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > This has an explanation of the model up front, and then the detailed
> > formulae in Section 6.  This is from 2010, and there might have been
> > smaller changes since then, but I'm not aware of any bigger ones.
> 
> Ahh, this is different from what others pointed at. Same people,
> similar name, but not the same paper.
> 
> I will read this version too, but from reading the other one and the
> standard in parallel and trying to make sense of it, it seems that I
> may have originally misunderstood part of the whole control dependency
> chain.
> 
> The fact that the left side of "? :", "&&" and "||" breaks data
> dependencies made me originally think that the standard tried very
> hard to break any control dependencies. Which I felt was insane, when
> then some of the examples literally were about the testing of the
> value of an atomic read. The data dependency matters quite a bit. The
> fact that the other "Mathematical" paper then very much talked about
> consume only in the sense of following a pointer made me think so even
> more.
> 
> But reading it some more, I now think that the whole "data dependency"
> logic (which is where the special left-hand side rule of the ternary
> and logical operators come in) are basically an exception to the rule
> that sequence points end up being also meaningful for ordering (ok, so
> C11 seems to have renamed "sequence points" to "sequenced before").
> 
> So while an expression like
> 
>     atomic_read(p, consume) ? a : b;
> 
> doesn't have a data dependency from the atomic read that forces
> serialization, writing
> 
>    if (atomic_read(p, consume))
>       a;
>    else
>       b;
> 
> the standard *does* imply that the atomic read is "happens-before" wrt
> "a", and I'm hoping that there is no question that the control
> dependency still acts as an ordering point.

The control dependency should order subsequent stores, at least assuming
that "a" and "b" don't start off with identical stores that the compiler
could pull out of the "if" and merge.  The same might also be true for ?:
for all I know.  (But see below)

That said, in this case, you could substitute relaxed for consume and get
the same effect.  The return value from atomic_read() gets absorbed into
the "if" condition, so there is no dependency-ordered-before relationship,
so nothing for consume to do.

One caution...  The happens-before relationship requires you to trace a
full path between the two operations of interest.  This is illustrated
by the following example, with both x and y initially zero:

T1:	atomic_store_explicit(&x, 1, memory_order_relaxed);
	r1 = atomic_load_explicit(&y, memory_order_relaxed);

T2:	atomic_store_explicit(&y, 1, memory_order_relaxed);
	r2 = atomic_load_explicit(&x, memory_order_relaxed);

There is a happens-before relationship between T1's load and store,
and another happens-before relationship between T2's load and store,
but there is no happens-before relationship from T1 to T2, and none
in the other direction, either.  And you don't get to assume any
ordering based on reasoning about these two disjoint happens-before
relationships.

So it is quite possible for r1==1&&r2==1 after both threads complete.

Which should be no surprise: This misordering can happen even on x86,
which would need a full smp_mb() to prevent it.

> THAT was one of my big confusions, the discussion about control
> dependencies and the fact that the logical ops broke the data
> dependency made me believe that the standard tried to actively avoid
> the whole issue with "control dependencies can break ordering
> dependencies on some CPU's due to branch prediction and memory
> re-ordering by the CPU".
> 
> But after all the reading, I'm starting to think that that was never
> actually the implication at all, and the "logical ops breaks the data
> dependency rule" is simply an exception to the sequence point rule.
> All other sequence points still do exist, and do imply an ordering
> that matters for "consume"
> 
> Am I now reading it right?

As long as there is an unbroken chain of -data- dependencies from the
consume to the later access in question, and as long as that chain
doesn't go through the excluded operations, yes.

> So the clarification is basically to the statement that the "if
> (consume(p)) a" version *would* have an ordering guarantee between the
> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> such an ordering guarantee. Yes?

Neither has a data-dependency guarantee, because there is no data
dependency from the load to either "a" or "b".  After all, the value
loaded got absorbed into the "if" condition.  However, according to
discussions earlier in this thread, the "if" variant would have a
control-dependency ordering guarantee for any stores in "a" and "b"
(but not loads!).  The ?: form might also have a control-dependency
guarantee for any stores in "a" and "b" (again, not loads).

Why my uncertainty?

Well, the standard does not talk explicitly about control dependencies.
They currently appear to be a side effect of other requirements in the
standard, for example, the prohibition against doing stores to atomics
if those stores wouldn't happen in an unoptimized naive compilation
of the program.  Even then, you have to take this in combination with
ordering guarantees of all the hardware that Linux currently runs on to
get to the control dependency.

I would feel way better if the standard explicitly called out ordering
based on control dependencies, but that is something for Torvald Riegel
and me to hash out.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  4:01                                                                       ` Paul E. McKenney
@ 2014-02-20  4:43                                                                         ` Linus Torvalds
  2014-02-20  8:30                                                                           ` Paul E. McKenney
  2014-02-20 17:49                                                                           ` Torvald Riegel
  2014-02-20 17:26                                                                         ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20  4:43 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> The control dependency should order subsequent stores, at least assuming
> that "a" and "b" don't start off with identical stores that the compiler
> could pull out of the "if" and merge.  The same might also be true for ?:
> for all I know.  (But see below)

Stores I don't worry about so much because

 (a) you can't sanely move stores up in a compiler anyway
 (b) no sane CPU or moves stores up, since they aren't on the critical path

so a read->cmp->store is actually really hard to make anything sane
re-order. I'm sure it can be done, and I'm sure it's stupid as hell.

But that "it's hard to screw up" is *not* true for a load->cmp->load.

So lets make this really simple: if you have a consume->cmp->read, is
the ordering of the two reads guaranteed?

> As long as there is an unbroken chain of -data- dependencies from the
> consume to the later access in question, and as long as that chain
> doesn't go through the excluded operations, yes.

So let's make it *really* specific, and make it real code doing a real
operation, that is actually realistic and reasonable in a threaded
environment, and may even be in some critical code.

The issue is the read-side ordering guarantee for 'a' and 'b', for this case:

 - Initial state:

   a = b = 0;

 - Thread 1 ("consumer"):

    if (atomic_read(&a, consume))
         return b;
    /* not yet initialized */
    return -1;

 - Thread 2 ("initializer"):

     b = some_value_lets_say_42;
     /* We are now ready to party */
     atomic_write(&a, 1, release);

and quite frankly, if there is no ordering guarantee between the read
of "a" and the read of "b" in the consumer thread, then the C atomics
standard is broken.

Put another way: I claim that if "thread 1" ever sees a return value
other than -1 or 42, then the whole definition of atomics is broken.

Question 2: and what changes if the atomic_read() is turned into an
acquire, and why? Does it start working?

> Neither has a data-dependency guarantee, because there is no data
> dependency from the load to either "a" or "b".  After all, the value
> loaded got absorbed into the "if" condition.  However, according to
> discussions earlier in this thread, the "if" variant would have a
> control-dependency ordering guarantee for any stores in "a" and "b"
> (but not loads!).

So exactly what part of the standard allows the loads to be
re-ordered, and why? Quite frankly, I'd think that any sane person
will agree that the above code snippet is realistic, and that my
requirement that thread 1 sees either -1 or 42 is valid.

And if the C standards body has said that control dependencies break
the read ordering, then I really think that the C standards committee
has screwed up.

If the consumer of an atomic load isn't a pointer chasing operation,
then the consume should be defined to be the same as acquire. None of
this "conditionals break consumers". No, conditionals on the
dependency path should turn consumers into acquire, because otherwise
the "consume" load is dangerous as hell.

And if the definition of acquire doesn't include the control
dependency either, then the C atomic memory model is just completely
and utterly broken, since the above *trivial* and clearly useful
example is broken.

I really think the above example is pretty damn black-and-white.
Either it works, or the standard isn't worth wiping your ass with.

          Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  4:43                                                                         ` Linus Torvalds
@ 2014-02-20  8:30                                                                           ` Paul E. McKenney
  2014-02-20  9:20                                                                             ` Paul E. McKenney
                                                                                               ` (2 more replies)
  2014-02-20 17:49                                                                           ` Torvald Riegel
  1 sibling, 3 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20  8:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote:
> On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > The control dependency should order subsequent stores, at least assuming
> > that "a" and "b" don't start off with identical stores that the compiler
> > could pull out of the "if" and merge.  The same might also be true for ?:
> > for all I know.  (But see below)
> 
> Stores I don't worry about so much because
> 
>  (a) you can't sanely move stores up in a compiler anyway
>  (b) no sane CPU or moves stores up, since they aren't on the critical path
> 
> so a read->cmp->store is actually really hard to make anything sane
> re-order. I'm sure it can be done, and I'm sure it's stupid as hell.
> 
> But that "it's hard to screw up" is *not* true for a load->cmp->load.
> 
> So lets make this really simple: if you have a consume->cmp->read, is
> the ordering of the two reads guaranteed?

Not as far as I know.  Also, as far as I know, there is no difference
between consume and relaxed in the consume->cmp->read case.

> > As long as there is an unbroken chain of -data- dependencies from the
> > consume to the later access in question, and as long as that chain
> > doesn't go through the excluded operations, yes.
> 
> So let's make it *really* specific, and make it real code doing a real
> operation, that is actually realistic and reasonable in a threaded
> environment, and may even be in some critical code.
> 
> The issue is the read-side ordering guarantee for 'a' and 'b', for this case:
> 
>  - Initial state:
> 
>    a = b = 0;
> 
>  - Thread 1 ("consumer"):
> 
>     if (atomic_read(&a, consume))
>          return b;
>     /* not yet initialized */
>     return -1;
> 
>  - Thread 2 ("initializer"):
> 
>      b = some_value_lets_say_42;
>      /* We are now ready to party */
>      atomic_write(&a, 1, release);
> 
> and quite frankly, if there is no ordering guarantee between the read
> of "a" and the read of "b" in the consumer thread, then the C atomics
> standard is broken.
> 
> Put another way: I claim that if "thread 1" ever sees a return value
> other than -1 or 42, then the whole definition of atomics is broken.

The above example can have a return value of 0 if translated
straightforwardly into either ARM or Power, right?  Both of these can
speculate a read into a conditional, and both can translate a consume
load into a plain load if data dependencies remain unbroken.

So, if you make one of two changes to your example, then I will agree
with you.  The first change is to have a real data dependency between
the read of "a" and the second read:

 - Initial state:

   a = &c, b = 0; c = 0;

 - Thread 1 ("consumer"):

    if (atomic_read(&a, consume))
         return *a;
    /* not yet initialized */
    return -1;

 - Thread 2 ("initializer"):

     b = some_value_lets_say_42;
     /* We are now ready to party */
     atomic_write(&a, &b, release);

The second change is to make the "consume" be an acquire:

 - Initial state:

   a = b = 0;

 - Thread 1 ("consumer"):

    if (atomic_read(&a, acquire))
         return b;
    /* not yet initialized */
    return -1;

 - Thread 2 ("initializer"):

     b = some_value_lets_say_42;
     /* We are now ready to party */
     atomic_write(&a, 1, release);

In theory, you could also change the "return" to a store, but the example
gets a bit complicated and as far as I can tell you get into the state
where the standard does not explicitly support it, even though I have
a hard time imagining an actual implementation that fails to support it.

> Question 2: and what changes if the atomic_read() is turned into an
> acquire, and why? Does it start working?

Yep, that is the second change above.

> > Neither has a data-dependency guarantee, because there is no data
> > dependency from the load to either "a" or "b".  After all, the value
> > loaded got absorbed into the "if" condition.  However, according to
> > discussions earlier in this thread, the "if" variant would have a
> > control-dependency ordering guarantee for any stores in "a" and "b"
> > (but not loads!).
> 
> So exactly what part of the standard allows the loads to be
> re-ordered, and why? Quite frankly, I'd think that any sane person
> will agree that the above code snippet is realistic, and that my
> requirement that thread 1 sees either -1 or 42 is valid.

Unless I am really confused, weakly ordered systems would be just fine
returning zero in your original example.  In the case of ARM, I believe
you need either a data dependency, an ISB after the branch, or a DMB
instruction to force the ordering you want.  In the case of PowerPC, I
believe that you need either a data dependency, an isync after the branch,
an lwsync, or a sync.  I would not expect a C compiler to generate code
with any of these precautions in place.

> And if the C standards body has said that control dependencies break
> the read ordering, then I really think that the C standards committee
> has screwed up.

The control dependencies don't break the read ordering, but rather, they
are insufficient to preserve the read ordering.

> If the consumer of an atomic load isn't a pointer chasing operation,
> then the consume should be defined to be the same as acquire. None of
> this "conditionals break consumers". No, conditionals on the
> dependency path should turn consumers into acquire, because otherwise
> the "consume" load is dangerous as hell.

Well, all the compilers currently convert consume to acquire, so you have
your wish there.  Of course, that also means that they generate actual
unneeded memory-barrier instructions, which seems extremely sub-optimal
to me.

> And if the definition of acquire doesn't include the control
> dependency either, then the C atomic memory model is just completely
> and utterly broken, since the above *trivial* and clearly useful
> example is broken.

The definition of acquire makes the ordering happen whether or not
there is a control dependency.

> I really think the above example is pretty damn black-and-white.
> Either it works, or the standard isn't worth wiping your ass with.

Seems to me that your gripe is with ARM's and PowerPC's weak memory
ordering rather than the standard.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  8:30                                                                           ` Paul E. McKenney
@ 2014-02-20  9:20                                                                             ` Paul E. McKenney
  2014-02-20 17:01                                                                             ` Linus Torvalds
  2014-02-20 17:54                                                                             ` Torvald Riegel
  2 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20  9:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 12:30:32AM -0800, Paul E. McKenney wrote:
> On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote:

[ . . . ]

> So, if you make one of two changes to your example, then I will agree
> with you.  The first change is to have a real data dependency between
> the read of "a" and the second read:
> 
>  - Initial state:
> 
>    a = &c, b = 0; c = 0;
> 
>  - Thread 1 ("consumer"):
> 
>     if (atomic_read(&a, consume))

And the above should be "if (atomic_read(&a, consume) != &c)".  Sigh!!!

							Thanx, Paul

>          return *a;
>     /* not yet initialized */
>     return -1;
> 
>  - Thread 2 ("initializer"):
> 
>      b = some_value_lets_say_42;
>      /* We are now ready to party */
>      atomic_write(&a, &b, release);


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  8:30                                                                           ` Paul E. McKenney
  2014-02-20  9:20                                                                             ` Paul E. McKenney
@ 2014-02-20 17:01                                                                             ` Linus Torvalds
  2014-02-20 18:11                                                                               ` Paul E. McKenney
                                                                                                 ` (2 more replies)
  2014-02-20 17:54                                                                             ` Torvald Riegel
  2 siblings, 3 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 17:01 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>>
>> So lets make this really simple: if you have a consume->cmp->read, is
>> the ordering of the two reads guaranteed?
>
> Not as far as I know.  Also, as far as I know, there is no difference
> between consume and relaxed in the consume->cmp->read case.

Ok, quite frankly, I think that means that "consume" is misdesigned.

> The above example can have a return value of 0 if translated
> straightforwardly into either ARM or Power, right?

Correct. And I think that is too subtle. It's dangerous, it makes code
that *looks* correct work incorrectly, and it actually happens to work
on x86 since x86 doesn't have crap-for-brains memory ordering
semantics.

> So, if you make one of two changes to your example, then I will agree
> with you.

No. We're not playing games here. I'm fed up with complex examples
that make no sense.

Nobody sane writes code that does that pointer comparison, and it is
entirely immaterial what the compiler can do behind our backs. The C
standard semantics need to make sense to the *user* (ie programmer),
not to a CPU and not to a compiler. The CPU and compiler are "tools".
They don't matter. Their only job is to make the code *work*, dammit.

So no idiotic made-up examples that involve code that nobody will ever
write and that have subtle issues.

So the starting point is that (same example as before, but with even
clearer naming):

Initialization state:
  initialized = 0;
  value = 0;

Consumer:

    return atomic_read(&initialized, consume) ? value : -1;

Writer:
    value = 42;
    atomic_write(&initialized, 1, release);

and because the C memory ordering standard is written in such a way
that this is subtly buggy (and can return 0, which is *not* logically
a valid value), then I think the C memory ordering standard is broken.

That "consumer" memory ordering is dangerous as hell, and it is
dangerous FOR NO GOOD REASON.

The trivial "fix" to the standard would be to get rid of all the
"carries a dependency" crap, and just say that *anything* that depends
on it is ordered wrt it.

That just means that on alpha, "consume" implies an unconditional read
barrier (well, unless the value is never used and is loaded just
because it is also volatile), on x86, "consume" is the same as
"acquire" which is just a plain load with ordering guarantees, and on
ARM or power you can still avoid the extra synchronization *if* the
value is used just for computation and for following pointers, but if
the value is used for a comparison, there needs to be a
synchronization barrier.

Notice? Getting rid of the stupid "carries-dependency" crap from the
standard actually
 (a) simplifies the standard
 (b) means that the above obvious example *works*
 (c) does not in *any* way make for any less efficient code generation
for the cases that "consume" works correctly for in the current
mis-designed standard.
 (d) is actually a hell of a lot easier to explain to a compiler
writer, and I can guarantee that it is simpler to implement too.

Why do I claim (d) "it is simpler to implement" - because on ARM/power
you can implement it *exactly* as a special "acquire", with just a
trivial peep-hole special case that follows the use chain of the
acquire op to the consume, and then just drop the acquire bit if the
only use is that compute-to-load chain.

In fact, realistically, the *only* thing you need to actually care
about for the intended use case of "consume" is the question "is the
consuming load immediately consumed as an address (with offset) of a
memory operation. So you don't even need to follow any complicated
computation chain in a compiler - the only case that matters for the
barrier removal optimization is the "oh, I can see that it is only
used as an address to a dereference".

Seriously. The current standard is broken. It's broken because it
mis-compiles code that on the face of it looks logical and works, it's
broken because it's overly complex, and it's broken because the
complexity doesn't even *buy* you anything. All this complexity for no
reason. When much simpler wording and implementation actually WORKS
BETTER.

                  Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  0:53                                                                     ` Linus Torvalds
  2014-02-20  4:01                                                                       ` Paul E. McKenney
@ 2014-02-20 17:14                                                                       ` Torvald Riegel
  2014-02-20 17:34                                                                         ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc,
	Mark Batty, Peter Sewell

On Wed, 2014-02-19 at 16:53 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> >>
> >> Can you point to it? Because I can find a draft standard, and it sure
> >> as hell does *not* contain any clarity of the model. It has a *lot* of
> >> verbiage, but it's pretty much impossible to actually understand, even
> >> for somebody who really understands memory ordering.
> >
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > This has an explanation of the model up front, and then the detailed
> > formulae in Section 6.  This is from 2010, and there might have been
> > smaller changes since then, but I'm not aware of any bigger ones.
> 
> Ahh, this is different from what others pointed at. Same people,
> similar name, but not the same paper.
> 
> I will read this version too, but from reading the other one and the
> standard in parallel and trying to make sense of it, it seems that I
> may have originally misunderstood part of the whole control dependency
> chain.
> 
> The fact that the left side of "? :", "&&" and "||" breaks data
> dependencies made me originally think that the standard tried very
> hard to break any control dependencies.

Hmm... (I'm not quite sure how to express this properly, so bear with
me.)

The control dependencies as you seem to consider aren't there *as is*,
and removing "? :", "&&", and "||" from what's called
Carries-a-dependency-to in n3132 is basically removing control
dependences in expressions from that relation (ie, so it's just data
dependences).

However, control dependences in the program are not ignored, because
they control what steps the abstract machine would do.  If we have
candidate executions with a particular choice for the reads-from
relation, they must make sense in terms of program logic (which does
include conditionals and such), or a candidate execution will not be
consistent.  The difference is in how reads-from is constrainted by
different memory orders.

I hope what I mean gets clearer below when looking at the example...

Mark Batty / Peter Sewell:  Do you have a better explanation?  (And
please correct anything wrong in what I wrote ;)

> Which I felt was insane, when
> then some of the examples literally were about the testing of the
> value of an atomic read. The data dependency matters quite a bit. The
> fact that the other "Mathematical" paper then very much talked about
> consume only in the sense of following a pointer made me think so even
> more.
> 
> But reading it some more, I now think that the whole "data dependency"
> logic (which is where the special left-hand side rule of the ternary
> and logical operators come in) are basically an exception to the rule
> that sequence points end up being also meaningful for ordering (ok, so
> C11 seems to have renamed "sequence points" to "sequenced before").

I'm not quite sure what you mean, but I think that this isn't quite the
case.  mo_consume and mo_acquire combine with sequenced-before
differently.  Have a look at the definition of
inter-thread-happens-before (6.15 in n3132): for synchronizes-with, the
composition with sequenced-before is used (ie, what you'd get with
mo_acquire), whereas for mo_consume, we just have
dependency_ordered_before (so, data dependencies only).

> So while an expression like
> 
>     atomic_read(p, consume) ? a : b;
> 
> doesn't have a data dependency from the atomic read that forces
> serialization, writing
> 
>    if (atomic_read(p, consume))
>       a;
>    else
>       b;
> 
> the standard *does* imply that the atomic read is "happens-before" wrt
> "a", and I'm hoping that there is no question that the control
> dependency still acts as an ordering point.

I don't think it does for mo_consume (but does for mo_acquire).  Both
examples do basically the same in the abstract machine, and I think it
would be surprising if rewriting could the code with something seemingly
equivalent would make a semantic difference.  AFAICT, "? :" etc. is
removed to allow the standard defining data dependencies as being based
on expressions that take the mo_consume value as input, without also
pulling in control dependencies.

To look at this example in the cppmem tool, we can transform your
example above into:

int main() {
  atomic_int x = 0; 
  int y = 0;
  {{{ { y = 1; x.store(1, memory_order_release); }
  ||| { if (x.load(memory_order_consume)) r2 = y; }
  }}};
  return 0; }

This has 2 threads, one is doing the store to y (the data) and the
release operation, the other is using consume to try to read the data.

This yields 2 consistent executions (we can read 2 values of x), but the
execution in which we read y has a data race.

I'm not quite sure how the tool parses/models data dependencies in
expressions, but the program (ie, your example) doesn't carry a
dependency from the mo_consume load to the load from y (as defined in C
++ 1.10p9 and n3132 6.13).  Thus, the mo_release store is
dependency-ordered-before the mo_consume load ("dob" in the tool's
output graph), but that doesn't extend to the load from y.  Therefore,
the load is unordered wrt. to the first thread's "y = 1", and because
they are nonatomic, we get a data race (labeled "dr").  (The labels on
the edges are easier to see in the "dot" Display Layout, even though one
has to then figure out which vertex belongs to which thread.)

If you enable it with the display options, the tool will show a control
dependency, but that is currently unused it seems (see 6.20 in n3132).

If we change mo_consume to mo_acquire in the example, the race goes
away.  (This is visible in the inter-thread-happens-before relation,
which can be shown by "removing suppress_edge ithb." under "edit display
options" (but this makes the graph kind of messy)).

BTW, a pretty similar way to expressing this cppmem code above is using
readsvalue() to constrain executions:

int main() {
  atomic_int x = 0; 
  int y = 0;
  {{{ { y = 1; x.store(1, memory_order_release); }
  ||| { r1 = x.load(memory_order_consume).readsvalue(1); r2 = y; }
  }}};
  return 0; }

> THAT was one of my big confusions, the discussion about control
> dependencies and the fact that the logical ops broke the data
> dependency made me believe that the standard tried to actively avoid
> the whole issue with "control dependencies can break ordering
> dependencies on some CPU's due to branch prediction and memory
> re-ordering by the CPU".
> 
> But after all the reading, I'm starting to think that that was never
> actually the implication at all, and the "logical ops breaks the data
> dependency rule" is simply an exception to the sequence point rule.
> All other sequence points still do exist, and do imply an ordering
> that matters for "consume"
> 
> Am I now reading it right?
> 
> So the clarification is basically to the statement that the "if
> (consume(p)) a" version *would* have an ordering guarantee between the
> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> such an ordering guarantee. Yes?

Not as I understand it.  If my reply above wasn't clear, let me know and
I'll try to rephrase it into something that is.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  4:01                                                                       ` Paul E. McKenney
  2014-02-20  4:43                                                                         ` Linus Torvalds
@ 2014-02-20 17:26                                                                         ` Torvald Riegel
  2014-02-20 18:18                                                                           ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 17:26 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
> On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> > >>
> > >> Can you point to it? Because I can find a draft standard, and it sure
> > >> as hell does *not* contain any clarity of the model. It has a *lot* of
> > >> verbiage, but it's pretty much impossible to actually understand, even
> > >> for somebody who really understands memory ordering.
> > >
> > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > This has an explanation of the model up front, and then the detailed
> > > formulae in Section 6.  This is from 2010, and there might have been
> > > smaller changes since then, but I'm not aware of any bigger ones.
> > 
> > Ahh, this is different from what others pointed at. Same people,
> > similar name, but not the same paper.
> > 
> > I will read this version too, but from reading the other one and the
> > standard in parallel and trying to make sense of it, it seems that I
> > may have originally misunderstood part of the whole control dependency
> > chain.
> > 
> > The fact that the left side of "? :", "&&" and "||" breaks data
> > dependencies made me originally think that the standard tried very
> > hard to break any control dependencies. Which I felt was insane, when
> > then some of the examples literally were about the testing of the
> > value of an atomic read. The data dependency matters quite a bit. The
> > fact that the other "Mathematical" paper then very much talked about
> > consume only in the sense of following a pointer made me think so even
> > more.
> > 
> > But reading it some more, I now think that the whole "data dependency"
> > logic (which is where the special left-hand side rule of the ternary
> > and logical operators come in) are basically an exception to the rule
> > that sequence points end up being also meaningful for ordering (ok, so
> > C11 seems to have renamed "sequence points" to "sequenced before").
> > 
> > So while an expression like
> > 
> >     atomic_read(p, consume) ? a : b;
> > 
> > doesn't have a data dependency from the atomic read that forces
> > serialization, writing
> > 
> >    if (atomic_read(p, consume))
> >       a;
> >    else
> >       b;
> > 
> > the standard *does* imply that the atomic read is "happens-before" wrt
> > "a", and I'm hoping that there is no question that the control
> > dependency still acts as an ordering point.
> 
> The control dependency should order subsequent stores, at least assuming
> that "a" and "b" don't start off with identical stores that the compiler
> could pull out of the "if" and merge.  The same might also be true for ?:
> for all I know.  (But see below)

I don't think this is quite true.  I agree that a conditional store will
not be executed speculatively (note that if it would happen in both the
then and the else branch, it's not conditional); so, the store in
"a;" (assuming it would be a store) won't happen unless the thread can
really observe a true value for p.  However, this is *this thread's*
view of the world, but not guaranteed to constrain how any other thread
sees the state.  mo_consume does not contribute to
inter-thread-happens-before in the same way that mo_acquire does (which
*does* put a constraint on i-t-h-b, and thus enforces a global
constraint that all threads have to respect).

Is it clear which distinction I'm trying to show here?

> That said, in this case, you could substitute relaxed for consume and get
> the same effect.  The return value from atomic_read() gets absorbed into
> the "if" condition, so there is no dependency-ordered-before relationship,
> so nothing for consume to do.
> 
> One caution...  The happens-before relationship requires you to trace a
> full path between the two operations of interest.  This is illustrated
> by the following example, with both x and y initially zero:
> 
> T1:	atomic_store_explicit(&x, 1, memory_order_relaxed);
> 	r1 = atomic_load_explicit(&y, memory_order_relaxed);
> 
> T2:	atomic_store_explicit(&y, 1, memory_order_relaxed);
> 	r2 = atomic_load_explicit(&x, memory_order_relaxed);
> 
> There is a happens-before relationship between T1's load and store,
> and another happens-before relationship between T2's load and store,
> but there is no happens-before relationship from T1 to T2, and none
> in the other direction, either.  And you don't get to assume any
> ordering based on reasoning about these two disjoint happens-before
> relationships.
> 
> So it is quite possible for r1==1&&r2==1 after both threads complete.
> 
> Which should be no surprise: This misordering can happen even on x86,
> which would need a full smp_mb() to prevent it.
> 
> > THAT was one of my big confusions, the discussion about control
> > dependencies and the fact that the logical ops broke the data
> > dependency made me believe that the standard tried to actively avoid
> > the whole issue with "control dependencies can break ordering
> > dependencies on some CPU's due to branch prediction and memory
> > re-ordering by the CPU".
> > 
> > But after all the reading, I'm starting to think that that was never
> > actually the implication at all, and the "logical ops breaks the data
> > dependency rule" is simply an exception to the sequence point rule.
> > All other sequence points still do exist, and do imply an ordering
> > that matters for "consume"
> > 
> > Am I now reading it right?
> 
> As long as there is an unbroken chain of -data- dependencies from the
> consume to the later access in question, and as long as that chain
> doesn't go through the excluded operations, yes.
> 
> > So the clarification is basically to the statement that the "if
> > (consume(p)) a" version *would* have an ordering guarantee between the
> > read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> > such an ordering guarantee. Yes?
> 
> Neither has a data-dependency guarantee, because there is no data
> dependency from the load to either "a" or "b".  After all, the value
> loaded got absorbed into the "if" condition.

Agreed.

> However, according to
> discussions earlier in this thread, the "if" variant would have a
> control-dependency ordering guarantee for any stores in "a" and "b"
> (but not loads!).  The ?: form might also have a control-dependency
> guarantee for any stores in "a" and "b" (again, not loads).

Don't quite agree; see above for my opinion on this.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:14                                                                       ` Torvald Riegel
@ 2014-02-20 17:34                                                                         ` Linus Torvalds
  2014-02-20 18:12                                                                           ` Torvald Riegel
  2014-02-20 18:26                                                                           ` Paul E. McKenney
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 17:34 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc,
	Mark Batty, Peter Sewell

On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel <triegel@redhat.com> wrote:
>>
>> So the clarification is basically to the statement that the "if
>> (consume(p)) a" version *would* have an ordering guarantee between the
>> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
>> such an ordering guarantee. Yes?
>
> Not as I understand it.  If my reply above wasn't clear, let me know and
> I'll try to rephrase it into something that is.

Yeah, so you and Paul agree. And as I mentioned in the email that
crossed with yours, I think that means that the standard is overly
complex, hard to understand, fragile, and all *pointlessly* so.

Btw, there are many ways that "use a consume as an input to a
conditional" can happen. In particular, even if the result is actually
*used* like a pointer as far as the programmer is concerned, tricks
like pointer compression etc can well mean that the "pointer" is
actually at least partly implemented using conditionals, so that some
paths end up being only dependent through a comparison of the pointer
value.

So I very much did *not* want to complicate the "litmus test" code
snippet when Paul tried to make it more complex, but I do think that
there are cases where code that "looks" like pure pointer chasing
actually is not for some cases, and then can become basically that
litmus test for some path.

Just to give you an example: in simple list handling it is not at all
unusual to have a special node that is discovered by comparing the
address, not by just loading from the pointer and following the list
itself. Examples of that would be a HEAD node in a doubly linked list
(Linux uses this concept quite widely, it's our default list
implementation), or it could be a place-marker ("cursor" entry) in the
list for safe traversal in the presence of concurrent deletes etc.

And obviously there is the already much earlier mentioned
compiler-induced compare, due to value speculation, that can basically
create such sequences even wherethey did not originally exist in the
source code itself.

So even if you work with "pointer dereferences", and thus match that
particular consume pattern, I really don't see why anybody would think
that "hey, we can ignore any control dependencies" is a good idea.
It's a *TERRIBLE* idea.

And as mentioned, it's a terrible idea with no upsides. It doesn't
help compiler optimizations for the case it's *intended* to help,
since those optimizations can still be done without the horribly
broken semantics. It doesn't help compiler writers, it just confuses
them.

And it sure as hell doesn't help users.

                  Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  4:43                                                                         ` Linus Torvalds
  2014-02-20  8:30                                                                           ` Paul E. McKenney
@ 2014-02-20 17:49                                                                           ` Torvald Riegel
  2014-02-20 18:25                                                                             ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Wed, 2014-02-19 at 20:43 -0800, Linus Torvalds wrote:

[Paul has already answered many of your questions, and my reply to your
previous email should also answer some.]

> If the consumer of an atomic load isn't a pointer chasing operation,
> then the consume should be defined to be the same as acquire. None of
> this "conditionals break consumers". No, conditionals on the
> dependency path should turn consumers into acquire, because otherwise
> the "consume" load is dangerous as hell.

Yes, mo_consume is more tricky than mo_acquire.

However, that has an advantage because you can avoid getting stronger
barriers if you don't need them (ie, you can avoid the "auto-update to
acquire" you seem to have in mind).

The auto-upgrade would be a possible semantics, I agree.

Another option may be to let an implementation optimize the HW barriers
that it uses for mo_acquire.  That is, if the compiler sees that (1) the
result of an mo_acquire load is used on certain code paths *only* for
consumers that carry the dependency *and* (2) there are no other
operations on that code path that can possibly rely on the mo_acquire
ordering guarantees, then the compiler can use a weaker HW barrier on
archs such as PowerPC or ARM.

That is similar to the rules for mo_consume you seem to have in mind,
but makes it a compiler optimization on mo_acquire.  However, the
compiler has to be conservative here, so having a mo_consume that is
trickier to use but doesn't ever silently introduce stronger HW barriers
seems to be useful (modulo the difficulties regarding to how it's
currently specified in the standard).

> And if the definition of acquire doesn't include the control
> dependency either, then the C atomic memory model is just completely
> and utterly broken, since the above *trivial* and clearly useful
> example is broken.

In terms of the model, if you establish a synchronizes-with using a
reads-from that has (or crosses) a release/acquire memory-order pair,
then this synchronizes-with will also order other operations that it's
sequenced-before (see the composition of synchronizes-with and
sequenced-before in the inter-thread-happens-before definition in n3132
6.15).

So yes, mo_acquire does that the logical "control dependencies" /
sequenced-before into account.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20  8:30                                                                           ` Paul E. McKenney
  2014-02-20  9:20                                                                             ` Paul E. McKenney
  2014-02-20 17:01                                                                             ` Linus Torvalds
@ 2014-02-20 17:54                                                                             ` Torvald Riegel
  2014-02-20 18:11                                                                               ` Paul E. McKenney
  2 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 17:54 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote:
> Well, all the compilers currently convert consume to acquire, so you have
> your wish there.  Of course, that also means that they generate actual
> unneeded memory-barrier instructions, which seems extremely sub-optimal
> to me.

GCC doesn't currently, but it also doesn't seem to track the
dependencies, but that's a bug:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:01                                                                             ` Linus Torvalds
@ 2014-02-20 18:11                                                                               ` Paul E. McKenney
  2014-02-20 18:32                                                                                 ` Linus Torvalds
  2014-02-20 18:44                                                                                 ` Torvald Riegel
  2014-02-20 18:23                                                                               ` Torvald Riegel
       [not found]                                                                               ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com>
  2 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 09:01:06AM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >> So lets make this really simple: if you have a consume->cmp->read, is
> >> the ordering of the two reads guaranteed?
> >
> > Not as far as I know.  Also, as far as I know, there is no difference
> > between consume and relaxed in the consume->cmp->read case.
> 
> Ok, quite frankly, I think that means that "consume" is misdesigned.
> 
> > The above example can have a return value of 0 if translated
> > straightforwardly into either ARM or Power, right?
> 
> Correct. And I think that is too subtle. It's dangerous, it makes code
> that *looks* correct work incorrectly, and it actually happens to work
> on x86 since x86 doesn't have crap-for-brains memory ordering
> semantics.
> 
> > So, if you make one of two changes to your example, then I will agree
> > with you.
> 
> No. We're not playing games here. I'm fed up with complex examples
> that make no sense.

Hey, your original example didn't do what you wanted given the current
standard.  Those two modified examples are no more complex than your
original, and are the closest approximations that I can come up with
right off-hand that provide the result you wanted.

> Nobody sane writes code that does that pointer comparison, and it is
> entirely immaterial what the compiler can do behind our backs. The C
> standard semantics need to make sense to the *user* (ie programmer),
> not to a CPU and not to a compiler. The CPU and compiler are "tools".
> They don't matter. Their only job is to make the code *work*, dammit.
> 
> So no idiotic made-up examples that involve code that nobody will ever
> write and that have subtle issues.

There are places in the Linux kernel that do both pointer comparisons
and dereferences.  Something like the following:

	p = rcu_dereference(gp);
	if (!p)
		p = &default_structure;
	do_something_with(p->a, p->b);

In the typical case where default_structure was initialized early on,
no ordering is needed in the !p case.

> So the starting point is that (same example as before, but with even
> clearer naming):
> 
> Initialization state:
>   initialized = 0;
>   value = 0;
> 
> Consumer:
> 
>     return atomic_read(&initialized, consume) ? value : -1;
> 
> Writer:
>     value = 42;
>     atomic_write(&initialized, 1, release);
> 
> and because the C memory ordering standard is written in such a way
> that this is subtly buggy (and can return 0, which is *not* logically
> a valid value), then I think the C memory ordering standard is broken.

You really need that "consume" to be "acquire".

> That "consumer" memory ordering is dangerous as hell, and it is
> dangerous FOR NO GOOD REASON.
> 
> The trivial "fix" to the standard would be to get rid of all the
> "carries a dependency" crap, and just say that *anything* that depends
> on it is ordered wrt it.
> 
> That just means that on alpha, "consume" implies an unconditional read
> barrier (well, unless the value is never used and is loaded just
> because it is also volatile), on x86, "consume" is the same as
> "acquire" which is just a plain load with ordering guarantees, and on
> ARM or power you can still avoid the extra synchronization *if* the
> value is used just for computation and for following pointers, but if
> the value is used for a comparison, there needs to be a
> synchronization barrier.

Except in the default-value case, where no barrier is required.

People really do comparisons on the pointers that they get back from
rcu_dereference(), either to check for NULL (as above) or to check for
a specific pointer.  Ordering is not required in those cases.  The
cases that do require ordering from memory_order_consume values being
used in an "if" statement should instead be memory_order_acquire.

> Notice? Getting rid of the stupid "carries-dependency" crap from the
> standard actually
>  (a) simplifies the standard
>  (b) means that the above obvious example *works*
>  (c) does not in *any* way make for any less efficient code generation
> for the cases that "consume" works correctly for in the current
> mis-designed standard.
>  (d) is actually a hell of a lot easier to explain to a compiler
> writer, and I can guarantee that it is simpler to implement too.
> 
> Why do I claim (d) "it is simpler to implement" - because on ARM/power
> you can implement it *exactly* as a special "acquire", with just a
> trivial peep-hole special case that follows the use chain of the
> acquire op to the consume, and then just drop the acquire bit if the
> only use is that compute-to-load chain.

I don't believe that it is quite that simple.

But yes, the compiler guys would be extremely happy to simply drop
memory_order_consume from the standard, as it is the memory order
that they most love to hate.

Getting them to agree to any sort of peep-hole optimization semantics
for memory_order_consume is likely problematic.  

> In fact, realistically, the *only* thing you need to actually care
> about for the intended use case of "consume" is the question "is the
> consuming load immediately consumed as an address (with offset) of a
> memory operation. So you don't even need to follow any complicated
> computation chain in a compiler - the only case that matters for the
> barrier removal optimization is the "oh, I can see that it is only
> used as an address to a dereference".
> 
> Seriously. The current standard is broken. It's broken because it
> mis-compiles code that on the face of it looks logical and works, it's
> broken because it's overly complex, and it's broken because the
> complexity doesn't even *buy* you anything. All this complexity for no
> reason. When much simpler wording and implementation actually WORKS
> BETTER.

I disagree.  The use cases you claim are memory_order_consume breakage
are really bugs.  You should use memory_order_acquire in those cases.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:54                                                                             ` Torvald Riegel
@ 2014-02-20 18:11                                                                               ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 18:11 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 06:54:06PM +0100, Torvald Riegel wrote:
> xagsmtp4.20140220175519.1127@vmsdvm6.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP4 at VMSDVM6)
> 
> On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote:
> > Well, all the compilers currently convert consume to acquire, so you have
> > your wish there.  Of course, that also means that they generate actual
> > unneeded memory-barrier instructions, which seems extremely sub-optimal
> > to me.
> 
> GCC doesn't currently, but it also doesn't seem to track the
> dependencies, but that's a bug:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

Ah, cool!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:34                                                                         ` Linus Torvalds
@ 2014-02-20 18:12                                                                           ` Torvald Riegel
  2014-02-20 18:26                                                                           ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc,
	Mark Batty, Peter Sewell

On Thu, 2014-02-20 at 09:34 -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >>
> >> So the clarification is basically to the statement that the "if
> >> (consume(p)) a" version *would* have an ordering guarantee between the
> >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> >> such an ordering guarantee. Yes?
> >
> > Not as I understand it.  If my reply above wasn't clear, let me know and
> > I'll try to rephrase it into something that is.
> 
> Yeah, so you and Paul agree. And as I mentioned in the email that
> crossed with yours, I think that means that the standard is overly
> complex, hard to understand, fragile, and all *pointlessly* so.

Let's step back a little here and distinguish different things:

1) AFAICT, mo_acquire provides all the ordering guarantees you want.
Thus, I suggest focusing on mo_acquire for now, especially when reading
the model.  Are you fine with the mo_acquire semantics?

2) mo_acquire is independent of carries-a-dependency and similar
definitions/relations, so you can ignore all those to reason about
mo_acquire.

3) Do you have concerns over the runtime costs of barriers with
mo_acquire semantics?  If so, that's a valid discussion point, and we
can certainly dig deeper into this topic to see how we can possibly use
weaker HW barriers by exploiting things the compiler sees and can
potentially preserve (e.g., control dependencies).  There might be some
stuff the compiler can do without needing further input from the
programmer.

4) mo_consume is kind of the difficult special-case variant of
mo_acquire.  We should discuss it (including whether it's helpful)
separately from the memory model, because it's not essential.

> Btw, there are many ways that "use a consume as an input to a
> conditional" can happen. In particular, even if the result is actually
> *used* like a pointer as far as the programmer is concerned, tricks
> like pointer compression etc can well mean that the "pointer" is
> actually at least partly implemented using conditionals, so that some
> paths end up being only dependent through a comparison of the pointer
> value.

AFAIU, this is similar to my concerns about how a compiler can
reasonably implement the ordering guarantees: The mo_consume value may
be used like a pointer in source code, but how this looks in the
generated code be different after reasonable transformations (including
transformations to control dependencies, so the compiler would have to
avoid those).  Or did I misunderstand?

> So I very much did *not* want to complicate the "litmus test" code
> snippet when Paul tried to make it more complex, but I do think that
> there are cases where code that "looks" like pure pointer chasing
> actually is not for some cases, and then can become basically that
> litmus test for some path.
> 
> Just to give you an example: in simple list handling it is not at all
> unusual to have a special node that is discovered by comparing the
> address, not by just loading from the pointer and following the list
> itself. Examples of that would be a HEAD node in a doubly linked list
> (Linux uses this concept quite widely, it's our default list
> implementation), or it could be a place-marker ("cursor" entry) in the
> list for safe traversal in the presence of concurrent deletes etc.
> 
> And obviously there is the already much earlier mentioned
> compiler-induced compare, due to value speculation, that can basically
> create such sequences even wherethey did not originally exist in the
> source code itself.
> 
> So even if you work with "pointer dereferences", and thus match that
> particular consume pattern, I really don't see why anybody would think
> that "hey, we can ignore any control dependencies" is a good idea.
> It's a *TERRIBLE* idea.
> 
> And as mentioned, it's a terrible idea with no upsides. It doesn't
> help compiler optimizations for the case it's *intended* to help,
> since those optimizations can still be done without the horribly
> broken semantics. It doesn't help compiler writers, it just confuses
> them.

I'm worried about how compilers can implement mo_consume without
prohibiting lots of optimizations on the code.  Your thoughts seem to
point in a similar direction.

I think we should continue by discussing mo_acquire first.  It has
easier semantics and allows a relatively simple implementation in
compilers (although there might be not-quite-so-simple optimizations).  

It's unfortunate we started the discussion with the tricky special case
first; maybe that's what contributed to the confusion.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:26                                                                         ` Torvald Riegel
@ 2014-02-20 18:18                                                                           ` Paul E. McKenney
  2014-02-22 18:30                                                                             ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 18:18 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140220172700.0416@vmsdvm4.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4)
> 
> On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
> > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> > > >>
> > > >> Can you point to it? Because I can find a draft standard, and it sure
> > > >> as hell does *not* contain any clarity of the model. It has a *lot* of
> > > >> verbiage, but it's pretty much impossible to actually understand, even
> > > >> for somebody who really understands memory ordering.
> > > >
> > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > This has an explanation of the model up front, and then the detailed
> > > > formulae in Section 6.  This is from 2010, and there might have been
> > > > smaller changes since then, but I'm not aware of any bigger ones.
> > > 
> > > Ahh, this is different from what others pointed at. Same people,
> > > similar name, but not the same paper.
> > > 
> > > I will read this version too, but from reading the other one and the
> > > standard in parallel and trying to make sense of it, it seems that I
> > > may have originally misunderstood part of the whole control dependency
> > > chain.
> > > 
> > > The fact that the left side of "? :", "&&" and "||" breaks data
> > > dependencies made me originally think that the standard tried very
> > > hard to break any control dependencies. Which I felt was insane, when
> > > then some of the examples literally were about the testing of the
> > > value of an atomic read. The data dependency matters quite a bit. The
> > > fact that the other "Mathematical" paper then very much talked about
> > > consume only in the sense of following a pointer made me think so even
> > > more.
> > > 
> > > But reading it some more, I now think that the whole "data dependency"
> > > logic (which is where the special left-hand side rule of the ternary
> > > and logical operators come in) are basically an exception to the rule
> > > that sequence points end up being also meaningful for ordering (ok, so
> > > C11 seems to have renamed "sequence points" to "sequenced before").
> > > 
> > > So while an expression like
> > > 
> > >     atomic_read(p, consume) ? a : b;
> > > 
> > > doesn't have a data dependency from the atomic read that forces
> > > serialization, writing
> > > 
> > >    if (atomic_read(p, consume))
> > >       a;
> > >    else
> > >       b;
> > > 
> > > the standard *does* imply that the atomic read is "happens-before" wrt
> > > "a", and I'm hoping that there is no question that the control
> > > dependency still acts as an ordering point.
> > 
> > The control dependency should order subsequent stores, at least assuming
> > that "a" and "b" don't start off with identical stores that the compiler
> > could pull out of the "if" and merge.  The same might also be true for ?:
> > for all I know.  (But see below)
> 
> I don't think this is quite true.  I agree that a conditional store will
> not be executed speculatively (note that if it would happen in both the
> then and the else branch, it's not conditional); so, the store in
> "a;" (assuming it would be a store) won't happen unless the thread can
> really observe a true value for p.  However, this is *this thread's*
> view of the world, but not guaranteed to constrain how any other thread
> sees the state.  mo_consume does not contribute to
> inter-thread-happens-before in the same way that mo_acquire does (which
> *does* put a constraint on i-t-h-b, and thus enforces a global
> constraint that all threads have to respect).
> 
> Is it clear which distinction I'm trying to show here?

If you are saying that the control dependencies are a result of a
combination of the standard and the properties of the hardware that
Linux runs on, I am with you.  (As opposed to control dependencies being
a result solely of the standard.)

This was a deliberate decision in 2007 or so.  At that time, the
documentation on CPU memory orderings were pretty crude, and it was
not clear that all relevant hardware respected control dependencies.
Back then, if you wanted an authoritative answer even to a fairly simple
memory-ordering question, you had to find a hardware architect, and you
probably waited weeks or even months for the answer.  Thanks to lots
of work from the Cambridge guys at about the time that the standard was
finalized, we have a much better picture of what the hardware does.

> > That said, in this case, you could substitute relaxed for consume and get
> > the same effect.  The return value from atomic_read() gets absorbed into
> > the "if" condition, so there is no dependency-ordered-before relationship,
> > so nothing for consume to do.
> > 
> > One caution...  The happens-before relationship requires you to trace a
> > full path between the two operations of interest.  This is illustrated
> > by the following example, with both x and y initially zero:
> > 
> > T1:	atomic_store_explicit(&x, 1, memory_order_relaxed);
> > 	r1 = atomic_load_explicit(&y, memory_order_relaxed);
> > 
> > T2:	atomic_store_explicit(&y, 1, memory_order_relaxed);
> > 	r2 = atomic_load_explicit(&x, memory_order_relaxed);
> > 
> > There is a happens-before relationship between T1's load and store,
> > and another happens-before relationship between T2's load and store,
> > but there is no happens-before relationship from T1 to T2, and none
> > in the other direction, either.  And you don't get to assume any
> > ordering based on reasoning about these two disjoint happens-before
> > relationships.
> > 
> > So it is quite possible for r1==1&&r2==1 after both threads complete.
> > 
> > Which should be no surprise: This misordering can happen even on x86,
> > which would need a full smp_mb() to prevent it.
> > 
> > > THAT was one of my big confusions, the discussion about control
> > > dependencies and the fact that the logical ops broke the data
> > > dependency made me believe that the standard tried to actively avoid
> > > the whole issue with "control dependencies can break ordering
> > > dependencies on some CPU's due to branch prediction and memory
> > > re-ordering by the CPU".
> > > 
> > > But after all the reading, I'm starting to think that that was never
> > > actually the implication at all, and the "logical ops breaks the data
> > > dependency rule" is simply an exception to the sequence point rule.
> > > All other sequence points still do exist, and do imply an ordering
> > > that matters for "consume"
> > > 
> > > Am I now reading it right?
> > 
> > As long as there is an unbroken chain of -data- dependencies from the
> > consume to the later access in question, and as long as that chain
> > doesn't go through the excluded operations, yes.
> > 
> > > So the clarification is basically to the statement that the "if
> > > (consume(p)) a" version *would* have an ordering guarantee between the
> > > read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> > > such an ordering guarantee. Yes?
> > 
> > Neither has a data-dependency guarantee, because there is no data
> > dependency from the load to either "a" or "b".  After all, the value
> > loaded got absorbed into the "if" condition.
> 
> Agreed.
> 
> > However, according to
> > discussions earlier in this thread, the "if" variant would have a
> > control-dependency ordering guarantee for any stores in "a" and "b"
> > (but not loads!).  The ?: form might also have a control-dependency
> > guarantee for any stores in "a" and "b" (again, not loads).
> 
> Don't quite agree; see above for my opinion on this.

And see above for my best guess at what your opinion is based on.  If
my guess is correct, we might even be in agreement.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:01                                                                             ` Linus Torvalds
  2014-02-20 18:11                                                                               ` Paul E. McKenney
@ 2014-02-20 18:23                                                                               ` Torvald Riegel
       [not found]                                                                               ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com>
  2 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, 2014-02-20 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >> So lets make this really simple: if you have a consume->cmp->read, is
> >> the ordering of the two reads guaranteed?
> >
> > Not as far as I know.  Also, as far as I know, there is no difference
> > between consume and relaxed in the consume->cmp->read case.
> 
> Ok, quite frankly, I think that means that "consume" is misdesigned.
> 
> > The above example can have a return value of 0 if translated
> > straightforwardly into either ARM or Power, right?
> 
> Correct. And I think that is too subtle. It's dangerous, it makes code
> that *looks* correct work incorrectly, and it actually happens to work
> on x86 since x86 doesn't have crap-for-brains memory ordering
> semantics.
> 
> > So, if you make one of two changes to your example, then I will agree
> > with you.
> 
> No. We're not playing games here. I'm fed up with complex examples
> that make no sense.

Note that Paul's second suggestion for a change was to just use
mo_acquire;  that's a simple change, and the easiest option, so it
should be just fine.

> Nobody sane writes code that does that pointer comparison, and it is
> entirely immaterial what the compiler can do behind our backs. The C
> standard semantics need to make sense to the *user* (ie programmer),
> not to a CPU and not to a compiler. The CPU and compiler are "tools".
> They don't matter. Their only job is to make the code *work*, dammit.
> 
> So no idiotic made-up examples that involve code that nobody will ever
> write and that have subtle issues.
> 
> So the starting point is that (same example as before, but with even
> clearer naming):
> 
> Initialization state:
>   initialized = 0;
>   value = 0;
> 
> Consumer:
> 
>     return atomic_read(&initialized, consume) ? value : -1;
> 
> Writer:
>     value = 42;
>     atomic_write(&initialized, 1, release);
> 
> and because the C memory ordering standard is written in such a way
> that this is subtly buggy (and can return 0, which is *not* logically
> a valid value), then I think the C memory ordering standard is broken.
> 
> That "consumer" memory ordering is dangerous as hell, and it is
> dangerous FOR NO GOOD REASON.
> 
> The trivial "fix" to the standard would be to get rid of all the
> "carries a dependency" crap, and just say that *anything* that depends
> on it is ordered wrt it.
> 
> That just means that on alpha, "consume" implies an unconditional read
> barrier (well, unless the value is never used and is loaded just
> because it is also volatile), on x86, "consume" is the same as
> "acquire" which is just a plain load with ordering guarantees, and on
> ARM or power you can still avoid the extra synchronization *if* the
> value is used just for computation and for following pointers, but if
> the value is used for a comparison, there needs to be a
> synchronization barrier.
> 
> Notice? Getting rid of the stupid "carries-dependency" crap from the
> standard actually
>  (a) simplifies the standard

Agreed, although it's easy to ignore the parts related to mo_consume, I
think.

>  (b) means that the above obvious example *works*
>  (c) does not in *any* way make for any less efficient code generation
> for the cases that "consume" works correctly for in the current
> mis-designed standard.
>  (d) is actually a hell of a lot easier to explain to a compiler
> writer, and I can guarantee that it is simpler to implement too.

mo_acquire is certainly easier to implement in a compiler.

> Why do I claim (d) "it is simpler to implement" - because on ARM/power
> you can implement it *exactly* as a special "acquire", with just a
> trivial peep-hole special case that follows the use chain of the
> acquire op to the consume, and then just drop the acquire bit if the
> only use is that compute-to-load chain.

That's similar to the way I saw it and described in my reply to your
other email (before getting to this email here).  It seems that this
indeed might be doable transparently in the compiler, without requiring
a special mo_acquire variant visible to programmers.

> In fact, realistically, the *only* thing you need to actually care
> about for the intended use case of "consume" is the question "is the
> consuming load immediately consumed as an address (with offset) of a
> memory operation. So you don't even need to follow any complicated
> computation chain in a compiler - the only case that matters for the
> barrier removal optimization is the "oh, I can see that it is only
> used as an address to a dereference".

To make this mo_acquire optimization apply often, a compiler might have
to try to filter out accesses that don't synchronize (e.g., so that an
access to a non-shared temporary variable doesn't prevent the
optimization).

> Seriously. The current standard is broken.

Please, let's be precise in such statement, so that everyone actually
knows what's meant.  The rest of the memory model can be perfectly fine
even if you think that mo_consume isn't useful at all.  I think your
opinion about mo_consume is clear now (and I have concerns about it too,
FWIW).  If you see issues about all the other parts of the memory model
(or the standard), please state these separately.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:49                                                                           ` Torvald Riegel
@ 2014-02-20 18:25                                                                             ` Linus Torvalds
  2014-02-20 19:02                                                                               ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 18:25 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, Feb 20, 2014 at 9:49 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> Yes, mo_consume is more tricky than mo_acquire.
>
> However, that has an advantage because you can avoid getting stronger
> barriers if you don't need them (ie, you can avoid the "auto-update to
> acquire" you seem to have in mind).

Oh, I agree about that part - I very much understand the reason for
"consume", and I can see how it is more relaxed than "acquire" under
many circumstances.

I just think that you actually *do* want to have "consume" even for
flag values, exactly *because* it is potentially cheaper than acquire.

In fact, I'd argue that making consume reliable in the face of control
dependencies is actually a *good* thing. It may not matter for
something like x86, where consume and acquire end up with the same
simple load, but even there it might relax instruction scheduling a
bit, since a 'consume' would have a barrier just to the *users* of the
value loaded, while 'acquire' would still have a scheduling barrier to
any subsequent operations.

So I claim that for a sequence like my example, where the reader
basically does something like

   load_atomic(&initialized, consume) ? value : -1;

the "consume" version can actually generate better code than "acquire"
- if "consume" is specified the way *I* specified it.

The way the C standard specifies it, the above code is *buggy*.
Agreed? It's really really subtly buggy, and I think that bug is not
only a real danger, I think it is logically hard to understand why.
The bug only makes sense to people who understand how memory ordering
and branch prediction interacts.

The way *I* suggested "consume" be implemented, the above not only
works and is sensible, it actually generates possibly better code than
forcing the programmer to use the (illogical) "acquire" operation.

Why? Let me give you another - completely realistic, even if obviously
a bit made up - example:

   int return_expensive_system_value(void)
   {
        static atomic_t initialized;
        static int calculated;

        if (atomic_read(&initialized, mo_consume))
            return calculated;

        //let's say that this code opens /proc/cpuinfo and counts
number of CPU's or whatever ...
        calculated = read_value_from_system_files();
        atomic_write(&initialized, 1, mo_release);
        return calculated;
    }

and let's all agree that this is a somewhat realistic example, and we
can imagine why/how somebody would write code like this. It's
basically a very common lazy initialization pattern, you'll find this
in libraries, in kernels, in application code yadda yadda. No
argument?

Now, let's assume that it turns out that this value ends up being
really performance-critical, so the programmer makes the fast-path an
inline function, tells the compiler that "initialized" read is likely,
and generally wants the compiler to optimize it to hell and back.
Still sounds reasonable and realistic?

In other words, the *expected* code sequence for this is (on x86,
which doesn't need any barriers):

    cmpl $0, initialized
    je unlikely_out_of_line_case
    movl calculated, eax

and on ARM/power you'd see a 'sync' instruction or whatever.

So far 'acquire' and 'consume' have exacly the same code generation on
power of x86, so your argument can be: "Ok, so let's just use the
inconvenient and hard-to-understand 'consume' semantics that the
current standard has, and tell the programmer that he should use
'acquire' and not worry his little head about the difference because
he will never understand it anyway".

Sure, that would be an inconvencience for programmers, but hey,
they're programming in C or C++, so they are *expected* to be manly
men or womanly women, and a little illogical inconvenience never hurt
anybody. After all, compared to the aliasing rules, that "use acquire,
not consume" rule is positively *simple*, no?

Are we all in agreement so far?

But no, the "consume" thing can actually generate better code. Trivial example:

    int my_threads_value;
    extern int magic_system_multiplier;

    my_thread_value = return_expensive_system_value();
    my_thread_value *= magic_system_multiplier;

and in the "acquire" model, the "acquire" itself means that the load
from magic_system_multiplier is now constrained by the acquire memory
ordering on "initialized".

While in my *sane* model, where you can consume things even if they
then result in control dependencies, there will still eventually be a
"sync" instruction on powerpc (because you really need one between the
load of 'initialized' and the load of 'calculated'), but the compiler
would be free to schedule the load of 'magic_system_multiplier'
earlier.

So as far as I can tell, we want the 'consume' memory ordering to
honor *all* dependencies, because
 - it's simpler
 - it's more logical
 - it's less error-prone
 - and it allows better code generation

Hmm?

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 17:34                                                                         ` Linus Torvalds
  2014-02-20 18:12                                                                           ` Torvald Riegel
@ 2014-02-20 18:26                                                                           ` Paul E. McKenney
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc, Mark Batty, Peter Sewell

On Thu, Feb 20, 2014 at 09:34:57AM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >>
> >> So the clarification is basically to the statement that the "if
> >> (consume(p)) a" version *would* have an ordering guarantee between the
> >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> >> such an ordering guarantee. Yes?
> >
> > Not as I understand it.  If my reply above wasn't clear, let me know and
> > I'll try to rephrase it into something that is.
> 
> Yeah, so you and Paul agree. And as I mentioned in the email that
> crossed with yours, I think that means that the standard is overly
> complex, hard to understand, fragile, and all *pointlessly* so.
> 
> Btw, there are many ways that "use a consume as an input to a
> conditional" can happen. In particular, even if the result is actually
> *used* like a pointer as far as the programmer is concerned, tricks
> like pointer compression etc can well mean that the "pointer" is
> actually at least partly implemented using conditionals, so that some
> paths end up being only dependent through a comparison of the pointer
> value.
> 
> So I very much did *not* want to complicate the "litmus test" code
> snippet when Paul tried to make it more complex, but I do think that
> there are cases where code that "looks" like pure pointer chasing
> actually is not for some cases, and then can become basically that
> litmus test for some path.
> 
> Just to give you an example: in simple list handling it is not at all
> unusual to have a special node that is discovered by comparing the
> address, not by just loading from the pointer and following the list
> itself. Examples of that would be a HEAD node in a doubly linked list
> (Linux uses this concept quite widely, it's our default list
> implementation), or it could be a place-marker ("cursor" entry) in the
> list for safe traversal in the presence of concurrent deletes etc.

Yep, good example.  And also an example where the comparison does not
need to create any particular ordering.

> And obviously there is the already much earlier mentioned
> compiler-induced compare, due to value speculation, that can basically
> create such sequences even wherethey did not originally exist in the
> source code itself.
> 
> So even if you work with "pointer dereferences", and thus match that
> particular consume pattern, I really don't see why anybody would think
> that "hey, we can ignore any control dependencies" is a good idea.
> It's a *TERRIBLE* idea.
> 
> And as mentioned, it's a terrible idea with no upsides. It doesn't
> help compiler optimizations for the case it's *intended* to help,
> since those optimizations can still be done without the horribly
> broken semantics. It doesn't help compiler writers, it just confuses
> them.
> 
> And it sure as hell doesn't help users.

Yep.  And in 2014 we know a lot more about the hardware, so could make a
reasonable proposal.  In contrast, back in 2007 and 2008 memory ordering
was much more of a dark art, and proposals for control dependencies
therefore didn't get very far.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:11                                                                               ` Paul E. McKenney
@ 2014-02-20 18:32                                                                                 ` Linus Torvalds
  2014-02-20 18:53                                                                                   ` Torvald Riegel
  2014-02-20 18:56                                                                                   ` Paul E. McKenney
  2014-02-20 18:44                                                                                 ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 18:32 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> You really need that "consume" to be "acquire".

So I think we now all agree that that is what the standard is saying.

And I'm saying that that is wrong, that the standard is badly written,
and should be fixed.

Because before the standard is fixed, I claim that "consume" is
unusable. We cannot trust it. End of story.

The fact that apparently gcc is currently buggy because it got the
dependency calculations *wrong* just reinforces my point.

The gcc bug Torvald pointed at is exactly because the current C
standard is illogical unreadable CRAP. I can guarantee that what
happened is:

 - the compiler saw that the result of the read was used as the left
hand expression of the ternary "? :" operator

 - as a result, the compiler decided that there's no dependency

 - the compiler didn't think about the dependency that comes from the
result of the load *also* being used as the middle part of the ternary
expression, because it had optimized it away, despite the standard not
talking about that at all.

 - so the compiler never saw the dependency that the standard talks about

BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT.

My suggested language never had any of these problems, because *my*
suggested semantics are clear, logical, and don't have these kinds of
idiotic pit-falls.

Solution: Fix the f*cking C standard. No excuses, no explanations.
Just get it fixed.

                      Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:11                                                                               ` Paul E. McKenney
  2014-02-20 18:32                                                                                 ` Linus Torvalds
@ 2014-02-20 18:44                                                                                 ` Torvald Riegel
  2014-02-20 18:56                                                                                   ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 18:44 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote:
> But yes, the compiler guys would be extremely happy to simply drop
> memory_order_consume from the standard, as it is the memory order
> that they most love to hate.
> 
> Getting them to agree to any sort of peep-hole optimization semantics
> for memory_order_consume is likely problematic.  

I wouldn't be so pessimistic about that.  If the transformations can be
shown to be always correct in terms of the semantics specified in the
standard, and if the performance win is sufficiently large, why not?  Of
course, somebody has to volunteer to actually implement it :)


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:32                                                                                 ` Linus Torvalds
@ 2014-02-20 18:53                                                                                   ` Torvald Riegel
  2014-02-20 19:09                                                                                     ` Linus Torvalds
  2014-02-20 18:56                                                                                   ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-20 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: 
> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > You really need that "consume" to be "acquire".
> 
> So I think we now all agree that that is what the standard is saying.

Huh?

The standard says that there are two separate things (among many more):
mo_acquire and mo_consume.  They both influence happens-before in
different (and independent!) ways.

What Paul is saying is that *you* should have used *acquire* in that
example.

> And I'm saying that that is wrong, that the standard is badly written,
> and should be fixed.
> 
> Because before the standard is fixed, I claim that "consume" is
> unusable. We cannot trust it. End of story.

Then we still have all the rest.  Let's just ignore mo_consume for now,
and look at mo_acquire, I suggest.

> The fact that apparently gcc is currently buggy because it got the
> dependency calculations *wrong* just reinforces my point.

Well, I'm pretty sure nobody actually worked on trying to preserve the
dependencies at all.  IOW, I suspect this fell through the cracks.  We
can ask the person working on this if you really want to know.

> The gcc bug Torvald pointed at is exactly because the current C
> standard is illogical unreadable CRAP.

It's obviously logically consistent to the extent that it can be
represented by a formal specification such as the one by the Cambridge
group.  Makes sense, or not?

> I can guarantee that what
> happened is:
> 
>  - the compiler saw that the result of the read was used as the left
> hand expression of the ternary "? :" operator
> 
>  - as a result, the compiler decided that there's no dependency
> 
>  - the compiler didn't think about the dependency that comes from the
> result of the load *also* being used as the middle part of the ternary
> expression, because it had optimized it away, despite the standard not
> talking about that at all.
> 
>  - so the compiler never saw the dependency that the standard talks about
> 
> BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT.

Please, be specific.  Right now you're saying that all of it is useless.
Which is arguable not true.

> My suggested language never had any of these problems, because *my*
> suggested semantics are clear, logical, and don't have these kinds of
> idiotic pit-falls.

Have you looked at and understood the semantics of the memory model
(e.g. in the formalized form) with mo_consume and related being ignored
(ie, just ignore 6.13 and 6.14 in n3132)?



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:32                                                                                 ` Linus Torvalds
  2014-02-20 18:53                                                                                   ` Torvald Riegel
@ 2014-02-20 18:56                                                                                   ` Paul E. McKenney
  2014-02-20 19:45                                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 10:32:51AM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > You really need that "consume" to be "acquire".
> 
> So I think we now all agree that that is what the standard is saying.
> 
> And I'm saying that that is wrong, that the standard is badly written,
> and should be fixed.
> 
> Because before the standard is fixed, I claim that "consume" is
> unusable. We cannot trust it. End of story.

We get exactly those same issues with control dependencies.

The example gcc breakage was something like this:

	i = atomic_load(idx, memory_order_consume);
	x = array[0 + i - i];

Then gcc optimized this to:

	i = atomic_load(idx, memory_order_consume);
	x = array[0];

This same issue would hit control dependencies.  You are free to argue
that this is the fault of ARM and PowerPC memory ordering, but the fact
remains that your suggested change has -exactly- the same vulnerability
as memory_order_consume currently has.

> The fact that apparently gcc is currently buggy because it got the
> dependency calculations *wrong* just reinforces my point.
> 
> The gcc bug Torvald pointed at is exactly because the current C
> standard is illogical unreadable CRAP. I can guarantee that what
> happened is:
> 
>  - the compiler saw that the result of the read was used as the left
> hand expression of the ternary "? :" operator
> 
>  - as a result, the compiler decided that there's no dependency
> 
>  - the compiler didn't think about the dependency that comes from the
> result of the load *also* being used as the middle part of the ternary
> expression, because it had optimized it away, despite the standard not
> talking about that at all.
> 
>  - so the compiler never saw the dependency that the standard talks about

No, the dependency was in a cancelling arithmetic expression as shown
above, so that gcc optimized the dependency away.  Then the ordering
was lost on AARCH64.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

> BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT.
> 
> My suggested language never had any of these problems, because *my*
> suggested semantics are clear, logical, and don't have these kinds of
> idiotic pit-falls.
> 
> Solution: Fix the f*cking C standard. No excuses, no explanations.
> Just get it fixed.

I agree that the standard needs help, but your suggested fix has the
same problems as shown in the bugzilla.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:44                                                                                 ` Torvald Riegel
@ 2014-02-20 18:56                                                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 18:56 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 07:44:32PM +0100, Torvald Riegel wrote:
> xagsmtp3.20140220184514.1789@bldgate.vnet.ibm.com
> X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP3 at BLDGATE)
> 
> On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote:
> > But yes, the compiler guys would be extremely happy to simply drop
> > memory_order_consume from the standard, as it is the memory order
> > that they most love to hate.
> > 
> > Getting them to agree to any sort of peep-hole optimization semantics
> > for memory_order_consume is likely problematic.  
> 
> I wouldn't be so pessimistic about that.  If the transformations can be
> shown to be always correct in terms of the semantics specified in the
> standard, and if the performance win is sufficiently large, why not?  Of
> course, somebody has to volunteer to actually implement it :)

I guess that there is only one way to find out.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:25                                                                             ` Linus Torvalds
@ 2014-02-20 19:02                                                                               ` Linus Torvalds
  2014-02-20 19:06                                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 19:02 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, Feb 20, 2014 at 10:25 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> While in my *sane* model, where you can consume things even if they
> then result in control dependencies, there will still eventually be a
> "sync" instruction on powerpc (because you really need one between the
> load of 'initialized' and the load of 'calculated'), but the compiler
> would be free to schedule the load of 'magic_system_multiplier'
> earlier.

Actually, "consume" is more interesting than that. Looking at the
bugzilla entry Torvald pointed at, it has the trick to always turn any
"consume" dependency into an address data dependency.

So another reason why you *want* to allow "consume" + "control
dependency" is that it opens up the window for many more interesting
and relevant optimizations than "acquire" does.

Again, let's take that "trivial" expression:

     return atomic_read(&initialized, consume) ? value : -1;

and the compiler can actually turn this into an interesting address
data dependency and optimize it to basically use address arithmetic
and turn it into something like

    return *(&value + (&((int)-1)-value)*!atomic_read(&initialized, consume));

Of course, the *programmer* could have done that himself, but the
above is actually a *pessimization* on x86 or other strongly ordered
machines, so doing it at a source code level is actually a bad idea
(not to mention that it's horribly unreadable).

I could easily have gotten the address generation trick above wrong
(see my comment about "horribly unreadable" and no sane person doing
this at a source level), but that "complex" expression is not
necessarily at all complex for a compiler. If "value" is a static
variable, the compiler could create another read-only static variable
that contains that "-1" value, and the difference in addresses would
be a link-time constant, so it would not necessarily be all that ugly
from a code generation standpoint.

There are other ways to turn it into an address dependency, so the
whole "consume as an input to conditionals" really does seem to have
several optimization advantages (over "acquire").

Again, the way I'd expect a compiler writer to actually *do* this is
to just default to "ac

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 19:02                                                                               ` Linus Torvalds
@ 2014-02-20 19:06                                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 19:06 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, Feb 20, 2014 at 11:02 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Again, the way I'd expect a compiler writer to actually *do* this is
> to just default to "ac

Oops, pressed send by mistake too early.

I was almost done:

I'd expect a compiler to just default to "acquire" semantics, but then
have a few "obvious peephole" optimizations for cases that it
encounters and where it is easy to replace the synchronization point
with just an address dependency.

             Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:53                                                                                   ` Torvald Riegel
@ 2014-02-20 19:09                                                                                     ` Linus Torvalds
  2014-02-22 18:53                                                                                       ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 19:09 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote:
> On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote:
>> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> >
>> > You really need that "consume" to be "acquire".
>>
>> So I think we now all agree that that is what the standard is saying.
>
> Huh?
>
> The standard says that there are two separate things (among many more):
> mo_acquire and mo_consume.  They both influence happens-before in
> different (and independent!) ways.
>
> What Paul is saying is that *you* should have used *acquire* in that
> example.

I understand.

And I disagree. I think the standard is wrong, and what I *should* be
doing is point out the fact very loudly, and just tell people to NEVER
EVER use "consume" as long as it's not reliable and has insane
semantics.

So what I "should do" is to not accept any C11 atomics use in the
kernel. Because with the "acquire", it generates worse code than what
we already have, and with the "consume" it's shit.

See?

                  Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:56                                                                                   ` Paul E. McKenney
@ 2014-02-20 19:45                                                                                     ` Linus Torvalds
  2014-02-20 22:10                                                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 19:45 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> The example gcc breakage was something like this:
>
>         i = atomic_load(idx, memory_order_consume);
>         x = array[0 + i - i];
>
> Then gcc optimized this to:
>
>         i = atomic_load(idx, memory_order_consume);
>         x = array[0];
>
> This same issue would hit control dependencies.  You are free to argue
> that this is the fault of ARM and PowerPC memory ordering, but the fact
> remains that your suggested change has -exactly- the same vulnerability
> as memory_order_consume currently has.

No it does not, for two reasons, first the legalistic (and bad) reason:

As I actually described it, the "consume" becomes an "acquire" by
default. If it's not used as an address to the dependent load, then
it's an acquire. The use "going away" in no way makes the acquire go
away in my simplistic model.

So the compiler would actually translate that to a load-with-acquire,
not be able to remove the acquire, and we have end of story. The
actual code generation would be that "ld + sync + ld" on powerpc, or
"ld.acq" on ARM.

Now, the reason I claim that reason was "legalistic and bad" is that
it's actually a cop-out, and if you had made the example be something
like this:

   p = atomic_load(&ptr, memory_order_consume);
   x = array[0 + p - p];
   y = p->val;

then yes, I actually think that the order of loads of 'x' and 'p' are
not enforced by the "consume". The only case that is clear is the
order of 'y' and 'p', because that is the only one that really *USES*
the value.

The "use" of "+p-p" is syntactic bullshit. It's very obvious to even a
slightly developmentally challenged hedgehog that "+p-p" doesn't have
any actual *semantic* meaning, it's purely syntactic.

And the syntactic meaning is meaningless and doesn't matter. Because I
would just get rid of the whole "dependency chain" language
ALTOGETHER.

So in fact, in my world, I would consider your example to be a
non-issue. In my world, there _is_ no "dependency chain" at a
syntactic level. In my SANE world, none of that insane crap language
exists. That language is made-up and tied to syntax exactly because it
*cannot* be tied to semantics.

In my sane world, "consume" has a much simpler meaning, and has no
legalistic syntactic meaning: only real use matters. If the value can
be optimized away, the so can the barrier, and so can the whole load.
The value isn't "consumed", so it has no meaning.

So if you write

   i = atomic_load(idx, memory_order_consume);
   x = array[0+i-i];

then in my world that "+i-i" is meaningless. It's semantic fluff, and
while my naive explanation would have left it as an acquire (because
it cannot be peep-holed away), I actually do believe that the compiler
should be obviously allowed to optimize the load away entirely since
it's meaningless, and if no use of 'i' remains, then it has no
consumer, and so there is no dependency.

Put another way: "consume" is not about getting a lock, it's about
getting a *value*. Only the dependency on the *value* matters, and if
the value is optimized away, there is no dependency.

And the value itself does not have any semantics. There's nothing
"volatile" about the use of the value that would mean that the
compiler cannot re-order it or remove it entirely. There's no barrier
"carried around" by the value per se. The barrier is between the load
and use. That's the *point* of "consume" after all.

The whole "chain of dependency" language is pointless. It's wrong.
It's complicated, it is illogical, and it causes subtle problems
exactly because it got tied to the language *syntax* rather than to
any logical use.

Don't try to re-introduce the whole issue. It was a mistake for the C
standard to talk about dependencies in the first place, exactly
because it results in these idiotic legalistic practices.

You do realize that that whole "*(q+flag-flag)" example in the
bugzilla comes from the fact that the programmer tried to *fight* the
fact that the C standard got the control dependency wrong?

In other words, the *deepest* reason for that bugzilla is that the
programmer tried to force the logical dependency by rewriting it as a
(fake, and easily optimizable) data dependency.

In *my* world, the stupid data-vs-control dependency thing goes away,
the test of the value itself is a use of it, and "*p ? *q :0" just
does the right thing, there's no reason to do that "q+flag-flag" thing
in the first place, and if you do, the compiler *should* just ignore
your little games.

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 19:45                                                                                     ` Linus Torvalds
@ 2014-02-20 22:10                                                                                       ` Paul E. McKenney
  2014-02-20 22:52                                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-20 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 11:45:29AM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > The example gcc breakage was something like this:
> >
> >         i = atomic_load(idx, memory_order_consume);
> >         x = array[0 + i - i];
> >
> > Then gcc optimized this to:
> >
> >         i = atomic_load(idx, memory_order_consume);
> >         x = array[0];
> >
> > This same issue would hit control dependencies.  You are free to argue
> > that this is the fault of ARM and PowerPC memory ordering, but the fact
> > remains that your suggested change has -exactly- the same vulnerability
> > as memory_order_consume currently has.
> 
> No it does not, for two reasons, first the legalistic (and bad) reason:
> 
> As I actually described it, the "consume" becomes an "acquire" by
> default. If it's not used as an address to the dependent load, then
> it's an acquire. The use "going away" in no way makes the acquire go
> away in my simplistic model.
> 
> So the compiler would actually translate that to a load-with-acquire,
> not be able to remove the acquire, and we have end of story. The
> actual code generation would be that "ld + sync + ld" on powerpc, or
> "ld.acq" on ARM.
> 
> Now, the reason I claim that reason was "legalistic and bad" is that
> it's actually a cop-out, and if you had made the example be something
> like this:
> 
>    p = atomic_load(&ptr, memory_order_consume);
>    x = array[0 + p - p];
>    y = p->val;
> 
> then yes, I actually think that the order of loads of 'x' and 'p' are
> not enforced by the "consume". The only case that is clear is the
> order of 'y' and 'p', because that is the only one that really *USES*
> the value.
> 
> The "use" of "+p-p" is syntactic bullshit. It's very obvious to even a
> slightly developmentally challenged hedgehog that "+p-p" doesn't have
> any actual *semantic* meaning, it's purely syntactic.
> 
> And the syntactic meaning is meaningless and doesn't matter. Because I
> would just get rid of the whole "dependency chain" language
> ALTOGETHER.
> 
> So in fact, in my world, I would consider your example to be a
> non-issue. In my world, there _is_ no "dependency chain" at a
> syntactic level. In my SANE world, none of that insane crap language
> exists. That language is made-up and tied to syntax exactly because it
> *cannot* be tied to semantics.
> 
> In my sane world, "consume" has a much simpler meaning, and has no
> legalistic syntactic meaning: only real use matters. If the value can
> be optimized away, the so can the barrier, and so can the whole load.
> The value isn't "consumed", so it has no meaning.
> 
> So if you write
> 
>    i = atomic_load(idx, memory_order_consume);
>    x = array[0+i-i];
> 
> then in my world that "+i-i" is meaningless. It's semantic fluff, and
> while my naive explanation would have left it as an acquire (because
> it cannot be peep-holed away), I actually do believe that the compiler
> should be obviously allowed to optimize the load away entirely since
> it's meaningless, and if no use of 'i' remains, then it has no
> consumer, and so there is no dependency.
> 
> Put another way: "consume" is not about getting a lock, it's about
> getting a *value*. Only the dependency on the *value* matters, and if
> the value is optimized away, there is no dependency.
> 
> And the value itself does not have any semantics. There's nothing
> "volatile" about the use of the value that would mean that the
> compiler cannot re-order it or remove it entirely. There's no barrier
> "carried around" by the value per se. The barrier is between the load
> and use. That's the *point* of "consume" after all.
> 
> The whole "chain of dependency" language is pointless. It's wrong.
> It's complicated, it is illogical, and it causes subtle problems
> exactly because it got tied to the language *syntax* rather than to
> any logical use.
> 
> Don't try to re-introduce the whole issue. It was a mistake for the C
> standard to talk about dependencies in the first place, exactly
> because it results in these idiotic legalistic practices.
> 
> You do realize that that whole "*(q+flag-flag)" example in the
> bugzilla comes from the fact that the programmer tried to *fight* the
> fact that the C standard got the control dependency wrong?
> 
> In other words, the *deepest* reason for that bugzilla is that the
> programmer tried to force the logical dependency by rewriting it as a
> (fake, and easily optimizable) data dependency.
> 
> In *my* world, the stupid data-vs-control dependency thing goes away,
> the test of the value itself is a use of it, and "*p ? *q :0" just
> does the right thing, there's no reason to do that "q+flag-flag" thing
> in the first place, and if you do, the compiler *should* just ignore
> your little games.

Linus, given that you are calling me out for pushing "legalistic and bad"
things, "syntactic bullshit", and playing "little games", I am forced
to conclude that you have never attended any sort of standards-committee
meeting.  ;-)

That said, I am fine with pushing control/data dependencies with this
general approach.  There will be complications, but there always are
and they can be dealt with as they come up.

FWIW, the last time I tried excluding things like "f-f", "x%1", "y*0" and
so on, I got a lot of pushback.  The reason I didn't argue too much back
(2007 or some such) then was that my view at the time was that I figured
the kernel code wouldn't do things like that anyway, so it didn't matter.
However, that was more than five years ago, so worth another try.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 22:10                                                                                       ` Paul E. McKenney
@ 2014-02-20 22:52                                                                                         ` Linus Torvalds
  2014-02-21 18:35                                                                                           ` Michael Matz
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-20 22:52 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 20, 2014 at 2:10 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Linus, given that you are calling me out for pushing "legalistic and bad"
> things, "syntactic bullshit", and playing "little games", I am forced
> to conclude that you have never attended any sort of standards-committee
> meeting.  ;-)

Heh. I have heard people wax poetic about the pleasures of standards
committee meetings.

Enough that I haven't really ever had the slightest urge to participate ;)

> FWIW, the last time I tried excluding things like "f-f", "x%1", "y*0" and
> so on, I got a lot of pushback.  The reason I didn't argue too much back
> (2007 or some such) then was that my view at the time was that I figured
> the kernel code wouldn't do things like that anyway, so it didn't matter.

Well.. I'd really hope we would never do that. That said, we have
certainly used disgusting things that we knew would disable certain
optimizations in the compiler before, so I wouldn't put it *entirely*
past us to do things like that, but I'd argue that we'd do so in order
to confuse the compiler to do what we want, not in order to argue that
it's a good thing.

I mean, right now we have at least *one* active ugly work-around for a
compiler bug (the magic empty inline asm that works around the "asm
goto" bug in gcc). So it's not like I would claim that we don't do
disgusting things when we need to.

But I'm pretty sure that any compiler guy must *hate* that current odd
dependency-generation part, and if I was a gcc person, seeing that
bugzilla entry Torvald pointed at, I would personally want to
dismember somebody with a rusty spoon..

So I suspect there are a number of people who would be *more* than
happy with a change to those odd dependency rules.

                Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 22:52                                                                                         ` Linus Torvalds
@ 2014-02-21 18:35                                                                                           ` Michael Matz
  2014-02-21 19:13                                                                                             ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Michael Matz @ 2014-02-21 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

Hi,

On Thu, 20 Feb 2014, Linus Torvalds wrote:

> But I'm pretty sure that any compiler guy must *hate* that current odd
> dependency-generation part, and if I was a gcc person, seeing that
> bugzilla entry Torvald pointed at, I would personally want to
> dismember somebody with a rusty spoon..

Yes.  Defining dependency chains in the way the standard currently seems 
to do must come from people not writing compilers.  There's simply no 
sensible way to implement it without being really conservative, because 
the depchains can contain arbitrary constructs including stores, 
loads and function calls but must still be observed.  

And with conservative I mean "everything is a source of a dependency, and 
hence can't be removed, reordered or otherwise fiddled with", and that 
includes code sequences where no atomic objects are anywhere in sight [1].
In the light of that the only realistic way (meaning to not have to 
disable optimization everywhere) to implement consume as currently 
specified is to map it to acquire.  At which point it becomes pointless.

> So I suspect there are a number of people who would be *more* than
> happy with a change to those odd dependency rules.

I can't say much about your actual discussion related to semantics of 
atomics, not my turf.  But the "carries a dependency" relation is not 
usefully implementable.


Ciao,
Michael.
[1] Simple example of what type of transformations would be disallowed:

int getzero (int i) { return i - i; }

Should be optimizable to "return 0;", right?  Not with carries a 
dependency in place:

int jeez (int idx) {
  int i = atomic_load(idx, memory_order_consume); // A
  int j = getzero (i);                            // B
  return array[j];                                // C
}

As I read "carries a dependency" there's a dependency from A to C. 
Now suppose we would optimize getzero in the obvious way, then inline, and 
boom, dependency gone.  So we wouldn't be able to optimize any function 
when we don't control all its users, for fear that it _might_ be used in 
some dependency chain where it then matters that we possibly removed some 
chain elements due to the transformation.  We would have to retain 'i-i' 
before inlining, and if the function then is inlined into a context where 
depchains don't matter, could _then_ optmize it to zero.  But that's 
insane, especially considering that it's hard to detect if a given context 
doesn't care for depchains, after all the depchain relation is constructed 
exactly so that it bleeds into nearly everywhere.  So we would most of 
the time have to assume that the ultimate context will be depchain-aware 
and therefore disable many transformations.

There'd be one solution to the above, we would have to invent some special 
operands and markers that explicitely model "carries-a-dep", ala this:

int getzero (int i) {
  #RETURN.dep = i.dep
  return 0;
}

int jeez (int idx) {
  # i.dep = idx.dep
  int i = atomic_load(idx, memory_order_consume); // A
  # j.dep = i.dep
  int j = getzero (i);                            // B
  # RETURN.dep = j.dep + array.dep
  return array[j];                                // C
}

Then inlining getzero would merely add another "# j.dep = i.dep" relation, 
so depchains are still there but the value optimization can happen before 
inlining.  Having to do something like that I'd find disgusting, and 
rather rewrite consume into acquire :)  Or make the depchain relation 
somehow realistically implementable.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 18:35                                                                                           ` Michael Matz
@ 2014-02-21 19:13                                                                                             ` Paul E. McKenney
  2014-02-21 22:10                                                                                               ` Joseph S. Myers
                                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-21 19:13 UTC (permalink / raw)
  To: Michael Matz
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> Hi,
> 
> On Thu, 20 Feb 2014, Linus Torvalds wrote:
> 
> > But I'm pretty sure that any compiler guy must *hate* that current odd
> > dependency-generation part, and if I was a gcc person, seeing that
> > bugzilla entry Torvald pointed at, I would personally want to
> > dismember somebody with a rusty spoon..
> 
> Yes.  Defining dependency chains in the way the standard currently seems 
> to do must come from people not writing compilers.  There's simply no 
> sensible way to implement it without being really conservative, because 
> the depchains can contain arbitrary constructs including stores, 
> loads and function calls but must still be observed.  
> 
> And with conservative I mean "everything is a source of a dependency, and 
> hence can't be removed, reordered or otherwise fiddled with", and that 
> includes code sequences where no atomic objects are anywhere in sight [1].
> In the light of that the only realistic way (meaning to not have to 
> disable optimization everywhere) to implement consume as currently 
> specified is to map it to acquire.  At which point it becomes pointless.

No, only memory_order_consume loads and [[carries_dependency]]
function arguments are sources of dependency chains.

> > So I suspect there are a number of people who would be *more* than
> > happy with a change to those odd dependency rules.
> 
> I can't say much about your actual discussion related to semantics of 
> atomics, not my turf.  But the "carries a dependency" relation is not 
> usefully implementable.
> 
> 
> Ciao,
> Michael.
> [1] Simple example of what type of transformations would be disallowed:
> 
> int getzero (int i) { return i - i; }

This needs to be as follows:

[[carries_dependency]] int getzero(int i [[carries_dependency]])
{
	return i - i;
}

Otherwise dependencies won't get carried through it.

> Should be optimizable to "return 0;", right?  Not with carries a 
> dependency in place:
> 
> int jeez (int idx) {
>   int i = atomic_load(idx, memory_order_consume); // A
>   int j = getzero (i);                            // B
>   return array[j];                                // C
> }
> 
> As I read "carries a dependency" there's a dependency from A to C. 
> Now suppose we would optimize getzero in the obvious way, then inline, and 
> boom, dependency gone.  So we wouldn't be able to optimize any function 
> when we don't control all its users, for fear that it _might_ be used in 
> some dependency chain where it then matters that we possibly removed some 
> chain elements due to the transformation.  We would have to retain 'i-i' 
> before inlining, and if the function then is inlined into a context where 
> depchains don't matter, could _then_ optmize it to zero.  But that's 
> insane, especially considering that it's hard to detect if a given context 
> doesn't care for depchains, after all the depchain relation is constructed 
> exactly so that it bleeds into nearly everywhere.  So we would most of 
> the time have to assume that the ultimate context will be depchain-aware 
> and therefore disable many transformations.

Any function that does not contain a memory_order_consume load and that
doesn't have any arguments marked [[carries_dependency]] can be optimized
just as before.

> There'd be one solution to the above, we would have to invent some special 
> operands and markers that explicitely model "carries-a-dep", ala this:
> 
> int getzero (int i) {
>   #RETURN.dep = i.dep
>   return 0;
> }

The above is already handled by the [[carries_dependency]] attribute,
see above.

> int jeez (int idx) {
>   # i.dep = idx.dep
>   int i = atomic_load(idx, memory_order_consume); // A
>   # j.dep = i.dep
>   int j = getzero (i);                            // B
>   # RETURN.dep = j.dep + array.dep
>   return array[j];                                // C
> }
> 
> Then inlining getzero would merely add another "# j.dep = i.dep" relation, 
> so depchains are still there but the value optimization can happen before 
> inlining.  Having to do something like that I'd find disgusting, and 
> rather rewrite consume into acquire :)  Or make the depchain relation 
> somehow realistically implementable.

I was actually OK with arithmetic cancellation breaking the dependency
chains.  Others on the committee felt otherwise, and I figured that
(1) I wouldn't be writing that kind of function anyway and (2) they
knew more about writing compilers than I.  I would still be OK saying
that things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just
break the dependency chain.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
       [not found]                                                                               ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com>
@ 2014-02-21 19:16                                                                                 ` Linus Torvalds
  2014-02-21 19:41                                                                                   ` Linus Torvalds
       [not found]                                                                                   ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com>
       [not found]                                                                                 ` <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com>
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-21 19:16 UTC (permalink / raw)
  To: p796231 .
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc, Mark Batty

On Fri, Feb 21, 2014 at 10:25 AM, Peter Sewell
<Peter.Sewell@cl.cam.ac.uk> wrote:
>
> If one thinks this is too fragile, then simply using memory_order_acquire
> and paying the resulting barrier cost (and perhaps hoping that compilers
> will eventually be able to optimise some cases of those barriers to
> hardware-level dependencies) is the obvious alternative.

No, the obvious alternative is to do what we do already, and just do
it by hand. Using acquire is *worse* than what we have now.

Maybe for some other users, the thing falls out differently.

> Many concurrent things will "accidentally" work on x86 - consume is not
> special in that respect.

No. But if you have something that is mis-designed, easy to get wrong,
and untestable, any sane programmer will go "that's bad".

> There are two difficulties with this, if I understand correctly what you're
> proposing.
>
> The first is knowing where to stop.

No.

Read my suggestion. Knowing where to stop is *trivial*.

Either the dependency is immediate and obvious, or you treat it like an acquire.

Seriously. Any compiler that doesn't turn the dependency chain into
SSA or something pretty much equivalent is pretty much a joke. Agreed?

So we can pretty much assume that the compiler will have some
intermediate representation as part of optimization that is basically
SSA.

So what you do is,

 - build the SSA by doing all the normal parsing and possible
tree-level optimizations you already do even before getting to the SSA
stage

 - do all the normal optimizations/simplifications/cse/etc that you do
normally on SSA

 - add *one* new rule to your SSA simplification that goes something like this:

   * when you see a load op that is marked with a "consume" barrier,
just follow the usage chain that comes from that.
   * if you hit a normal arithmetic op, just follow the result chain of that
   * if you hit a memory operation address use, stop and say "looks good"
   * it you hit anything else (including a copy/phi/whatever), abort
   * if nothing aborted as part of the walk, you can now just remove
the "consume" barrier.

You can fancy it up and try to follow more cases, but realistically
the only case that really matters is the "consume" being fed directly
into one or more loads, with possibly an offset calculation in
between. There are certainly more cases you could *try* to remove the
barrier, but the thing is, it's never incorrect to not remove it, so
any time you get bored or hit any complication at all, just do the
"abort" part.

I *guarantee* that if you describe this to a compiler writer, he will
tell you that my scheme is about a billion times simpler than the
current standard wording. Especially after you've pointed him to that
gcc bugzilla entry and explained to him about how the current standard
cares about those kinds of made-up syntactic chains that he likely
removed quite early, possibly even as he was generating the semantic
tree.

Try it. I dare you. So if you want to talk about "difficulties", the
current C standard loses.

> The second is the proposal in later mails to use some notion of  "semantic"
> dependency instead of this syntactic one.

Bah.

The C standard does that all over. It's called "as-is". The C standard
talks about how the compiler can do pretty much whatever it likes, as
long as the end result acts the same in the virtual C machine.

So claiming that "semantics" being meaningful is somehow complex is
bogus. People do that all the time. If you make it clear that the
dependency chain is through the *value*, not syntax, and that the
value can be optimized all the usual ways, it's quite clear what the
end result is. Any operation that actually meaningfully uses the value
is serialized with the load, and if there is no meaningful use that
would affect the end result in the virtual machine, then there is no
barrier.

Why would this be any different, especially since it's easy to
understand both for a human and a compiler?

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
       [not found]                                                                                 ` <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com>
@ 2014-02-21 19:24                                                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-21 19:24 UTC (permalink / raw)
  To: Peter Sewell
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc, Mark Batty

On Fri, Feb 21, 2014 at 06:28:05PM +0000, Peter Sewell wrote:
> On 20 February 2014 17:01, Linus Torvalds <torvalds@linux-foundation.org
> >wrote:

[ . . . ]

> > > So, if you make one of two changes to your example, then I will agree
> > > with you.
> >
> > No. We're not playing games here. I'm fed up with complex examples
> > that make no sense.
> >
> > Nobody sane writes code that does that pointer comparison, and it is
> > entirely immaterial what the compiler can do behind our backs. The C
> > standard semantics need to make sense to the *user* (ie programmer),
> > not to a CPU and not to a compiler. The CPU and compiler are "tools".
> > They don't matter. Their only job is to make the code *work*, dammit.
> >
> > So no idiotic made-up examples that involve code that nobody will ever
> > write and that have subtle issues.
> >
> > So the starting point is that (same example as before, but with even
> > clearer naming):
> >
> > Initialization state:
> >   initialized = 0;
> >   value = 0;
> >
> > Consumer:
> >
> >     return atomic_read(&initialized, consume) ? value : -1;
> >
> > Writer:
> >     value = 42;
> >     atomic_write(&initialized, 1, release);
> >
> > and because the C memory ordering standard is written in such a way
> > that this is subtly buggy (and can return 0, which is *not* logically
> > a valid value), then I think the C memory ordering standard is broken.
> >
> > That "consumer" memory ordering is dangerous as hell, and it is
> > dangerous FOR NO GOOD REASON.
> >
> > The trivial "fix" to the standard would be to get rid of all the
> > "carries a dependency" crap, and just say that *anything* that depends
> > on it is ordered wrt it.
> >
> 
> There are two difficulties with this, if I understand correctly what
> you're proposing.
> 
> The first is knowing where to stop.  If one includes all data and
> control dependencies, pretty much all the continuation of execution
> would depend on the consume read, so the compiler would eventually
> have to give up and insert a gratuitous barrier.  One might imagine
> somehow annotating the return_expensive_system_value() you have below
> to say "stop dependency tracking at the return" (thereby perhaps
> enabling the compiler to optimise the barrier that you'd need in h/w
> to order the Linus-consume-read of initialised and the non-atomic read
> of calculated, replacing it by a compiler-introduced artificial
> dependency), and indeed that's roughly what the standard's
> kill_dependency does for consumes.

One way to tell the compiler where to stop would be to place markers
in the source code saying where dependencies stop.  These markers
could be provided by the definitions of the current rcu_read_unlock()
tags in the Linux kernel (and elsewhere, for that matter).  These would
be overridden by [[carries_dependency]] tags on function arguments and
return values, which is needed to handle the possibility of nested RCU
read-side critical sections.

> The second is the proposal in later mails to use some notion of
> "semantic" dependency instead of this syntactic one.  That's maybe
> attractive at first sight, but rather hard to define in a decent way
> in general.  To define whether the consume load can "really" affect
> some subsequent value, you need to know about all the set of possible
> executions of the program - which is exactly what we have to define.
> 
> For syntactic dependencies, in contrast, you can at least tell whether
> they exist by examining the source code you have in front of you.  The
> fact that artificial dependencies like (&x + y-y) are significant is
> (I guess) basically incidental at the C level - sometimes things like
> this are the best idiom to enforce ordering at the assembly level, but
> in C I imagine they won't normally arise.  If they do, it might be
> nicer to have a more informative syntax, eg (&x + dependency(y)).

This was in fact one of the arguments put forward in favor of carrying
dependencies through things like "y-y" back in the 2007-8 timeframe.

Can't say that I am much of a fan of manually tagging all dependencies:
Machines are much better at that sort of thing than are humans.
But just out of curiosity, did you instead mean (&x + dependency(y-y))
or some such?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 19:16                                                                                 ` Linus Torvalds
@ 2014-02-21 19:41                                                                                   ` Linus Torvalds
  2014-02-21 19:48                                                                                     ` Peter Sewell
       [not found]                                                                                   ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com>
  1 sibling, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-21 19:41 UTC (permalink / raw)
  To: p796231 .
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc, Mark Batty

On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Why would this be any different, especially since it's easy to
> understand both for a human and a compiler?

Btw, the actual data path may actually be semantically meaningful even
at a processor level.

For example, let's look at that gcc bugzilla that got mentioned
earlier, and let's assume that gcc is fixed to follow the "arithmetic
is always meaningful, even if it is only syntactic" the letter.
So we have that gcc bugzilla use-case:

   flag ? *(q + flag - flag) : 0;

and let's say that the fixed compiler now generates the code with the
data dependency that is actually suggested in that bugzilla entry:

        and     w2, w2, #0
        ldr     w0, [x1, w2]

ie the CPU actually sees that address data dependency. Now everything
is fine, right?

Wrong.

It is actually quite possible that the CPU sees the "and with zero"
and *breaks the dependencies on the incoming value*.

Modern CPU's literally do things like that. Seriously. Maybe not that
particular one, but you'll sometimes find that the CPU - int he
instruction decoding phase (ie very early in the pipeline) notices
certain patterns that generate constants, and actually drop the data
dependency on the "incoming" registers.

On x86, generating zero using "xor" on the register with itself is one
such known sequence.

Can you guarantee that powerpc doesn't do the same for "and r,r,#0"?
Or what if the compiler generated the much more obvious

    sub w2,w2,w2

for that "+flag-flag"? Are you really 100% sure that the CPU won't
notice that that is just a way to generate a zero, and doesn't depend
on the incoming values?

Because I'm not. I know CPU designers that do exactly this.

So I would actually and seriously argue that the whole C standard
attempt to use a syntactic data dependency as a determination of
whether two things are serialized is wrong, and that you actually
*want* to have the compiler optimize away false data dependencies.

Because people playing tricks with "+flag-flag" and thinking that that
somehow generates a data dependency - that's *wrong*. It's not just
the compiler that decides "that's obviously nonsense, I'll optimize it
away". The CPU itself can do it.

So my "actual semantic dependency" model is seriously more likely to
be *correct*. Not just t a compiler level.

Btw, any tricks like that, I would also take a second look at the
assembler and the linker. Many assemblers do some trivial
optimizations too. Are you sure that "and     w2, w2, #0" really ends
up being encoded as an "and"? Maybe the assembler says "I can do that
as a "mov w2,#0" instead? Who knows? Even power and ARM have their
variable-sized encodings (there are some "compressed executable"
embedded power processors, and there is obviously Thumb2, and many
assemblers end up trying to use equivalent "small" instructions..

So the whole "fake data dependency" thing is just dangerous on so many levels.

MUCH more dangerous than my "actual real dependency" model.

                   Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 19:41                                                                                   ` Linus Torvalds
@ 2014-02-21 19:48                                                                                     ` Peter Sewell
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Sewell @ 2014-02-21 19:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc, Mark Batty

On 21 February 2014 19:41, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Why would this be any different, especially since it's easy to
>> understand both for a human and a compiler?
>
> Btw, the actual data path may actually be semantically meaningful even
> at a processor level.
>
> For example, let's look at that gcc bugzilla that got mentioned
> earlier, and let's assume that gcc is fixed to follow the "arithmetic
> is always meaningful, even if it is only syntactic" the letter.
> So we have that gcc bugzilla use-case:
>
>    flag ? *(q + flag - flag) : 0;
>
> and let's say that the fixed compiler now generates the code with the
> data dependency that is actually suggested in that bugzilla entry:
>
>         and     w2, w2, #0
>         ldr     w0, [x1, w2]
>
> ie the CPU actually sees that address data dependency. Now everything
> is fine, right?
>
> Wrong.
>
> It is actually quite possible that the CPU sees the "and with zero"
> and *breaks the dependencies on the incoming value*.

For reference: the Power and ARM architectures explicitly guarantee
not to do this, the architects are quite clear about it, and we've
tested (some cases) rather thoroughly.
I can't speak about other architectures.

> Modern CPU's literally do things like that. Seriously. Maybe not that
> particular one, but you'll sometimes find that the CPU - int he
> instruction decoding phase (ie very early in the pipeline) notices
> certain patterns that generate constants, and actually drop the data
> dependency on the "incoming" registers.
>
> On x86, generating zero using "xor" on the register with itself is one
> such known sequence.
>
> Can you guarantee that powerpc doesn't do the same for "and r,r,#0"?
> Or what if the compiler generated the much more obvious
>
>     sub w2,w2,w2
>
> for that "+flag-flag"? Are you really 100% sure that the CPU won't
> notice that that is just a way to generate a zero, and doesn't depend
> on the incoming values?
>
> Because I'm not. I know CPU designers that do exactly this.
>
> So I would actually and seriously argue that the whole C standard
> attempt to use a syntactic data dependency as a determination of
> whether two things are serialized is wrong, and that you actually
> *want* to have the compiler optimize away false data dependencies.
>
> Because people playing tricks with "+flag-flag" and thinking that that
> somehow generates a data dependency - that's *wrong*. It's not just
> the compiler that decides "that's obviously nonsense, I'll optimize it
> away". The CPU itself can do it.
>
> So my "actual semantic dependency" model is seriously more likely to
> be *correct*. Not just t a compiler level.
>
> Btw, any tricks like that, I would also take a second look at the
> assembler and the linker. Many assemblers do some trivial
> optimizations too.

That's certainly something worth checking.

> Are you sure that "and     w2, w2, #0" really ends
> up being encoded as an "and"? Maybe the assembler says "I can do that
> as a "mov w2,#0" instead? Who knows? Even power and ARM have their
> variable-sized encodings (there are some "compressed executable"
> embedded power processors, and there is obviously Thumb2, and many
> assemblers end up trying to use equivalent "small" instructions..
>
> So the whole "fake data dependency" thing is just dangerous on so many levels.
>
> MUCH more dangerous than my "actual real dependency" model.
>
>                    Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
       [not found]                                                                                   ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com>
@ 2014-02-21 20:22                                                                                     ` Linus Torvalds
  0 siblings, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-21 20:22 UTC (permalink / raw)
  To: p796231 .
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc, Mark Batty

On Fri, Feb 21, 2014 at 11:43 AM, Peter Sewell
<Peter.Sewell@cl.cam.ac.uk> wrote:
>
> You have to track dependencies through other assignments, e.g. simple x=y

That is all visible in the SSA form. Variable assignment has been
converted to some use of the SSA node that generated the value. The
use might be a phi node or a cast op, or maybe it's just a memory
store op, but the whole point of SSA is that there is one single node
that creates the data (in this case that would be the "load" op with
the associated consume barrier - that barrier might be part of the
load op itself, or it might be implemented as a separate SSA node that
consumes the result of the load that generates a new pseudo), and the
uses of the result are all visible from that.

And yes, there might be a lot of users. But any complex case you just
punt on - and the difference here is that since "punt" means "leave
the barrier in place", it's never a correctness issue.

So yeah, it could be somewhat expensive, although you can always bound
that expense by just punting. But the dependencies in SSA form are no
more complex than the dependencies the C standard talks about now, and
in SSA form they are at least really easy to follow.  So if they are
complex and expensive in SSA form, I'd expect them to be *worse* in
the current "depends-on" syntax form.

                 Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 19:13                                                                                             ` Paul E. McKenney
@ 2014-02-21 22:10                                                                                               ` Joseph S. Myers
  2014-02-21 22:37                                                                                                 ` Paul E. McKenney
  2014-02-26 13:09                                                                                                 ` Torvald Riegel
  2014-02-24 13:55                                                                                               ` Michael Matz
  2014-02-26 13:04                                                                                               ` Torvald Riegel
  2 siblings, 2 replies; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-21 22:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Matz, Linus Torvalds, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Fri, 21 Feb 2014, Paul E. McKenney wrote:

> This needs to be as follows:
> 
> [[carries_dependency]] int getzero(int i [[carries_dependency]])
> {
> 	return i - i;
> }
> 
> Otherwise dependencies won't get carried through it.

C11 doesn't have attributes at all (and no specification regarding calls 
and dependencies that I can see).  And the way I read the C++11 
specification of carries_dependency is that specifying carries_dependency 
is purely about increasing optimization of the caller: that if it isn't 
specified, then the caller doesn't know what dependencies might be 
carried.  "Note: The carries_dependency attribute does not change the 
meaning of the program, but may result in generation of more efficient 
code. - end note".

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 22:10                                                                                               ` Joseph S. Myers
@ 2014-02-21 22:37                                                                                                 ` Paul E. McKenney
  2014-02-26 13:09                                                                                                 ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-21 22:37 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Michael Matz, Linus Torvalds, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Fri, Feb 21, 2014 at 10:10:54PM +0000, Joseph S. Myers wrote:
> On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> 
> > This needs to be as follows:
> > 
> > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > {
> > 	return i - i;
> > }
> > 
> > Otherwise dependencies won't get carried through it.
> 
> C11 doesn't have attributes at all (and no specification regarding calls 
> and dependencies that I can see).  And the way I read the C++11 
> specification of carries_dependency is that specifying carries_dependency 
> is purely about increasing optimization of the caller: that if it isn't 
> specified, then the caller doesn't know what dependencies might be 
> carried.  "Note: The carries_dependency attribute does not change the 
> meaning of the program, but may result in generation of more efficient 
> code. - end note".

Good point -- I am so used to them being in gcc that I missed that.

In which case, it seems to me that straight C11 is within its rights
to emit a memory barrier just before control passes into a function
that either it can't see or that it chose to apply dependency-breaking
optimizations to.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 18:18                                                                           ` Paul E. McKenney
@ 2014-02-22 18:30                                                                             ` Torvald Riegel
  2014-02-22 20:17                                                                               ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-22 18:30 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote:
> On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140220172700.0416@vmsdvm4.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4)
> > 
> > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> > > > >>
> > > > >> Can you point to it? Because I can find a draft standard, and it sure
> > > > >> as hell does *not* contain any clarity of the model. It has a *lot* of
> > > > >> verbiage, but it's pretty much impossible to actually understand, even
> > > > >> for somebody who really understands memory ordering.
> > > > >
> > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > > This has an explanation of the model up front, and then the detailed
> > > > > formulae in Section 6.  This is from 2010, and there might have been
> > > > > smaller changes since then, but I'm not aware of any bigger ones.
> > > > 
> > > > Ahh, this is different from what others pointed at. Same people,
> > > > similar name, but not the same paper.
> > > > 
> > > > I will read this version too, but from reading the other one and the
> > > > standard in parallel and trying to make sense of it, it seems that I
> > > > may have originally misunderstood part of the whole control dependency
> > > > chain.
> > > > 
> > > > The fact that the left side of "? :", "&&" and "||" breaks data
> > > > dependencies made me originally think that the standard tried very
> > > > hard to break any control dependencies. Which I felt was insane, when
> > > > then some of the examples literally were about the testing of the
> > > > value of an atomic read. The data dependency matters quite a bit. The
> > > > fact that the other "Mathematical" paper then very much talked about
> > > > consume only in the sense of following a pointer made me think so even
> > > > more.
> > > > 
> > > > But reading it some more, I now think that the whole "data dependency"
> > > > logic (which is where the special left-hand side rule of the ternary
> > > > and logical operators come in) are basically an exception to the rule
> > > > that sequence points end up being also meaningful for ordering (ok, so
> > > > C11 seems to have renamed "sequence points" to "sequenced before").
> > > > 
> > > > So while an expression like
> > > > 
> > > >     atomic_read(p, consume) ? a : b;
> > > > 
> > > > doesn't have a data dependency from the atomic read that forces
> > > > serialization, writing
> > > > 
> > > >    if (atomic_read(p, consume))
> > > >       a;
> > > >    else
> > > >       b;
> > > > 
> > > > the standard *does* imply that the atomic read is "happens-before" wrt
> > > > "a", and I'm hoping that there is no question that the control
> > > > dependency still acts as an ordering point.
> > > 
> > > The control dependency should order subsequent stores, at least assuming
> > > that "a" and "b" don't start off with identical stores that the compiler
> > > could pull out of the "if" and merge.  The same might also be true for ?:
> > > for all I know.  (But see below)
> > 
> > I don't think this is quite true.  I agree that a conditional store will
> > not be executed speculatively (note that if it would happen in both the
> > then and the else branch, it's not conditional); so, the store in
> > "a;" (assuming it would be a store) won't happen unless the thread can
> > really observe a true value for p.  However, this is *this thread's*
> > view of the world, but not guaranteed to constrain how any other thread
> > sees the state.  mo_consume does not contribute to
> > inter-thread-happens-before in the same way that mo_acquire does (which
> > *does* put a constraint on i-t-h-b, and thus enforces a global
> > constraint that all threads have to respect).
> > 
> > Is it clear which distinction I'm trying to show here?
> 
> If you are saying that the control dependencies are a result of a
> combination of the standard and the properties of the hardware that
> Linux runs on, I am with you.  (As opposed to control dependencies being
> a result solely of the standard.)

I'm not quite sure I understand what you mean :)  Do you mean the
control dependencies in the binary code, or the logical "control
dependencies" in source programs?

> This was a deliberate decision in 2007 or so.  At that time, the
> documentation on CPU memory orderings were pretty crude, and it was
> not clear that all relevant hardware respected control dependencies.
> Back then, if you wanted an authoritative answer even to a fairly simple
> memory-ordering question, you had to find a hardware architect, and you
> probably waited weeks or even months for the answer.  Thanks to lots
> of work from the Cambridge guys at about the time that the standard was
> finalized, we have a much better picture of what the hardware does.

But this part I understand.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-20 19:09                                                                                     ` Linus Torvalds
@ 2014-02-22 18:53                                                                                       ` Torvald Riegel
  2014-02-22 21:53                                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-22 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, 2014-02-20 at 11:09 -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote:
> >> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >
> >> > You really need that "consume" to be "acquire".
> >>
> >> So I think we now all agree that that is what the standard is saying.
> >
> > Huh?
> >
> > The standard says that there are two separate things (among many more):
> > mo_acquire and mo_consume.  They both influence happens-before in
> > different (and independent!) ways.
> >
> > What Paul is saying is that *you* should have used *acquire* in that
> > example.
> 
> I understand.
> 
> And I disagree. I think the standard is wrong, and what I *should* be
> doing is point out the fact very loudly, and just tell people to NEVER
> EVER use "consume" as long as it's not reliable and has insane
> semantics.

Stating that (1) "the standard is wrong" and (2) that you think that
mo_consume semantics are not good is two different things.  Making bold
statements without a proper context isn't helpful in making this
discussion constructive.  It's simply not efficient if I (or anybody
else reading this) has to wonder whether you actually mean what you said
(even if, when reading it literally, is arguably not consistent with the
arguments brought up in the discussion) or whether those statements just
have to be interpreted in some other way.

> So what I "should do" is to not accept any C11 atomics use in the
> kernel.

You're obviously free to do that.

> Because with the "acquire", it generates worse code than what
> we already have,

I would argue that this is still under debate.  At least I haven't seen
a definition of what you want that is complete and based on the standard
(e.g., an example of what a compiler might do in a specific case isn't a
definition).  From what I've seen, it's not inconceivable that what you
want is just an optimized acquire.

I'll bring this question up again elsewhere in the thread (where it
hopefully fits better).


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-22 18:30                                                                             ` Torvald Riegel
@ 2014-02-22 20:17                                                                               ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-22 20:17 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, Feb 22, 2014 at 07:30:37PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140222183231.5343@emeavsc.vnet.ibm.com
> X-Xagent-Gateway: emeavsc.vnet.ibm.com (XAGSMTP2 at EMEAVSC)
> 
> On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140220172700.0416@vmsdvm4.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4)
> > > 
> > > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
> > > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> > > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> > > > > >>
> > > > > >> Can you point to it? Because I can find a draft standard, and it sure
> > > > > >> as hell does *not* contain any clarity of the model. It has a *lot* of
> > > > > >> verbiage, but it's pretty much impossible to actually understand, even
> > > > > >> for somebody who really understands memory ordering.
> > > > > >
> > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > > > This has an explanation of the model up front, and then the detailed
> > > > > > formulae in Section 6.  This is from 2010, and there might have been
> > > > > > smaller changes since then, but I'm not aware of any bigger ones.
> > > > > 
> > > > > Ahh, this is different from what others pointed at. Same people,
> > > > > similar name, but not the same paper.
> > > > > 
> > > > > I will read this version too, but from reading the other one and the
> > > > > standard in parallel and trying to make sense of it, it seems that I
> > > > > may have originally misunderstood part of the whole control dependency
> > > > > chain.
> > > > > 
> > > > > The fact that the left side of "? :", "&&" and "||" breaks data
> > > > > dependencies made me originally think that the standard tried very
> > > > > hard to break any control dependencies. Which I felt was insane, when
> > > > > then some of the examples literally were about the testing of the
> > > > > value of an atomic read. The data dependency matters quite a bit. The
> > > > > fact that the other "Mathematical" paper then very much talked about
> > > > > consume only in the sense of following a pointer made me think so even
> > > > > more.
> > > > > 
> > > > > But reading it some more, I now think that the whole "data dependency"
> > > > > logic (which is where the special left-hand side rule of the ternary
> > > > > and logical operators come in) are basically an exception to the rule
> > > > > that sequence points end up being also meaningful for ordering (ok, so
> > > > > C11 seems to have renamed "sequence points" to "sequenced before").
> > > > > 
> > > > > So while an expression like
> > > > > 
> > > > >     atomic_read(p, consume) ? a : b;
> > > > > 
> > > > > doesn't have a data dependency from the atomic read that forces
> > > > > serialization, writing
> > > > > 
> > > > >    if (atomic_read(p, consume))
> > > > >       a;
> > > > >    else
> > > > >       b;
> > > > > 
> > > > > the standard *does* imply that the atomic read is "happens-before" wrt
> > > > > "a", and I'm hoping that there is no question that the control
> > > > > dependency still acts as an ordering point.
> > > > 
> > > > The control dependency should order subsequent stores, at least assuming
> > > > that "a" and "b" don't start off with identical stores that the compiler
> > > > could pull out of the "if" and merge.  The same might also be true for ?:
> > > > for all I know.  (But see below)
> > > 
> > > I don't think this is quite true.  I agree that a conditional store will
> > > not be executed speculatively (note that if it would happen in both the
> > > then and the else branch, it's not conditional); so, the store in
> > > "a;" (assuming it would be a store) won't happen unless the thread can
> > > really observe a true value for p.  However, this is *this thread's*
> > > view of the world, but not guaranteed to constrain how any other thread
> > > sees the state.  mo_consume does not contribute to
> > > inter-thread-happens-before in the same way that mo_acquire does (which
> > > *does* put a constraint on i-t-h-b, and thus enforces a global
> > > constraint that all threads have to respect).
> > > 
> > > Is it clear which distinction I'm trying to show here?
> > 
> > If you are saying that the control dependencies are a result of a
> > combination of the standard and the properties of the hardware that
> > Linux runs on, I am with you.  (As opposed to control dependencies being
> > a result solely of the standard.)
> 
> I'm not quite sure I understand what you mean :)  Do you mean the
> control dependencies in the binary code, or the logical "control
> dependencies" in source programs?

At present, the intersection of those two sets, but only including those
control dependencies beginning with with a memory_order_consume load or a
[[carries_dependency]] function argument or return value.

Or something like that.  ;-)

> > This was a deliberate decision in 2007 or so.  At that time, the
> > documentation on CPU memory orderings were pretty crude, and it was
> > not clear that all relevant hardware respected control dependencies.
> > Back then, if you wanted an authoritative answer even to a fairly simple
> > memory-ordering question, you had to find a hardware architect, and you
> > probably waited weeks or even months for the answer.  Thanks to lots
> > of work from the Cambridge guys at about the time that the standard was
> > finalized, we have a much better picture of what the hardware does.
> 
> But this part I understand.

;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-22 18:53                                                                                       ` Torvald Riegel
@ 2014-02-22 21:53                                                                                         ` Linus Torvalds
  2014-02-23  0:39                                                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-22 21:53 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> Stating that (1) "the standard is wrong" and (2) that you think that
> mo_consume semantics are not good is two different things.

I do agree. They are two independent things.

I think the standard is wrong, because it's overly complex, hard to
understand, and nigh unimplementable. As shown by the bugzilla
example, "carries a dependency" encompasses things that are *not* just
synchronizing things just through a pointer, and as a result it's
actually very complicated, since they could have been optimized away,
or done in non-local code that wasn't even aware of the dependency
carrying.

That said, I'm reconsidering my suggested stricter semantics, because
for RCU we actually do want to test the resulting pointer against NULL
_without_ any implied serialization.

So I still feel that the standard as written is fragile and confusing
(and the bugzilla entry pretty much proves that it is also practically
unimplementable as written), but strengthening the serialization may
be the wrong thing.

Within the kernel, the RCU use for this is literally purely about
loading a pointer, and doing either:

 - testing its value against NULL (without any implied synchronization at all)

 - using it as a pointer to an object, and expecting that any accesses
to that object are ordered wrt the consuming load.

So I actually have a suggested *very* different model that people
might find more acceptable.

How about saying that the result of a "atomic_read(&a, mo_consume)" is
required to be a _restricted_ pointer type, and that the consume
ordering guarantees the ordering between that atomic read and the
accesses to the object that the pointer points to.

No "carries a dependency", no nothing.

Now, there's two things to note in there:

 - the "restricted pointer" part means that the compiler does not need
to worry about serialization to that object through other possible
pointers - we have basically promised that the *only* pointer to that
object comes from the mo_consume. So that part makes it clear that the
"consume" ordering really only is valid wrt that particular pointer
load.

 - the "to the object that the pointer points to" makes it clear that
you can't use the pointer to generate arbitrary other values and claim
to serialize that way.

IOW, with those alternate semantics, that gcc bugzilla example is
utterly bogus, and a compiler can ignore it, because while it tries to
synchronize through the "dependency chain" created with that "p-i+i"
expression, that is completely irrelevant when you use the above rules
instead.

In the bugzilla example, the object that "*(p-i+i)" accesses isn't
actually the object pointed to by the pointer, so no serialization is
implied. And if it actually *were* to be the same object, because "p"
happens to have the same value as "i", then the "restrict" part of the
rule pops up and the compiler can again say that there is no ordering
guarantee, since the programmer lied to it and used a restricted
pointer that aliased with another one.

So the above suggestion basically tightens the semantics of "consume"
in a totally different way - it doesn't make it serialize more, in
fact it weakens the serialization guarantees a lot, but it weakens
them in a way that makes the semantics a lot simpler and clearer.

                 Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-22 21:53                                                                                         ` Linus Torvalds
@ 2014-02-23  0:39                                                                                           ` Paul E. McKenney
  2014-02-23  3:50                                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-23  0:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, Feb 22, 2014 at 01:53:30PM -0800, Linus Torvalds wrote:
> On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > Stating that (1) "the standard is wrong" and (2) that you think that
> > mo_consume semantics are not good is two different things.
> 
> I do agree. They are two independent things.
> 
> I think the standard is wrong, because it's overly complex, hard to
> understand, and nigh unimplementable. As shown by the bugzilla
> example, "carries a dependency" encompasses things that are *not* just
> synchronizing things just through a pointer, and as a result it's
> actually very complicated, since they could have been optimized away,
> or done in non-local code that wasn't even aware of the dependency
> carrying.
> 
> That said, I'm reconsidering my suggested stricter semantics, because
> for RCU we actually do want to test the resulting pointer against NULL
> _without_ any implied serialization.
> 
> So I still feel that the standard as written is fragile and confusing
> (and the bugzilla entry pretty much proves that it is also practically
> unimplementable as written), but strengthening the serialization may
> be the wrong thing.
> 
> Within the kernel, the RCU use for this is literally purely about
> loading a pointer, and doing either:
> 
>  - testing its value against NULL (without any implied synchronization at all)
> 
>  - using it as a pointer to an object, and expecting that any accesses
> to that object are ordered wrt the consuming load.

Agreed, by far the most frequent use is "->" to dereference and assignment
to store into a local variable.  The other operations where the kernel
expects ordering to be maintained are:

o	Bitwise "&" to strip off low-order bits.  The FIB tree does
	this, for example in fib_table_lookup() in net/ipv4/fib_trie.c.
	The low-order bit is used to distinguish internal nodes from
	leaves -- nodes and leaves are different types of structures.
	(There are a few others.)

o	Uses "?:" to substitute defaults in case of NULL pointers,
	but ordering must be maintained in the non-default case.
	Most, perhaps all, of these could be converted to "if" should
	"?:" prove problematic.

o	Addition and subtraction to adjust both pointers to and indexes
	into RCU-protected arrays.  There are not that many indexes,
	and they could be converted to pointers, but the addition and
	subtraction looks necessary in a some cases.

o	Array indexing.  The value from rcu_dereference() is used both
	before and inside the "[]", interestingly enough.

o	Casts along with unary "&" and "*".

That said, I did not see any code that dependended on ordering through
the function-call "()", boolean complement "!", comparison (only "=="
and "!="), logical operators ("&&" and "||"), and the "*", "/", and "%"
arithmetic operators.

> So I actually have a suggested *very* different model that people
> might find more acceptable.
> 
> How about saying that the result of a "atomic_read(&a, mo_consume)" is
> required to be a _restricted_ pointer type, and that the consume
> ordering guarantees the ordering between that atomic read and the
> accesses to the object that the pointer points to.
> 
> No "carries a dependency", no nothing.

In the case of arrays, the object that the pointer points to is
considered to be the full array, right?

> Now, there's two things to note in there:
> 
>  - the "restricted pointer" part means that the compiler does not need
> to worry about serialization to that object through other possible
> pointers - we have basically promised that the *only* pointer to that
> object comes from the mo_consume. So that part makes it clear that the
> "consume" ordering really only is valid wrt that particular pointer
> load.

That could work, though there are some cases where a multi-linked
structure is made visible using a single rcu_assign_pointer(), and
rcu_dereference() is used only for the pointer leading to that
multi-linked structure, not for the pointers among the elements
making up that structure.  One way to handle this would be to
require rcu_dereference() to be used within the structure an well
as upon first traversal to the structure.

>  - the "to the object that the pointer points to" makes it clear that
> you can't use the pointer to generate arbitrary other values and claim
> to serialize that way.
> 
> IOW, with those alternate semantics, that gcc bugzilla example is
> utterly bogus, and a compiler can ignore it, because while it tries to
> synchronize through the "dependency chain" created with that "p-i+i"
> expression, that is completely irrelevant when you use the above rules
> instead.
> 
> In the bugzilla example, the object that "*(p-i+i)" accesses isn't
> actually the object pointed to by the pointer, so no serialization is
> implied. And if it actually *were* to be the same object, because "p"
> happens to have the same value as "i", then the "restrict" part of the
> rule pops up and the compiler can again say that there is no ordering
> guarantee, since the programmer lied to it and used a restricted
> pointer that aliased with another one.
> 
> So the above suggestion basically tightens the semantics of "consume"
> in a totally different way - it doesn't make it serialize more, in
> fact it weakens the serialization guarantees a lot, but it weakens
> them in a way that makes the semantics a lot simpler and clearer.

It does look simpler and does look like it handles a large fraction of
the Linux-kernel uses.

But now it is time for some bullshit syntax for the RCU-protected arrays
in the Linux kernel:

	p = atomic_load_explicit(gp, memory_order_consume);
	r1 = *p; /* Ordering maintained. */
	r2 = p[5]; /* Ordering maintained? */
	r3 = p + 5; /* Ordering maintained? */
	n = get_an_index();
	r4 = p[n]; /* Ordering maintained? */

If the answer to the three questions is "no", then perhaps some special
function takes care of accesses to RCU-protected arrays.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-23  0:39                                                                                           ` Paul E. McKenney
@ 2014-02-23  3:50                                                                                             ` Linus Torvalds
  2014-02-23  6:34                                                                                               ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-23  3:50 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Agreed, by far the most frequent use is "->" to dereference and assignment
> to store into a local variable.  The other operations where the kernel
> expects ordering to be maintained are:
>
> o       Bitwise "&" to strip off low-order bits.  The FIB tree does
>         this, for example in fib_table_lookup() in net/ipv4/fib_trie.c.
>         The low-order bit is used to distinguish internal nodes from
>         leaves -- nodes and leaves are different types of structures.
>         (There are a few others.)

Note that this is very much outside the scope of the C standard,
regardless of 'consume' or not.

We'll always do things outside the standard, so I wouldn't worry.

> o       Uses "?:" to substitute defaults in case of NULL pointers,
>         but ordering must be maintained in the non-default case.
>         Most, perhaps all, of these could be converted to "if" should
>         "?:" prove problematic.

Note that this doesn't actually affect the restrict/ordering rule in
theory: "?:" isn't special according to those rules. The rules are
fairly simple: we guarantee ordering only to the object that the
pointer points to, and even that guarantee goes out the window if
there is some *other* way to reach the object.

?: is not really relevant, except in the sense that *any* expression
that ends up pointing to outside the object will lose the ordering
guarantee. ?: can be one such expression, but so can "p-p" or anything
like that.

And in *practice*, the only thing that needs to be sure to generate
special code is alpha, and there you'd just add the "rmb" after the
load. That is sufficient to fulfill the guarantees.

On ARM and powerpc, the compiler obviously has to guarantee that it
doesn't do value-speculation on the result, but again, that never
really had anything to do with the whole "carries a dependency", it is
really all about the fact that in order to guarantee the ordering, the
compiler mustn't generate that magical aliased pointer value. But if
the aliased pointer value comes from the *source* code, all bets are
off.

Now, even on alpha, the compiler can obviously move that "rmb" around.
For example, if there is a conditional after the
"atomic_read(mo_consume)", and the compiler can tell that the pointer
that got read by mo_consume is dead along one branch, then the
compiler can move the "rmb" to only exist in the other branch. Why?
Because we inherently guarantee only the order to any accesses to the
object the pointer pointed to, and that the pointer that got loaded is
the *only* way to get to that object (in this context), so if the
value is dead, then so is the ordering.

In fact, even if the value is *not* dead, but it is NULL, the compiler
can validly say "the NULL pointer cannot point to any object, so I
don't have to guarantee any serialization". So code like this (writing
alpha assembly, since in practice only alpha will ever care):

    ptr = atomic_read(pp, mo_consume);
    if (ptr) {
       ... do something with ptr ..
    }
    return ptr;

can validly be translated to:

    ldq $1,0($2)
    beq $1,branch-over
    rmb
    .. the do-something code using register $1 ..

because the compiler knows that a NULL pointer cannot be dereferenced,
so it can decide to put the rmb in the non-NULL path - even though the
pointer value is still *live* in the other branch (well, the liveness
of a constant value is somewhat debatable, but you get the idea), and
may be used by the caller (but since it is NULL, the "use" can not
include accessing any object, only really testing)

So note how this is actually very different from the "carries
dependency" rule. It's simpler, and it allows much more natural
optimizations.

> o       Addition and subtraction to adjust both pointers to and indexes
>         into RCU-protected arrays.  There are not that many indexes,
>         and they could be converted to pointers, but the addition and
>         subtraction looks necessary in a some cases.

Addition and subtraction is fine, as long as they stay within the same
object/array.

And realistically, people violate the whole C pointer "same object"
rule all the time. Any time you implement a raw memory allocator,
you'll violate the C standard and you *will* basically be depending on
architecture-specific behavior. So everybody knows that the C "pointer
arithmetic has to stay within the object" is really a fairly made-up
but convenient shorthand for "sure, we know you'll do tricks on
pointer values, but they won't be portable and you may have to take
particular machine representations into account".

> o       Array indexing.  The value from rcu_dereference() is used both
>         before and inside the "[]", interestingly enough.

Well, in the C sense, or in the actual "integer index" sense? Because
technically, a[b] is nothing but *(a+b), so "inside" the "[]" is
strictly speaking meaningless. Inside and outside are just syntactic
sugar.

That said, I agree that the way I phrased things really limits things
to *just* pointers, and if you want to do RCU on integer values (that
get turned into pointers some other way), that would be outside the
spec.

But in practice, the code generation will *work* for non-pointers too
(the exception might be if the compiler actually does the above "NULL
is special, so I know I don't need to order wrt it". Exactly the way
that people do arithmetic operations on pointers that aren't really
covered by the standard (ie arithmetic 'and' to align them etc), the
code still *works*. It may be outside the standard, but everybody does
it, and one of the reasons C is so powerful is exactly that you *can*
do things like that. They won't be portable, and you have to know what
you are doing, but they don't stop working just because they aren't
covered by the standard.

>>  - the "restricted pointer" part means that the compiler does not need
>> to worry about serialization to that object through other possible
>> pointers - we have basically promised that the *only* pointer to that
>> object comes from the mo_consume. So that part makes it clear that the
>> "consume" ordering really only is valid wrt that particular pointer
>> load.
>
> That could work, though there are some cases where a multi-linked
> structure is made visible using a single rcu_assign_pointer(), and
> rcu_dereference() is used only for the pointer leading to that
> multi-linked structure, not for the pointers among the elements
> making up that structure.  One way to handle this would be to
> require rcu_dereference() to be used within the structure an well
> as upon first traversal to the structure.

I don't see that you would need to protect anything but the first
read, so I think you need rcu_dereference() only on the initial
pointer access.

On architectures where following a pointer is sufficient ordering,
we're fine. On alpha (and, if anybody ever makes that horrid mistake
again), the 'rmb' after the first access will guarantee that all the
later accesses will see the new list.

So basically, 'consume' will end up inserting a barrier (if necessary)
before following the pointer for the first time, but once that barrier
has been inserted, we are now fully ordered wrt the releasing store,
so we're done. No need to order *all* the accesses (even off the same
base pointer), only the first one.

> But now it is time for some bullshit syntax for the RCU-protected arrays
> in the Linux kernel:
>
>         p = atomic_load_explicit(gp, memory_order_consume);
>         r1 = *p; /* Ordering maintained. */
>         r2 = p[5]; /* Ordering maintained? */

Oh yes. It's in the same object. If it wasn't, then "p[5]" itself
would be meaningless and outside the scope of the standard to begin
with.

>         r3 = p + 5; /* Ordering maintained? */
>         n = get_an_index();
>         r4 = p[n]; /* Ordering maintained? */

Yes. By definition. And again, for the very same reason. You would
violate *other* parts of the C standard before you violated the
suggested rule.

Also, remember: a lot of this is about "legalistic guarantees". In
practice, just the fact that a data chain *exists* is guarantee
enough, so from a code generation standpoint, NONE OF THIS MATTERS.

The exception is alpha, where a "mo_consume" load basically needs to
be generated with a "rmb" following it (see above how you can relax
that a _bit_, but not much). Trivial.

So code generation is actually *trivial*. Ordering is maintained
*automatically*, with no real effort on the side of the compiler.

The only issues really are:

 - the current C standard language implies *too much* ordering. The
whole "carries dependency" is broken. The practical example - even
without using "?:" or any conditionals or any function calls - is just
that

        ptr = atomic_read(pp, mo_consume);
        value = array[ptr-ptr];

  and there really isn't any sane ordering there. But the current
standard clearly says there is. The current standard is wrong.

 - my trick of saying that it is only ordered wrt accesses to the
object the pointer points to gets rid of that whole bogus false
dependency.

 - the "restricted pointer" thing is just legalistic crap, and has no
actual meaning for code generation, it is _literally_ just that if the
programmer does value speculation by hand:

        ptr = atomic_read(pp, mo_consume);
        value = object.val;
        if (ptr != &object)
                value = ptr->val;

    then in this situation the compiler does *not* need to worry about
the ordering to the "object.val" access, because it was gotten through
an alias.

So that "restrict" thing is really just a way to say "the ordering
only exists through that single pointer". That's not really what
restrict was _designed_ for, but to quote the standard:

 "An object that is accessed through a restrict-qualified pointer has a
special association
  with that pointer. This association, defined in 6.7.3.1 below,
requires that all accesses to
  that object use, directly or indirectly, the value of that particular pointer"

that's not the *definition* of "restrict", and quite frankly, the
actual language that talks about "restrict" really talks about being
able to know that nobody *modifies* the value through another pointer,
so I'm clearly stretching things a bit. But anybody reading the above
quote from the standard hopefully agrees that I'm not stretching it
all that much.

                 Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-23  3:50                                                                                             ` Linus Torvalds
@ 2014-02-23  6:34                                                                                               ` Paul E. McKenney
  2014-02-23 19:31                                                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-23  6:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, Feb 22, 2014 at 07:50:35PM -0800, Linus Torvalds wrote:
> On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Agreed, by far the most frequent use is "->" to dereference and assignment
> > to store into a local variable.  The other operations where the kernel
> > expects ordering to be maintained are:
> >
> > o       Bitwise "&" to strip off low-order bits.  The FIB tree does
> >         this, for example in fib_table_lookup() in net/ipv4/fib_trie.c.
> >         The low-order bit is used to distinguish internal nodes from
> >         leaves -- nodes and leaves are different types of structures.
> >         (There are a few others.)
> 
> Note that this is very much outside the scope of the C standard,
> regardless of 'consume' or not.
> 
> We'll always do things outside the standard, so I wouldn't worry.

Fair enough.  I am going to make examples to try to make sure that we
aren't using the same words for two different meanings:

	struct foo restrict *p;

	p = atomic_load_explicit(&gp, memory_order_consume);
	p &= ~0x3;  /* OK because p has restrict property. */
	return p->a;	/* Ordered WRT above load. */

Of course, the compiler would have to work pretty hard to break this
ordering, so I am with you in not worrying too much about this one.

> > o       Uses "?:" to substitute defaults in case of NULL pointers,
> >         but ordering must be maintained in the non-default case.
> >         Most, perhaps all, of these could be converted to "if" should
> >         "?:" prove problematic.
> 
> Note that this doesn't actually affect the restrict/ordering rule in
> theory: "?:" isn't special according to those rules. The rules are
> fairly simple: we guarantee ordering only to the object that the
> pointer points to, and even that guarantee goes out the window if
> there is some *other* way to reach the object.
> 
> ?: is not really relevant, except in the sense that *any* expression
> that ends up pointing to outside the object will lose the ordering
> guarantee. ?: can be one such expression, but so can "p-p" or anything
> like that.
> 
> And in *practice*, the only thing that needs to be sure to generate
> special code is alpha, and there you'd just add the "rmb" after the
> load. That is sufficient to fulfill the guarantees.

OK, seems reasonable.  Reworking the earlier example:

	struct foo restrict *p;

	p = atomic_load_explicit(&gp, memory_order_consume);
	p = p ? p : &default_foo;
	return p->a;	/* Ordered WRT above load if p non-NULL. */

And the ordering makes sense only in the non-NULL case anyway,
so this should be fine.

> On ARM and powerpc, the compiler obviously has to guarantee that it
> doesn't do value-speculation on the result, but again, that never
> really had anything to do with the whole "carries a dependency", it is
> really all about the fact that in order to guarantee the ordering, the
> compiler mustn't generate that magical aliased pointer value. But if
> the aliased pointer value comes from the *source* code, all bets are
> off.

Agreed, compiler-based value speculation is dangerous in any case.
(Though the branch-predictor-based trick seems like it should be safe
on TSO systems like x86, s390, etc.)

> Now, even on alpha, the compiler can obviously move that "rmb" around.
> For example, if there is a conditional after the
> "atomic_read(mo_consume)", and the compiler can tell that the pointer
> that got read by mo_consume is dead along one branch, then the
> compiler can move the "rmb" to only exist in the other branch. Why?
> Because we inherently guarantee only the order to any accesses to the
> object the pointer pointed to, and that the pointer that got loaded is
> the *only* way to get to that object (in this context), so if the
> value is dead, then so is the ordering.

Yep!

> In fact, even if the value is *not* dead, but it is NULL, the compiler
> can validly say "the NULL pointer cannot point to any object, so I
> don't have to guarantee any serialization". So code like this (writing
> alpha assembly, since in practice only alpha will ever care):
> 
>     ptr = atomic_read(pp, mo_consume);
>     if (ptr) {
>        ... do something with ptr ..
>     }
>     return ptr;
> 
> can validly be translated to:
> 
>     ldq $1,0($2)
>     beq $1,branch-over
>     rmb
>     .. the do-something code using register $1 ..
> 
> because the compiler knows that a NULL pointer cannot be dereferenced,
> so it can decide to put the rmb in the non-NULL path - even though the
> pointer value is still *live* in the other branch (well, the liveness
> of a constant value is somewhat debatable, but you get the idea), and
> may be used by the caller (but since it is NULL, the "use" can not
> include accessing any object, only really testing)

Agreed.

> So note how this is actually very different from the "carries
> dependency" rule. It's simpler, and it allows much more natural
> optimizations.

I do have a couple more questions about it, but please see below.

> > o       Addition and subtraction to adjust both pointers to and indexes
> >         into RCU-protected arrays.  There are not that many indexes,
> >         and they could be converted to pointers, but the addition and
> >         subtraction looks necessary in a some cases.
> 
> Addition and subtraction is fine, as long as they stay within the same
> object/array.
> 
> And realistically, people violate the whole C pointer "same object"
> rule all the time. Any time you implement a raw memory allocator,
> you'll violate the C standard and you *will* basically be depending on
> architecture-specific behavior. So everybody knows that the C "pointer
> arithmetic has to stay within the object" is really a fairly made-up
> but convenient shorthand for "sure, we know you'll do tricks on
> pointer values, but they won't be portable and you may have to take
> particular machine representations into account".

Adding and subtracting integers to/from a RCU-protected pointer makes
sense to me.

Adding and subtracting integers to/from an RCU-protected integer makes
sense in many practical cases, but I worry about the compiler figuring
out that the RCU-protected integer cancelled with some other integer.
I am beginning to suspect that the few uses of RCU-protected array
indexes should be converted to pointer form.

I don't feel all that good about subtractions involving an RCU-protected
pointer and another pointer, again due to the possibility of arithmetic
optimizations cancelling everything.  But it is true that doing so is
perfectly safe in a number of situations.  For example, the following
should work, and might even be useful:

	p = atomic_load_explicit(&gp, memory_order_consume);
	q = gq + p - gp_base;
	do_something_with(q->a);  /* No ordering, but corresponding element. */

But at that point, the RCU-protected array index seems nicer.  Give or
take my nervousness about arithmetic optimizations.

> > o       Array indexing.  The value from rcu_dereference() is used both
> >         before and inside the "[]", interestingly enough.
> 
> Well, in the C sense, or in the actual "integer index" sense? Because
> technically, a[b] is nothing but *(a+b), so "inside" the "[]" is
> strictly speaking meaningless. Inside and outside are just syntactic
> sugar.

Yep, understood that a[b] is identical to b[a] in classic C.  Just getting
nervous about RCU-protected integer indexes interacting with compiler
optimizations.  Perhaps needlessly so, but...

> That said, I agree that the way I phrased things really limits things
> to *just* pointers, and if you want to do RCU on integer values (that
> get turned into pointers some other way), that would be outside the
> spec.

That was why I asked.  ;-)

> But in practice, the code generation will *work* for non-pointers too
> (the exception might be if the compiler actually does the above "NULL
> is special, so I know I don't need to order wrt it". Exactly the way
> that people do arithmetic operations on pointers that aren't really
> covered by the standard (ie arithmetic 'and' to align them etc), the
> code still *works*. It may be outside the standard, but everybody does
> it, and one of the reasons C is so powerful is exactly that you *can*
> do things like that. They won't be portable, and you have to know what
> you are doing, but they don't stop working just because they aren't
> covered by the standard.

Yep, again modulo my nervousness about arithmetic optimizations on
RCU-protected integers.

> >>  - the "restricted pointer" part means that the compiler does not need
> >> to worry about serialization to that object through other possible
> >> pointers - we have basically promised that the *only* pointer to that
> >> object comes from the mo_consume. So that part makes it clear that the
> >> "consume" ordering really only is valid wrt that particular pointer
> >> load.
> >
> > That could work, though there are some cases where a multi-linked
> > structure is made visible using a single rcu_assign_pointer(), and
> > rcu_dereference() is used only for the pointer leading to that
> > multi-linked structure, not for the pointers among the elements
> > making up that structure.  One way to handle this would be to
> > require rcu_dereference() to be used within the structure an well
> > as upon first traversal to the structure.
> 
> I don't see that you would need to protect anything but the first
> read, so I think you need rcu_dereference() only on the initial
> pointer access.

Let me try an example:

	struct bar {
		int a;
		unsigned b;
	};

	struct foo {
		struct bar *next;
		char c;
	};

	struct bar bar1;
	struct foo foo1;
	
	struct foo *foop;

T1:	bar1.a = -1;
	bar1.b = 2;
	foo1.next = &bar;
	foo1.c = "*";
	atomic_store_explicit(&foop, &foo1, memory_order_release);

T2:	struct foo restrict *p;
	struct bar *q;

	p = atomic_load_explicit(&foop, memory_order_consume);
	if (p == NULL)
		return -EDEALWITHIT;
	q = p->next;  /* Ordered with above load. */
	do_something(q->b); /* Non-decorated pointer, ordered? */

Yes, any reasonable code generation of the above will produce the ordering
on all systems that Linux runs on, understood.  But by your rules, if
we don't do something special for pointer q, how do we avoid either
(1) losing the ordering, (2) being back in the dependency-tracing business,
or (3) having to rely on the kindness of compiler writers?

> On architectures where following a pointer is sufficient ordering,
> we're fine. On alpha (and, if anybody ever makes that horrid mistake
> again), the 'rmb' after the first access will guarantee that all the
> later accesses will see the new list.
> 
> So basically, 'consume' will end up inserting a barrier (if necessary)
> before following the pointer for the first time, but once that barrier
> has been inserted, we are now fully ordered wrt the releasing store,
> so we're done. No need to order *all* the accesses (even off the same
> base pointer), only the first one.

I hear what you are saying, but it is sounding like #3 in my list above.  ;-)

Which is OK, except from the legalistic guarantees viewpoint you mention
below.  It might also allow compiler writers to do crazy optimizations
in general, but not on these restricted pointers.  This might improve
performance without risking messing up critical orderings.

> > But now it is time for some bullshit syntax for the RCU-protected arrays
> > in the Linux kernel:
> >
> >         p = atomic_load_explicit(gp, memory_order_consume);
> >         r1 = *p; /* Ordering maintained. */
> >         r2 = p[5]; /* Ordering maintained? */
> 
> Oh yes. It's in the same object. If it wasn't, then "p[5]" itself
> would be meaningless and outside the scope of the standard to begin
> with.
> 
> >         r3 = p + 5; /* Ordering maintained? */
> >         n = get_an_index();
> >         r4 = p[n]; /* Ordering maintained? */
> 
> Yes. By definition. And again, for the very same reason. You would
> violate *other* parts of the C standard before you violated the
> suggested rule.

Good!

> Also, remember: a lot of this is about "legalistic guarantees". In
> practice, just the fact that a data chain *exists* is guarantee
> enough, so from a code generation standpoint, NONE OF THIS MATTERS.

Completely understood.

> The exception is alpha, where a "mo_consume" load basically needs to
> be generated with a "rmb" following it (see above how you can relax
> that a _bit_, but not much). Trivial.
> 
> So code generation is actually *trivial*. Ordering is maintained
> *automatically*, with no real effort on the side of the compiler.
> 
> The only issues really are:
> 
>  - the current C standard language implies *too much* ordering. The
> whole "carries dependency" is broken. The practical example - even
> without using "?:" or any conditionals or any function calls - is just
> that
> 
>         ptr = atomic_read(pp, mo_consume);
>         value = array[ptr-ptr];
> 
>   and there really isn't any sane ordering there. But the current
> standard clearly says there is. The current standard is wrong.

Yep, not my idea, though I must take the blame for going along with it.

>  - my trick of saying that it is only ordered wrt accesses to the
> object the pointer points to gets rid of that whole bogus false
> dependency.

So I am still nervous about pointer "q" in my example above.

>  - the "restricted pointer" thing is just legalistic crap, and has no
> actual meaning for code generation, it is _literally_ just that if the
> programmer does value speculation by hand:
> 
>         ptr = atomic_read(pp, mo_consume);
>         value = object.val;
>         if (ptr != &object)
>                 value = ptr->val;
> 
>     then in this situation the compiler does *not* need to worry about
> the ordering to the "object.val" access, because it was gotten through
> an alias.

And hopefully to tell the compiler to go easy on crazy optimizations
in case where ptr->a or some such is really used.

> So that "restrict" thing is really just a way to say "the ordering
> only exists through that single pointer". That's not really what
> restrict was _designed_ for, but to quote the standard:
> 
>  "An object that is accessed through a restrict-qualified pointer has a
> special association
>   with that pointer. This association, defined in 6.7.3.1 below,
> requires that all accesses to
>   that object use, directly or indirectly, the value of that particular pointer"
> 
> that's not the *definition* of "restrict", and quite frankly, the
> actual language that talks about "restrict" really talks about being
> able to know that nobody *modifies* the value through another pointer,
> so I'm clearly stretching things a bit. But anybody reading the above
> quote from the standard hopefully agrees that I'm not stretching it
> all that much.

Well, based on my experience, the committee would pick some other
spelling anyway...  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-23  6:34                                                                                               ` Paul E. McKenney
@ 2014-02-23 19:31                                                                                                 ` Linus Torvalds
  2014-02-24  1:16                                                                                                   ` Paul E. McKenney
  2014-02-24 15:57                                                                                                   ` Linus Torvalds
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-23 19:31 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Adding and subtracting integers to/from a RCU-protected pointer makes
> sense to me.

Ack. And that's normal "access to an object" behavior anyway.

> Adding and subtracting integers to/from an RCU-protected integer makes
> sense in many practical cases, but I worry about the compiler figuring
> out that the RCU-protected integer cancelled with some other integer.

I suspect that in practice, all the same normal rules apply - assuming
that there aren't any "aliasing" integers, and that the compiler
doesn't do any value prediction, the end result *should* be exactly
the same as for the pointer arithmetic. So even if the standard were
to talk about just pointers, I suspect that in practice, there really
isn't anything that can go wrong.

> I don't feel all that good about subtractions involving an RCU-protected
> pointer and another pointer, again due to the possibility of arithmetic
> optimizations cancelling everything.

Actually, pointer subtraction is a pretty useful operation, even
without the "gotcha" case of "p-p" just to force a fake dependency.
Getting an index, or indeed just getting an offset within a structure,
is valid and common, and people will and should do it. It doesn't
really matter for my suggested language: it should be perfectly fine
to do something like

   ptr = atomic_read(pp, mo_consume);
   index = ptr - array_base;
   .. pass off 'index' to some function, which then re-generates the
ptr using it ..

and the compiler will have no trouble generating code, and the
suggested "ordered wrt that object" guarantee is that the eventual
ordering is between the pointer load and the use in the function,
regardless of how it got there (ie turning it into an index and back
is perfectly fine).

So both from a legalistic language wording standpoint, and from a
compiler code generation standpoint, this is a non-issue.

Now, if you actually lose information (ie some chain there drops
enough data from the pointer that it cannot be recovered, partially of
fully), and then "regenerate" the object value by faking it, and still
end up accessing the right data, but without actually going through
any of the the pointer that you loaded, that falls under the
"restricted" heading, and you must clearly at least partially have
used other information. In which case the standard wording wouldn't
guarantee anything at all.

>> I don't see that you would need to protect anything but the first
>> read, so I think you need rcu_dereference() only on the initial
>> pointer access.
>
> Let me try an example:
>
>         struct bar {
>                 int a;
>                 unsigned b;
>         };
>
>         struct foo {
>                 struct bar *next;
>                 char c;
>         };
>
>         struct bar bar1;
>         struct foo foo1;
>
>         struct foo *foop;
>
> T1:     bar1.a = -1;
>         bar1.b = 2;
>         foo1.next = &bar;
>         foo1.c = "*";
>         atomic_store_explicit(&foop, &foo1, memory_order_release);

So here, the standard requires that the store with release is an
ordering to all preceding writes. So *all* writes to bar and foo are
ordered, despite the fact that the pointer just points to foo.

> T2:     struct foo restrict *p;
>         struct bar *q;
>
>         p = atomic_load_explicit(&foop, memory_order_consume);
>         if (p == NULL)
>                 return -EDEALWITHIT;
>         q = p->next;  /* Ordered with above load. */
>         do_something(q->b); /* Non-decorated pointer, ordered? */

So the theory is that a compiler *could* do some value speculation now
on "q", and with value speculation move the actual load of "q->p" up
to before "foop" was even loaded.

So in practice, I think we agree that this doesn't really affect
compiler writers (because they'd have to do a whole lot extra code to
break it intentionally, considering that they can't do
value-prediction on 'p'), and we just need to make sure to close the
hole in the language to make this safe, right?

Let me think about it some more, but my gut feel is that just tweaking
the definition of what "ordered" means is sufficient.

So to go back to the suggested ordering rules (ignoring the "restrict"
part, which is just to clarify that ordering through other means to
get to the object doesn't matter), I suggested:

 "the consume ordering guarantees the ordering between that
  atomic read and the accesses to the object that the pointer
  points to"

and I think the solution is to just say that this ordering acts as a
fence. It doesn't say exactly *where* the fence is, but it says that
there is *some* fence between the load of the pointer and any/all
accesses to the object through that pointer.

So with that definition, the memory accesses that are dependent on 'q'
will obviously be ordered. Now, they will *not* be ordered wrt the
load of q itself, but they will be ordered wrt the load from 'foop' -
because we've made it clear that there is a fence *somewhere* between
that atomic_load and the load of 'q'.

So that handles the ordering guarantee you worry about.

Now, the other worry is that this "fence" language would impose a
*stricter* ordering, and that by saying there is a fence, we'd now
constrain code generation on architectures like ARM and power, agreed?
And we do not want to do that, other than make it clear that the whole
"break the dependency chain through value speculation" cannot break it
past the load from 'foop'. Are we in agreement?

So we do *not* want that fence to limit re-ordering of independent
memory operations. All agreed?

But let's look at any independent chain, especially around that
consuming load from '&foop'. Say we have some other memory access
(adding them for visualization into your example):

        p = atomic_load_explicit(&foop, memory_order_consume);
        if (p == NULL)
                return -EDEALWITHIT;
        q = p->next;  /* Ordered with above load. */
+       a = *b; *c = d;
        do_something(q->b); /* Non-decorated pointer, ordered? */

and we are looking to make sure that those memory accesses can still
move around freely.

I'm claiming that they can, even with the "fence" language, because

 (a) we've said 'q' is restricted, so there is no aliasing between q
and the pointers b/c. So the compiler is free to move those accesses
around the "q = p->next" access.

 (b) once you've moved them to *before* the "q = p->next" access, the
fence no longer constrains you, because the guarantee is that there is
a fence *somewhere* between the consuming load and the accesses to
that object, but it could be *at* the access, so you're done.

So I think the "we guarantee there is an ordering *somewhere* between
the consuming load and the first ordered access" model actually gives
us the semantics we need.

(Note that you cannot necessarily move the accesses through the
pointers b/c past the consuming load, since the only non-aliasing
guarantee is about the pointer 'p', not about 'foop'. But you may use
other arguments to move them past that too)

But it's possible that the above argument doesn't really work. It is a
very off-the-cuff "my intuition says this should work" model.

            Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-23 19:31                                                                                                 ` Linus Torvalds
@ 2014-02-24  1:16                                                                                                   ` Paul E. McKenney
  2014-02-24  1:35                                                                                                     ` Linus Torvalds
  2014-02-24 15:57                                                                                                   ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24  1:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Feb 23, 2014 at 11:31:25AM -0800, Linus Torvalds wrote:
> On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Adding and subtracting integers to/from a RCU-protected pointer makes
> > sense to me.
> 
> Ack. And that's normal "access to an object" behavior anyway.

And covers things like container_of() as well as array indexing and
field selection, so good.

> > Adding and subtracting integers to/from an RCU-protected integer makes
> > sense in many practical cases, but I worry about the compiler figuring
> > out that the RCU-protected integer cancelled with some other integer.
> 
> I suspect that in practice, all the same normal rules apply - assuming
> that there aren't any "aliasing" integers, and that the compiler
> doesn't do any value prediction, the end result *should* be exactly
> the same as for the pointer arithmetic. So even if the standard were
> to talk about just pointers, I suspect that in practice, there really
> isn't anything that can go wrong.

Yep, in practice in absence of the "i-i" code, it should work.  As long
as the integer returned from the memory_order_consume load is always
kept in an integer tagged as "restrict", I believe that it should be
possible to make the standard guarantee it as well.

> > I don't feel all that good about subtractions involving an RCU-protected
> > pointer and another pointer, again due to the possibility of arithmetic
> > optimizations cancelling everything.
> 
> Actually, pointer subtraction is a pretty useful operation, even
> without the "gotcha" case of "p-p" just to force a fake dependency.
> Getting an index, or indeed just getting an offset within a structure,
> is valid and common, and people will and should do it. It doesn't
> really matter for my suggested language: it should be perfectly fine
> to do something like
> 
>    ptr = atomic_read(pp, mo_consume);
>    index = ptr - array_base;
>    .. pass off 'index' to some function, which then re-generates the
> ptr using it ..

I believe that the devil is in the details of the regeneration of the
pointer.  Yet another example below.

> and the compiler will have no trouble generating code, and the
> suggested "ordered wrt that object" guarantee is that the eventual
> ordering is between the pointer load and the use in the function,
> regardless of how it got there (ie turning it into an index and back
> is perfectly fine).
> 
> So both from a legalistic language wording standpoint, and from a
> compiler code generation standpoint, this is a non-issue.
> 
> Now, if you actually lose information (ie some chain there drops
> enough data from the pointer that it cannot be recovered, partially of
> fully), and then "regenerate" the object value by faking it, and still
> end up accessing the right data, but without actually going through
> any of the the pointer that you loaded, that falls under the
> "restricted" heading, and you must clearly at least partially have
> used other information. In which case the standard wording wouldn't
> guarantee anything at all.

I agree in general, but believe that the pointer regeneration needs
to be done with care.  If you are adding the difference back into
the original variable ptr read from the RCU-protected pointer, I have
no problem with it.

My concern is with more elaborate cases.  For example:

	struct foo g1[N];
	struct bar g2[N];
	struct foo restrict *p1;
	struct bar *p2;
	int index;

	p1 = atomic_read_explicit(*gp1, memory_order_consume);
	index = (ptr - g1) / 2;  /* Getting this into the BS syntax realm. */
	p2 = g2 + index;

The standard as currently worded would force ordering between the read
into p1 and the read into p2.  But it does so via dependency chains,
so this is starting to feel to me like we are getting back into the
business of tracing dependency chains.

Of course, one can argue that parallel arrays are not all that useful
in the Linux kernel, and one can argue that dividing indexes by two is
beyond the pale.  (Although heaps in dense arrays really do this sort of
index arithmetic, I am having some trouble imagining why RCU protection
would be useful in that case.  Yes, I could make up something about
replacing one heap locklessly with another, but...)

I might well be in my usual overly paranoid mode, but this is the sort
of thing that makes me a bit nervous.  I would rather either have it
bulletproof safe, or to say "don't do that".

Hmmm...  I don't recall seeing a use case for subtraction involving
an RCU-protected pointer and another pointer when I looked through
all the rcu_dereference() invocations in the Linux kernel.  Is there
a use case that I am missing?

> >> I don't see that you would need to protect anything but the first
> >> read, so I think you need rcu_dereference() only on the initial
> >> pointer access.
> >
> > Let me try an example:
> >
> >         struct bar {
> >                 int a;
> >                 unsigned b;
> >         };
> >
> >         struct foo {
> >                 struct bar *next;
> >                 char c;
> >         };
> >
> >         struct bar bar1;
> >         struct foo foo1;
> >
> >         struct foo *foop;
> >
> > T1:     bar1.a = -1;
> >         bar1.b = 2;
> >         foo1.next = &bar;
> >         foo1.c = "*";
> >         atomic_store_explicit(&foop, &foo1, memory_order_release);
> 
> So here, the standard requires that the store with release is an
> ordering to all preceding writes. So *all* writes to bar and foo are
> ordered, despite the fact that the pointer just points to foo.

As long as T1 does some load from foop that extends T1's happens-before
chain, agreed.

> > T2:     struct foo restrict *p;
> >         struct bar *q;
> >
> >         p = atomic_load_explicit(&foop, memory_order_consume);
> >         if (p == NULL)
> >                 return -EDEALWITHIT;
> >         q = p->next;  /* Ordered with above load. */
> >         do_something(q->b); /* Non-decorated pointer, ordered? */
> 
> So the theory is that a compiler *could* do some value speculation now
> on "q", and with value speculation move the actual load of "q->p" up
> to before "foop" was even loaded.

That would be one concern.  The other concern is that the only reason I
believe that this is safe is because I trace the dependency chain from
"p" through "p->next" and through "q->b".  And because the more clearly
the standard states the rules the Linux kernel relies on, the easier it
is to get compiler writers to pay attention to bug reports.

It might be that adding "restrict" to the definition of "q" would address
my concerns.

> So in practice, I think we agree that this doesn't really affect
> compiler writers (because they'd have to do a whole lot extra code to
> break it intentionally, considering that they can't do
> value-prediction on 'p'), and we just need to make sure to close the
> hole in the language to make this safe, right?

Yep.  Unless the compiler is doing some "interesting" optimizations on
q, this should in practice be safe.  After all, we are relying on this
right now.  ;-)

> Let me think about it some more, but my gut feel is that just tweaking
> the definition of what "ordered" means is sufficient.
> 
> So to go back to the suggested ordering rules (ignoring the "restrict"
> part, which is just to clarify that ordering through other means to
> get to the object doesn't matter), I suggested:
> 
>  "the consume ordering guarantees the ordering between that
>   atomic read and the accesses to the object that the pointer
>   points to"
> 
> and I think the solution is to just say that this ordering acts as a
> fence. It doesn't say exactly *where* the fence is, but it says that
> there is *some* fence between the load of the pointer and any/all
> accesses to the object through that pointer.
> 
> So with that definition, the memory accesses that are dependent on 'q'
> will obviously be ordered. Now, they will *not* be ordered wrt the
> load of q itself, but they will be ordered wrt the load from 'foop' -
> because we've made it clear that there is a fence *somewhere* between
> that atomic_load and the load of 'q'.
> 
> So that handles the ordering guarantee you worry about.
> 
> Now, the other worry is that this "fence" language would impose a
> *stricter* ordering, and that by saying there is a fence, we'd now
> constrain code generation on architectures like ARM and power, agreed?
> And we do not want to do that, other than make it clear that the whole
> "break the dependency chain through value speculation" cannot break it
> past the load from 'foop'. Are we in agreement?
> 
> So we do *not* want that fence to limit re-ordering of independent
> memory operations. All agreed?

Yep!

> But let's look at any independent chain, especially around that
> consuming load from '&foop'. Say we have some other memory access
> (adding them for visualization into your example):
> 
>         p = atomic_load_explicit(&foop, memory_order_consume);
>         if (p == NULL)
>                 return -EDEALWITHIT;
>         q = p->next;  /* Ordered with above load. */
> +       a = *b; *c = d;
>         do_something(q->b); /* Non-decorated pointer, ordered? */
> 
> and we are looking to make sure that those memory accesses can still
> move around freely.
> 
> I'm claiming that they can, even with the "fence" language, because
> 
>  (a) we've said 'q' is restricted, so there is no aliasing between q
> and the pointers b/c. So the compiler is free to move those accesses
> around the "q = p->next" access.

Ah, if I understand you, very good!

My example intentionally left "q" -not- restricted.  I have to think about
it more, but I believe that I am OK requiring that "q" be restricted
in order to have the ordering guarantees.  In other words, leaving "q"
unrestricted would with extremely high probability get you the ordering
guarantees in practice, but would not be absolutely guaranteed according
to the standard.  Changing the above example to mark "q" restricted
gets you the guarantee both in practice and according to the standard.

Is that where you are going with this?

>  (b) once you've moved them to *before* the "q = p->next" access, the
> fence no longer constrains you, because the guarantee is that there is
> a fence *somewhere* between the consuming load and the accesses to
> that object, but it could be *at* the access, so you're done.

Plus the pointer "p" is restricted, so the reasoning above should
also allow moving the assignments to b/c to before the assignment to
"p", right?

> So I think the "we guarantee there is an ordering *somewhere* between
> the consuming load and the first ordered access" model actually gives
> us the semantics we need.
> 
> (Note that you cannot necessarily move the accesses through the
> pointers b/c past the consuming load, since the only non-aliasing
> guarantee is about the pointer 'p', not about 'foop'. But you may use
> other arguments to move them past that too)
> 
> But it's possible that the above argument doesn't really work. It is a
> very off-the-cuff "my intuition says this should work" model.

Yes, we definitely will need to run it past Peter Sewell and Mark Batty
once we have it fully nailed down to see if it survives formalization.  ;-)

But expanding the example one step further:

+	char ch;

        p = atomic_load_explicit(&foop, memory_order_consume);
        if (p == NULL)
                return -EDEALWITHIT;
        q = p->next;  /* Ordered with above load. */
	ch = p->c; /* Ordered with above load. */
        a = *b; *c = d;
        do_something(q->b); /* Non-decorated pointer, ordered? */
+	r1 = somearray[ch]; /* "ch" is not restricted, no guarantee. */

Again, given current compilers and hardware, the load into r1 would be
ordered, and the current unloved standard would guarantee that as well
(at great pain to compiler writers, I hasten to add).

But I would be OK with this guarantee not being ironclad as far as a new
version standard is concerned.  If you needed the standard to guarantee
this ordering, you could write

	char restrict ch;

instead of just plain "char ch".

Does that fit into the approach you are thinking of?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24  1:16                                                                                                   ` Paul E. McKenney
@ 2014-02-24  1:35                                                                                                     ` Linus Torvalds
  2014-02-24  4:59                                                                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24  1:35 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>>
>>  (a) we've said 'q' is restricted, so there is no aliasing between q
>> and the pointers b/c. So the compiler is free to move those accesses
>> around the "q = p->next" access.
>
> Ah, if I understand you, very good!
>
> My example intentionally left "q" -not- restricted.

No, I 100% agree with that. "q" is *not* restricted. But "p" is, since
it came from that consuming load.

But "q = p->next" is ordered by how something can alias "p->next", not by 'q'!

There is no need to restrict anything but 'p' for all of this to work.

Btw, it's also worth pointing out that I do *not* in any way expect
people to actually write the "restrict" keyword anywhere. So no need
to change source code.

What you have is a situation where the pointer coming out of the
memory_order_consume is restricted. But if you assign it to a
non-restricted pointer, that's *fine*. That's perfectly normal C
behavior. The "restrict" concept is not something that the programmer
needs to worry about or ever even notice, it's basically just a
promise to the compiler that "if somebody has another pointer lying
around, accesses though that other pointer do not require ordering".

So it sounds like you believe that the programmer would mark things
"restrict", and I did not mean that at all.

             Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24  1:35                                                                                                     ` Linus Torvalds
@ 2014-02-24  4:59                                                                                                       ` Paul E. McKenney
  2014-02-24  5:25                                                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24  4:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote:
> On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >>  (a) we've said 'q' is restricted, so there is no aliasing between q
> >> and the pointers b/c. So the compiler is free to move those accesses
> >> around the "q = p->next" access.
> >
> > Ah, if I understand you, very good!
> >
> > My example intentionally left "q" -not- restricted.
> 
> No, I 100% agree with that. "q" is *not* restricted. But "p" is, since
> it came from that consuming load.
> 
> But "q = p->next" is ordered by how something can alias "p->next", not by 'q'!
> 
> There is no need to restrict anything but 'p' for all of this to work.

I cannot say I understand this last sentence right new from the viewpoint
of the standard, but suspending disbelief for the moment...

(And yes, given current compilers and CPUs, I agree that this should
all work in practice.  My concern is the legality, not the reality.)

> Btw, it's also worth pointing out that I do *not* in any way expect
> people to actually write the "restrict" keyword anywhere. So no need
> to change source code.

Understood -- in this variant, you are taking the marking from the
fact that there was an assignment from a memory_order_consume load
rather than from a keyword on the assigned-to variable's declaration.

> What you have is a situation where the pointer coming out of the
> memory_order_consume is restricted. But if you assign it to a
> non-restricted pointer, that's *fine*. That's perfectly normal C
> behavior. The "restrict" concept is not something that the programmer
> needs to worry about or ever even notice, it's basically just a
> promise to the compiler that "if somebody has another pointer lying
> around, accesses though that other pointer do not require ordering".
> 
> So it sounds like you believe that the programmer would mark things
> "restrict", and I did not mean that at all.

Indeed I did believe that.

I must confess that I was looking for an easy way to express in
standardese -exactly- where the ordering guarantee did and did
not propagate.

The thing is that the vast majority of the Linux-kernel RCU code is more
than happy with the guarantee only applying to fetches via the pointer
returned from the memory_order_consume load.  There are relatively few
places where groups of structures are made visible to RCU readers via
a single rcu_assign_pointer().  I guess I need to actually count them.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24  4:59                                                                                                       ` Paul E. McKenney
@ 2014-02-24  5:25                                                                                                         ` Linus Torvalds
  0 siblings, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24  5:25 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Feb 23, 2014 at 8:59 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote:
>>
>> But "q = p->next" is ordered by how something can alias "p->next", not by 'q'!
>>
>> There is no need to restrict anything but 'p' for all of this to work.
>
> I cannot say I understand this last sentence right new from the viewpoint
> of the standard, but suspending disbelief for the moment...

So 'p' is what comes from that consuming load that returns a
'restrict' pointer. That doesn't affect 'q' in any way.

But the act of initializing 'q' by dereferencing p (in "p->next") is -
by virtue of the restrict - something that the compiler can see cannot
alias with anything else, so the compiler could re-order other memory
accesses freely around it, if you see what I mean.

Modulo all the *other* ordering guarantees, of course. So other
atomics and volatiles etc may have their own rules, quite apart from
any aliasing issues.

> Understood -- in this variant, you are taking the marking from the
> fact that there was an assignment from a memory_order_consume load
> rather than from a keyword on the assigned-to variable's declaration.

Yes, and to me, it's really just a legalistic trick to make it clear
that any *other* pointer that  happens to point to the same object
cannot be dereferenced within scope of the result of the
atomic_read(mo_consume), at least not if you expect to get the memory
ordering semantics. You can do it, but then you violate the guarantee
of the restrict, and you get what you get - a potential non-ordering.

So if somebody just immediately assigns the value to a normal
(non-restrict) pointer nothing *really* changes. It's just there to
describe the guarantees.

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 19:13                                                                                             ` Paul E. McKenney
  2014-02-21 22:10                                                                                               ` Joseph S. Myers
@ 2014-02-24 13:55                                                                                               ` Michael Matz
  2014-02-24 17:40                                                                                                 ` Paul E. McKenney
  2014-02-26 13:04                                                                                               ` Torvald Riegel
  2 siblings, 1 reply; 299+ messages in thread
From: Michael Matz @ 2014-02-24 13:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

Hi,

On Fri, 21 Feb 2014, Paul E. McKenney wrote:

> > And with conservative I mean "everything is a source of a dependency, and 
> > hence can't be removed, reordered or otherwise fiddled with", and that 
> > includes code sequences where no atomic objects are anywhere in sight [1].
> > In the light of that the only realistic way (meaning to not have to 
> > disable optimization everywhere) to implement consume as currently 
> > specified is to map it to acquire.  At which point it becomes pointless.
> 
> No, only memory_order_consume loads and [[carries_dependency]]
> function arguments are sources of dependency chains.

I don't see [[carries_dependency]] in the C11 final draft (yeah, should 
get a real copy, I know, but let's assume it's the same language as the 
standard).  Therefore, yes, only consume loads are sources of 
dependencies.  The problem with the definition of the "carries a 
dependency" relation is not the sources, but rather where it stops.  
It's transitively closed over "value of evaluation A is used as operand in 
evaluation B", with very few exceptions as per 5.1.2.4#14.  Evaluations 
can contain function calls, so if there's _any_ chance that an operand of 
an evaluation might even indirectly use something resulting from a consume 
load then that evaluation must be compiled in a way to not break 
dependency chains.

I don't see a way to generally assume that e.g. the value of a function 
argument can impossibly result from a consume load, therefore the compiler 
must assume that all function arguments _can_ result from such loads, and 
so must disable all depchain breaking optimization (which are many).

> > [1] Simple example of what type of transformations would be disallowed:
> > 
> > int getzero (int i) { return i - i; }
> 
> This needs to be as follows:
> 
> [[carries_dependency]] int getzero(int i [[carries_dependency]])
> {
> 	return i - i;
> }
> 
> Otherwise dependencies won't get carried through it.

So, with the above do you agree that in absense of any other magic (see 
below) the compiler is not allowed to transform my initial getzero() 
(without the carries_dependency markers) implementation into "return 0;" 
because of the C11 rules for "carries-a-dependency"?

If so, do you then also agree that the specification of "carries a 
dependency" is somewhat, err, shall we say, overbroad?

> > depchains don't matter, could _then_ optmize it to zero.  But that's 
> > insane, especially considering that it's hard to detect if a given context 
> > doesn't care for depchains, after all the depchain relation is constructed 
> > exactly so that it bleeds into nearly everywhere.  So we would most of 
> > the time have to assume that the ultimate context will be depchain-aware 
> > and therefore disable many transformations.
> 
> Any function that does not contain a memory_order_consume load and that 
> doesn't have any arguments marked [[carries_dependency]] can be 
> optimized just as before.

And as such marker doesn't exist we must conservatively assume that it's 
on _all_ parameters, so I'll stand by my claim.

> > Then inlining getzero would merely add another "# j.dep = i.dep" 
> > relation, so depchains are still there but the value optimization can 
> > happen before inlining.  Having to do something like that I'd find 
> > disgusting, and rather rewrite consume into acquire :)  Or make the 
> > depchain relation somehow realistically implementable.
> 
> I was actually OK with arithmetic cancellation breaking the dependency 
> chains.  Others on the committee felt otherwise, and I figured that (1) 
> I wouldn't be writing that kind of function anyway and (2) they knew 
> more about writing compilers than I.  I would still be OK saying that 
> things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just break the 
> dependency chain.

Exactly.  I can see the problem that people had with that, though.  There 
are very many ways to write conceiled zeros (or generally neutral elements 
of the function in question).  My getzero() function is one (it could e.g. 
be an assembler implementation).  The allowance to break dependency chains 
would have to apply to such cancellation as well, and so can't simply 
itemize all cases in which cancellation is allowed.  Rather it would have 
had to argue about something like "value dependency", ala "evaluation B 
depends on A, if there exist at least two different values A1 and A2 
(results from A), for which evaluation B (with otherwise same operands) 
yields different values B1 and B2".

Alas, it doesn't, except if you want to understand the term "the value of 
A is used as an operand of B" in that way.  Even then you'd still have the 
second case of the depchain definition, via intermediate not even atomic 
memory stores and loads to make two evaluations be ordered per 
carries-a-dependency.

And even that understanding of "is used" wouldn't be enough, because there 
are cases where the cancellation happens in steps, and where it interacts 
with the third clause (transitiveness):  Assume this:

  a = something()  // evaluation A
  b = 1 - a        // evaluation B
  c = a - 1 + b    // evaluation C

Now, clearly B depends on A.  Also C depends on B (because with otherwise 
same operands changing just B also changes C), because of transitiveness C 
then also depends on A.  But equally cleary C was just an elaborate way to 
write "0", and so depends on nothing.  The problem was of course that A 
and B weren't independent when determining the dependencies of C.  But 
allowing cancellation to break dependency chains would have to allow for 
these cases as well.

So, now, that leaves us basically with depchains forcing us to disable 
many useful transformation or finding some other magic.  One would be to 
just regard all consume loads as acquire loads and be done (and 
effectively remove the ill-advised "carries a dependency" relation from 
consideration).

You say downthread that it'd also be possible to just emit barriers before 
all function calls (I say "all" because the compiler will generally 
have applied some transformation that broke depchains if they existed).  
That seems to me to be a bigger hammer than just ignoring depchains and 
emit acquires instead of consumes (because the latter changes only exactly 
where atomics are used, the former seems to me to have unbounded effect).

So, am still missing something or is my understanding of the 
carries-a-dependency relation correct and my conclusions are merely too 
pessimistic?


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-23 19:31                                                                                                 ` Linus Torvalds
  2014-02-24  1:16                                                                                                   ` Paul E. McKenney
@ 2014-02-24 15:57                                                                                                   ` Linus Torvalds
  2014-02-24 16:27                                                                                                     ` Richard Biener
  2014-02-24 17:21                                                                                                     ` Paul E. McKenney
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 15:57 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Let me think about it some more, but my gut feel is that just tweaking
> the definition of what "ordered" means is sufficient.
>
> So to go back to the suggested ordering rules (ignoring the "restrict"
> part, which is just to clarify that ordering through other means to
> get to the object doesn't matter), I suggested:
>
>  "the consume ordering guarantees the ordering between that
>   atomic read and the accesses to the object that the pointer
>   points to"
>
> and I think the solution is to just say that this ordering acts as a
> fence. It doesn't say exactly *where* the fence is, but it says that
> there is *some* fence between the load of the pointer and any/all
> accesses to the object through that pointer.

I'm wrong. That doesn't work. At all. There is no ordering except
through the pointer chain.

So I think saying just that, and nothing else (no magic fences, no
nothing) is the right thing:

 "the consume ordering guarantees the ordering between that
  atomic read and the accesses to the object that the pointer
  points to directly or indirectly through a chain of pointers"

The thing is, anything but a chain of pointers (and maybe relaxing it
to "indexes in tables" in addition to pointers) doesn't really work.

The current standard tries to break it at "obvious" points that can
lose the data dependency (either by turning it into a control
dependency, or by just dropping the value, like the left-hand side of
a comma-expression), but the fact is, it's broken.

It's broken not just because the value can be lost other ways (ie the
"p-p" example), it's broken because the value can be turned into a
control dependency so many other ways too.

Compilers regularly turn arithmetic ops with logical comparisons into
branches. So an expression like "a = !!ptr" carries a dependency in
the current C standard, but it's entirely possible that a compiler
ends up turning it into a compare-and-branch rather than a
compare-and-set-conditional, depending on just exactly how "a" ends up
being used. That's true even on an architecture like ARM that has a
lot of conditional instructions (there are way less if you compile for
Thumb, for example, but compilers also do things like "if there are
more than N predicated instructions I'll just turn it into a
branch-over instead").

So I think the C standard needs to just explicitly say that you can
walk a chain of pointers (with that possible "indexes in arrays"
extension), and nothing more.

                    Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 15:57                                                                                                   ` Linus Torvalds
@ 2014-02-24 16:27                                                                                                     ` Richard Biener
  2014-02-24 16:37                                                                                                       ` Linus Torvalds
  2014-02-24 17:21                                                                                                     ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Richard Biener @ 2014-02-24 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 4:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Let me think about it some more, but my gut feel is that just tweaking
>> the definition of what "ordered" means is sufficient.
>>
>> So to go back to the suggested ordering rules (ignoring the "restrict"
>> part, which is just to clarify that ordering through other means to
>> get to the object doesn't matter), I suggested:
>>
>>  "the consume ordering guarantees the ordering between that
>>   atomic read and the accesses to the object that the pointer
>>   points to"
>>
>> and I think the solution is to just say that this ordering acts as a
>> fence. It doesn't say exactly *where* the fence is, but it says that
>> there is *some* fence between the load of the pointer and any/all
>> accesses to the object through that pointer.
>
> I'm wrong. That doesn't work. At all. There is no ordering except
> through the pointer chain.
>
> So I think saying just that, and nothing else (no magic fences, no
> nothing) is the right thing:
>
>  "the consume ordering guarantees the ordering between that
>   atomic read and the accesses to the object that the pointer
>   points to directly or indirectly through a chain of pointers"

To me that reads like

  int i;
  int *q = &i;
  int **p = &q;

  atomic_XXX (p, CONSUME);

orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
want to say that it orders against aliased storage - but then go further
and include "indirectly through a chain of pointers"?!  Thus an
atomic read of a int * orders against any 'int' memory operation but
not against 'float' memory operations?

Eh ...

Just jumping in to throw in my weird-2-cents.

Richard.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 16:27                                                                                                     ` Richard Biener
@ 2014-02-24 16:37                                                                                                       ` Linus Torvalds
  2014-02-24 16:40                                                                                                         ` Linus Torvalds
  2014-02-24 16:55                                                                                                         ` Michael Matz
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 16:37 UTC (permalink / raw)
  To: Richard Biener
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 8:27 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
>
> To me that reads like
>
>   int i;
>   int *q = &i;
>   int **p = &q;
>
>   atomic_XXX (p, CONSUME);
>
> orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> want to say that it orders against aliased storage - but then go further
> and include "indirectly through a chain of pointers"?!  Thus an
> atomic read of a int * orders against any 'int' memory operation but
> not against 'float' memory operations?

No, it's not about type at all, and the "chain of pointers" can be
much more complex than that, since the "int *" can point to within an
object that contains other things than just that "int" (the "int" can
be part of a structure that then has pointers to other structures
etc).

So in your example,

    ptr = atomic_read(p, CONSUME);

would indeed order against the subsequent access of the chain through
*that* pointer (the whole "restrict" thing that I left out as a
separate thing, which was probably a mistake), but certainly not
against any integer pointer, and certainly not against any aliasing
pointer chains.

So yes, the atomic_read() would be ordered wrt '*ptr' (getting 'q')
_and_ '**ptr' (getting 'i'), but nothing else - including just the
aliasing access of dereferencing 'i' directly.

            Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 16:37                                                                                                       ` Linus Torvalds
@ 2014-02-24 16:40                                                                                                         ` Linus Torvalds
  2014-02-24 16:55                                                                                                         ` Michael Matz
  1 sibling, 0 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 16:40 UTC (permalink / raw)
  To: Richard Biener
  Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 8:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So yes, the atomic_read() would be ordered wrt '*ptr' (getting 'q')
> _and_ '**ptr' (getting 'i'), but nothing else - including just the
> aliasing access of dereferencing 'i' directly.

Btw, what CPU architects and memory ordering guys tend to do in
documentation is give a number of "litmus test" pseudo-code sequences
to show the effects and intent of the language.

I think giving those kinds of litmus tests for both "this is ordered"
and "this is not ordered" cases like the above is would be a great
clarification. Partly because the language is going to be somewhat
legalistic and thus hard to wrap your mind around, and partly to
really hit home the *intent* of the language, which I think is
actually fairly clear to both compiler writers and to programmers.

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 16:37                                                                                                       ` Linus Torvalds
  2014-02-24 16:40                                                                                                         ` Linus Torvalds
@ 2014-02-24 16:55                                                                                                         ` Michael Matz
  2014-02-24 17:28                                                                                                           ` Paul E. McKenney
  2014-02-24 17:38                                                                                                           ` Linus Torvalds
  1 sibling, 2 replies; 299+ messages in thread
From: Michael Matz @ 2014-02-24 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Biener, Paul McKenney, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

Hi,

On Mon, 24 Feb 2014, Linus Torvalds wrote:

> > To me that reads like
> >
> >   int i;
> >   int *q = &i;
> >   int **p = &q;
> >
> >   atomic_XXX (p, CONSUME);
> >
> > orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> > want to say that it orders against aliased storage - but then go further
> > and include "indirectly through a chain of pointers"?!  Thus an
> > atomic read of a int * orders against any 'int' memory operation but
> > not against 'float' memory operations?
> 
> No, it's not about type at all, and the "chain of pointers" can be
> much more complex than that, since the "int *" can point to within an
> object that contains other things than just that "int" (the "int" can
> be part of a structure that then has pointers to other structures
> etc).

So, let me try to poke holes into your definition or increase my 
understanding :) .  You said "chain of pointers"(dereferences I assume), 
e.g. if p is result of consume load, then access to 
p->here->there->next->prev->stuff is supposed to be ordered with that load 
(or only when that last load/store itself is also an atomic load or 
store?).

So, what happens if the pointer deref chain is partly hidden in some 
functions:

A * adjustptr (B *ptr) { return &ptr->here->there->next; }
B * p = atomic_XXX (&somewhere, consume);
adjustptr(p)->prev->stuff = bla;

As far as I understood you, this whole ptrderef chain business would be 
only an optimization opportunity, right?  So if the compiler can't be sure 
how p is actually used (as in my function-using case, assume adjustptr is 
defined in another unit), then the consume load would simply be 
transformed into an acquire (or whatever, with some barrier I mean)?  Only 
_if_ the compiler sees all obvious uses of p (indirectly through pointer 
derefs) can it, yeah, do what with the consume load?


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 15:57                                                                                                   ` Linus Torvalds
  2014-02-24 16:27                                                                                                     ` Richard Biener
@ 2014-02-24 17:21                                                                                                     ` Paul E. McKenney
  2014-02-24 18:14                                                                                                       ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 17:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 07:57:24AM -0800, Linus Torvalds wrote:
> On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Let me think about it some more, but my gut feel is that just tweaking
> > the definition of what "ordered" means is sufficient.
> >
> > So to go back to the suggested ordering rules (ignoring the "restrict"
> > part, which is just to clarify that ordering through other means to
> > get to the object doesn't matter), I suggested:
> >
> >  "the consume ordering guarantees the ordering between that
> >   atomic read and the accesses to the object that the pointer
> >   points to"
> >
> > and I think the solution is to just say that this ordering acts as a
> > fence. It doesn't say exactly *where* the fence is, but it says that
> > there is *some* fence between the load of the pointer and any/all
> > accesses to the object through that pointer.
> 
> I'm wrong. That doesn't work. At all. There is no ordering except
> through the pointer chain.
> 
> So I think saying just that, and nothing else (no magic fences, no
> nothing) is the right thing:
> 
>  "the consume ordering guarantees the ordering between that
>   atomic read and the accesses to the object that the pointer
>   points to directly or indirectly through a chain of pointers"
> 
> The thing is, anything but a chain of pointers (and maybe relaxing it
> to "indexes in tables" in addition to pointers) doesn't really work.
> 
> The current standard tries to break it at "obvious" points that can
> lose the data dependency (either by turning it into a control
> dependency, or by just dropping the value, like the left-hand side of
> a comma-expression), but the fact is, it's broken.
> 
> It's broken not just because the value can be lost other ways (ie the
> "p-p" example), it's broken because the value can be turned into a
> control dependency so many other ways too.
> 
> Compilers regularly turn arithmetic ops with logical comparisons into
> branches. So an expression like "a = !!ptr" carries a dependency in
> the current C standard, but it's entirely possible that a compiler
> ends up turning it into a compare-and-branch rather than a
> compare-and-set-conditional, depending on just exactly how "a" ends up
> being used. That's true even on an architecture like ARM that has a
> lot of conditional instructions (there are way less if you compile for
> Thumb, for example, but compilers also do things like "if there are
> more than N predicated instructions I'll just turn it into a
> branch-over instead").
> 
> So I think the C standard needs to just explicitly say that you can
> walk a chain of pointers (with that possible "indexes in arrays"
> extension), and nothing more.

I am comfortable with this.  My desire for also marking the later
pointers does not make sense without some automated way of validating
them, which I don't immediately see a way to do.

So let me try laying out the details.  Sticking with pointers for the
moment, if we reach agreement on these, I will try expanding to integers.

1.	A pointer value obtained from a memory_order_consume load is part
	of a pointer chain.  I am calling the pointer itself a "chained
	pointer" for the moment.
	
2.	Note that it is the value that qualifies as being chained, not
	the variable.  For example, given pointer variable might hold
	a chained pointer at one point in the code, then a non-chained
	pointer later.  Therefore, "q = p", where "q" is a pointer and
	"p" is a chained pointer results in "q" containing a chained
	pointer.

3.	Adding or subtracting an integer to/from a chained pointer
	results in another chained pointer in that same pointer chain.

4.	Bitwise operators ("&", "|", "^", and I suppose also "~")
	applied to a chained pointer and an integer results in another
	chained pointer in that same pointer chain.

5.	Consider a sequence as follows: dereference operator (unary "*",
	"[]", "->") optionally followed by a series of direct selection
	operators ("."), finally (and unconditionally) followed by
	a unary "&" operator.  Applying such a sequence to a chained
	pointer results in another chained pointer in the same chain.

	Given a chained pointer "p", examples include "&p[3]",
	"&p->a.b.c.d.e.f.g", and "&*p".

6.	The expression "p->f", where "p" is a chained pointer and "f"
	is a pointer, results in a chained pointer.

	FWIW, this means that pointer chains can overlap as in this
	example:
	
		p = atomic_load_explicit(&gp, memory_order_consume);
		q = atomic_load_explicit(&p->ap, memory_order_consume);
		x = q->a;

	This should be fine, I don't see any problems with this.

7.	Applying a pointer cast to a chained pointer results in a
	chained pointer.

8.	Applying any of the following operators to a chained pointer
	results in something that is not a chained pointer:
	"()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=",
	">=", "==", "!=", "&&", and "||".

9.	The effect of the compound assignment operators "+=", "-=",
	and so on is the same as the equivalent expression using
	simple assignment.

10.	In a "?:" operator, if the selected one of the rightmost two
	values is a chained pointer, then the result is also a
	chained pointer.

11.	In a "," operator, if the rightmost value is a chained pointer,
	then the result is also a chained pointer.

12.	A memory_order_consume load carries a dependency to any
	dereference operator (unary "*", "[]", and "->") in the
	resulting pointer chain.

I think that covers everything...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 16:55                                                                                                         ` Michael Matz
@ 2014-02-24 17:28                                                                                                           ` Paul E. McKenney
  2014-02-24 17:57                                                                                                             ` Paul E. McKenney
  2014-02-26 17:39                                                                                                             ` Torvald Riegel
  2014-02-24 17:38                                                                                                           ` Linus Torvalds
  1 sibling, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 17:28 UTC (permalink / raw)
  To: Michael Matz
  Cc: Linus Torvalds, Richard Biener, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
> Hi,
> 
> On Mon, 24 Feb 2014, Linus Torvalds wrote:
> 
> > > To me that reads like
> > >
> > >   int i;
> > >   int *q = &i;
> > >   int **p = &q;
> > >
> > >   atomic_XXX (p, CONSUME);
> > >
> > > orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> > > want to say that it orders against aliased storage - but then go further
> > > and include "indirectly through a chain of pointers"?!  Thus an
> > > atomic read of a int * orders against any 'int' memory operation but
> > > not against 'float' memory operations?
> > 
> > No, it's not about type at all, and the "chain of pointers" can be
> > much more complex than that, since the "int *" can point to within an
> > object that contains other things than just that "int" (the "int" can
> > be part of a structure that then has pointers to other structures
> > etc).
> 
> So, let me try to poke holes into your definition or increase my 
> understanding :) .  You said "chain of pointers"(dereferences I assume), 
> e.g. if p is result of consume load, then access to 
> p->here->there->next->prev->stuff is supposed to be ordered with that load 
> (or only when that last load/store itself is also an atomic load or 
> store?).
> 
> So, what happens if the pointer deref chain is partly hidden in some 
> functions:
> 
> A * adjustptr (B *ptr) { return &ptr->here->there->next; }
> B * p = atomic_XXX (&somewhere, consume);
> adjustptr(p)->prev->stuff = bla;
> 
> As far as I understood you, this whole ptrderef chain business would be 
> only an optimization opportunity, right?  So if the compiler can't be sure 
> how p is actually used (as in my function-using case, assume adjustptr is 
> defined in another unit), then the consume load would simply be 
> transformed into an acquire (or whatever, with some barrier I mean)?  Only 
> _if_ the compiler sees all obvious uses of p (indirectly through pointer 
> derefs) can it, yeah, do what with the consume load?

Good point, I left that out of my list.  Adding it:

13.	By default, pointer chains do not propagate into or out of functions.
	In implementations having attributes, a [[carries_dependency]]
	may be used to mark a function argument or return as passing
	a pointer chain into or out of that function.

	If a function does not contain memory_order_consume loads and
	also does not contain [[carries_dependency]] attributes, then
	that function may be compiled using any desired dependency-breaking
	optimizations.

	The ordering effects are implementation defined when a given
	pointer chain passes into or out of a function through a parameter
	or return not marked with a [[carries_dependency]] attributed.

Note that this last paragraph differs from the current standard, which
would require ordering regardless.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 16:55                                                                                                         ` Michael Matz
  2014-02-24 17:28                                                                                                           ` Paul E. McKenney
@ 2014-02-24 17:38                                                                                                           ` Linus Torvalds
  2014-02-24 18:12                                                                                                             ` Paul E. McKenney
  2014-02-26 17:34                                                                                                             ` Torvald Riegel
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 17:38 UTC (permalink / raw)
  To: Michael Matz
  Cc: Richard Biener, Paul McKenney, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz <matz@suse.de> wrote:
>
> So, let me try to poke holes into your definition or increase my
> understanding :) .  You said "chain of pointers"(dereferences I assume),
> e.g. if p is result of consume load, then access to
> p->here->there->next->prev->stuff is supposed to be ordered with that load
> (or only when that last load/store itself is also an atomic load or
> store?).

It's supposed to be ordered wrt the first load (the consuming one), yes.

> So, what happens if the pointer deref chain is partly hidden in some
> functions:

No problem.

The thing is, the ordering is actually handled by the CPU in all
relevant cases.  So the compiler doesn't actually need to *do*
anything. All this legalistic stuff is just to describe the semantics
and the guarantees.

The problem is two cases:

 (a) alpha (which doesn't really order any accesses at all, not even
dependent loads), but for a compiler alpha is actually trivial: just
add a "rmb" instruction after the load, and you can't really do
anything else (there's a few optimizations you can do wrt the rmb, but
they are very specific and simple).

So (a) is a "problem", but the solution is actually really simple, and
gives very *strong* guarantees: on alpha, a "consume" ends up being
basically the same as a read barrier after the load, with only very
minimal room for optimization.

 (b) ARM and powerpc and similar architectures, that guarantee the
data dependency as long as it is an *actual* data dependency, and
never becomes a control dependency.

On ARM and powerpc, control dependencies do *not* order accesses (the
reasons boil down to essentially: branch prediction breaks the
dependency, and instructions that come after the branch can be happily
executed before the branch). But it's almost impossible to describe
that in the standard, since compilers can (and very much do) turn a
control dependency into a data dependency and vice versa.

So the current standard tries to describe that "control vs data"
dependency, and tries to limit it to a data dependency. It fails. It
fails for multiple reasons - it doesn't allow for trivial
optimizations that just remove the data dependency, and it also
doesn't allow for various trivial cases where the compiler *does* turn
the data dependency into a control dependency.

So I really really think that the current C standard language is
broken. Unfixably broken.

I'm trying to change the "syntactic data dependency" that the current
standard uses into something that is clearer and correct.

The "chain of pointers" thing is still obviously a data dependency,
but by limiting it to pointers, it simplifies the language, clarifies
the meaning, avoids all syntactic tricks (ie "p-p" is clearly a
syntactic dependency on "p", but does *not* involve in any way
following the pointer) and makes it basically impossible for the
compiler to break the dependency without doing value prediction, and
since value prediction has to be disallowed anyway, that's a feature,
not a bug.

                      Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 13:55                                                                                               ` Michael Matz
@ 2014-02-24 17:40                                                                                                 ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 17:40 UTC (permalink / raw)
  To: Michael Matz
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 02:55:07PM +0100, Michael Matz wrote:
> Hi,
> 
> On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> 
> > > And with conservative I mean "everything is a source of a dependency, and 
> > > hence can't be removed, reordered or otherwise fiddled with", and that 
> > > includes code sequences where no atomic objects are anywhere in sight [1].
> > > In the light of that the only realistic way (meaning to not have to 
> > > disable optimization everywhere) to implement consume as currently 
> > > specified is to map it to acquire.  At which point it becomes pointless.
> > 
> > No, only memory_order_consume loads and [[carries_dependency]]
> > function arguments are sources of dependency chains.
> 
> I don't see [[carries_dependency]] in the C11 final draft (yeah, should 
> get a real copy, I know, but let's assume it's the same language as the 
> standard).  Therefore, yes, only consume loads are sources of 
> dependencies.  The problem with the definition of the "carries a 
> dependency" relation is not the sources, but rather where it stops.  
> It's transitively closed over "value of evaluation A is used as operand in 
> evaluation B", with very few exceptions as per 5.1.2.4#14.  Evaluations 
> can contain function calls, so if there's _any_ chance that an operand of 
> an evaluation might even indirectly use something resulting from a consume 
> load then that evaluation must be compiled in a way to not break 
> dependency chains.
> 
> I don't see a way to generally assume that e.g. the value of a function 
> argument can impossibly result from a consume load, therefore the compiler 
> must assume that all function arguments _can_ result from such loads, and 
> so must disable all depchain breaking optimization (which are many).
> 
> > > [1] Simple example of what type of transformations would be disallowed:
> > > 
> > > int getzero (int i) { return i - i; }
> > 
> > This needs to be as follows:
> > 
> > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > {
> > 	return i - i;
> > }
> > 
> > Otherwise dependencies won't get carried through it.
> 
> So, with the above do you agree that in absense of any other magic (see 
> below) the compiler is not allowed to transform my initial getzero() 
> (without the carries_dependency markers) implementation into "return 0;" 
> because of the C11 rules for "carries-a-dependency"?
> 
> If so, do you then also agree that the specification of "carries a 
> dependency" is somewhat, err, shall we say, overbroad?

>From what I can see, overbroad.  The problem is that the C++11 standard
defines how carries-dependency interacts with function calls and returns
in 7.6.4, which describes the [[carries_dependency]] attribute.  For example,
7.6.4p6 says:

	Function g’s second parameter has a carries_dependency
	attribute, but its first parameter does not. Therefore, function
	h’s first call to g carries a dependency into g, but its second
	call does not. The implementation might need to insert a fence
	prior to the second call to g.

When C11 declined to take attributes, they also left out the part saying
how carries-dependency interacts with functions.  :-/

Might be fixed by now, checking up on it.

One could argue that the bit about emitting fence instructions at
function calls and returns is implied by the as-if rule even without
this wording, but...

> > > depchains don't matter, could _then_ optmize it to zero.  But that's 
> > > insane, especially considering that it's hard to detect if a given context 
> > > doesn't care for depchains, after all the depchain relation is constructed 
> > > exactly so that it bleeds into nearly everywhere.  So we would most of 
> > > the time have to assume that the ultimate context will be depchain-aware 
> > > and therefore disable many transformations.
> > 
> > Any function that does not contain a memory_order_consume load and that 
> > doesn't have any arguments marked [[carries_dependency]] can be 
> > optimized just as before.
> 
> And as such marker doesn't exist we must conservatively assume that it's 
> on _all_ parameters, so I'll stand by my claim.

Or that you have to emit a fence instruction when a dependency chain
enters or leaves a function in cases where all callers/calles are not
visible to the compiler.

My preference is that the ordering properties of a carries-dependency
chain is implementation defined at the point that it enters or leaves
a function without the marker, but others strongly disagreed.  ;-)

> > > Then inlining getzero would merely add another "# j.dep = i.dep" 
> > > relation, so depchains are still there but the value optimization can 
> > > happen before inlining.  Having to do something like that I'd find 
> > > disgusting, and rather rewrite consume into acquire :)  Or make the 
> > > depchain relation somehow realistically implementable.
> > 
> > I was actually OK with arithmetic cancellation breaking the dependency 
> > chains.  Others on the committee felt otherwise, and I figured that (1) 
> > I wouldn't be writing that kind of function anyway and (2) they knew 
> > more about writing compilers than I.  I would still be OK saying that 
> > things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just break the 
> > dependency chain.
> 
> Exactly.  I can see the problem that people had with that, though.  There 
> are very many ways to write conceiled zeros (or generally neutral elements 
> of the function in question).  My getzero() function is one (it could e.g. 
> be an assembler implementation).  The allowance to break dependency chains 
> would have to apply to such cancellation as well, and so can't simply 
> itemize all cases in which cancellation is allowed.  Rather it would have 
> had to argue about something like "value dependency", ala "evaluation B 
> depends on A, if there exist at least two different values A1 and A2 
> (results from A), for which evaluation B (with otherwise same operands) 
> yields different values B1 and B2".

And that was in fact one of the arguments used against me.  ;-)

> Alas, it doesn't, except if you want to understand the term "the value of 
> A is used as an operand of B" in that way.  Even then you'd still have the 
> second case of the depchain definition, via intermediate not even atomic 
> memory stores and loads to make two evaluations be ordered per 
> carries-a-dependency.
> 
> And even that understanding of "is used" wouldn't be enough, because there 
> are cases where the cancellation happens in steps, and where it interacts 
> with the third clause (transitiveness):  Assume this:
> 
>   a = something()  // evaluation A
>   b = 1 - a        // evaluation B
>   c = a - 1 + b    // evaluation C
> 
> Now, clearly B depends on A.  Also C depends on B (because with otherwise 
> same operands changing just B also changes C), because of transitiveness C 
> then also depends on A.  But equally cleary C was just an elaborate way to 
> write "0", and so depends on nothing.  The problem was of course that A 
> and B weren't independent when determining the dependencies of C.  But 
> allowing cancellation to break dependency chains would have to allow for 
> these cases as well.
> 
> So, now, that leaves us basically with depchains forcing us to disable 
> many useful transformation or finding some other magic.  One would be to 
> just regard all consume loads as acquire loads and be done (and 
> effectively remove the ill-advised "carries a dependency" relation from 
> consideration).
> 
> You say downthread that it'd also be possible to just emit barriers before 
> all function calls (I say "all" because the compiler will generally 
> have applied some transformation that broke depchains if they existed).  
> That seems to me to be a bigger hammer than just ignoring depchains and 
> emit acquires instead of consumes (because the latter changes only exactly 
> where atomics are used, the former seems to me to have unbounded effect).

Yep, converting the acquire to a consume is a valid alternative to
emitting a memory-barrier instruction prior to entering/exiting the
function in question.

> So, am still missing something or is my understanding of the 
> carries-a-dependency relation correct and my conclusions are merely too 
> pessimistic?

Given the definition as it is, I believe you understand it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 17:28                                                                                                           ` Paul E. McKenney
@ 2014-02-24 17:57                                                                                                             ` Paul E. McKenney
  2014-02-26 17:39                                                                                                             ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 17:57 UTC (permalink / raw)
  To: Michael Matz
  Cc: Linus Torvalds, Richard Biener, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Mon, Feb 24, 2014 at 09:28:56AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Mon, 24 Feb 2014, Linus Torvalds wrote:
> > 
> > > > To me that reads like
> > > >
> > > >   int i;
> > > >   int *q = &i;
> > > >   int **p = &q;
> > > >
> > > >   atomic_XXX (p, CONSUME);
> > > >
> > > > orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> > > > want to say that it orders against aliased storage - but then go further
> > > > and include "indirectly through a chain of pointers"?!  Thus an
> > > > atomic read of a int * orders against any 'int' memory operation but
> > > > not against 'float' memory operations?
> > > 
> > > No, it's not about type at all, and the "chain of pointers" can be
> > > much more complex than that, since the "int *" can point to within an
> > > object that contains other things than just that "int" (the "int" can
> > > be part of a structure that then has pointers to other structures
> > > etc).
> > 
> > So, let me try to poke holes into your definition or increase my 
> > understanding :) .  You said "chain of pointers"(dereferences I assume), 
> > e.g. if p is result of consume load, then access to 
> > p->here->there->next->prev->stuff is supposed to be ordered with that load 
> > (or only when that last load/store itself is also an atomic load or 
> > store?).
> > 
> > So, what happens if the pointer deref chain is partly hidden in some 
> > functions:
> > 
> > A * adjustptr (B *ptr) { return &ptr->here->there->next; }
> > B * p = atomic_XXX (&somewhere, consume);
> > adjustptr(p)->prev->stuff = bla;
> > 
> > As far as I understood you, this whole ptrderef chain business would be 
> > only an optimization opportunity, right?  So if the compiler can't be sure 
> > how p is actually used (as in my function-using case, assume adjustptr is 
> > defined in another unit), then the consume load would simply be 
> > transformed into an acquire (or whatever, with some barrier I mean)?  Only 
> > _if_ the compiler sees all obvious uses of p (indirectly through pointer 
> > derefs) can it, yeah, do what with the consume load?
> 
> Good point, I left that out of my list.  Adding it:
> 
> 13.	By default, pointer chains do not propagate into or out of functions.
> 	In implementations having attributes, a [[carries_dependency]]
> 	may be used to mark a function argument or return as passing
> 	a pointer chain into or out of that function.
> 
> 	If a function does not contain memory_order_consume loads and
> 	also does not contain [[carries_dependency]] attributes, then
> 	that function may be compiled using any desired dependency-breaking
> 	optimizations.
> 
> 	The ordering effects are implementation defined when a given
> 	pointer chain passes into or out of a function through a parameter
> 	or return not marked with a [[carries_dependency]] attributed.
> 
> Note that this last paragraph differs from the current standard, which
> would require ordering regardless.

And there is also kill_dependency(), which needs to be added to the list
in #8 of operators that take a chained pointer and return something that
is not a chained pointer.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 17:38                                                                                                           ` Linus Torvalds
@ 2014-02-24 18:12                                                                                                             ` Paul E. McKenney
  2014-02-26 17:34                                                                                                             ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Matz, Richard Biener, Torvald Riegel, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Mon, Feb 24, 2014 at 09:38:46AM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz <matz@suse.de> wrote:
> >
> > So, let me try to poke holes into your definition or increase my
> > understanding :) .  You said "chain of pointers"(dereferences I assume),
> > e.g. if p is result of consume load, then access to
> > p->here->there->next->prev->stuff is supposed to be ordered with that load
> > (or only when that last load/store itself is also an atomic load or
> > store?).
> 
> It's supposed to be ordered wrt the first load (the consuming one), yes.
> 
> > So, what happens if the pointer deref chain is partly hidden in some
> > functions:
> 
> No problem.
> 
> The thing is, the ordering is actually handled by the CPU in all
> relevant cases.  So the compiler doesn't actually need to *do*
> anything. All this legalistic stuff is just to describe the semantics
> and the guarantees.
> 
> The problem is two cases:
> 
>  (a) alpha (which doesn't really order any accesses at all, not even
> dependent loads), but for a compiler alpha is actually trivial: just
> add a "rmb" instruction after the load, and you can't really do
> anything else (there's a few optimizations you can do wrt the rmb, but
> they are very specific and simple).
> 
> So (a) is a "problem", but the solution is actually really simple, and
> gives very *strong* guarantees: on alpha, a "consume" ends up being
> basically the same as a read barrier after the load, with only very
> minimal room for optimization.
> 
>  (b) ARM and powerpc and similar architectures, that guarantee the
> data dependency as long as it is an *actual* data dependency, and
> never becomes a control dependency.
> 
> On ARM and powerpc, control dependencies do *not* order accesses (the
> reasons boil down to essentially: branch prediction breaks the
> dependency, and instructions that come after the branch can be happily
> executed before the branch). But it's almost impossible to describe
> that in the standard, since compilers can (and very much do) turn a
> control dependency into a data dependency and vice versa.
> 
> So the current standard tries to describe that "control vs data"
> dependency, and tries to limit it to a data dependency. It fails. It
> fails for multiple reasons - it doesn't allow for trivial
> optimizations that just remove the data dependency, and it also
> doesn't allow for various trivial cases where the compiler *does* turn
> the data dependency into a control dependency.
> 
> So I really really think that the current C standard language is
> broken. Unfixably broken.
> 
> I'm trying to change the "syntactic data dependency" that the current
> standard uses into something that is clearer and correct.
> 
> The "chain of pointers" thing is still obviously a data dependency,
> but by limiting it to pointers, it simplifies the language, clarifies
> the meaning, avoids all syntactic tricks (ie "p-p" is clearly a
> syntactic dependency on "p", but does *not* involve in any way
> following the pointer) and makes it basically impossible for the
> compiler to break the dependency without doing value prediction, and
> since value prediction has to be disallowed anyway, that's a feature,
> not a bug.

OK, good point, please ignore my added thirteenth item in the list.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 17:21                                                                                                     ` Paul E. McKenney
@ 2014-02-24 18:14                                                                                                       ` Linus Torvalds
  2014-02-24 18:53                                                                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 18:14 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
>         applied to a chained pointer and an integer results in another
>         chained pointer in that same pointer chain.

No. You cannot define it this way. Taking the value of a pointer and
doing a bitwise operation that throws away all the bits (or even
*most* of the bits) results in the compiler easily being able to turn
the "chain" into a non-chain.

The obvious example being "val & 0", but things like "val & 1" are in
practice also something that compilers easily turn into control
dependencies instead of data dependencies.

So you can talk about things like "aligning the pointer value to
object boundaries" etc, but it really cannot and *must* not be about
the syntactic operations.

The same goes for "adding and subtracting an integer". The *syntax*
doesn't matter. It's about remaining information. Doing "p-(int)p" or
"p+(-(int)p)" doesn't leave any information despite being "subtracting
and adding an integer" at a syntactic level.

Syntax is meaningless. Really.

> 8.      Applying any of the following operators to a chained pointer
>         results in something that is not a chained pointer:
>         "()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=",
>         ">=", "==", "!=", "&&", and "||".

Parenthesis? I'm assuming that you mean calling through the chained pointer.

Also, I think all of /, * and % are perfectly fine, and might be used
for that "aligning the pointer" operation that is fine.

             Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 18:14                                                                                                       ` Linus Torvalds
@ 2014-02-24 18:53                                                                                                         ` Paul E. McKenney
  2014-02-24 19:54                                                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 10:14:01AM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
> >         applied to a chained pointer and an integer results in another
> >         chained pointer in that same pointer chain.
> 
> No. You cannot define it this way. Taking the value of a pointer and
> doing a bitwise operation that throws away all the bits (or even
> *most* of the bits) results in the compiler easily being able to turn
> the "chain" into a non-chain.
> 
> The obvious example being "val & 0", but things like "val & 1" are in
> practice also something that compilers easily turn into control
> dependencies instead of data dependencies.

Indeed, most of the bits need to remain for this to work.

> So you can talk about things like "aligning the pointer value to
> object boundaries" etc, but it really cannot and *must* not be about
> the syntactic operations.
> 
> The same goes for "adding and subtracting an integer". The *syntax*
> doesn't matter. It's about remaining information. Doing "p-(int)p" or
> "p+(-(int)p)" doesn't leave any information despite being "subtracting
> and adding an integer" at a syntactic level.
> 
> Syntax is meaningless. Really.

Good points.  How about the following replacements?

3.	Adding or subtracting an integer to/from a chained pointer
	results in another chained pointer in that same pointer chain.
	The results of addition and subtraction operations that cancel
	the chained pointer's value (for example, "p-(long)p" where "p"
	is a pointer to char) are implementation defined.

4.	Bitwise operators ("&", "|", "^", and I suppose also "~")
	applied to a chained pointer and an integer for the purposes
	of alignment and pointer translation results in another
	chained pointer in that same pointer chain.  Other uses
	of bitwise operators on chained pointers (for example,
	"p|~0") are implementation defined.

> > 8.      Applying any of the following operators to a chained pointer
> >         results in something that is not a chained pointer:
> >         "()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=",
> >         ">=", "==", "!=", "&&", and "||".
> 
> Parenthesis? I'm assuming that you mean calling through the chained pointer.

Yes, good point.  Of course, parentheses for grouping just pass the
value through without affecting the chained-ness.

> Also, I think all of /, * and % are perfectly fine, and might be used
> for that "aligning the pointer" operation that is fine.

Something like this?

	char *p;

	p = p - (unsigned long)p % 8;

I was thinking of this as subtraction -- the "p" gets moduloed by 8,
which loses the chained-pointer designation.  But that is OK because
that designation gets folded back in by the subtraction.  Am I missing
a use case?

That leaves things like this one:

	p = (p / 8) * 8;

I cannot think of any other legitimate use for "/" and "*".

Here is an updated #8 and a new 8a:

8.	Applying any of the following operators to a chained pointer
	results in something that is not a chained pointer: function call
	"()", sizeof, "!", "%", ">>", "<<", "<", ">", "<=", ">=", "==",
	"!=", "&&", "||", and "kill_dependency()".

8a.	Dividing a chained pointer by an integer and multiplying it
	by that same integer (for example, to align that pointer) results
	in a chained pointer in that same pointer chain.  The ordering
	effects of other uses of infix "*" and "/" on chained pointers
	are implementation defined.

Does that capture it?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 18:53                                                                                                         ` Paul E. McKenney
@ 2014-02-24 19:54                                                                                                           ` Linus Torvalds
  2014-02-24 22:37                                                                                                             ` Paul E. McKenney
  2014-02-27 15:37                                                                                                             ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 19:54 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Good points.  How about the following replacements?
>
> 3.      Adding or subtracting an integer to/from a chained pointer
>         results in another chained pointer in that same pointer chain.
>         The results of addition and subtraction operations that cancel
>         the chained pointer's value (for example, "p-(long)p" where "p"
>         is a pointer to char) are implementation defined.
>
> 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
>         applied to a chained pointer and an integer for the purposes
>         of alignment and pointer translation results in another
>         chained pointer in that same pointer chain.  Other uses
>         of bitwise operators on chained pointers (for example,
>         "p|~0") are implementation defined.

Quite frankly, I think all of this language that is about the actual
operations is irrelevant and wrong.

It's not going to help compiler writers, and it sure isn't going to
help users that read this.

Why not just talk about "value chains" and that any operations that
restrict the value range severely end up breaking the chain. There is
no point in listing the operations individually, because every single
operation *can* restrict things. Listing individual operations and
depdendencies is just fundamentally wrong.

For example, let's look at this obvious case:

   int q,*p = atomic_read(&pp, consume);
   .. nothing modifies 'p' ..
   q = *p;

and there are literally *zero* operations that modify the value
change, so obviously the two operations are ordered, right?

Wrong.

What if the "nothing modifies 'p'" part looks like this:

    if (p != &myvariable)
        return;

and now any sane compiler will happily optimize "q = *p" into "q =
myvariable", and we're all done - nothing invalid was ever

So my earlier suggestion tried to avoid this by having the restrict
thing, so the above wouldn't work.

But your (and the current C standards) attempt to define this with
some kind of syntactic dependency carrying chain will _inevitably_ get
this wrong, and/or be too horribly complex to actually be useful.

Seriously, don't do it. I claim that all your attempts to do this
crazy syntactic "these operations maintain the chained pointers" is
broken. The fact that you have changed "carries a dependency" to
"chained pointer" changes NOTHING.

So just give it up. It's a fundamentally broken model. It's *wrong*,
but even more importantly, it's not even *useful*, since it ends up
being too complicated for a compiler writer or a programmer to
understand.

I really really really think you need to do this at a higher
conceptual level, get away from all these idiotic "these operations
maintain the chain" crap. Because there *is* no such list.

Quite frankly, any standards text that has that
"[[carries_dependency]]" or "[[kill_dependency]]" or whatever
attribute is broken. It's broken because the whole concept is TOTALLY
ALIEN to the compiler writer or the programmer. It makes no sense.
It's purely legalistic language that has zero reason to exist. It's
non-intuitive for everybody.

And *any* language that talks about the individual operations only
encourages people to play legalistic games that actually defeat the
whole purpose (namely that all relevant CPU's are going to implement
that consume ordering guarantee natively, with no extra code
generation rules AT ALL). So any time you talk about some random
detail of some operation, somebody is going to come up with a "trick"
that defeats things. So don't do it. There is absolutely ZERO
difference between any of the arithmetic operations, be they bitwise,
additive, multiplicative, shifts, whatever.

The *only* thing that matters for all of them is whether they are
"value-preserving", or whether they drop so much information that the
compiler might decide to use a control dependency instead. That's true
for every single one of them.

Similarly, actual true control dependencies that limit the problem
space sufficiently that the actual pointer value no longer has
significant information in it (see the above example) are also things
that remove information to the point that only a control dependency
remains. Even when the value itself is not modified in any way at all.

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 19:54                                                                                                           ` Linus Torvalds
@ 2014-02-24 22:37                                                                                                             ` Paul E. McKenney
  2014-02-24 23:35                                                                                                               ` Linus Torvalds
  2014-02-27 15:37                                                                                                             ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-24 22:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 11:54:46AM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Good points.  How about the following replacements?
> >
> > 3.      Adding or subtracting an integer to/from a chained pointer
> >         results in another chained pointer in that same pointer chain.
> >         The results of addition and subtraction operations that cancel
> >         the chained pointer's value (for example, "p-(long)p" where "p"
> >         is a pointer to char) are implementation defined.
> >
> > 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
> >         applied to a chained pointer and an integer for the purposes
> >         of alignment and pointer translation results in another
> >         chained pointer in that same pointer chain.  Other uses
> >         of bitwise operators on chained pointers (for example,
> >         "p|~0") are implementation defined.
> 
> Quite frankly, I think all of this language that is about the actual
> operations is irrelevant and wrong.
> 
> It's not going to help compiler writers, and it sure isn't going to
> help users that read this.
> 
> Why not just talk about "value chains" and that any operations that
> restrict the value range severely end up breaking the chain. There is
> no point in listing the operations individually, because every single
> operation *can* restrict things. Listing individual operations and
> depdendencies is just fundamentally wrong.
> 
> For example, let's look at this obvious case:
> 
>    int q,*p = atomic_read(&pp, consume);
>    .. nothing modifies 'p' ..
>    q = *p;
> 
> and there are literally *zero* operations that modify the value
> change, so obviously the two operations are ordered, right?
> 
> Wrong.
> 
> What if the "nothing modifies 'p'" part looks like this:
> 
>     if (p != &myvariable)
>         return;
> 
> and now any sane compiler will happily optimize "q = *p" into "q =
> myvariable", and we're all done - nothing invalid was ever

Yes, the compiler could do that.  But it would still be required to
carry a dependency from the memory_order_consume read to the "*p",
which it could do by compiling "q = *p" rather than "q = myvariable"
on the one hand or by emitting a memory-barrier instruction on the other.

This was the point of #12:

12.	A memory_order_consume load carries a dependency to any
	dereference operator (unary "*", "[]", and "->") in the
	resulting pointer chain.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 22:37                                                                                                             ` Paul E. McKenney
@ 2014-02-24 23:35                                                                                                               ` Linus Torvalds
  2014-02-25  6:00                                                                                                                 ` Paul E. McKenney
  2014-02-25  6:05                                                                                                                 ` Linus Torvalds
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-24 23:35 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>>
>> What if the "nothing modifies 'p'" part looks like this:
>>
>>     if (p != &myvariable)
>>         return;
>>
>> and now any sane compiler will happily optimize "q = *p" into "q =
>> myvariable", and we're all done - nothing invalid was ever
>
> Yes, the compiler could do that.  But it would still be required to
> carry a dependency from the memory_order_consume read to the "*p",

But that's *BS*. You didn't actually listen to the main issue.

Paul, why do you insist on this carries-a-dependency crap?

It's broken. If you don't believe me, then believe the compiler person
who already piped up and told you so.

The "carries a dependency" model is broken. Get over it.

No sane compiler will ever distinguish two different registers that
have the same value from each other. No sane compiler will ever say
"ok, register r1 has the exact same value as register r2, but r2
carries the dependency, so I need to make sure to pass r2 to that
function or use it as a base pointer".

And nobody sane should *expect* a compiler to distinguish two
registers with the same value that way.

So the whole model is broken.

I gave an alternate model (the "restrict"), and you didn't seem to
understand the really fundamental difference. It's not a language
difference, it's a conceptual difference.

In the broken "carries a dependency" model, you have fight all those
aliases that can have the same value, and it is not a fight you can
win. We've had the "p-p" examples, we've had the "p&0" examples, but
the fact is, that "p==&myvariable" example IS EXACTLY THE SAME THING.

All three of those things: "p-p", "p&0", and "p==&myvariable" mean
that any compiler worth its salt now know that "p" carries no
information, and will optimize it away.

So please stop arguing against that. Whenever you argue against that
simple fact, you are arguing against sane compilers.

So *accept* the fact that some operations (and I guarantee that there
are more of those than you can think of, and you can create them with
various tricks using pretty much *any* feature in the C language)
essentially take the data information away. And just accept the fact
that then the ordering goes away too.

So give up on "carries a dependency". Because there will be cases
where that dependency *isn't* carried.

The language of the standard needs to get *away* from the broken
model, because otherwise the standard is broken.

I suggest we instead talk about "litmus tests" and why certain code
sequences are ordered, and others are not.

So the code sequence I already mentioned is *not* ordered:

Litmus test 1:

    p = atomic_read(pp, consume);
    if (p == &variable)
        return p->val;

   is *NOT* ordered, because the compiler can trivially turn this into
"return variable.val", and break the data dependency.

   This is true *regardless* of any "carries a dependency" language,
because that language is insane, and doesn't work when the different
pieces here may be in different compilation units.

BUT:

Litmus test 2:

    p = atomic_read(pp, consume);
    if (p != &variable)
        return p->val;

  *IS* ordered, because while it looks syntactically a lot like
"Litmus test 1", there is no sane way a compiler can use the knowledge
that "p is not a pointer to a particular location" to break the data
dependency.

There is no way in hell that any sane "carries a dependency" model can
get the simple tests above right.

So give up on it already. "Carries a dependency" cannot work. It's a
bad model. You're trying to describe the problem using the wrong
tools.

Note that my "restrict+pointer to object" language actually got this
*right*. The "restrict" part made Litmus test 1 not ordered, because
that "p == &variable" success case means that the pointer wasn't
restricted, so the pre-requisite for ordering didn't exist.

See? The "carries a dependency" is a broken model for this, but there
are _other_ models that can work.

You tried to rewrite my model into "carries a dependency". That
*CANNOT* work. It's like trying to rewrite quantum physics into the
Greek model of the four elements. They are not compatible models, and
one of them can be shown to not work.

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 23:35                                                                                                               ` Linus Torvalds
@ 2014-02-25  6:00                                                                                                                 ` Paul E. McKenney
  2014-02-26  1:47                                                                                                                   ` Linus Torvalds
  2014-02-25  6:05                                                                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-25  6:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 03:35:04PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >> What if the "nothing modifies 'p'" part looks like this:
> >>
> >>     if (p != &myvariable)
> >>         return;
> >>
> >> and now any sane compiler will happily optimize "q = *p" into "q =
> >> myvariable", and we're all done - nothing invalid was ever
> >
> > Yes, the compiler could do that.  But it would still be required to
> > carry a dependency from the memory_order_consume read to the "*p",
> 
> But that's *BS*. You didn't actually listen to the main issue.
> 
> Paul, why do you insist on this carries-a-dependency crap?

Sigh.  Read on...

> It's broken. If you don't believe me, then believe the compiler person
> who already piped up and told you so.
> 
> The "carries a dependency" model is broken. Get over it.
> 
> No sane compiler will ever distinguish two different registers that
> have the same value from each other. No sane compiler will ever say
> "ok, register r1 has the exact same value as register r2, but r2
> carries the dependency, so I need to make sure to pass r2 to that
> function or use it as a base pointer".
> 
> And nobody sane should *expect* a compiler to distinguish two
> registers with the same value that way.
> 
> So the whole model is broken.
> 
> I gave an alternate model (the "restrict"), and you didn't seem to
> understand the really fundamental difference. It's not a language
> difference, it's a conceptual difference.
> 
> In the broken "carries a dependency" model, you have fight all those
> aliases that can have the same value, and it is not a fight you can
> win. We've had the "p-p" examples, we've had the "p&0" examples, but
> the fact is, that "p==&myvariable" example IS EXACTLY THE SAME THING.
> 
> All three of those things: "p-p", "p&0", and "p==&myvariable" mean
> that any compiler worth its salt now know that "p" carries no
> information, and will optimize it away.
> 
> So please stop arguing against that. Whenever you argue against that
> simple fact, you are arguing against sane compilers.

So let me see if I understand your reasoning.  My best guess is that it
goes something like this:

1.	The Linux kernel contains code that passes pointers from
	rcu_dereference() through external functions.
	
2.	Code in the Linux kernel expects the normal RCU ordering
	guarantees to be in effect even when external functions are
	involved.

3.	When compiling one of these external functions, the C compiler
	has no way of knowing about these RCU ordering guarantees.

4.	The C compiler might therefore apply any and all optimizations
	to these external functions.

5.	This in turn implies that we the only way to prohibit any given
	optimization from being applied to the results obtained from
	rcu_dereference() is to prohibit that optimization globally.

6.	We have to be very careful what optimizations are globally
	prohibited, because a poor choice could result in unacceptable
	performance degradation.

7.	Therefore, the only operations that can be counted on to
	maintain the needed RCU orderings are those where the compiler
	really doesn't have any choice, in other words, where any
	reasonable way of computing the result will necessarily maintain
	the needed ordering.

Did I get this right, or am I confused?

> So *accept* the fact that some operations (and I guarantee that there
> are more of those than you can think of, and you can create them with
> various tricks using pretty much *any* feature in the C language)
> essentially take the data information away. And just accept the fact
> that then the ordering goes away too.

Actually, the fact that there are more potential optimizations than I can
think of is a big reason for my insistence on the carries-a-dependency
crap.  My lack of optimization omniscience makes me very nervous about
relying on there never ever being a reasonable way of computing a given
result without preserving the ordering.

> So give up on "carries a dependency". Because there will be cases
> where that dependency *isn't* carried.
> 
> The language of the standard needs to get *away* from the broken
> model, because otherwise the standard is broken.
> 
> I suggest we instead talk about "litmus tests" and why certain code
> sequences are ordered, and others are not.

OK...

> So the code sequence I already mentioned is *not* ordered:
> 
> Litmus test 1:
> 
>     p = atomic_read(pp, consume);
>     if (p == &variable)
>         return p->val;
> 
>    is *NOT* ordered, because the compiler can trivially turn this into
> "return variable.val", and break the data dependency.

Right, given your model, the compiler is free to produce code that
doesn't order the load from pp against the load from p->val.

>    This is true *regardless* of any "carries a dependency" language,
> because that language is insane, and doesn't work when the different
> pieces here may be in different compilation units.

Indeed, it won't work across different compilation units unless
the compiler is told about it, which is of course the whole point of
[[carries_dependency]].  Understood, though, the Linux kernel currently
does not have anything that could reasonably automatically generate those
[[carries_dependency]] attributes.  (Or are there other reasons why you
believe [[carries_dependency]] is problematic?)

> BUT:
> 
> Litmus test 2:
> 
>     p = atomic_read(pp, consume);
>     if (p != &variable)
>         return p->val;
> 
>   *IS* ordered, because while it looks syntactically a lot like
> "Litmus test 1", there is no sane way a compiler can use the knowledge
> that "p is not a pointer to a particular location" to break the data
> dependency.
> 
> There is no way in hell that any sane "carries a dependency" model can
> get the simple tests above right.
> 
> So give up on it already. "Carries a dependency" cannot work. It's a
> bad model. You're trying to describe the problem using the wrong
> tools.
> 
> Note that my "restrict+pointer to object" language actually got this
> *right*. The "restrict" part made Litmus test 1 not ordered, because
> that "p == &variable" success case means that the pointer wasn't
> restricted, so the pre-requisite for ordering didn't exist.
> 
> See? The "carries a dependency" is a broken model for this, but there
> are _other_ models that can work.
> 
> You tried to rewrite my model into "carries a dependency". That
> *CANNOT* work. It's like trying to rewrite quantum physics into the
> Greek model of the four elements. They are not compatible models, and
> one of them can be shown to not work.

Of course, I cannot resist putting forward a third litmus test:

	static struct foo variable1;
	static struct foo variable2;
	static struct foo *pp = &variable1;

T1:	initialize_foo(&variable2);
	atomic_store_explicit(&pp, &variable2, memory_order_release);
	/* The above is the only store to pp in this translation unit,
	 * and the address of pp is not exported in any way.
	 */

T2:	if (p == &variable1)
		return p->val1; /* Must be variable1.val1. */
	else
		return p->val2; /* Must be variable2.val2. */

My guess is that your approach would not provide ordering in this
case, either.  Or am I missing something?

						Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 23:35                                                                                                               ` Linus Torvalds
  2014-02-25  6:00                                                                                                                 ` Paul E. McKenney
@ 2014-02-25  6:05                                                                                                                 ` Linus Torvalds
  2014-02-26  0:15                                                                                                                   ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-25  6:05 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Litmus test 1:
>
>     p = atomic_read(pp, consume);
>     if (p == &variable)
>         return p->val;
>
>    is *NOT* ordered

Btw, don't get me wrong. I don't _like_ it not being ordered, and I
actually did spend some time thinking about my earlier proposal on
strengthening the 'consume' ordering.

I have for the last several years been 100% convinced that the Intel
memory ordering is the right thing, and that people who like weak
memory ordering are wrong and should try to avoid reproducing if at
all possible. But given that we have memory orderings like power and
ARM, I don't actually see a sane way to get a good strong ordering.
You can teach compilers about cases like the above when they actually
see all the code and they could poison the value chain etc. But it
would be fairly painful, and once you cross object files (or even just
functions in the same compilation unit, for that matter), it goes from
painful to just "ridiculously not worth it".

So I think the C semantics should mirror what the hardware gives us -
and do so even in the face of reasonable optimizations - not try to do
something else that requires compilers to treat "consume" very
differently.

If people made me king of the world, I'd outlaw weak memory ordering.
You can re-order as much as you want in hardware with speculation etc,
but you should always *check* your speculation and make it *look* like
you did everything in order. Which is pretty much the intel memory
ordering (ignoring the write buffering).

               Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-25  6:05                                                                                                                 ` Linus Torvalds
@ 2014-02-26  0:15                                                                                                                   ` Paul E. McKenney
  2014-02-26  3:32                                                                                                                     ` Jeff Law
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-26  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Litmus test 1:
> >
> >     p = atomic_read(pp, consume);
> >     if (p == &variable)
> >         return p->val;
> >
> >    is *NOT* ordered
> 
> Btw, don't get me wrong. I don't _like_ it not being ordered, and I
> actually did spend some time thinking about my earlier proposal on
> strengthening the 'consume' ordering.

Understood.

> I have for the last several years been 100% convinced that the Intel
> memory ordering is the right thing, and that people who like weak
> memory ordering are wrong and should try to avoid reproducing if at
> all possible. But given that we have memory orderings like power and
> ARM, I don't actually see a sane way to get a good strong ordering.
> You can teach compilers about cases like the above when they actually
> see all the code and they could poison the value chain etc. But it
> would be fairly painful, and once you cross object files (or even just
> functions in the same compilation unit, for that matter), it goes from
> painful to just "ridiculously not worth it".

And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

> So I think the C semantics should mirror what the hardware gives us -
> and do so even in the face of reasonable optimizations - not try to do
> something else that requires compilers to treat "consume" very
> differently.

I am sure that a great many people would jump for joy at the chance to
drop any and all RCU-related verbiage from the C11 and C++11 standards.
(I know, you aren't necessarily advocating this, but given what you
say above, I cannot think what verbiage that would remain.)

The thing that makes me very nervous is how much the definition of
"reasonable optimization" has changed.  For example, before the
2.6.10 Linux kernel, we didn't even apply volatile semantics to
fetches of RCU-protected pointers -- and as far as I know, never
needed to.  But since then, there have been several cases where the
compiler happily hoisted a normal load out of a surprisingly large loop.
Hardware advances can come into play as well.  For example, my very first
RCU work back in the early 90s was on a parallel system whose CPUs had
no branch-prediction hardware (80386 or 80486, I don't remember which).
Now people talk about compilers using branch prediction hardware to
implement value-speculation optimizations.  Five or ten years from now,
who knows what crazy optimizations might be considered to be completely
reasonable?

Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?

> If people made me king of the world, I'd outlaw weak memory ordering.
> You can re-order as much as you want in hardware with speculation etc,
> but you should always *check* your speculation and make it *look* like
> you did everything in order. Which is pretty much the intel memory
> ordering (ignoring the write buffering).

Speaking as someone who got whacked over the head with DEC Alpha when
first presenting RCU to the Digital UNIX folks long ago, I do have some
sympathy with this line of thought.  But as you say, it is not the world
we currently live in.

Of course, in the final analysis, your kernel, your call.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-25  6:00                                                                                                                 ` Paul E. McKenney
@ 2014-02-26  1:47                                                                                                                   ` Linus Torvalds
  2014-02-26  5:12                                                                                                                     ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-26  1:47 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> So let me see if I understand your reasoning.  My best guess is that it
> goes something like this:
>
> 1.      The Linux kernel contains code that passes pointers from
>         rcu_dereference() through external functions.

No, actually, it's not so much Linux-specific at all.

I'm actually thinking about what I'd do as a compiler writer, and as a
defender the "C is a high-level assembler" concept.

I love C. I'm a huge fan. I think it's a great language, and I think
it's a great language not because of some theoretical issues, but
because it is the only language around that actually maps fairly well
to what machines really do.

And it's a *simple* language. Sure, it's not quite as simple as it
used to be, but look at how thin the "K&R book" is. Which pretty much
describes it - still.

That's the real strength of C, and why it's the only language serious
people use for system programming.  Ignore C++ for a while (Jesus
Xavier Christ, I've had to do C++ programming for subsurface), and
just think about what makes _C_ a good language.

I can look at C code, and I can understand what the code generation
is, and what it will really *do*. And I think that's important.
Abstractions that hide what the compiler will actually generate are
bad abstractions.

And ok, so this is obviously Linux-specific in that it's generally
only Linux where I really care about the code generation, but I do
think it's a bigger issue too.

So I want C features to *map* to the hardware features they implement.
The abstractions should match each other, not fight each other.

> Actually, the fact that there are more potential optimizations than I can
> think of is a big reason for my insistence on the carries-a-dependency
> crap.  My lack of optimization omniscience makes me very nervous about
> relying on there never ever being a reasonable way of computing a given
> result without preserving the ordering.

But if I can give two clear examples that are basically identical from
a syntactic standpoint, and one clearly can be trivially optimized to
the point where the ordering guarantee goes away, and the other
cannot, and you cannot describe the difference, then I think your
description is seriously lacking.

And I do *not* think the C language should be defined by how it can be
described. Leave that to things like Haskell or LISP, where the goal
is some kind of completeness of the language that is about the
language, not about the machines it will run on.

>> So the code sequence I already mentioned is *not* ordered:
>>
>> Litmus test 1:
>>
>>     p = atomic_read(pp, consume);
>>     if (p == &variable)
>>         return p->val;
>>
>>    is *NOT* ordered, because the compiler can trivially turn this into
>> "return variable.val", and break the data dependency.
>
> Right, given your model, the compiler is free to produce code that
> doesn't order the load from pp against the load from p->val.

Yes. Note also that that is what existing compilers would actually do.

And they'd do it "by mistake": they'd load the address of the variable
into a register, and then compare the two registers, and then end up
using _one_ of the registers as the base pointer for the "p->val"
access, but I can almost *guarantee* that there are going to be
sequences where some compiler will choose one register over the other
based on some random detail.

So my model isn't just a "model", it also happens to descibe reality.

> Indeed, it won't work across different compilation units unless
> the compiler is told about it, which is of course the whole point of
> [[carries_dependency]].  Understood, though, the Linux kernel currently
> does not have anything that could reasonably automatically generate those
> [[carries_dependency]] attributes.  (Or are there other reasons why you
> believe [[carries_dependency]] is problematic?)

So I think carries_dependency is problematic because:

 - it's not actually in C11 afaik

 - it requires the programmer to solve the problem of the standard not
matching the hardware.

 - I think it's just insanely ugly, *especially* if it's actually
meant to work so that the current carries-a-dependency works even for
insane expressions like "a-a".

in practice, it's one of those things where I guess nobody actually
would ever use it.

> Of course, I cannot resist putting forward a third litmus test:
>
>         static struct foo variable1;
>         static struct foo variable2;
>         static struct foo *pp = &variable1;
>
> T1:     initialize_foo(&variable2);
>         atomic_store_explicit(&pp, &variable2, memory_order_release);
>         /* The above is the only store to pp in this translation unit,
>          * and the address of pp is not exported in any way.
>          */
>
> T2:     if (p == &variable1)
>                 return p->val1; /* Must be variable1.val1. */
>         else
>                 return p->val2; /* Must be variable2.val2. */
>
> My guess is that your approach would not provide ordering in this
> case, either.  Or am I missing something?

I actually agree.

If you write insane code to "trick" the compiler into generating
optimizations that break the dependency, then you get what you
deserve.

Now, realistically, I doubt a compiler will notice, but if it does,
I'd go "well, that's your own fault for writing code that makes no
sense".

Basically, the above uses a pointer as a boolean flag.  The compiler
noticed it was really a boolean flag, and "consume" doesn't work on
boolean flags. Tough.

             Linus

              Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26  0:15                                                                                                                   ` Paul E. McKenney
@ 2014-02-26  3:32                                                                                                                     ` Jeff Law
  2014-02-26  5:23                                                                                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Jeff Law @ 2014-02-26  3:32 UTC (permalink / raw)
  To: paulmck, Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On 02/25/14 17:15, Paul E. McKenney wrote:
>> I have for the last several years been 100% convinced that the Intel
>> memory ordering is the right thing, and that people who like weak
>> memory ordering are wrong and should try to avoid reproducing if at
>> all possible. But given that we have memory orderings like power and
>> ARM, I don't actually see a sane way to get a good strong ordering.
>> You can teach compilers about cases like the above when they actually
>> see all the code and they could poison the value chain etc. But it
>> would be fairly painful, and once you cross object files (or even just
>> functions in the same compilation unit, for that matter), it goes from
>> painful to just "ridiculously not worth it".
>
> And I have indeed seen a post or two from you favoring stronger memory
> ordering over the past few years.  ;-)
I couldn't agree more.

>
> Are ARM and Power really the bad boys here?  Or are they instead playing
> the role of the canary in the coal mine?
That's a question I've been struggling with recently as well.  I suspect 
they (arm, power) are going to be the outliers rather than the canary. 
While the weaker model may give them some advantages WRT scalability, I 
don't think it'll ultimately be enough to overcome the difficulty in 
writing correct low level code for them.

Regardless, they're here and we have to deal with them.


Jeff

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26  1:47                                                                                                                   ` Linus Torvalds
@ 2014-02-26  5:12                                                                                                                     ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-26  5:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > So let me see if I understand your reasoning.  My best guess is that it
> > goes something like this:
> >
> > 1.      The Linux kernel contains code that passes pointers from
> >         rcu_dereference() through external functions.
> 
> No, actually, it's not so much Linux-specific at all.
> 
> I'm actually thinking about what I'd do as a compiler writer, and as a
> defender the "C is a high-level assembler" concept.
> 
> I love C. I'm a huge fan. I think it's a great language, and I think
> it's a great language not because of some theoretical issues, but
> because it is the only language around that actually maps fairly well
> to what machines really do.
> 
> And it's a *simple* language. Sure, it's not quite as simple as it
> used to be, but look at how thin the "K&R book" is. Which pretty much
> describes it - still.
> 
> That's the real strength of C, and why it's the only language serious
> people use for system programming.  Ignore C++ for a while (Jesus
> Xavier Christ, I've had to do C++ programming for subsurface), and
> just think about what makes _C_ a good language.

The last time I used C++ for a project was in 1990.  It was a lot smaller
then.

> I can look at C code, and I can understand what the code generation
> is, and what it will really *do*. And I think that's important.
> Abstractions that hide what the compiler will actually generate are
> bad abstractions.
> 
> And ok, so this is obviously Linux-specific in that it's generally
> only Linux where I really care about the code generation, but I do
> think it's a bigger issue too.
> 
> So I want C features to *map* to the hardware features they implement.
> The abstractions should match each other, not fight each other.

OK...

> > Actually, the fact that there are more potential optimizations than I can
> > think of is a big reason for my insistence on the carries-a-dependency
> > crap.  My lack of optimization omniscience makes me very nervous about
> > relying on there never ever being a reasonable way of computing a given
> > result without preserving the ordering.
> 
> But if I can give two clear examples that are basically identical from
> a syntactic standpoint, and one clearly can be trivially optimized to
> the point where the ordering guarantee goes away, and the other
> cannot, and you cannot describe the difference, then I think your
> description is seriously lacking.

In my defense, my plan was to constrain the compiler to retain the
ordering guarantee in either case.  Yes, I did notice that you find
that unacceptable.

> And I do *not* think the C language should be defined by how it can be
> described. Leave that to things like Haskell or LISP, where the goal
> is some kind of completeness of the language that is about the
> language, not about the machines it will run on.

I am with you up to the point that the fancy optimizers start kicking
in.  I don't know how to describe what the optimizers are and are not
permitted to do strictly in terms of the underlying hardware.

> >> So the code sequence I already mentioned is *not* ordered:
> >>
> >> Litmus test 1:
> >>
> >>     p = atomic_read(pp, consume);
> >>     if (p == &variable)
> >>         return p->val;
> >>
> >>    is *NOT* ordered, because the compiler can trivially turn this into
> >> "return variable.val", and break the data dependency.
> >
> > Right, given your model, the compiler is free to produce code that
> > doesn't order the load from pp against the load from p->val.
> 
> Yes. Note also that that is what existing compilers would actually do.
> 
> And they'd do it "by mistake": they'd load the address of the variable
> into a register, and then compare the two registers, and then end up
> using _one_ of the registers as the base pointer for the "p->val"
> access, but I can almost *guarantee* that there are going to be
> sequences where some compiler will choose one register over the other
> based on some random detail.
> 
> So my model isn't just a "model", it also happens to descibe reality.

Sounds to me like your model -is- reality.  I believe that it is useful
to constrain reality from time to time, but understand that you vehemently
disagree.

> > Indeed, it won't work across different compilation units unless
> > the compiler is told about it, which is of course the whole point of
> > [[carries_dependency]].  Understood, though, the Linux kernel currently
> > does not have anything that could reasonably automatically generate those
> > [[carries_dependency]] attributes.  (Or are there other reasons why you
> > believe [[carries_dependency]] is problematic?)
> 
> So I think carries_dependency is problematic because:
> 
>  - it's not actually in C11 afaik

Indeed it is not, but I bet that gcc will implement it like it does the
other attributes that are not part of C11.

>  - it requires the programmer to solve the problem of the standard not
> matching the hardware.

The programmer in this instance being the compiler writer?

>  - I think it's just insanely ugly, *especially* if it's actually
> meant to work so that the current carries-a-dependency works even for
> insane expressions like "a-a".

And left to myself, I would prune down the carries-a-dependency trees
to reflect what is actually used, excluding "a-a", comparison operators,
and so on, allowing the developer to know when ordering is preserved and
when it is not, without having to fully understand all the optimizations
that might ever be used.

Yes, I understand that you hate that thought as well.

> in practice, it's one of those things where I guess nobody actually
> would ever use it.

Well, I believe that this thread has demonstrated that -you- won't ever
use it.  ;-)

> > Of course, I cannot resist putting forward a third litmus test:
> >
> >         static struct foo variable1;
> >         static struct foo variable2;
> >         static struct foo *pp = &variable1;
> >
> > T1:     initialize_foo(&variable2);
> >         atomic_store_explicit(&pp, &variable2, memory_order_release);
> >         /* The above is the only store to pp in this translation unit,
> >          * and the address of pp is not exported in any way.
> >          */
> >
> > T2:     if (p == &variable1)
> >                 return p->val1; /* Must be variable1.val1. */
> >         else
> >                 return p->val2; /* Must be variable2.val2. */
> >
> > My guess is that your approach would not provide ordering in this
> > case, either.  Or am I missing something?
> 
> I actually agree.
> 
> If you write insane code to "trick" the compiler into generating
> optimizations that break the dependency, then you get what you
> deserve.
> 
> Now, realistically, I doubt a compiler will notice, but if it does,
> I'd go "well, that's your own fault for writing code that makes no
> sense".
> 
> Basically, the above uses a pointer as a boolean flag.  The compiler
> noticed it was really a boolean flag, and "consume" doesn't work on
> boolean flags. Tough.

So the places in the Linux kernel that currently compare the value
returned from rcu_dereference() against an address of a variable need
to do that comparison in an external function where the compiler cannot
see it?  Or inline assembly, I suppose.

There are probably also "interesting" situations involving structures
reached by multiple pointers, some RCU-protected and some not...

Ouch.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26  3:32                                                                                                                     ` Jeff Law
@ 2014-02-26  5:23                                                                                                                       ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-26  5:23 UTC (permalink / raw)
  To: Jeff Law
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote:
> On 02/25/14 17:15, Paul E. McKenney wrote:
> >>I have for the last several years been 100% convinced that the Intel
> >>memory ordering is the right thing, and that people who like weak
> >>memory ordering are wrong and should try to avoid reproducing if at
> >>all possible. But given that we have memory orderings like power and
> >>ARM, I don't actually see a sane way to get a good strong ordering.
> >>You can teach compilers about cases like the above when they actually
> >>see all the code and they could poison the value chain etc. But it
> >>would be fairly painful, and once you cross object files (or even just
> >>functions in the same compilation unit, for that matter), it goes from
> >>painful to just "ridiculously not worth it".
> >
> >And I have indeed seen a post or two from you favoring stronger memory
> >ordering over the past few years.  ;-)
> I couldn't agree more.
> 
> >
> >Are ARM and Power really the bad boys here?  Or are they instead playing
> >the role of the canary in the coal mine?
> That's a question I've been struggling with recently as well.  I
> suspect they (arm, power) are going to be the outliers rather than
> the canary. While the weaker model may give them some advantages WRT
> scalability, I don't think it'll ultimately be enough to overcome
> the difficulty in writing correct low level code for them.
> 
> Regardless, they're here and we have to deal with them.

Agreed...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 19:13                                                                                             ` Paul E. McKenney
  2014-02-21 22:10                                                                                               ` Joseph S. Myers
  2014-02-24 13:55                                                                                               ` Michael Matz
@ 2014-02-26 13:04                                                                                               ` Torvald Riegel
  2014-02-26 18:27                                                                                                 ` Paul E. McKenney
  2 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-26 13:04 UTC (permalink / raw)
  To: paulmck
  Cc: Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
> On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Thu, 20 Feb 2014, Linus Torvalds wrote:
> > 
> > > But I'm pretty sure that any compiler guy must *hate* that current odd
> > > dependency-generation part, and if I was a gcc person, seeing that
> > > bugzilla entry Torvald pointed at, I would personally want to
> > > dismember somebody with a rusty spoon..
> > 
> > Yes.  Defining dependency chains in the way the standard currently seems 
> > to do must come from people not writing compilers.  There's simply no 
> > sensible way to implement it without being really conservative, because 
> > the depchains can contain arbitrary constructs including stores, 
> > loads and function calls but must still be observed.  
> > 
> > And with conservative I mean "everything is a source of a dependency, and 
> > hence can't be removed, reordered or otherwise fiddled with", and that 
> > includes code sequences where no atomic objects are anywhere in sight [1].
> > In the light of that the only realistic way (meaning to not have to 
> > disable optimization everywhere) to implement consume as currently 
> > specified is to map it to acquire.  At which point it becomes pointless.
> 
> No, only memory_order_consume loads and [[carries_dependency]]
> function arguments are sources of dependency chains.

However, that is, given how the standard specifies things, just one of
the possible ways for how an implementation can handle this.  Treating
[[carries_dependency]] as a "necessary" annotation to make exploiting
mo_consume work in practice is possible, but it's not required by the
standard.

Also, dependencies are specified to flow through loads and stores
(restricted to scalar objects and bitfields), so any load that might
load from a dependency-carrying store can also be a source (and that
doesn't seem to be restricted by [[carries_dependency]]).


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-21 22:10                                                                                               ` Joseph S. Myers
  2014-02-21 22:37                                                                                                 ` Paul E. McKenney
@ 2014-02-26 13:09                                                                                                 ` Torvald Riegel
  2014-02-26 18:43                                                                                                   ` Joseph S. Myers
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-02-26 13:09 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Paul E. McKenney, Michael Matz, Linus Torvalds, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Fri, 2014-02-21 at 22:10 +0000, Joseph S. Myers wrote:
> On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> 
> > This needs to be as follows:
> > 
> > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > {
> > 	return i - i;
> > }
> > 
> > Otherwise dependencies won't get carried through it.
> 
> C11 doesn't have attributes at all (and no specification regarding calls 
> and dependencies that I can see).  And the way I read the C++11 
> specification of carries_dependency is that specifying carries_dependency 
> is purely about increasing optimization of the caller: that if it isn't 
> specified, then the caller doesn't know what dependencies might be 
> carried.  "Note: The carries_dependency attribute does not change the 
> meaning of the program, but may result in generation of more efficient 
> code. - end note".

I think that this last sentence can be kind of misleading, especially
when looking at it from an implementation point of view.  How
dependencies are handled (ie, preserving the syntactic dependencies vs.
emitting barriers) must be part of the ABI, or things like
[[carries_dependency]] won't work as expected (or lead to inefficient
code).  Thus, in practice, all compiler vendors on a platform would have
to agree to a particular handling, which might end up in selecting the
easy-but-conservative implementation option (ie, always emitting
mo_acquire when the source uses mo_consume).


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 17:38                                                                                                           ` Linus Torvalds
  2014-02-24 18:12                                                                                                             ` Paul E. McKenney
@ 2014-02-26 17:34                                                                                                             ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-26 17:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Matz, Richard Biener, Paul McKenney, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz <matz@suse.de> wrote:
> >
> > So, let me try to poke holes into your definition or increase my
> > understanding :) .  You said "chain of pointers"(dereferences I assume),
> > e.g. if p is result of consume load, then access to
> > p->here->there->next->prev->stuff is supposed to be ordered with that load
> > (or only when that last load/store itself is also an atomic load or
> > store?).
> 
> It's supposed to be ordered wrt the first load (the consuming one), yes.
> 
> > So, what happens if the pointer deref chain is partly hidden in some
> > functions:
> 
> No problem.
> 
> The thing is, the ordering is actually handled by the CPU in all
> relevant cases.  So the compiler doesn't actually need to *do*
> anything. All this legalistic stuff is just to describe the semantics
> and the guarantees.
> 
> The problem is two cases:
> 
>  (a) alpha (which doesn't really order any accesses at all, not even
> dependent loads), but for a compiler alpha is actually trivial: just
> add a "rmb" instruction after the load, and you can't really do
> anything else (there's a few optimizations you can do wrt the rmb, but
> they are very specific and simple).
> 
> So (a) is a "problem", but the solution is actually really simple, and
> gives very *strong* guarantees: on alpha, a "consume" ends up being
> basically the same as a read barrier after the load, with only very
> minimal room for optimization.
> 
>  (b) ARM and powerpc and similar architectures, that guarantee the
> data dependency as long as it is an *actual* data dependency, and
> never becomes a control dependency.
> 
> On ARM and powerpc, control dependencies do *not* order accesses (the
> reasons boil down to essentially: branch prediction breaks the
> dependency, and instructions that come after the branch can be happily
> executed before the branch). But it's almost impossible to describe
> that in the standard, since compilers can (and very much do) turn a
> control dependency into a data dependency and vice versa.
> 
> So the current standard tries to describe that "control vs data"
> dependency, and tries to limit it to a data dependency. It fails. It
> fails for multiple reasons - it doesn't allow for trivial
> optimizations that just remove the data dependency, and it also
> doesn't allow for various trivial cases where the compiler *does* turn
> the data dependency into a control dependency.
> 
> So I really really think that the current C standard language is
> broken. Unfixably broken.
> 
> I'm trying to change the "syntactic data dependency" that the current
> standard uses into something that is clearer and correct.
> 
> The "chain of pointers" thing is still obviously a data dependency,
> but by limiting it to pointers, it simplifies the language, clarifies
> the meaning, avoids all syntactic tricks (ie "p-p" is clearly a
> syntactic dependency on "p", but does *not* involve in any way
> following the pointer) and makes it basically impossible for the
> compiler to break the dependency without doing value prediction, and
> since value prediction has to be disallowed anyway, that's a feature,
> not a bug.

AFAIU, Michael is wondering about how we can separate non-synchronizing
code (ie, in this case, not taking part in any "chain of pointers" used
with mo_consume loads) from code that does.  If we cannot, then we
prevent value prediction *everywhere*, unless the compiler can prove
that the code is never part of such a chain (which is hard due to alias
analysis being hard, etc.).
(We can probably argue to which extent value prediction is necessary for
generation of efficient code, but it obviously does work in
non-synchronizing code (or even with acquire barriers with some care) --
so forbidding it entirely might be bad.)


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 17:28                                                                                                           ` Paul E. McKenney
  2014-02-24 17:57                                                                                                             ` Paul E. McKenney
@ 2014-02-26 17:39                                                                                                             ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-26 17:39 UTC (permalink / raw)
  To: paulmck
  Cc: Michael Matz, Linus Torvalds, Richard Biener, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote:
> On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Mon, 24 Feb 2014, Linus Torvalds wrote:
> > 
> > > > To me that reads like
> > > >
> > > >   int i;
> > > >   int *q = &i;
> > > >   int **p = &q;
> > > >
> > > >   atomic_XXX (p, CONSUME);
> > > >
> > > > orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> > > > want to say that it orders against aliased storage - but then go further
> > > > and include "indirectly through a chain of pointers"?!  Thus an
> > > > atomic read of a int * orders against any 'int' memory operation but
> > > > not against 'float' memory operations?
> > > 
> > > No, it's not about type at all, and the "chain of pointers" can be
> > > much more complex than that, since the "int *" can point to within an
> > > object that contains other things than just that "int" (the "int" can
> > > be part of a structure that then has pointers to other structures
> > > etc).
> > 
> > So, let me try to poke holes into your definition or increase my 
> > understanding :) .  You said "chain of pointers"(dereferences I assume), 
> > e.g. if p is result of consume load, then access to 
> > p->here->there->next->prev->stuff is supposed to be ordered with that load 
> > (or only when that last load/store itself is also an atomic load or 
> > store?).
> > 
> > So, what happens if the pointer deref chain is partly hidden in some 
> > functions:
> > 
> > A * adjustptr (B *ptr) { return &ptr->here->there->next; }
> > B * p = atomic_XXX (&somewhere, consume);
> > adjustptr(p)->prev->stuff = bla;
> > 
> > As far as I understood you, this whole ptrderef chain business would be 
> > only an optimization opportunity, right?  So if the compiler can't be sure 
> > how p is actually used (as in my function-using case, assume adjustptr is 
> > defined in another unit), then the consume load would simply be 
> > transformed into an acquire (or whatever, with some barrier I mean)?  Only 
> > _if_ the compiler sees all obvious uses of p (indirectly through pointer 
> > derefs) can it, yeah, do what with the consume load?
> 
> Good point, I left that out of my list.  Adding it:
> 
> 13.	By default, pointer chains do not propagate into or out of functions.
> 	In implementations having attributes, a [[carries_dependency]]
> 	may be used to mark a function argument or return as passing
> 	a pointer chain into or out of that function.
> 
> 	If a function does not contain memory_order_consume loads and
> 	also does not contain [[carries_dependency]] attributes, then
> 	that function may be compiled using any desired dependency-breaking
> 	optimizations.
> 
> 	The ordering effects are implementation defined when a given
> 	pointer chain passes into or out of a function through a parameter
> 	or return not marked with a [[carries_dependency]] attributed.
> 
> Note that this last paragraph differs from the current standard, which
> would require ordering regardless.

I would prefer if we could get rid off [[carries_dependency]] as well;
currently, it's a hint whose effectiveness really depends on how the
particular implementation handles this attribute.  If we still need
something like it in the future, it would be good if it had a clearer
use and performance effects.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26 13:04                                                                                               ` Torvald Riegel
@ 2014-02-26 18:27                                                                                                 ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-26 18:27 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140226130517.3625@vmsdvma.vnet.ibm.com
> X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> 
> On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> > > Hi,
> > > 
> > > On Thu, 20 Feb 2014, Linus Torvalds wrote:
> > > 
> > > > But I'm pretty sure that any compiler guy must *hate* that current odd
> > > > dependency-generation part, and if I was a gcc person, seeing that
> > > > bugzilla entry Torvald pointed at, I would personally want to
> > > > dismember somebody with a rusty spoon..
> > > 
> > > Yes.  Defining dependency chains in the way the standard currently seems 
> > > to do must come from people not writing compilers.  There's simply no 
> > > sensible way to implement it without being really conservative, because 
> > > the depchains can contain arbitrary constructs including stores, 
> > > loads and function calls but must still be observed.  
> > > 
> > > And with conservative I mean "everything is a source of a dependency, and 
> > > hence can't be removed, reordered or otherwise fiddled with", and that 
> > > includes code sequences where no atomic objects are anywhere in sight [1].
> > > In the light of that the only realistic way (meaning to not have to 
> > > disable optimization everywhere) to implement consume as currently 
> > > specified is to map it to acquire.  At which point it becomes pointless.
> > 
> > No, only memory_order_consume loads and [[carries_dependency]]
> > function arguments are sources of dependency chains.
> 
> However, that is, given how the standard specifies things, just one of
> the possible ways for how an implementation can handle this.  Treating
> [[carries_dependency]] as a "necessary" annotation to make exploiting
> mo_consume work in practice is possible, but it's not required by the
> standard.
> 
> Also, dependencies are specified to flow through loads and stores
> (restricted to scalar objects and bitfields), so any load that might
> load from a dependency-carrying store can also be a source (and that
> doesn't seem to be restricted by [[carries_dependency]]).

OK, this last is clearly totally unacceptable.  :-/

Leaving aside the option of dropping the whole thing for the moment,
the only thing that suggests itself is having all dependencies die at
a specific point in the code, corresponding to the rcu_read_unlock().
But as far as I can see, that absolutely requires "necessary" parameter
and return marking in order to correctly handle nested RCU read-side
critical sections in different functions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26 13:09                                                                                                 ` Torvald Riegel
@ 2014-02-26 18:43                                                                                                   ` Joseph S. Myers
  2014-02-27  0:52                                                                                                     ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Joseph S. Myers @ 2014-02-26 18:43 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul E. McKenney, Michael Matz, Linus Torvalds, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Wed, 26 Feb 2014, Torvald Riegel wrote:

> On Fri, 2014-02-21 at 22:10 +0000, Joseph S. Myers wrote:
> > On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> > 
> > > This needs to be as follows:
> > > 
> > > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > > {
> > > 	return i - i;
> > > }
> > > 
> > > Otherwise dependencies won't get carried through it.
> > 
> > C11 doesn't have attributes at all (and no specification regarding calls 
> > and dependencies that I can see).  And the way I read the C++11 
> > specification of carries_dependency is that specifying carries_dependency 
> > is purely about increasing optimization of the caller: that if it isn't 
> > specified, then the caller doesn't know what dependencies might be 
> > carried.  "Note: The carries_dependency attribute does not change the 
> > meaning of the program, but may result in generation of more efficient 
> > code. - end note".
> 
> I think that this last sentence can be kind of misleading, especially
> when looking at it from an implementation point of view.  How
> dependencies are handled (ie, preserving the syntactic dependencies vs.
> emitting barriers) must be part of the ABI, or things like
> [[carries_dependency]] won't work as expected (or lead to inefficient
> code).  Thus, in practice, all compiler vendors on a platform would have
> to agree to a particular handling, which might end up in selecting the
> easy-but-conservative implementation option (ie, always emitting
> mo_acquire when the source uses mo_consume).

Regardless of the ABI, my point is that if a program is valid, it is also 
valid when all uses of [[carries_dependency]] are removed.  If a function 
doesn't use [[carries_dependency]], that means "dependencies may or may 
not be carried through this function".  If a function uses 
[[carries_dependency]], that means that certain dependencies *are* carried 
through the function (and the ABI should then specify what this means the 
caller can rely on, in terms of the architecture's memory model).  (This 
may or may not be useful, but it's how I understand C++11.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26 18:43                                                                                                   ` Joseph S. Myers
@ 2014-02-27  0:52                                                                                                     ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-27  0:52 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Paul E. McKenney, Michael Matz, Linus Torvalds, Will Deacon,
	Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Wed, 2014-02-26 at 18:43 +0000, Joseph S. Myers wrote:
> On Wed, 26 Feb 2014, Torvald Riegel wrote:
> 
> > On Fri, 2014-02-21 at 22:10 +0000, Joseph S. Myers wrote:
> > > On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> > > 
> > > > This needs to be as follows:
> > > > 
> > > > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > > > {
> > > > 	return i - i;
> > > > }
> > > > 
> > > > Otherwise dependencies won't get carried through it.
> > > 
> > > C11 doesn't have attributes at all (and no specification regarding calls 
> > > and dependencies that I can see).  And the way I read the C++11 
> > > specification of carries_dependency is that specifying carries_dependency 
> > > is purely about increasing optimization of the caller: that if it isn't 
> > > specified, then the caller doesn't know what dependencies might be 
> > > carried.  "Note: The carries_dependency attribute does not change the 
> > > meaning of the program, but may result in generation of more efficient 
> > > code. - end note".
> > 
> > I think that this last sentence can be kind of misleading, especially
> > when looking at it from an implementation point of view.  How
> > dependencies are handled (ie, preserving the syntactic dependencies vs.
> > emitting barriers) must be part of the ABI, or things like
> > [[carries_dependency]] won't work as expected (or lead to inefficient
> > code).  Thus, in practice, all compiler vendors on a platform would have
> > to agree to a particular handling, which might end up in selecting the
> > easy-but-conservative implementation option (ie, always emitting
> > mo_acquire when the source uses mo_consume).
> 
> Regardless of the ABI, my point is that if a program is valid, it is also 
> valid when all uses of [[carries_dependency]] are removed.  If a function 
> doesn't use [[carries_dependency]], that means "dependencies may or may 
> not be carried through this function".  If a function uses 
> [[carries_dependency]], that means that certain dependencies *are* carried 
> through the function (and the ABI should then specify what this means the 
> caller can rely on, in terms of the architecture's memory model).  (This 
> may or may not be useful, but it's how I understand C++11.)

I agree.  What I tried to point out is that it's not the case that an
*implementation* can just ignore [[carries_dependency]].  So from an
implementation perspective, the attribute does have semantics.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-24 19:54                                                                                                           ` Linus Torvalds
  2014-02-24 22:37                                                                                                             ` Paul E. McKenney
@ 2014-02-27 15:37                                                                                                             ` Torvald Riegel
  2014-02-27 17:01                                                                                                               ` Linus Torvalds
  2014-02-27 17:50                                                                                                               ` Paul E. McKenney
  1 sibling, 2 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-27 15:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Good points.  How about the following replacements?
> >
> > 3.      Adding or subtracting an integer to/from a chained pointer
> >         results in another chained pointer in that same pointer chain.
> >         The results of addition and subtraction operations that cancel
> >         the chained pointer's value (for example, "p-(long)p" where "p"
> >         is a pointer to char) are implementation defined.
> >
> > 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
> >         applied to a chained pointer and an integer for the purposes
> >         of alignment and pointer translation results in another
> >         chained pointer in that same pointer chain.  Other uses
> >         of bitwise operators on chained pointers (for example,
> >         "p|~0") are implementation defined.
> 
> Quite frankly, I think all of this language that is about the actual
> operations is irrelevant and wrong.
> 
> It's not going to help compiler writers, and it sure isn't going to
> help users that read this.
> 
> Why not just talk about "value chains" and that any operations that
> restrict the value range severely end up breaking the chain. There is
> no point in listing the operations individually, because every single
> operation *can* restrict things. Listing individual operations and
> depdendencies is just fundamentally wrong.

[...]

> The *only* thing that matters for all of them is whether they are
> "value-preserving", or whether they drop so much information that the
> compiler might decide to use a control dependency instead. That's true
> for every single one of them.
> 
> Similarly, actual true control dependencies that limit the problem
> space sufficiently that the actual pointer value no longer has
> significant information in it (see the above example) are also things
> that remove information to the point that only a control dependency
> remains. Even when the value itself is not modified in any way at all.

I agree that just considering syntactic properties of the program seems
to be insufficient.  Making it instead depend on whether there is a
"semantic" dependency due to a value being "necessary" to compute a
result seems better.  However, whether a value is "necessary" might not
be obvious, and I understand Paul's argument that he does not want to
have to reason about all potential compiler optimizations.  Thus, I
believe we need to specify when a value is "necessary".

I have a suggestion for a somewhat different formulation of the feature
that you seem to have in mind, which I'll discuss below.  Excuse the
verbosity of the following, but I'd rather like to avoid
misunderstandings than save a few words.


What we'd like to capture is that a value originating from a mo_consume
load is "necessary" for a computation (e.g., it "cannot" be replaced
with value predictions and/or control dependencies); if that's the case
in the program, we can reasonably assume that a compiler implementation
will transform this into a data dependency, which will then lead to
ordering guarantees by the HW.

However, we need to specify when a value is "necessary".  We could say
that this is implementation-defined, and use a set of litmus tests
(e.g., like those discussed in the thread) to roughly carve out what a
programmer could expect.  This may even be practical for a project like
the Linux kernel that follows strict project-internal rules and pays a
lot of attention to what the particular implementations of compilers
expected to compile the kernel are doing.  However, I think this
approach would be too vague for the standard and for many other
programs/projects.


One way to understand "necessary" would be to say that if a mo_consume
load can result in more than V different values, then the actual value
is "unknown", and thus "necessary" to compute anything based on it.
(But this is flawed, as discussed below.)

However, how big should V be?  If it's larger than 1, atomic bool cannot
be used with mo_consume, which seems weird.
If V is 1, then Linus' litmus tests work (but Paul's doesn't; see
below), but the compiler must not try to predict more than one value.
This holds for any choice of V, so there always is an *additional*
constraint on code generation for operations that are meant to take part
in such "value dependencies".  The bigger V might be, the less likely it
should be for this to actually constrain a particular compiler's
optimizations (e.g., while it might be meaningful to use value
prediction for two or three values, it's probably not for 1000s).
Nonetheless, if we don't want to predict the future, we need to specify
V.  Given that we always have some constraint for code generation
anyway, and given that V > 1 might be an arbitrary-looking constraint
and disallows use on atomic bool, I believe V should be 1.

Furthermore, there is a problem in saying "a load can result in more
than one value" because in a deterministic program/operation, it will
result in exactly one value.  Paul's (counter-)litmus test is a similar
example: the compiler saw all stores to a particular variable, so it was
able to show that a particular pointer variable actually just has two
possible values (ie, it's like a bool).
How do we avoid this problem?  And do we actually want to require
programmers to reason about this at all (ie, issues like Paul's
example)?  The only solution that I currently see is to specify the
allowed scope of the knowledge about the values a load can result in.

Based on these thoughts, we could specify the new mo_consume guarantees
roughly as follows:

        An evaluation E (in an execution) has a value dependency to an
        atomic and mo_consume load L (in an execution) iff:
        * L's type holds more than one value (ruling out constants
        etc.),
        * L is sequenced-before E,
        * L's result is used by the abstract machine to compute E,
        * E is value-dependency-preserving code (defined below), and
        * at the time of execution of E, L can possibly have returned at
        least two different values under the assumption that L itself
        could have returned any value allowed by L's type.
        
        If a memory access A's targeted memory location has a value
        dependency on a mo_consume load L, and an action X
        inter-thread-happens-before L, then X happens-before A.

While this attempt at a definition certainly needs work, I hope that at
least the first 3 requirements for value dependencies are clear.

The fifth requirement tries to capture both Linus' "value chains"
concept as well as the point that we need to specify the scope of
knowledge about values.
Regarding the latter, we make a fresh start at each mo_consume load (ie,
we assume we know nothing -- L could have returned any possible value);
I believe this is easier to reason about than other scopes like function
granularities (what happens on inlining?), or translation units.  It
should also be simple to implement for compilers, and would hopefully
not constrain optimization too much.
The rest of the requirement then forbids "breaking value chains". For
example (for now, let's ignore the fourth requirement in the examples):

  int *p = atomic_load(&pp, mo_consume);
  if (p == &myvariable)
    q = *p;  // (1)

When (1) is executed, the load can only have returned the value
&myvariable (otherwise, the branch wouldn't be executed), so this
evaluation of p is not value-dependent on the mo_consume load anymore.

OTOH, the following does work (ie, uses a value dependency for
ordering), because int* can have more than two values:

  int *p = atomic_load(&pp, mo_consume);
  if (p != &myvariable)
    q = *p;  // (1)

But it would not work using bool:

  bool b = atomic_load(&foo, mo_consume);
  if (b != false)
    q = arr[b];  // only one value ("true") is left for b

Paul's litmus test would work, because we guarantee to the programmer
that it can assume that the mo_consume load would return any value
allowed by the type; effectively, this forbids the compiler analysis
Paul thought about:

        static struct foo variable1;
        static struct foo variable2;
        static struct foo *pp = &variable1;

T1:     initialize_foo(&variable2);
        atomic_store_explicit(&pp, &variable2, memory_order_release);
        /* The above is the only store to pp in this translation unit,
         * and the address of pp is not exported in any way.
         */

T2:     int *p = atomic_load(&pp, mo_consume);
        if (p == &variable1)
                return p->val1; /* Must be variable1.val1.
                                   No value dependency. */
        else
                return p->val2; /* Will read variable2.val2, but value
                                   dependency is guaranteed. */


The fourth requirement (value-dependency-preserving code) might look as
if I'm re-introducing the problematic carries-a-dependency again, but
that's *not* the case.  We do need to demarcate
value-dependency-carrying code from other code so that the compiler
knows when it can do value prediction and similar transformations (see
above).  However, note that this fourth requirement is necessary but not
sufficient for a value dependency -- the fifth requirement (ie, that
there must be a "value chain") always has the last word, so to speak.
IOW, this does not try to define dependencies based on syntax.

I suggest to use data types to demarcate code that is actually meant to
preserve value dependencies.  The current standard tries to do it on the
granularity of functions (ie, using [[carries_dependency]]), and I
believe we agree that this isn't great.  For example, it's too
constraining for large functions, it requires annotations, calls from
non-annotated to annotated functions might result in further barriers,
function-pointers are problematic (can't put [[carries_dependency]] on a
function pointer type?), etc.

What I have in mind is roughly the following (totally made-up syntax --
suggestions for how to do this properly are very welcome):
* Have a type modifier (eg, like restrict), that specifies that
operations on data of this type are preserving value dependencies:
  int value_dep_preserving *foo;
  bool value_dep_preserving bar;
* mo_consume loads return value_dep_preserving types.
* One can cast from value_dep_preserving to the normal type, but not
vice-versa.
* Operations that take a value_dep_preserving-typed operand will usually
produce a value_dep_preserving type.  One can argue whether "?:" should
be an exception to this, at least if only the left/first operand is
value_dep_preserving.  Unlike in the current standard, this is *not*
critical here because in the end, it's all down to the fifth requirement
(see above).

This has a couple of drawbacks:
* The programmer needs to use the types.
* It extends the type system (but [[carries_dependency]] effectively
extends the type system as well, depending on how it's implemented).

This has a few advantages over function-granularity or just hoping that
the compiler gets it right by automatic analysis:
* The programmer can use it exactly for the variables that it needs
value-dep-preservation for.  This won't slow down other code.
* If an mo_consume load's result is used directly (or in C++ with auto),
no special type annotations / extra code are required.
* The compiler knows exactly which code has constraints on code
generation (e.g., no value prediction), due to the types.  No points-to
analysis necessary.
* Programmers could even use special macros that only load from
value_dep_preserving-typed addresses (based on type checks); this would
obviously not catch all errors, but many.  For example:
  int *foo, *bar;
  int value_dep_preserving *p = atomic_load(&pp, mo_consume);
  x = VALUE_DEF_DEREF(p != NULL ? &foo : &bar);  // Error!
  if (p == &myvariable)
    q = VALUE_DEP_DEREF(p);  // Wouldn't be guaranteed to be caught, but
                             // compiler might be able to emit warning.
* In C++, we might be able to provide this with just a special template:
std::value_dep_preserving<int *> p;
Suitably defined operators for this template should be able to take care
of the rest, and the implementation might use implementation-defined
mechanisms internally to constrain code generation.


What do you think?

Is this meaningful regarding what current hardware offers, or will it do
(or might do in the future) value prediction on it's own?


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 15:37                                                                                                             ` Torvald Riegel
@ 2014-02-27 17:01                                                                                                               ` Linus Torvalds
  2014-02-27 19:06                                                                                                                 ` Paul E. McKenney
  2014-03-03 15:36                                                                                                                 ` Torvald Riegel
  2014-02-27 17:50                                                                                                               ` Paul E. McKenney
  1 sibling, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-27 17:01 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@redhat.com> wrote:
>
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".

I suspect it's hard to really strictly define, but at the same time I
actually think that compiler writers (and users, for that matter) have
little problem understanding the concept and intent.

I do think that listing operations might be useful to give good
examples of what is a "necessary" value, and - perhaps more
importantly - what can break the value from being "necessary".
Especially the gotchas.

> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Ok, I'm going to cut most of the verbiage since it's long and I'm not
commenting on most of it.

But

> Based on these thoughts, we could specify the new mo_consume guarantees
> roughly as follows:
>
>         An evaluation E (in an execution) has a value dependency to an
>         atomic and mo_consume load L (in an execution) iff:
>         * L's type holds more than one value (ruling out constants
>         etc.),
>         * L is sequenced-before E,
>         * L's result is used by the abstract machine to compute E,
>         * E is value-dependency-preserving code (defined below), and
>         * at the time of execution of E, L can possibly have returned at
>         least two different values under the assumption that L itself
>         could have returned any value allowed by L's type.
>
>         If a memory access A's targeted memory location has a value
>         dependency on a mo_consume load L, and an action X
>         inter-thread-happens-before L, then X happens-before A.

I think this mostly works.

> Regarding the latter, we make a fresh start at each mo_consume load (ie,
> we assume we know nothing -- L could have returned any possible value);
> I believe this is easier to reason about than other scopes like function
> granularities (what happens on inlining?), or translation units.  It
> should also be simple to implement for compilers, and would hopefully
> not constrain optimization too much.
>
> [...]
>
> Paul's litmus test would work, because we guarantee to the programmer
> that it can assume that the mo_consume load would return any value
> allowed by the type; effectively, this forbids the compiler analysis
> Paul thought about:

So realistically, since with the new wording we can ignore the silly
cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
cases ("if (p == &variable) .. use p"), and you would forbid the
"global value range optimization case" that Paul bright up, what
remains would seem to be just really subtle compiler transformations
of data dependencies to control dependencies.

And the only such thing I can think of is basically compiler-initiated
value-prediction, presumably directed by PGO (since now if the value
prediction is in the source code, it's considered to break the value
chain).

The good thing is that afaik, value-prediction is largely not used in
real life, afaik. There are lots of papers on it, but I don't think
anybody actually does it (although I can easily see some
specint-specific optimization pattern that is build up around it).

And even value prediction is actually fine, as long as the compiler
can see the memory *source* of the value prediction (and it isn't a
mo_consume). So it really ends up limiting your value prediction in
very simple ways: you cannot do it to function arguments if they are
registers. But you can still do value prediction on values you loaded
from memory, if you can actually *see* that memory op.

Of course, on more strongly ordered CPU's, even that "register
argument" limitation goes away.

So I agree that there is basically no real optimization constraint.
Value-prediction is of dubious value to begin with, and the actual
constraint on its use if some compiler writer really wants to is not
onerous.

> What I have in mind is roughly the following (totally made-up syntax --
> suggestions for how to do this properly are very welcome):
> * Have a type modifier (eg, like restrict), that specifies that
> operations on data of this type are preserving value dependencies:

So I'm not violently opposed, but I think the upsides are not great.
Note that my earlier suggestion to use "restrict" wasn't because I
believed the annotation itself would be visible, but basically just as
a legalistic promise to the compiler that *if* it found an alias, then
it didn't need to worry about ordering. So to me, that type modifier
was about conceptual guarantees, not about actual value chains.

Anyway, the reason I don't believe any type modifier (and
"[[carries_dependency]]" is basically just that) is worth it is simply
that it adds a real burden on the programmer, without actually giving
the programmer any real upside:

Within a single function, the compiler already sees that mo_consume
source, and so doing a type-based restriction doesn't really help. The
information is already there, without any burden on the programmer.

And across functions, the compiler has already - by definition -
mostly lost sight of all the things it could use to reduce the value
space. Even Paul's example doesn't really work if the use of the
"mo_consume" value has been passed to another function, because inside
a separate function, the compiler couldn't see that the value it uses
comes from only two possible values.

And as mentioned, even *if* the compiler wants to do value prediction
that turns a data dependency into a control dependency, the limitation
to say "no, you can't do it unless you saw where the value got loaded"
really isn't that onerous.

I bet that if you ask actual production compiler people (as opposed to
perhaps academia), none of them actually really believe in value
prediction to begin with.

> What do you think?
>
> Is this meaningful regarding what current hardware offers, or will it do
> (or might do in the future) value prediction on it's own?

I can pretty much guarantee that when/if hardware does value
prediction on its own, it will do so without exposing it as breaking
the data dependency.

The thing is, a CPU is actually *much* better situated at doing
speculative memory accesses, because a CPU already has all the
infrastructure to do speculation in general.

And for a CPU, once you do value speculation, guaranteeing the memory
ordering is *trivial*: all you need to do is to track the "speculated"
memory instruction until you check the value (which you obviously have
to do anyway, otherwise you're not doing value _prediction_, you're
just doing "value wild guessing" ;^), and when you check the value you
also check that the cacheline hasn't been evicted out-of-order.

This is all stuff that CPU people already do. If you have
transactional memory, you already have all the resources to do this.
Or, even without transactional memory, if like Intel you have a memory
model that says "loads are done in order" but you actually wildly
speculate loads and just check before retiring instructions that the
cachelines didn't get evicted out of order, you already have all the
hardware to do value prediction *without* making it visible in the
memory order.

This, btw, is one reason why people who think that compilers should be
overly smart and do fancy tricks are incompetent. People who thought
that Itanium was a great idea ("Let's put the complexity in the
compiler, and make a simple CPU") are simply objectively *wrong*.
People who think that value prediction by a compiler is a good idea
are not people you should really care about.

                         Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 15:37                                                                                                             ` Torvald Riegel
  2014-02-27 17:01                                                                                                               ` Linus Torvalds
@ 2014-02-27 17:50                                                                                                               ` Paul E. McKenney
  2014-02-27 19:22                                                                                                                 ` Paul E. McKenney
                                                                                                                                   ` (2 more replies)
  1 sibling, 3 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-27 17:50 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140227154925.3851@vmsdvm9.vnet.ibm.com
> 
> On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > Good points.  How about the following replacements?
> > >
> > > 3.      Adding or subtracting an integer to/from a chained pointer
> > >         results in another chained pointer in that same pointer chain.
> > >         The results of addition and subtraction operations that cancel
> > >         the chained pointer's value (for example, "p-(long)p" where "p"
> > >         is a pointer to char) are implementation defined.
> > >
> > > 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
> > >         applied to a chained pointer and an integer for the purposes
> > >         of alignment and pointer translation results in another
> > >         chained pointer in that same pointer chain.  Other uses
> > >         of bitwise operators on chained pointers (for example,
> > >         "p|~0") are implementation defined.
> > 
> > Quite frankly, I think all of this language that is about the actual
> > operations is irrelevant and wrong.
> > 
> > It's not going to help compiler writers, and it sure isn't going to
> > help users that read this.
> > 
> > Why not just talk about "value chains" and that any operations that
> > restrict the value range severely end up breaking the chain. There is
> > no point in listing the operations individually, because every single
> > operation *can* restrict things. Listing individual operations and
> > depdendencies is just fundamentally wrong.
> 
> [...]
> 
> > The *only* thing that matters for all of them is whether they are
> > "value-preserving", or whether they drop so much information that the
> > compiler might decide to use a control dependency instead. That's true
> > for every single one of them.
> > 
> > Similarly, actual true control dependencies that limit the problem
> > space sufficiently that the actual pointer value no longer has
> > significant information in it (see the above example) are also things
> > that remove information to the point that only a control dependency
> > remains. Even when the value itself is not modified in any way at all.
> 
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".
> 
> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Thank you very much for putting this forward!  I must confess that I was
stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
is quite clearly way bogus.

One possible saving grace:  From discussions at the standards committee
meeting a few weeks ago, there is a some chance that the committee will
be willing to do a rip-and-replace on the current memory_order_consume
wording, without provisions for backwards compatibility with the current
bogosity.

> What we'd like to capture is that a value originating from a mo_consume
> load is "necessary" for a computation (e.g., it "cannot" be replaced
> with value predictions and/or control dependencies); if that's the case
> in the program, we can reasonably assume that a compiler implementation
> will transform this into a data dependency, which will then lead to
> ordering guarantees by the HW.
> 
> However, we need to specify when a value is "necessary".  We could say
> that this is implementation-defined, and use a set of litmus tests
> (e.g., like those discussed in the thread) to roughly carve out what a
> programmer could expect.  This may even be practical for a project like
> the Linux kernel that follows strict project-internal rules and pays a
> lot of attention to what the particular implementations of compilers
> expected to compile the kernel are doing.  However, I think this
> approach would be too vague for the standard and for many other
> programs/projects.

I agree that a number of other projects would have more need for this than
might the kernel.  Please understand that this is in no way denigrating
the intelligence of other projects' members.  It is just that many of
them have only recently started seriously thinking about concurrency.
In contrast, the Linux kernel community has been doing concurrency since
the mid-1990s.  Projects with less experience with concurrency will
probably need more help, from the compiler and from elsewhere as well.

Your proposal looks quite promising at first glance.  But rather than
try and comment on it immediately, I am going to take a number of uses of
RCU from the Linux kernel and apply your proposal to them, then respond
with the results

Fair enough?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 17:01                                                                                                               ` Linus Torvalds
@ 2014-02-27 19:06                                                                                                                 ` Paul E. McKenney
  2014-02-27 19:47                                                                                                                   ` Linus Torvalds
  2014-03-03 15:36                                                                                                                 ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-27 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@redhat.com> wrote:
> >
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> 
> I suspect it's hard to really strictly define, but at the same time I
> actually think that compiler writers (and users, for that matter) have
> little problem understanding the concept and intent.
> 
> I do think that listing operations might be useful to give good
> examples of what is a "necessary" value, and - perhaps more
> importantly - what can break the value from being "necessary".
> Especially the gotchas.
> 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Ok, I'm going to cut most of the verbiage since it's long and I'm not
> commenting on most of it.
> 
> But
> 
> > Based on these thoughts, we could specify the new mo_consume guarantees
> > roughly as follows:
> >
> >         An evaluation E (in an execution) has a value dependency to an
> >         atomic and mo_consume load L (in an execution) iff:
> >         * L's type holds more than one value (ruling out constants
> >         etc.),
> >         * L is sequenced-before E,
> >         * L's result is used by the abstract machine to compute E,
> >         * E is value-dependency-preserving code (defined below), and
> >         * at the time of execution of E, L can possibly have returned at
> >         least two different values under the assumption that L itself
> >         could have returned any value allowed by L's type.
> >
> >         If a memory access A's targeted memory location has a value
> >         dependency on a mo_consume load L, and an action X
> >         inter-thread-happens-before L, then X happens-before A.
> 
> I think this mostly works.
> 
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == &variable) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.

FWIW, I am looking through the kernel for instances of your first
"if (p == &variable) .. use p" limus test.  All the ones I have found
thus far are OK for one of the following reasons:

1.	The comparison was against NULL, so you don't get to dereference
	the pointer anyway.  About 80% are in this category.

2.	The comparison was against another pointer, but there were no
	dereferences afterwards.  Here is an example of what these
	can look like:

		list_for_each_entry_rcu(p, &head, next)
			if (p == &variable)
				return; /* "p" goes out of scope. */

3.	The comparison was against another RCU-protected pointer,
	where that other pointer was properly fetched using one
	of the RCU primitives.  Here it doesn't matter which pointer
	you use.  At least as long as the rcu_assign_pointer() for
	that other pointer happened after the last update to the
	pointed-to structure.

I am a bit nervous about #3.  Any thoughts on it?

Some other reasons why it would be OK to dereference after a comparison:

4.	The pointed-to data is constant: (a) It was initialized at
	boot time, (b) the update-side lock is held, (c) we are
	running in a kthread and the data was initialized before the
	kthread was created, (d) we are running in a module, and
	the data was initialized during or before module-init time
	for that module.  And many more besides, involving pretty
	much every kernel primitive that makes something run later.

5.	All subsequent dereferences are stores, so that a control
	dependency is in effect.

Thoughts?

FWIW, no arguments with the following.

							Thanx, Paul

> And the only such thing I can think of is basically compiler-initiated
> value-prediction, presumably directed by PGO (since now if the value
> prediction is in the source code, it's considered to break the value
> chain).
> 
> The good thing is that afaik, value-prediction is largely not used in
> real life, afaik. There are lots of papers on it, but I don't think
> anybody actually does it (although I can easily see some
> specint-specific optimization pattern that is build up around it).
> 
> And even value prediction is actually fine, as long as the compiler
> can see the memory *source* of the value prediction (and it isn't a
> mo_consume). So it really ends up limiting your value prediction in
> very simple ways: you cannot do it to function arguments if they are
> registers. But you can still do value prediction on values you loaded
> from memory, if you can actually *see* that memory op.
> 
> Of course, on more strongly ordered CPU's, even that "register
> argument" limitation goes away.
> 
> So I agree that there is basically no real optimization constraint.
> Value-prediction is of dubious value to begin with, and the actual
> constraint on its use if some compiler writer really wants to is not
> onerous.
> 
> > What I have in mind is roughly the following (totally made-up syntax --
> > suggestions for how to do this properly are very welcome):
> > * Have a type modifier (eg, like restrict), that specifies that
> > operations on data of this type are preserving value dependencies:
> 
> So I'm not violently opposed, but I think the upsides are not great.
> Note that my earlier suggestion to use "restrict" wasn't because I
> believed the annotation itself would be visible, but basically just as
> a legalistic promise to the compiler that *if* it found an alias, then
> it didn't need to worry about ordering. So to me, that type modifier
> was about conceptual guarantees, not about actual value chains.
> 
> Anyway, the reason I don't believe any type modifier (and
> "[[carries_dependency]]" is basically just that) is worth it is simply
> that it adds a real burden on the programmer, without actually giving
> the programmer any real upside:
> 
> Within a single function, the compiler already sees that mo_consume
> source, and so doing a type-based restriction doesn't really help. The
> information is already there, without any burden on the programmer.
> 
> And across functions, the compiler has already - by definition -
> mostly lost sight of all the things it could use to reduce the value
> space. Even Paul's example doesn't really work if the use of the
> "mo_consume" value has been passed to another function, because inside
> a separate function, the compiler couldn't see that the value it uses
> comes from only two possible values.
> 
> And as mentioned, even *if* the compiler wants to do value prediction
> that turns a data dependency into a control dependency, the limitation
> to say "no, you can't do it unless you saw where the value got loaded"
> really isn't that onerous.
> 
> I bet that if you ask actual production compiler people (as opposed to
> perhaps academia), none of them actually really believe in value
> prediction to begin with.
> 
> > What do you think?
> >
> > Is this meaningful regarding what current hardware offers, or will it do
> > (or might do in the future) value prediction on it's own?
> 
> I can pretty much guarantee that when/if hardware does value
> prediction on its own, it will do so without exposing it as breaking
> the data dependency.
> 
> The thing is, a CPU is actually *much* better situated at doing
> speculative memory accesses, because a CPU already has all the
> infrastructure to do speculation in general.
> 
> And for a CPU, once you do value speculation, guaranteeing the memory
> ordering is *trivial*: all you need to do is to track the "speculated"
> memory instruction until you check the value (which you obviously have
> to do anyway, otherwise you're not doing value _prediction_, you're
> just doing "value wild guessing" ;^), and when you check the value you
> also check that the cacheline hasn't been evicted out-of-order.
> 
> This is all stuff that CPU people already do. If you have
> transactional memory, you already have all the resources to do this.
> Or, even without transactional memory, if like Intel you have a memory
> model that says "loads are done in order" but you actually wildly
> speculate loads and just check before retiring instructions that the
> cachelines didn't get evicted out of order, you already have all the
> hardware to do value prediction *without* making it visible in the
> memory order.
> 
> This, btw, is one reason why people who think that compilers should be
> overly smart and do fancy tricks are incompetent. People who thought
> that Itanium was a great idea ("Let's put the complexity in the
> compiler, and make a simple CPU") are simply objectively *wrong*.
> People who think that value prediction by a compiler is a good idea
> are not people you should really care about.
> 
>                          Linus
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 17:50                                                                                                               ` Paul E. McKenney
@ 2014-02-27 19:22                                                                                                                 ` Paul E. McKenney
  2014-02-28  1:02                                                                                                                 ` Paul E. McKenney
  2014-03-03 19:01                                                                                                                 ` Torvald Riegel
  2 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-27 19:22 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3851@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.      Adding or subtracting an integer to/from a chained pointer
> > > >         results in another chained pointer in that same pointer chain.
> > > >         The results of addition and subtraction operations that cancel
> > > >         the chained pointer's value (for example, "p-(long)p" where "p"
> > > >         is a pointer to char) are implementation defined.
> > > >
> > > > 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > >         applied to a chained pointer and an integer for the purposes
> > > >         of alignment and pointer translation results in another
> > > >         chained pointer in that same pointer chain.  Other uses
> > > >         of bitwise operators on chained pointers (for example,
> > > >         "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started seriously thinking about concurrency.
> In contrast, the Linux kernel community has been doing concurrency since
> the mid-1990s.  Projects with less experience with concurrency will
> probably need more help, from the compiler and from elsewhere as well.

I should hasten to add that it is not just concurrency.  After all, part
of the reason I got into trouble with memory_order_consume is that my
mid-to-late 70s experience with compilers is not so useful in 2014.  ;-)

							Thanx, Paul

> Your proposal looks quite promising at first glance.  But rather than
> try and comment on it immediately, I am going to take a number of uses of
> RCU from the Linux kernel and apply your proposal to them, then respond
> with the results
> 
> Fair enough?
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 19:06                                                                                                                 ` Paul E. McKenney
@ 2014-02-27 19:47                                                                                                                   ` Linus Torvalds
  2014-02-27 20:53                                                                                                                     ` Paul E. McKenney
  2014-03-03 18:59                                                                                                                     ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Linus Torvalds @ 2014-02-27 19:47 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> 3.      The comparison was against another RCU-protected pointer,
>         where that other pointer was properly fetched using one
>         of the RCU primitives.  Here it doesn't matter which pointer
>         you use.  At least as long as the rcu_assign_pointer() for
>         that other pointer happened after the last update to the
>         pointed-to structure.
>
> I am a bit nervous about #3.  Any thoughts on it?

I think that it might be worth pointing out as an example, and saying
that code like

   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
        data = p->val;

then the access of "p->val" is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.

I cannot for the life of me come up with a situation where this would
matter, though. If "X" contains a fence, then that fence will be a
stronger ordering than anything the consume through "p" would
guarantee anyway. And if "X" does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through "p" is through p or q is kind of
irrelevant. No?

                     Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 19:47                                                                                                                   ` Linus Torvalds
@ 2014-02-27 20:53                                                                                                                     ` Paul E. McKenney
  2014-03-01  0:50                                                                                                                       ` Paul E. McKenney
  2014-03-03 18:59                                                                                                                     ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-27 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > 3.      The comparison was against another RCU-protected pointer,
> >         where that other pointer was properly fetched using one
> >         of the RCU primitives.  Here it doesn't matter which pointer
> >         you use.  At least as long as the rcu_assign_pointer() for
> >         that other pointer happened after the last update to the
> >         pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>    p = atomic_read(consume);
>    X;
>    q = atomic_read(consume);
>    Y;
>    if (p == q)
>         data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.
> 
> I cannot for the life of me come up with a situation where this would
> matter, though. If "X" contains a fence, then that fence will be a
> stronger ordering than anything the consume through "p" would
> guarantee anyway. And if "X" does *not* contain a fence, then the
> atomic reads of p and q are unordered *anyway*, so then whether the
> ordering to the access through "p" is through p or q is kind of
> irrelevant. No?

I can make a contrived litmus test for it, but you are right, the only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU can
reorder the loads into p and q, and the read from p->val can, as you say,
come from either pointer.

For whatever it is worth, hear is the litmus test:

T1:	p = kmalloc(...);
	if (p == NULL)
		deal_with_it();
	p->a = 42;  /* Each field in its own cache line. */
	p->b = 43;
	p->c = 44;
	atomic_store_explicit(&gp1, p, memory_order_release);
	p->b = 143;
	p->c = 144;
	atomic_store_explicit(&gp2, p, memory_order_release);

T2:	p = atomic_load_explicit(&gp2, memory_order_consume);
	r1 = p->b;  /* Guaranteed to get 143. */
	q = atomic_load_explicit(&gp1, memory_order_consume);
	if (p == q) {
		/* The compiler decides that q->c is same as p->c. */
		r2 = p->c; /* Could get 44 on weakly order system. */
	}

The loads from gp1 and gp2 are, as you say, unordered, so you get what
you get.

And publishing a structure via one RCU-protected pointer, updating it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and still
see consistency across all the fields in the structure, please put a lock
in the structure and use it to guard updates and accesses to those fields.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 17:50                                                                                                               ` Paul E. McKenney
  2014-02-27 19:22                                                                                                                 ` Paul E. McKenney
@ 2014-02-28  1:02                                                                                                                 ` Paul E. McKenney
  2014-03-03 19:01                                                                                                                 ` Torvald Riegel
  2 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-28  1:02 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3851@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.      Adding or subtracting an integer to/from a chained pointer
> > > >         results in another chained pointer in that same pointer chain.
> > > >         The results of addition and subtraction operations that cancel
> > > >         the chained pointer's value (for example, "p-(long)p" where "p"
> > > >         is a pointer to char) are implementation defined.
> > > >
> > > > 4.      Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > >         applied to a chained pointer and an integer for the purposes
> > > >         of alignment and pointer translation results in another
> > > >         chained pointer in that same pointer chain.  Other uses
> > > >         of bitwise operators on chained pointers (for example,
> > > >         "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started seriously thinking about concurrency.
> In contrast, the Linux kernel community has been doing concurrency since
> the mid-1990s.  Projects with less experience with concurrency will
> probably need more help, from the compiler and from elsewhere as well.
> 
> Your proposal looks quite promising at first glance.  But rather than
> try and comment on it immediately, I am going to take a number of uses of
> RCU from the Linux kernel and apply your proposal to them, then respond
> with the results

And here is an initial set of six selected randomly from the Linux kernel
(assuming you trust awk's random-number generator).  This is of course a
tiny subset of what is in the kernel, but should be a good set to start
with.  Looks like a reasonable start to me, though I would not expect the
kernel to convert over wholesale any time soon.  Which is OK, there are
userspace projects using RCU.

Thoughts?  Am I understanding your proposal at all?  ;-)

							Thanx, Paul

------------------------------------------------------------------------

value_dep_preserving usage examples.

/* The following is approximate -- need sparse and maybe other checking. */
#define rcu_dereference(x) atomic_load_explicit(&(x), memory_order_consume)

1.	mm/vmalloc.c __purge_vmap_area_lazy()

	This requires only two small changes.  I am not sure that the
	second change is necessary.  My guess is that it is, on the theory
	that passing a non-value_dep_preserving variable in through a
	value_dep_preserving argument is guaranteed OK, give or take
	unnecessary suppression of optimizations, but that passing a
	value_dep_preserving in through a non-value_dep_preserving could
	be a serious bug.

	That said, the Linux kernel convention is that once you leave
	the outermost rcu_read_lock(), there is no more value dependency
	preservation.  Implementing this convention would remove the
	need for the cast, for whatever that is worth.

	static void __purge_vmap_area_lazy(...)
	{
		static DEFINE_SPINLOCK(purge_lock);
		LIST_HEAD(valist);
		struct vmap_area value_dep_preserving *va; /* CHANGE */
		struct vmap_area *n_va;
		int nr = 0;

		...

		rcu_read_lock();
		list_for_each_entry_rcu(va, &vmap_area_list, list) {
			if (va->flags & VM_LAZY_FREE) {
				if (va->va_start < *start)
					*start = va->va_start;
				if (va->va_end > *end)
					*end = va->va_end;
				nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
				list_add_tail(&va->purge_list, &valist);
				va->flags |= VM_LAZY_FREEING;
				va->flags &= ~VM_LAZY_FREE;
			}
		}
		rcu_read_unlock();

		...

		if (nr) {
			spin_lock(&vmap_area_lock);
			list_for_each_entry_safe(va, n_va, &valist, purge_list)
				__free_vmap_area((struct vmap_area *)va); /* CHANGE */
			spin_unlock(&vmap_area_lock);
		}
		...
	}

2.	net/core/sock.c sock_def_wakeup()

	static void sock_def_wakeup(struct sock *sk)
	{
		struct socket_wq value_dep_preserving *wq; /* CHANGE */

		rcu_read_lock();
		wq = rcu_dereference(sk->sk_wq);
		if (wq_has_sleeper(wq))
			wake_up_interruptible_all(&((struct socket_wq *)wq->wait)); /* CHANGE */
		rcu_read_unlock();
	}

	This calls wq_has_sleeper():

	static inline bool
	wq_has_sleeper(struct socket_wq value_dep_preserving *wq) /* CHANGE */
	{
		/* We need to be sure we are in sync with the
		 * add_wait_queue modifications to the wait queue.
		 *
		 * This memory barrier is paired in the sock_poll_wait.
		 */
		smp_mb();
		return wq && waitqueue_active(&wq->wait);
	}

	Although wq_has_sleeper() has a full memory barrier and therefore
	does not need value_dep_preserving, it seems to be called from
	within a lot of RCU read-side critical sections, so it is likely
	a bit nicer to decorate its argument than to apply casts to all
	calls.

	Is "&wq->wait" also value_dep_preserving?  The call to
	wake_up_interruptible_all() doesn't need it to be because it
	acquires a spinlock within the structure, and the ensuing barriers
	and atomic instructions enforce all the ordering that is required.

	The call waitqueue_active() is preceded by a full barrier, so
	it too has all the ordering it needs.

	So this example is slightly nicer if, given a value_dep_preserving
	pointer "p", "&p->f" is non-value_dep_preserving.  But there
	will likely be other examples that strongly prefer otherwise,
	and the pair of casts required here are not that horrible.
	Plus this approach matches our discussion earlier in this thread.

	Might be nice to have an intrinsic that takes a value_dep_preserving
	pointer and returns a non-value_dep_preserving pointer with the
	same value.

3.	net/ipv4/tcp_cong.c tcp_init_congestion_control()

	This example requires only one change.	I am assuming
	that if "p" is value_dep_preserving, then "p->a" is -not-
	value_dep_preserving unless the struct field "a" has been
	declared value_dep_preserving.	This means that there is no
	need to change the "try_module_get(ca->owner)", which is a nice
	improvement over the current memory_order_consume work.

	I believe that this approach will do much to trim what would
	otherwise be over-large value_dep_preserving regions.

	void tcp_init_congestion_control(struct sock *sk)
	{
		struct inet_connection_sock *icsk = inet_csk(sk);
		struct tcp_congestion_ops value_dep_preserving *ca; /* CHANGE */

		if (icsk->icsk_ca_ops == &tcp_init_congestion_ops) {
			rcu_read_lock();
			list_for_each_entry_rcu(ca, &tcp_cong_list, list) {
				if (try_module_get(ca->owner)) {
					icsk->icsk_ca_ops = ca;
					break;
				}

				/* fallback to next available */
			}
			rcu_read_unlock();
		}

		if (icsk->icsk_ca_ops->init)
			icsk->icsk_ca_ops->init(sk);
	}

4.	drivers/target/tcm_fc/tfc_sess.c ft_sess_get()

	This one brings up an interesting point.  I am guessing that
	the value_dep_preserving should be silently stripped when the
	value is passed in through __VA_ARGS__, as in pr_debug() below.
	
	Thoughts?

	static struct ft_sess *ft_sess_get(struct fc_lport *lport, u32 port_id)
	{
		struct ft_tport value_dep_preserving *tport; /* CHANGE */
		struct hlist_head *head;
		struct ft_sess value_dep_preserving *sess; /* CHANGE */

		rcu_read_lock();
		tport = rcu_dereference(lport->prov[FC_TYPE_FCP]);
		if (!tport)
			goto out;

		head = &tport->hash[ft_sess_hash(port_id)];
		hlist_for_each_entry_rcu(sess, head, hash) {
			if (sess->port_id == port_id) {
				kref_get((struct kref *)&sess->kref); /* CHANGE */
				rcu_read_unlock();
				pr_debug("port_id %x found %p\n", port_id, sess);
				return sess;
			}
		}
	out:
		rcu_read_unlock();
		pr_debug("port_id %x not found\n", port_id);
		return NULL;
	}

5.	net/netlabel/netlabel_unlabeled.c netlbl_unlhsh_search_iface()

	This one is interesting -- you have to go up a function-call
	level to find the rcu_read_lock.

	netlbl_unlhsh_add() calls netlbl_unlhsh_search_iface().

	There is no need for value_dep_preserving to flow into
	netlbl_unlhsh_search_iface(), but its return value appears
	to need to be netlbl_unlhsh_search_iface.

	static struct netlbl_unlhsh_iface value_dep_preserving *
	netlbl_unlhsh_search_iface(int ifindex) /* CHANGE */
	{
		u32 bkt;
		struct list_head *bkt_list;
		struct netlbl_unlhsh_iface value_dep_preserving *iter; /* CHANGE */

		bkt = netlbl_unlhsh_hash(ifindex);
		bkt_list = &netlbl_unlhsh_rcu_deref(netlbl_unlhsh)->tbl[bkt];
		list_for_each_entry_rcu(iter, bkt_list, list)
			if (iter->valid && iter->ifindex == ifindex)
				return iter;

		return NULL;
	}

6.	drivers/md/linear.c linear_congested()

	It is possible that this code relies on dependencies flowing into
	bdev_get_queue(), but to keep this effort finite in duration, I
	am relying on the __rcu markers -- and some of the structures
	in this subsystem do have __rcu.  The ->rdev and ->bdev fields
	don't, so this one is pretty straightforward.

	static int linear_congested(void *data, int bits)
	{
		struct mddev *mddev = data;
		struct linear_conf value_dep_preserving *conf; /* CHANGE */
		int i, ret = 0;

		if (mddev_congested(mddev, bits))
			return 1;

		rcu_read_lock();
		conf = rcu_dereference(mddev->private);

		for (i = 0; i < mddev->raid_disks && !ret ; i++) {
			struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
			ret |= bdi_congested(&q->backing_dev_info, bits);
		}

		rcu_read_unlock();
		return ret;
	}


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 20:53                                                                                                                     ` Paul E. McKenney
@ 2014-03-01  0:50                                                                                                                       ` Paul E. McKenney
  2014-03-01 10:06                                                                                                                         ` Peter Sewell
  2014-03-03 18:55                                                                                                                         ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-01  0:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > 3.      The comparison was against another RCU-protected pointer,
> > >         where that other pointer was properly fetched using one
> > >         of the RCU primitives.  Here it doesn't matter which pointer
> > >         you use.  At least as long as the rcu_assign_pointer() for
> > >         that other pointer happened after the last update to the
> > >         pointed-to structure.
> > >
> > > I am a bit nervous about #3.  Any thoughts on it?
> > 
> > I think that it might be worth pointing out as an example, and saying
> > that code like
> > 
> >    p = atomic_read(consume);
> >    X;
> >    q = atomic_read(consume);
> >    Y;
> >    if (p == q)
> >         data = p->val;
> > 
> > then the access of "p->val" is constrained to be data-dependent on
> > *either* p or q, but you can't really tell which, since the compiler
> > can decide that the values are interchangeable.
> > 
> > I cannot for the life of me come up with a situation where this would
> > matter, though. If "X" contains a fence, then that fence will be a
> > stronger ordering than anything the consume through "p" would
> > guarantee anyway. And if "X" does *not* contain a fence, then the
> > atomic reads of p and q are unordered *anyway*, so then whether the
> > ordering to the access through "p" is through p or q is kind of
> > irrelevant. No?
> 
> I can make a contrived litmus test for it, but you are right, the only
> time you can see it happen is when X has no barriers, in which case
> you don't have any ordering anyway -- both the compiler and the CPU can
> reorder the loads into p and q, and the read from p->val can, as you say,
> come from either pointer.
> 
> For whatever it is worth, hear is the litmus test:
> 
> T1:	p = kmalloc(...);
> 	if (p == NULL)
> 		deal_with_it();
> 	p->a = 42;  /* Each field in its own cache line. */
> 	p->b = 43;
> 	p->c = 44;
> 	atomic_store_explicit(&gp1, p, memory_order_release);
> 	p->b = 143;
> 	p->c = 144;
> 	atomic_store_explicit(&gp2, p, memory_order_release);
> 
> T2:	p = atomic_load_explicit(&gp2, memory_order_consume);
> 	r1 = p->b;  /* Guaranteed to get 143. */
> 	q = atomic_load_explicit(&gp1, memory_order_consume);
> 	if (p == q) {
> 		/* The compiler decides that q->c is same as p->c. */
> 		r2 = p->c; /* Could get 44 on weakly order system. */
> 	}
> 
> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> you get.
> 
> And publishing a structure via one RCU-protected pointer, updating it,
> then publishing it via another pointer seems to me to be asking for
> trouble anyway.  If you really want to do something like that and still
> see consistency across all the fields in the structure, please put a lock
> in the structure and use it to guard updates and accesses to those fields.

And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(&p, memory_order_consume).

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record rcu_dereference() value mishandling

Recent LKML discussings (see http://lwn.net/Articles/586838/ and
http://lwn.net/Articles/588300/ for the LWN writeups) brought out
some ways of misusing the return value from rcu_dereference() that
are not necessarily completely intuitive.  This commit therefore
documents what can and cannot safely be done with these values.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index fa57139f50bf..f773a264ae02 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ lockdep-splat.txt
 	- RCU Lockdep splats explained.
 NMI-RCU.txt
 	- Using RCU to Protect Dynamic NMI Handlers
+rcu_dereference.txt
+	- Proper care and feeding of return values from rcu_dereference()
 rcubarrier.txt
 	- RCU and Unloadable Modules
 rculist_nulls.txt
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 9d10d1db16a5..877947130ebe 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
 			http://www.openvms.compaq.com/wizard/wiz_2637.html
 
 		The rcu_dereference() primitive is also an excellent
-		documentation aid, letting the person reading the code
-		know exactly which pointers are protected by RCU.
+		documentation aid, letting the person reading the
+		code know exactly which pointers are protected by RCU.
 		Please note that compilers can also reorder code, and
 		they are becoming increasingly aggressive about doing
-		just that.  The rcu_dereference() primitive therefore
-		also prevents destructive compiler optimizations.
+		just that.  The rcu_dereference() primitive therefore also
+		prevents destructive compiler optimizations.  However,
+		with a bit of devious creativity, it is possible to
+		mishandle the return value from rcu_dereference().
+		Please see rcu_dereference.txt in this directory for
+		more information.
 
 		The rcu_dereference() primitive is used by the
 		various "_rcu()" list-traversal primitives, such
diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
new file mode 100644
index 000000000000..6e72cd8622df
--- /dev/null
+++ b/Documentation/RCU/rcu_dereference.txt
@@ -0,0 +1,365 @@
+PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
+
+Most of the time, you can use values from rcu_dereference() or one of
+the similar primitives without worries.  Dereferencing (prefix "*"),
+field selection ("->"), assignment ("="), address-of ("&"), addition and
+subtraction of constants, and casts all work quite naturally and safely.
+
+It is nevertheless possible to get into trouble with other operations.
+Follow these rules to keep your RCU code working properly:
+
+o	You must use one of the rcu_dereference() family of primitives
+	to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
+	will complain.  Worse yet, your code can see random memory-corruption
+	bugs due to games that compilers and DEC Alpha can play.
+	Without one of the rcu_dereference() primitives, compilers
+	can reload the value, and won't your code have fun with two
+	different values for a single pointer!  Without rcu_dereference(),
+	DEC Alpha can load a pointer, dereference that pointer, and
+	return data preceding initialization that preceded the store of
+	the pointer.
+
+	In addition, the volatile cast in rcu_dereference() prevents the
+	compiler from deducing the resulting pointer value.  Please see
+	the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
+	for an example where the compiler can in fact deduce the exact
+	value of the pointer, and thus cause misordering.
+
+o	Do not use single-element RCU-protected arrays.  The compiler
+	is within its right to assume that the value of an index into
+	such an array must necessarily evaluate to zero.  The compiler
+	could then substitute the constant zero for the computation, so
+	that the array index no longer depended on the value returned
+	by rcu_dereference().  If the array index no longer depends
+	on rcu_dereference(), then both the compiler and the CPU
+	are within their rights to order the array access before the
+	rcu_dereference(), which can cause the array access to return
+	garbage.
+
+o	Avoid cancellation when using the "+" and "-" infix arithmetic
+	operators.  For example, for a given variable "x", avoid
+	"(x-x)".  There are similar arithmetic pitfalls from other
+	arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
+	The compiler is within its rights to substitute zero for all of
+	these expressions, so that subsequent accesses no longer depend
+	on the rcu_dereference(), again possibly resulting in bugs due
+	to misordering.
+
+	Of course, if "p" is a pointer from rcu_dereference(), and "a"
+	and "b" are integers that happen to be equal, the expression
+	"p+a-b" is safe because its value still necessarily depends on
+	the rcu_dereference(), thus maintaining proper ordering.
+
+o	Avoid all-zero operands to the bitwise "&" operator, and
+	similarly avoid all-ones operands to the bitwise "|" operator.
+	If the compiler is able to deduce the value of such operands,
+	it is within its rights to substitute the corresponding constant
+	for the bitwise operation.  Once again, this causes subsequent
+	accesses to no longer depend on the rcu_dereference(), causing
+	bugs due to misordering.
+
+	Please note that single-bit operands to bitwise "&" can also
+	be dangerous.  At this point, the compiler knows that the
+	resulting value can only take on one of two possible values.
+	Therefore, a very small amount of additional information will
+	allow the compiler to deduce the exact value, which again can
+	result in misordering.
+
+o	If you are using RCU to protect JITed functions, so that the
+	"()" function-invocation operator is applied to a value obtained
+	(directly or indirectly) from rcu_dereference(), you may need to
+	interact directly with the hardware to flush instruction caches.
+	This issue arises on some systems when a newly JITed function is
+	using the same memory that was used by an earlier JITed function.
+
+o	Do not use the results from the boolean "&&" and "||" when
+	dereferencing.	For example, the following (rather improbable)
+	code is buggy:
+
+		int a[2];
+		int index;
+		int force_zero_index = 1;
+
+		...
+
+		r1 = rcu_dereference(i1)
+		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
+
+	The reason this is buggy is that "&&" and "||" are often compiled
+	using branches.  While weak-memory machines such as ARM or PowerPC
+	do order stores after such branches, they can speculate loads,
+	which can result in misordering bugs.
+
+o	Do not use the results from relational operators ("==", "!=",
+	">", ">=", "<", or "<=") when dereferencing.  For example,
+	the following (quite strange) code is buggy:
+
+		int a[2];
+		int index;
+		int flip_index = 0;
+
+		...
+
+		r1 = rcu_dereference(i1)
+		r2 = a[r1 != flip_index];  /* BUGGY!!! */
+
+	As before, the reason this is buggy is that relational operators
+	are often compiled using branches.  And as before, although
+	weak-memory machines such as ARM or PowerPC do order stores
+	after such branches, but can speculate loads, which can again
+	result in misordering bugs.
+
+o	Be very careful about comparing pointers obtained from
+	rcu_dereference() against non-NULL values.  As Linus Torvalds
+	explained, if the two pointers are equal, the compiler could
+	substitute the pointer you are comparing against for the pointer
+	obtained from rcu_dereference().  For example:
+
+		p = rcu_dereference(gp);
+		if (p == &default_struct)
+			do_default(p->a);
+
+	Because the compiler now knows that the value of "p" is exactly
+	the address of the variable "default_struct", it is free to
+	transform this code into the following:
+
+		p = rcu_dereference(gp);
+		if (p == &default_struct)
+			do_default(default_struct.a);
+
+	On ARM and Power hardware, the load from "default_struct.a"
+	can now be speculated, such that it might happen before the
+	rcu_dereference().  This could result in bugs due to misordering.
+
+	However, comparisons are OK in the following cases:
+
+	o	The comparison was against the NULL pointer.  If the
+		compiler knows that the pointer is NULL, you had better
+		not be dereferencing it anyway.  If the comparison is
+		non-equal, the compiler is none the wiser.  Therefore,
+		it is safe to compare pointers from rcu_dereference()
+		against NULL pointers.
+
+	o	The pointer is never dereferenced after being compared.
+		Since there are no subsequent dereferences, the compiler
+		cannot use anything it learned from the comparison
+		to reorder the non-existent subsequent dereferences.
+		This sort of comparison occurs frequently when scanning
+		RCU-protected circular linked lists.
+
+	o	The comparison is against a pointer pointer that
+		references memory that was initialized "a long time ago."
+		The reason this is safe is that even if misordering
+		occurs, the misordering will not affect the accesses
+		that follow the comparison.  So exactly how long ago is
+		"a long time ago"?  Here are some possibilities:
+
+		o	Compile time.
+
+		o	Boot time.
+
+		o	Module-init time for module code.
+
+		o	Prior to kthread creation for kthread code.
+
+		o	During some prior acquisition of the lock that
+			we now hold.
+
+		o	Before mod_timer() time for a timer handler.
+
+		There are many other possibilities involving the Linux
+		kernel's wide array of primitives that cause code to
+		be invoked at a later time.
+
+	o	The pointer being compared against also came from
+		rcu_dereference().  In this case, both pointers depend
+		on one rcu_dereference() or another, so you get proper
+		ordering either way.
+
+		That said, this situation can make certain RCU usage
+		bugs more likely to happen.  Which can be a good thing,
+		at least if they happen during testing.  An example
+		of such an RCU usage bug is shown in the section titled
+		"EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
+
+	o	All of the accesses following the comparison are stores,
+		so that a control dependency preserves the needed ordering.
+		That said, it is easy to get control dependencies wrong.
+		Please see the "CONTROL DEPENDENCIES" section of
+		Documentation/memory-barriers.txt for more details.
+
+	o	The pointers compared not-equal -and- the compiler does
+		not have enough information to deduce the value of the
+		pointer.  Note that the volatile cast in rcu_dereference()
+		will normally prevent the compiler from knowing too much.
+
+o	Disable any value-speculation optimizations that your compiler
+	might provide, especially if you are making use of feedback-based
+	optimizations that take data collected from prior runs.  Such
+	value-speculation optimizations reorder operations by design.
+
+	There is one exception to this rule:  Value-speculation
+	optimizations that leverage the branch-prediction hardware are
+	safe on strongly ordered systems (such as x86), but not on weakly
+	ordered systems (such as ARM or Power).  Choose your compiler
+	command-line options wisely!
+
+
+EXAMPLE OF AMPLIFIED RCU-USAGE BUG
+
+Because updaters can run concurrently with RCU readers, RCU readers can
+see stale and/or inconsistent values.  If RCU readers need fresh or
+consistent values, which they sometimes do, they need to take proper
+precautions.  To see this, consider the following code fragment:
+
+	struct foo {
+		int a;
+		int b;
+		int c;
+	};
+	struct foo *gp1;
+	struct foo *gp2;
+
+	void updater(void)
+	{
+		struct foo *p;
+
+		p = kmalloc(...);
+		if (p == NULL)
+			deal_with_it();
+		p->a = 42;  /* Each field in its own cache line. */
+		p->b = 43;
+		p->c = 44;
+		rcu_assign_pointer(gp1, p);
+		p->b = 143;
+		p->c = 144;
+		rcu_assign_pointer(gp2, p);
+	}
+
+	void reader(void)
+	{
+		struct foo *p;
+		struct foo *q;
+		int r1, r2;
+
+		p = rcu_dereference(gp2);
+		r1 = p->b;  /* Guaranteed to get 143. */
+		q = rcu_dereference(gp1);
+		if (p == q) {
+			/* The compiler decides that q->c is same as p->c. */
+			r2 = p->c; /* Could get 44 on weakly order system. */
+		}
+	}
+
+You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
+but you should not be.  After all, the updater might have been invoked
+a second time between the time reader() loaded into "r1" and the time
+that it loaded into "r2".  The fact that this same result can occur due
+to some reordering from the compiler and CPUs is beside the point.
+
+But suppose that the reader needs a consistent view?
+
+Then one approach is to use locking, for example, as follows:
+
+	struct foo {
+		int a;
+		int b;
+		int c;
+		spinlock_t lock;
+	};
+	struct foo *gp1;
+	struct foo *gp2;
+
+	void updater(void)
+	{
+		struct foo *p;
+
+		p = kmalloc(...);
+		if (p == NULL)
+			deal_with_it();
+		spin_lock(&p->lock);
+		p->a = 42;  /* Each field in its own cache line. */
+		p->b = 43;
+		p->c = 44;
+		spin_unlock(&p->lock);
+		rcu_assign_pointer(gp1, p);
+		spin_lock(&p->lock);
+		p->b = 143;
+		p->c = 144;
+		spin_unlock(&p->lock);
+		rcu_assign_pointer(gp2, p);
+	}
+
+	void reader(void)
+	{
+		struct foo *p;
+		struct foo *q;
+		int r1, r2;
+
+		p = rcu_dereference(gp2);
+		spin_lock(&p->lock);
+		r1 = p->b;  /* Guaranteed to get 143. */
+		q = rcu_dereference(gp1);
+		if (p == q) {
+			/* The compiler decides that q->c is same as p->c. */
+			r2 = p->c; /* Could get 44 on weakly order system. */
+		}
+		spin_unlock(&p->lock);
+	}
+
+As always, use the right tool for the job!
+
+
+EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
+
+If a pointer obtained from rcu_dereference() compares not-equal to some
+other pointer, the compiler normally has no clue what the value of the
+first pointer might be.  This lack of knowledge prevents the compiler
+from carrying out optimizations that otherwise might destroy the ordering
+guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
+should prevent the compiler from guessing the value.
+
+But without rcu_dereference(), the compiler knows more than you might
+expect.  Consider the following code fragment:
+
+	struct foo {
+		int a;
+		int b;
+	};
+	static struct foo variable1;
+	static struct foo variable2;
+	static struct foo *gp = &variable1;
+
+	void updater(void)
+	{
+		initialize_foo(&variable2);
+		rcu_assign_pointer(gp, &variable2);
+		/*
+		 * The above is the only store to gp in this translation unit,
+		 * and the address of gp is not exported in any way.
+		 */
+	}
+
+	int reader(void)
+	{
+		struct foo *p;
+
+		p = gp;
+		barrier();
+		if (p == &variable1)
+			return p->a; /* Must be variable1.a. */
+		else
+			return p->b; /* Must be variable2.b. */
+	}
+
+Because the compiler can see all stores to "gp", it knows that the only
+possible values of "gp" are "variable1" on the one hand and "variable2"
+on the other.  The comparison in reader() therefore tells the compiler
+the exact value of "p" even in the not-equals case.  This allows the
+compiler to make the return values independent of the load from "gp",
+in turn destroying the ordering between this load and the loads of the
+return values.  This can result in "p->b" returning pre-initialization
+garbage values.
+
+In short, rcu_dereference() is -not- optional when you are going to
+dereference the resulting pointer.


^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-01  0:50                                                                                                                       ` Paul E. McKenney
@ 2014-03-01 10:06                                                                                                                         ` Peter Sewell
  2014-03-01 14:03                                                                                                                           ` Paul E. McKenney
  2014-03-03 18:55                                                                                                                         ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Peter Sewell @ 2014-03-01 10:06 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

Hi Paul,

On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> > <paulmck@linux.vnet.ibm.com> wrote:
>> > >
>> > > 3.      The comparison was against another RCU-protected pointer,
>> > >         where that other pointer was properly fetched using one
>> > >         of the RCU primitives.  Here it doesn't matter which pointer
>> > >         you use.  At least as long as the rcu_assign_pointer() for
>> > >         that other pointer happened after the last update to the
>> > >         pointed-to structure.
>> > >
>> > > I am a bit nervous about #3.  Any thoughts on it?
>> >
>> > I think that it might be worth pointing out as an example, and saying
>> > that code like
>> >
>> >    p = atomic_read(consume);
>> >    X;
>> >    q = atomic_read(consume);
>> >    Y;
>> >    if (p == q)
>> >         data = p->val;
>> >
>> > then the access of "p->val" is constrained to be data-dependent on
>> > *either* p or q, but you can't really tell which, since the compiler
>> > can decide that the values are interchangeable.
>> >
>> > I cannot for the life of me come up with a situation where this would
>> > matter, though. If "X" contains a fence, then that fence will be a
>> > stronger ordering than anything the consume through "p" would
>> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> > atomic reads of p and q are unordered *anyway*, so then whether the
>> > ordering to the access through "p" is through p or q is kind of
>> > irrelevant. No?
>>
>> I can make a contrived litmus test for it, but you are right, the only
>> time you can see it happen is when X has no barriers, in which case
>> you don't have any ordering anyway -- both the compiler and the CPU can
>> reorder the loads into p and q, and the read from p->val can, as you say,
>> come from either pointer.
>>
>> For whatever it is worth, hear is the litmus test:
>>
>> T1:   p = kmalloc(...);
>>       if (p == NULL)
>>               deal_with_it();
>>       p->a = 42;  /* Each field in its own cache line. */
>>       p->b = 43;
>>       p->c = 44;
>>       atomic_store_explicit(&gp1, p, memory_order_release);
>>       p->b = 143;
>>       p->c = 144;
>>       atomic_store_explicit(&gp2, p, memory_order_release);
>>
>> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>>       r1 = p->b;  /* Guaranteed to get 143. */
>>       q = atomic_load_explicit(&gp1, memory_order_consume);
>>       if (p == q) {
>>               /* The compiler decides that q->c is same as p->c. */
>>               r2 = p->c; /* Could get 44 on weakly order system. */
>>       }
>>
>> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> you get.
>>
>> And publishing a structure via one RCU-protected pointer, updating it,
>> then publishing it via another pointer seems to me to be asking for
>> trouble anyway.  If you really want to do something like that and still
>> see consistency across all the fields in the structure, please put a lock
>> in the structure and use it to guard updates and accesses to those fields.
>
> And here is a patch documenting the restrictions for the current Linux
> kernel.  The rules change a bit due to rcu_dereference() acting a bit
> differently than atomic_load_explicit(&p, memory_order_consume).
>
> Thoughts?

That might serve as informal documentation for linux kernel
programmers about the bounds on the optimisations that you expect
compilers to do for common-case RCU code - and I guess that's what you
intend it to be for.   But I don't see how one can make it precise
enough to serve as a language definition, so that compiler people
could confidently say "yes, we respect that", which I guess is what
you really need.  As a useful criterion, we should aim for something
precise enough that in a verified-compiler context you can
mathematically prove that the compiler will satisfy it  (even though
that won't happen anytime soon for GCC), and that analysis tool
authors can actually know what they're working with.   All this stuff
about "you should avoid cancellation", and "avoid masking with just a
small number of bits" is just too vague.

The basic problem is that the compiler may be doing sophisticated
reasoning with a bunch of non-local knowledge that it's deduced from
the code, neither of which are well-understood, and here we have to
identify some envelope, expressive enough for RCU idioms, in which
that reasoning doesn't allow data/address dependencies to be removed
(and hence the hardware guarantee about them will be maintained at the
source level).

The C11 syntactic notion of dependency, whatever its faults, was at
least precise, could be reasoned about locally (just looking at the
syntactic code in question), and did do that.  The fact that current
compilers do optimisations that remove dependencies and will likely
have many bugs at present is besides the point - this was surely
intended as a *new* constraint on what they are allowed to do.  The
interesting question is really whether the compiler writers think that
they *could* implement it in a reasonable way - I'd like to hear
Torvald and his colleagues' opinion on that.

What you're doing above seems to be basically a very cut-down version
of that, but with a fuzzy boundary.   If you want it to be precise,
maybe it needs to be much simpler (which might force you into ruling
out some current code idioms).

best,
Peter



>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> documentation: Record rcu_dereference() value mishandling
>
> Recent LKML discussings (see http://lwn.net/Articles/586838/ and
> http://lwn.net/Articles/588300/ for the LWN writeups) brought out
> some ways of misusing the return value from rcu_dereference() that
> are not necessarily completely intuitive.  This commit therefore
> documents what can and cannot safely be done with these values.
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
> index fa57139f50bf..f773a264ae02 100644
> --- a/Documentation/RCU/00-INDEX
> +++ b/Documentation/RCU/00-INDEX
> @@ -12,6 +12,8 @@ lockdep-splat.txt
>         - RCU Lockdep splats explained.
>  NMI-RCU.txt
>         - Using RCU to Protect Dynamic NMI Handlers
> +rcu_dereference.txt
> +       - Proper care and feeding of return values from rcu_dereference()
>  rcubarrier.txt
>         - RCU and Unloadable Modules
>  rculist_nulls.txt
> diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> index 9d10d1db16a5..877947130ebe 100644
> --- a/Documentation/RCU/checklist.txt
> +++ b/Documentation/RCU/checklist.txt
> @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
>                         http://www.openvms.compaq.com/wizard/wiz_2637.html
>
>                 The rcu_dereference() primitive is also an excellent
> -               documentation aid, letting the person reading the code
> -               know exactly which pointers are protected by RCU.
> +               documentation aid, letting the person reading the
> +               code know exactly which pointers are protected by RCU.
>                 Please note that compilers can also reorder code, and
>                 they are becoming increasingly aggressive about doing
> -               just that.  The rcu_dereference() primitive therefore
> -               also prevents destructive compiler optimizations.
> +               just that.  The rcu_dereference() primitive therefore also
> +               prevents destructive compiler optimizations.  However,
> +               with a bit of devious creativity, it is possible to
> +               mishandle the return value from rcu_dereference().
> +               Please see rcu_dereference.txt in this directory for
> +               more information.
>
>                 The rcu_dereference() primitive is used by the
>                 various "_rcu()" list-traversal primitives, such
> diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
> new file mode 100644
> index 000000000000..6e72cd8622df
> --- /dev/null
> +++ b/Documentation/RCU/rcu_dereference.txt
> @@ -0,0 +1,365 @@
> +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
> +
> +Most of the time, you can use values from rcu_dereference() or one of
> +the similar primitives without worries.  Dereferencing (prefix "*"),
> +field selection ("->"), assignment ("="), address-of ("&"), addition and
> +subtraction of constants, and casts all work quite naturally and safely.
> +
> +It is nevertheless possible to get into trouble with other operations.
> +Follow these rules to keep your RCU code working properly:
> +
> +o      You must use one of the rcu_dereference() family of primitives
> +       to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
> +       will complain.  Worse yet, your code can see random memory-corruption
> +       bugs due to games that compilers and DEC Alpha can play.
> +       Without one of the rcu_dereference() primitives, compilers
> +       can reload the value, and won't your code have fun with two
> +       different values for a single pointer!  Without rcu_dereference(),
> +       DEC Alpha can load a pointer, dereference that pointer, and
> +       return data preceding initialization that preceded the store of
> +       the pointer.
> +
> +       In addition, the volatile cast in rcu_dereference() prevents the
> +       compiler from deducing the resulting pointer value.  Please see
> +       the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
> +       for an example where the compiler can in fact deduce the exact
> +       value of the pointer, and thus cause misordering.
> +
> +o      Do not use single-element RCU-protected arrays.  The compiler
> +       is within its right to assume that the value of an index into
> +       such an array must necessarily evaluate to zero.  The compiler
> +       could then substitute the constant zero for the computation, so
> +       that the array index no longer depended on the value returned
> +       by rcu_dereference().  If the array index no longer depends
> +       on rcu_dereference(), then both the compiler and the CPU
> +       are within their rights to order the array access before the
> +       rcu_dereference(), which can cause the array access to return
> +       garbage.
> +
> +o      Avoid cancellation when using the "+" and "-" infix arithmetic
> +       operators.  For example, for a given variable "x", avoid
> +       "(x-x)".  There are similar arithmetic pitfalls from other
> +       arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
> +       The compiler is within its rights to substitute zero for all of
> +       these expressions, so that subsequent accesses no longer depend
> +       on the rcu_dereference(), again possibly resulting in bugs due
> +       to misordering.
> +
> +       Of course, if "p" is a pointer from rcu_dereference(), and "a"
> +       and "b" are integers that happen to be equal, the expression
> +       "p+a-b" is safe because its value still necessarily depends on
> +       the rcu_dereference(), thus maintaining proper ordering.
> +
> +o      Avoid all-zero operands to the bitwise "&" operator, and
> +       similarly avoid all-ones operands to the bitwise "|" operator.
> +       If the compiler is able to deduce the value of such operands,
> +       it is within its rights to substitute the corresponding constant
> +       for the bitwise operation.  Once again, this causes subsequent
> +       accesses to no longer depend on the rcu_dereference(), causing
> +       bugs due to misordering.
> +
> +       Please note that single-bit operands to bitwise "&" can also
> +       be dangerous.  At this point, the compiler knows that the
> +       resulting value can only take on one of two possible values.
> +       Therefore, a very small amount of additional information will
> +       allow the compiler to deduce the exact value, which again can
> +       result in misordering.
> +
> +o      If you are using RCU to protect JITed functions, so that the
> +       "()" function-invocation operator is applied to a value obtained
> +       (directly or indirectly) from rcu_dereference(), you may need to
> +       interact directly with the hardware to flush instruction caches.
> +       This issue arises on some systems when a newly JITed function is
> +       using the same memory that was used by an earlier JITed function.
> +
> +o      Do not use the results from the boolean "&&" and "||" when
> +       dereferencing.  For example, the following (rather improbable)
> +       code is buggy:
> +
> +               int a[2];
> +               int index;
> +               int force_zero_index = 1;
> +
> +               ...
> +
> +               r1 = rcu_dereference(i1)
> +               r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> +
> +       The reason this is buggy is that "&&" and "||" are often compiled
> +       using branches.  While weak-memory machines such as ARM or PowerPC
> +       do order stores after such branches, they can speculate loads,
> +       which can result in misordering bugs.
> +
> +o      Do not use the results from relational operators ("==", "!=",
> +       ">", ">=", "<", or "<=") when dereferencing.  For example,
> +       the following (quite strange) code is buggy:
> +
> +               int a[2];
> +               int index;
> +               int flip_index = 0;
> +
> +               ...
> +
> +               r1 = rcu_dereference(i1)
> +               r2 = a[r1 != flip_index];  /* BUGGY!!! */
> +
> +       As before, the reason this is buggy is that relational operators
> +       are often compiled using branches.  And as before, although
> +       weak-memory machines such as ARM or PowerPC do order stores
> +       after such branches, but can speculate loads, which can again
> +       result in misordering bugs.
> +
> +o      Be very careful about comparing pointers obtained from
> +       rcu_dereference() against non-NULL values.  As Linus Torvalds
> +       explained, if the two pointers are equal, the compiler could
> +       substitute the pointer you are comparing against for the pointer
> +       obtained from rcu_dereference().  For example:
> +
> +               p = rcu_dereference(gp);
> +               if (p == &default_struct)
> +                       do_default(p->a);
> +
> +       Because the compiler now knows that the value of "p" is exactly
> +       the address of the variable "default_struct", it is free to
> +       transform this code into the following:
> +
> +               p = rcu_dereference(gp);
> +               if (p == &default_struct)
> +                       do_default(default_struct.a);
> +
> +       On ARM and Power hardware, the load from "default_struct.a"
> +       can now be speculated, such that it might happen before the
> +       rcu_dereference().  This could result in bugs due to misordering.
> +
> +       However, comparisons are OK in the following cases:
> +
> +       o       The comparison was against the NULL pointer.  If the
> +               compiler knows that the pointer is NULL, you had better
> +               not be dereferencing it anyway.  If the comparison is
> +               non-equal, the compiler is none the wiser.  Therefore,
> +               it is safe to compare pointers from rcu_dereference()
> +               against NULL pointers.
> +
> +       o       The pointer is never dereferenced after being compared.
> +               Since there are no subsequent dereferences, the compiler
> +               cannot use anything it learned from the comparison
> +               to reorder the non-existent subsequent dereferences.
> +               This sort of comparison occurs frequently when scanning
> +               RCU-protected circular linked lists.
> +
> +       o       The comparison is against a pointer pointer that
> +               references memory that was initialized "a long time ago."
> +               The reason this is safe is that even if misordering
> +               occurs, the misordering will not affect the accesses
> +               that follow the comparison.  So exactly how long ago is
> +               "a long time ago"?  Here are some possibilities:
> +
> +               o       Compile time.
> +
> +               o       Boot time.
> +
> +               o       Module-init time for module code.
> +
> +               o       Prior to kthread creation for kthread code.
> +
> +               o       During some prior acquisition of the lock that
> +                       we now hold.
> +
> +               o       Before mod_timer() time for a timer handler.
> +
> +               There are many other possibilities involving the Linux
> +               kernel's wide array of primitives that cause code to
> +               be invoked at a later time.
> +
> +       o       The pointer being compared against also came from
> +               rcu_dereference().  In this case, both pointers depend
> +               on one rcu_dereference() or another, so you get proper
> +               ordering either way.
> +
> +               That said, this situation can make certain RCU usage
> +               bugs more likely to happen.  Which can be a good thing,
> +               at least if they happen during testing.  An example
> +               of such an RCU usage bug is shown in the section titled
> +               "EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
> +
> +       o       All of the accesses following the comparison are stores,
> +               so that a control dependency preserves the needed ordering.
> +               That said, it is easy to get control dependencies wrong.
> +               Please see the "CONTROL DEPENDENCIES" section of
> +               Documentation/memory-barriers.txt for more details.
> +
> +       o       The pointers compared not-equal -and- the compiler does
> +               not have enough information to deduce the value of the
> +               pointer.  Note that the volatile cast in rcu_dereference()
> +               will normally prevent the compiler from knowing too much.
> +
> +o      Disable any value-speculation optimizations that your compiler
> +       might provide, especially if you are making use of feedback-based
> +       optimizations that take data collected from prior runs.  Such
> +       value-speculation optimizations reorder operations by design.
> +
> +       There is one exception to this rule:  Value-speculation
> +       optimizations that leverage the branch-prediction hardware are
> +       safe on strongly ordered systems (such as x86), but not on weakly
> +       ordered systems (such as ARM or Power).  Choose your compiler
> +       command-line options wisely!
> +
> +
> +EXAMPLE OF AMPLIFIED RCU-USAGE BUG
> +
> +Because updaters can run concurrently with RCU readers, RCU readers can
> +see stale and/or inconsistent values.  If RCU readers need fresh or
> +consistent values, which they sometimes do, they need to take proper
> +precautions.  To see this, consider the following code fragment:
> +
> +       struct foo {
> +               int a;
> +               int b;
> +               int c;
> +       };
> +       struct foo *gp1;
> +       struct foo *gp2;
> +
> +       void updater(void)
> +       {
> +               struct foo *p;
> +
> +               p = kmalloc(...);
> +               if (p == NULL)
> +                       deal_with_it();
> +               p->a = 42;  /* Each field in its own cache line. */
> +               p->b = 43;
> +               p->c = 44;
> +               rcu_assign_pointer(gp1, p);
> +               p->b = 143;
> +               p->c = 144;
> +               rcu_assign_pointer(gp2, p);
> +       }
> +
> +       void reader(void)
> +       {
> +               struct foo *p;
> +               struct foo *q;
> +               int r1, r2;
> +
> +               p = rcu_dereference(gp2);
> +               r1 = p->b;  /* Guaranteed to get 143. */
> +               q = rcu_dereference(gp1);
> +               if (p == q) {
> +                       /* The compiler decides that q->c is same as p->c. */
> +                       r2 = p->c; /* Could get 44 on weakly order system. */
> +               }
> +       }
> +
> +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
> +but you should not be.  After all, the updater might have been invoked
> +a second time between the time reader() loaded into "r1" and the time
> +that it loaded into "r2".  The fact that this same result can occur due
> +to some reordering from the compiler and CPUs is beside the point.
> +
> +But suppose that the reader needs a consistent view?
> +
> +Then one approach is to use locking, for example, as follows:
> +
> +       struct foo {
> +               int a;
> +               int b;
> +               int c;
> +               spinlock_t lock;
> +       };
> +       struct foo *gp1;
> +       struct foo *gp2;
> +
> +       void updater(void)
> +       {
> +               struct foo *p;
> +
> +               p = kmalloc(...);
> +               if (p == NULL)
> +                       deal_with_it();
> +               spin_lock(&p->lock);
> +               p->a = 42;  /* Each field in its own cache line. */
> +               p->b = 43;
> +               p->c = 44;
> +               spin_unlock(&p->lock);
> +               rcu_assign_pointer(gp1, p);
> +               spin_lock(&p->lock);
> +               p->b = 143;
> +               p->c = 144;
> +               spin_unlock(&p->lock);
> +               rcu_assign_pointer(gp2, p);
> +       }
> +
> +       void reader(void)
> +       {
> +               struct foo *p;
> +               struct foo *q;
> +               int r1, r2;
> +
> +               p = rcu_dereference(gp2);
> +               spin_lock(&p->lock);
> +               r1 = p->b;  /* Guaranteed to get 143. */
> +               q = rcu_dereference(gp1);
> +               if (p == q) {
> +                       /* The compiler decides that q->c is same as p->c. */
> +                       r2 = p->c; /* Could get 44 on weakly order system. */
> +               }
> +               spin_unlock(&p->lock);
> +       }
> +
> +As always, use the right tool for the job!
> +
> +
> +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
> +
> +If a pointer obtained from rcu_dereference() compares not-equal to some
> +other pointer, the compiler normally has no clue what the value of the
> +first pointer might be.  This lack of knowledge prevents the compiler
> +from carrying out optimizations that otherwise might destroy the ordering
> +guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
> +should prevent the compiler from guessing the value.
> +
> +But without rcu_dereference(), the compiler knows more than you might
> +expect.  Consider the following code fragment:
> +
> +       struct foo {
> +               int a;
> +               int b;
> +       };
> +       static struct foo variable1;
> +       static struct foo variable2;
> +       static struct foo *gp = &variable1;
> +
> +       void updater(void)
> +       {
> +               initialize_foo(&variable2);
> +               rcu_assign_pointer(gp, &variable2);
> +               /*
> +                * The above is the only store to gp in this translation unit,
> +                * and the address of gp is not exported in any way.
> +                */
> +       }
> +
> +       int reader(void)
> +       {
> +               struct foo *p;
> +
> +               p = gp;
> +               barrier();
> +               if (p == &variable1)
> +                       return p->a; /* Must be variable1.a. */
> +               else
> +                       return p->b; /* Must be variable2.b. */
> +       }
> +
> +Because the compiler can see all stores to "gp", it knows that the only
> +possible values of "gp" are "variable1" on the one hand and "variable2"
> +on the other.  The comparison in reader() therefore tells the compiler
> +the exact value of "p" even in the not-equals case.  This allows the
> +compiler to make the return values independent of the load from "gp",
> +in turn destroying the ordering between this load and the loads of the
> +return values.  This can result in "p->b" returning pre-initialization
> +garbage values.
> +
> +In short, rcu_dereference() is -not- optional when you are going to
> +dereference the resulting pointer.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-01 10:06                                                                                                                         ` Peter Sewell
@ 2014-03-01 14:03                                                                                                                           ` Paul E. McKenney
  2014-03-02 10:05                                                                                                                             ` Peter Sewell
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-01 14:03 UTC (permalink / raw)
  To: Peter Sewell
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> Hi Paul,
> 
> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> > <paulmck@linux.vnet.ibm.com> wrote:
> >> > >
> >> > > 3.      The comparison was against another RCU-protected pointer,
> >> > >         where that other pointer was properly fetched using one
> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
> >> > >         you use.  At least as long as the rcu_assign_pointer() for
> >> > >         that other pointer happened after the last update to the
> >> > >         pointed-to structure.
> >> > >
> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >
> >> > I think that it might be worth pointing out as an example, and saying
> >> > that code like
> >> >
> >> >    p = atomic_read(consume);
> >> >    X;
> >> >    q = atomic_read(consume);
> >> >    Y;
> >> >    if (p == q)
> >> >         data = p->val;
> >> >
> >> > then the access of "p->val" is constrained to be data-dependent on
> >> > *either* p or q, but you can't really tell which, since the compiler
> >> > can decide that the values are interchangeable.
> >> >
> >> > I cannot for the life of me come up with a situation where this would
> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> > stronger ordering than anything the consume through "p" would
> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> > ordering to the access through "p" is through p or q is kind of
> >> > irrelevant. No?
> >>
> >> I can make a contrived litmus test for it, but you are right, the only
> >> time you can see it happen is when X has no barriers, in which case
> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> come from either pointer.
> >>
> >> For whatever it is worth, hear is the litmus test:
> >>
> >> T1:   p = kmalloc(...);
> >>       if (p == NULL)
> >>               deal_with_it();
> >>       p->a = 42;  /* Each field in its own cache line. */
> >>       p->b = 43;
> >>       p->c = 44;
> >>       atomic_store_explicit(&gp1, p, memory_order_release);
> >>       p->b = 143;
> >>       p->c = 144;
> >>       atomic_store_explicit(&gp2, p, memory_order_release);
> >>
> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >>       r1 = p->b;  /* Guaranteed to get 143. */
> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
> >>       if (p == q) {
> >>               /* The compiler decides that q->c is same as p->c. */
> >>               r2 = p->c; /* Could get 44 on weakly order system. */
> >>       }
> >>
> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> you get.
> >>
> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> then publishing it via another pointer seems to me to be asking for
> >> trouble anyway.  If you really want to do something like that and still
> >> see consistency across all the fields in the structure, please put a lock
> >> in the structure and use it to guard updates and accesses to those fields.
> >
> > And here is a patch documenting the restrictions for the current Linux
> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> > differently than atomic_load_explicit(&p, memory_order_consume).
> >
> > Thoughts?
> 
> That might serve as informal documentation for linux kernel
> programmers about the bounds on the optimisations that you expect
> compilers to do for common-case RCU code - and I guess that's what you
> intend it to be for.   But I don't see how one can make it precise
> enough to serve as a language definition, so that compiler people
> could confidently say "yes, we respect that", which I guess is what
> you really need.  As a useful criterion, we should aim for something
> precise enough that in a verified-compiler context you can
> mathematically prove that the compiler will satisfy it  (even though
> that won't happen anytime soon for GCC), and that analysis tool
> authors can actually know what they're working with.   All this stuff
> about "you should avoid cancellation", and "avoid masking with just a
> small number of bits" is just too vague.

Understood, and yes, this is intended to document current compiler
behavior for the Linux kernel community.  It would not make sense to show
it to the C11 or C++11 communities, except perhaps as an informational
piece on current practice.

> The basic problem is that the compiler may be doing sophisticated
> reasoning with a bunch of non-local knowledge that it's deduced from
> the code, neither of which are well-understood, and here we have to
> identify some envelope, expressive enough for RCU idioms, in which
> that reasoning doesn't allow data/address dependencies to be removed
> (and hence the hardware guarantee about them will be maintained at the
> source level).
> 
> The C11 syntactic notion of dependency, whatever its faults, was at
> least precise, could be reasoned about locally (just looking at the
> syntactic code in question), and did do that.  The fact that current
> compilers do optimisations that remove dependencies and will likely
> have many bugs at present is besides the point - this was surely
> intended as a *new* constraint on what they are allowed to do.  The
> interesting question is really whether the compiler writers think that
> they *could* implement it in a reasonable way - I'd like to hear
> Torvald and his colleagues' opinion on that.
> 
> What you're doing above seems to be basically a very cut-down version
> of that, but with a fuzzy boundary.   If you want it to be precise,
> maybe it needs to be much simpler (which might force you into ruling
> out some current code idioms).

I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
can be developed to serve this purpose.

							Thanx, Paul

> best,
> Peter
> 
> 
> 
> >                                                         Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > documentation: Record rcu_dereference() value mishandling
> >
> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and
> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out
> > some ways of misusing the return value from rcu_dereference() that
> > are not necessarily completely intuitive.  This commit therefore
> > documents what can and cannot safely be done with these values.
> >
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
> > index fa57139f50bf..f773a264ae02 100644
> > --- a/Documentation/RCU/00-INDEX
> > +++ b/Documentation/RCU/00-INDEX
> > @@ -12,6 +12,8 @@ lockdep-splat.txt
> >         - RCU Lockdep splats explained.
> >  NMI-RCU.txt
> >         - Using RCU to Protect Dynamic NMI Handlers
> > +rcu_dereference.txt
> > +       - Proper care and feeding of return values from rcu_dereference()
> >  rcubarrier.txt
> >         - RCU and Unloadable Modules
> >  rculist_nulls.txt
> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> > index 9d10d1db16a5..877947130ebe 100644
> > --- a/Documentation/RCU/checklist.txt
> > +++ b/Documentation/RCU/checklist.txt
> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
> >                         http://www.openvms.compaq.com/wizard/wiz_2637.html
> >
> >                 The rcu_dereference() primitive is also an excellent
> > -               documentation aid, letting the person reading the code
> > -               know exactly which pointers are protected by RCU.
> > +               documentation aid, letting the person reading the
> > +               code know exactly which pointers are protected by RCU.
> >                 Please note that compilers can also reorder code, and
> >                 they are becoming increasingly aggressive about doing
> > -               just that.  The rcu_dereference() primitive therefore
> > -               also prevents destructive compiler optimizations.
> > +               just that.  The rcu_dereference() primitive therefore also
> > +               prevents destructive compiler optimizations.  However,
> > +               with a bit of devious creativity, it is possible to
> > +               mishandle the return value from rcu_dereference().
> > +               Please see rcu_dereference.txt in this directory for
> > +               more information.
> >
> >                 The rcu_dereference() primitive is used by the
> >                 various "_rcu()" list-traversal primitives, such
> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
> > new file mode 100644
> > index 000000000000..6e72cd8622df
> > --- /dev/null
> > +++ b/Documentation/RCU/rcu_dereference.txt
> > @@ -0,0 +1,365 @@
> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
> > +
> > +Most of the time, you can use values from rcu_dereference() or one of
> > +the similar primitives without worries.  Dereferencing (prefix "*"),
> > +field selection ("->"), assignment ("="), address-of ("&"), addition and
> > +subtraction of constants, and casts all work quite naturally and safely.
> > +
> > +It is nevertheless possible to get into trouble with other operations.
> > +Follow these rules to keep your RCU code working properly:
> > +
> > +o      You must use one of the rcu_dereference() family of primitives
> > +       to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
> > +       will complain.  Worse yet, your code can see random memory-corruption
> > +       bugs due to games that compilers and DEC Alpha can play.
> > +       Without one of the rcu_dereference() primitives, compilers
> > +       can reload the value, and won't your code have fun with two
> > +       different values for a single pointer!  Without rcu_dereference(),
> > +       DEC Alpha can load a pointer, dereference that pointer, and
> > +       return data preceding initialization that preceded the store of
> > +       the pointer.
> > +
> > +       In addition, the volatile cast in rcu_dereference() prevents the
> > +       compiler from deducing the resulting pointer value.  Please see
> > +       the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
> > +       for an example where the compiler can in fact deduce the exact
> > +       value of the pointer, and thus cause misordering.
> > +
> > +o      Do not use single-element RCU-protected arrays.  The compiler
> > +       is within its right to assume that the value of an index into
> > +       such an array must necessarily evaluate to zero.  The compiler
> > +       could then substitute the constant zero for the computation, so
> > +       that the array index no longer depended on the value returned
> > +       by rcu_dereference().  If the array index no longer depends
> > +       on rcu_dereference(), then both the compiler and the CPU
> > +       are within their rights to order the array access before the
> > +       rcu_dereference(), which can cause the array access to return
> > +       garbage.
> > +
> > +o      Avoid cancellation when using the "+" and "-" infix arithmetic
> > +       operators.  For example, for a given variable "x", avoid
> > +       "(x-x)".  There are similar arithmetic pitfalls from other
> > +       arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
> > +       The compiler is within its rights to substitute zero for all of
> > +       these expressions, so that subsequent accesses no longer depend
> > +       on the rcu_dereference(), again possibly resulting in bugs due
> > +       to misordering.
> > +
> > +       Of course, if "p" is a pointer from rcu_dereference(), and "a"
> > +       and "b" are integers that happen to be equal, the expression
> > +       "p+a-b" is safe because its value still necessarily depends on
> > +       the rcu_dereference(), thus maintaining proper ordering.
> > +
> > +o      Avoid all-zero operands to the bitwise "&" operator, and
> > +       similarly avoid all-ones operands to the bitwise "|" operator.
> > +       If the compiler is able to deduce the value of such operands,
> > +       it is within its rights to substitute the corresponding constant
> > +       for the bitwise operation.  Once again, this causes subsequent
> > +       accesses to no longer depend on the rcu_dereference(), causing
> > +       bugs due to misordering.
> > +
> > +       Please note that single-bit operands to bitwise "&" can also
> > +       be dangerous.  At this point, the compiler knows that the
> > +       resulting value can only take on one of two possible values.
> > +       Therefore, a very small amount of additional information will
> > +       allow the compiler to deduce the exact value, which again can
> > +       result in misordering.
> > +
> > +o      If you are using RCU to protect JITed functions, so that the
> > +       "()" function-invocation operator is applied to a value obtained
> > +       (directly or indirectly) from rcu_dereference(), you may need to
> > +       interact directly with the hardware to flush instruction caches.
> > +       This issue arises on some systems when a newly JITed function is
> > +       using the same memory that was used by an earlier JITed function.
> > +
> > +o      Do not use the results from the boolean "&&" and "||" when
> > +       dereferencing.  For example, the following (rather improbable)
> > +       code is buggy:
> > +
> > +               int a[2];
> > +               int index;
> > +               int force_zero_index = 1;
> > +
> > +               ...
> > +
> > +               r1 = rcu_dereference(i1)
> > +               r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > +
> > +       The reason this is buggy is that "&&" and "||" are often compiled
> > +       using branches.  While weak-memory machines such as ARM or PowerPC
> > +       do order stores after such branches, they can speculate loads,
> > +       which can result in misordering bugs.
> > +
> > +o      Do not use the results from relational operators ("==", "!=",
> > +       ">", ">=", "<", or "<=") when dereferencing.  For example,
> > +       the following (quite strange) code is buggy:
> > +
> > +               int a[2];
> > +               int index;
> > +               int flip_index = 0;
> > +
> > +               ...
> > +
> > +               r1 = rcu_dereference(i1)
> > +               r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > +
> > +       As before, the reason this is buggy is that relational operators
> > +       are often compiled using branches.  And as before, although
> > +       weak-memory machines such as ARM or PowerPC do order stores
> > +       after such branches, but can speculate loads, which can again
> > +       result in misordering bugs.
> > +
> > +o      Be very careful about comparing pointers obtained from
> > +       rcu_dereference() against non-NULL values.  As Linus Torvalds
> > +       explained, if the two pointers are equal, the compiler could
> > +       substitute the pointer you are comparing against for the pointer
> > +       obtained from rcu_dereference().  For example:
> > +
> > +               p = rcu_dereference(gp);
> > +               if (p == &default_struct)
> > +                       do_default(p->a);
> > +
> > +       Because the compiler now knows that the value of "p" is exactly
> > +       the address of the variable "default_struct", it is free to
> > +       transform this code into the following:
> > +
> > +               p = rcu_dereference(gp);
> > +               if (p == &default_struct)
> > +                       do_default(default_struct.a);
> > +
> > +       On ARM and Power hardware, the load from "default_struct.a"
> > +       can now be speculated, such that it might happen before the
> > +       rcu_dereference().  This could result in bugs due to misordering.
> > +
> > +       However, comparisons are OK in the following cases:
> > +
> > +       o       The comparison was against the NULL pointer.  If the
> > +               compiler knows that the pointer is NULL, you had better
> > +               not be dereferencing it anyway.  If the comparison is
> > +               non-equal, the compiler is none the wiser.  Therefore,
> > +               it is safe to compare pointers from rcu_dereference()
> > +               against NULL pointers.
> > +
> > +       o       The pointer is never dereferenced after being compared.
> > +               Since there are no subsequent dereferences, the compiler
> > +               cannot use anything it learned from the comparison
> > +               to reorder the non-existent subsequent dereferences.
> > +               This sort of comparison occurs frequently when scanning
> > +               RCU-protected circular linked lists.
> > +
> > +       o       The comparison is against a pointer pointer that
> > +               references memory that was initialized "a long time ago."
> > +               The reason this is safe is that even if misordering
> > +               occurs, the misordering will not affect the accesses
> > +               that follow the comparison.  So exactly how long ago is
> > +               "a long time ago"?  Here are some possibilities:
> > +
> > +               o       Compile time.
> > +
> > +               o       Boot time.
> > +
> > +               o       Module-init time for module code.
> > +
> > +               o       Prior to kthread creation for kthread code.
> > +
> > +               o       During some prior acquisition of the lock that
> > +                       we now hold.
> > +
> > +               o       Before mod_timer() time for a timer handler.
> > +
> > +               There are many other possibilities involving the Linux
> > +               kernel's wide array of primitives that cause code to
> > +               be invoked at a later time.
> > +
> > +       o       The pointer being compared against also came from
> > +               rcu_dereference().  In this case, both pointers depend
> > +               on one rcu_dereference() or another, so you get proper
> > +               ordering either way.
> > +
> > +               That said, this situation can make certain RCU usage
> > +               bugs more likely to happen.  Which can be a good thing,
> > +               at least if they happen during testing.  An example
> > +               of such an RCU usage bug is shown in the section titled
> > +               "EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
> > +
> > +       o       All of the accesses following the comparison are stores,
> > +               so that a control dependency preserves the needed ordering.
> > +               That said, it is easy to get control dependencies wrong.
> > +               Please see the "CONTROL DEPENDENCIES" section of
> > +               Documentation/memory-barriers.txt for more details.
> > +
> > +       o       The pointers compared not-equal -and- the compiler does
> > +               not have enough information to deduce the value of the
> > +               pointer.  Note that the volatile cast in rcu_dereference()
> > +               will normally prevent the compiler from knowing too much.
> > +
> > +o      Disable any value-speculation optimizations that your compiler
> > +       might provide, especially if you are making use of feedback-based
> > +       optimizations that take data collected from prior runs.  Such
> > +       value-speculation optimizations reorder operations by design.
> > +
> > +       There is one exception to this rule:  Value-speculation
> > +       optimizations that leverage the branch-prediction hardware are
> > +       safe on strongly ordered systems (such as x86), but not on weakly
> > +       ordered systems (such as ARM or Power).  Choose your compiler
> > +       command-line options wisely!
> > +
> > +
> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG
> > +
> > +Because updaters can run concurrently with RCU readers, RCU readers can
> > +see stale and/or inconsistent values.  If RCU readers need fresh or
> > +consistent values, which they sometimes do, they need to take proper
> > +precautions.  To see this, consider the following code fragment:
> > +
> > +       struct foo {
> > +               int a;
> > +               int b;
> > +               int c;
> > +       };
> > +       struct foo *gp1;
> > +       struct foo *gp2;
> > +
> > +       void updater(void)
> > +       {
> > +               struct foo *p;
> > +
> > +               p = kmalloc(...);
> > +               if (p == NULL)
> > +                       deal_with_it();
> > +               p->a = 42;  /* Each field in its own cache line. */
> > +               p->b = 43;
> > +               p->c = 44;
> > +               rcu_assign_pointer(gp1, p);
> > +               p->b = 143;
> > +               p->c = 144;
> > +               rcu_assign_pointer(gp2, p);
> > +       }
> > +
> > +       void reader(void)
> > +       {
> > +               struct foo *p;
> > +               struct foo *q;
> > +               int r1, r2;
> > +
> > +               p = rcu_dereference(gp2);
> > +               r1 = p->b;  /* Guaranteed to get 143. */
> > +               q = rcu_dereference(gp1);
> > +               if (p == q) {
> > +                       /* The compiler decides that q->c is same as p->c. */
> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
> > +               }
> > +       }
> > +
> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
> > +but you should not be.  After all, the updater might have been invoked
> > +a second time between the time reader() loaded into "r1" and the time
> > +that it loaded into "r2".  The fact that this same result can occur due
> > +to some reordering from the compiler and CPUs is beside the point.
> > +
> > +But suppose that the reader needs a consistent view?
> > +
> > +Then one approach is to use locking, for example, as follows:
> > +
> > +       struct foo {
> > +               int a;
> > +               int b;
> > +               int c;
> > +               spinlock_t lock;
> > +       };
> > +       struct foo *gp1;
> > +       struct foo *gp2;
> > +
> > +       void updater(void)
> > +       {
> > +               struct foo *p;
> > +
> > +               p = kmalloc(...);
> > +               if (p == NULL)
> > +                       deal_with_it();
> > +               spin_lock(&p->lock);
> > +               p->a = 42;  /* Each field in its own cache line. */
> > +               p->b = 43;
> > +               p->c = 44;
> > +               spin_unlock(&p->lock);
> > +               rcu_assign_pointer(gp1, p);
> > +               spin_lock(&p->lock);
> > +               p->b = 143;
> > +               p->c = 144;
> > +               spin_unlock(&p->lock);
> > +               rcu_assign_pointer(gp2, p);
> > +       }
> > +
> > +       void reader(void)
> > +       {
> > +               struct foo *p;
> > +               struct foo *q;
> > +               int r1, r2;
> > +
> > +               p = rcu_dereference(gp2);
> > +               spin_lock(&p->lock);
> > +               r1 = p->b;  /* Guaranteed to get 143. */
> > +               q = rcu_dereference(gp1);
> > +               if (p == q) {
> > +                       /* The compiler decides that q->c is same as p->c. */
> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
> > +               }
> > +               spin_unlock(&p->lock);
> > +       }
> > +
> > +As always, use the right tool for the job!
> > +
> > +
> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
> > +
> > +If a pointer obtained from rcu_dereference() compares not-equal to some
> > +other pointer, the compiler normally has no clue what the value of the
> > +first pointer might be.  This lack of knowledge prevents the compiler
> > +from carrying out optimizations that otherwise might destroy the ordering
> > +guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
> > +should prevent the compiler from guessing the value.
> > +
> > +But without rcu_dereference(), the compiler knows more than you might
> > +expect.  Consider the following code fragment:
> > +
> > +       struct foo {
> > +               int a;
> > +               int b;
> > +       };
> > +       static struct foo variable1;
> > +       static struct foo variable2;
> > +       static struct foo *gp = &variable1;
> > +
> > +       void updater(void)
> > +       {
> > +               initialize_foo(&variable2);
> > +               rcu_assign_pointer(gp, &variable2);
> > +               /*
> > +                * The above is the only store to gp in this translation unit,
> > +                * and the address of gp is not exported in any way.
> > +                */
> > +       }
> > +
> > +       int reader(void)
> > +       {
> > +               struct foo *p;
> > +
> > +               p = gp;
> > +               barrier();
> > +               if (p == &variable1)
> > +                       return p->a; /* Must be variable1.a. */
> > +               else
> > +                       return p->b; /* Must be variable2.b. */
> > +       }
> > +
> > +Because the compiler can see all stores to "gp", it knows that the only
> > +possible values of "gp" are "variable1" on the one hand and "variable2"
> > +on the other.  The comparison in reader() therefore tells the compiler
> > +the exact value of "p" even in the not-equals case.  This allows the
> > +compiler to make the return values independent of the load from "gp",
> > +in turn destroying the ordering between this load and the loads of the
> > +return values.  This can result in "p->b" returning pre-initialization
> > +garbage values.
> > +
> > +In short, rcu_dereference() is -not- optional when you are going to
> > +dereference the resulting pointer.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-01 14:03                                                                                                                           ` Paul E. McKenney
@ 2014-03-02 10:05                                                                                                                             ` Peter Sewell
  2014-03-02 23:20                                                                                                                               ` Paul E. McKenney
  2014-03-03 20:44                                                                                                                               ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Peter Sewell @ 2014-03-02 10:05 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> Hi Paul,
>>
>> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> > <paulmck@linux.vnet.ibm.com> wrote:
>> >> > >
>> >> > > 3.      The comparison was against another RCU-protected pointer,
>> >> > >         where that other pointer was properly fetched using one
>> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
>> >> > >         you use.  At least as long as the rcu_assign_pointer() for
>> >> > >         that other pointer happened after the last update to the
>> >> > >         pointed-to structure.
>> >> > >
>> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >
>> >> > I think that it might be worth pointing out as an example, and saying
>> >> > that code like
>> >> >
>> >> >    p = atomic_read(consume);
>> >> >    X;
>> >> >    q = atomic_read(consume);
>> >> >    Y;
>> >> >    if (p == q)
>> >> >         data = p->val;
>> >> >
>> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> > can decide that the values are interchangeable.
>> >> >
>> >> > I cannot for the life of me come up with a situation where this would
>> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> > stronger ordering than anything the consume through "p" would
>> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> > ordering to the access through "p" is through p or q is kind of
>> >> > irrelevant. No?
>> >>
>> >> I can make a contrived litmus test for it, but you are right, the only
>> >> time you can see it happen is when X has no barriers, in which case
>> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> reorder the loads into p and q, and the read from p->val can, as you say,
>> >> come from either pointer.
>> >>
>> >> For whatever it is worth, hear is the litmus test:
>> >>
>> >> T1:   p = kmalloc(...);
>> >>       if (p == NULL)
>> >>               deal_with_it();
>> >>       p->a = 42;  /* Each field in its own cache line. */
>> >>       p->b = 43;
>> >>       p->c = 44;
>> >>       atomic_store_explicit(&gp1, p, memory_order_release);
>> >>       p->b = 143;
>> >>       p->c = 144;
>> >>       atomic_store_explicit(&gp2, p, memory_order_release);
>> >>
>> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >>       r1 = p->b;  /* Guaranteed to get 143. */
>> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
>> >>       if (p == q) {
>> >>               /* The compiler decides that q->c is same as p->c. */
>> >>               r2 = p->c; /* Could get 44 on weakly order system. */
>> >>       }
>> >>
>> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> you get.
>> >>
>> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> then publishing it via another pointer seems to me to be asking for
>> >> trouble anyway.  If you really want to do something like that and still
>> >> see consistency across all the fields in the structure, please put a lock
>> >> in the structure and use it to guard updates and accesses to those fields.
>> >
>> > And here is a patch documenting the restrictions for the current Linux
>> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >
>> > Thoughts?
>>
>> That might serve as informal documentation for linux kernel
>> programmers about the bounds on the optimisations that you expect
>> compilers to do for common-case RCU code - and I guess that's what you
>> intend it to be for.   But I don't see how one can make it precise
>> enough to serve as a language definition, so that compiler people
>> could confidently say "yes, we respect that", which I guess is what
>> you really need.  As a useful criterion, we should aim for something
>> precise enough that in a verified-compiler context you can
>> mathematically prove that the compiler will satisfy it  (even though
>> that won't happen anytime soon for GCC), and that analysis tool
>> authors can actually know what they're working with.   All this stuff
>> about "you should avoid cancellation", and "avoid masking with just a
>> small number of bits" is just too vague.
>
> Understood, and yes, this is intended to document current compiler
> behavior for the Linux kernel community.  It would not make sense to show
> it to the C11 or C++11 communities, except perhaps as an informational
> piece on current practice.
>
>> The basic problem is that the compiler may be doing sophisticated
>> reasoning with a bunch of non-local knowledge that it's deduced from
>> the code, neither of which are well-understood, and here we have to
>> identify some envelope, expressive enough for RCU idioms, in which
>> that reasoning doesn't allow data/address dependencies to be removed
>> (and hence the hardware guarantee about them will be maintained at the
>> source level).
>>
>> The C11 syntactic notion of dependency, whatever its faults, was at
>> least precise, could be reasoned about locally (just looking at the
>> syntactic code in question), and did do that.  The fact that current
>> compilers do optimisations that remove dependencies and will likely
>> have many bugs at present is besides the point - this was surely
>> intended as a *new* constraint on what they are allowed to do.  The
>> interesting question is really whether the compiler writers think that
>> they *could* implement it in a reasonable way - I'd like to hear
>> Torvald and his colleagues' opinion on that.
>>
>> What you're doing above seems to be basically a very cut-down version
>> of that, but with a fuzzy boundary.   If you want it to be precise,
>> maybe it needs to be much simpler (which might force you into ruling
>> out some current code idioms).
>
> I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
> can be developed to serve this purpose.

(I missed that mail when it first came past, sorry)

That's also going to be tricky, I'm afraid.  The key condition there is:

"* at the time of execution of E, L       [PS: I assume that L is a
typo and should be E]
        can possibly have returned at
        least two different values under the assumption that L itself
        could have returned any value allowed by L's type."

First, the evaluation of E might be nondeterministic  - e.g., for an
artificial example, if it's just a nondeterministic value obtained
from the result of a race on SC atomics.  The above doesn't
distinguish between that (which doesn't have a real dependency on L)
and that XOR'd with L (which does).  And it does so in the wrong
direction: it'll say there the former has a dependency on L.

Second, it involves reasoning about counterfactual executions.  That
doesn't necessarily make it wrong, per se, but probably makes it hard
to work with.  For example, suppose that in all the actual
whole-program executions, a runtime occurrence of L only ever returns
one particular value (perhaps because of some simple #define'd
configuration), and that the code used in the evaluation of E depends
on some invariant which is related to that configuration.  The
hypothetical execution used above in which a different value is used
is one in the code is being run in a situation with broken invariants.
   Then there will be technical difficulties in using the definition:
I don't see how one would persuade oneself that a compiler always
satisfies it, because these hypothetical executions are far removed
from what it's actually working on.

(Aside: The notion of a thread "observing" another thread's load,
dating back a long time and adopted in the Power and ARM architecture
texts, relies on counterfactual executions in a broadly similar way;
we're happy to have escaped that now :-)

Peter




>                                                         Thanx, Paul
>
>> best,
>> Peter
>>
>>
>>
>> >                                                         Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > documentation: Record rcu_dereference() value mishandling
>> >
>> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and
>> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out
>> > some ways of misusing the return value from rcu_dereference() that
>> > are not necessarily completely intuitive.  This commit therefore
>> > documents what can and cannot safely be done with these values.
>> >
>> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >
>> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
>> > index fa57139f50bf..f773a264ae02 100644
>> > --- a/Documentation/RCU/00-INDEX
>> > +++ b/Documentation/RCU/00-INDEX
>> > @@ -12,6 +12,8 @@ lockdep-splat.txt
>> >         - RCU Lockdep splats explained.
>> >  NMI-RCU.txt
>> >         - Using RCU to Protect Dynamic NMI Handlers
>> > +rcu_dereference.txt
>> > +       - Proper care and feeding of return values from rcu_dereference()
>> >  rcubarrier.txt
>> >         - RCU and Unloadable Modules
>> >  rculist_nulls.txt
>> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
>> > index 9d10d1db16a5..877947130ebe 100644
>> > --- a/Documentation/RCU/checklist.txt
>> > +++ b/Documentation/RCU/checklist.txt
>> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
>> >                         http://www.openvms.compaq.com/wizard/wiz_2637.html
>> >
>> >                 The rcu_dereference() primitive is also an excellent
>> > -               documentation aid, letting the person reading the code
>> > -               know exactly which pointers are protected by RCU.
>> > +               documentation aid, letting the person reading the
>> > +               code know exactly which pointers are protected by RCU.
>> >                 Please note that compilers can also reorder code, and
>> >                 they are becoming increasingly aggressive about doing
>> > -               just that.  The rcu_dereference() primitive therefore
>> > -               also prevents destructive compiler optimizations.
>> > +               just that.  The rcu_dereference() primitive therefore also
>> > +               prevents destructive compiler optimizations.  However,
>> > +               with a bit of devious creativity, it is possible to
>> > +               mishandle the return value from rcu_dereference().
>> > +               Please see rcu_dereference.txt in this directory for
>> > +               more information.
>> >
>> >                 The rcu_dereference() primitive is used by the
>> >                 various "_rcu()" list-traversal primitives, such
>> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
>> > new file mode 100644
>> > index 000000000000..6e72cd8622df
>> > --- /dev/null
>> > +++ b/Documentation/RCU/rcu_dereference.txt
>> > @@ -0,0 +1,365 @@
>> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
>> > +
>> > +Most of the time, you can use values from rcu_dereference() or one of
>> > +the similar primitives without worries.  Dereferencing (prefix "*"),
>> > +field selection ("->"), assignment ("="), address-of ("&"), addition and
>> > +subtraction of constants, and casts all work quite naturally and safely.
>> > +
>> > +It is nevertheless possible to get into trouble with other operations.
>> > +Follow these rules to keep your RCU code working properly:
>> > +
>> > +o      You must use one of the rcu_dereference() family of primitives
>> > +       to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
>> > +       will complain.  Worse yet, your code can see random memory-corruption
>> > +       bugs due to games that compilers and DEC Alpha can play.
>> > +       Without one of the rcu_dereference() primitives, compilers
>> > +       can reload the value, and won't your code have fun with two
>> > +       different values for a single pointer!  Without rcu_dereference(),
>> > +       DEC Alpha can load a pointer, dereference that pointer, and
>> > +       return data preceding initialization that preceded the store of
>> > +       the pointer.
>> > +
>> > +       In addition, the volatile cast in rcu_dereference() prevents the
>> > +       compiler from deducing the resulting pointer value.  Please see
>> > +       the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
>> > +       for an example where the compiler can in fact deduce the exact
>> > +       value of the pointer, and thus cause misordering.
>> > +
>> > +o      Do not use single-element RCU-protected arrays.  The compiler
>> > +       is within its right to assume that the value of an index into
>> > +       such an array must necessarily evaluate to zero.  The compiler
>> > +       could then substitute the constant zero for the computation, so
>> > +       that the array index no longer depended on the value returned
>> > +       by rcu_dereference().  If the array index no longer depends
>> > +       on rcu_dereference(), then both the compiler and the CPU
>> > +       are within their rights to order the array access before the
>> > +       rcu_dereference(), which can cause the array access to return
>> > +       garbage.
>> > +
>> > +o      Avoid cancellation when using the "+" and "-" infix arithmetic
>> > +       operators.  For example, for a given variable "x", avoid
>> > +       "(x-x)".  There are similar arithmetic pitfalls from other
>> > +       arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
>> > +       The compiler is within its rights to substitute zero for all of
>> > +       these expressions, so that subsequent accesses no longer depend
>> > +       on the rcu_dereference(), again possibly resulting in bugs due
>> > +       to misordering.
>> > +
>> > +       Of course, if "p" is a pointer from rcu_dereference(), and "a"
>> > +       and "b" are integers that happen to be equal, the expression
>> > +       "p+a-b" is safe because its value still necessarily depends on
>> > +       the rcu_dereference(), thus maintaining proper ordering.
>> > +
>> > +o      Avoid all-zero operands to the bitwise "&" operator, and
>> > +       similarly avoid all-ones operands to the bitwise "|" operator.
>> > +       If the compiler is able to deduce the value of such operands,
>> > +       it is within its rights to substitute the corresponding constant
>> > +       for the bitwise operation.  Once again, this causes subsequent
>> > +       accesses to no longer depend on the rcu_dereference(), causing
>> > +       bugs due to misordering.
>> > +
>> > +       Please note that single-bit operands to bitwise "&" can also
>> > +       be dangerous.  At this point, the compiler knows that the
>> > +       resulting value can only take on one of two possible values.
>> > +       Therefore, a very small amount of additional information will
>> > +       allow the compiler to deduce the exact value, which again can
>> > +       result in misordering.
>> > +
>> > +o      If you are using RCU to protect JITed functions, so that the
>> > +       "()" function-invocation operator is applied to a value obtained
>> > +       (directly or indirectly) from rcu_dereference(), you may need to
>> > +       interact directly with the hardware to flush instruction caches.
>> > +       This issue arises on some systems when a newly JITed function is
>> > +       using the same memory that was used by an earlier JITed function.
>> > +
>> > +o      Do not use the results from the boolean "&&" and "||" when
>> > +       dereferencing.  For example, the following (rather improbable)
>> > +       code is buggy:
>> > +
>> > +               int a[2];
>> > +               int index;
>> > +               int force_zero_index = 1;
>> > +
>> > +               ...
>> > +
>> > +               r1 = rcu_dereference(i1)
>> > +               r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
>> > +
>> > +       The reason this is buggy is that "&&" and "||" are often compiled
>> > +       using branches.  While weak-memory machines such as ARM or PowerPC
>> > +       do order stores after such branches, they can speculate loads,
>> > +       which can result in misordering bugs.
>> > +
>> > +o      Do not use the results from relational operators ("==", "!=",
>> > +       ">", ">=", "<", or "<=") when dereferencing.  For example,
>> > +       the following (quite strange) code is buggy:
>> > +
>> > +               int a[2];
>> > +               int index;
>> > +               int flip_index = 0;
>> > +
>> > +               ...
>> > +
>> > +               r1 = rcu_dereference(i1)
>> > +               r2 = a[r1 != flip_index];  /* BUGGY!!! */
>> > +
>> > +       As before, the reason this is buggy is that relational operators
>> > +       are often compiled using branches.  And as before, although
>> > +       weak-memory machines such as ARM or PowerPC do order stores
>> > +       after such branches, but can speculate loads, which can again
>> > +       result in misordering bugs.
>> > +
>> > +o      Be very careful about comparing pointers obtained from
>> > +       rcu_dereference() against non-NULL values.  As Linus Torvalds
>> > +       explained, if the two pointers are equal, the compiler could
>> > +       substitute the pointer you are comparing against for the pointer
>> > +       obtained from rcu_dereference().  For example:
>> > +
>> > +               p = rcu_dereference(gp);
>> > +               if (p == &default_struct)
>> > +                       do_default(p->a);
>> > +
>> > +       Because the compiler now knows that the value of "p" is exactly
>> > +       the address of the variable "default_struct", it is free to
>> > +       transform this code into the following:
>> > +
>> > +               p = rcu_dereference(gp);
>> > +               if (p == &default_struct)
>> > +                       do_default(default_struct.a);
>> > +
>> > +       On ARM and Power hardware, the load from "default_struct.a"
>> > +       can now be speculated, such that it might happen before the
>> > +       rcu_dereference().  This could result in bugs due to misordering.
>> > +
>> > +       However, comparisons are OK in the following cases:
>> > +
>> > +       o       The comparison was against the NULL pointer.  If the
>> > +               compiler knows that the pointer is NULL, you had better
>> > +               not be dereferencing it anyway.  If the comparison is
>> > +               non-equal, the compiler is none the wiser.  Therefore,
>> > +               it is safe to compare pointers from rcu_dereference()
>> > +               against NULL pointers.
>> > +
>> > +       o       The pointer is never dereferenced after being compared.
>> > +               Since there are no subsequent dereferences, the compiler
>> > +               cannot use anything it learned from the comparison
>> > +               to reorder the non-existent subsequent dereferences.
>> > +               This sort of comparison occurs frequently when scanning
>> > +               RCU-protected circular linked lists.
>> > +
>> > +       o       The comparison is against a pointer pointer that
>> > +               references memory that was initialized "a long time ago."
>> > +               The reason this is safe is that even if misordering
>> > +               occurs, the misordering will not affect the accesses
>> > +               that follow the comparison.  So exactly how long ago is
>> > +               "a long time ago"?  Here are some possibilities:
>> > +
>> > +               o       Compile time.
>> > +
>> > +               o       Boot time.
>> > +
>> > +               o       Module-init time for module code.
>> > +
>> > +               o       Prior to kthread creation for kthread code.
>> > +
>> > +               o       During some prior acquisition of the lock that
>> > +                       we now hold.
>> > +
>> > +               o       Before mod_timer() time for a timer handler.
>> > +
>> > +               There are many other possibilities involving the Linux
>> > +               kernel's wide array of primitives that cause code to
>> > +               be invoked at a later time.
>> > +
>> > +       o       The pointer being compared against also came from
>> > +               rcu_dereference().  In this case, both pointers depend
>> > +               on one rcu_dereference() or another, so you get proper
>> > +               ordering either way.
>> > +
>> > +               That said, this situation can make certain RCU usage
>> > +               bugs more likely to happen.  Which can be a good thing,
>> > +               at least if they happen during testing.  An example
>> > +               of such an RCU usage bug is shown in the section titled
>> > +               "EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
>> > +
>> > +       o       All of the accesses following the comparison are stores,
>> > +               so that a control dependency preserves the needed ordering.
>> > +               That said, it is easy to get control dependencies wrong.
>> > +               Please see the "CONTROL DEPENDENCIES" section of
>> > +               Documentation/memory-barriers.txt for more details.
>> > +
>> > +       o       The pointers compared not-equal -and- the compiler does
>> > +               not have enough information to deduce the value of the
>> > +               pointer.  Note that the volatile cast in rcu_dereference()
>> > +               will normally prevent the compiler from knowing too much.
>> > +
>> > +o      Disable any value-speculation optimizations that your compiler
>> > +       might provide, especially if you are making use of feedback-based
>> > +       optimizations that take data collected from prior runs.  Such
>> > +       value-speculation optimizations reorder operations by design.
>> > +
>> > +       There is one exception to this rule:  Value-speculation
>> > +       optimizations that leverage the branch-prediction hardware are
>> > +       safe on strongly ordered systems (such as x86), but not on weakly
>> > +       ordered systems (such as ARM or Power).  Choose your compiler
>> > +       command-line options wisely!
>> > +
>> > +
>> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG
>> > +
>> > +Because updaters can run concurrently with RCU readers, RCU readers can
>> > +see stale and/or inconsistent values.  If RCU readers need fresh or
>> > +consistent values, which they sometimes do, they need to take proper
>> > +precautions.  To see this, consider the following code fragment:
>> > +
>> > +       struct foo {
>> > +               int a;
>> > +               int b;
>> > +               int c;
>> > +       };
>> > +       struct foo *gp1;
>> > +       struct foo *gp2;
>> > +
>> > +       void updater(void)
>> > +       {
>> > +               struct foo *p;
>> > +
>> > +               p = kmalloc(...);
>> > +               if (p == NULL)
>> > +                       deal_with_it();
>> > +               p->a = 42;  /* Each field in its own cache line. */
>> > +               p->b = 43;
>> > +               p->c = 44;
>> > +               rcu_assign_pointer(gp1, p);
>> > +               p->b = 143;
>> > +               p->c = 144;
>> > +               rcu_assign_pointer(gp2, p);
>> > +       }
>> > +
>> > +       void reader(void)
>> > +       {
>> > +               struct foo *p;
>> > +               struct foo *q;
>> > +               int r1, r2;
>> > +
>> > +               p = rcu_dereference(gp2);
>> > +               r1 = p->b;  /* Guaranteed to get 143. */
>> > +               q = rcu_dereference(gp1);
>> > +               if (p == q) {
>> > +                       /* The compiler decides that q->c is same as p->c. */
>> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
>> > +               }
>> > +       }
>> > +
>> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
>> > +but you should not be.  After all, the updater might have been invoked
>> > +a second time between the time reader() loaded into "r1" and the time
>> > +that it loaded into "r2".  The fact that this same result can occur due
>> > +to some reordering from the compiler and CPUs is beside the point.
>> > +
>> > +But suppose that the reader needs a consistent view?
>> > +
>> > +Then one approach is to use locking, for example, as follows:
>> > +
>> > +       struct foo {
>> > +               int a;
>> > +               int b;
>> > +               int c;
>> > +               spinlock_t lock;
>> > +       };
>> > +       struct foo *gp1;
>> > +       struct foo *gp2;
>> > +
>> > +       void updater(void)
>> > +       {
>> > +               struct foo *p;
>> > +
>> > +               p = kmalloc(...);
>> > +               if (p == NULL)
>> > +                       deal_with_it();
>> > +               spin_lock(&p->lock);
>> > +               p->a = 42;  /* Each field in its own cache line. */
>> > +               p->b = 43;
>> > +               p->c = 44;
>> > +               spin_unlock(&p->lock);
>> > +               rcu_assign_pointer(gp1, p);
>> > +               spin_lock(&p->lock);
>> > +               p->b = 143;
>> > +               p->c = 144;
>> > +               spin_unlock(&p->lock);
>> > +               rcu_assign_pointer(gp2, p);
>> > +       }
>> > +
>> > +       void reader(void)
>> > +       {
>> > +               struct foo *p;
>> > +               struct foo *q;
>> > +               int r1, r2;
>> > +
>> > +               p = rcu_dereference(gp2);
>> > +               spin_lock(&p->lock);
>> > +               r1 = p->b;  /* Guaranteed to get 143. */
>> > +               q = rcu_dereference(gp1);
>> > +               if (p == q) {
>> > +                       /* The compiler decides that q->c is same as p->c. */
>> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
>> > +               }
>> > +               spin_unlock(&p->lock);
>> > +       }
>> > +
>> > +As always, use the right tool for the job!
>> > +
>> > +
>> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
>> > +
>> > +If a pointer obtained from rcu_dereference() compares not-equal to some
>> > +other pointer, the compiler normally has no clue what the value of the
>> > +first pointer might be.  This lack of knowledge prevents the compiler
>> > +from carrying out optimizations that otherwise might destroy the ordering
>> > +guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
>> > +should prevent the compiler from guessing the value.
>> > +
>> > +But without rcu_dereference(), the compiler knows more than you might
>> > +expect.  Consider the following code fragment:
>> > +
>> > +       struct foo {
>> > +               int a;
>> > +               int b;
>> > +       };
>> > +       static struct foo variable1;
>> > +       static struct foo variable2;
>> > +       static struct foo *gp = &variable1;
>> > +
>> > +       void updater(void)
>> > +       {
>> > +               initialize_foo(&variable2);
>> > +               rcu_assign_pointer(gp, &variable2);
>> > +               /*
>> > +                * The above is the only store to gp in this translation unit,
>> > +                * and the address of gp is not exported in any way.
>> > +                */
>> > +       }
>> > +
>> > +       int reader(void)
>> > +       {
>> > +               struct foo *p;
>> > +
>> > +               p = gp;
>> > +               barrier();
>> > +               if (p == &variable1)
>> > +                       return p->a; /* Must be variable1.a. */
>> > +               else
>> > +                       return p->b; /* Must be variable2.b. */
>> > +       }
>> > +
>> > +Because the compiler can see all stores to "gp", it knows that the only
>> > +possible values of "gp" are "variable1" on the one hand and "variable2"
>> > +on the other.  The comparison in reader() therefore tells the compiler
>> > +the exact value of "p" even in the not-equals case.  This allows the
>> > +compiler to make the return values independent of the load from "gp",
>> > +in turn destroying the ordering between this load and the loads of the
>> > +return values.  This can result in "p->b" returning pre-initialization
>> > +garbage values.
>> > +
>> > +In short, rcu_dereference() is -not- optional when you are going to
>> > +dereference the resulting pointer.
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>>
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-02 10:05                                                                                                                             ` Peter Sewell
@ 2014-03-02 23:20                                                                                                                               ` Paul E. McKenney
  2014-03-02 23:44                                                                                                                                 ` Peter Sewell
  2014-03-03 20:44                                                                                                                               ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-02 23:20 UTC (permalink / raw)
  To: Peter Sewell
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > >
> >> >> > > 3.      The comparison was against another RCU-protected pointer,
> >> >> > >         where that other pointer was properly fetched using one
> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
> >> >> > >         that other pointer happened after the last update to the
> >> >> > >         pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >    p = atomic_read(consume);
> >> >> >    X;
> >> >> >    q = atomic_read(consume);
> >> >> >    Y;
> >> >> >    if (p == q)
> >> >> >         data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>       if (p == NULL)
> >> >>               deal_with_it();
> >> >>       p->a = 42;  /* Each field in its own cache line. */
> >> >>       p->b = 43;
> >> >>       p->c = 44;
> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
> >> >>       p->b = 143;
> >> >>       p->c = 144;
> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >>       if (p == q) {
> >> >>               /* The compiler decides that q->c is same as p->c. */
> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>       }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a lock
> >> >> in the structure and use it to guard updates and accesses to those fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is intended to document current compiler
> > behavior for the Linux kernel community.  It would not make sense to show
> > it to the C11 or C++11 communities, except perhaps as an informational
> > piece on current practice.
> >
> >> The basic problem is that the compiler may be doing sophisticated
> >> reasoning with a bunch of non-local knowledge that it's deduced from
> >> the code, neither of which are well-understood, and here we have to
> >> identify some envelope, expressive enough for RCU idioms, in which
> >> that reasoning doesn't allow data/address dependencies to be removed
> >> (and hence the hardware guarantee about them will be maintained at the
> >> source level).
> >>
> >> The C11 syntactic notion of dependency, whatever its faults, was at
> >> least precise, could be reasoned about locally (just looking at the
> >> syntactic code in question), and did do that.  The fact that current
> >> compilers do optimisations that remove dependencies and will likely
> >> have many bugs at present is besides the point - this was surely
> >> intended as a *new* constraint on what they are allowed to do.  The
> >> interesting question is really whether the compiler writers think that
> >> they *could* implement it in a reasonable way - I'd like to hear
> >> Torvald and his colleagues' opinion on that.
> >>
> >> What you're doing above seems to be basically a very cut-down version
> >> of that, but with a fuzzy boundary.   If you want it to be precise,
> >> maybe it needs to be much simpler (which might force you into ruling
> >> out some current code idioms).
> >
> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
> > can be developed to serve this purpose.
> 
> (I missed that mail when it first came past, sorry)

No worries!

> That's also going to be tricky, I'm afraid.  The key condition there is:
> 
> "* at the time of execution of E, L       [PS: I assume that L is a
> typo and should be E]

I believe it really is "L".  As I understand it (and Torvald will correct
me if I am wrong), the idea is that the implementation is prohibited
from guessing the value of "L" -- it must assume that any value from
L's type might be returned, regardless of what it might otherwise know.

However, after L's value is loaded, the implementation -is- permitted
to learn constraint's on this value based on "if" statements and the
like between the load from "L" and the execution of "E".

Does that help?

>         can possibly have returned at
>         least two different values under the assumption that L itself
>         could have returned any value allowed by L's type."
> 
> First, the evaluation of E might be nondeterministic  - e.g., for an
> artificial example, if it's just a nondeterministic value obtained
> from the result of a race on SC atomics.  The above doesn't
> distinguish between that (which doesn't have a real dependency on L)
> and that XOR'd with L (which does).  And it does so in the wrong
> direction: it'll say there the former has a dependency on L.

Right, it is only any dependency that E has on L that would be
constrained.  If E also depends on other quantities obtained some
other way than a memory_order_consume load into a value_dep_preserving,
variable, then as I understand it, the compiler is within its rights
to optimize these other quantities to within an inch of their lives.

It is quite possible that E depends on L only sometimes.  For example:

	p = atomic_load_explicit(&gp, memory_order_consume);
	p = random() & 0x8 ? p : &default_structure;
	E(p);

My guess is that in this case, the ordering would be guaranteed only
for those executions where there is a value dependency.  In my naive
view, this should be no different than something like this:

	if (random() & 0x10)
		p = atomic_load_explicit(&gp, memory_order_acquire);
	else
		p = &default_structure;
	E(p);

Or am I missing your point?

> Second, it involves reasoning about counterfactual executions.  That
> doesn't necessarily make it wrong, per se, but probably makes it hard
> to work with.  For example, suppose that in all the actual
> whole-program executions, a runtime occurrence of L only ever returns
> one particular value (perhaps because of some simple #define'd
> configuration), and that the code used in the evaluation of E depends
> on some invariant which is related to that configuration.  The
> hypothetical execution used above in which a different value is used
> is one in the code is being run in a situation with broken invariants.
>    Then there will be technical difficulties in using the definition:
> I don't see how one would persuade oneself that a compiler always
> satisfies it, because these hypothetical executions are far removed
> from what it's actually working on.

The developer answer would be something like "all it really means is that
the implementation is required to actually emit the memory_order_consume
load and actually use the value," which is probably not much comfort
to someone trying to model it.  Maybe there is a better way of wording
this constraint so as to avoid the counterfactuals?

> (Aside: The notion of a thread "observing" another thread's load,
> dating back a long time and adopted in the Power and ARM architecture
> texts, relies on counterfactual executions in a broadly similar way;
> we're happy to have escaped that now :-)

Here is hoping that there is a way to escape it in this case as well.  ;-)

							Thanx, Paul

> Peter
> 
> 
> 
> 
> >                                                         Thanx, Paul
> >
> >> best,
> >> Peter
> >>
> >>
> >>
> >> >                                                         Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > documentation: Record rcu_dereference() value mishandling
> >> >
> >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and
> >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out
> >> > some ways of misusing the return value from rcu_dereference() that
> >> > are not necessarily completely intuitive.  This commit therefore
> >> > documents what can and cannot safely be done with these values.
> >> >
> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >
> >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
> >> > index fa57139f50bf..f773a264ae02 100644
> >> > --- a/Documentation/RCU/00-INDEX
> >> > +++ b/Documentation/RCU/00-INDEX
> >> > @@ -12,6 +12,8 @@ lockdep-splat.txt
> >> >         - RCU Lockdep splats explained.
> >> >  NMI-RCU.txt
> >> >         - Using RCU to Protect Dynamic NMI Handlers
> >> > +rcu_dereference.txt
> >> > +       - Proper care and feeding of return values from rcu_dereference()
> >> >  rcubarrier.txt
> >> >         - RCU and Unloadable Modules
> >> >  rculist_nulls.txt
> >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> >> > index 9d10d1db16a5..877947130ebe 100644
> >> > --- a/Documentation/RCU/checklist.txt
> >> > +++ b/Documentation/RCU/checklist.txt
> >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
> >> >                         http://www.openvms.compaq.com/wizard/wiz_2637.html
> >> >
> >> >                 The rcu_dereference() primitive is also an excellent
> >> > -               documentation aid, letting the person reading the code
> >> > -               know exactly which pointers are protected by RCU.
> >> > +               documentation aid, letting the person reading the
> >> > +               code know exactly which pointers are protected by RCU.
> >> >                 Please note that compilers can also reorder code, and
> >> >                 they are becoming increasingly aggressive about doing
> >> > -               just that.  The rcu_dereference() primitive therefore
> >> > -               also prevents destructive compiler optimizations.
> >> > +               just that.  The rcu_dereference() primitive therefore also
> >> > +               prevents destructive compiler optimizations.  However,
> >> > +               with a bit of devious creativity, it is possible to
> >> > +               mishandle the return value from rcu_dereference().
> >> > +               Please see rcu_dereference.txt in this directory for
> >> > +               more information.
> >> >
> >> >                 The rcu_dereference() primitive is used by the
> >> >                 various "_rcu()" list-traversal primitives, such
> >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
> >> > new file mode 100644
> >> > index 000000000000..6e72cd8622df
> >> > --- /dev/null
> >> > +++ b/Documentation/RCU/rcu_dereference.txt
> >> > @@ -0,0 +1,365 @@
> >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
> >> > +
> >> > +Most of the time, you can use values from rcu_dereference() or one of
> >> > +the similar primitives without worries.  Dereferencing (prefix "*"),
> >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and
> >> > +subtraction of constants, and casts all work quite naturally and safely.
> >> > +
> >> > +It is nevertheless possible to get into trouble with other operations.
> >> > +Follow these rules to keep your RCU code working properly:
> >> > +
> >> > +o      You must use one of the rcu_dereference() family of primitives
> >> > +       to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
> >> > +       will complain.  Worse yet, your code can see random memory-corruption
> >> > +       bugs due to games that compilers and DEC Alpha can play.
> >> > +       Without one of the rcu_dereference() primitives, compilers
> >> > +       can reload the value, and won't your code have fun with two
> >> > +       different values for a single pointer!  Without rcu_dereference(),
> >> > +       DEC Alpha can load a pointer, dereference that pointer, and
> >> > +       return data preceding initialization that preceded the store of
> >> > +       the pointer.
> >> > +
> >> > +       In addition, the volatile cast in rcu_dereference() prevents the
> >> > +       compiler from deducing the resulting pointer value.  Please see
> >> > +       the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
> >> > +       for an example where the compiler can in fact deduce the exact
> >> > +       value of the pointer, and thus cause misordering.
> >> > +
> >> > +o      Do not use single-element RCU-protected arrays.  The compiler
> >> > +       is within its right to assume that the value of an index into
> >> > +       such an array must necessarily evaluate to zero.  The compiler
> >> > +       could then substitute the constant zero for the computation, so
> >> > +       that the array index no longer depended on the value returned
> >> > +       by rcu_dereference().  If the array index no longer depends
> >> > +       on rcu_dereference(), then both the compiler and the CPU
> >> > +       are within their rights to order the array access before the
> >> > +       rcu_dereference(), which can cause the array access to return
> >> > +       garbage.
> >> > +
> >> > +o      Avoid cancellation when using the "+" and "-" infix arithmetic
> >> > +       operators.  For example, for a given variable "x", avoid
> >> > +       "(x-x)".  There are similar arithmetic pitfalls from other
> >> > +       arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
> >> > +       The compiler is within its rights to substitute zero for all of
> >> > +       these expressions, so that subsequent accesses no longer depend
> >> > +       on the rcu_dereference(), again possibly resulting in bugs due
> >> > +       to misordering.
> >> > +
> >> > +       Of course, if "p" is a pointer from rcu_dereference(), and "a"
> >> > +       and "b" are integers that happen to be equal, the expression
> >> > +       "p+a-b" is safe because its value still necessarily depends on
> >> > +       the rcu_dereference(), thus maintaining proper ordering.
> >> > +
> >> > +o      Avoid all-zero operands to the bitwise "&" operator, and
> >> > +       similarly avoid all-ones operands to the bitwise "|" operator.
> >> > +       If the compiler is able to deduce the value of such operands,
> >> > +       it is within its rights to substitute the corresponding constant
> >> > +       for the bitwise operation.  Once again, this causes subsequent
> >> > +       accesses to no longer depend on the rcu_dereference(), causing
> >> > +       bugs due to misordering.
> >> > +
> >> > +       Please note that single-bit operands to bitwise "&" can also
> >> > +       be dangerous.  At this point, the compiler knows that the
> >> > +       resulting value can only take on one of two possible values.
> >> > +       Therefore, a very small amount of additional information will
> >> > +       allow the compiler to deduce the exact value, which again can
> >> > +       result in misordering.
> >> > +
> >> > +o      If you are using RCU to protect JITed functions, so that the
> >> > +       "()" function-invocation operator is applied to a value obtained
> >> > +       (directly or indirectly) from rcu_dereference(), you may need to
> >> > +       interact directly with the hardware to flush instruction caches.
> >> > +       This issue arises on some systems when a newly JITed function is
> >> > +       using the same memory that was used by an earlier JITed function.
> >> > +
> >> > +o      Do not use the results from the boolean "&&" and "||" when
> >> > +       dereferencing.  For example, the following (rather improbable)
> >> > +       code is buggy:
> >> > +
> >> > +               int a[2];
> >> > +               int index;
> >> > +               int force_zero_index = 1;
> >> > +
> >> > +               ...
> >> > +
> >> > +               r1 = rcu_dereference(i1)
> >> > +               r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> >> > +
> >> > +       The reason this is buggy is that "&&" and "||" are often compiled
> >> > +       using branches.  While weak-memory machines such as ARM or PowerPC
> >> > +       do order stores after such branches, they can speculate loads,
> >> > +       which can result in misordering bugs.
> >> > +
> >> > +o      Do not use the results from relational operators ("==", "!=",
> >> > +       ">", ">=", "<", or "<=") when dereferencing.  For example,
> >> > +       the following (quite strange) code is buggy:
> >> > +
> >> > +               int a[2];
> >> > +               int index;
> >> > +               int flip_index = 0;
> >> > +
> >> > +               ...
> >> > +
> >> > +               r1 = rcu_dereference(i1)
> >> > +               r2 = a[r1 != flip_index];  /* BUGGY!!! */
> >> > +
> >> > +       As before, the reason this is buggy is that relational operators
> >> > +       are often compiled using branches.  And as before, although
> >> > +       weak-memory machines such as ARM or PowerPC do order stores
> >> > +       after such branches, but can speculate loads, which can again
> >> > +       result in misordering bugs.
> >> > +
> >> > +o      Be very careful about comparing pointers obtained from
> >> > +       rcu_dereference() against non-NULL values.  As Linus Torvalds
> >> > +       explained, if the two pointers are equal, the compiler could
> >> > +       substitute the pointer you are comparing against for the pointer
> >> > +       obtained from rcu_dereference().  For example:
> >> > +
> >> > +               p = rcu_dereference(gp);
> >> > +               if (p == &default_struct)
> >> > +                       do_default(p->a);
> >> > +
> >> > +       Because the compiler now knows that the value of "p" is exactly
> >> > +       the address of the variable "default_struct", it is free to
> >> > +       transform this code into the following:
> >> > +
> >> > +               p = rcu_dereference(gp);
> >> > +               if (p == &default_struct)
> >> > +                       do_default(default_struct.a);
> >> > +
> >> > +       On ARM and Power hardware, the load from "default_struct.a"
> >> > +       can now be speculated, such that it might happen before the
> >> > +       rcu_dereference().  This could result in bugs due to misordering.
> >> > +
> >> > +       However, comparisons are OK in the following cases:
> >> > +
> >> > +       o       The comparison was against the NULL pointer.  If the
> >> > +               compiler knows that the pointer is NULL, you had better
> >> > +               not be dereferencing it anyway.  If the comparison is
> >> > +               non-equal, the compiler is none the wiser.  Therefore,
> >> > +               it is safe to compare pointers from rcu_dereference()
> >> > +               against NULL pointers.
> >> > +
> >> > +       o       The pointer is never dereferenced after being compared.
> >> > +               Since there are no subsequent dereferences, the compiler
> >> > +               cannot use anything it learned from the comparison
> >> > +               to reorder the non-existent subsequent dereferences.
> >> > +               This sort of comparison occurs frequently when scanning
> >> > +               RCU-protected circular linked lists.
> >> > +
> >> > +       o       The comparison is against a pointer pointer that
> >> > +               references memory that was initialized "a long time ago."
> >> > +               The reason this is safe is that even if misordering
> >> > +               occurs, the misordering will not affect the accesses
> >> > +               that follow the comparison.  So exactly how long ago is
> >> > +               "a long time ago"?  Here are some possibilities:
> >> > +
> >> > +               o       Compile time.
> >> > +
> >> > +               o       Boot time.
> >> > +
> >> > +               o       Module-init time for module code.
> >> > +
> >> > +               o       Prior to kthread creation for kthread code.
> >> > +
> >> > +               o       During some prior acquisition of the lock that
> >> > +                       we now hold.
> >> > +
> >> > +               o       Before mod_timer() time for a timer handler.
> >> > +
> >> > +               There are many other possibilities involving the Linux
> >> > +               kernel's wide array of primitives that cause code to
> >> > +               be invoked at a later time.
> >> > +
> >> > +       o       The pointer being compared against also came from
> >> > +               rcu_dereference().  In this case, both pointers depend
> >> > +               on one rcu_dereference() or another, so you get proper
> >> > +               ordering either way.
> >> > +
> >> > +               That said, this situation can make certain RCU usage
> >> > +               bugs more likely to happen.  Which can be a good thing,
> >> > +               at least if they happen during testing.  An example
> >> > +               of such an RCU usage bug is shown in the section titled
> >> > +               "EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
> >> > +
> >> > +       o       All of the accesses following the comparison are stores,
> >> > +               so that a control dependency preserves the needed ordering.
> >> > +               That said, it is easy to get control dependencies wrong.
> >> > +               Please see the "CONTROL DEPENDENCIES" section of
> >> > +               Documentation/memory-barriers.txt for more details.
> >> > +
> >> > +       o       The pointers compared not-equal -and- the compiler does
> >> > +               not have enough information to deduce the value of the
> >> > +               pointer.  Note that the volatile cast in rcu_dereference()
> >> > +               will normally prevent the compiler from knowing too much.
> >> > +
> >> > +o      Disable any value-speculation optimizations that your compiler
> >> > +       might provide, especially if you are making use of feedback-based
> >> > +       optimizations that take data collected from prior runs.  Such
> >> > +       value-speculation optimizations reorder operations by design.
> >> > +
> >> > +       There is one exception to this rule:  Value-speculation
> >> > +       optimizations that leverage the branch-prediction hardware are
> >> > +       safe on strongly ordered systems (such as x86), but not on weakly
> >> > +       ordered systems (such as ARM or Power).  Choose your compiler
> >> > +       command-line options wisely!
> >> > +
> >> > +
> >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG
> >> > +
> >> > +Because updaters can run concurrently with RCU readers, RCU readers can
> >> > +see stale and/or inconsistent values.  If RCU readers need fresh or
> >> > +consistent values, which they sometimes do, they need to take proper
> >> > +precautions.  To see this, consider the following code fragment:
> >> > +
> >> > +       struct foo {
> >> > +               int a;
> >> > +               int b;
> >> > +               int c;
> >> > +       };
> >> > +       struct foo *gp1;
> >> > +       struct foo *gp2;
> >> > +
> >> > +       void updater(void)
> >> > +       {
> >> > +               struct foo *p;
> >> > +
> >> > +               p = kmalloc(...);
> >> > +               if (p == NULL)
> >> > +                       deal_with_it();
> >> > +               p->a = 42;  /* Each field in its own cache line. */
> >> > +               p->b = 43;
> >> > +               p->c = 44;
> >> > +               rcu_assign_pointer(gp1, p);
> >> > +               p->b = 143;
> >> > +               p->c = 144;
> >> > +               rcu_assign_pointer(gp2, p);
> >> > +       }
> >> > +
> >> > +       void reader(void)
> >> > +       {
> >> > +               struct foo *p;
> >> > +               struct foo *q;
> >> > +               int r1, r2;
> >> > +
> >> > +               p = rcu_dereference(gp2);
> >> > +               r1 = p->b;  /* Guaranteed to get 143. */
> >> > +               q = rcu_dereference(gp1);
> >> > +               if (p == q) {
> >> > +                       /* The compiler decides that q->c is same as p->c. */
> >> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
> >> > +               }
> >> > +       }
> >> > +
> >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
> >> > +but you should not be.  After all, the updater might have been invoked
> >> > +a second time between the time reader() loaded into "r1" and the time
> >> > +that it loaded into "r2".  The fact that this same result can occur due
> >> > +to some reordering from the compiler and CPUs is beside the point.
> >> > +
> >> > +But suppose that the reader needs a consistent view?
> >> > +
> >> > +Then one approach is to use locking, for example, as follows:
> >> > +
> >> > +       struct foo {
> >> > +               int a;
> >> > +               int b;
> >> > +               int c;
> >> > +               spinlock_t lock;
> >> > +       };
> >> > +       struct foo *gp1;
> >> > +       struct foo *gp2;
> >> > +
> >> > +       void updater(void)
> >> > +       {
> >> > +               struct foo *p;
> >> > +
> >> > +               p = kmalloc(...);
> >> > +               if (p == NULL)
> >> > +                       deal_with_it();
> >> > +               spin_lock(&p->lock);
> >> > +               p->a = 42;  /* Each field in its own cache line. */
> >> > +               p->b = 43;
> >> > +               p->c = 44;
> >> > +               spin_unlock(&p->lock);
> >> > +               rcu_assign_pointer(gp1, p);
> >> > +               spin_lock(&p->lock);
> >> > +               p->b = 143;
> >> > +               p->c = 144;
> >> > +               spin_unlock(&p->lock);
> >> > +               rcu_assign_pointer(gp2, p);
> >> > +       }
> >> > +
> >> > +       void reader(void)
> >> > +       {
> >> > +               struct foo *p;
> >> > +               struct foo *q;
> >> > +               int r1, r2;
> >> > +
> >> > +               p = rcu_dereference(gp2);
> >> > +               spin_lock(&p->lock);
> >> > +               r1 = p->b;  /* Guaranteed to get 143. */
> >> > +               q = rcu_dereference(gp1);
> >> > +               if (p == q) {
> >> > +                       /* The compiler decides that q->c is same as p->c. */
> >> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
> >> > +               }
> >> > +               spin_unlock(&p->lock);
> >> > +       }
> >> > +
> >> > +As always, use the right tool for the job!
> >> > +
> >> > +
> >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
> >> > +
> >> > +If a pointer obtained from rcu_dereference() compares not-equal to some
> >> > +other pointer, the compiler normally has no clue what the value of the
> >> > +first pointer might be.  This lack of knowledge prevents the compiler
> >> > +from carrying out optimizations that otherwise might destroy the ordering
> >> > +guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
> >> > +should prevent the compiler from guessing the value.
> >> > +
> >> > +But without rcu_dereference(), the compiler knows more than you might
> >> > +expect.  Consider the following code fragment:
> >> > +
> >> > +       struct foo {
> >> > +               int a;
> >> > +               int b;
> >> > +       };
> >> > +       static struct foo variable1;
> >> > +       static struct foo variable2;
> >> > +       static struct foo *gp = &variable1;
> >> > +
> >> > +       void updater(void)
> >> > +       {
> >> > +               initialize_foo(&variable2);
> >> > +               rcu_assign_pointer(gp, &variable2);
> >> > +               /*
> >> > +                * The above is the only store to gp in this translation unit,
> >> > +                * and the address of gp is not exported in any way.
> >> > +                */
> >> > +       }
> >> > +
> >> > +       int reader(void)
> >> > +       {
> >> > +               struct foo *p;
> >> > +
> >> > +               p = gp;
> >> > +               barrier();
> >> > +               if (p == &variable1)
> >> > +                       return p->a; /* Must be variable1.a. */
> >> > +               else
> >> > +                       return p->b; /* Must be variable2.b. */
> >> > +       }
> >> > +
> >> > +Because the compiler can see all stores to "gp", it knows that the only
> >> > +possible values of "gp" are "variable1" on the one hand and "variable2"
> >> > +on the other.  The comparison in reader() therefore tells the compiler
> >> > +the exact value of "p" even in the not-equals case.  This allows the
> >> > +compiler to make the return values independent of the load from "gp",
> >> > +in turn destroying the ordering between this load and the loads of the
> >> > +return values.  This can result in "p->b" returning pre-initialization
> >> > +garbage values.
> >> > +
> >> > +In short, rcu_dereference() is -not- optional when you are going to
> >> > +dereference the resulting pointer.
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >>
> >
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-02 23:20                                                                                                                               ` Paul E. McKenney
@ 2014-03-02 23:44                                                                                                                                 ` Peter Sewell
  2014-03-03  4:25                                                                                                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Peter Sewell @ 2014-03-02 23:44 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On 2 March 2014 23:20, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
>> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> Hi Paul,
>> >>
>> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
>> >> >> > >
>> >> >> > > 3.      The comparison was against another RCU-protected pointer,
>> >> >> > >         where that other pointer was properly fetched using one
>> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
>> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
>> >> >> > >         that other pointer happened after the last update to the
>> >> >> > >         pointed-to structure.
>> >> >> > >
>> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >
>> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> > that code like
>> >> >> >
>> >> >> >    p = atomic_read(consume);
>> >> >> >    X;
>> >> >> >    q = atomic_read(consume);
>> >> >> >    Y;
>> >> >> >    if (p == q)
>> >> >> >         data = p->val;
>> >> >> >
>> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> > can decide that the values are interchangeable.
>> >> >> >
>> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> > irrelevant. No?
>> >> >>
>> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
>> >> >> come from either pointer.
>> >> >>
>> >> >> For whatever it is worth, hear is the litmus test:
>> >> >>
>> >> >> T1:   p = kmalloc(...);
>> >> >>       if (p == NULL)
>> >> >>               deal_with_it();
>> >> >>       p->a = 42;  /* Each field in its own cache line. */
>> >> >>       p->b = 43;
>> >> >>       p->c = 44;
>> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
>> >> >>       p->b = 143;
>> >> >>       p->c = 144;
>> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
>> >> >>
>> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
>> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
>> >> >>       if (p == q) {
>> >> >>               /* The compiler decides that q->c is same as p->c. */
>> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >>       }
>> >> >>
>> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> you get.
>> >> >>
>> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> see consistency across all the fields in the structure, please put a lock
>> >> >> in the structure and use it to guard updates and accesses to those fields.
>> >> >
>> >> > And here is a patch documenting the restrictions for the current Linux
>> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >> >
>> >> > Thoughts?
>> >>
>> >> That might serve as informal documentation for linux kernel
>> >> programmers about the bounds on the optimisations that you expect
>> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> intend it to be for.   But I don't see how one can make it precise
>> >> enough to serve as a language definition, so that compiler people
>> >> could confidently say "yes, we respect that", which I guess is what
>> >> you really need.  As a useful criterion, we should aim for something
>> >> precise enough that in a verified-compiler context you can
>> >> mathematically prove that the compiler will satisfy it  (even though
>> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> authors can actually know what they're working with.   All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a
>> >> small number of bits" is just too vague.
>> >
>> > Understood, and yes, this is intended to document current compiler
>> > behavior for the Linux kernel community.  It would not make sense to show
>> > it to the C11 or C++11 communities, except perhaps as an informational
>> > piece on current practice.
>> >
>> >> The basic problem is that the compiler may be doing sophisticated
>> >> reasoning with a bunch of non-local knowledge that it's deduced from
>> >> the code, neither of which are well-understood, and here we have to
>> >> identify some envelope, expressive enough for RCU idioms, in which
>> >> that reasoning doesn't allow data/address dependencies to be removed
>> >> (and hence the hardware guarantee about them will be maintained at the
>> >> source level).
>> >>
>> >> The C11 syntactic notion of dependency, whatever its faults, was at
>> >> least precise, could be reasoned about locally (just looking at the
>> >> syntactic code in question), and did do that.  The fact that current
>> >> compilers do optimisations that remove dependencies and will likely
>> >> have many bugs at present is besides the point - this was surely
>> >> intended as a *new* constraint on what they are allowed to do.  The
>> >> interesting question is really whether the compiler writers think that
>> >> they *could* implement it in a reasonable way - I'd like to hear
>> >> Torvald and his colleagues' opinion on that.
>> >>
>> >> What you're doing above seems to be basically a very cut-down version
>> >> of that, but with a fuzzy boundary.   If you want it to be precise,
>> >> maybe it needs to be much simpler (which might force you into ruling
>> >> out some current code idioms).
>> >
>> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
>> > can be developed to serve this purpose.
>>
>> (I missed that mail when it first came past, sorry)
>
> No worries!
>
>> That's also going to be tricky, I'm afraid.  The key condition there is:
>>
>> "* at the time of execution of E, L       [PS: I assume that L is a
>> typo and should be E]
>
> I believe it really is "L".  As I understand it (and Torvald will correct
> me if I am wrong), the idea is that the implementation is prohibited
> from guessing the value of "L" -- it must assume that any value from
> L's type might be returned, regardless of what it might otherwise know.
>
> However, after L's value is loaded, the implementation -is- permitted
> to learn constraint's on this value based on "if" statements and the
> like between the load from "L" and the execution of "E".
>
> Does that help?

Not sure (i.e., not really :-).  I thought Torvald wanted to say that
"E  really-depends on L if there exist two different values that (just
according to typing) might be read for L that give rise to two
different values for E".

>>         can possibly have returned at
>>         least two different values under the assumption that L itself
>>         could have returned any value allowed by L's type."
>>
>> First, the evaluation of E might be nondeterministic  - e.g., for an
>> artificial example, if it's just a nondeterministic value obtained
>> from the result of a race on SC atomics.  The above doesn't
>> distinguish between that (which doesn't have a real dependency on L)
>> and that XOR'd with L (which does).  And it does so in the wrong
>> direction: it'll say there the former has a dependency on L.
>
> Right, it is only any dependency that E has on L that would be
> constrained.  If E also depends on other quantities obtained some
> other way than a memory_order_consume load into a value_dep_preserving,
> variable, then as I understand it, the compiler is within its rights
> to optimize these other quantities to within an inch of their lives.
>
> It is quite possible that E depends on L only sometimes.  For example:
>
>         p = atomic_load_explicit(&gp, memory_order_consume);
>         p = random() & 0x8 ? p : &default_structure;
>         E(p);
>
> My guess is that in this case, the ordering would be guaranteed only
> for those executions where there is a value dependency.  In my naive
> view, this should be no different than something like this:
>
>         if (random() & 0x10)
>                 p = atomic_load_explicit(&gp, memory_order_acquire);
>         else
>                 p = &default_structure;
>         E(p);

all this is fine, but...

> Or am I missing your point?

...if the idea was to identify "real dependencies" as cases where two
values of E are possible based on different values of L, then if two
values of E are possible *just anyway* (e.g. because of
nondeterminism), the definition gets confused.


>> Second, it involves reasoning about counterfactual executions.  That
>> doesn't necessarily make it wrong, per se, but probably makes it hard
>> to work with.  For example, suppose that in all the actual
>> whole-program executions, a runtime occurrence of L only ever returns
>> one particular value (perhaps because of some simple #define'd
>> configuration), and that the code used in the evaluation of E depends
>> on some invariant which is related to that configuration.  The
>> hypothetical execution used above in which a different value is used
>> is one in the code is being run in a situation with broken invariants.
>>    Then there will be technical difficulties in using the definition:
>> I don't see how one would persuade oneself that a compiler always
>> satisfies it, because these hypothetical executions are far removed
>> from what it's actually working on.
>
> The developer answer would be something like "all it really means is that
> the implementation is required to actually emit the memory_order_consume
> load and actually use the value," which is probably not much comfort
> to someone trying to model it.  Maybe there is a better way of wording
> this constraint so as to avoid the counterfactuals?

maybe.  I don't have one right now, though.

>> (Aside: The notion of a thread "observing" another thread's load,
>> dating back a long time and adopted in the Power and ARM architecture
>> texts, relies on counterfactual executions in a broadly similar way;
>> we're happy to have escaped that now :-)
>
> Here is hoping that there is a way to escape it in this case as well.  ;-)


ta,
Peter


>                                                         Thanx, Paul
>
>> Peter
>>
>>
>>
>>
>> >                                                         Thanx, Paul
>> >
>> >> best,
>> >> Peter
>> >>
>> >>
>> >>
>> >> >                                                         Thanx, Paul
>> >> >
>> >> > ------------------------------------------------------------------------
>> >> >
>> >> > documentation: Record rcu_dereference() value mishandling
>> >> >
>> >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and
>> >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out
>> >> > some ways of misusing the return value from rcu_dereference() that
>> >> > are not necessarily completely intuitive.  This commit therefore
>> >> > documents what can and cannot safely be done with these values.
>> >> >
>> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> >
>> >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
>> >> > index fa57139f50bf..f773a264ae02 100644
>> >> > --- a/Documentation/RCU/00-INDEX
>> >> > +++ b/Documentation/RCU/00-INDEX
>> >> > @@ -12,6 +12,8 @@ lockdep-splat.txt
>> >> >         - RCU Lockdep splats explained.
>> >> >  NMI-RCU.txt
>> >> >         - Using RCU to Protect Dynamic NMI Handlers
>> >> > +rcu_dereference.txt
>> >> > +       - Proper care and feeding of return values from rcu_dereference()
>> >> >  rcubarrier.txt
>> >> >         - RCU and Unloadable Modules
>> >> >  rculist_nulls.txt
>> >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
>> >> > index 9d10d1db16a5..877947130ebe 100644
>> >> > --- a/Documentation/RCU/checklist.txt
>> >> > +++ b/Documentation/RCU/checklist.txt
>> >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
>> >> >                         http://www.openvms.compaq.com/wizard/wiz_2637.html
>> >> >
>> >> >                 The rcu_dereference() primitive is also an excellent
>> >> > -               documentation aid, letting the person reading the code
>> >> > -               know exactly which pointers are protected by RCU.
>> >> > +               documentation aid, letting the person reading the
>> >> > +               code know exactly which pointers are protected by RCU.
>> >> >                 Please note that compilers can also reorder code, and
>> >> >                 they are becoming increasingly aggressive about doing
>> >> > -               just that.  The rcu_dereference() primitive therefore
>> >> > -               also prevents destructive compiler optimizations.
>> >> > +               just that.  The rcu_dereference() primitive therefore also
>> >> > +               prevents destructive compiler optimizations.  However,
>> >> > +               with a bit of devious creativity, it is possible to
>> >> > +               mishandle the return value from rcu_dereference().
>> >> > +               Please see rcu_dereference.txt in this directory for
>> >> > +               more information.
>> >> >
>> >> >                 The rcu_dereference() primitive is used by the
>> >> >                 various "_rcu()" list-traversal primitives, such
>> >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
>> >> > new file mode 100644
>> >> > index 000000000000..6e72cd8622df
>> >> > --- /dev/null
>> >> > +++ b/Documentation/RCU/rcu_dereference.txt
>> >> > @@ -0,0 +1,365 @@
>> >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
>> >> > +
>> >> > +Most of the time, you can use values from rcu_dereference() or one of
>> >> > +the similar primitives without worries.  Dereferencing (prefix "*"),
>> >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and
>> >> > +subtraction of constants, and casts all work quite naturally and safely.
>> >> > +
>> >> > +It is nevertheless possible to get into trouble with other operations.
>> >> > +Follow these rules to keep your RCU code working properly:
>> >> > +
>> >> > +o      You must use one of the rcu_dereference() family of primitives
>> >> > +       to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
>> >> > +       will complain.  Worse yet, your code can see random memory-corruption
>> >> > +       bugs due to games that compilers and DEC Alpha can play.
>> >> > +       Without one of the rcu_dereference() primitives, compilers
>> >> > +       can reload the value, and won't your code have fun with two
>> >> > +       different values for a single pointer!  Without rcu_dereference(),
>> >> > +       DEC Alpha can load a pointer, dereference that pointer, and
>> >> > +       return data preceding initialization that preceded the store of
>> >> > +       the pointer.
>> >> > +
>> >> > +       In addition, the volatile cast in rcu_dereference() prevents the
>> >> > +       compiler from deducing the resulting pointer value.  Please see
>> >> > +       the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
>> >> > +       for an example where the compiler can in fact deduce the exact
>> >> > +       value of the pointer, and thus cause misordering.
>> >> > +
>> >> > +o      Do not use single-element RCU-protected arrays.  The compiler
>> >> > +       is within its right to assume that the value of an index into
>> >> > +       such an array must necessarily evaluate to zero.  The compiler
>> >> > +       could then substitute the constant zero for the computation, so
>> >> > +       that the array index no longer depended on the value returned
>> >> > +       by rcu_dereference().  If the array index no longer depends
>> >> > +       on rcu_dereference(), then both the compiler and the CPU
>> >> > +       are within their rights to order the array access before the
>> >> > +       rcu_dereference(), which can cause the array access to return
>> >> > +       garbage.
>> >> > +
>> >> > +o      Avoid cancellation when using the "+" and "-" infix arithmetic
>> >> > +       operators.  For example, for a given variable "x", avoid
>> >> > +       "(x-x)".  There are similar arithmetic pitfalls from other
>> >> > +       arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
>> >> > +       The compiler is within its rights to substitute zero for all of
>> >> > +       these expressions, so that subsequent accesses no longer depend
>> >> > +       on the rcu_dereference(), again possibly resulting in bugs due
>> >> > +       to misordering.
>> >> > +
>> >> > +       Of course, if "p" is a pointer from rcu_dereference(), and "a"
>> >> > +       and "b" are integers that happen to be equal, the expression
>> >> > +       "p+a-b" is safe because its value still necessarily depends on
>> >> > +       the rcu_dereference(), thus maintaining proper ordering.
>> >> > +
>> >> > +o      Avoid all-zero operands to the bitwise "&" operator, and
>> >> > +       similarly avoid all-ones operands to the bitwise "|" operator.
>> >> > +       If the compiler is able to deduce the value of such operands,
>> >> > +       it is within its rights to substitute the corresponding constant
>> >> > +       for the bitwise operation.  Once again, this causes subsequent
>> >> > +       accesses to no longer depend on the rcu_dereference(), causing
>> >> > +       bugs due to misordering.
>> >> > +
>> >> > +       Please note that single-bit operands to bitwise "&" can also
>> >> > +       be dangerous.  At this point, the compiler knows that the
>> >> > +       resulting value can only take on one of two possible values.
>> >> > +       Therefore, a very small amount of additional information will
>> >> > +       allow the compiler to deduce the exact value, which again can
>> >> > +       result in misordering.
>> >> > +
>> >> > +o      If you are using RCU to protect JITed functions, so that the
>> >> > +       "()" function-invocation operator is applied to a value obtained
>> >> > +       (directly or indirectly) from rcu_dereference(), you may need to
>> >> > +       interact directly with the hardware to flush instruction caches.
>> >> > +       This issue arises on some systems when a newly JITed function is
>> >> > +       using the same memory that was used by an earlier JITed function.
>> >> > +
>> >> > +o      Do not use the results from the boolean "&&" and "||" when
>> >> > +       dereferencing.  For example, the following (rather improbable)
>> >> > +       code is buggy:
>> >> > +
>> >> > +               int a[2];
>> >> > +               int index;
>> >> > +               int force_zero_index = 1;
>> >> > +
>> >> > +               ...
>> >> > +
>> >> > +               r1 = rcu_dereference(i1)
>> >> > +               r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
>> >> > +
>> >> > +       The reason this is buggy is that "&&" and "||" are often compiled
>> >> > +       using branches.  While weak-memory machines such as ARM or PowerPC
>> >> > +       do order stores after such branches, they can speculate loads,
>> >> > +       which can result in misordering bugs.
>> >> > +
>> >> > +o      Do not use the results from relational operators ("==", "!=",
>> >> > +       ">", ">=", "<", or "<=") when dereferencing.  For example,
>> >> > +       the following (quite strange) code is buggy:
>> >> > +
>> >> > +               int a[2];
>> >> > +               int index;
>> >> > +               int flip_index = 0;
>> >> > +
>> >> > +               ...
>> >> > +
>> >> > +               r1 = rcu_dereference(i1)
>> >> > +               r2 = a[r1 != flip_index];  /* BUGGY!!! */
>> >> > +
>> >> > +       As before, the reason this is buggy is that relational operators
>> >> > +       are often compiled using branches.  And as before, although
>> >> > +       weak-memory machines such as ARM or PowerPC do order stores
>> >> > +       after such branches, but can speculate loads, which can again
>> >> > +       result in misordering bugs.
>> >> > +
>> >> > +o      Be very careful about comparing pointers obtained from
>> >> > +       rcu_dereference() against non-NULL values.  As Linus Torvalds
>> >> > +       explained, if the two pointers are equal, the compiler could
>> >> > +       substitute the pointer you are comparing against for the pointer
>> >> > +       obtained from rcu_dereference().  For example:
>> >> > +
>> >> > +               p = rcu_dereference(gp);
>> >> > +               if (p == &default_struct)
>> >> > +                       do_default(p->a);
>> >> > +
>> >> > +       Because the compiler now knows that the value of "p" is exactly
>> >> > +       the address of the variable "default_struct", it is free to
>> >> > +       transform this code into the following:
>> >> > +
>> >> > +               p = rcu_dereference(gp);
>> >> > +               if (p == &default_struct)
>> >> > +                       do_default(default_struct.a);
>> >> > +
>> >> > +       On ARM and Power hardware, the load from "default_struct.a"
>> >> > +       can now be speculated, such that it might happen before the
>> >> > +       rcu_dereference().  This could result in bugs due to misordering.
>> >> > +
>> >> > +       However, comparisons are OK in the following cases:
>> >> > +
>> >> > +       o       The comparison was against the NULL pointer.  If the
>> >> > +               compiler knows that the pointer is NULL, you had better
>> >> > +               not be dereferencing it anyway.  If the comparison is
>> >> > +               non-equal, the compiler is none the wiser.  Therefore,
>> >> > +               it is safe to compare pointers from rcu_dereference()
>> >> > +               against NULL pointers.
>> >> > +
>> >> > +       o       The pointer is never dereferenced after being compared.
>> >> > +               Since there are no subsequent dereferences, the compiler
>> >> > +               cannot use anything it learned from the comparison
>> >> > +               to reorder the non-existent subsequent dereferences.
>> >> > +               This sort of comparison occurs frequently when scanning
>> >> > +               RCU-protected circular linked lists.
>> >> > +
>> >> > +       o       The comparison is against a pointer pointer that
>> >> > +               references memory that was initialized "a long time ago."
>> >> > +               The reason this is safe is that even if misordering
>> >> > +               occurs, the misordering will not affect the accesses
>> >> > +               that follow the comparison.  So exactly how long ago is
>> >> > +               "a long time ago"?  Here are some possibilities:
>> >> > +
>> >> > +               o       Compile time.
>> >> > +
>> >> > +               o       Boot time.
>> >> > +
>> >> > +               o       Module-init time for module code.
>> >> > +
>> >> > +               o       Prior to kthread creation for kthread code.
>> >> > +
>> >> > +               o       During some prior acquisition of the lock that
>> >> > +                       we now hold.
>> >> > +
>> >> > +               o       Before mod_timer() time for a timer handler.
>> >> > +
>> >> > +               There are many other possibilities involving the Linux
>> >> > +               kernel's wide array of primitives that cause code to
>> >> > +               be invoked at a later time.
>> >> > +
>> >> > +       o       The pointer being compared against also came from
>> >> > +               rcu_dereference().  In this case, both pointers depend
>> >> > +               on one rcu_dereference() or another, so you get proper
>> >> > +               ordering either way.
>> >> > +
>> >> > +               That said, this situation can make certain RCU usage
>> >> > +               bugs more likely to happen.  Which can be a good thing,
>> >> > +               at least if they happen during testing.  An example
>> >> > +               of such an RCU usage bug is shown in the section titled
>> >> > +               "EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
>> >> > +
>> >> > +       o       All of the accesses following the comparison are stores,
>> >> > +               so that a control dependency preserves the needed ordering.
>> >> > +               That said, it is easy to get control dependencies wrong.
>> >> > +               Please see the "CONTROL DEPENDENCIES" section of
>> >> > +               Documentation/memory-barriers.txt for more details.
>> >> > +
>> >> > +       o       The pointers compared not-equal -and- the compiler does
>> >> > +               not have enough information to deduce the value of the
>> >> > +               pointer.  Note that the volatile cast in rcu_dereference()
>> >> > +               will normally prevent the compiler from knowing too much.
>> >> > +
>> >> > +o      Disable any value-speculation optimizations that your compiler
>> >> > +       might provide, especially if you are making use of feedback-based
>> >> > +       optimizations that take data collected from prior runs.  Such
>> >> > +       value-speculation optimizations reorder operations by design.
>> >> > +
>> >> > +       There is one exception to this rule:  Value-speculation
>> >> > +       optimizations that leverage the branch-prediction hardware are
>> >> > +       safe on strongly ordered systems (such as x86), but not on weakly
>> >> > +       ordered systems (such as ARM or Power).  Choose your compiler
>> >> > +       command-line options wisely!
>> >> > +
>> >> > +
>> >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG
>> >> > +
>> >> > +Because updaters can run concurrently with RCU readers, RCU readers can
>> >> > +see stale and/or inconsistent values.  If RCU readers need fresh or
>> >> > +consistent values, which they sometimes do, they need to take proper
>> >> > +precautions.  To see this, consider the following code fragment:
>> >> > +
>> >> > +       struct foo {
>> >> > +               int a;
>> >> > +               int b;
>> >> > +               int c;
>> >> > +       };
>> >> > +       struct foo *gp1;
>> >> > +       struct foo *gp2;
>> >> > +
>> >> > +       void updater(void)
>> >> > +       {
>> >> > +               struct foo *p;
>> >> > +
>> >> > +               p = kmalloc(...);
>> >> > +               if (p == NULL)
>> >> > +                       deal_with_it();
>> >> > +               p->a = 42;  /* Each field in its own cache line. */
>> >> > +               p->b = 43;
>> >> > +               p->c = 44;
>> >> > +               rcu_assign_pointer(gp1, p);
>> >> > +               p->b = 143;
>> >> > +               p->c = 144;
>> >> > +               rcu_assign_pointer(gp2, p);
>> >> > +       }
>> >> > +
>> >> > +       void reader(void)
>> >> > +       {
>> >> > +               struct foo *p;
>> >> > +               struct foo *q;
>> >> > +               int r1, r2;
>> >> > +
>> >> > +               p = rcu_dereference(gp2);
>> >> > +               r1 = p->b;  /* Guaranteed to get 143. */
>> >> > +               q = rcu_dereference(gp1);
>> >> > +               if (p == q) {
>> >> > +                       /* The compiler decides that q->c is same as p->c. */
>> >> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
>> >> > +               }
>> >> > +       }
>> >> > +
>> >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
>> >> > +but you should not be.  After all, the updater might have been invoked
>> >> > +a second time between the time reader() loaded into "r1" and the time
>> >> > +that it loaded into "r2".  The fact that this same result can occur due
>> >> > +to some reordering from the compiler and CPUs is beside the point.
>> >> > +
>> >> > +But suppose that the reader needs a consistent view?
>> >> > +
>> >> > +Then one approach is to use locking, for example, as follows:
>> >> > +
>> >> > +       struct foo {
>> >> > +               int a;
>> >> > +               int b;
>> >> > +               int c;
>> >> > +               spinlock_t lock;
>> >> > +       };
>> >> > +       struct foo *gp1;
>> >> > +       struct foo *gp2;
>> >> > +
>> >> > +       void updater(void)
>> >> > +       {
>> >> > +               struct foo *p;
>> >> > +
>> >> > +               p = kmalloc(...);
>> >> > +               if (p == NULL)
>> >> > +                       deal_with_it();
>> >> > +               spin_lock(&p->lock);
>> >> > +               p->a = 42;  /* Each field in its own cache line. */
>> >> > +               p->b = 43;
>> >> > +               p->c = 44;
>> >> > +               spin_unlock(&p->lock);
>> >> > +               rcu_assign_pointer(gp1, p);
>> >> > +               spin_lock(&p->lock);
>> >> > +               p->b = 143;
>> >> > +               p->c = 144;
>> >> > +               spin_unlock(&p->lock);
>> >> > +               rcu_assign_pointer(gp2, p);
>> >> > +       }
>> >> > +
>> >> > +       void reader(void)
>> >> > +       {
>> >> > +               struct foo *p;
>> >> > +               struct foo *q;
>> >> > +               int r1, r2;
>> >> > +
>> >> > +               p = rcu_dereference(gp2);
>> >> > +               spin_lock(&p->lock);
>> >> > +               r1 = p->b;  /* Guaranteed to get 143. */
>> >> > +               q = rcu_dereference(gp1);
>> >> > +               if (p == q) {
>> >> > +                       /* The compiler decides that q->c is same as p->c. */
>> >> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
>> >> > +               }
>> >> > +               spin_unlock(&p->lock);
>> >> > +       }
>> >> > +
>> >> > +As always, use the right tool for the job!
>> >> > +
>> >> > +
>> >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
>> >> > +
>> >> > +If a pointer obtained from rcu_dereference() compares not-equal to some
>> >> > +other pointer, the compiler normally has no clue what the value of the
>> >> > +first pointer might be.  This lack of knowledge prevents the compiler
>> >> > +from carrying out optimizations that otherwise might destroy the ordering
>> >> > +guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
>> >> > +should prevent the compiler from guessing the value.
>> >> > +
>> >> > +But without rcu_dereference(), the compiler knows more than you might
>> >> > +expect.  Consider the following code fragment:
>> >> > +
>> >> > +       struct foo {
>> >> > +               int a;
>> >> > +               int b;
>> >> > +       };
>> >> > +       static struct foo variable1;
>> >> > +       static struct foo variable2;
>> >> > +       static struct foo *gp = &variable1;
>> >> > +
>> >> > +       void updater(void)
>> >> > +       {
>> >> > +               initialize_foo(&variable2);
>> >> > +               rcu_assign_pointer(gp, &variable2);
>> >> > +               /*
>> >> > +                * The above is the only store to gp in this translation unit,
>> >> > +                * and the address of gp is not exported in any way.
>> >> > +                */
>> >> > +       }
>> >> > +
>> >> > +       int reader(void)
>> >> > +       {
>> >> > +               struct foo *p;
>> >> > +
>> >> > +               p = gp;
>> >> > +               barrier();
>> >> > +               if (p == &variable1)
>> >> > +                       return p->a; /* Must be variable1.a. */
>> >> > +               else
>> >> > +                       return p->b; /* Must be variable2.b. */
>> >> > +       }
>> >> > +
>> >> > +Because the compiler can see all stores to "gp", it knows that the only
>> >> > +possible values of "gp" are "variable1" on the one hand and "variable2"
>> >> > +on the other.  The comparison in reader() therefore tells the compiler
>> >> > +the exact value of "p" even in the not-equals case.  This allows the
>> >> > +compiler to make the return values independent of the load from "gp",
>> >> > +in turn destroying the ordering between this load and the loads of the
>> >> > +return values.  This can result in "p->b" returning pre-initialization
>> >> > +garbage values.
>> >> > +
>> >> > +In short, rcu_dereference() is -not- optional when you are going to
>> >> > +dereference the resulting pointer.
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> > Please read the FAQ at  http://www.tux.org/lkml/
>> >>
>> >
>>
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-02 23:44                                                                                                                                 ` Peter Sewell
@ 2014-03-03  4:25                                                                                                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-03  4:25 UTC (permalink / raw)
  To: Peter Sewell
  Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, Mar 02, 2014 at 11:44:52PM +0000, Peter Sewell wrote:
> On 2 March 2014 23:20, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
> >> >> >> > >
> >> >> >> > > 3.      The comparison was against another RCU-protected pointer,
> >> >> >> > >         where that other pointer was properly fetched using one
> >> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
> >> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
> >> >> >> > >         that other pointer happened after the last update to the
> >> >> >> > >         pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >    p = atomic_read(consume);
> >> >> >> >    X;
> >> >> >> >    q = atomic_read(consume);
> >> >> >> >    Y;
> >> >> >> >    if (p == q)
> >> >> >> >         data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>       if (p == NULL)
> >> >> >>               deal_with_it();
> >> >> >>       p->a = 42;  /* Each field in its own cache line. */
> >> >> >>       p->b = 43;
> >> >> >>       p->c = 44;
> >> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
> >> >> >>       p->b = 143;
> >> >> >>       p->c = 144;
> >> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >> >>       if (p == q) {
> >> >> >>               /* The compiler decides that q->c is same as p->c. */
> >> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>       }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> >> see consistency across all the fields in the structure, please put a lock
> >> >> >> in the structure and use it to guard updates and accesses to those fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is what
> >> >> you really need.  As a useful criterion, we should aim for something
> >> >> precise enough that in a verified-compiler context you can
> >> >> mathematically prove that the compiler will satisfy it  (even though
> >> >> that won't happen anytime soon for GCC), and that analysis tool
> >> >> authors can actually know what they're working with.   All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a
> >> >> small number of bits" is just too vague.
> >> >
> >> > Understood, and yes, this is intended to document current compiler
> >> > behavior for the Linux kernel community.  It would not make sense to show
> >> > it to the C11 or C++11 communities, except perhaps as an informational
> >> > piece on current practice.
> >> >
> >> >> The basic problem is that the compiler may be doing sophisticated
> >> >> reasoning with a bunch of non-local knowledge that it's deduced from
> >> >> the code, neither of which are well-understood, and here we have to
> >> >> identify some envelope, expressive enough for RCU idioms, in which
> >> >> that reasoning doesn't allow data/address dependencies to be removed
> >> >> (and hence the hardware guarantee about them will be maintained at the
> >> >> source level).
> >> >>
> >> >> The C11 syntactic notion of dependency, whatever its faults, was at
> >> >> least precise, could be reasoned about locally (just looking at the
> >> >> syntactic code in question), and did do that.  The fact that current
> >> >> compilers do optimisations that remove dependencies and will likely
> >> >> have many bugs at present is besides the point - this was surely
> >> >> intended as a *new* constraint on what they are allowed to do.  The
> >> >> interesting question is really whether the compiler writers think that
> >> >> they *could* implement it in a reasonable way - I'd like to hear
> >> >> Torvald and his colleagues' opinion on that.
> >> >>
> >> >> What you're doing above seems to be basically a very cut-down version
> >> >> of that, but with a fuzzy boundary.   If you want it to be precise,
> >> >> maybe it needs to be much simpler (which might force you into ruling
> >> >> out some current code idioms).
> >> >
> >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
> >> > can be developed to serve this purpose.
> >>
> >> (I missed that mail when it first came past, sorry)
> >
> > No worries!
> >
> >> That's also going to be tricky, I'm afraid.  The key condition there is:
> >>
> >> "* at the time of execution of E, L       [PS: I assume that L is a
> >> typo and should be E]
> >
> > I believe it really is "L".  As I understand it (and Torvald will correct
> > me if I am wrong), the idea is that the implementation is prohibited
> > from guessing the value of "L" -- it must assume that any value from
> > L's type might be returned, regardless of what it might otherwise know.
> >
> > However, after L's value is loaded, the implementation -is- permitted
> > to learn constraint's on this value based on "if" statements and the
> > like between the load from "L" and the execution of "E".
> >
> > Does that help?
> 
> Not sure (i.e., not really :-).  I thought Torvald wanted to say that
> "E  really-depends on L if there exist two different values that (just
> according to typing) might be read for L that give rise to two
> different values for E".

My interpretation was that for there to be a value dependency from L to
E, L must have the possibility of taking on at least two values at the
point in the code where E resides, but that the computation of E might
well result in only one possible value.

I guess we need Torvald to tell us which he meant.  ;-)

> >>         can possibly have returned at
> >>         least two different values under the assumption that L itself
> >>         could have returned any value allowed by L's type."
> >>
> >> First, the evaluation of E might be nondeterministic  - e.g., for an
> >> artificial example, if it's just a nondeterministic value obtained
> >> from the result of a race on SC atomics.  The above doesn't
> >> distinguish between that (which doesn't have a real dependency on L)
> >> and that XOR'd with L (which does).  And it does so in the wrong
> >> direction: it'll say there the former has a dependency on L.
> >
> > Right, it is only any dependency that E has on L that would be
> > constrained.  If E also depends on other quantities obtained some
> > other way than a memory_order_consume load into a value_dep_preserving,
> > variable, then as I understand it, the compiler is within its rights
> > to optimize these other quantities to within an inch of their lives.
> >
> > It is quite possible that E depends on L only sometimes.  For example:
> >
> >         p = atomic_load_explicit(&gp, memory_order_consume);
> >         p = random() & 0x8 ? p : &default_structure;
> >         E(p);
> >
> > My guess is that in this case, the ordering would be guaranteed only
> > for those executions where there is a value dependency.  In my naive
> > view, this should be no different than something like this:
> >
> >         if (random() & 0x10)
> >                 p = atomic_load_explicit(&gp, memory_order_acquire);
> >         else
> >                 p = &default_structure;
> >         E(p);
> 
> all this is fine, but...
> 
> > Or am I missing your point?
> 
> ...if the idea was to identify "real dependencies" as cases where two
> values of E are possible based on different values of L, then if two
> values of E are possible *just anyway* (e.g. because of
> nondeterminism), the definition gets confused.

Understood.  Does this confusion persist in the case where it is only
L that is required to have the possibility of taking on two or more
values?

> >> Second, it involves reasoning about counterfactual executions.  That
> >> doesn't necessarily make it wrong, per se, but probably makes it hard
> >> to work with.  For example, suppose that in all the actual
> >> whole-program executions, a runtime occurrence of L only ever returns
> >> one particular value (perhaps because of some simple #define'd
> >> configuration), and that the code used in the evaluation of E depends
> >> on some invariant which is related to that configuration.  The
> >> hypothetical execution used above in which a different value is used
> >> is one in the code is being run in a situation with broken invariants.
> >>    Then there will be technical difficulties in using the definition:
> >> I don't see how one would persuade oneself that a compiler always
> >> satisfies it, because these hypothetical executions are far removed
> >> from what it's actually working on.
> >
> > The developer answer would be something like "all it really means is that
> > the implementation is required to actually emit the memory_order_consume
> > load and actually use the value," which is probably not much comfort
> > to someone trying to model it.  Maybe there is a better way of wording
> > this constraint so as to avoid the counterfactuals?
> 
> maybe.  I don't have one right now, though.
> 
> >> (Aside: The notion of a thread "observing" another thread's load,
> >> dating back a long time and adopted in the Power and ARM architecture
> >> texts, relies on counterfactual executions in a broadly similar way;
> >> we're happy to have escaped that now :-)
> >
> > Here is hoping that there is a way to escape it in this case as well.  ;-)
> 
> 
> ta,
> Peter

                                                        Thanx, Paul

> >> Peter
> >>
> >>
> >>
> >>
> >> >                                                         Thanx, Paul
> >> >
> >> >> best,
> >> >> Peter
> >> >>
> >> >>
> >> >>
> >> >> >                                                         Thanx, Paul
> >> >> >
> >> >> > ------------------------------------------------------------------------
> >> >> >
> >> >> > documentation: Record rcu_dereference() value mishandling
> >> >> >
> >> >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and
> >> >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out
> >> >> > some ways of misusing the return value from rcu_dereference() that
> >> >> > are not necessarily completely intuitive.  This commit therefore
> >> >> > documents what can and cannot safely be done with these values.
> >> >> >
> >> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> >
> >> >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
> >> >> > index fa57139f50bf..f773a264ae02 100644
> >> >> > --- a/Documentation/RCU/00-INDEX
> >> >> > +++ b/Documentation/RCU/00-INDEX
> >> >> > @@ -12,6 +12,8 @@ lockdep-splat.txt
> >> >> >         - RCU Lockdep splats explained.
> >> >> >  NMI-RCU.txt
> >> >> >         - Using RCU to Protect Dynamic NMI Handlers
> >> >> > +rcu_dereference.txt
> >> >> > +       - Proper care and feeding of return values from rcu_dereference()
> >> >> >  rcubarrier.txt
> >> >> >         - RCU and Unloadable Modules
> >> >> >  rculist_nulls.txt
> >> >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> >> >> > index 9d10d1db16a5..877947130ebe 100644
> >> >> > --- a/Documentation/RCU/checklist.txt
> >> >> > +++ b/Documentation/RCU/checklist.txt
> >> >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome!
> >> >> >                         http://www.openvms.compaq.com/wizard/wiz_2637.html
> >> >> >
> >> >> >                 The rcu_dereference() primitive is also an excellent
> >> >> > -               documentation aid, letting the person reading the code
> >> >> > -               know exactly which pointers are protected by RCU.
> >> >> > +               documentation aid, letting the person reading the
> >> >> > +               code know exactly which pointers are protected by RCU.
> >> >> >                 Please note that compilers can also reorder code, and
> >> >> >                 they are becoming increasingly aggressive about doing
> >> >> > -               just that.  The rcu_dereference() primitive therefore
> >> >> > -               also prevents destructive compiler optimizations.
> >> >> > +               just that.  The rcu_dereference() primitive therefore also
> >> >> > +               prevents destructive compiler optimizations.  However,
> >> >> > +               with a bit of devious creativity, it is possible to
> >> >> > +               mishandle the return value from rcu_dereference().
> >> >> > +               Please see rcu_dereference.txt in this directory for
> >> >> > +               more information.
> >> >> >
> >> >> >                 The rcu_dereference() primitive is used by the
> >> >> >                 various "_rcu()" list-traversal primitives, such
> >> >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
> >> >> > new file mode 100644
> >> >> > index 000000000000..6e72cd8622df
> >> >> > --- /dev/null
> >> >> > +++ b/Documentation/RCU/rcu_dereference.txt
> >> >> > @@ -0,0 +1,365 @@
> >> >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
> >> >> > +
> >> >> > +Most of the time, you can use values from rcu_dereference() or one of
> >> >> > +the similar primitives without worries.  Dereferencing (prefix "*"),
> >> >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and
> >> >> > +subtraction of constants, and casts all work quite naturally and safely.
> >> >> > +
> >> >> > +It is nevertheless possible to get into trouble with other operations.
> >> >> > +Follow these rules to keep your RCU code working properly:
> >> >> > +
> >> >> > +o      You must use one of the rcu_dereference() family of primitives
> >> >> > +       to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU
> >> >> > +       will complain.  Worse yet, your code can see random memory-corruption
> >> >> > +       bugs due to games that compilers and DEC Alpha can play.
> >> >> > +       Without one of the rcu_dereference() primitives, compilers
> >> >> > +       can reload the value, and won't your code have fun with two
> >> >> > +       different values for a single pointer!  Without rcu_dereference(),
> >> >> > +       DEC Alpha can load a pointer, dereference that pointer, and
> >> >> > +       return data preceding initialization that preceded the store of
> >> >> > +       the pointer.
> >> >> > +
> >> >> > +       In addition, the volatile cast in rcu_dereference() prevents the
> >> >> > +       compiler from deducing the resulting pointer value.  Please see
> >> >> > +       the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH"
> >> >> > +       for an example where the compiler can in fact deduce the exact
> >> >> > +       value of the pointer, and thus cause misordering.
> >> >> > +
> >> >> > +o      Do not use single-element RCU-protected arrays.  The compiler
> >> >> > +       is within its right to assume that the value of an index into
> >> >> > +       such an array must necessarily evaluate to zero.  The compiler
> >> >> > +       could then substitute the constant zero for the computation, so
> >> >> > +       that the array index no longer depended on the value returned
> >> >> > +       by rcu_dereference().  If the array index no longer depends
> >> >> > +       on rcu_dereference(), then both the compiler and the CPU
> >> >> > +       are within their rights to order the array access before the
> >> >> > +       rcu_dereference(), which can cause the array access to return
> >> >> > +       garbage.
> >> >> > +
> >> >> > +o      Avoid cancellation when using the "+" and "-" infix arithmetic
> >> >> > +       operators.  For example, for a given variable "x", avoid
> >> >> > +       "(x-x)".  There are similar arithmetic pitfalls from other
> >> >> > +       arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)".
> >> >> > +       The compiler is within its rights to substitute zero for all of
> >> >> > +       these expressions, so that subsequent accesses no longer depend
> >> >> > +       on the rcu_dereference(), again possibly resulting in bugs due
> >> >> > +       to misordering.
> >> >> > +
> >> >> > +       Of course, if "p" is a pointer from rcu_dereference(), and "a"
> >> >> > +       and "b" are integers that happen to be equal, the expression
> >> >> > +       "p+a-b" is safe because its value still necessarily depends on
> >> >> > +       the rcu_dereference(), thus maintaining proper ordering.
> >> >> > +
> >> >> > +o      Avoid all-zero operands to the bitwise "&" operator, and
> >> >> > +       similarly avoid all-ones operands to the bitwise "|" operator.
> >> >> > +       If the compiler is able to deduce the value of such operands,
> >> >> > +       it is within its rights to substitute the corresponding constant
> >> >> > +       for the bitwise operation.  Once again, this causes subsequent
> >> >> > +       accesses to no longer depend on the rcu_dereference(), causing
> >> >> > +       bugs due to misordering.
> >> >> > +
> >> >> > +       Please note that single-bit operands to bitwise "&" can also
> >> >> > +       be dangerous.  At this point, the compiler knows that the
> >> >> > +       resulting value can only take on one of two possible values.
> >> >> > +       Therefore, a very small amount of additional information will
> >> >> > +       allow the compiler to deduce the exact value, which again can
> >> >> > +       result in misordering.
> >> >> > +
> >> >> > +o      If you are using RCU to protect JITed functions, so that the
> >> >> > +       "()" function-invocation operator is applied to a value obtained
> >> >> > +       (directly or indirectly) from rcu_dereference(), you may need to
> >> >> > +       interact directly with the hardware to flush instruction caches.
> >> >> > +       This issue arises on some systems when a newly JITed function is
> >> >> > +       using the same memory that was used by an earlier JITed function.
> >> >> > +
> >> >> > +o      Do not use the results from the boolean "&&" and "||" when
> >> >> > +       dereferencing.  For example, the following (rather improbable)
> >> >> > +       code is buggy:
> >> >> > +
> >> >> > +               int a[2];
> >> >> > +               int index;
> >> >> > +               int force_zero_index = 1;
> >> >> > +
> >> >> > +               ...
> >> >> > +
> >> >> > +               r1 = rcu_dereference(i1)
> >> >> > +               r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> >> >> > +
> >> >> > +       The reason this is buggy is that "&&" and "||" are often compiled
> >> >> > +       using branches.  While weak-memory machines such as ARM or PowerPC
> >> >> > +       do order stores after such branches, they can speculate loads,
> >> >> > +       which can result in misordering bugs.
> >> >> > +
> >> >> > +o      Do not use the results from relational operators ("==", "!=",
> >> >> > +       ">", ">=", "<", or "<=") when dereferencing.  For example,
> >> >> > +       the following (quite strange) code is buggy:
> >> >> > +
> >> >> > +               int a[2];
> >> >> > +               int index;
> >> >> > +               int flip_index = 0;
> >> >> > +
> >> >> > +               ...
> >> >> > +
> >> >> > +               r1 = rcu_dereference(i1)
> >> >> > +               r2 = a[r1 != flip_index];  /* BUGGY!!! */
> >> >> > +
> >> >> > +       As before, the reason this is buggy is that relational operators
> >> >> > +       are often compiled using branches.  And as before, although
> >> >> > +       weak-memory machines such as ARM or PowerPC do order stores
> >> >> > +       after such branches, but can speculate loads, which can again
> >> >> > +       result in misordering bugs.
> >> >> > +
> >> >> > +o      Be very careful about comparing pointers obtained from
> >> >> > +       rcu_dereference() against non-NULL values.  As Linus Torvalds
> >> >> > +       explained, if the two pointers are equal, the compiler could
> >> >> > +       substitute the pointer you are comparing against for the pointer
> >> >> > +       obtained from rcu_dereference().  For example:
> >> >> > +
> >> >> > +               p = rcu_dereference(gp);
> >> >> > +               if (p == &default_struct)
> >> >> > +                       do_default(p->a);
> >> >> > +
> >> >> > +       Because the compiler now knows that the value of "p" is exactly
> >> >> > +       the address of the variable "default_struct", it is free to
> >> >> > +       transform this code into the following:
> >> >> > +
> >> >> > +               p = rcu_dereference(gp);
> >> >> > +               if (p == &default_struct)
> >> >> > +                       do_default(default_struct.a);
> >> >> > +
> >> >> > +       On ARM and Power hardware, the load from "default_struct.a"
> >> >> > +       can now be speculated, such that it might happen before the
> >> >> > +       rcu_dereference().  This could result in bugs due to misordering.
> >> >> > +
> >> >> > +       However, comparisons are OK in the following cases:
> >> >> > +
> >> >> > +       o       The comparison was against the NULL pointer.  If the
> >> >> > +               compiler knows that the pointer is NULL, you had better
> >> >> > +               not be dereferencing it anyway.  If the comparison is
> >> >> > +               non-equal, the compiler is none the wiser.  Therefore,
> >> >> > +               it is safe to compare pointers from rcu_dereference()
> >> >> > +               against NULL pointers.
> >> >> > +
> >> >> > +       o       The pointer is never dereferenced after being compared.
> >> >> > +               Since there are no subsequent dereferences, the compiler
> >> >> > +               cannot use anything it learned from the comparison
> >> >> > +               to reorder the non-existent subsequent dereferences.
> >> >> > +               This sort of comparison occurs frequently when scanning
> >> >> > +               RCU-protected circular linked lists.
> >> >> > +
> >> >> > +       o       The comparison is against a pointer pointer that
> >> >> > +               references memory that was initialized "a long time ago."
> >> >> > +               The reason this is safe is that even if misordering
> >> >> > +               occurs, the misordering will not affect the accesses
> >> >> > +               that follow the comparison.  So exactly how long ago is
> >> >> > +               "a long time ago"?  Here are some possibilities:
> >> >> > +
> >> >> > +               o       Compile time.
> >> >> > +
> >> >> > +               o       Boot time.
> >> >> > +
> >> >> > +               o       Module-init time for module code.
> >> >> > +
> >> >> > +               o       Prior to kthread creation for kthread code.
> >> >> > +
> >> >> > +               o       During some prior acquisition of the lock that
> >> >> > +                       we now hold.
> >> >> > +
> >> >> > +               o       Before mod_timer() time for a timer handler.
> >> >> > +
> >> >> > +               There are many other possibilities involving the Linux
> >> >> > +               kernel's wide array of primitives that cause code to
> >> >> > +               be invoked at a later time.
> >> >> > +
> >> >> > +       o       The pointer being compared against also came from
> >> >> > +               rcu_dereference().  In this case, both pointers depend
> >> >> > +               on one rcu_dereference() or another, so you get proper
> >> >> > +               ordering either way.
> >> >> > +
> >> >> > +               That said, this situation can make certain RCU usage
> >> >> > +               bugs more likely to happen.  Which can be a good thing,
> >> >> > +               at least if they happen during testing.  An example
> >> >> > +               of such an RCU usage bug is shown in the section titled
> >> >> > +               "EXAMPLE OF AMPLIFIED RCU-USAGE BUG".
> >> >> > +
> >> >> > +       o       All of the accesses following the comparison are stores,
> >> >> > +               so that a control dependency preserves the needed ordering.
> >> >> > +               That said, it is easy to get control dependencies wrong.
> >> >> > +               Please see the "CONTROL DEPENDENCIES" section of
> >> >> > +               Documentation/memory-barriers.txt for more details.
> >> >> > +
> >> >> > +       o       The pointers compared not-equal -and- the compiler does
> >> >> > +               not have enough information to deduce the value of the
> >> >> > +               pointer.  Note that the volatile cast in rcu_dereference()
> >> >> > +               will normally prevent the compiler from knowing too much.
> >> >> > +
> >> >> > +o      Disable any value-speculation optimizations that your compiler
> >> >> > +       might provide, especially if you are making use of feedback-based
> >> >> > +       optimizations that take data collected from prior runs.  Such
> >> >> > +       value-speculation optimizations reorder operations by design.
> >> >> > +
> >> >> > +       There is one exception to this rule:  Value-speculation
> >> >> > +       optimizations that leverage the branch-prediction hardware are
> >> >> > +       safe on strongly ordered systems (such as x86), but not on weakly
> >> >> > +       ordered systems (such as ARM or Power).  Choose your compiler
> >> >> > +       command-line options wisely!
> >> >> > +
> >> >> > +
> >> >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG
> >> >> > +
> >> >> > +Because updaters can run concurrently with RCU readers, RCU readers can
> >> >> > +see stale and/or inconsistent values.  If RCU readers need fresh or
> >> >> > +consistent values, which they sometimes do, they need to take proper
> >> >> > +precautions.  To see this, consider the following code fragment:
> >> >> > +
> >> >> > +       struct foo {
> >> >> > +               int a;
> >> >> > +               int b;
> >> >> > +               int c;
> >> >> > +       };
> >> >> > +       struct foo *gp1;
> >> >> > +       struct foo *gp2;
> >> >> > +
> >> >> > +       void updater(void)
> >> >> > +       {
> >> >> > +               struct foo *p;
> >> >> > +
> >> >> > +               p = kmalloc(...);
> >> >> > +               if (p == NULL)
> >> >> > +                       deal_with_it();
> >> >> > +               p->a = 42;  /* Each field in its own cache line. */
> >> >> > +               p->b = 43;
> >> >> > +               p->c = 44;
> >> >> > +               rcu_assign_pointer(gp1, p);
> >> >> > +               p->b = 143;
> >> >> > +               p->c = 144;
> >> >> > +               rcu_assign_pointer(gp2, p);
> >> >> > +       }
> >> >> > +
> >> >> > +       void reader(void)
> >> >> > +       {
> >> >> > +               struct foo *p;
> >> >> > +               struct foo *q;
> >> >> > +               int r1, r2;
> >> >> > +
> >> >> > +               p = rcu_dereference(gp2);
> >> >> > +               r1 = p->b;  /* Guaranteed to get 143. */
> >> >> > +               q = rcu_dereference(gp1);
> >> >> > +               if (p == q) {
> >> >> > +                       /* The compiler decides that q->c is same as p->c. */
> >> >> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> > +               }
> >> >> > +       }
> >> >> > +
> >> >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
> >> >> > +but you should not be.  After all, the updater might have been invoked
> >> >> > +a second time between the time reader() loaded into "r1" and the time
> >> >> > +that it loaded into "r2".  The fact that this same result can occur due
> >> >> > +to some reordering from the compiler and CPUs is beside the point.
> >> >> > +
> >> >> > +But suppose that the reader needs a consistent view?
> >> >> > +
> >> >> > +Then one approach is to use locking, for example, as follows:
> >> >> > +
> >> >> > +       struct foo {
> >> >> > +               int a;
> >> >> > +               int b;
> >> >> > +               int c;
> >> >> > +               spinlock_t lock;
> >> >> > +       };
> >> >> > +       struct foo *gp1;
> >> >> > +       struct foo *gp2;
> >> >> > +
> >> >> > +       void updater(void)
> >> >> > +       {
> >> >> > +               struct foo *p;
> >> >> > +
> >> >> > +               p = kmalloc(...);
> >> >> > +               if (p == NULL)
> >> >> > +                       deal_with_it();
> >> >> > +               spin_lock(&p->lock);
> >> >> > +               p->a = 42;  /* Each field in its own cache line. */
> >> >> > +               p->b = 43;
> >> >> > +               p->c = 44;
> >> >> > +               spin_unlock(&p->lock);
> >> >> > +               rcu_assign_pointer(gp1, p);
> >> >> > +               spin_lock(&p->lock);
> >> >> > +               p->b = 143;
> >> >> > +               p->c = 144;
> >> >> > +               spin_unlock(&p->lock);
> >> >> > +               rcu_assign_pointer(gp2, p);
> >> >> > +       }
> >> >> > +
> >> >> > +       void reader(void)
> >> >> > +       {
> >> >> > +               struct foo *p;
> >> >> > +               struct foo *q;
> >> >> > +               int r1, r2;
> >> >> > +
> >> >> > +               p = rcu_dereference(gp2);
> >> >> > +               spin_lock(&p->lock);
> >> >> > +               r1 = p->b;  /* Guaranteed to get 143. */
> >> >> > +               q = rcu_dereference(gp1);
> >> >> > +               if (p == q) {
> >> >> > +                       /* The compiler decides that q->c is same as p->c. */
> >> >> > +                       r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> > +               }
> >> >> > +               spin_unlock(&p->lock);
> >> >> > +       }
> >> >> > +
> >> >> > +As always, use the right tool for the job!
> >> >> > +
> >> >> > +
> >> >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH
> >> >> > +
> >> >> > +If a pointer obtained from rcu_dereference() compares not-equal to some
> >> >> > +other pointer, the compiler normally has no clue what the value of the
> >> >> > +first pointer might be.  This lack of knowledge prevents the compiler
> >> >> > +from carrying out optimizations that otherwise might destroy the ordering
> >> >> > +guarantees that RCU depends on.  And the volatile cast in rcu_dereference()
> >> >> > +should prevent the compiler from guessing the value.
> >> >> > +
> >> >> > +But without rcu_dereference(), the compiler knows more than you might
> >> >> > +expect.  Consider the following code fragment:
> >> >> > +
> >> >> > +       struct foo {
> >> >> > +               int a;
> >> >> > +               int b;
> >> >> > +       };
> >> >> > +       static struct foo variable1;
> >> >> > +       static struct foo variable2;
> >> >> > +       static struct foo *gp = &variable1;
> >> >> > +
> >> >> > +       void updater(void)
> >> >> > +       {
> >> >> > +               initialize_foo(&variable2);
> >> >> > +               rcu_assign_pointer(gp, &variable2);
> >> >> > +               /*
> >> >> > +                * The above is the only store to gp in this translation unit,
> >> >> > +                * and the address of gp is not exported in any way.
> >> >> > +                */
> >> >> > +       }
> >> >> > +
> >> >> > +       int reader(void)
> >> >> > +       {
> >> >> > +               struct foo *p;
> >> >> > +
> >> >> > +               p = gp;
> >> >> > +               barrier();
> >> >> > +               if (p == &variable1)
> >> >> > +                       return p->a; /* Must be variable1.a. */
> >> >> > +               else
> >> >> > +                       return p->b; /* Must be variable2.b. */
> >> >> > +       }
> >> >> > +
> >> >> > +Because the compiler can see all stores to "gp", it knows that the only
> >> >> > +possible values of "gp" are "variable1" on the one hand and "variable2"
> >> >> > +on the other.  The comparison in reader() therefore tells the compiler
> >> >> > +the exact value of "p" even in the not-equals case.  This allows the
> >> >> > +compiler to make the return values independent of the load from "gp",
> >> >> > +in turn destroying the ordering between this load and the loads of the
> >> >> > +return values.  This can result in "p->b" returning pre-initialization
> >> >> > +garbage values.
> >> >> > +
> >> >> > +In short, rcu_dereference() is -not- optional when you are going to
> >> >> > +dereference the resulting pointer.
> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >>
> >> >
> >>
> >
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 17:01                                                                                                               ` Linus Torvalds
  2014-02-27 19:06                                                                                                                 ` Paul E. McKenney
@ 2014-03-03 15:36                                                                                                                 ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-03-03 15:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc,
	Peter Sewell

On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@redhat.com> wrote:
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == &variable) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.
> 
> And the only such thing I can think of is basically compiler-initiated
> value-prediction, presumably directed by PGO (since now if the value
> prediction is in the source code, it's considered to break the value
> chain).

The other example that comes to mind would be feedback-directed JIT
compilation.  I don't think that's widely used today, and it might never
be for the kernel -- but *in the standard*, we at least have to consider
what the future might bring.

> The good thing is that afaik, value-prediction is largely not used in
> real life, afaik. There are lots of papers on it, but I don't think
> anybody actually does it (although I can easily see some
> specint-specific optimization pattern that is build up around it).
> 
> And even value prediction is actually fine, as long as the compiler
> can see the memory *source* of the value prediction (and it isn't a
> mo_consume). So it really ends up limiting your value prediction in
> very simple ways: you cannot do it to function arguments if they are
> registers. But you can still do value prediction on values you loaded
> from memory, if you can actually *see* that memory op.

I think one would need to show that the source is *not even indirectly*
a mo_consume load.  With the wording I proposed, value dependencies
don't break when storing to / loading from memory locations.

Thus, if a compiler ends up at a memory load after waling SSA, it needs
to prove that the load cannot read a value that (1) was produced by a
store sequenced-before the load and (2) might carry a value dependency
(e.g., by being a mo_consume load) that the value prediction in question
would break.  This, in general, requires alias analysis.
Deciding whether a prediction would break a value dependency has to
consider what later stages in a compiler would be doing, including LTO
or further rounds of inlining/optimizations.  OTOH, if the compiler can
treat an mo_consume load as returning all possible values (eg, by
ignoring all knowledge about it), then it can certainly do so with other
memory loads too.

So, I think that the constraints due to value dependencies can matter in
practice.  However, the impact on optimizations on
non-mo_consume-related code are hard to estimate -- I don't see a huge
amount of impact right now, but I also wouldn't want to predict that
this can't change in the future.

> Of course, on more strongly ordered CPU's, even that "register
> argument" limitation goes away.
> 
> So I agree that there is basically no real optimization constraint.
> Value-prediction is of dubious value to begin with, and the actual
> constraint on its use if some compiler writer really wants to is not
> onerous.
> 
> > What I have in mind is roughly the following (totally made-up syntax --
> > suggestions for how to do this properly are very welcome):
> > * Have a type modifier (eg, like restrict), that specifies that
> > operations on data of this type are preserving value dependencies:
> 
> So I'm not violently opposed, but I think the upsides are not great.
> Note that my earlier suggestion to use "restrict" wasn't because I
> believed the annotation itself would be visible, but basically just as
> a legalistic promise to the compiler that *if* it found an alias, then
> it didn't need to worry about ordering. So to me, that type modifier
> was about conceptual guarantees, not about actual value chains.
> 
> Anyway, the reason I don't believe any type modifier (and
> "[[carries_dependency]]" is basically just that) is worth it is simply
> that it adds a real burden on the programmer, without actually giving
> the programmer any real upside:
> 
> Within a single function, the compiler already sees that mo_consume
> source, and so doing a type-based restriction doesn't really help. The
> information is already there, without any burden on the programmer.

I think it's not just a question of whether we're talking a single
function or across functions, but to which extent other code can detect
whether it might have to consider value dependencies.  The store/load
case above is an example that complicates the detection for a compiler.

In cases in which the mo_consume load is used directly, we don't need to
use any annotations on the type:
  int val = atomic_load_explicit(ptr, mo_consume)->value;

However, if we need to use the load's result more than once (which I
think will happen often), then we do need the type annotation:
  s value_dep_preserving *ptr = atomic_load_explicit(ptr, mo_consume);
  if (ptr != 0)
    int val = ptr->value;

If we want to avoid the annotation in this case, and still want to avoid
the store/load vs. alias analysis problem mentioned above, we'd need to
require that ptr isn't a variable that's visible to other code not
related to this mo_consume load.  But I believe that such a requirement
would be awkward, and also hard to specify.

I hope that Paul's look at rcu_derefence() usage could provide some
indication of how much annotation overhead there actually would be for a
programmer.

> 
> And across functions, the compiler has already - by definition -
> mostly lost sight of all the things it could use to reduce the value
> space.

I don't think that I agree here.  

Assume we have two separate functions bar and foo, and one temporary
variable t of a type int012 that holds values 0,1,2 (excuse the somewhat
artificial example):

int012 t;
int arr[20];

int bar(int a)
{
  bar_stuff(a);  // compiler knows this is noop with arguments 0 or 1
                 // and this will *never* touch t nor arr
  return a;
}

int foo(int a)
{
  foo_stuff(a); // compiler knows this is noop with arguments 1 or 2
                // and this will *never* touch t nor arr
  return a;
}

void main() {
  t = atomic_load_explicit(&source, mo_consume);
  x = arr[bar(foo(t))];  // value-dependent?
}

If a compiler looks at foo() and bar() separately, I think it might want
to optimize bar() to the following:

int bar_opt(int a)
{
  if (a != 2) return a;
  bar_stuff(a);
  return a;
}

int foo_opt(int a)
{
  if (a != 0) return a;
  foo_stuff(a);
  return a;
}

I think that those could be valuable optimizations for general-purpose
code.

What happens if the compiler does LTO afterwards and combines the foo
and bar calls?:

int bar_opt_foo_opt(int a)
{
  if (a == 1) return a;
  if (a == 0) foo_stuff(a);
  else bar_stuff(a);
  return a;
}

This still looks like a good thing to do for general-purpose code, and
it doesn't do any value prediction.

If we inline this into main, it becomes kind of difficult for the
compiler because it cannot just weave in bar_opt_foo_opt, or it might
get:

  t = atomic_load_explicit(&source, mo_consume);
  if (t == 1) goto next;
  if (t == 0) foo_stuff(t);
  else bar_stuff(t);
access:
  x = arr[t];  // value-dependent?

Would this be still value-dependent for the hardware, or would the
branch prediction interfere?

Even if this would still be okay from the hardware POV, other compiler
transformations now need to pay attention to where the value comes from.
In particular, we can't specialize this into the following (which
doesn't predict any values):

  t = atomic_load_explicit(&source, mo_consume);
  if (t == 1) x = arr[1];
  else {
    if (t == 0) foo_stuff(t);
    else bar_stuff(t);
    x = arr[t];
  }

We could argue that this wouldn't be allowed because t is coming from an
mo_consume load, but then we also need to say that this can in fact
affect compiler transformations other than just value prediction.

At least in this example, introducing the value_dep_preserving type
modifier would have made this easier because it would have allowed the
compiler to avoid the initial value prediction.

However, I think that we might get a few interesting issues even with
value_dep_preserving, so this needs further investigation.  Nonetheless,
my gut feeling is that having value_dep_preserving makes this all a lot
easier because if unsure, the compiler can just use a bigger hammer for
value_dep_preserving without having to worry about preventing
optimizations on unrelated code.

> Even Paul's example doesn't really work if the use of the
> "mo_consume" value has been passed to another function, because inside
> a separate function, the compiler couldn't see that the value it uses
> comes from only two possible values.
> 
> And as mentioned, even *if* the compiler wants to do value prediction
> that turns a data dependency into a control dependency, the limitation
> to say "no, you can't do it unless you saw where the value got loaded"
> really isn't that onerous.
> 
> I bet that if you ask actual production compiler people (as opposed to
> perhaps academia), none of them actually really believe in value
> prediction to begin with.

What about keeping whether we really need value_dep_preserving as an
open question for now, and try to get more feeback by compiler
implementers on what the consequences of not having it would be?  This
should help us assess the current implementation perspective.  We would
still need to extrapolate what future compilers might want to do,
though; thus, deciding to not use it will remain somewhat risky for
future optimizations.



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-01  0:50                                                                                                                       ` Paul E. McKenney
  2014-03-01 10:06                                                                                                                         ` Peter Sewell
@ 2014-03-03 18:55                                                                                                                         ` Torvald Riegel
  2014-03-03 19:20                                                                                                                           ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-03 18:55 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> +o	Do not use the results from the boolean "&&" and "||" when
> +	dereferencing.	For example, the following (rather improbable)
> +	code is buggy:
> +
> +		int a[2];
> +		int index;
> +		int force_zero_index = 1;
> +
> +		...
> +
> +		r1 = rcu_dereference(i1)
> +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> +
> +	The reason this is buggy is that "&&" and "||" are often compiled
> +	using branches.  While weak-memory machines such as ARM or PowerPC
> +	do order stores after such branches, they can speculate loads,
> +	which can result in misordering bugs.
> +
> +o	Do not use the results from relational operators ("==", "!=",
> +	">", ">=", "<", or "<=") when dereferencing.  For example,
> +	the following (quite strange) code is buggy:
> +
> +		int a[2];
> +		int index;
> +		int flip_index = 0;
> +
> +		...
> +
> +		r1 = rcu_dereference(i1)
> +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> +
> +	As before, the reason this is buggy is that relational operators
> +	are often compiled using branches.  And as before, although
> +	weak-memory machines such as ARM or PowerPC do order stores
> +	after such branches, but can speculate loads, which can again
> +	result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).

I don't think the wording is flawed.  We could raise the requirement of
having more than one value left for r1 to having more than N with N > 1
values left, but the fundamental problem remains in that a compiler
could try to generate a (big) switch statement.

Instead, I think that this indicates that the value_dep_preserving type
modifier would be useful: It would tell the compiler that it shouldn't
transform this into a branch in this case, yet allow that optimization
for all other code.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 19:47                                                                                                                   ` Linus Torvalds
  2014-02-27 20:53                                                                                                                     ` Paul E. McKenney
@ 2014-03-03 18:59                                                                                                                     ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-03-03 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan,
	David Howells, linux-arch, linux-kernel, akpm, mingo, gcc

On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > 3.      The comparison was against another RCU-protected pointer,
> >         where that other pointer was properly fetched using one
> >         of the RCU primitives.  Here it doesn't matter which pointer
> >         you use.  At least as long as the rcu_assign_pointer() for
> >         that other pointer happened after the last update to the
> >         pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>    p = atomic_read(consume);
>    X;
>    q = atomic_read(consume);
>    Y;
>    if (p == q)
>         data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.

The wording I proposed would make the p dereference have a value
dependency unless X and Y would somehow restrict p and q.  The reasoning
is that if the atomic loads return potentially more than one value, then
even if we find out that two such loads did return the same value, we
still don't know what the exact value was.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-27 17:50                                                                                                               ` Paul E. McKenney
  2014-02-27 19:22                                                                                                                 ` Paul E. McKenney
  2014-02-28  1:02                                                                                                                 ` Paul E. McKenney
@ 2014-03-03 19:01                                                                                                                 ` Torvald Riegel
  2 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-03-03 19:01 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote:
> Your proposal looks quite promising at first glance.  But rather than
> try and comment on it immediately, I am going to take a number of uses of
> RCU from the Linux kernel and apply your proposal to them, then respond
> with the results
> 
> Fair enough?

Sure.  Thanks for doing the cross-check!


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-03 18:55                                                                                                                         ` Torvald Riegel
@ 2014-03-03 19:20                                                                                                                           ` Paul E. McKenney
  2014-03-03 20:46                                                                                                                             ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-03 19:20 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> 
> On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > +o	Do not use the results from the boolean "&&" and "||" when
> > +	dereferencing.	For example, the following (rather improbable)
> > +	code is buggy:
> > +
> > +		int a[2];
> > +		int index;
> > +		int force_zero_index = 1;
> > +
> > +		...
> > +
> > +		r1 = rcu_dereference(i1)
> > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > +
> > +	The reason this is buggy is that "&&" and "||" are often compiled
> > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > +	do order stores after such branches, they can speculate loads,
> > +	which can result in misordering bugs.
> > +
> > +o	Do not use the results from relational operators ("==", "!=",
> > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > +	the following (quite strange) code is buggy:
> > +
> > +		int a[2];
> > +		int index;
> > +		int flip_index = 0;
> > +
> > +		...
> > +
> > +		r1 = rcu_dereference(i1)
> > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > +
> > +	As before, the reason this is buggy is that relational operators
> > +	are often compiled using branches.  And as before, although
> > +	weak-memory machines such as ARM or PowerPC do order stores
> > +	after such branches, but can speculate loads, which can again
> > +	result in misordering bugs.
> 
> Those two would be allowed by the wording I have recently proposed,
> AFAICS.  r1 != flip_index would result in two possible values (unless
> there are further constraints due to the type of r1 and the values that
> flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)

> I don't think the wording is flawed.  We could raise the requirement of
> having more than one value left for r1 to having more than N with N > 1
> values left, but the fundamental problem remains in that a compiler
> could try to generate a (big) switch statement.
> 
> Instead, I think that this indicates that the value_dep_preserving type
> modifier would be useful: It would tell the compiler that it shouldn't
> transform this into a branch in this case, yet allow that optimization
> for all other code.

Understood!

BTW, my current task is generating examples using the value_dep_preserving
type for RCU-protected array indexes.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-02 10:05                                                                                                                             ` Peter Sewell
  2014-03-02 23:20                                                                                                                               ` Paul E. McKenney
@ 2014-03-03 20:44                                                                                                                               ` Torvald Riegel
  2014-03-04 22:11                                                                                                                                 ` Peter Sewell
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-03 20:44 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > >
> >> >> > > 3.      The comparison was against another RCU-protected pointer,
> >> >> > >         where that other pointer was properly fetched using one
> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
> >> >> > >         that other pointer happened after the last update to the
> >> >> > >         pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >    p = atomic_read(consume);
> >> >> >    X;
> >> >> >    q = atomic_read(consume);
> >> >> >    Y;
> >> >> >    if (p == q)
> >> >> >         data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>       if (p == NULL)
> >> >>               deal_with_it();
> >> >>       p->a = 42;  /* Each field in its own cache line. */
> >> >>       p->b = 43;
> >> >>       p->c = 44;
> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
> >> >>       p->b = 143;
> >> >>       p->c = 144;
> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >>       if (p == q) {
> >> >>               /* The compiler decides that q->c is same as p->c. */
> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>       }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a lock
> >> >> in the structure and use it to guard updates and accesses to those fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff
> >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is intended to document current compiler
> > behavior for the Linux kernel community.  It would not make sense to show
> > it to the C11 or C++11 communities, except perhaps as an informational
> > piece on current practice.
> >
> >> The basic problem is that the compiler may be doing sophisticated
> >> reasoning with a bunch of non-local knowledge that it's deduced from
> >> the code, neither of which are well-understood, and here we have to
> >> identify some envelope, expressive enough for RCU idioms, in which
> >> that reasoning doesn't allow data/address dependencies to be removed
> >> (and hence the hardware guarantee about them will be maintained at the
> >> source level).
> >>
> >> The C11 syntactic notion of dependency, whatever its faults, was at
> >> least precise, could be reasoned about locally (just looking at the
> >> syntactic code in question), and did do that.  The fact that current
> >> compilers do optimisations that remove dependencies and will likely
> >> have many bugs at present is besides the point - this was surely
> >> intended as a *new* constraint on what they are allowed to do.  The
> >> interesting question is really whether the compiler writers think that
> >> they *could* implement it in a reasonable way - I'd like to hear
> >> Torvald and his colleagues' opinion on that.
> >>
> >> What you're doing above seems to be basically a very cut-down version
> >> of that, but with a fuzzy boundary.   If you want it to be precise,
> >> maybe it needs to be much simpler (which might force you into ruling
> >> out some current code idioms).
> >
> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
> > can be developed to serve this purpose.
> 
> (I missed that mail when it first came past, sorry)
> 
> That's also going to be tricky, I'm afraid.  The key condition there is:
> 
> "* at the time of execution of E, L       [PS: I assume that L is a
> typo and should be E]

No, L was intended. (But see below.)

>         can possibly have returned at
>         least two different values under the assumption that L itself
>         could have returned any value allowed by L's type."
> 
> First, the evaluation of E might be nondeterministic  - e.g., for an
> artificial example, if it's just a nondeterministic value obtained
> from the result of a race on SC atomics.  The above doesn't
> distinguish between that (which doesn't have a real dependency on L)
> and that XOR'd with L (which does).  And it does so in the wrong
> direction: it'll say there the former has a dependency on L.

I'm not quite sure I understand the examples you want to point out
(could you add brief code snippets, perhaps?) -- but the informal
definition I proposed also says that E must have used L in some way to
make it's computation.  That's pretty vague, and the way I phrased the
requirement is probably not optimal.

So let me expand a bit on the background first.  What I tried to capture
with the rules is that an evaluation (ie, E), really uses the value
returned by L, and not just a, for example, constant value that can be
inferred from whatever was executed between L and E.  This (is intended
to) prevent value prediction and such by programmers.  The compiler in
turn knows what a real program is allowed to do that still wants to rely
on the ordering by the consume.  Basing all of this on the values seemed
to be helpful because basing it on *purely* syntax didn't seem
implementable; for the latter, we'd need to disallow cases such as x-x
or require compilers to include artificial dependencies if a program
indeed does value speculation.  IOW, it didn't seem possible to draw a
clear distinction between a data dependency defined purely based on
syntax and control dependencies.

Another way to perhaps phrase the requirement might be to construct it
inductively, so that "E uses the value returned by L" becomes clearer.
However, I currently don't really know how to do that without any holes.

For example, let's say that an atomic mo_consume load has a value
dependency to itself; the dependency has an associated set of values,
which is equal to all values allowed by the type (but see also below).
Then, an evaluation A that uses B as operand has a value dependency on B
if the associated set of values can still have more than one element.
Whether that's the case depends on the operation, and the other
operands, including their values at the time of execution.
However, it's not just results of evaluations that effectively constrain
the set of values.  Conditional execution does as well, because the
condition being true might establish a constraint on L, which might
remove any dependency when L (or a result of a computation involving L)
is used.  So I'm not sure the inductive approach would work.

You said you thought I might have wanted to say that:
"E  really-depends on L if there exist two different values that (just
according to typing) might be read for L that give rise to two
different values for E".

Which should follow from the above, or at least should not conflict with
it.  My "definition" tried to capture that the program must not
establish so many constraints between E and L that the value of L is
clear in the sense of having exactly one value.  If we dereference such
an E whose execution in fact constrains L to one value, there is no real
dependency anymore.
However, by itself, this doesn't cover that E must use L in it's
computation -- which your formulation does, I think.

Another way might be to try to define which constraints established by
an execution (based on the executed code's semantics) actually remove
value dependencies on L.

Do you have any suggestions for how to define this (ie, true value
dependencies) in a better way?

> Second, it involves reasoning about counterfactual executions.  That
> doesn't necessarily make it wrong, per se, but probably makes it hard
> to work with.  For example, suppose that in all the actual
> whole-program executions, a runtime occurrence of L only ever returns
> one particular value (perhaps because of some simple #define'd
> configuration)

Right, or just because it's a deterministic program.

This is a problem for the definition of value dependencies, AFAICT,
because where do you put the line for what the compiler is allowed to
speculate about?  When does the program indeed "reveal" the value of a
load, and when does it not?  Is a compiler to do out-of-band tracking
for certain memory locations, and thus predict the value?

Adding the assumption that L, at least conceptually, might return any
value seemed to be a simple way to remove that uncertainty regarding
what the compiler might know about.

> , and that the code used in the evaluation of E depends
> on some invariant which is related to that configuration.

A related issue that I've been thinking about is whether things like
divide-by-zero (ie, operations that give rise to undefined behavior)
should be considered as constraints on the values or not.  I guess they
should, but then this seems closer what you mention, invariants
established by the whole program.

> The
> hypothetical execution used above in which a different value is used
> is one in the code is being run in a situation with broken invariants.
>    Then there will be technical difficulties in using the definition:
> I don't see how one would persuade oneself that a compiler always
> satisfies it, because these hypothetical executions are far removed
> from what it's actually working on.

I think the above is easy to implement for a compiler.  At the
mo_consume load, you simply forget any information you have about this
value / memory location; for operations on value_dep_preserving types,
you ignore any out-of-band information that might originate from code
run before the load (IOW, you take a fresh start on each load in terms
of analysis info).  Having the requirement that the program indeed needs
to have a value dependency should allow the compiler to run normal
optimizations on the code.

I'd really appreciate any feedback or alternative suggestions for how to
formulate the requirements more precisely.  I think many of us kind of
agree about the general approach we'd like to try, but specifying this
precisely is another step.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-03 19:20                                                                                                                           ` Paul E. McKenney
@ 2014-03-03 20:46                                                                                                                             ` Torvald Riegel
  2014-03-04 19:00                                                                                                                               ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-03 20:46 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > 
> > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > +o	Do not use the results from the boolean "&&" and "||" when
> > > +	dereferencing.	For example, the following (rather improbable)
> > > +	code is buggy:
> > > +
> > > +		int a[2];
> > > +		int index;
> > > +		int force_zero_index = 1;
> > > +
> > > +		...
> > > +
> > > +		r1 = rcu_dereference(i1)
> > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > +
> > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > +	do order stores after such branches, they can speculate loads,
> > > +	which can result in misordering bugs.
> > > +
> > > +o	Do not use the results from relational operators ("==", "!=",
> > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > +	the following (quite strange) code is buggy:
> > > +
> > > +		int a[2];
> > > +		int index;
> > > +		int flip_index = 0;
> > > +
> > > +		...
> > > +
> > > +		r1 = rcu_dereference(i1)
> > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > +
> > > +	As before, the reason this is buggy is that relational operators
> > > +	are often compiled using branches.  And as before, although
> > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > +	after such branches, but can speculate loads, which can again
> > > +	result in misordering bugs.
> > 
> > Those two would be allowed by the wording I have recently proposed,
> > AFAICS.  r1 != flip_index would result in two possible values (unless
> > there are further constraints due to the type of r1 and the values that
> > flip_index can have).
> 
> And I am OK with the value_dep_preserving type providing more/better
> guarantees than we get by default from current compilers.
> 
> One question, though.  Suppose that the code did not want a value
> dependency to be tracked through a comparison operator.  What does
> the developer do in that case?  (The reason I ask is that I have
> not yet found a use case in the Linux kernel that expects a value
> dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-03 20:46                                                                                                                             ` Torvald Riegel
@ 2014-03-04 19:00                                                                                                                               ` Paul E. McKenney
  2014-03-04 21:35                                                                                                                                 ` Paul E. McKenney
  2014-03-05 16:26                                                                                                                                 ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-04 19:00 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> 
> On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > 
> > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > +	dereferencing.	For example, the following (rather improbable)
> > > > +	code is buggy:
> > > > +
> > > > +		int a[2];
> > > > +		int index;
> > > > +		int force_zero_index = 1;
> > > > +
> > > > +		...
> > > > +
> > > > +		r1 = rcu_dereference(i1)
> > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > +
> > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > +	do order stores after such branches, they can speculate loads,
> > > > +	which can result in misordering bugs.
> > > > +
> > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > +	the following (quite strange) code is buggy:
> > > > +
> > > > +		int a[2];
> > > > +		int index;
> > > > +		int flip_index = 0;
> > > > +
> > > > +		...
> > > > +
> > > > +		r1 = rcu_dereference(i1)
> > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > +
> > > > +	As before, the reason this is buggy is that relational operators
> > > > +	are often compiled using branches.  And as before, although
> > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > +	after such branches, but can speculate loads, which can again
> > > > +	result in misordering bugs.
> > > 
> > > Those two would be allowed by the wording I have recently proposed,
> > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > there are further constraints due to the type of r1 and the values that
> > > flip_index can have).
> > 
> > And I am OK with the value_dep_preserving type providing more/better
> > guarantees than we get by default from current compilers.
> > 
> > One question, though.  Suppose that the code did not want a value
> > dependency to be tracked through a comparison operator.  What does
> > the developer do in that case?  (The reason I ask is that I have
> > not yet found a use case in the Linux kernel that expects a value
> > dependency to be tracked through a comparison.)
> 
> Hmm.  I suppose use an explicit cast to non-vdp before or after the
> comparison?

That should work well assuming that things like "if", "while", and "?:"
conditions are happy to take a vdp.  This assumes that p->a only returns
vdp if field "a" is declared vdp, otherwise we have vdps running wild
through the program.  ;-)

The other thing that can happen is that a vdp can get handed off to
another synchronization mechanism, for example, to reference counting:

	p = atomic_load_explicit(&gp, memory_order_consume);
	if (do_something_with(p->a)) {
		/* fast path protected by RCU. */
		return 0;
	}
	if (atomic_inc_not_zero(&p->refcnt) {
		/* slow path protected by reference counting. */
		return do_something_else_with((struct foo *)p);  /* CHANGE */
	}
	/* Needed slow path, but raced with deletion. */
	return -EAGAIN;

I am guessing that the cast ends the vdp.  Is that the case?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-04 19:00                                                                                                                               ` Paul E. McKenney
@ 2014-03-04 21:35                                                                                                                                 ` Paul E. McKenney
  2014-03-05 16:54                                                                                                                                   ` Torvald Riegel
  2014-03-05 16:26                                                                                                                                 ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-04 21:35 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > +	code is buggy:
> > > > > +
> > > > > +		int a[2];
> > > > > +		int index;
> > > > > +		int force_zero_index = 1;
> > > > > +
> > > > > +		...
> > > > > +
> > > > > +		r1 = rcu_dereference(i1)
> > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > +	do order stores after such branches, they can speculate loads,
> > > > > +	which can result in misordering bugs.
> > > > > +
> > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > +	the following (quite strange) code is buggy:
> > > > > +
> > > > > +		int a[2];
> > > > > +		int index;
> > > > > +		int flip_index = 0;
> > > > > +
> > > > > +		...
> > > > > +
> > > > > +		r1 = rcu_dereference(i1)
> > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > +	As before, the reason this is buggy is that relational operators
> > > > > +	are often compiled using branches.  And as before, although
> > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > +	after such branches, but can speculate loads, which can again
> > > > > +	result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.  This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)
> 
> The other thing that can happen is that a vdp can get handed off to
> another synchronization mechanism, for example, to reference counting:
> 
> 	p = atomic_load_explicit(&gp, memory_order_consume);
> 	if (do_something_with(p->a)) {
> 		/* fast path protected by RCU. */
> 		return 0;
> 	}
> 	if (atomic_inc_not_zero(&p->refcnt) {
> 		/* slow path protected by reference counting. */
> 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> 	}
> 	/* Needed slow path, but raced with deletion. */
> 	return -EAGAIN;
> 
> I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

	struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

	rdev = rcu_dereference(conf->mirrors[disk].rdev);
	if (r1_bio->bios[disk] == IO_BLOCKED
	    || rdev == NULL
	    || test_bit(Unmerged, &rdev->flags)
	    || test_bit(Faulty, &rdev->flags))
		continue;

The fact that the "rdev == NULL" returns vdp does not force the "||"
operators to be evaluated arithmetically because the entire function
is an "if" condition, correct?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-03 20:44                                                                                                                               ` Torvald Riegel
@ 2014-03-04 22:11                                                                                                                                 ` Peter Sewell
  2014-03-05 17:15                                                                                                                                   ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Peter Sewell @ 2014-03-04 22:11 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On 3 March 2014 20:44, Torvald Riegel <triegel@redhat.com> wrote:
> On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
>> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> Hi Paul,
>> >>
>> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
>> >> >> > >
>> >> >> > > 3.      The comparison was against another RCU-protected pointer,
>> >> >> > >         where that other pointer was properly fetched using one
>> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
>> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
>> >> >> > >         that other pointer happened after the last update to the
>> >> >> > >         pointed-to structure.
>> >> >> > >
>> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >
>> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> > that code like
>> >> >> >
>> >> >> >    p = atomic_read(consume);
>> >> >> >    X;
>> >> >> >    q = atomic_read(consume);
>> >> >> >    Y;
>> >> >> >    if (p == q)
>> >> >> >         data = p->val;
>> >> >> >
>> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> > can decide that the values are interchangeable.
>> >> >> >
>> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> > irrelevant. No?
>> >> >>
>> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
>> >> >> come from either pointer.
>> >> >>
>> >> >> For whatever it is worth, hear is the litmus test:
>> >> >>
>> >> >> T1:   p = kmalloc(...);
>> >> >>       if (p == NULL)
>> >> >>               deal_with_it();
>> >> >>       p->a = 42;  /* Each field in its own cache line. */
>> >> >>       p->b = 43;
>> >> >>       p->c = 44;
>> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
>> >> >>       p->b = 143;
>> >> >>       p->c = 144;
>> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
>> >> >>
>> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
>> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
>> >> >>       if (p == q) {
>> >> >>               /* The compiler decides that q->c is same as p->c. */
>> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >>       }
>> >> >>
>> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> you get.
>> >> >>
>> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> see consistency across all the fields in the structure, please put a lock
>> >> >> in the structure and use it to guard updates and accesses to those fields.
>> >> >
>> >> > And here is a patch documenting the restrictions for the current Linux
>> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >> >
>> >> > Thoughts?
>> >>
>> >> That might serve as informal documentation for linux kernel
>> >> programmers about the bounds on the optimisations that you expect
>> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> intend it to be for.   But I don't see how one can make it precise
>> >> enough to serve as a language definition, so that compiler people
>> >> could confidently say "yes, we respect that", which I guess is what
>> >> you really need.  As a useful criterion, we should aim for something
>> >> precise enough that in a verified-compiler context you can
>> >> mathematically prove that the compiler will satisfy it  (even though
>> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> authors can actually know what they're working with.   All this stuff
>> >> about "you should avoid cancellation", and "avoid masking with just a
>> >> small number of bits" is just too vague.
>> >
>> > Understood, and yes, this is intended to document current compiler
>> > behavior for the Linux kernel community.  It would not make sense to show
>> > it to the C11 or C++11 communities, except perhaps as an informational
>> > piece on current practice.
>> >
>> >> The basic problem is that the compiler may be doing sophisticated
>> >> reasoning with a bunch of non-local knowledge that it's deduced from
>> >> the code, neither of which are well-understood, and here we have to
>> >> identify some envelope, expressive enough for RCU idioms, in which
>> >> that reasoning doesn't allow data/address dependencies to be removed
>> >> (and hence the hardware guarantee about them will be maintained at the
>> >> source level).
>> >>
>> >> The C11 syntactic notion of dependency, whatever its faults, was at
>> >> least precise, could be reasoned about locally (just looking at the
>> >> syntactic code in question), and did do that.  The fact that current
>> >> compilers do optimisations that remove dependencies and will likely
>> >> have many bugs at present is besides the point - this was surely
>> >> intended as a *new* constraint on what they are allowed to do.  The
>> >> interesting question is really whether the compiler writers think that
>> >> they *could* implement it in a reasonable way - I'd like to hear
>> >> Torvald and his colleagues' opinion on that.
>> >>
>> >> What you're doing above seems to be basically a very cut-down version
>> >> of that, but with a fuzzy boundary.   If you want it to be precise,
>> >> maybe it needs to be much simpler (which might force you into ruling
>> >> out some current code idioms).
>> >
>> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
>> > can be developed to serve this purpose.
>>
>> (I missed that mail when it first came past, sorry)
>>
>> That's also going to be tricky, I'm afraid.  The key condition there is:
>>
>> "* at the time of execution of E, L       [PS: I assume that L is a
>> typo and should be E]
>
> No, L was intended. (But see below.)

then I misunderstood...

>>         can possibly have returned at
>>         least two different values under the assumption that L itself
>>         could have returned any value allowed by L's type."
>>
>> First, the evaluation of E might be nondeterministic  - e.g., for an
>> artificial example, if it's just a nondeterministic value obtained
>> from the result of a race on SC atomics.  The above doesn't
>> distinguish between that (which doesn't have a real dependency on L)
>> and that XOR'd with L (which does).  And it does so in the wrong
>> direction: it'll say there the former has a dependency on L.
>
> I'm not quite sure I understand the examples you want to point out
> (could you add brief code snippets, perhaps?) -- but the informal
> definition I proposed also says that E must have used L in some way to
> make it's computation.  That's pretty vague, and the way I phrased the
> requirement is probably not optimal.

...so this example isn't so relevant, but it's maybe interesting
anyway. In free-wheeling psuedocode:

x = read_consume(L)     // a memory_order_consume read of a 1-bit value from L
y = nondet()                     // some code that
nondeterministically produces another 1-bit value, eg by spawning two
threads that each do an SC-atomic write to some other location,
returning 0 or 1 depending on which wins

then evaluate either just

z=y

or

z = x XOR y

In my misinterpretation of what you wrote, your definition would say
there's a dependency from the load of L to the evalution of y, even
though there isn't.


> So let me expand a bit on the background first.  What I tried to capture
> with the rules is that an evaluation (ie, E), really uses the value
> returned by L, and not just a, for example, constant value that can be
> inferred from whatever was executed between L and E.  This (is intended
> to) prevent value prediction and such by programmers.  The compiler in
> turn knows what a real program is allowed to do that still wants to rely
> on the ordering by the consume.  Basing all of this on the values seemed
> to be helpful because basing it on *purely* syntax didn't seem
> implementable; for the latter, we'd need to disallow cases such as x-x
> or require compilers to include artificial dependencies if a program
> indeed does value speculation.  IOW, it didn't seem possible to draw a
> clear distinction between a data dependency defined purely based on
> syntax and control dependencies.
>
> Another way to perhaps phrase the requirement might be to construct it
> inductively, so that "E uses the value returned by L" becomes clearer.
> However, I currently don't really know how to do that without any holes.
>
> For example, let's say that an atomic mo_consume load has a value
> dependency to itself; the dependency has an associated set of values,
> which is equal to all values allowed by the type (but see also below).
> Then, an evaluation A that uses B as operand has a value dependency on B
> if the associated set of values can still have more than one element.
> Whether that's the case depends on the operation, and the other
> operands, including their values at the time of execution.
> However, it's not just results of evaluations that effectively constrain
> the set of values.  Conditional execution does as well, because the
> condition being true might establish a constraint on L, which might
> remove any dependency when L (or a result of a computation involving L)
> is used.  So I'm not sure the inductive approach would work.
>
> You said you thought I might have wanted to say that:
> "E  really-depends on L if there exist two different values that (just
> according to typing) might be read for L that give rise to two
> different values for E".
>
> Which should follow from the above, or at least should not conflict with
> it.  My "definition" tried to capture that the program must not
> establish so many constraints between E and L that the value of L is
> clear in the sense of having exactly one value.  If we dereference such
> an E whose execution in fact constrains L to one value, there is no real
> dependency anymore.
> However, by itself, this doesn't cover that E must use L in it's
> computation -- which your formulation does, I think.

unfortunately not - that's what the example above shows.

> Another way might be to try to define which constraints established by
> an execution (based on the executed code's semantics) actually remove
> value dependencies on L.
>
> Do you have any suggestions for how to define this (ie, true value
> dependencies) in a better way?

not at the moment.  I need to think some more about your vdps, though.
 As I understand it, you're agreeing with the general intent of the
original C11 design that some new mechanism to tell the compiler not
to optimise in certain places in required, but differing in the way
you identify those?


>> Second, it involves reasoning about counterfactual executions.  That
>> doesn't necessarily make it wrong, per se, but probably makes it hard
>> to work with.  For example, suppose that in all the actual
>> whole-program executions, a runtime occurrence of L only ever returns
>> one particular value (perhaps because of some simple #define'd
>> configuration)
>
> Right, or just because it's a deterministic program.
>
> This is a problem for the definition of value dependencies, AFAICT,
> because where do you put the line for what the compiler is allowed to
> speculate about?  When does the program indeed "reveal" the value of a
> load, and when does it not?  Is a compiler to do out-of-band tracking
> for certain memory locations, and thus predict the value?
>
> Adding the assumption that L, at least conceptually, might return any
> value seemed to be a simple way to remove that uncertainty regarding
> what the compiler might know about.

...but it brings in a lot of baggage.

>> , and that the code used in the evaluation of E depends
>> on some invariant which is related to that configuration.
>
> A related issue that I've been thinking about is whether things like
> divide-by-zero (ie, operations that give rise to undefined behavior)
> should be considered as constraints on the values or not.  I guess they
> should, but then this seems closer what you mention, invariants
> established by the whole program.
>
>> The
>> hypothetical execution used above in which a different value is used
>> is one in the code is being run in a situation with broken invariants.
>>    Then there will be technical difficulties in using the definition:
>> I don't see how one would persuade oneself that a compiler always
>> satisfies it, because these hypothetical executions are far removed
>> from what it's actually working on.
>
> I think the above is easy to implement for a compiler.  At the
> mo_consume load, you simply forget any information you have about this
> value / memory location; for operations on value_dep_preserving types,

are all the subexpressions likewise forced to be of vdp types?

> you ignore any out-of-band information that might originate from code
> run before the load (IOW, you take a fresh start on each load in terms
> of analysis info).  Having the requirement that the program indeed needs
> to have a value dependency should allow the compiler to run normal
> optimizations on the code.
>
> I'd really appreciate any feedback or alternative suggestions for how to
> formulate the requirements more precisely.  I think many of us kind of
> agree about the general approach we'd like to try, but specifying this
> precisely is another step.
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-04 19:00                                                                                                                               ` Paul E. McKenney
  2014-03-04 21:35                                                                                                                                 ` Paul E. McKenney
@ 2014-03-05 16:26                                                                                                                                 ` Torvald Riegel
  2014-03-05 18:01                                                                                                                                   ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-05 16:26 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > +	code is buggy:
> > > > > +
> > > > > +		int a[2];
> > > > > +		int index;
> > > > > +		int force_zero_index = 1;
> > > > > +
> > > > > +		...
> > > > > +
> > > > > +		r1 = rcu_dereference(i1)
> > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > +	do order stores after such branches, they can speculate loads,
> > > > > +	which can result in misordering bugs.
> > > > > +
> > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > +	the following (quite strange) code is buggy:
> > > > > +
> > > > > +		int a[2];
> > > > > +		int index;
> > > > > +		int flip_index = 0;
> > > > > +
> > > > > +		...
> > > > > +
> > > > > +		r1 = rcu_dereference(i1)
> > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > +	As before, the reason this is buggy is that relational operators
> > > > > +	are often compiled using branches.  And as before, although
> > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > +	after such branches, but can speculate loads, which can again
> > > > > +	result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.

I currently don't see a reason why that should be disallowed.  If we
have allowed an implicit conversion to non-vdp, I believe that should
follow.  ?: could be somewhat special, in that the type depends on the
2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
disallowed if we don't provide for implicit casts from non-vdp to vdp.

> This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)

That's a good question.  For the scheme I had in mind, I'm not concerned
about vdps running wild because one needs to assign to explicitly
vdp-typed variables (or function arguments, etc.) to let vdp extend to
beyond single expressions.

Nonetheless, I think it's a good question how -> should behave if the
field is not vdp; in particular, should vdp->non_vdp be automatically
vdp?  One concern might be that we know something about non-vdp -- OTOH,
we shouldn't be able to do so because we (assume to) don't know anything
about the vdp pointer, so we can't infer something about something it
points to.

> The other thing that can happen is that a vdp can get handed off to
> another synchronization mechanism, for example, to reference counting:
> 
> 	p = atomic_load_explicit(&gp, memory_order_consume);
> 	if (do_something_with(p->a)) {
> 		/* fast path protected by RCU. */
> 		return 0;
> 	}
> 	if (atomic_inc_not_zero(&p->refcnt) {

Is the argument to atomic_inc_no_zero vdp or non-vdp?

> 		/* slow path protected by reference counting. */
> 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> 	}
> 	/* Needed slow path, but raced with deletion. */
> 	return -EAGAIN;
> 
> I am guessing that the cast ends the vdp.  Is that the case?

That would end it, yes.  The other way this could happen is that the
argument of do_something_else_with() would be specified to be non-vdp.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-04 21:35                                                                                                                                 ` Paul E. McKenney
@ 2014-03-05 16:54                                                                                                                                   ` Torvald Riegel
  2014-03-05 18:15                                                                                                                                     ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-05 16:54 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > +	code is buggy:
> > > > > > +
> > > > > > +		int a[2];
> > > > > > +		int index;
> > > > > > +		int force_zero_index = 1;
> > > > > > +
> > > > > > +		...
> > > > > > +
> > > > > > +		r1 = rcu_dereference(i1)
> > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > +	which can result in misordering bugs.
> > > > > > +
> > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +	the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +		int a[2];
> > > > > > +		int index;
> > > > > > +		int flip_index = 0;
> > > > > > +
> > > > > > +		...
> > > > > > +
> > > > > > +		r1 = rcu_dereference(i1)
> > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > +	are often compiled using branches.  And as before, although
> > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > +	result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.  This assumes that p->a only returns
> > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > through the program.  ;-)
> > 
> > The other thing that can happen is that a vdp can get handed off to
> > another synchronization mechanism, for example, to reference counting:
> > 
> > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > 	if (do_something_with(p->a)) {
> > 		/* fast path protected by RCU. */
> > 		return 0;
> > 	}
> > 	if (atomic_inc_not_zero(&p->refcnt) {
> > 		/* slow path protected by reference counting. */
> > 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> > 	}
> > 	/* Needed slow path, but raced with deletion. */
> > 	return -EAGAIN;
> > 
> > I am guessing that the cast ends the vdp.  Is that the case?
> 
> And here is a more elaborate example from the Linux kernel:
> 
> 	struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> 
> 	rdev = rcu_dereference(conf->mirrors[disk].rdev);
> 	if (r1_bio->bios[disk] == IO_BLOCKED
> 	    || rdev == NULL
> 	    || test_bit(Unmerged, &rdev->flags)
> 	    || test_bit(Faulty, &rdev->flags))
> 		continue;
> 
> The fact that the "rdev == NULL" returns vdp does not force the "||"
> operators to be evaluated arithmetically because the entire function
> is an "if" condition, correct?

That's a good question, and one that as far as I understand currently,
essentially boils down to whether we want to have tight restrictions on
which operations are still vdp.

If we look at the different combinations, then it seems we can't decide
on whether we have a value-dependency just due to a vdp type:
* non-vdp || vdp:  vdp iff non-vdp == false
* vdp || non-vdp:  vdp iff non-vdp == false?
* vdp || vdp: always vdp? (and dependency on both?)

I'm not sure it makes sense to try to not make all of those
vdp-by-default.  The first and second case show that it's dependent on
the specific execution anyway, and thus is already covered by the
requirement that the value must still matter.   The vdp type is just a
way to prevent inappropriate compiler optimizations; it's not critical
for correctness is we make more stuff vdp, yet it may prevent some
optimizations in the affected expression.

If the compiler knows that some vdp-typed evaluation will not have a
value-dependency anyway, then it can just optimize this evaluation like
non-vdp code.

I guess not much would change for the code you posted, because we
already have to evaluate || operands in order, I believe (e.g., don't
access rdev->flags before doing the rdev == NULL check, modulo as-if).
Do I understand your question correctly?



^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-04 22:11                                                                                                                                 ` Peter Sewell
@ 2014-03-05 17:15                                                                                                                                   ` Torvald Riegel
  2014-03-05 18:37                                                                                                                                     ` Peter Sewell
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-05 17:15 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Tue, 2014-03-04 at 22:11 +0000, Peter Sewell wrote:
> On 3 March 2014 20:44, Torvald Riegel <triegel@redhat.com> wrote:
> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
> >> >> >> > >
> >> >> >> > > 3.      The comparison was against another RCU-protected pointer,
> >> >> >> > >         where that other pointer was properly fetched using one
> >> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
> >> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
> >> >> >> > >         that other pointer happened after the last update to the
> >> >> >> > >         pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >    p = atomic_read(consume);
> >> >> >> >    X;
> >> >> >> >    q = atomic_read(consume);
> >> >> >> >    Y;
> >> >> >> >    if (p == q)
> >> >> >> >         data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>       if (p == NULL)
> >> >> >>               deal_with_it();
> >> >> >>       p->a = 42;  /* Each field in its own cache line. */
> >> >> >>       p->b = 43;
> >> >> >>       p->c = 44;
> >> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
> >> >> >>       p->b = 143;
> >> >> >>       p->c = 144;
> >> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >> >>       if (p == q) {
> >> >> >>               /* The compiler decides that q->c is same as p->c. */
> >> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>       }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> >> see consistency across all the fields in the structure, please put a lock
> >> >> >> in the structure and use it to guard updates and accesses to those fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is what
> >> >> you really need.  As a useful criterion, we should aim for something
> >> >> precise enough that in a verified-compiler context you can
> >> >> mathematically prove that the compiler will satisfy it  (even though
> >> >> that won't happen anytime soon for GCC), and that analysis tool
> >> >> authors can actually know what they're working with.   All this stuff
> >> >> about "you should avoid cancellation", and "avoid masking with just a
> >> >> small number of bits" is just too vague.
> >> >
> >> > Understood, and yes, this is intended to document current compiler
> >> > behavior for the Linux kernel community.  It would not make sense to show
> >> > it to the C11 or C++11 communities, except perhaps as an informational
> >> > piece on current practice.
> >> >
> >> >> The basic problem is that the compiler may be doing sophisticated
> >> >> reasoning with a bunch of non-local knowledge that it's deduced from
> >> >> the code, neither of which are well-understood, and here we have to
> >> >> identify some envelope, expressive enough for RCU idioms, in which
> >> >> that reasoning doesn't allow data/address dependencies to be removed
> >> >> (and hence the hardware guarantee about them will be maintained at the
> >> >> source level).
> >> >>
> >> >> The C11 syntactic notion of dependency, whatever its faults, was at
> >> >> least precise, could be reasoned about locally (just looking at the
> >> >> syntactic code in question), and did do that.  The fact that current
> >> >> compilers do optimisations that remove dependencies and will likely
> >> >> have many bugs at present is besides the point - this was surely
> >> >> intended as a *new* constraint on what they are allowed to do.  The
> >> >> interesting question is really whether the compiler writers think that
> >> >> they *could* implement it in a reasonable way - I'd like to hear
> >> >> Torvald and his colleagues' opinion on that.
> >> >>
> >> >> What you're doing above seems to be basically a very cut-down version
> >> >> of that, but with a fuzzy boundary.   If you want it to be precise,
> >> >> maybe it needs to be much simpler (which might force you into ruling
> >> >> out some current code idioms).
> >> >
> >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
> >> > can be developed to serve this purpose.
> >>
> >> (I missed that mail when it first came past, sorry)
> >>
> >> That's also going to be tricky, I'm afraid.  The key condition there is:
> >>
> >> "* at the time of execution of E, L       [PS: I assume that L is a
> >> typo and should be E]
> >
> > No, L was intended. (But see below.)
> 
> then I misunderstood...
> 
> >>         can possibly have returned at
> >>         least two different values under the assumption that L itself
> >>         could have returned any value allowed by L's type."
> >>
> >> First, the evaluation of E might be nondeterministic  - e.g., for an
> >> artificial example, if it's just a nondeterministic value obtained
> >> from the result of a race on SC atomics.  The above doesn't
> >> distinguish between that (which doesn't have a real dependency on L)
> >> and that XOR'd with L (which does).  And it does so in the wrong
> >> direction: it'll say there the former has a dependency on L.
> >
> > I'm not quite sure I understand the examples you want to point out
> > (could you add brief code snippets, perhaps?) -- but the informal
> > definition I proposed also says that E must have used L in some way to
> > make it's computation.  That's pretty vague, and the way I phrased the
> > requirement is probably not optimal.
> 
> ...so this example isn't so relevant, but it's maybe interesting
> anyway. In free-wheeling psuedocode:
> 
> x = read_consume(L)     // a memory_order_consume read of a 1-bit value from L
> y = nondet()                     // some code that
> nondeterministically produces another 1-bit value, eg by spawning two
> threads that each do an SC-atomic write to some other location,
> returning 0 or 1 depending on which wins
> 
> then evaluate either just
> 
> z=y
> 
> or
> 
> z = x XOR y
> 
> In my misinterpretation of what you wrote, your definition would say
> there's a dependency from the load of L to the evalution of y, even
> though there isn't.

The evaluations "z==y" or "z=y" wouldn't depend on L, assuming nondet()
doesn't depend on L (e.g., by using x).  They don't use x's value.  (And
I don't have a precise definition for this, unfortunately, but I hope
the intent is clear.)

"x XOR y" would depend on L, because this does take x into account, and
x's value remains relevant irrespective of which value y has.  y would
be non-vdp, and the compiler could have specialized the code into two
branches for both 1 and 0 values of y; but it would still need x to
compute z in the last evaluation.

> 
> > So let me expand a bit on the background first.  What I tried to capture
> > with the rules is that an evaluation (ie, E), really uses the value
> > returned by L, and not just a, for example, constant value that can be
> > inferred from whatever was executed between L and E.  This (is intended
> > to) prevent value prediction and such by programmers.  The compiler in
> > turn knows what a real program is allowed to do that still wants to rely
> > on the ordering by the consume.  Basing all of this on the values seemed
> > to be helpful because basing it on *purely* syntax didn't seem
> > implementable; for the latter, we'd need to disallow cases such as x-x
> > or require compilers to include artificial dependencies if a program
> > indeed does value speculation.  IOW, it didn't seem possible to draw a
> > clear distinction between a data dependency defined purely based on
> > syntax and control dependencies.
> >
> > Another way to perhaps phrase the requirement might be to construct it
> > inductively, so that "E uses the value returned by L" becomes clearer.
> > However, I currently don't really know how to do that without any holes.
> >
> > For example, let's say that an atomic mo_consume load has a value
> > dependency to itself; the dependency has an associated set of values,
> > which is equal to all values allowed by the type (but see also below).
> > Then, an evaluation A that uses B as operand has a value dependency on B
> > if the associated set of values can still have more than one element.
> > Whether that's the case depends on the operation, and the other
> > operands, including their values at the time of execution.
> > However, it's not just results of evaluations that effectively constrain
> > the set of values.  Conditional execution does as well, because the
> > condition being true might establish a constraint on L, which might
> > remove any dependency when L (or a result of a computation involving L)
> > is used.  So I'm not sure the inductive approach would work.
> >
> > You said you thought I might have wanted to say that:
> > "E  really-depends on L if there exist two different values that (just
> > according to typing) might be read for L that give rise to two
> > different values for E".
> >
> > Which should follow from the above, or at least should not conflict with
> > it.  My "definition" tried to capture that the program must not
> > establish so many constraints between E and L that the value of L is
> > clear in the sense of having exactly one value.  If we dereference such
> > an E whose execution in fact constrains L to one value, there is no real
> > dependency anymore.
> > However, by itself, this doesn't cover that E must use L in it's
> > computation -- which your formulation does, I think.
> 
> unfortunately not - that's what the example above shows.

Now I think I understand.

> > Another way might be to try to define which constraints established by
> > an execution (based on the executed code's semantics) actually remove
> > value dependencies on L.
> >
> > Do you have any suggestions for how to define this (ie, true value
> > dependencies) in a better way?
> 
> not at the moment.  I need to think some more about your vdps, though.
>  As I understand it, you're agreeing with the general intent of the
> original C11 design that some new mechanism to tell the compiler not
> to optimise in certain places in required, but differing in the way
> you identify those?

Yes, we still need such a mechanism because we need to prevent the
compiler from doing value prediction or something similar that uses
control dependencies to avoid data dependencies.  For example, it could
use a large switch statement to add code specialized for each possible
value returned from an mo_consume load.  Thus, at least for the
standard, we need some mechanism that prevents certain compiler
optimizations.

Where it differs to C11 is that we're not trying to use just such a
mechanism built on syntax to try to define when we actually have a real
value dependency.

> 
> >> Second, it involves reasoning about counterfactual executions.  That
> >> doesn't necessarily make it wrong, per se, but probably makes it hard
> >> to work with.  For example, suppose that in all the actual
> >> whole-program executions, a runtime occurrence of L only ever returns
> >> one particular value (perhaps because of some simple #define'd
> >> configuration)
> >
> > Right, or just because it's a deterministic program.
> >
> > This is a problem for the definition of value dependencies, AFAICT,
> > because where do you put the line for what the compiler is allowed to
> > speculate about?  When does the program indeed "reveal" the value of a
> > load, and when does it not?  Is a compiler to do out-of-band tracking
> > for certain memory locations, and thus predict the value?
> >
> > Adding the assumption that L, at least conceptually, might return any
> > value seemed to be a simple way to remove that uncertainty regarding
> > what the compiler might know about.
> 
> ...but it brings in a lot of baggage.

Which baggage do you mean?  In terms of verification, or definitions in
the standard, or effects on compilers?

> >> , and that the code used in the evaluation of E depends
> >> on some invariant which is related to that configuration.
> >
> > A related issue that I've been thinking about is whether things like
> > divide-by-zero (ie, operations that give rise to undefined behavior)
> > should be considered as constraints on the values or not.  I guess they
> > should, but then this seems closer what you mention, invariants
> > established by the whole program.
> >
> >> The
> >> hypothetical execution used above in which a different value is used
> >> is one in the code is being run in a situation with broken invariants.
> >>    Then there will be technical difficulties in using the definition:
> >> I don't see how one would persuade oneself that a compiler always
> >> satisfies it, because these hypothetical executions are far removed
> >> from what it's actually working on.
> >
> > I think the above is easy to implement for a compiler.  At the
> > mo_consume load, you simply forget any information you have about this
> > value / memory location; for operations on value_dep_preserving types,
> 
> are all the subexpressions likewise forced to be of vdp types?

I'd like to avoid that if possible.  Things like "vdp-pointer + int"
should Just Work in the sense of being still vdp-typed without requiring
an explicit cast for the int operand.  Operators ||, &&, ?: might
justify more fine-grained rules, but I guess it's better to just make
more stuff vdp if that's easier.  vdp is not a sufficient condition for
there being a value-dependency anyway, and the few optimizations it
prevents on expressions containing some vdp-typed operands probably
don't matter much.  Also, as I mentioned in the reply to Paul, I believe
that if the compiler can prove that there's no value-dependency, it can
ignore the vdp-annotation (IOW, as-if is still possible).


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-05 16:26                                                                                                                                 ` Torvald Riegel
@ 2014-03-05 18:01                                                                                                                                   ` Paul E. McKenney
  2014-03-07 17:45                                                                                                                                     ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-05 18:01 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> xagsmtp3.20140305162928.8243@uk1vsc.vnet.ibm.com
> X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> 
> On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > +	code is buggy:
> > > > > > +
> > > > > > +		int a[2];
> > > > > > +		int index;
> > > > > > +		int force_zero_index = 1;
> > > > > > +
> > > > > > +		...
> > > > > > +
> > > > > > +		r1 = rcu_dereference(i1)
> > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > +	which can result in misordering bugs.
> > > > > > +
> > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +	the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +		int a[2];
> > > > > > +		int index;
> > > > > > +		int flip_index = 0;
> > > > > > +
> > > > > > +		...
> > > > > > +
> > > > > > +		r1 = rcu_dereference(i1)
> > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > +	are often compiled using branches.  And as before, although
> > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > +	result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.
> 
> I currently don't see a reason why that should be disallowed.  If we
> have allowed an implicit conversion to non-vdp, I believe that should
> follow.

I am a bit nervous about a silent implicit conversion from vdp to
non-vdp in the general case.  However, when the result is being used by
a conditional, the silent implicit conversion makes a lot of sense.
Is that distinction something that the compiler can handle easily?

On the other hand, silent implicit conversion from non-vdp to vdp
is very useful for common code that can be invoked both by RCU
readers and by updaters.

>          ?: could be somewhat special, in that the type depends on the
> 2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
> allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
> disallowed if we don't provide for implicit casts from non-vdp to vdp.

Actually, from the Linux-kernel code that I am seeing, we want to be able
to silently convert from non-vdp to vdp in order to permit common code
that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
This common code must be compiled conservatively to allow vdp, but should
be just find with non-vdp.

Going through the combinations...

 0.	vdp x = vdp ? vdp : vdp;		 /* OK, matches. */
 1.	vdp x = vdp ? vdp : non-vdp;		 /* Silent conversion. */
 2.	vdp x = vdp ? non-vdp : vdp;		 /* Silent conversion. */
 3.	vdp x = vdp ? non-vdp : non-vdp;	 /* Silent conversion. */
 4.	vdp x = non-vdp ? vdp : vdp;		 /* OK, matches. */
 5.	vdp x = non-vdp ? vdp : non-vdp;	 /* Silent conversion. */
 6.	vdp x = non-vdp ? non-vdp : vdp;	 /* Silent conversion. */
 7.	vdp x = non-vdp ? non-vdp : non-vdp;	 /* Silent conversion. */
 8.	non-vdp x = vdp ? vdp : vdp;		 /* Warning unless condition. */
 9.	non-vdp x = vdp ? vdp : non-vdp;	 /* Warning unless condition. */
10.	non-vdp x = vdp ? non-vdp : vdp;	 /* Warning unless condition. */
11.	non-vdp x = vdp ? non-vdp : non-vdp;	 /* OK, matches. */
12.	non-vdp x = non-vdp ? vdp : vdp;	 /* Warning unless condition. */
13.	non-vdp x = non-vdp ? vdp : non-vdp;	 /* Warning unless condition. */
14.	non-vdp x = non-vdp ? non-vdp : vdp;	 /* Warning unless condition. */
15.	non-vdp x = non-vdp ? non-vdp : non-vdp; /* OK, matches. */

0, 4, 11, and 15 are OK because both legs of the ?: match the variable
being assigned to.  1, 2, 3, 4, 6, and 7 are implicit silent conversions
from non-vdp to vdp, which is always safe and is useful for common code.
8, 9, 10, 12, 13, and 14 are mismatches: A vdp quantity is being assigned
to a non-vdp variable, which could potentially be passed to a vdp-oblivious
function.  However, 8, 9, 10, 12, 13, and 14 are OK if the result is
consumed by a conditional.  That said, I would not complain if something
like the following kicked out a warning:

	struct foo value_dep_preserving *p;
	struct foo *q;

	p = rcu_dereference(gp);
	q = f() ? p : p + 1;
	if (q < THE_LIMIT)
		do_something();
	else
		do_something_else(p);

The warning could be avoided by marking q value_dep_preserving or by
eliminating q entirely:

	struct foo value_dep_preserving *p;

	p = rcu_dereference(gp);
	if ((f() ? p : p + 1) < THE_LIMIT)
		do_something();
	else
		do_something_else(p);

Or, for that matter, by using a cast:

	struct foo value_dep_preserving *p;
	struct foo *q;

	p = rcu_dereference(gp);
	q = (struct foo *)(f() ? p : p + 1);
	if (q < THE_LIMIT)
		do_something();
	else
		do_something_else(p);

Does that make sense?

> > This assumes that p->a only returns
> > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > through the program.  ;-)
> 
> That's a good question.  For the scheme I had in mind, I'm not concerned
> about vdps running wild because one needs to assign to explicitly
> vdp-typed variables (or function arguments, etc.) to let vdp extend to
> beyond single expressions.
> 
> Nonetheless, I think it's a good question how -> should behave if the
> field is not vdp; in particular, should vdp->non_vdp be automatically
> vdp?  One concern might be that we know something about non-vdp -- OTOH,
> we shouldn't be able to do so because we (assume to) don't know anything
> about the vdp pointer, so we can't infer something about something it
> points to.

In almost all the cases I am seeing in the Linux kernel, p->f wants to
be non-vdp.  A common case is that "f" is an integer that is used in
later computation, but where the ordering is needed only when fetching
p->f, not during later use of the resulting integer.

So it is looking like p->f should be vdp only if field "f" is declared vdp.

> > The other thing that can happen is that a vdp can get handed off to
> > another synchronization mechanism, for example, to reference counting:
> > 
> > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > 	if (do_something_with(p->a)) {
> > 		/* fast path protected by RCU. */
> > 		return 0;
> > 	}
> > 	if (atomic_inc_not_zero(&p->refcnt) {
> 
> Is the argument to atomic_inc_no_zero vdp or non-vdp?

The argument to atomic_inc_not_zero() is non-vdp, and because it is an
atomic operation, it would not make sense to mark it vdp.  This results
in a bit of a dilemma: I am finding code that wants "&p->f" to be vdp
if "p" is vdp, and I am finding other code (like the above) that wants
"&p->f" to be non-vdp always.

The approaches I can think of at the moment include:

1.	If "p" is vdp, make "&p->f" be vdp, but don't complain about
	subsequent assignments to non-vdp variables.  Sounds like quite
	a mess in the compiler.

2.	Propagate value_dep_preserving tags throughout the kernel.
	Sounds like a good recipe for a Linux-kernel revolt against
	this proposal.

3.	Require explicit casts to avoid warnings:

	if atomic_inc_not_zero((struct foo *)&p->refcnt) {

	This would not be as bad as #2, but would still require
	a fair amount of markup.

4.	Use something like kill_dependency().  This has strengths
	and weaknesses similar to #3, but has the advantage of
	being useful in type-generic macros.

5.	Either #3 or #4 above, but have a command-line flag that
	shuts off the warnings.  That way, people who want the
	diagnostics can enable them in their own code, and people
	who don't can disable them.

#5 looks like the way to go to me.  So "&p->f" has the same vdp-ness
as "p", so that assigning it to a non-vdp variable, passing it via a
non-vdp argument, or returning it via a non-vdp return value will
cause a warning.  However, that warning can be easily shut off on a
file-by-file basis.

Seem reasonable?

> > 		/* slow path protected by reference counting. */
> > 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> > 	}
> > 	/* Needed slow path, but raced with deletion. */
> > 	return -EAGAIN;
> > 
> > I am guessing that the cast ends the vdp.  Is that the case?
> 
> That would end it, yes.  The other way this could happen is that the
> argument of do_something_else_with() would be specified to be non-vdp.

Agreed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-05 16:54                                                                                                                                   ` Torvald Riegel
@ 2014-03-05 18:15                                                                                                                                     ` Paul E. McKenney
  2014-03-07 18:33                                                                                                                                       ` Torvald Riegel
  0 siblings, 1 reply; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-05 18:15 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > > +	code is buggy:
> > > > > > > +
> > > > > > > +		int a[2];
> > > > > > > +		int index;
> > > > > > > +		int force_zero_index = 1;
> > > > > > > +
> > > > > > > +		...
> > > > > > > +
> > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > > +	which can result in misordering bugs.
> > > > > > > +
> > > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > +	the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > +		int a[2];
> > > > > > > +		int index;
> > > > > > > +		int flip_index = 0;
> > > > > > > +
> > > > > > > +		...
> > > > > > > +
> > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > > +	are often compiled using branches.  And as before, although
> > > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > > +	result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > > there are further constraints due to the type of r1 and the values that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > comparison?
> > > 
> > > That should work well assuming that things like "if", "while", and "?:"
> > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > through the program.  ;-)
> > > 
> > > The other thing that can happen is that a vdp can get handed off to
> > > another synchronization mechanism, for example, to reference counting:
> > > 
> > > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > > 	if (do_something_with(p->a)) {
> > > 		/* fast path protected by RCU. */
> > > 		return 0;
> > > 	}
> > > 	if (atomic_inc_not_zero(&p->refcnt) {
> > > 		/* slow path protected by reference counting. */
> > > 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> > > 	}
> > > 	/* Needed slow path, but raced with deletion. */
> > > 	return -EAGAIN;
> > > 
> > > I am guessing that the cast ends the vdp.  Is that the case?
> > 
> > And here is a more elaborate example from the Linux kernel:
> > 
> > 	struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> > 
> > 	rdev = rcu_dereference(conf->mirrors[disk].rdev);
> > 	if (r1_bio->bios[disk] == IO_BLOCKED
> > 	    || rdev == NULL
> > 	    || test_bit(Unmerged, &rdev->flags)
> > 	    || test_bit(Faulty, &rdev->flags))
> > 		continue;
> > 
> > The fact that the "rdev == NULL" returns vdp does not force the "||"
> > operators to be evaluated arithmetically because the entire function
> > is an "if" condition, correct?
> 
> That's a good question, and one that as far as I understand currently,
> essentially boils down to whether we want to have tight restrictions on
> which operations are still vdp.
> 
> If we look at the different combinations, then it seems we can't decide
> on whether we have a value-dependency just due to a vdp type:
> * non-vdp || vdp:  vdp iff non-vdp == false
> * vdp || non-vdp:  vdp iff non-vdp == false?
> * vdp || vdp: always vdp? (and dependency on both?)
> 
> I'm not sure it makes sense to try to not make all of those
> vdp-by-default.  The first and second case show that it's dependent on
> the specific execution anyway, and thus is already covered by the
> requirement that the value must still matter.   The vdp type is just a
> way to prevent inappropriate compiler optimizations; it's not critical
> for correctness is we make more stuff vdp, yet it may prevent some
> optimizations in the affected expression.
> 
> If the compiler knows that some vdp-typed evaluation will not have a
> value-dependency anyway, then it can just optimize this evaluation like
> non-vdp code.
> 
> I guess not much would change for the code you posted, because we
> already have to evaluate || operands in order, I believe (e.g., don't
> access rdev->flags before doing the rdev == NULL check, modulo as-if).
> Do I understand your question correctly?

Let me give an example for the other side:

	struct foo value_dep_preserving *p;
	struct foo value_dep_preserving *q;

	p = rcu_dereference(gp);
	q = rcu_dereference(gq);
	return myarray[p || q]];  /* Linux kernel doesn't do this. */

If we wanted this to work (and I am not at all convinced that we do),
the compiler would have to force a data dependency through the "||".

But I would be just as happy to instead just say that boolean logical
operators ("||" and "&&") never return vdp values.  Ditto for the
relational operators ("==", "!=", ">", ">=", "<", and "<=").  No one
seems to rely on value dependencies via these operators, after all,
and preserving value dependencies through them seems to require that
the compiler generate odd code.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-05 17:15                                                                                                                                   ` Torvald Riegel
@ 2014-03-05 18:37                                                                                                                                     ` Peter Sewell
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Sewell @ 2014-03-05 18:37 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On 5 March 2014 17:15, Torvald Riegel <triegel@redhat.com> wrote:
> On Tue, 2014-03-04 at 22:11 +0000, Peter Sewell wrote:
>> On 3 March 2014 20:44, Torvald Riegel <triegel@redhat.com> wrote:
>> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
>> >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> >> Hi Paul,
>> >> >>
>> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >> > <paulmck@linux.vnet.ibm.com> wrote:
>> >> >> >> > >
>> >> >> >> > > 3.      The comparison was against another RCU-protected pointer,
>> >> >> >> > >         where that other pointer was properly fetched using one
>> >> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
>> >> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
>> >> >> >> > >         that other pointer happened after the last update to the
>> >> >> >> > >         pointed-to structure.
>> >> >> >> > >
>> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >> >
>> >> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> >> > that code like
>> >> >> >> >
>> >> >> >> >    p = atomic_read(consume);
>> >> >> >> >    X;
>> >> >> >> >    q = atomic_read(consume);
>> >> >> >> >    Y;
>> >> >> >> >    if (p == q)
>> >> >> >> >         data = p->val;
>> >> >> >> >
>> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> >> > can decide that the values are interchangeable.
>> >> >> >> >
>> >> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> >> > irrelevant. No?
>> >> >> >>
>> >> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
>> >> >> >> come from either pointer.
>> >> >> >>
>> >> >> >> For whatever it is worth, hear is the litmus test:
>> >> >> >>
>> >> >> >> T1:   p = kmalloc(...);
>> >> >> >>       if (p == NULL)
>> >> >> >>               deal_with_it();
>> >> >> >>       p->a = 42;  /* Each field in its own cache line. */
>> >> >> >>       p->b = 43;
>> >> >> >>       p->c = 44;
>> >> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
>> >> >> >>       p->b = 143;
>> >> >> >>       p->c = 144;
>> >> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
>> >> >> >>
>> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
>> >> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
>> >> >> >>       if (p == q) {
>> >> >> >>               /* The compiler decides that q->c is same as p->c. */
>> >> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >> >>       }
>> >> >> >>
>> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> >> you get.
>> >> >> >>
>> >> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> >> see consistency across all the fields in the structure, please put a lock
>> >> >> >> in the structure and use it to guard updates and accesses to those fields.
>> >> >> >
>> >> >> > And here is a patch documenting the restrictions for the current Linux
>> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >> >> >
>> >> >> > Thoughts?
>> >> >>
>> >> >> That might serve as informal documentation for linux kernel
>> >> >> programmers about the bounds on the optimisations that you expect
>> >> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> >> intend it to be for.   But I don't see how one can make it precise
>> >> >> enough to serve as a language definition, so that compiler people
>> >> >> could confidently say "yes, we respect that", which I guess is what
>> >> >> you really need.  As a useful criterion, we should aim for something
>> >> >> precise enough that in a verified-compiler context you can
>> >> >> mathematically prove that the compiler will satisfy it  (even though
>> >> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> >> authors can actually know what they're working with.   All this stuff
>> >> >> about "you should avoid cancellation", and "avoid masking with just a
>> >> >> small number of bits" is just too vague.
>> >> >
>> >> > Understood, and yes, this is intended to document current compiler
>> >> > behavior for the Linux kernel community.  It would not make sense to show
>> >> > it to the C11 or C++11 communities, except perhaps as an informational
>> >> > piece on current practice.
>> >> >
>> >> >> The basic problem is that the compiler may be doing sophisticated
>> >> >> reasoning with a bunch of non-local knowledge that it's deduced from
>> >> >> the code, neither of which are well-understood, and here we have to
>> >> >> identify some envelope, expressive enough for RCU idioms, in which
>> >> >> that reasoning doesn't allow data/address dependencies to be removed
>> >> >> (and hence the hardware guarantee about them will be maintained at the
>> >> >> source level).
>> >> >>
>> >> >> The C11 syntactic notion of dependency, whatever its faults, was at
>> >> >> least precise, could be reasoned about locally (just looking at the
>> >> >> syntactic code in question), and did do that.  The fact that current
>> >> >> compilers do optimisations that remove dependencies and will likely
>> >> >> have many bugs at present is besides the point - this was surely
>> >> >> intended as a *new* constraint on what they are allowed to do.  The
>> >> >> interesting question is really whether the compiler writers think that
>> >> >> they *could* implement it in a reasonable way - I'd like to hear
>> >> >> Torvald and his colleagues' opinion on that.
>> >> >>
>> >> >> What you're doing above seems to be basically a very cut-down version
>> >> >> of that, but with a fuzzy boundary.   If you want it to be precise,
>> >> >> maybe it needs to be much simpler (which might force you into ruling
>> >> >> out some current code idioms).
>> >> >
>> >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
>> >> > can be developed to serve this purpose.
>> >>
>> >> (I missed that mail when it first came past, sorry)
>> >>
>> >> That's also going to be tricky, I'm afraid.  The key condition there is:
>> >>
>> >> "* at the time of execution of E, L       [PS: I assume that L is a
>> >> typo and should be E]
>> >
>> > No, L was intended. (But see below.)
>>
>> then I misunderstood...
>>
>> >>         can possibly have returned at
>> >>         least two different values under the assumption that L itself
>> >>         could have returned any value allowed by L's type."
>> >>
>> >> First, the evaluation of E might be nondeterministic  - e.g., for an
>> >> artificial example, if it's just a nondeterministic value obtained
>> >> from the result of a race on SC atomics.  The above doesn't
>> >> distinguish between that (which doesn't have a real dependency on L)
>> >> and that XOR'd with L (which does).  And it does so in the wrong
>> >> direction: it'll say there the former has a dependency on L.
>> >
>> > I'm not quite sure I understand the examples you want to point out
>> > (could you add brief code snippets, perhaps?) -- but the informal
>> > definition I proposed also says that E must have used L in some way to
>> > make it's computation.  That's pretty vague, and the way I phrased the
>> > requirement is probably not optimal.
>>
>> ...so this example isn't so relevant, but it's maybe interesting
>> anyway. In free-wheeling psuedocode:
>>
>> x = read_consume(L)     // a memory_order_consume read of a 1-bit value from L
>> y = nondet()                     // some code that
>> nondeterministically produces another 1-bit value, eg by spawning two
>> threads that each do an SC-atomic write to some other location,
>> returning 0 or 1 depending on which wins
>>
>> then evaluate either just
>>
>> z=y
>>
>> or
>>
>> z = x XOR y
>>
>> In my misinterpretation of what you wrote, your definition would say
>> there's a dependency from the load of L to the evalution of y, even
>> though there isn't.
>
> The evaluations "z==y" or "z=y" wouldn't depend on L, assuming nondet()
> doesn't depend on L (e.g., by using x).  They don't use x's value.  (And
> I don't have a precise definition for this, unfortunately, but I hope
> the intent is clear.)
>
> "x XOR y" would depend on L, because this does take x into account, and
> x's value remains relevant irrespective of which value y has.  y would
> be non-vdp, and the compiler could have specialized the code into two
> branches for both 1 and 0 values of y; but it would still need x to
> compute z in the last evaluation.
>
>>
>> > So let me expand a bit on the background first.  What I tried to capture
>> > with the rules is that an evaluation (ie, E), really uses the value
>> > returned by L, and not just a, for example, constant value that can be
>> > inferred from whatever was executed between L and E.  This (is intended
>> > to) prevent value prediction and such by programmers.  The compiler in
>> > turn knows what a real program is allowed to do that still wants to rely
>> > on the ordering by the consume.  Basing all of this on the values seemed
>> > to be helpful because basing it on *purely* syntax didn't seem
>> > implementable; for the latter, we'd need to disallow cases such as x-x
>> > or require compilers to include artificial dependencies if a program
>> > indeed does value speculation.  IOW, it didn't seem possible to draw a
>> > clear distinction between a data dependency defined purely based on
>> > syntax and control dependencies.
>> >
>> > Another way to perhaps phrase the requirement might be to construct it
>> > inductively, so that "E uses the value returned by L" becomes clearer.
>> > However, I currently don't really know how to do that without any holes.
>> >
>> > For example, let's say that an atomic mo_consume load has a value
>> > dependency to itself; the dependency has an associated set of values,
>> > which is equal to all values allowed by the type (but see also below).
>> > Then, an evaluation A that uses B as operand has a value dependency on B
>> > if the associated set of values can still have more than one element.
>> > Whether that's the case depends on the operation, and the other
>> > operands, including their values at the time of execution.
>> > However, it's not just results of evaluations that effectively constrain
>> > the set of values.  Conditional execution does as well, because the
>> > condition being true might establish a constraint on L, which might
>> > remove any dependency when L (or a result of a computation involving L)
>> > is used.  So I'm not sure the inductive approach would work.
>> >
>> > You said you thought I might have wanted to say that:
>> > "E  really-depends on L if there exist two different values that (just
>> > according to typing) might be read for L that give rise to two
>> > different values for E".
>> >
>> > Which should follow from the above, or at least should not conflict with
>> > it.  My "definition" tried to capture that the program must not
>> > establish so many constraints between E and L that the value of L is
>> > clear in the sense of having exactly one value.  If we dereference such
>> > an E whose execution in fact constrains L to one value, there is no real
>> > dependency anymore.
>> > However, by itself, this doesn't cover that E must use L in it's
>> > computation -- which your formulation does, I think.
>>
>> unfortunately not - that's what the example above shows.
>
> Now I think I understand.
>
>> > Another way might be to try to define which constraints established by
>> > an execution (based on the executed code's semantics) actually remove
>> > value dependencies on L.
>> >
>> > Do you have any suggestions for how to define this (ie, true value
>> > dependencies) in a better way?
>>
>> not at the moment.  I need to think some more about your vdps, though.
>>  As I understand it, you're agreeing with the general intent of the
>> original C11 design that some new mechanism to tell the compiler not
>> to optimise in certain places in required, but differing in the way
>> you identify those?
>
> Yes, we still need such a mechanism because we need to prevent the
> compiler from doing value prediction or something similar that uses
> control dependencies to avoid data dependencies.  For example, it could
> use a large switch statement to add code specialized for each possible
> value returned from an mo_consume load.  Thus, at least for the
> standard, we need some mechanism that prevents certain compiler
> optimizations.
>
> Where it differs to C11 is that we're not trying to use just such a
> mechanism built on syntax to try to define when we actually have a real
> value dependency.
>
>>
>> >> Second, it involves reasoning about counterfactual executions.  That
>> >> doesn't necessarily make it wrong, per se, but probably makes it hard
>> >> to work with.  For example, suppose that in all the actual
>> >> whole-program executions, a runtime occurrence of L only ever returns
>> >> one particular value (perhaps because of some simple #define'd
>> >> configuration)
>> >
>> > Right, or just because it's a deterministic program.
>> >
>> > This is a problem for the definition of value dependencies, AFAICT,
>> > because where do you put the line for what the compiler is allowed to
>> > speculate about?  When does the program indeed "reveal" the value of a
>> > load, and when does it not?  Is a compiler to do out-of-band tracking
>> > for certain memory locations, and thus predict the value?
>> >
>> > Adding the assumption that L, at least conceptually, might return any
>> > value seemed to be a simple way to remove that uncertainty regarding
>> > what the compiler might know about.
>>
>> ...but it brings in a lot of baggage.
>
> Which baggage do you mean?  In terms of verification, or definitions in
> the standard, or effects on compilers?

At least the first two.  I'm not in a position to comment on the last,
but I'd hate to try to do compiler verification based on a semantics
that involves substantial quantification over counterfactual
hypothetical executions.

>> >> , and that the code used in the evaluation of E depends
>> >> on some invariant which is related to that configuration.
>> >
>> > A related issue that I've been thinking about is whether things like
>> > divide-by-zero (ie, operations that give rise to undefined behavior)
>> > should be considered as constraints on the values or not.  I guess they
>> > should, but then this seems closer what you mention, invariants
>> > established by the whole program.
>> >
>> >> The
>> >> hypothetical execution used above in which a different value is used
>> >> is one in the code is being run in a situation with broken invariants.
>> >>    Then there will be technical difficulties in using the definition:
>> >> I don't see how one would persuade oneself that a compiler always
>> >> satisfies it, because these hypothetical executions are far removed
>> >> from what it's actually working on.
>> >
>> > I think the above is easy to implement for a compiler.  At the
>> > mo_consume load, you simply forget any information you have about this
>> > value / memory location; for operations on value_dep_preserving types,
>>
>> are all the subexpressions likewise forced to be of vdp types?
>
> I'd like to avoid that if possible.  Things like "vdp-pointer + int"
> should Just Work in the sense of being still vdp-typed without requiring
> an explicit cast for the int operand.  Operators ||, &&, ?: might
> justify more fine-grained rules, but I guess it's better to just make
> more stuff vdp if that's easier.  vdp is not a sufficient condition for
> there being a value-dependency anyway, and the few optimizations it
> prevents on expressions containing some vdp-typed operands probably
> don't matter much.  Also, as I mentioned in the reply to Paul, I believe
> that if the compiler can prove that there's no value-dependency, it can
> ignore the vdp-annotation (IOW, as-if is still possible).
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-05 18:01                                                                                                                                   ` Paul E. McKenney
@ 2014-03-07 17:45                                                                                                                                     ` Torvald Riegel
  2014-03-07 19:02                                                                                                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-07 17:45 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > xagsmtp3.20140305162928.8243@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > 
> > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > > +	code is buggy:
> > > > > > > +
> > > > > > > +		int a[2];
> > > > > > > +		int index;
> > > > > > > +		int force_zero_index = 1;
> > > > > > > +
> > > > > > > +		...
> > > > > > > +
> > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > > +	which can result in misordering bugs.
> > > > > > > +
> > > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > +	the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > +		int a[2];
> > > > > > > +		int index;
> > > > > > > +		int flip_index = 0;
> > > > > > > +
> > > > > > > +		...
> > > > > > > +
> > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > > +	are often compiled using branches.  And as before, although
> > > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > > +	result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > > there are further constraints due to the type of r1 and the values that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > comparison?
> > > 
> > > That should work well assuming that things like "if", "while", and "?:"
> > > conditions are happy to take a vdp.
> > 
> > I currently don't see a reason why that should be disallowed.  If we
> > have allowed an implicit conversion to non-vdp, I believe that should
> > follow.
> 
> I am a bit nervous about a silent implicit conversion from vdp to
> non-vdp in the general case.

Why are you nervous about it?

> However, when the result is being used by
> a conditional, the silent implicit conversion makes a lot of sense.
> Is that distinction something that the compiler can handle easily?

I think so. I'm not a language lawyer, but we have other such
conversions in the standard (e.g., int to boolean, between int and
float) and I currently don't see a fundamental difference to those.  But
we'll have to ask the language folks (or SG1 or LEWG) to really verify
that.

> On the other hand, silent implicit conversion from non-vdp to vdp
> is very useful for common code that can be invoked both by RCU
> readers and by updaters.

I'd be more nervous about that because then there's less obstacles to
one programmer expecting a vdp to indicate a dependency vs. another
programmer putting non-vdp into vdp.

For this case of common code (which I agree is a valid concern), would
it be a lot of programmer overhead to add explicit casts from non-vdp to
vdp?  Would C11 generics help with that, similarly to how C++ template
functions would?

Nonetheless, in the end this is just trading off convenient use against
different ways to catch different but simple errors.

> >          ?: could be somewhat special, in that the type depends on the
> > 2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
> > allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
> > disallowed if we don't provide for implicit casts from non-vdp to vdp.
> 
> Actually, from the Linux-kernel code that I am seeing, we want to be able
> to silently convert from non-vdp to vdp in order to permit common code
> that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
> This common code must be compiled conservatively to allow vdp, but should
> be just find with non-vdp.
> 
> Going through the combinations...
> 
>  0.	vdp x = vdp ? vdp : vdp;		 /* OK, matches. */
>  1.	vdp x = vdp ? vdp : non-vdp;		 /* Silent conversion. */
>  2.	vdp x = vdp ? non-vdp : vdp;		 /* Silent conversion. */
>  3.	vdp x = vdp ? non-vdp : non-vdp;	 /* Silent conversion. */
>  4.	vdp x = non-vdp ? vdp : vdp;		 /* OK, matches. */
>  5.	vdp x = non-vdp ? vdp : non-vdp;	 /* Silent conversion. */
>  6.	vdp x = non-vdp ? non-vdp : vdp;	 /* Silent conversion. */
>  7.	vdp x = non-vdp ? non-vdp : non-vdp;	 /* Silent conversion. */
>  8.	non-vdp x = vdp ? vdp : vdp;		 /* Warning unless condition. */
>  9.	non-vdp x = vdp ? vdp : non-vdp;	 /* Warning unless condition. */
> 10.	non-vdp x = vdp ? non-vdp : vdp;	 /* Warning unless condition. */
> 11.	non-vdp x = vdp ? non-vdp : non-vdp;	 /* OK, matches. */
> 12.	non-vdp x = non-vdp ? vdp : vdp;	 /* Warning unless condition. */
> 13.	non-vdp x = non-vdp ? vdp : non-vdp;	 /* Warning unless condition. */
> 14.	non-vdp x = non-vdp ? non-vdp : vdp;	 /* Warning unless condition. */
> 15.	non-vdp x = non-vdp ? non-vdp : non-vdp; /* OK, matches. */
> 
> 0, 4, 11, and 15 are OK because both legs of the ?: match the variable
> being assigned to.  1, 2, 3, 4, 6, and 7 are implicit silent conversions
> from non-vdp to vdp, which is always safe and is useful for common code.

Note that some of those can in fact be vdp depending on operands, but
don't necessarily carry an actual value dependency.  So, from a type
system perspective, I would guess that those expressions would be vdp by
default (except 7 and 3).

> 8, 9, 10, 12, 13, and 14 are mismatches: A vdp quantity is being assigned
> to a non-vdp variable, which could potentially be passed to a vdp-oblivious
> function.

I agree that there is a mismatch, but I'm not sure we want to warn on
silent conversion from vdp to non-vdp instead of just doing a silent
conversion.  Otherwise, we'll have to add casts whenever we send a vdp
to something that doesn't want to make use of the value dependency
(e.g., printf, an if statement, ...).  What would be the programmer
overhead for the latter?

> However, 8, 9, 10, 12, 13, and 14 are OK if the result is
> consumed by a conditional.  That said, I would not complain if something
> like the following kicked out a warning:
> 
> 	struct foo value_dep_preserving *p;
> 	struct foo *q;
> 
> 	p = rcu_dereference(gp);
> 	q = f() ? p : p + 1;

You'd like to see the warning here, right?

> 	if (q < THE_LIMIT)
> 		do_something();
> 	else
> 		do_something_else(p);
> 
> The warning could be avoided by marking q value_dep_preserving or by
> eliminating q entirely:
> 
> 	struct foo value_dep_preserving *p;
> 
> 	p = rcu_dereference(gp);
> 	if ((f() ? p : p + 1) < THE_LIMIT)
> 		do_something();
> 	else
> 		do_something_else(p);
> 
> Or, for that matter, by using a cast:
> 
> 	struct foo value_dep_preserving *p;
> 	struct foo *q;
> 
> 	p = rcu_dereference(gp);
> 	q = (struct foo *)(f() ? p : p + 1);

So the cast would be like kill_dependency()?

> 	if (q < THE_LIMIT)
> 		do_something();
> 	else
> 		do_something_else(p);
> 
> Does that make sense?

I think I understand which scheme you have in mind.  I just don't have a
strong preference for either your approach (AFAIU, roughly, to expect
programmers to kill vdp and to warn on silent kills otherwise) and what
I had in mind (to allow silent transitions to non-vdp but to provide a
helper function/macro that raises an error if the access is not vdp (so
one can "request" to get a vdp for memory accesses where this matters)).

Did I understand your approach correctly?

Right now, I can't confidently say that one would be better than the
other.  I think we need to get feedback for both.

> > > This assumes that p->a only returns
> > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > through the program.  ;-)
> > 
> > That's a good question.  For the scheme I had in mind, I'm not concerned
> > about vdps running wild because one needs to assign to explicitly
> > vdp-typed variables (or function arguments, etc.) to let vdp extend to
> > beyond single expressions.
> > 
> > Nonetheless, I think it's a good question how -> should behave if the
> > field is not vdp; in particular, should vdp->non_vdp be automatically
> > vdp?  One concern might be that we know something about non-vdp -- OTOH,
> > we shouldn't be able to do so because we (assume to) don't know anything
> > about the vdp pointer, so we can't infer something about something it
> > points to.
> 
> In almost all the cases I am seeing in the Linux kernel, p->f wants to
> be non-vdp.  A common case is that "f" is an integer that is used in
> later computation, but where the ordering is needed only when fetching
> p->f, not during later use of the resulting integer.

Module perhaps a few minor missed optimizations in this expression, if
we had implicit/silent conversion to non-vdp, this should just work fine
I believe even if we say that vdp->non_vdp is vdp by default.

> So it is looking like p->f should be vdp only if field "f" is declared vdp.

I can see that if the silent conversion to non-vdp is not allowed, then
this approach might lead to fewer casts / kill_dependency().

> > > The other thing that can happen is that a vdp can get handed off to
> > > another synchronization mechanism, for example, to reference counting:
> > > 
> > > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > > 	if (do_something_with(p->a)) {
> > > 		/* fast path protected by RCU. */
> > > 		return 0;
> > > 	}
> > > 	if (atomic_inc_not_zero(&p->refcnt) {
> > 
> > Is the argument to atomic_inc_no_zero vdp or non-vdp?
> 
> The argument to atomic_inc_not_zero() is non-vdp, and because it is an
> atomic operation, it would not make sense to mark it vdp.

Why?  If it would be an atomic mo_relaxed load, for example, then vdp
would possibly make sense, or not?  For atomic RMW ops, I also don't see
why we'd always want non-vdp as operands.

> This results
> in a bit of a dilemma: I am finding code that wants "&p->f" to be vdp
> if "p" is vdp, and I am finding other code (like the above) that wants
> "&p->f" to be non-vdp always.
> 
> The approaches I can think of at the moment include:
> 
> 1.	If "p" is vdp, make "&p->f" be vdp, but don't complain about
> 	subsequent assignments to non-vdp variables.  Sounds like quite
> 	a mess in the compiler.

Why?  On the assignment, there will need to be an implicit conversion.
But assigning floats to integers or integer to boolean seems quite
similar, or not?

> 2.	Propagate value_dep_preserving tags throughout the kernel.
> 	Sounds like a good recipe for a Linux-kernel revolt against
> 	this proposal.
> 
> 3.	Require explicit casts to avoid warnings:
> 
> 	if atomic_inc_not_zero((struct foo *)&p->refcnt) {
> 
> 	This would not be as bad as #2, but would still require
> 	a fair amount of markup.

3a.
We could also allow implicit conversion from non-vdp to vdp (but a
default to go from vdp to vdp, conservatively, so that dependency chains
in expressions aren't broken accidentally).

> 4.	Use something like kill_dependency().  This has strengths
> 	and weaknesses similar to #3, but has the advantage of
> 	being useful in type-generic macros.
> 
> 5.	Either #3 or #4 above, but have a command-line flag that
> 	shuts off the warnings.  That way, people who want the
> 	diagnostics can enable them in their own code, and people
> 	who don't can disable them.

If we have implicit conversions, than having warnings for when (some of)
those happen sounds like a good idea.

> #5 looks like the way to go to me.  So "&p->f" has the same vdp-ness
> as "p", so that assigning it to a non-vdp variable, passing it via a
> non-vdp argument, or returning it via a non-vdp return value will
> cause a warning.  However, that warning can be easily shut off on a
> file-by-file basis.
> 
> Seem reasonable?

Yes, except that I think that we need to describe this differently (ie,
more like what the standard handles other implicit conversions).
Whether the warning should be on by default (ie, opt-out) or should be
part of -Wall can be separately discussed, I believe.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-05 18:15                                                                                                                                     ` Paul E. McKenney
@ 2014-03-07 18:33                                                                                                                                       ` Torvald Riegel
  2014-03-07 19:11                                                                                                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: Torvald Riegel @ 2014-03-07 18:33 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > > > +	code is buggy:
> > > > > > > > +
> > > > > > > > +		int a[2];
> > > > > > > > +		int index;
> > > > > > > > +		int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +		...
> > > > > > > > +
> > > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > > > +	which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > > +	the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +		int a[2];
> > > > > > > > +		int index;
> > > > > > > > +		int flip_index = 0;
> > > > > > > > +
> > > > > > > > +		...
> > > > > > > > +
> > > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > > > +	are often compiled using branches.  And as before, although
> > > > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > > > +	result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > > > there are further constraints due to the type of r1 and the values that
> > > > > > > flip_index can have).
> > > > > > 
> > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > guarantees than we get by default from current compilers.
> > > > > > 
> > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > dependency to be tracked through a comparison.)
> > > > > 
> > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > comparison?
> > > > 
> > > > That should work well assuming that things like "if", "while", and "?:"
> > > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > through the program.  ;-)
> > > > 
> > > > The other thing that can happen is that a vdp can get handed off to
> > > > another synchronization mechanism, for example, to reference counting:
> > > > 
> > > > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > > > 	if (do_something_with(p->a)) {
> > > > 		/* fast path protected by RCU. */
> > > > 		return 0;
> > > > 	}
> > > > 	if (atomic_inc_not_zero(&p->refcnt) {
> > > > 		/* slow path protected by reference counting. */
> > > > 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> > > > 	}
> > > > 	/* Needed slow path, but raced with deletion. */
> > > > 	return -EAGAIN;
> > > > 
> > > > I am guessing that the cast ends the vdp.  Is that the case?
> > > 
> > > And here is a more elaborate example from the Linux kernel:
> > > 
> > > 	struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> > > 
> > > 	rdev = rcu_dereference(conf->mirrors[disk].rdev);
> > > 	if (r1_bio->bios[disk] == IO_BLOCKED
> > > 	    || rdev == NULL
> > > 	    || test_bit(Unmerged, &rdev->flags)
> > > 	    || test_bit(Faulty, &rdev->flags))
> > > 		continue;
> > > 
> > > The fact that the "rdev == NULL" returns vdp does not force the "||"
> > > operators to be evaluated arithmetically because the entire function
> > > is an "if" condition, correct?
> > 
> > That's a good question, and one that as far as I understand currently,
> > essentially boils down to whether we want to have tight restrictions on
> > which operations are still vdp.
> > 
> > If we look at the different combinations, then it seems we can't decide
> > on whether we have a value-dependency just due to a vdp type:
> > * non-vdp || vdp:  vdp iff non-vdp == false
> > * vdp || non-vdp:  vdp iff non-vdp == false?
> > * vdp || vdp: always vdp? (and dependency on both?)
> > 
> > I'm not sure it makes sense to try to not make all of those
> > vdp-by-default.  The first and second case show that it's dependent on
> > the specific execution anyway, and thus is already covered by the
> > requirement that the value must still matter.   The vdp type is just a
> > way to prevent inappropriate compiler optimizations; it's not critical
> > for correctness is we make more stuff vdp, yet it may prevent some
> > optimizations in the affected expression.
> > 
> > If the compiler knows that some vdp-typed evaluation will not have a
> > value-dependency anyway, then it can just optimize this evaluation like
> > non-vdp code.
> > 
> > I guess not much would change for the code you posted, because we
> > already have to evaluate || operands in order, I believe (e.g., don't
> > access rdev->flags before doing the rdev == NULL check, modulo as-if).
> > Do I understand your question correctly?
> 
> Let me give an example for the other side:
> 
> 	struct foo value_dep_preserving *p;
> 	struct foo value_dep_preserving *q;
> 
> 	p = rcu_dereference(gp);
> 	q = rcu_dereference(gq);
> 	return myarray[p || q]];  /* Linux kernel doesn't do this. */
> 
> If we wanted this to work (and I am not at all convinced that we do),
> the compiler would have to force a data dependency through the "||".

Yes.

> But I would be just as happy to instead just say that boolean logical
> operators ("||" and "&&") never return vdp values.

I think those aren't actually the problem (or if they were, we'd need to
think about & and | on 1-bit integers or bitfields as well), but ...

> Ditto for the
> relational operators ("==", "!=", ">", ">=", "<", and "<=").  No one
> seems to rely on value dependencies via these operators, after all,
> and preserving value dependencies through them seems to require that
> the compiler generate odd code.

... that any conversion from vdp to bool requires specialized handling
by the compiler.  That happens on implicit conversion (as in "p || q")
and in the operators you mentioned.

I don't see a reason why conversion to bool (or any other operator
returning bool and taking vdp as operand) should be *always* non-vdp.
But it seems it would be easier to misuse than other operators.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-07 17:45                                                                                                                                     ` Torvald Riegel
@ 2014-03-07 19:02                                                                                                                                       ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-07 19:02 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote:
> xagsmtp5.20140307174618.3777@vmsdvm6.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6)
> 
> On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > > xagsmtp3.20140305162928.8243@uk1vsc.vnet.ibm.com
> > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > > 
> > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > > > +	code is buggy:
> > > > > > > > +
> > > > > > > > +		int a[2];
> > > > > > > > +		int index;
> > > > > > > > +		int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +		...
> > > > > > > > +
> > > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > > > +	which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > > +	the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +		int a[2];
> > > > > > > > +		int index;
> > > > > > > > +		int flip_index = 0;
> > > > > > > > +
> > > > > > > > +		...
> > > > > > > > +
> > > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > > > +	are often compiled using branches.  And as before, although
> > > > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > > > +	result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > > > there are further constraints due to the type of r1 and the values that
> > > > > > > flip_index can have).
> > > > > > 
> > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > guarantees than we get by default from current compilers.
> > > > > > 
> > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > dependency to be tracked through a comparison.)
> > > > > 
> > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > comparison?
> > > > 
> > > > That should work well assuming that things like "if", "while", and "?:"
> > > > conditions are happy to take a vdp.
> > > 
> > > I currently don't see a reason why that should be disallowed.  If we
> > > have allowed an implicit conversion to non-vdp, I believe that should
> > > follow.
> > 
> > I am a bit nervous about a silent implicit conversion from vdp to
> > non-vdp in the general case.
> 
> Why are you nervous about it?

If someone expects the vdp to propagate into some function that might
be compiled with aggressive optimizations that break this expectation,
it would be good for that someone to know about it.

Ah!  I am assuming that the compiler is -not- emitting memory barriers
at vdp-to-non-vdp transitions.  In that case, warnings are even more
important -- without the warnings, it is a real pain chasing these
unnecessary memory barriers out of the code.

So we are -not- in the business of emitting memory barriers on
vdp-to-non-vdp transitions, right?

> > However, when the result is being used by
> > a conditional, the silent implicit conversion makes a lot of sense.
> > Is that distinction something that the compiler can handle easily?
> 
> I think so. I'm not a language lawyer, but we have other such
> conversions in the standard (e.g., int to boolean, between int and
> float) and I currently don't see a fundamental difference to those.  But
> we'll have to ask the language folks (or SG1 or LEWG) to really verify
> that.

Understood!

> > On the other hand, silent implicit conversion from non-vdp to vdp
> > is very useful for common code that can be invoked both by RCU
> > readers and by updaters.
> 
> I'd be more nervous about that because then there's less obstacles to
> one programmer expecting a vdp to indicate a dependency vs. another
> programmer putting non-vdp into vdp.

Well, that is the concern either way.  But the usual reason for putting
non-vdp into vdp is because the update-side code holds the lock, so
that nothing can change.  In this case, there is no harm in passing
the non-vdp pointer to a vdp function.

In contrast, on the read side, there is nothing preventing the underlying
data from changing at any time.  So if you have a read-side vdp, passing
it to a non-vdp function could result in an ordering bug.

And given the choice, what I would want would be to be warned of a
vdp-to-non-vdp transition within an RCU read-side critical section,
but not outside of an RCU read-side critical section.  Not sure whether
that is practical from a compiler viewpoint, though.  (Would need to tell
the compiler about rcu_read_unlock(), which is easy from my perspective,
but there are interactions with vdp-marked parameters and return values.)

> For this case of common code (which I agree is a valid concern), would
> it be a lot of programmer overhead to add explicit casts from non-vdp to
> vdp?  Would C11 generics help with that, similarly to how C++ template
> functions would?

For common code, it depends on other decisions.  For example, in the
case where "p" being vdp implies that "p->f" is vdp even when "f" is
declared non-vdp, there would be an intolerable number of casts.

I am not sure to what extent generics or templates would help.  The use
of gcc type-generic macros would be much easier with some primitive that
could strip vdp from a given type.

> Nonetheless, in the end this is just trading off convenient use against
> different ways to catch different but simple errors.

Yep.  Perhaps best just to make two separate command-line flags, one to
enable vdp-to-non-vdp warnings and the other to enable non-vdp-to-vdp
warnings.  That would allow each project to adapt the compiler to their
coding standards and expectations.

> > >          ?: could be somewhat special, in that the type depends on the
> > > 2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
> > > allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
> > > disallowed if we don't provide for implicit casts from non-vdp to vdp.
> > 
> > Actually, from the Linux-kernel code that I am seeing, we want to be able
> > to silently convert from non-vdp to vdp in order to permit common code
> > that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
> > This common code must be compiled conservatively to allow vdp, but should
> > be just find with non-vdp.
> > 
> > Going through the combinations...
> > 
> >  0.	vdp x = vdp ? vdp : vdp;		 /* OK, matches. */
> >  1.	vdp x = vdp ? vdp : non-vdp;		 /* Silent conversion. */
> >  2.	vdp x = vdp ? non-vdp : vdp;		 /* Silent conversion. */
> >  3.	vdp x = vdp ? non-vdp : non-vdp;	 /* Silent conversion. */
> >  4.	vdp x = non-vdp ? vdp : vdp;		 /* OK, matches. */
> >  5.	vdp x = non-vdp ? vdp : non-vdp;	 /* Silent conversion. */
> >  6.	vdp x = non-vdp ? non-vdp : vdp;	 /* Silent conversion. */
> >  7.	vdp x = non-vdp ? non-vdp : non-vdp;	 /* Silent conversion. */
> >  8.	non-vdp x = vdp ? vdp : vdp;		 /* Warning unless condition. */
> >  9.	non-vdp x = vdp ? vdp : non-vdp;	 /* Warning unless condition. */
> > 10.	non-vdp x = vdp ? non-vdp : vdp;	 /* Warning unless condition. */
> > 11.	non-vdp x = vdp ? non-vdp : non-vdp;	 /* OK, matches. */
> > 12.	non-vdp x = non-vdp ? vdp : vdp;	 /* Warning unless condition. */
> > 13.	non-vdp x = non-vdp ? vdp : non-vdp;	 /* Warning unless condition. */
> > 14.	non-vdp x = non-vdp ? non-vdp : vdp;	 /* Warning unless condition. */
> > 15.	non-vdp x = non-vdp ? non-vdp : non-vdp; /* OK, matches. */
> > 
> > 0, 4, 11, and 15 are OK because both legs of the ?: match the variable
> > being assigned to.  1, 2, 3, 4, 6, and 7 are implicit silent conversions
> > from non-vdp to vdp, which is always safe and is useful for common code.
> 
> Note that some of those can in fact be vdp depending on operands, but
> don't necessarily carry an actual value dependency.  So, from a type
> system perspective, I would guess that those expressions would be vdp by
> default (except 7 and 3).
> 
> > 8, 9, 10, 12, 13, and 14 are mismatches: A vdp quantity is being assigned
> > to a non-vdp variable, which could potentially be passed to a vdp-oblivious
> > function.
> 
> I agree that there is a mismatch, but I'm not sure we want to warn on
> silent conversion from vdp to non-vdp instead of just doing a silent
> conversion.  Otherwise, we'll have to add casts whenever we send a vdp
> to something that doesn't want to make use of the value dependency
> (e.g., printf, an if statement, ...).  What would be the programmer
> overhead for the latter?

You have convinced me that different projects will want to have different
types of warnings, so that there should be a pair of compiler command-line
flags, one to enable vdp-to-non-vdb warnings and another to enable
non-vdp-to-vdp warnings.

> > However, 8, 9, 10, 12, 13, and 14 are OK if the result is
> > consumed by a conditional.  That said, I would not complain if something
> > like the following kicked out a warning:
> > 
> > 	struct foo value_dep_preserving *p;
> > 	struct foo *q;
> > 
> > 	p = rcu_dereference(gp);
> > 	q = f() ? p : p + 1;
> 
> You'd like to see the warning here, right?

Yes, if enabled and inside an RCU read-side critical section.

> > 	if (q < THE_LIMIT)
> > 		do_something();
> > 	else
> > 		do_something_else(p);
> > 
> > The warning could be avoided by marking q value_dep_preserving or by
> > eliminating q entirely:
> > 
> > 	struct foo value_dep_preserving *p;
> > 
> > 	p = rcu_dereference(gp);
> > 	if ((f() ? p : p + 1) < THE_LIMIT)
> > 		do_something();
> > 	else
> > 		do_something_else(p);
> > 
> > Or, for that matter, by using a cast:
> > 
> > 	struct foo value_dep_preserving *p;
> > 	struct foo *q;
> > 
> > 	p = rcu_dereference(gp);
> > 	q = (struct foo *)(f() ? p : p + 1);
> 
> So the cast would be like kill_dependency()?

In the sense that it tells the compiler to stop worrying about the
value dependency.

> > 	if (q < THE_LIMIT)
> > 		do_something();
> > 	else
> > 		do_something_else(p);
> > 
> > Does that make sense?
> 
> I think I understand which scheme you have in mind.  I just don't have a
> strong preference for either your approach (AFAIU, roughly, to expect
> programmers to kill vdp and to warn on silent kills otherwise) and what
> I had in mind (to allow silent transitions to non-vdp but to provide a
> helper function/macro that raises an error if the access is not vdp (so
> one can "request" to get a vdp for memory accesses where this matters)).

Ah, I see where you were coming from for your non-vdp-to-vdp warning, as
that would catch the case where your function/macro might get the wrong
answer.  Hmmm...

> Did I understand your approach correctly?

I believe you did.

> Right now, I can't confidently say that one would be better than the
> other.  I think we need to get feedback for both.

Makes sense.

> > > > This assumes that p->a only returns
> > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > through the program.  ;-)
> > > 
> > > That's a good question.  For the scheme I had in mind, I'm not concerned
> > > about vdps running wild because one needs to assign to explicitly
> > > vdp-typed variables (or function arguments, etc.) to let vdp extend to
> > > beyond single expressions.
> > > 
> > > Nonetheless, I think it's a good question how -> should behave if the
> > > field is not vdp; in particular, should vdp->non_vdp be automatically
> > > vdp?  One concern might be that we know something about non-vdp -- OTOH,
> > > we shouldn't be able to do so because we (assume to) don't know anything
> > > about the vdp pointer, so we can't infer something about something it
> > > points to.
> > 
> > In almost all the cases I am seeing in the Linux kernel, p->f wants to
> > be non-vdp.  A common case is that "f" is an integer that is used in
> > later computation, but where the ordering is needed only when fetching
> > p->f, not during later use of the resulting integer.
> 
> Module perhaps a few minor missed optimizations in this expression, if
> we had implicit/silent conversion to non-vdp, this should just work fine
> I believe even if we say that vdp->non_vdp is vdp by default.
> 
> > So it is looking like p->f should be vdp only if field "f" is declared vdp.
> 
> I can see that if the silent conversion to non-vdp is not allowed, then
> this approach might lead to fewer casts / kill_dependency().

Agreed in both cases.

> > > > The other thing that can happen is that a vdp can get handed off to
> > > > another synchronization mechanism, for example, to reference counting:
> > > > 
> > > > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > > > 	if (do_something_with(p->a)) {
> > > > 		/* fast path protected by RCU. */
> > > > 		return 0;
> > > > 	}
> > > > 	if (atomic_inc_not_zero(&p->refcnt) {
> > > 
> > > Is the argument to atomic_inc_no_zero vdp or non-vdp?
> > 
> > The argument to atomic_inc_not_zero() is non-vdp, and because it is an
> > atomic operation, it would not make sense to mark it vdp.
> 
> Why?  If it would be an atomic mo_relaxed load, for example, then vdp
> would possibly make sense, or not?  For atomic RMW ops, I also don't see
> why we'd always want non-vdp as operands.

I should have said that it is a value-returning read-modify-write atomic
operation, which translates to something stronger than memory_order_seq_cst
in C11.  The reason that it is stronger is that provides more ordering
guarantees to surrounding relaxed (ACCESS_ONCE()) operations.

For example, given x and y both initially zero:

T1:	ACCESS_ONCE(x) = 1;
	if (atomic_inc_not_zero(&p->refcnt))
		ACCESS_ONCE(y) = 1;

T2:	if (ACCESS_ONCE(y)) {
		if (atomic_inc_not_zero(q->other_refcnt))
			BUG_ON(!ACCESS_ONCE(x));
	}

In the Linux kernel, the BUG_ON() cannot trigger.  In C11, it could.

So, to answer your question, if atomic_inc_not_zero() mapped to a
memory_order_relaxed operation, then yes, you might need vdp.  But given
that it maps to stronger-than-memory_order_seq_cst, you do not.

> > This results
> > in a bit of a dilemma: I am finding code that wants "&p->f" to be vdp
> > if "p" is vdp, and I am finding other code (like the above) that wants
> > "&p->f" to be non-vdp always.
> > 
> > The approaches I can think of at the moment include:
> > 
> > 1.	If "p" is vdp, make "&p->f" be vdp, but don't complain about
> > 	subsequent assignments to non-vdp variables.  Sounds like quite
> > 	a mess in the compiler.
> 
> Why?  On the assignment, there will need to be an implicit conversion.
> But assigning floats to integers or integer to boolean seems quite
> similar, or not?

In your scheme, you would never complain about vdp-to-non-vdp assignments,
so no problem.  If it is also not a problem in my approach, so mucn the
better!  ;-)

> > 2.	Propagate value_dep_preserving tags throughout the kernel.
> > 	Sounds like a good recipe for a Linux-kernel revolt against
> > 	this proposal.
> > 
> > 3.	Require explicit casts to avoid warnings:
> > 
> > 	if atomic_inc_not_zero((struct foo *)&p->refcnt) {
> > 
> > 	This would not be as bad as #2, but would still require
> > 	a fair amount of markup.
> 
> 3a.
> We could also allow implicit conversion from non-vdp to vdp (but a
> default to go from vdp to vdp, conservatively, so that dependency chains
> in expressions aren't broken accidentally).

Right, this is your proposal.

> > 4.	Use something like kill_dependency().  This has strengths
> > 	and weaknesses similar to #3, but has the advantage of
> > 	being useful in type-generic macros.
> > 
> > 5.	Either #3 or #4 above, but have a command-line flag that
> > 	shuts off the warnings.  That way, people who want the
> > 	diagnostics can enable them in their own code, and people
> > 	who don't can disable them.
> 
> If we have implicit conversions, than having warnings for when (some of)
> those happen sounds like a good idea.

Very good.

> > #5 looks like the way to go to me.  So "&p->f" has the same vdp-ness
> > as "p", so that assigning it to a non-vdp variable, passing it via a
> > non-vdp argument, or returning it via a non-vdp return value will
> > cause a warning.  However, that warning can be easily shut off on a
> > file-by-file basis.
> > 
> > Seem reasonable?
> 
> Yes, except that I think that we need to describe this differently (ie,
> more like what the standard handles other implicit conversions).
> Whether the warning should be on by default (ie, opt-out) or should be
> part of -Wall can be separately discussed, I believe.

Right, the standardese would be quite different.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-03-07 18:33                                                                                                                                       ` Torvald Riegel
@ 2014-03-07 19:11                                                                                                                                         ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-03-07 19:11 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Linus Torvalds, Will Deacon, Peter Zijlstra,
	Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel,
	akpm, mingo, gcc

On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote:
> On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com
> > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > > 
> > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com
> > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > > 
> > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > > +o	Do not use the results from the boolean "&&" and "||" when
> > > > > > > > > +	dereferencing.	For example, the following (rather improbable)
> > > > > > > > > +	code is buggy:
> > > > > > > > > +
> > > > > > > > > +		int a[2];
> > > > > > > > > +		int index;
> > > > > > > > > +		int force_zero_index = 1;
> > > > > > > > > +
> > > > > > > > > +		...
> > > > > > > > > +
> > > > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > > > +		r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > > +
> > > > > > > > > +	The reason this is buggy is that "&&" and "||" are often compiled
> > > > > > > > > +	using branches.  While weak-memory machines such as ARM or PowerPC
> > > > > > > > > +	do order stores after such branches, they can speculate loads,
> > > > > > > > > +	which can result in misordering bugs.
> > > > > > > > > +
> > > > > > > > > +o	Do not use the results from relational operators ("==", "!=",
> > > > > > > > > +	">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > > > +	the following (quite strange) code is buggy:
> > > > > > > > > +
> > > > > > > > > +		int a[2];
> > > > > > > > > +		int index;
> > > > > > > > > +		int flip_index = 0;
> > > > > > > > > +
> > > > > > > > > +		...
> > > > > > > > > +
> > > > > > > > > +		r1 = rcu_dereference(i1)
> > > > > > > > > +		r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > > +
> > > > > > > > > +	As before, the reason this is buggy is that relational operators
> > > > > > > > > +	are often compiled using branches.  And as before, although
> > > > > > > > > +	weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > > > +	after such branches, but can speculate loads, which can again
> > > > > > > > > +	result in misordering bugs.
> > > > > > > > 
> > > > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > > > > there are further constraints due to the type of r1 and the values that
> > > > > > > > flip_index can have).
> > > > > > > 
> > > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > > guarantees than we get by default from current compilers.
> > > > > > > 
> > > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > > dependency to be tracked through a comparison.)
> > > > > > 
> > > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > > comparison?
> > > > > 
> > > > > That should work well assuming that things like "if", "while", and "?:"
> > > > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > > through the program.  ;-)
> > > > > 
> > > > > The other thing that can happen is that a vdp can get handed off to
> > > > > another synchronization mechanism, for example, to reference counting:
> > > > > 
> > > > > 	p = atomic_load_explicit(&gp, memory_order_consume);
> > > > > 	if (do_something_with(p->a)) {
> > > > > 		/* fast path protected by RCU. */
> > > > > 		return 0;
> > > > > 	}
> > > > > 	if (atomic_inc_not_zero(&p->refcnt) {
> > > > > 		/* slow path protected by reference counting. */
> > > > > 		return do_something_else_with((struct foo *)p);  /* CHANGE */
> > > > > 	}
> > > > > 	/* Needed slow path, but raced with deletion. */
> > > > > 	return -EAGAIN;
> > > > > 
> > > > > I am guessing that the cast ends the vdp.  Is that the case?
> > > > 
> > > > And here is a more elaborate example from the Linux kernel:
> > > > 
> > > > 	struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> > > > 
> > > > 	rdev = rcu_dereference(conf->mirrors[disk].rdev);
> > > > 	if (r1_bio->bios[disk] == IO_BLOCKED
> > > > 	    || rdev == NULL
> > > > 	    || test_bit(Unmerged, &rdev->flags)
> > > > 	    || test_bit(Faulty, &rdev->flags))
> > > > 		continue;
> > > > 
> > > > The fact that the "rdev == NULL" returns vdp does not force the "||"
> > > > operators to be evaluated arithmetically because the entire function
> > > > is an "if" condition, correct?
> > > 
> > > That's a good question, and one that as far as I understand currently,
> > > essentially boils down to whether we want to have tight restrictions on
> > > which operations are still vdp.
> > > 
> > > If we look at the different combinations, then it seems we can't decide
> > > on whether we have a value-dependency just due to a vdp type:
> > > * non-vdp || vdp:  vdp iff non-vdp == false
> > > * vdp || non-vdp:  vdp iff non-vdp == false?
> > > * vdp || vdp: always vdp? (and dependency on both?)
> > > 
> > > I'm not sure it makes sense to try to not make all of those
> > > vdp-by-default.  The first and second case show that it's dependent on
> > > the specific execution anyway, and thus is already covered by the
> > > requirement that the value must still matter.   The vdp type is just a
> > > way to prevent inappropriate compiler optimizations; it's not critical
> > > for correctness is we make more stuff vdp, yet it may prevent some
> > > optimizations in the affected expression.
> > > 
> > > If the compiler knows that some vdp-typed evaluation will not have a
> > > value-dependency anyway, then it can just optimize this evaluation like
> > > non-vdp code.
> > > 
> > > I guess not much would change for the code you posted, because we
> > > already have to evaluate || operands in order, I believe (e.g., don't
> > > access rdev->flags before doing the rdev == NULL check, modulo as-if).
> > > Do I understand your question correctly?
> > 
> > Let me give an example for the other side:
> > 
> > 	struct foo value_dep_preserving *p;
> > 	struct foo value_dep_preserving *q;
> > 
> > 	p = rcu_dereference(gp);
> > 	q = rcu_dereference(gq);
> > 	return myarray[p || q]];  /* Linux kernel doesn't do this. */
> > 
> > If we wanted this to work (and I am not at all convinced that we do),
> > the compiler would have to force a data dependency through the "||".
> 
> Yes.
> 
> > But I would be just as happy to instead just say that boolean logical
> > operators ("||" and "&&") never return vdp values.
> 
> I think those aren't actually the problem (or if they were, we'd need to
> think about & and | on 1-bit integers or bitfields as well), but ...
> 
> > Ditto for the
> > relational operators ("==", "!=", ">", ">=", "<", and "<=").  No one
> > seems to rely on value dependencies via these operators, after all,
> > and preserving value dependencies through them seems to require that
> > the compiler generate odd code.
> 
> ... that any conversion from vdp to bool requires specialized handling
> by the compiler.  That happens on implicit conversion (as in "p || q")
> and in the operators you mentioned.
> 
> I don't see a reason why conversion to bool (or any other operator
> returning bool and taking vdp as operand) should be *always* non-vdp.
> But it seems it would be easier to misuse than other operators.

Well, we have more than 1,000 things in the Linux kernel that head up
what would be vdps, and none of them need a relational operator or a
boolean non-bitwise operator to produce a vdp.

Admittedly some danger in extrapolating, but we are extrapolating from
a reasonably large sample.

That said, I don't have a problem with these operators producing a vdp as
long as the quality of the code emitted by the compiler doesn't suffer in
the common case where they are not needed, and you are the expert on that.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-26  3:06 George Spelvin
@ 2014-02-26  5:22 ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-26  5:22 UTC (permalink / raw)
  To: George Spelvin
  Cc: akpm, dhowells, gcc, linux-arch, linux-kernel, mingo, peterz,
	Ramana.Radhakrishnan, torvalds, triegel, will.deacon

On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote:
> <paulmck@linux.vnet.ibm.com> wrote:
> > <torvalds@linux-foundation.org> wrote:
> >> I have for the last several years been 100% convinced that the Intel
> >> memory ordering is the right thing, and that people who like weak
> >> memory ordering are wrong and should try to avoid reproducing if at
> >> all possible.
> >
> > Are ARM and Power really the bad boys here?  Or are they instead playing
> > the role of the canary in the coal mine?
> 
> To paraphrase some older threads, I think Linus's argument is that
> weak memory ordering is like branch delay slots: a way to make a simple
> implementation simpler, but ends up being no help to a more aggressive
> implementation.
> 
> Branch delay slots give a one-cycle bonus to in-order cores, but
> once you go superscalar and add branch prediction, they stop helping,
> and once you go full out of order, they're just an annoyance.
> 
> Likewise, I can see the point that weak ordering can help make a simple
> cache interface simpler, but once you start doing speculative loads,
> you've already bought and paid for all the hardware you need to do
> stronger coherency.
> 
> Another thing that requires all the strong-coherency machinery is
> a high-performance implementation of the various memory barrier and
> synchronization operations.  Yes, a low-performance (drain the pipeline)
> implementation is tolerable if the instructions aren't used frequently,
> but once you're really trying, it doesn't save complexity.
> 
> Once you're there, strong coherency always doesn't actually cost you any
> time outside of critical synchronization code, and it both simplifies
> and speeds up the tricky synchronization software.
> 
> 
> So PPC and ARM's weak ordering are not the direction the future is going.
> Rather, weak ordering is something that's only useful in a limited
> technology window, which is rapidly passing.

That does indeed appear to be Intel's story.  Might well be correct.
Time will tell.

> If you can find someone in IBM who's worked on the Z series cache
> coherency (extremely strong ordering), they probably have some useful
> insights.  The big question is if strong ordering, once you've accepted
> the implementation complexity and area, actually costs anything in
> execution time.  If there's an unavoidable cost which weak ordering saves,
> that's significant.

There has been a lot of ink spilled on this argument.  ;-)

PPC has much larger CPU counts than does the mainframe.  On the other
hand, there are large x86 systems.  Some claim that there are differences
in latency due to the different approaches, and there could be a long
argument about whether all this in inherent in the memory ordering or
whether it is due to implementation issues.

I don't claim to know the answer.  I do know that ARM and PPC are
here now, and that I need to deal with them.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
@ 2014-02-26  3:06 George Spelvin
  2014-02-26  5:22 ` Paul E. McKenney
  0 siblings, 1 reply; 299+ messages in thread
From: George Spelvin @ 2014-02-26  3:06 UTC (permalink / raw)
  To: paulmck
  Cc: akpm, dhowells, gcc, linux, linux-arch, linux-kernel, mingo,
	peterz, Ramana.Radhakrishnan, torvalds, triegel, will.deacon

<paulmck@linux.vnet.ibm.com> wrote:
> <torvalds@linux-foundation.org> wrote:
>> I have for the last several years been 100% convinced that the Intel
>> memory ordering is the right thing, and that people who like weak
>> memory ordering are wrong and should try to avoid reproducing if at
>> all possible.
>
> Are ARM and Power really the bad boys here?  Or are they instead playing
> the role of the canary in the coal mine?

To paraphrase some older threads, I think Linus's argument is that
weak memory ordering is like branch delay slots: a way to make a simple
implementation simpler, but ends up being no help to a more aggressive
implementation.

Branch delay slots give a one-cycle bonus to in-order cores, but
once you go superscalar and add branch prediction, they stop helping,
and once you go full out of order, they're just an annoyance.

Likewise, I can see the point that weak ordering can help make a simple
cache interface simpler, but once you start doing speculative loads,
you've already bought and paid for all the hardware you need to do
stronger coherency.

Another thing that requires all the strong-coherency machinery is
a high-performance implementation of the various memory barrier and
synchronization operations.  Yes, a low-performance (drain the pipeline)
implementation is tolerable if the instructions aren't used frequently,
but once you're really trying, it doesn't save complexity.

Once you're there, strong coherency always doesn't actually cost you any
time outside of critical synchronization code, and it both simplifies
and speeds up the tricky synchronization software.


So PPC and ARM's weak ordering are not the direction the future is going.
Rather, weak ordering is something that's only useful in a limited
technology window, which is rapidly passing.

If you can find someone in IBM who's worked on the Z series cache
coherency (extremely strong ordering), they probably have some useful
insights.  The big question is if strong ordering, once you've accepted
the implementation complexity and area, actually costs anything in
execution time.  If there's an unavoidable cost which weak ordering saves,
that's significant.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 23:48   ` Peter Sewell
@ 2014-02-19  9:46     ` Torvald Riegel
  0 siblings, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-19  9:46 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra,
	Linus Torvalds, Will Deacon, ramana.radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On Tue, 2014-02-18 at 23:48 +0000, Peter Sewell wrote:
> On 18 February 2014 20:43, Torvald Riegel <triegel@redhat.com> wrote:
> > On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies.  In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >
> > [...]
> >
> >> 2) otherwise, the language definition should prohibit it but the
> >>    compiler would have to preserve dependencies even in compilation
> >>    units that have no mention of atomics.  It's unclear what the
> >>    (runtime and compiler development) cost of that would be in
> >>    practice - perhaps Torvald could comment?
> >
> > If I'm reading the standard correctly, it requires that data
> > dependencies are preserved through loads and stores, including nonatomic
> > ones.  That sounds convenient because it allows programmers to use
> > temporary storage.
> 
> The standard only needs this for consume chains,

That's right, and the runtime cost / implementation problems of
mo_consume was what I was making statements about.  Sorry if that wasn't
clear.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 20:43 ` Torvald Riegel
  2014-02-18 21:29   ` Paul E. McKenney
@ 2014-02-18 23:48   ` Peter Sewell
  2014-02-19  9:46     ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Peter Sewell @ 2014-02-18 23:48 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra,
	Linus Torvalds, Will Deacon, ramana.radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On 18 February 2014 20:43, Torvald Riegel <triegel@redhat.com> wrote:
> On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote:
>> Several of you have said that the standard and compiler should not
>> permit speculative writes of atomics, or (effectively) that the
>> compiler should preserve dependencies.  In simple examples it's easy
>> to see what that means, but in general it's not so clear what the
>> language should guarantee, because dependencies may go via non-atomic
>> code in other compilation units, and we have to consider the extent to
>> which it's desirable to limit optimisation there.
>
> [...]
>
>> 2) otherwise, the language definition should prohibit it but the
>>    compiler would have to preserve dependencies even in compilation
>>    units that have no mention of atomics.  It's unclear what the
>>    (runtime and compiler development) cost of that would be in
>>    practice - perhaps Torvald could comment?
>
> If I'm reading the standard correctly, it requires that data
> dependencies are preserved through loads and stores, including nonatomic
> ones.  That sounds convenient because it allows programmers to use
> temporary storage.

The standard only needs this for consume chains, but if one wanted to
get rid of thin-air values by requiring implementations to respect all
(reads-from union dependency) cycles, AFAICS we'd need it pretty much
everywhere.   I don't myself think that's likely to be a realistic
proposal, but it does keep coming up, and it'd be very interesting to
know the actual cost on some credible workload.

> However, what happens if a dependency "arrives" at a store for which the
> alias set isn't completely known?  Then we either have to add a barrier
> to enforce the ordering at this point, or we have to assume that all
> other potentially aliasing memory locations would also have to start
> carrying dependencies (which might be in other functions in other
> compilation units).  Neither option is good.  The first might introduce
> barriers in places in which they might not be required (or the
> programmer has to use kill_dependency() quite often to avoid all these).
> The second is bad because points-to analysis is hard, so in practice the
> points-to set will not be precisely known for a lot of pointers.  So
> this might not just creep into other functions via calls of
> [[carries_dependency]] functions, but also through normal loads and
> stores, likely prohibiting many optimizations.
>
> Furthermore, the dependency tracking can currently only be
> "disabled/enabled" on a function granularity (via
> [[carries_dependency]]).  Thus, if we have big functions, then
> dependency tracking may slow down a lot of code in the big function.  If
> we have small functions, there's a lot of attributes to be added.
>
> If a function may only carry a dependency but doesn't necessarily (eg,
> depending on input parameters), then the programmer has to make a
> trade-off whether he/she want's to benefit from mo_consume but slow down
> other calls due to additional barriers (ie, when this function is called
> from non-[[carries_dependency]] functions), or vice versa.  (IOW,
> because of the function granularity, other code's performance is
> affected.)
>
> If a compiler wants to implement dependency tracking just for a few
> constructs (e.g., operators -> + ...) and use barriers otherwise, then
> this decision must be compatible with how all this is handled in other
> compilation units.  Thus, compiler optimizations effectively become part
> of the ABI, which doesn't seem right.
>
> I hope these examples illustrate my concerns about the implementability
> in practice of this.  It's also why I've suggested to move from an
> opt-out approach as in the current standard (ie, with kill_dependency())
> to an opt-in approach for conservative dependency tracking (e.g., with a
> preserve_dependencies(exp) call, where exp will not be optimized in a
> way that removes any dependencies).  This wouldn't help with many
> optimizations being prevented, but it should at least help programmers
> contain the problem to smaller regions of code.
>
> I'm not aware of any implementation that tries to track dependencies, so
> I can't give any real performance numbers.  This could perhaps be
> simulated, but I'm not sure whether a realistic case would be made
> without at least supporting [[carries_dependency]] properly in the
> compiler, which would be some work.
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 20:43 ` Torvald Riegel
@ 2014-02-18 21:29   ` Paul E. McKenney
  2014-02-18 23:48   ` Peter Sewell
  1 sibling, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 21:29 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Peter.Sewell, mark.batty@cl.cam.ac.uk, peterz, torvalds,
	Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 09:43:31PM +0100, Torvald Riegel wrote:
> xagsmtp5.20140218204423.3934@bldgate.vnet.ibm.com
> X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP5 at BLDGATE)
> 
> On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote:
> > Several of you have said that the standard and compiler should not
> > permit speculative writes of atomics, or (effectively) that the
> > compiler should preserve dependencies.  In simple examples it's easy
> > to see what that means, but in general it's not so clear what the
> > language should guarantee, because dependencies may go via non-atomic
> > code in other compilation units, and we have to consider the extent to
> > which it's desirable to limit optimisation there.
> 
> [...]
> 
> > 2) otherwise, the language definition should prohibit it but the
> >    compiler would have to preserve dependencies even in compilation
> >    units that have no mention of atomics.  It's unclear what the
> >    (runtime and compiler development) cost of that would be in
> >    practice - perhaps Torvald could comment?
> 
> If I'm reading the standard correctly, it requires that data
> dependencies are preserved through loads and stores, including nonatomic
> ones.  That sounds convenient because it allows programmers to use
> temporary storage.
> 
> However, what happens if a dependency "arrives" at a store for which the
> alias set isn't completely known?  Then we either have to add a barrier
> to enforce the ordering at this point, or we have to assume that all
> other potentially aliasing memory locations would also have to start
> carrying dependencies (which might be in other functions in other
> compilation units).  Neither option is good.  The first might introduce
> barriers in places in which they might not be required (or the
> programmer has to use kill_dependency() quite often to avoid all these).
> The second is bad because points-to analysis is hard, so in practice the
> points-to set will not be precisely known for a lot of pointers.  So
> this might not just creep into other functions via calls of
> [[carries_dependency]] functions, but also through normal loads and
> stores, likely prohibiting many optimizations.

I cannot immediately think of a situation where a store carrying a
dependency into a non-trivially aliased object wouldn't be a usage
error, so perhaps emitting a barrier and a diagnostic at that point
is best.

> Furthermore, the dependency tracking can currently only be
> "disabled/enabled" on a function granularity (via
> [[carries_dependency]]).  Thus, if we have big functions, then
> dependency tracking may slow down a lot of code in the big function.  If
> we have small functions, there's a lot of attributes to be added.
> 
> If a function may only carry a dependency but doesn't necessarily (eg,
> depending on input parameters), then the programmer has to make a
> trade-off whether he/she want's to benefit from mo_consume but slow down
> other calls due to additional barriers (ie, when this function is called
> from non-[[carries_dependency]] functions), or vice versa.  (IOW,
> because of the function granularity, other code's performance is
> affected.)
> 
> If a compiler wants to implement dependency tracking just for a few
> constructs (e.g., operators -> + ...) and use barriers otherwise, then
> this decision must be compatible with how all this is handled in other
> compilation units.  Thus, compiler optimizations effectively become part
> of the ABI, which doesn't seem right.
> 
> I hope these examples illustrate my concerns about the implementability
> in practice of this.  It's also why I've suggested to move from an
> opt-out approach as in the current standard (ie, with kill_dependency())
> to an opt-in approach for conservative dependency tracking (e.g., with a
> preserve_dependencies(exp) call, where exp will not be optimized in a
> way that removes any dependencies).  This wouldn't help with many
> optimizations being prevented, but it should at least help programmers
> contain the problem to smaller regions of code.
> 
> I'm not aware of any implementation that tries to track dependencies, so
> I can't give any real performance numbers.  This could perhaps be
> simulated, but I'm not sure whether a realistic case would be made
> without at least supporting [[carries_dependency]] properly in the
> compiler, which would be some work.

Another approach would be to use start-tracking/stop-tracking directives
that could be buried into rcu_read_lock() and rcu_read_unlock().  There
are issues with nesting and conditional use of rcu_read_lock() and
rcu_read_unlock(), but it does give you nicer granularity properties.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 18:21   ` Peter Sewell
  2014-02-18 18:49     ` Linus Torvalds
@ 2014-02-18 20:46     ` Torvald Riegel
  1 sibling, 0 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 20:46 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: Linus Torvalds, mark.batty@cl.cam.ac.uk, Paul McKenney,
	Peter Zijlstra, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On Tue, 2014-02-18 at 18:21 +0000, Peter Sewell wrote:
> On 18 February 2014 17:38, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >>     void f(int ra, int*rb) {
> >>       if (ra==42)
> >>         *rb=42;
> >>       else
> >>         *rb=42;
> >>     }
> >
> > So this is a great example, and in general I really like your page at:
> >
> >> For more context, this example is taken from a summary of the thin-air
> >> problem by Mark Batty and myself,
> >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
> >> dependencies via other compilation units was AFAIK first pointed out
> >> by Hans Boehm.
> >
> > and the reason I like your page is that it really talks about the
> > problem by pointing to the "unoptimized" code, and what hardware would
> > do.
> 
> Thanks.  It's certainly necessary to separately understand what compiler
> optimisation and the hardware might do, to get anywhere here.   But...
> 
> > As mentioned, I think that's actually the *correct* way to think about
> > the problem space, because it allows the programmer to take hardware
> > characteristics into account, without having to try to "describe" them
> > at a source level.
> 
> ...to be clear, I am ultimately after a decent source-level description of what
> programmers can depend on, and we (Mark and I) view that page as
> identifying constraints on what that description can say.  There are too
> many compiler optimisations for people to reason directly in terms of
> the set of all transformations that they do, so we need some more
> concise and comprehensible envelope identifying what is allowed,
> as an interface between compiler writers and users.  AIUI that's basically
> what Torvald is arguing.

Yes, that's one reason.  Another one is that if a programmer would
actually want to use atomics in a machine-independent / portable way,
he/she does also not want to reason about how all those transformations
might interact with the machine's memory model.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 12:12 Peter Sewell
                   ` (2 preceding siblings ...)
  2014-02-18 17:38 ` Linus Torvalds
@ 2014-02-18 20:43 ` Torvald Riegel
  2014-02-18 21:29   ` Paul E. McKenney
  2014-02-18 23:48   ` Peter Sewell
  3 siblings, 2 replies; 299+ messages in thread
From: Torvald Riegel @ 2014-02-18 20:43 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, peterz, torvalds,
	Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.  In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.

[...]

> 2) otherwise, the language definition should prohibit it but the
>    compiler would have to preserve dependencies even in compilation
>    units that have no mention of atomics.  It's unclear what the
>    (runtime and compiler development) cost of that would be in
>    practice - perhaps Torvald could comment?

If I'm reading the standard correctly, it requires that data
dependencies are preserved through loads and stores, including nonatomic
ones.  That sounds convenient because it allows programmers to use
temporary storage.

However, what happens if a dependency "arrives" at a store for which the
alias set isn't completely known?  Then we either have to add a barrier
to enforce the ordering at this point, or we have to assume that all
other potentially aliasing memory locations would also have to start
carrying dependencies (which might be in other functions in other
compilation units).  Neither option is good.  The first might introduce
barriers in places in which they might not be required (or the
programmer has to use kill_dependency() quite often to avoid all these).
The second is bad because points-to analysis is hard, so in practice the
points-to set will not be precisely known for a lot of pointers.  So
this might not just creep into other functions via calls of
[[carries_dependency]] functions, but also through normal loads and
stores, likely prohibiting many optimizations.

Furthermore, the dependency tracking can currently only be
"disabled/enabled" on a function granularity (via
[[carries_dependency]]).  Thus, if we have big functions, then
dependency tracking may slow down a lot of code in the big function.  If
we have small functions, there's a lot of attributes to be added.

If a function may only carry a dependency but doesn't necessarily (eg,
depending on input parameters), then the programmer has to make a
trade-off whether he/she want's to benefit from mo_consume but slow down
other calls due to additional barriers (ie, when this function is called
from non-[[carries_dependency]] functions), or vice versa.  (IOW,
because of the function granularity, other code's performance is
affected.)

If a compiler wants to implement dependency tracking just for a few
constructs (e.g., operators -> + ...) and use barriers otherwise, then
this decision must be compatible with how all this is handled in other
compilation units.  Thus, compiler optimizations effectively become part
of the ABI, which doesn't seem right.

I hope these examples illustrate my concerns about the implementability
in practice of this.  It's also why I've suggested to move from an
opt-out approach as in the current standard (ie, with kill_dependency())
to an opt-in approach for conservative dependency tracking (e.g., with a
preserve_dependencies(exp) call, where exp will not be optimized in a
way that removes any dependencies).  This wouldn't help with many
optimizations being prevented, but it should at least help programmers
contain the problem to smaller regions of code.

I'm not aware of any implementation that tries to track dependencies, so
I can't give any real performance numbers.  This could perhaps be
simulated, but I'm not sure whether a realistic case would be made
without at least supporting [[carries_dependency]] properly in the
compiler, which would be some work.


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 18:49     ` Linus Torvalds
@ 2014-02-18 19:47       ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter.Sewell, mark.batty@cl.cam.ac.uk, Peter Zijlstra,
	Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On Tue, Feb 18, 2014 at 10:49:27AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell
> <Peter.Sewell@cl.cam.ac.uk> wrote:
> >
> > This is a bit more subtle, because (on ARM and POWER) removing the
> > dependency and conditional branch is actually in general *not* equivalent
> > in the hardware, in a concurrent context.
> 
> So I agree, but I think that's a generic issue with non-local memory
> ordering, and is not at all specific to the optimization wrt that
> "x?42:42" expression.
> 
> If you have a value that you loaded with a non-relaxed load, and you
> pass that value off to a non-local function that you don't know what
> it does, in my opinion that implies that the compiler had better add
> the necessary serialization to say "whatever that other function does,
> we guarantee the semantics of the load".
> 
> So on ppc, if you do a load with "consume" or "acquire" and then call
> another function without having had something in the caller that
> serializes the load, you'd better add the lwsync or whatever before
> the call. Exactly because the function call itself otherwise basically
> breaks the visibility into ordering. You've basically turned a
> load-with-ordering-guarantees into just an integer that you passed off
> to something that doesn't know about the ordering guarantees - and you
> need that "lwsync" in order to still guarantee the ordering.
> 
> Tough titties. That's what a CPU with weak memory ordering semantics
> gets in order to have sufficient memory ordering.

And that is in fact what C11 compilers are supposed to do if the function
doesn't have the [[carries_dependency]] attribute on the corresponding
argument or return of the non-local function.  If the function is marked
with [[carries_dependency]], then the compiler has the information needed
in both compilations to make things work correctly.

							Thanx, Paul

> And I don't think it's actually a problem in practice. If you are
> doing loads with ordered semantics, you're not going to pass the
> result off willy-nilly to random functions (or you really *do* require
> the ordering, because the load that did the "acquire" was actually for
> a lock!
> 
> So I really think that the "local optimization" is correct regardless.
> 
>                    Linus
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 18:21   ` Peter Sewell
@ 2014-02-18 18:49     ` Linus Torvalds
  2014-02-18 19:47       ` Paul E. McKenney
  2014-02-18 20:46     ` Torvald Riegel
  1 sibling, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18 18:49 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra,
	Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell
<Peter.Sewell@cl.cam.ac.uk> wrote:
>
> This is a bit more subtle, because (on ARM and POWER) removing the
> dependency and conditional branch is actually in general *not* equivalent
> in the hardware, in a concurrent context.

So I agree, but I think that's a generic issue with non-local memory
ordering, and is not at all specific to the optimization wrt that
"x?42:42" expression.

If you have a value that you loaded with a non-relaxed load, and you
pass that value off to a non-local function that you don't know what
it does, in my opinion that implies that the compiler had better add
the necessary serialization to say "whatever that other function does,
we guarantee the semantics of the load".

So on ppc, if you do a load with "consume" or "acquire" and then call
another function without having had something in the caller that
serializes the load, you'd better add the lwsync or whatever before
the call. Exactly because the function call itself otherwise basically
breaks the visibility into ordering. You've basically turned a
load-with-ordering-guarantees into just an integer that you passed off
to something that doesn't know about the ordering guarantees - and you
need that "lwsync" in order to still guarantee the ordering.

Tough titties. That's what a CPU with weak memory ordering semantics
gets in order to have sufficient memory ordering.

And I don't think it's actually a problem in practice. If you are
doing loads with ordered semantics, you're not going to pass the
result off willy-nilly to random functions (or you really *do* require
the ordering, because the load that did the "acquire" was actually for
a lock!

So I really think that the "local optimization" is correct regardless.

                   Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 17:38 ` Linus Torvalds
@ 2014-02-18 18:21   ` Peter Sewell
  2014-02-18 18:49     ` Linus Torvalds
  2014-02-18 20:46     ` Torvald Riegel
  0 siblings, 2 replies; 299+ messages in thread
From: Peter Sewell @ 2014-02-18 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra,
	Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On 18 February 2014 17:38, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
>>
>> For example, suppose we have, in one compilation unit:
>>
>>     void f(int ra, int*rb) {
>>       if (ra==42)
>>         *rb=42;
>>       else
>>         *rb=42;
>>     }
>
> So this is a great example, and in general I really like your page at:
>
>> For more context, this example is taken from a summary of the thin-air
>> problem by Mark Batty and myself,
>> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
>> dependencies via other compilation units was AFAIK first pointed out
>> by Hans Boehm.
>
> and the reason I like your page is that it really talks about the
> problem by pointing to the "unoptimized" code, and what hardware would
> do.

Thanks.  It's certainly necessary to separately understand what compiler
optimisation and the hardware might do, to get anywhere here.   But...

> As mentioned, I think that's actually the *correct* way to think about
> the problem space, because it allows the programmer to take hardware
> characteristics into account, without having to try to "describe" them
> at a source level.

...to be clear, I am ultimately after a decent source-level description of what
programmers can depend on, and we (Mark and I) view that page as
identifying constraints on what that description can say.  There are too
many compiler optimisations for people to reason directly in terms of
the set of all transformations that they do, so we need some more
concise and comprehensible envelope identifying what is allowed,
as an interface between compiler writers and users.  AIUI that's basically
what Torvald is arguing.

The C11 spec in its current form is not yet fully up to that task, for one thing
because it doesn't attempt to cover all the h/w interactions that you and
Paul list, but that is where we're trying to go with our formalisation work.

> As to your example of
>
>    if (ra)
>        atomic_write(rb, A);
>    else
>        atomic_write(rb, B);
>
> I really think that it is ok to combine that into
>
>     atomic_write(rb, ra ? A:B);
>
> (by virtue of "exact same behavior on actual hardware"), and then the
> only remaining question is whether the "ra?A:B" can be optimized to
> remove the conditional if A==B as in your example where both are "42".
> Agreed?

y

> Now, I would argue that the "naive" translation of that is
> unambiguous, and since "ra" is not volatile or magic in any way, then
> "ra?42:42" can obviously be optimized into just 42 - by the exact same
> rule that says "the compiler can do any transformation that is
> equivalent in the hardware". The compiler can *locally* decide that
> that is the right thing to do, and any programmer that complains about
> that decision is just crazy.
>
> So my "local machine behavior equivalency" rule means that that
> function can be optimized into a single "store 42 atomically into rb".

This is a bit more subtle, because (on ARM and POWER) removing the
dependency and conditional branch is actually in general *not* equivalent
in the hardware, in a concurrent context.

That notwithstanding, I tend to agree that preventing that optimisation for
non-atomics would be prohibitively costly (though I'd like real data).

It's tempting then to permit more-or-less any optimisation for thread-local
accesses but rule out value-range analysis and suchlike for shared-memory
accesses.  Whether that would be viable from a compiler point of view, I don't
know.   In C, one will at best only be able to get an approximate analysis
of which is which, just for a start.


> Now, if it's *not* compiled locally, and is instead implemented as a
> macro (or inline function), there are obviously situations where "ra ?
> A : B" ends up having to do other things. In particular, X may be
> volatile or an atomic read that has ordering semantics, and then that
> expression doesn't become just "42", but that's a separate issue. It's
> not all that dissimilar to "function calls are sequence points",
> though, and obviously if the source of "ra" has semantic meaning, you
> have to honor that semantic meaning.

separate point, indeed

Peter

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 12:12 Peter Sewell
  2014-02-18 12:53 ` Peter Zijlstra
  2014-02-18 14:56 ` Paul E. McKenney
@ 2014-02-18 17:38 ` Linus Torvalds
  2014-02-18 18:21   ` Peter Sewell
  2014-02-18 20:43 ` Torvald Riegel
  3 siblings, 1 reply; 299+ messages in thread
From: Linus Torvalds @ 2014-02-18 17:38 UTC (permalink / raw)
  To: Peter.Sewell
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra,
	Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells,
	linux-arch, Linux Kernel Mailing List, Andrew Morton,
	Ingo Molnar, gcc

On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
>
> For example, suppose we have, in one compilation unit:
>
>     void f(int ra, int*rb) {
>       if (ra==42)
>         *rb=42;
>       else
>         *rb=42;
>     }

So this is a great example, and in general I really like your page at:

> For more context, this example is taken from a summary of the thin-air
> problem by Mark Batty and myself,
> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
> dependencies via other compilation units was AFAIK first pointed out
> by Hans Boehm.

and the reason I like your page is that it really talks about the
problem by pointing to the "unoptimized" code, and what hardware would
do.

As mentioned, I think that's actually the *correct* way to think about
the problem space, because it allows the programmer to take hardware
characteristics into account, without having to try to "describe" them
at a source level.

As to your example of

   if (ra)
       atomic_write(rb, A);
   else
       atomic_write(rb, B);

I really think that it is ok to combine that into

    atomic_write(rb, ra ? A:B);

(by virtue of "exact same behavior on actual hardware"), and then the
only remaining question is whether the "ra?A:B" can be optimized to
remove the conditional if A==B as in your example where both are "42".
Agreed?

Now, I would argue that the "naive" translation of that is
unambiguous, and since "ra" is not volatile or magic in any way, then
"ra?42:42" can obviously be optimized into just 42 - by the exact same
rule that says "the compiler can do any transformation that is
equivalent in the hardware". The compiler can *locally* decide that
that is the right thing to do, and any programmer that complains about
that decision is just crazy.

So my "local machine behavior equivalency" rule means that that
function can be optimized into a single "store 42 atomically into rb".

Now, if it's *not* compiled locally, and is instead implemented as a
macro (or inline function), there are obviously situations where "ra ?
A : B" ends up having to do other things. In particular, X may be
volatile or an atomic read that has ordering semantics, and then that
expression doesn't become just "42", but that's a separate issue. It's
not all that dissimilar to "function calls are sequence points",
though, and obviously if the source of "ra" has semantic meaning, you
have to honor that semantic meaning.

Agreed?

                  Linus

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 15:16   ` Mark Batty
@ 2014-02-18 17:17     ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 17:17 UTC (permalink / raw)
  To: Mark Batty
  Cc: Peter Sewell, peterz, Torvald Riegel, torvalds, Will Deacon,
	Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm,
	mingo, gcc

On Tue, Feb 18, 2014 at 03:16:33PM +0000, Mark Batty wrote:
> Hi Paul,
> 
> Thanks for the document. I'm looking forward to reading the bits about
> dependency chains in Linux.

And I am looking forward to your thoughts on those bits!

> > One point of confusion for me...  Example 4 says "language must allow".
> > Shouldn't that be "language is permitted to allow"?
> 
> When we say "allow", we mean that the optimised execution should be
> allowed by the specification, and Implicitly, the unoptimised
> execution should remain allowed too. We want to be concrete about what
> the language specification allows, and that's why we say "must". It is
> not to disallow the unoptimised execution.

OK, got it!

							Thanx, Paul

> > Seems like an
> > implementation is always within its rights to avoid an optimization if
> > its implementation prevents it from safely detecting the oppportunity
> > for that optimization.
> 
> That's right.
> 
> - Mark
> 
> 
> > Or am I missing something here?
> >
> >                                                         Thanx, Paul
> >
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 15:33   ` Peter Sewell
@ 2014-02-18 16:47     ` Paul E. McKenney
  0 siblings, 0 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 16:47 UTC (permalink / raw)
  To: Peter Sewell
  Cc: mark.batty@cl.cam.ac.uk, peterz, Torvald Riegel, torvalds,
	Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 03:33:35PM +0000, Peter Sewell wrote:
> Hi Paul,
> 
> On 18 February 2014 14:56, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies.  In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >>     void f(int ra, int*rb) {
> >>       if (ra==42)
> >>         *rb=42;
> >>       else
> >>         *rb=42;
> >>     }
> >
> > Hello, Peter!
> >
> > Nice example!
> >
> > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
> > says the following about the control dependency in the above construct:
> >
> > ------------------------------------------------------------------------
> >
> >         q = ACCESS_ONCE(a);
> >         if (q) {
> >                 barrier();
> >                 ACCESS_ONCE(b) = p;
> >                 do_something();
> >         } else {
> >                 barrier();
> >                 ACCESS_ONCE(b) = p;
> >                 do_something_else();
> >         }
> >
> > The initial ACCESS_ONCE() is required to prevent the compiler from
> > proving the value of 'a', and the pair of barrier() invocations are
> > required to prevent the compiler from pulling the two identical stores
> > to 'b' out from the legs of the "if" statement.
> 
> thanks
> 
> > ------------------------------------------------------------------------
> >
> > So yes, current compilers need significant help if it is necessary to
> > maintain dependencies in that sort of code.
> >
> > Similar examples came up in the data-dependency discussions in the
> > standards committee, which led to the [[carries_dependency]] attribute for
> > C11 and C++11.  Of course, current compilers don't have this attribute,
> > and the current Linux kernel code doesn't have any other marking for
> > data dependencies passing across function boundaries.  (Maybe some time
> > as an assist for detecting pointer leaks out of RCU read-side critical
> > sections, but efforts along those lines are a bit stalled at the moment.)
> >
> > More on data dependencies below...
> >
> >> and in another compilation unit the bodies of two threads:
> >>
> >>     // Thread 0
> >>     r1 = x;
> >>     f(r1,&r2);
> >>     y = r2;
> >>
> >>     // Thread 1
> >>     r3 = y;
> >>     f(r3,&r4);
> >>     x = r4;
> >>
> >> where accesses to x and y are annotated C11 atomic
> >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
> >>
> >> (Of course, this is an artificial example, to make the point below as
> >> simply as possible - in real code the branches of the conditional
> >> might not be syntactically identical, just equivalent after macro
> >> expansion and other optimisation.)
> >>
> >> In the source program there's a dependency from the read of x to the
> >> write of y in Thread 0, and from the read of y to the write of x on
> >> Thread 1.  Dependency-respecting compilation would preserve those and
> >> the ARM and POWER architectures both respect them, so the reads of x
> >> and y could not give 42.
> >>
> >> But a compiler might well optimise the (non-atomic) body of f() to
> >> just *rb=42, making the threads effectively
> >>
> >>     // Thread 0
> >>     r1 = x;
> >>     y = 42;
> >>
> >>     // Thread 1
> >>     r3 = y;
> >>     x = 42;
> >>
> >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
> >> permit those two reads to see 42. That is moreover actually observable
> >> on current ARM hardware.
> >
> > I do agree that this could happen on current compilers and hardware.
> >
> > Agreed, but as Peter Zijlstra noted in this thread, this optimization
> > is to a control dependency, not a data dependency.
> 
> Indeed.  In principle (again as Hans has observed) a compiler might
> well convert between the two, e.g. if operating on single-bit values,
> or where value-range analysis has shown that a variable can only
> contain one of a small set of values.   I don't know whether that
> happens in practice?  Then there are also cases where a compiler is
> very likely to remove data/address dependencies, eg if some constant C
> is #define'd to be 0 then an array access indexed by x * C will have
> the dependency on x removed.  The runtime and compiler development
> costs of preventing that are also unclear to me.
> 
> Given that, whether it's reasonable to treat control and data
> dependencies differently seems to be an open question.

Here is another (admittedly fanciful and probably buggy) implementation
of f() that relies on data dependencies (according to C11 and C++11),
but which could not be relied on to preserve thosse data dependencies
given current pre-C11 compilers:

	int arr[2] = { 42, 43 };
	int *bigarr;

	int f(int ra)
	{
		return arr[ra != 42];
	}

	// Thread 0
	r1 = atomic_load_explicit(&gidx, memory_order_consume);
	r2 = bigarr[f(r1)];

	// Thread 1
	r3 = random() % BIGARR_SIZE;
	bigarr[r3] = some_integer();
	atomic_store_explicit(&gidx, r3, memory_order_release);

	// Mainprogram
	bigarr = kmalloc(BIGARR_SIZE * sizeof(*bigarr), ...);
	// Note: bigarr currently contains pre-initialization garbage
	// Spawn threads 1 and 2

Many compilers would be happy to convert f() into something like the
following:

	int f(int ra)
	{
		if (ra == 42)
			return arr[0];
		else
			return arr[1];
	}

And many would argue that this is a perfectly reasonable conversion.

However, this breaks the data dependency, and allows Thread 0's load
from bigarr[] to be speculated, so that r2 might end up containing
pre-initialization garbage.  This is why the consume.2014.02.16c.pdf
document advises against attempting to carry dependencies through
relational operators and booleans (&& and ||) when using current compilers
(hmmm...  I need to make that advice more strongly stated).  And again,
this is one of the reasons for the [[carries_dependency]] attribute in
C11 -- to signal the compiler to be careful in a given function.

Again, this example is fanciful.  It is intended to illustrate a data
dependency that could be broken given current compilers and hardware.
It is -not- intended as an example of good code for the Linux kernel,
much the opposite, in fact.

That said, I would very much welcome a more realistic example.

> >> So as far as we can see, either:
> >>
> >> 1) if you can accept the latter behaviour (if the Linux codebase does
> >>    not rely on its absence), the language definition should permit it,
> >>    and current compiler optimisations can be used,
> >>
> >> or
> >>
> >> 2) otherwise, the language definition should prohibit it but the
> >>    compiler would have to preserve dependencies even in compilation
> >>    units that have no mention of atomics.  It's unclear what the
> >>    (runtime and compiler development) cost of that would be in
> >>    practice - perhaps Torvald could comment?
> >
> > For current compilers, we have to rely on coding conventions within
> > the Linux kernel in combination with non-standard extentions to gcc
> > and specified compiler flags to disable undesirable behavior.  I have a
> > start on specifying this in a document I am preparing for the standards
> > committee, a very early draft of which may be found here:
> >
> > http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf
> >
> > Section 3 shows the results of a manual scan through the Linux kernel's
> > dependency chains, and Section 4.1 lists a probably incomplete (and no
> > doubt erroneous) list of coding standards required to make dependency
> > chains work on current compilers.  Any comments and suggestions are more
> > than welcome!
> 
> Thanks, that's very interesting (especially the non-local dependency chains).
> 
> At a first glance, the "4.1 Rules for C-Language RCU Users" seem
> pretty fragile - they're basically trying to guess the limits of
> compiler optimisation smartness.

Agreed, but that is the world we currently must live in, given pre-C11
compilers and the tepid implementations of memory_order_consume in
the current C11 implementations that I am aware of.  As long as the
Linux kernel must live in this world for some time to come, I might as
well document the limitations, fragile though they might be.

> >> For more context, this example is taken from a summary of the thin-air
> >> problem by Mark Batty and myself,
> >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
> >> dependencies via other compilation units was AFAIK first pointed out
> >> by Hans Boehm.
> >
> > Nice document!
> >
> > One point of confusion for me...  Example 4 says "language must allow".
> > Shouldn't that be "language is permitted to allow"?  Seems like an
> > implementation is always within its rights to avoid an optimization if
> > its implementation prevents it from safely detecting the oppportunity
> > for that optimization.  Or am I missing something here?
> 
> We're saying that the language definition must allow it, not that any
> particular implementation must be able to exhibit it.

Ah, got it.  You had me worried there for a bit!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 12:53 ` Peter Zijlstra
@ 2014-02-18 16:08   ` Peter Sewell
  0 siblings, 0 replies; 299+ messages in thread
From: Peter Sewell @ 2014-02-18 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Torvald Riegel, torvalds,
	Will Deacon, ramana.radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On 18 February 2014 12:53, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
>> Several of you have said that the standard and compiler should not
>> permit speculative writes of atomics, or (effectively) that the
>> compiler should preserve dependencies.
>
> The example below only deals with control dependencies; so I'll limit
> myself to that.

Data/address dependencies are, if anything, even less clear - see a
paragraph on that in my reply to Paul.

>> In simple examples it's easy
>> to see what that means, but in general it's not so clear what the
>> language should guarantee, because dependencies may go via non-atomic
>> code in other compilation units, and we have to consider the extent to
>> which it's desirable to limit optimisation there.
>>
>> For example, suppose we have, in one compilation unit:
>>
>>     void f(int ra, int*rb) {
>>       if (ra==42)
>>         *rb=42;
>>       else
>>         *rb=42;
>>     }
>>
>> and in another compilation unit the bodies of two threads:
>>
>>     // Thread 0
>>     r1 = x;
>>     f(r1,&r2);
>>     y = r2;
>>
>>     // Thread 1
>>     r3 = y;
>>     f(r3,&r4);
>>     x = r4;
>>
>> where accesses to x and y are annotated C11 atomic
>> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
>> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
>
> So I'm intuitively ok with this, however I would expect something like:
>
>   void f(_Atomic int ra, _Atomic int *rb);
>
> To preserve dependencies and not make the conditional go away, simply
> because in that case the:
>
>   if (ra == 42)
>
> the 'ra' usage can be seen as an atomic load.
>
>> So as far as we can see, either:
>>
>> 1) if you can accept the latter behaviour (if the Linux codebase does
>>    not rely on its absence), the language definition should permit it,
>>    and current compiler optimisations can be used,
>
> Currently there's exactly 1 site in the Linux kernel that relies on
> control dependencies as far as I know -- the one I put in.

ok, thanks

> And its
> limited to a single function, so no cross translation unit funnies
> there.

One can imagine a language definition that treats code that lies
entirely within a single compilation unit specially (e.g. if it's
somehow annotated as relying on dependencies).  But I imagine it would
be pretty unappealing to work with.

> Of course, nobody is going to tell me when or where they'll put in the
> next one; since its now documented as accepted practise.

Is that not fragile?

> However, PaulMck and our RCU usage very much do cross all sorts of TU
> boundaries; but those are data dependencies.

yes - though again see my reply to Paul's note

thanks,
Peter

> ~ Peter

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 14:56 ` Paul E. McKenney
  2014-02-18 15:16   ` Mark Batty
@ 2014-02-18 15:33   ` Peter Sewell
  2014-02-18 16:47     ` Paul E. McKenney
  1 sibling, 1 reply; 299+ messages in thread
From: Peter Sewell @ 2014-02-18 15:33 UTC (permalink / raw)
  To: Paul McKenney
  Cc: mark.batty@cl.cam.ac.uk, peterz, Torvald Riegel, torvalds,
	Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

Hi Paul,

On 18 February 2014 14:56, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
>> Several of you have said that the standard and compiler should not
>> permit speculative writes of atomics, or (effectively) that the
>> compiler should preserve dependencies.  In simple examples it's easy
>> to see what that means, but in general it's not so clear what the
>> language should guarantee, because dependencies may go via non-atomic
>> code in other compilation units, and we have to consider the extent to
>> which it's desirable to limit optimisation there.
>>
>> For example, suppose we have, in one compilation unit:
>>
>>     void f(int ra, int*rb) {
>>       if (ra==42)
>>         *rb=42;
>>       else
>>         *rb=42;
>>     }
>
> Hello, Peter!
>
> Nice example!
>
> The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
> says the following about the control dependency in the above construct:
>
> ------------------------------------------------------------------------
>
>         q = ACCESS_ONCE(a);
>         if (q) {
>                 barrier();
>                 ACCESS_ONCE(b) = p;
>                 do_something();
>         } else {
>                 barrier();
>                 ACCESS_ONCE(b) = p;
>                 do_something_else();
>         }
>
> The initial ACCESS_ONCE() is required to prevent the compiler from
> proving the value of 'a', and the pair of barrier() invocations are
> required to prevent the compiler from pulling the two identical stores
> to 'b' out from the legs of the "if" statement.

thanks

> ------------------------------------------------------------------------
>
> So yes, current compilers need significant help if it is necessary to
> maintain dependencies in that sort of code.
>
> Similar examples came up in the data-dependency discussions in the
> standards committee, which led to the [[carries_dependency]] attribute for
> C11 and C++11.  Of course, current compilers don't have this attribute,
> and the current Linux kernel code doesn't have any other marking for
> data dependencies passing across function boundaries.  (Maybe some time
> as an assist for detecting pointer leaks out of RCU read-side critical
> sections, but efforts along those lines are a bit stalled at the moment.)
>
> More on data dependencies below...
>
>> and in another compilation unit the bodies of two threads:
>>
>>     // Thread 0
>>     r1 = x;
>>     f(r1,&r2);
>>     y = r2;
>>
>>     // Thread 1
>>     r3 = y;
>>     f(r3,&r4);
>>     x = r4;
>>
>> where accesses to x and y are annotated C11 atomic
>> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
>> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
>>
>> (Of course, this is an artificial example, to make the point below as
>> simply as possible - in real code the branches of the conditional
>> might not be syntactically identical, just equivalent after macro
>> expansion and other optimisation.)
>>
>> In the source program there's a dependency from the read of x to the
>> write of y in Thread 0, and from the read of y to the write of x on
>> Thread 1.  Dependency-respecting compilation would preserve those and
>> the ARM and POWER architectures both respect them, so the reads of x
>> and y could not give 42.
>>
>> But a compiler might well optimise the (non-atomic) body of f() to
>> just *rb=42, making the threads effectively
>>
>>     // Thread 0
>>     r1 = x;
>>     y = 42;
>>
>>     // Thread 1
>>     r3 = y;
>>     x = 42;
>>
>> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
>> permit those two reads to see 42. That is moreover actually observable
>> on current ARM hardware.
>
> I do agree that this could happen on current compilers and hardware.
>
> Agreed, but as Peter Zijlstra noted in this thread, this optimization
> is to a control dependency, not a data dependency.

Indeed.  In principle (again as Hans has observed) a compiler might
well convert between the two, e.g. if operating on single-bit values,
or where value-range analysis has shown that a variable can only
contain one of a small set of values.   I don't know whether that
happens in practice?  Then there are also cases where a compiler is
very likely to remove data/address dependencies, eg if some constant C
is #define'd to be 0 then an array access indexed by x * C will have
the dependency on x removed.  The runtime and compiler development
costs of preventing that are also unclear to me.

Given that, whether it's reasonable to treat control and data
dependencies differently seems to be an open question.

>> So as far as we can see, either:
>>
>> 1) if you can accept the latter behaviour (if the Linux codebase does
>>    not rely on its absence), the language definition should permit it,
>>    and current compiler optimisations can be used,
>>
>> or
>>
>> 2) otherwise, the language definition should prohibit it but the
>>    compiler would have to preserve dependencies even in compilation
>>    units that have no mention of atomics.  It's unclear what the
>>    (runtime and compiler development) cost of that would be in
>>    practice - perhaps Torvald could comment?
>
> For current compilers, we have to rely on coding conventions within
> the Linux kernel in combination with non-standard extentions to gcc
> and specified compiler flags to disable undesirable behavior.  I have a
> start on specifying this in a document I am preparing for the standards
> committee, a very early draft of which may be found here:
>
> http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf
>
> Section 3 shows the results of a manual scan through the Linux kernel's
> dependency chains, and Section 4.1 lists a probably incomplete (and no
> doubt erroneous) list of coding standards required to make dependency
> chains work on current compilers.  Any comments and suggestions are more
> than welcome!

Thanks, that's very interesting (especially the non-local dependency chains).

At a first glance, the "4.1 Rules for C-Language RCU Users" seem
pretty fragile - they're basically trying to guess the limits of
compiler optimisation smartness.

>> For more context, this example is taken from a summary of the thin-air
>> problem by Mark Batty and myself,
>> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
>> dependencies via other compilation units was AFAIK first pointed out
>> by Hans Boehm.
>
> Nice document!
>
> One point of confusion for me...  Example 4 says "language must allow".
> Shouldn't that be "language is permitted to allow"?  Seems like an
> implementation is always within its rights to avoid an optimization if
> its implementation prevents it from safely detecting the oppportunity
> for that optimization.  Or am I missing something here?

We're saying that the language definition must allow it, not that any
particular implementation must be able to exhibit it.

best,
Peter



>                                                         Thanx, Paul
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 14:56 ` Paul E. McKenney
@ 2014-02-18 15:16   ` Mark Batty
  2014-02-18 17:17     ` Paul E. McKenney
  2014-02-18 15:33   ` Peter Sewell
  1 sibling, 1 reply; 299+ messages in thread
From: Mark Batty @ 2014-02-18 15:16 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Sewell, peterz, Torvald Riegel, torvalds, Will Deacon,
	Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm,
	mingo, gcc

Hi Paul,

Thanks for the document. I'm looking forward to reading the bits about
dependency chains in Linux.

> One point of confusion for me...  Example 4 says "language must allow".
> Shouldn't that be "language is permitted to allow"?

When we say "allow", we mean that the optimised execution should be
allowed by the specification, and Implicitly, the unoptimised
execution should remain allowed too. We want to be concrete about what
the language specification allows, and that's why we say "must". It is
not to disallow the unoptimised execution.

> Seems like an
> implementation is always within its rights to avoid an optimization if
> its implementation prevents it from safely detecting the oppportunity
> for that optimization.

That's right.

- Mark


> Or am I missing something here?
>
>                                                         Thanx, Paul
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 12:12 Peter Sewell
  2014-02-18 12:53 ` Peter Zijlstra
@ 2014-02-18 14:56 ` Paul E. McKenney
  2014-02-18 15:16   ` Mark Batty
  2014-02-18 15:33   ` Peter Sewell
  2014-02-18 17:38 ` Linus Torvalds
  2014-02-18 20:43 ` Torvald Riegel
  3 siblings, 2 replies; 299+ messages in thread
From: Paul E. McKenney @ 2014-02-18 14:56 UTC (permalink / raw)
  To: Peter Sewell
  Cc: mark.batty@cl.cam.ac.uk, peterz, Torvald Riegel, torvalds,
	Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.  In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.
> 
> For example, suppose we have, in one compilation unit:
> 
>     void f(int ra, int*rb) {
>       if (ra==42)
>         *rb=42;
>       else
>         *rb=42;
>     }

Hello, Peter!

Nice example!

The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
says the following about the control dependency in the above construct:

------------------------------------------------------------------------

	q = ACCESS_ONCE(a);
	if (q) {
		barrier();
		ACCESS_ONCE(b) = p;
		do_something();
	} else {
		barrier();
		ACCESS_ONCE(b) = p;
		do_something_else();
	}

The initial ACCESS_ONCE() is required to prevent the compiler from
proving the value of 'a', and the pair of barrier() invocations are
required to prevent the compiler from pulling the two identical stores
to 'b' out from the legs of the "if" statement.

------------------------------------------------------------------------

So yes, current compilers need significant help if it is necessary to
maintain dependencies in that sort of code.

Similar examples came up in the data-dependency discussions in the
standards committee, which led to the [[carries_dependency]] attribute for
C11 and C++11.  Of course, current compilers don't have this attribute,
and the current Linux kernel code doesn't have any other marking for
data dependencies passing across function boundaries.  (Maybe some time
as an assist for detecting pointer leaks out of RCU read-side critical
sections, but efforts along those lines are a bit stalled at the moment.)

More on data dependencies below...

> and in another compilation unit the bodies of two threads:
> 
>     // Thread 0
>     r1 = x;
>     f(r1,&r2);
>     y = r2;
> 
>     // Thread 1
>     r3 = y;
>     f(r3,&r4);
>     x = r4;
> 
> where accesses to x and y are annotated C11 atomic
> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
> 
> (Of course, this is an artificial example, to make the point below as
> simply as possible - in real code the branches of the conditional
> might not be syntactically identical, just equivalent after macro
> expansion and other optimisation.)
> 
> In the source program there's a dependency from the read of x to the
> write of y in Thread 0, and from the read of y to the write of x on
> Thread 1.  Dependency-respecting compilation would preserve those and
> the ARM and POWER architectures both respect them, so the reads of x
> and y could not give 42.
> 
> But a compiler might well optimise the (non-atomic) body of f() to
> just *rb=42, making the threads effectively
> 
>     // Thread 0
>     r1 = x;
>     y = 42;
> 
>     // Thread 1
>     r3 = y;
>     x = 42;
> 
> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
> permit those two reads to see 42. That is moreover actually observable
> on current ARM hardware.

I do agree that this could happen on current compilers and hardware.

Agreed, but as Peter Zijlstra noted in this thread, this optimization
is to a control dependency, not a data dependency.

> So as far as we can see, either:
> 
> 1) if you can accept the latter behaviour (if the Linux codebase does
>    not rely on its absence), the language definition should permit it,
>    and current compiler optimisations can be used,
> 
> or
> 
> 2) otherwise, the language definition should prohibit it but the
>    compiler would have to preserve dependencies even in compilation
>    units that have no mention of atomics.  It's unclear what the
>    (runtime and compiler development) cost of that would be in
>    practice - perhaps Torvald could comment?

For current compilers, we have to rely on coding conventions within
the Linux kernel in combination with non-standard extentions to gcc
and specified compiler flags to disable undesirable behavior.  I have a
start on specifying this in a document I am preparing for the standards
committee, a very early draft of which may be found here:

http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf

Section 3 shows the results of a manual scan through the Linux kernel's
dependency chains, and Section 4.1 lists a probably incomplete (and no
doubt erroneous) list of coding standards required to make dependency
chains work on current compilers.  Any comments and suggestions are more
than welcome!

> For more context, this example is taken from a summary of the thin-air
> problem by Mark Batty and myself,
> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
> dependencies via other compilation units was AFAIK first pointed out
> by Hans Boehm.

Nice document!

One point of confusion for me...  Example 4 says "language must allow".
Shouldn't that be "language is permitted to allow"?  Seems like an
implementation is always within its rights to avoid an optimization if
its implementation prevents it from safely detecting the oppportunity
for that optimization.  Or am I missing something here?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
  2014-02-18 12:12 Peter Sewell
@ 2014-02-18 12:53 ` Peter Zijlstra
  2014-02-18 16:08   ` Peter Sewell
  2014-02-18 14:56 ` Paul E. McKenney
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 299+ messages in thread
From: Peter Zijlstra @ 2014-02-18 12:53 UTC (permalink / raw)
  To: Peter Sewell
  Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Torvald Riegel, torvalds,
	Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch,
	linux-kernel, akpm, mingo, gcc

On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.

The example below only deals with control dependencies; so I'll limit
myself to that.

> In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.
> 
> For example, suppose we have, in one compilation unit:
> 
>     void f(int ra, int*rb) {
>       if (ra==42)
>         *rb=42;
>       else
>         *rb=42;
>     }
> 
> and in another compilation unit the bodies of two threads:
> 
>     // Thread 0
>     r1 = x;
>     f(r1,&r2);
>     y = r2;
> 
>     // Thread 1
>     r3 = y;
>     f(r3,&r4);
>     x = r4;
> 
> where accesses to x and y are annotated C11 atomic
> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.

So I'm intuitively ok with this, however I would expect something like:

  void f(_Atomic int ra, _Atomic int *rb);

To preserve dependencies and not make the conditional go away, simply
because in that case the:

  if (ra == 42)

the 'ra' usage can be seen as an atomic load.

> So as far as we can see, either:
> 
> 1) if you can accept the latter behaviour (if the Linux codebase does
>    not rely on its absence), the language definition should permit it,
>    and current compiler optimisations can be used,

Currently there's exactly 1 site in the Linux kernel that relies on
control dependencies as far as I know -- the one I put in. And its
limited to a single function, so no cross translation unit funnies
there.

Of course, nobody is going to tell me when or where they'll put in the
next one; since its now documented as accepted practise.

However, PaulMck and our RCU usage very much do cross all sorts of TU
boundaries; but those are data dependencies.

~ Peter

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [RFC][PATCH 0/5] arch: atomic rework
@ 2014-02-18 12:12 Peter Sewell
  2014-02-18 12:53 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 299+ messages in thread
From: Peter Sewell @ 2014-02-18 12:12 UTC (permalink / raw)
  To: Peter Sewell, mark.batty@cl.cam.ac.uk, Paul McKenney, peterz,
	Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan,
	dhowells, linux-arch, linux-kernel, akpm, mingo, gcc

Several of you have said that the standard and compiler should not
permit speculative writes of atomics, or (effectively) that the
compiler should preserve dependencies.  In simple examples it's easy
to see what that means, but in general it's not so clear what the
language should guarantee, because dependencies may go via non-atomic
code in other compilation units, and we have to consider the extent to
which it's desirable to limit optimisation there.

For example, suppose we have, in one compilation unit:

    void f(int ra, int*rb) {
      if (ra==42)
        *rb=42;
      else
        *rb=42;
    }

and in another compilation unit the bodies of two threads:

    // Thread 0
    r1 = x;
    f(r1,&r2);
    y = r2;

    // Thread 1
    r3 = y;
    f(r3,&r4);
    x = r4;

where accesses to x and y are annotated C11 atomic
memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.

(Of course, this is an artificial example, to make the point below as
simply as possible - in real code the branches of the conditional
might not be syntactically identical, just equivalent after macro
expansion and other optimisation.)

In the source program there's a dependency from the read of x to the
write of y in Thread 0, and from the read of y to the write of x on
Thread 1.  Dependency-respecting compilation would preserve those and
the ARM and POWER architectures both respect them, so the reads of x
and y could not give 42.

But a compiler might well optimise the (non-atomic) body of f() to
just *rb=42, making the threads effectively

    // Thread 0
    r1 = x;
    y = 42;

    // Thread 1
    r3 = y;
    x = 42;

(GCC does this at O1, O2, and O3) and the ARM and POWER architectures
permit those two reads to see 42. That is moreover actually observable
on current ARM hardware.

So as far as we can see, either:

1) if you can accept the latter behaviour (if the Linux codebase does
   not rely on its absence), the language definition should permit it,
   and current compiler optimisations can be used,

or

2) otherwise, the language definition should prohibit it but the
   compiler would have to preserve dependencies even in compilation
   units that have no mention of atomics.  It's unclear what the
   (runtime and compiler development) cost of that would be in
   practice - perhaps Torvald could comment?


For more context, this example is taken from a summary of the thin-air
problem by Mark Batty and myself,
<www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
dependencies via other compilation units was AFAIK first pointed out
by Hans Boehm.

Peter

^ permalink raw reply	[flat|nested] 299+ messages in thread

end of thread, other threads:[~2014-03-07 19:11 UTC | newest]

Thread overview: 299+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-06 13:48 [RFC][PATCH 0/5] arch: atomic rework Peter Zijlstra
2014-02-06 13:48 ` [RFC][PATCH 1/5] ia64: Fix up smp_mb__{before,after}_clear_bit Peter Zijlstra
2014-02-06 13:48 ` [RFC][PATCH 2/5] arc,hexagon: Delete asm/barrier.h Peter Zijlstra
2014-02-06 13:48 ` [RFC][PATCH 3/5] arch: s/smp_mb__(before|after)_(atomic|clear)_(dec,inc,bit)/smp_mb__\1/g Peter Zijlstra
2014-02-06 19:12   ` Paul E. McKenney
2014-02-07  9:52     ` Will Deacon
2014-02-06 13:48 ` [RFC][PATCH 4/5] arch: Generic atomic.h cleanup Peter Zijlstra
2014-02-06 17:49   ` Will Deacon
2014-02-06 13:48 ` [RFC][PATCH 5/5] arch: Sanitize atomic_t bitwise ops Peter Zijlstra
2014-02-06 14:43   ` Geert Uytterhoeven
2014-02-06 16:14     ` Peter Zijlstra
2014-02-06 16:53   ` Linus Torvalds
2014-02-06 17:52     ` Peter Zijlstra
2014-02-06 17:56       ` Linus Torvalds
2014-02-06 18:09         ` Peter Zijlstra
2014-02-06 18:25 ` [RFC][PATCH 0/5] arch: atomic rework David Howells
2014-02-06 18:30   ` Peter Zijlstra
2014-02-06 18:42   ` Paul E. McKenney
2014-02-06 18:55   ` Ramana Radhakrishnan
2014-02-06 18:59     ` Will Deacon
2014-02-06 19:27       ` Paul E. McKenney
2014-02-06 21:17         ` Torvald Riegel
2014-02-06 22:11           ` Paul E. McKenney
2014-02-06 23:44             ` Torvald Riegel
2014-02-07  4:20               ` Paul E. McKenney
2014-02-07  7:44                 ` Peter Zijlstra
2014-02-07 16:50                   ` Paul E. McKenney
2014-02-07 16:55                     ` Will Deacon
2014-02-07 17:06                       ` Peter Zijlstra
2014-02-07 17:13                         ` Will Deacon
2014-02-07 17:20                           ` Peter Zijlstra
2014-02-07 18:03                           ` Paul E. McKenney
2014-02-07 17:46                         ` Joseph S. Myers
2014-02-07 18:43                         ` Torvald Riegel
2014-02-07 18:02                       ` Paul E. McKenney
2014-02-10  0:27                         ` Torvald Riegel
2014-02-10  0:56                           ` Linus Torvalds
2014-02-10  1:16                             ` Torvald Riegel
2014-02-10  1:24                               ` Linus Torvalds
2014-02-10  1:46                                 ` Torvald Riegel
2014-02-10  2:04                                   ` Linus Torvalds
2014-02-10  3:21                           ` Paul E. McKenney
2014-02-10  3:45                           ` Paul E. McKenney
2014-02-10 11:46                           ` Peter Zijlstra
2014-02-10 19:09                           ` Linus Torvalds
2014-02-11 15:59                             ` Paul E. McKenney
2014-02-12  6:06                               ` Torvald Riegel
2014-02-12  9:19                                 ` Peter Zijlstra
2014-02-12 17:42                                   ` Paul E. McKenney
2014-02-12 18:12                                     ` Peter Zijlstra
2014-02-17 18:18                                       ` Paul E. McKenney
2014-02-17 20:39                                         ` Richard Biener
2014-02-17 22:14                                           ` Paul E. McKenney
2014-02-17 22:27                                             ` Torvald Riegel
2014-02-14  5:07                                   ` Torvald Riegel
2014-02-14  9:50                                     ` Peter Zijlstra
2014-02-14 19:19                                       ` Torvald Riegel
2014-02-12 17:39                                 ` Paul E. McKenney
2014-02-12  5:39                             ` Torvald Riegel
2014-02-12 18:07                               ` Paul E. McKenney
2014-02-12 20:22                                 ` Linus Torvalds
2014-02-13  0:23                                   ` Paul E. McKenney
2014-02-13 20:03                                     ` Torvald Riegel
2014-02-14  2:01                                       ` Paul E. McKenney
2014-02-14  4:43                                         ` Torvald Riegel
2014-02-14 17:29                                           ` Paul E. McKenney
2014-02-14 19:21                                             ` Torvald Riegel
2014-02-14 19:50                                             ` Linus Torvalds
2014-02-14 20:02                                               ` Linus Torvalds
2014-02-15  2:08                                                 ` Paul E. McKenney
2014-02-15  2:44                                                   ` Linus Torvalds
2014-02-15  2:48                                                     ` Linus Torvalds
2014-02-15  6:35                                                       ` Paul E. McKenney
2014-02-15  6:58                                                         ` Paul E. McKenney
2014-02-15 18:07                                                     ` Torvald Riegel
2014-02-17 18:59                                                       ` Joseph S. Myers
2014-02-17 19:19                                                         ` Will Deacon
2014-02-17 19:41                                                         ` Torvald Riegel
2014-02-17 23:12                                                           ` Joseph S. Myers
2014-02-15 17:45                                                 ` Torvald Riegel
2014-02-15 18:49                                                   ` Linus Torvalds
2014-02-17 19:55                                                     ` Torvald Riegel
2014-02-17 20:18                                                       ` Linus Torvalds
2014-02-17 21:21                                                         ` Torvald Riegel
2014-02-17 22:02                                                           ` Linus Torvalds
2014-02-17 22:25                                                             ` Torvald Riegel
2014-02-17 22:47                                                               ` Linus Torvalds
2014-02-17 23:41                                                                 ` Torvald Riegel
2014-02-18  0:18                                                                   ` Linus Torvalds
2014-02-18  1:26                                                                     ` Paul E. McKenney
2014-02-18 15:38                                                                     ` Torvald Riegel
2014-02-18 16:55                                                                       ` Paul E. McKenney
2014-02-18 19:57                                                                         ` Torvald Riegel
2014-02-17 23:10                                                         ` Alec Teal
2014-02-18  0:05                                                           ` Linus Torvalds
2014-02-18 15:31                                                             ` Torvald Riegel
2014-02-18 16:49                                                               ` Linus Torvalds
2014-02-18 17:16                                                                 ` Paul E. McKenney
2014-02-18 18:23                                                                   ` Peter Sewell
2014-02-18 19:00                                                                     ` Linus Torvalds
2014-02-18 19:42                                                                     ` Paul E. McKenney
2014-02-18 21:40                                                                   ` Torvald Riegel
2014-02-18 21:52                                                                     ` Peter Zijlstra
2014-02-19  9:52                                                                       ` Torvald Riegel
2014-02-18 22:58                                                                     ` Paul E. McKenney
2014-02-19 10:59                                                                       ` Torvald Riegel
2014-02-19 15:14                                                                         ` Paul E. McKenney
2014-02-19 17:55                                                                           ` Torvald Riegel
2014-02-19 22:12                                                                             ` Paul E. McKenney
2014-02-18 21:21                                                                 ` Torvald Riegel
2014-02-18 21:40                                                                   ` Peter Zijlstra
2014-02-18 21:47                                                                     ` Torvald Riegel
2014-02-19 15:23                                                                       ` David Lang
2014-02-19 18:11                                                                         ` Torvald Riegel
2014-02-18 21:47                                                                   ` Peter Zijlstra
2014-02-19 11:07                                                                     ` Torvald Riegel
2014-02-19 11:42                                                                       ` Peter Zijlstra
2014-02-18 22:14                                                                   ` Linus Torvalds
2014-02-19 14:40                                                                     ` Torvald Riegel
2014-02-19 19:49                                                                       ` Linus Torvalds
2014-02-18  3:00                                                         ` Paul E. McKenney
2014-02-18  3:24                                                           ` Linus Torvalds
2014-02-18  3:42                                                             ` Linus Torvalds
2014-02-18  5:22                                                               ` Paul E. McKenney
2014-02-18 16:17                                                               ` Torvald Riegel
2014-02-18 17:44                                                                 ` Linus Torvalds
2014-02-18 19:40                                                                   ` Paul E. McKenney
2014-02-18 19:47                                                                   ` Torvald Riegel
2014-02-20  0:53                                                                     ` Linus Torvalds
2014-02-20  4:01                                                                       ` Paul E. McKenney
2014-02-20  4:43                                                                         ` Linus Torvalds
2014-02-20  8:30                                                                           ` Paul E. McKenney
2014-02-20  9:20                                                                             ` Paul E. McKenney
2014-02-20 17:01                                                                             ` Linus Torvalds
2014-02-20 18:11                                                                               ` Paul E. McKenney
2014-02-20 18:32                                                                                 ` Linus Torvalds
2014-02-20 18:53                                                                                   ` Torvald Riegel
2014-02-20 19:09                                                                                     ` Linus Torvalds
2014-02-22 18:53                                                                                       ` Torvald Riegel
2014-02-22 21:53                                                                                         ` Linus Torvalds
2014-02-23  0:39                                                                                           ` Paul E. McKenney
2014-02-23  3:50                                                                                             ` Linus Torvalds
2014-02-23  6:34                                                                                               ` Paul E. McKenney
2014-02-23 19:31                                                                                                 ` Linus Torvalds
2014-02-24  1:16                                                                                                   ` Paul E. McKenney
2014-02-24  1:35                                                                                                     ` Linus Torvalds
2014-02-24  4:59                                                                                                       ` Paul E. McKenney
2014-02-24  5:25                                                                                                         ` Linus Torvalds
2014-02-24 15:57                                                                                                   ` Linus Torvalds
2014-02-24 16:27                                                                                                     ` Richard Biener
2014-02-24 16:37                                                                                                       ` Linus Torvalds
2014-02-24 16:40                                                                                                         ` Linus Torvalds
2014-02-24 16:55                                                                                                         ` Michael Matz
2014-02-24 17:28                                                                                                           ` Paul E. McKenney
2014-02-24 17:57                                                                                                             ` Paul E. McKenney
2014-02-26 17:39                                                                                                             ` Torvald Riegel
2014-02-24 17:38                                                                                                           ` Linus Torvalds
2014-02-24 18:12                                                                                                             ` Paul E. McKenney
2014-02-26 17:34                                                                                                             ` Torvald Riegel
2014-02-24 17:21                                                                                                     ` Paul E. McKenney
2014-02-24 18:14                                                                                                       ` Linus Torvalds
2014-02-24 18:53                                                                                                         ` Paul E. McKenney
2014-02-24 19:54                                                                                                           ` Linus Torvalds
2014-02-24 22:37                                                                                                             ` Paul E. McKenney
2014-02-24 23:35                                                                                                               ` Linus Torvalds
2014-02-25  6:00                                                                                                                 ` Paul E. McKenney
2014-02-26  1:47                                                                                                                   ` Linus Torvalds
2014-02-26  5:12                                                                                                                     ` Paul E. McKenney
2014-02-25  6:05                                                                                                                 ` Linus Torvalds
2014-02-26  0:15                                                                                                                   ` Paul E. McKenney
2014-02-26  3:32                                                                                                                     ` Jeff Law
2014-02-26  5:23                                                                                                                       ` Paul E. McKenney
2014-02-27 15:37                                                                                                             ` Torvald Riegel
2014-02-27 17:01                                                                                                               ` Linus Torvalds
2014-02-27 19:06                                                                                                                 ` Paul E. McKenney
2014-02-27 19:47                                                                                                                   ` Linus Torvalds
2014-02-27 20:53                                                                                                                     ` Paul E. McKenney
2014-03-01  0:50                                                                                                                       ` Paul E. McKenney
2014-03-01 10:06                                                                                                                         ` Peter Sewell
2014-03-01 14:03                                                                                                                           ` Paul E. McKenney
2014-03-02 10:05                                                                                                                             ` Peter Sewell
2014-03-02 23:20                                                                                                                               ` Paul E. McKenney
2014-03-02 23:44                                                                                                                                 ` Peter Sewell
2014-03-03  4:25                                                                                                                                   ` Paul E. McKenney
2014-03-03 20:44                                                                                                                               ` Torvald Riegel
2014-03-04 22:11                                                                                                                                 ` Peter Sewell
2014-03-05 17:15                                                                                                                                   ` Torvald Riegel
2014-03-05 18:37                                                                                                                                     ` Peter Sewell
2014-03-03 18:55                                                                                                                         ` Torvald Riegel
2014-03-03 19:20                                                                                                                           ` Paul E. McKenney
2014-03-03 20:46                                                                                                                             ` Torvald Riegel
2014-03-04 19:00                                                                                                                               ` Paul E. McKenney
2014-03-04 21:35                                                                                                                                 ` Paul E. McKenney
2014-03-05 16:54                                                                                                                                   ` Torvald Riegel
2014-03-05 18:15                                                                                                                                     ` Paul E. McKenney
2014-03-07 18:33                                                                                                                                       ` Torvald Riegel
2014-03-07 19:11                                                                                                                                         ` Paul E. McKenney
2014-03-05 16:26                                                                                                                                 ` Torvald Riegel
2014-03-05 18:01                                                                                                                                   ` Paul E. McKenney
2014-03-07 17:45                                                                                                                                     ` Torvald Riegel
2014-03-07 19:02                                                                                                                                       ` Paul E. McKenney
2014-03-03 18:59                                                                                                                     ` Torvald Riegel
2014-03-03 15:36                                                                                                                 ` Torvald Riegel
2014-02-27 17:50                                                                                                               ` Paul E. McKenney
2014-02-27 19:22                                                                                                                 ` Paul E. McKenney
2014-02-28  1:02                                                                                                                 ` Paul E. McKenney
2014-03-03 19:01                                                                                                                 ` Torvald Riegel
2014-02-20 18:56                                                                                   ` Paul E. McKenney
2014-02-20 19:45                                                                                     ` Linus Torvalds
2014-02-20 22:10                                                                                       ` Paul E. McKenney
2014-02-20 22:52                                                                                         ` Linus Torvalds
2014-02-21 18:35                                                                                           ` Michael Matz
2014-02-21 19:13                                                                                             ` Paul E. McKenney
2014-02-21 22:10                                                                                               ` Joseph S. Myers
2014-02-21 22:37                                                                                                 ` Paul E. McKenney
2014-02-26 13:09                                                                                                 ` Torvald Riegel
2014-02-26 18:43                                                                                                   ` Joseph S. Myers
2014-02-27  0:52                                                                                                     ` Torvald Riegel
2014-02-24 13:55                                                                                               ` Michael Matz
2014-02-24 17:40                                                                                                 ` Paul E. McKenney
2014-02-26 13:04                                                                                               ` Torvald Riegel
2014-02-26 18:27                                                                                                 ` Paul E. McKenney
2014-02-20 18:44                                                                                 ` Torvald Riegel
2014-02-20 18:56                                                                                   ` Paul E. McKenney
2014-02-20 18:23                                                                               ` Torvald Riegel
     [not found]                                                                               ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com>
2014-02-21 19:16                                                                                 ` Linus Torvalds
2014-02-21 19:41                                                                                   ` Linus Torvalds
2014-02-21 19:48                                                                                     ` Peter Sewell
     [not found]                                                                                   ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com>
2014-02-21 20:22                                                                                     ` Linus Torvalds
     [not found]                                                                                 ` <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com>
2014-02-21 19:24                                                                                   ` Paul E. McKenney
2014-02-20 17:54                                                                             ` Torvald Riegel
2014-02-20 18:11                                                                               ` Paul E. McKenney
2014-02-20 17:49                                                                           ` Torvald Riegel
2014-02-20 18:25                                                                             ` Linus Torvalds
2014-02-20 19:02                                                                               ` Linus Torvalds
2014-02-20 19:06                                                                                 ` Linus Torvalds
2014-02-20 17:26                                                                         ` Torvald Riegel
2014-02-20 18:18                                                                           ` Paul E. McKenney
2014-02-22 18:30                                                                             ` Torvald Riegel
2014-02-22 20:17                                                                               ` Paul E. McKenney
2014-02-20 17:14                                                                       ` Torvald Riegel
2014-02-20 17:34                                                                         ` Linus Torvalds
2014-02-20 18:12                                                                           ` Torvald Riegel
2014-02-20 18:26                                                                           ` Paul E. McKenney
2014-02-18  5:01                                                             ` Paul E. McKenney
2014-02-18 15:56                                                           ` Torvald Riegel
2014-02-18 16:51                                                             ` Paul E. McKenney
2014-02-17 20:23                                                       ` Paul E. McKenney
2014-02-17 21:05                                                         ` Torvald Riegel
2014-02-15 17:30                                               ` Torvald Riegel
2014-02-15 19:15                                                 ` Linus Torvalds
2014-02-17 22:09                                                   ` Torvald Riegel
2014-02-17 22:32                                                     ` Linus Torvalds
2014-02-17 23:17                                                       ` Torvald Riegel
2014-02-18  0:09                                                         ` Linus Torvalds
2014-02-18 15:46                                                           ` Torvald Riegel
2014-02-10 11:48                         ` Peter Zijlstra
2014-02-10 11:49                           ` Will Deacon
2014-02-10 12:05                             ` Peter Zijlstra
2014-02-10 15:04                             ` Paul E. McKenney
2014-02-10 16:22                               ` Will Deacon
2014-02-07 18:44                     ` Torvald Riegel
2014-02-10  0:06                 ` Torvald Riegel
2014-02-10  3:51                   ` Paul E. McKenney
2014-02-12  5:13                     ` Torvald Riegel
2014-02-12 18:26                       ` Paul E. McKenney
2014-02-06 21:09       ` Torvald Riegel
2014-02-06 21:55         ` Paul E. McKenney
2014-02-06 22:58           ` Torvald Riegel
2014-02-07  4:06             ` Paul E. McKenney
2014-02-07  9:13               ` Torvald Riegel
2014-02-07 16:44                 ` Paul E. McKenney
2014-02-06 22:13         ` Joseph S. Myers
2014-02-06 23:25           ` Torvald Riegel
2014-02-06 23:33             ` Joseph S. Myers
2014-02-07 12:01         ` Will Deacon
2014-02-07 16:47           ` Paul E. McKenney
2014-02-06 19:21   ` Linus Torvalds
     [not found] ` <52F93B7C.2090304@tilera.com>
     [not found]   ` <20140210205719.GY5002@laptop.programming.kicks-ass.net>
2014-02-10 21:08     ` Chris Metcalf
2014-02-10 21:14       ` Peter Zijlstra
2014-02-18 12:12 Peter Sewell
2014-02-18 12:53 ` Peter Zijlstra
2014-02-18 16:08   ` Peter Sewell
2014-02-18 14:56 ` Paul E. McKenney
2014-02-18 15:16   ` Mark Batty
2014-02-18 17:17     ` Paul E. McKenney
2014-02-18 15:33   ` Peter Sewell
2014-02-18 16:47     ` Paul E. McKenney
2014-02-18 17:38 ` Linus Torvalds
2014-02-18 18:21   ` Peter Sewell
2014-02-18 18:49     ` Linus Torvalds
2014-02-18 19:47       ` Paul E. McKenney
2014-02-18 20:46     ` Torvald Riegel
2014-02-18 20:43 ` Torvald Riegel
2014-02-18 21:29   ` Paul E. McKenney
2014-02-18 23:48   ` Peter Sewell
2014-02-19  9:46     ` Torvald Riegel
2014-02-26  3:06 George Spelvin
2014-02-26  5:22 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).