All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-07-10 20:58 Mathieu Desnoyers
  2015-07-10 20:58 ` [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86) Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-07-10 20:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-api, Mathieu Desnoyers

Hi Andrew,

Here is a repost of sys_membarrier, rebased on top of Linus commit
c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
last v19 post other that proceeding to further testing. When merging
with other system calls, system call number conflicts should be quite
straightforward to handle, there is nothing special there.

Please consider pulling it into your tree in preparation for the
following merge window.

Thanks!

Mathieu

Mathieu Desnoyers (2):
  sys_membarrier(): system-wide memory barrier (generic, x86)
  selftests: enhance membarrier syscall test

Pranith Kumar (1):
  selftests: add membarrier syscall test

 MAINTAINERS                                        |   8 ++
 arch/x86/entry/syscalls/syscall_32.tbl             |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |   1 +
 include/linux/syscalls.h                           |   2 +
 include/uapi/asm-generic/unistd.h                  |   4 +-
 include/uapi/linux/Kbuild                          |   1 +
 include/uapi/linux/membarrier.h                    |  53 +++++++++
 init/Kconfig                                       |  12 ++
 kernel/Makefile                                    |   1 +
 kernel/membarrier.c                                |  66 +++++++++++
 kernel/sys_ni.c                                    |   3 +
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/membarrier/.gitignore      |   1 +
 tools/testing/selftests/membarrier/Makefile        |  11 ++
 .../testing/selftests/membarrier/membarrier_test.c | 121 +++++++++++++++++++++
 15 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/membarrier.h
 create mode 100644 kernel/membarrier.c
 create mode 100644 tools/testing/selftests/membarrier/.gitignore
 create mode 100644 tools/testing/selftests/membarrier/Makefile
 create mode 100644 tools/testing/selftests/membarrier/membarrier_test.c

-- 
2.1.4


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
  2015-07-10 20:58 [PATCH 0/3] sys_membarrier (x86, generic) Mathieu Desnoyers
@ 2015-07-10 20:58 ` Mathieu Desnoyers
  2015-12-04 15:44   ` Michael Kerrisk (man-pages)
  2015-07-10 20:58 ` [PATCH 2/3] selftests: add membarrier syscall test Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-07-10 20:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-api, Mathieu Desnoyers, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, Pranith Kumar, Michael Kerrisk

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to distribute
the cost of user-space memory barriers asymmetrically by transforming
pairs of memory barriers into pairs consisting of sys_membarrier() and a
compiler barrier. For synchronization primitives that distinguish
between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
read-side can be accelerated significantly by moving the bulk of the
memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by this
system call are as follows:

* Through Userspace RCU library (http://urcu.so)
  - DNS server (Knot DNS) https://www.knot-dns.cz/
  - Network sniffer (http://netsniff-ng.org/)
  - Distributed object storage (https://sheepdog.github.io/sheepdog/)
  - User-space tracing (http://lttng.org)
  - Network storage system (https://www.gluster.org/)
  - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
  - Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used
by libraries, sys_membarrier can speed up the read-side by moving the
bulk of the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
  - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

This implementation is based on kernel v4.1-rc8.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 2202847 writes
signal-based scheme:          9830061167 reads,    6700 writes
sys_membarrier:               9952759104 reads,     425 writes
sys_membarrier (dyn. check):  7970328887 reads,     425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Nicholas Miell <nmiell@comcast.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: David Howells <dhowells@redhat.com>
CC: Pranith Kumar <bobby.prani@gmail.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: linux-api@vger.kernel.org

---

membarrier(2) man page:
--------------- snip -------------------
MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)

NAME
       membarrier - issue memory barriers on a set of threads

SYNOPSIS
       #include <linux/membarrier.h>

       int membarrier(int cmd, int flags);

DESCRIPTION
       The cmd argument is one of the following:

       MEMBARRIER_CMD_QUERY
              Query  the  set  of  supported commands. It returns a bitmask of
              supported commands.

       MEMBARRIER_CMD_SHARED
              Execute a memory barrier on all threads running on  the  system.
              Upon  return from system call, the caller thread is ensured that
              all running threads have passed through a state where all memory
              accesses  to  user-space  addresses  match program order between
              entry to and return from the system  call  (non-running  threads
              are de facto in such a state). This covers threads from all pro‐
              cesses running on the system.  This command returns 0.

       The flags argument needs to be 0. For future extensions.

       All memory accesses performed  in  program  order  from  each  targeted
       thread is guaranteed to be ordered with respect to sys_membarrier(). If
       we use the semantic "barrier()" to represent a compiler barrier forcing
       memory  accesses  to  be performed in program order across the barrier,
       and smp_mb() to represent explicit memory barriers forcing full  memory
       ordering  across  the barrier, we have the following ordering table for
       each pair of barrier(), sys_membarrier() and smp_mb():

       The pair ordering is detailed as (O: ordered, X: not ordered):

                              barrier()   smp_mb() sys_membarrier()
              barrier()          X           X            O
              smp_mb()           X           O            O
              sys_membarrier()   O           O            O

RETURN VALUE
       On success, these system calls return zero.  On error, -1 is  returned,
       and errno is set appropriately. For a given command, with flags
       argument set to 0, this system call is guaranteed to always return the
       same value until reboot.

ERRORS
       ENOSYS System call is not implemented.

       EINVAL Invalid arguments.

Linux                             2015-04-15                     MEMBARRIER(2)
--------------- snip -------------------

Changes since v18:
- Add unlikely() check to flags,
- Describe current users in changelog.

Changes since v17:
- Update commit message.

Changes since v16:
- Update documentation.
- Add man page to changelog.
- Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
  to not care about the number of processors on the system.  Based on
  recommendations from Stephen Hemminger and Steven Rostedt.
- Check that flags argument is 0, update documentation to require it.

Changes since v15:
- Add flags argument in addition to cmd.
- Update documentation.

Changes since v14:
- Take care of Thomas Gleixner's comments.

Changes since v13:
- Move to kernel/membarrier.c.
- Remove MEMBARRIER_PRIVATE flag.
- Add MAINTAINERS file entry.

Changes since v12:
- Remove _FLAG suffix from uapi flags.
- Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
- Remove EXPEDITED mode. Only implement non-expedited for now, until
  reading the cpu_curr()->mm can be done without holding the CPU's rq
  lock.

Changes since v11:
- 5 years have passed.
- Rebase on v3.19 kernel.
- Add futex-alike PRIVATE vs SHARED semantic: private for per-process
  barriers, non-private for memory mappings shared between processes.
- Simplify user API.
- Code refactoring.

Changes since v10:
- Apply Randy's comments.
- Rebase on 2.6.34-rc4 -tip.

Changes since v9:
- Clean up #ifdef CONFIG_SMP.

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding
  memory barriers to the scheduler. It implies a potential RoS
  (reduction of service) if sys_membarrier() is executed in a busy-loop
  by a user, but nothing more than what is already possible with other
  existing system calls, but saves memory barriers in the scheduler fast
  path.
- re-add the memory barrier comments to x86 switch_mm() as an example to
  other architectures.
- Update documentation of the memory barriers in sys_membarrier and
  switch_mm().
- Append execution scenarios to the changelog showing the purpose of
  each memory barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned
  to incrementally reserve syscall IDs on other architectures as these
  are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU
  read lock in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule()
  and finish_lock_switch(), where they clearly document that all data
  protected by the rq lock is guaranteed to have memory barriers issued
  between the scheduler update and the task execution. Replacing the
  spin lock acquire/release barriers with these memory barriers imply
  either no overhead (x86 spinlock atomic instruction already implies a
  full mb) or some hopefully small overhead caused by the upgrade of the
  spinlock acquire/release barriers to more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to
  standard spinlocks and full memory barriers. Each architecture can
  specialize this header following their own need and declare
  CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
  implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks
  to the "flags" system call parameter. Past experience with accept4(),
  signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
  inotify_init1() indicates that this is the kind of thing we want to
  plan for. Return -EINVAL if the mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the
  non-expedited case. Thanks to Lai Jiangshan for making us consider
  seriously using synchronize_sched() to provide the low-overhead
  membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before
  sending an IPI. Ensures that we do not disturb RT tasks in the
  presence of lazy TLB shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of
  processors we have to IPI to (which use the mm), and this mask is
  updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.
---
 MAINTAINERS                            |  8 +++++
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  4 ++-
 include/uapi/linux/Kbuild              |  1 +
 include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
 init/Kconfig                           | 12 +++++++
 kernel/Makefile                        |  1 +
 kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                        |  3 ++
 11 files changed, 151 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/membarrier.h
 create mode 100644 kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0d70760..b560da6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
 F:	drivers/net/ethernet/mellanox/mlx4/en_*
 
+MEMBARRIER SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/membarrier.c
+F:	include/uapi/linux/membarrier.h
+
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
 W:	http://www.linux-mm.org
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ef8187f..e63ad61 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	membarrier		sys_membarrier
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9ef32d5..87f3cd6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	membarrier		sys_membarrier
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b45c45b..d4ab99b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *argv,
 			const char __user *const __user *envp, int flags);
 
+asmlinkage long sys_membarrier(int cmd, int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..8da542a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_membarrier 282
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 1ff9942..e6f229a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -251,6 +251,7 @@ header-y += mdio.h
 header-y += media.h
 header-y += media-bus-format.h
 header-y += mei.h
+header-y += membarrier.h
 header-y += memfd.h
 header-y += mempolicy.h
 header-y += meye.h
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
new file mode 100644
index 0000000..e0b108b
--- /dev/null
+++ b/include/uapi/linux/membarrier.h
@@ -0,0 +1,53 @@
+#ifndef _UAPI_LINUX_MEMBARRIER_H
+#define _UAPI_LINUX_MEMBARRIER_H
+
+/*
+ * linux/membarrier.h
+ *
+ * membarrier system call API
+ *
+ * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/**
+ * enum membarrier_cmd - membarrier system call command
+ * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
+ *                          a bitmask of valid commands.
+ * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
+ *                          Upon return from system call, the caller thread
+ *                          is ensured that all running threads have passed
+ *                          through a state where all memory accesses to
+ *                          user-space addresses match program order between
+ *                          entry to and return from the system call
+ *                          (non-running threads are de facto in such a
+ *                          state). This covers threads from all processes
+ *                          running on the system. This command returns 0.
+ *
+ * Command to be passed to the membarrier system call. The commands need to
+ * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
+ * the value 0.
+ */
+enum membarrier_cmd {
+	MEMBARRIER_CMD_QUERY = 0,
+	MEMBARRIER_CMD_SHARED = (1 << 0),
+};
+
+#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index af09b4f..4bba60f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1577,6 +1577,18 @@ config PCI_QUIRKS
 	  bugs/quirks. Disable this only if your target machine is
 	  unaffected by PCI quirks.
 
+config MEMBARRIER
+	bool "Enable membarrier() system call" if EXPERT
+	default y
+	help
+	  Enable the membarrier() system call that allows issuing memory
+	  barriers across all running threads, which can be used to distribute
+	  the cost of user-space memory barriers asymmetrically by transforming
+	  pairs of memory barriers into pairs consisting of membarrier() and a
+	  compiler barrier.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 43c4c92..92a481b 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
new file mode 100644
index 0000000..536c727
--- /dev/null
+++ b/kernel/membarrier.c
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * membarrier system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/membarrier.h>
+
+/*
+ * Bitmask made from a "or" of all commands within enum membarrier_cmd,
+ * except MEMBARRIER_CMD_QUERY.
+ */
+#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
+
+/**
+ * sys_membarrier - issue memory barriers on a set of threads
+ * @cmd:   Takes command values defined in enum membarrier_cmd.
+ * @flags: Currently needs to be 0. For future extensions.
+ *
+ * If this system call is not implemented, -ENOSYS is returned. If the
+ * command specified does not exist, or if the command argument is invalid,
+ * this system call returns -EINVAL. For a given command, with flags argument
+ * set to 0, this system call is guaranteed to always return the same value
+ * until reboot.
+ *
+ * All memory accesses performed in program order from each targeted thread
+ * is guaranteed to be ordered with respect to sys_membarrier(). If we use
+ * the semantic "barrier()" to represent a compiler barrier forcing memory
+ * accesses to be performed in program order across the barrier, and
+ * smp_mb() to represent explicit memory barriers forcing full memory
+ * ordering across the barrier, we have the following ordering table for
+ * each pair of barrier(), sys_membarrier() and smp_mb():
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ *                        barrier()   smp_mb() sys_membarrier()
+ *        barrier()          X           X            O
+ *        smp_mb()           X           O            O
+ *        sys_membarrier()   O           O            O
+ */
+SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
+{
+	if (unlikely(flags))
+		return -EINVAL;
+	switch (cmd) {
+	case MEMBARRIER_CMD_QUERY:
+		return MEMBARRIER_CMD_BITMASK;
+	case MEMBARRIER_CMD_SHARED:
+		if (num_online_cpus() > 1)
+			synchronize_sched();
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..eb4fde0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* membarrier */
+cond_syscall(sys_membarrier);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] selftests: add membarrier syscall test
  2015-07-10 20:58 [PATCH 0/3] sys_membarrier (x86, generic) Mathieu Desnoyers
  2015-07-10 20:58 ` [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86) Mathieu Desnoyers
@ 2015-07-10 20:58 ` Mathieu Desnoyers
  2015-08-31  6:54     ` Michael Ellerman
  2015-07-10 20:58 ` [PATCH 3/3] selftests: enhance " Mathieu Desnoyers
  2015-10-05 23:21   ` Rusty Russell
  3 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-07-10 20:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-api, Pranith Kumar, Michael Ellerman,
	Mathieu Desnoyers

From: Pranith Kumar <bobby.prani@gmail.com>

This patch adds a self test for the membarrier system call.

CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/Makefile                   |  1 +
 tools/testing/selftests/membarrier/.gitignore      |  1 +
 tools/testing/selftests/membarrier/Makefile        | 11 ++++
 .../testing/selftests/membarrier/membarrier_test.c | 71 ++++++++++++++++++++++
 4 files changed, 84 insertions(+)
 create mode 100644 tools/testing/selftests/membarrier/.gitignore
 create mode 100644 tools/testing/selftests/membarrier/Makefile
 create mode 100644 tools/testing/selftests/membarrier/membarrier_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 24ae9e8..df577a4 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -6,6 +6,7 @@ TARGETS += firmware
 TARGETS += ftrace
 TARGETS += futex
 TARGETS += kcmp
+TARGETS += membarrier
 TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mount
diff --git a/tools/testing/selftests/membarrier/.gitignore b/tools/testing/selftests/membarrier/.gitignore
new file mode 100644
index 0000000..020c44f4
--- /dev/null
+++ b/tools/testing/selftests/membarrier/.gitignore
@@ -0,0 +1 @@
+membarrier_test
diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile
new file mode 100644
index 0000000..877a503
--- /dev/null
+++ b/tools/testing/selftests/membarrier/Makefile
@@ -0,0 +1,11 @@
+CFLAGS += -g -I../../../../usr/include/
+
+all:
+	$(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
+
+TEST_PROGS := membarrier_test
+
+include ../lib.mk
+
+clean:
+	$(RM) membarrier_test
diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
new file mode 100644
index 0000000..3c9f217
--- /dev/null
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -0,0 +1,71 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <linux/membarrier.h>
+#include <asm-generic/unistd.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include "../kselftest.h"
+
+static int sys_membarrier(int cmd, int flags)
+{
+	return syscall(__NR_membarrier, cmd, flags);
+}
+
+static void test_membarrier_fail(void)
+{
+	int cmd = -1, flags = 0;
+
+	if (sys_membarrier(cmd, flags) != -1) {
+		printf("membarrier: Should fail but passed\n");
+		ksft_exit_fail();
+	}
+}
+
+static void test_membarrier_success(void)
+{
+	int flags = 0;
+
+	if (sys_membarrier(MEMBARRIER_CMD_SHARED, flags) != 0) {
+		printf("membarrier: Executing MEMBARRIER failed, %s\n",
+				strerror(errno));
+		ksft_exit_fail();
+	}
+
+	printf("membarrier: MEMBARRIER_CMD_SHARED success\n");
+}
+
+static void test_membarrier(void)
+{
+	test_membarrier_fail();
+	test_membarrier_success();
+}
+
+static int test_membarrier_exists(void)
+{
+	int flags = 0;
+
+	if (sys_membarrier(MEMBARRIER_CMD_QUERY, flags))
+		return 0;
+
+	return 1;
+}
+
+int main(int argc, char **argv)
+{
+	printf("membarrier: MEMBARRIER_CMD_QUERY ");
+	if (test_membarrier_exists()) {
+		printf("syscall implemented\n");
+		test_membarrier();
+	} else {
+		printf("syscall not implemented!\n");
+		return ksft_exit_fail();
+	}
+
+	printf("membarrier: tests done!\n");
+
+	return ksft_exit_pass();
+}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] selftests: enhance membarrier syscall test
  2015-07-10 20:58 [PATCH 0/3] sys_membarrier (x86, generic) Mathieu Desnoyers
  2015-07-10 20:58 ` [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86) Mathieu Desnoyers
  2015-07-10 20:58 ` [PATCH 2/3] selftests: add membarrier syscall test Mathieu Desnoyers
@ 2015-07-10 20:58 ` Mathieu Desnoyers
  2015-10-05 23:21   ` Rusty Russell
  3 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-07-10 20:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-api, Mathieu Desnoyers, Michael Ellerman,
	Pranith Kumar

Update the membarrier syscall self-test to match the membarrier
interface. Extend coverage of the interface. Consider ENOSYS as a "SKIP"
test, since it is a valid configuration, but does not allow testing the
system call.

CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 .../testing/selftests/membarrier/membarrier_test.c | 100 +++++++++++++++------
 1 file changed, 75 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index 3c9f217..dde3125 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -10,62 +10,112 @@
 
 #include "../kselftest.h"
 
+enum test_membarrier_status {
+	TEST_MEMBARRIER_PASS = 0,
+	TEST_MEMBARRIER_FAIL,
+	TEST_MEMBARRIER_SKIP,
+};
+
 static int sys_membarrier(int cmd, int flags)
 {
 	return syscall(__NR_membarrier, cmd, flags);
 }
 
-static void test_membarrier_fail(void)
+static enum test_membarrier_status test_membarrier_cmd_fail(void)
 {
 	int cmd = -1, flags = 0;
 
 	if (sys_membarrier(cmd, flags) != -1) {
-		printf("membarrier: Should fail but passed\n");
-		ksft_exit_fail();
+		printf("membarrier: Wrong command should fail but passed.\n");
+		return TEST_MEMBARRIER_FAIL;
+	}
+	return TEST_MEMBARRIER_PASS;
+}
+
+static enum test_membarrier_status test_membarrier_flags_fail(void)
+{
+	int cmd = MEMBARRIER_CMD_QUERY, flags = 1;
+
+	if (sys_membarrier(cmd, flags) != -1) {
+		printf("membarrier: Wrong flags should fail but passed.\n");
+		return TEST_MEMBARRIER_FAIL;
 	}
+	return TEST_MEMBARRIER_PASS;
 }
 
-static void test_membarrier_success(void)
+static enum test_membarrier_status test_membarrier_success(void)
 {
-	int flags = 0;
+	int cmd = MEMBARRIER_CMD_SHARED, flags = 0;
 
-	if (sys_membarrier(MEMBARRIER_CMD_SHARED, flags) != 0) {
-		printf("membarrier: Executing MEMBARRIER failed, %s\n",
+	if (sys_membarrier(cmd, flags) != 0) {
+		printf("membarrier: Executing MEMBARRIER_CMD_SHARED failed. %s.\n",
 				strerror(errno));
-		ksft_exit_fail();
+		return TEST_MEMBARRIER_FAIL;
 	}
 
-	printf("membarrier: MEMBARRIER_CMD_SHARED success\n");
+	printf("membarrier: MEMBARRIER_CMD_SHARED success.\n");
+	return TEST_MEMBARRIER_PASS;
 }
 
-static void test_membarrier(void)
+static enum test_membarrier_status test_membarrier(void)
 {
-	test_membarrier_fail();
-	test_membarrier_success();
+	enum test_membarrier_status status;
+
+	status = test_membarrier_cmd_fail();
+	if (status)
+		return status;
+	status = test_membarrier_flags_fail();
+	if (status)
+		return status;
+	status = test_membarrier_success();
+	if (status)
+		return status;
+	return TEST_MEMBARRIER_PASS;
 }
 
-static int test_membarrier_exists(void)
+static enum test_membarrier_status test_membarrier_query(void)
 {
-	int flags = 0;
-
-	if (sys_membarrier(MEMBARRIER_CMD_QUERY, flags))
-		return 0;
+	int flags = 0, ret;
 
-	return 1;
+	printf("membarrier MEMBARRIER_CMD_QUERY ");
+	ret = sys_membarrier(MEMBARRIER_CMD_QUERY, flags);
+	if (ret < 0) {
+		printf("failed. %s.\n", strerror(errno));
+		switch (errno) {
+		case ENOSYS:
+			/*
+			 * It is valid to build a kernel with
+			 * CONFIG_MEMBARRIER=n. However, this skips the tests.
+			 */
+			return TEST_MEMBARRIER_SKIP;
+		case EINVAL:
+		default:
+			return TEST_MEMBARRIER_FAIL;
+		}
+	}
+	if (!(ret & MEMBARRIER_CMD_SHARED)) {
+		printf("command MEMBARRIER_CMD_SHARED is not supported.\n");
+		return TEST_MEMBARRIER_FAIL;
+	}
+	printf("syscall available.\n");
+	return TEST_MEMBARRIER_PASS;
 }
 
 int main(int argc, char **argv)
 {
-	printf("membarrier: MEMBARRIER_CMD_QUERY ");
-	if (test_membarrier_exists()) {
-		printf("syscall implemented\n");
-		test_membarrier();
-	} else {
-		printf("syscall not implemented!\n");
+	switch (test_membarrier_query()) {
+	case TEST_MEMBARRIER_FAIL:
 		return ksft_exit_fail();
+	case TEST_MEMBARRIER_SKIP:
+		return ksft_exit_skip();
+	}
+	switch (test_membarrier()) {
+	case TEST_MEMBARRIER_FAIL:
+		return ksft_exit_fail();
+	case TEST_MEMBARRIER_SKIP:
+		return ksft_exit_skip();
 	}
 
 	printf("membarrier: tests done!\n");
-
 	return ksft_exit_pass();
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-08-31  6:54     ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-08-31  6:54 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Andrew Morton, linux-kernel, linux-api, Pranith Kumar

On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
> From: Pranith Kumar <bobby.prani@gmail.com>
> 
> This patch adds a self test for the membarrier system call.
> 
> CC: Michael Ellerman <mpe@ellerman.id.au>

Sorry I only just saw this due to some over zealous filtering on my end.


> diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile
> new file mode 100644
> index 0000000..877a503
> --- /dev/null
> +++ b/tools/testing/selftests/membarrier/Makefile
> @@ -0,0 +1,11 @@
> +CFLAGS += -g -I../../../../usr/include/
> +
> +all:
> +	$(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
>
> +TEST_PROGS := membarrier_test

You don't need to specify the rule, the implict one will do exactly the same,
so you can just do:

TEST_PROGS := membarrier_test

all: $(TEST_PROGS)

> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
> new file mode 100644
> index 0000000..3c9f217
> --- /dev/null
> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
> @@ -0,0 +1,71 @@
> +#define _GNU_SOURCE
> +#define __EXPORTED_HEADERS__

Why are you exporting that?

I suspect to try and get around the "Attempt to use kernel headers from user space" warning.

But you're correctly building against the installed headers, not the kernel
headers, so you don't need to do that.

> +
> +#include <linux/membarrier.h>
> +#include <asm-generic/unistd.h>

This should just be <unistd.h>


cheers



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-08-31  6:54     ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-08-31  6:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pranith Kumar

On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
> From: Pranith Kumar <bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> This patch adds a self test for the membarrier system call.
> 
> CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

Sorry I only just saw this due to some over zealous filtering on my end.


> diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile
> new file mode 100644
> index 0000000..877a503
> --- /dev/null
> +++ b/tools/testing/selftests/membarrier/Makefile
> @@ -0,0 +1,11 @@
> +CFLAGS += -g -I../../../../usr/include/
> +
> +all:
> +	$(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
>
> +TEST_PROGS := membarrier_test

You don't need to specify the rule, the implict one will do exactly the same,
so you can just do:

TEST_PROGS := membarrier_test

all: $(TEST_PROGS)

> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
> new file mode 100644
> index 0000000..3c9f217
> --- /dev/null
> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
> @@ -0,0 +1,71 @@
> +#define _GNU_SOURCE
> +#define __EXPORTED_HEADERS__

Why are you exporting that?

I suspect to try and get around the "Attempt to use kernel headers from user space" warning.

But you're correctly building against the installed headers, not the kernel
headers, so you don't need to do that.

> +
> +#include <linux/membarrier.h>
> +#include <asm-generic/unistd.h>

This should just be <unistd.h>


cheers

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-01 17:11       ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-01 17:11 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Andrew Morton, linux-kernel, linux-api, Pranith Kumar

----- On Aug 31, 2015, at 2:54 AM, Michael Ellerman mpe@ellerman.id.au wrote:

> On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
>> From: Pranith Kumar <bobby.prani@gmail.com>
>> 
>> This patch adds a self test for the membarrier system call.
>> 
>> CC: Michael Ellerman <mpe@ellerman.id.au>
> 
> Sorry I only just saw this due to some over zealous filtering on my end.
> 
> 
>> diff --git a/tools/testing/selftests/membarrier/Makefile
>> b/tools/testing/selftests/membarrier/Makefile
>> new file mode 100644
>> index 0000000..877a503
>> --- /dev/null
>> +++ b/tools/testing/selftests/membarrier/Makefile
>> @@ -0,0 +1,11 @@
>> +CFLAGS += -g -I../../../../usr/include/
>> +
>> +all:
>> +	$(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
>>
>> +TEST_PROGS := membarrier_test
> 
> You don't need to specify the rule, the implict one will do exactly the same,
> so you can just do:
> 
> TEST_PROGS := membarrier_test
> 
> all: $(TEST_PROGS)
> 
>> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c
>> b/tools/testing/selftests/membarrier/membarrier_test.c
>> new file mode 100644
>> index 0000000..3c9f217
>> --- /dev/null
>> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
>> @@ -0,0 +1,71 @@
>> +#define _GNU_SOURCE
>> +#define __EXPORTED_HEADERS__
> 
> Why are you exporting that?
> 
> I suspect to try and get around the "Attempt to use kernel headers from user
> space" warning.
> 
> But you're correctly building against the installed headers, not the kernel
> headers, so you don't need to do that.

Just to make sure I understand: should we expect that
everyone will issue "make headers_install" on their system
before doing a make kselftest ?

I see that a few selftests (e.g. memfd) are adding the
source tree include paths to the compiler include paths,
which I guess is to ensure that the kselftest will
work even if the system headers are not up to date.

Thanks,

Mathieu

> 
>> +
>> +#include <linux/membarrier.h>
>> +#include <asm-generic/unistd.h>
> 
> This should just be <unistd.h>
> 
> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-01 17:11       ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-01 17:11 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Pranith Kumar

----- On Aug 31, 2015, at 2:54 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:

> On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
>> From: Pranith Kumar <bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> 
>> This patch adds a self test for the membarrier system call.
>> 
>> CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
> 
> Sorry I only just saw this due to some over zealous filtering on my end.
> 
> 
>> diff --git a/tools/testing/selftests/membarrier/Makefile
>> b/tools/testing/selftests/membarrier/Makefile
>> new file mode 100644
>> index 0000000..877a503
>> --- /dev/null
>> +++ b/tools/testing/selftests/membarrier/Makefile
>> @@ -0,0 +1,11 @@
>> +CFLAGS += -g -I../../../../usr/include/
>> +
>> +all:
>> +	$(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
>>
>> +TEST_PROGS := membarrier_test
> 
> You don't need to specify the rule, the implict one will do exactly the same,
> so you can just do:
> 
> TEST_PROGS := membarrier_test
> 
> all: $(TEST_PROGS)
> 
>> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c
>> b/tools/testing/selftests/membarrier/membarrier_test.c
>> new file mode 100644
>> index 0000000..3c9f217
>> --- /dev/null
>> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
>> @@ -0,0 +1,71 @@
>> +#define _GNU_SOURCE
>> +#define __EXPORTED_HEADERS__
> 
> Why are you exporting that?
> 
> I suspect to try and get around the "Attempt to use kernel headers from user
> space" warning.
> 
> But you're correctly building against the installed headers, not the kernel
> headers, so you don't need to do that.

Just to make sure I understand: should we expect that
everyone will issue "make headers_install" on their system
before doing a make kselftest ?

I see that a few selftests (e.g. memfd) are adding the
source tree include paths to the compiler include paths,
which I guess is to ensure that the kselftest will
work even if the system headers are not up to date.

Thanks,

Mathieu

> 
>> +
>> +#include <linux/membarrier.h>
>> +#include <asm-generic/unistd.h>
> 
> This should just be <unistd.h>
> 
> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-01 18:32         ` Andy Lutomirski
  0 siblings, 0 replies; 35+ messages in thread
From: Andy Lutomirski @ 2015-09-01 18:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Ellerman, Andrew Morton, linux-kernel, linux-api, Pranith Kumar

On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Aug 31, 2015, at 2:54 AM, Michael Ellerman mpe@ellerman.id.au wrote:
>
>> On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
>>> From: Pranith Kumar <bobby.prani@gmail.com>
>>>
>>> This patch adds a self test for the membarrier system call.
>>>
>>> CC: Michael Ellerman <mpe@ellerman.id.au>
>>
>> Sorry I only just saw this due to some over zealous filtering on my end.
>>
>>
>>> diff --git a/tools/testing/selftests/membarrier/Makefile
>>> b/tools/testing/selftests/membarrier/Makefile
>>> new file mode 100644
>>> index 0000000..877a503
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/membarrier/Makefile
>>> @@ -0,0 +1,11 @@
>>> +CFLAGS += -g -I../../../../usr/include/
>>> +
>>> +all:
>>> +    $(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
>>>
>>> +TEST_PROGS := membarrier_test
>>
>> You don't need to specify the rule, the implict one will do exactly the same,
>> so you can just do:
>>
>> TEST_PROGS := membarrier_test
>>
>> all: $(TEST_PROGS)
>>
>>> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c
>>> b/tools/testing/selftests/membarrier/membarrier_test.c
>>> new file mode 100644
>>> index 0000000..3c9f217
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
>>> @@ -0,0 +1,71 @@
>>> +#define _GNU_SOURCE
>>> +#define __EXPORTED_HEADERS__
>>
>> Why are you exporting that?
>>
>> I suspect to try and get around the "Attempt to use kernel headers from user
>> space" warning.
>>
>> But you're correctly building against the installed headers, not the kernel
>> headers, so you don't need to do that.
>
> Just to make sure I understand: should we expect that
> everyone will issue "make headers_install" on their system
> before doing a make kselftest ?
>
> I see that a few selftests (e.g. memfd) are adding the
> source tree include paths to the compiler include paths,
> which I guess is to ensure that the kselftest will
> work even if the system headers are not up to date.

It would be really nice if there were a clean way for selftests to
include the kernel headers.  Perhaps make should build the exportable
headers somewhere as a dependency of kselftests.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-01 18:32         ` Andy Lutomirski
  0 siblings, 0 replies; 35+ messages in thread
From: Andy Lutomirski @ 2015-09-01 18:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Ellerman, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Pranith Kumar

On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> ----- On Aug 31, 2015, at 2:54 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:
>
>> On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
>>> From: Pranith Kumar <bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>>
>>> This patch adds a self test for the membarrier system call.
>>>
>>> CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
>>
>> Sorry I only just saw this due to some over zealous filtering on my end.
>>
>>
>>> diff --git a/tools/testing/selftests/membarrier/Makefile
>>> b/tools/testing/selftests/membarrier/Makefile
>>> new file mode 100644
>>> index 0000000..877a503
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/membarrier/Makefile
>>> @@ -0,0 +1,11 @@
>>> +CFLAGS += -g -I../../../../usr/include/
>>> +
>>> +all:
>>> +    $(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
>>>
>>> +TEST_PROGS := membarrier_test
>>
>> You don't need to specify the rule, the implict one will do exactly the same,
>> so you can just do:
>>
>> TEST_PROGS := membarrier_test
>>
>> all: $(TEST_PROGS)
>>
>>> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c
>>> b/tools/testing/selftests/membarrier/membarrier_test.c
>>> new file mode 100644
>>> index 0000000..3c9f217
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
>>> @@ -0,0 +1,71 @@
>>> +#define _GNU_SOURCE
>>> +#define __EXPORTED_HEADERS__
>>
>> Why are you exporting that?
>>
>> I suspect to try and get around the "Attempt to use kernel headers from user
>> space" warning.
>>
>> But you're correctly building against the installed headers, not the kernel
>> headers, so you don't need to do that.
>
> Just to make sure I understand: should we expect that
> everyone will issue "make headers_install" on their system
> before doing a make kselftest ?
>
> I see that a few selftests (e.g. memfd) are adding the
> source tree include paths to the compiler include paths,
> which I guess is to ensure that the kselftest will
> work even if the system headers are not up to date.

It would be really nice if there were a clean way for selftests to
include the kernel headers.  Perhaps make should build the exportable
headers somewhere as a dependency of kselftests.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-03  9:24         ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-09-03  9:24 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Andrew Morton, linux-kernel, linux-api, Pranith Kumar

On Tue, 2015-09-01 at 17:11 +0000, Mathieu Desnoyers wrote:
> ----- On Aug 31, 2015, at 2:54 AM, Michael Ellerman mpe@ellerman.id.au wrote:
> > On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
> >> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c
> >> b/tools/testing/selftests/membarrier/membarrier_test.c
> >> new file mode 100644
> >> index 0000000..3c9f217
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
> >> @@ -0,0 +1,71 @@
> >> +#define _GNU_SOURCE
> >> +#define __EXPORTED_HEADERS__
> > 
> > Why are you exporting that?
> > 
> > I suspect to try and get around the "Attempt to use kernel headers from user
> > space" warning.
> > 
> > But you're correctly building against the installed headers, not the kernel
> > headers, so you don't need to do that.
> 
> Just to make sure I understand: should we expect that
> everyone will issue "make headers_install" on their system
> before doing a make kselftest ?

Usually yes, but not always.

They might be deliberately building the selftests against their installed
headers. That's their choice.

In a case like this the test will fail to build, which is fine IMHO.

> I see that a few selftests (e.g. memfd) are adding the
> source tree include paths to the compiler include paths,
> which I guess is to ensure that the kselftest will
> work even if the system headers are not up to date.

Yeah they should be fixed to not do that. The unexported kernel headers are not
designed to be used from userspace. It works sometimes on some arches,
depending on the exact headers that get included etc. But it's wrong™.

cheers



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-03  9:24         ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-09-03  9:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Pranith Kumar

On Tue, 2015-09-01 at 17:11 +0000, Mathieu Desnoyers wrote:
> ----- On Aug 31, 2015, at 2:54 AM, Michael Ellerman mpe-Gsx/Oe8HsFi6NASR9rTEaw@public.gmane.orgu wrote:
> > On Fri, 2015-07-10 at 16:58 -0400, Mathieu Desnoyers wrote:
> >> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c
> >> b/tools/testing/selftests/membarrier/membarrier_test.c
> >> new file mode 100644
> >> index 0000000..3c9f217
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
> >> @@ -0,0 +1,71 @@
> >> +#define _GNU_SOURCE
> >> +#define __EXPORTED_HEADERS__
> > 
> > Why are you exporting that?
> > 
> > I suspect to try and get around the "Attempt to use kernel headers from user
> > space" warning.
> > 
> > But you're correctly building against the installed headers, not the kernel
> > headers, so you don't need to do that.
> 
> Just to make sure I understand: should we expect that
> everyone will issue "make headers_install" on their system
> before doing a make kselftest ?

Usually yes, but not always.

They might be deliberately building the selftests against their installed
headers. That's their choice.

In a case like this the test will fail to build, which is fine IMHO.

> I see that a few selftests (e.g. memfd) are adding the
> source tree include paths to the compiler include paths,
> which I guess is to ensure that the kselftest will
> work even if the system headers are not up to date.

Yeah they should be fixed to not do that. The unexported kernel headers are not
designed to be used from userspace. It works sometimes on some arches,
depending on the exact headers that get included etc. But it's wrong™.

cheers

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
  2015-09-01 18:32         ` Andy Lutomirski
  (?)
@ 2015-09-03  9:33         ` Michael Ellerman
  2015-09-03 15:47             ` Mathieu Desnoyers
  -1 siblings, 1 reply; 35+ messages in thread
From: Michael Ellerman @ 2015-09-03  9:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Andrew Morton, linux-kernel, linux-api, Pranith Kumar

On Tue, 2015-09-01 at 11:32 -0700, Andy Lutomirski wrote:
> On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> > Just to make sure I understand: should we expect that
> > everyone will issue "make headers_install" on their system
> > before doing a make kselftest ?
> >
> > I see that a few selftests (e.g. memfd) are adding the
> > source tree include paths to the compiler include paths,
> > which I guess is to ensure that the kselftest will
> > work even if the system headers are not up to date.
> 
> It would be really nice if there were a clean way for selftests to
> include the kernel headers.  

What's wrong with make headers_install?

Or do you mean when writing the tests? That we could fix by adding the
../../../../usr/include path to CFLAGS in lib.mk. And fixing all the tests that
overwrite CFLAGS to append to CFLAGS.

> Perhaps make should build the exportable headers somewhere as a dependency of
> kselftests.

Yeah the top-level kselftest target could do that I think.

Folks who don't want the headers installed can just run the selftests Makefile
directly.

Does this work for you?

diff --git a/Makefile b/Makefile
index c361593..c8841d3 100644
--- a/Makefile
+++ b/Makefile
@@ -1080,7 +1080,7 @@ headers_check: headers_install
 # Kernel selftest
 
 PHONY += kselftest
-kselftest:
+kselftest: headers_install
        $(Q)$(MAKE) -C tools/testing/selftests run_tests
 
 # ---------------------------------------------------------------------------

cheers



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-03 15:47             ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-03 15:47 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andy Lutomirski, Andrew Morton, linux-kernel, linux-api, Pranith Kumar

----- On Sep 3, 2015, at 5:33 AM, Michael Ellerman mpe@ellerman.id.au wrote:

> On Tue, 2015-09-01 at 11:32 -0700, Andy Lutomirski wrote:
>> On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>> > Just to make sure I understand: should we expect that
>> > everyone will issue "make headers_install" on their system
>> > before doing a make kselftest ?
>> >
>> > I see that a few selftests (e.g. memfd) are adding the
>> > source tree include paths to the compiler include paths,
>> > which I guess is to ensure that the kselftest will
>> > work even if the system headers are not up to date.
>> 
>> It would be really nice if there were a clean way for selftests to
>> include the kernel headers.
> 
> What's wrong with make headers_install?
> 
> Or do you mean when writing the tests? That we could fix by adding the
> ../../../../usr/include path to CFLAGS in lib.mk. And fixing all the tests that
> overwrite CFLAGS to append to CFLAGS.
> 
>> Perhaps make should build the exportable headers somewhere as a dependency of
>> kselftests.
> 
> Yeah the top-level kselftest target could do that I think.
> 
> Folks who don't want the headers installed can just run the selftests Makefile
> directly.
> 
> Does this work for you?
> 
> diff --git a/Makefile b/Makefile
> index c361593..c8841d3 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1080,7 +1080,7 @@ headers_check: headers_install
> # Kernel selftest
> 
> PHONY += kselftest
> -kselftest:
> +kselftest: headers_install
>        $(Q)$(MAKE) -C tools/testing/selftests run_tests

My personal experience is that make headers_install
does not necessarily play well with the distribution
header file hierarchy, which requires some tweaks
to be done by the users (e.g. asm vs x86_64-linux-gnu).
Also, headers_install typically expects a INSTALL_HDR_PATH.
It would be interesting if we could install the kernel
headers into a specific location that is then re-used by
kselftest, so using it without too much manual configuration
does not require to overwrite the distribution header files
to run tests.

Thoughts ?

Thanks,

Mathieu

> 
> # ---------------------------------------------------------------------------
> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-03 15:47             ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-03 15:47 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andy Lutomirski, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Pranith Kumar

----- On Sep 3, 2015, at 5:33 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:

> On Tue, 2015-09-01 at 11:32 -0700, Andy Lutomirski wrote:
>> On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> > Just to make sure I understand: should we expect that
>> > everyone will issue "make headers_install" on their system
>> > before doing a make kselftest ?
>> >
>> > I see that a few selftests (e.g. memfd) are adding the
>> > source tree include paths to the compiler include paths,
>> > which I guess is to ensure that the kselftest will
>> > work even if the system headers are not up to date.
>> 
>> It would be really nice if there were a clean way for selftests to
>> include the kernel headers.
> 
> What's wrong with make headers_install?
> 
> Or do you mean when writing the tests? That we could fix by adding the
> ../../../../usr/include path to CFLAGS in lib.mk. And fixing all the tests that
> overwrite CFLAGS to append to CFLAGS.
> 
>> Perhaps make should build the exportable headers somewhere as a dependency of
>> kselftests.
> 
> Yeah the top-level kselftest target could do that I think.
> 
> Folks who don't want the headers installed can just run the selftests Makefile
> directly.
> 
> Does this work for you?
> 
> diff --git a/Makefile b/Makefile
> index c361593..c8841d3 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1080,7 +1080,7 @@ headers_check: headers_install
> # Kernel selftest
> 
> PHONY += kselftest
> -kselftest:
> +kselftest: headers_install
>        $(Q)$(MAKE) -C tools/testing/selftests run_tests

My personal experience is that make headers_install
does not necessarily play well with the distribution
header file hierarchy, which requires some tweaks
to be done by the users (e.g. asm vs x86_64-linux-gnu).
Also, headers_install typically expects a INSTALL_HDR_PATH.
It would be interesting if we could install the kernel
headers into a specific location that is then re-used by
kselftest, so using it without too much manual configuration
does not require to overwrite the distribution header files
to run tests.

Thoughts ?

Thanks,

Mathieu

> 
> # ---------------------------------------------------------------------------
> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-04  3:36               ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-09-04  3:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton, linux-kernel, linux-api, Pranith Kumar

On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
> ----- On Sep 3, 2015, at 5:33 AM, Michael Ellerman mpe@ellerman.id.au wrote:
> 
> > On Tue, 2015-09-01 at 11:32 -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
> >> <mathieu.desnoyers@efficios.com> wrote:
> >> > Just to make sure I understand: should we expect that
> >> > everyone will issue "make headers_install" on their system
> >> > before doing a make kselftest ?
> >> >
> >> > I see that a few selftests (e.g. memfd) are adding the
> >> > source tree include paths to the compiler include paths,
> >> > which I guess is to ensure that the kselftest will
> >> > work even if the system headers are not up to date.
> >> 
> >> It would be really nice if there were a clean way for selftests to
> >> include the kernel headers.
> > 
> > What's wrong with make headers_install?
> > 
> > Or do you mean when writing the tests? That we could fix by adding the
> > ../../../../usr/include path to CFLAGS in lib.mk. And fixing all the tests that
> > overwrite CFLAGS to append to CFLAGS.
> > 
> >> Perhaps make should build the exportable headers somewhere as a dependency of
> >> kselftests.
> > 
> > Yeah the top-level kselftest target could do that I think.
> > 
> > Folks who don't want the headers installed can just run the selftests Makefile
> > directly.
> > 
> > Does this work for you?
> > 
> > diff --git a/Makefile b/Makefile
> > index c361593..c8841d3 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1080,7 +1080,7 @@ headers_check: headers_install
> > # Kernel selftest
> > 
> > PHONY += kselftest
> > -kselftest:
> > +kselftest: headers_install
> >        $(Q)$(MAKE) -C tools/testing/selftests run_tests
> 
> My personal experience is that make headers_install does not necessarily play
> well with the distribution header file hierarchy, which requires some tweaks
> to be done by the users (e.g. asm vs x86_64-linux-gnu).

OK, I've never had issues. What exactly are you doing and how is it going wrong?

> Also, headers_install typically expects a INSTALL_HDR_PATH. 

You can specify it, but the default is just usr/, ie. in the kernel directory,
that is what I was proposing. (Actually it's $(objtree)/usr).

> It would be interesting if we could install the kernel headers into a
> specific location that is then re-used by kselftest, so using it without too
> much manual configuration does not require to overwrite the distribution
> header files to run tests.

I think we can do that now, ie:

  $ ls /usr/include/linux/membarrier.h
  ls: cannot access /usr/include/linux/membarrier.h: No such file or directory

  $ cd linux-next
  $ make mrproper
  $ make headers_install
  ...
  $ ls usr/include/linux/membarrier.h
  usr/include/linux/membarrier.h
  $ make -C tools/testing/selftests TARGETS=membarrier
  make: Entering directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests'
  for TARGET in membarrier; do \
  	make -C $TARGET; \
  done;
  make[1]: Entering directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests/membarrier'
  gcc -g -I../../../../usr/include/ membarrier_test.c -o membarrier_test
  make[1]: Leaving directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests/membarrier'
  make: Leaving directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests'

  $ ./tools/testing/selftests/membarrier/membarrier_test
  membarrier MEMBARRIER_CMD_QUERY failed. Function not implemented.
  $


So that seems to be working for me. Are you doing some different work flow, or
am I just missing something?

I guess it probably doesn't work if you're using O=.. ?

cheers



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-04  3:36               ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-09-04  3:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Pranith Kumar

On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
> ----- On Sep 3, 2015, at 5:33 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:
> 
> > On Tue, 2015-09-01 at 11:32 -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
> >> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> > Just to make sure I understand: should we expect that
> >> > everyone will issue "make headers_install" on their system
> >> > before doing a make kselftest ?
> >> >
> >> > I see that a few selftests (e.g. memfd) are adding the
> >> > source tree include paths to the compiler include paths,
> >> > which I guess is to ensure that the kselftest will
> >> > work even if the system headers are not up to date.
> >> 
> >> It would be really nice if there were a clean way for selftests to
> >> include the kernel headers.
> > 
> > What's wrong with make headers_install?
> > 
> > Or do you mean when writing the tests? That we could fix by adding the
> > ../../../../usr/include path to CFLAGS in lib.mk. And fixing all the tests that
> > overwrite CFLAGS to append to CFLAGS.
> > 
> >> Perhaps make should build the exportable headers somewhere as a dependency of
> >> kselftests.
> > 
> > Yeah the top-level kselftest target could do that I think.
> > 
> > Folks who don't want the headers installed can just run the selftests Makefile
> > directly.
> > 
> > Does this work for you?
> > 
> > diff --git a/Makefile b/Makefile
> > index c361593..c8841d3 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1080,7 +1080,7 @@ headers_check: headers_install
> > # Kernel selftest
> > 
> > PHONY += kselftest
> > -kselftest:
> > +kselftest: headers_install
> >        $(Q)$(MAKE) -C tools/testing/selftests run_tests
> 
> My personal experience is that make headers_install does not necessarily play
> well with the distribution header file hierarchy, which requires some tweaks
> to be done by the users (e.g. asm vs x86_64-linux-gnu).

OK, I've never had issues. What exactly are you doing and how is it going wrong?

> Also, headers_install typically expects a INSTALL_HDR_PATH. 

You can specify it, but the default is just usr/, ie. in the kernel directory,
that is what I was proposing. (Actually it's $(objtree)/usr).

> It would be interesting if we could install the kernel headers into a
> specific location that is then re-used by kselftest, so using it without too
> much manual configuration does not require to overwrite the distribution
> header files to run tests.

I think we can do that now, ie:

  $ ls /usr/include/linux/membarrier.h
  ls: cannot access /usr/include/linux/membarrier.h: No such file or directory

  $ cd linux-next
  $ make mrproper
  $ make headers_install
  ...
  $ ls usr/include/linux/membarrier.h
  usr/include/linux/membarrier.h
  $ make -C tools/testing/selftests TARGETS=membarrier
  make: Entering directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests'
  for TARGET in membarrier; do \
  	make -C $TARGET; \
  done;
  make[1]: Entering directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests/membarrier'
  gcc -g -I../../../../usr/include/ membarrier_test.c -o membarrier_test
  make[1]: Leaving directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests/membarrier'
  make: Leaving directory '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests'

  $ ./tools/testing/selftests/membarrier/membarrier_test
  membarrier MEMBARRIER_CMD_QUERY failed. Function not implemented.
  $


So that seems to be working for me. Are you doing some different work flow, or
am I just missing something?

I guess it probably doesn't work if you're using O=.. ?

cheers

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
  2015-09-04  3:36               ` Michael Ellerman
  (?)
@ 2015-09-07 16:01               ` Mathieu Desnoyers
  2015-09-08  4:19                   ` Michael Ellerman
  -1 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-07 16:01 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andy Lutomirski, Andrew Morton, linux-kernel, linux-api, Pranith Kumar

[-- Attachment #1: Type: text/plain, Size: 5583 bytes --]

----- On Sep 3, 2015, at 11:36 PM, Michael Ellerman mpe@ellerman.id.au wrote:

> On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
>> ----- On Sep 3, 2015, at 5:33 AM, Michael Ellerman mpe@ellerman.id.au wrote:
>> 
>> > On Tue, 2015-09-01 at 11:32 -0700, Andy Lutomirski wrote:
>> >> On Tue, Sep 1, 2015 at 10:11 AM, Mathieu Desnoyers
>> >> <mathieu.desnoyers@efficios.com> wrote:
>> >> > Just to make sure I understand: should we expect that
>> >> > everyone will issue "make headers_install" on their system
>> >> > before doing a make kselftest ?
>> >> >
>> >> > I see that a few selftests (e.g. memfd) are adding the
>> >> > source tree include paths to the compiler include paths,
>> >> > which I guess is to ensure that the kselftest will
>> >> > work even if the system headers are not up to date.
>> >> 
>> >> It would be really nice if there were a clean way for selftests to
>> >> include the kernel headers.
>> > 
>> > What's wrong with make headers_install?
>> > 
>> > Or do you mean when writing the tests? That we could fix by adding the
>> > ../../../../usr/include path to CFLAGS in lib.mk. And fixing all the tests that
>> > overwrite CFLAGS to append to CFLAGS.
>> > 
>> >> Perhaps make should build the exportable headers somewhere as a dependency of
>> >> kselftests.
>> > 
>> > Yeah the top-level kselftest target could do that I think.
>> > 
>> > Folks who don't want the headers installed can just run the selftests Makefile
>> > directly.
>> > 
>> > Does this work for you?
>> > 
>> > diff --git a/Makefile b/Makefile
>> > index c361593..c8841d3 100644
>> > --- a/Makefile
>> > +++ b/Makefile
>> > @@ -1080,7 +1080,7 @@ headers_check: headers_install
>> > # Kernel selftest
>> > 
>> > PHONY += kselftest
>> > -kselftest:
>> > +kselftest: headers_install
>> >        $(Q)$(MAKE) -C tools/testing/selftests run_tests
>> 
>> My personal experience is that make headers_install does not necessarily play
>> well with the distribution header file hierarchy, which requires some tweaks
>> to be done by the users (e.g. asm vs x86_64-linux-gnu).
> 
> OK, I've never had issues. What exactly are you doing and how is it going wrong?

After some investigation, I noticed the following:

1) I first ran make headers_install as root, which installed the
headers within my build tree. I later tried it again as user, and
it failed due to permission issues (my bad). This is where I tried
to install it into my system rather than under my build directory,
which caused a mess.

2) Since make kselftest should be run as root (according to make
help), this means that all the output files generated by the build
are owned by root. It leads to permissions issues when trying to
rebuild the tests as user afterward. Perhaps we could introduce a
distinction between make kselftest_build and make kselftest_run ?
The former could be executed as user, and the latter as root.

> 
>> Also, headers_install typically expects a INSTALL_HDR_PATH.
> 
> You can specify it, but the default is just usr/, ie. in the kernel directory,
> that is what I was proposing. (Actually it's $(objtree)/usr).

OK, trying it out.

> 
>> It would be interesting if we could install the kernel headers into a
>> specific location that is then re-used by kselftest, so using it without too
>> much manual configuration does not require to overwrite the distribution
>> header files to run tests.
> 
> I think we can do that now, ie:
> 
>  $ ls /usr/include/linux/membarrier.h
>  ls: cannot access /usr/include/linux/membarrier.h: No such file or directory
> 
>  $ cd linux-next
>  $ make mrproper
>  $ make headers_install
>  ...
>  $ ls usr/include/linux/membarrier.h
>  usr/include/linux/membarrier.h
>  $ make -C tools/testing/selftests TARGETS=membarrier
>  make: Entering directory
>  '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests'
>  for TARGET in membarrier; do \
>  	make -C $TARGET; \
>  done;
>  make[1]: Entering directory
>  '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests/membarrier'
>  gcc -g -I../../../../usr/include/ membarrier_test.c -o membarrier_test
>  make[1]: Leaving directory
>  '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests/membarrier'
>  make: Leaving directory
>  '/home/michael/work/topics/selftests/linux-next/tools/testing/selftests'
> 
>  $ ./tools/testing/selftests/membarrier/membarrier_test
>  membarrier MEMBARRIER_CMD_QUERY failed. Function not implemented.
>  $
> 
> 
> So that seems to be working for me. Are you doing some different work flow, or
> am I just missing something?

When doing make headers_install, it indeed installs
membarrier.h where we expect it under the build output
dir:

$ ls usr/include/linux/membarrier.h 
usr/include/linux/membarrier.h

However, if I issue 

$ make -C tools/testing/selftests TARGETS=membarrier
make: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests'
for TARGET in membarrier; do \
		make -C $TARGET; \
	done;
make[1]: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests/membarrier'
gcc     membarrier_test.c   -o membarrier_test
membarrier_test.c:2:30: fatal error: linux/membarrier.h: No such file or directory
 #include <linux/membarrier.h>

This is after applying the modifications you requested
(see patch attached). Perhaps I did something wrong ?


> 
> I guess it probably doesn't work if you're using O=.. ?

I'm not using anything special here. My src tree is my obj
output directory.

Thanks,

Mathieu

> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-to-send-Cleanup-membarrier-selftest-header-inclusion.patch --]
[-- Type: text/x-patch; name=0001-to-send-Cleanup-membarrier-selftest-header-inclusion.patch, Size: 1831 bytes --]

From 352465bd42737de4b97ce2072f0b36b2499dbb9e Mon Sep 17 00:00:00 2001
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: Tue, 1 Sep 2015 13:42:52 -0400
Subject: [PATCH] Cleanup: membarrier selftest header inclusion

The kselftest can depend on having installed kernel headers. Therefore,
there is no need to point to the local kernel source for headers.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
CC: Pranith Kumar <bobby.prani@gmail.com>
CC: Shuah Khan <shuahkh@osg.samsung.com>
---
 tools/testing/selftests/membarrier/Makefile          | 9 +++------
 tools/testing/selftests/membarrier/membarrier_test.c | 5 +----
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile
index 877a503..0e2a5b7 100644
--- a/tools/testing/selftests/membarrier/Makefile
+++ b/tools/testing/selftests/membarrier/Makefile
@@ -1,11 +1,8 @@
-CFLAGS += -g -I../../../../usr/include/
-
-all:
-	$(CC) $(CFLAGS) membarrier_test.c -o membarrier_test
-
 TEST_PROGS := membarrier_test
 
+all: $(TEST_PROGS)
+
 include ../lib.mk
 
 clean:
-	$(RM) membarrier_test
+	$(RM) $(TEST_PROGS)
diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index dde3125..535f0fe 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -1,9 +1,6 @@
 #define _GNU_SOURCE
-#define __EXPORTED_HEADERS__
-
 #include <linux/membarrier.h>
-#include <asm-generic/unistd.h>
-#include <sys/syscall.h>
+#include <syscall.h>
 #include <stdio.h>
 #include <errno.h>
 #include <string.h>
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-08  4:19                   ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-09-08  4:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton, linux-kernel, linux-api, Pranith Kumar

On Mon, 2015-09-07 at 16:01 +0000, Mathieu Desnoyers wrote:
> ----- On Sep 3, 2015, at 11:36 PM, Michael Ellerman mpe@ellerman.id.au wrote:
> > On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
> >> 
> >> My personal experience is that make headers_install does not necessarily play
> >> well with the distribution header file hierarchy, which requires some tweaks
> >> to be done by the users (e.g. asm vs x86_64-linux-gnu).
> > 
> > OK, I've never had issues. What exactly are you doing and how is it going wrong?
> 
> After some investigation, I noticed the following:
> 
> 1) I first ran make headers_install as root, which installed the
> headers within my build tree. I later tried it again as user, and
> it failed due to permission issues (my bad). This is where I tried
> to install it into my system rather than under my build directory,
> which caused a mess.

Yeah OK that's a good point about root.

I tend to build as a regular user and then copy the installed tests to another
machine where I run them as root.

> 2) Since make kselftest should be run as root (according to make
> help), 

Well some of the tests only work when run as root. IMHO we should support
running as many tests as possible as non-root, but some of them obviously
require root.

So you can run them as non-root, but to get maximum coverage you need to run
them as root.

> this means that all the output files generated by the build
> are owned by root. It leads to permissions issues when trying to
> rebuild the tests as user afterward. Perhaps we could introduce a
> distinction between make kselftest_build and make kselftest_run ?
> The former could be executed as user, and the latter as root.

Right. Personally I don't use the kselftest target at all, I just cd down to
tools/testing/selftests and run make there.

If it was up to me the kselftest target would go away, because it's only caused
us trouble so far.

But given it's there we should try to make it work as well as possible. So yeah
splitting it into build and run would make sense, that way you could do:

$ make headers_install
$ make kselftest_build
$ sudo make kselftest_run

And that would hopefully do the right thing.

Would that improve the workflow for you?

> > So that seems to be working for me. Are you doing some different work flow, or
> > am I just missing something?
> 
> When doing make headers_install, it indeed installs
> membarrier.h where we expect it under the build output
> dir:
> 
> $ ls usr/include/linux/membarrier.h 
> usr/include/linux/membarrier.h
> 
> However, if I issue 
> 
> $ make -C tools/testing/selftests TARGETS=membarrier
> make: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests'
> for TARGET in membarrier; do \
> 		make -C $TARGET; \
> 	done;
> make[1]: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests/membarrier'
> gcc     membarrier_test.c   -o membarrier_test
> membarrier_test.c:2:30: fatal error: linux/membarrier.h: No such file or directory
>  #include <linux/membarrier.h>
> 
> This is after applying the modifications you requested
> (see patch attached). Perhaps I did something wrong ?

Yeah sorry, you still need the -I line:

CFLAGS += -I../../../../usr/include/


We /should/ add that to lib.mk so it's inherited by everyone, but we haven't
yet.

So I think if you put that back the instructions I gave you will work?

cheers



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-08  4:19                   ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2015-09-08  4:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Pranith Kumar

On Mon, 2015-09-07 at 16:01 +0000, Mathieu Desnoyers wrote:
> ----- On Sep 3, 2015, at 11:36 PM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:
> > On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
> >> 
> >> My personal experience is that make headers_install does not necessarily play
> >> well with the distribution header file hierarchy, which requires some tweaks
> >> to be done by the users (e.g. asm vs x86_64-linux-gnu).
> > 
> > OK, I've never had issues. What exactly are you doing and how is it going wrong?
> 
> After some investigation, I noticed the following:
> 
> 1) I first ran make headers_install as root, which installed the
> headers within my build tree. I later tried it again as user, and
> it failed due to permission issues (my bad). This is where I tried
> to install it into my system rather than under my build directory,
> which caused a mess.

Yeah OK that's a good point about root.

I tend to build as a regular user and then copy the installed tests to another
machine where I run them as root.

> 2) Since make kselftest should be run as root (according to make
> help), 

Well some of the tests only work when run as root. IMHO we should support
running as many tests as possible as non-root, but some of them obviously
require root.

So you can run them as non-root, but to get maximum coverage you need to run
them as root.

> this means that all the output files generated by the build
> are owned by root. It leads to permissions issues when trying to
> rebuild the tests as user afterward. Perhaps we could introduce a
> distinction between make kselftest_build and make kselftest_run ?
> The former could be executed as user, and the latter as root.

Right. Personally I don't use the kselftest target at all, I just cd down to
tools/testing/selftests and run make there.

If it was up to me the kselftest target would go away, because it's only caused
us trouble so far.

But given it's there we should try to make it work as well as possible. So yeah
splitting it into build and run would make sense, that way you could do:

$ make headers_install
$ make kselftest_build
$ sudo make kselftest_run

And that would hopefully do the right thing.

Would that improve the workflow for you?

> > So that seems to be working for me. Are you doing some different work flow, or
> > am I just missing something?
> 
> When doing make headers_install, it indeed installs
> membarrier.h where we expect it under the build output
> dir:
> 
> $ ls usr/include/linux/membarrier.h 
> usr/include/linux/membarrier.h
> 
> However, if I issue 
> 
> $ make -C tools/testing/selftests TARGETS=membarrier
> make: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests'
> for TARGET in membarrier; do \
> 		make -C $TARGET; \
> 	done;
> make[1]: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests/membarrier'
> gcc     membarrier_test.c   -o membarrier_test
> membarrier_test.c:2:30: fatal error: linux/membarrier.h: No such file or directory
>  #include <linux/membarrier.h>
> 
> This is after applying the modifications you requested
> (see patch attached). Perhaps I did something wrong ?

Yeah sorry, you still need the -I line:

CFLAGS += -I../../../../usr/include/


We /should/ add that to lib.mk so it's inherited by everyone, but we haven't
yet.

So I think if you put that back the instructions I gave you will work?

cheers

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-08 14:02                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-08 14:02 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andy Lutomirski, Andrew Morton, linux-kernel, linux-api,
	Pranith Kumar, Jonathan Rajotte

----- On Sep 8, 2015, at 12:19 AM, Michael Ellerman mpe@ellerman.id.au wrote:

> On Mon, 2015-09-07 at 16:01 +0000, Mathieu Desnoyers wrote:
>> ----- On Sep 3, 2015, at 11:36 PM, Michael Ellerman mpe@ellerman.id.au wrote:
>> > On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
>> >> 
>> >> My personal experience is that make headers_install does not necessarily play
>> >> well with the distribution header file hierarchy, which requires some tweaks
>> >> to be done by the users (e.g. asm vs x86_64-linux-gnu).
>> > 
>> > OK, I've never had issues. What exactly are you doing and how is it going wrong?
>> 
>> After some investigation, I noticed the following:
>> 
>> 1) I first ran make headers_install as root, which installed the
>> headers within my build tree. I later tried it again as user, and
>> it failed due to permission issues (my bad). This is where I tried
>> to install it into my system rather than under my build directory,
>> which caused a mess.
> 
> Yeah OK that's a good point about root.
> 
> I tend to build as a regular user and then copy the installed tests to another
> machine where I run them as root.
> 
>> 2) Since make kselftest should be run as root (according to make
>> help),
> 
> Well some of the tests only work when run as root. IMHO we should support
> running as many tests as possible as non-root, but some of them obviously
> require root.
> 
> So you can run them as non-root, but to get maximum coverage you need to run
> them as root.

Works for me. We do something similar in lttng-tools. We use "tap"
(https://testanything.org/) for tests, and explicitly skip all tests that
require root if we detect that we don't run as root. I notice that many
selftests format their own output. The nice part about standardizing on
something like tap is that it simplifies automated parsing of the test
output.

> 
>> this means that all the output files generated by the build
>> are owned by root. It leads to permissions issues when trying to
>> rebuild the tests as user afterward. Perhaps we could introduce a
>> distinction between make kselftest_build and make kselftest_run ?
>> The former could be executed as user, and the latter as root.
> 
> Right. Personally I don't use the kselftest target at all, I just cd down to
> tools/testing/selftests and run make there.
> 
> If it was up to me the kselftest target would go away, because it's only caused
> us trouble so far.
> 
> But given it's there we should try to make it work as well as possible. So yeah
> splitting it into build and run would make sense, that way you could do:
> 
> $ make headers_install
> $ make kselftest_build
> $ sudo make kselftest_run
> 
> And that would hopefully do the right thing.
> 
> Would that improve the workflow for you?

Yes. Although I'm wondering why the kernel should be different from many
other projects out there. Why not simply:

- Add a kselftest_build dependency to the kernel build, so tests are always built,
  and warnings that arise from modifying anything related to installed headers
  will trigger for everyone,
- Add a dependency on headers_install into the obj tree to kselftest_build,
- Optionally add a "make check" alias to "make kselftest".

This way, running the tests becomes as simple as:

make
sudo make check

Documentation is key here: make sure to update Documentation/kselftest.txt to
document where the self-tests are looking for their system headers (not system,
but within usr/ in the obj tree). This is the missing documentation bit that
confused me the most.

> 
>> > So that seems to be working for me. Are you doing some different work flow, or
>> > am I just missing something?
>> 
>> When doing make headers_install, it indeed installs
>> membarrier.h where we expect it under the build output
>> dir:
>> 
>> $ ls usr/include/linux/membarrier.h
>> usr/include/linux/membarrier.h
>> 
>> However, if I issue
>> 
>> $ make -C tools/testing/selftests TARGETS=membarrier
>> make: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests'
>> for TARGET in membarrier; do \
>> 		make -C $TARGET; \
>> 	done;
>> make[1]: Entering directory
>> `/home/efficios/git/linux-next/tools/testing/selftests/membarrier'
>> gcc     membarrier_test.c   -o membarrier_test
>> membarrier_test.c:2:30: fatal error: linux/membarrier.h: No such file or
>> directory
>>  #include <linux/membarrier.h>
>> 
>> This is after applying the modifications you requested
>> (see patch attached). Perhaps I did something wrong ?
> 
> Yeah sorry, you still need the -I line:
> 
> CFLAGS += -I../../../../usr/include/
> 
> 
> We /should/ add that to lib.mk so it's inherited by everyone, but we haven't
> yet.

Yep, this would be a good start.

> 
> So I think if you put that back the instructions I gave you will work?

Yes, it does, thanks!

Mathieu

> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] selftests: add membarrier syscall test
@ 2015-09-08 14:02                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-09-08 14:02 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andy Lutomirski, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Pranith Kumar,
	Jonathan Rajotte

----- On Sep 8, 2015, at 12:19 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:

> On Mon, 2015-09-07 at 16:01 +0000, Mathieu Desnoyers wrote:
>> ----- On Sep 3, 2015, at 11:36 PM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:
>> > On Thu, 2015-09-03 at 15:47 +0000, Mathieu Desnoyers wrote:
>> >> 
>> >> My personal experience is that make headers_install does not necessarily play
>> >> well with the distribution header file hierarchy, which requires some tweaks
>> >> to be done by the users (e.g. asm vs x86_64-linux-gnu).
>> > 
>> > OK, I've never had issues. What exactly are you doing and how is it going wrong?
>> 
>> After some investigation, I noticed the following:
>> 
>> 1) I first ran make headers_install as root, which installed the
>> headers within my build tree. I later tried it again as user, and
>> it failed due to permission issues (my bad). This is where I tried
>> to install it into my system rather than under my build directory,
>> which caused a mess.
> 
> Yeah OK that's a good point about root.
> 
> I tend to build as a regular user and then copy the installed tests to another
> machine where I run them as root.
> 
>> 2) Since make kselftest should be run as root (according to make
>> help),
> 
> Well some of the tests only work when run as root. IMHO we should support
> running as many tests as possible as non-root, but some of them obviously
> require root.
> 
> So you can run them as non-root, but to get maximum coverage you need to run
> them as root.

Works for me. We do something similar in lttng-tools. We use "tap"
(https://testanything.org/) for tests, and explicitly skip all tests that
require root if we detect that we don't run as root. I notice that many
selftests format their own output. The nice part about standardizing on
something like tap is that it simplifies automated parsing of the test
output.

> 
>> this means that all the output files generated by the build
>> are owned by root. It leads to permissions issues when trying to
>> rebuild the tests as user afterward. Perhaps we could introduce a
>> distinction between make kselftest_build and make kselftest_run ?
>> The former could be executed as user, and the latter as root.
> 
> Right. Personally I don't use the kselftest target at all, I just cd down to
> tools/testing/selftests and run make there.
> 
> If it was up to me the kselftest target would go away, because it's only caused
> us trouble so far.
> 
> But given it's there we should try to make it work as well as possible. So yeah
> splitting it into build and run would make sense, that way you could do:
> 
> $ make headers_install
> $ make kselftest_build
> $ sudo make kselftest_run
> 
> And that would hopefully do the right thing.
> 
> Would that improve the workflow for you?

Yes. Although I'm wondering why the kernel should be different from many
other projects out there. Why not simply:

- Add a kselftest_build dependency to the kernel build, so tests are always built,
  and warnings that arise from modifying anything related to installed headers
  will trigger for everyone,
- Add a dependency on headers_install into the obj tree to kselftest_build,
- Optionally add a "make check" alias to "make kselftest".

This way, running the tests becomes as simple as:

make
sudo make check

Documentation is key here: make sure to update Documentation/kselftest.txt to
document where the self-tests are looking for their system headers (not system,
but within usr/ in the obj tree). This is the missing documentation bit that
confused me the most.

> 
>> > So that seems to be working for me. Are you doing some different work flow, or
>> > am I just missing something?
>> 
>> When doing make headers_install, it indeed installs
>> membarrier.h where we expect it under the build output
>> dir:
>> 
>> $ ls usr/include/linux/membarrier.h
>> usr/include/linux/membarrier.h
>> 
>> However, if I issue
>> 
>> $ make -C tools/testing/selftests TARGETS=membarrier
>> make: Entering directory `/home/efficios/git/linux-next/tools/testing/selftests'
>> for TARGET in membarrier; do \
>> 		make -C $TARGET; \
>> 	done;
>> make[1]: Entering directory
>> `/home/efficios/git/linux-next/tools/testing/selftests/membarrier'
>> gcc     membarrier_test.c   -o membarrier_test
>> membarrier_test.c:2:30: fatal error: linux/membarrier.h: No such file or
>> directory
>>  #include <linux/membarrier.h>
>> 
>> This is after applying the modifications you requested
>> (see patch attached). Perhaps I did something wrong ?
> 
> Yeah sorry, you still need the -I line:
> 
> CFLAGS += -I../../../../usr/include/
> 
> 
> We /should/ add that to lib.mk so it's inherited by everyone, but we haven't
> yet.

Yep, this would be a good start.

> 
> So I think if you put that back the instructions I gave you will work?

Yes, it does, thanks!

Mathieu

> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-10-05 23:21   ` Rusty Russell
  0 siblings, 0 replies; 35+ messages in thread
From: Rusty Russell @ 2015-10-05 23:21 UTC (permalink / raw)
  To: Mathieu Desnoyers, Andrew Morton
  Cc: linux-kernel, linux-api, Mathieu Desnoyers

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> Hi Andrew,
>
> Here is a repost of sys_membarrier, rebased on top of Linus commit
> c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
> last v19 post other that proceeding to further testing. When merging
> with other system calls, system call number conflicts should be quite
> straightforward to handle, there is nothing special there.

Hi Mathieu,

        Great to see this go in!  One small note: it talks about
threads, but membarrier as currently implemented would cover any shared
memory.  If you plan to optimize in future, that might not be the case:
we'd want an address argument for those cases?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-10-05 23:21   ` Rusty Russell
  0 siblings, 0 replies; 35+ messages in thread
From: Rusty Russell @ 2015-10-05 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Mathieu Desnoyers

Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
> Hi Andrew,
>
> Here is a repost of sys_membarrier, rebased on top of Linus commit
> c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
> last v19 post other that proceeding to further testing. When merging
> with other system calls, system call number conflicts should be quite
> straightforward to handle, there is nothing special there.

Hi Mathieu,

        Great to see this go in!  One small note: it talks about
threads, but membarrier as currently implemented would cover any shared
memory.  If you plan to optimize in future, that might not be the case:
we'd want an address argument for those cases?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-10-06  2:17     ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-10-06  2:17 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Andrew Morton, linux-kernel, linux-api

----- On Oct 5, 2015, at 7:21 PM, Rusty Russell rusty@ozlabs.org wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> Hi Andrew,
>>
>> Here is a repost of sys_membarrier, rebased on top of Linus commit
>> c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
>> last v19 post other that proceeding to further testing. When merging
>> with other system calls, system call number conflicts should be quite
>> straightforward to handle, there is nothing special there.
> 
> Hi Mathieu,
> 
>        Great to see this go in!  One small note: it talks about
> threads, but membarrier as currently implemented would cover any shared
> memory.  If you plan to optimize in future, that might not be the case:
> we'd want an address argument for those cases?

Hi Rusty,

Indeed, the current membarrier implementation only supports
the MEMBARRIER_CMD_SHARED flag, which works even with shared
memory across processes. If we ever want to optimize that for
single-process, multi-threaded cases, we would have to add
a new flag (e.g. MEMBARRIER_CMD_PRIVATE). This is quite
similar to what already exists in the futex system call.

I'm not sure I fully understand where the address argument
you are describing would be useful. So far, I see two
main use-cases: we either interact with memory that is
local to a single process, or with memory shared across
processes.

We could indeed think about sending a membarrier to all
processes using a specific shared memory area (hence the
possible need for an address argument). This could eventually
be supported by adding a specific flag for this (e.g.
MEMBARRIER_CMD_SHM), which would indicate that an extra
parameter is provided (an address).

Thoughts ?

Thanks for the feedback!

Mathieu

> 
> Cheers,
> Rusty.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-10-06  2:17     ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-10-06  2:17 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api

----- On Oct 5, 2015, at 7:21 PM, Rusty Russell rusty-mnsaURCQ41sdnm+yROfE0A@public.gmane.org wrote:

> Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
>> Hi Andrew,
>>
>> Here is a repost of sys_membarrier, rebased on top of Linus commit
>> c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
>> last v19 post other that proceeding to further testing. When merging
>> with other system calls, system call number conflicts should be quite
>> straightforward to handle, there is nothing special there.
> 
> Hi Mathieu,
> 
>        Great to see this go in!  One small note: it talks about
> threads, but membarrier as currently implemented would cover any shared
> memory.  If you plan to optimize in future, that might not be the case:
> we'd want an address argument for those cases?

Hi Rusty,

Indeed, the current membarrier implementation only supports
the MEMBARRIER_CMD_SHARED flag, which works even with shared
memory across processes. If we ever want to optimize that for
single-process, multi-threaded cases, we would have to add
a new flag (e.g. MEMBARRIER_CMD_PRIVATE). This is quite
similar to what already exists in the futex system call.

I'm not sure I fully understand where the address argument
you are describing would be useful. So far, I see two
main use-cases: we either interact with memory that is
local to a single process, or with memory shared across
processes.

We could indeed think about sending a membarrier to all
processes using a specific shared memory area (hence the
possible need for an address argument). This could eventually
be supported by adding a specific flag for this (e.g.
MEMBARRIER_CMD_SHM), which would indicate that an extra
parameter is provided (an address).

Thoughts ?

Thanks for the feedback!

Mathieu

> 
> Cheers,
> Rusty.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-10-08  6:22       ` Rusty Russell
  0 siblings, 0 replies; 35+ messages in thread
From: Rusty Russell @ 2015-10-08  6:22 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Andrew Morton, linux-kernel, linux-api

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> ----- On Oct 5, 2015, at 7:21 PM, Rusty Russell rusty@ozlabs.org wrote:
>
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>>> Hi Andrew,
>>>
>>> Here is a repost of sys_membarrier, rebased on top of Linus commit
>>> c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
>>> last v19 post other that proceeding to further testing. When merging
>>> with other system calls, system call number conflicts should be quite
>>> straightforward to handle, there is nothing special there.
>> 
>> Hi Mathieu,
>> 
>>        Great to see this go in!  One small note: it talks about
>> threads, but membarrier as currently implemented would cover any shared
>> memory.  If you plan to optimize in future, that might not be the case:
>> we'd want an address argument for those cases?
>
> Hi Rusty,
>
> Indeed, the current membarrier implementation only supports
> the MEMBARRIER_CMD_SHARED flag, which works even with shared
> memory across processes. If we ever want to optimize that for
> single-process, multi-threaded cases, we would have to add
> a new flag (e.g. MEMBARRIER_CMD_PRIVATE). This is quite
> similar to what already exists in the futex system call.
>
> I'm not sure I fully understand where the address argument
> you are describing would be useful. So far, I see two
> main use-cases: we either interact with memory that is
> local to a single process, or with memory shared across
> processes.
>
> We could indeed think about sending a membarrier to all
> processes using a specific shared memory area (hence the
> possible need for an address argument). This could eventually
> be supported by adding a specific flag for this (e.g.
> MEMBARRIER_CMD_SHM), which would indicate that an extra
> parameter is provided (an address).

That's exactly what I was thinking; eg. it can be optimized in the case
where nothing else with the memory mapped is running.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/3] sys_membarrier (x86, generic)
@ 2015-10-08  6:22       ` Rusty Russell
  0 siblings, 0 replies; 35+ messages in thread
From: Rusty Russell @ 2015-10-08  6:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api

Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
> ----- On Oct 5, 2015, at 7:21 PM, Rusty Russell rusty-mnsaURCQ41sdnm+yROfE0A@public.gmane.org wrote:
>
>> Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
>>> Hi Andrew,
>>>
>>> Here is a repost of sys_membarrier, rebased on top of Linus commit
>>> c4b5fd3fb2058b650447372472ad24e2a989f9f6 without any change since the
>>> last v19 post other that proceeding to further testing. When merging
>>> with other system calls, system call number conflicts should be quite
>>> straightforward to handle, there is nothing special there.
>> 
>> Hi Mathieu,
>> 
>>        Great to see this go in!  One small note: it talks about
>> threads, but membarrier as currently implemented would cover any shared
>> memory.  If you plan to optimize in future, that might not be the case:
>> we'd want an address argument for those cases?
>
> Hi Rusty,
>
> Indeed, the current membarrier implementation only supports
> the MEMBARRIER_CMD_SHARED flag, which works even with shared
> memory across processes. If we ever want to optimize that for
> single-process, multi-threaded cases, we would have to add
> a new flag (e.g. MEMBARRIER_CMD_PRIVATE). This is quite
> similar to what already exists in the futex system call.
>
> I'm not sure I fully understand where the address argument
> you are describing would be useful. So far, I see two
> main use-cases: we either interact with memory that is
> local to a single process, or with memory shared across
> processes.
>
> We could indeed think about sending a membarrier to all
> processes using a specific shared memory area (hence the
> possible need for an address argument). This could eventually
> be supported by adding a specific flag for this (e.g.
> MEMBARRIER_CMD_SHM), which would indicate that an extra
> parameter is provided (an address).

That's exactly what I was thinking; eg. it can be optimized in the case
where nothing else with the memory mapped is running.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
  2015-07-10 20:58 ` [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86) Mathieu Desnoyers
@ 2015-12-04 15:44   ` Michael Kerrisk (man-pages)
  2015-12-05  8:48       ` Mathieu Desnoyers
  0 siblings, 1 reply; 35+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-04 15:44 UTC (permalink / raw)
  To: Mathieu Desnoyers, Andrew Morton
  Cc: mtk.manpages, linux-kernel, linux-api, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, Pranith Kumar

Hi Mathieu,

In the patch below you have a man page type of text. Is that
just plain text, or do you have some groff source somewhere?

Thanks,

Michael


On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads running on the system. It is
> implemented by calling synchronize_sched(). It can be used to distribute
> the cost of user-space memory barriers asymmetrically by transforming
> pairs of memory barriers into pairs consisting of sys_membarrier() and a
> compiler barrier. For synchronization primitives that distinguish
> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
> read-side can be accelerated significantly by moving the bulk of the
> memory barrier overhead to the write-side.
> 
> The existing applications of which I am aware that would be improved by this
> system call are as follows:
> 
> * Through Userspace RCU library (http://urcu.so)
>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>   - Network sniffer (http://netsniff-ng.org/)
>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>   - User-space tracing (http://lttng.org)
>   - Network storage system (https://www.gluster.org/)
>   - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
> 
> Those projects use RCU in userspace to increase read-side speed and
> scalability compared to locking. Especially in the case of RCU used
> by libraries, sys_membarrier can speed up the read-side by moving the
> bulk of the memory barrier cost to synchronize_rcu().
> 
> * Direct users of sys_membarrier
>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
> 
> Microsoft core dotnet GC developers are planning to use the mprotect()
> side-effect of issuing memory barriers through IPIs as a way to implement
> Windows FlushProcessWriteBuffers() on Linux. They are referring to
> sys_membarrier in their github thread, specifically stating that
> sys_membarrier() is what they are looking for.
> 
> This implementation is based on kernel v4.1-rc8.
> 
> To explain the benefit of this scheme, let's introduce two example threads:
> 
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu
> rcu_read_lock()/rcu_read_unlock())
> 
> In a scheme where all smp_mb() in thread A are ordering memory accesses
> with respect to smp_mb() present in Thread B, we can change each
> smp_mb() within Thread A into calls to sys_membarrier() and each
> smp_mb() within Thread B into compiler barriers "barrier()".
> 
> Before the change, we had, for each smp_mb() pairs:
> 
> Thread A                    Thread B
> previous mem accesses       previous mem accesses
> smp_mb()                    smp_mb()
> following mem accesses      following mem accesses
> 
> After the change, these pairs become:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
> 
> 1) Non-concurrent Thread A vs Thread B accesses:
> 
> Thread A                    Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
>                             prev mem accesses
>                             barrier()
>                             follow mem accesses
> 
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
> 
> 2) Concurrent Thread A vs Thread B accesses
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() by synchronize_sched().
> 
> * Benchmarks
> 
> On Intel Xeon E5405 (8 cores)
> (one thread is calling sys_membarrier, the other 7 threads are busy
> looping)
> 
> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
> 
> * User-space user of this system call: Userspace RCU library
> 
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invocation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
> 
> Results in liburcu:
> 
> Operations in 10s, 6 readers, 2 writers:
> 
> memory barriers in reader:    1701557485 reads, 2202847 writes
> signal-based scheme:          9830061167 reads,    6700 writes
> sys_membarrier:               9952759104 reads,     425 writes
> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
> 
> The dynamic sys_membarrier availability check adds some overhead to
> the read-side compared to the signal-based scheme, but besides that,
> sys_membarrier slightly outperforms the signal-based scheme. However,
> this non-expedited sys_membarrier implementation has a much slower grace
> period than signal and memory barrier schemes.
> 
> Besides diminishing the number of wake-ups, one major advantage of the
> membarrier system call over the signal-based scheme is that it does not
> need to reserve a signal. This plays much more nicely with libraries,
> and with processes injected into for tracing purposes, for which we
> cannot expect that signals will be unused by the application.
> 
> An expedited version of this system call can be added later on to speed
> up the grace period. Its implementation will likely depend on reading
> the cpu_curr()->mm without holding each CPU's rq lock.
> 
> This patch adds the system call to x86 and to asm-generic.
> 
> [1] http://urcu.so
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Nicholas Miell <nmiell@comcast.net>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> CC: Lai Jiangshan <laijs@cn.fujitsu.com>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: David Howells <dhowells@redhat.com>
> CC: Pranith Kumar <bobby.prani@gmail.com>
> CC: Michael Kerrisk <mtk.manpages@gmail.com>
> CC: linux-api@vger.kernel.org
> 
> ---
> 
> membarrier(2) man page:
> --------------- snip -------------------
> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
> 
> NAME
>        membarrier - issue memory barriers on a set of threads
> 
> SYNOPSIS
>        #include <linux/membarrier.h>
> 
>        int membarrier(int cmd, int flags);
> 
> DESCRIPTION
>        The cmd argument is one of the following:
> 
>        MEMBARRIER_CMD_QUERY
>               Query  the  set  of  supported commands. It returns a bitmask of
>               supported commands.
> 
>        MEMBARRIER_CMD_SHARED
>               Execute a memory barrier on all threads running on  the  system.
>               Upon  return from system call, the caller thread is ensured that
>               all running threads have passed through a state where all memory
>               accesses  to  user-space  addresses  match program order between
>               entry to and return from the system  call  (non-running  threads
>               are de facto in such a state). This covers threads from all pro‐
>               cesses running on the system.  This command returns 0.
> 
>        The flags argument needs to be 0. For future extensions.
> 
>        All memory accesses performed  in  program  order  from  each  targeted
>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>        we use the semantic "barrier()" to represent a compiler barrier forcing
>        memory  accesses  to  be performed in program order across the barrier,
>        and smp_mb() to represent explicit memory barriers forcing full  memory
>        ordering  across  the barrier, we have the following ordering table for
>        each pair of barrier(), sys_membarrier() and smp_mb():
> 
>        The pair ordering is detailed as (O: ordered, X: not ordered):
> 
>                               barrier()   smp_mb() sys_membarrier()
>               barrier()          X           X            O
>               smp_mb()           X           O            O
>               sys_membarrier()   O           O            O
> 
> RETURN VALUE
>        On success, these system calls return zero.  On error, -1 is  returned,
>        and errno is set appropriately. For a given command, with flags
>        argument set to 0, this system call is guaranteed to always return the
>        same value until reboot.
> 
> ERRORS
>        ENOSYS System call is not implemented.
> 
>        EINVAL Invalid arguments.
> 
> Linux                             2015-04-15                     MEMBARRIER(2)
> --------------- snip -------------------
> 
> Changes since v18:
> - Add unlikely() check to flags,
> - Describe current users in changelog.
> 
> Changes since v17:
> - Update commit message.
> 
> Changes since v16:
> - Update documentation.
> - Add man page to changelog.
> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>   to not care about the number of processors on the system.  Based on
>   recommendations from Stephen Hemminger and Steven Rostedt.
> - Check that flags argument is 0, update documentation to require it.
> 
> Changes since v15:
> - Add flags argument in addition to cmd.
> - Update documentation.
> 
> Changes since v14:
> - Take care of Thomas Gleixner's comments.
> 
> Changes since v13:
> - Move to kernel/membarrier.c.
> - Remove MEMBARRIER_PRIVATE flag.
> - Add MAINTAINERS file entry.
> 
> Changes since v12:
> - Remove _FLAG suffix from uapi flags.
> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>   lock.
> 
> Changes since v11:
> - 5 years have passed.
> - Rebase on v3.19 kernel.
> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>   barriers, non-private for memory mappings shared between processes.
> - Simplify user API.
> - Code refactoring.
> 
> Changes since v10:
> - Apply Randy's comments.
> - Rebase on 2.6.34-rc4 -tip.
> 
> Changes since v9:
> - Clean up #ifdef CONFIG_SMP.
> 
> Changes since v8:
> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>   memory barriers to the scheduler. It implies a potential RoS
>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>   by a user, but nothing more than what is already possible with other
>   existing system calls, but saves memory barriers in the scheduler fast
>   path.
> - re-add the memory barrier comments to x86 switch_mm() as an example to
>   other architectures.
> - Update documentation of the memory barriers in sys_membarrier and
>   switch_mm().
> - Append execution scenarios to the changelog showing the purpose of
>   each memory barrier.
> 
> Changes since v7:
> - Move spinlock-mb and scheduler related changes to separate patches.
> - Add support for sys_membarrier on x86_32.
> - Only x86 32/64 system calls are reserved in this patch. It is planned
>   to incrementally reserve syscall IDs on other architectures as these
>   are tested.
> 
> Changes since v6:
> - Remove some unlikely() not so unlikely.
> - Add the proper scheduler memory barriers needed to only use the RCU
>   read lock in sys_membarrier rather than take each runqueue spinlock:
> - Move memory barriers from per-architecture switch_mm() to schedule()
>   and finish_lock_switch(), where they clearly document that all data
>   protected by the rq lock is guaranteed to have memory barriers issued
>   between the scheduler update and the task execution. Replacing the
>   spin lock acquire/release barriers with these memory barriers imply
>   either no overhead (x86 spinlock atomic instruction already implies a
>   full mb) or some hopefully small overhead caused by the upgrade of the
>   spinlock acquire/release barriers to more heavyweight smp_mb().
> - The "generic" version of spinlock-mb.h declares both a mapping to
>   standard spinlocks and full memory barriers. Each architecture can
>   specialize this header following their own need and declare
>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>   implementations on a wide range of architecture would be welcome.
> 
> Changes since v5:
> - Plan ahead for extensibility by introducing mandatory/optional masks
>   to the "flags" system call parameter. Past experience with accept4(),
>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>   inotify_init1() indicates that this is the kind of thing we want to
>   plan for. Return -EINVAL if the mandatory flags received are unknown.
> - Create include/linux/membarrier.h to define these flags.
> - Add MEMBARRIER_QUERY optional flag.
> 
> Changes since v4:
> - Add "int expedited" parameter, use synchronize_sched() in the
>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>   seriously using synchronize_sched() to provide the low-overhead
>   membarrier scheme.
> - Check num_online_cpus() == 1, quickly return without doing nothing.
> 
> Changes since v3a:
> - Confirm that each CPU indeed runs the current task's ->mm before
>   sending an IPI. Ensures that we do not disturb RT tasks in the
>   presence of lazy TLB shootdown.
> - Document memory barriers needed in switch_mm().
> - Surround helper functions with #ifdef CONFIG_SMP.
> 
> Changes since v2:
> - simply send-to-many to the mm_cpumask. It contains the list of
>   processors we have to IPI to (which use the mm), and this mask is
>   updated atomically.
> 
> Changes since v1:
> - Only perform the IPI in CONFIG_SMP.
> - Only perform the IPI if the process has more than one thread.
> - Only send IPIs to CPUs involved with threads belonging to our process.
> - Adaptative IPI scheme (single vs many IPI with threshold).
> - Issue smp_mb() at the beginning and end of the system call.
> ---
>  MAINTAINERS                            |  8 +++++
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/syscalls.h               |  2 ++
>  include/uapi/asm-generic/unistd.h      |  4 ++-
>  include/uapi/linux/Kbuild              |  1 +
>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>  init/Kconfig                           | 12 +++++++
>  kernel/Makefile                        |  1 +
>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>  kernel/sys_ni.c                        |  3 ++
>  11 files changed, 151 insertions(+), 1 deletion(-)
>  create mode 100644 include/uapi/linux/membarrier.h
>  create mode 100644 kernel/membarrier.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0d70760..b560da6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>  
> +MEMBARRIER SUPPORT
> +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> +L:	linux-kernel@vger.kernel.org
> +S:	Supported
> +F:	kernel/membarrier.c
> +F:	include/uapi/linux/membarrier.h
> +
>  MEMORY MANAGEMENT
>  L:	linux-mm@kvack.org
>  W:	http://www.linux-mm.org
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index ef8187f..e63ad61 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -365,3 +365,4 @@
>  356	i386	memfd_create		sys_memfd_create
>  357	i386	bpf			sys_bpf
>  358	i386	execveat		sys_execveat			stub32_execveat
> +359	i386	membarrier		sys_membarrier
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 9ef32d5..87f3cd6 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320	common	kexec_file_load		sys_kexec_file_load
>  321	common	bpf			sys_bpf
>  322	64	execveat		stub_execveat
> +323	common	membarrier		sys_membarrier
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b45c45b..d4ab99b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
>  			const char __user *const __user *argv,
>  			const char __user *const __user *envp, int flags);
>  
> +asmlinkage long sys_membarrier(int cmd, int flags);
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..8da542a 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  __SYSCALL(__NR_bpf, sys_bpf)
>  #define __NR_execveat 281
>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_membarrier 282
> +__SYSCALL(__NR_membarrier, sys_membarrier)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
>  
>  /*
>   * All syscalls below here should go away really,
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 1ff9942..e6f229a 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -251,6 +251,7 @@ header-y += mdio.h
>  header-y += media.h
>  header-y += media-bus-format.h
>  header-y += mei.h
> +header-y += membarrier.h
>  header-y += memfd.h
>  header-y += mempolicy.h
>  header-y += meye.h
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> new file mode 100644
> index 0000000..e0b108b
> --- /dev/null
> +++ b/include/uapi/linux/membarrier.h
> @@ -0,0 +1,53 @@
> +#ifndef _UAPI_LINUX_MEMBARRIER_H
> +#define _UAPI_LINUX_MEMBARRIER_H
> +
> +/*
> + * linux/membarrier.h
> + *
> + * membarrier system call API
> + *
> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +/**
> + * enum membarrier_cmd - membarrier system call command
> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
> + *                          a bitmask of valid commands.
> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
> + *                          Upon return from system call, the caller thread
> + *                          is ensured that all running threads have passed
> + *                          through a state where all memory accesses to
> + *                          user-space addresses match program order between
> + *                          entry to and return from the system call
> + *                          (non-running threads are de facto in such a
> + *                          state). This covers threads from all processes
> + *                          running on the system. This command returns 0.
> + *
> + * Command to be passed to the membarrier system call. The commands need to
> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> + * the value 0.
> + */
> +enum membarrier_cmd {
> +	MEMBARRIER_CMD_QUERY = 0,
> +	MEMBARRIER_CMD_SHARED = (1 << 0),
> +};
> +
> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index af09b4f..4bba60f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>  	  bugs/quirks. Disable this only if your target machine is
>  	  unaffected by PCI quirks.
>  
> +config MEMBARRIER
> +	bool "Enable membarrier() system call" if EXPERT
> +	default y
> +	help
> +	  Enable the membarrier() system call that allows issuing memory
> +	  barriers across all running threads, which can be used to distribute
> +	  the cost of user-space memory barriers asymmetrically by transforming
> +	  pairs of memory barriers into pairs consisting of membarrier() and a
> +	  compiler barrier.
> +
> +	  If unsure, say Y.
> +
>  config EMBEDDED
>  	bool "Embedded system"
>  	option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 43c4c92..92a481b 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>  obj-$(CONFIG_TORTURE_TEST) += torture.o
> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>  
>  $(obj)/configs.o: $(obj)/config_data.h
>  
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> new file mode 100644
> index 0000000..536c727
> --- /dev/null
> +++ b/kernel/membarrier.c
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * membarrier system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/membarrier.h>
> +
> +/*
> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> + * except MEMBARRIER_CMD_QUERY.
> + */
> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
> +
> +/**
> + * sys_membarrier - issue memory barriers on a set of threads
> + * @cmd:   Takes command values defined in enum membarrier_cmd.
> + * @flags: Currently needs to be 0. For future extensions.
> + *
> + * If this system call is not implemented, -ENOSYS is returned. If the
> + * command specified does not exist, or if the command argument is invalid,
> + * this system call returns -EINVAL. For a given command, with flags argument
> + * set to 0, this system call is guaranteed to always return the same value
> + * until reboot.
> + *
> + * All memory accesses performed in program order from each targeted thread
> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
> + * the semantic "barrier()" to represent a compiler barrier forcing memory
> + * accesses to be performed in program order across the barrier, and
> + * smp_mb() to represent explicit memory barriers forcing full memory
> + * ordering across the barrier, we have the following ordering table for
> + * each pair of barrier(), sys_membarrier() and smp_mb():
> + *
> + * The pair ordering is detailed as (O: ordered, X: not ordered):
> + *
> + *                        barrier()   smp_mb() sys_membarrier()
> + *        barrier()          X           X            O
> + *        smp_mb()           X           O            O
> + *        sys_membarrier()   O           O            O
> + */
> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> +{
> +	if (unlikely(flags))
> +		return -EINVAL;
> +	switch (cmd) {
> +	case MEMBARRIER_CMD_QUERY:
> +		return MEMBARRIER_CMD_BITMASK;
> +	case MEMBARRIER_CMD_SHARED:
> +		if (num_online_cpus() > 1)
> +			synchronize_sched();
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 7995ef5..eb4fde0 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>  
>  /* execveat */
>  cond_syscall(sys_execveat);
> +
> +/* membarrier */
> +cond_syscall(sys_membarrier);
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
@ 2015-12-05  8:48       ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-12-05  8:48 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Andrew Morton, linux-kernel, linux-api, KOSAKI Motohiro, rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, One Thousand Gnomes,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, Pranith Kumar

[-- Attachment #1: Type: text/plain, Size: 28123 bytes --]

Hi Michael,

Please find the membarrier man groff file attached. I re-integrated
some changes that went in initially only in the changelog text version
back onto this groff source.

Please let me know if you find any issue with it.

Mathieu

----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:

> Hi Mathieu,
> 
> In the patch below you have a man page type of text. Is that
> just plain text, or do you have some groff source somewhere?
> 
> Thanks,
> 
> Michael
> 
> 
> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>> Here is an implementation of a new system call, sys_membarrier(), which
>> executes a memory barrier on all threads running on the system. It is
>> implemented by calling synchronize_sched(). It can be used to distribute
>> the cost of user-space memory barriers asymmetrically by transforming
>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>> compiler barrier. For synchronization primitives that distinguish
>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>> read-side can be accelerated significantly by moving the bulk of the
>> memory barrier overhead to the write-side.
>> 
>> The existing applications of which I am aware that would be improved by this
>> system call are as follows:
>> 
>> * Through Userspace RCU library (http://urcu.so)
>>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>>   - Network sniffer (http://netsniff-ng.org/)
>>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>   - User-space tracing (http://lttng.org)
>>   - Network storage system (https://www.gluster.org/)
>>   - Virtual routers
>>   (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
>> 
>> Those projects use RCU in userspace to increase read-side speed and
>> scalability compared to locking. Especially in the case of RCU used
>> by libraries, sys_membarrier can speed up the read-side by moving the
>> bulk of the memory barrier cost to synchronize_rcu().
>> 
>> * Direct users of sys_membarrier
>>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>> 
>> Microsoft core dotnet GC developers are planning to use the mprotect()
>> side-effect of issuing memory barriers through IPIs as a way to implement
>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>> sys_membarrier in their github thread, specifically stating that
>> sys_membarrier() is what they are looking for.
>> 
>> This implementation is based on kernel v4.1-rc8.
>> 
>> To explain the benefit of this scheme, let's introduce two example threads:
>> 
>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>> Thread B (frequent, e.g. executing liburcu
>> rcu_read_lock()/rcu_read_unlock())
>> 
>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>> with respect to smp_mb() present in Thread B, we can change each
>> smp_mb() within Thread A into calls to sys_membarrier() and each
>> smp_mb() within Thread B into compiler barriers "barrier()".
>> 
>> Before the change, we had, for each smp_mb() pairs:
>> 
>> Thread A                    Thread B
>> previous mem accesses       previous mem accesses
>> smp_mb()                    smp_mb()
>> following mem accesses      following mem accesses
>> 
>> After the change, these pairs become:
>> 
>> Thread A                    Thread B
>> prev mem accesses           prev mem accesses
>> sys_membarrier()            barrier()
>> follow mem accesses         follow mem accesses
>> 
>> As we can see, there are two possible scenarios: either Thread B memory
>> accesses do not happen concurrently with Thread A accesses (1), or they
>> do (2).
>> 
>> 1) Non-concurrent Thread A vs Thread B accesses:
>> 
>> Thread A                    Thread B
>> prev mem accesses
>> sys_membarrier()
>> follow mem accesses
>>                             prev mem accesses
>>                             barrier()
>>                             follow mem accesses
>> 
>> In this case, thread B accesses will be weakly ordered. This is OK,
>> because at that point, thread A is not particularly interested in
>> ordering them with respect to its own accesses.
>> 
>> 2) Concurrent Thread A vs Thread B accesses
>> 
>> Thread A                    Thread B
>> prev mem accesses           prev mem accesses
>> sys_membarrier()            barrier()
>> follow mem accesses         follow mem accesses
>> 
>> In this case, thread B accesses, which are ensured to be in program
>> order thanks to the compiler barrier, will be "upgraded" to full
>> smp_mb() by synchronize_sched().
>> 
>> * Benchmarks
>> 
>> On Intel Xeon E5405 (8 cores)
>> (one thread is calling sys_membarrier, the other 7 threads are busy
>> looping)
>> 
>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>> 
>> * User-space user of this system call: Userspace RCU library
>> 
>> Both the signal-based and the sys_membarrier userspace RCU schemes
>> permit us to remove the memory barrier from the userspace RCU
>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>> accelerating them. These memory barriers are replaced by compiler
>> barriers on the read-side, and all matching memory barriers on the
>> write-side are turned into an invocation of a memory barrier on all
>> active threads in the process. By letting the kernel perform this
>> synchronization rather than dumbly sending a signal to every process
>> threads (as we currently do), we diminish the number of unnecessary wake
>> ups and only issue the memory barriers on active threads. Non-running
>> threads do not need to execute such barrier anyway, because these are
>> implied by the scheduler context switches.
>> 
>> Results in liburcu:
>> 
>> Operations in 10s, 6 readers, 2 writers:
>> 
>> memory barriers in reader:    1701557485 reads, 2202847 writes
>> signal-based scheme:          9830061167 reads,    6700 writes
>> sys_membarrier:               9952759104 reads,     425 writes
>> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
>> 
>> The dynamic sys_membarrier availability check adds some overhead to
>> the read-side compared to the signal-based scheme, but besides that,
>> sys_membarrier slightly outperforms the signal-based scheme. However,
>> this non-expedited sys_membarrier implementation has a much slower grace
>> period than signal and memory barrier schemes.
>> 
>> Besides diminishing the number of wake-ups, one major advantage of the
>> membarrier system call over the signal-based scheme is that it does not
>> need to reserve a signal. This plays much more nicely with libraries,
>> and with processes injected into for tracing purposes, for which we
>> cannot expect that signals will be unused by the application.
>> 
>> An expedited version of this system call can be added later on to speed
>> up the grace period. Its implementation will likely depend on reading
>> the cpu_curr()->mm without holding each CPU's rq lock.
>> 
>> This patch adds the system call to x86 and to asm-generic.
>> 
>> [1] http://urcu.so
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Nicholas Miell <nmiell@comcast.net>
>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
>> CC: Lai Jiangshan <laijs@cn.fujitsu.com>
>> CC: Stephen Hemminger <stephen@networkplumber.org>
>> CC: Andrew Morton <akpm@linux-foundation.org>
>> CC: Thomas Gleixner <tglx@linutronix.de>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> CC: David Howells <dhowells@redhat.com>
>> CC: Pranith Kumar <bobby.prani@gmail.com>
>> CC: Michael Kerrisk <mtk.manpages@gmail.com>
>> CC: linux-api@vger.kernel.org
>> 
>> ---
>> 
>> membarrier(2) man page:
>> --------------- snip -------------------
>> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
>> 
>> NAME
>>        membarrier - issue memory barriers on a set of threads
>> 
>> SYNOPSIS
>>        #include <linux/membarrier.h>
>> 
>>        int membarrier(int cmd, int flags);
>> 
>> DESCRIPTION
>>        The cmd argument is one of the following:
>> 
>>        MEMBARRIER_CMD_QUERY
>>               Query  the  set  of  supported commands. It returns a bitmask of
>>               supported commands.
>> 
>>        MEMBARRIER_CMD_SHARED
>>               Execute a memory barrier on all threads running on  the  system.
>>               Upon  return from system call, the caller thread is ensured that
>>               all running threads have passed through a state where all memory
>>               accesses  to  user-space  addresses  match program order between
>>               entry to and return from the system  call  (non-running  threads
>>               are de facto in such a state). This covers threads from all pro‐
>>               cesses running on the system.  This command returns 0.
>> 
>>        The flags argument needs to be 0. For future extensions.
>> 
>>        All memory accesses performed  in  program  order  from  each  targeted
>>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>        we use the semantic "barrier()" to represent a compiler barrier forcing
>>        memory  accesses  to  be performed in program order across the barrier,
>>        and smp_mb() to represent explicit memory barriers forcing full  memory
>>        ordering  across  the barrier, we have the following ordering table for
>>        each pair of barrier(), sys_membarrier() and smp_mb():
>> 
>>        The pair ordering is detailed as (O: ordered, X: not ordered):
>> 
>>                               barrier()   smp_mb() sys_membarrier()
>>               barrier()          X           X            O
>>               smp_mb()           X           O            O
>>               sys_membarrier()   O           O            O
>> 
>> RETURN VALUE
>>        On success, these system calls return zero.  On error, -1 is  returned,
>>        and errno is set appropriately. For a given command, with flags
>>        argument set to 0, this system call is guaranteed to always return the
>>        same value until reboot.
>> 
>> ERRORS
>>        ENOSYS System call is not implemented.
>> 
>>        EINVAL Invalid arguments.
>> 
>> Linux                             2015-04-15                     MEMBARRIER(2)
>> --------------- snip -------------------
>> 
>> Changes since v18:
>> - Add unlikely() check to flags,
>> - Describe current users in changelog.
>> 
>> Changes since v17:
>> - Update commit message.
>> 
>> Changes since v16:
>> - Update documentation.
>> - Add man page to changelog.
>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>   to not care about the number of processors on the system.  Based on
>>   recommendations from Stephen Hemminger and Steven Rostedt.
>> - Check that flags argument is 0, update documentation to require it.
>> 
>> Changes since v15:
>> - Add flags argument in addition to cmd.
>> - Update documentation.
>> 
>> Changes since v14:
>> - Take care of Thomas Gleixner's comments.
>> 
>> Changes since v13:
>> - Move to kernel/membarrier.c.
>> - Remove MEMBARRIER_PRIVATE flag.
>> - Add MAINTAINERS file entry.
>> 
>> Changes since v12:
>> - Remove _FLAG suffix from uapi flags.
>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>>   lock.
>> 
>> Changes since v11:
>> - 5 years have passed.
>> - Rebase on v3.19 kernel.
>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>   barriers, non-private for memory mappings shared between processes.
>> - Simplify user API.
>> - Code refactoring.
>> 
>> Changes since v10:
>> - Apply Randy's comments.
>> - Rebase on 2.6.34-rc4 -tip.
>> 
>> Changes since v9:
>> - Clean up #ifdef CONFIG_SMP.
>> 
>> Changes since v8:
>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>   memory barriers to the scheduler. It implies a potential RoS
>>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>>   by a user, but nothing more than what is already possible with other
>>   existing system calls, but saves memory barriers in the scheduler fast
>>   path.
>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>   other architectures.
>> - Update documentation of the memory barriers in sys_membarrier and
>>   switch_mm().
>> - Append execution scenarios to the changelog showing the purpose of
>>   each memory barrier.
>> 
>> Changes since v7:
>> - Move spinlock-mb and scheduler related changes to separate patches.
>> - Add support for sys_membarrier on x86_32.
>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>   to incrementally reserve syscall IDs on other architectures as these
>>   are tested.
>> 
>> Changes since v6:
>> - Remove some unlikely() not so unlikely.
>> - Add the proper scheduler memory barriers needed to only use the RCU
>>   read lock in sys_membarrier rather than take each runqueue spinlock:
>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>   and finish_lock_switch(), where they clearly document that all data
>>   protected by the rq lock is guaranteed to have memory barriers issued
>>   between the scheduler update and the task execution. Replacing the
>>   spin lock acquire/release barriers with these memory barriers imply
>>   either no overhead (x86 spinlock atomic instruction already implies a
>>   full mb) or some hopefully small overhead caused by the upgrade of the
>>   spinlock acquire/release barriers to more heavyweight smp_mb().
>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>   standard spinlocks and full memory barriers. Each architecture can
>>   specialize this header following their own need and declare
>>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>   implementations on a wide range of architecture would be welcome.
>> 
>> Changes since v5:
>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>   to the "flags" system call parameter. Past experience with accept4(),
>>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>   inotify_init1() indicates that this is the kind of thing we want to
>>   plan for. Return -EINVAL if the mandatory flags received are unknown.
>> - Create include/linux/membarrier.h to define these flags.
>> - Add MEMBARRIER_QUERY optional flag.
>> 
>> Changes since v4:
>> - Add "int expedited" parameter, use synchronize_sched() in the
>>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>>   seriously using synchronize_sched() to provide the low-overhead
>>   membarrier scheme.
>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>> 
>> Changes since v3a:
>> - Confirm that each CPU indeed runs the current task's ->mm before
>>   sending an IPI. Ensures that we do not disturb RT tasks in the
>>   presence of lazy TLB shootdown.
>> - Document memory barriers needed in switch_mm().
>> - Surround helper functions with #ifdef CONFIG_SMP.
>> 
>> Changes since v2:
>> - simply send-to-many to the mm_cpumask. It contains the list of
>>   processors we have to IPI to (which use the mm), and this mask is
>>   updated atomically.
>> 
>> Changes since v1:
>> - Only perform the IPI in CONFIG_SMP.
>> - Only perform the IPI if the process has more than one thread.
>> - Only send IPIs to CPUs involved with threads belonging to our process.
>> - Adaptative IPI scheme (single vs many IPI with threshold).
>> - Issue smp_mb() at the beginning and end of the system call.
>> ---
>>  MAINTAINERS                            |  8 +++++
>>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>  include/linux/syscalls.h               |  2 ++
>>  include/uapi/asm-generic/unistd.h      |  4 ++-
>>  include/uapi/linux/Kbuild              |  1 +
>>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>>  init/Kconfig                           | 12 +++++++
>>  kernel/Makefile                        |  1 +
>>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>>  kernel/sys_ni.c                        |  3 ++
>>  11 files changed, 151 insertions(+), 1 deletion(-)
>>  create mode 100644 include/uapi/linux/membarrier.h
>>  create mode 100644 kernel/membarrier.c
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 0d70760..b560da6 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>>  
>> +MEMBARRIER SUPPORT
>> +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> +M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>> +L:	linux-kernel@vger.kernel.org
>> +S:	Supported
>> +F:	kernel/membarrier.c
>> +F:	include/uapi/linux/membarrier.h
>> +
>>  MEMORY MANAGEMENT
>>  L:	linux-mm@kvack.org
>>  W:	http://www.linux-mm.org
>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>> b/arch/x86/entry/syscalls/syscall_32.tbl
>> index ef8187f..e63ad61 100644
>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>> @@ -365,3 +365,4 @@
>>  356	i386	memfd_create		sys_memfd_create
>>  357	i386	bpf			sys_bpf
>>  358	i386	execveat		sys_execveat			stub32_execveat
>> +359	i386	membarrier		sys_membarrier
>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>> b/arch/x86/entry/syscalls/syscall_64.tbl
>> index 9ef32d5..87f3cd6 100644
>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>> @@ -329,6 +329,7 @@
>>  320	common	kexec_file_load		sys_kexec_file_load
>>  321	common	bpf			sys_bpf
>>  322	64	execveat		stub_execveat
>> +323	common	membarrier		sys_membarrier
>>  
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index b45c45b..d4ab99b 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>> *filename,
>>  			const char __user *const __user *argv,
>>  			const char __user *const __user *envp, int flags);
>>  
>> +asmlinkage long sys_membarrier(int cmd, int flags);
>> +
>>  #endif
>> diff --git a/include/uapi/asm-generic/unistd.h
>> b/include/uapi/asm-generic/unistd.h
>> index e016bd9..8da542a 100644
>> --- a/include/uapi/asm-generic/unistd.h
>> +++ b/include/uapi/asm-generic/unistd.h
>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>  __SYSCALL(__NR_bpf, sys_bpf)
>>  #define __NR_execveat 281
>>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>> +#define __NR_membarrier 282
>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>  
>>  #undef __NR_syscalls
>> -#define __NR_syscalls 282
>> +#define __NR_syscalls 283
>>  
>>  /*
>>   * All syscalls below here should go away really,
>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>> index 1ff9942..e6f229a 100644
>> --- a/include/uapi/linux/Kbuild
>> +++ b/include/uapi/linux/Kbuild
>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>  header-y += media.h
>>  header-y += media-bus-format.h
>>  header-y += mei.h
>> +header-y += membarrier.h
>>  header-y += memfd.h
>>  header-y += mempolicy.h
>>  header-y += meye.h
>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> new file mode 100644
>> index 0000000..e0b108b
>> --- /dev/null
>> +++ b/include/uapi/linux/membarrier.h
>> @@ -0,0 +1,53 @@
>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>> +#define _UAPI_LINUX_MEMBARRIER_H
>> +
>> +/*
>> + * linux/membarrier.h
>> + *
>> + * membarrier system call API
>> + *
>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to
>> deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> THE
>> + * SOFTWARE.
>> + */
>> +
>> +/**
>> + * enum membarrier_cmd - membarrier system call command
>> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
>> + *                          a bitmask of valid commands.
>> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
>> + *                          Upon return from system call, the caller thread
>> + *                          is ensured that all running threads have passed
>> + *                          through a state where all memory accesses to
>> + *                          user-space addresses match program order between
>> + *                          entry to and return from the system call
>> + *                          (non-running threads are de facto in such a
>> + *                          state). This covers threads from all processes
>> + *                          running on the system. This command returns 0.
>> + *
>> + * Command to be passed to the membarrier system call. The commands need to
>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>> + * the value 0.
>> + */
>> +enum membarrier_cmd {
>> +	MEMBARRIER_CMD_QUERY = 0,
>> +	MEMBARRIER_CMD_SHARED = (1 << 0),
>> +};
>> +
>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>> diff --git a/init/Kconfig b/init/Kconfig
>> index af09b4f..4bba60f 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>  	  bugs/quirks. Disable this only if your target machine is
>>  	  unaffected by PCI quirks.
>>  
>> +config MEMBARRIER
>> +	bool "Enable membarrier() system call" if EXPERT
>> +	default y
>> +	help
>> +	  Enable the membarrier() system call that allows issuing memory
>> +	  barriers across all running threads, which can be used to distribute
>> +	  the cost of user-space memory barriers asymmetrically by transforming
>> +	  pairs of memory barriers into pairs consisting of membarrier() and a
>> +	  compiler barrier.
>> +
>> +	  If unsure, say Y.
>> +
>>  config EMBEDDED
>>  	bool "Embedded system"
>>  	option allnoconfig_y
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 43c4c92..92a481b 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>  
>>  $(obj)/configs.o: $(obj)/config_data.h
>>  
>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>> new file mode 100644
>> index 0000000..536c727
>> --- /dev/null
>> +++ b/kernel/membarrier.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> + *
>> + * membarrier system call
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/syscalls.h>
>> +#include <linux/membarrier.h>
>> +
>> +/*
>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>> + * except MEMBARRIER_CMD_QUERY.
>> + */
>> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>> +
>> +/**
>> + * sys_membarrier - issue memory barriers on a set of threads
>> + * @cmd:   Takes command values defined in enum membarrier_cmd.
>> + * @flags: Currently needs to be 0. For future extensions.
>> + *
>> + * If this system call is not implemented, -ENOSYS is returned. If the
>> + * command specified does not exist, or if the command argument is invalid,
>> + * this system call returns -EINVAL. For a given command, with flags argument
>> + * set to 0, this system call is guaranteed to always return the same value
>> + * until reboot.
>> + *
>> + * All memory accesses performed in program order from each targeted thread
>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>> + * accesses to be performed in program order across the barrier, and
>> + * smp_mb() to represent explicit memory barriers forcing full memory
>> + * ordering across the barrier, we have the following ordering table for
>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>> + *
>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>> + *
>> + *                        barrier()   smp_mb() sys_membarrier()
>> + *        barrier()          X           X            O
>> + *        smp_mb()           X           O            O
>> + *        sys_membarrier()   O           O            O
>> + */
>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>> +{
>> +	if (unlikely(flags))
>> +		return -EINVAL;
>> +	switch (cmd) {
>> +	case MEMBARRIER_CMD_QUERY:
>> +		return MEMBARRIER_CMD_BITMASK;
>> +	case MEMBARRIER_CMD_SHARED:
>> +		if (num_online_cpus() > 1)
>> +			synchronize_sched();
>> +		return 0;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +}
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 7995ef5..eb4fde0 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>  
>>  /* execveat */
>>  cond_syscall(sys_execveat);
>> +
>> +/* membarrier */
>> +cond_syscall(sys_membarrier);
>> 
> 
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: membarrier.2 --]
[-- Type: text/troff; name=membarrier.2, Size: 3189 bytes --]

.\" Copyright 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
.SH NAME
membarrier \- issue memory barriers on a set of threads
.SH SYNOPSIS
.B #include <linux/membarrier.h>
.sp
.BI "int membarrier(int " cmd ", int " flags ");
.sp
.SH DESCRIPTION
The
.I cmd
argument is one of the following:

.TP
.B MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of supported
commands.
.TP
.B MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system. Upon
return from system call, the caller thread is ensured that all running
threads have passed through a state where all memory accesses to
user-space addresses match program order between entry to and return
from the system call (non-running threads are de facto in such a
state). This covers threads from all processes running on the system.
This command returns 0.

.PP
The
.I flags
argument is currently unused.

.PP
All memory accesses performed in program order from each targeted thread
is guaranteed to be ordered with respect to sys_membarrier(). If we use
the semantic "barrier()" to represent a compiler barrier forcing memory
accesses to be performed in program order across the barrier, and
smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():

The pair ordering is detailed as (O: ordered, X: not ordered):

                       barrier()   smp_mb() sys_membarrier()
       barrier()          X           X            O
       smp_mb()           X           O            O
       sys_membarrier()   O           O            O

.SH RETURN VALUE
On success, these system calls return zero.  On error, \-1 is returned,
and
.I errno
is set appropriately.
For a given command, with flags argument set to 0, this system call is
guaranteed to always return the same value until reboot.
.SH ERRORS
.TP
.B ENOSYS
System call is not implemented.
.TP
.B EINVAL
Invalid arguments.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
@ 2015-12-05  8:48       ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-12-05  8:48 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	KOSAKI Motohiro, rostedt, Nicholas Miell, Linus Torvalds,
	Ingo Molnar, One Thousand Gnomes, Lai Jiangshan,
	Stephen Hemminger, Thomas Gleixner, Peter Zijlstra,
	David Howells, Pranith Kumar

[-- Attachment #1: Type: text/plain, Size: 28765 bytes --]

Hi Michael,

Please find the membarrier man groff file attached. I re-integrated
some changes that went in initially only in the changelog text version
back onto this groff source.

Please let me know if you find any issue with it.

Mathieu

----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:

> Hi Mathieu,
> 
> In the patch below you have a man page type of text. Is that
> just plain text, or do you have some groff source somewhere?
> 
> Thanks,
> 
> Michael
> 
> 
> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>> Here is an implementation of a new system call, sys_membarrier(), which
>> executes a memory barrier on all threads running on the system. It is
>> implemented by calling synchronize_sched(). It can be used to distribute
>> the cost of user-space memory barriers asymmetrically by transforming
>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>> compiler barrier. For synchronization primitives that distinguish
>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>> read-side can be accelerated significantly by moving the bulk of the
>> memory barrier overhead to the write-side.
>> 
>> The existing applications of which I am aware that would be improved by this
>> system call are as follows:
>> 
>> * Through Userspace RCU library (http://urcu.so)
>>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>>   - Network sniffer (http://netsniff-ng.org/)
>>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>   - User-space tracing (http://lttng.org)
>>   - Network storage system (https://www.gluster.org/)
>>   - Virtual routers
>>   (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
>> 
>> Those projects use RCU in userspace to increase read-side speed and
>> scalability compared to locking. Especially in the case of RCU used
>> by libraries, sys_membarrier can speed up the read-side by moving the
>> bulk of the memory barrier cost to synchronize_rcu().
>> 
>> * Direct users of sys_membarrier
>>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>> 
>> Microsoft core dotnet GC developers are planning to use the mprotect()
>> side-effect of issuing memory barriers through IPIs as a way to implement
>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>> sys_membarrier in their github thread, specifically stating that
>> sys_membarrier() is what they are looking for.
>> 
>> This implementation is based on kernel v4.1-rc8.
>> 
>> To explain the benefit of this scheme, let's introduce two example threads:
>> 
>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>> Thread B (frequent, e.g. executing liburcu
>> rcu_read_lock()/rcu_read_unlock())
>> 
>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>> with respect to smp_mb() present in Thread B, we can change each
>> smp_mb() within Thread A into calls to sys_membarrier() and each
>> smp_mb() within Thread B into compiler barriers "barrier()".
>> 
>> Before the change, we had, for each smp_mb() pairs:
>> 
>> Thread A                    Thread B
>> previous mem accesses       previous mem accesses
>> smp_mb()                    smp_mb()
>> following mem accesses      following mem accesses
>> 
>> After the change, these pairs become:
>> 
>> Thread A                    Thread B
>> prev mem accesses           prev mem accesses
>> sys_membarrier()            barrier()
>> follow mem accesses         follow mem accesses
>> 
>> As we can see, there are two possible scenarios: either Thread B memory
>> accesses do not happen concurrently with Thread A accesses (1), or they
>> do (2).
>> 
>> 1) Non-concurrent Thread A vs Thread B accesses:
>> 
>> Thread A                    Thread B
>> prev mem accesses
>> sys_membarrier()
>> follow mem accesses
>>                             prev mem accesses
>>                             barrier()
>>                             follow mem accesses
>> 
>> In this case, thread B accesses will be weakly ordered. This is OK,
>> because at that point, thread A is not particularly interested in
>> ordering them with respect to its own accesses.
>> 
>> 2) Concurrent Thread A vs Thread B accesses
>> 
>> Thread A                    Thread B
>> prev mem accesses           prev mem accesses
>> sys_membarrier()            barrier()
>> follow mem accesses         follow mem accesses
>> 
>> In this case, thread B accesses, which are ensured to be in program
>> order thanks to the compiler barrier, will be "upgraded" to full
>> smp_mb() by synchronize_sched().
>> 
>> * Benchmarks
>> 
>> On Intel Xeon E5405 (8 cores)
>> (one thread is calling sys_membarrier, the other 7 threads are busy
>> looping)
>> 
>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>> 
>> * User-space user of this system call: Userspace RCU library
>> 
>> Both the signal-based and the sys_membarrier userspace RCU schemes
>> permit us to remove the memory barrier from the userspace RCU
>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>> accelerating them. These memory barriers are replaced by compiler
>> barriers on the read-side, and all matching memory barriers on the
>> write-side are turned into an invocation of a memory barrier on all
>> active threads in the process. By letting the kernel perform this
>> synchronization rather than dumbly sending a signal to every process
>> threads (as we currently do), we diminish the number of unnecessary wake
>> ups and only issue the memory barriers on active threads. Non-running
>> threads do not need to execute such barrier anyway, because these are
>> implied by the scheduler context switches.
>> 
>> Results in liburcu:
>> 
>> Operations in 10s, 6 readers, 2 writers:
>> 
>> memory barriers in reader:    1701557485 reads, 2202847 writes
>> signal-based scheme:          9830061167 reads,    6700 writes
>> sys_membarrier:               9952759104 reads,     425 writes
>> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
>> 
>> The dynamic sys_membarrier availability check adds some overhead to
>> the read-side compared to the signal-based scheme, but besides that,
>> sys_membarrier slightly outperforms the signal-based scheme. However,
>> this non-expedited sys_membarrier implementation has a much slower grace
>> period than signal and memory barrier schemes.
>> 
>> Besides diminishing the number of wake-ups, one major advantage of the
>> membarrier system call over the signal-based scheme is that it does not
>> need to reserve a signal. This plays much more nicely with libraries,
>> and with processes injected into for tracing purposes, for which we
>> cannot expect that signals will be unused by the application.
>> 
>> An expedited version of this system call can be added later on to speed
>> up the grace period. Its implementation will likely depend on reading
>> the cpu_curr()->mm without holding each CPU's rq lock.
>> 
>> This patch adds the system call to x86 and to asm-generic.
>> 
>> [1] http://urcu.so
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>> Reviewed-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
>> CC: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
>> CC: Nicholas Miell <nmiell-Wuw85uim5zDR7s880joybQ@public.gmane.org>
>> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> CC: Alan Cox <gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org>
>> CC: Lai Jiangshan <laijs-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> CC: Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>> CC: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> CC: Pranith Kumar <bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> 
>> ---
>> 
>> membarrier(2) man page:
>> --------------- snip -------------------
>> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
>> 
>> NAME
>>        membarrier - issue memory barriers on a set of threads
>> 
>> SYNOPSIS
>>        #include <linux/membarrier.h>
>> 
>>        int membarrier(int cmd, int flags);
>> 
>> DESCRIPTION
>>        The cmd argument is one of the following:
>> 
>>        MEMBARRIER_CMD_QUERY
>>               Query  the  set  of  supported commands. It returns a bitmask of
>>               supported commands.
>> 
>>        MEMBARRIER_CMD_SHARED
>>               Execute a memory barrier on all threads running on  the  system.
>>               Upon  return from system call, the caller thread is ensured that
>>               all running threads have passed through a state where all memory
>>               accesses  to  user-space  addresses  match program order between
>>               entry to and return from the system  call  (non-running  threads
>>               are de facto in such a state). This covers threads from all pro‐
>>               cesses running on the system.  This command returns 0.
>> 
>>        The flags argument needs to be 0. For future extensions.
>> 
>>        All memory accesses performed  in  program  order  from  each  targeted
>>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>        we use the semantic "barrier()" to represent a compiler barrier forcing
>>        memory  accesses  to  be performed in program order across the barrier,
>>        and smp_mb() to represent explicit memory barriers forcing full  memory
>>        ordering  across  the barrier, we have the following ordering table for
>>        each pair of barrier(), sys_membarrier() and smp_mb():
>> 
>>        The pair ordering is detailed as (O: ordered, X: not ordered):
>> 
>>                               barrier()   smp_mb() sys_membarrier()
>>               barrier()          X           X            O
>>               smp_mb()           X           O            O
>>               sys_membarrier()   O           O            O
>> 
>> RETURN VALUE
>>        On success, these system calls return zero.  On error, -1 is  returned,
>>        and errno is set appropriately. For a given command, with flags
>>        argument set to 0, this system call is guaranteed to always return the
>>        same value until reboot.
>> 
>> ERRORS
>>        ENOSYS System call is not implemented.
>> 
>>        EINVAL Invalid arguments.
>> 
>> Linux                             2015-04-15                     MEMBARRIER(2)
>> --------------- snip -------------------
>> 
>> Changes since v18:
>> - Add unlikely() check to flags,
>> - Describe current users in changelog.
>> 
>> Changes since v17:
>> - Update commit message.
>> 
>> Changes since v16:
>> - Update documentation.
>> - Add man page to changelog.
>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>   to not care about the number of processors on the system.  Based on
>>   recommendations from Stephen Hemminger and Steven Rostedt.
>> - Check that flags argument is 0, update documentation to require it.
>> 
>> Changes since v15:
>> - Add flags argument in addition to cmd.
>> - Update documentation.
>> 
>> Changes since v14:
>> - Take care of Thomas Gleixner's comments.
>> 
>> Changes since v13:
>> - Move to kernel/membarrier.c.
>> - Remove MEMBARRIER_PRIVATE flag.
>> - Add MAINTAINERS file entry.
>> 
>> Changes since v12:
>> - Remove _FLAG suffix from uapi flags.
>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>>   lock.
>> 
>> Changes since v11:
>> - 5 years have passed.
>> - Rebase on v3.19 kernel.
>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>   barriers, non-private for memory mappings shared between processes.
>> - Simplify user API.
>> - Code refactoring.
>> 
>> Changes since v10:
>> - Apply Randy's comments.
>> - Rebase on 2.6.34-rc4 -tip.
>> 
>> Changes since v9:
>> - Clean up #ifdef CONFIG_SMP.
>> 
>> Changes since v8:
>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>   memory barriers to the scheduler. It implies a potential RoS
>>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>>   by a user, but nothing more than what is already possible with other
>>   existing system calls, but saves memory barriers in the scheduler fast
>>   path.
>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>   other architectures.
>> - Update documentation of the memory barriers in sys_membarrier and
>>   switch_mm().
>> - Append execution scenarios to the changelog showing the purpose of
>>   each memory barrier.
>> 
>> Changes since v7:
>> - Move spinlock-mb and scheduler related changes to separate patches.
>> - Add support for sys_membarrier on x86_32.
>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>   to incrementally reserve syscall IDs on other architectures as these
>>   are tested.
>> 
>> Changes since v6:
>> - Remove some unlikely() not so unlikely.
>> - Add the proper scheduler memory barriers needed to only use the RCU
>>   read lock in sys_membarrier rather than take each runqueue spinlock:
>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>   and finish_lock_switch(), where they clearly document that all data
>>   protected by the rq lock is guaranteed to have memory barriers issued
>>   between the scheduler update and the task execution. Replacing the
>>   spin lock acquire/release barriers with these memory barriers imply
>>   either no overhead (x86 spinlock atomic instruction already implies a
>>   full mb) or some hopefully small overhead caused by the upgrade of the
>>   spinlock acquire/release barriers to more heavyweight smp_mb().
>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>   standard spinlocks and full memory barriers. Each architecture can
>>   specialize this header following their own need and declare
>>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>   implementations on a wide range of architecture would be welcome.
>> 
>> Changes since v5:
>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>   to the "flags" system call parameter. Past experience with accept4(),
>>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>   inotify_init1() indicates that this is the kind of thing we want to
>>   plan for. Return -EINVAL if the mandatory flags received are unknown.
>> - Create include/linux/membarrier.h to define these flags.
>> - Add MEMBARRIER_QUERY optional flag.
>> 
>> Changes since v4:
>> - Add "int expedited" parameter, use synchronize_sched() in the
>>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>>   seriously using synchronize_sched() to provide the low-overhead
>>   membarrier scheme.
>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>> 
>> Changes since v3a:
>> - Confirm that each CPU indeed runs the current task's ->mm before
>>   sending an IPI. Ensures that we do not disturb RT tasks in the
>>   presence of lazy TLB shootdown.
>> - Document memory barriers needed in switch_mm().
>> - Surround helper functions with #ifdef CONFIG_SMP.
>> 
>> Changes since v2:
>> - simply send-to-many to the mm_cpumask. It contains the list of
>>   processors we have to IPI to (which use the mm), and this mask is
>>   updated atomically.
>> 
>> Changes since v1:
>> - Only perform the IPI in CONFIG_SMP.
>> - Only perform the IPI if the process has more than one thread.
>> - Only send IPIs to CPUs involved with threads belonging to our process.
>> - Adaptative IPI scheme (single vs many IPI with threshold).
>> - Issue smp_mb() at the beginning and end of the system call.
>> ---
>>  MAINTAINERS                            |  8 +++++
>>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>  include/linux/syscalls.h               |  2 ++
>>  include/uapi/asm-generic/unistd.h      |  4 ++-
>>  include/uapi/linux/Kbuild              |  1 +
>>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>>  init/Kconfig                           | 12 +++++++
>>  kernel/Makefile                        |  1 +
>>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>>  kernel/sys_ni.c                        |  3 ++
>>  11 files changed, 151 insertions(+), 1 deletion(-)
>>  create mode 100644 include/uapi/linux/membarrier.h
>>  create mode 100644 kernel/membarrier.c
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 0d70760..b560da6 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>>  
>> +MEMBARRIER SUPPORT
>> +M:	Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> +M:	"Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>> +L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> +S:	Supported
>> +F:	kernel/membarrier.c
>> +F:	include/uapi/linux/membarrier.h
>> +
>>  MEMORY MANAGEMENT
>>  L:	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
>>  W:	http://www.linux-mm.org
>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>> b/arch/x86/entry/syscalls/syscall_32.tbl
>> index ef8187f..e63ad61 100644
>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>> @@ -365,3 +365,4 @@
>>  356	i386	memfd_create		sys_memfd_create
>>  357	i386	bpf			sys_bpf
>>  358	i386	execveat		sys_execveat			stub32_execveat
>> +359	i386	membarrier		sys_membarrier
>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>> b/arch/x86/entry/syscalls/syscall_64.tbl
>> index 9ef32d5..87f3cd6 100644
>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>> @@ -329,6 +329,7 @@
>>  320	common	kexec_file_load		sys_kexec_file_load
>>  321	common	bpf			sys_bpf
>>  322	64	execveat		stub_execveat
>> +323	common	membarrier		sys_membarrier
>>  
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index b45c45b..d4ab99b 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>> *filename,
>>  			const char __user *const __user *argv,
>>  			const char __user *const __user *envp, int flags);
>>  
>> +asmlinkage long sys_membarrier(int cmd, int flags);
>> +
>>  #endif
>> diff --git a/include/uapi/asm-generic/unistd.h
>> b/include/uapi/asm-generic/unistd.h
>> index e016bd9..8da542a 100644
>> --- a/include/uapi/asm-generic/unistd.h
>> +++ b/include/uapi/asm-generic/unistd.h
>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>  __SYSCALL(__NR_bpf, sys_bpf)
>>  #define __NR_execveat 281
>>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>> +#define __NR_membarrier 282
>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>  
>>  #undef __NR_syscalls
>> -#define __NR_syscalls 282
>> +#define __NR_syscalls 283
>>  
>>  /*
>>   * All syscalls below here should go away really,
>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>> index 1ff9942..e6f229a 100644
>> --- a/include/uapi/linux/Kbuild
>> +++ b/include/uapi/linux/Kbuild
>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>  header-y += media.h
>>  header-y += media-bus-format.h
>>  header-y += mei.h
>> +header-y += membarrier.h
>>  header-y += memfd.h
>>  header-y += mempolicy.h
>>  header-y += meye.h
>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> new file mode 100644
>> index 0000000..e0b108b
>> --- /dev/null
>> +++ b/include/uapi/linux/membarrier.h
>> @@ -0,0 +1,53 @@
>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>> +#define _UAPI_LINUX_MEMBARRIER_H
>> +
>> +/*
>> + * linux/membarrier.h
>> + *
>> + * membarrier system call API
>> + *
>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to
>> deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> THE
>> + * SOFTWARE.
>> + */
>> +
>> +/**
>> + * enum membarrier_cmd - membarrier system call command
>> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
>> + *                          a bitmask of valid commands.
>> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
>> + *                          Upon return from system call, the caller thread
>> + *                          is ensured that all running threads have passed
>> + *                          through a state where all memory accesses to
>> + *                          user-space addresses match program order between
>> + *                          entry to and return from the system call
>> + *                          (non-running threads are de facto in such a
>> + *                          state). This covers threads from all processes
>> + *                          running on the system. This command returns 0.
>> + *
>> + * Command to be passed to the membarrier system call. The commands need to
>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>> + * the value 0.
>> + */
>> +enum membarrier_cmd {
>> +	MEMBARRIER_CMD_QUERY = 0,
>> +	MEMBARRIER_CMD_SHARED = (1 << 0),
>> +};
>> +
>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>> diff --git a/init/Kconfig b/init/Kconfig
>> index af09b4f..4bba60f 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>  	  bugs/quirks. Disable this only if your target machine is
>>  	  unaffected by PCI quirks.
>>  
>> +config MEMBARRIER
>> +	bool "Enable membarrier() system call" if EXPERT
>> +	default y
>> +	help
>> +	  Enable the membarrier() system call that allows issuing memory
>> +	  barriers across all running threads, which can be used to distribute
>> +	  the cost of user-space memory barriers asymmetrically by transforming
>> +	  pairs of memory barriers into pairs consisting of membarrier() and a
>> +	  compiler barrier.
>> +
>> +	  If unsure, say Y.
>> +
>>  config EMBEDDED
>>  	bool "Embedded system"
>>  	option allnoconfig_y
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 43c4c92..92a481b 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>  
>>  $(obj)/configs.o: $(obj)/config_data.h
>>  
>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>> new file mode 100644
>> index 0000000..536c727
>> --- /dev/null
>> +++ b/kernel/membarrier.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> + *
>> + * membarrier system call
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/syscalls.h>
>> +#include <linux/membarrier.h>
>> +
>> +/*
>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>> + * except MEMBARRIER_CMD_QUERY.
>> + */
>> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>> +
>> +/**
>> + * sys_membarrier - issue memory barriers on a set of threads
>> + * @cmd:   Takes command values defined in enum membarrier_cmd.
>> + * @flags: Currently needs to be 0. For future extensions.
>> + *
>> + * If this system call is not implemented, -ENOSYS is returned. If the
>> + * command specified does not exist, or if the command argument is invalid,
>> + * this system call returns -EINVAL. For a given command, with flags argument
>> + * set to 0, this system call is guaranteed to always return the same value
>> + * until reboot.
>> + *
>> + * All memory accesses performed in program order from each targeted thread
>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>> + * accesses to be performed in program order across the barrier, and
>> + * smp_mb() to represent explicit memory barriers forcing full memory
>> + * ordering across the barrier, we have the following ordering table for
>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>> + *
>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>> + *
>> + *                        barrier()   smp_mb() sys_membarrier()
>> + *        barrier()          X           X            O
>> + *        smp_mb()           X           O            O
>> + *        sys_membarrier()   O           O            O
>> + */
>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>> +{
>> +	if (unlikely(flags))
>> +		return -EINVAL;
>> +	switch (cmd) {
>> +	case MEMBARRIER_CMD_QUERY:
>> +		return MEMBARRIER_CMD_BITMASK;
>> +	case MEMBARRIER_CMD_SHARED:
>> +		if (num_online_cpus() > 1)
>> +			synchronize_sched();
>> +		return 0;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +}
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 7995ef5..eb4fde0 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>  
>>  /* execveat */
>>  cond_syscall(sys_execveat);
>> +
>> +/* membarrier */
>> +cond_syscall(sys_membarrier);
>> 
> 
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: membarrier.2 --]
[-- Type: text/troff; name=membarrier.2, Size: 3189 bytes --]

.\" Copyright 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
.SH NAME
membarrier \- issue memory barriers on a set of threads
.SH SYNOPSIS
.B #include <linux/membarrier.h>
.sp
.BI "int membarrier(int " cmd ", int " flags ");
.sp
.SH DESCRIPTION
The
.I cmd
argument is one of the following:

.TP
.B MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of supported
commands.
.TP
.B MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system. Upon
return from system call, the caller thread is ensured that all running
threads have passed through a state where all memory accesses to
user-space addresses match program order between entry to and return
from the system call (non-running threads are de facto in such a
state). This covers threads from all processes running on the system.
This command returns 0.

.PP
The
.I flags
argument is currently unused.

.PP
All memory accesses performed in program order from each targeted thread
is guaranteed to be ordered with respect to sys_membarrier(). If we use
the semantic "barrier()" to represent a compiler barrier forcing memory
accesses to be performed in program order across the barrier, and
smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():

The pair ordering is detailed as (O: ordered, X: not ordered):

                       barrier()   smp_mb() sys_membarrier()
       barrier()          X           X            O
       smp_mb()           X           O            O
       sys_membarrier()   O           O            O

.SH RETURN VALUE
On success, these system calls return zero.  On error, \-1 is returned,
and
.I errno
is set appropriately.
For a given command, with flags argument set to 0, this system call is
guaranteed to always return the same value until reboot.
.SH ERRORS
.TP
.B ENOSYS
System call is not implemented.
.TP
.B EINVAL
Invalid arguments.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
@ 2015-12-11 18:05         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-11 18:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: mtk.manpages, Andrew Morton, linux-kernel, linux-api,
	KOSAKI Motohiro, rostedt, Nicholas Miell, Linus Torvalds,
	Ingo Molnar, One Thousand Gnomes, Lai Jiangshan,
	Stephen Hemminger, Thomas Gleixner, Peter Zijlstra,
	David Howells, Pranith Kumar

Hi Matthew,

On 12/05/2015 09:48 AM, Mathieu Desnoyers wrote:
> Hi Michael,
> 
> Please find the membarrier man groff file attached. I re-integrated
> some changes that went in initially only in the changelog text version
> back onto this groff source.
> 
> Please let me know if you find any issue with it.

Thanks for the page, but there's a few issues. Could you please 
submit a new version as an inline patch, and see what can be
done w.r.t. the following points (see man-pages(7) for some
background on some of these points):

* Start DESCRIPTION off with a paragraph explaining what this system
  call is about and why one would use it.

* Page needs VERSIONS, CONFORMING TO, and SEE ALSO sections.

* Is its possible to add a small EXAMPLE?

* In a NOTES section, it might be helpful to briefly explain the following
  concepts:  memory barrier and program order.

Some comments on individual pieces below:

> .TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
> .SH NAME
> membarrier \- issue memory barriers on a set of threads
> .SH SYNOPSIS
> .B #include <linux/membarrier.h>
> .sp
> .BI "int membarrier(int " cmd ", int " flags ");
> .sp
> .SH DESCRIPTION
> The
> .I cmd
> argument is one of the following:
> 
> .TP
> .B MEMBARRIER_CMD_QUERY
> Query the set of supported commands. It returns a bitmask of supported
> commands.

Not clear here. Does this mean that the 'cmd' argument is a bit mask,
rather than an enumeration? I think that needs to be spelled out.
Also, the text should mention that the returned bitmask excludes
MEMBARRIER_CMD_QUERY. (Why, actually?)

> .TP
> .B MEMBARRIER_CMD_SHARED
> Execute a memory barrier on all threads running on the system. 

All threads on the system?

> Upon
> return from system call, the caller thread is ensured that all running
> threads have passed through a state where all memory accesses to
> user-space addresses match program order between entry to and return
> from the system call (non-running threads are de facto in such a
> state). This covers threads from all processes running on the system.
> This command returns 0.
> 
> .PP
> The
> .I flags
> argument is currently unused.
> 
> .PP
> All memory accesses performed in program order from each targeted thread

What is a "targeted thread"? Some rewording is needed here.

> is guaranteed to be ordered with respect to sys_membarrier(). If we use
> the semantic "barrier()" to represent a compiler barrier forcing memory
> accesses to be performed in program order across the barrier, and
> smp_mb() to represent explicit memory barriers forcing full memory
> ordering across the barrier, we have the following ordering table for
> each pair of barrier(), sys_membarrier() and smp_mb():
> 
> The pair ordering is detailed as (O: ordered, X: not ordered):
> 
>                        barrier()   smp_mb() sys_membarrier()
>        barrier()          X           X            O
>        smp_mb()           X           O            O
>        sys_membarrier()   O           O            O
> 
> .SH RETURN VALUE
> On success, these system calls return zero.  

This sentence seems out of place. We have one system call.
And the different operations described above return
nonzero values on success.

> On error, \-1 is returned,
> and
> .I errno
> is set appropriately.
> For a given command, with flags argument set to 0, this system call is
> guaranteed to always return the same value until reboot.

I don't understand the intent of the last sentence. What idea are you
trying to convey?

> .SH ERRORS
> .TP
> .B ENOSYS
> System call is not implemented.
> .TP
> .B EINVAL
> Invalid arguments.

Would be clearer to say here: "cmd is invalid or flags is nonezero"

Thanks,

Michael


> ----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:
> 
>> Hi Mathieu,
>>
>> In the patch below you have a man page type of text. Is that
>> just plain text, or do you have some groff source somewhere?
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>>> Here is an implementation of a new system call, sys_membarrier(), which
>>> executes a memory barrier on all threads running on the system. It is
>>> implemented by calling synchronize_sched(). It can be used to distribute
>>> the cost of user-space memory barriers asymmetrically by transforming
>>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>>> compiler barrier. For synchronization primitives that distinguish
>>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>>> read-side can be accelerated significantly by moving the bulk of the
>>> memory barrier overhead to the write-side.
>>>
>>> The existing applications of which I am aware that would be improved by this
>>> system call are as follows:
>>>
>>> * Through Userspace RCU library (http://urcu.so)
>>>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>>>   - Network sniffer (http://netsniff-ng.org/)
>>>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>>   - User-space tracing (http://lttng.org)
>>>   - Network storage system (https://www.gluster.org/)
>>>   - Virtual routers
>>>   (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
>>>
>>> Those projects use RCU in userspace to increase read-side speed and
>>> scalability compared to locking. Especially in the case of RCU used
>>> by libraries, sys_membarrier can speed up the read-side by moving the
>>> bulk of the memory barrier cost to synchronize_rcu().
>>>
>>> * Direct users of sys_membarrier
>>>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>>>
>>> Microsoft core dotnet GC developers are planning to use the mprotect()
>>> side-effect of issuing memory barriers through IPIs as a way to implement
>>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>>> sys_membarrier in their github thread, specifically stating that
>>> sys_membarrier() is what they are looking for.
>>>
>>> This implementation is based on kernel v4.1-rc8.
>>>
>>> To explain the benefit of this scheme, let's introduce two example threads:
>>>
>>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>>> Thread B (frequent, e.g. executing liburcu
>>> rcu_read_lock()/rcu_read_unlock())
>>>
>>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>>> with respect to smp_mb() present in Thread B, we can change each
>>> smp_mb() within Thread A into calls to sys_membarrier() and each
>>> smp_mb() within Thread B into compiler barriers "barrier()".
>>>
>>> Before the change, we had, for each smp_mb() pairs:
>>>
>>> Thread A                    Thread B
>>> previous mem accesses       previous mem accesses
>>> smp_mb()                    smp_mb()
>>> following mem accesses      following mem accesses
>>>
>>> After the change, these pairs become:
>>>
>>> Thread A                    Thread B
>>> prev mem accesses           prev mem accesses
>>> sys_membarrier()            barrier()
>>> follow mem accesses         follow mem accesses
>>>
>>> As we can see, there are two possible scenarios: either Thread B memory
>>> accesses do not happen concurrently with Thread A accesses (1), or they
>>> do (2).
>>>
>>> 1) Non-concurrent Thread A vs Thread B accesses:
>>>
>>> Thread A                    Thread B
>>> prev mem accesses
>>> sys_membarrier()
>>> follow mem accesses
>>>                             prev mem accesses
>>>                             barrier()
>>>                             follow mem accesses
>>>
>>> In this case, thread B accesses will be weakly ordered. This is OK,
>>> because at that point, thread A is not particularly interested in
>>> ordering them with respect to its own accesses.
>>>
>>> 2) Concurrent Thread A vs Thread B accesses
>>>
>>> Thread A                    Thread B
>>> prev mem accesses           prev mem accesses
>>> sys_membarrier()            barrier()
>>> follow mem accesses         follow mem accesses
>>>
>>> In this case, thread B accesses, which are ensured to be in program
>>> order thanks to the compiler barrier, will be "upgraded" to full
>>> smp_mb() by synchronize_sched().
>>>
>>> * Benchmarks
>>>
>>> On Intel Xeon E5405 (8 cores)
>>> (one thread is calling sys_membarrier, the other 7 threads are busy
>>> looping)
>>>
>>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>>>
>>> * User-space user of this system call: Userspace RCU library
>>>
>>> Both the signal-based and the sys_membarrier userspace RCU schemes
>>> permit us to remove the memory barrier from the userspace RCU
>>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>>> accelerating them. These memory barriers are replaced by compiler
>>> barriers on the read-side, and all matching memory barriers on the
>>> write-side are turned into an invocation of a memory barrier on all
>>> active threads in the process. By letting the kernel perform this
>>> synchronization rather than dumbly sending a signal to every process
>>> threads (as we currently do), we diminish the number of unnecessary wake
>>> ups and only issue the memory barriers on active threads. Non-running
>>> threads do not need to execute such barrier anyway, because these are
>>> implied by the scheduler context switches.
>>>
>>> Results in liburcu:
>>>
>>> Operations in 10s, 6 readers, 2 writers:
>>>
>>> memory barriers in reader:    1701557485 reads, 2202847 writes
>>> signal-based scheme:          9830061167 reads,    6700 writes
>>> sys_membarrier:               9952759104 reads,     425 writes
>>> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
>>>
>>> The dynamic sys_membarrier availability check adds some overhead to
>>> the read-side compared to the signal-based scheme, but besides that,
>>> sys_membarrier slightly outperforms the signal-based scheme. However,
>>> this non-expedited sys_membarrier implementation has a much slower grace
>>> period than signal and memory barrier schemes.
>>>
>>> Besides diminishing the number of wake-ups, one major advantage of the
>>> membarrier system call over the signal-based scheme is that it does not
>>> need to reserve a signal. This plays much more nicely with libraries,
>>> and with processes injected into for tracing purposes, for which we
>>> cannot expect that signals will be unused by the application.
>>>
>>> An expedited version of this system call can be added later on to speed
>>> up the grace period. Its implementation will likely depend on reading
>>> the cpu_curr()->mm without holding each CPU's rq lock.
>>>
>>> This patch adds the system call to x86 and to asm-generic.
>>>
>>> [1] http://urcu.so
>>>
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> CC: Steven Rostedt <rostedt@goodmis.org>
>>> CC: Nicholas Miell <nmiell@comcast.net>
>>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>>> CC: Ingo Molnar <mingo@redhat.com>
>>> CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
>>> CC: Lai Jiangshan <laijs@cn.fujitsu.com>
>>> CC: Stephen Hemminger <stephen@networkplumber.org>
>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>> CC: Thomas Gleixner <tglx@linutronix.de>
>>> CC: Peter Zijlstra <peterz@infradead.org>
>>> CC: David Howells <dhowells@redhat.com>
>>> CC: Pranith Kumar <bobby.prani@gmail.com>
>>> CC: Michael Kerrisk <mtk.manpages@gmail.com>
>>> CC: linux-api@vger.kernel.org
>>>
>>> ---
>>>
>>> membarrier(2) man page:
>>> --------------- snip -------------------
>>> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
>>>
>>> NAME
>>>        membarrier - issue memory barriers on a set of threads
>>>
>>> SYNOPSIS
>>>        #include <linux/membarrier.h>
>>>
>>>        int membarrier(int cmd, int flags);
>>>
>>> DESCRIPTION
>>>        The cmd argument is one of the following:
>>>
>>>        MEMBARRIER_CMD_QUERY
>>>               Query  the  set  of  supported commands. It returns a bitmask of
>>>               supported commands.
>>>
>>>        MEMBARRIER_CMD_SHARED
>>>               Execute a memory barrier on all threads running on  the  system.
>>>               Upon  return from system call, the caller thread is ensured that
>>>               all running threads have passed through a state where all memory
>>>               accesses  to  user-space  addresses  match program order between
>>>               entry to and return from the system  call  (non-running  threads
>>>               are de facto in such a state). This covers threads from all pro‐
>>>               cesses running on the system.  This command returns 0.
>>>
>>>        The flags argument needs to be 0. For future extensions.
>>>
>>>        All memory accesses performed  in  program  order  from  each  targeted
>>>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>>        we use the semantic "barrier()" to represent a compiler barrier forcing
>>>        memory  accesses  to  be performed in program order across the barrier,
>>>        and smp_mb() to represent explicit memory barriers forcing full  memory
>>>        ordering  across  the barrier, we have the following ordering table for
>>>        each pair of barrier(), sys_membarrier() and smp_mb():
>>>
>>>        The pair ordering is detailed as (O: ordered, X: not ordered):
>>>
>>>                               barrier()   smp_mb() sys_membarrier()
>>>               barrier()          X           X            O
>>>               smp_mb()           X           O            O
>>>               sys_membarrier()   O           O            O
>>>
>>> RETURN VALUE
>>>        On success, these system calls return zero.  On error, -1 is  returned,
>>>        and errno is set appropriately. For a given command, with flags
>>>        argument set to 0, this system call is guaranteed to always return the
>>>        same value until reboot.
>>>
>>> ERRORS
>>>        ENOSYS System call is not implemented.
>>>
>>>        EINVAL Invalid arguments.
>>>
>>> Linux                             2015-04-15                     MEMBARRIER(2)
>>> --------------- snip -------------------
>>>
>>> Changes since v18:
>>> - Add unlikely() check to flags,
>>> - Describe current users in changelog.
>>>
>>> Changes since v17:
>>> - Update commit message.
>>>
>>> Changes since v16:
>>> - Update documentation.
>>> - Add man page to changelog.
>>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>>   to not care about the number of processors on the system.  Based on
>>>   recommendations from Stephen Hemminger and Steven Rostedt.
>>> - Check that flags argument is 0, update documentation to require it.
>>>
>>> Changes since v15:
>>> - Add flags argument in addition to cmd.
>>> - Update documentation.
>>>
>>> Changes since v14:
>>> - Take care of Thomas Gleixner's comments.
>>>
>>> Changes since v13:
>>> - Move to kernel/membarrier.c.
>>> - Remove MEMBARRIER_PRIVATE flag.
>>> - Add MAINTAINERS file entry.
>>>
>>> Changes since v12:
>>> - Remove _FLAG suffix from uapi flags.
>>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>>>   lock.
>>>
>>> Changes since v11:
>>> - 5 years have passed.
>>> - Rebase on v3.19 kernel.
>>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>>   barriers, non-private for memory mappings shared between processes.
>>> - Simplify user API.
>>> - Code refactoring.
>>>
>>> Changes since v10:
>>> - Apply Randy's comments.
>>> - Rebase on 2.6.34-rc4 -tip.
>>>
>>> Changes since v9:
>>> - Clean up #ifdef CONFIG_SMP.
>>>
>>> Changes since v8:
>>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>>   memory barriers to the scheduler. It implies a potential RoS
>>>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>>>   by a user, but nothing more than what is already possible with other
>>>   existing system calls, but saves memory barriers in the scheduler fast
>>>   path.
>>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>>   other architectures.
>>> - Update documentation of the memory barriers in sys_membarrier and
>>>   switch_mm().
>>> - Append execution scenarios to the changelog showing the purpose of
>>>   each memory barrier.
>>>
>>> Changes since v7:
>>> - Move spinlock-mb and scheduler related changes to separate patches.
>>> - Add support for sys_membarrier on x86_32.
>>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>>   to incrementally reserve syscall IDs on other architectures as these
>>>   are tested.
>>>
>>> Changes since v6:
>>> - Remove some unlikely() not so unlikely.
>>> - Add the proper scheduler memory barriers needed to only use the RCU
>>>   read lock in sys_membarrier rather than take each runqueue spinlock:
>>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>>   and finish_lock_switch(), where they clearly document that all data
>>>   protected by the rq lock is guaranteed to have memory barriers issued
>>>   between the scheduler update and the task execution. Replacing the
>>>   spin lock acquire/release barriers with these memory barriers imply
>>>   either no overhead (x86 spinlock atomic instruction already implies a
>>>   full mb) or some hopefully small overhead caused by the upgrade of the
>>>   spinlock acquire/release barriers to more heavyweight smp_mb().
>>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>>   standard spinlocks and full memory barriers. Each architecture can
>>>   specialize this header following their own need and declare
>>>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>>   implementations on a wide range of architecture would be welcome.
>>>
>>> Changes since v5:
>>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>>   to the "flags" system call parameter. Past experience with accept4(),
>>>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>>   inotify_init1() indicates that this is the kind of thing we want to
>>>   plan for. Return -EINVAL if the mandatory flags received are unknown.
>>> - Create include/linux/membarrier.h to define these flags.
>>> - Add MEMBARRIER_QUERY optional flag.
>>>
>>> Changes since v4:
>>> - Add "int expedited" parameter, use synchronize_sched() in the
>>>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>>>   seriously using synchronize_sched() to provide the low-overhead
>>>   membarrier scheme.
>>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>>>
>>> Changes since v3a:
>>> - Confirm that each CPU indeed runs the current task's ->mm before
>>>   sending an IPI. Ensures that we do not disturb RT tasks in the
>>>   presence of lazy TLB shootdown.
>>> - Document memory barriers needed in switch_mm().
>>> - Surround helper functions with #ifdef CONFIG_SMP.
>>>
>>> Changes since v2:
>>> - simply send-to-many to the mm_cpumask. It contains the list of
>>>   processors we have to IPI to (which use the mm), and this mask is
>>>   updated atomically.
>>>
>>> Changes since v1:
>>> - Only perform the IPI in CONFIG_SMP.
>>> - Only perform the IPI if the process has more than one thread.
>>> - Only send IPIs to CPUs involved with threads belonging to our process.
>>> - Adaptative IPI scheme (single vs many IPI with threshold).
>>> - Issue smp_mb() at the beginning and end of the system call.
>>> ---
>>>  MAINTAINERS                            |  8 +++++
>>>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>>  include/linux/syscalls.h               |  2 ++
>>>  include/uapi/asm-generic/unistd.h      |  4 ++-
>>>  include/uapi/linux/Kbuild              |  1 +
>>>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>>>  init/Kconfig                           | 12 +++++++
>>>  kernel/Makefile                        |  1 +
>>>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>>>  kernel/sys_ni.c                        |  3 ++
>>>  11 files changed, 151 insertions(+), 1 deletion(-)
>>>  create mode 100644 include/uapi/linux/membarrier.h
>>>  create mode 100644 kernel/membarrier.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 0d70760..b560da6 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>>>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>>>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>>>  
>>> +MEMBARRIER SUPPORT
>>> +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> +M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>>> +L:	linux-kernel@vger.kernel.org
>>> +S:	Supported
>>> +F:	kernel/membarrier.c
>>> +F:	include/uapi/linux/membarrier.h
>>> +
>>>  MEMORY MANAGEMENT
>>>  L:	linux-mm@kvack.org
>>>  W:	http://www.linux-mm.org
>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>>> b/arch/x86/entry/syscalls/syscall_32.tbl
>>> index ef8187f..e63ad61 100644
>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>>> @@ -365,3 +365,4 @@
>>>  356	i386	memfd_create		sys_memfd_create
>>>  357	i386	bpf			sys_bpf
>>>  358	i386	execveat		sys_execveat			stub32_execveat
>>> +359	i386	membarrier		sys_membarrier
>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>>> b/arch/x86/entry/syscalls/syscall_64.tbl
>>> index 9ef32d5..87f3cd6 100644
>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>> @@ -329,6 +329,7 @@
>>>  320	common	kexec_file_load		sys_kexec_file_load
>>>  321	common	bpf			sys_bpf
>>>  322	64	execveat		stub_execveat
>>> +323	common	membarrier		sys_membarrier
>>>  
>>>  #
>>>  # x32-specific system call numbers start at 512 to avoid cache impact
>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>> index b45c45b..d4ab99b 100644
>>> --- a/include/linux/syscalls.h
>>> +++ b/include/linux/syscalls.h
>>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>>> *filename,
>>>  			const char __user *const __user *argv,
>>>  			const char __user *const __user *envp, int flags);
>>>  
>>> +asmlinkage long sys_membarrier(int cmd, int flags);
>>> +
>>>  #endif
>>> diff --git a/include/uapi/asm-generic/unistd.h
>>> b/include/uapi/asm-generic/unistd.h
>>> index e016bd9..8da542a 100644
>>> --- a/include/uapi/asm-generic/unistd.h
>>> +++ b/include/uapi/asm-generic/unistd.h
>>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>>  __SYSCALL(__NR_bpf, sys_bpf)
>>>  #define __NR_execveat 281
>>>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>> +#define __NR_membarrier 282
>>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>>  
>>>  #undef __NR_syscalls
>>> -#define __NR_syscalls 282
>>> +#define __NR_syscalls 283
>>>  
>>>  /*
>>>   * All syscalls below here should go away really,
>>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>>> index 1ff9942..e6f229a 100644
>>> --- a/include/uapi/linux/Kbuild
>>> +++ b/include/uapi/linux/Kbuild
>>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>>  header-y += media.h
>>>  header-y += media-bus-format.h
>>>  header-y += mei.h
>>> +header-y += membarrier.h
>>>  header-y += memfd.h
>>>  header-y += mempolicy.h
>>>  header-y += meye.h
>>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>>> new file mode 100644
>>> index 0000000..e0b108b
>>> --- /dev/null
>>> +++ b/include/uapi/linux/membarrier.h
>>> @@ -0,0 +1,53 @@
>>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>>> +#define _UAPI_LINUX_MEMBARRIER_H
>>> +
>>> +/*
>>> + * linux/membarrier.h
>>> + *
>>> + * membarrier system call API
>>> + *
>>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>> + * of this software and associated documentation files (the "Software"), to
>>> deal
>>> + * in the Software without restriction, including without limitation the rights
>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>> + * copies of the Software, and to permit persons to whom the Software is
>>> + * furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice shall be included in
>>> + * all copies or substantial portions of the Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>> FROM,
>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>> THE
>>> + * SOFTWARE.
>>> + */
>>> +
>>> +/**
>>> + * enum membarrier_cmd - membarrier system call command
>>> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
>>> + *                          a bitmask of valid commands.
>>> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
>>> + *                          Upon return from system call, the caller thread
>>> + *                          is ensured that all running threads have passed
>>> + *                          through a state where all memory accesses to
>>> + *                          user-space addresses match program order between
>>> + *                          entry to and return from the system call
>>> + *                          (non-running threads are de facto in such a
>>> + *                          state). This covers threads from all processes
>>> + *                          running on the system. This command returns 0.
>>> + *
>>> + * Command to be passed to the membarrier system call. The commands need to
>>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>>> + * the value 0.
>>> + */
>>> +enum membarrier_cmd {
>>> +	MEMBARRIER_CMD_QUERY = 0,
>>> +	MEMBARRIER_CMD_SHARED = (1 << 0),
>>> +};
>>> +
>>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index af09b4f..4bba60f 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>>  	  bugs/quirks. Disable this only if your target machine is
>>>  	  unaffected by PCI quirks.
>>>  
>>> +config MEMBARRIER
>>> +	bool "Enable membarrier() system call" if EXPERT
>>> +	default y
>>> +	help
>>> +	  Enable the membarrier() system call that allows issuing memory
>>> +	  barriers across all running threads, which can be used to distribute
>>> +	  the cost of user-space memory barriers asymmetrically by transforming
>>> +	  pairs of memory barriers into pairs consisting of membarrier() and a
>>> +	  compiler barrier.
>>> +
>>> +	  If unsure, say Y.
>>> +
>>>  config EMBEDDED
>>>  	bool "Embedded system"
>>>  	option allnoconfig_y
>>> diff --git a/kernel/Makefile b/kernel/Makefile
>>> index 43c4c92..92a481b 100644
>>> --- a/kernel/Makefile
>>> +++ b/kernel/Makefile
>>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>>  
>>>  $(obj)/configs.o: $(obj)/config_data.h
>>>  
>>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>>> new file mode 100644
>>> index 0000000..536c727
>>> --- /dev/null
>>> +++ b/kernel/membarrier.c
>>> @@ -0,0 +1,66 @@
>>> +/*
>>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> + *
>>> + * membarrier system call
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + */
>>> +
>>> +#include <linux/syscalls.h>
>>> +#include <linux/membarrier.h>
>>> +
>>> +/*
>>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>> + * except MEMBARRIER_CMD_QUERY.
>>> + */
>>> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>>> +
>>> +/**
>>> + * sys_membarrier - issue memory barriers on a set of threads
>>> + * @cmd:   Takes command values defined in enum membarrier_cmd.
>>> + * @flags: Currently needs to be 0. For future extensions.
>>> + *
>>> + * If this system call is not implemented, -ENOSYS is returned. If the
>>> + * command specified does not exist, or if the command argument is invalid,
>>> + * this system call returns -EINVAL. For a given command, with flags argument
>>> + * set to 0, this system call is guaranteed to always return the same value
>>> + * until reboot.
>>> + *
>>> + * All memory accesses performed in program order from each targeted thread
>>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>>> + * accesses to be performed in program order across the barrier, and
>>> + * smp_mb() to represent explicit memory barriers forcing full memory
>>> + * ordering across the barrier, we have the following ordering table for
>>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>>> + *
>>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>>> + *
>>> + *                        barrier()   smp_mb() sys_membarrier()
>>> + *        barrier()          X           X            O
>>> + *        smp_mb()           X           O            O
>>> + *        sys_membarrier()   O           O            O
>>> + */
>>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>> +{
>>> +	if (unlikely(flags))
>>> +		return -EINVAL;
>>> +	switch (cmd) {
>>> +	case MEMBARRIER_CMD_QUERY:
>>> +		return MEMBARRIER_CMD_BITMASK;
>>> +	case MEMBARRIER_CMD_SHARED:
>>> +		if (num_online_cpus() > 1)
>>> +			synchronize_sched();
>>> +		return 0;
>>> +	default:
>>> +		return -EINVAL;
>>> +	}
>>> +}
>>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>>> index 7995ef5..eb4fde0 100644
>>> --- a/kernel/sys_ni.c
>>> +++ b/kernel/sys_ni.c
>>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>>  
>>>  /* execveat */
>>>  cond_syscall(sys_execveat);
>>> +
>>> +/* membarrier */
>>> +cond_syscall(sys_membarrier);
>>>
>>
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
@ 2015-12-11 18:05         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-11 18:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, KOSAKI Motohiro,
	rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	One Thousand Gnomes, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Peter Zijlstra, David Howells, Pranith Kumar

Hi Matthew,

On 12/05/2015 09:48 AM, Mathieu Desnoyers wrote:
> Hi Michael,
> 
> Please find the membarrier man groff file attached. I re-integrated
> some changes that went in initially only in the changelog text version
> back onto this groff source.
> 
> Please let me know if you find any issue with it.

Thanks for the page, but there's a few issues. Could you please 
submit a new version as an inline patch, and see what can be
done w.r.t. the following points (see man-pages(7) for some
background on some of these points):

* Start DESCRIPTION off with a paragraph explaining what this system
  call is about and why one would use it.

* Page needs VERSIONS, CONFORMING TO, and SEE ALSO sections.

* Is its possible to add a small EXAMPLE?

* In a NOTES section, it might be helpful to briefly explain the following
  concepts:  memory barrier and program order.

Some comments on individual pieces below:

> .TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
> .SH NAME
> membarrier \- issue memory barriers on a set of threads
> .SH SYNOPSIS
> .B #include <linux/membarrier.h>
> .sp
> .BI "int membarrier(int " cmd ", int " flags ");
> .sp
> .SH DESCRIPTION
> The
> .I cmd
> argument is one of the following:
> 
> .TP
> .B MEMBARRIER_CMD_QUERY
> Query the set of supported commands. It returns a bitmask of supported
> commands.

Not clear here. Does this mean that the 'cmd' argument is a bit mask,
rather than an enumeration? I think that needs to be spelled out.
Also, the text should mention that the returned bitmask excludes
MEMBARRIER_CMD_QUERY. (Why, actually?)

> .TP
> .B MEMBARRIER_CMD_SHARED
> Execute a memory barrier on all threads running on the system. 

All threads on the system?

> Upon
> return from system call, the caller thread is ensured that all running
> threads have passed through a state where all memory accesses to
> user-space addresses match program order between entry to and return
> from the system call (non-running threads are de facto in such a
> state). This covers threads from all processes running on the system.
> This command returns 0.
> 
> .PP
> The
> .I flags
> argument is currently unused.
> 
> .PP
> All memory accesses performed in program order from each targeted thread

What is a "targeted thread"? Some rewording is needed here.

> is guaranteed to be ordered with respect to sys_membarrier(). If we use
> the semantic "barrier()" to represent a compiler barrier forcing memory
> accesses to be performed in program order across the barrier, and
> smp_mb() to represent explicit memory barriers forcing full memory
> ordering across the barrier, we have the following ordering table for
> each pair of barrier(), sys_membarrier() and smp_mb():
> 
> The pair ordering is detailed as (O: ordered, X: not ordered):
> 
>                        barrier()   smp_mb() sys_membarrier()
>        barrier()          X           X            O
>        smp_mb()           X           O            O
>        sys_membarrier()   O           O            O
> 
> .SH RETURN VALUE
> On success, these system calls return zero.  

This sentence seems out of place. We have one system call.
And the different operations described above return
nonzero values on success.

> On error, \-1 is returned,
> and
> .I errno
> is set appropriately.
> For a given command, with flags argument set to 0, this system call is
> guaranteed to always return the same value until reboot.

I don't understand the intent of the last sentence. What idea are you
trying to convey?

> .SH ERRORS
> .TP
> .B ENOSYS
> System call is not implemented.
> .TP
> .B EINVAL
> Invalid arguments.

Would be clearer to say here: "cmd is invalid or flags is nonezero"

Thanks,

Michael


> ----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:
> 
>> Hi Mathieu,
>>
>> In the patch below you have a man page type of text. Is that
>> just plain text, or do you have some groff source somewhere?
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>>> Here is an implementation of a new system call, sys_membarrier(), which
>>> executes a memory barrier on all threads running on the system. It is
>>> implemented by calling synchronize_sched(). It can be used to distribute
>>> the cost of user-space memory barriers asymmetrically by transforming
>>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>>> compiler barrier. For synchronization primitives that distinguish
>>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>>> read-side can be accelerated significantly by moving the bulk of the
>>> memory barrier overhead to the write-side.
>>>
>>> The existing applications of which I am aware that would be improved by this
>>> system call are as follows:
>>>
>>> * Through Userspace RCU library (http://urcu.so)
>>>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>>>   - Network sniffer (http://netsniff-ng.org/)
>>>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>>   - User-space tracing (http://lttng.org)
>>>   - Network storage system (https://www.gluster.org/)
>>>   - Virtual routers
>>>   (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
>>>
>>> Those projects use RCU in userspace to increase read-side speed and
>>> scalability compared to locking. Especially in the case of RCU used
>>> by libraries, sys_membarrier can speed up the read-side by moving the
>>> bulk of the memory barrier cost to synchronize_rcu().
>>>
>>> * Direct users of sys_membarrier
>>>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>>>
>>> Microsoft core dotnet GC developers are planning to use the mprotect()
>>> side-effect of issuing memory barriers through IPIs as a way to implement
>>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>>> sys_membarrier in their github thread, specifically stating that
>>> sys_membarrier() is what they are looking for.
>>>
>>> This implementation is based on kernel v4.1-rc8.
>>>
>>> To explain the benefit of this scheme, let's introduce two example threads:
>>>
>>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>>> Thread B (frequent, e.g. executing liburcu
>>> rcu_read_lock()/rcu_read_unlock())
>>>
>>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>>> with respect to smp_mb() present in Thread B, we can change each
>>> smp_mb() within Thread A into calls to sys_membarrier() and each
>>> smp_mb() within Thread B into compiler barriers "barrier()".
>>>
>>> Before the change, we had, for each smp_mb() pairs:
>>>
>>> Thread A                    Thread B
>>> previous mem accesses       previous mem accesses
>>> smp_mb()                    smp_mb()
>>> following mem accesses      following mem accesses
>>>
>>> After the change, these pairs become:
>>>
>>> Thread A                    Thread B
>>> prev mem accesses           prev mem accesses
>>> sys_membarrier()            barrier()
>>> follow mem accesses         follow mem accesses
>>>
>>> As we can see, there are two possible scenarios: either Thread B memory
>>> accesses do not happen concurrently with Thread A accesses (1), or they
>>> do (2).
>>>
>>> 1) Non-concurrent Thread A vs Thread B accesses:
>>>
>>> Thread A                    Thread B
>>> prev mem accesses
>>> sys_membarrier()
>>> follow mem accesses
>>>                             prev mem accesses
>>>                             barrier()
>>>                             follow mem accesses
>>>
>>> In this case, thread B accesses will be weakly ordered. This is OK,
>>> because at that point, thread A is not particularly interested in
>>> ordering them with respect to its own accesses.
>>>
>>> 2) Concurrent Thread A vs Thread B accesses
>>>
>>> Thread A                    Thread B
>>> prev mem accesses           prev mem accesses
>>> sys_membarrier()            barrier()
>>> follow mem accesses         follow mem accesses
>>>
>>> In this case, thread B accesses, which are ensured to be in program
>>> order thanks to the compiler barrier, will be "upgraded" to full
>>> smp_mb() by synchronize_sched().
>>>
>>> * Benchmarks
>>>
>>> On Intel Xeon E5405 (8 cores)
>>> (one thread is calling sys_membarrier, the other 7 threads are busy
>>> looping)
>>>
>>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>>>
>>> * User-space user of this system call: Userspace RCU library
>>>
>>> Both the signal-based and the sys_membarrier userspace RCU schemes
>>> permit us to remove the memory barrier from the userspace RCU
>>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>>> accelerating them. These memory barriers are replaced by compiler
>>> barriers on the read-side, and all matching memory barriers on the
>>> write-side are turned into an invocation of a memory barrier on all
>>> active threads in the process. By letting the kernel perform this
>>> synchronization rather than dumbly sending a signal to every process
>>> threads (as we currently do), we diminish the number of unnecessary wake
>>> ups and only issue the memory barriers on active threads. Non-running
>>> threads do not need to execute such barrier anyway, because these are
>>> implied by the scheduler context switches.
>>>
>>> Results in liburcu:
>>>
>>> Operations in 10s, 6 readers, 2 writers:
>>>
>>> memory barriers in reader:    1701557485 reads, 2202847 writes
>>> signal-based scheme:          9830061167 reads,    6700 writes
>>> sys_membarrier:               9952759104 reads,     425 writes
>>> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
>>>
>>> The dynamic sys_membarrier availability check adds some overhead to
>>> the read-side compared to the signal-based scheme, but besides that,
>>> sys_membarrier slightly outperforms the signal-based scheme. However,
>>> this non-expedited sys_membarrier implementation has a much slower grace
>>> period than signal and memory barrier schemes.
>>>
>>> Besides diminishing the number of wake-ups, one major advantage of the
>>> membarrier system call over the signal-based scheme is that it does not
>>> need to reserve a signal. This plays much more nicely with libraries,
>>> and with processes injected into for tracing purposes, for which we
>>> cannot expect that signals will be unused by the application.
>>>
>>> An expedited version of this system call can be added later on to speed
>>> up the grace period. Its implementation will likely depend on reading
>>> the cpu_curr()->mm without holding each CPU's rq lock.
>>>
>>> This patch adds the system call to x86 and to asm-generic.
>>>
>>> [1] http://urcu.so
>>>
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>>> Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>> Reviewed-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
>>> CC: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>>> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
>>> CC: Nicholas Miell <nmiell-Wuw85uim5zDR7s880joybQ@public.gmane.org>
>>> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> CC: Alan Cox <gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org>
>>> CC: Lai Jiangshan <laijs-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>>> CC: Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
>>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>> CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>>> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>>> CC: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> CC: Pranith Kumar <bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>> CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>> CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>
>>> ---
>>>
>>> membarrier(2) man page:
>>> --------------- snip -------------------
>>> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
>>>
>>> NAME
>>>        membarrier - issue memory barriers on a set of threads
>>>
>>> SYNOPSIS
>>>        #include <linux/membarrier.h>
>>>
>>>        int membarrier(int cmd, int flags);
>>>
>>> DESCRIPTION
>>>        The cmd argument is one of the following:
>>>
>>>        MEMBARRIER_CMD_QUERY
>>>               Query  the  set  of  supported commands. It returns a bitmask of
>>>               supported commands.
>>>
>>>        MEMBARRIER_CMD_SHARED
>>>               Execute a memory barrier on all threads running on  the  system.
>>>               Upon  return from system call, the caller thread is ensured that
>>>               all running threads have passed through a state where all memory
>>>               accesses  to  user-space  addresses  match program order between
>>>               entry to and return from the system  call  (non-running  threads
>>>               are de facto in such a state). This covers threads from all pro‐
>>>               cesses running on the system.  This command returns 0.
>>>
>>>        The flags argument needs to be 0. For future extensions.
>>>
>>>        All memory accesses performed  in  program  order  from  each  targeted
>>>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>>        we use the semantic "barrier()" to represent a compiler barrier forcing
>>>        memory  accesses  to  be performed in program order across the barrier,
>>>        and smp_mb() to represent explicit memory barriers forcing full  memory
>>>        ordering  across  the barrier, we have the following ordering table for
>>>        each pair of barrier(), sys_membarrier() and smp_mb():
>>>
>>>        The pair ordering is detailed as (O: ordered, X: not ordered):
>>>
>>>                               barrier()   smp_mb() sys_membarrier()
>>>               barrier()          X           X            O
>>>               smp_mb()           X           O            O
>>>               sys_membarrier()   O           O            O
>>>
>>> RETURN VALUE
>>>        On success, these system calls return zero.  On error, -1 is  returned,
>>>        and errno is set appropriately. For a given command, with flags
>>>        argument set to 0, this system call is guaranteed to always return the
>>>        same value until reboot.
>>>
>>> ERRORS
>>>        ENOSYS System call is not implemented.
>>>
>>>        EINVAL Invalid arguments.
>>>
>>> Linux                             2015-04-15                     MEMBARRIER(2)
>>> --------------- snip -------------------
>>>
>>> Changes since v18:
>>> - Add unlikely() check to flags,
>>> - Describe current users in changelog.
>>>
>>> Changes since v17:
>>> - Update commit message.
>>>
>>> Changes since v16:
>>> - Update documentation.
>>> - Add man page to changelog.
>>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>>   to not care about the number of processors on the system.  Based on
>>>   recommendations from Stephen Hemminger and Steven Rostedt.
>>> - Check that flags argument is 0, update documentation to require it.
>>>
>>> Changes since v15:
>>> - Add flags argument in addition to cmd.
>>> - Update documentation.
>>>
>>> Changes since v14:
>>> - Take care of Thomas Gleixner's comments.
>>>
>>> Changes since v13:
>>> - Move to kernel/membarrier.c.
>>> - Remove MEMBARRIER_PRIVATE flag.
>>> - Add MAINTAINERS file entry.
>>>
>>> Changes since v12:
>>> - Remove _FLAG suffix from uapi flags.
>>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>>>   lock.
>>>
>>> Changes since v11:
>>> - 5 years have passed.
>>> - Rebase on v3.19 kernel.
>>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>>   barriers, non-private for memory mappings shared between processes.
>>> - Simplify user API.
>>> - Code refactoring.
>>>
>>> Changes since v10:
>>> - Apply Randy's comments.
>>> - Rebase on 2.6.34-rc4 -tip.
>>>
>>> Changes since v9:
>>> - Clean up #ifdef CONFIG_SMP.
>>>
>>> Changes since v8:
>>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>>   memory barriers to the scheduler. It implies a potential RoS
>>>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>>>   by a user, but nothing more than what is already possible with other
>>>   existing system calls, but saves memory barriers in the scheduler fast
>>>   path.
>>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>>   other architectures.
>>> - Update documentation of the memory barriers in sys_membarrier and
>>>   switch_mm().
>>> - Append execution scenarios to the changelog showing the purpose of
>>>   each memory barrier.
>>>
>>> Changes since v7:
>>> - Move spinlock-mb and scheduler related changes to separate patches.
>>> - Add support for sys_membarrier on x86_32.
>>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>>   to incrementally reserve syscall IDs on other architectures as these
>>>   are tested.
>>>
>>> Changes since v6:
>>> - Remove some unlikely() not so unlikely.
>>> - Add the proper scheduler memory barriers needed to only use the RCU
>>>   read lock in sys_membarrier rather than take each runqueue spinlock:
>>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>>   and finish_lock_switch(), where they clearly document that all data
>>>   protected by the rq lock is guaranteed to have memory barriers issued
>>>   between the scheduler update and the task execution. Replacing the
>>>   spin lock acquire/release barriers with these memory barriers imply
>>>   either no overhead (x86 spinlock atomic instruction already implies a
>>>   full mb) or some hopefully small overhead caused by the upgrade of the
>>>   spinlock acquire/release barriers to more heavyweight smp_mb().
>>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>>   standard spinlocks and full memory barriers. Each architecture can
>>>   specialize this header following their own need and declare
>>>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>>   implementations on a wide range of architecture would be welcome.
>>>
>>> Changes since v5:
>>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>>   to the "flags" system call parameter. Past experience with accept4(),
>>>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>>   inotify_init1() indicates that this is the kind of thing we want to
>>>   plan for. Return -EINVAL if the mandatory flags received are unknown.
>>> - Create include/linux/membarrier.h to define these flags.
>>> - Add MEMBARRIER_QUERY optional flag.
>>>
>>> Changes since v4:
>>> - Add "int expedited" parameter, use synchronize_sched() in the
>>>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>>>   seriously using synchronize_sched() to provide the low-overhead
>>>   membarrier scheme.
>>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>>>
>>> Changes since v3a:
>>> - Confirm that each CPU indeed runs the current task's ->mm before
>>>   sending an IPI. Ensures that we do not disturb RT tasks in the
>>>   presence of lazy TLB shootdown.
>>> - Document memory barriers needed in switch_mm().
>>> - Surround helper functions with #ifdef CONFIG_SMP.
>>>
>>> Changes since v2:
>>> - simply send-to-many to the mm_cpumask. It contains the list of
>>>   processors we have to IPI to (which use the mm), and this mask is
>>>   updated atomically.
>>>
>>> Changes since v1:
>>> - Only perform the IPI in CONFIG_SMP.
>>> - Only perform the IPI if the process has more than one thread.
>>> - Only send IPIs to CPUs involved with threads belonging to our process.
>>> - Adaptative IPI scheme (single vs many IPI with threshold).
>>> - Issue smp_mb() at the beginning and end of the system call.
>>> ---
>>>  MAINTAINERS                            |  8 +++++
>>>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>>  include/linux/syscalls.h               |  2 ++
>>>  include/uapi/asm-generic/unistd.h      |  4 ++-
>>>  include/uapi/linux/Kbuild              |  1 +
>>>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>>>  init/Kconfig                           | 12 +++++++
>>>  kernel/Makefile                        |  1 +
>>>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>>>  kernel/sys_ni.c                        |  3 ++
>>>  11 files changed, 151 insertions(+), 1 deletion(-)
>>>  create mode 100644 include/uapi/linux/membarrier.h
>>>  create mode 100644 kernel/membarrier.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 0d70760..b560da6 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>>>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>>>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>>>  
>>> +MEMBARRIER SUPPORT
>>> +M:	Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>>> +M:	"Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>> +L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> +S:	Supported
>>> +F:	kernel/membarrier.c
>>> +F:	include/uapi/linux/membarrier.h
>>> +
>>>  MEMORY MANAGEMENT
>>>  L:	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
>>>  W:	http://www.linux-mm.org
>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>>> b/arch/x86/entry/syscalls/syscall_32.tbl
>>> index ef8187f..e63ad61 100644
>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>>> @@ -365,3 +365,4 @@
>>>  356	i386	memfd_create		sys_memfd_create
>>>  357	i386	bpf			sys_bpf
>>>  358	i386	execveat		sys_execveat			stub32_execveat
>>> +359	i386	membarrier		sys_membarrier
>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>>> b/arch/x86/entry/syscalls/syscall_64.tbl
>>> index 9ef32d5..87f3cd6 100644
>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>> @@ -329,6 +329,7 @@
>>>  320	common	kexec_file_load		sys_kexec_file_load
>>>  321	common	bpf			sys_bpf
>>>  322	64	execveat		stub_execveat
>>> +323	common	membarrier		sys_membarrier
>>>  
>>>  #
>>>  # x32-specific system call numbers start at 512 to avoid cache impact
>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>> index b45c45b..d4ab99b 100644
>>> --- a/include/linux/syscalls.h
>>> +++ b/include/linux/syscalls.h
>>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>>> *filename,
>>>  			const char __user *const __user *argv,
>>>  			const char __user *const __user *envp, int flags);
>>>  
>>> +asmlinkage long sys_membarrier(int cmd, int flags);
>>> +
>>>  #endif
>>> diff --git a/include/uapi/asm-generic/unistd.h
>>> b/include/uapi/asm-generic/unistd.h
>>> index e016bd9..8da542a 100644
>>> --- a/include/uapi/asm-generic/unistd.h
>>> +++ b/include/uapi/asm-generic/unistd.h
>>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>>  __SYSCALL(__NR_bpf, sys_bpf)
>>>  #define __NR_execveat 281
>>>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>> +#define __NR_membarrier 282
>>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>>  
>>>  #undef __NR_syscalls
>>> -#define __NR_syscalls 282
>>> +#define __NR_syscalls 283
>>>  
>>>  /*
>>>   * All syscalls below here should go away really,
>>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>>> index 1ff9942..e6f229a 100644
>>> --- a/include/uapi/linux/Kbuild
>>> +++ b/include/uapi/linux/Kbuild
>>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>>  header-y += media.h
>>>  header-y += media-bus-format.h
>>>  header-y += mei.h
>>> +header-y += membarrier.h
>>>  header-y += memfd.h
>>>  header-y += mempolicy.h
>>>  header-y += meye.h
>>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>>> new file mode 100644
>>> index 0000000..e0b108b
>>> --- /dev/null
>>> +++ b/include/uapi/linux/membarrier.h
>>> @@ -0,0 +1,53 @@
>>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>>> +#define _UAPI_LINUX_MEMBARRIER_H
>>> +
>>> +/*
>>> + * linux/membarrier.h
>>> + *
>>> + * membarrier system call API
>>> + *
>>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>> + * of this software and associated documentation files (the "Software"), to
>>> deal
>>> + * in the Software without restriction, including without limitation the rights
>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>> + * copies of the Software, and to permit persons to whom the Software is
>>> + * furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice shall be included in
>>> + * all copies or substantial portions of the Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>> FROM,
>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>> THE
>>> + * SOFTWARE.
>>> + */
>>> +
>>> +/**
>>> + * enum membarrier_cmd - membarrier system call command
>>> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
>>> + *                          a bitmask of valid commands.
>>> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
>>> + *                          Upon return from system call, the caller thread
>>> + *                          is ensured that all running threads have passed
>>> + *                          through a state where all memory accesses to
>>> + *                          user-space addresses match program order between
>>> + *                          entry to and return from the system call
>>> + *                          (non-running threads are de facto in such a
>>> + *                          state). This covers threads from all processes
>>> + *                          running on the system. This command returns 0.
>>> + *
>>> + * Command to be passed to the membarrier system call. The commands need to
>>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>>> + * the value 0.
>>> + */
>>> +enum membarrier_cmd {
>>> +	MEMBARRIER_CMD_QUERY = 0,
>>> +	MEMBARRIER_CMD_SHARED = (1 << 0),
>>> +};
>>> +
>>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index af09b4f..4bba60f 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>>  	  bugs/quirks. Disable this only if your target machine is
>>>  	  unaffected by PCI quirks.
>>>  
>>> +config MEMBARRIER
>>> +	bool "Enable membarrier() system call" if EXPERT
>>> +	default y
>>> +	help
>>> +	  Enable the membarrier() system call that allows issuing memory
>>> +	  barriers across all running threads, which can be used to distribute
>>> +	  the cost of user-space memory barriers asymmetrically by transforming
>>> +	  pairs of memory barriers into pairs consisting of membarrier() and a
>>> +	  compiler barrier.
>>> +
>>> +	  If unsure, say Y.
>>> +
>>>  config EMBEDDED
>>>  	bool "Embedded system"
>>>  	option allnoconfig_y
>>> diff --git a/kernel/Makefile b/kernel/Makefile
>>> index 43c4c92..92a481b 100644
>>> --- a/kernel/Makefile
>>> +++ b/kernel/Makefile
>>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>>  
>>>  $(obj)/configs.o: $(obj)/config_data.h
>>>  
>>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>>> new file mode 100644
>>> index 0000000..536c727
>>> --- /dev/null
>>> +++ b/kernel/membarrier.c
>>> @@ -0,0 +1,66 @@
>>> +/*
>>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> + *
>>> + * membarrier system call
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + */
>>> +
>>> +#include <linux/syscalls.h>
>>> +#include <linux/membarrier.h>
>>> +
>>> +/*
>>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>> + * except MEMBARRIER_CMD_QUERY.
>>> + */
>>> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>>> +
>>> +/**
>>> + * sys_membarrier - issue memory barriers on a set of threads
>>> + * @cmd:   Takes command values defined in enum membarrier_cmd.
>>> + * @flags: Currently needs to be 0. For future extensions.
>>> + *
>>> + * If this system call is not implemented, -ENOSYS is returned. If the
>>> + * command specified does not exist, or if the command argument is invalid,
>>> + * this system call returns -EINVAL. For a given command, with flags argument
>>> + * set to 0, this system call is guaranteed to always return the same value
>>> + * until reboot.
>>> + *
>>> + * All memory accesses performed in program order from each targeted thread
>>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>>> + * accesses to be performed in program order across the barrier, and
>>> + * smp_mb() to represent explicit memory barriers forcing full memory
>>> + * ordering across the barrier, we have the following ordering table for
>>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>>> + *
>>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>>> + *
>>> + *                        barrier()   smp_mb() sys_membarrier()
>>> + *        barrier()          X           X            O
>>> + *        smp_mb()           X           O            O
>>> + *        sys_membarrier()   O           O            O
>>> + */
>>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>> +{
>>> +	if (unlikely(flags))
>>> +		return -EINVAL;
>>> +	switch (cmd) {
>>> +	case MEMBARRIER_CMD_QUERY:
>>> +		return MEMBARRIER_CMD_BITMASK;
>>> +	case MEMBARRIER_CMD_SHARED:
>>> +		if (num_online_cpus() > 1)
>>> +			synchronize_sched();
>>> +		return 0;
>>> +	default:
>>> +		return -EINVAL;
>>> +	}
>>> +}
>>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>>> index 7995ef5..eb4fde0 100644
>>> --- a/kernel/sys_ni.c
>>> +++ b/kernel/sys_ni.c
>>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>>  
>>>  /* execveat */
>>>  cond_syscall(sys_execveat);
>>> +
>>> +/* membarrier */
>>> +cond_syscall(sys_membarrier);
>>>
>>
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
@ 2015-12-13 11:44           ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-12-13 11:44 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Andrew Morton, linux-kernel, linux-api, KOSAKI Motohiro, rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, One Thousand Gnomes,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, Pranith Kumar

----- On Dec 11, 2015, at 1:05 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:

> Hi Matthew,
> 
> On 12/05/2015 09:48 AM, Mathieu Desnoyers wrote:
>> Hi Michael,
>> 
>> Please find the membarrier man groff file attached. I re-integrated
>> some changes that went in initially only in the changelog text version
>> back onto this groff source.
>> 
>> Please let me know if you find any issue with it.
> 
> Thanks for the page, but there's a few issues. Could you please
> submit a new version as an inline patch, and see what can be
> done w.r.t. the following points (see man-pages(7) for some
> background on some of these points):
> 
> * Start DESCRIPTION off with a paragraph explaining what this system
>  call is about and why one would use it.
> 
> * Page needs VERSIONS, CONFORMING TO, and SEE ALSO sections.
> 
> * Is its possible to add a small EXAMPLE?
> 
> * In a NOTES section, it might be helpful to briefly explain the following
>  concepts:  memory barrier and program order.

Sure, I'll prepare a new version including these,
as a patch over the man pages project. Some questions
below,

> 
> Some comments on individual pieces below:
> 
>> .TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
>> .SH NAME
>> membarrier \- issue memory barriers on a set of threads
>> .SH SYNOPSIS
>> .B #include <linux/membarrier.h>
>> .sp
>> .BI "int membarrier(int " cmd ", int " flags ");
>> .sp
>> .SH DESCRIPTION
>> The
>> .I cmd
>> argument is one of the following:
>> 
>> .TP
>> .B MEMBARRIER_CMD_QUERY
>> Query the set of supported commands. It returns a bitmask of supported
>> commands.
> 
> Not clear here. Does this mean that the 'cmd' argument is a bit mask,
> rather than an enumeration? I think that needs to be spelled out.
> Also, the text should mention that the returned bitmask excludes
> MEMBARRIER_CMD_QUERY. (Why, actually?)

The 'cmd' arg really expects a one-hot bit of a bitmask. However,
MEMBARRIER_CMD_QUERY is a special case: 0, so it is never part
of the mask of supported commands. I'll describe this in the manpage.

> 
>> .TP
>> .B MEMBARRIER_CMD_SHARED
>> Execute a memory barrier on all threads running on the system.
> 
> All threads on the system?
> 
>> Upon
>> return from system call, the caller thread is ensured that all running
>> threads have passed through a state where all memory accesses to
>> user-space addresses match program order between entry to and return
>> from the system call (non-running threads are de facto in such a
>> state). This covers threads from all processes running on the system.

I can reword this entire paragraph like this:

Ensure that all threads from all processes on the system pass through a
state where all memory accesses to user-space addresses match program
order between entry to and return from the membarrier system call.
All threads on the system are targeted by this command.
This command returns 0.



>> This command returns 0.
>> 
>> .PP
>> The
>> .I flags
>> argument is currently unused.
>> 
>> .PP
>> All memory accesses performed in program order from each targeted thread
> 
> What is a "targeted thread"? Some rewording is needed here.

I added a sentence to the command description above which clarifies
the notion of "targeted threads".

> 
>> is guaranteed to be ordered with respect to sys_membarrier(). If we use
>> the semantic "barrier()" to represent a compiler barrier forcing memory
>> accesses to be performed in program order across the barrier, and
>> smp_mb() to represent explicit memory barriers forcing full memory
>> ordering across the barrier, we have the following ordering table for
>> each pair of barrier(), sys_membarrier() and smp_mb():
>> 
>> The pair ordering is detailed as (O: ordered, X: not ordered):
>> 
>>                        barrier()   smp_mb() sys_membarrier()
>>        barrier()          X           X            O
>>        smp_mb()           X           O            O
>>        sys_membarrier()   O           O            O
>> 
>> .SH RETURN VALUE
>> On success, these system calls return zero.
> 
> This sentence seems out of place. We have one system call.
> And the different operations described above return
> nonzero values on success.

Updated with:

On success, this system call returns zero.  On error, \-1 is returned,
and
.I errno
is set appropriately.


> 
>> On error, \-1 is returned,
>> and
>> .I errno
>> is set appropriately.
>> For a given command, with flags argument set to 0, this system call is
>> guaranteed to always return the same value until reboot.
> 
> I don't understand the intent of the last sentence. What idea are you
> trying to convey?

What I'm trying to say is that an executable that invokes
the membarrier system call with a query command, or executes
some other command, only needs to check the return value and
errno for errors the first time it invokes the system call, and can
assume that the kernel will always return that same value
and errno for following calls. I'm limiting this guarantee to flags
arg == 0, since we may want to break this guarantee in the
future when specific flags are set.

This make it easy for applications and library to check what
is available within a constructor, and then not have to do
error handling in the rest of the application.

Not sure how to best express this though.

> 
>> .SH ERRORS
>> .TP
>> .B ENOSYS
>> System call is not implemented.
>> .TP
>> .B EINVAL
>> Invalid arguments.
> 
> Would be clearer to say here: "cmd is invalid or flags is nonezero"

Will do.

Thanks!

Mathieu

> 
> Thanks,
> 
> Michael
> 
> 
>> ----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:
>> 
>>> Hi Mathieu,
>>>
>>> In the patch below you have a man page type of text. Is that
>>> just plain text, or do you have some groff source somewhere?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>
>>> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>>>> Here is an implementation of a new system call, sys_membarrier(), which
>>>> executes a memory barrier on all threads running on the system. It is
>>>> implemented by calling synchronize_sched(). It can be used to distribute
>>>> the cost of user-space memory barriers asymmetrically by transforming
>>>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>>>> compiler barrier. For synchronization primitives that distinguish
>>>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>>>> read-side can be accelerated significantly by moving the bulk of the
>>>> memory barrier overhead to the write-side.
>>>>
>>>> The existing applications of which I am aware that would be improved by this
>>>> system call are as follows:
>>>>
>>>> * Through Userspace RCU library (http://urcu.so)
>>>>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>>>>   - Network sniffer (http://netsniff-ng.org/)
>>>>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>>>   - User-space tracing (http://lttng.org)
>>>>   - Network storage system (https://www.gluster.org/)
>>>>   - Virtual routers
>>>>   (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>>>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
>>>>
>>>> Those projects use RCU in userspace to increase read-side speed and
>>>> scalability compared to locking. Especially in the case of RCU used
>>>> by libraries, sys_membarrier can speed up the read-side by moving the
>>>> bulk of the memory barrier cost to synchronize_rcu().
>>>>
>>>> * Direct users of sys_membarrier
>>>>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>>>>
>>>> Microsoft core dotnet GC developers are planning to use the mprotect()
>>>> side-effect of issuing memory barriers through IPIs as a way to implement
>>>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>>>> sys_membarrier in their github thread, specifically stating that
>>>> sys_membarrier() is what they are looking for.
>>>>
>>>> This implementation is based on kernel v4.1-rc8.
>>>>
>>>> To explain the benefit of this scheme, let's introduce two example threads:
>>>>
>>>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>>>> Thread B (frequent, e.g. executing liburcu
>>>> rcu_read_lock()/rcu_read_unlock())
>>>>
>>>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>>>> with respect to smp_mb() present in Thread B, we can change each
>>>> smp_mb() within Thread A into calls to sys_membarrier() and each
>>>> smp_mb() within Thread B into compiler barriers "barrier()".
>>>>
>>>> Before the change, we had, for each smp_mb() pairs:
>>>>
>>>> Thread A                    Thread B
>>>> previous mem accesses       previous mem accesses
>>>> smp_mb()                    smp_mb()
>>>> following mem accesses      following mem accesses
>>>>
>>>> After the change, these pairs become:
>>>>
>>>> Thread A                    Thread B
>>>> prev mem accesses           prev mem accesses
>>>> sys_membarrier()            barrier()
>>>> follow mem accesses         follow mem accesses
>>>>
>>>> As we can see, there are two possible scenarios: either Thread B memory
>>>> accesses do not happen concurrently with Thread A accesses (1), or they
>>>> do (2).
>>>>
>>>> 1) Non-concurrent Thread A vs Thread B accesses:
>>>>
>>>> Thread A                    Thread B
>>>> prev mem accesses
>>>> sys_membarrier()
>>>> follow mem accesses
>>>>                             prev mem accesses
>>>>                             barrier()
>>>>                             follow mem accesses
>>>>
>>>> In this case, thread B accesses will be weakly ordered. This is OK,
>>>> because at that point, thread A is not particularly interested in
>>>> ordering them with respect to its own accesses.
>>>>
>>>> 2) Concurrent Thread A vs Thread B accesses
>>>>
>>>> Thread A                    Thread B
>>>> prev mem accesses           prev mem accesses
>>>> sys_membarrier()            barrier()
>>>> follow mem accesses         follow mem accesses
>>>>
>>>> In this case, thread B accesses, which are ensured to be in program
>>>> order thanks to the compiler barrier, will be "upgraded" to full
>>>> smp_mb() by synchronize_sched().
>>>>
>>>> * Benchmarks
>>>>
>>>> On Intel Xeon E5405 (8 cores)
>>>> (one thread is calling sys_membarrier, the other 7 threads are busy
>>>> looping)
>>>>
>>>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>>>>
>>>> * User-space user of this system call: Userspace RCU library
>>>>
>>>> Both the signal-based and the sys_membarrier userspace RCU schemes
>>>> permit us to remove the memory barrier from the userspace RCU
>>>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>>>> accelerating them. These memory barriers are replaced by compiler
>>>> barriers on the read-side, and all matching memory barriers on the
>>>> write-side are turned into an invocation of a memory barrier on all
>>>> active threads in the process. By letting the kernel perform this
>>>> synchronization rather than dumbly sending a signal to every process
>>>> threads (as we currently do), we diminish the number of unnecessary wake
>>>> ups and only issue the memory barriers on active threads. Non-running
>>>> threads do not need to execute such barrier anyway, because these are
>>>> implied by the scheduler context switches.
>>>>
>>>> Results in liburcu:
>>>>
>>>> Operations in 10s, 6 readers, 2 writers:
>>>>
>>>> memory barriers in reader:    1701557485 reads, 2202847 writes
>>>> signal-based scheme:          9830061167 reads,    6700 writes
>>>> sys_membarrier:               9952759104 reads,     425 writes
>>>> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
>>>>
>>>> The dynamic sys_membarrier availability check adds some overhead to
>>>> the read-side compared to the signal-based scheme, but besides that,
>>>> sys_membarrier slightly outperforms the signal-based scheme. However,
>>>> this non-expedited sys_membarrier implementation has a much slower grace
>>>> period than signal and memory barrier schemes.
>>>>
>>>> Besides diminishing the number of wake-ups, one major advantage of the
>>>> membarrier system call over the signal-based scheme is that it does not
>>>> need to reserve a signal. This plays much more nicely with libraries,
>>>> and with processes injected into for tracing purposes, for which we
>>>> cannot expect that signals will be unused by the application.
>>>>
>>>> An expedited version of this system call can be added later on to speed
>>>> up the grace period. Its implementation will likely depend on reading
>>>> the cpu_curr()->mm without holding each CPU's rq lock.
>>>>
>>>> This patch adds the system call to x86 and to asm-generic.
>>>>
>>>> [1] http://urcu.so
>>>>
>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>>> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
>>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>> CC: Steven Rostedt <rostedt@goodmis.org>
>>>> CC: Nicholas Miell <nmiell@comcast.net>
>>>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>>>> CC: Ingo Molnar <mingo@redhat.com>
>>>> CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
>>>> CC: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> CC: Stephen Hemminger <stephen@networkplumber.org>
>>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>>> CC: Thomas Gleixner <tglx@linutronix.de>
>>>> CC: Peter Zijlstra <peterz@infradead.org>
>>>> CC: David Howells <dhowells@redhat.com>
>>>> CC: Pranith Kumar <bobby.prani@gmail.com>
>>>> CC: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> CC: linux-api@vger.kernel.org
>>>>
>>>> ---
>>>>
>>>> membarrier(2) man page:
>>>> --------------- snip -------------------
>>>> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
>>>>
>>>> NAME
>>>>        membarrier - issue memory barriers on a set of threads
>>>>
>>>> SYNOPSIS
>>>>        #include <linux/membarrier.h>
>>>>
>>>>        int membarrier(int cmd, int flags);
>>>>
>>>> DESCRIPTION
>>>>        The cmd argument is one of the following:
>>>>
>>>>        MEMBARRIER_CMD_QUERY
>>>>               Query  the  set  of  supported commands. It returns a bitmask of
>>>>               supported commands.
>>>>
>>>>        MEMBARRIER_CMD_SHARED
>>>>               Execute a memory barrier on all threads running on  the  system.
>>>>               Upon  return from system call, the caller thread is ensured that
>>>>               all running threads have passed through a state where all memory
>>>>               accesses  to  user-space  addresses  match program order between
>>>>               entry to and return from the system  call  (non-running  threads
>>>>               are de facto in such a state). This covers threads from all pro‐
>>>>               cesses running on the system.  This command returns 0.
>>>>
>>>>        The flags argument needs to be 0. For future extensions.
>>>>
>>>>        All memory accesses performed  in  program  order  from  each  targeted
>>>>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>>>        we use the semantic "barrier()" to represent a compiler barrier forcing
>>>>        memory  accesses  to  be performed in program order across the barrier,
>>>>        and smp_mb() to represent explicit memory barriers forcing full  memory
>>>>        ordering  across  the barrier, we have the following ordering table for
>>>>        each pair of barrier(), sys_membarrier() and smp_mb():
>>>>
>>>>        The pair ordering is detailed as (O: ordered, X: not ordered):
>>>>
>>>>                               barrier()   smp_mb() sys_membarrier()
>>>>               barrier()          X           X            O
>>>>               smp_mb()           X           O            O
>>>>               sys_membarrier()   O           O            O
>>>>
>>>> RETURN VALUE
>>>>        On success, these system calls return zero.  On error, -1 is  returned,
>>>>        and errno is set appropriately. For a given command, with flags
>>>>        argument set to 0, this system call is guaranteed to always return the
>>>>        same value until reboot.
>>>>
>>>> ERRORS
>>>>        ENOSYS System call is not implemented.
>>>>
>>>>        EINVAL Invalid arguments.
>>>>
>>>> Linux                             2015-04-15                     MEMBARRIER(2)
>>>> --------------- snip -------------------
>>>>
>>>> Changes since v18:
>>>> - Add unlikely() check to flags,
>>>> - Describe current users in changelog.
>>>>
>>>> Changes since v17:
>>>> - Update commit message.
>>>>
>>>> Changes since v16:
>>>> - Update documentation.
>>>> - Add man page to changelog.
>>>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>>>   to not care about the number of processors on the system.  Based on
>>>>   recommendations from Stephen Hemminger and Steven Rostedt.
>>>> - Check that flags argument is 0, update documentation to require it.
>>>>
>>>> Changes since v15:
>>>> - Add flags argument in addition to cmd.
>>>> - Update documentation.
>>>>
>>>> Changes since v14:
>>>> - Take care of Thomas Gleixner's comments.
>>>>
>>>> Changes since v13:
>>>> - Move to kernel/membarrier.c.
>>>> - Remove MEMBARRIER_PRIVATE flag.
>>>> - Add MAINTAINERS file entry.
>>>>
>>>> Changes since v12:
>>>> - Remove _FLAG suffix from uapi flags.
>>>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>>>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>>>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>>>>   lock.
>>>>
>>>> Changes since v11:
>>>> - 5 years have passed.
>>>> - Rebase on v3.19 kernel.
>>>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>>>   barriers, non-private for memory mappings shared between processes.
>>>> - Simplify user API.
>>>> - Code refactoring.
>>>>
>>>> Changes since v10:
>>>> - Apply Randy's comments.
>>>> - Rebase on 2.6.34-rc4 -tip.
>>>>
>>>> Changes since v9:
>>>> - Clean up #ifdef CONFIG_SMP.
>>>>
>>>> Changes since v8:
>>>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>>>   memory barriers to the scheduler. It implies a potential RoS
>>>>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>>>>   by a user, but nothing more than what is already possible with other
>>>>   existing system calls, but saves memory barriers in the scheduler fast
>>>>   path.
>>>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>>>   other architectures.
>>>> - Update documentation of the memory barriers in sys_membarrier and
>>>>   switch_mm().
>>>> - Append execution scenarios to the changelog showing the purpose of
>>>>   each memory barrier.
>>>>
>>>> Changes since v7:
>>>> - Move spinlock-mb and scheduler related changes to separate patches.
>>>> - Add support for sys_membarrier on x86_32.
>>>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>>>   to incrementally reserve syscall IDs on other architectures as these
>>>>   are tested.
>>>>
>>>> Changes since v6:
>>>> - Remove some unlikely() not so unlikely.
>>>> - Add the proper scheduler memory barriers needed to only use the RCU
>>>>   read lock in sys_membarrier rather than take each runqueue spinlock:
>>>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>>>   and finish_lock_switch(), where they clearly document that all data
>>>>   protected by the rq lock is guaranteed to have memory barriers issued
>>>>   between the scheduler update and the task execution. Replacing the
>>>>   spin lock acquire/release barriers with these memory barriers imply
>>>>   either no overhead (x86 spinlock atomic instruction already implies a
>>>>   full mb) or some hopefully small overhead caused by the upgrade of the
>>>>   spinlock acquire/release barriers to more heavyweight smp_mb().
>>>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>>>   standard spinlocks and full memory barriers. Each architecture can
>>>>   specialize this header following their own need and declare
>>>>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>>>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>>>   implementations on a wide range of architecture would be welcome.
>>>>
>>>> Changes since v5:
>>>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>>>   to the "flags" system call parameter. Past experience with accept4(),
>>>>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>>>   inotify_init1() indicates that this is the kind of thing we want to
>>>>   plan for. Return -EINVAL if the mandatory flags received are unknown.
>>>> - Create include/linux/membarrier.h to define these flags.
>>>> - Add MEMBARRIER_QUERY optional flag.
>>>>
>>>> Changes since v4:
>>>> - Add "int expedited" parameter, use synchronize_sched() in the
>>>>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>>>>   seriously using synchronize_sched() to provide the low-overhead
>>>>   membarrier scheme.
>>>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>>>>
>>>> Changes since v3a:
>>>> - Confirm that each CPU indeed runs the current task's ->mm before
>>>>   sending an IPI. Ensures that we do not disturb RT tasks in the
>>>>   presence of lazy TLB shootdown.
>>>> - Document memory barriers needed in switch_mm().
>>>> - Surround helper functions with #ifdef CONFIG_SMP.
>>>>
>>>> Changes since v2:
>>>> - simply send-to-many to the mm_cpumask. It contains the list of
>>>>   processors we have to IPI to (which use the mm), and this mask is
>>>>   updated atomically.
>>>>
>>>> Changes since v1:
>>>> - Only perform the IPI in CONFIG_SMP.
>>>> - Only perform the IPI if the process has more than one thread.
>>>> - Only send IPIs to CPUs involved with threads belonging to our process.
>>>> - Adaptative IPI scheme (single vs many IPI with threshold).
>>>> - Issue smp_mb() at the beginning and end of the system call.
>>>> ---
>>>>  MAINTAINERS                            |  8 +++++
>>>>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>>>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>>>  include/linux/syscalls.h               |  2 ++
>>>>  include/uapi/asm-generic/unistd.h      |  4 ++-
>>>>  include/uapi/linux/Kbuild              |  1 +
>>>>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>>>>  init/Kconfig                           | 12 +++++++
>>>>  kernel/Makefile                        |  1 +
>>>>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>>>>  kernel/sys_ni.c                        |  3 ++
>>>>  11 files changed, 151 insertions(+), 1 deletion(-)
>>>>  create mode 100644 include/uapi/linux/membarrier.h
>>>>  create mode 100644 kernel/membarrier.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 0d70760..b560da6 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>>>>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>>>>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>>>>  
>>>> +MEMBARRIER SUPPORT
>>>> +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> +M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>>>> +L:	linux-kernel@vger.kernel.org
>>>> +S:	Supported
>>>> +F:	kernel/membarrier.c
>>>> +F:	include/uapi/linux/membarrier.h
>>>> +
>>>>  MEMORY MANAGEMENT
>>>>  L:	linux-mm@kvack.org
>>>>  W:	http://www.linux-mm.org
>>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>>>> b/arch/x86/entry/syscalls/syscall_32.tbl
>>>> index ef8187f..e63ad61 100644
>>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>>>> @@ -365,3 +365,4 @@
>>>>  356	i386	memfd_create		sys_memfd_create
>>>>  357	i386	bpf			sys_bpf
>>>>  358	i386	execveat		sys_execveat			stub32_execveat
>>>> +359	i386	membarrier		sys_membarrier
>>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>>>> b/arch/x86/entry/syscalls/syscall_64.tbl
>>>> index 9ef32d5..87f3cd6 100644
>>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>>> @@ -329,6 +329,7 @@
>>>>  320	common	kexec_file_load		sys_kexec_file_load
>>>>  321	common	bpf			sys_bpf
>>>>  322	64	execveat		stub_execveat
>>>> +323	common	membarrier		sys_membarrier
>>>>  
>>>>  #
>>>>  # x32-specific system call numbers start at 512 to avoid cache impact
>>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>>> index b45c45b..d4ab99b 100644
>>>> --- a/include/linux/syscalls.h
>>>> +++ b/include/linux/syscalls.h
>>>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>>>> *filename,
>>>>  			const char __user *const __user *argv,
>>>>  			const char __user *const __user *envp, int flags);
>>>>  
>>>> +asmlinkage long sys_membarrier(int cmd, int flags);
>>>> +
>>>>  #endif
>>>> diff --git a/include/uapi/asm-generic/unistd.h
>>>> b/include/uapi/asm-generic/unistd.h
>>>> index e016bd9..8da542a 100644
>>>> --- a/include/uapi/asm-generic/unistd.h
>>>> +++ b/include/uapi/asm-generic/unistd.h
>>>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>>>  __SYSCALL(__NR_bpf, sys_bpf)
>>>>  #define __NR_execveat 281
>>>>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>>> +#define __NR_membarrier 282
>>>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>>>  
>>>>  #undef __NR_syscalls
>>>> -#define __NR_syscalls 282
>>>> +#define __NR_syscalls 283
>>>>  
>>>>  /*
>>>>   * All syscalls below here should go away really,
>>>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>>>> index 1ff9942..e6f229a 100644
>>>> --- a/include/uapi/linux/Kbuild
>>>> +++ b/include/uapi/linux/Kbuild
>>>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>>>  header-y += media.h
>>>>  header-y += media-bus-format.h
>>>>  header-y += mei.h
>>>> +header-y += membarrier.h
>>>>  header-y += memfd.h
>>>>  header-y += mempolicy.h
>>>>  header-y += meye.h
>>>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>>>> new file mode 100644
>>>> index 0000000..e0b108b
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/membarrier.h
>>>> @@ -0,0 +1,53 @@
>>>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>>>> +#define _UAPI_LINUX_MEMBARRIER_H
>>>> +
>>>> +/*
>>>> + * linux/membarrier.h
>>>> + *
>>>> + * membarrier system call API
>>>> + *
>>>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> + *
>>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>>> + * of this software and associated documentation files (the "Software"), to
>>>> deal
>>>> + * in the Software without restriction, including without limitation the rights
>>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>>> + * copies of the Software, and to permit persons to whom the Software is
>>>> + * furnished to do so, subject to the following conditions:
>>>> + *
>>>> + * The above copyright notice and this permission notice shall be included in
>>>> + * all copies or substantial portions of the Software.
>>>> + *
>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>>>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>>> FROM,
>>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>> THE
>>>> + * SOFTWARE.
>>>> + */
>>>> +
>>>> +/**
>>>> + * enum membarrier_cmd - membarrier system call command
>>>> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
>>>> + *                          a bitmask of valid commands.
>>>> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
>>>> + *                          Upon return from system call, the caller thread
>>>> + *                          is ensured that all running threads have passed
>>>> + *                          through a state where all memory accesses to
>>>> + *                          user-space addresses match program order between
>>>> + *                          entry to and return from the system call
>>>> + *                          (non-running threads are de facto in such a
>>>> + *                          state). This covers threads from all processes
>>>> + *                          running on the system. This command returns 0.
>>>> + *
>>>> + * Command to be passed to the membarrier system call. The commands need to
>>>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>>>> + * the value 0.
>>>> + */
>>>> +enum membarrier_cmd {
>>>> +	MEMBARRIER_CMD_QUERY = 0,
>>>> +	MEMBARRIER_CMD_SHARED = (1 << 0),
>>>> +};
>>>> +
>>>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>>>> diff --git a/init/Kconfig b/init/Kconfig
>>>> index af09b4f..4bba60f 100644
>>>> --- a/init/Kconfig
>>>> +++ b/init/Kconfig
>>>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>>>  	  bugs/quirks. Disable this only if your target machine is
>>>>  	  unaffected by PCI quirks.
>>>>  
>>>> +config MEMBARRIER
>>>> +	bool "Enable membarrier() system call" if EXPERT
>>>> +	default y
>>>> +	help
>>>> +	  Enable the membarrier() system call that allows issuing memory
>>>> +	  barriers across all running threads, which can be used to distribute
>>>> +	  the cost of user-space memory barriers asymmetrically by transforming
>>>> +	  pairs of memory barriers into pairs consisting of membarrier() and a
>>>> +	  compiler barrier.
>>>> +
>>>> +	  If unsure, say Y.
>>>> +
>>>>  config EMBEDDED
>>>>  	bool "Embedded system"
>>>>  	option allnoconfig_y
>>>> diff --git a/kernel/Makefile b/kernel/Makefile
>>>> index 43c4c92..92a481b 100644
>>>> --- a/kernel/Makefile
>>>> +++ b/kernel/Makefile
>>>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>>>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>>>  
>>>>  $(obj)/configs.o: $(obj)/config_data.h
>>>>  
>>>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>>>> new file mode 100644
>>>> index 0000000..536c727
>>>> --- /dev/null
>>>> +++ b/kernel/membarrier.c
>>>> @@ -0,0 +1,66 @@
>>>> +/*
>>>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> + *
>>>> + * membarrier system call
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License as published by
>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>> + * (at your option) any later version.
>>>> + *
>>>> + * This program is distributed in the hope that it will be useful,
>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + * GNU General Public License for more details.
>>>> + */
>>>> +
>>>> +#include <linux/syscalls.h>
>>>> +#include <linux/membarrier.h>
>>>> +
>>>> +/*
>>>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>>> + * except MEMBARRIER_CMD_QUERY.
>>>> + */
>>>> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>>>> +
>>>> +/**
>>>> + * sys_membarrier - issue memory barriers on a set of threads
>>>> + * @cmd:   Takes command values defined in enum membarrier_cmd.
>>>> + * @flags: Currently needs to be 0. For future extensions.
>>>> + *
>>>> + * If this system call is not implemented, -ENOSYS is returned. If the
>>>> + * command specified does not exist, or if the command argument is invalid,
>>>> + * this system call returns -EINVAL. For a given command, with flags argument
>>>> + * set to 0, this system call is guaranteed to always return the same value
>>>> + * until reboot.
>>>> + *
>>>> + * All memory accesses performed in program order from each targeted thread
>>>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>>>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>>>> + * accesses to be performed in program order across the barrier, and
>>>> + * smp_mb() to represent explicit memory barriers forcing full memory
>>>> + * ordering across the barrier, we have the following ordering table for
>>>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>>>> + *
>>>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>>>> + *
>>>> + *                        barrier()   smp_mb() sys_membarrier()
>>>> + *        barrier()          X           X            O
>>>> + *        smp_mb()           X           O            O
>>>> + *        sys_membarrier()   O           O            O
>>>> + */
>>>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>>> +{
>>>> +	if (unlikely(flags))
>>>> +		return -EINVAL;
>>>> +	switch (cmd) {
>>>> +	case MEMBARRIER_CMD_QUERY:
>>>> +		return MEMBARRIER_CMD_BITMASK;
>>>> +	case MEMBARRIER_CMD_SHARED:
>>>> +		if (num_online_cpus() > 1)
>>>> +			synchronize_sched();
>>>> +		return 0;
>>>> +	default:
>>>> +		return -EINVAL;
>>>> +	}
>>>> +}
>>>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>>>> index 7995ef5..eb4fde0 100644
>>>> --- a/kernel/sys_ni.c
>>>> +++ b/kernel/sys_ni.c
>>>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>>>  
>>>>  /* execveat */
>>>>  cond_syscall(sys_execveat);
>>>> +
>>>> +/* membarrier */
>>>> +cond_syscall(sys_membarrier);
>>>>
>>>
>>>
>>> --
>>> Michael Kerrisk
>>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>>> Linux/UNIX System Programming Training: http://man7.org/training/
>> 
> 
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)
@ 2015-12-13 11:44           ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2015-12-13 11:44 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	KOSAKI Motohiro, rostedt, Nicholas Miell, Linus Torvalds,
	Ingo Molnar, One Thousand Gnomes, Lai Jiangshan,
	Stephen Hemminger, Thomas Gleixner, Peter Zijlstra,
	David Howells, Pranith Kumar

----- On Dec 11, 2015, at 1:05 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:

> Hi Matthew,
> 
> On 12/05/2015 09:48 AM, Mathieu Desnoyers wrote:
>> Hi Michael,
>> 
>> Please find the membarrier man groff file attached. I re-integrated
>> some changes that went in initially only in the changelog text version
>> back onto this groff source.
>> 
>> Please let me know if you find any issue with it.
> 
> Thanks for the page, but there's a few issues. Could you please
> submit a new version as an inline patch, and see what can be
> done w.r.t. the following points (see man-pages(7) for some
> background on some of these points):
> 
> * Start DESCRIPTION off with a paragraph explaining what this system
>  call is about and why one would use it.
> 
> * Page needs VERSIONS, CONFORMING TO, and SEE ALSO sections.
> 
> * Is its possible to add a small EXAMPLE?
> 
> * In a NOTES section, it might be helpful to briefly explain the following
>  concepts:  memory barrier and program order.

Sure, I'll prepare a new version including these,
as a patch over the man pages project. Some questions
below,

> 
> Some comments on individual pieces below:
> 
>> .TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
>> .SH NAME
>> membarrier \- issue memory barriers on a set of threads
>> .SH SYNOPSIS
>> .B #include <linux/membarrier.h>
>> .sp
>> .BI "int membarrier(int " cmd ", int " flags ");
>> .sp
>> .SH DESCRIPTION
>> The
>> .I cmd
>> argument is one of the following:
>> 
>> .TP
>> .B MEMBARRIER_CMD_QUERY
>> Query the set of supported commands. It returns a bitmask of supported
>> commands.
> 
> Not clear here. Does this mean that the 'cmd' argument is a bit mask,
> rather than an enumeration? I think that needs to be spelled out.
> Also, the text should mention that the returned bitmask excludes
> MEMBARRIER_CMD_QUERY. (Why, actually?)

The 'cmd' arg really expects a one-hot bit of a bitmask. However,
MEMBARRIER_CMD_QUERY is a special case: 0, so it is never part
of the mask of supported commands. I'll describe this in the manpage.

> 
>> .TP
>> .B MEMBARRIER_CMD_SHARED
>> Execute a memory barrier on all threads running on the system.
> 
> All threads on the system?
> 
>> Upon
>> return from system call, the caller thread is ensured that all running
>> threads have passed through a state where all memory accesses to
>> user-space addresses match program order between entry to and return
>> from the system call (non-running threads are de facto in such a
>> state). This covers threads from all processes running on the system.

I can reword this entire paragraph like this:

Ensure that all threads from all processes on the system pass through a
state where all memory accesses to user-space addresses match program
order between entry to and return from the membarrier system call.
All threads on the system are targeted by this command.
This command returns 0.



>> This command returns 0.
>> 
>> .PP
>> The
>> .I flags
>> argument is currently unused.
>> 
>> .PP
>> All memory accesses performed in program order from each targeted thread
> 
> What is a "targeted thread"? Some rewording is needed here.

I added a sentence to the command description above which clarifies
the notion of "targeted threads".

> 
>> is guaranteed to be ordered with respect to sys_membarrier(). If we use
>> the semantic "barrier()" to represent a compiler barrier forcing memory
>> accesses to be performed in program order across the barrier, and
>> smp_mb() to represent explicit memory barriers forcing full memory
>> ordering across the barrier, we have the following ordering table for
>> each pair of barrier(), sys_membarrier() and smp_mb():
>> 
>> The pair ordering is detailed as (O: ordered, X: not ordered):
>> 
>>                        barrier()   smp_mb() sys_membarrier()
>>        barrier()          X           X            O
>>        smp_mb()           X           O            O
>>        sys_membarrier()   O           O            O
>> 
>> .SH RETURN VALUE
>> On success, these system calls return zero.
> 
> This sentence seems out of place. We have one system call.
> And the different operations described above return
> nonzero values on success.

Updated with:

On success, this system call returns zero.  On error, \-1 is returned,
and
.I errno
is set appropriately.


> 
>> On error, \-1 is returned,
>> and
>> .I errno
>> is set appropriately.
>> For a given command, with flags argument set to 0, this system call is
>> guaranteed to always return the same value until reboot.
> 
> I don't understand the intent of the last sentence. What idea are you
> trying to convey?

What I'm trying to say is that an executable that invokes
the membarrier system call with a query command, or executes
some other command, only needs to check the return value and
errno for errors the first time it invokes the system call, and can
assume that the kernel will always return that same value
and errno for following calls. I'm limiting this guarantee to flags
arg == 0, since we may want to break this guarantee in the
future when specific flags are set.

This make it easy for applications and library to check what
is available within a constructor, and then not have to do
error handling in the rest of the application.

Not sure how to best express this though.

> 
>> .SH ERRORS
>> .TP
>> .B ENOSYS
>> System call is not implemented.
>> .TP
>> .B EINVAL
>> Invalid arguments.
> 
> Would be clearer to say here: "cmd is invalid or flags is nonezero"

Will do.

Thanks!

Mathieu

> 
> Thanks,
> 
> Michael
> 
> 
>> ----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages@gmail.com wrote:
>> 
>>> Hi Mathieu,
>>>
>>> In the patch below you have a man page type of text. Is that
>>> just plain text, or do you have some groff source somewhere?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>
>>> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>>>> Here is an implementation of a new system call, sys_membarrier(), which
>>>> executes a memory barrier on all threads running on the system. It is
>>>> implemented by calling synchronize_sched(). It can be used to distribute
>>>> the cost of user-space memory barriers asymmetrically by transforming
>>>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>>>> compiler barrier. For synchronization primitives that distinguish
>>>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>>>> read-side can be accelerated significantly by moving the bulk of the
>>>> memory barrier overhead to the write-side.
>>>>
>>>> The existing applications of which I am aware that would be improved by this
>>>> system call are as follows:
>>>>
>>>> * Through Userspace RCU library (http://urcu.so)
>>>>   - DNS server (Knot DNS) https://www.knot-dns.cz/
>>>>   - Network sniffer (http://netsniff-ng.org/)
>>>>   - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>>>   - User-space tracing (http://lttng.org)
>>>>   - Network storage system (https://www.gluster.org/)
>>>>   - Virtual routers
>>>>   (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>>>   - Financial software (https://lkml.org/lkml/2015/3/23/189)
>>>>
>>>> Those projects use RCU in userspace to increase read-side speed and
>>>> scalability compared to locking. Especially in the case of RCU used
>>>> by libraries, sys_membarrier can speed up the read-side by moving the
>>>> bulk of the memory barrier cost to synchronize_rcu().
>>>>
>>>> * Direct users of sys_membarrier
>>>>   - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>>>>
>>>> Microsoft core dotnet GC developers are planning to use the mprotect()
>>>> side-effect of issuing memory barriers through IPIs as a way to implement
>>>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>>>> sys_membarrier in their github thread, specifically stating that
>>>> sys_membarrier() is what they are looking for.
>>>>
>>>> This implementation is based on kernel v4.1-rc8.
>>>>
>>>> To explain the benefit of this scheme, let's introduce two example threads:
>>>>
>>>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>>>> Thread B (frequent, e.g. executing liburcu
>>>> rcu_read_lock()/rcu_read_unlock())
>>>>
>>>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>>>> with respect to smp_mb() present in Thread B, we can change each
>>>> smp_mb() within Thread A into calls to sys_membarrier() and each
>>>> smp_mb() within Thread B into compiler barriers "barrier()".
>>>>
>>>> Before the change, we had, for each smp_mb() pairs:
>>>>
>>>> Thread A                    Thread B
>>>> previous mem accesses       previous mem accesses
>>>> smp_mb()                    smp_mb()
>>>> following mem accesses      following mem accesses
>>>>
>>>> After the change, these pairs become:
>>>>
>>>> Thread A                    Thread B
>>>> prev mem accesses           prev mem accesses
>>>> sys_membarrier()            barrier()
>>>> follow mem accesses         follow mem accesses
>>>>
>>>> As we can see, there are two possible scenarios: either Thread B memory
>>>> accesses do not happen concurrently with Thread A accesses (1), or they
>>>> do (2).
>>>>
>>>> 1) Non-concurrent Thread A vs Thread B accesses:
>>>>
>>>> Thread A                    Thread B
>>>> prev mem accesses
>>>> sys_membarrier()
>>>> follow mem accesses
>>>>                             prev mem accesses
>>>>                             barrier()
>>>>                             follow mem accesses
>>>>
>>>> In this case, thread B accesses will be weakly ordered. This is OK,
>>>> because at that point, thread A is not particularly interested in
>>>> ordering them with respect to its own accesses.
>>>>
>>>> 2) Concurrent Thread A vs Thread B accesses
>>>>
>>>> Thread A                    Thread B
>>>> prev mem accesses           prev mem accesses
>>>> sys_membarrier()            barrier()
>>>> follow mem accesses         follow mem accesses
>>>>
>>>> In this case, thread B accesses, which are ensured to be in program
>>>> order thanks to the compiler barrier, will be "upgraded" to full
>>>> smp_mb() by synchronize_sched().
>>>>
>>>> * Benchmarks
>>>>
>>>> On Intel Xeon E5405 (8 cores)
>>>> (one thread is calling sys_membarrier, the other 7 threads are busy
>>>> looping)
>>>>
>>>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>>>>
>>>> * User-space user of this system call: Userspace RCU library
>>>>
>>>> Both the signal-based and the sys_membarrier userspace RCU schemes
>>>> permit us to remove the memory barrier from the userspace RCU
>>>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>>>> accelerating them. These memory barriers are replaced by compiler
>>>> barriers on the read-side, and all matching memory barriers on the
>>>> write-side are turned into an invocation of a memory barrier on all
>>>> active threads in the process. By letting the kernel perform this
>>>> synchronization rather than dumbly sending a signal to every process
>>>> threads (as we currently do), we diminish the number of unnecessary wake
>>>> ups and only issue the memory barriers on active threads. Non-running
>>>> threads do not need to execute such barrier anyway, because these are
>>>> implied by the scheduler context switches.
>>>>
>>>> Results in liburcu:
>>>>
>>>> Operations in 10s, 6 readers, 2 writers:
>>>>
>>>> memory barriers in reader:    1701557485 reads, 2202847 writes
>>>> signal-based scheme:          9830061167 reads,    6700 writes
>>>> sys_membarrier:               9952759104 reads,     425 writes
>>>> sys_membarrier (dyn. check):  7970328887 reads,     425 writes
>>>>
>>>> The dynamic sys_membarrier availability check adds some overhead to
>>>> the read-side compared to the signal-based scheme, but besides that,
>>>> sys_membarrier slightly outperforms the signal-based scheme. However,
>>>> this non-expedited sys_membarrier implementation has a much slower grace
>>>> period than signal and memory barrier schemes.
>>>>
>>>> Besides diminishing the number of wake-ups, one major advantage of the
>>>> membarrier system call over the signal-based scheme is that it does not
>>>> need to reserve a signal. This plays much more nicely with libraries,
>>>> and with processes injected into for tracing purposes, for which we
>>>> cannot expect that signals will be unused by the application.
>>>>
>>>> An expedited version of this system call can be added later on to speed
>>>> up the grace period. Its implementation will likely depend on reading
>>>> the cpu_curr()->mm without holding each CPU's rq lock.
>>>>
>>>> This patch adds the system call to x86 and to asm-generic.
>>>>
>>>> [1] http://urcu.so
>>>>
>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>>>> Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>>> Reviewed-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
>>>> CC: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>>>> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
>>>> CC: Nicholas Miell <nmiell-Wuw85uim5zDR7s880joybQ@public.gmane.org>
>>>> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>>> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> CC: Alan Cox <gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org>
>>>> CC: Lai Jiangshan <laijs-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>>>> CC: Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
>>>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>>> CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>>>> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>>>> CC: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> CC: Pranith Kumar <bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>>> CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>>> CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>
>>>> ---
>>>>
>>>> membarrier(2) man page:
>>>> --------------- snip -------------------
>>>> MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
>>>>
>>>> NAME
>>>>        membarrier - issue memory barriers on a set of threads
>>>>
>>>> SYNOPSIS
>>>>        #include <linux/membarrier.h>
>>>>
>>>>        int membarrier(int cmd, int flags);
>>>>
>>>> DESCRIPTION
>>>>        The cmd argument is one of the following:
>>>>
>>>>        MEMBARRIER_CMD_QUERY
>>>>               Query  the  set  of  supported commands. It returns a bitmask of
>>>>               supported commands.
>>>>
>>>>        MEMBARRIER_CMD_SHARED
>>>>               Execute a memory barrier on all threads running on  the  system.
>>>>               Upon  return from system call, the caller thread is ensured that
>>>>               all running threads have passed through a state where all memory
>>>>               accesses  to  user-space  addresses  match program order between
>>>>               entry to and return from the system  call  (non-running  threads
>>>>               are de facto in such a state). This covers threads from all pro‐
>>>>               cesses running on the system.  This command returns 0.
>>>>
>>>>        The flags argument needs to be 0. For future extensions.
>>>>
>>>>        All memory accesses performed  in  program  order  from  each  targeted
>>>>        thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>>>        we use the semantic "barrier()" to represent a compiler barrier forcing
>>>>        memory  accesses  to  be performed in program order across the barrier,
>>>>        and smp_mb() to represent explicit memory barriers forcing full  memory
>>>>        ordering  across  the barrier, we have the following ordering table for
>>>>        each pair of barrier(), sys_membarrier() and smp_mb():
>>>>
>>>>        The pair ordering is detailed as (O: ordered, X: not ordered):
>>>>
>>>>                               barrier()   smp_mb() sys_membarrier()
>>>>               barrier()          X           X            O
>>>>               smp_mb()           X           O            O
>>>>               sys_membarrier()   O           O            O
>>>>
>>>> RETURN VALUE
>>>>        On success, these system calls return zero.  On error, -1 is  returned,
>>>>        and errno is set appropriately. For a given command, with flags
>>>>        argument set to 0, this system call is guaranteed to always return the
>>>>        same value until reboot.
>>>>
>>>> ERRORS
>>>>        ENOSYS System call is not implemented.
>>>>
>>>>        EINVAL Invalid arguments.
>>>>
>>>> Linux                             2015-04-15                     MEMBARRIER(2)
>>>> --------------- snip -------------------
>>>>
>>>> Changes since v18:
>>>> - Add unlikely() check to flags,
>>>> - Describe current users in changelog.
>>>>
>>>> Changes since v17:
>>>> - Update commit message.
>>>>
>>>> Changes since v16:
>>>> - Update documentation.
>>>> - Add man page to changelog.
>>>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>>>   to not care about the number of processors on the system.  Based on
>>>>   recommendations from Stephen Hemminger and Steven Rostedt.
>>>> - Check that flags argument is 0, update documentation to require it.
>>>>
>>>> Changes since v15:
>>>> - Add flags argument in addition to cmd.
>>>> - Update documentation.
>>>>
>>>> Changes since v14:
>>>> - Take care of Thomas Gleixner's comments.
>>>>
>>>> Changes since v13:
>>>> - Move to kernel/membarrier.c.
>>>> - Remove MEMBARRIER_PRIVATE flag.
>>>> - Add MAINTAINERS file entry.
>>>>
>>>> Changes since v12:
>>>> - Remove _FLAG suffix from uapi flags.
>>>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>>>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>>>   reading the cpu_curr()->mm can be done without holding the CPU's rq
>>>>   lock.
>>>>
>>>> Changes since v11:
>>>> - 5 years have passed.
>>>> - Rebase on v3.19 kernel.
>>>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>>>   barriers, non-private for memory mappings shared between processes.
>>>> - Simplify user API.
>>>> - Code refactoring.
>>>>
>>>> Changes since v10:
>>>> - Apply Randy's comments.
>>>> - Rebase on 2.6.34-rc4 -tip.
>>>>
>>>> Changes since v9:
>>>> - Clean up #ifdef CONFIG_SMP.
>>>>
>>>> Changes since v8:
>>>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>>>   memory barriers to the scheduler. It implies a potential RoS
>>>>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>>>>   by a user, but nothing more than what is already possible with other
>>>>   existing system calls, but saves memory barriers in the scheduler fast
>>>>   path.
>>>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>>>   other architectures.
>>>> - Update documentation of the memory barriers in sys_membarrier and
>>>>   switch_mm().
>>>> - Append execution scenarios to the changelog showing the purpose of
>>>>   each memory barrier.
>>>>
>>>> Changes since v7:
>>>> - Move spinlock-mb and scheduler related changes to separate patches.
>>>> - Add support for sys_membarrier on x86_32.
>>>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>>>   to incrementally reserve syscall IDs on other architectures as these
>>>>   are tested.
>>>>
>>>> Changes since v6:
>>>> - Remove some unlikely() not so unlikely.
>>>> - Add the proper scheduler memory barriers needed to only use the RCU
>>>>   read lock in sys_membarrier rather than take each runqueue spinlock:
>>>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>>>   and finish_lock_switch(), where they clearly document that all data
>>>>   protected by the rq lock is guaranteed to have memory barriers issued
>>>>   between the scheduler update and the task execution. Replacing the
>>>>   spin lock acquire/release barriers with these memory barriers imply
>>>>   either no overhead (x86 spinlock atomic instruction already implies a
>>>>   full mb) or some hopefully small overhead caused by the upgrade of the
>>>>   spinlock acquire/release barriers to more heavyweight smp_mb().
>>>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>>>   standard spinlocks and full memory barriers. Each architecture can
>>>>   specialize this header following their own need and declare
>>>>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>>>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>>>   implementations on a wide range of architecture would be welcome.
>>>>
>>>> Changes since v5:
>>>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>>>   to the "flags" system call parameter. Past experience with accept4(),
>>>>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>>>   inotify_init1() indicates that this is the kind of thing we want to
>>>>   plan for. Return -EINVAL if the mandatory flags received are unknown.
>>>> - Create include/linux/membarrier.h to define these flags.
>>>> - Add MEMBARRIER_QUERY optional flag.
>>>>
>>>> Changes since v4:
>>>> - Add "int expedited" parameter, use synchronize_sched() in the
>>>>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>>>>   seriously using synchronize_sched() to provide the low-overhead
>>>>   membarrier scheme.
>>>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>>>>
>>>> Changes since v3a:
>>>> - Confirm that each CPU indeed runs the current task's ->mm before
>>>>   sending an IPI. Ensures that we do not disturb RT tasks in the
>>>>   presence of lazy TLB shootdown.
>>>> - Document memory barriers needed in switch_mm().
>>>> - Surround helper functions with #ifdef CONFIG_SMP.
>>>>
>>>> Changes since v2:
>>>> - simply send-to-many to the mm_cpumask. It contains the list of
>>>>   processors we have to IPI to (which use the mm), and this mask is
>>>>   updated atomically.
>>>>
>>>> Changes since v1:
>>>> - Only perform the IPI in CONFIG_SMP.
>>>> - Only perform the IPI if the process has more than one thread.
>>>> - Only send IPIs to CPUs involved with threads belonging to our process.
>>>> - Adaptative IPI scheme (single vs many IPI with threshold).
>>>> - Issue smp_mb() at the beginning and end of the system call.
>>>> ---
>>>>  MAINTAINERS                            |  8 +++++
>>>>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>>>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>>>  include/linux/syscalls.h               |  2 ++
>>>>  include/uapi/asm-generic/unistd.h      |  4 ++-
>>>>  include/uapi/linux/Kbuild              |  1 +
>>>>  include/uapi/linux/membarrier.h        | 53 +++++++++++++++++++++++++++
>>>>  init/Kconfig                           | 12 +++++++
>>>>  kernel/Makefile                        |  1 +
>>>>  kernel/membarrier.c                    | 66 ++++++++++++++++++++++++++++++++++
>>>>  kernel/sys_ni.c                        |  3 ++
>>>>  11 files changed, 151 insertions(+), 1 deletion(-)
>>>>  create mode 100644 include/uapi/linux/membarrier.h
>>>>  create mode 100644 kernel/membarrier.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 0d70760..b560da6 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -6642,6 +6642,14 @@ W:	http://www.mellanox.com
>>>>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>>>>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
>>>>  
>>>> +MEMBARRIER SUPPORT
>>>> +M:	Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>>>> +M:	"Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>>> +L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> +S:	Supported
>>>> +F:	kernel/membarrier.c
>>>> +F:	include/uapi/linux/membarrier.h
>>>> +
>>>>  MEMORY MANAGEMENT
>>>>  L:	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
>>>>  W:	http://www.linux-mm.org
>>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>>>> b/arch/x86/entry/syscalls/syscall_32.tbl
>>>> index ef8187f..e63ad61 100644
>>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>>>> @@ -365,3 +365,4 @@
>>>>  356	i386	memfd_create		sys_memfd_create
>>>>  357	i386	bpf			sys_bpf
>>>>  358	i386	execveat		sys_execveat			stub32_execveat
>>>> +359	i386	membarrier		sys_membarrier
>>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>>>> b/arch/x86/entry/syscalls/syscall_64.tbl
>>>> index 9ef32d5..87f3cd6 100644
>>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>>> @@ -329,6 +329,7 @@
>>>>  320	common	kexec_file_load		sys_kexec_file_load
>>>>  321	common	bpf			sys_bpf
>>>>  322	64	execveat		stub_execveat
>>>> +323	common	membarrier		sys_membarrier
>>>>  
>>>>  #
>>>>  # x32-specific system call numbers start at 512 to avoid cache impact
>>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>>> index b45c45b..d4ab99b 100644
>>>> --- a/include/linux/syscalls.h
>>>> +++ b/include/linux/syscalls.h
>>>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>>>> *filename,
>>>>  			const char __user *const __user *argv,
>>>>  			const char __user *const __user *envp, int flags);
>>>>  
>>>> +asmlinkage long sys_membarrier(int cmd, int flags);
>>>> +
>>>>  #endif
>>>> diff --git a/include/uapi/asm-generic/unistd.h
>>>> b/include/uapi/asm-generic/unistd.h
>>>> index e016bd9..8da542a 100644
>>>> --- a/include/uapi/asm-generic/unistd.h
>>>> +++ b/include/uapi/asm-generic/unistd.h
>>>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>>>  __SYSCALL(__NR_bpf, sys_bpf)
>>>>  #define __NR_execveat 281
>>>>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>>> +#define __NR_membarrier 282
>>>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>>>  
>>>>  #undef __NR_syscalls
>>>> -#define __NR_syscalls 282
>>>> +#define __NR_syscalls 283
>>>>  
>>>>  /*
>>>>   * All syscalls below here should go away really,
>>>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>>>> index 1ff9942..e6f229a 100644
>>>> --- a/include/uapi/linux/Kbuild
>>>> +++ b/include/uapi/linux/Kbuild
>>>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>>>  header-y += media.h
>>>>  header-y += media-bus-format.h
>>>>  header-y += mei.h
>>>> +header-y += membarrier.h
>>>>  header-y += memfd.h
>>>>  header-y += mempolicy.h
>>>>  header-y += meye.h
>>>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>>>> new file mode 100644
>>>> index 0000000..e0b108b
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/membarrier.h
>>>> @@ -0,0 +1,53 @@
>>>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>>>> +#define _UAPI_LINUX_MEMBARRIER_H
>>>> +
>>>> +/*
>>>> + * linux/membarrier.h
>>>> + *
>>>> + * membarrier system call API
>>>> + *
>>>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> + *
>>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>>> + * of this software and associated documentation files (the "Software"), to
>>>> deal
>>>> + * in the Software without restriction, including without limitation the rights
>>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>>> + * copies of the Software, and to permit persons to whom the Software is
>>>> + * furnished to do so, subject to the following conditions:
>>>> + *
>>>> + * The above copyright notice and this permission notice shall be included in
>>>> + * all copies or substantial portions of the Software.
>>>> + *
>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>>>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>>> FROM,
>>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>> THE
>>>> + * SOFTWARE.
>>>> + */
>>>> +
>>>> +/**
>>>> + * enum membarrier_cmd - membarrier system call command
>>>> + * @MEMBARRIER_CMD_QUERY:   Query the set of supported commands. It returns
>>>> + *                          a bitmask of valid commands.
>>>> + * @MEMBARRIER_CMD_SHARED:  Execute a memory barrier on all running threads.
>>>> + *                          Upon return from system call, the caller thread
>>>> + *                          is ensured that all running threads have passed
>>>> + *                          through a state where all memory accesses to
>>>> + *                          user-space addresses match program order between
>>>> + *                          entry to and return from the system call
>>>> + *                          (non-running threads are de facto in such a
>>>> + *                          state). This covers threads from all processes
>>>> + *                          running on the system. This command returns 0.
>>>> + *
>>>> + * Command to be passed to the membarrier system call. The commands need to
>>>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>>>> + * the value 0.
>>>> + */
>>>> +enum membarrier_cmd {
>>>> +	MEMBARRIER_CMD_QUERY = 0,
>>>> +	MEMBARRIER_CMD_SHARED = (1 << 0),
>>>> +};
>>>> +
>>>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>>>> diff --git a/init/Kconfig b/init/Kconfig
>>>> index af09b4f..4bba60f 100644
>>>> --- a/init/Kconfig
>>>> +++ b/init/Kconfig
>>>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>>>  	  bugs/quirks. Disable this only if your target machine is
>>>>  	  unaffected by PCI quirks.
>>>>  
>>>> +config MEMBARRIER
>>>> +	bool "Enable membarrier() system call" if EXPERT
>>>> +	default y
>>>> +	help
>>>> +	  Enable the membarrier() system call that allows issuing memory
>>>> +	  barriers across all running threads, which can be used to distribute
>>>> +	  the cost of user-space memory barriers asymmetrically by transforming
>>>> +	  pairs of memory barriers into pairs consisting of membarrier() and a
>>>> +	  compiler barrier.
>>>> +
>>>> +	  If unsure, say Y.
>>>> +
>>>>  config EMBEDDED
>>>>  	bool "Embedded system"
>>>>  	option allnoconfig_y
>>>> diff --git a/kernel/Makefile b/kernel/Makefile
>>>> index 43c4c92..92a481b 100644
>>>> --- a/kernel/Makefile
>>>> +++ b/kernel/Makefile
>>>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>>>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>>>  
>>>>  $(obj)/configs.o: $(obj)/config_data.h
>>>>  
>>>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>>>> new file mode 100644
>>>> index 0000000..536c727
>>>> --- /dev/null
>>>> +++ b/kernel/membarrier.c
>>>> @@ -0,0 +1,66 @@
>>>> +/*
>>>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> + *
>>>> + * membarrier system call
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License as published by
>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>> + * (at your option) any later version.
>>>> + *
>>>> + * This program is distributed in the hope that it will be useful,
>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + * GNU General Public License for more details.
>>>> + */
>>>> +
>>>> +#include <linux/syscalls.h>
>>>> +#include <linux/membarrier.h>
>>>> +
>>>> +/*
>>>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>>> + * except MEMBARRIER_CMD_QUERY.
>>>> + */
>>>> +#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>>>> +
>>>> +/**
>>>> + * sys_membarrier - issue memory barriers on a set of threads
>>>> + * @cmd:   Takes command values defined in enum membarrier_cmd.
>>>> + * @flags: Currently needs to be 0. For future extensions.
>>>> + *
>>>> + * If this system call is not implemented, -ENOSYS is returned. If the
>>>> + * command specified does not exist, or if the command argument is invalid,
>>>> + * this system call returns -EINVAL. For a given command, with flags argument
>>>> + * set to 0, this system call is guaranteed to always return the same value
>>>> + * until reboot.
>>>> + *
>>>> + * All memory accesses performed in program order from each targeted thread
>>>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>>>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>>>> + * accesses to be performed in program order across the barrier, and
>>>> + * smp_mb() to represent explicit memory barriers forcing full memory
>>>> + * ordering across the barrier, we have the following ordering table for
>>>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>>>> + *
>>>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>>>> + *
>>>> + *                        barrier()   smp_mb() sys_membarrier()
>>>> + *        barrier()          X           X            O
>>>> + *        smp_mb()           X           O            O
>>>> + *        sys_membarrier()   O           O            O
>>>> + */
>>>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>>> +{
>>>> +	if (unlikely(flags))
>>>> +		return -EINVAL;
>>>> +	switch (cmd) {
>>>> +	case MEMBARRIER_CMD_QUERY:
>>>> +		return MEMBARRIER_CMD_BITMASK;
>>>> +	case MEMBARRIER_CMD_SHARED:
>>>> +		if (num_online_cpus() > 1)
>>>> +			synchronize_sched();
>>>> +		return 0;
>>>> +	default:
>>>> +		return -EINVAL;
>>>> +	}
>>>> +}
>>>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>>>> index 7995ef5..eb4fde0 100644
>>>> --- a/kernel/sys_ni.c
>>>> +++ b/kernel/sys_ni.c
>>>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>>>  
>>>>  /* execveat */
>>>>  cond_syscall(sys_execveat);
>>>> +
>>>> +/* membarrier */
>>>> +cond_syscall(sys_membarrier);
>>>>
>>>
>>>
>>> --
>>> Michael Kerrisk
>>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>>> Linux/UNIX System Programming Training: http://man7.org/training/
>> 
> 
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-12-13 11:44 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-10 20:58 [PATCH 0/3] sys_membarrier (x86, generic) Mathieu Desnoyers
2015-07-10 20:58 ` [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86) Mathieu Desnoyers
2015-12-04 15:44   ` Michael Kerrisk (man-pages)
2015-12-05  8:48     ` Mathieu Desnoyers
2015-12-05  8:48       ` Mathieu Desnoyers
2015-12-11 18:05       ` Michael Kerrisk (man-pages)
2015-12-11 18:05         ` Michael Kerrisk (man-pages)
2015-12-13 11:44         ` Mathieu Desnoyers
2015-12-13 11:44           ` Mathieu Desnoyers
2015-07-10 20:58 ` [PATCH 2/3] selftests: add membarrier syscall test Mathieu Desnoyers
2015-08-31  6:54   ` Michael Ellerman
2015-08-31  6:54     ` Michael Ellerman
2015-09-01 17:11     ` Mathieu Desnoyers
2015-09-01 17:11       ` Mathieu Desnoyers
2015-09-01 18:32       ` Andy Lutomirski
2015-09-01 18:32         ` Andy Lutomirski
2015-09-03  9:33         ` Michael Ellerman
2015-09-03 15:47           ` Mathieu Desnoyers
2015-09-03 15:47             ` Mathieu Desnoyers
2015-09-04  3:36             ` Michael Ellerman
2015-09-04  3:36               ` Michael Ellerman
2015-09-07 16:01               ` Mathieu Desnoyers
2015-09-08  4:19                 ` Michael Ellerman
2015-09-08  4:19                   ` Michael Ellerman
2015-09-08 14:02                   ` Mathieu Desnoyers
2015-09-08 14:02                     ` Mathieu Desnoyers
2015-09-03  9:24       ` Michael Ellerman
2015-09-03  9:24         ` Michael Ellerman
2015-07-10 20:58 ` [PATCH 3/3] selftests: enhance " Mathieu Desnoyers
2015-10-05 23:21 ` [PATCH 0/3] sys_membarrier (x86, generic) Rusty Russell
2015-10-05 23:21   ` Rusty Russell
2015-10-06  2:17   ` Mathieu Desnoyers
2015-10-06  2:17     ` Mathieu Desnoyers
2015-10-08  6:22     ` Rusty Russell
2015-10-08  6:22       ` Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.