All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
@ 2015-03-25 19:03 Mathieu Desnoyers
  2015-03-27  9:21 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2015-03-25 19:03 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: linux-kernel, Mathieu Desnoyers, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, David Howells

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to distribute
the cost of user-space memory barriers asymmetrically by transforming
pairs of memory barriers into pairs consisting of sys_membarrier() and a
compiler barrier. For synchronization primitives that distinguish
between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
read-side can be accelerated significantly by moving the bulk of the
memory barrier overhead to the write-side.

It is based on kernel v3.19, and also applies fine to master. I am
submitting it to get official acked-by/reviewed-by after the prior
rounds of changes, planning submission for inclusion into the upcoming
4.1 merge window.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 3129842 writes
signal-based scheme:          9825306874 reads,    5386 writes
sys_membarrier:               7992076602 reads,     220 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
with the expedited scheme, we can see that we are close to the read-side
performance of the signal-based scheme. However, this non-expedited
sys_membarrier implementation has a much slower grace period than signal
and memory barrier schemes.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch only adds the system call to x86.

[1] http://urcu.so

Changes since v13:
- Move to kernel/membarrier.c.
- Remove MEMBARRIER_PRIVATE flag.
- Add MAINTAINERS file entry.

Changes since v12:
- Remove _FLAG suffix from uapi flags.
- Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
- Remove EXPEDITED mode. Only implement non-expedited for now, until
  reading the cpu_curr()->mm can be done without holding the CPU's rq
  lock.

Changes since v11:
- 5 years have passed.
- Rebase on v3.19 kernel.
- Add futex-alike PRIVATE vs SHARED semantic: private for per-process
  barriers, non-private for memory mappings shared between processes.
- Simplify user API.
- Code refactoring.

Changes since v10:
- Apply Randy's comments.
- Rebase on 2.6.34-rc4 -tip.

Changes since v9:
- Clean up #ifdef CONFIG_SMP.

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding
  memory barriers to the scheduler. It implies a potential RoS
  (reduction of service) if sys_membarrier() is executed in a busy-loop
  by a user, but nothing more than what is already possible with other
  existing system calls, but saves memory barriers in the scheduler fast
  path.
- re-add the memory barrier comments to x86 switch_mm() as an example to
  other architectures.
- Update documentation of the memory barriers in sys_membarrier and
  switch_mm().
- Append execution scenarios to the changelog showing the purpose of
  each memory barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned
  to incrementally reserve syscall IDs on other architectures as these
  are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU
  read lock in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule()
  and finish_lock_switch(), where they clearly document that all data
  protected by the rq lock is guaranteed to have memory barriers issued
  between the scheduler update and the task execution. Replacing the
  spin lock acquire/release barriers with these memory barriers imply
  either no overhead (x86 spinlock atomic instruction already implies a
  full mb) or some hopefully small overhead caused by the upgrade of the
  spinlock acquire/release barriers to more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to
  standard spinlocks and full memory barriers. Each architecture can
  specialize this header following their own need and declare
  CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
  implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks
  to the "flags" system call parameter. Past experience with accept4(),
  signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
  inotify_init1() indicates that this is the kind of thing we want to
  plan for. Return -EINVAL if the mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the
  non-expedited case. Thanks to Lai Jiangshan for making us consider
  seriously using synchronize_sched() to provide the low-overhead
  membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before
  sending an IPI. Ensures that we do not disturb RT tasks in the
  presence of lazy TLB shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of
  processors we have to IPI to (which use the mm), and this mask is
  updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Nicholas Miell <nmiell@comcast.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: David Howells <dhowells@redhat.com>
---
 MAINTAINERS                       |    8 ++++
 arch/x86/syscalls/syscall_32.tbl  |    1 +
 arch/x86/syscalls/syscall_64.tbl  |    1 +
 include/linux/syscalls.h          |    2 +
 include/uapi/asm-generic/unistd.h |    4 +-
 include/uapi/linux/Kbuild         |    1 +
 include/uapi/linux/membarrier.h   |   57 ++++++++++++++++++++++++++++
 init/Kconfig                      |   12 ++++++
 kernel/Makefile                   |    1 +
 kernel/membarrier.c               |   75 +++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                   |    3 +
 11 files changed, 164 insertions(+), 1 deletions(-)
 create mode 100644 include/uapi/linux/membarrier.h
 create mode 100644 kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index d66a97d..7fbb698 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6206,6 +6206,14 @@ W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
 F:	drivers/net/ethernet/mellanox/mlx4/en_*
 
+MEMBARRIER SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/membarrier.c
+F:	include/uapi/linux/membarrier.h
+
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
 W:	http://www.linux-mm.org
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..439415f 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	membarrier		sys_membarrier
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..823130d 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	membarrier		sys_membarrier
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..058ec0a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *argv,
 			const char __user *const __user *envp, int flags);
 
+asmlinkage long sys_membarrier(int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..8da542a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_membarrier 282
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 00b10002..c5b0dbf 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -248,6 +248,7 @@ header-y += mdio.h
 header-y += media.h
 header-y += media-bus-format.h
 header-y += mei.h
+header-y += membarrier.h
 header-y += memfd.h
 header-y += mempolicy.h
 header-y += meye.h
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
new file mode 100644
index 0000000..b6f8f40
--- /dev/null
+++ b/include/uapi/linux/membarrier.h
@@ -0,0 +1,57 @@
+#ifndef _UAPI_LINUX_MEMBARRIER_H
+#define _UAPI_LINUX_MEMBARRIER_H
+
+/*
+ * linux/membarrier.h
+ *
+ * membarrier system call API
+ *
+ * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * All memory accesses performed in program order from each thread on
+ * the system is guaranteed to be ordered with respect to sys_membarrier().
+ * If we use the semantic "barrier()" to represent a compiler barrier
+ * forcing memory accesses to be performed in program order across the
+ * barrier, and smp_mb() to represent explicit memory barriers forcing
+ * full memory ordering across the barrier, we have the following
+ * ordering table for each pair of barrier(), sys_membarrier() and
+ * smp_mb() :
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ *                        barrier()   smp_mb() sys_membarrier()
+ *        barrier()          X           X            O
+ *        smp_mb()           X           O            O
+ *        sys_membarrier()   O           O            O
+ */
+
+/* System call membarrier "flags" argument. */
+enum {
+	/*
+	 * Query whether the rest of the specified flags are supported,
+	 * without performing synchronization.
+	 */
+	MEMBARRIER_QUERY = (1 << 31),
+};
+
+#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..2452f3c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1568,6 +1568,18 @@ config PCI_QUIRKS
 	  bugs/quirks. Disable this only if your target machine is
 	  unaffected by PCI quirks.
 
+config MEMBARRIER
+	bool "Enable membarrier() system call" if EXPERT
+	default y
+	help
+	  Enable the membarrier() system call that allows issuing memory
+	  barriers across all running threads, which can be used to distribute
+	  the cost of user-space memory barriers asymmetrically by transforming
+	  pairs of memory barriers into pairs consisting of membarrier() and a
+	  compiler barrier.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..b572ced 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
new file mode 100644
index 0000000..3077e94
--- /dev/null
+++ b/kernel/membarrier.c
@@ -0,0 +1,75 @@
+/*
+ * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * membarrier system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/membarrier.h>
+
+static int membarrier_validate_flags(int flags)
+{
+	/* Check for unrecognized flags. */
+	if (flags & ~MEMBARRIER_QUERY)
+		return -EINVAL;
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+
+/*
+ * sys_membarrier - issue memory barrier on all running threads
+ * @flags: MEMBARRIER_QUERY:
+ *             Query whether the rest of the specified flags are supported,
+ *             without performing synchronization.
+ *
+ * return values: Returns -EINVAL if the flags are incorrect. Testing
+ * for kernel sys_membarrier support can be done by checking for -ENOSYS
+ * return value.  Return value of 0 indicates success. For a given set
+ * of flags on a given kernel, this system call will always return the
+ * same value. It is therefore correct to check the return value only
+ * once during a process lifetime, setting MEMBARRIER_QUERY to only
+ * check if the flags are supported, without performing any
+ * synchronization.
+ *
+ * This system call executes a memory barrier on all running threads.
+ * Upon completion, the caller thread is ensured that all running
+ * threads have passed through a state where all memory accesses to
+ * user-space addresses match program order. (non-running threads are de
+ * facto in such a state.)
+ *
+ * On uniprocessor systems, this system call simply returns 0 after
+ * validating the arguments, so user-space knows it is implemented.
+ */
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+	int retval;
+
+	retval = membarrier_validate_flags(flags);
+	if (retval)
+		goto end;
+	if (unlikely(flags & MEMBARRIER_QUERY) || num_online_cpus() == 1)
+		goto end;
+	synchronize_sched();
+end:
+	return retval;
+}
+
+#else /* !CONFIG_SMP */
+
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+	return membarrier_validate_flags(flags);
+}
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5913b84 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* membarrier */
+cond_syscall(sys_membarrier);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-03-25 19:03 [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86) Mathieu Desnoyers
@ 2015-03-27  9:21 ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2015-03-27  9:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Josh Triplett, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, David Howells

On Wed, Mar 25, 2015 at 03:03:47PM -0400, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads running on the system. It is
> implemented by calling synchronize_sched(). It can be used to distribute
> the cost of user-space memory barriers asymmetrically by transforming
> pairs of memory barriers into pairs consisting of sys_membarrier() and a
> compiler barrier. For synchronization primitives that distinguish
> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
> read-side can be accelerated significantly by moving the bulk of the
> memory barrier overhead to the write-side.

[ . . . ]

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Nicholas Miell <nmiell@comcast.net>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> CC: Lai Jiangshan <laijs@cn.fujitsu.com>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: David Howells <dhowells@redhat.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  MAINTAINERS                       |    8 ++++
>  arch/x86/syscalls/syscall_32.tbl  |    1 +
>  arch/x86/syscalls/syscall_64.tbl  |    1 +
>  include/linux/syscalls.h          |    2 +
>  include/uapi/asm-generic/unistd.h |    4 +-
>  include/uapi/linux/Kbuild         |    1 +
>  include/uapi/linux/membarrier.h   |   57 ++++++++++++++++++++++++++++
>  init/Kconfig                      |   12 ++++++
>  kernel/Makefile                   |    1 +
>  kernel/membarrier.c               |   75 +++++++++++++++++++++++++++++++++++++
>  kernel/sys_ni.c                   |    3 +
>  11 files changed, 164 insertions(+), 1 deletions(-)
>  create mode 100644 include/uapi/linux/membarrier.h
>  create mode 100644 kernel/membarrier.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d66a97d..7fbb698 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6206,6 +6206,14 @@ W:	http://www.mellanox.com
>  Q:	http://patchwork.ozlabs.org/project/netdev/list/
>  F:	drivers/net/ethernet/mellanox/mlx4/en_*
> 
> +MEMBARRIER SUPPORT
> +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> +L:	linux-kernel@vger.kernel.org
> +S:	Supported
> +F:	kernel/membarrier.c
> +F:	include/uapi/linux/membarrier.h
> +
>  MEMORY MANAGEMENT
>  L:	linux-mm@kvack.org
>  W:	http://www.linux-mm.org
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index b3560ec..439415f 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -365,3 +365,4 @@
>  356	i386	memfd_create		sys_memfd_create
>  357	i386	bpf			sys_bpf
>  358	i386	execveat		sys_execveat			stub32_execveat
> +359	i386	membarrier		sys_membarrier
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..823130d 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320	common	kexec_file_load		sys_kexec_file_load
>  321	common	bpf			sys_bpf
>  322	64	execveat		stub_execveat
> +323	common	membarrier		sys_membarrier
> 
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..058ec0a 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
>  			const char __user *const __user *argv,
>  			const char __user *const __user *envp, int flags);
> 
> +asmlinkage long sys_membarrier(int flags);
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..8da542a 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  __SYSCALL(__NR_bpf, sys_bpf)
>  #define __NR_execveat 281
>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_membarrier 282
> +__SYSCALL(__NR_membarrier, sys_membarrier)
> 
>  #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
> 
>  /*
>   * All syscalls below here should go away really,
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 00b10002..c5b0dbf 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -248,6 +248,7 @@ header-y += mdio.h
>  header-y += media.h
>  header-y += media-bus-format.h
>  header-y += mei.h
> +header-y += membarrier.h
>  header-y += memfd.h
>  header-y += mempolicy.h
>  header-y += meye.h
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> new file mode 100644
> index 0000000..b6f8f40
> --- /dev/null
> +++ b/include/uapi/linux/membarrier.h
> @@ -0,0 +1,57 @@
> +#ifndef _UAPI_LINUX_MEMBARRIER_H
> +#define _UAPI_LINUX_MEMBARRIER_H
> +
> +/*
> + * linux/membarrier.h
> + *
> + * membarrier system call API
> + *
> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +/*
> + * All memory accesses performed in program order from each thread on
> + * the system is guaranteed to be ordered with respect to sys_membarrier().
> + * If we use the semantic "barrier()" to represent a compiler barrier
> + * forcing memory accesses to be performed in program order across the
> + * barrier, and smp_mb() to represent explicit memory barriers forcing
> + * full memory ordering across the barrier, we have the following
> + * ordering table for each pair of barrier(), sys_membarrier() and
> + * smp_mb() :
> + *
> + * The pair ordering is detailed as (O: ordered, X: not ordered):
> + *
> + *                        barrier()   smp_mb() sys_membarrier()
> + *        barrier()          X           X            O
> + *        smp_mb()           X           O            O
> + *        sys_membarrier()   O           O            O
> + */
> +
> +/* System call membarrier "flags" argument. */
> +enum {
> +	/*
> +	 * Query whether the rest of the specified flags are supported,
> +	 * without performing synchronization.
> +	 */
> +	MEMBARRIER_QUERY = (1 << 31),
> +};
> +
> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 9afb971..2452f3c 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1568,6 +1568,18 @@ config PCI_QUIRKS
>  	  bugs/quirks. Disable this only if your target machine is
>  	  unaffected by PCI quirks.
> 
> +config MEMBARRIER
> +	bool "Enable membarrier() system call" if EXPERT
> +	default y
> +	help
> +	  Enable the membarrier() system call that allows issuing memory
> +	  barriers across all running threads, which can be used to distribute
> +	  the cost of user-space memory barriers asymmetrically by transforming
> +	  pairs of memory barriers into pairs consisting of membarrier() and a
> +	  compiler barrier.
> +
> +	  If unsure, say Y.
> +
>  config EMBEDDED
>  	bool "Embedded system"
>  	option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index a59481a..b572ced 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -95,6 +95,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>  obj-$(CONFIG_TORTURE_TEST) += torture.o
> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
> 
>  $(obj)/configs.o: $(obj)/config_data.h
> 
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> new file mode 100644
> index 0000000..3077e94
> --- /dev/null
> +++ b/kernel/membarrier.c
> @@ -0,0 +1,75 @@
> +/*
> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * membarrier system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/membarrier.h>
> +
> +static int membarrier_validate_flags(int flags)
> +{
> +	/* Check for unrecognized flags. */
> +	if (flags & ~MEMBARRIER_QUERY)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +#ifdef CONFIG_SMP
> +
> +/*
> + * sys_membarrier - issue memory barrier on all running threads
> + * @flags: MEMBARRIER_QUERY:
> + *             Query whether the rest of the specified flags are supported,
> + *             without performing synchronization.
> + *
> + * return values: Returns -EINVAL if the flags are incorrect. Testing
> + * for kernel sys_membarrier support can be done by checking for -ENOSYS
> + * return value.  Return value of 0 indicates success. For a given set
> + * of flags on a given kernel, this system call will always return the
> + * same value. It is therefore correct to check the return value only
> + * once during a process lifetime, setting MEMBARRIER_QUERY to only
> + * check if the flags are supported, without performing any
> + * synchronization.
> + *
> + * This system call executes a memory barrier on all running threads.
> + * Upon completion, the caller thread is ensured that all running
> + * threads have passed through a state where all memory accesses to
> + * user-space addresses match program order. (non-running threads are de
> + * facto in such a state.)
> + *
> + * On uniprocessor systems, this system call simply returns 0 after
> + * validating the arguments, so user-space knows it is implemented.
> + */
> +SYSCALL_DEFINE1(membarrier, int, flags)
> +{
> +	int retval;
> +
> +	retval = membarrier_validate_flags(flags);
> +	if (retval)
> +		goto end;
> +	if (unlikely(flags & MEMBARRIER_QUERY) || num_online_cpus() == 1)
> +		goto end;
> +	synchronize_sched();
> +end:
> +	return retval;
> +}
> +
> +#else /* !CONFIG_SMP */
> +
> +SYSCALL_DEFINE1(membarrier, int, flags)
> +{
> +	return membarrier_validate_flags(flags);
> +}
> +
> +#endif /* CONFIG_SMP */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 5adcb0a..5913b84 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
> 
>  /* execveat */
>  cond_syscall(sys_execveat);
> +
> +/* membarrier */
> +cond_syscall(sys_membarrier);
> -- 
> 1.7.7.3
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-14 20:17         ` Mathieu Desnoyers
@ 2015-04-14 21:05           ` Thomas Gleixner
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2015-04-14 21:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

On Tue, 14 Apr 2015, Mathieu Desnoyers wrote:
> If we go for a single active flag at a time, I would call that
> "cmd" rather than "flags". Each command would be a power
> of two. Only one cmd could be passed as argument (no "or" mask).
> QUERY would return a mask of the supported commands.

That's what I told you, except that I named it opmode :)

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-14 19:35       ` Thomas Gleixner
@ 2015-04-14 20:17         ` Mathieu Desnoyers
  2015-04-14 21:05           ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2015-04-14 20:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

----- Original Message -----
> On Tue, 14 Apr 2015, Mathieu Desnoyers wrote:
> > Thinking about it a bit more, one reason for doing the QUERY along
> > with the exact set of flags queried allow us to do more than just
> > returning which flags are supported: it allows us to tell userspace
> > whether the combination of flags used is valid or not.
> > 
> > For instance, if we add a MEMBARRIER_PRIVATE flag in a future release
> > to issue memory barriers only to other threads from the same process,
> > and we add a MEMBARRIER_EXPEDITED which uses IPIs to issue those
> > barriers, we could very well have a situation where using
> > 
> >   EXPEDITED | PRIVATE  would be valid (only sending IPIs to CPUs
> >   running threads from the same process)
> > 
> > but
> > 
> >   EXPEDITED alone would be invalid (-EINVAL), until we figure out
> >   how to expedite memory barriers to all processors without impacting
> >   other processes, if at all possible.
> > 
> > Using QUERY with an empty set of flags could however return the set of
> > flags supported, which could be a nice feature. Anyway, I think
> > the "0" flag should be the basic always correct configuration that
> > is always supported, otherwise we'd have -ENOSYS. Therefore, querying
> > whether the empty set of flags is supported has little value, other
> > than checking for -ENOSYS.
> > 
> > So considering the above, the typical use of this query method from
> > library initialization would be:
> > 
> > int supported_flags = sys_membarrier(MEMBARRIER_QUERY);
> > 
> > ... check for -ENOSYS ....
> > ... check whether the flags we need are supported ...
> > 
> > if (sys_membarrier(MEMBARRIER_QUERY | flag1 | flag2))
> >   goto error;
> > 
> > then we are guaranteed that using sys_membarrier(flag1 | flag2)
> > will always succeed within the application, without needing to
> > handle errors every time it is used. This property is useful
> > to implement a synchronize_rcu() that returns "void" and simplify
> > error handling within the application.
> 
> So how many of these "flags" are you planning to implement and how
> many valid combinations are going to exist?
> 
> I doubt it's more than a dozen. So I prefer explicit operation modes
> for the valid ones rather than having a random pile of "flags".

I don't expect many, so indeed your approach would allow
listing the valid flags, and using them as "one-hot".

If we go for a single active flag at a time, I would call that
"cmd" rather than "flags". Each command would be a power
of two. Only one cmd could be passed as argument (no "or" mask).
QUERY would return a mask of the supported commands.

Thoughts ?

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx
> 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-14 19:22     ` Mathieu Desnoyers
@ 2015-04-14 19:35       ` Thomas Gleixner
  2015-04-14 20:17         ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2015-04-14 19:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

On Tue, 14 Apr 2015, Mathieu Desnoyers wrote:
> Thinking about it a bit more, one reason for doing the QUERY along
> with the exact set of flags queried allow us to do more than just
> returning which flags are supported: it allows us to tell userspace
> whether the combination of flags used is valid or not.
> 
> For instance, if we add a MEMBARRIER_PRIVATE flag in a future release
> to issue memory barriers only to other threads from the same process,
> and we add a MEMBARRIER_EXPEDITED which uses IPIs to issue those
> barriers, we could very well have a situation where using
> 
>   EXPEDITED | PRIVATE  would be valid (only sending IPIs to CPUs
>   running threads from the same process)
> 
> but
> 
>   EXPEDITED alone would be invalid (-EINVAL), until we figure out
>   how to expedite memory barriers to all processors without impacting
>   other processes, if at all possible.
> 
> Using QUERY with an empty set of flags could however return the set of
> flags supported, which could be a nice feature. Anyway, I think
> the "0" flag should be the basic always correct configuration that
> is always supported, otherwise we'd have -ENOSYS. Therefore, querying
> whether the empty set of flags is supported has little value, other
> than checking for -ENOSYS.
> 
> So considering the above, the typical use of this query method from
> library initialization would be:
> 
> int supported_flags = sys_membarrier(MEMBARRIER_QUERY);
> 
> ... check for -ENOSYS ....
> ... check whether the flags we need are supported ...
> 
> if (sys_membarrier(MEMBARRIER_QUERY | flag1 | flag2))
>   goto error;
> 
> then we are guaranteed that using sys_membarrier(flag1 | flag2)
> will always succeed within the application, without needing to
> handle errors every time it is used. This property is useful
> to implement a synchronize_rcu() that returns "void" and simplify
> error handling within the application.

So how many of these "flags" are you planning to implement and how
many valid combinations are going to exist?

I doubt it's more than a dozen. So I prefer explicit operation modes
for the valid ones rather than having a random pile of "flags".

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-14 19:02   ` Mathieu Desnoyers
  2015-04-14 19:22     ` Mathieu Desnoyers
@ 2015-04-14 19:31     ` Thomas Gleixner
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2015-04-14 19:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

On Tue, 14 Apr 2015, Mathieu Desnoyers wrote:
> So the question would be: should we introduce this syscall
> in different patches for each architecture, or should
> we add them all in one go ? There is nothing fundamentally
> x86-specific to the implementation of this system call.

We can add that in one go for the generic ones. I was merily pointing
out the inconsistency of your changelog.
 
> > 
> > > +/* System call membarrier "flags" argument. */
> > > +enum {
> > > +	/*
> > > +	 * Query whether the rest of the specified flags are supported,
> > > +	 * without performing synchronization.
> > > +	 */
> > 
> > Docbook has support for enums.
> 
> OK, how about the following ?
> 
> /**
>  * enum membarrier_flags - membarrier system call "flags" argument bitmask.
>  *
>  * Bitmask of flags to be passed to the membarrier system call. The "flags"
>  * parameter can be either 0 or many of those flags combined with a bitwise
>  * "or".

So now add the explanation for the implemented flags.

   * MEMBARRIER_QUERY:	   Insert blurb ....

> Since this type is used as a system call ABI, I'm worried that a compiler
> implementation may compile a user-space program with a enum representation
> of a "char", whereas the kernel would expect an integer, thus causing an
> ABI issue, where userland would fail to clear the high-order bits of the
> register, and a more recent kernel would care about those bits (if we add
> new flags in future kernel releases).

Sigh, no. The compiler CANNOT ever truncate the enum:

 "The choice of type is implementation-defined,108) but shall be
  capable of representing the values of all the members of the
  enumeration."

So if you define a value with 1<<31 the compiler cannot chose to use
a char.

> > Why is this named flags and not given a descriptive name? If I
> > understand your changelog correctly you want to implement other
> > synchronization modes than the current synchronize_sched. So mode
> > might be a proper name.
> 
> If we look at other system calls like open(), "flags" describes a
> feature configuration, whereas "mode" is used to specify the
> permissions to use when creating the file. What we want to specify
> here is more a configuration of the syscall, which fits better with
> the existing semantic of open() "flags".

Then name it opmode or opcode, because that's what you want.

flags is usually used for features, but you dont invoke a
sysmembarrier feature. You invoke an operation mode.

flags as the only argument is a clear indicator that we do not know
how that system call should look like in the future, so we have the
opaque flag field to implement random and inconsistent crap.

> > Why is MEMBARRIER_QUERY not a proper operation mode and simply returns
> > the supported modes instead of doing it backwards and asking whether
> > a specific value is supported?
> 
> That's an interesting way to do it. We'd have to be careful not to
> conflict with return values like -ENOSYS then. We could define
> MEMBARRIER_QUERY as (1 << 31), which effectively reserves all
> signed negative numbers for the QUERY flag, and then the return value
> of the QUERY flag could be either a negative value (e.g. -ENOSYS)
> or the set of flags supported (bits 0 to 30).

Makes sense.

> > 
> > > + * On uniprocessor systems, this system call simply returns 0 after
> > > + * validating the arguments, so user-space knows it is implemented.
> > 
> > And the exact point of knowing this is?
> 
> I guess the alternative approach would be to return -ENOSYS, and let
> userspace discover that the kernel only support uniprocessor systems
> by other means ?

It's fine to have that, but the explanation sucks.
 
> I really prefer that the "default" of passing "0" does
> the obviously correct behavior for the system call:
> issuing a memory barrier over the entire system. Otherwise,
> making "0" a no-op could introduce hard-to-find races
> if people misuse the system call. 

People should stay away from stuff they do not understand. Whether 0
is a valid opcode or not does not matter. They will screw up anyway.

> Moreover, we need to reserve negative return values for errors like
> -ENOSYS, so we can use the high-order bit for the QUERY flag.

Agreed.
 
> In summary, how about the following:
> 
> - "0" is the default, obviously correct flags parameter,
>   which does a memory barrier on all processors.

Then make 0 an explicit value in the enum with a proper explanation
for it. It's a valid argument and wants a proper documentation.

> - 1 << 31 is the QUERY flag, which returns a bitmask of
>   all flags supported.
> - We keep negative return values reserved for errors
>   (e.g. -ENOSYS), even for the QUERY flag.

Agreed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-14 19:02   ` Mathieu Desnoyers
@ 2015-04-14 19:22     ` Mathieu Desnoyers
  2015-04-14 19:35       ` Thomas Gleixner
  2015-04-14 19:31     ` Thomas Gleixner
  1 sibling, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2015-04-14 19:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

----- Original Message -----
> ----- Original Message -----
> > On Mon, 13 Apr 2015, Mathieu Desnoyers wrote:
> > 
> > > [ Andrew, can you take this for the 4.1 merge window ? ]
> > 
> > You probably mean 4.2, right?
> 
> It can wait for 4.2 if you think it still need some work,
> of course. I was submitting it for 4.1 because I thought
> the discussion had settled down in the prior rounds.
> 
> > 
> > This fails the basic test for exposure in linux-next, adds syscalls
> > without the explicit ack of any x86 maintainer and exposes a user
> > space ABI with a magic undocumented flags argument.
> 
> I guess for the undocumented part we need to add a manpage,
> right ?
> 
> > 
> > > This patch only adds the system call to x86.
> > 
> > So the changes to
> >  
> > >  include/uapi/asm-generic/unistd.h |    4 +-
> > 
> > are just cosmetic, right?
> 
> Hrm, right, those are not. I did not get the full impact
> of adding it into this generic header. This means arc, arm64,
> c6x, hexagon, metag, nios2, openrisc, score, tile,
> unicore32 are also getting this system call.
> 
> So the question would be: should we introduce this syscall
> in different patches for each architecture, or should
> we add them all in one go ? There is nothing fundamentally
> x86-specific to the implementation of this system call.
> 
> > 
> > > +/* System call membarrier "flags" argument. */
> > > +enum {
> > > +	/*
> > > +	 * Query whether the rest of the specified flags are supported,
> > > +	 * without performing synchronization.
> > > +	 */
> > 
> > Docbook has support for enums.
> 
> OK, how about the following ?
> 
> /**
>  * enum membarrier_flags - membarrier system call "flags" argument bitmask.
>  *
>  * Bitmask of flags to be passed to the membarrier system call. The "flags"
>  * parameter can be either 0 or many of those flags combined with a bitwise
>  * "or".
>  */
> 
> > 
> > > +	MEMBARRIER_QUERY = (1 << 31),
> > > +};
> > 
> > Why is this an anonymous enum?
> 
> Because I used a "int" type for the syscall flags argument, so I did
> not need to name it. Giving it a name does indeed help making the
> documentation clearer anyway, so I'm tempted to give it a name.
> 
> > 
> > > +#ifdef CONFIG_SMP
> > 
> > So documentation is SMP only, right?
> 
> Yeah, those embedded folks still doing UP don't need documentation. ;-)
> 
> I'll fix it.
> 
> > 
> > > +/*
> > 
> > Docbook comments start with "/**"
> 
> Fixed.
> 
> > 
> > > + * sys_membarrier - issue memory barrier on all running threads
> > > + * @flags: MEMBARRIER_QUERY:
> > > + *             Query whether the rest of the specified flags are
> > > supported,
> > > + *             without performing synchronization.
> > 
> >       @flags:	Explain what this is for, not what a particular implemented
> >       		value is used for. The values should be proper documented
> > 		in the enum
> > 
> > Why is this an int and not a named enum?
> 
> In my reading of C99:
> 
> ISO/IEC 9899:TC2
> 6.7.2.2 Enumeration specifiers
> 
> "4 Each enumerated type shall be compatible with char, a signed integer type,
> or an
> unsigned integer type. The choice of type is implementation-defined,108) but
> shall be
> capable of representing the values of all the members of the enumeration. The
> enumerated type is incomplete until after the } that terminates the list of
> enumerator
> declarations."
> 
> Since this type is used as a system call ABI, I'm worried that a compiler
> implementation may compile a user-space program with a enum representation
> of a "char", whereas the kernel would expect an integer, thus causing an
> ABI issue, where userland would fail to clear the high-order bits of the
> register, and a more recent kernel would care about those bits (if we add
> new flags in future kernel releases).
> 
> > 
> > Why is this named flags and not given a descriptive name? If I
> > understand your changelog correctly you want to implement other
> > synchronization modes than the current synchronize_sched. So mode
> > might be a proper name.
> 
> If we look at other system calls like open(), "flags" describes a
> feature configuration, whereas "mode" is used to specify the
> permissions to use when creating the file. What we want to specify
> here is more a configuration of the syscall, which fits better with
> the existing semantic of open() "flags".
> 
> > 
> > Why is MEMBARRIER_QUERY not a proper operation mode and simply returns
> > the supported modes instead of doing it backwards and asking whether
> > a specific value is supported?
> 
> That's an interesting way to do it. We'd have to be careful not to
> conflict with return values like -ENOSYS then. We could define
> MEMBARRIER_QUERY as (1 << 31), which effectively reserves all
> signed negative numbers for the QUERY flag, and then the return value
> of the QUERY flag could be either a negative value (e.g. -ENOSYS)
> or the set of flags supported (bits 0 to 30).

Thinking about it a bit more, one reason for doing the QUERY along
with the exact set of flags queried allow us to do more than just
returning which flags are supported: it allows us to tell userspace
whether the combination of flags used is valid or not.

For instance, if we add a MEMBARRIER_PRIVATE flag in a future release
to issue memory barriers only to other threads from the same process,
and we add a MEMBARRIER_EXPEDITED which uses IPIs to issue those
barriers, we could very well have a situation where using

  EXPEDITED | PRIVATE  would be valid (only sending IPIs to CPUs
  running threads from the same process)

but

  EXPEDITED alone would be invalid (-EINVAL), until we figure out
  how to expedite memory barriers to all processors without impacting
  other processes, if at all possible.

Using QUERY with an empty set of flags could however return the set of
flags supported, which could be a nice feature. Anyway, I think
the "0" flag should be the basic always correct configuration that
is always supported, otherwise we'd have -ENOSYS. Therefore, querying
whether the empty set of flags is supported has little value, other
than checking for -ENOSYS.

So considering the above, the typical use of this query method from
library initialization would be:

int supported_flags = sys_membarrier(MEMBARRIER_QUERY);

... check for -ENOSYS ....
... check whether the flags we need are supported ...

if (sys_membarrier(MEMBARRIER_QUERY | flag1 | flag2))
  goto error;

then we are guaranteed that using sys_membarrier(flag1 | flag2)
will always succeed within the application, without needing to
handle errors every time it is used. This property is useful
to implement a synchronize_rcu() that returns "void" and simplify
error handling within the application.

Thoughts ?

Thanks,

Mathieu


> 
> > 
> > > + * On uniprocessor systems, this system call simply returns 0 after
> > > + * validating the arguments, so user-space knows it is implemented.
> > 
> > And the exact point of knowing this is?
> 
> I guess the alternative approach would be to return -ENOSYS, and let
> userspace discover that the kernel only support uniprocessor systems
> by other means ?
> 
> > 
> > > + */
> > > +SYSCALL_DEFINE1(membarrier, int, flags)
> > > +{
> > > +	int retval;
> > > +
> > > +	retval = membarrier_validate_flags(flags);
> > > +	if (retval)
> > > +		goto end;
> > > +	if (unlikely(flags & MEMBARRIER_QUERY) || num_online_cpus() == 1)
> > > +		goto end;
> > 
> > So why not doing the obvious?
> > 
> > enum modes {
> >      QUERY     = 0,
> >      FULL_SYNC = 1 <<  0,
> > };
> > 
> > #define IMPLEMENTED_MODES	(FULL_SYNC)
> > 
> > 	switch (mode) {
> > 	case FULL_SYNC:
> >        		synchronize_sched_on_smp();
> > 		return 0;
> > 	case QUERY:
> > 		return IMPLEMENTED_MODES;
> > 	}
> > 	return -EINVAL;
> > 
> > And later on you do
> > 
> >  enum modes {
> >      QUERY       = 0,
> >      FULL_SYNC   = 1 <<  0,
> > +    MAGIC_SYNC  = 1 <<  1,
> >  };
> > 
> > -#define IMPLEMENTED_MODES	(FULL_SYNC)
> > +#define IMPLEMENTED_MODES	(FULL_SYNC | MAGIC_SYNC)
> > 
> > All you need to make sure is that any mode is a power of 2.
> 
> I really prefer that the "default" of passing "0" does
> the obviously correct behavior for the system call:
> issuing a memory barrier over the entire system. Otherwise,
> making "0" a no-op could introduce hard-to-find races
> if people misuse the system call. Moreover, we need to
> reserve negative return values for errors like -ENOSYS, so
> we can use the high-order bit for the QUERY flag.
> 
> In summary, how about the following:
> 
> - "0" is the default, obviously correct flags parameter,
>   which does a memory barrier on all processors.
> - 1 << 31 is the QUERY flag, which returns a bitmask of
>   all flags supported.
> - We keep negative return values reserved for errors
>   (e.g. -ENOSYS), even for the QUERY flag.
> 
> Thanks for the feedback!
> 
> Mathieu
> 
> 
> 
> > 
> > Hmm?
> > 
> > Thanks,
> > 
> > 	tglx
> > 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-13 21:49 ` Thomas Gleixner
  2015-04-13 22:45   ` Thomas Gleixner
@ 2015-04-14 19:02   ` Mathieu Desnoyers
  2015-04-14 19:22     ` Mathieu Desnoyers
  2015-04-14 19:31     ` Thomas Gleixner
  1 sibling, 2 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2015-04-14 19:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

----- Original Message -----
> On Mon, 13 Apr 2015, Mathieu Desnoyers wrote:
> 
> > [ Andrew, can you take this for the 4.1 merge window ? ]
> 
> You probably mean 4.2, right?

It can wait for 4.2 if you think it still need some work,
of course. I was submitting it for 4.1 because I thought
the discussion had settled down in the prior rounds.

> 
> This fails the basic test for exposure in linux-next, adds syscalls
> without the explicit ack of any x86 maintainer and exposes a user
> space ABI with a magic undocumented flags argument.

I guess for the undocumented part we need to add a manpage,
right ?

> 
> > This patch only adds the system call to x86.
> 
> So the changes to
>  
> >  include/uapi/asm-generic/unistd.h |    4 +-
> 
> are just cosmetic, right?

Hrm, right, those are not. I did not get the full impact
of adding it into this generic header. This means arc, arm64,
c6x, hexagon, metag, nios2, openrisc, score, tile,
unicore32 are also getting this system call.

So the question would be: should we introduce this syscall
in different patches for each architecture, or should
we add them all in one go ? There is nothing fundamentally
x86-specific to the implementation of this system call.

> 
> > +/* System call membarrier "flags" argument. */
> > +enum {
> > +	/*
> > +	 * Query whether the rest of the specified flags are supported,
> > +	 * without performing synchronization.
> > +	 */
> 
> Docbook has support for enums.

OK, how about the following ?

/**
 * enum membarrier_flags - membarrier system call "flags" argument bitmask.
 *
 * Bitmask of flags to be passed to the membarrier system call. The "flags"
 * parameter can be either 0 or many of those flags combined with a bitwise
 * "or".
 */

> 
> > +	MEMBARRIER_QUERY = (1 << 31),
> > +};
> 
> Why is this an anonymous enum?

Because I used a "int" type for the syscall flags argument, so I did
not need to name it. Giving it a name does indeed help making the
documentation clearer anyway, so I'm tempted to give it a name.

> 
> > +#ifdef CONFIG_SMP
> 
> So documentation is SMP only, right?

Yeah, those embedded folks still doing UP don't need documentation. ;-)

I'll fix it.

> 
> > +/*
> 
> Docbook comments start with "/**"

Fixed.

> 
> > + * sys_membarrier - issue memory barrier on all running threads
> > + * @flags: MEMBARRIER_QUERY:
> > + *             Query whether the rest of the specified flags are
> > supported,
> > + *             without performing synchronization.
> 
>       @flags:	Explain what this is for, not what a particular implemented
>       		value is used for. The values should be proper documented
> 		in the enum
> 
> Why is this an int and not a named enum?

In my reading of C99:

ISO/IEC 9899:TC2
6.7.2.2 Enumeration specifiers

"4 Each enumerated type shall be compatible with char, a signed integer type, or an
unsigned integer type. The choice of type is implementation-defined,108) but shall be
capable of representing the values of all the members of the enumeration. The
enumerated type is incomplete until after the } that terminates the list of enumerator
declarations."

Since this type is used as a system call ABI, I'm worried that a compiler
implementation may compile a user-space program with a enum representation
of a "char", whereas the kernel would expect an integer, thus causing an
ABI issue, where userland would fail to clear the high-order bits of the
register, and a more recent kernel would care about those bits (if we add
new flags in future kernel releases).

> 
> Why is this named flags and not given a descriptive name? If I
> understand your changelog correctly you want to implement other
> synchronization modes than the current synchronize_sched. So mode
> might be a proper name.

If we look at other system calls like open(), "flags" describes a
feature configuration, whereas "mode" is used to specify the
permissions to use when creating the file. What we want to specify
here is more a configuration of the syscall, which fits better with
the existing semantic of open() "flags".

> 
> Why is MEMBARRIER_QUERY not a proper operation mode and simply returns
> the supported modes instead of doing it backwards and asking whether
> a specific value is supported?

That's an interesting way to do it. We'd have to be careful not to
conflict with return values like -ENOSYS then. We could define
MEMBARRIER_QUERY as (1 << 31), which effectively reserves all
signed negative numbers for the QUERY flag, and then the return value
of the QUERY flag could be either a negative value (e.g. -ENOSYS)
or the set of flags supported (bits 0 to 30).

> 
> > + * On uniprocessor systems, this system call simply returns 0 after
> > + * validating the arguments, so user-space knows it is implemented.
> 
> And the exact point of knowing this is?

I guess the alternative approach would be to return -ENOSYS, and let
userspace discover that the kernel only support uniprocessor systems
by other means ?

> 
> > + */
> > +SYSCALL_DEFINE1(membarrier, int, flags)
> > +{
> > +	int retval;
> > +
> > +	retval = membarrier_validate_flags(flags);
> > +	if (retval)
> > +		goto end;
> > +	if (unlikely(flags & MEMBARRIER_QUERY) || num_online_cpus() == 1)
> > +		goto end;
> 
> So why not doing the obvious?
> 
> enum modes {
>      QUERY     = 0,
>      FULL_SYNC = 1 <<  0,
> };
> 
> #define IMPLEMENTED_MODES	(FULL_SYNC)
> 
> 	switch (mode) {
> 	case FULL_SYNC:
>        		synchronize_sched_on_smp();
> 		return 0;
> 	case QUERY:
> 		return IMPLEMENTED_MODES;
> 	}
> 	return -EINVAL;
> 
> And later on you do
> 
>  enum modes {
>      QUERY       = 0,
>      FULL_SYNC   = 1 <<  0,
> +    MAGIC_SYNC  = 1 <<  1,
>  };
> 
> -#define IMPLEMENTED_MODES	(FULL_SYNC)
> +#define IMPLEMENTED_MODES	(FULL_SYNC | MAGIC_SYNC)
> 
> All you need to make sure is that any mode is a power of 2.

I really prefer that the "default" of passing "0" does
the obviously correct behavior for the system call:
issuing a memory barrier over the entire system. Otherwise,
making "0" a no-op could introduce hard-to-find races
if people misuse the system call. Moreover, we need to
reserve negative return values for errors like -ENOSYS, so
we can use the high-order bit for the QUERY flag.

In summary, how about the following:

- "0" is the default, obviously correct flags parameter,
  which does a memory barrier on all processors.
- 1 << 31 is the QUERY flag, which returns a bitmask of
  all flags supported.
- We keep negative return values reserved for errors
  (e.g. -ENOSYS), even for the QUERY flag.

Thanks for the feedback!

Mathieu



> 
> Hmm?
> 
> Thanks,
> 
> 	tglx
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-13 21:49 ` Thomas Gleixner
@ 2015-04-13 22:45   ` Thomas Gleixner
  2015-04-14 19:02   ` Mathieu Desnoyers
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2015-04-13 22:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

On Mon, 13 Apr 2015, Thomas Gleixner wrote:
> On Mon, 13 Apr 2015, Mathieu Desnoyers wrote:
> 
> > [ Andrew, can you take this for the 4.1 merge window ? ]
> 
> You probably mean 4.2, right?
> 
> This fails the basic test for exposure in linux-next, adds syscalls
> without the explicit ack of any x86 maintainer and exposes a user
> space ABI with a magic undocumented flags argument.

Just for completeness sake. It also lacks a man page and the review of
the manpage maintainer.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
  2015-04-13 19:10 Mathieu Desnoyers
@ 2015-04-13 21:49 ` Thomas Gleixner
  2015-04-13 22:45   ` Thomas Gleixner
  2015-04-14 19:02   ` Mathieu Desnoyers
  0 siblings, 2 replies; 11+ messages in thread
From: Thomas Gleixner @ 2015-04-13 21:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Peter Zijlstra,
	David Howells

On Mon, 13 Apr 2015, Mathieu Desnoyers wrote:

> [ Andrew, can you take this for the 4.1 merge window ? ]

You probably mean 4.2, right?

This fails the basic test for exposure in linux-next, adds syscalls
without the explicit ack of any x86 maintainer and exposes a user
space ABI with a magic undocumented flags argument.

> This patch only adds the system call to x86.

So the changes to 
 
>  include/uapi/asm-generic/unistd.h |    4 +-

are just cosmetic, right?

> +/* System call membarrier "flags" argument. */
> +enum {
> +	/*
> +	 * Query whether the rest of the specified flags are supported,
> +	 * without performing synchronization.
> +	 */

Docbook has support for enums.

> +	MEMBARRIER_QUERY = (1 << 31),
> +};

Why is this an anonymous enum?

> +#ifdef CONFIG_SMP

So documentation is SMP only, right?

> +/*

Docbook comments start with "/**"

> + * sys_membarrier - issue memory barrier on all running threads
> + * @flags: MEMBARRIER_QUERY:
> + *             Query whether the rest of the specified flags are supported,
> + *             without performing synchronization.

      @flags:	Explain what this is for, not what a particular implemented
      		value is used for. The values should be proper documented
		in the enum

Why is this an int and not a named enum?

Why is this named flags and not given a descriptive name? If I
understand your changelog correctly you want to implement other
synchronization modes than the current synchronize_sched. So mode
might be a proper name.

Why is MEMBARRIER_QUERY not a proper operation mode and simply returns
the supported modes instead of doing it backwards and asking whether
a specific value is supported?

> + * On uniprocessor systems, this system call simply returns 0 after
> + * validating the arguments, so user-space knows it is implemented.

And the exact point of knowing this is?

> + */
> +SYSCALL_DEFINE1(membarrier, int, flags)
> +{
> +	int retval;
> +
> +	retval = membarrier_validate_flags(flags);
> +	if (retval)
> +		goto end;
> +	if (unlikely(flags & MEMBARRIER_QUERY) || num_online_cpus() == 1)
> +		goto end;

So why not doing the obvious?

enum modes {
     QUERY     = 0,
     FULL_SYNC = 1 <<  0,
};

#define IMPLEMENTED_MODES	(FULL_SYNC)

	switch (mode) {
	case FULL_SYNC:
       		synchronize_sched_on_smp();
		return 0;
	case QUERY:
		return IMPLEMENTED_MODES;
	}
	return -EINVAL;

And later on you do

 enum modes {
     QUERY       = 0,
     FULL_SYNC   = 1 <<  0,
+    MAGIC_SYNC  = 1 <<  1,
 };

-#define IMPLEMENTED_MODES	(FULL_SYNC)
+#define IMPLEMENTED_MODES	(FULL_SYNC | MAGIC_SYNC)

All you need to make sure is that any mode is a power of 2.

Hmm?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86)
@ 2015-04-13 19:10 Mathieu Desnoyers
  2015-04-13 21:49 ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2015-04-13 19:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mathieu Desnoyers, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells

[ Andrew, can you take this for the 4.1 merge window ? ]

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to distribute
the cost of user-space memory barriers asymmetrically by transforming
pairs of memory barriers into pairs consisting of sys_membarrier() and a
compiler barrier. For synchronization primitives that distinguish
between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
read-side can be accelerated significantly by moving the bulk of the
memory barrier overhead to the write-side.

It is based on kernel v4.0.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 3129842 writes
signal-based scheme:          9825306874 reads,    5386 writes
sys_membarrier:               7992076602 reads,     220 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
with the expedited scheme, we can see that we are close to the read-side
performance of the signal-based scheme. However, this non-expedited
sys_membarrier implementation has a much slower grace period than signal
and memory barrier schemes.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch only adds the system call to x86.

[1] http://urcu.so

Changes since v13:
- Move to kernel/membarrier.c.
- Remove MEMBARRIER_PRIVATE flag.
- Add MAINTAINERS file entry.

Changes since v12:
- Remove _FLAG suffix from uapi flags.
- Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
- Remove EXPEDITED mode. Only implement non-expedited for now, until
  reading the cpu_curr()->mm can be done without holding the CPU's rq
  lock.

Changes since v11:
- 5 years have passed.
- Rebase on v3.19 kernel.
- Add futex-alike PRIVATE vs SHARED semantic: private for per-process
  barriers, non-private for memory mappings shared between processes.
- Simplify user API.
- Code refactoring.

Changes since v10:
- Apply Randy's comments.
- Rebase on 2.6.34-rc4 -tip.

Changes since v9:
- Clean up #ifdef CONFIG_SMP.

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding
  memory barriers to the scheduler. It implies a potential RoS
  (reduction of service) if sys_membarrier() is executed in a busy-loop
  by a user, but nothing more than what is already possible with other
  existing system calls, but saves memory barriers in the scheduler fast
  path.
- re-add the memory barrier comments to x86 switch_mm() as an example to
  other architectures.
- Update documentation of the memory barriers in sys_membarrier and
  switch_mm().
- Append execution scenarios to the changelog showing the purpose of
  each memory barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned
  to incrementally reserve syscall IDs on other architectures as these
  are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU
  read lock in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule()
  and finish_lock_switch(), where they clearly document that all data
  protected by the rq lock is guaranteed to have memory barriers issued
  between the scheduler update and the task execution. Replacing the
  spin lock acquire/release barriers with these memory barriers imply
  either no overhead (x86 spinlock atomic instruction already implies a
  full mb) or some hopefully small overhead caused by the upgrade of the
  spinlock acquire/release barriers to more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to
  standard spinlocks and full memory barriers. Each architecture can
  specialize this header following their own need and declare
  CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
  implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks
  to the "flags" system call parameter. Past experience with accept4(),
  signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
  inotify_init1() indicates that this is the kind of thing we want to
  plan for. Return -EINVAL if the mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the
  non-expedited case. Thanks to Lai Jiangshan for making us consider
  seriously using synchronize_sched() to provide the low-overhead
  membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before
  sending an IPI. Ensures that we do not disturb RT tasks in the
  presence of lazy TLB shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of
  processors we have to IPI to (which use the mm), and this mask is
  updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Nicholas Miell <nmiell@comcast.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: David Howells <dhowells@redhat.com>
---
 MAINTAINERS                       |    8 ++++
 arch/x86/syscalls/syscall_32.tbl  |    1 +
 arch/x86/syscalls/syscall_64.tbl  |    1 +
 include/linux/syscalls.h          |    2 +
 include/uapi/asm-generic/unistd.h |    4 +-
 include/uapi/linux/Kbuild         |    1 +
 include/uapi/linux/membarrier.h   |   57 ++++++++++++++++++++++++++++
 init/Kconfig                      |   12 ++++++
 kernel/Makefile                   |    1 +
 kernel/membarrier.c               |   75 +++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                   |    3 +
 11 files changed, 164 insertions(+), 1 deletions(-)
 create mode 100644 include/uapi/linux/membarrier.h
 create mode 100644 kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index efbcb50..a94cc7b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6328,6 +6328,14 @@ W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
 F:	drivers/net/ethernet/mellanox/mlx4/en_*
 
+MEMBARRIER SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/membarrier.c
+F:	include/uapi/linux/membarrier.h
+
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
 W:	http://www.linux-mm.org
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..439415f 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	membarrier		sys_membarrier
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..823130d 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	membarrier		sys_membarrier
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..9702fc8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *argv,
 			const char __user *const __user *envp, int flags);
 
+asmlinkage long sys_membarrier(int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..8da542a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_membarrier 282
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 68ceb97..4684a8d 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -249,6 +249,7 @@ header-y += mdio.h
 header-y += media.h
 header-y += media-bus-format.h
 header-y += mei.h
+header-y += membarrier.h
 header-y += memfd.h
 header-y += mempolicy.h
 header-y += meye.h
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
new file mode 100644
index 0000000..b6f8f40
--- /dev/null
+++ b/include/uapi/linux/membarrier.h
@@ -0,0 +1,57 @@
+#ifndef _UAPI_LINUX_MEMBARRIER_H
+#define _UAPI_LINUX_MEMBARRIER_H
+
+/*
+ * linux/membarrier.h
+ *
+ * membarrier system call API
+ *
+ * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * All memory accesses performed in program order from each thread on
+ * the system is guaranteed to be ordered with respect to sys_membarrier().
+ * If we use the semantic "barrier()" to represent a compiler barrier
+ * forcing memory accesses to be performed in program order across the
+ * barrier, and smp_mb() to represent explicit memory barriers forcing
+ * full memory ordering across the barrier, we have the following
+ * ordering table for each pair of barrier(), sys_membarrier() and
+ * smp_mb() :
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ *                        barrier()   smp_mb() sys_membarrier()
+ *        barrier()          X           X            O
+ *        smp_mb()           X           O            O
+ *        sys_membarrier()   O           O            O
+ */
+
+/* System call membarrier "flags" argument. */
+enum {
+	/*
+	 * Query whether the rest of the specified flags are supported,
+	 * without performing synchronization.
+	 */
+	MEMBARRIER_QUERY = (1 << 31),
+};
+
+#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..3bfad2a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1559,6 +1559,18 @@ config PCI_QUIRKS
 	  bugs/quirks. Disable this only if your target machine is
 	  unaffected by PCI quirks.
 
+config MEMBARRIER
+	bool "Enable membarrier() system call" if EXPERT
+	default y
+	help
+	  Enable the membarrier() system call that allows issuing memory
+	  barriers across all running threads, which can be used to distribute
+	  the cost of user-space memory barriers asymmetrically by transforming
+	  pairs of memory barriers into pairs consisting of membarrier() and a
+	  compiler barrier.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..1cafa11 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
new file mode 100644
index 0000000..3077e94
--- /dev/null
+++ b/kernel/membarrier.c
@@ -0,0 +1,75 @@
+/*
+ * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * membarrier system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/membarrier.h>
+
+static int membarrier_validate_flags(int flags)
+{
+	/* Check for unrecognized flags. */
+	if (flags & ~MEMBARRIER_QUERY)
+		return -EINVAL;
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+
+/*
+ * sys_membarrier - issue memory barrier on all running threads
+ * @flags: MEMBARRIER_QUERY:
+ *             Query whether the rest of the specified flags are supported,
+ *             without performing synchronization.
+ *
+ * return values: Returns -EINVAL if the flags are incorrect. Testing
+ * for kernel sys_membarrier support can be done by checking for -ENOSYS
+ * return value.  Return value of 0 indicates success. For a given set
+ * of flags on a given kernel, this system call will always return the
+ * same value. It is therefore correct to check the return value only
+ * once during a process lifetime, setting MEMBARRIER_QUERY to only
+ * check if the flags are supported, without performing any
+ * synchronization.
+ *
+ * This system call executes a memory barrier on all running threads.
+ * Upon completion, the caller thread is ensured that all running
+ * threads have passed through a state where all memory accesses to
+ * user-space addresses match program order. (non-running threads are de
+ * facto in such a state.)
+ *
+ * On uniprocessor systems, this system call simply returns 0 after
+ * validating the arguments, so user-space knows it is implemented.
+ */
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+	int retval;
+
+	retval = membarrier_validate_flags(flags);
+	if (retval)
+		goto end;
+	if (unlikely(flags & MEMBARRIER_QUERY) || num_online_cpus() == 1)
+		goto end;
+	synchronize_sched();
+end:
+	return retval;
+}
+
+#else /* !CONFIG_SMP */
+
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+	return membarrier_validate_flags(flags);
+}
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5913b84 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* membarrier */
+cond_syscall(sys_membarrier);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-04-14 21:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-25 19:03 [PATCH v14 for 4.1] sys_membarrier(): system-wide memory barrier (x86) Mathieu Desnoyers
2015-03-27  9:21 ` Paul E. McKenney
2015-04-13 19:10 Mathieu Desnoyers
2015-04-13 21:49 ` Thomas Gleixner
2015-04-13 22:45   ` Thomas Gleixner
2015-04-14 19:02   ` Mathieu Desnoyers
2015-04-14 19:22     ` Mathieu Desnoyers
2015-04-14 19:35       ` Thomas Gleixner
2015-04-14 20:17         ` Mathieu Desnoyers
2015-04-14 21:05           ` Thomas Gleixner
2015-04-14 19:31     ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.