All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
@ 2015-03-15 19:24 Mathieu Desnoyers
  2015-03-15 22:05 ` Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-15 19:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mathieu Desnoyers, KOSAKI Motohiro, Steven Rostedt,
	Paul E. McKenney, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Andrew Morton,
	Josh Triplett, Thomas Gleixner, Peter Zijlstra, David Howells,
	Nick Piggin

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on either all running threads of the current
process (MEMBARRIER_PRIVATE_FLAG) or calls synchronize_sched() to issue
a memory barrier on all threads running on the system. It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU,
rwlocks), the read-side can be accelerated significantly by moving the
bulk of the memory barrier overhead to the write-side.

The first user of this system call is the "liburcu" Userspace RCU
implementation [1]. It aims at greatly simplifying and enhancing the
current implementation, which uses a scheme similar to the
sys_membarrier(), but based on signals sent to each reader thread.
Liburcu is currently packaged in all major Linux distributions.

One user of this library is the LTTng-UST (Userspace Tracing) library
[2]. The impact of two additional memory barriers on the LTTng tracer
fast path has been benchmarked to 35ns on our reference Intel Core Xeon
2.0 GHz, which adds up to about 25% performance degradation.

This patch mostly sits in kernel/sched.c (it needs to access struct rq).
It is based on kernel v3.19, and also applies fine to master. I am
submitting it as RFC.

Alternative approach: signals. A signal-based alternative proposed by
Ingo would lead to important scheduler modifications, which involves
adding context-switch in/out overhead (calling user-space upon scheduler
switch in and out). In addition to the overhead issue, I am also
reluctant to base a synchronization primitive on the signals, which, to
quote Linus, are "already one of our more "exciting" layers out there",
which does not give me the warm feeling of rock-solidness that's usually
expected from synchronization primitives.

Changes since v11:
- 5 years have passed.
- Rebase on v3.19 kernel.
- Add futex-alike PRIVATE vs SHARED semantic: private for per-process
  barriers, non-private for memory mappings shared between processes.
- Simplify user API.
- Code refactoring.

Changes since v10:
- Apply Randy's comments.
- Rebase on 2.6.34-rc4 -tip.

Changes since v9:
- Clean up #ifdef CONFIG_SMP.

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding
  memory barriers to the scheduler. It implies a potential RoS
  (reduction of service) if sys_membarrier() is executed in a busy-loop
  by a user, but nothing more than what is already possible with other
  existing system calls, but saves memory barriers in the scheduler fast
  path.
- re-add the memory barrier comments to x86 switch_mm() as an example to
  other architectures.
- Update documentation of the memory barriers in sys_membarrier and
  switch_mm().
- Append execution scenarios to the changelog showing the purpose of
  each memory barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned
  to incrementally reserve syscall IDs on other architectures as these
  are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU
  read lock in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule()
  and finish_lock_switch(), where they clearly document that all data
  protected by the rq lock is guaranteed to have memory barriers issued
  between the scheduler update and the task execution. Replacing the
  spin lock acquire/release barriers with these memory barriers imply
  either no overhead (x86 spinlock atomic instruction already implies a
  full mb) or some hopefully small overhead caused by the upgrade of the
  spinlock acquire/release barriers to more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to
  standard spinlocks and full memory barriers. Each architecture can
  specialize this header following their own need and declare
  CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
  implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks
  to the "flags" system call parameter. Past experience with accept4(),
  signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
  inotify_init1() indicates that this is the kind of thing we want to
  plan for. Return -EINVAL if the mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the
  non-expedited case. Thanks to Lai Jiangshan for making us consider
  seriously using synchronize_sched() to provide the low-overhead
  membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before
  sending an IPI. Ensures that we do not disturb RT tasks in the
  presence of lazy TLB shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of
  processors we have to IPI to (which use the mm), and this mask is
  updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

- expedited

  10,000,000 sys_membarrier calls in 43s = 4.3 microseconds/call.

- non-expedited

  1000 sys_membarrier calls in 33s = 33 milliseconds/call.

Expedited is 7600 times faster than non-expedited.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 3129842 writes
signal-based scheme:          9825306874 reads,    5386 writes
sys_membarrier expedited:     6637539697 reads,  852129 writes
sys_membarrier non-expedited: 7992076602 reads,     220 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
with the expedited scheme, we can see that we are close to the read-side
performance of the signal-based scheme and also at 1/4 of the
memory-barrier write-side performance. We have a write-side speedup of
158:1 over the signal-based scheme by using the sys_membarrier system
call. This allows a 3.9:1 read-side speedup over the pre-existing memory
barrier scheme.

The non-expedited scheme adds indeed a much lower overhead on the
read-side both because we do not send IPIs and because we perform less
updates, which in turn generates less cache-line exchanges. The
write-side latency becomes even higher than with the signal-based
scheme. The advantage of sys_membarrier() over signal-based scheme is
that it does not require to wake up all the process threads.

The non-expedited sys_membarrier scheme can be useful to a userspace RCU
flavor that encompass all processes on a system, which may share memory
mappings.

* More information about memory barriers in:

- sys_membarrier()
- membarrier_ipi()
- switch_mm()
- issued with ->mm update while the rq lock is held

The goal of these memory barriers is to ensure that all memory accesses
to user-space addresses performed by every processor which execute
threads belonging to the current process are observed to be in program
order at least once between the two memory barriers surrounding
sys_membarrier().

If we were to simply broadcast an IPI to all processors between the two
smp_mb() in sys_membarrier(), membarrier_ipi() would execute on each
processor, and waiting for these handlers to complete execution
guarantees that each running processor passed through a state where
user-space memory address accesses were in program order.

However, this "big hammer" approach does not please the real-time
concerned people. This would let a non RT task disturb real-time tasks
by sending useless IPIs to processors not concerned by the memory of the
current process.

This is why we iterate on the mm_cpumask, which is a superset of the
processors concerned by the process memory map and check each processor
->mm with the rq lock held to confirm that the processor is indeed
running a thread concerned with our mm (and not just part of the
mm_cpumask due to lazy TLB shootdown).

User-space memory address accesses must be in program order when
mm_cpumask is set or cleared. (more details in the x86 switch_mm()
comments).

The verification, for each cpu part of the mm_cpumask, that the rq ->mm
is indeed part of the current ->mm needs to be done with the rq lock
held. This ensures that each time a rq ->mm is modified, a memory
barrier (typically implied by the change of memory mapping) is also
issued. These ->mm update and memory barrier are made atomic by the rq
spinlock.

The execution scenario (1) shows the behavior of the sys_membarrier()
system call executed on Thread A while Thread B executes memory accesses
that need to be ordered. Thread B is running. Memory accesses in Thread
B are in program order (e.g. separated by a compiler barrier()).

1) Thread B running, ordering ensured by the membarrier_ipi():

  Thread A                               Thread B
-------------------------------------------------------------------------
  prev accesses to userspace addr.       prev accesses to userspace addr.
  sys_membarrier
    smp_mb
    IPI  ------------------------------> membarrier_ipi()
                                         smp_mb
                                         return
    smp_mb
  following accesses to userspace addr.  following accesses to userspace
                                         addr.

The execution scenarios (2-3-4-5) show the same setup as (1), but Thread
B is not running while sys_membarrier() is called. Thanks to the memory
barriers implied by load_cr3 in switch_mm(), Thread B user-space address
memory accesses are already in program order when sys_membarrier finds
out that either the mm_cpumask does not contain Thread B CPU or that
that CPU's ->mm is not running the current process mm.

2) Context switch in, showing rq spin lock synchronization:

  Thread A                               Thread B
-------------------------------------------------------------------------
                                         <prev accesses to userspace addr.
                                         saved on stack>
  prev accesses to userspace addr.
  sys_membarrier
    smp_mb
      for each cpu in mm_cpumask
        <Thread B CPU is present e.g. due
         to lazy TLB shootdown>
        spin lock cpu rq
        mm = cpu rq mm
        spin unlock cpu rq
                                         context switch in
                                         <spin lock cpu rq by other thread>
                                         load_cr3 (or equiv. mem. barrier)
                                         spin unlock cpu rq
                                         following accesses to userspace
                                         addr.
        if (mm == current rq mm)
          <false>
    smp_mb
  following accesses to userspace addr.
Here, the important point is that Thread B have passed through a point
where all its userspace memory address accesses were in program order
between the two smp_mb() in sys_membarrier.

3) Context switch out, showing rq spin lock synchronization:

  Thread A                               Thread B
-------------------------------------------------------------------------
  prev accesses to userspace addr.
                                         prev accesses to userspace addr.
  sys_membarrier
    smp_mb
      for each cpu in mm_cpumask
                                         context switch out
                                         spin lock cpu rq
                                         load_cr3 (or equiv. mem. barrier)
                                         <spin unlock cpu rq by other thread>
                                         <following accesses to userspace
                                         addr. will happen when rescheduled>
        spin lock cpu rq
        mm = cpu rq mm
        spin unlock cpu rq
        if (mm == current rq mm)
          <false>
    smp_mb
  following accesses to userspace addr.
Same as (2): the important point is that Thread B have passed through a
point where all its userspace memory address accesses were in program
order between the two smp_mb() in sys_membarrier.

4) Context switch in, showing mm_cpumask synchronization:

  Thread A                               Thread B
-------------------------------------------------------------------------
                                         <prev accesses to userspace addr.
                                         saved on stack>
  prev accesses to userspace addr.
  sys_membarrier
    smp_mb
      for each cpu in mm_cpumask
        <Thread B CPU not in mask>
                                         context switch in
                                         set cpu bit in mm_cpumask
                                         load_cr3 (or equiv. mem. barrier)
                                         following accesses to userspace
                                         addr.
    smp_mb
  following accesses to userspace addr.
Same as 2-3: Thread B is passing through a point where userspace memory
address accesses are in program order between the two smp_mb() in
sys_membarrier().

5) Context switch out, showing mm_cpumask synchronization:

  Thread A                               Thread B
-------------------------------------------------------------------------
  prev accesses to userspace addr.
                                         prev accesses to userspace addr.
  sys_membarrier
    smp_mb
                                         context switch out
                                         load_cr3 (or equiv. mem. barrier)
                                         clear cpu bit in mm_cpumask
                                         <following accesses to userspace
                                         addr. will happen when rescheduled>
      for each cpu in mm_cpumask
        <Thread B CPU not in mask>
    smp_mb
  following accesses to userspace addr.
Same as 2-3-4: Thread B is passing through a point where userspace
memory address accesses are in program order between the two smp_mb() in
sys_membarrier().

This patch only adds the system calls to x86. See the sys_membarrier()
comments for memory barriers requirement in switch_mm() to port to other
architectures.

[1] http://urcu.so
[2] http://lttng.org

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Nicholas Miell <nmiell@comcast.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: David Howells <dhowells@redhat.com>
CC: Nick Piggin <npiggin@kernel.dk>
---
 arch/x86/include/asm/mmu_context.h |   17 +++
 arch/x86/syscalls/syscall_32.tbl   |    1 +
 arch/x86/syscalls/syscall_64.tbl   |    1 +
 include/linux/syscalls.h           |    2 +
 include/uapi/asm-generic/unistd.h  |    4 +-
 include/uapi/linux/Kbuild          |    1 +
 include/uapi/linux/membarrier.h    |   75 +++++++++++++
 kernel/sched/core.c                |  208 ++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                    |    3 +
 9 files changed, 311 insertions(+), 1 deletions(-)
 create mode 100644 include/uapi/linux/membarrier.h

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 4b75d59..a30a63d 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 #endif
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * smp_mb() between mm_cpumask set and following memory
+		 * accesses to user-space addresses is required by
+		 * sys_membarrier(). A smp_mb() is also needed between
+		 * prior memory accesses and mm_cpumask clear. This
+		 * ensures that all user-space address memory accesses
+		 * performed by the current thread are in program order
+		 * when the mm_cpumask is set. Implied by load_cr3.
+		 */
+
 		/* Re-load page tables */
 		load_cr3(next->pgd);
 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
@@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			 * We were in lazy tlb mode and leave_mm disabled
 			 * tlb flush IPI delivery. We must reload CR3
 			 * to make sure to use no freed page tables.
+			 *
+			 * smp_mb() between mm_cpumask set and memory accesses
+			 * to user-space addresses is required by
+			 * sys_membarrier(). This ensures that all user-space
+			 * address memory accesses performed by the current
+			 * thread are in program order when the mm_cpumask is
+			 * set. Implied by load_cr3.
 			 */
 			load_cr3(next->pgd);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..439415f 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	membarrier		sys_membarrier
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..823130d 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	membarrier		sys_membarrier
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..058ec0a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *argv,
 			const char __user *const __user *envp, int flags);
 
+asmlinkage long sys_membarrier(int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..8da542a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_membarrier 282
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 00b10002..c5b0dbf 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -248,6 +248,7 @@ header-y += mdio.h
 header-y += media.h
 header-y += media-bus-format.h
 header-y += mei.h
+header-y += membarrier.h
 header-y += memfd.h
 header-y += mempolicy.h
 header-y += meye.h
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
new file mode 100644
index 0000000..928b0d5a
--- /dev/null
+++ b/include/uapi/linux/membarrier.h
@@ -0,0 +1,75 @@
+#ifndef _UAPI_LINUX_MEMBARRIER_H
+#define _UAPI_LINUX_MEMBARRIER_H
+
+/*
+ * linux/membarrier.h
+ *
+ * membarrier system call API
+ *
+ * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * All memory accesses performed in program order from each targeted thread is
+ * guaranteed to be ordered with respect to sys_membarrier(). If we use the
+ * semantic "barrier()" to represent a compiler barrier forcing memory accesses
+ * to be performed in program order across the barrier, and smp_mb() to
+ * represent explicit memory barriers forcing full memory ordering across the
+ * barrier, we have the following ordering table for each pair of barrier(),
+ * sys_membarrier() and smp_mb() :
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ *                        barrier()   smp_mb() sys_membarrier()
+ *        barrier()          X           X            O
+ *        smp_mb()           X           O            O
+ *        sys_membarrier()   O           O            O
+ *
+ * If the private flag is set, only running threads belonging to the same
+ * process are targeted. Else, all running threads, including those belonging to
+ * other processes are targeted.
+ */
+
+/* System call membarrier "flags" argument. */
+enum {
+	/*
+	 * Private flag set: only synchronize across a single process. If this
+	 * flag is not set, it means "shared": synchronize across multiple
+	 * processes.  The shared mode is useful for shared memory mappings
+	 * across processes.
+	 */
+	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
+
+	/*
+	 * Expedited flag set: adds some overhead, fast execution (few
+	 * microseconds).  If this flag is not set, it means "delayed": low
+	 * overhead, but slow execution (few milliseconds).
+	 */
+	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
+
+	/*
+	 * Query whether the rest of the specified flags are supported, without
+	 * performing synchronization.
+	 */
+	MEMBARRIER_QUERY_FLAG = (1 << 31),
+};
+
+#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5eab11d..8b33728 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/membarrier.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -8402,3 +8403,210 @@ void dump_cpu_task(int cpu)
 	pr_info("Task dump for CPU %d:\n", cpu);
 	sched_show_task(cpu_curr(cpu));
 }
+
+#ifdef CONFIG_SMP
+
+/*
+ * Execute a memory barrier on all active threads from the current process
+ * on SMP systems. In order to keep this code independent of implementation
+ * details of IPI handlers, do not rely on implicit barriers within IPI handler
+ * execution. This is not the bulk of the overhead anyway, so let's stay on the
+ * safe side.
+ */
+static void membarrier_ipi(void *unused)
+{
+	/* Order memory accesses with respects to sys_membarrier caller. */
+	smp_mb();
+}
+
+/*
+ * Handle out-of-memory by sending per-cpu IPIs instead.
+ */
+static void membarrier_fallback(void)
+{
+	struct mm_struct *mm;
+	int cpu;
+
+	for_each_cpu(cpu, mm_cpumask(current->mm)) {
+		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+		mm = cpu_curr(cpu)->mm;
+		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+		if (current->mm == mm)
+			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+	}
+}
+
+static int membarrier_validate_flags(int flags)
+{
+	/* Check for unrecognized flag. */
+	if (flags & ~(MEMBARRIER_PRIVATE_FLAG | MEMBARRIER_EXPEDITED_FLAG
+			| MEMBARRIER_QUERY_FLAG))
+		return -EINVAL;
+	/* Check for unsupported flag combination. */
+	if ((flags & MEMBARRIER_EXPEDITED_FLAG)
+			&& !(flags & MEMBARRIER_PRIVATE_FLAG))
+		return -EINVAL;
+	return 0;
+}
+
+static void membarrier_expedited(void)
+{
+	struct mm_struct *mm;
+	cpumask_var_t tmpmask;
+	int cpu;
+
+	/*
+	 * Memory barrier on the caller thread between previous memory accesses
+	 * to user-space addresses and sending memory-barrier IPIs. Orders all
+	 * user-space address memory accesses prior to sys_membarrier() before
+	 * mm_cpumask read and membarrier_ipi executions. This barrier is paired
+	 * with memory barriers in:
+	 * - membarrier_ipi() (for each running threads of the current process)
+	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
+	 *                accesses to user-space addresses)
+	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
+	 *   A memory barrier is issued each time ->mm is changed while the rq
+	 *   lock is held.
+	 */
+	smp_mb();
+	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+		membarrier_fallback();
+		goto out;
+	}
+	cpumask_copy(tmpmask, mm_cpumask(current->mm));
+	preempt_disable();
+	cpumask_clear_cpu(smp_processor_id(), tmpmask);
+	for_each_cpu(cpu, tmpmask) {
+		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+		mm = cpu_curr(cpu)->mm;
+		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+		if (current->mm != mm)
+			cpumask_clear_cpu(cpu, tmpmask);
+	}
+	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+	preempt_enable();
+	free_cpumask_var(tmpmask);
+out:
+	/*
+	 * Memory barrier on the caller thread between sending & waiting for
+	 * memory-barrier IPIs and following memory accesses to user-space
+	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
+	 * before all user-space address memory accesses following
+	 * sys_membarrier(). This barrier is paired with memory barriers in:
+	 * - membarrier_ipi() (for each running threads of the current process)
+	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
+	 *                accesses to user-space addresses)
+	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
+	 *   A memory barrier is issued each time ->mm is changed while the rq
+	 *   lock is held.
+	 */
+	smp_mb();
+}
+
+/*
+ * sys_membarrier - issue memory barrier on target threads
+ * @flags: MEMBARRIER_PRIVATE_FLAG:
+ *             Private flag set: only synchronize across a single process. If
+ *             this flag is not set, it means "shared": synchronize across
+ *             multiple processes. The shared mode is useful for shared memory
+ *             mappings across processes.
+ *         MEMBARRIER_EXPEDITED_FLAG:
+ *             Expedited flag set: adds some overhead, fast execution (few
+ *             microseconds). If this flag is not set, it means "delayed": low
+ *             overhead, but slow execution (few milliseconds).
+ *         MEMBARRIER_QUERY_FLAG:
+ *             Query whether the rest of the specified flags are supported,
+ *             without performing synchronization.
+ *
+ * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
+ * sys_membarrier support can be done by checking for -ENOSYS return value.
+ * Return value of 0 indicates success. For a given set of flags on a given
+ * kernel, this system call will always return the same value. It is therefore
+ * correct to check the return value only once during a process lifetime,
+ * setting MEMBARRIER_QUERY_FLAG to only check if the flags are supported,
+ * without performing any synchronization.
+ *
+ * This system call executes a memory barrier on all targeted threads.
+ * If the private flag is set, only running threads belonging to the same
+ * process are targeted. Else, all running threads, including those belonging to
+ * other processes are targeted. Upon completion, the caller thread is ensured
+ * that all targeted running threads have passed through a state where all
+ * memory accesses to user-space addresses match program order. (non-running
+ * threads are de facto in such a state.)
+ *
+ * Using the non-expedited mode is recommended for applications which can
+ * afford leaving the caller thread waiting for a few milliseconds. A good
+ * example would be a thread dedicated to execute RCU callbacks, which waits
+ * for callbacks to enqueue most of the time anyway.
+ *
+ * The expedited mode is recommended whenever the application needs to have
+ * control returning to the caller thread as quickly as possible. An example
+ * of such application would be one which uses the same thread to perform
+ * data structure updates and issue the RCU synchronization.
+ *
+ * It is perfectly safe to call both expedited and non-expedited
+ * sys_membarrier() in a process.
+ *
+ * The combination of expedited mode (MEMBARRIER_EXPEDITED_FLAG) and non-private
+ * (shared) (~MEMBARRIER_PRIVATE_FLAG) flags is currently unimplemented. Using
+ * this combination returns -EINVAL.
+ *
+ * mm_cpumask is used as an approximation of the processors which run threads
+ * belonging to the current process. It is a superset of the cpumask to which we
+ * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU in
+ * the mm_cpumask, we check each runqueue with the rq lock held to make sure our
+ * ->mm is indeed running on them. The rq lock ensures that a memory barrier is
+ * issued each time the rq current task is changed. This reduces the risk of
+ * disturbing a RT task by sending unnecessary IPIs. There is still a slight
+ * chance to disturb an unrelated task, because we do not lock the runqueues
+ * while sending IPIs, but the real-time effect of this heavy locking would be
+ * worse than the comparatively small disruption of an IPI.
+ *
+ * RED PEN: before assigning a system call number for sys_membarrier() to an
+ * architecture, we must ensure that switch_mm issues full memory barriers
+ * (or a synchronizing instruction having the same effect) between:
+ * - memory accesses to user-space addresses and clear mm_cpumask.
+ * - set mm_cpumask and memory accesses to user-space addresses.
+ *
+ * The reason why these memory barriers are required is that mm_cpumask updates,
+ * as well as iteration on the mm_cpumask, offer no ordering guarantees.
+ * These added memory barriers ensure that any thread modifying the mm_cpumask
+ * is in a state where all memory accesses to user-space addresses are
+ * guaranteed to be in program order.
+ *
+ * In some case adding a comment to this effect will suffice, in others we
+ * will need to add explicit memory barriers.  These barriers are required to
+ * ensure we do not _miss_ a CPU that need to receive an IPI, which would be a
+ * bug.
+ *
+ * On uniprocessor systems, this system call simply returns 0 after validating
+ * the arguments, so user-space knows it is implemented.
+ */
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+	int retval;
+
+	retval = membarrier_validate_flags(flags);
+	if (retval)
+		goto end;
+	if (unlikely((flags & MEMBARRIER_QUERY_FLAG)
+		     || ((flags & MEMBARRIER_PRIVATE_FLAG)
+				&& thread_group_empty(current)))
+		     || num_online_cpus() == 1)
+		goto end;
+	if (flags & MEMBARRIER_EXPEDITED_FLAG)
+		membarrier_expedited();
+	else
+		synchronize_sched();
+end:
+	return retval;
+}
+
+#else /* !CONFIG_SMP */
+
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+	return membarrier_validate_args(flags);
+}
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5913b84 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* membarrier */
+cond_syscall(sys_membarrier);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-15 19:24 [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) Mathieu Desnoyers
@ 2015-03-15 22:05 ` Paul E. McKenney
  2015-03-16  3:25 ` Josh Triplett
  2015-03-16 14:19 ` Peter Zijlstra
  2 siblings, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-15 22:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Nicholas Miell,
	Linus Torvalds, Ingo Molnar, Alan Cox, Lai Jiangshan,
	Stephen Hemminger, Andrew Morton, Josh Triplett, Thomas Gleixner,
	Peter Zijlstra, David Howells, Nick Piggin

On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on either all running threads of the current
> process (MEMBARRIER_PRIVATE_FLAG) or calls synchronize_sched() to issue
> a memory barrier on all threads running on the system. It can be used to
> distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of
> sys_membarrier() and a compiler barrier. For synchronization primitives
> that distinguish between read-side and write-side (e.g. userspace RCU,
> rwlocks), the read-side can be accelerated significantly by moving the
> bulk of the memory barrier overhead to the write-side.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> The first user of this system call is the "liburcu" Userspace RCU
> implementation [1]. It aims at greatly simplifying and enhancing the
> current implementation, which uses a scheme similar to the
> sys_membarrier(), but based on signals sent to each reader thread.
> Liburcu is currently packaged in all major Linux distributions.
> 
> One user of this library is the LTTng-UST (Userspace Tracing) library
> [2]. The impact of two additional memory barriers on the LTTng tracer
> fast path has been benchmarked to 35ns on our reference Intel Core Xeon
> 2.0 GHz, which adds up to about 25% performance degradation.
> 
> This patch mostly sits in kernel/sched.c (it needs to access struct rq).
> It is based on kernel v3.19, and also applies fine to master. I am
> submitting it as RFC.
> 
> Alternative approach: signals. A signal-based alternative proposed by
> Ingo would lead to important scheduler modifications, which involves
> adding context-switch in/out overhead (calling user-space upon scheduler
> switch in and out). In addition to the overhead issue, I am also
> reluctant to base a synchronization primitive on the signals, which, to
> quote Linus, are "already one of our more "exciting" layers out there",
> which does not give me the warm feeling of rock-solidness that's usually
> expected from synchronization primitives.
> 
> Changes since v11:
> - 5 years have passed.
> - Rebase on v3.19 kernel.
> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>   barriers, non-private for memory mappings shared between processes.
> - Simplify user API.
> - Code refactoring.
> 
> Changes since v10:
> - Apply Randy's comments.
> - Rebase on 2.6.34-rc4 -tip.
> 
> Changes since v9:
> - Clean up #ifdef CONFIG_SMP.
> 
> Changes since v8:
> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>   memory barriers to the scheduler. It implies a potential RoS
>   (reduction of service) if sys_membarrier() is executed in a busy-loop
>   by a user, but nothing more than what is already possible with other
>   existing system calls, but saves memory barriers in the scheduler fast
>   path.
> - re-add the memory barrier comments to x86 switch_mm() as an example to
>   other architectures.
> - Update documentation of the memory barriers in sys_membarrier and
>   switch_mm().
> - Append execution scenarios to the changelog showing the purpose of
>   each memory barrier.
> 
> Changes since v7:
> - Move spinlock-mb and scheduler related changes to separate patches.
> - Add support for sys_membarrier on x86_32.
> - Only x86 32/64 system calls are reserved in this patch. It is planned
>   to incrementally reserve syscall IDs on other architectures as these
>   are tested.
> 
> Changes since v6:
> - Remove some unlikely() not so unlikely.
> - Add the proper scheduler memory barriers needed to only use the RCU
>   read lock in sys_membarrier rather than take each runqueue spinlock:
> - Move memory barriers from per-architecture switch_mm() to schedule()
>   and finish_lock_switch(), where they clearly document that all data
>   protected by the rq lock is guaranteed to have memory barriers issued
>   between the scheduler update and the task execution. Replacing the
>   spin lock acquire/release barriers with these memory barriers imply
>   either no overhead (x86 spinlock atomic instruction already implies a
>   full mb) or some hopefully small overhead caused by the upgrade of the
>   spinlock acquire/release barriers to more heavyweight smp_mb().
> - The "generic" version of spinlock-mb.h declares both a mapping to
>   standard spinlocks and full memory barriers. Each architecture can
>   specialize this header following their own need and declare
>   CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>   implementations on a wide range of architecture would be welcome.
> 
> Changes since v5:
> - Plan ahead for extensibility by introducing mandatory/optional masks
>   to the "flags" system call parameter. Past experience with accept4(),
>   signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>   inotify_init1() indicates that this is the kind of thing we want to
>   plan for. Return -EINVAL if the mandatory flags received are unknown.
> - Create include/linux/membarrier.h to define these flags.
> - Add MEMBARRIER_QUERY optional flag.
> 
> Changes since v4:
> - Add "int expedited" parameter, use synchronize_sched() in the
>   non-expedited case. Thanks to Lai Jiangshan for making us consider
>   seriously using synchronize_sched() to provide the low-overhead
>   membarrier scheme.
> - Check num_online_cpus() == 1, quickly return without doing nothing.
> 
> Changes since v3a:
> - Confirm that each CPU indeed runs the current task's ->mm before
>   sending an IPI. Ensures that we do not disturb RT tasks in the
>   presence of lazy TLB shootdown.
> - Document memory barriers needed in switch_mm().
> - Surround helper functions with #ifdef CONFIG_SMP.
> 
> Changes since v2:
> - simply send-to-many to the mm_cpumask. It contains the list of
>   processors we have to IPI to (which use the mm), and this mask is
>   updated atomically.
> 
> Changes since v1:
> - Only perform the IPI in CONFIG_SMP.
> - Only perform the IPI if the process has more than one thread.
> - Only send IPIs to CPUs involved with threads belonging to our process.
> - Adaptative IPI scheme (single vs many IPI with threshold).
> - Issue smp_mb() at the beginning and end of the system call.
> 
> To explain the benefit of this scheme, let's introduce two example threads:
> 
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu
> rcu_read_lock()/rcu_read_unlock())
> 
> In a scheme where all smp_mb() in thread A are ordering memory accesses
> with respect to smp_mb() present in Thread B, we can change each
> smp_mb() within Thread A into calls to sys_membarrier() and each
> smp_mb() within Thread B into compiler barriers "barrier()".
> 
> Before the change, we had, for each smp_mb() pairs:
> 
> Thread A                    Thread B
> previous mem accesses       previous mem accesses
> smp_mb()                    smp_mb()
> following mem accesses      following mem accesses
> 
> After the change, these pairs become:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
> 
> 1) Non-concurrent Thread A vs Thread B accesses:
> 
> Thread A                    Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
>                             prev mem accesses
>                             barrier()
>                             follow mem accesses
> 
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
> 
> 2) Concurrent Thread A vs Thread B accesses
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() by to the IPIs executing memory barriers on each active
> system threads. Each non-running process threads are intrinsically
> serialized by the scheduler.
> 
> * Benchmarks
> 
> On Intel Xeon E5405 (8 cores)
> (one thread is calling sys_membarrier, the other 7 threads are busy
> looping)
> 
> - expedited
> 
>   10,000,000 sys_membarrier calls in 43s = 4.3 microseconds/call.
> 
> - non-expedited
> 
>   1000 sys_membarrier calls in 33s = 33 milliseconds/call.
> 
> Expedited is 7600 times faster than non-expedited.
> 
> * User-space user of this system call: Userspace RCU library
> 
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invocation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
> 
> Results in liburcu:
> 
> Operations in 10s, 6 readers, 2 writers:
> 
> memory barriers in reader:    1701557485 reads, 3129842 writes
> signal-based scheme:          9825306874 reads,    5386 writes
> sys_membarrier expedited:     6637539697 reads,  852129 writes
> sys_membarrier non-expedited: 7992076602 reads,     220 writes
> 
> The dynamic sys_membarrier availability check adds some overhead to
> the read-side compared to the signal-based scheme, but besides that,
> with the expedited scheme, we can see that we are close to the read-side
> performance of the signal-based scheme and also at 1/4 of the
> memory-barrier write-side performance. We have a write-side speedup of
> 158:1 over the signal-based scheme by using the sys_membarrier system
> call. This allows a 3.9:1 read-side speedup over the pre-existing memory
> barrier scheme.
> 
> The non-expedited scheme adds indeed a much lower overhead on the
> read-side both because we do not send IPIs and because we perform less
> updates, which in turn generates less cache-line exchanges. The
> write-side latency becomes even higher than with the signal-based
> scheme. The advantage of sys_membarrier() over signal-based scheme is
> that it does not require to wake up all the process threads.
> 
> The non-expedited sys_membarrier scheme can be useful to a userspace RCU
> flavor that encompass all processes on a system, which may share memory
> mappings.
> 
> * More information about memory barriers in:
> 
> - sys_membarrier()
> - membarrier_ipi()
> - switch_mm()
> - issued with ->mm update while the rq lock is held
> 
> The goal of these memory barriers is to ensure that all memory accesses
> to user-space addresses performed by every processor which execute
> threads belonging to the current process are observed to be in program
> order at least once between the two memory barriers surrounding
> sys_membarrier().
> 
> If we were to simply broadcast an IPI to all processors between the two
> smp_mb() in sys_membarrier(), membarrier_ipi() would execute on each
> processor, and waiting for these handlers to complete execution
> guarantees that each running processor passed through a state where
> user-space memory address accesses were in program order.
> 
> However, this "big hammer" approach does not please the real-time
> concerned people. This would let a non RT task disturb real-time tasks
> by sending useless IPIs to processors not concerned by the memory of the
> current process.
> 
> This is why we iterate on the mm_cpumask, which is a superset of the
> processors concerned by the process memory map and check each processor
> ->mm with the rq lock held to confirm that the processor is indeed
> running a thread concerned with our mm (and not just part of the
> mm_cpumask due to lazy TLB shootdown).
> 
> User-space memory address accesses must be in program order when
> mm_cpumask is set or cleared. (more details in the x86 switch_mm()
> comments).
> 
> The verification, for each cpu part of the mm_cpumask, that the rq ->mm
> is indeed part of the current ->mm needs to be done with the rq lock
> held. This ensures that each time a rq ->mm is modified, a memory
> barrier (typically implied by the change of memory mapping) is also
> issued. These ->mm update and memory barrier are made atomic by the rq
> spinlock.
> 
> The execution scenario (1) shows the behavior of the sys_membarrier()
> system call executed on Thread A while Thread B executes memory accesses
> that need to be ordered. Thread B is running. Memory accesses in Thread
> B are in program order (e.g. separated by a compiler barrier()).
> 
> 1) Thread B running, ordering ensured by the membarrier_ipi():
> 
>   Thread A                               Thread B
> -------------------------------------------------------------------------
>   prev accesses to userspace addr.       prev accesses to userspace addr.
>   sys_membarrier
>     smp_mb
>     IPI  ------------------------------> membarrier_ipi()
>                                          smp_mb
>                                          return
>     smp_mb
>   following accesses to userspace addr.  following accesses to userspace
>                                          addr.
> 
> The execution scenarios (2-3-4-5) show the same setup as (1), but Thread
> B is not running while sys_membarrier() is called. Thanks to the memory
> barriers implied by load_cr3 in switch_mm(), Thread B user-space address
> memory accesses are already in program order when sys_membarrier finds
> out that either the mm_cpumask does not contain Thread B CPU or that
> that CPU's ->mm is not running the current process mm.
> 
> 2) Context switch in, showing rq spin lock synchronization:
> 
>   Thread A                               Thread B
> -------------------------------------------------------------------------
>                                          <prev accesses to userspace addr.
>                                          saved on stack>
>   prev accesses to userspace addr.
>   sys_membarrier
>     smp_mb
>       for each cpu in mm_cpumask
>         <Thread B CPU is present e.g. due
>          to lazy TLB shootdown>
>         spin lock cpu rq
>         mm = cpu rq mm
>         spin unlock cpu rq
>                                          context switch in
>                                          <spin lock cpu rq by other thread>
>                                          load_cr3 (or equiv. mem. barrier)
>                                          spin unlock cpu rq
>                                          following accesses to userspace
>                                          addr.
>         if (mm == current rq mm)
>           <false>
>     smp_mb
>   following accesses to userspace addr.
> Here, the important point is that Thread B have passed through a point
> where all its userspace memory address accesses were in program order
> between the two smp_mb() in sys_membarrier.
> 
> 3) Context switch out, showing rq spin lock synchronization:
> 
>   Thread A                               Thread B
> -------------------------------------------------------------------------
>   prev accesses to userspace addr.
>                                          prev accesses to userspace addr.
>   sys_membarrier
>     smp_mb
>       for each cpu in mm_cpumask
>                                          context switch out
>                                          spin lock cpu rq
>                                          load_cr3 (or equiv. mem. barrier)
>                                          <spin unlock cpu rq by other thread>
>                                          <following accesses to userspace
>                                          addr. will happen when rescheduled>
>         spin lock cpu rq
>         mm = cpu rq mm
>         spin unlock cpu rq
>         if (mm == current rq mm)
>           <false>
>     smp_mb
>   following accesses to userspace addr.
> Same as (2): the important point is that Thread B have passed through a
> point where all its userspace memory address accesses were in program
> order between the two smp_mb() in sys_membarrier.
> 
> 4) Context switch in, showing mm_cpumask synchronization:
> 
>   Thread A                               Thread B
> -------------------------------------------------------------------------
>                                          <prev accesses to userspace addr.
>                                          saved on stack>
>   prev accesses to userspace addr.
>   sys_membarrier
>     smp_mb
>       for each cpu in mm_cpumask
>         <Thread B CPU not in mask>
>                                          context switch in
>                                          set cpu bit in mm_cpumask
>                                          load_cr3 (or equiv. mem. barrier)
>                                          following accesses to userspace
>                                          addr.
>     smp_mb
>   following accesses to userspace addr.
> Same as 2-3: Thread B is passing through a point where userspace memory
> address accesses are in program order between the two smp_mb() in
> sys_membarrier().
> 
> 5) Context switch out, showing mm_cpumask synchronization:
> 
>   Thread A                               Thread B
> -------------------------------------------------------------------------
>   prev accesses to userspace addr.
>                                          prev accesses to userspace addr.
>   sys_membarrier
>     smp_mb
>                                          context switch out
>                                          load_cr3 (or equiv. mem. barrier)
>                                          clear cpu bit in mm_cpumask
>                                          <following accesses to userspace
>                                          addr. will happen when rescheduled>
>       for each cpu in mm_cpumask
>         <Thread B CPU not in mask>
>     smp_mb
>   following accesses to userspace addr.
> Same as 2-3-4: Thread B is passing through a point where userspace
> memory address accesses are in program order between the two smp_mb() in
> sys_membarrier().
> 
> This patch only adds the system calls to x86. See the sys_membarrier()
> comments for memory barriers requirement in switch_mm() to port to other
> architectures.
> 
> [1] http://urcu.so
> [2] http://lttng.org
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> CC: Nicholas Miell <nmiell@comcast.net>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> CC: Lai Jiangshan <laijs@cn.fujitsu.com>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: David Howells <dhowells@redhat.com>
> CC: Nick Piggin <npiggin@kernel.dk>
> ---
>  arch/x86/include/asm/mmu_context.h |   17 +++
>  arch/x86/syscalls/syscall_32.tbl   |    1 +
>  arch/x86/syscalls/syscall_64.tbl   |    1 +
>  include/linux/syscalls.h           |    2 +
>  include/uapi/asm-generic/unistd.h  |    4 +-
>  include/uapi/linux/Kbuild          |    1 +
>  include/uapi/linux/membarrier.h    |   75 +++++++++++++
>  kernel/sched/core.c                |  208 ++++++++++++++++++++++++++++++++++++
>  kernel/sys_ni.c                    |    3 +
>  9 files changed, 311 insertions(+), 1 deletions(-)
>  create mode 100644 include/uapi/linux/membarrier.h
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 4b75d59..a30a63d 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  #endif
>  		cpumask_set_cpu(cpu, mm_cpumask(next));
> 
> +		/*
> +		 * smp_mb() between mm_cpumask set and following memory
> +		 * accesses to user-space addresses is required by
> +		 * sys_membarrier(). A smp_mb() is also needed between
> +		 * prior memory accesses and mm_cpumask clear. This
> +		 * ensures that all user-space address memory accesses
> +		 * performed by the current thread are in program order
> +		 * when the mm_cpumask is set. Implied by load_cr3.
> +		 */
> +
>  		/* Re-load page tables */
>  		load_cr3(next->pgd);
>  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  			 * We were in lazy tlb mode and leave_mm disabled
>  			 * tlb flush IPI delivery. We must reload CR3
>  			 * to make sure to use no freed page tables.
> +			 *
> +			 * smp_mb() between mm_cpumask set and memory accesses
> +			 * to user-space addresses is required by
> +			 * sys_membarrier(). This ensures that all user-space
> +			 * address memory accesses performed by the current
> +			 * thread are in program order when the mm_cpumask is
> +			 * set. Implied by load_cr3.
>  			 */
>  			load_cr3(next->pgd);
>  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index b3560ec..439415f 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -365,3 +365,4 @@
>  356	i386	memfd_create		sys_memfd_create
>  357	i386	bpf			sys_bpf
>  358	i386	execveat		sys_execveat			stub32_execveat
> +359	i386	membarrier		sys_membarrier
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..823130d 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320	common	kexec_file_load		sys_kexec_file_load
>  321	common	bpf			sys_bpf
>  322	64	execveat		stub_execveat
> +323	common	membarrier		sys_membarrier
> 
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..058ec0a 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
>  			const char __user *const __user *argv,
>  			const char __user *const __user *envp, int flags);
> 
> +asmlinkage long sys_membarrier(int flags);
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..8da542a 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  __SYSCALL(__NR_bpf, sys_bpf)
>  #define __NR_execveat 281
>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_membarrier 282
> +__SYSCALL(__NR_membarrier, sys_membarrier)
> 
>  #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
> 
>  /*
>   * All syscalls below here should go away really,
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 00b10002..c5b0dbf 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -248,6 +248,7 @@ header-y += mdio.h
>  header-y += media.h
>  header-y += media-bus-format.h
>  header-y += mei.h
> +header-y += membarrier.h
>  header-y += memfd.h
>  header-y += mempolicy.h
>  header-y += meye.h
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> new file mode 100644
> index 0000000..928b0d5a
> --- /dev/null
> +++ b/include/uapi/linux/membarrier.h
> @@ -0,0 +1,75 @@
> +#ifndef _UAPI_LINUX_MEMBARRIER_H
> +#define _UAPI_LINUX_MEMBARRIER_H
> +
> +/*
> + * linux/membarrier.h
> + *
> + * membarrier system call API
> + *
> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +/*
> + * All memory accesses performed in program order from each targeted thread is
> + * guaranteed to be ordered with respect to sys_membarrier(). If we use the
> + * semantic "barrier()" to represent a compiler barrier forcing memory accesses
> + * to be performed in program order across the barrier, and smp_mb() to
> + * represent explicit memory barriers forcing full memory ordering across the
> + * barrier, we have the following ordering table for each pair of barrier(),
> + * sys_membarrier() and smp_mb() :
> + *
> + * The pair ordering is detailed as (O: ordered, X: not ordered):
> + *
> + *                        barrier()   smp_mb() sys_membarrier()
> + *        barrier()          X           X            O
> + *        smp_mb()           X           O            O
> + *        sys_membarrier()   O           O            O
> + *
> + * If the private flag is set, only running threads belonging to the same
> + * process are targeted. Else, all running threads, including those belonging to
> + * other processes are targeted.
> + */
> +
> +/* System call membarrier "flags" argument. */
> +enum {
> +	/*
> +	 * Private flag set: only synchronize across a single process. If this
> +	 * flag is not set, it means "shared": synchronize across multiple
> +	 * processes.  The shared mode is useful for shared memory mappings
> +	 * across processes.
> +	 */
> +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> +
> +	/*
> +	 * Expedited flag set: adds some overhead, fast execution (few
> +	 * microseconds).  If this flag is not set, it means "delayed": low
> +	 * overhead, but slow execution (few milliseconds).
> +	 */
> +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> +
> +	/*
> +	 * Query whether the rest of the specified flags are supported, without
> +	 * performing synchronization.
> +	 */
> +	MEMBARRIER_QUERY_FLAG = (1 << 31),
> +};
> +
> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5eab11d..8b33728 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -74,6 +74,7 @@
>  #include <linux/binfmts.h>
>  #include <linux/context_tracking.h>
>  #include <linux/compiler.h>
> +#include <linux/membarrier.h>
> 
>  #include <asm/switch_to.h>
>  #include <asm/tlb.h>
> @@ -8402,3 +8403,210 @@ void dump_cpu_task(int cpu)
>  	pr_info("Task dump for CPU %d:\n", cpu);
>  	sched_show_task(cpu_curr(cpu));
>  }
> +
> +#ifdef CONFIG_SMP
> +
> +/*
> + * Execute a memory barrier on all active threads from the current process
> + * on SMP systems. In order to keep this code independent of implementation
> + * details of IPI handlers, do not rely on implicit barriers within IPI handler
> + * execution. This is not the bulk of the overhead anyway, so let's stay on the
> + * safe side.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	/* Order memory accesses with respects to sys_membarrier caller. */
> +	smp_mb();
> +}
> +
> +/*
> + * Handle out-of-memory by sending per-cpu IPIs instead.
> + */
> +static void membarrier_fallback(void)
> +{
> +	struct mm_struct *mm;
> +	int cpu;
> +
> +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +		mm = cpu_curr(cpu)->mm;
> +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +		if (current->mm == mm)
> +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> +	}
> +}
> +
> +static int membarrier_validate_flags(int flags)
> +{
> +	/* Check for unrecognized flag. */
> +	if (flags & ~(MEMBARRIER_PRIVATE_FLAG | MEMBARRIER_EXPEDITED_FLAG
> +			| MEMBARRIER_QUERY_FLAG))
> +		return -EINVAL;
> +	/* Check for unsupported flag combination. */
> +	if ((flags & MEMBARRIER_EXPEDITED_FLAG)
> +			&& !(flags & MEMBARRIER_PRIVATE_FLAG))
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static void membarrier_expedited(void)
> +{
> +	struct mm_struct *mm;
> +	cpumask_var_t tmpmask;
> +	int cpu;
> +
> +	/*
> +	 * Memory barrier on the caller thread between previous memory accesses
> +	 * to user-space addresses and sending memory-barrier IPIs. Orders all
> +	 * user-space address memory accesses prior to sys_membarrier() before
> +	 * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> +	 * with memory barriers in:
> +	 * - membarrier_ipi() (for each running threads of the current process)
> +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> +	 *                accesses to user-space addresses)
> +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> +	 *   A memory barrier is issued each time ->mm is changed while the rq
> +	 *   lock is held.
> +	 */
> +	smp_mb();
> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> +		membarrier_fallback();
> +		goto out;
> +	}
> +	cpumask_copy(tmpmask, mm_cpumask(current->mm));
> +	preempt_disable();
> +	cpumask_clear_cpu(smp_processor_id(), tmpmask);
> +	for_each_cpu(cpu, tmpmask) {
> +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +		mm = cpu_curr(cpu)->mm;
> +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +		if (current->mm != mm)
> +			cpumask_clear_cpu(cpu, tmpmask);
> +	}
> +	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> +	preempt_enable();
> +	free_cpumask_var(tmpmask);
> +out:
> +	/*
> +	 * Memory barrier on the caller thread between sending & waiting for
> +	 * memory-barrier IPIs and following memory accesses to user-space
> +	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
> +	 * before all user-space address memory accesses following
> +	 * sys_membarrier(). This barrier is paired with memory barriers in:
> +	 * - membarrier_ipi() (for each running threads of the current process)
> +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> +	 *                accesses to user-space addresses)
> +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> +	 *   A memory barrier is issued each time ->mm is changed while the rq
> +	 *   lock is held.
> +	 */
> +	smp_mb();
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on target threads
> + * @flags: MEMBARRIER_PRIVATE_FLAG:
> + *             Private flag set: only synchronize across a single process. If
> + *             this flag is not set, it means "shared": synchronize across
> + *             multiple processes. The shared mode is useful for shared memory
> + *             mappings across processes.
> + *         MEMBARRIER_EXPEDITED_FLAG:
> + *             Expedited flag set: adds some overhead, fast execution (few
> + *             microseconds). If this flag is not set, it means "delayed": low
> + *             overhead, but slow execution (few milliseconds).
> + *         MEMBARRIER_QUERY_FLAG:
> + *             Query whether the rest of the specified flags are supported,
> + *             without performing synchronization.
> + *
> + * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
> + * sys_membarrier support can be done by checking for -ENOSYS return value.
> + * Return value of 0 indicates success. For a given set of flags on a given
> + * kernel, this system call will always return the same value. It is therefore
> + * correct to check the return value only once during a process lifetime,
> + * setting MEMBARRIER_QUERY_FLAG to only check if the flags are supported,
> + * without performing any synchronization.
> + *
> + * This system call executes a memory barrier on all targeted threads.
> + * If the private flag is set, only running threads belonging to the same
> + * process are targeted. Else, all running threads, including those belonging to
> + * other processes are targeted. Upon completion, the caller thread is ensured
> + * that all targeted running threads have passed through a state where all
> + * memory accesses to user-space addresses match program order. (non-running
> + * threads are de facto in such a state.)
> + *
> + * Using the non-expedited mode is recommended for applications which can
> + * afford leaving the caller thread waiting for a few milliseconds. A good
> + * example would be a thread dedicated to execute RCU callbacks, which waits
> + * for callbacks to enqueue most of the time anyway.
> + *
> + * The expedited mode is recommended whenever the application needs to have
> + * control returning to the caller thread as quickly as possible. An example
> + * of such application would be one which uses the same thread to perform
> + * data structure updates and issue the RCU synchronization.
> + *
> + * It is perfectly safe to call both expedited and non-expedited
> + * sys_membarrier() in a process.
> + *
> + * The combination of expedited mode (MEMBARRIER_EXPEDITED_FLAG) and non-private
> + * (shared) (~MEMBARRIER_PRIVATE_FLAG) flags is currently unimplemented. Using
> + * this combination returns -EINVAL.
> + *
> + * mm_cpumask is used as an approximation of the processors which run threads
> + * belonging to the current process. It is a superset of the cpumask to which we
> + * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU in
> + * the mm_cpumask, we check each runqueue with the rq lock held to make sure our
> + * ->mm is indeed running on them. The rq lock ensures that a memory barrier is
> + * issued each time the rq current task is changed. This reduces the risk of
> + * disturbing a RT task by sending unnecessary IPIs. There is still a slight
> + * chance to disturb an unrelated task, because we do not lock the runqueues
> + * while sending IPIs, but the real-time effect of this heavy locking would be
> + * worse than the comparatively small disruption of an IPI.
> + *
> + * RED PEN: before assigning a system call number for sys_membarrier() to an
> + * architecture, we must ensure that switch_mm issues full memory barriers
> + * (or a synchronizing instruction having the same effect) between:
> + * - memory accesses to user-space addresses and clear mm_cpumask.
> + * - set mm_cpumask and memory accesses to user-space addresses.
> + *
> + * The reason why these memory barriers are required is that mm_cpumask updates,
> + * as well as iteration on the mm_cpumask, offer no ordering guarantees.
> + * These added memory barriers ensure that any thread modifying the mm_cpumask
> + * is in a state where all memory accesses to user-space addresses are
> + * guaranteed to be in program order.
> + *
> + * In some case adding a comment to this effect will suffice, in others we
> + * will need to add explicit memory barriers.  These barriers are required to
> + * ensure we do not _miss_ a CPU that need to receive an IPI, which would be a
> + * bug.
> + *
> + * On uniprocessor systems, this system call simply returns 0 after validating
> + * the arguments, so user-space knows it is implemented.
> + */
> +SYSCALL_DEFINE1(membarrier, int, flags)
> +{
> +	int retval;
> +
> +	retval = membarrier_validate_flags(flags);
> +	if (retval)
> +		goto end;
> +	if (unlikely((flags & MEMBARRIER_QUERY_FLAG)
> +		     || ((flags & MEMBARRIER_PRIVATE_FLAG)
> +				&& thread_group_empty(current)))
> +		     || num_online_cpus() == 1)
> +		goto end;
> +	if (flags & MEMBARRIER_EXPEDITED_FLAG)
> +		membarrier_expedited();
> +	else
> +		synchronize_sched();
> +end:
> +	return retval;
> +}
> +
> +#else /* !CONFIG_SMP */
> +
> +SYSCALL_DEFINE1(membarrier, int, flags)
> +{
> +	return membarrier_validate_args(flags);
> +}
> +
> +#endif /* CONFIG_SMP */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 5adcb0a..5913b84 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
> 
>  /* execveat */
>  cond_syscall(sys_execveat);
> +
> +/* membarrier */
> +cond_syscall(sys_membarrier);
> -- 
> 1.7.7.3
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-15 19:24 [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) Mathieu Desnoyers
  2015-03-15 22:05 ` Paul E. McKenney
@ 2015-03-16  3:25 ` Josh Triplett
  2015-03-16 13:00   ` Mathieu Desnoyers
  2015-03-16 14:19 ` Peter Zijlstra
  2 siblings, 1 reply; 34+ messages in thread
From: Josh Triplett @ 2015-03-16  3:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, David Howells, Nick Piggin

On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on either all running threads of the current
> process (MEMBARRIER_PRIVATE_FLAG) or calls synchronize_sched() to issue
> a memory barrier on all threads running on the system. It can be used to
> distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of
> sys_membarrier() and a compiler barrier. For synchronization primitives
> that distinguish between read-side and write-side (e.g. userspace RCU,
> rwlocks), the read-side can be accelerated significantly by moving the
> bulk of the memory barrier overhead to the write-side.

>From a quick review, this seems quite reasonable (as it did 5 years
ago).

One request: Could you please add a config option (default y) in the
EXPERT menu to disable this?  You actually seem to already have it
marked as a cond_syscall.

Also, a very minor nit: flags in kernel APIs aren't typically named with
a _FLAG suffix.

With the syscall made optional, and with or without that naming nit
fixed:
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

- Josh Triplett

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16  3:25 ` Josh Triplett
@ 2015-03-16 13:00   ` Mathieu Desnoyers
  0 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-16 13:00 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, David Howells, Nick Piggin

----- Original Message -----
> From: "Josh Triplett" <josh@joshtriplett.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Thomas Gleixner" <tglx@linutronix.de>,
> "Peter Zijlstra" <peterz@infradead.org>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Sunday, March 15, 2015 11:25:50 PM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on either all running threads of the current
> > process (MEMBARRIER_PRIVATE_FLAG) or calls synchronize_sched() to issue
> > a memory barrier on all threads running on the system. It can be used to
> > distribute the cost of user-space memory barriers asymmetrically by
> > transforming pairs of memory barriers into pairs consisting of
> > sys_membarrier() and a compiler barrier. For synchronization primitives
> > that distinguish between read-side and write-side (e.g. userspace RCU,
> > rwlocks), the read-side can be accelerated significantly by moving the
> > bulk of the memory barrier overhead to the write-side.
> 
> From a quick review, this seems quite reasonable (as it did 5 years
> ago).
> 
> One request: Could you please add a config option (default y) in the
> EXPERT menu to disable this?  You actually seem to already have it
> marked as a cond_syscall.

Sure, done.

> 
> Also, a very minor nit: flags in kernel APIs aren't typically named with
> a _FLAG suffix.

OK, good point.

> 
> With the syscall made optional, and with or without that naming nit
> fixed:
> Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Thanks!

Mathieu

> 
> - Josh Triplett
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-15 19:24 [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) Mathieu Desnoyers
  2015-03-15 22:05 ` Paul E. McKenney
  2015-03-16  3:25 ` Josh Triplett
@ 2015-03-16 14:19 ` Peter Zijlstra
  2015-03-16 14:24   ` Steven Rostedt
  2015-03-16 15:43   ` Mathieu Desnoyers
  2 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-16 14:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:

TL;DR

> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  #endif
>  		cpumask_set_cpu(cpu, mm_cpumask(next));
>  
> +		/*
> +		 * smp_mb() between mm_cpumask set and following memory
> +		 * accesses to user-space addresses is required by
> +		 * sys_membarrier(). A smp_mb() is also needed between
> +		 * prior memory accesses and mm_cpumask clear. This
> +		 * ensures that all user-space address memory accesses
> +		 * performed by the current thread are in program order
> +		 * when the mm_cpumask is set. Implied by load_cr3.
> +		 */
> +
>  		/* Re-load page tables */
>  		load_cr3(next->pgd);
>  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  			 * We were in lazy tlb mode and leave_mm disabled
>  			 * tlb flush IPI delivery. We must reload CR3
>  			 * to make sure to use no freed page tables.
> +			 *
> +			 * smp_mb() between mm_cpumask set and memory accesses
> +			 * to user-space addresses is required by
> +			 * sys_membarrier(). This ensures that all user-space
> +			 * address memory accesses performed by the current
> +			 * thread are in program order when the mm_cpumask is
> +			 * set. Implied by load_cr3.
>  			 */
>  			load_cr3(next->pgd);
>  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);


In both cases the cpumask_set_cpu() will also imply a MB.

> +enum {
> +	/*
> +	 * Private flag set: only synchronize across a single process. If this
> +	 * flag is not set, it means "shared": synchronize across multiple
> +	 * processes.  The shared mode is useful for shared memory mappings
> +	 * across processes.
> +	 */
> +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> +
> +	/*
> +	 * Expedited flag set: adds some overhead, fast execution (few
> +	 * microseconds).  If this flag is not set, it means "delayed": low
> +	 * overhead, but slow execution (few milliseconds).
> +	 */
> +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),


I suppose this is an unprivileged syscall; so what do we do about:

	for (;;)
		sys_membar(EXPEDITED);

Which would spray the entire system with IPIs at break neck speed.

> +static void membarrier_ipi(void *unused)
> +{
> +	/* Order memory accesses with respects to sys_membarrier caller. */
> +	smp_mb();
> +}
> +
> +/*
> + * Handle out-of-memory by sending per-cpu IPIs instead.
> + */
> +static void membarrier_fallback(void)
> +{
> +	struct mm_struct *mm;
> +	int cpu;
> +
> +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +		mm = cpu_curr(cpu)->mm;
> +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +		if (current->mm == mm)
> +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> +	}
> +}

> +static void membarrier_expedited(void)
> +{
> +	struct mm_struct *mm;
> +	cpumask_var_t tmpmask;
> +	int cpu;
> +
> +	/*
> +	 * Memory barrier on the caller thread between previous memory accesses
> +	 * to user-space addresses and sending memory-barrier IPIs. Orders all
> +	 * user-space address memory accesses prior to sys_membarrier() before
> +	 * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> +	 * with memory barriers in:
> +	 * - membarrier_ipi() (for each running threads of the current process)
> +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> +	 *                accesses to user-space addresses)
> +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> +	 *   A memory barrier is issued each time ->mm is changed while the rq
> +	 *   lock is held.
> +	 */
> +	smp_mb();
> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> +		membarrier_fallback();
> +		goto out;
> +	}
> +	cpumask_copy(tmpmask, mm_cpumask(current->mm));
> +	preempt_disable();
> +	cpumask_clear_cpu(smp_processor_id(), tmpmask);
> +	for_each_cpu(cpu, tmpmask) {
> +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +		mm = cpu_curr(cpu)->mm;
> +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +		if (current->mm != mm)
> +			cpumask_clear_cpu(cpu, tmpmask);
> +	}
> +	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> +	preempt_enable();
> +	free_cpumask_var(tmpmask);
> +out:
> +	/*
> +	 * Memory barrier on the caller thread between sending & waiting for
> +	 * memory-barrier IPIs and following memory accesses to user-space
> +	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
> +	 * before all user-space address memory accesses following
> +	 * sys_membarrier(). This barrier is paired with memory barriers in:
> +	 * - membarrier_ipi() (for each running threads of the current process)
> +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> +	 *                accesses to user-space addresses)
> +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> +	 *   A memory barrier is issued each time ->mm is changed while the rq
> +	 *   lock is held.
> +	 */
> +	smp_mb();
> +}

Did you just write:

bool membar_cpu_is_mm(int cpu, void *info)
{
	struct mm_struct *mm = info;
	struct rq *rq = cpu_rq(cpu);
	bool ret;

	raw_spin_lock_irq(&rq->lock);
	ret = rq->curr->mm == mm;
	raw_spin_unlock_irq(&rq->lock);

	return ret;
}

	on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT);



On which; I absolutely hate that rq->lock thing in there. What is
'wrong' with doing a lockless compare there? Other than not actually
being able to deref rq->curr of course, but we need to fix that anyhow.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 14:19 ` Peter Zijlstra
@ 2015-03-16 14:24   ` Steven Rostedt
  2015-03-16 15:49     ` Mathieu Desnoyers
  2015-03-16 15:49     ` Paul E. McKenney
  2015-03-16 15:43   ` Mathieu Desnoyers
  1 sibling, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2015-03-16 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Paul E. McKenney, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Andrew Morton,
	Josh Triplett, Thomas Gleixner, David Howells, Nick Piggin

On Mon, 16 Mar 2015 15:19:39 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> 
> I suppose this is an unprivileged syscall; so what do we do about:
> 
> 	for (;;)
> 		sys_membar(EXPEDITED);
> 
> Which would spray the entire system with IPIs at break neck speed.

Perhaps it should be rate limited. Have parameters (controlled via
sysctl) that will only allow so many of these per ms. If it exceeds it,
then the call will end up being a schedule_timeout() till it is allowed
to continue. Thus, the above will spit out a few hundred IPIs, then
sleep for a millisecond, and then spit out another hundred IPIs and
sleep again.

That would prevent any DoS attacks.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 14:19 ` Peter Zijlstra
  2015-03-16 14:24   ` Steven Rostedt
@ 2015-03-16 15:43   ` Mathieu Desnoyers
  2015-03-16 15:57     ` Mathieu Desnoyers
                       ` (3 more replies)
  1 sibling, 4 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-16 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 10:19:39 AM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> 
> TL;DR
> 
> > --- a/arch/x86/include/asm/mmu_context.h
> > +++ b/arch/x86/include/asm/mmu_context.h
> > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev,
> > struct mm_struct *next,
> >  #endif
> >  		cpumask_set_cpu(cpu, mm_cpumask(next));
> >  
> > +		/*
> > +		 * smp_mb() between mm_cpumask set and following memory
> > +		 * accesses to user-space addresses is required by
> > +		 * sys_membarrier(). A smp_mb() is also needed between
> > +		 * prior memory accesses and mm_cpumask clear. This
> > +		 * ensures that all user-space address memory accesses
> > +		 * performed by the current thread are in program order
> > +		 * when the mm_cpumask is set. Implied by load_cr3.
> > +		 */
> > +
> >  		/* Re-load page tables */
> >  		load_cr3(next->pgd);
> >  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev,
> > struct mm_struct *next,
> >  			 * We were in lazy tlb mode and leave_mm disabled
> >  			 * tlb flush IPI delivery. We must reload CR3
> >  			 * to make sure to use no freed page tables.
> > +			 *
> > +			 * smp_mb() between mm_cpumask set and memory accesses
> > +			 * to user-space addresses is required by
> > +			 * sys_membarrier(). This ensures that all user-space
> > +			 * address memory accesses performed by the current
> > +			 * thread are in program order when the mm_cpumask is
> > +			 * set. Implied by load_cr3.
> >  			 */
> >  			load_cr3(next->pgd);
> >  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> 
> 
> In both cases the cpumask_set_cpu() will also imply a MB.

I'm probably missing what exactly in cpumask_set_cpu()
implies this guarantee. cpumask_set_cpu() uses set_bit().
On x86, set_bit is indeed implemented with a lock-prefixed
orb or bts. However, the comment above set_bit() states:

 * Note: there are no guarantees that this function will not be reordered
 * on non x86 architectures, so if you are writing portable code,
 * make sure not to rely on its reordering guarantees.

And it states nothing about memory barriers. Typically,
atomic ops that imply memory barriers always return
something (xchg, cmpxchg, add_return). Ops like atomic_add
do not imply barriers.

> 
> > +enum {
> > +	/*
> > +	 * Private flag set: only synchronize across a single process. If this
> > +	 * flag is not set, it means "shared": synchronize across multiple
> > +	 * processes.  The shared mode is useful for shared memory mappings
> > +	 * across processes.
> > +	 */
> > +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> > +
> > +	/*
> > +	 * Expedited flag set: adds some overhead, fast execution (few
> > +	 * microseconds).  If this flag is not set, it means "delayed": low
> > +	 * overhead, but slow execution (few milliseconds).
> > +	 */
> > +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> 
> 
> I suppose this is an unprivileged syscall; so what do we do about:
> 
> 	for (;;)
> 		sys_membar(EXPEDITED);
> 
> Which would spray the entire system with IPIs at break neck speed.

Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL.
Therefore, if someone cares about issuing barriers on the entire
system, the only option is to use non-EXPEDITED, which rely on
synchronize_rcu().

The only way to invoke expedited barriers in a loop is:

for (;;)
        sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE);

Which will only send IPIs to the CPU running threads from the same
process.

> 
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	/* Order memory accesses with respects to sys_membarrier caller. */
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-memory by sending per-cpu IPIs instead.
> > + */
> > +static void membarrier_fallback(void)
> > +{
> > +	struct mm_struct *mm;
> > +	int cpu;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm == mm)
> > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > +	}
> > +}
> 
> > +static void membarrier_expedited(void)
> > +{
> > +	struct mm_struct *mm;
> > +	cpumask_var_t tmpmask;
> > +	int cpu;
> > +
> > +	/*
> > +	 * Memory barrier on the caller thread between previous memory accesses
> > +	 * to user-space addresses and sending memory-barrier IPIs. Orders all
> > +	 * user-space address memory accesses prior to sys_membarrier() before
> > +	 * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> > +	 * with memory barriers in:
> > +	 * - membarrier_ipi() (for each running threads of the current process)
> > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > +	 *                accesses to user-space addresses)
> > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > +	 *   lock is held.
> > +	 */
> > +	smp_mb();
> > +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> > +		membarrier_fallback();
> > +		goto out;
> > +	}
> > +	cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > +	preempt_disable();
> > +	cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > +	for_each_cpu(cpu, tmpmask) {
> > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm != mm)
> > +			cpumask_clear_cpu(cpu, tmpmask);
> > +	}
> > +	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > +	preempt_enable();
> > +	free_cpumask_var(tmpmask);
> > +out:
> > +	/*
> > +	 * Memory barrier on the caller thread between sending & waiting for
> > +	 * memory-barrier IPIs and following memory accesses to user-space
> > +	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
> > +	 * before all user-space address memory accesses following
> > +	 * sys_membarrier(). This barrier is paired with memory barriers in:
> > +	 * - membarrier_ipi() (for each running threads of the current process)
> > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > +	 *                accesses to user-space addresses)
> > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > +	 *   lock is held.
> > +	 */
> > +	smp_mb();
> > +}
> 
> Did you just write:
> 
> bool membar_cpu_is_mm(int cpu, void *info)
> {
> 	struct mm_struct *mm = info;
> 	struct rq *rq = cpu_rq(cpu);
> 	bool ret;
> 
> 	raw_spin_lock_irq(&rq->lock);
> 	ret = rq->curr->mm == mm;
> 	raw_spin_unlock_irq(&rq->lock);
> 
> 	return ret;
> }
> 
> 	on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT);
> 

It is very similar indeed! The main difference is that my implementation
was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs
for which TLB shootdown is simply pending (confirmed by taking the rq lock
and checking cpu_curr(cpu)->mm against current->mm).

Now that you mention this, I think we don't really need to use
mm_cpumask(current->mm) at all. Just iterating on each cpu, taking
the rq lock, and comparing the mm should be enough. This would
remove the need to rely on having extra memory barriers around
set/clear of the mm cpumask.

The main reason why I did not use on_each_cpu_cond() was that it did
not exist back in 2010. ;-)

> On which; I absolutely hate that rq->lock thing in there. What is
> 'wrong' with doing a lockless compare there? Other than not actually
> being able to deref rq->curr of course, but we need to fix that anyhow.

If we can make sure rq->curr deref could be done without holding the rq
lock, then I think all we would need is to ensure that updates to rq->curr
are surrounded by memory barriers. Therefore, we would have the following:

* When a thread is scheduled out, a memory barrier would be issued before
  rq->curr is updated to the next thread task_struct.

* Before a thread is scheduled in, a memory barrier needs to be issued
  after rq->curr is updated to the incoming thread.

In order to be able to dereference rq->curr->mm without holding the
rq->lock, do you envision we should protect task reclaim with RCU-sched ?

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 14:24   ` Steven Rostedt
@ 2015-03-16 15:49     ` Mathieu Desnoyers
  2015-03-16 15:49     ` Paul E. McKenney
  1 sibling, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-16 15:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, KOSAKI Motohiro, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

----- Original Message -----
> From: "Steven Rostedt" <rostedt@goodmis.org>
> To: "Peter Zijlstra" <peterz@infradead.org>
> Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, linux-kernel@vger.kernel.org, "KOSAKI Motohiro"
> <kosaki.motohiro@jp.fujitsu.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell"
> <nmiell@comcast.net>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 10:24:30 AM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Mon, 16 Mar 2015 15:19:39 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > 
> > I suppose this is an unprivileged syscall; so what do we do about:
> > 
> > 	for (;;)
> > 		sys_membar(EXPEDITED);
> > 
> > Which would spray the entire system with IPIs at break neck speed.
> 
> Perhaps it should be rate limited. Have parameters (controlled via
> sysctl) that will only allow so many of these per ms. If it exceeds it,
> then the call will end up being a schedule_timeout() till it is allowed
> to continue. Thus, the above will spit out a few hundred IPIs, then
> sleep for a millisecond, and then spit out another hundred IPIs and
> sleep again.
> 
> That would prevent any DoS attacks.

As I pointed out in my other email, EXPEDITED | ~PRIVATE currently
returns -EINVAL. The only way to do a system-wide barrier with
this membarrier implementation is to use synchronize_sched() (~EXPEDITED),
which I strongly doubt would perform a DoS.

If we eventually care about a EXPEDITED | ~PRIVATE implementation,
then I agree that rate limiting might be a good way to do it. I would
be a bit uncomfortable sending IPIs to _all_ CPUs though, even with
rate limiting. But perhaps it's a non-issue ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 14:24   ` Steven Rostedt
  2015-03-16 15:49     ` Mathieu Desnoyers
@ 2015-03-16 15:49     ` Paul E. McKenney
  2015-03-16 16:12       ` Steven Rostedt
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-16 15:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Mon, Mar 16, 2015 at 10:24:30AM -0400, Steven Rostedt wrote:
> On Mon, 16 Mar 2015 15:19:39 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > 
> > I suppose this is an unprivileged syscall; so what do we do about:
> > 
> > 	for (;;)
> > 		sys_membar(EXPEDITED);
> > 
> > Which would spray the entire system with IPIs at break neck speed.
> 
> Perhaps it should be rate limited. Have parameters (controlled via
> sysctl) that will only allow so many of these per ms. If it exceeds it,
> then the call will end up being a schedule_timeout() till it is allowed
> to continue. Thus, the above will spit out a few hundred IPIs, then
> sleep for a millisecond, and then spit out another hundred IPIs and
> sleep again.
> 
> That would prevent any DoS attacks.

But this would only qualify as a DoS if MEMBARRIER_EXPEDITED_FLAG and
!MEMBARRIER_PRIVATE_FLAG.  Otherwise, the user's process is only DoSing
itself, which is that user's problem, not anyone else's.  And it looks
like the current patch refuses to implement this DoS case, unless I am
really confused about the code in membarrier_expedited().  And in fact
membarrier_validate_flags() checks for this DoS case and returns -EINVAL.

So I do not believe that this syscall permits that type of DoS.

What am I missing here?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 15:43   ` Mathieu Desnoyers
@ 2015-03-16 15:57     ` Mathieu Desnoyers
  2015-03-16 17:13     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-16 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> To: "Peter Zijlstra" <peterz@infradead.org>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 11:43:56 AM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> ----- Original Message -----
> > From: "Peter Zijlstra" <peterz@infradead.org>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro"
> > <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> > <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
> > "Nicholas Miell" <nmiell@comcast.net>,
> > "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar"
> > <mingo@redhat.com>, "Alan Cox"
> > <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>,
> > "Stephen Hemminger"
> > <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>,
> > "Josh Triplett" <josh@joshtriplett.org>,
> > "Thomas Gleixner" <tglx@linutronix.de>, "David Howells"
> > <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> > Sent: Monday, March 16, 2015 10:19:39 AM
> > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory
> > barrier (x86) (v12)
> > 
> > On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> > 
> > TL;DR
> > 
> > > --- a/arch/x86/include/asm/mmu_context.h
> > > +++ b/arch/x86/include/asm/mmu_context.h
> > > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev,
> > > struct mm_struct *next,
> > >  #endif
> > >  		cpumask_set_cpu(cpu, mm_cpumask(next));
> > >  
> > > +		/*
> > > +		 * smp_mb() between mm_cpumask set and following memory
> > > +		 * accesses to user-space addresses is required by
> > > +		 * sys_membarrier(). A smp_mb() is also needed between
> > > +		 * prior memory accesses and mm_cpumask clear. This
> > > +		 * ensures that all user-space address memory accesses
> > > +		 * performed by the current thread are in program order
> > > +		 * when the mm_cpumask is set. Implied by load_cr3.
> > > +		 */
> > > +
> > >  		/* Re-load page tables */
> > >  		load_cr3(next->pgd);
> > >  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> > > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev,
> > > struct mm_struct *next,
> > >  			 * We were in lazy tlb mode and leave_mm disabled
> > >  			 * tlb flush IPI delivery. We must reload CR3
> > >  			 * to make sure to use no freed page tables.
> > > +			 *
> > > +			 * smp_mb() between mm_cpumask set and memory accesses
> > > +			 * to user-space addresses is required by
> > > +			 * sys_membarrier(). This ensures that all user-space
> > > +			 * address memory accesses performed by the current
> > > +			 * thread are in program order when the mm_cpumask is
> > > +			 * set. Implied by load_cr3.
> > >  			 */
> > >  			load_cr3(next->pgd);
> > >  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> > 
> > 
> > In both cases the cpumask_set_cpu() will also imply a MB.
> 
> I'm probably missing what exactly in cpumask_set_cpu()
> implies this guarantee. cpumask_set_cpu() uses set_bit().
> On x86, set_bit is indeed implemented with a lock-prefixed
> orb or bts. However, the comment above set_bit() states:
> 
>  * Note: there are no guarantees that this function will not be reordered
>  * on non x86 architectures, so if you are writing portable code,
>  * make sure not to rely on its reordering guarantees.
> 
> And it states nothing about memory barriers. Typically,
> atomic ops that imply memory barriers always return
> something (xchg, cmpxchg, add_return). Ops like atomic_add
> do not imply barriers.
> 
> > 
> > > +enum {
> > > +	/*
> > > +	 * Private flag set: only synchronize across a single process. If this
> > > +	 * flag is not set, it means "shared": synchronize across multiple
> > > +	 * processes.  The shared mode is useful for shared memory mappings
> > > +	 * across processes.
> > > +	 */
> > > +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> > > +
> > > +	/*
> > > +	 * Expedited flag set: adds some overhead, fast execution (few
> > > +	 * microseconds).  If this flag is not set, it means "delayed": low
> > > +	 * overhead, but slow execution (few milliseconds).
> > > +	 */
> > > +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> > 
> > 
> > I suppose this is an unprivileged syscall; so what do we do about:
> > 
> > 	for (;;)
> > 		sys_membar(EXPEDITED);
> > 
> > Which would spray the entire system with IPIs at break neck speed.
> 
> Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL.
> Therefore, if someone cares about issuing barriers on the entire
> system, the only option is to use non-EXPEDITED, which rely on
> synchronize_rcu().

Sorry, I meant "synchronize_sched()".

Thanks,

Mathieu

> 
> The only way to invoke expedited barriers in a loop is:
> 
> for (;;)
>         sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE);
> 
> Which will only send IPIs to the CPU running threads from the same
> process.
> 
> > 
> > > +static void membarrier_ipi(void *unused)
> > > +{
> > > +	/* Order memory accesses with respects to sys_membarrier caller. */
> > > +	smp_mb();
> > > +}
> > > +
> > > +/*
> > > + * Handle out-of-memory by sending per-cpu IPIs instead.
> > > + */
> > > +static void membarrier_fallback(void)
> > > +{
> > > +	struct mm_struct *mm;
> > > +	int cpu;
> > > +
> > > +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > > +		mm = cpu_curr(cpu)->mm;
> > > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > +		if (current->mm == mm)
> > > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > > +	}
> > > +}
> > 
> > > +static void membarrier_expedited(void)
> > > +{
> > > +	struct mm_struct *mm;
> > > +	cpumask_var_t tmpmask;
> > > +	int cpu;
> > > +
> > > +	/*
> > > +	 * Memory barrier on the caller thread between previous memory accesses
> > > +	 * to user-space addresses and sending memory-barrier IPIs. Orders all
> > > +	 * user-space address memory accesses prior to sys_membarrier() before
> > > +	 * mm_cpumask read and membarrier_ipi executions. This barrier is
> > > paired
> > > +	 * with memory barriers in:
> > > +	 * - membarrier_ipi() (for each running threads of the current process)
> > > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > > +	 *                accesses to user-space addresses)
> > > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > > +	 *   lock is held.
> > > +	 */
> > > +	smp_mb();
> > > +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> > > +		membarrier_fallback();
> > > +		goto out;
> > > +	}
> > > +	cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > > +	preempt_disable();
> > > +	cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > > +	for_each_cpu(cpu, tmpmask) {
> > > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > > +		mm = cpu_curr(cpu)->mm;
> > > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > +		if (current->mm != mm)
> > > +			cpumask_clear_cpu(cpu, tmpmask);
> > > +	}
> > > +	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > > +	preempt_enable();
> > > +	free_cpumask_var(tmpmask);
> > > +out:
> > > +	/*
> > > +	 * Memory barrier on the caller thread between sending & waiting for
> > > +	 * memory-barrier IPIs and following memory accesses to user-space
> > > +	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
> > > +	 * before all user-space address memory accesses following
> > > +	 * sys_membarrier(). This barrier is paired with memory barriers in:
> > > +	 * - membarrier_ipi() (for each running threads of the current process)
> > > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > > +	 *                accesses to user-space addresses)
> > > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > > +	 *   lock is held.
> > > +	 */
> > > +	smp_mb();
> > > +}
> > 
> > Did you just write:
> > 
> > bool membar_cpu_is_mm(int cpu, void *info)
> > {
> > 	struct mm_struct *mm = info;
> > 	struct rq *rq = cpu_rq(cpu);
> > 	bool ret;
> > 
> > 	raw_spin_lock_irq(&rq->lock);
> > 	ret = rq->curr->mm == mm;
> > 	raw_spin_unlock_irq(&rq->lock);
> > 
> > 	return ret;
> > }
> > 
> > 	on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT);
> > 
> 
> It is very similar indeed! The main difference is that my implementation
> was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs
> for which TLB shootdown is simply pending (confirmed by taking the rq lock
> and checking cpu_curr(cpu)->mm against current->mm).
> 
> Now that you mention this, I think we don't really need to use
> mm_cpumask(current->mm) at all. Just iterating on each cpu, taking
> the rq lock, and comparing the mm should be enough. This would
> remove the need to rely on having extra memory barriers around
> set/clear of the mm cpumask.
> 
> The main reason why I did not use on_each_cpu_cond() was that it did
> not exist back in 2010. ;-)
> 
> > On which; I absolutely hate that rq->lock thing in there. What is
> > 'wrong' with doing a lockless compare there? Other than not actually
> > being able to deref rq->curr of course, but we need to fix that anyhow.
> 
> If we can make sure rq->curr deref could be done without holding the rq
> lock, then I think all we would need is to ensure that updates to rq->curr
> are surrounded by memory barriers. Therefore, we would have the following:
> 
> * When a thread is scheduled out, a memory barrier would be issued before
>   rq->curr is updated to the next thread task_struct.
> 
> * Before a thread is scheduled in, a memory barrier needs to be issued
>   after rq->curr is updated to the incoming thread.
> 
> In order to be able to dereference rq->curr->mm without holding the
> rq->lock, do you envision we should protect task reclaim with RCU-sched ?
> 
> Thanks!
> 
> Mathieu
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 15:49     ` Paul E. McKenney
@ 2015-03-16 16:12       ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2015-03-16 16:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Mon, 16 Mar 2015 08:49:40 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Mon, Mar 16, 2015 at 10:24:30AM -0400, Steven Rostedt wrote:
> > On Mon, 16 Mar 2015 15:19:39 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > 
> > > I suppose this is an unprivileged syscall; so what do we do about:
> > > 
> > > 	for (;;)
> > > 		sys_membar(EXPEDITED);
> > > 
> > > Which would spray the entire system with IPIs at break neck speed.
> > 
> > Perhaps it should be rate limited. Have parameters (controlled via
> > sysctl) that will only allow so many of these per ms. If it exceeds it,
> > then the call will end up being a schedule_timeout() till it is allowed
> > to continue. Thus, the above will spit out a few hundred IPIs, then
> > sleep for a millisecond, and then spit out another hundred IPIs and
> > sleep again.
> > 
> > That would prevent any DoS attacks.
> 
> But this would only qualify as a DoS if MEMBARRIER_EXPEDITED_FLAG and
> !MEMBARRIER_PRIVATE_FLAG.  Otherwise, the user's process is only DoSing
> itself, which is that user's problem, not anyone else's.  And it looks
> like the current patch refuses to implement this DoS case, unless I am
> really confused about the code in membarrier_expedited().  And in fact
> membarrier_validate_flags() checks for this DoS case and returns -EINVAL.
> 
> So I do not believe that this syscall permits that type of DoS.
> 
> What am I missing here?
>

That I wasn't replying about the patch but only to Peter's comment,
which made it appear that sys_membar() would spew IPIs over the entire
system. If this is not a case, then there's no need for rate limiting.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 15:43   ` Mathieu Desnoyers
  2015-03-16 15:57     ` Mathieu Desnoyers
@ 2015-03-16 17:13     ` Peter Zijlstra
  2015-03-16 17:21     ` Peter Zijlstra
  2015-03-16 17:24     ` Peter Zijlstra
  3 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-16 17:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Mon, Mar 16, 2015 at 03:43:56PM +0000, Mathieu Desnoyers wrote:
> > > +++ b/arch/x86/include/asm/mmu_context.h

> > In both cases the cpumask_set_cpu() will also imply a MB.
> 
> I'm probably missing what exactly in cpumask_set_cpu()
> implies this guarantee. cpumask_set_cpu() uses set_bit().
> On x86, set_bit is indeed implemented with a lock-prefixed
> orb or bts. However, the comment above set_bit() states:

But its very much x86 specific code you're patching, so the LOCK prefix
is sufficient ;-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 15:43   ` Mathieu Desnoyers
  2015-03-16 15:57     ` Mathieu Desnoyers
  2015-03-16 17:13     ` Peter Zijlstra
@ 2015-03-16 17:21     ` Peter Zijlstra
  2015-03-16 18:53       ` Mathieu Desnoyers
  2015-03-16 17:24     ` Peter Zijlstra
  3 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-16 17:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Mon, Mar 16, 2015 at 03:43:56PM +0000, Mathieu Desnoyers wrote:
> > On which; I absolutely hate that rq->lock thing in there. What is
> > 'wrong' with doing a lockless compare there? Other than not actually
> > being able to deref rq->curr of course, but we need to fix that anyhow.
> 
> If we can make sure rq->curr deref could be done without holding the rq
> lock, then I think all we would need is to ensure that updates to rq->curr
> are surrounded by memory barriers. Therefore, we would have the following:
> 
> * When a thread is scheduled out, a memory barrier would be issued before
>   rq->curr is updated to the next thread task_struct.
> 
> * Before a thread is scheduled in, a memory barrier needs to be issued
>   after rq->curr is updated to the incoming thread.

I'm not entirely awake atm but I'm not seeing why it would need to be
that strict; I think the current single MB on task switch is sufficient
because if we're in the middle of schedule, userspace isn't actually
running.

So from the point of userspace the task switch is atomic. Therefore even
if we do not get a barrier before setting ->curr, the expedited thing
missing us doesn't matter as userspace cannot observe the difference.

> In order to be able to dereference rq->curr->mm without holding the
> rq->lock, do you envision we should protect task reclaim with RCU-sched ?

A recent discussion had Linus suggest SLAB_DESTROY_BY_RCU, although I
think Oleg did mention it would still be 'interesting'. I've not yet had
time to really think about that.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 15:43   ` Mathieu Desnoyers
                       ` (2 preceding siblings ...)
  2015-03-16 17:21     ` Peter Zijlstra
@ 2015-03-16 17:24     ` Peter Zijlstra
  3 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-16 17:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Mon, Mar 16, 2015 at 03:43:56PM +0000, Mathieu Desnoyers wrote:
> > > +enum {
> > > +	/*
> > > +	 * Private flag set: only synchronize across a single process. If this
> > > +	 * flag is not set, it means "shared": synchronize across multiple
> > > +	 * processes.  The shared mode is useful for shared memory mappings
> > > +	 * across processes.
> > > +	 */
> > > +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> > > +
> > > +	/*
> > > +	 * Expedited flag set: adds some overhead, fast execution (few
> > > +	 * microseconds).  If this flag is not set, it means "delayed": low
> > > +	 * overhead, but slow execution (few milliseconds).
> > > +	 */
> > > +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> > 
> > 
> > I suppose this is an unprivileged syscall; so what do we do about:
> > 
> > 	for (;;)
> > 		sys_membar(EXPEDITED);
> > 
> > Which would spray the entire system with IPIs at break neck speed.
> 
> Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL.

Ah, tl;dr that bit. This patch really had _too_ much verbiage.

> Therefore, if someone cares about issuing barriers on the entire
> system, the only option is to use non-EXPEDITED, which rely on
> synchronize_rcu().
> 
> The only way to invoke expedited barriers in a loop is:
> 
> for (;;)
>         sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE);
> 
> Which will only send IPIs to the CPU running threads from the same
> process.

That is indeed less of a problem.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 17:21     ` Peter Zijlstra
@ 2015-03-16 18:53       ` Mathieu Desnoyers
  2015-03-16 20:54         ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-16 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 1:21:04 PM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Mon, Mar 16, 2015 at 03:43:56PM +0000, Mathieu Desnoyers wrote:
> > > On which; I absolutely hate that rq->lock thing in there. What is
> > > 'wrong' with doing a lockless compare there? Other than not actually
> > > being able to deref rq->curr of course, but we need to fix that anyhow.
> > 
> > If we can make sure rq->curr deref could be done without holding the rq
> > lock, then I think all we would need is to ensure that updates to rq->curr
> > are surrounded by memory barriers. Therefore, we would have the following:
> > 
> > * When a thread is scheduled out, a memory barrier would be issued before
> >   rq->curr is updated to the next thread task_struct.
> > 
> > * Before a thread is scheduled in, a memory barrier needs to be issued
> >   after rq->curr is updated to the incoming thread.
> 
> I'm not entirely awake atm but I'm not seeing why it would need to be
> that strict; I think the current single MB on task switch is sufficient
> because if we're in the middle of schedule, userspace isn't actually
> running.
> 
> So from the point of userspace the task switch is atomic. Therefore even
> if we do not get a barrier before setting ->curr, the expedited thing
> missing us doesn't matter as userspace cannot observe the difference.

AFAIU, atomicity is not what matters here. It's more about memory ordering.
What is guaranteeing that upon entry in kernel-space, all prior memory
accesses (loads and stores) are ordered prior to following loads/stores ?

The same applies when returning to user-space: what is guaranteeing that all
prior loads/stores are ordered before the user-space loads/stores performed
after returning to user-space ?

> 
> > In order to be able to dereference rq->curr->mm without holding the
> > rq->lock, do you envision we should protect task reclaim with RCU-sched ?
> 
> A recent discussion had Linus suggest SLAB_DESTROY_BY_RCU, although I
> think Oleg did mention it would still be 'interesting'. I've not yet had
> time to really think about that.

This might be an "interesting" modification. :) This could perhaps come
as an optimization later on ?

By the way, I now remember why we start from the mm_cpumask, and then
double-check the mm: using the mm_cpumask serves as an approximation
of the CPUs we need to double-check. Therefore, rather than grabbing
the rq lock for all CPUs, we only need to grab it for CPUs that are
in the mm_cpumask.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 18:53       ` Mathieu Desnoyers
@ 2015-03-16 20:54         ` Peter Zijlstra
  2015-03-17  1:45           ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-16 20:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Mon, Mar 16, 2015 at 06:53:35PM +0000, Mathieu Desnoyers wrote:
> > I'm not entirely awake atm but I'm not seeing why it would need to be
> > that strict; I think the current single MB on task switch is sufficient
> > because if we're in the middle of schedule, userspace isn't actually
> > running.
> > 
> > So from the point of userspace the task switch is atomic. Therefore even
> > if we do not get a barrier before setting ->curr, the expedited thing
> > missing us doesn't matter as userspace cannot observe the difference.
> 
> AFAIU, atomicity is not what matters here. It's more about memory ordering.
> What is guaranteeing that upon entry in kernel-space, all prior memory
> accesses (loads and stores) are ordered prior to following loads/stores ?
> 
> The same applies when returning to user-space: what is guaranteeing that all
> prior loads/stores are ordered before the user-space loads/stores performed
> after returning to user-space ?

You're still one step ahead of me; why does this matter?

Or put it another way; what can go wrong? By virtue of being in
schedule() both tasks (prev and next) get an affective MB from the task
switch.

So even if we see the 'wrong' rq->curr, that CPU will still observe the
MB by the time it gets to userspace.

All of this is really only about userspace load/store ordering and the
context switch already very much needs to guarantee userspace program
order in the face of context switches.

> > > In order to be able to dereference rq->curr->mm without holding the
> > > rq->lock, do you envision we should protect task reclaim with RCU-sched ?
> > 
> > A recent discussion had Linus suggest SLAB_DESTROY_BY_RCU, although I
> > think Oleg did mention it would still be 'interesting'. I've not yet had
> > time to really think about that.
> 
> This might be an "interesting" modification. :) This could perhaps come
> as an optimization later on ?

Not really, again, take this for (;;) sys_membar(EXPEDITED) that'll
generate horrendous rq lock contention, with or without the PRIVATE
thing it'll pound a number of rq locks real bad.

Typical scheduler syscalls only affect a single rq lock at a time -- the
one the task is on. This one potentially pounds all of them.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-16 20:54         ` Peter Zijlstra
@ 2015-03-17  1:45           ` Mathieu Desnoyers
  2015-03-17  2:26             ` Steven Rostedt
  2015-03-17  6:30             ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-17  1:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 4:54:35 PM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Mon, Mar 16, 2015 at 06:53:35PM +0000, Mathieu Desnoyers wrote:
> > > I'm not entirely awake atm but I'm not seeing why it would need to be
> > > that strict; I think the current single MB on task switch is sufficient
> > > because if we're in the middle of schedule, userspace isn't actually
> > > running.
> > > 
> > > So from the point of userspace the task switch is atomic. Therefore even
> > > if we do not get a barrier before setting ->curr, the expedited thing
> > > missing us doesn't matter as userspace cannot observe the difference.
> > 
> > AFAIU, atomicity is not what matters here. It's more about memory ordering.
> > What is guaranteeing that upon entry in kernel-space, all prior memory
> > accesses (loads and stores) are ordered prior to following loads/stores ?
> > 
> > The same applies when returning to user-space: what is guaranteeing that
> > all
> > prior loads/stores are ordered before the user-space loads/stores performed
> > after returning to user-space ?
> 
> You're still one step ahead of me; why does this matter?
> 
> Or put it another way; what can go wrong? By virtue of being in
> schedule() both tasks (prev and next) get an affective MB from the task
> switch.
> 
> So even if we see the 'wrong' rq->curr, that CPU will still observe the
> MB by the time it gets to userspace.
> 
> All of this is really only about userspace load/store ordering and the
> context switch already very much needs to guarantee userspace program
> order in the face of context switches.

Let's go through a memory ordering scenario to highlight my reasoning
there.

Let's consider the following memory barrier scenario performed in
user-space on an architecture with very relaxed ordering. PowerPC comes
to mind.

https://lwn.net/Articles/573436/
scenario 12:

CPU 0                   CPU 1
CAO(x) = 1;             r3 = CAO(y);
cmm_smp_wmb();          cmm_smp_rmb();
CAO(y) = 1;             r4 = CAO(x);

BUG_ON(r3 == 1 && r4 == 0)


We tweak it to use sys_membarrier on CPU 1, and a simple compiler
barrier() on CPU 0:

CPU 0                   CPU 1
CAO(x) = 1;             r3 = CAO(y);
barrier();              sys_membarrier();
CAO(y) = 1;             r4 = CAO(x);

BUG_ON(r3 == 1 && r4 == 0)

Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both
stores, we have:

CPU 0                           CPU 1
CAO(x) = 1;
  [1st store is slow to
   reach other cores]
CAO(y) = 1;
  [2nd store reaches other
   cores more quickly]
[preempted]
                                r3 = CAO(y)
                                  (may see y = 1)
                                sys_membarrier()
Scheduler changes rq->curr.
                                skips CPU 0, because rq->curr has
                                  been updated.
                                [return to userspace]
                                r4 = CAO(x)
                                  (may see x = 0)
                                BUG_ON(r3 == 1 && r4 == 0) -> fails.
load_cr3, with implied
  memory barrier, comes
  after CPU 1 has read "x".

The only way to make this scenario work is if a memory barrier is added
before updating rq->curr. (we could also do a similar scenario for the
needed barrier after store to rq->curr).

> 
> > > > In order to be able to dereference rq->curr->mm without holding the
> > > > rq->lock, do you envision we should protect task reclaim with RCU-sched
> > > > ?
> > > 
> > > A recent discussion had Linus suggest SLAB_DESTROY_BY_RCU, although I
> > > think Oleg did mention it would still be 'interesting'. I've not yet had
> > > time to really think about that.
> > 
> > This might be an "interesting" modification. :) This could perhaps come
> > as an optimization later on ?
> 
> Not really, again, take this for (;;) sys_membar(EXPEDITED) that'll
> generate horrendous rq lock contention, with or without the PRIVATE
> thing it'll pound a number of rq locks real bad.
> 
> Typical scheduler syscalls only affect a single rq lock at a time -- the
> one the task is on. This one potentially pounds all of them.

Would you see it as acceptable if we start by implementing
only the non-expedited sys_membarrier() ? Then we can add
the expedited-private implementation after rq->curr becomes
available through RCU.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  1:45           ` Mathieu Desnoyers
@ 2015-03-17  2:26             ` Steven Rostedt
  2015-03-17  6:40               ` Peter Zijlstra
  2015-03-17 12:46               ` Mathieu Desnoyers
  2015-03-17  6:30             ` Peter Zijlstra
  1 sibling, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2015-03-17  2:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, KOSAKI Motohiro, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells


[ Removed npiggen@kernel.dk as I keep getting bounces from that addr ]

On Tue, 17 Mar 2015 01:45:25 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- Original Message -----
> > From: "Peter Zijlstra" <peterz@infradead.org>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> > <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> > "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> > <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> > <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> > "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> > Sent: Monday, March 16, 2015 4:54:35 PM
> > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)

Can you please fix your mail client to not include the entire header in
your replies please.

> Let's consider the following memory barrier scenario performed in
> user-space on an architecture with very relaxed ordering. PowerPC comes
> to mind.
> 
> https://lwn.net/Articles/573436/
> scenario 12:
> 
> CPU 0                   CPU 1
> CAO(x) = 1;             r3 = CAO(y);
> cmm_smp_wmb();          cmm_smp_rmb();
> CAO(y) = 1;             r4 = CAO(x);
> 
> BUG_ON(r3 == 1 && r4 == 0)
> 
> 
> We tweak it to use sys_membarrier on CPU 1, and a simple compiler
> barrier() on CPU 0:
> 
> CPU 0                   CPU 1
> CAO(x) = 1;             r3 = CAO(y);
> barrier();              sys_membarrier();
> CAO(y) = 1;             r4 = CAO(x);
> 
> BUG_ON(r3 == 1 && r4 == 0)
> 
> Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both
> stores, we have:
> 
> CPU 0                           CPU 1
> CAO(x) = 1;
>   [1st store is slow to
>    reach other cores]
> CAO(y) = 1;
>   [2nd store reaches other
>    cores more quickly]
> [preempted]
>                                 r3 = CAO(y)
>                                   (may see y = 1)
>                                 sys_membarrier()
> Scheduler changes rq->curr.
>                                 skips CPU 0, because rq->curr has
>                                   been updated.
>                                 [return to userspace]
>                                 r4 = CAO(x)
>                                   (may see x = 0)
>                                 BUG_ON(r3 == 1 && r4 == 0) -> fails.
> load_cr3, with implied
>   memory barrier, comes
>   after CPU 1 has read "x".
> 
> The only way to make this scenario work is if a memory barrier is added
> before updating rq->curr. (we could also do a similar scenario for the
> needed barrier after store to rq->curr).

Hmm, I wonder if anything were to break if rq->curr was updated after
the context_switch() call?

Would that help?

	this_cpu_write(saved_next, next);
	rq = context_switch(rq, prev, next);
	rq->curr = this_cpu_read(saved_next);

As I recently found out that this_cpu_read/write() is not that nice on
all architectures, something else may need to be updated. Or we can add
a temp variable on the rq.

	rq->saved_next = next;
	rq = context_switch(rq, prev, next);
	rq->curr = rq->saved_next;

-- Steve


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  1:45           ` Mathieu Desnoyers
  2015-03-17  2:26             ` Steven Rostedt
@ 2015-03-17  6:30             ` Peter Zijlstra
  2015-03-17 11:56               ` Paul E. McKenney
  2015-03-17 13:13               ` Mathieu Desnoyers
  1 sibling, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-17  6:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
> Let's go through a memory ordering scenario to highlight my reasoning
> there.
> 
> Let's consider the following memory barrier scenario performed in
> user-space on an architecture with very relaxed ordering. PowerPC comes
> to mind.
> 
> https://lwn.net/Articles/573436/
> scenario 12:
> 
> CPU 0                   CPU 1
> CAO(x) = 1;             r3 = CAO(y);
> cmm_smp_wmb();          cmm_smp_rmb();
> CAO(y) = 1;             r4 = CAO(x);
> 
> BUG_ON(r3 == 1 && r4 == 0)

WTF is CAO() ? and that ridiculous cmm_ prefix on the barriers.

> We tweak it to use sys_membarrier on CPU 1, and a simple compiler
> barrier() on CPU 0:
> 
> CPU 0                   CPU 1
> CAO(x) = 1;             r3 = CAO(y);
> barrier();              sys_membarrier();
> CAO(y) = 1;             r4 = CAO(x);
> 
> BUG_ON(r3 == 1 && r4 == 0)

That hardly seems like a valid substitution; barrier() is not a valid
replacement of a memory barrier is it? Esp not on PPC.

> Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both
> stores, we have:
> 
> CPU 0                           CPU 1
> CAO(x) = 1;
>   [1st store is slow to
>    reach other cores]
> CAO(y) = 1;
>   [2nd store reaches other
>    cores more quickly]
> [preempted]
>                                 r3 = CAO(y)
>                                   (may see y = 1)
>                                 sys_membarrier()
> Scheduler changes rq->curr.
>                                 skips CPU 0, because rq->curr has
>                                   been updated.
>                                 [return to userspace]
>                                 r4 = CAO(x)
>                                   (may see x = 0)
>                                 BUG_ON(r3 == 1 && r4 == 0) -> fails.
> load_cr3, with implied
>   memory barrier, comes
>   after CPU 1 has read "x".
> 
> The only way to make this scenario work is if a memory barrier is added
> before updating rq->curr. (we could also do a similar scenario for the
> needed barrier after store to rq->curr).

Hmmm.. like that. Light begins to dawn.

So I think in this case we're good with the smp_mb__before_spinlock() we
have; but do note its not a full MB even though the name says so.

Its basically: WMB + ACQUIRE, which theoretically can leak a read in,
but nobody sane _delays_ reads, you want to speculate reads, not
postpone.

Also, it lacks the transitive property.

> Would you see it as acceptable if we start by implementing
> only the non-expedited sys_membarrier() ?

Sure.

> Then we can add
> the expedited-private implementation after rq->curr becomes
> available through RCU.

Yeah, or not at all; I'm still trying to get Paul to remove the
expedited nonsense from the kernel RCU bits; and now you want it in
userspace too :/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  2:26             ` Steven Rostedt
@ 2015-03-17  6:40               ` Peter Zijlstra
  2015-03-17 11:44                 ` Paul E. McKenney
  2015-03-17 12:46               ` Mathieu Desnoyers
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-17  6:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Paul E. McKenney, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Andrew Morton,
	Josh Triplett, Thomas Gleixner, David Howells

On Mon, Mar 16, 2015 at 10:26:11PM -0400, Steven Rostedt wrote:
> As I recently found out that this_cpu_read/write() is not that nice on
> all architectures,

In fact, they only really work well on x86. Aargh64 seems to have a semi
usable version, but mostly its quite horrible indeed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  6:40               ` Peter Zijlstra
@ 2015-03-17 11:44                 ` Paul E. McKenney
  2015-03-17 14:10                   ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-17 11:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

On Tue, Mar 17, 2015 at 07:40:56AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 16, 2015 at 10:26:11PM -0400, Steven Rostedt wrote:
> > As I recently found out that this_cpu_read/write() is not that nice on
> > all architectures,
> 
> In fact, they only really work well on x86. Aargh64 seems to have a semi
> usable version, but mostly its quite horrible indeed.

Yeah, if you are in a preempt-disabled region, __this_cpu_read() and
__this_cpu_write() generate much better code on most platforms.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  6:30             ` Peter Zijlstra
@ 2015-03-17 11:56               ` Paul E. McKenney
  2015-03-17 12:01                 ` Paul E. McKenney
  2015-03-17 13:13               ` Mathieu Desnoyers
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-17 11:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Tue, Mar 17, 2015 at 07:30:59AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:

[ . . . ]

> > Then we can add
> > the expedited-private implementation after rq->curr becomes
> > available through RCU.
> 
> Yeah, or not at all; I'm still trying to get Paul to remove the
> expedited nonsense from the kernel RCU bits; and now you want it in
> userspace too :/

Well, the expedited RCU primitives got added for some use cases, and
those use cases are still with us.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 11:56               ` Paul E. McKenney
@ 2015-03-17 12:01                 ` Paul E. McKenney
  0 siblings, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-17 12:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Nick Piggin

On Tue, Mar 17, 2015 at 04:56:08AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 17, 2015 at 07:30:59AM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
> 
> [ . . . ]
> 
> > > Then we can add
> > > the expedited-private implementation after rq->curr becomes
> > > available through RCU.
> > 
> > Yeah, or not at all; I'm still trying to get Paul to remove the
> > expedited nonsense from the kernel RCU bits; and now you want it in
> > userspace too :/
> 
> Well, the expedited RCU primitives got added for some use cases, and
> those use cases are still with us.  ;-)

And, as I was meaning to say -before- I hit send...  I still have some
changes in mind to make the expedited RCU primitives at least somewhat
more friendly, for example, moving from try_stop_cpus() to something
like on_each_cpu_mask().

							Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  2:26             ` Steven Rostedt
  2015-03-17  6:40               ` Peter Zijlstra
@ 2015-03-17 12:46               ` Mathieu Desnoyers
  2015-03-18  1:06                 ` Steven Rostedt
  1 sibling, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-17 12:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, KOSAKI Motohiro, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

----- Original Message -----
> 
> [ Removed npiggen@kernel.dk as I keep getting bounces from that addr ]

Yep, me too. However this his the address that shows up in the
MAINTAINERS file. Weird.

> 
> On Tue, 17 Mar 2015 01:45:25 +0000 (UTC)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
[...]
> 
> Can you please fix your mail client to not include the entire header in
> your replies please.

Done, thanks for pointing it out!

> 
> > Let's consider the following memory barrier scenario performed in
> > user-space on an architecture with very relaxed ordering. PowerPC comes
> > to mind.
> > 
> > https://lwn.net/Articles/573436/
> > scenario 12:
> > 
> > CPU 0                   CPU 1
> > CAO(x) = 1;             r3 = CAO(y);
> > cmm_smp_wmb();          cmm_smp_rmb();
> > CAO(y) = 1;             r4 = CAO(x);
> > 
> > BUG_ON(r3 == 1 && r4 == 0)
> > 
> > 
> > We tweak it to use sys_membarrier on CPU 1, and a simple compiler
> > barrier() on CPU 0:
> > 
> > CPU 0                   CPU 1
> > CAO(x) = 1;             r3 = CAO(y);
> > barrier();              sys_membarrier();
> > CAO(y) = 1;             r4 = CAO(x);
> > 
> > BUG_ON(r3 == 1 && r4 == 0)
> > 
> > Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both
> > stores, we have:
> > 
> > CPU 0                           CPU 1
> > CAO(x) = 1;
> >   [1st store is slow to
> >    reach other cores]
> > CAO(y) = 1;
> >   [2nd store reaches other
> >    cores more quickly]
> > [preempted]
> >                                 r3 = CAO(y)
> >                                   (may see y = 1)
> >                                 sys_membarrier()
> > Scheduler changes rq->curr.
> >                                 skips CPU 0, because rq->curr has
> >                                   been updated.
> >                                 [return to userspace]
> >                                 r4 = CAO(x)
> >                                   (may see x = 0)
> >                                 BUG_ON(r3 == 1 && r4 == 0) -> fails.
> > load_cr3, with implied
> >   memory barrier, comes
> >   after CPU 1 has read "x".
> > 
> > The only way to make this scenario work is if a memory barrier is added
> > before updating rq->curr. (we could also do a similar scenario for the
> > needed barrier after store to rq->curr).
> 
> Hmm, I wonder if anything were to break if rq->curr was updated after
> the context_switch() call?
> 
> Would that help?
> 
> 	this_cpu_write(saved_next, next);
> 	rq = context_switch(rq, prev, next);
> 	rq->curr = this_cpu_read(saved_next);

Assuming there is a full memory barrier (e.g. load_cr3) within
context_switch, it would help for ordering memory accesses that
are performed prior to the preemption, but not for memory accesses
to be performed immediately after return to userspace from preemption.

Thanks,

Mathieu

> 
> As I recently found out that this_cpu_read/write() is not that nice on
> all architectures, something else may need to be updated. Or we can add
> a temp variable on the rq.
> 
> 	rq->saved_next = next;
> 	rq = context_switch(rq, prev, next);
> 	rq->curr = rq->saved_next;
> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17  6:30             ` Peter Zijlstra
  2015-03-17 11:56               ` Paul E. McKenney
@ 2015-03-17 13:13               ` Mathieu Desnoyers
  2015-03-17 16:36                 ` Mathieu Desnoyers
  2015-03-17 16:37                 ` Peter Zijlstra
  1 sibling, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-17 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

----- Original Message -----
> On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
> > Let's go through a memory ordering scenario to highlight my reasoning
> > there.
> > 
> > Let's consider the following memory barrier scenario performed in
> > user-space on an architecture with very relaxed ordering. PowerPC comes
> > to mind.
> > 
> > https://lwn.net/Articles/573436/
> > scenario 12:
> > 
> > CPU 0                   CPU 1
> > CAO(x) = 1;             r3 = CAO(y);
> > cmm_smp_wmb();          cmm_smp_rmb();
> > CAO(y) = 1;             r4 = CAO(x);
> > 
> > BUG_ON(r3 == 1 && r4 == 0)
> 
> WTF is CAO() ? and that ridiculous cmm_ prefix on the barriers.

In the LWN article, for the sake of conciseness, CAO is an alias
to the Linux kernel ACCESS_ONCE(). It maps to CMM_LOAD_SHARED(),
CMM_STORE_SHARED() in userspace RCU.

The cmm_ prefix means "Concurrent Memory Model", which is a prefix
used for memory access/ordering related headers in userspace RCU.
Because it is a userspace library, we need those prefixes, else we
end up clashing with pre-existing apps.

> 
> > We tweak it to use sys_membarrier on CPU 1, and a simple compiler
> > barrier() on CPU 0:
> > 
> > CPU 0                   CPU 1
> > CAO(x) = 1;             r3 = CAO(y);
> > barrier();              sys_membarrier();
> > CAO(y) = 1;             r4 = CAO(x);
> > 
> > BUG_ON(r3 == 1 && r4 == 0)
> 
> That hardly seems like a valid substitution; barrier() is not a valid
> replacement of a memory barrier is it? Esp not on PPC.

That's the whole point of sys_membarrier. Quoting the td;dr changelog:

"It can be used to distribute the cost of user-space memory barriers
asymmetrically by transforming pairs of memory barriers into pairs
consisting of sys_membarrier() and a compiler barrier. For
synchronization primitives that distinguish between read-side and
write-side (e.g. userspace RCU, rwlocks), the read-side can be
accelerated significantly by moving the bulk of the memory barrier
overhead to the write-side."

So basically, for a given memory barrier pair, we can turn one barrier
into a sys_membarrier, which allows us to turn the other barrier into
a simple compiler barrier. Therefore, whenever the thread issuing
sys_membarrier actually cares about ordering of the matching barrier,
sys_membarrier() forces all other threads to issue a memory barrier,
which punctually promotes program order to an actual memory barrier
on all targeted threads.

> 
> > Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both
> > stores, we have:
> > 
> > CPU 0                           CPU 1
> > CAO(x) = 1;
> >   [1st store is slow to
> >    reach other cores]
> > CAO(y) = 1;
> >   [2nd store reaches other
> >    cores more quickly]
> > [preempted]
> >                                 r3 = CAO(y)
> >                                   (may see y = 1)
> >                                 sys_membarrier()
> > Scheduler changes rq->curr.
> >                                 skips CPU 0, because rq->curr has
> >                                   been updated.
> >                                 [return to userspace]
> >                                 r4 = CAO(x)
> >                                   (may see x = 0)
> >                                 BUG_ON(r3 == 1 && r4 == 0) -> fails.
> > load_cr3, with implied
> >   memory barrier, comes
> >   after CPU 1 has read "x".
> > 
> > The only way to make this scenario work is if a memory barrier is added
> > before updating rq->curr. (we could also do a similar scenario for the
> > needed barrier after store to rq->curr).
> 
> Hmmm.. like that. Light begins to dawn.
> 
> So I think in this case we're good with the smp_mb__before_spinlock() we
> have; but do note its not a full MB even though the name says so.
> 
> Its basically: WMB + ACQUIRE, which theoretically can leak a read in,
> but nobody sane _delays_ reads, you want to speculate reads, not
> postpone.

If I believe the memory ordering table at
https://en.wikipedia.org/wiki/Memory_ordering , there appears
to be quite a few architectures that can reorder loads after loads,
and loads after stores: Alpha, ARMv7, PA-RISC, SPARC RMO, x86 oostore
and ia64. There may be subtle details that would allow us to
do without the barriers in specific situations, but for that I'd
very much like to hear what Paul has to say.

> 
> Also, it lacks the transitive property.

The lack of transitive property would likely be an issue
if we want to make this generally useful.

> 
> > Would you see it as acceptable if we start by implementing
> > only the non-expedited sys_membarrier() ?
> 
> Sure.
> 
> > Then we can add
> > the expedited-private implementation after rq->curr becomes
> > available through RCU.
> 
> Yeah, or not at all; I'm still trying to get Paul to remove the
> expedited nonsense from the kernel RCU bits; and now you want it in
> userspace too :/

The non-expedited case makes sense when we batch RCU work
with call_rcu. However, some users require to use synchronize_rcu()
directly after modifying their data structure. Therefore, the
latency associated with sys_membarrier() then becomes important,
hence the interest for an expedited scheme.

I agree that we should try to find a way to implement it with
low disturbance on the CPU's rq locks. I'd be certainly
OK with starting with just the non-expedited scheme, and add
the expedited scheme later on. This is why we have the flags
anyway.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 11:44                 ` Paul E. McKenney
@ 2015-03-17 14:10                   ` Steven Rostedt
  2015-03-17 16:35                     ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2015-03-17 14:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Christoph Lameter

On Tue, 17 Mar 2015 04:44:11 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, Mar 17, 2015 at 07:40:56AM +0100, Peter Zijlstra wrote:
> > On Mon, Mar 16, 2015 at 10:26:11PM -0400, Steven Rostedt wrote:
> > > As I recently found out that this_cpu_read/write() is not that nice on
> > > all architectures,
> > 
> > In fact, they only really work well on x86. Aargh64 seems to have a semi
> > usable version, but mostly its quite horrible indeed.
> 
> Yeah, if you are in a preempt-disabled region, __this_cpu_read() and
> __this_cpu_write() generate much better code on most platforms.
> 

I was just having this discussion with Christoph. I originally switched
to using this_cpu_ptr(), but I'll try __this_cpu* instead. I think
Christoph recommended that too.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 14:10                   ` Steven Rostedt
@ 2015-03-17 16:35                     ` Paul E. McKenney
  0 siblings, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-17 16:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells, Christoph Lameter

On Tue, Mar 17, 2015 at 10:10:09AM -0400, Steven Rostedt wrote:
> On Tue, 17 Mar 2015 04:44:11 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Mar 17, 2015 at 07:40:56AM +0100, Peter Zijlstra wrote:
> > > On Mon, Mar 16, 2015 at 10:26:11PM -0400, Steven Rostedt wrote:
> > > > As I recently found out that this_cpu_read/write() is not that nice on
> > > > all architectures,
> > > 
> > > In fact, they only really work well on x86. Aargh64 seems to have a semi
> > > usable version, but mostly its quite horrible indeed.
> > 
> > Yeah, if you are in a preempt-disabled region, __this_cpu_read() and
> > __this_cpu_write() generate much better code on most platforms.
> 
> I was just having this discussion with Christoph. I originally switched
> to using this_cpu_ptr(), but I'll try __this_cpu* instead. I think
> Christoph recommended that too.

Sounds good to me!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 13:13               ` Mathieu Desnoyers
@ 2015-03-17 16:36                 ` Mathieu Desnoyers
  2015-03-17 16:48                   ` Paul E. McKenney
  2015-03-17 17:55                   ` josh
  2015-03-17 16:37                 ` Peter Zijlstra
  1 sibling, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2015-03-17 16:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

> > On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
[...]
> > 
> > > Would you see it as acceptable if we start by implementing
> > > only the non-expedited sys_membarrier() ?
> > 
> > Sure.
> > 
> > > Then we can add
> > > the expedited-private implementation after rq->curr becomes
> > > available through RCU.
> > 
> > Yeah, or not at all; I'm still trying to get Paul to remove the
> > expedited nonsense from the kernel RCU bits; and now you want it in
> > userspace too :/
> 
> The non-expedited case makes sense when we batch RCU work
> with call_rcu. However, some users require to use synchronize_rcu()
> directly after modifying their data structure. Therefore, the
> latency associated with sys_membarrier() then becomes important,
> hence the interest for an expedited scheme.
> 
> I agree that we should try to find a way to implement it with
> low disturbance on the CPU's rq locks. I'd be certainly
> OK with starting with just the non-expedited scheme, and add
> the expedited scheme later on. This is why we have the flags
> anyway.

Paul, I'm currently reworking the patch to keep only the
non-expedited scheme. I don't need to touch the scheduler
internals anymore, so should I move the sys_membarrier
system call implementation into kernel/rcu/update.c ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 13:13               ` Mathieu Desnoyers
  2015-03-17 16:36                 ` Mathieu Desnoyers
@ 2015-03-17 16:37                 ` Peter Zijlstra
  2015-03-17 16:49                   ` Paul E. McKenney
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-17 16:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, KOSAKI Motohiro, Steven Rostedt, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

On Tue, Mar 17, 2015 at 01:13:36PM +0000, Mathieu Desnoyers wrote:
> > Its basically: WMB + ACQUIRE, which theoretically can leak a read in,
> > but nobody sane _delays_ reads, you want to speculate reads, not
> > postpone.
> 
> If I believe the memory ordering table at
> https://en.wikipedia.org/wiki/Memory_ordering , there appears
> to be quite a few architectures that can reorder loads after loads,
> and loads after stores: Alpha, ARMv7, PA-RISC, SPARC RMO, x86 oostore
> and ia64. There may be subtle details that would allow us to
> do without the barriers in specific situations, but for that I'd
> very much like to hear what Paul has to say.

So I was starting to write that you can get load after load by one
speculating more than the other, but I suppose you can delay loads just
fine too.

Imagine getting a cache miss on a load, the OoO engine can then continue
execution until it hits a hard dependency, so you're effectively
delaying the load.

So yeah, if we want to be able to replace smp_rmb() with a
barrier+sys_membar() we need to promote the smp_mb__before_spinlock() to
smp_mb__after_unlock_lock() or so, that would only penalize PPC a bit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 16:36                 ` Mathieu Desnoyers
@ 2015-03-17 16:48                   ` Paul E. McKenney
  2015-03-17 17:55                   ` josh
  1 sibling, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-17 16:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

On Tue, Mar 17, 2015 at 04:36:27PM +0000, Mathieu Desnoyers wrote:
> > > On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
> [...]
> > > 
> > > > Would you see it as acceptable if we start by implementing
> > > > only the non-expedited sys_membarrier() ?
> > > 
> > > Sure.
> > > 
> > > > Then we can add
> > > > the expedited-private implementation after rq->curr becomes
> > > > available through RCU.
> > > 
> > > Yeah, or not at all; I'm still trying to get Paul to remove the
> > > expedited nonsense from the kernel RCU bits; and now you want it in
> > > userspace too :/
> > 
> > The non-expedited case makes sense when we batch RCU work
> > with call_rcu. However, some users require to use synchronize_rcu()
> > directly after modifying their data structure. Therefore, the
> > latency associated with sys_membarrier() then becomes important,
> > hence the interest for an expedited scheme.
> > 
> > I agree that we should try to find a way to implement it with
> > low disturbance on the CPU's rq locks. I'd be certainly
> > OK with starting with just the non-expedited scheme, and add
> > the expedited scheme later on. This is why we have the flags
> > anyway.
> 
> Paul, I'm currently reworking the patch to keep only the
> non-expedited scheme. I don't need to touch the scheduler
> internals anymore, so should I move the sys_membarrier
> system call implementation into kernel/rcu/update.c ?

Can't see why that wouldn't work.  ;-)

						Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 16:37                 ` Peter Zijlstra
@ 2015-03-17 16:49                   ` Paul E. McKenney
  2015-03-17 17:00                     ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2015-03-17 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

On Tue, Mar 17, 2015 at 05:37:56PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 17, 2015 at 01:13:36PM +0000, Mathieu Desnoyers wrote:
> > > Its basically: WMB + ACQUIRE, which theoretically can leak a read in,
> > > but nobody sane _delays_ reads, you want to speculate reads, not
> > > postpone.
> > 
> > If I believe the memory ordering table at
> > https://en.wikipedia.org/wiki/Memory_ordering , there appears
> > to be quite a few architectures that can reorder loads after loads,
> > and loads after stores: Alpha, ARMv7, PA-RISC, SPARC RMO, x86 oostore
> > and ia64. There may be subtle details that would allow us to
> > do without the barriers in specific situations, but for that I'd
> > very much like to hear what Paul has to say.
> 
> So I was starting to write that you can get load after load by one
> speculating more than the other, but I suppose you can delay loads just
> fine too.
> 
> Imagine getting a cache miss on a load, the OoO engine can then continue
> execution until it hits a hard dependency, so you're effectively
> delaying the load.
> 
> So yeah, if we want to be able to replace smp_rmb() with a
> barrier+sys_membar() we need to promote the smp_mb__before_spinlock() to
> smp_mb__after_unlock_lock() or so, that would only penalize PPC a bit.

Agreed, though if Mathieu is dropping the expedited version for the
moment, this should not be required yet, right?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 16:49                   ` Paul E. McKenney
@ 2015-03-17 17:00                     ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2015-03-17 17:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, linux-kernel, KOSAKI Motohiro, Steven Rostedt,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

On Tue, Mar 17, 2015 at 09:49:40AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 17, 2015 at 05:37:56PM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 17, 2015 at 01:13:36PM +0000, Mathieu Desnoyers wrote:
> > > > Its basically: WMB + ACQUIRE, which theoretically can leak a read in,
> > > > but nobody sane _delays_ reads, you want to speculate reads, not
> > > > postpone.
> > > 
> > > If I believe the memory ordering table at
> > > https://en.wikipedia.org/wiki/Memory_ordering , there appears
> > > to be quite a few architectures that can reorder loads after loads,
> > > and loads after stores: Alpha, ARMv7, PA-RISC, SPARC RMO, x86 oostore
> > > and ia64. There may be subtle details that would allow us to
> > > do without the barriers in specific situations, but for that I'd
> > > very much like to hear what Paul has to say.
> > 
> > So I was starting to write that you can get load after load by one
> > speculating more than the other, but I suppose you can delay loads just
> > fine too.
> > 
> > Imagine getting a cache miss on a load, the OoO engine can then continue
> > execution until it hits a hard dependency, so you're effectively
> > delaying the load.
> > 
> > So yeah, if we want to be able to replace smp_rmb() with a
> > barrier+sys_membar() we need to promote the smp_mb__before_spinlock() to
> > smp_mb__after_unlock_lock() or so, that would only penalize PPC a bit.
> 
> Agreed, though if Mathieu is dropping the expedited version for the
> moment, this should not be required yet, right?

Indeed so.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 16:36                 ` Mathieu Desnoyers
  2015-03-17 16:48                   ` Paul E. McKenney
@ 2015-03-17 17:55                   ` josh
  1 sibling, 0 replies; 34+ messages in thread
From: josh @ 2015-03-17 17:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Linus Torvalds, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Andrew Morton,
	Thomas Gleixner, David Howells

On Tue, Mar 17, 2015 at 04:36:27PM +0000, Mathieu Desnoyers wrote:
> > > On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
> [...]
> > > 
> > > > Would you see it as acceptable if we start by implementing
> > > > only the non-expedited sys_membarrier() ?
> > > 
> > > Sure.
> > > 
> > > > Then we can add
> > > > the expedited-private implementation after rq->curr becomes
> > > > available through RCU.
> > > 
> > > Yeah, or not at all; I'm still trying to get Paul to remove the
> > > expedited nonsense from the kernel RCU bits; and now you want it in
> > > userspace too :/
> > 
> > The non-expedited case makes sense when we batch RCU work
> > with call_rcu. However, some users require to use synchronize_rcu()
> > directly after modifying their data structure. Therefore, the
> > latency associated with sys_membarrier() then becomes important,
> > hence the interest for an expedited scheme.
> > 
> > I agree that we should try to find a way to implement it with
> > low disturbance on the CPU's rq locks. I'd be certainly
> > OK with starting with just the non-expedited scheme, and add
> > the expedited scheme later on. This is why we have the flags
> > anyway.
> 
> Paul, I'm currently reworking the patch to keep only the
> non-expedited scheme. I don't need to touch the scheduler
> internals anymore, so should I move the sys_membarrier
> system call implementation into kernel/rcu/update.c ?

If you don't need access to scheduler internals, I'd suggest putting it
in its own file (something like kernel/membarrier.c), so that you can
use obj-$(CONFIG_MEMBARRIER) in a Makefile to enable/disable it rather
than an #ifdef in a .c file.

- Josh Triplett

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
  2015-03-17 12:46               ` Mathieu Desnoyers
@ 2015-03-18  1:06                 ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2015-03-18  1:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, KOSAKI Motohiro, Paul E. McKenney,
	Nicholas Miell, Linus Torvalds, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Andrew Morton, Josh Triplett,
	Thomas Gleixner, David Howells

On Tue, 17 Mar 2015 12:46:41 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:


> > Would that help?
> > 
> > 	this_cpu_write(saved_next, next);
> > 	rq = context_switch(rq, prev, next);
> > 	rq->curr = this_cpu_read(saved_next);
> 
> Assuming there is a full memory barrier (e.g. load_cr3) within
> context_switch, it would help for ordering memory accesses that
> are performed prior to the preemption, but not for memory accesses
> to be performed immediately after return to userspace from preemption.

Hmm, I was thinking that there was a spin_unlock(rq->lock) after that,
but it appears that context_switch() does the unlock. If there was an
spin_unlock() after this code, then it could work. There's always
setting the rq->curr in the context_switch() call itself.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-03-18  1:05 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-15 19:24 [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) Mathieu Desnoyers
2015-03-15 22:05 ` Paul E. McKenney
2015-03-16  3:25 ` Josh Triplett
2015-03-16 13:00   ` Mathieu Desnoyers
2015-03-16 14:19 ` Peter Zijlstra
2015-03-16 14:24   ` Steven Rostedt
2015-03-16 15:49     ` Mathieu Desnoyers
2015-03-16 15:49     ` Paul E. McKenney
2015-03-16 16:12       ` Steven Rostedt
2015-03-16 15:43   ` Mathieu Desnoyers
2015-03-16 15:57     ` Mathieu Desnoyers
2015-03-16 17:13     ` Peter Zijlstra
2015-03-16 17:21     ` Peter Zijlstra
2015-03-16 18:53       ` Mathieu Desnoyers
2015-03-16 20:54         ` Peter Zijlstra
2015-03-17  1:45           ` Mathieu Desnoyers
2015-03-17  2:26             ` Steven Rostedt
2015-03-17  6:40               ` Peter Zijlstra
2015-03-17 11:44                 ` Paul E. McKenney
2015-03-17 14:10                   ` Steven Rostedt
2015-03-17 16:35                     ` Paul E. McKenney
2015-03-17 12:46               ` Mathieu Desnoyers
2015-03-18  1:06                 ` Steven Rostedt
2015-03-17  6:30             ` Peter Zijlstra
2015-03-17 11:56               ` Paul E. McKenney
2015-03-17 12:01                 ` Paul E. McKenney
2015-03-17 13:13               ` Mathieu Desnoyers
2015-03-17 16:36                 ` Mathieu Desnoyers
2015-03-17 16:48                   ` Paul E. McKenney
2015-03-17 17:55                   ` josh
2015-03-17 16:37                 ` Peter Zijlstra
2015-03-17 16:49                   ` Paul E. McKenney
2015-03-17 17:00                     ` Peter Zijlstra
2015-03-16 17:24     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.