linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v7 0/7] Restartable sequences system call
@ 2016-07-21 21:14 Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
                   ` (6 more replies)
  0 siblings, 7 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Hi,

This is mostly a re-write of Paul Turner and Andrew Hunter's restartable
critical sections (percpu atomics), which brings the following main
benefits over Paul Turner's prior version (v2):

- The ABI is now architecture-agnostic, and it requires fewer
  instruction on the user-space fast path,
- Ported to ARM 32, in addition to cover x86 32/64. Adding support
  for new architectures is now trivial,
- Progress is ensured by a fall-back to locking (purely userspace)
  when single-stepped by a debugger.

This is v7, as it derives from my prior getcpu cache and thread local
ABI patchsets. You will find benchmark results in the changelog of
patch 1/7.

Feedback is welcome!

Thanks,

Mathieu


Mathieu Desnoyers (7):
  Restartable sequences system call
  tracing: instrument restartable sequences
  Restartable sequences: ARM 32 architecture support
  Restartable sequences: wire up ARM 32 system call
  Restartable sequences: x86 32/64 architecture support
  Restartable sequences: wire up x86 32/64 system call
  Restartable sequences: self-tests

 MAINTAINERS                                        |   7 +
 arch/Kconfig                                       |   7 +
 arch/arm/Kconfig                                   |   1 +
 arch/arm/include/uapi/asm/unistd.h                 |   1 +
 arch/arm/kernel/calls.S                            |   1 +
 arch/arm/kernel/signal.c                           |   7 +
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/common.c                            |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl             |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |   1 +
 arch/x86/kernel/signal.c                           |   6 +
 fs/exec.c                                          |   1 +
 include/linux/sched.h                              |  68 ++
 include/trace/events/rseq.h                        |  60 ++
 include/uapi/linux/Kbuild                          |   1 +
 include/uapi/linux/rseq.h                          |  85 +++
 init/Kconfig                                       |  13 +
 kernel/Makefile                                    |   1 +
 kernel/fork.c                                      |   2 +
 kernel/rseq.c                                      | 243 +++++++
 kernel/sched/core.c                                |   1 +
 kernel/sys_ni.c                                    |   3 +
 tools/testing/selftests/rseq/.gitignore            |   3 +
 tools/testing/selftests/rseq/Makefile              |  13 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c | 279 ++++++++
 tools/testing/selftests/rseq/basic_test.c          | 106 +++
 tools/testing/selftests/rseq/param_test.c          | 707 +++++++++++++++++++++
 tools/testing/selftests/rseq/rseq.c                | 200 ++++++
 tools/testing/selftests/rseq/rseq.h                | 449 +++++++++++++
 29 files changed, 2269 insertions(+)
 create mode 100644 include/trace/events/rseq.h
 create mode 100644 include/uapi/linux/rseq.h
 create mode 100644 kernel/rseq.c
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/param_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
  2016-07-25 23:02   ` Andy Lutomirski
                     ` (2 more replies)
  2016-07-21 21:14 ` [RFC PATCH v7 2/7] tracing: instrument restartable sequences Mathieu Desnoyers
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path. A locking-based
fall-back, purely implemented in user-space, is proposed here to deal
with debugger single-stepping. This fallback interacts with rseq_start()
and rseq_finish(), which force retries in response to concurrent
lock-based activity.

Here are benchmarks of counter increment in various scenarios compared
to restartable sequences:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck

                      Counter increment speed (ns/increment)
                             1 thread    2 threads
global increment (baseline)      6           N/A
percpu rseq increment           50            52
percpu rseq spinlock            94            94
global atomic increment         48            74 (__sync_add_and_fetch_4)
global atomic CAS               50           172 (__sync_val_compare_and_swap_4)
global pthread mutex           148           862

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard

                      Counter increment speed (ns/increment)
                             1 thread    4 threads
global increment (baseline)      7           N/A
percpu rseq increment           50            50
percpu rseq spinlock            82            84
global atomic increment         44           262 (__sync_add_and_fetch_4)
global atomic CAS               46           316 (__sync_val_compare_and_swap_4)
global pthread mutex           146          1400

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:

                      Counter increment speed (ns/increment)
                              1 thread           8 threads
global increment (baseline)      3.0                N/A
percpu rseq increment            3.6                3.8
percpu rseq spinlock             5.6                6.2
global LOCK; inc                 8.0              166.4
global LOCK; cmpxchg            13.4              435.2
global pthread mutex            25.2             1363.6

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler migration set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
2855 bytes, and the data size increase of vmlinux is 1024 bytes.

* CONFIG_RSEQ=n

   text	   data	    bss	    dec	    hex	filename
9964559	4256280	 962560	15183399	 e7ae27	vmlinux.norseq

* CONFIG_RSEQ=y

   text	   data	    bss	    dec	    hex	filename
9967414	4257304	 962560	15187278	 e7bd4e	vmlinux.rseq

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---

Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
  sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
  update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
  getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
  and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
  defining this enumeration.
- Split resume notifier architecture implementation from the system call
  wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
  set the current cpu cache pointer before doing the cache update, and
  set it back to NULL if the update fails. Setting it back to NULL on
  error ensures that no resume notifier will trigger a SIGSEGV if a
  migration happened concurrently.

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
  to change log.

Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
  this system call to cover future features such as restartable critical
  sections. Generalizing this system call ensures that we can add
  features similar to the cpu_id field within the same cache-line
  without having to track one pointer per feature within the task
  struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
  the ABI beyond the initial 64-byte structure by registering structures
  with tlabi_nr greater than 0. The initial ABI structure is associated
  with tlabi_nr 0.
- Rebased on kernel v4.5.

Changes since v6:
- Integrate "restartable sequences" v2 patchset from Paul Turner.
- Add handling of single-stepping purely in user-space, with a
  fallback to locking after 2 rseq failures to ensure progress, and
  by exposing a __rseq_table section to debuggers so they know where
  to put breakpoints when dealing with rseq assembly blocks which
  can be aborted at any point.
- make the code and ABI generic: porting the kernel implementation
  simply requires to wire up the signal handler and return to user-space
  hooks, and allocate the syscall number.
- extend testing with a fully configurable test program. See
  param_spinlock_test -h for details.
- handling of rseq ENOSYS in user-space, also with a fallback
  to locking.
- modify Paul Turner's rseq ABI to only require a single TLS store on
  the user-space fast-path, removing the need to populate two additional
  registers. This is made possible by introducing struct rseq_cs into
  the ABI to describe a critical section start_ip, post_commit_ip, and
  abort_ip.
- Rebased on kernel v4.7-rc7.

Man page associated:

RSEQ(2)                Linux Programmer's Manual               RSEQ(2)

NAME
       rseq - Restartable sequences and cpu number cache

SYNOPSIS
       #include <linux/rseq.h>

       int rseq(struct rseq * rseq, int flags);

DESCRIPTION
       The  rseq()  ABI  accelerates  user-space operations on per-cpu
       data by defining a shared data structure ABI between each user-
       space thread and the kernel.

       The  rseq argument is a pointer to the thread-local rseq struc‐
       ture to be shared between kernel and user-space.  A  NULL  rseq
       value  can  be used to check whether rseq is registered for the
       current thread.

       The layout of struct rseq is as follows:

       Structure alignment
              This structure needs to be aligned on  multiples  of  64
              bytes.

       Structure size
              This structure has a fixed size of 128 bytes.

       Fields

           cpu_id
              Cache  of  the CPU number on which the calling thread is
              running.

           event_counter
              Restartable sequences event_counter field.

           rseq_cs
              Restartable sequences rseq_cs field. Points to a  struct
              rseq_cs.

       The layout of struct rseq_cs is as follows:

       Structure alignment
              This  structure  needs  to be aligned on multiples of 64
              bytes.

       Structure size
              This structure has a fixed size of 192 bytes.

       Fields

           start_ip
              Instruction pointer address of the first instruction  of
              the sequence of consecutive assembly instructions.

           post_commit_ip
              Instruction  pointer  address after the last instruction
              of the sequence of consecutive assembly instructions.

           abort_ip
              Instruction pointer address where to move the  execution
              flow  in  case  of  abort of the sequence of consecutive
              assembly instructions.

       The flags argument is currently unused and must be specified as
       0.

       Typically,  a  library or application will keep the rseq struc‐
       ture in a thread-local storage variable, or other memory  areas
       belonging to each thread. It is recommended to perform volatile
       reads of the thread-local cache to prevent  the  compiler  from
       doing  load  tearing.  An  alternative approach is to read each
       field from inline assembly.

       Each thread is responsible for registering its rseq  structure.
       Only  one  rseq structure address can be registered per thread.
       Once set, the rseq address is idempotent for a given thread.

       In a typical usage scenario, the thread  registering  the  rseq
       structure  will  be  performing  loads  and stores from/to that
       structure. It is however also allowed to  read  that  structure
       from  other  threads.   The rseq field updates performed by the
       kernel provide single-copy atomicity semantics, which guarantee
       that  other  threads performing single-copy atomic reads of the
       cpu number cache will always observe a consistent value.

       Memory registered as rseq structure should never be deallocated
       before  the  thread which registered it exits: specifically, it
       should not be freed, and the library containing the  registered
       thread-local  storage  should  not be dlclose'd. Violating this
       constraint may cause a SIGSEGV signal to be  delivered  to  the
       thread.

       Unregistration  of associated rseq structure is implicitly per‐
       formed when a thread or process exit.

RETURN VALUE
       A return  value  of  0  indicates  success.  On  error,  -1  is
       returned, and errno is set appropriately.

ERRORS
       EINVAL Either  flags  is  non-zero, or rseq contains an address
              which is not appropriately aligned.

       ENOSYS The rseq() system call is not implemented by  this  ker‐
              nel.

       EFAULT rseq is an invalid address.

       EBUSY  The rseq argument contains a non-NULL address which dif‐
              fers from the memory  location  already  registered  for
              this thread.

       ENOENT The  rseq  argument  is  NULL, but no memory location is
              currently registered for this thread.

VERSIONS
       The rseq() system call was added in Linux 4.X (TODO).

CONFORMING TO
       rseq() is Linux-specific.

EXAMPLE
       The following code uses  the  rseq()  system  call  to  keep  a
       thread-local  storage  variable up to date with the current CPU
       number, with a fallback on sched_getcpu(3) if the cache is  not
       available.  For  example  simplicity, it is done in main(), but
       multithreaded programs would need to invoke  rseq()  from  each
       program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sched.h>
           #include <stddef.h>
           #include <errno.h>
           #include <string.h>
           #include <sys/syscall.h>
           #include <linux/rseq.h>

           static __thread volatile struct rseq rseq_state = {
               .u.e.cpu_id = -1,
           };

           static int
           sys_rseq(volatile struct rseq *rseq_abi, int flags)
           {
               return syscall(__NR_rseq, rseq_abi, flags);
           }

           static int32_t
           rseq_current_cpu_raw(void)
           {
               return rseq_state.u.e.cpu_id;
           }

           static int32_t
           rseq_current_cpu(void)
           {
               int32_t cpu;

               cpu = rseq_current_cpu_raw();
               if (cpu < 0)
                   cpu = sched_getcpu();
               return cpu;
           }

           static int
           rseq_init_current_thread(void)
           {
               int rc;

               rc = sys_rseq(&rseq_state, 0);
               if (rc) {
                   fprintf(stderr, "Error: sys_rseq(...) failed(%d): %s\n",
                       errno, strerror(errno));
                   return -1;
               }
               return 0;
           }

           int
           main(int argc, char **argv)
           {
               if (rseq_init_current_thread()) {
                   fprintf(stderr,
                       "Unable to initialize restartable sequences.\n");
                   fprintf(stderr, "Using sched_getcpu() as fallback.\n");
               }
               printf("Current CPU number: %d\n", rseq_current_cpu());

               exit(EXIT_SUCCESS);
           }

SEE ALSO
       sched_getcpu(3)

Linux                         2016-07-19                       RSEQ(2)
---
 MAINTAINERS               |   7 ++
 arch/Kconfig              |   7 ++
 fs/exec.c                 |   1 +
 include/linux/sched.h     |  68 ++++++++++++++
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/rseq.h |  85 +++++++++++++++++
 init/Kconfig              |  13 +++
 kernel/Makefile           |   1 +
 kernel/fork.c             |   2 +
 kernel/rseq.c             | 231 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c       |   1 +
 kernel/sys_ni.c           |   3 +
 12 files changed, 420 insertions(+)
 create mode 100644 include/uapi/linux/rseq.h
 create mode 100644 kernel/rseq.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1209323..daef027 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
 S:	Maintained
 F:	scripts/get_maintainer.pl
 
+RESTARTABLE SEQUENCES SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/rseq.c
+F:	include/uapi/linux/rseq.h
+
 GFS2 FILE SYSTEM
 M:	Steven Whitehouse <swhiteho@redhat.com>
 M:	Bob Peterson <rpeterso@redhat.com>
diff --git a/arch/Kconfig b/arch/Kconfig
index 1599629..2c23e26 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -242,6 +242,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
 	  declared in asm/ptrace.h
 	  For example the kprobes-based event tracer needs this API.
 
+config HAVE_RSEQ
+	bool
+	depends on HAVE_REGS_AND_STACK_ACCESS_API
+	help
+	  This symbol should be selected by an architecture if it
+	  supports an implementation of restartable sequences.
+
 config HAVE_CLK
 	bool
 	help
diff --git a/fs/exec.c b/fs/exec.c
index 887c1c9..e912d87 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1707,6 +1707,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f..5c4b900 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -59,6 +59,7 @@ struct sched_param {
 #include <linux/gfp.h>
 #include <linux/magic.h>
 #include <linux/cgroup-defs.h>
+#include <linux/rseq.h>
 
 #include <asm/processor.h>
 
@@ -1918,6 +1919,10 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_RSEQ
+	struct rseq __user *rseq;
+	uint32_t rseq_event_counter;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3387,4 +3392,67 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
+#ifdef CONFIG_RSEQ
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+	if (t->rseq)
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+void __rseq_handle_notify_resume(struct pt_regs *regs);
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	if (current->rseq)
+		__rseq_handle_notify_resume(regs);
+}
+/*
+ * If parent process has a registered restartable sequences area, the
+ * child inherits. Only applies when forking a process, not a thread. In
+ * case a parent fork() in the middle of a restartable sequence, set the
+ * resume notifier to force the child to retry.
+ */
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD) {
+		t->rseq = NULL;
+		t->rseq_event_counter = 0;
+	} else {
+		t->rseq = current->rseq;
+		t->rseq_event_counter = current->rseq_event_counter;
+		rseq_set_notify_resume(t);
+	}
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+	t->rseq = NULL;
+	t->rseq_event_counter = 0;
+}
+static inline void rseq_sched_out(struct task_struct *t)
+{
+	rseq_set_notify_resume(t);
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+	rseq_handle_notify_resume(regs);
+}
+#else
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+}
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+}
+static inline void rseq_sched_out(struct task_struct *t)
+{
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 8bdae34..2e64fb8 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -403,6 +403,7 @@ header-y += tcp_metrics.h
 header-y += telephony.h
 header-y += termios.h
 header-y += thermal.h
+header-y += rseq.h
 header-y += time.h
 header-y += times.h
 header-y += timex.h
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
new file mode 100644
index 0000000..3e79fa9
--- /dev/null
+++ b/include/uapi/linux/rseq.h
@@ -0,0 +1,85 @@
+#ifndef _UAPI_LINUX_RSEQ_H
+#define _UAPI_LINUX_RSEQ_H
+
+/*
+ * linux/rseq.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else	/* #ifdef __KERNEL__ */
+# include <stdint.h>
+#endif	/* #else #ifdef __KERNEL__ */
+
+#include <asm/byteorder.h>
+
+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field)	uint64_t field
+#elif defined(__BYTE_ORDER) ? \
+	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field)	uint32_t _padding ## field, field
+#else
+# define RSEQ_FIELD_u32_u64(field)	uint32_t field, _padding ## field
+#endif
+
+struct rseq_cs {
+	RSEQ_FIELD_u32_u64(start_ip);
+	RSEQ_FIELD_u32_u64(post_commit_ip);
+	RSEQ_FIELD_u32_u64(abort_ip);
+} __attribute__((aligned(sizeof(uint64_t))));
+
+struct rseq {
+	union {
+		struct {
+			/*
+			 * Restartable sequences cpu_id field.
+			 * Updated by the kernel, and read by user-space with
+			 * single-copy atomicity semantics. Aligned on 32-bit.
+			 * Negative values are reserved for user-space.
+			 */
+			int32_t cpu_id;
+			/*
+			 * Restartable sequences event_counter field.
+			 * Updated by the kernel, and read by user-space with
+			 * single-copy atomicity semantics. Aligned on 32-bit.
+			 */
+			uint32_t event_counter;
+		} e;
+		/*
+		 * On architectures with 64-bit aligned reads, both cpu_id and
+		 * event_counter can be read with single-copy atomicity
+		 * semantics.
+		 */
+		uint64_t v;
+	} u;
+	/*
+	 * Restartable sequences rseq_cs field.
+	 * Updated by user-space, read by the kernel with
+	 * single-copy atomicity semantics. Aligned on 64-bit.
+	 */
+	RSEQ_FIELD_u32_u64(rseq_cs);
+} __attribute__((aligned(sizeof(uint64_t))));
+
+#endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/init/Kconfig b/init/Kconfig
index c02d897..545b7ed 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1653,6 +1653,19 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config RSEQ
+	bool "Enable rseq() system call" if EXPERT
+	default y
+	depends on HAVE_RSEQ
+	help
+	  Enable the restartable sequences system call. It provides a
+	  user-space cache for the current CPU number value, which
+	  speeds up getting the current CPU number from user-space,
+	  as well as an ABI to speed up user-space operations on
+	  per-CPU data.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e..4c6d8b5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_RSEQ) += rseq.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4a7ec0c..cc7756b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1591,6 +1591,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	copy_seccomp(p);
 
+	rseq_fork(p, clone_flags);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/rseq.c b/kernel/rseq.c
new file mode 100644
index 0000000..e1c847b
--- /dev/null
+++ b/kernel/rseq.c
@@ -0,0 +1,231 @@
+/*
+ * Restartable sequences system call
+ *
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com>
+ * Copyright (C) 2015-2016, EfficiOS Inc.,
+ * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/rseq.h>
+#include <asm/ptrace.h>
+
+/*
+ * Each restartable sequence assembly block defines a "struct rseq_cs"
+ * structure which describes the post_commit_ip address, and the
+ * abort_ip address where the kernel should move the thread instruction
+ * pointer if a rseq critical section assembly block is preempted or if
+ * a signal is delivered on top of a rseq critical section assembly
+ * block. It also contains a start_ip, which is the address of the start
+ * of the rseq assembly block, which is useful to debuggers.
+ *
+ * The algorithm for a restartable sequence assembly block is as
+ * follows:
+ *
+ * rseq_start()
+ *
+ *   0. Userspace loads the current event counter value from the
+ *      event_counter field of the registered struct rseq TLS area,
+ *
+ * rseq_finish()
+ *
+ *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
+ *   userspace that can handle being moved to the abort_ip between any
+ *   of those instructions.
+ *
+ *   The abort_ip address needs to be equal or above the post_commit_ip.
+ *   Step [4] and the failure code step [F1] need to be at addresses
+ *   equal or above the post_commit_ip.
+ *
+ *   1.  Userspace stores the address of the struct rseq cs rseq
+ *       assembly block descriptor into the rseq_cs field of the
+ *       registered struct rseq TLS area.
+ *
+ *   2.  Userspace tests to see whether the current event counter values
+ *       match those loaded at [0]. Manually jumping to [F1] in case of
+ *       a mismatch.
+ *
+ *       Note that if we are preempted or interrupted by a signal
+ *       after [1] and before post_commit_ip, then the kernel also
+ *       performs the comparison performed in [2], and conditionally
+ *       clears rseq_cs, then jumps us to abort_ip.
+ *
+ *   3.  Userspace critical section final instruction before
+ *       post_commit_ip is the commit. The critical section is
+ *       self-terminating.
+ *       [post_commit_ip]
+ *
+ *   4.  Userspace clears the rseq_cs field of the struct rseq
+ *       TLS area.
+ *
+ *   5.  Return true.
+ *
+ *   On failure at [2]:
+ *
+ *   F1. Userspace clears the rseq_cs field of the struct rseq
+ *       TLS area. Followed by step [F2].
+ *
+ *       [abort_ip]
+ *   F2. Return false.
+ */
+
+static int rseq_increment_event_counter(struct task_struct *t)
+{
+	if (__put_user(++t->rseq_event_counter,
+			&t->rseq->u.e.event_counter))
+		return -1;
+	return 0;
+}
+
+static int rseq_get_rseq_cs(struct task_struct *t,
+		void __user **post_commit_ip,
+		void __user **abort_ip)
+{
+	unsigned long ptr;
+	struct rseq_cs __user *rseq_cs;
+
+	if (__get_user(ptr, &t->rseq->rseq_cs))
+		return -1;
+	if (!ptr)
+		return 0;
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall()) {
+		rseq_cs = compat_ptr((compat_uptr_t)ptr);
+		if (get_user(ptr, &rseq_cs->post_commit_ip))
+			return -1;
+		*post_commit_ip = compat_ptr((compat_uptr_t)ptr);
+		if (get_user(ptr, &rseq_cs->abort_ip))
+			return -1;
+		*abort_ip = compat_ptr((compat_uptr_t)ptr);
+		return 0;
+	}
+#endif
+	rseq_cs = (struct rseq_cs __user *)ptr;
+	if (get_user(ptr, &rseq_cs->post_commit_ip))
+		return -1;
+	*post_commit_ip = (void __user *)ptr;
+	if (get_user(ptr, &rseq_cs->abort_ip))
+		return -1;
+	*abort_ip = (void __user *)ptr;
+	return 0;
+}
+
+static int rseq_ip_fixup(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+	void __user *post_commit_ip = NULL;
+	void __user *abort_ip = NULL;
+
+	if (rseq_get_rseq_cs(t, &post_commit_ip, &abort_ip))
+		return -1;
+
+	/* Handle potentially being within a critical section. */
+	if ((void __user *)instruction_pointer(regs) < post_commit_ip) {
+		/*
+		 * We need to clear rseq_cs upon entry into a signal
+		 * handler nested on top of a rseq assembly block, so
+		 * the signal handler will not be fixed up if itself
+		 * interrupted by a nested signal handler or preempted.
+		 */
+		if (clear_user(&t->rseq->rseq_cs,
+				sizeof(t->rseq->rseq_cs)))
+			return -1;
+
+		/*
+		 * We set this after potentially failing in
+		 * clear_user so that the signal arrives at the
+		 * faulting rip.
+		 */
+		instruction_pointer_set(regs, (unsigned long)abort_ip);
+	}
+	return 0;
+}
+
+/*
+ * This resume handler should always be executed between any of:
+ * - preemption,
+ * - signal delivery,
+ * and return to user-space.
+ */
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))
+		goto error;
+	if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
+		goto error;
+	if (rseq_increment_event_counter(t))
+		goto error;
+	if (rseq_ip_fixup(regs))
+		goto error;
+	return;
+
+error:
+	force_sig(SIGSEGV, t);
+}
+
+/*
+ * sys_rseq - setup restartable sequences for caller thread.
+ */
+SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
+{
+	if (unlikely(flags))
+		return -EINVAL;
+	if (!rseq) {
+		if (!current->rseq)
+			return -ENOENT;
+		return 0;
+	}
+
+	if (current->rseq) {
+		/*
+		 * If rseq is already registered, check whether
+		 * the provided address differs from the prior
+		 * one.
+		 */
+		if (current->rseq != rseq)
+			return -EBUSY;
+	} else {
+		/*
+		 * If there was no rseq previously registered,
+		 * we need to ensure the provided rseq is
+		 * properly aligned and valid.
+		 */
+		if (!IS_ALIGNED((unsigned long)rseq, sizeof(uint64_t)))
+			return -EINVAL;
+		if (!access_ok(VERIFY_WRITE, rseq, sizeof(*rseq)))
+			return -EFAULT;
+		current->rseq = rseq;
+		/*
+		 * If rseq was previously inactive, and has just
+		 * been registered, ensure the cpu_id and
+		 * event_counter fields are updated before
+		 * returning to user-space.
+		 */
+		rseq_set_notify_resume(current);
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..fbef0c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2664,6 +2664,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 {
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
+	rseq_sched_out(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8..c653f78 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -250,3 +250,6 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+
+/* restartable sequence */
+cond_syscall(sys_rseq);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 2/7] tracing: instrument restartable sequences
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 3/7] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 include/trace/events/rseq.h | 60 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/rseq.c               | 18 +++++++++++---
 2 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 include/trace/events/rseq.h

diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
new file mode 100644
index 0000000..83fd31e
--- /dev/null
+++ b/include/trace/events/rseq.h
@@ -0,0 +1,60 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rseq
+
+#if !defined(_TRACE_RSEQ_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RSEQ_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(rseq_inc,
+
+	TP_PROTO(uint32_t event_counter, int ret),
+
+	TP_ARGS(event_counter, ret),
+
+	TP_STRUCT__entry(
+		__field(uint32_t, event_counter)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->event_counter = event_counter;
+		__entry->ret = ret;
+	),
+
+	TP_printk("event_counter=%u ret=%d",
+		__entry->event_counter, __entry->ret)
+);
+
+TRACE_EVENT(rseq_ip_fixup,
+
+	TP_PROTO(void __user *regs_ip, void __user *post_commit_ip,
+		void __user *abort_ip, uint32_t kevcount, int ret),
+
+	TP_ARGS(regs_ip, post_commit_ip, abort_ip, kevcount, ret),
+
+	TP_STRUCT__entry(
+		__field(void __user *, regs_ip)
+		__field(void __user *, post_commit_ip)
+		__field(void __user *, abort_ip)
+		__field(uint32_t, kevcount)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->regs_ip = regs_ip;
+		__entry->post_commit_ip = post_commit_ip;
+		__entry->abort_ip = abort_ip;
+		__entry->kevcount = kevcount;
+		__entry->ret = ret;
+	),
+
+	TP_printk("regs_ip=%p post_commit_ip=%p abort_ip=%p kevcount=%u ret=%d",
+		__entry->regs_ip, __entry->post_commit_ip, __entry->abort_ip,
+		__entry->kevcount, __entry->ret)
+);
+
+#endif /* _TRACE_SOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/rseq.c b/kernel/rseq.c
index e1c847b..cab326a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -29,6 +29,9 @@
 #include <linux/rseq.h>
 #include <asm/ptrace.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/rseq.h>
+
 /*
  * Each restartable sequence assembly block defines a "struct rseq_cs"
  * structure which describes the post_commit_ip address, and the
@@ -90,8 +93,12 @@
 
 static int rseq_increment_event_counter(struct task_struct *t)
 {
-	if (__put_user(++t->rseq_event_counter,
-			&t->rseq->u.e.event_counter))
+	int ret;
+
+	ret = __put_user(++t->rseq_event_counter,
+			&t->rseq->u.e.event_counter);
+	trace_rseq_inc(t->rseq_event_counter, ret);
+	if (ret)
 		return -1;
 	return 0;
 }
@@ -134,8 +141,13 @@ static int rseq_ip_fixup(struct pt_regs *regs)
 	struct task_struct *t = current;
 	void __user *post_commit_ip = NULL;
 	void __user *abort_ip = NULL;
+	int ret;
 
-	if (rseq_get_rseq_cs(t, &post_commit_ip, &abort_ip))
+	ret = rseq_get_rseq_cs(t, &post_commit_ip, &abort_ip);
+	trace_rseq_ip_fixup((void __user *)instruction_pointer(regs),
+		post_commit_ip, abort_ip, t->rseq_event_counter,
+		ret);
+	if (ret)
 		return -1;
 
 	/* Handle potentially being within a critical section. */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 3/7] Restartable sequences: ARM 32 architecture support
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 2/7] tracing: instrument restartable sequences Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 4/7] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/Kconfig         | 1 +
 arch/arm/kernel/signal.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 90542db..636e14b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -75,6 +75,7 @@ config ARM
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
 	select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index 7b8f214..907da02 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -475,6 +475,12 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	int ret;
 
 	/*
+	 * Increment event counter and perform fixup for the pre-signal
+	 * frame.
+	 */
+	rseq_signal_deliver(regs);
+
+	/*
 	 * Set up the stack frame
 	 */
 	if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -594,6 +600,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 			} else {
 				clear_thread_flag(TIF_NOTIFY_RESUME);
 				tracehook_notify_resume(regs);
+				rseq_handle_notify_resume(regs);
 			}
 		}
 		local_irq_disable();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 4/7] Restartable sequences: wire up ARM 32 system call
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2016-07-21 21:14 ` [RFC PATCH v7 3/7] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 5/7] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Wire up the rseq system call on 32-bit ARM.

This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/include/uapi/asm/unistd.h | 1 +
 arch/arm/kernel/calls.S            | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 2cb9dc7..8f61c79 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -420,6 +420,7 @@
 #define __NR_copy_file_range		(__NR_SYSCALL_BASE+391)
 #define __NR_preadv2			(__NR_SYSCALL_BASE+392)
 #define __NR_pwritev2			(__NR_SYSCALL_BASE+393)
+#define __NR_rseq			(__NR_SYSCALL_BASE+394)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 703fa0f..0865c04 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -403,6 +403,7 @@
 		CALL(sys_copy_file_range)
 		CALL(sys_preadv2)
 		CALL(sys_pwritev2)
+		CALL(sys_rseq)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 5/7] Restartable sequences: x86 32/64 architecture support
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2016-07-21 21:14 ` [RFC PATCH v7 4/7] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 6/7] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 7/7] Restartable sequences: self-tests Mathieu Desnoyers
  6 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/x86/Kconfig         | 1 +
 arch/x86/entry/common.c  | 1 +
 arch/x86/kernel/signal.c | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da..1db7b06 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -140,6 +140,7 @@ config X86
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16			if X86_32 || IA32_EMULATION
 	select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e5..3877dba 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -231,6 +231,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			rseq_handle_notify_resume(regs);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 22cc2f9..0f4da5a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -683,6 +683,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	sigset_t *set = sigmask_to_save();
 	compat_sigset_t *cset = (compat_sigset_t *) set;
 
+	/*
+	 * Increment event counter and perform fixup for the pre-signal
+	 * frame.
+	 */
+	rseq_signal_deliver(regs);
+
 	/* Set up the stack frame */
 	if (is_ia32_frame()) {
 		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 6/7] Restartable sequences: wire up x86 32/64 system call
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2016-07-21 21:14 ` [RFC PATCH v7 5/7] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
  2016-07-21 21:14 ` [RFC PATCH v7 7/7] Restartable sequences: self-tests Mathieu Desnoyers
  6 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cddd17..15fb98c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -386,3 +386,4 @@
 377	i386	copy_file_range		sys_copy_file_range
 378	i386	preadv2			sys_preadv2			compat_sys_preadv2
 379	i386	pwritev2		sys_pwritev2			compat_sys_pwritev2
+380	i386	rseq			sys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 555263e..c7f3c7e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -335,6 +335,7 @@
 326	common	copy_file_range		sys_copy_file_range
 327	64	preadv2			sys_preadv2
 328	64	pwritev2		sys_pwritev2
+329	common	rseq			sys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2016-07-21 21:14 ` [RFC PATCH v7 6/7] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
@ 2016-07-21 21:14 ` Mathieu Desnoyers
       [not found]   ` <CO1PR15MB09822FC140F84DCEEF2004CDDD0B0@CO1PR15MB0982.namprd15.prod.outlook.com>
  6 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-21 21:14 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng, Mathieu Desnoyers

Implements two basic tests of RSEQ functionality, and one more
exhaustive parameterizable test.

The first, "basic_test" only asserts that RSEQ works moderately
correctly.
E.g. that:
- The CPUID pointer works
- Code infinitely looping within a critical section will eventually be
  interrupted.
- Critical sections are interrupted by signals.

"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.

"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.

As part of those tests, a helper library "rseq" implements a user-space
API around restartable sequences. It takes care of ensuring progress in
case of debugger single-stepping with a fall-back to locking, and
exposes the instruction pointer addresses where the rseq assembly blocks
begin and end, as well as the associated abort instruction pointer, in
the __rseq_table section. This section allows debuggers may know where
to place breakpoints when single-stepping through assembly blocks which
may be aborted at any point by the kernel.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 tools/testing/selftests/rseq/.gitignore            |   3 +
 tools/testing/selftests/rseq/Makefile              |  13 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c | 279 ++++++++
 tools/testing/selftests/rseq/basic_test.c          | 106 +++
 tools/testing/selftests/rseq/param_test.c          | 707 +++++++++++++++++++++
 tools/testing/selftests/rseq/rseq.c                | 200 ++++++
 tools/testing/selftests/rseq/rseq.h                | 449 +++++++++++++
 7 files changed, 1757 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/param_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
new file mode 100644
index 0000000..2596e26
--- /dev/null
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -0,0 +1,3 @@
+basic_percpu_ops_test
+basic_test
+param_test
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 0000000..3d1ad8e
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,13 @@
+CFLAGS += -O2 -Wall -g -I../../../../usr/include/
+LDFLAGS += -lpthread
+
+TESTS = basic_test basic_percpu_ops_test param_test
+
+all: $(TESTS)
+%: %.c rseq.h rseq.c
+	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+	$(RM) $(TESTS)
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 0000000..4667dc5
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,279 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "rseq.h"
+
+static struct rseq_lock rseq_lock;
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	int count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+	int reps;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval;
+	int cpu;
+	bool result;
+
+	for (;;) {
+		do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+			{
+				if (unlikely(lock->c[cpu].v)) {
+					result = false;
+				} else {
+					newval = 1;
+					targetptr = (intptr_t *)&lock->c[cpu].v;
+				}
+			});
+		if (likely(result))
+			break;
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches smp_store_release().
+	 */
+	smp_acquire__after_ctrl_dep();
+	return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * smp_acquire__after_ctrl_dep().
+	 */
+	smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_test_data *data = arg;
+	int i, cpu;
+
+	if (rseq_init_current_thread())
+		abort();
+	for (i = 0; i < data->reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+	}
+
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = 200;
+	int i, sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+
+	memset(&data, 0, sizeof(data));
+	data.reps = 5000;
+
+	for (i = 0; i < num_threads; i++)
+		pthread_create(&test_threads[i], NULL,
+			test_percpu_spinlock_thread, &data);
+
+	for (i = 0; i < num_threads; i++)
+		pthread_join(test_threads[i], NULL);
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == data.reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval;
+	int cpu;
+	bool result;
+
+	do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+		{
+			newval = (intptr_t)node;
+			targetptr = (intptr_t *)&list->c[cpu].head;
+			node->next = list->c[cpu].head;
+		});
+
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	struct percpu_list_node *head, *next;
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval;
+	int cpu;
+	bool result;
+
+	do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+		{
+			head = list->c[cpu].head;
+			if (!head) {
+				result = false;
+			} else {
+				next = head->next;
+				newval = (intptr_t) next;
+				targetptr = (intptr_t *)&list->c[cpu].head;
+			}
+		});
+
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (rseq_init_current_thread())
+		abort();
+
+	for (i = 0; i < 100000; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+
+		sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node);
+	}
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	int i, j;
+	long sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[200];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < 200; i++)
+		assert(pthread_create(&test_threads[i], NULL,
+			test_percpu_list_thread, &list) == 0);
+
+	for (i = 0; i < 200; i++)
+		pthread_join(test_threads[i], NULL);
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_init_lock(&rseq_lock)) {
+		perror("rseq_init_lock");
+		return -1;
+	}
+	if (rseq_init_current_thread())
+		goto error;
+	printf("spinlock\n");
+	test_percpu_spinlock();
+	printf("percpu_list\n");
+	test_percpu_list();
+
+	if (rseq_destroy_lock(&rseq_lock)) {
+		perror("rseq_destroy_lock");
+		return -1;
+	}
+	return 0;
+
+error:
+	if (rseq_destroy_lock(&rseq_lock))
+		perror("rseq_destroy_lock");
+	return -1;
+}
+
diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index 0000000..e8fdcd6
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,106 @@
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+volatile int signals_delivered;
+volatile __thread struct rseq_state sigtest_start;
+static struct rseq_lock rseq_lock;
+
+void test_cpu_pointer(void)
+{
+	cpu_set_t affinity, test_affinity;
+	int i;
+
+	sched_getaffinity(0, sizeof(affinity), &affinity);
+	CPU_ZERO(&test_affinity);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (CPU_ISSET(i, &affinity)) {
+			CPU_SET(i, &test_affinity);
+			sched_setaffinity(0, sizeof(test_affinity),
+					&test_affinity);
+			assert(rseq_current_cpu() == sched_getcpu());
+			assert(rseq_current_cpu() == i);
+			CPU_CLR(i, &test_affinity);
+		}
+	}
+	sched_setaffinity(0, sizeof(affinity), &affinity);
+}
+
+/*
+ * This depends solely on some environmental event triggering a counter
+ * increase.
+ */
+void test_critical_section(void)
+{
+	struct rseq_state start;
+	uint32_t event_counter;
+
+	start = rseq_start(&rseq_lock);
+	event_counter = start.event_counter;
+	do {
+		start = rseq_start(&rseq_lock);
+	} while (start.event_counter == event_counter);
+}
+
+void test_signal_interrupt_handler(int signo)
+{
+	struct rseq_state current;
+
+	current = rseq_start(&rseq_lock);
+	/*
+	 * The potential critical section bordered by 'start' must be
+	 * invalid.
+	 */
+	assert(current.event_counter != sigtest_start.event_counter);
+	signals_delivered++;
+}
+
+void test_signal_interrupts(void)
+{
+	struct itimerval it = { { 0, 1 }, { 0, 1 } };
+
+	setitimer(ITIMER_PROF, &it, NULL);
+	signal(SIGPROF, test_signal_interrupt_handler);
+
+	do {
+		sigtest_start = rseq_start(&rseq_lock);
+	} while (signals_delivered < 10);
+	setitimer(ITIMER_PROF, NULL, NULL);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_init_lock(&rseq_lock)) {
+		perror("rseq_init_lock");
+		return -1;
+	}
+	if (rseq_init_current_thread())
+		goto init_thread_error;
+	printf("testing current cpu\n");
+	test_cpu_pointer();
+	printf("testing critical section\n");
+	test_critical_section();
+	printf("testing critical section is interrupted by signal\n");
+	test_signal_interrupts();
+
+	if (rseq_destroy_lock(&rseq_lock)) {
+		perror("rseq_destroy_lock");
+		return -1;
+	}
+	return 0;
+
+init_thread_error:
+	if (rseq_destroy_lock(&rseq_lock))
+		perror("rseq_destroy_lock");
+	return -1;
+}
diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
new file mode 100644
index 0000000..f95fba5
--- /dev/null
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -0,0 +1,707 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <poll.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+
+static inline pid_t gettid(void)
+{
+	return syscall(__NR_gettid);
+}
+
+#define NR_INJECT	9
+static int loop_cnt[NR_INJECT + 1];
+
+static int opt_modulo;
+
+static int opt_yield, opt_signal, opt_sleep, opt_fallback_cnt = 3,
+		opt_disable_rseq, opt_threads = 200,
+		opt_reps = 5000, opt_disable_mod = 0, opt_test = 's';
+
+static __thread unsigned int signals_delivered;
+
+static struct rseq_lock rseq_lock;
+
+#ifndef BENCHMARK
+
+static __thread unsigned int yield_mod_cnt, nr_retry;
+
+#define printf_nobench(fmt, ...)	printf(fmt, ## __VA_ARGS__)
+
+#define RSEQ_INJECT_INPUT \
+	, [loop_cnt_1]"m"(loop_cnt[1]) \
+	, [loop_cnt_2]"m"(loop_cnt[2]) \
+	, [loop_cnt_3]"m"(loop_cnt[3]) \
+	, [loop_cnt_4]"m"(loop_cnt[4])
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define INJECT_ASM_REG	"eax"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"mov %[loop_cnt_" #n "], %%" INJECT_ASM_REG "\n\t" \
+	"test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+	"jz 333f\n\t" \
+	"222:\n\t" \
+	"dec %%" INJECT_ASM_REG "\n\t" \
+	"jnz 222b\n\t" \
+	"333:\n\t"
+
+#elif defined(__ARMEL__)
+
+#define INJECT_ASM_REG	"r4"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"ldr " INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmp " INJECT_ASM_REG ", #0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subs " INJECT_ASM_REG ", #1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+
+#else
+#error unsupported target
+#endif
+
+#define RSEQ_INJECT_FAILED \
+	nr_retry++;
+
+#define RSEQ_INJECT_C(n) \
+{ \
+	int loc_i, loc_nr_loops = loop_cnt[n]; \
+	\
+	for (loc_i = 0; loc_i < loc_nr_loops; loc_i++) { \
+		barrier(); \
+	} \
+	if (loc_nr_loops == -1 && opt_modulo) { \
+		if (yield_mod_cnt == opt_modulo - 1) { \
+			if (opt_sleep > 0) \
+				poll(NULL, 0, opt_sleep); \
+			if (opt_yield) \
+				sched_yield(); \
+			if (opt_signal) \
+				raise(SIGUSR1); \
+			yield_mod_cnt = 0; \
+		} else { \
+			yield_mod_cnt++; \
+		} \
+	} \
+}
+
+#define RSEQ_FALLBACK_CNT	\
+	opt_fallback_cnt
+
+#else
+
+#define printf_nobench(fmt, ...)
+
+#endif /* BENCHMARK */
+
+#include "rseq.h"
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	int count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct spinlock_thread_test_data {
+	struct spinlock_test_data *data;
+	int reps;
+	int reg;
+};
+
+struct inc_test_data {
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct inc_thread_test_data {
+	struct inc_test_data *data;
+	int reps;
+	int reg;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+static int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval;
+	int cpu;
+	bool result;
+
+	for (;;) {
+		do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+			{
+				if (unlikely(lock->c[cpu].v)) {
+					result = false;
+				} else {
+					newval = 1;
+					targetptr = (intptr_t *)&lock->c[cpu].v;
+				}
+			});
+		if (likely(result))
+			break;
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches smp_store_release().
+	 */
+	smp_acquire__after_ctrl_dep();
+	return cpu;
+}
+
+static void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * smp_acquire__after_ctrl_dep().
+	 */
+	smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_thread_test_data *thread_data = arg;
+	struct spinlock_test_data *data = thread_data->data;
+	int i, cpu;
+
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_init_current_thread())
+		abort();
+	for (i = 0; i < thread_data->reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (thread_data->reps / 10)))
+			printf("tid %d: count %d\n", (int) gettid(), i);
+#endif
+	}
+	printf_nobench("tid %d: number of retry: %d, signals delivered: %u, nr_fallback %u, nr_fallback_wait %u\n",
+		(int) gettid(), nr_retry, signals_delivered,
+		__rseq_thread_state.fallback_cnt,
+		__rseq_thread_state.fallback_wait_cnt);
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = opt_threads;
+	int i, sum, ret;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+	struct spinlock_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_spinlock_thread, &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == opt_reps * num_threads);
+}
+
+void *test_percpu_inc_thread(void *arg)
+{
+	struct inc_thread_test_data *thread_data = arg;
+	struct inc_test_data *data = thread_data->data;
+	int i;
+
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_init_current_thread())
+		abort();
+	for (i = 0; i < thread_data->reps; i++) {
+		struct rseq_state rseq_state;
+		intptr_t *targetptr, newval;
+		int cpu;
+		bool result;
+
+		do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+			{
+				newval = (intptr_t)data->c[cpu].count + 1;
+				targetptr = (intptr_t *)&data->c[cpu].count;
+			});
+
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (thread_data->reps / 10)))
+			printf("tid %d: count %d\n", (int) gettid(), i);
+#endif
+	}
+	printf_nobench("tid %d: number of retry: %d, signals delivered: %u, nr_fallback %u, nr_fallback_wait %u\n",
+		(int) gettid(), nr_retry, signals_delivered,
+		__rseq_thread_state.fallback_cnt,
+		__rseq_thread_state.fallback_wait_cnt);
+	return NULL;
+}
+
+void test_percpu_inc(void)
+{
+	const int num_threads = opt_threads;
+	int i, sum, ret;
+	pthread_t test_threads[num_threads];
+	struct inc_test_data data;
+	struct inc_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_inc_thread, &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == opt_reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval;
+	int cpu;
+	bool result;
+
+	do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+		{
+			newval = (intptr_t)node;
+			targetptr = (intptr_t *)&list->c[cpu].head;
+			node->next = list->c[cpu].head;
+		});
+
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	struct percpu_list_node *head, *next;
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval;
+	int cpu;
+	bool result;
+
+	do_rseq(&rseq_lock, rseq_state, cpu, result, targetptr, newval,
+		{
+			head = list->c[cpu].head;
+			if (!head) {
+				result = false;
+			} else {
+				next = head->next;
+				newval = (intptr_t) next;
+				targetptr = (intptr_t *) &list->c[cpu].head;
+			}
+		});
+
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (rseq_init_current_thread())
+		abort();
+
+	for (i = 0; i < opt_reps; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node);
+	}
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	long sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_list_thread, &list);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+static void test_signal_interrupt_handler(int signo)
+{
+	signals_delivered++;
+}
+
+static int set_signal_handler(void)
+{
+	int ret = 0;
+	struct sigaction sa;
+	sigset_t sigset;
+
+	ret = sigemptyset(&sigset);
+	if (ret < 0) {
+		perror("sigemptyset");
+		return ret;
+	}
+
+	sa.sa_handler = test_signal_interrupt_handler;
+	sa.sa_mask = sigset;
+	sa.sa_flags = 0;
+	ret = sigaction(SIGUSR1, &sa, NULL);
+	if (ret < 0) {
+		perror("sigaction");
+		return ret;
+	}
+
+	printf_nobench("Signal handler set for SIGUSR1\n");
+
+	return ret;
+}
+
+static void show_usage(int argc, char **argv)
+{
+	printf("Usage : %s <OPTIONS>\n",
+		argv[0]);
+	printf("OPTIONS:\n");
+	printf("	[-1 loops] Number of loops for delay injection 1\n");
+	printf("	[-2 loops] Number of loops for delay injection 2\n");
+	printf("	[-3 loops] Number of loops for delay injection 3\n");
+	printf("	[-4 loops] Number of loops for delay injection 4\n");
+	printf("	[-5 loops] Number of loops for delay injection 5 (-1 to enable -m)\n");
+	printf("	[-6 loops] Number of loops for delay injection 6 (-1 to enable -m)\n");
+	printf("	[-7 loops] Number of loops for delay injection 7 (-1 to enable -m)\n");
+	printf("	[-8 loops] Number of loops for delay injection 8 (-1 to enable -m)\n");
+	printf("	[-9 loops] Number of loops for delay injection 9 (-1 to enable -m)\n");
+	printf("	[-m N] Yield/sleep/kill every modulo N (default 0: disabled) (>= 0)\n");
+	printf("	[-y] Yield\n");
+	printf("	[-k] Kill thread with signal\n");
+	printf("	[-s S] S: =0: disabled (default), >0: sleep time (ms)\n");
+	printf("	[-f N] Use fallback every N failure (>= 1)\n");
+	printf("	[-t N] Number of threads (default 200)\n");
+	printf("	[-r N] Number of repetitions per thread (default 5000)\n");
+	printf("	[-d] Disable rseq system call (no initialization)\n");
+	printf("	[-D M] Disable rseq for each M threads\n");
+	printf("	[-T test] Choose test: (s)pinlock, (l)ist, (i)ncrement\n");
+	printf("	[-h] Show this help.\n");
+	printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+	int i;
+
+	if (rseq_init_lock(&rseq_lock)) {
+		perror("rseq_init_lock");
+		return -1;
+	}
+	if (set_signal_handler())
+		goto error;
+	for (i = 1; i < argc; i++) {
+		if (argv[i][0] != '-')
+			continue;
+		switch (argv[i][1]) {
+		case '1':
+		case '2':
+		case '3':
+		case '4':
+		case '5':
+		case '6':
+		case '7':
+		case '8':
+		case '9':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			loop_cnt[argv[i][1] - '0'] = atol(argv[i + 1]);
+			i++;
+			break;
+		case 'm':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_modulo = atol(argv[i + 1]);
+			if (opt_modulo < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 's':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_sleep = atol(argv[i + 1]);
+			if (opt_sleep < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'y':
+			opt_yield = 1;
+			break;
+		case 'k':
+			opt_signal = 1;
+			break;
+		case 'd':
+			opt_disable_rseq = 1;
+			break;
+		case 'D':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_disable_mod = atol(argv[i + 1]);
+			if (opt_disable_mod < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'f':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_fallback_cnt = atol(argv[i + 1]);
+			if (opt_fallback_cnt < 1) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 't':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_threads = atol(argv[i + 1]);
+			if (opt_threads < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'r':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_reps = atol(argv[i + 1]);
+			if (opt_reps < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'h':
+			show_usage(argc, argv);
+			goto end;
+		case 'T':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_test = *argv[i + 1];
+			switch (opt_test) {
+			case 's':
+			case 'l':
+			case 'i':
+				break;
+			default:
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		default:
+			show_usage(argc, argv);
+			goto error;
+		}
+	}
+
+	if (!opt_disable_rseq && rseq_init_current_thread())
+		goto error;
+	switch (opt_test) {
+	case 's':
+		printf_nobench("spinlock\n");
+		test_percpu_spinlock();
+		break;
+	case 'l':
+		printf_nobench("linked list\n");
+		test_percpu_list();
+		break;
+	case 'i':
+		printf_nobench("counter increment\n");
+		test_percpu_inc();
+		break;
+	}
+end:
+	return 0;
+
+error:
+	if (rseq_destroy_lock(&rseq_lock))
+		perror("rseq_destroy_lock");
+	return -1;
+}
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
new file mode 100644
index 0000000..f411be2
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -0,0 +1,200 @@
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+#include <linux/membarrier.h>
+
+#include "rseq.h"
+
+#ifdef __NR_membarrier
+# define membarrier(...)		syscall(__NR_membarrier, __VA_ARGS__)
+#else
+# define membarrier(...)		-ENOSYS
+#endif
+
+__thread volatile struct rseq_thread_state __rseq_thread_state = {
+	.abi.u.e.cpu_id = -1,
+};
+
+int rseq_has_sys_membarrier;
+
+static int sys_rseq(volatile struct rseq *rseq_abi, int flags)
+{
+	return syscall(__NR_rseq, rseq_abi, flags);
+}
+
+int rseq_init_current_thread(void)
+{
+	int rc;
+
+	rc = sys_rseq(&__rseq_thread_state.abi, 0);
+	if (rc) {
+		fprintf(stderr, "Error: sys_rseq(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		return -1;
+	}
+	assert(rseq_current_cpu() >= 0);
+	return 0;
+}
+
+int rseq_init_lock(struct rseq_lock *rlock)
+{
+	int ret;
+
+	ret = pthread_mutex_init(&rlock->lock, NULL);
+	if (ret) {
+		errno = ret;
+		return -1;
+	}
+	rlock->state = RSEQ_LOCK_STATE_RESTART;
+	return 0;
+}
+
+int rseq_destroy_lock(struct rseq_lock *rlock)
+{
+	int ret;
+
+	ret = pthread_mutex_destroy(&rlock->lock);
+	if (ret) {
+		errno = ret;
+		return -1;
+	}
+	return 0;
+}
+
+static void signal_off_save(sigset_t *oldset)
+{
+	sigset_t set;
+	int ret;
+
+	sigfillset(&set);
+	ret = pthread_sigmask(SIG_BLOCK, &set, oldset);
+	if (ret)
+		abort();
+}
+
+static void signal_restore(sigset_t oldset)
+{
+	int ret;
+
+	ret = pthread_sigmask(SIG_SETMASK, &oldset, NULL);
+	if (ret)
+		abort();
+}
+
+static void rseq_fallback_lock(struct rseq_lock *rlock)
+{
+	signal_off_save((sigset_t *)&__rseq_thread_state.sigmask_saved);
+	pthread_mutex_lock(&rlock->lock);
+	__rseq_thread_state.fallback_cnt++;
+	/*
+	 * For concurrent threads arriving before we set LOCK:
+	 * reading cpu_id after setting the state to LOCK
+	 * ensures they restart.
+	 */
+	ACCESS_ONCE(rlock->state) = RSEQ_LOCK_STATE_LOCK;
+	/*
+	 * For concurrent threads arriving after we set LOCK:
+	 * those will grab the lock, so we are protected by
+	 * mutual exclusion.
+	 */
+}
+
+void rseq_fallback_wait(struct rseq_lock *rlock)
+{
+	signal_off_save((sigset_t *)&__rseq_thread_state.sigmask_saved);
+	pthread_mutex_lock(&rlock->lock);
+	__rseq_thread_state.fallback_wait_cnt++;
+	pthread_mutex_unlock(&rlock->lock);
+	signal_restore(__rseq_thread_state.sigmask_saved);
+}
+
+static void rseq_fallback_unlock(struct rseq_lock *rlock, int cpu_at_start)
+{
+	/*
+	 * Concurrent rseq arriving before we set state back to RESTART
+	 * grab the lock. Those arriving after we set state back to
+	 * RESTART will perform restartable critical sections. The next
+	 * owner of the lock will take take of making sure it prevents
+	 * concurrent restartable sequences from completing.  We may be
+	 * writing from another CPU, so update the state with a store
+	 * release semantic to ensure restartable sections will see our
+	 * side effect (writing to *p) before they enter their
+	 * restartable critical section.
+	 *
+	 * In cases where we observe that we are on the right CPU after the
+	 * critical section, program order ensures that following restartable
+	 * critical sections will see our stores, so we don't have to use
+	 * store-release or membarrier.
+	 *
+	 * Use sys_membarrier when available to remove the memory barrier
+	 * implied by smp_load_acquire().
+	 */
+	barrier();
+	if (likely(rseq_current_cpu() == cpu_at_start)) {
+		ACCESS_ONCE(rlock->state) = RSEQ_LOCK_STATE_RESTART;
+	} else {
+		if (!has_fast_acquire_release() && rseq_has_sys_membarrier) {
+			if (membarrier(MEMBARRIER_CMD_SHARED, 0))
+				abort();
+			ACCESS_ONCE(rlock->state) = RSEQ_LOCK_STATE_RESTART;
+		} else {
+			/*
+			 * Store with release semantic to ensure
+			 * restartable sections will see our side effect
+			 * (writing to *p) before they enter their
+			 * restartable critical section. Matches
+			 * smp_load_acquire() in rseq_start().
+			 */
+			smp_store_release(&rlock->state,
+				RSEQ_LOCK_STATE_RESTART);
+		}
+	}
+	pthread_mutex_unlock(&rlock->lock);
+	signal_restore(__rseq_thread_state.sigmask_saved);
+}
+
+int rseq_fallback_current_cpu(void)
+{
+	int cpu;
+
+	cpu = sched_getcpu();
+	if (cpu < 0) {
+		perror("sched_getcpu()");
+		abort();
+	}
+	return cpu;
+}
+
+int rseq_fallback_begin(struct rseq_lock *rlock)
+{
+	rseq_fallback_lock(rlock);
+	return rseq_fallback_current_cpu();
+}
+
+void rseq_fallback_end(struct rseq_lock *rlock, int cpu)
+{
+	rseq_fallback_unlock(rlock, cpu);
+}
+
+/* Handle non-initialized rseq for this thread. */
+void rseq_fallback_noinit(struct rseq_state *rseq_state)
+{
+	rseq_state->lock_state = RSEQ_LOCK_STATE_FAIL;
+	rseq_state->cpu_id = 0;
+}
+
+void __attribute__((constructor)) rseq_init(void)
+{
+	int ret;
+
+	ret = membarrier(MEMBARRIER_CMD_QUERY, 0);
+	if (ret >= 0 && (ret & MEMBARRIER_CMD_SHARED))
+		rseq_has_sys_membarrier = 1;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
new file mode 100644
index 0000000..791e14c
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -0,0 +1,449 @@
+#ifndef RSEQ_H
+#define RSEQ_H
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <signal.h>
+#include <sched.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <linux/rseq.h>
+
+/*
+ * Empty code injection macros, override when testing.
+ * It is important to consider that the ASM injection macros need to be
+ * fully reentrant (e.g. do not modify the stack).
+ */
+#ifndef RSEQ_INJECT_ASM
+#define RSEQ_INJECT_ASM(n)
+#endif
+
+#ifndef RSEQ_INJECT_C
+#define RSEQ_INJECT_C(n)
+#endif
+
+#ifndef RSEQ_INJECT_INPUT
+#define RSEQ_INJECT_INPUT
+#endif
+
+#ifndef RSEQ_INJECT_CLOBBER
+#define RSEQ_INJECT_CLOBBER
+#endif
+
+#ifndef RSEQ_INJECT_FAILED
+#define RSEQ_INJECT_FAILED
+#endif
+
+#ifndef RSEQ_FALLBACK_CNT
+#define RSEQ_FALLBACK_CNT	3
+#endif
+
+struct rseq_thread_state {
+	struct rseq abi;	/* Kernel ABI. */
+	uint32_t fallback_wait_cnt;
+	uint32_t fallback_cnt;
+	sigset_t sigmask_saved;
+};
+
+extern __thread volatile struct rseq_thread_state __rseq_thread_state;
+extern int rseq_has_sys_membarrier;
+
+#define likely(x)		__builtin_expect(!!(x), 1)
+#define unlikely(x)		__builtin_expect(!!(x), 0)
+#define barrier()		__asm__ __volatile__("" : : : "memory")
+
+#define ACCESS_ONCE(x)		(*(__volatile__  __typeof__(x) *)&(x))
+#define WRITE_ONCE(x, v)	__extension__ ({ ACCESS_ONCE(x) = (v); })
+#define READ_ONCE(x)		ACCESS_ONCE(x)
+
+#ifdef __x86_64__
+
+#define smp_mb()	__asm__ __volatile__ ("mfence" : : : "memory")
+#define smp_rmb()	barrier()
+#define smp_wmb()	barrier()
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	barrier();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	1
+#define has_single_copy_load_64()	1
+
+#elif __i386__
+
+/*
+ * Support older 32-bit architectures that do not implement fence
+ * instructions.
+ */
+#define smp_mb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define smp_rmb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define smp_wmb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_mb();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	0
+#define has_single_copy_load_64()	0
+
+#elif defined(__ARMEL__)
+
+#define smp_mb()	__asm__ __volatile__ ("dmb" : : : "memory")
+#define smp_rmb()	__asm__ __volatile__ ("dmb" : : : "memory")
+#define smp_wmb()	__asm__ __volatile__ ("dmb" : : : "memory")
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_mb();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	0
+#define has_single_copy_load_64()	1
+
+#else
+#error unsupported target
+#endif
+
+enum rseq_lock_state {
+	RSEQ_LOCK_STATE_RESTART = 0,
+	RSEQ_LOCK_STATE_LOCK = 1,
+	RSEQ_LOCK_STATE_FAIL = 2,
+};
+
+struct rseq_lock {
+	pthread_mutex_t lock;
+	int32_t state;		/* enum rseq_lock_state */
+};
+
+/* State returned by rseq_start, passed as argument to rseq_finish. */
+struct rseq_state {
+	volatile struct rseq_thread_state *rseqp;
+	int32_t cpu_id;		/* cpu_id at start. */
+	uint32_t event_counter;	/* event_counter at start. */
+	int32_t lock_state;	/* Lock state at start. */
+};
+
+/*
+ * Initialize rseq for the current thread.  Must be called once by any
+ * thread which uses restartable sequences, before they start using
+ * restartable sequences. If initialization is not invoked, or if it
+ * fails, the restartable critical sections will fall-back on locking
+ * (rseq_lock).
+ */
+int rseq_init_current_thread(void);
+
+/*
+ * The fallback lock should be initialized before being used by any
+ * thread, and destroyed after all threads are done using it. This lock
+ * should be used by all rseq calls associated with shared data, either
+ * between threads, or between processes in a shared memory.
+ *
+ * There may be many rseq_lock per process, e.g. one per protected data
+ * structure.
+ */
+int rseq_init_lock(struct rseq_lock *rlock);
+int rseq_destroy_lock(struct rseq_lock *rlock);
+
+/*
+ * Restartable sequence fallback prototypes. Fallback on locking when
+ * rseq is not initialized, not available on the system, or during
+ * single-stepping to ensure forward progress.
+ */
+int rseq_fallback_begin(struct rseq_lock *rlock);
+void rseq_fallback_end(struct rseq_lock *rlock, int cpu);
+void rseq_fallback_wait(struct rseq_lock *rlock);
+void rseq_fallback_noinit(struct rseq_state *rseq_state);
+
+/*
+ * Restartable sequence fallback for reading the current CPU number.
+ */
+int rseq_fallback_current_cpu(void);
+
+static inline int32_t rseq_cpu_at_start(struct rseq_state start_value)
+{
+	return start_value.cpu_id;
+}
+
+static inline int32_t rseq_current_cpu_raw(void)
+{
+	return ACCESS_ONCE(__rseq_thread_state.abi.u.e.cpu_id);
+}
+
+static inline int32_t rseq_current_cpu(void)
+{
+	int32_t cpu;
+
+	cpu = rseq_current_cpu_raw();
+	if (unlikely(cpu < 0))
+		cpu = rseq_fallback_current_cpu();
+	return cpu;
+}
+
+static inline __attribute__((always_inline))
+struct rseq_state rseq_start(struct rseq_lock *rlock)
+{
+	struct rseq_state result;
+
+	result.rseqp = &__rseq_thread_state;
+	if (has_single_copy_load_64()) {
+		union {
+			struct {
+				uint32_t cpu_id;
+				uint32_t event_counter;
+			} e;
+			uint64_t v;
+		} u;
+
+		u.v = ACCESS_ONCE(result.rseqp->abi.u.v);
+		result.event_counter = u.e.event_counter;
+		result.cpu_id = u.e.cpu_id;
+	} else {
+		result.event_counter =
+			ACCESS_ONCE(result.rseqp->abi.u.e.event_counter);
+		/* load event_counter before cpu_id. */
+		RSEQ_INJECT_C(5)
+		result.cpu_id = ACCESS_ONCE(result.rseqp->abi.u.e.cpu_id);
+	}
+	/*
+	 * Read event counter before lock state and cpu_id. This ensures
+	 * that when the state changes from RESTART to LOCK, if we have
+	 * some threads that have already seen the RESTART still in
+	 * flight, they will necessarily be preempted/signalled before a
+	 * thread can see the LOCK state for that same CPU. That
+	 * preemption/signalling will cause them to restart, so they
+	 * don't interfere with the lock.
+	 */
+	RSEQ_INJECT_C(6)
+
+	if (!has_fast_acquire_release() && likely(rseq_has_sys_membarrier)) {
+		result.lock_state = ACCESS_ONCE(rlock->state);
+		barrier();
+	} else {
+		/*
+		 * Load lock state with acquire semantic. Matches
+		 * smp_store_release() in rseq_fallback_end().
+		 */
+		result.lock_state = smp_load_acquire(&rlock->state);
+	}
+	if (unlikely(result.cpu_id < 0))
+		rseq_fallback_noinit(&result);
+	/*
+	 * We need to ensure that the compiler does not re-order the
+	 * loads of any protected values before we read the current
+	 * state.
+	 */
+	barrier();
+	return result;
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish(struct rseq_lock *rlock,
+		intptr_t *p, intptr_t to_write,
+		struct rseq_state start_value)
+{
+	RSEQ_INJECT_C(9)
+
+	if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
+		if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
+			rseq_fallback_wait(rlock);
+		return false;
+	}
+
+#ifdef __x86_64__
+	/*
+	 * The __rseq_table section can be used by debuggers to better
+	 * handle single-stepping through the restartable critical
+	 * sections.
+	 */
+	__asm__ __volatile__ goto (
+		".pushsection __rseq_table, \"aw\"\n\t"
+		".balign 8\n\t"
+		"4:\n\t"
+		".quad 1f, 2f, 3f\n\t"
+		".popsection\n\t"
+		"1:\n\t"
+		RSEQ_INJECT_ASM(1)
+		"movq $4b, (%[rseq_cs])\n\t"
+		RSEQ_INJECT_ASM(2)
+		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
+		"jnz 3f\n\t"
+		RSEQ_INJECT_ASM(3)
+		"movq %[to_write], (%[target])\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(4)
+		"movq $0, (%[rseq_cs])\n\t"
+		"jmp %l[succeed]\n\t"
+		"3: movq $0, (%[rseq_cs])\n\t"
+		: /* no outputs */
+		: [start_event_counter]"r"(start_value.event_counter),
+		  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+		  [to_write]"r"(to_write),
+		  [target]"r"(p),
+		  [rseq_cs]"r"(&start_value.rseqp->abi.rseq_cs)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: succeed
+	);
+#elif defined(__i386__)
+	/*
+	 * The __rseq_table section can be used by debuggers to better
+	 * handle single-stepping through the restartable critical
+	 * sections.
+	 */
+	__asm__ __volatile__ goto (
+		".pushsection __rseq_table, \"aw\"\n\t"
+		".balign 8\n\t"
+		"4:\n\t"
+		".long 1f, 0x0, 2f, 0x0, 3f, 0x0\n\t"
+		".popsection\n\t"
+		"1:\n\t"
+		RSEQ_INJECT_ASM(1)
+		"movl $4b, (%[rseq_cs])\n\t"
+		RSEQ_INJECT_ASM(2)
+		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
+		"jnz 3f\n\t"
+		RSEQ_INJECT_ASM(3)
+		"movl %[to_write], (%[target])\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(4)
+		"movl $0, (%[rseq_cs])\n\t"
+		"jmp %l[succeed]\n\t"
+		"3: movl $0, (%[rseq_cs])\n\t"
+		: /* no outputs */
+		: [start_event_counter]"r"(start_value.event_counter),
+		  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+		  [to_write]"r"(to_write),
+		  [target]"r"(p),
+		  [rseq_cs]"r"(&start_value.rseqp->abi.rseq_cs)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: succeed
+	);
+#elif defined(__ARMEL__)
+	{
+		/*
+		 * The __rseq_table section can be used by debuggers to better
+		 * handle single-stepping through the restartable critical
+		 * sections.
+		 */
+		__asm__ __volatile__ goto (
+			".pushsection __rseq_table, \"aw\"\n\t"
+			".balign 8\n\t"
+			".word 1f, 0x0, 2f, 0x0, 3f, 0x0\n\t"
+			".popsection\n\t"
+			"1:\n\t"
+			RSEQ_INJECT_ASM(1)
+			"adr r0, 4f\n\t"
+			"str r0, [%[rseq_cs]]\n\t"
+			RSEQ_INJECT_ASM(2)
+			"ldr r0, %[current_event_counter]\n\t"
+			"mov r1, #0\n\t"
+			"cmp %[start_event_counter], r0\n\t"
+			"bne 3f\n\t"
+			RSEQ_INJECT_ASM(3)
+			"str %[to_write], [%[target]]\n\t"
+			"2:\n\t"
+			RSEQ_INJECT_ASM(4)
+			"str r1, [%[rseq_cs]]\n\t"
+			"b %l[succeed]\n\t"
+			".balign 8\n\t"
+			"4:\n\t"
+			".word 1b, 0x0, 2b, 0x0, 3f, 0x0\n\t"
+			"3:\n\t"
+			"mov r1, #0\n\t"
+			"str r1, [%[rseq_cs]]\n\t"
+			: /* no outputs */
+			: [start_event_counter]"r"(start_value.event_counter),
+			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+			  [to_write]"r"(to_write),
+			  [rseq_cs]"r"(&start_value.rseqp->abi.rseq_cs),
+			  [target]"r"(p)
+			  RSEQ_INJECT_INPUT
+			: "r0", "r1", "memory", "cc"
+			  RSEQ_INJECT_CLOBBER
+			: succeed
+		);
+	}
+#else
+#error unsupported target
+#endif
+	RSEQ_INJECT_FAILED
+	return false;
+succeed:
+	return true;
+}
+
+/*
+ * Helper macro doing two restartable critical section attempts, and if
+ * they fail, fallback on locking.
+ */
+#define do_rseq(_lock, _rseq_state, _cpu, _result, _targetptr, _newval, \
+		_code)							\
+	do {								\
+		_rseq_state = rseq_start(_lock);			\
+		_cpu = rseq_cpu_at_start(_rseq_state);			\
+		_result = true;						\
+		_code							\
+		if (unlikely(!_result))					\
+			break;						\
+		if (likely(rseq_finish(_lock, _targetptr, _newval,	\
+				_rseq_state)))				\
+			break;						\
+		_rseq_state = rseq_start(_lock);			\
+		_cpu = rseq_cpu_at_start(_rseq_state);			\
+		_result = true;						\
+		_code							\
+		if (unlikely(!_result))					\
+			break;						\
+		if (likely(rseq_finish(_lock, _targetptr, _newval,	\
+				_rseq_state)))				\
+			break;						\
+		_cpu = rseq_fallback_begin(_lock);			\
+		_result = true;						\
+		_code							\
+		if (likely(_result))					\
+			*(_targetptr) = (_newval);			\
+		rseq_fallback_end(_lock, _cpu);				\
+	} while (0)
+
+#endif  /* RSEQ_H_ */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
       [not found]   ` <CO1PR15MB09822FC140F84DCEEF2004CDDD0B0@CO1PR15MB0982.namprd15.prod.outlook.com>
@ 2016-07-24  3:09     ` Mathieu Desnoyers
  2016-07-24 18:01       ` Dave Watson
  2016-07-25 18:12     ` Mathieu Desnoyers
  1 sibling, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-24  3:09 UTC (permalink / raw)
  To: Dave Watson
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Jul 23, 2016, at 5:26 PM, Dave Watson davejwatson@fb.com wrote:

> Hi Mathieu,

> > Implements two basic tests of RSEQ functionality, and one more
> > exhaustive parameterizable test.

> Thanks for beefing up the tests. I ran this set through our jemalloc
> tests using rseq, and everything looks good so far.

> +static inline __attribute__((always_inline))
> +bool rseq_finish(struct rseq_lock *rlock,
> + intptr_t *p, intptr_t to_write,
> + struct rseq_state start_value)
> +{
> + RSEQ_INJECT_C(9)
> +
> + if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
> + if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
> + rseq_fallback_wait(rlock);
> + return false;
> + }
> +
> +#ifdef __x86_64__
> + /*
> + * The __rseq_table section can be used by debuggers to better
> + * handle single-stepping through the restartable critical
> + * sections.
> + */
> + __asm__ __volatile__ goto (
> + ".pushsection __rseq_table, \"aw\"\n\t"
> + ".balign 8\n\t"
> + "4:\n\t"
> + ".quad 1f, 2f, 3f\n\t"
> + ".popsection\n\t"

> Is there a reason we're also passing the start ip? It looks unused.
> I see the "for debuggers" comment, but it looks like all the debugger
> support is done in userspace.

> + "1:\n\t"
> + RSEQ_INJECT_ASM(1)
> + "movq $4b, (%[rseq_cs])\n\t"
> + RSEQ_INJECT_ASM(2)
> + "cmpl %[start_event_counter], %[current_event_counter]\n\t"
> + "jnz 3f\n\t"
> + RSEQ_INJECT_ASM(3)
> + "movq %[to_write], (%[target])\n\t"
> + "2:\n\t"
> + RSEQ_INJECT_ASM(4)
> + "movq $0, (%[rseq_cs])\n\t"
> + "jmp %l[succeed]\n\t"
> + "3: movq $0, (%[rseq_cs])\n\t"
> + : /* no outputs */
> + : [start_event_counter]"r"(start_value.event_counter),
> + [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> + [to_write]"r"(to_write),
> + [target]"r"(p),
> + [rseq_cs]"r"(&start_value.rseqp->abi.rseq_cs)
> + RSEQ_INJECT_INPUT
> + : "memory", "cc"
> + RSEQ_INJECT_CLOBBER
> + : succeed
> + );

> This ABI looks like it will work fine for our use case. I don't think it
> has been mentioned yet, but we may still need multiple asm blocks
> for differing numbers of writes. For example, an array-based freelist push:

> void push(void *obj) {
> if (index < maxlen) {
> freelist[index++] = obj;
> }
> }

> would be more efficiently implemented with a two-write rseq_finish:

> rseq_finish2(&freelist[index], obj, // first write
> &index, index + 1, // second write
> ...);

> where it is ok to abort between the two writes, but both need to happen
> on the same cpu.

(re-send without html formatting for the mailing lists)

Would pairing one rseq_start with two rseq_finish do the trick
there ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-07-24  3:09     ` Mathieu Desnoyers
@ 2016-07-24 18:01       ` Dave Watson
  2016-07-25 16:43         ` Mathieu Desnoyers
  2016-08-11 23:26         ` Mathieu Desnoyers
  0 siblings, 2 replies; 82+ messages in thread
From: Dave Watson @ 2016-07-24 18:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

>> +static inline __attribute__((always_inline))
>> +bool rseq_finish(struct rseq_lock *rlock,
>> + intptr_t *p, intptr_t to_write,
>> + struct rseq_state start_value)

>> This ABI looks like it will work fine for our use case. I don't think it
>> has been mentioned yet, but we may still need multiple asm blocks
>> for differing numbers of writes. For example, an array-based freelist push:

>> void push(void *obj) {
>> if (index < maxlen) {
>> freelist[index++] = obj;
>> }
>> }

>> would be more efficiently implemented with a two-write rseq_finish:

>> rseq_finish2(&freelist[index], obj, // first write
>> &index, index + 1, // second write
>> ...);

> Would pairing one rseq_start with two rseq_finish do the trick
> there ?

Yes, two rseq_finish works, as long as the extra rseq management overhead
is not substantial.  

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-07-24 18:01       ` Dave Watson
@ 2016-07-25 16:43         ` Mathieu Desnoyers
  2016-08-11 23:26         ` Mathieu Desnoyers
  1 sibling, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-25 16:43 UTC (permalink / raw)
  To: Dave Watson
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:

>>> +static inline __attribute__((always_inline))
>>> +bool rseq_finish(struct rseq_lock *rlock,
>>> + intptr_t *p, intptr_t to_write,
>>> + struct rseq_state start_value)
> 
>>> This ABI looks like it will work fine for our use case. I don't think it
>>> has been mentioned yet, but we may still need multiple asm blocks
>>> for differing numbers of writes. For example, an array-based freelist push:
> 
>>> void push(void *obj) {
>>> if (index < maxlen) {
>>> freelist[index++] = obj;
>>> }
>>> }
> 
>>> would be more efficiently implemented with a two-write rseq_finish:
> 
>>> rseq_finish2(&freelist[index], obj, // first write
>>> &index, index + 1, // second write
>>> ...);
> 
>> Would pairing one rseq_start with two rseq_finish do the trick
>> there ?
> 
> Yes, two rseq_finish works, as long as the extra rseq management overhead
> is not substantial.

The different is actually not negligible. On x86-64
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
(counter increment benchmark (single-thread))

* Single store per increment:                                  3.6 ns
* Two rseq_finish() per increment:                             5.2 ns
* rseq_finish2() with two mov instructions per rseq_finish2(): 4.0 ns

And I expect the difference to be even larger on non-x86 architectures.

I'll try to figure out a way to do rseq_finish() and rseq_finish2()
without duplicating the code. Perhaps macros will be helpful there.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
       [not found]   ` <CO1PR15MB09822FC140F84DCEEF2004CDDD0B0@CO1PR15MB0982.namprd15.prod.outlook.com>
  2016-07-24  3:09     ` Mathieu Desnoyers
@ 2016-07-25 18:12     ` Mathieu Desnoyers
  1 sibling, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-25 18:12 UTC (permalink / raw)
  To: Dave Watson
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Jul 23, 2016, at 5:26 PM, Dave Watson davejwatson@fb.com wrote:

[...]

> +static inline __attribute__((always_inline))
> +bool rseq_finish(struct rseq_lock *rlock,
> + intptr_t *p, intptr_t to_write,
> + struct rseq_state start_value)
> +{
> + RSEQ_INJECT_C(9)
> +
> + if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
> + if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
> + rseq_fallback_wait(rlock);
> + return false;
> + }
> +
> +#ifdef __x86_64__
> + /*
> + * The __rseq_table section can be used by debuggers to better
> + * handle single-stepping through the restartable critical
> + * sections.
> + */
> + __asm__ __volatile__ goto (
> + ".pushsection __rseq_table, \"aw\"\n\t"
> + ".balign 8\n\t"
> + "4:\n\t"
> + ".quad 1f, 2f, 3f\n\t"
> + ".popsection\n\t"

> Is there a reason we're also passing the start ip? It looks unused.
> I see the "for debuggers" comment, but it looks like all the debugger
> support is done in userspace.

I notice I did not answer this question. This __rseq_table section is
populated by struct rseq_cs elements. This has two uses:

1) Interface with the kernel: only fields "post_commit_ip" and "abort_ip"
   are currently used by the kernel. This is a "critical section descriptor".
   User-space stores a pointer to this descriptor in the struct rseq
   "rseq_cs" field to tell the kernel that it needs to handle the
   rseq assembly block critical section, and it is set back to 0 when
   exiting the critical section.

2) Interface for debuggers: all three fields are used: "start_ip",
   "post_commit_ip", and "abort_ip". When a debugger single-steps
   through the rseq assembly block by placing breakpoints at the
   following instruction (I observed this behavior with gdb on arm32),
   it needs to be made aware that, when single-stepping instructions
   between "start_ip" (included) and "post_commit_ip" (excluded), it
   should also place a breakpoint at "abort_ip", or single-stepping
   would be fooled by an abort.

On 32-bit and 64-bit x86, we can combine the structures for (1) and (2)
and only keep one structure for both. The assembly fast-path can therefore
use the address within the __rseq_table section as pointer to descriptor.

On 32-bit ARM, as an optimization, we keep two copies of this structure:
one is in the __rseq_table section (for debuggers), and the other is
placed near the instruction pointer, so a cheaper ip-relative "adr"
instruction can be used to calculate the address of the descriptor.

If my understanding if correct, you suggest we do the following instead:

struct rseq_cs {
        RSEQ_FIELD_u32_u64(post_commit_ip);
        RSEQ_FIELD_u32_u64(abort_ip);
};

struct rseq_debug_cs {
        struct rseq_cs rseq_cs;
        RSEQ_FIELD_u32_u64(start_ip);
};

So we put struct rseq_debug_cs elements within the __rseq_table section,
and only expose the struct rseq_cs part to the kernel ABI.

Thinking a bit more about this, I think we should use the start_ip in the
kernel too. The idea here is that the end of critical sections (user-space fast
path) currently looks like this tangled mess (e.g. x86-64):

                "cmpl %[start_event_counter], %[current_event_counter]\n\t"
                "jnz 3f\n\t"                         <--- branch in case of failure
                "movq %[to_write], (%[target])\n\t"  <--- commit instruction
                "2:\n\t"
                "movq $0, (%[rseq_cs])\n\t"
                "jmp %l[succeed]\n\t"                <--- jump over the failure path
                "3: movq $0, (%[rseq_cs])\n\t"

Where we basically need to jump over the failure path at the end of the
successful fast path, all because the failure path needs to be placed
at addresses greater or equal to the post_commit_ip.

In the kernel, if rather than testing for:

if ((void __user *)instruction_pointer(regs) < post_commit_ip) {

we could test for both start_ip and post_commit_ip:

if ((void __user *)instruction_pointer(regs) < post_commit_ip
    && (void __user *)instruction_pointer(regs) >= start_ip) {

We could perform the failure path (storing NULL into the rseq_cs
field of struct rseq) in C rather than being required to do it in
assembly at addresses >= to post_commit_ip, all because the kernel
would test whether we are within the assembly block address range
using both the lower and upper bounds (start_ip and post_commit_ip).

The extra check with start_ip in the kernel is only done in a slow
path (in the notify resume handler), and only if the rseq_cs field
of struct rseq is non-NULL (when the kernel actually preempts or
delivers a signal over a rseq critical section), so it should not
matter at all in terms of kernel performance.

Removing this extra jump from the user-space fast-path might not have
much impact on x86, but I expect it to be more important on
architectures like ARM32, where the architecture is less forgiving
when fed sub-optimal assembly.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
@ 2016-07-25 23:02   ` Andy Lutomirski
  2016-07-26  3:02     ` Mathieu Desnoyers
  2016-07-27 15:03   ` Boqun Feng
  2016-08-03 13:19   ` Peter Zijlstra
  2 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2016-07-25 23:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, Linux API, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Thu, Jul 21, 2016 at 2:14 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> Man page associated:
>
> RSEQ(2)                Linux Programmer's Manual               RSEQ(2)
>
> NAME
>        rseq - Restartable sequences and cpu number cache
>
> SYNOPSIS
>        #include <linux/rseq.h>
>
>        int rseq(struct rseq * rseq, int flags);
>
> DESCRIPTION
>        The  rseq()  ABI  accelerates  user-space operations on per-cpu
>        data by defining a shared data structure ABI between each user-
>        space thread and the kernel.
>
>        The  rseq argument is a pointer to the thread-local rseq struc‐
>        ture to be shared between kernel and user-space.  A  NULL  rseq
>        value  can  be used to check whether rseq is registered for the
>        current thread.
>
>        The layout of struct rseq is as follows:
>
>        Structure alignment
>               This structure needs to be aligned on  multiples  of  64
>               bytes.
>
>        Structure size
>               This structure has a fixed size of 128 bytes.
>
>        Fields
>
>            cpu_id
>               Cache  of  the CPU number on which the calling thread is
>               running.
>
>            event_counter
>               Restartable sequences event_counter field.

That's an unhelpful description.

>
>            rseq_cs
>               Restartable sequences rseq_cs field. Points to a  struct
>               rseq_cs.

Why is it a pointer?

>
>        The layout of struct rseq_cs is as follows:
>
>        Structure alignment
>               This  structure  needs  to be aligned on multiples of 64
>               bytes.
>
>        Structure size
>               This structure has a fixed size of 192 bytes.
>
>        Fields
>
>            start_ip
>               Instruction pointer address of the first instruction  of
>               the sequence of consecutive assembly instructions.
>
>            post_commit_ip
>               Instruction  pointer  address after the last instruction
>               of the sequence of consecutive assembly instructions.
>
>            abort_ip
>               Instruction pointer address where to move the  execution
>               flow  in  case  of  abort of the sequence of consecutive
>               assembly instructions.
>
>        The flags argument is currently unused and must be specified as
>        0.
>
>        Typically,  a  library or application will keep the rseq struc‐
>        ture in a thread-local storage variable, or other memory  areas

"variable or other memory area"

>        belonging to each thread. It is recommended to perform volatile
>        reads of the thread-local cache to prevent  the  compiler  from
>        doing  load  tearing.  An  alternative approach is to read each
>        field from inline assembly.

I don't think the man page needs to tell people how to implement
correct atomic loads.

>
>        Each thread is responsible for registering its rseq  structure.
>        Only  one  rseq structure address can be registered per thread.
>        Once set, the rseq address is idempotent for a given thread.

"Idempotent" is a property that applies to an action, and the "rseq
address" is not an action.  I don't know what you're trying to say.

>
>        In a typical usage scenario, the thread  registering  the  rseq
>        structure  will  be  performing  loads  and stores from/to that
>        structure. It is however also allowed to  read  that  structure
>        from  other  threads.   The rseq field updates performed by the
>        kernel provide single-copy atomicity semantics, which guarantee
>        that  other  threads performing single-copy atomic reads of the
>        cpu number cache will always observe a consistent value.

s/single-copy/relaxed atomic/ perhaps?

>
>        Memory registered as rseq structure should never be deallocated
>        before  the  thread which registered it exits: specifically, it
>        should not be freed, and the library containing the  registered
>        thread-local  storage  should  not be dlclose'd. Violating this
>        constraint may cause a SIGSEGV signal to be  delivered  to  the
>        thread.

That's an unfortunate constraint for threads that exit without help.

>
>        Unregistration  of associated rseq structure is implicitly per‐
>        formed when a thread or process exit.

exits.

[...]

Can you please document what this thing does prior to giving an
example of how to use it.

Hmm, here are the docs, sort of:

> diff --git a/kernel/rseq.c b/kernel/rseq.c
> new file mode 100644
> index 0000000..e1c847b
> --- /dev/null
> +++ b/kernel/rseq.c

> +/*
> + * Each restartable sequence assembly block defines a "struct rseq_cs"
> + * structure which describes the post_commit_ip address, and the
> + * abort_ip address where the kernel should move the thread instruction
> + * pointer if a rseq critical section assembly block is preempted or if
> + * a signal is delivered on top of a rseq critical section assembly
> + * block. It also contains a start_ip, which is the address of the start
> + * of the rseq assembly block, which is useful to debuggers.
> + *
> + * The algorithm for a restartable sequence assembly block is as
> + * follows:
> + *
> + * rseq_start()
> + *
> + *   0. Userspace loads the current event counter value from the
> + *      event_counter field of the registered struct rseq TLS area,
> + *
> + * rseq_finish()
> + *
> + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
> + *   userspace that can handle being moved to the abort_ip between any
> + *   of those instructions.
> + *
> + *   The abort_ip address needs to be equal or above the post_commit_ip.
> + *   Step [4] and the failure code step [F1] need to be at addresses
> + *   equal or above the post_commit_ip.
> + *
> + *   1.  Userspace stores the address of the struct rseq cs rseq

"struct rseq cs rseq" contains a typo.

> + *       assembly block descriptor into the rseq_cs field of the
> + *       registered struct rseq TLS area.
> + *
> + *   2.  Userspace tests to see whether the current event counter values
> + *       match those loaded at [0]. Manually jumping to [F1] in case of
> + *       a mismatch.

Grammar issues here.  More importantly, you said "values", but you
only described one value.

> + *
> + *       Note that if we are preempted or interrupted by a signal
> + *       after [1] and before post_commit_ip, then the kernel also
> + *       performs the comparison performed in [2], and conditionally
> + *       clears rseq_cs, then jumps us to abort_ip.

This is the first I've heard of rseq_cs being something that gets
changed as a result of using this facility.  What code sets it in the
first place?

I think you've also mentioned "preemption" and "migration".  Which do you mean?

> + *
> + *   3.  Userspace critical section final instruction before
> + *       post_commit_ip is the commit. The critical section is
> + *       self-terminating.
> + *       [post_commit_ip]
> + *
> + *   4.  Userspace clears the rseq_cs field of the struct rseq
> + *       TLS area.
> + *
> + *   5.  Return true.
> + *
> + *   On failure at [2]:
> + *

A major issue I have with percpu critical sections or rseqs or
whatever you want to call them is that, every time I talk to someone
about them, there are a different set of requirements that they are
supposed to satisfy.  So:

What problem does this solve?

What are its atomicity properties?  Under what conditions does it
work?  What assumptions does it make?

What real-world operations become faster as a result of rseq (as
opposed to just cpu number queries)?

Why is it important for the kernel to do something special on every preemption?

What "events" does "event_counter" count and why?


If I'm understanding the intent of this code correctly (which is a big
if), I think you're trying to do this:

start a critical section;
compute something;
commit;
if (commit worked)
  return;
else
  try again;

where "commit;" is a single instruction.  The kernel guarantees that
if the thread is preempted (or migrated, perhaps?) between the start
and commit steps then commit will be forced to fail (or be skipped
entirely).  Because I don't understand what you're doing with this
primitive, I can't really tell why you need to detect preemption as
opposed to just migration.

For example: would the following primitive solve the same problem?

begin_dont_migrate_me()

figure out what store to do to take the percpu lock;
do that store;

if (end_dont_migrate_me())
  return;

// oops, the kernel migrated us.  retry.


--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-25 23:02   ` Andy Lutomirski
@ 2016-07-26  3:02     ` Mathieu Desnoyers
  2016-08-03 12:27       ` Peter Zijlstra
  2016-08-03 18:29       ` Christoph Lameter
  0 siblings, 2 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-26  3:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Jul 25, 2016, at 7:02 PM, Andy Lutomirski luto@amacapital.net wrote:

> On Thu, Jul 21, 2016 at 2:14 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> Man page associated:
>>
>> RSEQ(2)                Linux Programmer's Manual               RSEQ(2)
>>
>> NAME
>>        rseq - Restartable sequences and cpu number cache
>>
>> SYNOPSIS
>>        #include <linux/rseq.h>
>>
>>        int rseq(struct rseq * rseq, int flags);
>>
>> DESCRIPTION
>>        The  rseq()  ABI  accelerates  user-space operations on per-cpu
>>        data by defining a shared data structure ABI between each user-
>>        space thread and the kernel.
>>
>>        The  rseq argument is a pointer to the thread-local rseq struc‐
>>        ture to be shared between kernel and user-space.  A  NULL  rseq
>>        value  can  be used to check whether rseq is registered for the
>>        current thread.
>>
>>        The layout of struct rseq is as follows:
>>
>>        Structure alignment
>>               This structure needs to be aligned on  multiples  of  64
>>               bytes.
>>
>>        Structure size
>>               This structure has a fixed size of 128 bytes.
>>
>>        Fields
>>
>>            cpu_id
>>               Cache  of  the CPU number on which the calling thread is
>>               running.
>>
>>            event_counter
>>               Restartable sequences event_counter field.
> 
> That's an unhelpful description.

Good point, how about:

event_counter
   Counter guaranteed to be incremented when the current thread is
   preempted or when a signal is delivered to the current thread.

In that same line of thoughts, I would reword cpu_id as:

cpu_id
   Cache  of  the CPU number on which the current thread is
   running.

> 
>>
>>            rseq_cs
>>               Restartable sequences rseq_cs field. Points to a  struct
>>               rseq_cs.
> 
> Why is it a pointer?

Rewording like this should help understand:

rseq_cs
   The rseq_cs field is a pointer to a struct rseq_cs. Is is NULL when
   no rseq assembly block critical section is active for the current
   thread. Setting it to point to a critical section descriptor (struct
   rseq_cs) marks the beginning of the critical section. It is cleared
   after the end of the critical section.


> 
>>
>>        The layout of struct rseq_cs is as follows:
>>
>>        Structure alignment
>>               This  structure  needs  to be aligned on multiples of 64
>>               bytes.
>>
>>        Structure size
>>               This structure has a fixed size of 192 bytes.
>>
>>        Fields
>>
>>            start_ip
>>               Instruction pointer address of the first instruction  of
>>               the sequence of consecutive assembly instructions.
>>
>>            post_commit_ip
>>               Instruction  pointer  address after the last instruction
>>               of the sequence of consecutive assembly instructions.
>>
>>            abort_ip
>>               Instruction pointer address where to move the  execution
>>               flow  in  case  of  abort of the sequence of consecutive
>>               assembly instructions.
>>
>>        The flags argument is currently unused and must be specified as
>>        0.
>>
>>        Typically,  a  library or application will keep the rseq struc‐
>>        ture in a thread-local storage variable, or other memory  areas
> 
> "variable or other memory area"

ok

> 
>>        belonging to each thread. It is recommended to perform volatile
>>        reads of the thread-local cache to prevent  the  compiler  from
>>        doing  load  tearing.  An  alternative approach is to read each
>>        field from inline assembly.
> 
> I don't think the man page needs to tell people how to implement
> correct atomic loads.

ok, I can remove the two previous sentences.

> 
>>
>>        Each thread is responsible for registering its rseq  structure.
>>        Only  one  rseq structure address can be registered per thread.
>>        Once set, the rseq address is idempotent for a given thread.
> 
> "Idempotent" is a property that applies to an action, and the "rseq
> address" is not an action.  I don't know what you're trying to say.

I mean there is only one address registered per thread, and it stays
registered for the life-time of the thread. Perhaps I could say:

  "Once set, the rseq address never changes for a given thread."

> 
>>
>>        In a typical usage scenario, the thread  registering  the  rseq
>>        structure  will  be  performing  loads  and stores from/to that
>>        structure. It is however also allowed to  read  that  structure
>>        from  other  threads.   The rseq field updates performed by the
>>        kernel provide single-copy atomicity semantics, which guarantee
>>        that  other  threads performing single-copy atomic reads of the
>>        cpu number cache will always observe a consistent value.
> 
> s/single-copy/relaxed atomic/ perhaps?

ok

> 
>>
>>        Memory registered as rseq structure should never be deallocated
>>        before  the  thread which registered it exits: specifically, it
>>        should not be freed, and the library containing the  registered
>>        thread-local  storage  should  not be dlclose'd. Violating this
>>        constraint may cause a SIGSEGV signal to be  delivered  to  the
>>        thread.
> 
> That's an unfortunate constraint for threads that exit without help.

I don't understand what you are pointing at here. I see this mostly as
a constraint on the life-time of the library that holds the struct rseq
TLS more than a constraint on the thread life-time.

> 
>>
>>        Unregistration  of associated rseq structure is implicitly per‐
>>        formed when a thread or process exit.
> 
> exits.

ok

> 
> [...]
> 
> Can you please document what this thing does prior to giving an
> example of how to use it.

Good point, will do. (more comments on what can be added as documentation
below)

> 
> Hmm, here are the docs, sort of:
> 
>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>> new file mode 100644
>> index 0000000..e1c847b
>> --- /dev/null
>> +++ b/kernel/rseq.c
> 
>> +/*
>> + * Each restartable sequence assembly block defines a "struct rseq_cs"
>> + * structure which describes the post_commit_ip address, and the
>> + * abort_ip address where the kernel should move the thread instruction
>> + * pointer if a rseq critical section assembly block is preempted or if
>> + * a signal is delivered on top of a rseq critical section assembly
>> + * block. It also contains a start_ip, which is the address of the start
>> + * of the rseq assembly block, which is useful to debuggers.
>> + *
>> + * The algorithm for a restartable sequence assembly block is as
>> + * follows:
>> + *
>> + * rseq_start()
>> + *
>> + *   0. Userspace loads the current event counter value from the
>> + *      event_counter field of the registered struct rseq TLS area,
>> + *
>> + * rseq_finish()
>> + *
>> + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
>> + *   userspace that can handle being moved to the abort_ip between any
>> + *   of those instructions.
>> + *
>> + *   The abort_ip address needs to be equal or above the post_commit_ip.
>> + *   Step [4] and the failure code step [F1] need to be at addresses
>> + *   equal or above the post_commit_ip.
>> + *
>> + *   1.  Userspace stores the address of the struct rseq cs rseq
> 
> "struct rseq cs rseq" contains a typo.

should be "struct rseq_cs"

> 
>> + *       assembly block descriptor into the rseq_cs field of the
>> + *       registered struct rseq TLS area.
>> + *
>> + *   2.  Userspace tests to see whether the current event counter values
>> + *       match those loaded at [0]. Manually jumping to [F1] in case of
>> + *       a mismatch.
> 
> Grammar issues here.  More importantly, you said "values", but you
> only described one value.

Indeed, values -> value, and those -> the value

> 
>> + *
>> + *       Note that if we are preempted or interrupted by a signal
>> + *       after [1] and before post_commit_ip, then the kernel also
>> + *       performs the comparison performed in [2], and conditionally
>> + *       clears rseq_cs, then jumps us to abort_ip.
> 
> This is the first I've heard of rseq_cs being something that gets
> changed as a result of using this facility.  What code sets it in the
> first place?

struct rseq_cs (the critical section descriptor) is statically declared,
never changes. What I should clarify above is that the rseq_cs field of
struct rseq gets cleared (not the struct rseq_cs per se).

The struct rseq_cs field is initially at NULL, and is populated by the
struct rseq_cs descriptor address when entering the critical section.
It is set back to NULL right after exiting the critical section, through
both the success and failure paths.

> 
> I think you've also mentioned "preemption" and "migration".  Which do you mean?

We really care about preemption here. Every migration implies a
preemption from a user-space perspective. If we would only care
about keeping the CPU id up-to-date, hooking into migration would be
enough. But since we want atomicity guarantees for restartable
sequences, we need to hook into preemption.

I should update the changelog of patch 1/7 to specify that we really do
hook on preemption, even for the cpu_id update part.

> 
>> + *
>> + *   3.  Userspace critical section final instruction before
>> + *       post_commit_ip is the commit. The critical section is
>> + *       self-terminating.
>> + *       [post_commit_ip]
>> + *
>> + *   4.  Userspace clears the rseq_cs field of the struct rseq
>> + *       TLS area.
>> + *
>> + *   5.  Return true.
>> + *
>> + *   On failure at [2]:
>> + *
> 
> A major issue I have with percpu critical sections or rseqs or
> whatever you want to call them is that, every time I talk to someone
> about them, there are a different set of requirements that they are
> supposed to satisfy.  So:
> 
> What problem does this solve?

It allows user-space to perform update operations on per-cpu data without
requiring heavy-weight atomic operations.

> 
> What are its atomicity properties?  Under what conditions does it
> work?  What assumptions does it make?

Restartable sequences are atomic with respect to preemption (making it
atomic with respect to other threads running on the same CPU), as well
as signal delivery (user-space execution contexts nested over the same
thread).

It is suited for update operations on per-cpu data.

It can be used on data structures shared between threads within a process,
and on data structures shared between threads across different processes.

> 
> What real-world operations become faster as a result of rseq (as
> opposed to just cpu number queries)?

A few examples of operations accelerated:

- incrementing per-cpu counters,
- per-cpu spin-lock,
- per-cpu linked-lists (including memory allocator free-list),
- per-cpu ring buffer,

Perhaps others will have other operations in mind ?

Note that compared to Paul Turner's patchset, I removed the percpu_cmpxchg
and percpu_cmpxchg_check APIs from the test program rseq.h in user-space,
because I found out that it was difficult to guarantee progress with those
APIs. The do_rseq() approach, which does 2 attempts and falls back to
locking, does provide progress guarantees even in the face of (unlikely)
frequent migrations.

> 
> Why is it important for the kernel to do something special on every preemption?

This is how we can ensure that the entire critical section,
consisting of both the C part and the assembly instruction
sequence, will issue the commit instruction only if executed
atomically with respect to other threads scheduled on the
same CPU.

> 
> What "events" does "event_counter" count and why?

Technically, it increments each time a thread returns to
user-space with the NOTIFY_RESUME thread flag set. We ensure
to set this flag on preemption (out), as well as signal delivery.
So it is guaranteed to increment when either of those events take
place. It can however increment due to other kernel code setting
TIF_NOTIFY_RESUME before returning to user-space.

It is meant to allow user-space to detect preemption and signal
delivery, not to count the exact number of such events.

> 
> 
> If I'm understanding the intent of this code correctly (which is a big
> if), I think you're trying to do this:
> 
> start a critical section;
> compute something;
> commit;
> if (commit worked)
>  return;
> else
>  try again;
> 
> where "commit;" is a single instruction.  The kernel guarantees that
> if the thread is preempted (or migrated, perhaps?)

A thread needs to have been preempted in order to be migrated, so
from a user-space perspective, detecting preemption is a super-set
of detecting migration. We track preemption and signal delivery here.

> between the start
> and commit steps then commit will be forced to fail (or be skipped
> entirely).  Because I don't understand what you're doing with this
> primitive, I can't really tell why you need to detect preemption as
> opposed to just migration.
> 
> For example: would the following primitive solve the same problem?
> 
> begin_dont_migrate_me()
> 
> figure out what store to do to take the percpu lock;
> do that store;
> 
> if (end_dont_migrate_me())
>  return;
> 
> // oops, the kernel migrated us.  retry.

First, prohibiting migration from user-space has been frowned upon
by scheduler developers for a long time, and I doubt this mindset will
change.

But if we look at it from the point of view of letting user-space
retry when it detects migration (rather than preemption), it would
require that we use an atomic instruction (although without the lock
prefix) as the commit instruction to ensure atomicity with respect
to other threads running on the same CPU. Detecting preemption
instead allows us to use a simple store instruction as the commit.
Simple store instructions (e.g. mov) are faster than atomic
instructions (e.g. xadd, cmpxchg...). Moreover, detecting
migrations and using atomic instructions as commit is prone to ABA
(e.g. free-list use-case) that are prevented by the restart on
preemption or signal delivery.

Thanks for looking into it!

Mathieu


> 
> 
> --Andy

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
  2016-07-25 23:02   ` Andy Lutomirski
@ 2016-07-27 15:03   ` Boqun Feng
  2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
  2016-07-28  3:10     ` [RFC PATCH v7 1/7] Restartable sequences system call Mathieu Desnoyers
  2016-08-03 13:19   ` Peter Zijlstra
  2 siblings, 2 replies; 82+ messages in thread
From: Boqun Feng @ 2016-07-27 15:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 8429 bytes --]

Hi Mathieu,

On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
> Expose a new system call allowing each thread to register one userspace
> memory area to be used as an ABI between kernel and user-space for two
> purposes: user-space restartable sequences and quick access to read the
> current CPU number value from user-space.
> 
> * Restartable sequences (per-cpu atomics)
> 
> The restartable critical sections (percpu atomics) work has been started
> by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
> critical sections. [1] [2] The re-implementation proposed here brings a
> few simplifications to the ABI which facilitates porting to other

Agreed ;-)

> architectures and speeds up the user-space fast path. A locking-based
> fall-back, purely implemented in user-space, is proposed here to deal
> with debugger single-stepping. This fallback interacts with rseq_start()
> and rseq_finish(), which force retries in response to concurrent
> lock-based activity.
> 

So I have enabled this on powerpc, thanks to your nice work to make
things easy for porting ;-)

A patchset will follow in-reply-to this email, which includes patches
enabling this on powerpc and a patch that improves the portability of
the selftests, which I think it's not necessary to be a standalone
patch, so it's OK to be merged into your patch #7.

I did some tests on 64bit little/big endian pSeries(guest) kernel with
selftest cases(64bit LE selftest on 64bit LE kernel, 64/32bit BE
selftest on 64bit BE kernel), things seemingly went well ;-)

Here are some benchmark results I got on a little endian guest with 64
VCPUs:

Benchmarking various approaches for reading the current CPU number:

Power8 PSeries Guest(64 VCPUs, the host has 16 cores, 128 hardware
threads):
							
- Baseline (empty loop):                                   1.56 ns
- Read CPU from rseq cpu_id:                               1.56 ns
- Read CPU from rseq cpu_id (lazy register):               2.08 ns
- glibc 2.23-0ubuntu3 getcpu:                              7.72 ns
- getcpu system call:                                     91.80 ns


Benchmarking various approaches for counter increment:

Power8 PSeries KVM Guest(64 VCPUs, the host has 16 cores, 128 hardware
threads):

                                 Counter increment speed (ns/increment)
                              1 thread   2 threads   4 threads   8 threads   16 threads   32 threads
global increment (baseline)     6.5          N/A         N/A         N/A         N/A           N/A
percpu rseq increment           6.9          6.9         7.2         7.3        15.4          35.5
percpu rseq spinlock           19.0         18.9        19.4        19.4        35.5          71.8
global atomic increment        25.8        111.0       261.0       905.2      2319.5        4170.5 (__sync_add_and_fetch_4)
global atomic CAS              26.2        119.0       341.6      1183.0      3951.3        9312.5 (__sync_val_compare_and_swap_4)
global pthread mutex           40.0        238.1       644.0      2052.2      4272.5        8612.2


I surely need to run more tests for my patches in different
environments, and will try to adjust the patchset according to whatever
change you make(e.g. rseq_finish2) in the future.

(Add PPC maintainers in Cc)

Regards,
Boqun

> Here are benchmarks of counter increment in various scenarios compared
> to restartable sequences:
> 
> ARMv7 Processor rev 4 (v7l)
> Machine model: Cubietruck
> 
>                       Counter increment speed (ns/increment)
>                              1 thread    2 threads
> global increment (baseline)      6           N/A
> percpu rseq increment           50            52
> percpu rseq spinlock            94            94
> global atomic increment         48            74 (__sync_add_and_fetch_4)
> global atomic CAS               50           172 (__sync_val_compare_and_swap_4)
> global pthread mutex           148           862
> 
> ARMv7 Processor rev 10 (v7l)
> Machine model: Wandboard
> 
>                       Counter increment speed (ns/increment)
>                              1 thread    4 threads
> global increment (baseline)      7           N/A
> percpu rseq increment           50            50
> percpu rseq spinlock            82            84
> global atomic increment         44           262 (__sync_add_and_fetch_4)
> global atomic CAS               46           316 (__sync_val_compare_and_swap_4)
> global pthread mutex           146          1400
> 
> x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
> 
>                       Counter increment speed (ns/increment)
>                               1 thread           8 threads
> global increment (baseline)      3.0                N/A
> percpu rseq increment            3.6                3.8
> percpu rseq spinlock             5.6                6.2
> global LOCK; inc                 8.0              166.4
> global LOCK; cmpxchg            13.4              435.2
> global pthread mutex            25.2             1363.6
> 
> * Reading the current CPU number
> 
> Speeding up reading the current CPU number on which the caller thread is
> running is done by keeping the current CPU number up do date within the
> cpu_id field of the memory area registered by the thread. This is done
> by making scheduler migration set the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within the registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
> 
> Keeping the current cpu id in a memory area shared between kernel and
> user-space is an improvement over current mechanisms available to read
> the current CPU number, which has the following benefits over
> alternative approaches:
> 
> - 35x speedup on ARM vs system call through glibc
> - 20x speedup on x86 compared to calling glibc, which calls vdso
>   executing a "lsl" instruction,
> - 14x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cpu_id value can be read from an inline
>   assembly, which makes it a useful building block for restartable
>   sequences.
> - The approach of reading the cpu id through memory mapping shared
>   between kernel and user-space is portable (e.g. ARM), which is not the
>   case for the lsl-based x86 vdso.
> 
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the cpu id cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.
> 
> Benchmarking various approaches for reading the current CPU number:
> 
> ARMv7 Processor rev 4 (v7l)
> Machine model: Cubietruck
> - Baseline (empty loop):                                    8.4 ns
> - Read CPU from rseq cpu_id:                               16.7 ns
> - Read CPU from rseq cpu_id (lazy register):               19.8 ns
> - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
> - getcpu system call:                                     234.9 ns
> 
> x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
> - Baseline (empty loop):                                    0.8 ns
> - Read CPU from rseq cpu_id:                                0.8 ns
> - Read CPU from rseq cpu_id (lazy register):                0.8 ns
> - Read using gs segment selector:                           0.8 ns
> - "lsl" inline assembly:                                   13.0 ns
> - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
> - getcpu system call:                                      53.9 ns
> 
> - Speed
> 
> Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
> expectations, that enabling CONFIG_RSEQ slightly accelerates the
> scheduler:
> 
> Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
> 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
> saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
> kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
> restartable sequences series applied.
> 

[snip]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t
  2016-07-27 15:03   ` Boqun Feng
@ 2016-07-27 15:05     ` Boqun Feng
  2016-07-27 15:05       ` [RFC 2/4] Restartable sequences: powerpc architecture support Boqun Feng
                         ` (3 more replies)
  2016-07-28  3:10     ` [RFC PATCH v7 1/7] Restartable sequences system call Mathieu Desnoyers
  1 sibling, 4 replies; 82+ messages in thread
From: Boqun Feng @ 2016-07-27 15:05 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Boqun Feng

The current semantics of do_resq() is to do a intptr_t type store in
successful cases, however, in test_percpu_{inc,spinlock}, we use
test_data_entry::count as the location to store, whose type is int.

intptr_t and int have different size on LP64 systems, and despite the
inconsistency of types, having test_data_entry::count as int needs more
care on endian handling.

To make things simpler and more consistent, convert
test_data_entry::count to type intptr_t, which also makes the coming
tests for ppc64le and ppc64 share the same code.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 tools/testing/selftests/rseq/param_test.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
index f95fba5a1b2a..db25e0a818e5 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -124,7 +124,7 @@ struct percpu_lock {
 };
 
 struct test_data_entry {
-	int count;
+	intptr_t count;
 } __attribute__((aligned(128)));
 
 struct spinlock_test_data {
@@ -234,7 +234,8 @@ void *test_percpu_spinlock_thread(void *arg)
 void test_percpu_spinlock(void)
 {
 	const int num_threads = opt_threads;
-	int i, sum, ret;
+	int i, ret;
+	intptr_t sum;
 	pthread_t test_threads[num_threads];
 	struct spinlock_test_data data;
 	struct spinlock_thread_test_data thread_data[num_threads];
@@ -308,7 +309,8 @@ void *test_percpu_inc_thread(void *arg)
 void test_percpu_inc(void)
 {
 	const int num_threads = opt_threads;
-	int i, sum, ret;
+	int i, ret;
+	intptr_t sum;
 	pthread_t test_threads[num_threads];
 	struct inc_test_data data;
 	struct inc_thread_test_data thread_data[num_threads];
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC 2/4] Restartable sequences: powerpc architecture support
  2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
@ 2016-07-27 15:05       ` Boqun Feng
  2016-07-28  3:13         ` Mathieu Desnoyers
  2016-07-27 15:05       ` [RFC 3/4] Restartable sequences: Wire up powerpc system call Boqun Feng
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-07-27 15:05 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Boqun Feng

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal when a
signal is delivered on top of a restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/Kconfig         | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0a9d439bcda6..4e93629c6b84 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -120,6 +120,7 @@ config PPC
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
+	select HAVE_RSEQ
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select SPARSE_IRQ
 	select IRQ_DOMAIN
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index cb64d6feb45a..339d0ebe2906 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -131,6 +131,8 @@ static void do_signal(struct pt_regs *regs)
 	/* Re-enable the breakpoints for the signal stack */
 	thread_change_pc(current, regs);
 
+	rseq_signal_deliver(regs);
+
 	if (is32) {
         	if (ksig.ka.sa.sa_flags & SA_SIGINFO)
 			ret = handle_rt_signal32(&ksig, oldset, regs);
@@ -157,6 +159,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
 		clear_thread_flag(TIF_NOTIFY_RESUME);
 		tracehook_notify_resume(regs);
+		rseq_handle_notify_resume(regs);
 	}
 
 	user_enter();
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC 3/4] Restartable sequences: Wire up powerpc system call
  2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
  2016-07-27 15:05       ` [RFC 2/4] Restartable sequences: powerpc architecture support Boqun Feng
@ 2016-07-27 15:05       ` Boqun Feng
  2016-07-28  3:13         ` Mathieu Desnoyers
  2016-07-27 15:05       ` [RFC 4/4] Restartable sequences: Add self-tests for PPC Boqun Feng
  2016-07-28  3:07       ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Mathieu Desnoyers
  3 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-07-27 15:05 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Boqun Feng

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 2fc5d4db503c..c68f4d0d00b2 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -386,3 +386,4 @@ SYSCALL(mlock2)
 SYSCALL(copy_file_range)
 COMPAT_SYS_SPU(preadv2)
 COMPAT_SYS_SPU(pwritev2)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index cf12c580f6b2..a01e97d3f305 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls		382
+#define NR_syscalls		383
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index e9f5f41aa55a..d1849d64c8ef 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -392,5 +392,6 @@
 #define __NR_copy_file_range	379
 #define __NR_preadv2		380
 #define __NR_pwritev2		381
+#define __NR_rseq		382
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC 4/4] Restartable sequences: Add self-tests for PPC
  2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
  2016-07-27 15:05       ` [RFC 2/4] Restartable sequences: powerpc architecture support Boqun Feng
  2016-07-27 15:05       ` [RFC 3/4] Restartable sequences: Wire up powerpc system call Boqun Feng
@ 2016-07-27 15:05       ` Boqun Feng
  2016-07-28  2:59         ` Mathieu Desnoyers
  2016-07-28  3:07       ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Mathieu Desnoyers
  3 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-07-27 15:05 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Boqun Feng

As rseq syscall is enabled on PPC, implement the self-tests on PPC to
verify the implementation of the syscall.

Please note we only support 32bit userspace on BE kernel.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 tools/testing/selftests/rseq/param_test.c |  14 ++++
 tools/testing/selftests/rseq/rseq.h       | 120 ++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+)

diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
index db25e0a818e5..e2cb1b165f81 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -75,6 +75,20 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
 	"bne 222b\n\t" \
 	"333:\n\t"
 
+#elif __PPC__
+#define INJECT_ASM_REG	"r18"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
 #else
 #error unsupported target
 #endif
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index 791e14cf42ae..dea0bea52566 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -138,6 +138,35 @@ do {									\
 #define has_fast_acquire_release()	0
 #define has_single_copy_load_64()	1
 
+#elif __PPC__
+#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
+#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
+#define smp_rmb()	smp_lwsync()
+#define smp_wmb()	smp_lwsync()
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_lwsync();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_lwsync()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_lwsync();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	1
+
+# if __PPC64__
+# define has_single_copy_load_64()	1
+# else
+# define has_single_copy_load_64()	0
+# endif
+
 #else
 #error unsupported target
 #endif
@@ -404,6 +433,97 @@ bool rseq_finish(struct rseq_lock *rlock,
 			: succeed
 		);
 	}
+#elif __PPC64__
+	{
+		/*
+		 * The __rseq_table section can be used by debuggers to better
+		 * handle single-stepping through the restartable critical
+		 * sections.
+		 */
+		__asm__ __volatile__ goto (
+			".pushsection __rseq_table, \"aw\"\n\t"
+			".balign 8\n\t"
+			"4:\n\t"
+			".quad 1f, 2f, 3f\n\t"
+			".popsection\n\t"
+			"1:\n\t"
+			RSEQ_INJECT_ASM(1)
+			"lis %%r17, (4b)@highest\n\t"
+			"ori %%r17, %%r17, (4b)@higher\n\t"
+			"rldicr %%r17, %%r17, 32, 31\n\t"
+			"oris %%r17, %%r17, (4b)@h\n\t"
+			"ori %%r17, %%r17, (4b)@l\n\t"
+			"std %%r17, 0(%[rseq_cs])\n\t"
+			RSEQ_INJECT_ASM(2)
+			"lwz %%r17, %[current_event_counter]\n\t"
+			"li %%r16, 0\n\t"
+			"cmpw cr7, %[start_event_counter], %%r17\n\t"
+			"bne cr7, 3f\n\t"
+			RSEQ_INJECT_ASM(3)
+			"std %[to_write], 0(%[target])\n\t"
+			"2:\n\t"
+			RSEQ_INJECT_ASM(4)
+			"std %%r16, 0(%[rseq_cs])\n\t"
+			"b %l[succeed]\n\t"
+			"3:\n\t"
+			"li %%r16, 0\n\t"
+			"std %%r16, 0(%[rseq_cs])\n\t"
+			: /* no outputs */
+			: [start_event_counter]"r"(start_value.event_counter),
+			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+			  [to_write]"r"(to_write),
+			  [target]"b"(p),
+			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
+			  RSEQ_INJECT_INPUT
+			: "r16", "r17", "memory", "cc"
+			  RSEQ_INJECT_CLOBBER
+			: succeed
+		);
+	}
+#elif __PPC__
+	{
+		/*
+		 * The __rseq_table section can be used by debuggers to better
+		 * handle single-stepping through the restartable critical
+		 * sections.
+		 */
+		__asm__ __volatile__ goto (
+			".pushsection __rseq_table, \"aw\"\n\t"
+			".balign 8\n\t"
+			"4:\n\t"
+			".long 0x0, 1f, 0x0, 2f, 0x0, 3f\n\t" /* 32 bit only supported on BE */
+			".popsection\n\t"
+			"1:\n\t"
+			RSEQ_INJECT_ASM(1)
+			"lis %%r17, (4b)@ha\n\t"
+			"addi %%r17, %%r17, (4b)@l\n\t"
+			"stw %%r17, 0(%[rseq_cs])\n\t"
+			RSEQ_INJECT_ASM(2)
+			"lwz %%r17, %[current_event_counter]\n\t"
+			"li %%r16, 0\n\t"
+			"cmpw cr7, %[start_event_counter], %%r17\n\t"
+			"bne cr7, 3f\n\t"
+			RSEQ_INJECT_ASM(3)
+			"stw %[to_write], 0(%[target])\n\t"
+			"2:\n\t"
+			RSEQ_INJECT_ASM(4)
+			"stw %%r16, 0(%[rseq_cs])\n\t"
+			"b %l[succeed]\n\t"
+			"3:\n\t"
+			"li %%r16, 0\n\t"
+			"stw %%r16, 0(%[rseq_cs])\n\t"
+			: /* no outputs */
+			: [start_event_counter]"r"(start_value.event_counter),
+			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+			  [to_write]"r"(to_write),
+			  [target]"b"(p),
+			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
+			  RSEQ_INJECT_INPUT
+			: "r16", "r17", "memory", "cc"
+			  RSEQ_INJECT_CLOBBER
+			: succeed
+		);
+	}
 #else
 #error unsupported target
 #endif
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 4/4] Restartable sequences: Add self-tests for PPC
  2016-07-27 15:05       ` [RFC 4/4] Restartable sequences: Add self-tests for PPC Boqun Feng
@ 2016-07-28  2:59         ` Mathieu Desnoyers
  2016-07-28  4:43           ` Boqun Feng
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28  2:59 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 27, 2016, at 11:05 AM, Boqun Feng boqun.feng@gmail.com wrote:

> As rseq syscall is enabled on PPC, implement the self-tests on PPC to
> verify the implementation of the syscall.
> 
> Please note we only support 32bit userspace on BE kernel.
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> tools/testing/selftests/rseq/param_test.c |  14 ++++
> tools/testing/selftests/rseq/rseq.h       | 120 ++++++++++++++++++++++++++++++
> 2 files changed, 134 insertions(+)
> 
> diff --git a/tools/testing/selftests/rseq/param_test.c
> b/tools/testing/selftests/rseq/param_test.c
> index db25e0a818e5..e2cb1b165f81 100644
> --- a/tools/testing/selftests/rseq/param_test.c
> +++ b/tools/testing/selftests/rseq/param_test.c
> @@ -75,6 +75,20 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
> 	"bne 222b\n\t" \
> 	"333:\n\t"
> 
> +#elif __PPC__
> +#define INJECT_ASM_REG	"r18"
> +
> +#define RSEQ_INJECT_CLOBBER \
> +	, INJECT_ASM_REG
> +
> +#define RSEQ_INJECT_ASM(n) \
> +	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
> +	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
> +	"beq 333f\n\t" \
> +	"222:\n\t" \
> +	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
> +	"bne 222b\n\t" \
> +	"333:\n\t"
> #else
> #error unsupported target
> #endif
> diff --git a/tools/testing/selftests/rseq/rseq.h
> b/tools/testing/selftests/rseq/rseq.h
> index 791e14cf42ae..dea0bea52566 100644
> --- a/tools/testing/selftests/rseq/rseq.h
> +++ b/tools/testing/selftests/rseq/rseq.h
> @@ -138,6 +138,35 @@ do {									\
> #define has_fast_acquire_release()	0
> #define has_single_copy_load_64()	1
> 
> +#elif __PPC__
> +#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> +#define smp_rmb()	smp_lwsync()
> +#define smp_wmb()	smp_lwsync()
> +
> +#define smp_load_acquire(p)						\
> +__extension__ ({							\
> +	__typeof(*p) ____p1 = READ_ONCE(*p);				\
> +	smp_lwsync();							\
> +	____p1;								\
> +})
> +
> +#define smp_acquire__after_ctrl_dep()	smp_lwsync()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_lwsync();							\
> +	WRITE_ONCE(*p, v);						\
> +} while (0)
> +
> +#define has_fast_acquire_release()	1

Can you check if defining has_fast_acquire_release() to 0 speeds up
performance significantly ? It turns the smp_lwsync() into a
compiler barrier() on the smp_load_acquire() side (fast-path), and
turn the smp_lwsync() into a membarrier system call instead of the
matching smp_store_release() (slow path).

Thanks,

Mathieu

> +
> +# if __PPC64__
> +# define has_single_copy_load_64()	1
> +# else
> +# define has_single_copy_load_64()	0
> +# endif
> +
> #else
> #error unsupported target
> #endif
> @@ -404,6 +433,97 @@ bool rseq_finish(struct rseq_lock *rlock,
> 			: succeed
> 		);
> 	}
> +#elif __PPC64__
> +	{
> +		/*
> +		 * The __rseq_table section can be used by debuggers to better
> +		 * handle single-stepping through the restartable critical
> +		 * sections.
> +		 */
> +		__asm__ __volatile__ goto (
> +			".pushsection __rseq_table, \"aw\"\n\t"
> +			".balign 8\n\t"
> +			"4:\n\t"
> +			".quad 1f, 2f, 3f\n\t"
> +			".popsection\n\t"
> +			"1:\n\t"
> +			RSEQ_INJECT_ASM(1)
> +			"lis %%r17, (4b)@highest\n\t"
> +			"ori %%r17, %%r17, (4b)@higher\n\t"
> +			"rldicr %%r17, %%r17, 32, 31\n\t"
> +			"oris %%r17, %%r17, (4b)@h\n\t"
> +			"ori %%r17, %%r17, (4b)@l\n\t"
> +			"std %%r17, 0(%[rseq_cs])\n\t"
> +			RSEQ_INJECT_ASM(2)
> +			"lwz %%r17, %[current_event_counter]\n\t"
> +			"li %%r16, 0\n\t"
> +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
> +			"bne cr7, 3f\n\t"
> +			RSEQ_INJECT_ASM(3)
> +			"std %[to_write], 0(%[target])\n\t"
> +			"2:\n\t"
> +			RSEQ_INJECT_ASM(4)
> +			"std %%r16, 0(%[rseq_cs])\n\t"
> +			"b %l[succeed]\n\t"
> +			"3:\n\t"
> +			"li %%r16, 0\n\t"
> +			"std %%r16, 0(%[rseq_cs])\n\t"
> +			: /* no outputs */
> +			: [start_event_counter]"r"(start_value.event_counter),
> +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> +			  [to_write]"r"(to_write),
> +			  [target]"b"(p),
> +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
> +			  RSEQ_INJECT_INPUT
> +			: "r16", "r17", "memory", "cc"
> +			  RSEQ_INJECT_CLOBBER
> +			: succeed
> +		);
> +	}
> +#elif __PPC__
> +	{
> +		/*
> +		 * The __rseq_table section can be used by debuggers to better
> +		 * handle single-stepping through the restartable critical
> +		 * sections.
> +		 */
> +		__asm__ __volatile__ goto (
> +			".pushsection __rseq_table, \"aw\"\n\t"
> +			".balign 8\n\t"
> +			"4:\n\t"
> +			".long 0x0, 1f, 0x0, 2f, 0x0, 3f\n\t" /* 32 bit only supported on BE */
> +			".popsection\n\t"
> +			"1:\n\t"
> +			RSEQ_INJECT_ASM(1)
> +			"lis %%r17, (4b)@ha\n\t"
> +			"addi %%r17, %%r17, (4b)@l\n\t"
> +			"stw %%r17, 0(%[rseq_cs])\n\t"
> +			RSEQ_INJECT_ASM(2)
> +			"lwz %%r17, %[current_event_counter]\n\t"
> +			"li %%r16, 0\n\t"
> +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
> +			"bne cr7, 3f\n\t"
> +			RSEQ_INJECT_ASM(3)
> +			"stw %[to_write], 0(%[target])\n\t"
> +			"2:\n\t"
> +			RSEQ_INJECT_ASM(4)
> +			"stw %%r16, 0(%[rseq_cs])\n\t"
> +			"b %l[succeed]\n\t"
> +			"3:\n\t"
> +			"li %%r16, 0\n\t"
> +			"stw %%r16, 0(%[rseq_cs])\n\t"
> +			: /* no outputs */
> +			: [start_event_counter]"r"(start_value.event_counter),
> +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> +			  [to_write]"r"(to_write),
> +			  [target]"b"(p),
> +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
> +			  RSEQ_INJECT_INPUT
> +			: "r16", "r17", "memory", "cc"
> +			  RSEQ_INJECT_CLOBBER
> +			: succeed
> +		);
> +	}
> #else
> #error unsupported target
> #endif
> --
> 2.9.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t
  2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
                         ` (2 preceding siblings ...)
  2016-07-27 15:05       ` [RFC 4/4] Restartable sequences: Add self-tests for PPC Boqun Feng
@ 2016-07-28  3:07       ` Mathieu Desnoyers
  3 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28  3:07 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 27, 2016, at 11:05 AM, Boqun Feng boqun.feng@gmail.com wrote:

> The current semantics of do_resq() is to do a intptr_t type store in
> successful cases, however, in test_percpu_{inc,spinlock}, we use
> test_data_entry::count as the location to store, whose type is int.
> 
> intptr_t and int have different size on LP64 systems, and despite the
> inconsistency of types, having test_data_entry::count as int needs more
> care on endian handling.
> 
> To make things simpler and more consistent, convert
> test_data_entry::count to type intptr_t, which also makes the coming
> tests for ppc64le and ppc64 share the same code.

Folded into my rseq tests patch for next round, thanks!

I also took care of basic_percpu_ops_test.c which had the
same issue.

Thanks!

Mathieu

> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> tools/testing/selftests/rseq/param_test.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/rseq/param_test.c
> b/tools/testing/selftests/rseq/param_test.c
> index f95fba5a1b2a..db25e0a818e5 100644
> --- a/tools/testing/selftests/rseq/param_test.c
> +++ b/tools/testing/selftests/rseq/param_test.c
> @@ -124,7 +124,7 @@ struct percpu_lock {
> };
> 
> struct test_data_entry {
> -	int count;
> +	intptr_t count;
> } __attribute__((aligned(128)));
> 
> struct spinlock_test_data {
> @@ -234,7 +234,8 @@ void *test_percpu_spinlock_thread(void *arg)
> void test_percpu_spinlock(void)
> {
> 	const int num_threads = opt_threads;
> -	int i, sum, ret;
> +	int i, ret;
> +	intptr_t sum;
> 	pthread_t test_threads[num_threads];
> 	struct spinlock_test_data data;
> 	struct spinlock_thread_test_data thread_data[num_threads];
> @@ -308,7 +309,8 @@ void *test_percpu_inc_thread(void *arg)
> void test_percpu_inc(void)
> {
> 	const int num_threads = opt_threads;
> -	int i, sum, ret;
> +	int i, ret;
> +	intptr_t sum;
> 	pthread_t test_threads[num_threads];
> 	struct inc_test_data data;
> 	struct inc_thread_test_data thread_data[num_threads];
> --
> 2.9.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-27 15:03   ` Boqun Feng
  2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
@ 2016-07-28  3:10     ` Mathieu Desnoyers
  1 sibling, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28  3:10 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 27, 2016, at 11:03 AM, Boqun Feng boqun.feng@gmail.com wrote:

> Hi Mathieu,
> 
> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
>> Expose a new system call allowing each thread to register one userspace
>> memory area to be used as an ABI between kernel and user-space for two
>> purposes: user-space restartable sequences and quick access to read the
>> current CPU number value from user-space.
>> 
>> * Restartable sequences (per-cpu atomics)
>> 
>> The restartable critical sections (percpu atomics) work has been started
>> by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
>> critical sections. [1] [2] The re-implementation proposed here brings a
>> few simplifications to the ABI which facilitates porting to other
> 
> Agreed ;-)
> 
>> architectures and speeds up the user-space fast path. A locking-based
>> fall-back, purely implemented in user-space, is proposed here to deal
>> with debugger single-stepping. This fallback interacts with rseq_start()
>> and rseq_finish(), which force retries in response to concurrent
>> lock-based activity.
>> 
> 
> So I have enabled this on powerpc, thanks to your nice work to make
> things easy for porting ;-)
> 
> A patchset will follow in-reply-to this email, which includes patches
> enabling this on powerpc and a patch that improves the portability of
> the selftests, which I think it's not necessary to be a standalone
> patch, so it's OK to be merged into your patch #7.
> 
> I did some tests on 64bit little/big endian pSeries(guest) kernel with
> selftest cases(64bit LE selftest on 64bit LE kernel, 64/32bit BE
> selftest on 64bit BE kernel), things seemingly went well ;-)
> 
> Here are some benchmark results I got on a little endian guest with 64
> VCPUs:
> 
> Benchmarking various approaches for reading the current CPU number:
> 
> Power8 PSeries Guest(64 VCPUs, the host has 16 cores, 128 hardware
> threads):
>							
> - Baseline (empty loop):                                   1.56 ns
> - Read CPU from rseq cpu_id:                               1.56 ns
> - Read CPU from rseq cpu_id (lazy register):               2.08 ns
> - glibc 2.23-0ubuntu3 getcpu:                              7.72 ns
> - getcpu system call:                                     91.80 ns
> 
> 
> Benchmarking various approaches for counter increment:
> 
> Power8 PSeries KVM Guest(64 VCPUs, the host has 16 cores, 128 hardware
> threads):
> 
>                                 Counter increment speed (ns/increment)
>                              1 thread   2 threads   4 threads   8 threads   16 threads   32 threads
> global increment (baseline)     6.5          N/A         N/A         N/A
> N/A           N/A
> percpu rseq increment           6.9          6.9         7.2         7.3
> 15.4          35.5
> percpu rseq spinlock           19.0         18.9        19.4        19.4
> 35.5          71.8
> global atomic increment        25.8        111.0       261.0       905.2
> 2319.5        4170.5 (__sync_add_and_fetch_4)
> global atomic CAS              26.2        119.0       341.6      1183.0
> 3951.3        9312.5 (__sync_val_compare_and_swap_4)
> global pthread mutex           40.0        238.1       644.0      2052.2
> 4272.5        8612.2
> 
> 
> I surely need to run more tests for my patches in different
> environments, and will try to adjust the patchset according to whatever
> change you make(e.g. rseq_finish2) in the future.

I'm very glad to see it brings speedup on powerpc too! I plan
minor changes following the feedback I already got. I'll surely
grab your updated benchmark numbers into my changelog when I stop
hiding in RFC. ;)

Thanks,

Mathieu

> 
> (Add PPC maintainers in Cc)
> 
> Regards,
> Boqun
> 
>> Here are benchmarks of counter increment in various scenarios compared
>> to restartable sequences:
>> 
>> ARMv7 Processor rev 4 (v7l)
>> Machine model: Cubietruck
>> 
>>                       Counter increment speed (ns/increment)
>>                              1 thread    2 threads
>> global increment (baseline)      6           N/A
>> percpu rseq increment           50            52
>> percpu rseq spinlock            94            94
>> global atomic increment         48            74 (__sync_add_and_fetch_4)
>> global atomic CAS               50           172 (__sync_val_compare_and_swap_4)
>> global pthread mutex           148           862
>> 
>> ARMv7 Processor rev 10 (v7l)
>> Machine model: Wandboard
>> 
>>                       Counter increment speed (ns/increment)
>>                              1 thread    4 threads
>> global increment (baseline)      7           N/A
>> percpu rseq increment           50            50
>> percpu rseq spinlock            82            84
>> global atomic increment         44           262 (__sync_add_and_fetch_4)
>> global atomic CAS               46           316 (__sync_val_compare_and_swap_4)
>> global pthread mutex           146          1400
>> 
>> x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
>> 
>>                       Counter increment speed (ns/increment)
>>                               1 thread           8 threads
>> global increment (baseline)      3.0                N/A
>> percpu rseq increment            3.6                3.8
>> percpu rseq spinlock             5.6                6.2
>> global LOCK; inc                 8.0              166.4
>> global LOCK; cmpxchg            13.4              435.2
>> global pthread mutex            25.2             1363.6
>> 
>> * Reading the current CPU number
>> 
>> Speeding up reading the current CPU number on which the caller thread is
>> running is done by keeping the current CPU number up do date within the
>> cpu_id field of the memory area registered by the thread. This is done
>> by making scheduler migration set the TIF_NOTIFY_RESUME flag on the
>> current thread. Upon return to user-space, a notify-resume handler
>> updates the current CPU value within the registered user-space memory
>> area. User-space can then read the current CPU number directly from
>> memory.
>> 
>> Keeping the current cpu id in a memory area shared between kernel and
>> user-space is an improvement over current mechanisms available to read
>> the current CPU number, which has the following benefits over
>> alternative approaches:
>> 
>> - 35x speedup on ARM vs system call through glibc
>> - 20x speedup on x86 compared to calling glibc, which calls vdso
>>   executing a "lsl" instruction,
>> - 14x speedup on x86 compared to inlined "lsl" instruction,
>> - Unlike vdso approaches, this cpu_id value can be read from an inline
>>   assembly, which makes it a useful building block for restartable
>>   sequences.
>> - The approach of reading the cpu id through memory mapping shared
>>   between kernel and user-space is portable (e.g. ARM), which is not the
>>   case for the lsl-based x86 vdso.
>> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the cpu id cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
>> 
>> Benchmarking various approaches for reading the current CPU number:
>> 
>> ARMv7 Processor rev 4 (v7l)
>> Machine model: Cubietruck
>> - Baseline (empty loop):                                    8.4 ns
>> - Read CPU from rseq cpu_id:                               16.7 ns
>> - Read CPU from rseq cpu_id (lazy register):               19.8 ns
>> - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
>> - getcpu system call:                                     234.9 ns
>> 
>> x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
>> - Baseline (empty loop):                                    0.8 ns
>> - Read CPU from rseq cpu_id:                                0.8 ns
>> - Read CPU from rseq cpu_id (lazy register):                0.8 ns
>> - Read using gs segment selector:                           0.8 ns
>> - "lsl" inline assembly:                                   13.0 ns
>> - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
>> - getcpu system call:                                      53.9 ns
>> 
>> - Speed
>> 
>> Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
>> expectations, that enabling CONFIG_RSEQ slightly accelerates the
>> scheduler:
>> 
>> Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
>> 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
>> saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
>> kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
>> restartable sequences series applied.
>> 
> 
> [snip]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 2/4] Restartable sequences: powerpc architecture support
  2016-07-27 15:05       ` [RFC 2/4] Restartable sequences: powerpc architecture support Boqun Feng
@ 2016-07-28  3:13         ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28  3:13 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 27, 2016, at 11:05 AM, Boqun Feng boqun.feng@gmail.com wrote:

> Call the rseq_handle_notify_resume() function on return to userspace if
> TIF_NOTIFY_RESUME thread flag is set.
> 
> Increment the event counter and perform fixup on the pre-signal when a
> signal is delivered on top of a restartable sequence critical section.

Picked into my patchset. I'm keeping a volatile branch of my current
work at https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
if you are interested in the progress.

Thanks!

Mathieu

> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> arch/powerpc/Kconfig         | 1 +
> arch/powerpc/kernel/signal.c | 3 +++
> 2 files changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 0a9d439bcda6..4e93629c6b84 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -120,6 +120,7 @@ config PPC
> 	select HAVE_PERF_USER_STACK_DUMP
> 	select HAVE_REGS_AND_STACK_ACCESS_API
> 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
> +	select HAVE_RSEQ
> 	select ARCH_WANT_IPC_PARSE_VERSION
> 	select SPARSE_IRQ
> 	select IRQ_DOMAIN
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index cb64d6feb45a..339d0ebe2906 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -131,6 +131,8 @@ static void do_signal(struct pt_regs *regs)
> 	/* Re-enable the breakpoints for the signal stack */
> 	thread_change_pc(current, regs);
> 
> +	rseq_signal_deliver(regs);
> +
> 	if (is32) {
>         	if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> 			ret = handle_rt_signal32(&ksig, oldset, regs);
> @@ -157,6 +159,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
> thread_info_flags)
> 	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> 		clear_thread_flag(TIF_NOTIFY_RESUME);
> 		tracehook_notify_resume(regs);
> +		rseq_handle_notify_resume(regs);
> 	}
> 
> 	user_enter();
> --
> 2.9.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 3/4] Restartable sequences: Wire up powerpc system call
  2016-07-27 15:05       ` [RFC 3/4] Restartable sequences: Wire up powerpc system call Boqun Feng
@ 2016-07-28  3:13         ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28  3:13 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 27, 2016, at 11:05 AM, Boqun Feng boqun.feng@gmail.com wrote:

> Wire up the rseq system call on powerpc.
> 
> This provides an ABI improving the speed of a user-space getcpu
> operation on powerpc by skipping the getcpu system call on the fast
> path, as well as improving the speed of user-space operations on per-cpu
> data compared to using load-reservation/store-conditional atomics.

Picked up in my dev branch too, thanks!

Mathieu

> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> arch/powerpc/include/asm/systbl.h      | 1 +
> arch/powerpc/include/asm/unistd.h      | 2 +-
> arch/powerpc/include/uapi/asm/unistd.h | 1 +
> 3 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/systbl.h
> b/arch/powerpc/include/asm/systbl.h
> index 2fc5d4db503c..c68f4d0d00b2 100644
> --- a/arch/powerpc/include/asm/systbl.h
> +++ b/arch/powerpc/include/asm/systbl.h
> @@ -386,3 +386,4 @@ SYSCALL(mlock2)
> SYSCALL(copy_file_range)
> COMPAT_SYS_SPU(preadv2)
> COMPAT_SYS_SPU(pwritev2)
> +SYSCALL(rseq)
> diff --git a/arch/powerpc/include/asm/unistd.h
> b/arch/powerpc/include/asm/unistd.h
> index cf12c580f6b2..a01e97d3f305 100644
> --- a/arch/powerpc/include/asm/unistd.h
> +++ b/arch/powerpc/include/asm/unistd.h
> @@ -12,7 +12,7 @@
> #include <uapi/asm/unistd.h>
> 
> 
> -#define NR_syscalls		382
> +#define NR_syscalls		383
> 
> #define __NR__exit __NR_exit
> 
> diff --git a/arch/powerpc/include/uapi/asm/unistd.h
> b/arch/powerpc/include/uapi/asm/unistd.h
> index e9f5f41aa55a..d1849d64c8ef 100644
> --- a/arch/powerpc/include/uapi/asm/unistd.h
> +++ b/arch/powerpc/include/uapi/asm/unistd.h
> @@ -392,5 +392,6 @@
> #define __NR_copy_file_range	379
> #define __NR_preadv2		380
> #define __NR_pwritev2		381
> +#define __NR_rseq		382
> 
> #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
> --
> 2.9.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 4/4] Restartable sequences: Add self-tests for PPC
  2016-07-28  2:59         ` Mathieu Desnoyers
@ 2016-07-28  4:43           ` Boqun Feng
  2016-07-28  7:37             ` [RFC v2] " Boqun Feng
  2016-07-28 13:42             ` [RFC 4/4] " Mathieu Desnoyers
  0 siblings, 2 replies; 82+ messages in thread
From: Boqun Feng @ 2016-07-28  4:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 7346 bytes --]

On Thu, Jul 28, 2016 at 02:59:45AM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 27, 2016, at 11:05 AM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > As rseq syscall is enabled on PPC, implement the self-tests on PPC to
> > verify the implementation of the syscall.
> > 
> > Please note we only support 32bit userspace on BE kernel.
> > 
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> > tools/testing/selftests/rseq/param_test.c |  14 ++++
> > tools/testing/selftests/rseq/rseq.h       | 120 ++++++++++++++++++++++++++++++
> > 2 files changed, 134 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/rseq/param_test.c
> > b/tools/testing/selftests/rseq/param_test.c
> > index db25e0a818e5..e2cb1b165f81 100644
> > --- a/tools/testing/selftests/rseq/param_test.c
> > +++ b/tools/testing/selftests/rseq/param_test.c
> > @@ -75,6 +75,20 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
> > 	"bne 222b\n\t" \
> > 	"333:\n\t"
> > 
> > +#elif __PPC__
> > +#define INJECT_ASM_REG	"r18"
> > +
> > +#define RSEQ_INJECT_CLOBBER \
> > +	, INJECT_ASM_REG
> > +
> > +#define RSEQ_INJECT_ASM(n) \
> > +	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
> > +	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
> > +	"beq 333f\n\t" \
> > +	"222:\n\t" \
> > +	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
> > +	"bne 222b\n\t" \
> > +	"333:\n\t"
> > #else
> > #error unsupported target
> > #endif
> > diff --git a/tools/testing/selftests/rseq/rseq.h
> > b/tools/testing/selftests/rseq/rseq.h
> > index 791e14cf42ae..dea0bea52566 100644
> > --- a/tools/testing/selftests/rseq/rseq.h
> > +++ b/tools/testing/selftests/rseq/rseq.h
> > @@ -138,6 +138,35 @@ do {									\
> > #define has_fast_acquire_release()	0
> > #define has_single_copy_load_64()	1
> > 
> > +#elif __PPC__
> > +#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
> > +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> > +#define smp_rmb()	smp_lwsync()
> > +#define smp_wmb()	smp_lwsync()
> > +
> > +#define smp_load_acquire(p)						\
> > +__extension__ ({							\
> > +	__typeof(*p) ____p1 = READ_ONCE(*p);				\
> > +	smp_lwsync();							\
> > +	____p1;								\
> > +})
> > +
> > +#define smp_acquire__after_ctrl_dep()	smp_lwsync()
> > +
> > +#define smp_store_release(p, v)						\
> > +do {									\
> > +	smp_lwsync();							\
> > +	WRITE_ONCE(*p, v);						\
> > +} while (0)
> > +
> > +#define has_fast_acquire_release()	1
> 
> Can you check if defining has_fast_acquire_release() to 0 speeds up
> performance significantly ? It turns the smp_lwsync() into a
> compiler barrier() on the smp_load_acquire() side (fast-path), and
> turn the smp_lwsync() into a membarrier system call instead of the
> matching smp_store_release() (slow path).
> 

Good point. Here are the numbers:

Power8 PSeries KVM Guest(64 VCPUs, the host has 16 cores, 128 hardware
threads):

                                 Counter increment speed (ns/increment)
                              1 thread   2 threads   4 threads   8 threads   16 threads   32 threads
global increment (baseline)     6.5          N/A         N/A         N/A       N/A           N/A
percpu rseq increment           7.0          7.0         7.2         7.2       9.3          14.5
percpu rseq spinlock           18.5         18.5        18.6        18.8      25.5          52.7

So looks like defining has_fast_acquire_release() to 0 could benefit the
cases with more threads in current benchmark. I will send a updated
patch doing this.

And as discussed in IRC, I will also remove jump from rseq_finish()
fast-path in powerpc asm in the updated patch as you did for x86 and
ARM.

Regards,
Boqun


> Thanks,
> 
> Mathieu
> 
> > +
> > +# if __PPC64__
> > +# define has_single_copy_load_64()	1
> > +# else
> > +# define has_single_copy_load_64()	0
> > +# endif
> > +
> > #else
> > #error unsupported target
> > #endif
> > @@ -404,6 +433,97 @@ bool rseq_finish(struct rseq_lock *rlock,
> > 			: succeed
> > 		);
> > 	}
> > +#elif __PPC64__
> > +	{
> > +		/*
> > +		 * The __rseq_table section can be used by debuggers to better
> > +		 * handle single-stepping through the restartable critical
> > +		 * sections.
> > +		 */
> > +		__asm__ __volatile__ goto (
> > +			".pushsection __rseq_table, \"aw\"\n\t"
> > +			".balign 8\n\t"
> > +			"4:\n\t"
> > +			".quad 1f, 2f, 3f\n\t"
> > +			".popsection\n\t"
> > +			"1:\n\t"
> > +			RSEQ_INJECT_ASM(1)
> > +			"lis %%r17, (4b)@highest\n\t"
> > +			"ori %%r17, %%r17, (4b)@higher\n\t"
> > +			"rldicr %%r17, %%r17, 32, 31\n\t"
> > +			"oris %%r17, %%r17, (4b)@h\n\t"
> > +			"ori %%r17, %%r17, (4b)@l\n\t"
> > +			"std %%r17, 0(%[rseq_cs])\n\t"
> > +			RSEQ_INJECT_ASM(2)
> > +			"lwz %%r17, %[current_event_counter]\n\t"
> > +			"li %%r16, 0\n\t"
> > +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
> > +			"bne cr7, 3f\n\t"
> > +			RSEQ_INJECT_ASM(3)
> > +			"std %[to_write], 0(%[target])\n\t"
> > +			"2:\n\t"
> > +			RSEQ_INJECT_ASM(4)
> > +			"std %%r16, 0(%[rseq_cs])\n\t"
> > +			"b %l[succeed]\n\t"
> > +			"3:\n\t"
> > +			"li %%r16, 0\n\t"
> > +			"std %%r16, 0(%[rseq_cs])\n\t"
> > +			: /* no outputs */
> > +			: [start_event_counter]"r"(start_value.event_counter),
> > +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> > +			  [to_write]"r"(to_write),
> > +			  [target]"b"(p),
> > +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
> > +			  RSEQ_INJECT_INPUT
> > +			: "r16", "r17", "memory", "cc"
> > +			  RSEQ_INJECT_CLOBBER
> > +			: succeed
> > +		);
> > +	}
> > +#elif __PPC__
> > +	{
> > +		/*
> > +		 * The __rseq_table section can be used by debuggers to better
> > +		 * handle single-stepping through the restartable critical
> > +		 * sections.
> > +		 */
> > +		__asm__ __volatile__ goto (
> > +			".pushsection __rseq_table, \"aw\"\n\t"
> > +			".balign 8\n\t"
> > +			"4:\n\t"
> > +			".long 0x0, 1f, 0x0, 2f, 0x0, 3f\n\t" /* 32 bit only supported on BE */
> > +			".popsection\n\t"
> > +			"1:\n\t"
> > +			RSEQ_INJECT_ASM(1)
> > +			"lis %%r17, (4b)@ha\n\t"
> > +			"addi %%r17, %%r17, (4b)@l\n\t"
> > +			"stw %%r17, 0(%[rseq_cs])\n\t"
> > +			RSEQ_INJECT_ASM(2)
> > +			"lwz %%r17, %[current_event_counter]\n\t"
> > +			"li %%r16, 0\n\t"
> > +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
> > +			"bne cr7, 3f\n\t"
> > +			RSEQ_INJECT_ASM(3)
> > +			"stw %[to_write], 0(%[target])\n\t"
> > +			"2:\n\t"
> > +			RSEQ_INJECT_ASM(4)
> > +			"stw %%r16, 0(%[rseq_cs])\n\t"
> > +			"b %l[succeed]\n\t"
> > +			"3:\n\t"
> > +			"li %%r16, 0\n\t"
> > +			"stw %%r16, 0(%[rseq_cs])\n\t"
> > +			: /* no outputs */
> > +			: [start_event_counter]"r"(start_value.event_counter),
> > +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> > +			  [to_write]"r"(to_write),
> > +			  [target]"b"(p),
> > +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
> > +			  RSEQ_INJECT_INPUT
> > +			: "r16", "r17", "memory", "cc"
> > +			  RSEQ_INJECT_CLOBBER
> > +			: succeed
> > +		);
> > +	}
> > #else
> > #error unsupported target
> > #endif
> > --
> > 2.9.0
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC v2] Restartable sequences: Add self-tests for PPC
  2016-07-28  4:43           ` Boqun Feng
@ 2016-07-28  7:37             ` Boqun Feng
  2016-07-28 14:04               ` Mathieu Desnoyers
  2016-07-28 13:42             ` [RFC 4/4] " Mathieu Desnoyers
  1 sibling, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-07-28  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Boqun Feng

As rseq syscall is enabled on PPC, implement the self-tests on PPC to
verify the implementation of the syscall.

Please note we only support 32bit userspace on BE kernel.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
v1-->v2:
	1. Remove branch in rseq_finish() fastpath

	2. Use bne- instead of bne to jump when failure.

	3. Use r17 instead of r16 for storing zero to rseq_cs, which
	could save a register in rseq_finish() asm block.

 tools/testing/selftests/rseq/param_test.c |  14 ++++
 tools/testing/selftests/rseq/rseq.h       | 112 ++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
index db25e0a818e5..e2cb1b165f81 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -75,6 +75,20 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
 	"bne 222b\n\t" \
 	"333:\n\t"
 
+#elif __PPC__
+#define INJECT_ASM_REG	"r18"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
 #else
 #error unsupported target
 #endif
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index 35b60ee3bb02..b5336cf54788 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -138,6 +138,35 @@ do {									\
 #define has_fast_acquire_release()	0
 #define has_single_copy_load_64()	1
 
+#elif __PPC__
+#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
+#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
+#define smp_rmb()	smp_lwsync()
+#define smp_wmb()	smp_lwsync()
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_lwsync();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_lwsync()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_lwsync();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	0
+
+# if __PPC64__
+# define has_single_copy_load_64()	1
+# else
+# define has_single_copy_load_64()	0
+# endif
+
 #else
 #error unsupported target
 #endif
@@ -398,6 +427,89 @@ bool rseq_finish(struct rseq_lock *rlock,
 			: failure
 		);
 	}
+#elif __PPC64__
+	{
+		/*
+		 * The __rseq_table section can be used by debuggers to better
+		 * handle single-stepping through the restartable critical
+		 * sections.
+		 */
+		__asm__ __volatile__ goto (
+			".pushsection __rseq_table, \"aw\"\n\t"
+			".balign 8\n\t"
+			"3:\n\t"
+			".quad 1f, 2f, %l[failure]\n\t"
+			".popsection\n\t"
+			"1:\n\t"
+			RSEQ_INJECT_ASM(1)
+			"lis %%r17, (3b)@highest\n\t"
+			"ori %%r17, %%r17, (3b)@higher\n\t"
+			"rldicr %%r17, %%r17, 32, 31\n\t"
+			"oris %%r17, %%r17, (3b)@h\n\t"
+			"ori %%r17, %%r17, (3b)@l\n\t"
+			"std %%r17, 0(%[rseq_cs])\n\t"
+			RSEQ_INJECT_ASM(2)
+			"lwz %%r17, %[current_event_counter]\n\t"
+			"cmpw cr7, %[start_event_counter], %%r17\n\t"
+			"bne- cr7, %l[failure]\n\t"
+			RSEQ_INJECT_ASM(3)
+			"std %[to_write], 0(%[target])\n\t"
+			"2:\n\t"
+			RSEQ_INJECT_ASM(4)
+			"li %%r17, 0\n\t"
+			"std %%r17, 0(%[rseq_cs])\n\t"
+			: /* no outputs */
+			: [start_event_counter]"r"(start_value.event_counter),
+			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+			  [to_write]"r"(to_write),
+			  [target]"b"(p),
+			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
+			  RSEQ_INJECT_INPUT
+			: "r17", "memory", "cc"
+			  RSEQ_INJECT_CLOBBER
+			: failure
+		);
+	}
+#elif __PPC__
+	{
+		/*
+		 * The __rseq_table section can be used by debuggers to better
+		 * handle single-stepping through the restartable critical
+		 * sections.
+		 */
+		__asm__ __volatile__ goto (
+			".pushsection __rseq_table, \"aw\"\n\t"
+			".balign 8\n\t"
+			"3:\n\t"
+			".long 0x0, 1f, 0x0, 2f, 0x0, %l[failure]\n\t" /* 32 bit only supported on BE */
+			".popsection\n\t"
+			"1:\n\t"
+			RSEQ_INJECT_ASM(1)
+			"lis %%r17, (3b)@ha\n\t"
+			"addi %%r17, %%r17, (3b)@l\n\t"
+			"stw %%r17, 0(%[rseq_cs])\n\t"
+			RSEQ_INJECT_ASM(2)
+			"lwz %%r17, %[current_event_counter]\n\t"
+			"cmpw cr7, %[start_event_counter], %%r17\n\t"
+			"bne- cr7, %l[failure]\n\t"
+			RSEQ_INJECT_ASM(3)
+			"stw %[to_write], 0(%[target])\n\t"
+			"2:\n\t"
+			RSEQ_INJECT_ASM(4)
+			"li %%r17, 0\n\t"
+			"stw %%r17, 0(%[rseq_cs])\n\t"
+			: /* no outputs */
+			: [start_event_counter]"r"(start_value.event_counter),
+			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
+			  [to_write]"r"(to_write),
+			  [target]"b"(p),
+			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
+			  RSEQ_INJECT_INPUT
+			: "r17", "memory", "cc"
+			  RSEQ_INJECT_CLOBBER
+			: failure
+		);
+	}
 #else
 #error unsupported target
 #endif
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 4/4] Restartable sequences: Add self-tests for PPC
  2016-07-28  4:43           ` Boqun Feng
  2016-07-28  7:37             ` [RFC v2] " Boqun Feng
@ 2016-07-28 13:42             ` Mathieu Desnoyers
  1 sibling, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28 13:42 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 28, 2016, at 12:43 AM, Boqun Feng boqun.feng@gmail.com wrote:

> On Thu, Jul 28, 2016 at 02:59:45AM +0000, Mathieu Desnoyers wrote:
>> ----- On Jul 27, 2016, at 11:05 AM, Boqun Feng boqun.feng@gmail.com wrote:
>> 
>> > As rseq syscall is enabled on PPC, implement the self-tests on PPC to
>> > verify the implementation of the syscall.
>> > 
>> > Please note we only support 32bit userspace on BE kernel.
>> > 
>> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
>> > ---
>> > tools/testing/selftests/rseq/param_test.c |  14 ++++
>> > tools/testing/selftests/rseq/rseq.h       | 120 ++++++++++++++++++++++++++++++
>> > 2 files changed, 134 insertions(+)
>> > 
>> > diff --git a/tools/testing/selftests/rseq/param_test.c
>> > b/tools/testing/selftests/rseq/param_test.c
>> > index db25e0a818e5..e2cb1b165f81 100644
>> > --- a/tools/testing/selftests/rseq/param_test.c
>> > +++ b/tools/testing/selftests/rseq/param_test.c
>> > @@ -75,6 +75,20 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
>> > 	"bne 222b\n\t" \
>> > 	"333:\n\t"
>> > 
>> > +#elif __PPC__
>> > +#define INJECT_ASM_REG	"r18"
>> > +
>> > +#define RSEQ_INJECT_CLOBBER \
>> > +	, INJECT_ASM_REG
>> > +
>> > +#define RSEQ_INJECT_ASM(n) \
>> > +	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
>> > +	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
>> > +	"beq 333f\n\t" \
>> > +	"222:\n\t" \
>> > +	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
>> > +	"bne 222b\n\t" \
>> > +	"333:\n\t"
>> > #else
>> > #error unsupported target
>> > #endif
>> > diff --git a/tools/testing/selftests/rseq/rseq.h
>> > b/tools/testing/selftests/rseq/rseq.h
>> > index 791e14cf42ae..dea0bea52566 100644
>> > --- a/tools/testing/selftests/rseq/rseq.h
>> > +++ b/tools/testing/selftests/rseq/rseq.h
>> > @@ -138,6 +138,35 @@ do {									\
>> > #define has_fast_acquire_release()	0
>> > #define has_single_copy_load_64()	1
>> > 
>> > +#elif __PPC__
>> > +#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
>> > +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
>> > +#define smp_rmb()	smp_lwsync()
>> > +#define smp_wmb()	smp_lwsync()
>> > +
>> > +#define smp_load_acquire(p)						\
>> > +__extension__ ({							\
>> > +	__typeof(*p) ____p1 = READ_ONCE(*p);				\
>> > +	smp_lwsync();							\
>> > +	____p1;								\
>> > +})
>> > +
>> > +#define smp_acquire__after_ctrl_dep()	smp_lwsync()
>> > +
>> > +#define smp_store_release(p, v)						\
>> > +do {									\
>> > +	smp_lwsync();							\
>> > +	WRITE_ONCE(*p, v);						\
>> > +} while (0)
>> > +
>> > +#define has_fast_acquire_release()	1
>> 
>> Can you check if defining has_fast_acquire_release() to 0 speeds up
>> performance significantly ? It turns the smp_lwsync() into a
>> compiler barrier() on the smp_load_acquire() side (fast-path), and
>> turn the smp_lwsync() into a membarrier system call instead of the
>> matching smp_store_release() (slow path).
>> 
> 
> Good point. Here are the numbers:
> 
> Power8 PSeries KVM Guest(64 VCPUs, the host has 16 cores, 128 hardware
> threads):
> 
>                                 Counter increment speed (ns/increment)
>                              1 thread   2 threads   4 threads   8 threads   16 threads   32 threads
> global increment (baseline)     6.5          N/A         N/A         N/A
> N/A           N/A
> percpu rseq increment           7.0          7.0         7.2         7.2
> 9.3          14.5
> percpu rseq spinlock           18.5         18.5        18.6        18.8
> 25.5          52.7
> 
> So looks like defining has_fast_acquire_release() to 0 could benefit the
> cases with more threads in current benchmark. I will send a updated
> patch doing this.

Good to know the lwsync barrier overhead kicks in at that level of
workload on Power8.

> 
> And as discussed in IRC, I will also remove jump from rseq_finish()
> fast-path in powerpc asm in the updated patch as you did for x86 and
> ARM.

Allright, thanks!

Mathieu

> 
> Regards,
> Boqun
> 
> 
>> Thanks,
>> 
>> Mathieu
>> 
>> > +
>> > +# if __PPC64__
>> > +# define has_single_copy_load_64()	1
>> > +# else
>> > +# define has_single_copy_load_64()	0
>> > +# endif
>> > +
>> > #else
>> > #error unsupported target
>> > #endif
>> > @@ -404,6 +433,97 @@ bool rseq_finish(struct rseq_lock *rlock,
>> > 			: succeed
>> > 		);
>> > 	}
>> > +#elif __PPC64__
>> > +	{
>> > +		/*
>> > +		 * The __rseq_table section can be used by debuggers to better
>> > +		 * handle single-stepping through the restartable critical
>> > +		 * sections.
>> > +		 */
>> > +		__asm__ __volatile__ goto (
>> > +			".pushsection __rseq_table, \"aw\"\n\t"
>> > +			".balign 8\n\t"
>> > +			"4:\n\t"
>> > +			".quad 1f, 2f, 3f\n\t"
>> > +			".popsection\n\t"
>> > +			"1:\n\t"
>> > +			RSEQ_INJECT_ASM(1)
>> > +			"lis %%r17, (4b)@highest\n\t"
>> > +			"ori %%r17, %%r17, (4b)@higher\n\t"
>> > +			"rldicr %%r17, %%r17, 32, 31\n\t"
>> > +			"oris %%r17, %%r17, (4b)@h\n\t"
>> > +			"ori %%r17, %%r17, (4b)@l\n\t"
>> > +			"std %%r17, 0(%[rseq_cs])\n\t"
>> > +			RSEQ_INJECT_ASM(2)
>> > +			"lwz %%r17, %[current_event_counter]\n\t"
>> > +			"li %%r16, 0\n\t"
>> > +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
>> > +			"bne cr7, 3f\n\t"
>> > +			RSEQ_INJECT_ASM(3)
>> > +			"std %[to_write], 0(%[target])\n\t"
>> > +			"2:\n\t"
>> > +			RSEQ_INJECT_ASM(4)
>> > +			"std %%r16, 0(%[rseq_cs])\n\t"
>> > +			"b %l[succeed]\n\t"
>> > +			"3:\n\t"
>> > +			"li %%r16, 0\n\t"
>> > +			"std %%r16, 0(%[rseq_cs])\n\t"
>> > +			: /* no outputs */
>> > +			: [start_event_counter]"r"(start_value.event_counter),
>> > +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
>> > +			  [to_write]"r"(to_write),
>> > +			  [target]"b"(p),
>> > +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
>> > +			  RSEQ_INJECT_INPUT
>> > +			: "r16", "r17", "memory", "cc"
>> > +			  RSEQ_INJECT_CLOBBER
>> > +			: succeed
>> > +		);
>> > +	}
>> > +#elif __PPC__
>> > +	{
>> > +		/*
>> > +		 * The __rseq_table section can be used by debuggers to better
>> > +		 * handle single-stepping through the restartable critical
>> > +		 * sections.
>> > +		 */
>> > +		__asm__ __volatile__ goto (
>> > +			".pushsection __rseq_table, \"aw\"\n\t"
>> > +			".balign 8\n\t"
>> > +			"4:\n\t"
>> > +			".long 0x0, 1f, 0x0, 2f, 0x0, 3f\n\t" /* 32 bit only supported on BE */
>> > +			".popsection\n\t"
>> > +			"1:\n\t"
>> > +			RSEQ_INJECT_ASM(1)
>> > +			"lis %%r17, (4b)@ha\n\t"
>> > +			"addi %%r17, %%r17, (4b)@l\n\t"
>> > +			"stw %%r17, 0(%[rseq_cs])\n\t"
>> > +			RSEQ_INJECT_ASM(2)
>> > +			"lwz %%r17, %[current_event_counter]\n\t"
>> > +			"li %%r16, 0\n\t"
>> > +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
>> > +			"bne cr7, 3f\n\t"
>> > +			RSEQ_INJECT_ASM(3)
>> > +			"stw %[to_write], 0(%[target])\n\t"
>> > +			"2:\n\t"
>> > +			RSEQ_INJECT_ASM(4)
>> > +			"stw %%r16, 0(%[rseq_cs])\n\t"
>> > +			"b %l[succeed]\n\t"
>> > +			"3:\n\t"
>> > +			"li %%r16, 0\n\t"
>> > +			"stw %%r16, 0(%[rseq_cs])\n\t"
>> > +			: /* no outputs */
>> > +			: [start_event_counter]"r"(start_value.event_counter),
>> > +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
>> > +			  [to_write]"r"(to_write),
>> > +			  [target]"b"(p),
>> > +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
>> > +			  RSEQ_INJECT_INPUT
>> > +			: "r16", "r17", "memory", "cc"
>> > +			  RSEQ_INJECT_CLOBBER
>> > +			: succeed
>> > +		);
>> > +	}
>> > #else
>> > #error unsupported target
>> > #endif
>> > --
>> > 2.9.0
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v2] Restartable sequences: Add self-tests for PPC
  2016-07-28  7:37             ` [RFC v2] " Boqun Feng
@ 2016-07-28 14:04               ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-07-28 14:04 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-api, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras

----- On Jul 28, 2016, at 3:37 AM, Boqun Feng boqun.feng@gmail.com wrote:

> As rseq syscall is enabled on PPC, implement the self-tests on PPC to
> verify the implementation of the syscall.
> 
> Please note we only support 32bit userspace on BE kernel.

Picked into my rseq-fallback dev branch, thanks!

Mathieu

> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> v1-->v2:
>	1. Remove branch in rseq_finish() fastpath
> 
>	2. Use bne- instead of bne to jump when failure.
> 
>	3. Use r17 instead of r16 for storing zero to rseq_cs, which
>	could save a register in rseq_finish() asm block.
> 
> tools/testing/selftests/rseq/param_test.c |  14 ++++
> tools/testing/selftests/rseq/rseq.h       | 112 ++++++++++++++++++++++++++++++
> 2 files changed, 126 insertions(+)
> 
> diff --git a/tools/testing/selftests/rseq/param_test.c
> b/tools/testing/selftests/rseq/param_test.c
> index db25e0a818e5..e2cb1b165f81 100644
> --- a/tools/testing/selftests/rseq/param_test.c
> +++ b/tools/testing/selftests/rseq/param_test.c
> @@ -75,6 +75,20 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
> 	"bne 222b\n\t" \
> 	"333:\n\t"
> 
> +#elif __PPC__
> +#define INJECT_ASM_REG	"r18"
> +
> +#define RSEQ_INJECT_CLOBBER \
> +	, INJECT_ASM_REG
> +
> +#define RSEQ_INJECT_ASM(n) \
> +	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
> +	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
> +	"beq 333f\n\t" \
> +	"222:\n\t" \
> +	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
> +	"bne 222b\n\t" \
> +	"333:\n\t"
> #else
> #error unsupported target
> #endif
> diff --git a/tools/testing/selftests/rseq/rseq.h
> b/tools/testing/selftests/rseq/rseq.h
> index 35b60ee3bb02..b5336cf54788 100644
> --- a/tools/testing/selftests/rseq/rseq.h
> +++ b/tools/testing/selftests/rseq/rseq.h
> @@ -138,6 +138,35 @@ do {									\
> #define has_fast_acquire_release()	0
> #define has_single_copy_load_64()	1
> 
> +#elif __PPC__
> +#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> +#define smp_rmb()	smp_lwsync()
> +#define smp_wmb()	smp_lwsync()
> +
> +#define smp_load_acquire(p)						\
> +__extension__ ({							\
> +	__typeof(*p) ____p1 = READ_ONCE(*p);				\
> +	smp_lwsync();							\
> +	____p1;								\
> +})
> +
> +#define smp_acquire__after_ctrl_dep()	smp_lwsync()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_lwsync();							\
> +	WRITE_ONCE(*p, v);						\
> +} while (0)
> +
> +#define has_fast_acquire_release()	0
> +
> +# if __PPC64__
> +# define has_single_copy_load_64()	1
> +# else
> +# define has_single_copy_load_64()	0
> +# endif
> +
> #else
> #error unsupported target
> #endif
> @@ -398,6 +427,89 @@ bool rseq_finish(struct rseq_lock *rlock,
> 			: failure
> 		);
> 	}
> +#elif __PPC64__
> +	{
> +		/*
> +		 * The __rseq_table section can be used by debuggers to better
> +		 * handle single-stepping through the restartable critical
> +		 * sections.
> +		 */
> +		__asm__ __volatile__ goto (
> +			".pushsection __rseq_table, \"aw\"\n\t"
> +			".balign 8\n\t"
> +			"3:\n\t"
> +			".quad 1f, 2f, %l[failure]\n\t"
> +			".popsection\n\t"
> +			"1:\n\t"
> +			RSEQ_INJECT_ASM(1)
> +			"lis %%r17, (3b)@highest\n\t"
> +			"ori %%r17, %%r17, (3b)@higher\n\t"
> +			"rldicr %%r17, %%r17, 32, 31\n\t"
> +			"oris %%r17, %%r17, (3b)@h\n\t"
> +			"ori %%r17, %%r17, (3b)@l\n\t"
> +			"std %%r17, 0(%[rseq_cs])\n\t"
> +			RSEQ_INJECT_ASM(2)
> +			"lwz %%r17, %[current_event_counter]\n\t"
> +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
> +			"bne- cr7, %l[failure]\n\t"
> +			RSEQ_INJECT_ASM(3)
> +			"std %[to_write], 0(%[target])\n\t"
> +			"2:\n\t"
> +			RSEQ_INJECT_ASM(4)
> +			"li %%r17, 0\n\t"
> +			"std %%r17, 0(%[rseq_cs])\n\t"
> +			: /* no outputs */
> +			: [start_event_counter]"r"(start_value.event_counter),
> +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> +			  [to_write]"r"(to_write),
> +			  [target]"b"(p),
> +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
> +			  RSEQ_INJECT_INPUT
> +			: "r17", "memory", "cc"
> +			  RSEQ_INJECT_CLOBBER
> +			: failure
> +		);
> +	}
> +#elif __PPC__
> +	{
> +		/*
> +		 * The __rseq_table section can be used by debuggers to better
> +		 * handle single-stepping through the restartable critical
> +		 * sections.
> +		 */
> +		__asm__ __volatile__ goto (
> +			".pushsection __rseq_table, \"aw\"\n\t"
> +			".balign 8\n\t"
> +			"3:\n\t"
> +			".long 0x0, 1f, 0x0, 2f, 0x0, %l[failure]\n\t" /* 32 bit only supported on
> BE */
> +			".popsection\n\t"
> +			"1:\n\t"
> +			RSEQ_INJECT_ASM(1)
> +			"lis %%r17, (3b)@ha\n\t"
> +			"addi %%r17, %%r17, (3b)@l\n\t"
> +			"stw %%r17, 0(%[rseq_cs])\n\t"
> +			RSEQ_INJECT_ASM(2)
> +			"lwz %%r17, %[current_event_counter]\n\t"
> +			"cmpw cr7, %[start_event_counter], %%r17\n\t"
> +			"bne- cr7, %l[failure]\n\t"
> +			RSEQ_INJECT_ASM(3)
> +			"stw %[to_write], 0(%[target])\n\t"
> +			"2:\n\t"
> +			RSEQ_INJECT_ASM(4)
> +			"li %%r17, 0\n\t"
> +			"stw %%r17, 0(%[rseq_cs])\n\t"
> +			: /* no outputs */
> +			: [start_event_counter]"r"(start_value.event_counter),
> +			  [current_event_counter]"m"(start_value.rseqp->abi.u.e.event_counter),
> +			  [to_write]"r"(to_write),
> +			  [target]"b"(p),
> +			  [rseq_cs]"b"(&start_value.rseqp->abi.rseq_cs)
> +			  RSEQ_INJECT_INPUT
> +			: "r17", "memory", "cc"
> +			  RSEQ_INJECT_CLOBBER
> +			: failure
> +		);
> +	}
> #else
> #error unsupported target
> #endif
> --
> 2.9.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-26  3:02     ` Mathieu Desnoyers
@ 2016-08-03 12:27       ` Peter Zijlstra
  2016-08-03 16:37         ` Andy Lutomirski
  2016-08-03 18:29       ` Christoph Lameter
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-03 12:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
> We really care about preemption here. Every migration implies a
> preemption from a user-space perspective. If we would only care
> about keeping the CPU id up-to-date, hooking into migration would be
> enough. But since we want atomicity guarantees for restartable
> sequences, we need to hook into preemption.

> It allows user-space to perform update operations on per-cpu data without
> requiring heavy-weight atomic operations.

Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.

It is however on PPC and possibly other architectures, so in name of
simplicity supporting only the one variant makes sense.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
  2016-07-25 23:02   ` Andy Lutomirski
  2016-07-27 15:03   ` Boqun Feng
@ 2016-08-03 13:19   ` Peter Zijlstra
  2016-08-03 14:53     ` Paul E. McKenney
                       ` (2 more replies)
  2 siblings, 3 replies; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-03 13:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1209323..daef027 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
>  S:	Maintained
>  F:	scripts/get_maintainer.pl
>  
> +RESTARTABLE SEQUENCES SUPPORT
> +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

It would be good to have multiple people here, if we lack volunteers I'd
be willing. Paul, Andrew any of you guys willing?

> +L:	linux-kernel@vger.kernel.org
> +S:	Supported
> +F:	kernel/rseq.c
> +F:	include/uapi/linux/rseq.h
> +
>  GFS2 FILE SYSTEM
>  M:	Steven Whitehouse <swhiteho@redhat.com>
>  M:	Bob Peterson <rpeterso@redhat.com>


> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 253538f..5c4b900 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -59,6 +59,7 @@ struct sched_param {
>  #include <linux/gfp.h>
>  #include <linux/magic.h>
>  #include <linux/cgroup-defs.h>
> +#include <linux/rseq.h>
>  
>  #include <asm/processor.h>
>  
> @@ -1918,6 +1919,10 @@ struct task_struct {
>  #ifdef CONFIG_MMU
>  	struct task_struct *oom_reaper_list;
>  #endif
> +#ifdef CONFIG_RSEQ
> +	struct rseq __user *rseq;
> +	uint32_t rseq_event_counter;

This is kernel code, should we not use u32 instead?

Also, do we want a comment somewhere that explains why overflow isn't a
problem?

> +#endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
>  /*
> @@ -3387,4 +3392,67 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
>  void cpufreq_remove_update_util_hook(int cpu);
>  #endif /* CONFIG_CPU_FREQ */
>  
> +#ifdef CONFIG_RSEQ
> +static inline void rseq_set_notify_resume(struct task_struct *t)
> +{
> +	if (t->rseq)
> +		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +}

Maybe I missed it, but why do we want to hook into NOTIFY_RESUME and not
have our own TIF flag?


> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> new file mode 100644
> index 0000000..3e79fa9
> --- /dev/null
> +++ b/include/uapi/linux/rseq.h
> @@ -0,0 +1,85 @@
> +#ifndef _UAPI_LINUX_RSEQ_H
> +#define _UAPI_LINUX_RSEQ_H
> +
> +/*
> + * linux/rseq.h
> + *
> + * Restartable sequences system call API
> + *
> + * Copyright (c) 2015-2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifdef __KERNEL__
> +# include <linux/types.h>
> +#else	/* #ifdef __KERNEL__ */
> +# include <stdint.h>
> +#endif	/* #else #ifdef __KERNEL__ */
> +
> +#include <asm/byteorder.h>
> +
> +#ifdef __LP64__
> +# define RSEQ_FIELD_u32_u64(field)	uint64_t field
> +#elif defined(__BYTE_ORDER) ? \
> +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define RSEQ_FIELD_u32_u64(field)	uint32_t _padding ## field, field
> +#else
> +# define RSEQ_FIELD_u32_u64(field)	uint32_t field, _padding ## field
> +#endif
> +
> +struct rseq_cs {
> +	RSEQ_FIELD_u32_u64(start_ip);
> +	RSEQ_FIELD_u32_u64(post_commit_ip);
> +	RSEQ_FIELD_u32_u64(abort_ip);
> +} __attribute__((aligned(sizeof(uint64_t))));

Do we either want to grow that alignment to L1_CACHE_BYTES or place a
comment near that it would be best for performance to ensure the whole
thing fits into 1 line?

Alternatively, growing the alignment to 4*8 would probably be sufficient
to ensure that and waste less bytes.

> +struct rseq {
> +	union {
> +		struct {
> +			/*
> +			 * Restartable sequences cpu_id field.
> +			 * Updated by the kernel, and read by user-space with
> +			 * single-copy atomicity semantics. Aligned on 32-bit.
> +			 * Negative values are reserved for user-space.
> +			 */
> +			int32_t cpu_id;
> +			/*
> +			 * Restartable sequences event_counter field.
> +			 * Updated by the kernel, and read by user-space with
> +			 * single-copy atomicity semantics. Aligned on 32-bit.
> +			 */
> +			uint32_t event_counter;
> +		} e;
> +		/*
> +		 * On architectures with 64-bit aligned reads, both cpu_id and
> +		 * event_counter can be read with single-copy atomicity
> +		 * semantics.
> +		 */
> +		uint64_t v;
> +	} u;
> +	/*
> +	 * Restartable sequences rseq_cs field.
> +	 * Updated by user-space, read by the kernel with
> +	 * single-copy atomicity semantics. Aligned on 64-bit.
> +	 */
> +	RSEQ_FIELD_u32_u64(rseq_cs);
> +} __attribute__((aligned(sizeof(uint64_t))));

2*sizeof(uint64_t) ?

Also, I think it would be good to have a comment explaining why this is
split in two structures? Don't you rely on the address dependency?

> +#endif /* _UAPI_LINUX_RSEQ_H */

> diff --git a/kernel/rseq.c b/kernel/rseq.c
> new file mode 100644
> index 0000000..e1c847b
> --- /dev/null
> +++ b/kernel/rseq.c
> @@ -0,0 +1,231 @@

> +/*
> + * Each restartable sequence assembly block defines a "struct rseq_cs"
> + * structure which describes the post_commit_ip address, and the
> + * abort_ip address where the kernel should move the thread instruction
> + * pointer if a rseq critical section assembly block is preempted or if
> + * a signal is delivered on top of a rseq critical section assembly
> + * block. It also contains a start_ip, which is the address of the start
> + * of the rseq assembly block, which is useful to debuggers.
> + *
> + * The algorithm for a restartable sequence assembly block is as
> + * follows:
> + *
> + * rseq_start()
> + *
> + *   0. Userspace loads the current event counter value from the
> + *      event_counter field of the registered struct rseq TLS area,
> + *
> + * rseq_finish()
> + *
> + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
> + *   userspace that can handle being moved to the abort_ip between any
> + *   of those instructions.
> + *
> + *   The abort_ip address needs to be equal or above the post_commit_ip.

Above, as in: abort_ip >= post_commit_ip? Would not 'after' or
greater-or-equal be easier to understand?

> + *   Step [4] and the failure code step [F1] need to be at addresses
> + *   equal or above the post_commit_ip.

idem.

> + *   1.  Userspace stores the address of the struct rseq cs rseq
> + *       assembly block descriptor into the rseq_cs field of the
> + *       registered struct rseq TLS area.

And this should be something like up-store-release, which would
basically be a regular store, but such that the compiler is restrained
from placing the stores to the structure itself later.

> + *
> + *   2.  Userspace tests to see whether the current event counter values
> + *       match those loaded at [0]. Manually jumping to [F1] in case of
> + *       a mismatch.
> + *
> + *       Note that if we are preempted or interrupted by a signal
> + *       after [1] and before post_commit_ip, then the kernel also
> + *       performs the comparison performed in [2], and conditionally
> + *       clears rseq_cs, then jumps us to abort_ip.
> + *
> + *   3.  Userspace critical section final instruction before
> + *       post_commit_ip is the commit. The critical section is
> + *       self-terminating.
> + *       [post_commit_ip]
> + *
> + *   4.  Userspace clears the rseq_cs field of the struct rseq
> + *       TLS area.
> + *
> + *   5.  Return true.
> + *
> + *   On failure at [2]:
> + *
> + *   F1. Userspace clears the rseq_cs field of the struct rseq
> + *       TLS area. Followed by step [F2].
> + *
> + *       [abort_ip]
> + *   F2. Return false.
> + */
> +
> +static int rseq_increment_event_counter(struct task_struct *t)
> +{
> +	if (__put_user(++t->rseq_event_counter,
> +			&t->rseq->u.e.event_counter))
> +		return -1;
> +	return 0;
> +}

this,

> +static int rseq_get_rseq_cs(struct task_struct *t,
> +		void __user **post_commit_ip,
> +		void __user **abort_ip)
> +{
> +	unsigned long ptr;
> +	struct rseq_cs __user *rseq_cs;
> +
> +	if (__get_user(ptr, &t->rseq->rseq_cs))
> +		return -1;
> +	if (!ptr)
> +		return 0;
> +#ifdef CONFIG_COMPAT
> +	if (in_compat_syscall()) {
> +		rseq_cs = compat_ptr((compat_uptr_t)ptr);
> +		if (get_user(ptr, &rseq_cs->post_commit_ip))
> +			return -1;
> +		*post_commit_ip = compat_ptr((compat_uptr_t)ptr);
> +		if (get_user(ptr, &rseq_cs->abort_ip))
> +			return -1;
> +		*abort_ip = compat_ptr((compat_uptr_t)ptr);
> +		return 0;
> +	}
> +#endif
> +	rseq_cs = (struct rseq_cs __user *)ptr;
> +	if (get_user(ptr, &rseq_cs->post_commit_ip))
> +		return -1;
> +	*post_commit_ip = (void __user *)ptr;
> +	if (get_user(ptr, &rseq_cs->abort_ip))
> +		return -1;

Given we want all 3 of those values in a single line and doing 3
get_user() calls ends up doing 3 pairs of STAC/CLAC, should we not use
either copy_from_user_inatomic or unsafe_get_user() paired with
user_access_begin/end() pairs.

> +	*abort_ip = (void __user *)ptr;
> +	return 0;
> +}

this and,

> +static int rseq_ip_fixup(struct pt_regs *regs)
> +{
> +	struct task_struct *t = current;
> +	void __user *post_commit_ip = NULL;
> +	void __user *abort_ip = NULL;
> +
> +	if (rseq_get_rseq_cs(t, &post_commit_ip, &abort_ip))
> +		return -1;
> +
> +	/* Handle potentially being within a critical section. */
> +	if ((void __user *)instruction_pointer(regs) < post_commit_ip) {

Alternatively you can do:

	if (likely(void __user *)instruction_pointer(regs) >= post_commit_ip)
		return 0;

and you can safe an indent level below.

> +		/*
> +		 * We need to clear rseq_cs upon entry into a signal
> +		 * handler nested on top of a rseq assembly block, so
> +		 * the signal handler will not be fixed up if itself
> +		 * interrupted by a nested signal handler or preempted.
> +		 */
> +		if (clear_user(&t->rseq->rseq_cs,
> +				sizeof(t->rseq->rseq_cs)))
> +			return -1;
> +
> +		/*
> +		 * We set this after potentially failing in
> +		 * clear_user so that the signal arrives at the
> +		 * faulting rip.
> +		 */
> +		instruction_pointer_set(regs, (unsigned long)abort_ip);
> +	}
> +	return 0;
> +}

this function look like it should return bool.

> +/*
> + * This resume handler should always be executed between any of:
> + * - preemption,
> + * - signal delivery,
> + * and return to user-space.
> + */
> +void __rseq_handle_notify_resume(struct pt_regs *regs)
> +{
> +	struct task_struct *t = current;
> +
> +	if (unlikely(t->flags & PF_EXITING))
> +		return;
> +	if (!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))
> +		goto error;
> +	if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
> +		goto error;
> +	if (rseq_increment_event_counter(t))

It seems a shame to not use a single __put_user() here. You did the
layout to explicitly allow for this, but then you don't.

> +		goto error;
> +	if (rseq_ip_fixup(regs))
> +		goto error;
> +	return;
> +
> +error:
> +	force_sig(SIGSEGV, t);
> +}
> +
> +/*
> + * sys_rseq - setup restartable sequences for caller thread.
> + */
> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
> +{
> +	if (unlikely(flags))
> +		return -EINVAL;

(add whitespace)

> +	if (!rseq) {
> +		if (!current->rseq)
> +			return -ENOENT;
> +		return 0;
> +	}
> +
> +	if (current->rseq) {
> +		/*
> +		 * If rseq is already registered, check whether
> +		 * the provided address differs from the prior
> +		 * one.
> +		 */
> +		if (current->rseq != rseq)
> +			return -EBUSY;

Why explicitly allow resetting the same value?

> +	} else {
> +		/*
> +		 * If there was no rseq previously registered,
> +		 * we need to ensure the provided rseq is
> +		 * properly aligned and valid.
> +		 */
> +		if (!IS_ALIGNED((unsigned long)rseq, sizeof(uint64_t)))
> +			return -EINVAL;
> +		if (!access_ok(VERIFY_WRITE, rseq, sizeof(*rseq)))
> +			return -EFAULT;

GCC has __alignof__(struct rseq) for this. And as per the above, I would
recommend you change this to 2*sizeof(u64) to ensure the whole thing
fits in a single line.

> +		current->rseq = rseq;
> +		/*
> +		 * If rseq was previously inactive, and has just
> +		 * been registered, ensure the cpu_id and
> +		 * event_counter fields are updated before
> +		 * returning to user-space.
> +		 */
> +		rseq_set_notify_resume(current);
> +	}
> +
> +	return 0;
> +}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51d7105..fbef0c3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2664,6 +2664,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  {
>  	sched_info_switch(rq, prev, next);
>  	perf_event_task_sched_out(prev, next);
> +	rseq_sched_out(prev);

One thing I considered is doing something like:

static inline void rseq_sched_out(struct task_struct *t)
{
	unsigned long ptr;
	int err;

	if (!t->rseq)
		return;

	err = __get_user(ptr, &t->rseq->rseq_cs);
	if (err || ptr)
		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}

That will optimistically try to read the rseq_cs pointer and, on success
and empty (the most likely case) avoid setting the TIF flag.

This will require an explicit migration hook to unconditionally set the
TIF flag such that we keep the cpu_id field correct of course.

And obviously we can do this later, as an optimization. Its just
something I figured might be worth it.

>  	fire_sched_out_preempt_notifiers(prev, next);
>  	prepare_lock_switch(rq, next);
>  	prepare_arch_switch(next);

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 13:19   ` Peter Zijlstra
@ 2016-08-03 14:53     ` Paul E. McKenney
  2016-08-03 15:45     ` Boqun Feng
  2016-08-09 20:06     ` Mathieu Desnoyers
  2 siblings, 0 replies; 82+ messages in thread
From: Paul E. McKenney @ 2016-08-03 14:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, Aug 03, 2016 at 03:19:40PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 1209323..daef027 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
> >  S:	Maintained
> >  F:	scripts/get_maintainer.pl
> >  
> > +RESTARTABLE SEQUENCES SUPPORT
> > +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> It would be good to have multiple people here, if we lack volunteers I'd
> be willing. Paul, Andrew any of you guys willing?

I will join you in the "if we lack volunteers" category.

							Thanx, Paul

> > +L:	linux-kernel@vger.kernel.org
> > +S:	Supported
> > +F:	kernel/rseq.c
> > +F:	include/uapi/linux/rseq.h
> > +
> >  GFS2 FILE SYSTEM
> >  M:	Steven Whitehouse <swhiteho@redhat.com>
> >  M:	Bob Peterson <rpeterso@redhat.com>
> 
> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 253538f..5c4b900 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -59,6 +59,7 @@ struct sched_param {
> >  #include <linux/gfp.h>
> >  #include <linux/magic.h>
> >  #include <linux/cgroup-defs.h>
> > +#include <linux/rseq.h>
> >  
> >  #include <asm/processor.h>
> >  
> > @@ -1918,6 +1919,10 @@ struct task_struct {
> >  #ifdef CONFIG_MMU
> >  	struct task_struct *oom_reaper_list;
> >  #endif
> > +#ifdef CONFIG_RSEQ
> > +	struct rseq __user *rseq;
> > +	uint32_t rseq_event_counter;
> 
> This is kernel code, should we not use u32 instead?
> 
> Also, do we want a comment somewhere that explains why overflow isn't a
> problem?
> 
> > +#endif
> >  /* CPU-specific state of this task */
> >  	struct thread_struct thread;
> >  /*
> > @@ -3387,4 +3392,67 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> >  void cpufreq_remove_update_util_hook(int cpu);
> >  #endif /* CONFIG_CPU_FREQ */
> >  
> > +#ifdef CONFIG_RSEQ
> > +static inline void rseq_set_notify_resume(struct task_struct *t)
> > +{
> > +	if (t->rseq)
> > +		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> > +}
> 
> Maybe I missed it, but why do we want to hook into NOTIFY_RESUME and not
> have our own TIF flag?
> 
> 
> > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> > new file mode 100644
> > index 0000000..3e79fa9
> > --- /dev/null
> > +++ b/include/uapi/linux/rseq.h
> > @@ -0,0 +1,85 @@
> > +#ifndef _UAPI_LINUX_RSEQ_H
> > +#define _UAPI_LINUX_RSEQ_H
> > +
> > +/*
> > + * linux/rseq.h
> > + *
> > + * Restartable sequences system call API
> > + *
> > + * Copyright (c) 2015-2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a copy
> > + * of this software and associated documentation files (the "Software"), to deal
> > + * in the Software without restriction, including without limitation the rights
> > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> > + * copies of the Software, and to permit persons to whom the Software is
> > + * furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> > + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + */
> > +
> > +#ifdef __KERNEL__
> > +# include <linux/types.h>
> > +#else	/* #ifdef __KERNEL__ */
> > +# include <stdint.h>
> > +#endif	/* #else #ifdef __KERNEL__ */
> > +
> > +#include <asm/byteorder.h>
> > +
> > +#ifdef __LP64__
> > +# define RSEQ_FIELD_u32_u64(field)	uint64_t field
> > +#elif defined(__BYTE_ORDER) ? \
> > +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> > +# define RSEQ_FIELD_u32_u64(field)	uint32_t _padding ## field, field
> > +#else
> > +# define RSEQ_FIELD_u32_u64(field)	uint32_t field, _padding ## field
> > +#endif
> > +
> > +struct rseq_cs {
> > +	RSEQ_FIELD_u32_u64(start_ip);
> > +	RSEQ_FIELD_u32_u64(post_commit_ip);
> > +	RSEQ_FIELD_u32_u64(abort_ip);
> > +} __attribute__((aligned(sizeof(uint64_t))));
> 
> Do we either want to grow that alignment to L1_CACHE_BYTES or place a
> comment near that it would be best for performance to ensure the whole
> thing fits into 1 line?
> 
> Alternatively, growing the alignment to 4*8 would probably be sufficient
> to ensure that and waste less bytes.
> 
> > +struct rseq {
> > +	union {
> > +		struct {
> > +			/*
> > +			 * Restartable sequences cpu_id field.
> > +			 * Updated by the kernel, and read by user-space with
> > +			 * single-copy atomicity semantics. Aligned on 32-bit.
> > +			 * Negative values are reserved for user-space.
> > +			 */
> > +			int32_t cpu_id;
> > +			/*
> > +			 * Restartable sequences event_counter field.
> > +			 * Updated by the kernel, and read by user-space with
> > +			 * single-copy atomicity semantics. Aligned on 32-bit.
> > +			 */
> > +			uint32_t event_counter;
> > +		} e;
> > +		/*
> > +		 * On architectures with 64-bit aligned reads, both cpu_id and
> > +		 * event_counter can be read with single-copy atomicity
> > +		 * semantics.
> > +		 */
> > +		uint64_t v;
> > +	} u;
> > +	/*
> > +	 * Restartable sequences rseq_cs field.
> > +	 * Updated by user-space, read by the kernel with
> > +	 * single-copy atomicity semantics. Aligned on 64-bit.
> > +	 */
> > +	RSEQ_FIELD_u32_u64(rseq_cs);
> > +} __attribute__((aligned(sizeof(uint64_t))));
> 
> 2*sizeof(uint64_t) ?
> 
> Also, I think it would be good to have a comment explaining why this is
> split in two structures? Don't you rely on the address dependency?
> 
> > +#endif /* _UAPI_LINUX_RSEQ_H */
> 
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > new file mode 100644
> > index 0000000..e1c847b
> > --- /dev/null
> > +++ b/kernel/rseq.c
> > @@ -0,0 +1,231 @@
> 
> > +/*
> > + * Each restartable sequence assembly block defines a "struct rseq_cs"
> > + * structure which describes the post_commit_ip address, and the
> > + * abort_ip address where the kernel should move the thread instruction
> > + * pointer if a rseq critical section assembly block is preempted or if
> > + * a signal is delivered on top of a rseq critical section assembly
> > + * block. It also contains a start_ip, which is the address of the start
> > + * of the rseq assembly block, which is useful to debuggers.
> > + *
> > + * The algorithm for a restartable sequence assembly block is as
> > + * follows:
> > + *
> > + * rseq_start()
> > + *
> > + *   0. Userspace loads the current event counter value from the
> > + *      event_counter field of the registered struct rseq TLS area,
> > + *
> > + * rseq_finish()
> > + *
> > + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
> > + *   userspace that can handle being moved to the abort_ip between any
> > + *   of those instructions.
> > + *
> > + *   The abort_ip address needs to be equal or above the post_commit_ip.
> 
> Above, as in: abort_ip >= post_commit_ip? Would not 'after' or
> greater-or-equal be easier to understand?
> 
> > + *   Step [4] and the failure code step [F1] need to be at addresses
> > + *   equal or above the post_commit_ip.
> 
> idem.
> 
> > + *   1.  Userspace stores the address of the struct rseq cs rseq
> > + *       assembly block descriptor into the rseq_cs field of the
> > + *       registered struct rseq TLS area.
> 
> And this should be something like up-store-release, which would
> basically be a regular store, but such that the compiler is restrained
> from placing the stores to the structure itself later.
> 
> > + *
> > + *   2.  Userspace tests to see whether the current event counter values
> > + *       match those loaded at [0]. Manually jumping to [F1] in case of
> > + *       a mismatch.
> > + *
> > + *       Note that if we are preempted or interrupted by a signal
> > + *       after [1] and before post_commit_ip, then the kernel also
> > + *       performs the comparison performed in [2], and conditionally
> > + *       clears rseq_cs, then jumps us to abort_ip.
> > + *
> > + *   3.  Userspace critical section final instruction before
> > + *       post_commit_ip is the commit. The critical section is
> > + *       self-terminating.
> > + *       [post_commit_ip]
> > + *
> > + *   4.  Userspace clears the rseq_cs field of the struct rseq
> > + *       TLS area.
> > + *
> > + *   5.  Return true.
> > + *
> > + *   On failure at [2]:
> > + *
> > + *   F1. Userspace clears the rseq_cs field of the struct rseq
> > + *       TLS area. Followed by step [F2].
> > + *
> > + *       [abort_ip]
> > + *   F2. Return false.
> > + */
> > +
> > +static int rseq_increment_event_counter(struct task_struct *t)
> > +{
> > +	if (__put_user(++t->rseq_event_counter,
> > +			&t->rseq->u.e.event_counter))
> > +		return -1;
> > +	return 0;
> > +}
> 
> this,
> 
> > +static int rseq_get_rseq_cs(struct task_struct *t,
> > +		void __user **post_commit_ip,
> > +		void __user **abort_ip)
> > +{
> > +	unsigned long ptr;
> > +	struct rseq_cs __user *rseq_cs;
> > +
> > +	if (__get_user(ptr, &t->rseq->rseq_cs))
> > +		return -1;
> > +	if (!ptr)
> > +		return 0;
> > +#ifdef CONFIG_COMPAT
> > +	if (in_compat_syscall()) {
> > +		rseq_cs = compat_ptr((compat_uptr_t)ptr);
> > +		if (get_user(ptr, &rseq_cs->post_commit_ip))
> > +			return -1;
> > +		*post_commit_ip = compat_ptr((compat_uptr_t)ptr);
> > +		if (get_user(ptr, &rseq_cs->abort_ip))
> > +			return -1;
> > +		*abort_ip = compat_ptr((compat_uptr_t)ptr);
> > +		return 0;
> > +	}
> > +#endif
> > +	rseq_cs = (struct rseq_cs __user *)ptr;
> > +	if (get_user(ptr, &rseq_cs->post_commit_ip))
> > +		return -1;
> > +	*post_commit_ip = (void __user *)ptr;
> > +	if (get_user(ptr, &rseq_cs->abort_ip))
> > +		return -1;
> 
> Given we want all 3 of those values in a single line and doing 3
> get_user() calls ends up doing 3 pairs of STAC/CLAC, should we not use
> either copy_from_user_inatomic or unsafe_get_user() paired with
> user_access_begin/end() pairs.
> 
> > +	*abort_ip = (void __user *)ptr;
> > +	return 0;
> > +}
> 
> this and,
> 
> > +static int rseq_ip_fixup(struct pt_regs *regs)
> > +{
> > +	struct task_struct *t = current;
> > +	void __user *post_commit_ip = NULL;
> > +	void __user *abort_ip = NULL;
> > +
> > +	if (rseq_get_rseq_cs(t, &post_commit_ip, &abort_ip))
> > +		return -1;
> > +
> > +	/* Handle potentially being within a critical section. */
> > +	if ((void __user *)instruction_pointer(regs) < post_commit_ip) {
> 
> Alternatively you can do:
> 
> 	if (likely(void __user *)instruction_pointer(regs) >= post_commit_ip)
> 		return 0;
> 
> and you can safe an indent level below.
> 
> > +		/*
> > +		 * We need to clear rseq_cs upon entry into a signal
> > +		 * handler nested on top of a rseq assembly block, so
> > +		 * the signal handler will not be fixed up if itself
> > +		 * interrupted by a nested signal handler or preempted.
> > +		 */
> > +		if (clear_user(&t->rseq->rseq_cs,
> > +				sizeof(t->rseq->rseq_cs)))
> > +			return -1;
> > +
> > +		/*
> > +		 * We set this after potentially failing in
> > +		 * clear_user so that the signal arrives at the
> > +		 * faulting rip.
> > +		 */
> > +		instruction_pointer_set(regs, (unsigned long)abort_ip);
> > +	}
> > +	return 0;
> > +}
> 
> this function look like it should return bool.
> 
> > +/*
> > + * This resume handler should always be executed between any of:
> > + * - preemption,
> > + * - signal delivery,
> > + * and return to user-space.
> > + */
> > +void __rseq_handle_notify_resume(struct pt_regs *regs)
> > +{
> > +	struct task_struct *t = current;
> > +
> > +	if (unlikely(t->flags & PF_EXITING))
> > +		return;
> > +	if (!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))
> > +		goto error;
> > +	if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
> > +		goto error;
> > +	if (rseq_increment_event_counter(t))
> 
> It seems a shame to not use a single __put_user() here. You did the
> layout to explicitly allow for this, but then you don't.
> 
> > +		goto error;
> > +	if (rseq_ip_fixup(regs))
> > +		goto error;
> > +	return;
> > +
> > +error:
> > +	force_sig(SIGSEGV, t);
> > +}
> > +
> > +/*
> > + * sys_rseq - setup restartable sequences for caller thread.
> > + */
> > +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
> > +{
> > +	if (unlikely(flags))
> > +		return -EINVAL;
> 
> (add whitespace)
> 
> > +	if (!rseq) {
> > +		if (!current->rseq)
> > +			return -ENOENT;
> > +		return 0;
> > +	}
> > +
> > +	if (current->rseq) {
> > +		/*
> > +		 * If rseq is already registered, check whether
> > +		 * the provided address differs from the prior
> > +		 * one.
> > +		 */
> > +		if (current->rseq != rseq)
> > +			return -EBUSY;
> 
> Why explicitly allow resetting the same value?
> 
> > +	} else {
> > +		/*
> > +		 * If there was no rseq previously registered,
> > +		 * we need to ensure the provided rseq is
> > +		 * properly aligned and valid.
> > +		 */
> > +		if (!IS_ALIGNED((unsigned long)rseq, sizeof(uint64_t)))
> > +			return -EINVAL;
> > +		if (!access_ok(VERIFY_WRITE, rseq, sizeof(*rseq)))
> > +			return -EFAULT;
> 
> GCC has __alignof__(struct rseq) for this. And as per the above, I would
> recommend you change this to 2*sizeof(u64) to ensure the whole thing
> fits in a single line.
> 
> > +		current->rseq = rseq;
> > +		/*
> > +		 * If rseq was previously inactive, and has just
> > +		 * been registered, ensure the cpu_id and
> > +		 * event_counter fields are updated before
> > +		 * returning to user-space.
> > +		 */
> > +		rseq_set_notify_resume(current);
> > +	}
> > +
> > +	return 0;
> > +}
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 51d7105..fbef0c3 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2664,6 +2664,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> >  {
> >  	sched_info_switch(rq, prev, next);
> >  	perf_event_task_sched_out(prev, next);
> > +	rseq_sched_out(prev);
> 
> One thing I considered is doing something like:
> 
> static inline void rseq_sched_out(struct task_struct *t)
> {
> 	unsigned long ptr;
> 	int err;
> 
> 	if (!t->rseq)
> 		return;
> 
> 	err = __get_user(ptr, &t->rseq->rseq_cs);
> 	if (err || ptr)
> 		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> }
> 
> That will optimistically try to read the rseq_cs pointer and, on success
> and empty (the most likely case) avoid setting the TIF flag.
> 
> This will require an explicit migration hook to unconditionally set the
> TIF flag such that we keep the cpu_id field correct of course.
> 
> And obviously we can do this later, as an optimization. Its just
> something I figured might be worth it.
> 
> >  	fire_sched_out_preempt_notifiers(prev, next);
> >  	prepare_lock_switch(rq, next);
> >  	prepare_arch_switch(next);
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 13:19   ` Peter Zijlstra
  2016-08-03 14:53     ` Paul E. McKenney
@ 2016-08-03 15:45     ` Boqun Feng
  2016-08-07 15:36       ` Mathieu Desnoyers
  2016-08-09 20:06     ` Mathieu Desnoyers
  2 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-03 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 1043 bytes --]

On Wed, Aug 03, 2016 at 03:19:40PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 1209323..daef027 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
> >  S:	Maintained
> >  F:	scripts/get_maintainer.pl
> >  
> > +RESTARTABLE SEQUENCES SUPPORT
> > +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> It would be good to have multiple people here, if we lack volunteers I'd
> be willing. Paul, Andrew any of you guys willing?
> 

I volunteer to review related patches, do tests/benchmarks(esp. on PPC)
and try to fix/improve any issue as I can.

Mathieu, may I join the party? ;-)

Regards,
Boqun

> > +L:	linux-kernel@vger.kernel.org
> > +S:	Supported
> > +F:	kernel/rseq.c
> > +F:	include/uapi/linux/rseq.h
> > +
> >  GFS2 FILE SYSTEM
> >  M:	Steven Whitehouse <swhiteho@redhat.com>
> >  M:	Bob Peterson <rpeterso@redhat.com>
> 
[...]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 12:27       ` Peter Zijlstra
@ 2016-08-03 16:37         ` Andy Lutomirski
  2016-08-03 18:31           ` Christoph Lameter
  2016-08-04  4:27           ` Boqun Feng
  0 siblings, 2 replies; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-03 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
>> We really care about preemption here. Every migration implies a
>> preemption from a user-space perspective. If we would only care
>> about keeping the CPU id up-to-date, hooking into migration would be
>> enough. But since we want atomicity guarantees for restartable
>> sequences, we need to hook into preemption.
>
>> It allows user-space to perform update operations on per-cpu data without
>> requiring heavy-weight atomic operations.
>
> Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
>
> It is however on PPC and possibly other architectures, so in name of
> simplicity supporting only the one variant makes sense.
>

I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
that were narrower than the full abort-on-preemption primitive.
Specifically, suppose we had abort if (actual cpu != expected_cpu ||
*aptr != aval).  We could do things like:

expected_cpu = cpu;
aval = NULL;  // disarm for now
begin();
aval = event_count[cpu] + 1;
event_count[cpu] = aval;
event_count[cpu]++;

... compute something ...

// arm the rest of it
aptr = &event_count[cpu];
if (*aptr != aval)
  goto fail;

*thing_im_writing = value_i_computed;
end();

The idea here is that we don't rely on the scheduler to increment the
event count at all, which means that we get to determine the scope of
what kinds of access conflicts we care about ourselves.

This has an obvious downside: it's more complicated.

It has several benefits, I think.  It's debuggable without hassle
(unless someone, accidentally or otherwise, sets aval incorrectly).
It also allows much longer critical sections to work well, as merely
being preempted in the middle won't cause an abort any more.

So I'm hoping to understand whether we could make something like this
work.  This whole thing is roughly equivalent to abort-if-migrated
plus an atomic "if (*aptr == aval) *b = c;" operation.

(I think that, if this worked, we could improve it a bit by making the
abort operation jump back to the "if (*aptr != aval) goto fail;" code,
which should reduce the scope for error a bit and also reduces the
need for extra code paths that only execute on an abort.)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-07-26  3:02     ` Mathieu Desnoyers
  2016-08-03 12:27       ` Peter Zijlstra
@ 2016-08-03 18:29       ` Christoph Lameter
  2016-08-10 16:47         ` Mathieu Desnoyers
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2016-08-03 18:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andi Kleen,
	Dave Watson, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, 26 Jul 2016, Mathieu Desnoyers wrote:

> > What problem does this solve?
>
> It allows user-space to perform update operations on per-cpu data without
> requiring heavy-weight atomic operations.


This is great but seems to indicate that such a facility would be better
for kernel code instread of user space code.

> First, prohibiting migration from user-space has been frowned upon
> by scheduler developers for a long time, and I doubt this mindset will
> change.

Note that the task isolation patchset from Chris Metcalf does something
that goes a long way towards this. If you set strict isolation mode then
the kernel will terminate the process or notify you if the scheduler
becomes involved. In some way we are getting that as a side effect.

Also prohibiting migration is trivial form user space. Just do a taskset
to a single cpu.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 16:37         ` Andy Lutomirski
@ 2016-08-03 18:31           ` Christoph Lameter
  2016-08-04  5:01             ` Andy Lutomirski
  2016-08-04  4:27           ` Boqun Feng
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Lameter @ 2016-08-03 18:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Mathieu Desnoyers, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon, Michael Kerrisk,
	Boqun Feng

On Wed, 3 Aug 2016, Andy Lutomirski wrote:

> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
> >
> > It is however on PPC and possibly other architectures, so in name of
> > simplicity supporting only the one variant makes sense.
> >
>
> I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
> that were narrower than the full abort-on-preemption primitive.
> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
> *aptr != aval).  We could do things like:
>

The latency issues that are addressed by restartable sequences require
minimim instruction overhead. Lockless CMPXCHG is very important in that
area and I would not simply remove it from consideration.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 16:37         ` Andy Lutomirski
  2016-08-03 18:31           ` Christoph Lameter
@ 2016-08-04  4:27           ` Boqun Feng
  2016-08-04  5:03             ` Andy Lutomirski
  1 sibling, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-04  4:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Mathieu Desnoyers, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 3467 bytes --]

On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
> >> We really care about preemption here. Every migration implies a
> >> preemption from a user-space perspective. If we would only care
> >> about keeping the CPU id up-to-date, hooking into migration would be
> >> enough. But since we want atomicity guarantees for restartable
> >> sequences, we need to hook into preemption.
> >
> >> It allows user-space to perform update operations on per-cpu data without
> >> requiring heavy-weight atomic operations.
> >
> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
> >
> > It is however on PPC and possibly other architectures, so in name of
> > simplicity supporting only the one variant makes sense.
> >
> 
> I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
> that were narrower than the full abort-on-preemption primitive.
> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
> *aptr != aval).  We could do things like:
> 
> expected_cpu = cpu;
> aval = NULL;  // disarm for now
> begin();
> aval = event_count[cpu] + 1;
> event_count[cpu] = aval;
> event_count[cpu]++;

This line is redundant, right? Because it will guarantee a failure even
in no-contention cases.

> 
> ... compute something ...
> 
> // arm the rest of it
> aptr = &event_count[cpu];
> if (*aptr != aval)
>   goto fail;
> 
> *thing_im_writing = value_i_computed;
> end();
> 
> The idea here is that we don't rely on the scheduler to increment the
> event count at all, which means that we get to determine the scope of
> what kinds of access conflicts we care about ourselves.
> 

If we increase the event count in userspace, how could we prevent two
userspace threads from racing on the event_count[cpu] field? For
example:

	CPU 0
	================
	{event_count[0] is initially 0}

	[Thread 1]
	begin();
	aval = event_count[cpu] + 1; // 1

	(preempted)
	[Thread 2]
	begin();
	aval = event_count[cpu] + 1; // 1, too
	event_count[cpu] = aval; // event_count[0] is 1

	(preempted)
	[Thread 1]
	event_count[cpu] = aval; // event_count[0] is 1

	... 

	aptr = &event_count[cpu];
	if (*aptr != aval) // false.
		...

	[Thread 2]
	aptr = &event_count[cpu];
	if (*aptr != aval) // false.
		...

, in which case, both the critical sections are successful, and Thread 1
and Thread 2 will race on *thing_im_writing.

Am I missing your point here?

Regards,
Boqun

> This has an obvious downside: it's more complicated.
> 
> It has several benefits, I think.  It's debuggable without hassle
> (unless someone, accidentally or otherwise, sets aval incorrectly).
> It also allows much longer critical sections to work well, as merely
> being preempted in the middle won't cause an abort any more.
> 
> So I'm hoping to understand whether we could make something like this
> work.  This whole thing is roughly equivalent to abort-if-migrated
> plus an atomic "if (*aptr == aval) *b = c;" operation.
> 
> (I think that, if this worked, we could improve it a bit by making the
> abort operation jump back to the "if (*aptr != aval) goto fail;" code,
> which should reduce the scope for error a bit and also reduces the
> need for extra code paths that only execute on an abort.)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 18:31           ` Christoph Lameter
@ 2016-08-04  5:01             ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-04  5:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Will Deacon, Linus Torvalds, Mathieu Desnoyers, Paul E. McKenney,
	Andi Kleen, Josh Triplett, Paul Turner, Boqun Feng, linux-kernel,
	Catalin Marinas, Andrew Hunter, H. Peter Anvin, Peter Zijlstra

On Aug 3, 2016 11:31 AM, "Christoph Lameter" <cl@linux.com> wrote:
>
> On Wed, 3 Aug 2016, Andy Lutomirski wrote:
>
> > > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
> > >
> > > It is however on PPC and possibly other architectures, so in name of
> > > simplicity supporting only the one variant makes sense.
> > >
> >
> > I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
> > that were narrower than the full abort-on-preemption primitive.
> > Specifically, suppose we had abort if (actual cpu != expected_cpu ||
> > *aptr != aval).  We could do things like:
> >
>
> The latency issues that are addressed by restartable sequences require
> minimim instruction overhead. Lockless CMPXCHG is very important in that
> area and I would not simply remove it from consideration.

What I mean is: I think the solution shouldn't depend on the
x86-specific unlocked CMPXCHG instruction if it can be avoided.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-04  4:27           ` Boqun Feng
@ 2016-08-04  5:03             ` Andy Lutomirski
  2016-08-09 16:13               ` Boqun Feng
  2016-08-10  8:13               ` Andy Lutomirski
  0 siblings, 2 replies; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-04  5:03 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Mathieu Desnoyers, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
>> >> We really care about preemption here. Every migration implies a
>> >> preemption from a user-space perspective. If we would only care
>> >> about keeping the CPU id up-to-date, hooking into migration would be
>> >> enough. But since we want atomicity guarantees for restartable
>> >> sequences, we need to hook into preemption.
>> >
>> >> It allows user-space to perform update operations on per-cpu data without
>> >> requiring heavy-weight atomic operations.
>> >
>> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
>> >
>> > It is however on PPC and possibly other architectures, so in name of
>> > simplicity supporting only the one variant makes sense.
>> >
>>
>> I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
>> that were narrower than the full abort-on-preemption primitive.
>> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
>> *aptr != aval).  We could do things like:
>>
>> expected_cpu = cpu;
>> aval = NULL;  // disarm for now
>> begin();
>> aval = event_count[cpu] + 1;
>> event_count[cpu] = aval;
>> event_count[cpu]++;
>
> This line is redundant, right? Because it will guarantee a failure even
> in no-contention cases.
>
>>
>> ... compute something ...
>>
>> // arm the rest of it
>> aptr = &event_count[cpu];
>> if (*aptr != aval)
>>   goto fail;
>>
>> *thing_im_writing = value_i_computed;
>> end();
>>
>> The idea here is that we don't rely on the scheduler to increment the
>> event count at all, which means that we get to determine the scope of
>> what kinds of access conflicts we care about ourselves.
>>
>
> If we increase the event count in userspace, how could we prevent two
> userspace threads from racing on the event_count[cpu] field? For
> example:
>
>         CPU 0
>         ================
>         {event_count[0] is initially 0}
>
>         [Thread 1]
>         begin();
>         aval = event_count[cpu] + 1; // 1
>
>         (preempted)
>         [Thread 2]
>         begin();
>         aval = event_count[cpu] + 1; // 1, too
>         event_count[cpu] = aval; // event_count[0] is 1
>

You're right :(  This would work with an xadd instruction, but that's
very slow and doesn't exist on most architectures.  It could also work
if we did:

aval = some_tls_value++;

where some_tls_value is set up such that no two threads could ever end
up with the same values (using high bits as thread ids, perhaps), but
that's messy.  Maybe my idea is no good.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 15:45     ` Boqun Feng
@ 2016-08-07 15:36       ` Mathieu Desnoyers
  2016-08-07 23:35         ` Boqun Feng
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-07 15:36 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Aug 3, 2016, at 11:45 AM, Boqun Feng boqun.feng@gmail.com wrote:

> On Wed, Aug 03, 2016 at 03:19:40PM +0200, Peter Zijlstra wrote:
>> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
>> > diff --git a/MAINTAINERS b/MAINTAINERS
>> > index 1209323..daef027 100644
>> > --- a/MAINTAINERS
>> > +++ b/MAINTAINERS
>> > @@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
>> >  S:	Maintained
>> >  F:	scripts/get_maintainer.pl
>> >  
>> > +RESTARTABLE SEQUENCES SUPPORT
>> > +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> 
>> It would be good to have multiple people here, if we lack volunteers I'd
>> be willing. Paul, Andrew any of you guys willing?
>> 
> 
> I volunteer to review related patches, do tests/benchmarks(esp. on PPC)
> and try to fix/improve any issue as I can.
> 
> Mathieu, may I join the party? ;-)

Hi!

I'm glad to see so much interest in helping me maintain rseq :)
I'll therefore tentatively add the following lines to the maintainers
list in my next round:

RESTARTABLE SEQUENCES SUPPORT
M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
M:	Peter Zijlstra <peterz@infradead.org>
M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
M:	Boqun Feng <boqun.feng@gmail.com>
L:	linux-kernel@vger.kernel.org
S:	Supported
F:	kernel/rseq.c
F:	include/uapi/linux/rseq.h

Thanks!

Mathieu

> 
> Regards,
> Boqun
> 
>> > +L:	linux-kernel@vger.kernel.org
>> > +S:	Supported
>> > +F:	kernel/rseq.c
>> > +F:	include/uapi/linux/rseq.h
>> > +
>> >  GFS2 FILE SYSTEM
>> >  M:	Steven Whitehouse <swhiteho@redhat.com>
>> >  M:	Bob Peterson <rpeterso@redhat.com>
>> 
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-07 15:36       ` Mathieu Desnoyers
@ 2016-08-07 23:35         ` Boqun Feng
  2016-08-09 13:22           ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-07 23:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 2151 bytes --]

On Sun, Aug 07, 2016 at 03:36:24PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 3, 2016, at 11:45 AM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > On Wed, Aug 03, 2016 at 03:19:40PM +0200, Peter Zijlstra wrote:
> >> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
> >> > diff --git a/MAINTAINERS b/MAINTAINERS
> >> > index 1209323..daef027 100644
> >> > --- a/MAINTAINERS
> >> > +++ b/MAINTAINERS
> >> > @@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
> >> >  S:	Maintained
> >> >  F:	scripts/get_maintainer.pl
> >> >  
> >> > +RESTARTABLE SEQUENCES SUPPORT
> >> > +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> 
> >> It would be good to have multiple people here, if we lack volunteers I'd
> >> be willing. Paul, Andrew any of you guys willing?
> >> 
> > 
> > I volunteer to review related patches, do tests/benchmarks(esp. on PPC)
> > and try to fix/improve any issue as I can.
> > 
> > Mathieu, may I join the party? ;-)
> 
> Hi!
> 
> I'm glad to see so much interest in helping me maintain rseq :)
> I'll therefore tentatively add the following lines to the maintainers
> list in my next round:
> 
> RESTARTABLE SEQUENCES SUPPORT
> M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> M:	Peter Zijlstra <peterz@infradead.org>
> M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> M:	Boqun Feng <boqun.feng@gmail.com>
> L:	linux-kernel@vger.kernel.org
> S:	Supported
> F:	kernel/rseq.c
> F:	include/uapi/linux/rseq.h
> 

Thank you, Mathieu ;-)

Maybe we also should put the selftest directory here? Like:

F:	tools/testing/selftests/rseq

Of course, this line better be added in patch 7 rather than patch 1.

Regards,
Boqun

> Thanks!
> 
> Mathieu
> 
> > 
> > Regards,
> > Boqun
> > 
> >> > +L:	linux-kernel@vger.kernel.org
> >> > +S:	Supported
> >> > +F:	kernel/rseq.c
> >> > +F:	include/uapi/linux/rseq.h
> >> > +
> >> >  GFS2 FILE SYSTEM
> >> >  M:	Steven Whitehouse <swhiteho@redhat.com>
> >> >  M:	Bob Peterson <rpeterso@redhat.com>
> >> 
> > [...]
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-07 23:35         ` Boqun Feng
@ 2016-08-09 13:22           ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-09 13:22 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Aug 7, 2016, at 7:35 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Sun, Aug 07, 2016 at 03:36:24PM +0000, Mathieu Desnoyers wrote:
>> ----- On Aug 3, 2016, at 11:45 AM, Boqun Feng boqun.feng@gmail.com wrote:
>> 
>> > On Wed, Aug 03, 2016 at 03:19:40PM +0200, Peter Zijlstra wrote:
>> >> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
>> >> > diff --git a/MAINTAINERS b/MAINTAINERS
>> >> > index 1209323..daef027 100644
>> >> > --- a/MAINTAINERS
>> >> > +++ b/MAINTAINERS
>> >> > @@ -5085,6 +5085,13 @@ M:	Joe Perches <joe@perches.com>
>> >> >  S:	Maintained
>> >> >  F:	scripts/get_maintainer.pl
>> >> >  
>> >> > +RESTARTABLE SEQUENCES SUPPORT
>> >> > +M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >> 
>> >> It would be good to have multiple people here, if we lack volunteers I'd
>> >> be willing. Paul, Andrew any of you guys willing?
>> >> 
>> > 
>> > I volunteer to review related patches, do tests/benchmarks(esp. on PPC)
>> > and try to fix/improve any issue as I can.
>> > 
>> > Mathieu, may I join the party? ;-)
>> 
>> Hi!
>> 
>> I'm glad to see so much interest in helping me maintain rseq :)
>> I'll therefore tentatively add the following lines to the maintainers
>> list in my next round:
>> 
>> RESTARTABLE SEQUENCES SUPPORT
>> M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> M:	Peter Zijlstra <peterz@infradead.org>
>> M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>> M:	Boqun Feng <boqun.feng@gmail.com>
>> L:	linux-kernel@vger.kernel.org
>> S:	Supported
>> F:	kernel/rseq.c
>> F:	include/uapi/linux/rseq.h
>> 
> 
> Thank you, Mathieu ;-)
> 
> Maybe we also should put the selftest directory here? Like:
> 
> F:	tools/testing/selftests/rseq
> 
> Of course, this line better be added in patch 7 rather than patch 1.

Adding this to patch 7, with a "/" at the end of the line, since it
targets an entire directory.

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> Thanks!
>> 
>> Mathieu
>> 
>> > 
>> > Regards,
>> > Boqun
>> > 
>> >> > +L:	linux-kernel@vger.kernel.org
>> >> > +S:	Supported
>> >> > +F:	kernel/rseq.c
>> >> > +F:	include/uapi/linux/rseq.h
>> >> > +
>> >> >  GFS2 FILE SYSTEM
>> >> >  M:	Steven Whitehouse <swhiteho@redhat.com>
>> >> >  M:	Bob Peterson <rpeterso@redhat.com>
>> >> 
>> > [...]
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-04  5:03             ` Andy Lutomirski
@ 2016-08-09 16:13               ` Boqun Feng
  2016-08-10  8:01                 ` Andy Lutomirski
  2016-08-10 17:33                 ` Mathieu Desnoyers
  2016-08-10  8:13               ` Andy Lutomirski
  1 sibling, 2 replies; 82+ messages in thread
From: Boqun Feng @ 2016-08-09 16:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Mathieu Desnoyers, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 5319 bytes --]

On Wed, Aug 03, 2016 at 10:03:32PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
> >> >> We really care about preemption here. Every migration implies a
> >> >> preemption from a user-space perspective. If we would only care
> >> >> about keeping the CPU id up-to-date, hooking into migration would be
> >> >> enough. But since we want atomicity guarantees for restartable
> >> >> sequences, we need to hook into preemption.
> >> >
> >> >> It allows user-space to perform update operations on per-cpu data without
> >> >> requiring heavy-weight atomic operations.
> >> >
> >> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
> >> >
> >> > It is however on PPC and possibly other architectures, so in name of
> >> > simplicity supporting only the one variant makes sense.
> >> >
> >>
> >> I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
> >> that were narrower than the full abort-on-preemption primitive.
> >> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
> >> *aptr != aval).  We could do things like:
> >>
> >> expected_cpu = cpu;
> >> aval = NULL;  // disarm for now
> >> begin();
> >> aval = event_count[cpu] + 1;
> >> event_count[cpu] = aval;
> >> event_count[cpu]++;
> >
> > This line is redundant, right? Because it will guarantee a failure even
> > in no-contention cases.
> >
> >>
> >> ... compute something ...
> >>
> >> // arm the rest of it
> >> aptr = &event_count[cpu];
> >> if (*aptr != aval)
> >>   goto fail;
> >>
> >> *thing_im_writing = value_i_computed;
> >> end();
> >>
> >> The idea here is that we don't rely on the scheduler to increment the
> >> event count at all, which means that we get to determine the scope of
> >> what kinds of access conflicts we care about ourselves.
> >>
> >
> > If we increase the event count in userspace, how could we prevent two
> > userspace threads from racing on the event_count[cpu] field? For
> > example:
> >
> >         CPU 0
> >         ================
> >         {event_count[0] is initially 0}
> >
> >         [Thread 1]
> >         begin();
> >         aval = event_count[cpu] + 1; // 1
> >
> >         (preempted)
> >         [Thread 2]
> >         begin();
> >         aval = event_count[cpu] + 1; // 1, too
> >         event_count[cpu] = aval; // event_count[0] is 1
> >
> 
> You're right :(  This would work with an xadd instruction, but that's
> very slow and doesn't exist on most architectures.  It could also work
> if we did:
> 
> aval = some_tls_value++;
> 
> where some_tls_value is set up such that no two threads could ever end
> up with the same values (using high bits as thread ids, perhaps), but
> that's messy.  Maybe my idea is no good.

This is a little more complex, plus I failed to find a way to do an
atomic "if (*aptr == aval) *b = c" in userspace ;-(

However, I'm thinking maybe we can use some tricks to avoid unnecessary
aborts-on-preemption.

First of all, I notice we haven't make any constraint on what kind of
memory objects could be "protected" by rseq critical sections yet. And I
think this is something we should decide before adding this feature into
kernel.

We can do some optimization if we have some constraints. For example, if
the memory objects inside the rseq critical sections could only be
modified by userspace programs, we therefore don't need to abort
immediately when userspace task -> kernel task context switch.

Further more, if the memory objects inside the rseq critical sections
could only be modified by userspace programs that have registered their
rseq structures, we don't need to abort immediately between the context
switches between two rseq-unregistered tasks or one rseq-registered
task and one rseq-unregistered task.

Instead, we do tricks as follow:

defining a percpu pointer in kernel:

DEFINE_PER_CPU(struct task_struct *, rseq_owner);

and a cpu field in struct task_struct:

	struct task_struct {
	...
	#ifdef CONFIG_RSEQ                                                              
		struct rseq __user *rseq;                                               
		uint32_t rseq_event_counter;                                            
		int rseq_cpu;
	#endif  
	...
	};

(task_struct::rseq_cpu should be initialized as -1.)

each time at sched out(in rseq_sched_out()), we do something like:

	if (prev->rseq) {
		raw_cpu_write(rseq_owner, prev);
		prev->rseq_cpu = smp_processor_id();
	}

each time sched in(in rseq_handle_notify_resume()), we do something
like:

	if (current->rseq &&
	    (this_cpu_read(rseq_owner) != current || 
	     current->rseq_cpu != smp_processor_id()))
		__rseq_handle_notify_resume(regs);

(Also need to modify rseq_signal_deliver() to call
__rseq_handle_notify_resume() directly).


I think this could save some unnecessary aborts-on-preemption, however,
TBH, I'm too sleepy to verify every corner case. Will recheck this
tomorrow.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 13:19   ` Peter Zijlstra
  2016-08-03 14:53     ` Paul E. McKenney
  2016-08-03 15:45     ` Boqun Feng
@ 2016-08-09 20:06     ` Mathieu Desnoyers
  2016-08-09 21:33       ` Peter Zijlstra
                         ` (3 more replies)
  2 siblings, 4 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-09 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 3, 2016, at 9:19 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Thu, Jul 21, 2016 at 05:14:16PM -0400, Mathieu Desnoyers wrote:
[...]
> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 253538f..5c4b900 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -59,6 +59,7 @@ struct sched_param {
>>  #include <linux/gfp.h>
>>  #include <linux/magic.h>
>>  #include <linux/cgroup-defs.h>
>> +#include <linux/rseq.h>
>>  
>>  #include <asm/processor.h>
>>  
>> @@ -1918,6 +1919,10 @@ struct task_struct {
>>  #ifdef CONFIG_MMU
>>  	struct task_struct *oom_reaper_list;
>>  #endif
>> +#ifdef CONFIG_RSEQ
>> +	struct rseq __user *rseq;
>> +	uint32_t rseq_event_counter;
> 
> This is kernel code, should we not use u32 instead?
> 

Good point. Will fix.

> Also, do we want a comment somewhere that explains why overflow isn't a
> problem?

I can add a comment about rseq_increment_event_counter stating:

 * Overflow of the event counter is not a problem in practice. It
 * increments at most once between each user-space thread instruction
 * executed, so we would need a thread to execute 2^32 instructions or
 * more between rseq_start() and rseq_finish(), while single-stepping,
 * for this to be an issue.

Is it fine, or should we be more conservative and care about the overflow,
extending the counter to a 64-bit value in the process ?

> 
>> +#endif
>>  /* CPU-specific state of this task */
>>  	struct thread_struct thread;
>>  /*
>> @@ -3387,4 +3392,67 @@ void cpufreq_add_update_util_hook(int cpu, struct
>> update_util_data *data,
>>  void cpufreq_remove_update_util_hook(int cpu);
>>  #endif /* CONFIG_CPU_FREQ */
>>  
>> +#ifdef CONFIG_RSEQ
>> +static inline void rseq_set_notify_resume(struct task_struct *t)
>> +{
>> +	if (t->rseq)
>> +		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
>> +}
> 
> Maybe I missed it, but why do we want to hook into NOTIFY_RESUME and not
> have our own TIF flag?

The short answer is that used the same approach as Paul Turner's patchset. ;)

Through a deeper look into this, the only times we set the flag is when
preempting and delivering a signal to a thread that has registered to
rseq.

Upon return to user-space with the flag set, the performance difference
between having our own flag and hopping into the NOTIFY_RESUME bandwagon
is that we can skip the various tests in exit_to_usermode_loop()
with our own flag, at the expense of crowding the thread flags even
nearer to filling up 32 bits, which will at some point require extra
tests on the fast-path.

Thinking about it, one benchmark I have not done so far is to modify
hackbench so it registers its threads with the rseq system call. We
can then figure out whether reserving a flag for rseq is justified or
not.

Comparing 10 runs of hackbench registering its sender/receiver threads
with unmodified hackbench: (hackbench -l 100000)

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.7 defconfig+localyesconfig,
    restartable sequences series applied.

                                    Avg. Time (s)    Std.dev. (s)
Unmodified Hackbench                    40.5            0.1
Rseq-Registered Hackbench Threads       40.4            0.1

So initial results seems to indicate that adding the notify_resume
handling upon preemption does not have noticeable effects on
performance, so I don't consider it worthwhile to try optimizing
it by reserving its own thread flag. Or perhaps am I missing something
important here ?

> 
> 
>> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
>> new file mode 100644
>> index 0000000..3e79fa9
>> --- /dev/null
>> +++ b/include/uapi/linux/rseq.h
>> @@ -0,0 +1,85 @@
>> +#ifndef _UAPI_LINUX_RSEQ_H
>> +#define _UAPI_LINUX_RSEQ_H
>> +
>> +/*
>> + * linux/rseq.h
>> + *
>> + * Restartable sequences system call API
>> + *
>> + * Copyright (c) 2015-2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to
>> deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> THE
>> + * SOFTWARE.
>> + */
>> +
>> +#ifdef __KERNEL__
>> +# include <linux/types.h>
>> +#else	/* #ifdef __KERNEL__ */
>> +# include <stdint.h>
>> +#endif	/* #else #ifdef __KERNEL__ */
>> +
>> +#include <asm/byteorder.h>
>> +
>> +#ifdef __LP64__
>> +# define RSEQ_FIELD_u32_u64(field)	uint64_t field
>> +#elif defined(__BYTE_ORDER) ? \
>> +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
>> +# define RSEQ_FIELD_u32_u64(field)	uint32_t _padding ## field, field
>> +#else
>> +# define RSEQ_FIELD_u32_u64(field)	uint32_t field, _padding ## field
>> +#endif
>> +
>> +struct rseq_cs {
>> +	RSEQ_FIELD_u32_u64(start_ip);
>> +	RSEQ_FIELD_u32_u64(post_commit_ip);
>> +	RSEQ_FIELD_u32_u64(abort_ip);
>> +} __attribute__((aligned(sizeof(uint64_t))));
> 
> Do we either want to grow that alignment to L1_CACHE_BYTES or place a
> comment near that it would be best for performance to ensure the whole
> thing fits into 1 line?
> 
> Alternatively, growing the alignment to 4*8 would probably be sufficient
> to ensure that and waste less bytes.

I am tempted to go for an alignment of 4 * sizeof(uint64_t) to ensure it
is contained within a single cache-line without wasting space uselessly,
adding the following comment:

/*
 * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
 * contained within a single cache-line.
 */

There seems to be no gain in aligning it on a larger value, because it
is only ever read at run-time (never updated), so there can be no
false-sharing.

I'm updating the selftests rseq.h accordingly.

> 
>> +struct rseq {
>> +	union {
>> +		struct {
>> +			/*
>> +			 * Restartable sequences cpu_id field.
>> +			 * Updated by the kernel, and read by user-space with
>> +			 * single-copy atomicity semantics. Aligned on 32-bit.
>> +			 * Negative values are reserved for user-space.
>> +			 */
>> +			int32_t cpu_id;
>> +			/*
>> +			 * Restartable sequences event_counter field.
>> +			 * Updated by the kernel, and read by user-space with
>> +			 * single-copy atomicity semantics. Aligned on 32-bit.
>> +			 */
>> +			uint32_t event_counter;
>> +		} e;
>> +		/*
>> +		 * On architectures with 64-bit aligned reads, both cpu_id and
>> +		 * event_counter can be read with single-copy atomicity
>> +		 * semantics.
>> +		 */
>> +		uint64_t v;
>> +	} u;
>> +	/*
>> +	 * Restartable sequences rseq_cs field.
>> +	 * Updated by user-space, read by the kernel with
>> +	 * single-copy atomicity semantics. Aligned on 64-bit.
>> +	 */
>> +	RSEQ_FIELD_u32_u64(rseq_cs);
>> +} __attribute__((aligned(sizeof(uint64_t))));
> 
> 2*sizeof(uint64_t) ?
> 

Yes. Will do.

> Also, I think it would be good to have a comment explaining why this is
> split in two structures? Don't you rely on the address dependency?

The comment above the rseq_cs fields needs clarification, how about:

        /*
         * Restartable sequences rseq_cs field.
         * Contains NULL when no critical section is active for the
         * current thread, or holds a pointer to the currently active
         * struct rseq_cs.
         * Updated by user-space at the beginning and end of assembly
         * instruction sequence block, and by the kernel when it
         * restarts an assembly instruction sequence block. Read by the
         * kernel with single-copy atomicity semantics. Aligned on
         * 64-bit.
         */

This really explains that rseq_cs field of struct rseq holds a pointer
to the current struct rseq_cs (or NULL), which makes it obvious why this
needs to be two different structures.

> 
>> +#endif /* _UAPI_LINUX_RSEQ_H */
> 
>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>> new file mode 100644
>> index 0000000..e1c847b
>> --- /dev/null
>> +++ b/kernel/rseq.c
>> @@ -0,0 +1,231 @@
> 
>> +/*
>> + * Each restartable sequence assembly block defines a "struct rseq_cs"
>> + * structure which describes the post_commit_ip address, and the
>> + * abort_ip address where the kernel should move the thread instruction
>> + * pointer if a rseq critical section assembly block is preempted or if
>> + * a signal is delivered on top of a rseq critical section assembly
>> + * block. It also contains a start_ip, which is the address of the start
>> + * of the rseq assembly block, which is useful to debuggers.
>> + *
>> + * The algorithm for a restartable sequence assembly block is as
>> + * follows:
>> + *
>> + * rseq_start()
>> + *
>> + *   0. Userspace loads the current event counter value from the
>> + *      event_counter field of the registered struct rseq TLS area,
>> + *
>> + * rseq_finish()
>> + *
>> + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
>> + *   userspace that can handle being moved to the abort_ip between any
>> + *   of those instructions.
>> + *
>> + *   The abort_ip address needs to be equal or above the post_commit_ip.
> 
> Above, as in: abort_ip >= post_commit_ip? Would not 'after' or
> greater-or-equal be easier to understand?

Fixed. Using "lesser" and "greater-or-equal" for consistency.

> 
>> + *   Step [4] and the failure code step [F1] need to be at addresses
>> + *   equal or above the post_commit_ip.
> 
> idem.

Fixed.

Combined with other recent feedback, this becomes:

 *   The abort_ip address needs to be lesser than start_ip, or
 *   greater-or-equal the post_commit_ip. Step [4] and the failure
 *   code step [F1] need to be at addresses lesser than start_ip, or
 *   greater-or-equal the post_commit_ip.


> 
>> + *   1.  Userspace stores the address of the struct rseq cs rseq
>> + *       assembly block descriptor into the rseq_cs field of the
>> + *       registered struct rseq TLS area.
> 
> And this should be something like up-store-release, which would
> basically be a regular store, but such that the compiler is restrained
> from placing the stores to the structure itself later.
> 

The compiler should also prevents following loads to be moved before this
store. Updated to:

 *   1.  Userspace stores the address of the struct rseq_cs assembly
 *       block descriptor into the rseq_cs field of the registered
 *       struct rseq TLS area. This update is performed through a single
 *       store, followed by a compiler barrier which prevents the
 *       compiler from moving following loads or stores before this
 *       store.


>> + *
>> + *   2.  Userspace tests to see whether the current event counter values
>> + *       match those loaded at [0]. Manually jumping to [F1] in case of
>> + *       a mismatch.
>> + *
>> + *       Note that if we are preempted or interrupted by a signal
>> + *       after [1] and before post_commit_ip, then the kernel also
>> + *       performs the comparison performed in [2], and conditionally
>> + *       clears rseq_cs, then jumps us to abort_ip.
>> + *
>> + *   3.  Userspace critical section final instruction before
>> + *       post_commit_ip is the commit. The critical section is
>> + *       self-terminating.
>> + *       [post_commit_ip]
>> + *
>> + *   4.  Userspace clears the rseq_cs field of the struct rseq
>> + *       TLS area.
>> + *
>> + *   5.  Return true.
>> + *
>> + *   On failure at [2]:
>> + *
>> + *   F1. Userspace clears the rseq_cs field of the struct rseq
>> + *       TLS area. Followed by step [F2].
>> + *
>> + *       [abort_ip]
>> + *   F2. Return false.
>> + */
>> +
>> +static int rseq_increment_event_counter(struct task_struct *t)
>> +{
>> +	if (__put_user(++t->rseq_event_counter,
>> +			&t->rseq->u.e.event_counter))
>> +		return -1;
>> +	return 0;
>> +}
> 
> this,

ok. Will use bool.

> 
>> +static int rseq_get_rseq_cs(struct task_struct *t,
>> +		void __user **post_commit_ip,
>> +		void __user **abort_ip)
>> +{
>> +	unsigned long ptr;
>> +	struct rseq_cs __user *rseq_cs;
>> +
>> +	if (__get_user(ptr, &t->rseq->rseq_cs))
>> +		return -1;
>> +	if (!ptr)
>> +		return 0;
>> +#ifdef CONFIG_COMPAT
>> +	if (in_compat_syscall()) {
>> +		rseq_cs = compat_ptr((compat_uptr_t)ptr);
>> +		if (get_user(ptr, &rseq_cs->post_commit_ip))
>> +			return -1;
>> +		*post_commit_ip = compat_ptr((compat_uptr_t)ptr);
>> +		if (get_user(ptr, &rseq_cs->abort_ip))
>> +			return -1;
>> +		*abort_ip = compat_ptr((compat_uptr_t)ptr);
>> +		return 0;
>> +	}
>> +#endif
>> +	rseq_cs = (struct rseq_cs __user *)ptr;
>> +	if (get_user(ptr, &rseq_cs->post_commit_ip))
>> +		return -1;
>> +	*post_commit_ip = (void __user *)ptr;
>> +	if (get_user(ptr, &rseq_cs->abort_ip))
>> +		return -1;
> 
> Given we want all 3 of those values in a single line and doing 3
> get_user() calls ends up doing 3 pairs of STAC/CLAC, should we not use
> either copy_from_user_inatomic or unsafe_get_user() paired with
> user_access_begin/end() pairs.

Actually, we want copy_from_user() there. This executes upon
resume to user-space, so we can take a page fault is needed, so
no "inatomic" needed. I therefore suggest:

static bool rseq_get_rseq_cs(struct task_struct *t,
                void __user **start_ip,
                void __user **post_commit_ip,
                void __user **abort_ip)
{
        unsigned long ptr;
        struct rseq_cs __user *urseq_cs;
        struct rseq_cs rseq_cs;

        if (__get_user(ptr, &t->rseq->rseq_cs))
                return false;
        if (!ptr)
                return true;
#ifdef CONFIG_COMPAT
        if (in_compat_syscall()) {
                urseq_cs = compat_ptr((compat_uptr_t)ptr);
                if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
                        return false;
                *start_ip = compat_ptr((compat_uptr_t)rseq_cs.start_ip);
                *post_commit_ip = compat_ptr((compat_uptr_t)rseq_cs.post_commit_ip);
                *abort_ip = compat_ptr((compat_uptr_t)rseq_cs.abort_ip);
                return true;
        }
#endif
        urseq_cs = (struct rseq_cs __user *)ptr;
        if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
                return false;
        *start_ip = rseq_cs.start_ip;
        *post_commit_ip = rseq_cs.post_commit_ip;
        *abort_ip = rseq_cs.abort_ip;
        return true;
}


> 
>> +	*abort_ip = (void __user *)ptr;
>> +	return 0;
>> +}
> 
> this and,

ok. Will use bool.

> 
>> +static int rseq_ip_fixup(struct pt_regs *regs)
>> +{
>> +	struct task_struct *t = current;
>> +	void __user *post_commit_ip = NULL;
>> +	void __user *abort_ip = NULL;
>> +
>> +	if (rseq_get_rseq_cs(t, &post_commit_ip, &abort_ip))
>> +		return -1;
>> +
>> +	/* Handle potentially being within a critical section. */
>> +	if ((void __user *)instruction_pointer(regs) < post_commit_ip) {
> 
> Alternatively you can do:
> 
>	if (likely(void __user *)instruction_pointer(regs) >= post_commit_ip)
>		return 0;
> 
> and you can safe an indent level below.

ok. will do.

> 
>> +		/*
>> +		 * We need to clear rseq_cs upon entry into a signal
>> +		 * handler nested on top of a rseq assembly block, so
>> +		 * the signal handler will not be fixed up if itself
>> +		 * interrupted by a nested signal handler or preempted.
>> +		 */
>> +		if (clear_user(&t->rseq->rseq_cs,
>> +				sizeof(t->rseq->rseq_cs)))
>> +			return -1;
>> +
>> +		/*
>> +		 * We set this after potentially failing in
>> +		 * clear_user so that the signal arrives at the
>> +		 * faulting rip.
>> +		 */
>> +		instruction_pointer_set(regs, (unsigned long)abort_ip);
>> +	}
>> +	return 0;
>> +}
> 
> this function look like it should return bool.

ok. Will use bool.

> 
>> +/*
>> + * This resume handler should always be executed between any of:
>> + * - preemption,
>> + * - signal delivery,
>> + * and return to user-space.
>> + */
>> +void __rseq_handle_notify_resume(struct pt_regs *regs)
>> +{
>> +	struct task_struct *t = current;
>> +
>> +	if (unlikely(t->flags & PF_EXITING))
>> +		return;
>> +	if (!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))
>> +		goto error;
>> +	if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
>> +		goto error;
>> +	if (rseq_increment_event_counter(t))
> 
> It seems a shame to not use a single __put_user() here. You did the
> layout to explicitly allow for this, but then you don't.

The event counter increment needs to be performed at least once before
returning to user-space whenever the thread is preempted or has a signal
delivered. This counter increment needs to occur even if we are not nested
over a restartable assembly block. (more detailed explanation about this
follows at the end of this email)

The rseq_ip_fixup only ever needs to update the rseq_cs pointer
field if it preempts/delivers a signal over a restartable
assembly block, which happens very rarely.

Therefore, since the event counter increment is more frequent than
setting rseq_cs ptr, I don't see much value in trying to combine
those two into a single __put_user().

The reason why I combined both the cpu_id and event_counter
fields into the same 64-bit integer is for user-space rseq_start()
to be able to fetch them through a single load when the architecture
allows it.

> 
>> +		goto error;
>> +	if (rseq_ip_fixup(regs))
>> +		goto error;
>> +	return;
>> +
>> +error:
>> +	force_sig(SIGSEGV, t);
>> +}
>> +
>> +/*
>> + * sys_rseq - setup restartable sequences for caller thread.
>> + */
>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>> +{
>> +	if (unlikely(flags))
>> +		return -EINVAL;
> 
> (add whitespace)

fixed.

> 
>> +	if (!rseq) {
>> +		if (!current->rseq)
>> +			return -ENOENT;
>> +		return 0;
>> +	}
>> +
>> +	if (current->rseq) {
>> +		/*
>> +		 * If rseq is already registered, check whether
>> +		 * the provided address differs from the prior
>> +		 * one.
>> +		 */
>> +		if (current->rseq != rseq)
>> +			return -EBUSY;
> 
> Why explicitly allow resetting the same value?

The foreseen use is as follows: let's assume we have one or more
user-space libraries, and possibly the application, each using rseq.
They would each define a struct rseq TLS. They are expected to all
give it the same name (e.g. __rseq_thread_state), and mark it as a
weak symbol, so all uses of that symbol within the process address
space will refer to the same address for a given thread.

Registration of this TLS is done through a call to the rseq system
call. In cases where the application uses rseq, registration can be
done explicitly at thread creation by the application, since it controls
the beginning of the thread execution.

However, if we have uses in libraries that cannot rely on the application
registering the TLS, those libraries will need to lazily register their
struct rseq TLS the first time they use it within each thread.

So rather than to keep an extra flag in the __rseq_thread_state TLS
shared across application and various libs using rseq to indicate whether
it has been registered (with signal handler races it may involve if
rseq is used in a library within a signal handler), just allow the rseq
system call to succeed if multiple attempts are made to register the same
TLS address for a given thread.

One use-case I have in mind for libraries using rseq without requiring
the application to know about it is user-space tracing, as you would
have probably guessed. :)

> 
>> +	} else {
>> +		/*
>> +		 * If there was no rseq previously registered,
>> +		 * we need to ensure the provided rseq is
>> +		 * properly aligned and valid.
>> +		 */
>> +		if (!IS_ALIGNED((unsigned long)rseq, sizeof(uint64_t)))
>> +			return -EINVAL;
>> +		if (!access_ok(VERIFY_WRITE, rseq, sizeof(*rseq)))
>> +			return -EFAULT;
> 
> GCC has __alignof__(struct rseq) for this. And as per the above, I would
> recommend you change this to 2*sizeof(u64) to ensure the whole thing
> fits in a single line.

Will use __alignof__(*rseq) for the IS_ALIGNED check, to match the
sizeof(*rseq) just below.

> 
>> +		current->rseq = rseq;
>> +		/*
>> +		 * If rseq was previously inactive, and has just
>> +		 * been registered, ensure the cpu_id and
>> +		 * event_counter fields are updated before
>> +		 * returning to user-space.
>> +		 */
>> +		rseq_set_notify_resume(current);
>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 51d7105..fbef0c3 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2664,6 +2664,7 @@ prepare_task_switch(struct rq *rq, struct task_struct
>> *prev,
>>  {
>>  	sched_info_switch(rq, prev, next);
>>  	perf_event_task_sched_out(prev, next);
>> +	rseq_sched_out(prev);
> 
> One thing I considered is doing something like:
> 
> static inline void rseq_sched_out(struct task_struct *t)
> {
>	unsigned long ptr;
>	int err;
> 
>	if (!t->rseq)
>		return;
> 
>	err = __get_user(ptr, &t->rseq->rseq_cs);
>	if (err || ptr)
>		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> }
> 
> That will optimistically try to read the rseq_cs pointer and, on success
> and empty (the most likely case) avoid setting the TIF flag.
> 
> This will require an explicit migration hook to unconditionally set the
> TIF flag such that we keep the cpu_id field correct of course.
> 
> And obviously we can do this later, as an optimization. Its just
> something I figured might be worth it.

This won't work. The rseq mechanism proposed here is really the overlap
of _two_ distinct restart mechanisms: a sequence counter for C code,
and a ip-fixup-based mechanism for the assembly "finish" instruction
sequence.

What you propose here only considers the fixup of the assembly instruction
sequence, but not the C code that runs before. The C code between
rseq_start() and rseq_finish() loads the current value of the sequence
counter in rseq_start(), and then it gets compared with the new current
value within the rseq_finish restartable sequence of instructions. So the
sequence counter needs to be updated upon preemption/signal delivery that
occurs on top of C code, even if not nesting over a sequence of
restartable assembly instructions.

Thanks for the thorough review!

Mathieu

> 
>>  	fire_sched_out_preempt_notifiers(prev, next);
>>  	prepare_lock_switch(rq, next);
> >  	prepare_arch_switch(next);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 20:06     ` Mathieu Desnoyers
@ 2016-08-09 21:33       ` Peter Zijlstra
  2016-08-09 22:41         ` Mathieu Desnoyers
  2016-08-10  8:10       ` Andy Lutomirski
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-09 21:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, Aug 09, 2016 at 08:06:40PM +0000, Mathieu Desnoyers wrote:
> >> +static int rseq_increment_event_counter(struct task_struct *t)
> >> +{
> >> +	if (__put_user(++t->rseq_event_counter,
> >> +			&t->rseq->u.e.event_counter))
> >> +		return -1;
> >> +	return 0;
> >> +}

> >> +void __rseq_handle_notify_resume(struct pt_regs *regs)
> >> +{
> >> +	struct task_struct *t = current;
> >> +
> >> +	if (unlikely(t->flags & PF_EXITING))
> >> +		return;
> >> +	if (!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))
> >> +		goto error;
> >> +	if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
> >> +		goto error;
> >> +	if (rseq_increment_event_counter(t))
> > 
> > It seems a shame to not use a single __put_user() here. You did the
> > layout to explicitly allow for this, but then you don't.
> 
> The event counter increment needs to be performed at least once before
> returning to user-space whenever the thread is preempted or has a signal
> delivered. This counter increment needs to occur even if we are not nested
> over a restartable assembly block. (more detailed explanation about this
> follows at the end of this email)
> 
> The rseq_ip_fixup only ever needs to update the rseq_cs pointer
> field if it preempts/delivers a signal over a restartable
> assembly block, which happens very rarely.
> 
> Therefore, since the event counter increment is more frequent than
> setting rseq_cs ptr, I don't see much value in trying to combine
> those two into a single __put_user().
> 
> The reason why I combined both the cpu_id and event_counter
> fields into the same 64-bit integer is for user-space rseq_start()
> to be able to fetch them through a single load when the architecture
> allows it.

I wasn't talking about the rseq_up_fixup(), I was talking about both
unconditional __put_user()'s on cpu_id and event_counter.

These are 2 unconditinoal u32 stores that could very easily be done as a
single u64 store (on 64bit hardware).

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 21:33       ` Peter Zijlstra
@ 2016-08-09 22:41         ` Mathieu Desnoyers
  2016-08-10  7:50           ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-09 22:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 9, 2016, at 5:33 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 09, 2016 at 08:06:40PM +0000, Mathieu Desnoyers wrote:
>> >> +static int rseq_increment_event_counter(struct task_struct *t)
>> >> +{
>> >> +	if (__put_user(++t->rseq_event_counter,
>> >> +			&t->rseq->u.e.event_counter))
>> >> +		return -1;
>> >> +	return 0;
>> >> +}
> 
>> >> +void __rseq_handle_notify_resume(struct pt_regs *regs)
>> >> +{
>> >> +	struct task_struct *t = current;
>> >> +
>> >> +	if (unlikely(t->flags & PF_EXITING))
>> >> +		return;
>> >> +	if (!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))
>> >> +		goto error;
>> >> +	if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
>> >> +		goto error;
>> >> +	if (rseq_increment_event_counter(t))
>> > 
>> > It seems a shame to not use a single __put_user() here. You did the
>> > layout to explicitly allow for this, but then you don't.
>> 
>> The event counter increment needs to be performed at least once before
>> returning to user-space whenever the thread is preempted or has a signal
>> delivered. This counter increment needs to occur even if we are not nested
>> over a restartable assembly block. (more detailed explanation about this
>> follows at the end of this email)
>> 
>> The rseq_ip_fixup only ever needs to update the rseq_cs pointer
>> field if it preempts/delivers a signal over a restartable
>> assembly block, which happens very rarely.
>> 
>> Therefore, since the event counter increment is more frequent than
>> setting rseq_cs ptr, I don't see much value in trying to combine
>> those two into a single __put_user().
>> 
>> The reason why I combined both the cpu_id and event_counter
>> fields into the same 64-bit integer is for user-space rseq_start()
>> to be able to fetch them through a single load when the architecture
>> allows it.
> 
> I wasn't talking about the rseq_up_fixup(), I was talking about both
> unconditional __put_user()'s on cpu_id and event_counter.
> 
> These are 2 unconditinoal u32 stores that could very easily be done as a
> single u64 store (on 64bit hardware).

Gotcha. I'll therefore move the union outside of struct rseq in rseq.h
so we can re-use it:

union rseq_cpu_event {
        struct {
                /*
                 * Restartable sequences cpu_id field.
                 * Updated by the kernel, and read by user-space with
                 * single-copy atomicity semantics. Aligned on 32-bit.
                 * Negative values are reserved for user-space.
                 */
                int32_t cpu_id;
                /*
                 * Restartable sequences event_counter field.
                 * Updated by the kernel, and read by user-space with
                 * single-copy atomicity semantics. Aligned on 32-bit.
                 */
                uint32_t event_counter;
        } e;
        /*
         * On architectures with 64-bit aligned reads, both cpu_id and
         * event_counter can be read with single-copy atomicity
         * semantics.
         */
        uint64_t v;
};

/*
 * struct rseq is aligned on 2 * 8 bytes to ensure it is always
 * contained within a single cache-line.
 */
struct rseq {
        union rseq_cpu_event u;
        /*
         * Restartable sequences rseq_cs field.
         * Contains NULL when no critical section is active for the
         * current thread, or holds a pointer to the currently active
         * struct rseq_cs.
         * Updated by user-space at the beginning and end of assembly
         * instruction sequence block, and by the kernel when it
         * restarts an assembly instruction sequence block. Read by the
         * kernel with single-copy atomicity semantics. Aligned on
         * 64-bit.
         */
        RSEQ_FIELD_u32_u64(rseq_cs);
} __attribute__((aligned(2 * sizeof(uint64_t))));



I'll replace the two updates by this call in __rseq_handle_notify_resume():

        if (!rseq_update_cpu_id_event_counter(t))
                goto error;

And the given implementation:

/*
 * The rseq_event_counter allow user-space to detect preemption and
 * signal delivery. It increments at least once before returning to
 * user-space if a thread is preempted or has a signal delivered. It is
 * not meant to be an exact counter of such events.
 *
 * Overflow of the event counter is not a problem in practice. It
 * increments at most once between each user-space thread instruction
 * executed, so we would need a thread to execute 2^32 instructions or
 * more between rseq_start() and rseq_finish(), while single-stepping,
 * for this to be an issue.
 *
 * On 64-bit architectures, both cpu_id and event_counter can be updated
 * with a single 64-bit store. On 32-bit architectures, we instead
 * perform two 32-bit single-copy stores, just in case the architecture
 * 64-bit __put_user() would fallback on a bytewise copy, which would
 * not guarantee single-copy atomicity semantics for other threads.
 */
#ifdef __LP64__

static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
{
        union rseq_cpu_event u;

        u.e.cpu_id = raw_smp_processor_id();
        u.e.event_counter = ++t->rseq_event_counter;
        if (__put_user(u.v, &t->rseq->u.v))
                return false;
        trace_rseq_inc(t->rseq_event_counter);
        return true;
}

#else /* #ifdef __LP64__ */

static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
{
        if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
                return false;
        if (__put_user(++t->rseq_event_counter, &t->rseq->u.e.event_counter))
                return false;
        trace_rseq_inc(t->rseq_event_counter);
        return true;
}

#endif /* #else #ifdef __LP64__ */


Let me know if I missed anything.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 22:41         ` Mathieu Desnoyers
@ 2016-08-10  7:50           ` Peter Zijlstra
  2016-08-10 13:26             ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-10  7:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, Aug 09, 2016 at 10:41:47PM +0000, Mathieu Desnoyers wrote:
> #ifdef __LP64__
> 
> static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
> {
>         union rseq_cpu_event u;
> 
>         u.e.cpu_id = raw_smp_processor_id();
>         u.e.event_counter = ++t->rseq_event_counter;
>         if (__put_user(u.v, &t->rseq->u.v))
>                 return false;
>         trace_rseq_inc(t->rseq_event_counter);
>         return true;
> }
> 
> #else /* #ifdef __LP64__ */
> 
> static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
> {
>         if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
>                 return false;
>         if (__put_user(++t->rseq_event_counter, &t->rseq->u.e.event_counter))
>                 return false;
>         trace_rseq_inc(t->rseq_event_counter);
>         return true;
> }
> 
> #endif /* #else #ifdef __LP64__ */

I don't think you need to guard it (and CONFIG_64BIT is the 'right'
kernel symbol for that), 32bit should have u64 __put_user() only
implemented as 2 u32 stores.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 16:13               ` Boqun Feng
@ 2016-08-10  8:01                 ` Andy Lutomirski
  2016-08-10 17:40                   ` Mathieu Desnoyers
  2016-08-10 17:33                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-10  8:01 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Mathieu Desnoyers, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Tue, Aug 9, 2016 at 9:13 AM, Boqun Feng <boqun.feng@gmail.com> wrote:
> On Wed, Aug 03, 2016 at 10:03:32PM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
>> > On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
>> >> >> We really care about preemption here. Every migration implies a
>> >> >> preemption from a user-space perspective. If we would only care
>> >> >> about keeping the CPU id up-to-date, hooking into migration would be
>> >> >> enough. But since we want atomicity guarantees for restartable
>> >> >> sequences, we need to hook into preemption.
>> >> >
>> >> >> It allows user-space to perform update operations on per-cpu data without
>> >> >> requiring heavy-weight atomic operations.
>> >> >
>> >> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
>> >> >
>> >> > It is however on PPC and possibly other architectures, so in name of
>> >> > simplicity supporting only the one variant makes sense.
>> >> >
>> >>
>> >> I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
>> >> that were narrower than the full abort-on-preemption primitive.
>> >> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
>> >> *aptr != aval).  We could do things like:
>> >>
>> >> expected_cpu = cpu;
>> >> aval = NULL;  // disarm for now
>> >> begin();
>> >> aval = event_count[cpu] + 1;
>> >> event_count[cpu] = aval;
>> >> event_count[cpu]++;
>> >
>> > This line is redundant, right? Because it will guarantee a failure even
>> > in no-contention cases.
>> >
>> >>
>> >> ... compute something ...
>> >>
>> >> // arm the rest of it
>> >> aptr = &event_count[cpu];
>> >> if (*aptr != aval)
>> >>   goto fail;
>> >>
>> >> *thing_im_writing = value_i_computed;
>> >> end();
>> >>
>> >> The idea here is that we don't rely on the scheduler to increment the
>> >> event count at all, which means that we get to determine the scope of
>> >> what kinds of access conflicts we care about ourselves.
>> >>
>> >
>> > If we increase the event count in userspace, how could we prevent two
>> > userspace threads from racing on the event_count[cpu] field? For
>> > example:
>> >
>> >         CPU 0
>> >         ================
>> >         {event_count[0] is initially 0}
>> >
>> >         [Thread 1]
>> >         begin();
>> >         aval = event_count[cpu] + 1; // 1
>> >
>> >         (preempted)
>> >         [Thread 2]
>> >         begin();
>> >         aval = event_count[cpu] + 1; // 1, too
>> >         event_count[cpu] = aval; // event_count[0] is 1
>> >
>>
>> You're right :(  This would work with an xadd instruction, but that's
>> very slow and doesn't exist on most architectures.  It could also work
>> if we did:
>>
>> aval = some_tls_value++;
>>
>> where some_tls_value is set up such that no two threads could ever end
>> up with the same values (using high bits as thread ids, perhaps), but
>> that's messy.  Maybe my idea is no good.
>
> This is a little more complex, plus I failed to find a way to do an
> atomic "if (*aptr == aval) *b = c" in userspace ;-(
>

But the kernel might be able to help using something similar to this patchset.

> However, I'm thinking maybe we can use some tricks to avoid unnecessary
> aborts-on-preemption.
>
> First of all, I notice we haven't make any constraint on what kind of
> memory objects could be "protected" by rseq critical sections yet. And I
> think this is something we should decide before adding this feature into
> kernel.
>
> We can do some optimization if we have some constraints. For example, if
> the memory objects inside the rseq critical sections could only be
> modified by userspace programs, we therefore don't need to abort
> immediately when userspace task -> kernel task context switch.

True, although trying to do a syscall in an rseq critical section
seems like a bad idea in general.

>
> Further more, if the memory objects inside the rseq critical sections
> could only be modified by userspace programs that have registered their
> rseq structures, we don't need to abort immediately between the context
> switches between two rseq-unregistered tasks or one rseq-registered
> task and one rseq-unregistered task.
>
> Instead, we do tricks as follow:
>
> defining a percpu pointer in kernel:
>
> DEFINE_PER_CPU(struct task_struct *, rseq_owner);
>
> and a cpu field in struct task_struct:
>
>         struct task_struct {
>         ...
>         #ifdef CONFIG_RSEQ
>                 struct rseq __user *rseq;
>                 uint32_t rseq_event_counter;
>                 int rseq_cpu;
>         #endif
>         ...
>         };
>
> (task_struct::rseq_cpu should be initialized as -1.)
>
> each time at sched out(in rseq_sched_out()), we do something like:
>
>         if (prev->rseq) {
>                 raw_cpu_write(rseq_owner, prev);
>                 prev->rseq_cpu = smp_processor_id();
>         }
>
> each time sched in(in rseq_handle_notify_resume()), we do something
> like:
>
>         if (current->rseq &&
>             (this_cpu_read(rseq_owner) != current ||
>              current->rseq_cpu != smp_processor_id()))
>                 __rseq_handle_notify_resume(regs);
>
> (Also need to modify rseq_signal_deliver() to call
> __rseq_handle_notify_resume() directly).
>
>
> I think this could save some unnecessary aborts-on-preemption, however,
> TBH, I'm too sleepy to verify every corner case. Will recheck this
> tomorrow.

Interesting.  That could help a bit, although it would help less if
everyone started using rseq.

But do we need to protect MAP_SHARED objects?  If not, maybe we could
only track context switches between different tasks sharing the same
mm.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 20:06     ` Mathieu Desnoyers
  2016-08-09 21:33       ` Peter Zijlstra
@ 2016-08-10  8:10       ` Andy Lutomirski
  2016-08-10 19:04         ` Mathieu Desnoyers
  2016-08-10  8:43       ` Peter Zijlstra
  2016-08-10 13:29       ` Peter Zijlstra
  3 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-10  8:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, Aug 9, 2016 at 1:06 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Aug 3, 2016, at 9:19 AM, Peter Zijlstra peterz@infradead.org wrote:
>

>>
>>> +#endif
>>>  /* CPU-specific state of this task */
>>>      struct thread_struct thread;
>>>  /*
>>> @@ -3387,4 +3392,67 @@ void cpufreq_add_update_util_hook(int cpu, struct
>>> update_util_data *data,
>>>  void cpufreq_remove_update_util_hook(int cpu);
>>>  #endif /* CONFIG_CPU_FREQ */
>>>
>>> +#ifdef CONFIG_RSEQ
>>> +static inline void rseq_set_notify_resume(struct task_struct *t)
>>> +{
>>> +    if (t->rseq)
>>> +            set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
>>> +}
>>
>> Maybe I missed it, but why do we want to hook into NOTIFY_RESUME and not
>> have our own TIF flag?
>
> The short answer is that used the same approach as Paul Turner's patchset. ;)
>
> Through a deeper look into this, the only times we set the flag is when
> preempting and delivering a signal to a thread that has registered to
> rseq.
>
> Upon return to user-space with the flag set, the performance difference
> between having our own flag and hopping into the NOTIFY_RESUME bandwagon
> is that we can skip the various tests in exit_to_usermode_loop()
> with our own flag, at the expense of crowding the thread flags even
> nearer to filling up 32 bits, which will at some point require extra
> tests on the fast-path.

I don't think we're anywhere near running out.  Several of those flags
can probably go away pretty easily, too.

>
> Thinking about it, one benchmark I have not done so far is to modify
> hackbench so it registers its threads with the rseq system call. We
> can then figure out whether reserving a flag for rseq is justified or
> not.
>
> Comparing 10 runs of hackbench registering its sender/receiver threads
> with unmodified hackbench: (hackbench -l 100000)
>
>     Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
>     2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
>     saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
>     kernel parameter), with a Linux v4.7 defconfig+localyesconfig,
>     restartable sequences series applied.
>
>                                     Avg. Time (s)    Std.dev. (s)
> Unmodified Hackbench                    40.5            0.1
> Rseq-Registered Hackbench Threads       40.4            0.1
>
> So initial results seems to indicate that adding the notify_resume
> handling upon preemption does not have noticeable effects on
> performance, so I don't consider it worthwhile to try optimizing
> it by reserving its own thread flag. Or perhaps am I missing something
> important here ?
>

I don't think so.  One benefit of using do_notify_resume would be less
arch code.

> Actually, we want copy_from_user() there. This executes upon
> resume to user-space, so we can take a page fault is needed, so
> no "inatomic" needed. I therefore suggest:

Running the code below via exit_to_usermode_loop...

>
> static bool rseq_get_rseq_cs(struct task_struct *t,
>                 void __user **start_ip,
>                 void __user **post_commit_ip,
>                 void __user **abort_ip)
> {
>         unsigned long ptr;
>         struct rseq_cs __user *urseq_cs;
>         struct rseq_cs rseq_cs;
>
>         if (__get_user(ptr, &t->rseq->rseq_cs))
>                 return false;
>         if (!ptr)
>                 return true;
> #ifdef CONFIG_COMPAT
>         if (in_compat_syscall()) {
>                 urseq_cs = compat_ptr((compat_uptr_t)ptr);
>                 if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>                         return false;
>                 *start_ip = compat_ptr((compat_uptr_t)rseq_cs.start_ip);
>                 *post_commit_ip = compat_ptr((compat_uptr_t)rseq_cs.post_commit_ip);
>                 *abort_ip = compat_ptr((compat_uptr_t)rseq_cs.abort_ip);
>                 return true;
>         }
> #endif

...means that in_compat_syscall() is nonsense.  (It *works* there, but
I can't imagine that it does anything that is actually sensible for
this use.)

Can't you just define the ABI so that no compat junk is needed?
(Also, CRIU will thank you for doing that.)


>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>>> +{
>>> +    if (unlikely(flags))
>>> +            return -EINVAL;
>>
>> (add whitespace)
>
> fixed.
>
>>
>>> +    if (!rseq) {
>>> +            if (!current->rseq)
>>> +                    return -ENOENT;
>>> +            return 0;
>>> +    }

This looks entirely wrong.  Setting rseq to NULL fails if it's already
NULL but silently does nothing if rseq is already set?  Surely it
should always succeed and it should actually do something if rseq is
set.


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-04  5:03             ` Andy Lutomirski
  2016-08-09 16:13               ` Boqun Feng
@ 2016-08-10  8:13               ` Andy Lutomirski
  1 sibling, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-10  8:13 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Mathieu Desnoyers, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, Aug 3, 2016 at 10:03 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
>> On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
>>> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
>>> >> We really care about preemption here. Every migration implies a
>>> >> preemption from a user-space perspective. If we would only care
>>> >> about keeping the CPU id up-to-date, hooking into migration would be
>>> >> enough. But since we want atomicity guarantees for restartable
>>> >> sequences, we need to hook into preemption.
>>> >
>>> >> It allows user-space to perform update operations on per-cpu data without
>>> >> requiring heavy-weight atomic operations.
>>> >
>>> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
>>> >
>>> > It is however on PPC and possibly other architectures, so in name of
>>> > simplicity supporting only the one variant makes sense.
>>> >
>>>
>>> I wouldn't want to depend on CMPXCHG.  But imagine we had primitives
>>> that were narrower than the full abort-on-preemption primitive.
>>> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
>>> *aptr != aval).  We could do things like:
>>>
>>> expected_cpu = cpu;
>>> aval = NULL;  // disarm for now
>>> begin();
>>> aval = event_count[cpu] + 1;
>>> event_count[cpu] = aval;
>>> event_count[cpu]++;
>>
>> This line is redundant, right? Because it will guarantee a failure even
>> in no-contention cases.
>>
>>>
>>> ... compute something ...
>>>
>>> // arm the rest of it
>>> aptr = &event_count[cpu];
>>> if (*aptr != aval)
>>>   goto fail;
>>>
>>> *thing_im_writing = value_i_computed;
>>> end();
>>>
>>> The idea here is that we don't rely on the scheduler to increment the
>>> event count at all, which means that we get to determine the scope of
>>> what kinds of access conflicts we care about ourselves.
>>>
>>
>> If we increase the event count in userspace, how could we prevent two
>> userspace threads from racing on the event_count[cpu] field? For
>> example:
>>
>>         CPU 0
>>         ================
>>         {event_count[0] is initially 0}
>>
>>         [Thread 1]
>>         begin();
>>         aval = event_count[cpu] + 1; // 1
>>
>>         (preempted)
>>         [Thread 2]
>>         begin();
>>         aval = event_count[cpu] + 1; // 1, too
>>         event_count[cpu] = aval; // event_count[0] is 1
>>
>
> You're right :(  This would work with an xadd instruction, but that's
> very slow and doesn't exist on most architectures.  It could also work
> if we did:

Thinking about this slightly more, maybe it does work.  We could use
basically the same mechanism to allow the kernel to restart if the
specific sequence:

aval = event_count[cpu] + 1
event_count[cpu] = avall

gets preempted by setting aptr = &event_count[cpu] and aval to
event_count[cpu], like this (although I might have screwed up any
number of small details):

aptr = &event_count[cpu];
barrier();
aval = event_count[cpu];
barrier();
tmp = aval + 1;
event_count[cpu] = tmp;
/* preemption here will cause an unnecessary retry, but that's okay */
aval = tmp;

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 20:06     ` Mathieu Desnoyers
  2016-08-09 21:33       ` Peter Zijlstra
  2016-08-10  8:10       ` Andy Lutomirski
@ 2016-08-10  8:43       ` Peter Zijlstra
  2016-08-10 13:57         ` Mathieu Desnoyers
  2016-08-10 13:29       ` Peter Zijlstra
  3 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-10  8:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, Aug 09, 2016 at 08:06:40PM +0000, Mathieu Desnoyers wrote:
> > Also, do we want a comment somewhere that explains why overflow isn't a
> > problem?
> 
> I can add a comment about rseq_increment_event_counter stating:
> 
>  * Overflow of the event counter is not a problem in practice. It
>  * increments at most once between each user-space thread instruction
>  * executed, so we would need a thread to execute 2^32 instructions or
>  * more between rseq_start() and rseq_finish(), while single-stepping,
>  * for this to be an issue.
> 
> Is it fine, or should we be more conservative and care about the overflow,
> extending the counter to a 64-bit value in the process ?

I think its good enough; and using u64 has the unfortunate side effect
of not being able to share the word with the cpu number.

My point was more to have this stuff clearly documented.

> > Maybe I missed it, but why do we want to hook into NOTIFY_RESUME and not
> > have our own TIF flag?
> 
> The short answer is that used the same approach as Paul Turner's patchset. ;)
> 
> Through a deeper look into this, the only times we set the flag is when
> preempting and delivering a signal to a thread that has registered to
> rseq.

 <snip>

> So initial results seems to indicate that adding the notify_resume
> handling upon preemption does not have noticeable effects on
> performance, so I don't consider it worthwhile to try optimizing
> it by reserving its own thread flag. Or perhaps am I missing something
> important here ?

Not sure; seems like we can leave it as is for the moment. Again my
point was to make sure we've thought about the decision, and per the
above you clearly have now ;-)

> > Also, I think it would be good to have a comment explaining why this is
> > split in two structures? Don't you rely on the address dependency?
> 
> The comment above the rseq_cs fields needs clarification, how about:
> 
>         /*
>          * Restartable sequences rseq_cs field.
>          * Contains NULL when no critical section is active for the
>          * current thread, or holds a pointer to the currently active
>          * struct rseq_cs.
>          * Updated by user-space at the beginning and end of assembly
>          * instruction sequence block, and by the kernel when it
>          * restarts an assembly instruction sequence block. Read by the
>          * kernel with single-copy atomicity semantics. Aligned on
>          * 64-bit.
>          */
> 
> This really explains that rseq_cs field of struct rseq holds a pointer
> to the current struct rseq_cs (or NULL), which makes it obvious why this
> needs to be two different structures.

I think I'm still missing things as its not obvious to me at all :/

We could equally well have chosen a single structure and picked the
post_commit_ip field to trigger things from, no?

The only down side seems to be that we must then impose ordering (but UP
ordering, so that's cheap) between writing the abort_ip and
post_commit_ip.

That is; something like so:

struct rseq {
	union rseq_event_cpu u;

	u64 abort_ip;
	u64 post_commit_ip;
};

Where userspace must do:

	r->abort_ip = $abort_ip;
	barrier();
	WRITE_ONCE(r->post_commit_ip, $post_commit_ip);
	barrier();

Which is not much different from what Paul did, except he kept the
abort_ip in a register (which must be loaded before setting the
commit_ip).

And the kernel checks post_commit_ip, if 0, nothing happens, otherwise
we check instruction_pointer and do magic.

Then after the commit, we clear post_commit_ip again; just like we now
clear the rseq_cs pointer.

AFAICT this is an equally valid approach. So why split and put that
indirection in?

> Combined with other recent feedback, this becomes:
> 
>  *   The abort_ip address needs to be lesser than start_ip, or

Isn't it "less than" ?

>  *   greater-or-equal the post_commit_ip. Step [4] and the failure
>  *   code step [F1] need to be at addresses lesser than start_ip, or
>  *   greater-or-equal the post_commit_ip.
> 


> >> +	if (current->rseq) {
> >> +		/*
> >> +		 * If rseq is already registered, check whether
> >> +		 * the provided address differs from the prior
> >> +		 * one.
> >> +		 */
> >> +		if (current->rseq != rseq)
> >> +			return -EBUSY;
> > 
> > Why explicitly allow resetting the same value?
> 
> The foreseen use is as follows: let's assume we have one or more
> user-space libraries, and possibly the application, each using rseq.
> They would each define a struct rseq TLS. They are expected to all
> give it the same name (e.g. __rseq_thread_state), and mark it as a
> weak symbol, so all uses of that symbol within the process address
> space will refer to the same address for a given thread.

Cute!

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10  7:50           ` Peter Zijlstra
@ 2016-08-10 13:26             ` Mathieu Desnoyers
  2016-08-10 13:33               ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 3:50 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 09, 2016 at 10:41:47PM +0000, Mathieu Desnoyers wrote:
>> #ifdef __LP64__
>> 
>> static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
>> {
>>         union rseq_cpu_event u;
>> 
>>         u.e.cpu_id = raw_smp_processor_id();
>>         u.e.event_counter = ++t->rseq_event_counter;
>>         if (__put_user(u.v, &t->rseq->u.v))
>>                 return false;
>>         trace_rseq_inc(t->rseq_event_counter);
>>         return true;
>> }
>> 
>> #else /* #ifdef __LP64__ */
>> 
>> static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
>> {
>>         if (__put_user(raw_smp_processor_id(), &t->rseq->u.e.cpu_id))
>>                 return false;
>>         if (__put_user(++t->rseq_event_counter, &t->rseq->u.e.event_counter))
>>                 return false;
>>         trace_rseq_inc(t->rseq_event_counter);
>>         return true;
>> }
>> 
>> #endif /* #else #ifdef __LP64__ */
> 
> I don't think you need to guard it (and CONFIG_64BIT is the 'right'
> kernel symbol for that), 32bit should have u64 __put_user() only
> implemented as 2 u32 stores.

OK, I can then simplify the implementation to:

[...]
 * On 64-bit architectures, both cpu_id and event_counter can be updated
 * with a single 64-bit store. On 32-bit architectures, __put_user() is
 * expected to perform two 32-bit single-copy stores to guarantee
 * single-copy atomicity semantics for other threads.
 */
static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
{
        union rseq_cpu_event u;

        u.e.cpu_id = raw_smp_processor_id();
        u.e.event_counter = ++t->rseq_event_counter;
        if (__put_user(u.v, &t->rseq->u.v))
                return false;
        trace_rseq_inc(t->rseq_event_counter);
        return true;
}

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 20:06     ` Mathieu Desnoyers
                         ` (2 preceding siblings ...)
  2016-08-10  8:43       ` Peter Zijlstra
@ 2016-08-10 13:29       ` Peter Zijlstra
  3 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-10 13:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Tue, Aug 09, 2016 at 08:06:40PM +0000, Mathieu Desnoyers wrote:
> On Aug 3, 2016, at 9:19 AM, Peter Zijlstra peterz@infradead.org wrote:

> >> +++ b/kernel/sched/core.c
> >> @@ -2664,6 +2664,7 @@ prepare_task_switch(struct rq *rq, struct task_struct
> >> *prev,
> >>  {
> >>  	sched_info_switch(rq, prev, next);
> >>  	perf_event_task_sched_out(prev, next);
> >> +	rseq_sched_out(prev);
> > 
> > One thing I considered is doing something like:
> > 
> > static inline void rseq_sched_out(struct task_struct *t)
> > {
> >	unsigned long ptr;
> >	int err;
> > 
> >	if (!t->rseq)
> >		return;
> > 
> >	err = __get_user(ptr, &t->rseq->rseq_cs);
> >	if (err || ptr)
> >		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> > }
> > 
> > That will optimistically try to read the rseq_cs pointer and, on success
> > and empty (the most likely case) avoid setting the TIF flag.
> > 
> > This will require an explicit migration hook to unconditionally set the
> > TIF flag such that we keep the cpu_id field correct of course.
> > 
> > And obviously we can do this later, as an optimization. Its just
> > something I figured might be worth it.
> 
> This won't work. The rseq mechanism proposed here is really the overlap
> of _two_ distinct restart mechanisms: a sequence counter for C code,
> and a ip-fixup-based mechanism for the assembly "finish" instruction
> sequence.
> 
> What you propose here only considers the fixup of the assembly instruction
> sequence, but not the C code that runs before. The C code between
> rseq_start() and rseq_finish() loads the current value of the sequence
> counter in rseq_start(), and then it gets compared with the new current
> value within the rseq_finish restartable sequence of instructions. So the
> sequence counter needs to be updated upon preemption/signal delivery that
> occurs on top of C code, even if not nesting over a sequence of
> restartable assembly instructions.

True; we could of course have the rseq_start() also set a !0 state
before reading the seq, but not sure that all is worth it.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 13:26             ` Mathieu Desnoyers
@ 2016-08-10 13:33               ` Peter Zijlstra
  2016-08-10 14:04                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-10 13:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, Aug 10, 2016 at 01:26:04PM +0000, Mathieu Desnoyers wrote:

> static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
> {
>         union rseq_cpu_event u;
> 
>         u.e.cpu_id = raw_smp_processor_id();
>         u.e.event_counter = ++t->rseq_event_counter;
>         if (__put_user(u.v, &t->rseq->u.v))
>                 return false;
>         trace_rseq_inc(t->rseq_event_counter);

I had not previously noticed the trace_* muck, but I would suggest
passing in t and leaving it up to the tracepoint implementation to pick
out the value.

Also, since this not only increments (it also updates the cpu number)
the naming is 'wrong'.

>         return true;
> }

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10  8:43       ` Peter Zijlstra
@ 2016-08-10 13:57         ` Mathieu Desnoyers
  2016-08-10 14:28           ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 4:43 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 09, 2016 at 08:06:40PM +0000, Mathieu Desnoyers wrote:

<snip>

>> > Also, I think it would be good to have a comment explaining why this is
>> > split in two structures? Don't you rely on the address dependency?
>> 
>> The comment above the rseq_cs fields needs clarification, how about:
>> 
>>         /*
>>          * Restartable sequences rseq_cs field.
>>          * Contains NULL when no critical section is active for the
>>          * current thread, or holds a pointer to the currently active
>>          * struct rseq_cs.
>>          * Updated by user-space at the beginning and end of assembly
>>          * instruction sequence block, and by the kernel when it
>>          * restarts an assembly instruction sequence block. Read by the
>>          * kernel with single-copy atomicity semantics. Aligned on
>>          * 64-bit.
>>          */
>> 
>> This really explains that rseq_cs field of struct rseq holds a pointer
>> to the current struct rseq_cs (or NULL), which makes it obvious why this
>> needs to be two different structures.
> 
> I think I'm still missing things as its not obvious to me at all :/
> 
> We could equally well have chosen a single structure and picked the
> post_commit_ip field to trigger things from, no?
> 
> The only down side seems to be that we must then impose ordering (but UP
> ordering, so that's cheap) between writing the abort_ip and
> post_commit_ip.
> 
> That is; something like so:
> 
> struct rseq {
>	union rseq_event_cpu u;
> 
>	u64 abort_ip;
>	u64 post_commit_ip;
> };
> 
> Where userspace must do:
> 
>	r->abort_ip = $abort_ip;
>	barrier();
>	WRITE_ONCE(r->post_commit_ip, $post_commit_ip);
>	barrier();
> 
> Which is not much different from what Paul did, except he kept the
> abort_ip in a register (which must be loaded before setting the
> commit_ip).
> 
> And the kernel checks post_commit_ip, if 0, nothing happens, otherwise
> we check instruction_pointer and do magic.
> 
> Then after the commit, we clear post_commit_ip again; just like we now
> clear the rseq_cs pointer.
> 
> AFAICT this is an equally valid approach. So why split and put that
> indirection in?

Now I understand from which angle you are looking at it.

The reason for this indirection is to speed up the user-space rseq_finish()
fast path:

With Paul Turner's approach, we needed to clobber a register, issue
instructions to move abort_ip to that register, and store the post_commit_ip
to the TLS.

With your approach here, you need 2 stores, ordered with compiler-barriers:
storing abort_ip to TLS, and then post_commit_ip to TLS.

The approach I propose (indirection) only requires a single store to the TLS:
we store the address of the currently active struct rseq_cs descriptor. The
kernel can then fetch the content of that descriptor (start_ip, post_commit_ip,
abort_ip) when/if it preempts/deliver a signal over that critical section.

On architectures like arm32, it makes a very significant difference
performance-wise to simply remove useless register movement or stores.

So I add an indirection in the kernel slow path (upon return to user-space after
preempting a rseq asm sequence, or upon signal delivery over a rseq asm sequence),
to speed up the user-space fast path.

By using the indirection approach, we also get the "start_ip" pointer for free,
which can be used to let the kernel know the exact range of the restartable
sequence, and means we can implement the abort handler in pure C, even if it
is placed at addresses before the restartable block by the compiler. This saves
us a jump on the fast path (otherwise required to skip over the abort code).
Doing the same with Paul's approach and yours would require to clobber yet
another register or add one more store for the start_ip.

> 
>> Combined with other recent feedback, this becomes:
>> 
>>  *   The abort_ip address needs to be lesser than start_ip, or
> 
> Isn't it "less than" ?

Indeed, I had to look this one up. "lesser" is an adjective, and here
I should use "to be less than", but below the use the "be at addresses
lesser than" would appear to be OK.

> 
>>  *   greater-or-equal the post_commit_ip. Step [4] and the failure
>>  *   code step [F1] need to be at addresses lesser than start_ip, or
>>  *   greater-or-equal the post_commit_ip.
>> 
> 

<snip>

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 13:33               ` Peter Zijlstra
@ 2016-08-10 14:04                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 14:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 9:33 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Wed, Aug 10, 2016 at 01:26:04PM +0000, Mathieu Desnoyers wrote:
> 
>> static bool rseq_update_cpu_id_event_counter(struct task_struct *t)
>> {
>>         union rseq_cpu_event u;
>> 
>>         u.e.cpu_id = raw_smp_processor_id();
>>         u.e.event_counter = ++t->rseq_event_counter;
>>         if (__put_user(u.v, &t->rseq->u.v))
>>                 return false;
>>         trace_rseq_inc(t->rseq_event_counter);
> 
> I had not previously noticed the trace_* muck, but I would suggest
> passing in t and leaving it up to the tracepoint implementation to pick
> out the value.

OK, fixed.

> 
> Also, since this not only increments (it also updates the cpu number)
> the naming is 'wrong'.

I'll rename the event to "rseq_update" then, and have two fields:
cpu_id and event_counter.

Thanks,

Mathieu

> 
>>         return true;
> > }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 13:57         ` Mathieu Desnoyers
@ 2016-08-10 14:28           ` Peter Zijlstra
  2016-08-10 14:44             ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2016-08-10 14:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, Aug 10, 2016 at 01:57:05PM +0000, Mathieu Desnoyers wrote:

> > We could equally well have chosen a single structure and picked the
> > post_commit_ip field to trigger things from, no?
> > 
> > The only down side seems to be that we must then impose ordering (but UP
> > ordering, so that's cheap) between writing the abort_ip and
> > post_commit_ip.
> > 
> > That is; something like so:
> > 
> > struct rseq {
> >	union rseq_event_cpu u;
> > 
> >	u64 abort_ip;
> >	u64 post_commit_ip;
> > };
> > 
> > Where userspace must do:
> > 
> >	r->abort_ip = $abort_ip;
> >	barrier();
> >	WRITE_ONCE(r->post_commit_ip, $post_commit_ip);
> >	barrier();
> > 
> > Which is not much different from what Paul did, except he kept the
> > abort_ip in a register (which must be loaded before setting the
> > commit_ip).
> > 
> > And the kernel checks post_commit_ip, if 0, nothing happens, otherwise
> > we check instruction_pointer and do magic.
> > 
> > Then after the commit, we clear post_commit_ip again; just like we now
> > clear the rseq_cs pointer.
> > 
> > AFAICT this is an equally valid approach. So why split and put that
> > indirection in?
> 
> Now I understand from which angle you are looking at it.
> 
> The reason for this indirection is to speed up the user-space rseq_finish()
> fast path:
> 
> With Paul Turner's approach, we needed to clobber a register, issue
> instructions to move abort_ip to that register, and store the post_commit_ip
> to the TLS.
> 
> With your approach here, you need 2 stores, ordered with compiler-barriers:
> storing abort_ip to TLS, and then post_commit_ip to TLS.
> 
> The approach I propose (indirection) only requires a single store to the TLS:
> we store the address of the currently active struct rseq_cs descriptor. The
> kernel can then fetch the content of that descriptor (start_ip, post_commit_ip,
> abort_ip) when/if it preempts/deliver a signal over that critical section.
> 
> On architectures like arm32, it makes a very significant difference
> performance-wise to simply remove useless register movement or stores.
> 
> So I add an indirection in the kernel slow path (upon return to user-space after
> preempting a rseq asm sequence, or upon signal delivery over a rseq asm sequence),
> to speed up the user-space fast path.
> 
> By using the indirection approach, we also get the "start_ip" pointer for free,
> which can be used to let the kernel know the exact range of the restartable
> sequence, and means we can implement the abort handler in pure C, even if it
> is placed at addresses before the restartable block by the compiler. This saves
> us a jump on the fast path (otherwise required to skip over the abort code).
> Doing the same with Paul's approach and yours would require to clobber yet
> another register or add one more store for the start_ip.

Ah, because the {start,abort,commit} tuple is link time constants? Which
means we can have this in .data and not on the stack, avoiding the
stores entirely.

Because the moment we put the thing on the stack, we need to do those
stores anyway.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 14:28           ` Peter Zijlstra
@ 2016-08-10 14:44             ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 10:28 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Wed, Aug 10, 2016 at 01:57:05PM +0000, Mathieu Desnoyers wrote:
> 
>> > We could equally well have chosen a single structure and picked the
>> > post_commit_ip field to trigger things from, no?
>> > 
>> > The only down side seems to be that we must then impose ordering (but UP
>> > ordering, so that's cheap) between writing the abort_ip and
>> > post_commit_ip.
>> > 
>> > That is; something like so:
>> > 
>> > struct rseq {
>> >	union rseq_event_cpu u;
>> > 
>> >	u64 abort_ip;
>> >	u64 post_commit_ip;
>> > };
>> > 
>> > Where userspace must do:
>> > 
>> >	r->abort_ip = $abort_ip;
>> >	barrier();
>> >	WRITE_ONCE(r->post_commit_ip, $post_commit_ip);
>> >	barrier();
>> > 
>> > Which is not much different from what Paul did, except he kept the
>> > abort_ip in a register (which must be loaded before setting the
>> > commit_ip).
>> > 
>> > And the kernel checks post_commit_ip, if 0, nothing happens, otherwise
>> > we check instruction_pointer and do magic.
>> > 
>> > Then after the commit, we clear post_commit_ip again; just like we now
>> > clear the rseq_cs pointer.
>> > 
>> > AFAICT this is an equally valid approach. So why split and put that
>> > indirection in?
>> 
>> Now I understand from which angle you are looking at it.
>> 
>> The reason for this indirection is to speed up the user-space rseq_finish()
>> fast path:
>> 
>> With Paul Turner's approach, we needed to clobber a register, issue
>> instructions to move abort_ip to that register, and store the post_commit_ip
>> to the TLS.
>> 
>> With your approach here, you need 2 stores, ordered with compiler-barriers:
>> storing abort_ip to TLS, and then post_commit_ip to TLS.
>> 
>> The approach I propose (indirection) only requires a single store to the TLS:
>> we store the address of the currently active struct rseq_cs descriptor. The
>> kernel can then fetch the content of that descriptor (start_ip, post_commit_ip,
>> abort_ip) when/if it preempts/deliver a signal over that critical section.
>> 
>> On architectures like arm32, it makes a very significant difference
>> performance-wise to simply remove useless register movement or stores.
>> 
>> So I add an indirection in the kernel slow path (upon return to user-space after
>> preempting a rseq asm sequence, or upon signal delivery over a rseq asm
>> sequence),
>> to speed up the user-space fast path.
>> 
>> By using the indirection approach, we also get the "start_ip" pointer for free,
>> which can be used to let the kernel know the exact range of the restartable
>> sequence, and means we can implement the abort handler in pure C, even if it
>> is placed at addresses before the restartable block by the compiler. This saves
>> us a jump on the fast path (otherwise required to skip over the abort code).
>> Doing the same with Paul's approach and yours would require to clobber yet
>> another register or add one more store for the start_ip.
> 
> Ah, because the {start,abort,commit} tuple is link time constants? Which
> means we can have this in .data and not on the stack, avoiding the
> stores entirely.

Yes, this is exactly what we do in the selftests rseq.h for x86 and ppc. For
ARM32, we put this in the code (we jump over it), so we can calculate the
address pointing to the descriptor using the ip-relative "adr" instruction,
which is faster than loading an arbitrary address constant.

> 
> Because the moment we put the thing on the stack, we need to do those
> stores anyway.

Since those are link-time constants, we don't need to store them, ever.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-03 18:29       ` Christoph Lameter
@ 2016-08-10 16:47         ` Mathieu Desnoyers
  2016-08-10 16:59           ` Christoph Lameter
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 16:47 UTC (permalink / raw)
  To: Chris Lameter
  Cc: Andy Lutomirski, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andi Kleen,
	Dave Watson, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 3, 2016, at 2:29 PM, Chris Lameter cl@linux.com wrote:

> On Tue, 26 Jul 2016, Mathieu Desnoyers wrote:
> 
>> > What problem does this solve?
>>
>> It allows user-space to perform update operations on per-cpu data without
>> requiring heavy-weight atomic operations.
> 
> 
> This is great but seems to indicate that such a facility would be better
> for kernel code instread of user space code.

It would be interesting to eventually investigate whether rseq is
additionally useful for kernel code. It seems unrelated to its usefulness
for user-space code though.

Rseq for user-space only needs to hook into preemption and signal delivery,
which doesn't seem to have measurable effects on overall performance.

Doing rseq for kernel code would imply hooking into supplementary sites:

- preemption of kernel code (for atomicity wrt other threads). This would
  replace preempt_disable()/preempt_enable() critical sections touching
  per-cpu data shared with other threads. We would have to do the event_counter
  increment and ip fixup directly in the sched_out hook when preempting
  kernel code.
- possibly interrupt handlers (for atomicity wrt interrupts). This would
  replace local irq save/restore when touching per-cpu data shared with
  interrupt handlers. We would have to increment the event_counter and
  fixup on the pre-irq kernel frame.
- possibly NMI handlers (for atomicity wrt NMIs). This would replace
  preempt/irq off protected local atomic operations on per-cpu data
  shared with NMIs. We would have to increment the event_counter and
  fixup on the pre-NMI kernel frame.

Those supplementary hooks may add significant overall performance overhead,
so careful benchmarking would be required to figure out if it's worth it.

> 
>> First, prohibiting migration from user-space has been frowned upon
>> by scheduler developers for a long time, and I doubt this mindset will
>> change.
> 
> Note that the task isolation patchset from Chris Metcalf does something
> that goes a long way towards this. If you set strict isolation mode then
> the kernel will terminate the process or notify you if the scheduler
> becomes involved. In some way we are getting that as a side effect.

AFAIU, what you propose here is doable at the application design level.
We want to introduce rseq to speed up memory allocation, tracing, and
other uses of per-cpu data without having to modify the design of each
and every user-space applications out there.

> Also prohibiting migration is trivial form user space. Just do a taskset
> to a single cpu.

This is also possible if you can redesign user-space applications, but not
from a library perspective. Invoking system calls to change the affinity of
a thread at each and every critical section would kill performance. Setting
the affinity of a thread from a library on behalf of the application and
leaving it affined requires changes to the application design.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 16:47         ` Mathieu Desnoyers
@ 2016-08-10 16:59           ` Christoph Lameter
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2016-08-10 16:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andi Kleen,
	Dave Watson, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, 10 Aug 2016, Mathieu Desnoyers wrote:

> - preemption of kernel code (for atomicity wrt other threads). This would
>   replace preempt_disable()/preempt_enable() critical sections touching
>   per-cpu data shared with other threads. We would have to do the event_counter
>   increment and ip fixup directly in the sched_out hook when preempting
>   kernel code.

What we would need is special handling when returning from a context
switch so that we recognize in what type of code section we are in and
continue execution at the proper retry site. This can be done by putting
code into special sections or other methods that do not require additional
coee.

> - possibly interrupt handlers (for atomicity wrt interrupts). This would
>   replace local irq save/restore when touching per-cpu data shared with
>   interrupt handlers. We would have to increment the event_counter and
>   fixup on the pre-irq kernel frame.

Same thing as before. Test if we are in a section by testing the return
address and then maybe continue elsewhere.

> Those supplementary hooks may add significant overall performance overhead,
> so careful benchmarking would be required to figure out if it's worth it.

We need a design that does not need these hooks. If we check the return
IP address for a special range then we would not need those. Any hooks
would bloat the code in such a way that the implementation would not be
acceptable for the kernel code.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-09 16:13               ` Boqun Feng
  2016-08-10  8:01                 ` Andy Lutomirski
@ 2016-08-10 17:33                 ` Mathieu Desnoyers
  2016-08-11  4:54                   ` Boqun Feng
  1 sibling, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 17:33 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Andy Lutomirski, Peter Zijlstra, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 9, 2016, at 12:13 PM, Boqun Feng boqun.feng@gmail.com wrote:

<snip>

> 
> However, I'm thinking maybe we can use some tricks to avoid unnecessary
> aborts-on-preemption.
> 
> First of all, I notice we haven't make any constraint on what kind of
> memory objects could be "protected" by rseq critical sections yet. And I
> think this is something we should decide before adding this feature into
> kernel.
> 
> We can do some optimization if we have some constraints. For example, if
> the memory objects inside the rseq critical sections could only be
> modified by userspace programs, we therefore don't need to abort
> immediately when userspace task -> kernel task context switch.

The rseq_owner per-cpu variable and rseq_cpu field in task_struct you
propose below would indeed take care of this scenario.

> 
> Further more, if the memory objects inside the rseq critical sections
> could only be modified by userspace programs that have registered their
> rseq structures, we don't need to abort immediately between the context
> switches between two rseq-unregistered tasks or one rseq-registered
> task and one rseq-unregistered task.
> 
> Instead, we do tricks as follow:
> 
> defining a percpu pointer in kernel:
> 
> DEFINE_PER_CPU(struct task_struct *, rseq_owner);
> 
> and a cpu field in struct task_struct:
> 
>	struct task_struct {
>	...
>	#ifdef CONFIG_RSEQ
>		struct rseq __user *rseq;
>		uint32_t rseq_event_counter;
>		int rseq_cpu;
>	#endif
>	...
>	};
> 
> (task_struct::rseq_cpu should be initialized as -1.)
> 
> each time at sched out(in rseq_sched_out()), we do something like:
> 
>	if (prev->rseq) {
>		raw_cpu_write(rseq_owner, prev);
>		prev->rseq_cpu = smp_processor_id();
>	}
> 
> each time sched in(in rseq_handle_notify_resume()), we do something
> like:
> 
>	if (current->rseq &&
>	    (this_cpu_read(rseq_owner) != current ||
>	     current->rseq_cpu != smp_processor_id()))
>		__rseq_handle_notify_resume(regs);
> 
> (Also need to modify rseq_signal_deliver() to call
> __rseq_handle_notify_resume() directly).
> 
> 
> I think this could save some unnecessary aborts-on-preemption, however,
> TBH, I'm too sleepy to verify every corner case. Will recheck this
> tomorrow.

This adds extra fields to the task struct, per-cpu rseq_owner pointers,
and hooks into sched_in which are not needed otherwise, all this to
eliminate unneeded abort-on-preemption.

If we look at the single-stepping use-case, this means that gdb would
only be able to single-step applications as long as neither itself, nor
any of its libraries, use rseq. This seems to be quite fragile. I prefer
requiring rseq users to implement a fallback to locking which progresses
in every situation rather than adding complexity and overhead trying
lessen the odds of triggering the restart.

Simply lessening the odds of triggering the restart without a design that
ensures progress even in restart cases seems to make the lack-of-progress
problem just harder to debug when it will surface in real life.

Thanks,

Mathieu

> 
> Regards,
> Boqun

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10  8:01                 ` Andy Lutomirski
@ 2016-08-10 17:40                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 17:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Boqun Feng, Peter Zijlstra, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 10, 2016, at 4:01 AM, Andy Lutomirski luto@amacapital.net wrote:

> On Tue, Aug 9, 2016 at 9:13 AM, Boqun Feng <boqun.feng@gmail.com> wrote:

<snip>

> 
>> However, I'm thinking maybe we can use some tricks to avoid unnecessary
>> aborts-on-preemption.
>>
>> First of all, I notice we haven't make any constraint on what kind of
>> memory objects could be "protected" by rseq critical sections yet. And I
>> think this is something we should decide before adding this feature into
>> kernel.
>>
>> We can do some optimization if we have some constraints. For example, if
>> the memory objects inside the rseq critical sections could only be
>> modified by userspace programs, we therefore don't need to abort
>> immediately when userspace task -> kernel task context switch.
> 
> True, although trying to do a syscall in an rseq critical section
> seems like a bad idea in general.

The scenario above does not require the rseq critical section to perform
an explicit system call. It can happen from simple timer-driven preemption
of user-space.

<snip>

> 
> But do we need to protect MAP_SHARED objects?  If not, maybe we could
> only track context switches between different tasks sharing the same
> mm.

I have tracing use-cases involving MAP_SHARED objects for rseq: per-cpu
buffers.

Moreover, if you only track context switch between tasks with the same
mm, you run into issues if you have:

Process A
  Thread 1 (rseq)
  Thread 2 (rseq)

Process B
  Thread 1

Scheduling: A.1 -> B.1 -> A.2 -> B.1 -> A.1

There is no scheduling between threads of the same process here, but
the entire chain involves two threads of the same process accessing
the same per-cpu data concurrently.

Thanks,

Mathieu


> 
> --Andy

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10  8:10       ` Andy Lutomirski
@ 2016-08-10 19:04         ` Mathieu Desnoyers
  2016-08-10 19:16           ` Andy Lutomirski
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 19:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 4:10 AM, Andy Lutomirski luto@amacapital.net wrote:

> On Tue, Aug 9, 2016 at 1:06 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:

<snip>

>> Actually, we want copy_from_user() there. This executes upon
>> resume to user-space, so we can take a page fault is needed, so
>> no "inatomic" needed. I therefore suggest:
> 
> Running the code below via exit_to_usermode_loop...
> 
>>
>> static bool rseq_get_rseq_cs(struct task_struct *t,
>>                 void __user **start_ip,
>>                 void __user **post_commit_ip,
>>                 void __user **abort_ip)
>> {
>>         unsigned long ptr;
>>         struct rseq_cs __user *urseq_cs;
>>         struct rseq_cs rseq_cs;
>>
>>         if (__get_user(ptr, &t->rseq->rseq_cs))
>>                 return false;
>>         if (!ptr)
>>                 return true;
>> #ifdef CONFIG_COMPAT
>>         if (in_compat_syscall()) {
>>                 urseq_cs = compat_ptr((compat_uptr_t)ptr);
>>                 if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>>                         return false;
>>                 *start_ip = compat_ptr((compat_uptr_t)rseq_cs.start_ip);
>>                 *post_commit_ip = compat_ptr((compat_uptr_t)rseq_cs.post_commit_ip);
>>                 *abort_ip = compat_ptr((compat_uptr_t)rseq_cs.abort_ip);
>>                 return true;
>>         }
>> #endif
> 
> ...means that in_compat_syscall() is nonsense.  (It *works* there, but
> I can't imagine that it does anything that is actually sensible for
> this use.)

Agreed that we are not per-se in a system call here. It works for
in_ia32_syscall(), but it may not work for in_x32_syscall().

Then should we test for this ?

if (!is_64bit_mm(current->mm))

This is currently x86-specific. Is this how we are expected to test
the user-space pointer size in the current mm in arch-agnostic code ?
If so, we should implement is_64bit_mm() on all other architectures.

> 
> Can't you just define the ABI so that no compat junk is needed?
> (Also, CRIU will thank you for doing that.)

We are dealing with user-space pointers here, so AFAIU we need to
be aware of their size, which involves compat code. Am I missing
something ?

> 
> 
>>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>>>> +{
>>>> +    if (unlikely(flags))
>>>> +            return -EINVAL;
>>>
>>> (add whitespace)
>>
>> fixed.
>>
>>>
>>>> +    if (!rseq) {
>>>> +            if (!current->rseq)
>>>> +                    return -ENOENT;
>>>> +            return 0;
>>>> +    }
> 
> This looks entirely wrong.  Setting rseq to NULL fails if it's already
> NULL but silently does nothing if rseq is already set?  Surely it
> should always succeed and it should actually do something if rseq is
> set.

>From the proposed rseq(2) manpage:

"A NULL rseq value can be used to check whether rseq is registered
for the current thread."

The implementation does just that: it returns -1, errno=ENOENT if no
rseq is currently registered, or 0 if rseq is currently registered.

Thanks,

Mathieu


> 
> 
> --
> Andy Lutomirski
> AMA Capital Management, LLC

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 19:04         ` Mathieu Desnoyers
@ 2016-08-10 19:16           ` Andy Lutomirski
  2016-08-10 20:06             ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-10 19:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, Aug 10, 2016 at 12:04 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Aug 10, 2016, at 4:10 AM, Andy Lutomirski luto@amacapital.net wrote:
>
>> On Tue, Aug 9, 2016 at 1:06 PM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>
> <snip>
>
>>> Actually, we want copy_from_user() there. This executes upon
>>> resume to user-space, so we can take a page fault is needed, so
>>> no "inatomic" needed. I therefore suggest:
>>
>> Running the code below via exit_to_usermode_loop...
>>
>>>
>>> static bool rseq_get_rseq_cs(struct task_struct *t,
>>>                 void __user **start_ip,
>>>                 void __user **post_commit_ip,
>>>                 void __user **abort_ip)
>>> {
>>>         unsigned long ptr;
>>>         struct rseq_cs __user *urseq_cs;
>>>         struct rseq_cs rseq_cs;
>>>
>>>         if (__get_user(ptr, &t->rseq->rseq_cs))
>>>                 return false;
>>>         if (!ptr)
>>>                 return true;
>>> #ifdef CONFIG_COMPAT
>>>         if (in_compat_syscall()) {
>>>                 urseq_cs = compat_ptr((compat_uptr_t)ptr);
>>>                 if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>>>                         return false;
>>>                 *start_ip = compat_ptr((compat_uptr_t)rseq_cs.start_ip);
>>>                 *post_commit_ip = compat_ptr((compat_uptr_t)rseq_cs.post_commit_ip);
>>>                 *abort_ip = compat_ptr((compat_uptr_t)rseq_cs.abort_ip);
>>>                 return true;
>>>         }
>>> #endif
>>
>> ...means that in_compat_syscall() is nonsense.  (It *works* there, but
>> I can't imagine that it does anything that is actually sensible for
>> this use.)
>
> Agreed that we are not per-se in a system call here. It works for
> in_ia32_syscall(), but it may not work for in_x32_syscall().
>
> Then should we test for this ?
>
> if (!is_64bit_mm(current->mm))
>
> This is currently x86-specific. Is this how we are expected to test
> the user-space pointer size in the current mm in arch-agnostic code ?
> If so, we should implement is_64bit_mm() on all other architectures.

There is no universal concept of the user-space pointer size on x86
because x86 code can change it via long jumps.

What are you actually trying to do?  I would guess that
user_64bit_mode(regs) is the right thing here, because the rseq data
structure is describing the currently executing code.

>
>>
>> Can't you just define the ABI so that no compat junk is needed?
>> (Also, CRIU will thank you for doing that.)
>
> We are dealing with user-space pointers here, so AFAIU we need to
> be aware of their size, which involves compat code. Am I missing
> something ?

u64 is a perfectly valid, if odd, userspace pointer on all
architecures that I know of, and it's certainly a valid userspace
pointer on x86 32-bit userspace (the high bits will just all be zero).
Can you just use u64?

If this would be a performance problem on ARM, then maybe that's a
reason to use compat helpers.

>
>>
>>
>>>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>>>>> +{
>>>>> +    if (unlikely(flags))
>>>>> +            return -EINVAL;
>>>>
>>>> (add whitespace)
>>>
>>> fixed.
>>>
>>>>
>>>>> +    if (!rseq) {
>>>>> +            if (!current->rseq)
>>>>> +                    return -ENOENT;
>>>>> +            return 0;
>>>>> +    }
>>
>> This looks entirely wrong.  Setting rseq to NULL fails if it's already
>> NULL but silently does nothing if rseq is already set?  Surely it
>> should always succeed and it should actually do something if rseq is
>> set.
>
> From the proposed rseq(2) manpage:
>
> "A NULL rseq value can be used to check whether rseq is registered
> for the current thread."
>
> The implementation does just that: it returns -1, errno=ENOENT if no
> rseq is currently registered, or 0 if rseq is currently registered.

I think that's problematic.  Why can't you unregister an existing
rseq?  If you can't, how is a thread supposed to clean up after
itself?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 19:16           ` Andy Lutomirski
@ 2016-08-10 20:06             ` Mathieu Desnoyers
  2016-08-10 20:09               ` Andy Lutomirski
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 20:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 3:16 PM, Andy Lutomirski luto@amacapital.net wrote:

> On Wed, Aug 10, 2016 at 12:04 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> ----- On Aug 10, 2016, at 4:10 AM, Andy Lutomirski luto@amacapital.net wrote:
>>
>>> On Tue, Aug 9, 2016 at 1:06 PM, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>
>> <snip>
>>
>>>> Actually, we want copy_from_user() there. This executes upon
>>>> resume to user-space, so we can take a page fault is needed, so
>>>> no "inatomic" needed. I therefore suggest:
>>>
>>> Running the code below via exit_to_usermode_loop...
>>>
>>>>
>>>> static bool rseq_get_rseq_cs(struct task_struct *t,
>>>>                 void __user **start_ip,
>>>>                 void __user **post_commit_ip,
>>>>                 void __user **abort_ip)
>>>> {
>>>>         unsigned long ptr;
>>>>         struct rseq_cs __user *urseq_cs;
>>>>         struct rseq_cs rseq_cs;
>>>>
>>>>         if (__get_user(ptr, &t->rseq->rseq_cs))
>>>>                 return false;
>>>>         if (!ptr)
>>>>                 return true;
>>>> #ifdef CONFIG_COMPAT
>>>>         if (in_compat_syscall()) {
>>>>                 urseq_cs = compat_ptr((compat_uptr_t)ptr);
>>>>                 if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>>>>                         return false;
>>>>                 *start_ip = compat_ptr((compat_uptr_t)rseq_cs.start_ip);
>>>>                 *post_commit_ip = compat_ptr((compat_uptr_t)rseq_cs.post_commit_ip);
>>>>                 *abort_ip = compat_ptr((compat_uptr_t)rseq_cs.abort_ip);
>>>>                 return true;
>>>>         }
>>>> #endif
>>>
>>> ...means that in_compat_syscall() is nonsense.  (It *works* there, but
>>> I can't imagine that it does anything that is actually sensible for
>>> this use.)
>>
>> Agreed that we are not per-se in a system call here. It works for
>> in_ia32_syscall(), but it may not work for in_x32_syscall().
>>
>> Then should we test for this ?
>>
>> if (!is_64bit_mm(current->mm))
>>
>> This is currently x86-specific. Is this how we are expected to test
>> the user-space pointer size in the current mm in arch-agnostic code ?
>> If so, we should implement is_64bit_mm() on all other architectures.
> 
> There is no universal concept of the user-space pointer size on x86
> because x86 code can change it via long jumps.
> 
> What are you actually trying to do?  I would guess that
> user_64bit_mode(regs) is the right thing here, because the rseq data
> structure is describing the currently executing code.

Yes, that's correct, we care about the pointer size of currently executing
code. On x86 user_64bit_mode(regs) would appear to be the right thing to do.

> 
>>
>>>
>>> Can't you just define the ABI so that no compat junk is needed?
>>> (Also, CRIU will thank you for doing that.)
>>
>> We are dealing with user-space pointers here, so AFAIU we need to
>> be aware of their size, which involves compat code. Am I missing
>> something ?
> 
> u64 is a perfectly valid, if odd, userspace pointer on all
> architecures that I know of, and it's certainly a valid userspace
> pointer on x86 32-bit userspace (the high bits will just all be zero).
> Can you just use u64?

My concern is about a 32-bit user-space putting garbage rather than zeroes
(on purpose) to fool the kernel on those upper 32 bits. Doing

  compat_ptr((compat_uptr_t)rseq_cs.start_ip)

effectively ends up clearing the upper 32 bits.

But since we only use those pointer values for comparisons, perhaps we
just don't care if a 32-bit userspace app try to shoot itself in
the foot by passing garbage upper 32 bits ?

> 
> If this would be a performance problem on ARM, then maybe that's a
> reason to use compat helpers.

We already use 64-bit values for the pointers, even on 32-bit. Normally
userspace just puts zeroes in the top bits. It's mostly a question of
clearing the top 32 bits or not when loading them in the kernel. If we
don't need to, then I can remove the compat code entirely, and we don't
care about user_64bit_mode() anymore, as you initially recommended.
Does it make sense ?

> 
>>
>>>
>>>
>>>>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>>>>>> +{
>>>>>> +    if (unlikely(flags))
>>>>>> +            return -EINVAL;
>>>>>
>>>>> (add whitespace)
>>>>
>>>> fixed.
>>>>
>>>>>
>>>>>> +    if (!rseq) {
>>>>>> +            if (!current->rseq)
>>>>>> +                    return -ENOENT;
>>>>>> +            return 0;
>>>>>> +    }
>>>
>>> This looks entirely wrong.  Setting rseq to NULL fails if it's already
>>> NULL but silently does nothing if rseq is already set?  Surely it
>>> should always succeed and it should actually do something if rseq is
>>> set.
>>
>> From the proposed rseq(2) manpage:
>>
>> "A NULL rseq value can be used to check whether rseq is registered
>> for the current thread."
>>
>> The implementation does just that: it returns -1, errno=ENOENT if no
>> rseq is currently registered, or 0 if rseq is currently registered.
> 
> I think that's problematic.  Why can't you unregister an existing
> rseq?  If you can't, how is a thread supposed to clean up after
> itself?
> 

Unregistering an existing thread rseq would require that we keep reference
counting, in case multiple libs and/or the app are using rseq. I am
trying to keep things as simple as needed.

If I understand your concern, the problematic scenario would be at
thread exit (this is my current approximate understanding of glibc
handling of library TLS variable reclaim at thread exit):

thread exits in userspace:
- glibc frees its rseq TLS memory area (in case the TLS is in a library),
- thread preempted before really exiting,
- kernel reads/writes to freed TLS memory.
  - corruption may occur (e.g. memory re-allocated by another thread already)

Am I getting it right ?

Thanks,

Mathieu

> --Andy

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 20:06             ` Mathieu Desnoyers
@ 2016-08-10 20:09               ` Andy Lutomirski
  2016-08-10 21:01                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-10 20:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

On Wed, Aug 10, 2016 at 1:06 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Aug 10, 2016, at 3:16 PM, Andy Lutomirski luto@amacapital.net wrote:
>
>> On Wed, Aug 10, 2016 at 12:04 PM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>> ----- On Aug 10, 2016, at 4:10 AM, Andy Lutomirski luto@amacapital.net wrote:
>>>
>>>> On Tue, Aug 9, 2016 at 1:06 PM, Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> <snip>
>>>
>>>>> Actually, we want copy_from_user() there. This executes upon
>>>>> resume to user-space, so we can take a page fault is needed, so
>>>>> no "inatomic" needed. I therefore suggest:
>>>>
>>>> Running the code below via exit_to_usermode_loop...
>>>>
>>>>>
>>>>> static bool rseq_get_rseq_cs(struct task_struct *t,
>>>>>                 void __user **start_ip,
>>>>>                 void __user **post_commit_ip,
>>>>>                 void __user **abort_ip)
>>>>> {
>>>>>         unsigned long ptr;
>>>>>         struct rseq_cs __user *urseq_cs;
>>>>>         struct rseq_cs rseq_cs;
>>>>>
>>>>>         if (__get_user(ptr, &t->rseq->rseq_cs))
>>>>>                 return false;
>>>>>         if (!ptr)
>>>>>                 return true;
>>>>> #ifdef CONFIG_COMPAT
>>>>>         if (in_compat_syscall()) {
>>>>>                 urseq_cs = compat_ptr((compat_uptr_t)ptr);
>>>>>                 if (copy_from_user(&rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>>>>>                         return false;
>>>>>                 *start_ip = compat_ptr((compat_uptr_t)rseq_cs.start_ip);
>>>>>                 *post_commit_ip = compat_ptr((compat_uptr_t)rseq_cs.post_commit_ip);
>>>>>                 *abort_ip = compat_ptr((compat_uptr_t)rseq_cs.abort_ip);
>>>>>                 return true;
>>>>>         }
>>>>> #endif
>>>>
>>>> ...means that in_compat_syscall() is nonsense.  (It *works* there, but
>>>> I can't imagine that it does anything that is actually sensible for
>>>> this use.)
>>>
>>> Agreed that we are not per-se in a system call here. It works for
>>> in_ia32_syscall(), but it may not work for in_x32_syscall().
>>>
>>> Then should we test for this ?
>>>
>>> if (!is_64bit_mm(current->mm))
>>>
>>> This is currently x86-specific. Is this how we are expected to test
>>> the user-space pointer size in the current mm in arch-agnostic code ?
>>> If so, we should implement is_64bit_mm() on all other architectures.
>>
>> There is no universal concept of the user-space pointer size on x86
>> because x86 code can change it via long jumps.
>>
>> What are you actually trying to do?  I would guess that
>> user_64bit_mode(regs) is the right thing here, because the rseq data
>> structure is describing the currently executing code.
>
> Yes, that's correct, we care about the pointer size of currently executing
> code. On x86 user_64bit_mode(regs) would appear to be the right thing to do.
>
>>
>>>
>>>>
>>>> Can't you just define the ABI so that no compat junk is needed?
>>>> (Also, CRIU will thank you for doing that.)
>>>
>>> We are dealing with user-space pointers here, so AFAIU we need to
>>> be aware of their size, which involves compat code. Am I missing
>>> something ?
>>
>> u64 is a perfectly valid, if odd, userspace pointer on all
>> architecures that I know of, and it's certainly a valid userspace
>> pointer on x86 32-bit userspace (the high bits will just all be zero).
>> Can you just use u64?
>
> My concern is about a 32-bit user-space putting garbage rather than zeroes
> (on purpose) to fool the kernel on those upper 32 bits. Doing
>
>   compat_ptr((compat_uptr_t)rseq_cs.start_ip)
>
> effectively ends up clearing the upper 32 bits.
>
> But since we only use those pointer values for comparisons, perhaps we
> just don't care if a 32-bit userspace app try to shoot itself in
> the foot by passing garbage upper 32 bits ?
>

How is garbage in the high bits any different than garbage in any
other bits in there?

>
>> If this would be a performance problem on ARM, then maybe that's a
>> reason to use compat helpers.
>
> We already use 64-bit values for the pointers, even on 32-bit. Normally
> userspace just puts zeroes in the top bits. It's mostly a question of
> clearing the top 32 bits or not when loading them in the kernel. If we
> don't need to, then I can remove the compat code entirely, and we don't
> care about user_64bit_mode() anymore, as you initially recommended.
> Does it make sense ?

Yes, I think so.  I'd suggest just honoring all the bits.

>
>>
>>>
>>>>
>>>>
>>>>>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>>>>>>> +{
>>>>>>> +    if (unlikely(flags))
>>>>>>> +            return -EINVAL;
>>>>>>
>>>>>> (add whitespace)
>>>>>
>>>>> fixed.
>>>>>
>>>>>>
>>>>>>> +    if (!rseq) {
>>>>>>> +            if (!current->rseq)
>>>>>>> +                    return -ENOENT;
>>>>>>> +            return 0;
>>>>>>> +    }
>>>>
>>>> This looks entirely wrong.  Setting rseq to NULL fails if it's already
>>>> NULL but silently does nothing if rseq is already set?  Surely it
>>>> should always succeed and it should actually do something if rseq is
>>>> set.
>>>
>>> From the proposed rseq(2) manpage:
>>>
>>> "A NULL rseq value can be used to check whether rseq is registered
>>> for the current thread."
>>>
>>> The implementation does just that: it returns -1, errno=ENOENT if no
>>> rseq is currently registered, or 0 if rseq is currently registered.
>>
>> I think that's problematic.  Why can't you unregister an existing
>> rseq?  If you can't, how is a thread supposed to clean up after
>> itself?
>>
>
> Unregistering an existing thread rseq would require that we keep reference
> counting, in case multiple libs and/or the app are using rseq. I am
> trying to keep things as simple as needed.
>
> If I understand your concern, the problematic scenario would be at
> thread exit (this is my current approximate understanding of glibc
> handling of library TLS variable reclaim at thread exit):
>
> thread exits in userspace:
> - glibc frees its rseq TLS memory area (in case the TLS is in a library),
> - thread preempted before really exiting,
> - kernel reads/writes to freed TLS memory.
>   - corruption may occur (e.g. memory re-allocated by another thread already)
>
> Am I getting it right ?

Yes.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 20:09               ` Andy Lutomirski
@ 2016-08-10 21:01                 ` Mathieu Desnoyers
  2016-08-11  7:23                   ` Andy Lutomirski
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-10 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 10, 2016, at 4:09 PM, Andy Lutomirski luto@amacapital.net wrote:

> On Wed, Aug 10, 2016 at 1:06 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

<snip>

>>> u64 is a perfectly valid, if odd, userspace pointer on all
>>> architecures that I know of, and it's certainly a valid userspace
>>> pointer on x86 32-bit userspace (the high bits will just all be zero).
>>> Can you just use u64?
>>
>> My concern is about a 32-bit user-space putting garbage rather than zeroes
>> (on purpose) to fool the kernel on those upper 32 bits. Doing
>>
>>   compat_ptr((compat_uptr_t)rseq_cs.start_ip)
>>
>> effectively ends up clearing the upper 32 bits.
>>
>> But since we only use those pointer values for comparisons, perhaps we
>> just don't care if a 32-bit userspace app try to shoot itself in
>> the foot by passing garbage upper 32 bits ?
>>
> 
> How is garbage in the high bits any different than garbage in any
> other bits in there?

It's not :)

> 
>>
>>> If this would be a performance problem on ARM, then maybe that's a
>>> reason to use compat helpers.
>>
>> We already use 64-bit values for the pointers, even on 32-bit. Normally
>> userspace just puts zeroes in the top bits. It's mostly a question of
>> clearing the top 32 bits or not when loading them in the kernel. If we
>> don't need to, then I can remove the compat code entirely, and we don't
>> care about user_64bit_mode() anymore, as you initially recommended.
>> Does it make sense ?
> 
> Yes, I think so.  I'd suggest just honoring all the bits.

OK, will do !

> 
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
>>>>>>>> +{
>>>>>>>> +    if (unlikely(flags))
>>>>>>>> +            return -EINVAL;
>>>>>>>
>>>>>>> (add whitespace)
>>>>>>
>>>>>> fixed.
>>>>>>
>>>>>>>
>>>>>>>> +    if (!rseq) {
>>>>>>>> +            if (!current->rseq)
>>>>>>>> +                    return -ENOENT;
>>>>>>>> +            return 0;
>>>>>>>> +    }
>>>>>
>>>>> This looks entirely wrong.  Setting rseq to NULL fails if it's already
>>>>> NULL but silently does nothing if rseq is already set?  Surely it
>>>>> should always succeed and it should actually do something if rseq is
>>>>> set.
>>>>
>>>> From the proposed rseq(2) manpage:
>>>>
>>>> "A NULL rseq value can be used to check whether rseq is registered
>>>> for the current thread."
>>>>
>>>> The implementation does just that: it returns -1, errno=ENOENT if no
>>>> rseq is currently registered, or 0 if rseq is currently registered.
>>>
>>> I think that's problematic.  Why can't you unregister an existing
>>> rseq?  If you can't, how is a thread supposed to clean up after
>>> itself?
>>>
>>
>> Unregistering an existing thread rseq would require that we keep reference
>> counting, in case multiple libs and/or the app are using rseq. I am
>> trying to keep things as simple as needed.
>>
>> If I understand your concern, the problematic scenario would be at
>> thread exit (this is my current approximate understanding of glibc
>> handling of library TLS variable reclaim at thread exit):
>>
>> thread exits in userspace:
>> - glibc frees its rseq TLS memory area (in case the TLS is in a library),
>> - thread preempted before really exiting,
>> - kernel reads/writes to freed TLS memory.
>>   - corruption may occur (e.g. memory re-allocated by another thread already)
>>
>> Am I getting it right ?
> 
> Yes.

Hrm, then we should:

- add a rseq_refcount field to the task struct,
- increment this refcount whenever rseq receives a registration, after
  ensuring that we are registering the same address as was previously
  requested by preceding registrations for the thread (except if the
  refcount was 0),
- When rseq receives a NULL address, decrement refcount. Set address to
  NULL when it reaches 0.

Doing the refcounting in kernel-space rather than user-space allows us to
keep both registration/unregistration and refcount atomic, which simplify
things if we plan to use rseq from signal handlers.

With current glibc, a library that would lazily register and use rseq
without knowledge of the application would then have to use pthread_key_create()
to set a destr_function to run at thread exit, which would take care of
unregistration.

We could add a RSEQ_FORCE_UNREGISTER flag to rseq flags to allow future
glibc versions to force unregistering rseq before freeing its TLS memory,
just in case a userspace library omits to unregister itself.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 17:33                 ` Mathieu Desnoyers
@ 2016-08-11  4:54                   ` Boqun Feng
  0 siblings, 0 replies; 82+ messages in thread
From: Boqun Feng @ 2016-08-11  4:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Peter Zijlstra, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel,
	linux-api, Paul Turner, Andrew Hunter, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 6382 bytes --]

On Wed, Aug 10, 2016 at 05:33:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 9, 2016, at 12:13 PM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> <snip>
> 
> > 
> > However, I'm thinking maybe we can use some tricks to avoid unnecessary
> > aborts-on-preemption.
> > 
> > First of all, I notice we haven't make any constraint on what kind of
> > memory objects could be "protected" by rseq critical sections yet. And I
> > think this is something we should decide before adding this feature into
> > kernel.
> > 
> > We can do some optimization if we have some constraints. For example, if
> > the memory objects inside the rseq critical sections could only be
> > modified by userspace programs, we therefore don't need to abort
> > immediately when userspace task -> kernel task context switch.
> 
> The rseq_owner per-cpu variable and rseq_cpu field in task_struct you
> propose below would indeed take care of this scenario.
> 
> > 
> > Further more, if the memory objects inside the rseq critical sections
> > could only be modified by userspace programs that have registered their
> > rseq structures, we don't need to abort immediately between the context
> > switches between two rseq-unregistered tasks or one rseq-registered
> > task and one rseq-unregistered task.
> > 
> > Instead, we do tricks as follow:
> > 
> > defining a percpu pointer in kernel:
> > 
> > DEFINE_PER_CPU(struct task_struct *, rseq_owner);
> > 
> > and a cpu field in struct task_struct:
> > 
> >	struct task_struct {
> >	...
> >	#ifdef CONFIG_RSEQ
> >		struct rseq __user *rseq;
> >		uint32_t rseq_event_counter;
> >		int rseq_cpu;
> >	#endif
> >	...
> >	};
> > 
> > (task_struct::rseq_cpu should be initialized as -1.)
> > 
> > each time at sched out(in rseq_sched_out()), we do something like:
> > 
> >	if (prev->rseq) {
> >		raw_cpu_write(rseq_owner, prev);
> >		prev->rseq_cpu = smp_processor_id();
> >	}
> > 
> > each time sched in(in rseq_handle_notify_resume()), we do something
> > like:
> > 
> >	if (current->rseq &&
> >	    (this_cpu_read(rseq_owner) != current ||
> >	     current->rseq_cpu != smp_processor_id()))
> >		__rseq_handle_notify_resume(regs);
> > 
> > (Also need to modify rseq_signal_deliver() to call
> > __rseq_handle_notify_resume() directly).
> > 
> > 
> > I think this could save some unnecessary aborts-on-preemption, however,
> > TBH, I'm too sleepy to verify every corner case. Will recheck this
> > tomorrow.
> 
> This adds extra fields to the task struct, per-cpu rseq_owner pointers,
> and hooks into sched_in which are not needed otherwise, all this to
> eliminate unneeded abort-on-preemption.
> 
> If we look at the single-stepping use-case, this means that gdb would
> only be able to single-step applications as long as neither itself, nor
> any of its libraries, use rseq. This seems to be quite fragile. I prefer
> requiring rseq users to implement a fallback to locking which progresses
> in every situation rather than adding complexity and overhead trying
> lessen the odds of triggering the restart.
> 
> Simply lessening the odds of triggering the restart without a design that
> ensures progress even in restart cases seems to make the lack-of-progress
> problem just harder to debug when it will surface in real life.
> 

Fair enough.

I did my own research of the mechanism I proposed. The patch is attached
at the end of the email. Unfortunately, there is no noticeable
performance gain for the current benchmark. One possible reason may be:
The rseq critical sections in current benchmark are quite small, which
makes retrying is not that expensive.

From another angle, this may imply that in current senarios,
abort-on-preemption doesn't hurt the performance much. But these are
only my two cents.

> Thanks,
> 
> Mathieu
> 

---
 include/linux/sched.h | 18 +++++++++++++++---
 kernel/rseq.c         |  2 ++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5875fdd6edc8..c23e5dee9c60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1922,6 +1922,7 @@ struct task_struct {
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_event_counter;
+	int rseq_cpu;
 #endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
@@ -3393,6 +3394,8 @@ void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
 #ifdef CONFIG_RSEQ
+DECLARE_PER_CPU(struct task_struct *, rseq_owner);
+
 static inline void rseq_set_notify_resume(struct task_struct *t)
 {
 	if (t->rseq)
@@ -3401,7 +3404,9 @@ static inline void rseq_set_notify_resume(struct task_struct *t)
 void __rseq_handle_notify_resume(struct pt_regs *regs);
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
-	if (current->rseq)
+	if (current->rseq &&
+	    (current != raw_cpu_read(rseq_owner) ||
+	     current->rseq_cpu != smp_processor_id()))
 		__rseq_handle_notify_resume(regs);
 }
 /*
@@ -3415,9 +3420,11 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 	if (clone_flags & CLONE_THREAD) {
 		t->rseq = NULL;
 		t->rseq_event_counter = 0;
+		t->rseq_cpu = -1;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_event_counter = current->rseq_event_counter;
+		t->rseq_cpu = -1;
 		rseq_set_notify_resume(t);
 	}
 }
@@ -3428,11 +3435,16 @@ static inline void rseq_execve(struct task_struct *t)
 }
 static inline void rseq_sched_out(struct task_struct *t)
 {
-	rseq_set_notify_resume(t);
+	if (t->rseq) {
+		raw_cpu_write(rseq_owner, t);
+		t->rseq_cpu = smp_processor_id();
+		rseq_set_notify_resume(t);
+	}
 }
 static inline void rseq_signal_deliver(struct pt_regs *regs)
 {
-	rseq_handle_notify_resume(regs);
+	if (current->rseq)
+		__rseq_handle_notify_resume(regs);
 }
 #else
 static inline void rseq_set_notify_resume(struct task_struct *t)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 7e4d1d0e9520..0390a57ef0e5 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -100,6 +100,8 @@
  *   F2. Return false.
  */
 
+DEFINE_PER_CPU(struct task_struct *, rseq_owner);
+
 /*
  * The rseq_event_counter allow user-space to detect preemption and
  * signal delivery. It increments at least once before returning to
-- 
2.9.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 1/7] Restartable sequences system call
  2016-08-10 21:01                 ` Mathieu Desnoyers
@ 2016-08-11  7:23                   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2016-08-11  7:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Will Deacon, Linus Torvalds, Paul E. McKenney, linux-kernel,
	Chris Lameter, Andi Kleen, Josh Triplett, Paul Turner,
	Boqun Feng, Catalin Marinas, Andrew Hunter, H. Peter Anvin,
	Peter Zijlstra

On Aug 11, 2016 12:01 AM, "Mathieu Desnoyers"
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Aug 10, 2016, at 4:09 PM, Andy Lutomirski luto@amacapital.net wrote:
>
> > On Wed, Aug 10, 2016 at 1:06 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> <snip>
>
> >>> u64 is a perfectly valid, if odd, userspace pointer on all
> >>> architecures that I know of, and it's certainly a valid userspace
> >>> pointer on x86 32-bit userspace (the high bits will just all be zero).
> >>> Can you just use u64?
> >>
> >> My concern is about a 32-bit user-space putting garbage rather than zeroes
> >> (on purpose) to fool the kernel on those upper 32 bits. Doing
> >>
> >>   compat_ptr((compat_uptr_t)rseq_cs.start_ip)
> >>
> >> effectively ends up clearing the upper 32 bits.
> >>
> >> But since we only use those pointer values for comparisons, perhaps we
> >> just don't care if a 32-bit userspace app try to shoot itself in
> >> the foot by passing garbage upper 32 bits ?
> >>
> >
> > How is garbage in the high bits any different than garbage in any
> > other bits in there?
>
> It's not :)
>
> >
> >>
> >>> If this would be a performance problem on ARM, then maybe that's a
> >>> reason to use compat helpers.
> >>
> >> We already use 64-bit values for the pointers, even on 32-bit. Normally
> >> userspace just puts zeroes in the top bits. It's mostly a question of
> >> clearing the top 32 bits or not when loading them in the kernel. If we
> >> don't need to, then I can remove the compat code entirely, and we don't
> >> care about user_64bit_mode() anymore, as you initially recommended.
> >> Does it make sense ?
> >
> > Yes, I think so.  I'd suggest just honoring all the bits.
>
> OK, will do !
>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>>>>> +SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
> >>>>>>>> +{
> >>>>>>>> +    if (unlikely(flags))
> >>>>>>>> +            return -EINVAL;
> >>>>>>>
> >>>>>>> (add whitespace)
> >>>>>>
> >>>>>> fixed.
> >>>>>>
> >>>>>>>
> >>>>>>>> +    if (!rseq) {
> >>>>>>>> +            if (!current->rseq)
> >>>>>>>> +                    return -ENOENT;
> >>>>>>>> +            return 0;
> >>>>>>>> +    }
> >>>>>
> >>>>> This looks entirely wrong.  Setting rseq to NULL fails if it's already
> >>>>> NULL but silently does nothing if rseq is already set?  Surely it
> >>>>> should always succeed and it should actually do something if rseq is
> >>>>> set.
> >>>>
> >>>> From the proposed rseq(2) manpage:
> >>>>
> >>>> "A NULL rseq value can be used to check whether rseq is registered
> >>>> for the current thread."
> >>>>
> >>>> The implementation does just that: it returns -1, errno=ENOENT if no
> >>>> rseq is currently registered, or 0 if rseq is currently registered.
> >>>
> >>> I think that's problematic.  Why can't you unregister an existing
> >>> rseq?  If you can't, how is a thread supposed to clean up after
> >>> itself?
> >>>
> >>
> >> Unregistering an existing thread rseq would require that we keep reference
> >> counting, in case multiple libs and/or the app are using rseq. I am
> >> trying to keep things as simple as needed.
> >>
> >> If I understand your concern, the problematic scenario would be at
> >> thread exit (this is my current approximate understanding of glibc
> >> handling of library TLS variable reclaim at thread exit):
> >>
> >> thread exits in userspace:
> >> - glibc frees its rseq TLS memory area (in case the TLS is in a library),
> >> - thread preempted before really exiting,
> >> - kernel reads/writes to freed TLS memory.
> >>   - corruption may occur (e.g. memory re-allocated by another thread already)
> >>
> >> Am I getting it right ?
> >
> > Yes.
>
> Hrm, then we should:
>
> - add a rseq_refcount field to the task struct,
> - increment this refcount whenever rseq receives a registration, after
>   ensuring that we are registering the same address as was previously
>   requested by preceding registrations for the thread (except if the
>   refcount was 0),
> - When rseq receives a NULL address, decrement refcount. Set address to
>   NULL when it reaches 0.
>
> Doing the refcounting in kernel-space rather than user-space allows us to
> keep both registration/unregistration and refcount atomic, which simplify
> things if we plan to use rseq from signal handlers.
>
> With current glibc, a library that would lazily register and use rseq
> without knowledge of the application would then have to use pthread_key_create()
> to set a destr_function to run at thread exit, which would take care of
> unregistration.

That sounds reasonable at first glance.

>
> We could add a RSEQ_FORCE_UNREGISTER flag to rseq flags to allow future
> glibc versions to force unregistering rseq before freeing its TLS memory,
> just in case a userspace library omits to unregister itself.

Sounds good too.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-07-24 18:01       ` Dave Watson
  2016-07-25 16:43         ` Mathieu Desnoyers
@ 2016-08-11 23:26         ` Mathieu Desnoyers
  2016-08-12  1:28           ` Boqun Feng
  2016-08-12 19:36           ` Mathieu Desnoyers
  1 sibling, 2 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-11 23:26 UTC (permalink / raw)
  To: Dave Watson
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:

>>> +static inline __attribute__((always_inline))
>>> +bool rseq_finish(struct rseq_lock *rlock,
>>> + intptr_t *p, intptr_t to_write,
>>> + struct rseq_state start_value)
> 
>>> This ABI looks like it will work fine for our use case. I don't think it
>>> has been mentioned yet, but we may still need multiple asm blocks
>>> for differing numbers of writes. For example, an array-based freelist push:
> 
>>> void push(void *obj) {
>>> if (index < maxlen) {
>>> freelist[index++] = obj;
>>> }
>>> }
> 
>>> would be more efficiently implemented with a two-write rseq_finish:
> 
>>> rseq_finish2(&freelist[index], obj, // first write
>>> &index, index + 1, // second write
>>> ...);
> 
>> Would pairing one rseq_start with two rseq_finish do the trick
>> there ?
> 
> Yes, two rseq_finish works, as long as the extra rseq management overhead
> is not substantial.

I've added a commit implementing rseq_finish2() in my rseq volatile
dev branch. You can fetch it at:

https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback

I also have a separate test and benchmark tree in addition to the
kernel selftests here:

https://github.com/compudj/rseq-test

I named the first write a "speculative" write, and the second write
the "final" write.

Would you like to extend the test cases to cover your intended use-case ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-11 23:26         ` Mathieu Desnoyers
@ 2016-08-12  1:28           ` Boqun Feng
  2016-08-12  3:10             ` Mathieu Desnoyers
  2016-08-12 19:36           ` Mathieu Desnoyers
  1 sibling, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-12  1:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 2772 bytes --]

On Thu, Aug 11, 2016 at 11:26:30PM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:
> 
> >>> +static inline __attribute__((always_inline))
> >>> +bool rseq_finish(struct rseq_lock *rlock,
> >>> + intptr_t *p, intptr_t to_write,
> >>> + struct rseq_state start_value)
> > 
> >>> This ABI looks like it will work fine for our use case. I don't think it
> >>> has been mentioned yet, but we may still need multiple asm blocks
> >>> for differing numbers of writes. For example, an array-based freelist push:
> > 
> >>> void push(void *obj) {
> >>> if (index < maxlen) {
> >>> freelist[index++] = obj;
> >>> }
> >>> }
> > 
> >>> would be more efficiently implemented with a two-write rseq_finish:
> > 
> >>> rseq_finish2(&freelist[index], obj, // first write
> >>> &index, index + 1, // second write
> >>> ...);
> > 
> >> Would pairing one rseq_start with two rseq_finish do the trick
> >> there ?
> > 
> > Yes, two rseq_finish works, as long as the extra rseq management overhead
> > is not substantial.
> 
> I've added a commit implementing rseq_finish2() in my rseq volatile
> dev branch. You can fetch it at:
> 
> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
> 
> I also have a separate test and benchmark tree in addition to the
> kernel selftests here:
> 
> https://github.com/compudj/rseq-test
> 
> I named the first write a "speculative" write, and the second write
> the "final" write.
> 

Maybe I miss something subtle, but if the first write is only a
"speculative" write, why can't we put it in the rseq critical section
rather than asm block? Like this:

	do_rseq(..., result, targetptr, newval
		{	
			newval = index;
			targetptr = &index;
			if (newval < maxlen)
				freelist[newval++] = obj;
			else
				result = false;
		}

No extra rseq_finish() is needed here, but maybe a little more
"speculative" writes?

> Would you like to extend the test cases to cover your intended use-case ?
> 

Dave, if you are going to write some test cases about your use-cases,
would you also try the away I mentioned above?


Besides, do we allow userspace programs do read-only access to the
memory objects modified by do_rseq(). If so, we have a problem when
there are two writes in a do_rseq()(either in the rseq critical section
or in the asm block), because in current implemetation, these two writes
are unordered, which makes the readers outside a do_rseq() could observe
the ordering of writes differently.

For rseq_finish2(), a simple solution would be making the "final" write
a RELEASE.

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12  1:28           ` Boqun Feng
@ 2016-08-12  3:10             ` Mathieu Desnoyers
  2016-08-12  3:13               ` Mathieu Desnoyers
  2016-08-12  5:30               ` Boqun Feng
  0 siblings, 2 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-12  3:10 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 11, 2016, at 9:28 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Thu, Aug 11, 2016 at 11:26:30PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:
>> 
>> >>> +static inline __attribute__((always_inline))
>> >>> +bool rseq_finish(struct rseq_lock *rlock,
>> >>> + intptr_t *p, intptr_t to_write,
>> >>> + struct rseq_state start_value)
>> > 
>> >>> This ABI looks like it will work fine for our use case. I don't think it
>> >>> has been mentioned yet, but we may still need multiple asm blocks
>> >>> for differing numbers of writes. For example, an array-based freelist push:
>> > 
>> >>> void push(void *obj) {
>> >>> if (index < maxlen) {
>> >>> freelist[index++] = obj;
>> >>> }
>> >>> }
>> > 
>> >>> would be more efficiently implemented with a two-write rseq_finish:
>> > 
>> >>> rseq_finish2(&freelist[index], obj, // first write
>> >>> &index, index + 1, // second write
>> >>> ...);
>> > 
>> >> Would pairing one rseq_start with two rseq_finish do the trick
>> >> there ?
>> > 
>> > Yes, two rseq_finish works, as long as the extra rseq management overhead
>> > is not substantial.
>> 
>> I've added a commit implementing rseq_finish2() in my rseq volatile
>> dev branch. You can fetch it at:
>> 
>> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
>> 
>> I also have a separate test and benchmark tree in addition to the
>> kernel selftests here:
>> 
>> https://github.com/compudj/rseq-test
>> 
>> I named the first write a "speculative" write, and the second write
>> the "final" write.
>> 
> 
> Maybe I miss something subtle, but if the first write is only a
> "speculative" write, why can't we put it in the rseq critical section
> rather than asm block? Like this:
> 
>	do_rseq(..., result, targetptr, newval
>		{
>			newval = index;
>			targetptr = &index;
>			if (newval < maxlen)
>				freelist[newval++] = obj;
>			else
>				result = false;
>		}
> 
> No extra rseq_finish() is needed here, but maybe a little more
> "speculative" writes?

This won't work unfortunately. The speculative stores need to be
between the rseq_event_counter comparison instruction in the rseq_finish
asm sequence and the final store. The ip fixup is really needed for
correctness of speculative stores. The sequence number scheme only works
for loads.

Putting it in the C code between rseq_start and rseq_finish would lead
to races such as:

thread A                                thread B
rseq_start
<preempted>
                                        <sched in>
                                        rseq_start
                                        freelist[offset + 1] = obj
                                        rseq_finish
                                           offset++
                                        <preempted>
<sched in>
freelist[newval + 1] = obj  <--- corrupts the list content.

<snip>

> Besides, do we allow userspace programs do read-only access to the
> memory objects modified by do_rseq(). If so, we have a problem when
> there are two writes in a do_rseq()(either in the rseq critical section
> or in the asm block), because in current implemetation, these two writes
> are unordered, which makes the readers outside a do_rseq() could observe
> the ordering of writes differently.
> 
> For rseq_finish2(), a simple solution would be making the "final" write
> a RELEASE.

Indeed, we would need a release semantic for the final store here if this
is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
add a rseq_finish2_release. We should find a way to eliminate code duplication
there. I suspect we'll end up doing macros.

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> Thanks,
>> 
>> Mathieu
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12  3:10             ` Mathieu Desnoyers
@ 2016-08-12  3:13               ` Mathieu Desnoyers
  2016-08-12  5:30               ` Boqun Feng
  1 sibling, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-12  3:13 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 11, 2016, at 11:10 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Aug 11, 2016, at 9:28 PM, Boqun Feng boqun.feng@gmail.com wrote:
> 
>> On Thu, Aug 11, 2016 at 11:26:30PM +0000, Mathieu Desnoyers wrote:
>>> ----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:
>>> 
>>> >>> +static inline __attribute__((always_inline))
>>> >>> +bool rseq_finish(struct rseq_lock *rlock,
>>> >>> + intptr_t *p, intptr_t to_write,
>>> >>> + struct rseq_state start_value)
>>> > 
>>> >>> This ABI looks like it will work fine for our use case. I don't think it
>>> >>> has been mentioned yet, but we may still need multiple asm blocks
>>> >>> for differing numbers of writes. For example, an array-based freelist push:
>>> > 
>>> >>> void push(void *obj) {
>>> >>> if (index < maxlen) {
>>> >>> freelist[index++] = obj;
>>> >>> }
>>> >>> }
>>> > 
>>> >>> would be more efficiently implemented with a two-write rseq_finish:
>>> > 
>>> >>> rseq_finish2(&freelist[index], obj, // first write
>>> >>> &index, index + 1, // second write
>>> >>> ...);
>>> > 
>>> >> Would pairing one rseq_start with two rseq_finish do the trick
>>> >> there ?
>>> > 
>>> > Yes, two rseq_finish works, as long as the extra rseq management overhead
>>> > is not substantial.
>>> 
>>> I've added a commit implementing rseq_finish2() in my rseq volatile
>>> dev branch. You can fetch it at:
>>> 
>>> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
>>> 
>>> I also have a separate test and benchmark tree in addition to the
>>> kernel selftests here:
>>> 
>>> https://github.com/compudj/rseq-test
>>> 
>>> I named the first write a "speculative" write, and the second write
>>> the "final" write.
>>> 
>> 
>> Maybe I miss something subtle, but if the first write is only a
>> "speculative" write, why can't we put it in the rseq critical section
>> rather than asm block? Like this:
>> 
>>	do_rseq(..., result, targetptr, newval
>>		{
>>			newval = index;
>>			targetptr = &index;
>>			if (newval < maxlen)
>>				freelist[newval++] = obj;
>>			else
>>				result = false;
>>		}
>> 
>> No extra rseq_finish() is needed here, but maybe a little more
>> "speculative" writes?
> 
> This won't work unfortunately. The speculative stores need to be
> between the rseq_event_counter comparison instruction in the rseq_finish
> asm sequence and the final store. The ip fixup is really needed for
> correctness of speculative stores. The sequence number scheme only works
> for loads.
> 
> Putting it in the C code between rseq_start and rseq_finish would lead
> to races such as:
> 
> thread A                                thread B
> rseq_start
> <preempted>
>                                        <sched in>
>                                        rseq_start
>                                        freelist[offset + 1] = obj
>                                        rseq_finish
>                                           offset++
>                                        <preempted>
> <sched in>
> freelist[newval + 1] = obj  <--- corrupts the list content.
> 

Small clarification to the scenario:

thread A                                thread B
rseq_start
load offset into (register 1)
<preempted>
                                       <sched in>
                                       rseq_start
                                       freelist[offset + 1] = obj
                                       rseq_finish
                                          offset++
                                       <preempted>
<sched in>
freelist[(register 1) + 1] = obj  <--- corrupts the list content.

Thanks,

Mathieu


> <snip>
> 
>> Besides, do we allow userspace programs do read-only access to the
>> memory objects modified by do_rseq(). If so, we have a problem when
>> there are two writes in a do_rseq()(either in the rseq critical section
>> or in the asm block), because in current implemetation, these two writes
>> are unordered, which makes the readers outside a do_rseq() could observe
>> the ordering of writes differently.
>> 
>> For rseq_finish2(), a simple solution would be making the "final" write
>> a RELEASE.
> 
> Indeed, we would need a release semantic for the final store here if this
> is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
> add a rseq_finish2_release. We should find a way to eliminate code duplication
> there. I suspect we'll end up doing macros.
> 
> Thanks,
> 
> Mathieu
> 
>> 
>> Regards,
>> Boqun
>> 
>>> Thanks,
>>> 
>>> Mathieu
>>> 
>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>> > http://www.efficios.com
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12  3:10             ` Mathieu Desnoyers
  2016-08-12  3:13               ` Mathieu Desnoyers
@ 2016-08-12  5:30               ` Boqun Feng
  2016-08-12 16:35                 ` Boqun Feng
  1 sibling, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-12  5:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]

On Fri, Aug 12, 2016 at 03:10:38AM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 11, 2016, at 9:28 PM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > On Thu, Aug 11, 2016 at 11:26:30PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:
> >> 
> >> >>> +static inline __attribute__((always_inline))
> >> >>> +bool rseq_finish(struct rseq_lock *rlock,
> >> >>> + intptr_t *p, intptr_t to_write,
> >> >>> + struct rseq_state start_value)
> >> > 
> >> >>> This ABI looks like it will work fine for our use case. I don't think it
> >> >>> has been mentioned yet, but we may still need multiple asm blocks
> >> >>> for differing numbers of writes. For example, an array-based freelist push:
> >> > 
> >> >>> void push(void *obj) {
> >> >>> if (index < maxlen) {
> >> >>> freelist[index++] = obj;
> >> >>> }
> >> >>> }
> >> > 
> >> >>> would be more efficiently implemented with a two-write rseq_finish:
> >> > 
> >> >>> rseq_finish2(&freelist[index], obj, // first write
> >> >>> &index, index + 1, // second write
> >> >>> ...);
> >> > 
> >> >> Would pairing one rseq_start with two rseq_finish do the trick
> >> >> there ?
> >> > 
> >> > Yes, two rseq_finish works, as long as the extra rseq management overhead
> >> > is not substantial.
> >> 
> >> I've added a commit implementing rseq_finish2() in my rseq volatile
> >> dev branch. You can fetch it at:
> >> 
> >> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
> >> 
> >> I also have a separate test and benchmark tree in addition to the
> >> kernel selftests here:
> >> 
> >> https://github.com/compudj/rseq-test
> >> 
> >> I named the first write a "speculative" write, and the second write
> >> the "final" write.
> >> 
> > 
> > Maybe I miss something subtle, but if the first write is only a
> > "speculative" write, why can't we put it in the rseq critical section
> > rather than asm block? Like this:
> > 
> >	do_rseq(..., result, targetptr, newval
> >		{
> >			newval = index;
> >			targetptr = &index;
> >			if (newval < maxlen)
> >				freelist[newval++] = obj;
> >			else
> >				result = false;
> >		}
> > 
> > No extra rseq_finish() is needed here, but maybe a little more
> > "speculative" writes?
> 
> This won't work unfortunately. The speculative stores need to be
> between the rseq_event_counter comparison instruction in the rseq_finish
> asm sequence and the final store. The ip fixup is really needed for
> correctness of speculative stores. The sequence number scheme only works
> for loads.
> 
> Putting it in the C code between rseq_start and rseq_finish would lead
> to races such as:
> 
> thread A                                thread B
> rseq_start
> <preempted>
>                                         <sched in>
>                                         rseq_start
>                                         freelist[offset + 1] = obj
>                                         rseq_finish
>                                            offset++
>                                         <preempted>
> <sched in>
> freelist[newval + 1] = obj  <--- corrupts the list content.
> 

Ah, right!

We couldn't do any "global"(real global or percpu) update in the rseq
critical section(code between rseq_start and rseq_finish), because
without an ip fixup, we cannot abort the critical section immediately,
we have to compare the event_counter in rseq_finish, but that's too late
for speculates stores.

> <snip>
> 
> > Besides, do we allow userspace programs do read-only access to the
> > memory objects modified by do_rseq(). If so, we have a problem when
> > there are two writes in a do_rseq()(either in the rseq critical section
> > or in the asm block), because in current implemetation, these two writes
> > are unordered, which makes the readers outside a do_rseq() could observe
> > the ordering of writes differently.
> > 
> > For rseq_finish2(), a simple solution would be making the "final" write
> > a RELEASE.
> 
> Indeed, we would need a release semantic for the final store here if this
> is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
> add a rseq_finish2_release. We should find a way to eliminate code duplication

I'm in favor of a separate rseq_finish2_release().

> there. I suspect we'll end up doing macros.
> 

Me too. Lemme have a try ;-)

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> > 
> > Regards,
> > Boqun
> > 
> >> Thanks,
> >> 
> >> Mathieu
> >> 
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> > > http://www.efficios.com
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12  5:30               ` Boqun Feng
@ 2016-08-12 16:35                 ` Boqun Feng
  2016-08-12 18:11                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-12 16:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
[snip]
> > > Besides, do we allow userspace programs do read-only access to the
> > > memory objects modified by do_rseq(). If so, we have a problem when
> > > there are two writes in a do_rseq()(either in the rseq critical section
> > > or in the asm block), because in current implemetation, these two writes
> > > are unordered, which makes the readers outside a do_rseq() could observe
> > > the ordering of writes differently.
> > > 
> > > For rseq_finish2(), a simple solution would be making the "final" write
> > > a RELEASE.
> > 
> > Indeed, we would need a release semantic for the final store here if this
> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
> > add a rseq_finish2_release. We should find a way to eliminate code duplication
> 
> I'm in favor of a separate rseq_finish2_release().
> 
> > there. I suspect we'll end up doing macros.
> > 
> 
> Me too. Lemme have a try ;-)
> 

How about this? Although a little messy, I separated the asm block into
several parts and implemented each part in a arch-diagnose way.

Compiled successfully on x86 and ppc64le, no more further tests.

Regards,
Boqun

-------------------->8
>From 3a4c40ded1320b824af462d875f942913e5c46a3 Mon Sep 17 00:00:00 2001
From: Boqun Feng <boqun.feng@gmail.com>
Date: Sat, 13 Aug 2016 00:16:13 +0800
Subject: [PATCH] WIP1

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 tools/testing/selftests/rseq/rseq.h | 541 ++++++++++++++----------------------
 1 file changed, 205 insertions(+), 336 deletions(-)

diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index e8614e76b377..7e13aab2ec8b 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -304,6 +304,172 @@ struct rseq_state rseq_start(struct rseq_lock *rlock)
 	return result;
 }
 
+/*
+ * ASM code for building the rseq_cs table
+ */
+
+#if defined(__x86_64__) || defined(__PPC64__)
+# define RSEQ_CS_TABLE(table, start, post_commit, abort)	\
+	".balign 32\n\t"						\
+	table ":\n\t"						\
+	".quad " start "," post_commit "," abort ", 0x0\n\t"
+#elif defined(__ARMEL__)
+# define RSEQ_CS_TABLE(table, start, post_commit, abort)	\
+	".balign 32\n\t"						\
+	table ":\n\t"						\
+	".long " start ", 0x0," post_commit ", 0x0," abort ", 0x0, 0x0, 0x0\n\t"
+#elif defined(__PPC__) /* PPC32 */
+# define RSEQ_CS_TABLE(table, start_ip, post_commit_ip, abort_ip)	\
+	".balign 32\n\t"						\
+	table ":\n\t"							\
+	".long 0x0," start ", 0x0," post_commit ", 0x0," abort ", 0x0, 0x0\n\t"
+#else
+#endif
+
+/*
+ * ASM code for putting the rseq_cs table into a special section for debugging
+ */
+
+#define RSEQ_CS_TABLE_SECTION(table, start, post_commit, abort)		\
+	".pushsection __rseq_table, \"aw\"\n\t"				\
+	RSEQ_CS_TABLE(table, start, post_commit, abort)			\
+	".popsection\n\t"						\
+	start ":\n\t"
+
+
+/*
+ * ASM code to store the pointer of rseq_cs table into rseq structure, which
+ * indicates the start of rseq asm block
+ */
+#ifdef __x86_64__
+# define RSEQ_CS_STORE(cs_table, shadow_table, rseq_cs)			\
+	"movq $" cs_table ",(" rseq_cs ")\n\t"
+#elif defined(__i386__)
+# define RSEQ_CS_STORE(cs_table, shadow_table, rseq_cs)			\
+	"movl $" cs_table ",(" rseq_cs ")\n\t"
+#elif defined(__ARMEL__)
+# define RSEQ_CS_STORE(cs_table, shadow_table, rseq_cs)			\
+	"adr r0, " shadow_table "\n\t"					\
+	"str r0, [" rseq_cs "]\n\t"
+#elif defined(__PPC64__)
+# define RSEQ_CS_STORE(cs_table, shadow_table, rseq_cs)			\
+	"lis %%r17, (" cs_table ")@highest\n\t"			\
+	"ori %%r17, %%r17, (" cs_table ")@higher\n\t"			\
+	"rldicr %%r17, %%r17, 32, 31\n\t"				\
+	"oris %%r17, %%r17, (" cs_table ")@h\n\t"			\
+	"ori %%r17, %%r17, (" cs_table ")@l\n\t"			\
+	"std %%r17, 0(" rseq_cs ")\n\t"
+#elif defined(__PPC__)
+# define RSEQ_CS_STORE(cs_table, shadow_table, rseq_cs)			\
+	"lis %%r17, (" cs_table ")@ha\n\t"				\
+	"addi %%r17, %%r17, (" cs_table ")@l\n\t"			\
+	"stw %%r17, 0(" rseq_cs ")\n\t"
+#else
+# error unsupported target
+#endif
+
+/* ASM code to check whether the event_counter changed */
+#ifdef __x86_64__
+# define RSEQ_CHECK_COUNTER(start_counter, current_counter, abort_ip)	\
+	"cmpl " start_counter ", " current_counter "\n\t"		\
+	"jnz " abort_ip "\n\t"
+#elif defined(__i386__)
+# define RSEQ_CHECK_COUNTER(start_counter, current_counter, abort_ip)	\
+	"cmpl " start_counter ", " current_counter "\n\t"		\
+	"jnz " abort_ip "\n\t"
+#elif defined(__ARMEL__)
+# define RSEQ_CHECK_COUNTER(start_counter, current_counter, abort_ip)	\
+	"ldr r0, " current_counter "\n\t"				\
+	"cmp " start_counter ", r0\n\t"					\
+	"bne " abort_ip "\n\t"
+#elif defined(__PPC__)
+# define RSEQ_CHECK_COUNTER(start_counter, current_counter, abort_ip)	\
+	"lwz %%r17, " current_counter "\n\t"				\
+	"cmpw cr7, " start_counter ", %%r17\n\t"			\
+	"bne- cr7, " abort_ip "\n\t"
+#else
+# error unsupported target
+#endif
+
+/* ASM code to do a normal write in rseq block*/
+#ifdef __x86_64__
+# define RSEQ_WRITE(to_write, target_addr)				\
+	"movq " to_write ", (" target_addr ")\n\t"
+
+#elif defined(__i386__)
+# define RSEQ_WRITE(to_write, target_addr)				\
+	"movl " to_write ", (" target_addr ")\n\t"
+
+#elif defined(__ARMEL__)
+# define RSEQ_WRITE(to_write, target_addr)				\
+	"str " to_write ", [" target_addr "]\n\t"
+
+#elif defined(__PPC64__)
+# define RSEQ_WRITE(to_write, target_addr)				\
+	"std " to_write ", 0(" target_addr ")\n\t"
+
+#elif defined(__PPC__)
+# define RSEQ_WRITE(to_write, target_addr)				\
+	"stw " to_write ", 0(" target_addr ")\n\t"
+#else
+# error unsupported target
+#endif
+
+/* ASM code to do a commit(final) write */
+#define RSEQ_COMMIT_WRITE(to_write, target_addr, post_commit)		\
+	RSEQ_WRITE(to_write, target_addr)				\
+	post_commit ":\n\t"
+
+/*
+ * ASM code to zero the rseq_cs, which indicates the end of the rseq asm block
+ */
+#if defined(__x86_64__) || defined(__i386__)
+# define RSEQ_ZERO_CS(rseq_cs)						\
+	RSEQ_WRITE("$0", rseq_cs)
+
+#elif defined(__ARMEL__)
+# define RSEQ_ZERO_CS(rseq_cs)						\
+	"mov r0, #0\n\t"						\
+	RSEQ_WRITE("r0", rseq_cs)
+
+#elif defined(__PPC__)
+# define RSEQ_ZERO_CS(rseq_cs)						\
+	"li %%r17, 0\n\t"						\
+	RSEQ_WRITE("%%r17", rseq_cs)
+
+#else
+# error unsupported target
+#endif
+
+/* ARM use another table to set the rseq_cs */
+#if defined(__ARMEL__)
+# define RSEQ_CS_SHADOW_TABLE(table, start, post_commit, abort)	\
+	"b skip\n\t"						\
+	RSEQ_CS_TABLE(table, start, post_commit, abort)		\
+	"skip:\n\t"
+#else
+# define RSEQ_CS_SHADOW_TABLE(table, start, post_commit, abort)
+#endif
+
+#define RSEQ_VAR_REG(sym, expr)	[sym] "r" (expr)
+#define RSEQ_VAR_MEM(sym, expr)	[sym] "m" (expr)
+
+#ifdef __PPC__ /* PPC64 and PPC32 */
+# define RSEQ_ADDR_REG(sym, expr)	[sym] "b" (expr)
+#endif
+
+#ifndef RSEQ_ADDR_REG
+# define RSEQ_ADDR_REG(sym, expr)	RSEQ_VAR_REG(sym, expr)
+#endif
+
+#ifdef __PPC__
+# define RSEQ_REG_COBBLER	,"r17"
+#elif defined(__ARMEL__)
+# define RSEQ_REG_COBBLER	,"r0"
+#else
+# define RSEQ_REG_COBBLER	,"memory"
+#endif
+
 static inline __attribute__((always_inline))
 bool rseq_finish(struct rseq_lock *rlock,
 		intptr_t *p, intptr_t to_write,
@@ -322,174 +488,33 @@ bool rseq_finish(struct rseq_lock *rlock,
 	 * handle single-stepping through the restartable critical
 	 * sections.
 	 */
-
-#ifdef __x86_64__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".quad 1f, 2f, %l[failure], 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"movq $3b, (%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
-		"jnz %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"movq %[to_write], (%[target])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(4)
-		"movq $0, (%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"r"(p),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#elif defined(__i386__)
 	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".long 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
+		RSEQ_CS_TABLE_SECTION("cs_table%=", "start%=", "post_commit%=", "%l[failure]")
+		/* start */
 		RSEQ_INJECT_ASM(1)
-		"movl $3b, (%[rseq_cs])\n\t"
+		RSEQ_CS_STORE("cs_table%=", "shadow_table%=", "%[rseq_cs]")
 		RSEQ_INJECT_ASM(2)
-		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
-		"jnz %l[failure]\n\t"
+		RSEQ_CHECK_COUNTER("%[start_event_counter]",
+				   "%[current_event_counter]",
+				   "%l[failure]")
 		RSEQ_INJECT_ASM(3)
-		"movl %[to_write], (%[target])\n\t"
-		"2:\n\t"
+		RSEQ_COMMIT_WRITE("%[to_write]", "%[target]", "post_commit%=")
+		/* post_commit */
 		RSEQ_INJECT_ASM(4)
-		"movl $0, (%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"r"(p),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
+		RSEQ_ZERO_CS("%[rseq_cs]")
+		RSEQ_CS_SHADOW_TABLE("shadow_table%=", "start%=", "post_commit%=", "%l[failure]")
+		:
+		: RSEQ_VAR_REG(start_event_counter, start_value.event_counter),
+		  RSEQ_VAR_MEM(current_event_counter, start_value.rseqp->u.e.event_counter),
+		  RSEQ_VAR_REG(to_write, to_write),
+		  RSEQ_ADDR_REG(target, p),
+		  RSEQ_ADDR_REG(rseq_cs, &start_value.rseqp->rseq_cs)
 		  RSEQ_INJECT_INPUT
 		: "memory", "cc"
+		  RSEQ_REG_COBBLER
 		  RSEQ_INJECT_CLOBBER
 		: failure
-	);
-#elif defined(__ARMEL__)
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		".word 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"adr r0, 3f\n\t"
-		"str r0, [%[rseq_cs]]\n\t"
-		RSEQ_INJECT_ASM(2)
-		"ldr r0, %[current_event_counter]\n\t"
-		"mov r1, #0\n\t"
-		"cmp %[start_event_counter], r0\n\t"
-		"bne %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"str %[to_write], [%[target]]\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(4)
-		"str r1, [%[rseq_cs]]\n\t"
-		"b 4f\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t"
-		"4:\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"r"(p),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r0", "r1", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#elif __PPC64__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".quad 1f, 2f, %l[failure], 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"lis %%r17, (3b)@highest\n\t"
-		"ori %%r17, %%r17, (3b)@higher\n\t"
-		"rldicr %%r17, %%r17, 32, 31\n\t"
-		"oris %%r17, %%r17, (3b)@h\n\t"
-		"ori %%r17, %%r17, (3b)@l\n\t"
-		"std %%r17, 0(%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"lwz %%r17, %[current_event_counter]\n\t"
-		"cmpw cr7, %[start_event_counter], %%r17\n\t"
-		"bne- cr7, %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"std %[to_write], 0(%[target])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(4)
-		"li %%r17, 0\n\t"
-		"std %%r17, 0(%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"b"(p),
-		  [rseq_cs]"b"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r17", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#elif __PPC__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		/* 32-bit only supported on BE */
-		".long 0x0, 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"lis %%r17, (3b)@ha\n\t"
-		"addi %%r17, %%r17, (3b)@l\n\t"
-		"stw %%r17, 0(%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"lwz %%r17, %[current_event_counter]\n\t"
-		"cmpw cr7, %[start_event_counter], %%r17\n\t"
-		"bne- cr7, %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"stw %[to_write], 0(%[target])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(4)
-		"li %%r17, 0\n\t"
-		"stw %%r17, 0(%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"b"(p),
-		  [rseq_cs]"b"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r17", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#else
-#error unsupported target
-#endif
+		);
 	return true;
 failure:
 	RSEQ_INJECT_FAILED
@@ -525,193 +550,37 @@ bool rseq_finish2(struct rseq_lock *rlock,
 	 * sections.
 	 */
 
-#ifdef __x86_64__
 	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".quad 1f, 2f, %l[failure], 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
+		RSEQ_CS_TABLE_SECTION("cs_table%=", "start%=", "post_commit%=", "%l[failure]")
+		/* start */
 		RSEQ_INJECT_ASM(1)
-		"movq $3b, (%[rseq_cs])\n\t"
+		RSEQ_CS_STORE("cs_table%=", "shadow_table%=", "%[rseq_cs]")
 		RSEQ_INJECT_ASM(2)
-		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
-		"jnz %l[failure]\n\t"
+		RSEQ_CHECK_COUNTER("%[start_event_counter]",
+				   "%[current_event_counter]",
+				   "%l[failure]")
 		RSEQ_INJECT_ASM(3)
-		"movq %[to_write_spec], (%[target_spec])\n\t"
+		RSEQ_WRITE("%[to_write_spec]", "%[target_spec]")
 		RSEQ_INJECT_ASM(4)
-		"movq %[to_write_final], (%[target_final])\n\t"
-		"2:\n\t"
+		RSEQ_COMMIT_WRITE("%[to_write_final]", "%[target_final]", "post_commit%=")
+		/* post_commit */
 		RSEQ_INJECT_ASM(5)
-		"movq $0, (%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write_spec]"r"(to_write_spec),
-		  [target_spec]"r"(p_spec),
-		  [to_write_final]"r"(to_write_final),
-		  [target_final]"r"(p_final),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
+		RSEQ_ZERO_CS("%[rseq_cs]")
+		RSEQ_CS_SHADOW_TABLE("shadow_table%=", "start%=", "post_commit%=", "%l[failure]")
+		:
+		: RSEQ_VAR_REG(start_event_counter, start_value.event_counter),
+		  RSEQ_VAR_MEM(current_event_counter, start_value.rseqp->u.e.event_counter),
+		  RSEQ_VAR_REG(to_write_spec, to_write_spec),
+		  RSEQ_ADDR_REG(target_spec, p_spec),
+		  RSEQ_VAR_REG(to_write_final, to_write_final),
+		  RSEQ_ADDR_REG(target_final, p_final),
+		  RSEQ_ADDR_REG(rseq_cs, &start_value.rseqp->rseq_cs)
 		  RSEQ_INJECT_INPUT
 		: "memory", "cc"
+		  RSEQ_REG_COBBLER
 		  RSEQ_INJECT_CLOBBER
 		: failure
-	);
-#elif defined(__i386__)
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".long 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"movl $3b, (%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
-		"jnz %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"movl %[to_write_spec], (%[target_spec])\n\t"
-		RSEQ_INJECT_ASM(4)
-		"movl %[to_write_final], (%[target_final])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(5)
-		"movl $0, (%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write_spec]"r"(to_write_spec),
-		  [target_spec]"r"(p_spec),
-		  [to_write_final]"r"(to_write_final),
-		  [target_final]"r"(p_final),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#elif defined(__ARMEL__)
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		".word 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"adr r0, 3f\n\t"
-		"str r0, [%[rseq_cs]]\n\t"
-		RSEQ_INJECT_ASM(2)
-		"ldr r0, %[current_event_counter]\n\t"
-		"mov r1, #0\n\t"
-		"cmp %[start_event_counter], r0\n\t"
-		"bne %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"str %[to_write_spec], [%[target_spec]]\n\t"
-		RSEQ_INJECT_ASM(4)
-		"str %[to_write_final], [%[target_final]]\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(5)
-		"str r1, [%[rseq_cs]]\n\t"
-		"b 4f\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t"
-		"4:\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write_spec]"r"(to_write_spec),
-		  [target_spec]"r"(p_spec),
-		  [to_write_final]"r"(to_write_final),
-		  [target_final]"r"(p_final),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r0", "r1", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#elif __PPC64__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".quad 1f, 2f, %l[failure], 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"lis %%r17, (3b)@highest\n\t"
-		"ori %%r17, %%r17, (3b)@higher\n\t"
-		"rldicr %%r17, %%r17, 32, 31\n\t"
-		"oris %%r17, %%r17, (3b)@h\n\t"
-		"ori %%r17, %%r17, (3b)@l\n\t"
-		"std %%r17, 0(%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"lwz %%r17, %[current_event_counter]\n\t"
-		"cmpw cr7, %[start_event_counter], %%r17\n\t"
-		"bne- cr7, %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"std %[to_write_spec], 0(%[target_spec])\n\t"
-		RSEQ_INJECT_ASM(4)
-		"std %[to_write_final], 0(%[target_final])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(5)
-		"li %%r17, 0\n\t"
-		"std %%r17, 0(%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write_spec]"r"(to_write_spec),
-		  [target_spec]"b"(p_spec),
-		  [to_write_final]"r"(to_write_final),
-		  [target_final]"b"(p_final),
-		  [rseq_cs]"b"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r17", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#elif __PPC__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		/* 32-bit only supported on BE */
-		".long 0x0, 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"lis %%r17, (3b)@ha\n\t"
-		"addi %%r17, %%r17, (3b)@l\n\t"
-		"stw %%r17, 0(%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"lwz %%r17, %[current_event_counter]\n\t"
-		"cmpw cr7, %[start_event_counter], %%r17\n\t"
-		"bne- cr7, %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"stw %[to_write_spec], 0(%[target_spec])\n\t"
-		RSEQ_INJECT_ASM(4)
-		"stw %[to_write_final], 0(%[target_final])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(5)
-		"li %%r17, 0\n\t"
-		"stw %%r17, 0(%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write_spec]"r"(to_write_spec),
-		  [target_spec]"b"(p_spec),
-		  [to_write_final]"r"(to_write_final),
-		  [target_final]"b"(p_final),
-		  [rseq_cs]"b"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r17", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
-#else
-#error unsupported target
-#endif
+		);
 	return true;
 failure:
 	RSEQ_INJECT_FAILED
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12 16:35                 ` Boqun Feng
@ 2016-08-12 18:11                   ` Mathieu Desnoyers
  2016-08-13  1:28                     ` Boqun Feng
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-12 18:11 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
> [snip]
>> > > Besides, do we allow userspace programs do read-only access to the
>> > > memory objects modified by do_rseq(). If so, we have a problem when
>> > > there are two writes in a do_rseq()(either in the rseq critical section
>> > > or in the asm block), because in current implemetation, these two writes
>> > > are unordered, which makes the readers outside a do_rseq() could observe
>> > > the ordering of writes differently.
>> > > 
>> > > For rseq_finish2(), a simple solution would be making the "final" write
>> > > a RELEASE.
>> > 
>> > Indeed, we would need a release semantic for the final store here if this
>> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
>> > add a rseq_finish2_release. We should find a way to eliminate code duplication
>> 
>> I'm in favor of a separate rseq_finish2_release().
>> 
>> > there. I suspect we'll end up doing macros.
>> > 
>> 
>> Me too. Lemme have a try ;-)
>> 
> 
> How about this? Although a little messy, I separated the asm block into
> several parts and implemented each part in a arch-diagnose way.

I find it rather hard to follow the per-arch assembly with this approach.
It might prove to be troublesome if we want to do arch-specific optimizations
in the future.

I've come up with the following macro approach instead, feedback welcome!


commit 4d27431d6aefaee617540ef04518962b0e4d14f4
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Aug 11 19:11:27 2016 -0400

    rseq_finish2, rseq_finish2_release (WIP)

diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
index 5f88b6b..59efc98 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -41,7 +41,8 @@ static __thread unsigned int yield_mod_cnt, nr_retry;
 	, [loop_cnt_1]"m"(loop_cnt[1]) \
 	, [loop_cnt_2]"m"(loop_cnt[2]) \
 	, [loop_cnt_3]"m"(loop_cnt[3]) \
-	, [loop_cnt_4]"m"(loop_cnt[4])
+	, [loop_cnt_4]"m"(loop_cnt[4]) \
+	, [loop_cnt_5]"m"(loop_cnt[5])
 
 #if defined(__x86_64__) || defined(__i386__)
 
@@ -548,7 +549,7 @@ static void show_usage(int argc, char **argv)
 	printf("	[-2 loops] Number of loops for delay injection 2\n");
 	printf("	[-3 loops] Number of loops for delay injection 3\n");
 	printf("	[-4 loops] Number of loops for delay injection 4\n");
-	printf("	[-5 loops] Number of loops for delay injection 5 (-1 to enable -m)\n");
+	printf("	[-5 loops] Number of loops for delay injection 5\n");
 	printf("	[-6 loops] Number of loops for delay injection 6 (-1 to enable -m)\n");
 	printf("	[-7 loops] Number of loops for delay injection 7 (-1 to enable -m)\n");
 	printf("	[-8 loops] Number of loops for delay injection 8 (-1 to enable -m)\n");
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index 5853b17..6da993d 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -269,7 +269,7 @@ struct rseq_state rseq_start(struct rseq_lock *rlock)
 		result.event_counter =
 			ACCESS_ONCE(result.rseqp->u.e.event_counter);
 		/* load event_counter before cpu_id. */
-		RSEQ_INJECT_C(5)
+		RSEQ_INJECT_C(6)
 		result.cpu_id = ACCESS_ONCE(result.rseqp->u.e.cpu_id);
 	}
 	/*
@@ -281,7 +281,7 @@ struct rseq_state rseq_start(struct rseq_lock *rlock)
 	 * preemption/signalling will cause them to restart, so they
 	 * don't interfere with the lock.
 	 */
-	RSEQ_INJECT_C(6)
+	RSEQ_INJECT_C(7)
 
 	if (!has_fast_acquire_release() && likely(rseq_has_sys_membarrier)) {
 		result.lock_state = ACCESS_ONCE(rlock->state);
@@ -304,192 +304,342 @@ struct rseq_state rseq_start(struct rseq_lock *rlock)
 	return result;
 }
 
-static inline __attribute__((always_inline))
-bool rseq_finish(struct rseq_lock *rlock,
-		intptr_t *p, intptr_t to_write,
-		struct rseq_state start_value)
-{
-	RSEQ_INJECT_C(9)
+/*
+ * The __rseq_table section can be used by debuggers to better handle
+ * single-stepping through the restartable critical sections.
+ */
 
-	if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
-		if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
-			rseq_fallback_wait(rlock);
-		return false;
-	}
+#ifdef __x86_64__
 
-	/*
-	 * The __rseq_table section can be used by debuggers to better
-	 * handle single-stepping through the restartable critical
-	 * sections.
-	 */
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, extra_store, extra_input) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".quad 1f, 2f, %l[failure], 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		RSEQ_INJECT_ASM(1) \
+		"movq $3b, (%[rseq_cs])\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"cmpl %[start_event_counter], %[current_event_counter]\n\t" \
+		"jnz %l[failure]\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		extra_store \
+		"movq %[to_write_final], (%[target_final])\n\t" \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		"movq $0, (%[rseq_cs])\n\t" \
+		: /* no outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [to_write_final]"r"(_to_write_final), \
+		  [target_final]"r"(_target_final), \
+		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
+		  extra_input \
+		  RSEQ_INJECT_INPUT \
+		: "memory", "cc" \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
+	);
 
-#ifdef __x86_64__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".quad 1f, 2f, %l[failure], 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"movq $3b, (%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
-		"jnz %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"movq %[to_write], (%[target])\n\t"
-		"2:\n\t"
+#define RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"movq %[to_write_spec], (%[target_spec])\n\t" \
 		RSEQ_INJECT_ASM(4)
-		"movq $0, (%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"r"(p),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
-	);
+
+/* x86-64 is TSO */
+#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
+	RSEQ_FINISH2_SPECULATIVE_STORE_ASM()
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"r"(_target_spec)
+
 #elif defined(__i386__)
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".long 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"movl $3b, (%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"cmpl %[start_event_counter], %[current_event_counter]\n\t"
-		"jnz %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"movl %[to_write], (%[target])\n\t"
-		"2:\n\t"
-		RSEQ_INJECT_ASM(4)
-		"movl $0, (%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"r"(p),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, extra_store, extra_input) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".long 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		RSEQ_INJECT_ASM(1) \
+		"movl $3b, (%[rseq_cs])\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"cmpl %[start_event_counter], %[current_event_counter]\n\t" \
+		"jnz %l[failure]\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		extra_store \
+		"movl %[to_write_final], (%[target_final])\n\t" \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		"movl $0, (%[rseq_cs])\n\t" \
+		: /* no outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [to_write_final]"r"(_to_write_final), \
+		  [target_final]"r"(_target_final), \
+		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
+		  extra_input \
+		  RSEQ_INJECT_INPUT \
+		: "memory", "cc" \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
 	);
-#elif defined(__ARMEL__)
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		".word 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"adr r0, 3f\n\t"
-		"str r0, [%[rseq_cs]]\n\t"
-		RSEQ_INJECT_ASM(2)
-		"ldr r0, %[current_event_counter]\n\t"
-		"mov r1, #0\n\t"
-		"cmp %[start_event_counter], r0\n\t"
-		"bne %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"str %[to_write], [%[target]]\n\t"
-		"2:\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"movl %[to_write_spec], (%[target_spec])\n\t" \
 		RSEQ_INJECT_ASM(4)
-		"str r1, [%[rseq_cs]]\n\t"
-		"b 4f\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t"
-		"4:\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"r"(p),
-		  [rseq_cs]"r"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r0", "r1", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
+
+#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
+		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"lock; addl $0,0(%%esp)\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"r"(_target_spec)
+
+#elif defined(__ARMEL__)
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, extra_store, extra_input) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		".word 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		RSEQ_INJECT_ASM(1) \
+		"adr r0, 3f\n\t" \
+		"str r0, [%[rseq_cs]]\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"ldr r0, %[current_event_counter]\n\t" \
+		"mov r1, #0\n\t" \
+		"cmp %[start_event_counter], r0\n\t" \
+		"bne %l[failure]\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		extra_store \
+		"str %[to_write_final], [%[target_final]]\n\t" \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		"str r1, [%[rseq_cs]]\n\t" \
+		"b 4f\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \
+		"4:\n\t" \
+		: /* no outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [to_write_final]"r"(_to_write_final), \
+		  [target_final]"r"(_target_final), \
+		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
+		  extra_input \
+		  RSEQ_INJECT_INPUT \
+		: "r0", "r1", "memory", "cc" \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
 	);
-#elif __PPC64__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		".quad 1f, 2f, %l[failure], 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"lis %%r17, (3b)@highest\n\t"
-		"ori %%r17, %%r17, (3b)@higher\n\t"
-		"rldicr %%r17, %%r17, 32, 31\n\t"
-		"oris %%r17, %%r17, (3b)@h\n\t"
-		"ori %%r17, %%r17, (3b)@l\n\t"
-		"std %%r17, 0(%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"lwz %%r17, %[current_event_counter]\n\t"
-		"cmpw cr7, %[start_event_counter], %%r17\n\t"
-		"bne- cr7, %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"std %[to_write], 0(%[target])\n\t"
-		"2:\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"str %[to_write_spec], [%[target_spec]]\n\t" \
 		RSEQ_INJECT_ASM(4)
-		"li %%r17, 0\n\t"
-		"std %%r17, 0(%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"b"(p),
-		  [rseq_cs]"b"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r17", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
+
+#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
+		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"dmb\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"r"(_target_spec)
+
+#elif __PPC64__
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, extra_store, extra_input) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".quad 1f, 2f, %l[failure], 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		RSEQ_INJECT_ASM(1) \
+		"lis %%r17, (3b)@highest\n\t" \
+		"ori %%r17, %%r17, (3b)@higher\n\t" \
+		"rldicr %%r17, %%r17, 32, 31\n\t" \
+		"oris %%r17, %%r17, (3b)@h\n\t" \
+		"ori %%r17, %%r17, (3b)@l\n\t" \
+		"std %%r17, 0(%[rseq_cs])\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"lwz %%r17, %[current_event_counter]\n\t" \
+		"cmpw cr7, %[start_event_counter], %%r17\n\t" \
+		"bne- cr7, %l[failure]\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		extra_store \
+		"std %[to_write_final], 0(%[target_final])\n\t" \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		"li %%r17, 0\n\t" \
+		"std %%r17, 0(%[rseq_cs])\n\t" \
+		: /* no outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [to_write_final]"r"(_to_write_final), \
+		  [target_final]"b"(_target_final), \
+		  [rseq_cs]"b"(&(_start_value).rseqp->rseq_cs) \
+		  extra_input \
+		  RSEQ_INJECT_INPUT \
+		: "r17", "memory", "cc" \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
 	);
-#elif __PPC__
-	__asm__ __volatile__ goto (
-		".pushsection __rseq_table, \"aw\"\n\t"
-		".balign 32\n\t"
-		"3:\n\t"
-		/* 32-bit only supported on BE */
-		".long 0x0, 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0\n\t"
-		".popsection\n\t"
-		"1:\n\t"
-		RSEQ_INJECT_ASM(1)
-		"lis %%r17, (3b)@ha\n\t"
-		"addi %%r17, %%r17, (3b)@l\n\t"
-		"stw %%r17, 0(%[rseq_cs])\n\t"
-		RSEQ_INJECT_ASM(2)
-		"lwz %%r17, %[current_event_counter]\n\t"
-		"cmpw cr7, %[start_event_counter], %%r17\n\t"
-		"bne- cr7, %l[failure]\n\t"
-		RSEQ_INJECT_ASM(3)
-		"stw %[to_write], 0(%[target])\n\t"
-		"2:\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"std %[to_write_spec], 0(%[target_spec])\n\t" \
 		RSEQ_INJECT_ASM(4)
-		"li %%r17, 0\n\t"
-		"stw %%r17, 0(%[rseq_cs])\n\t"
-		: /* no outputs */
-		: [start_event_counter]"r"(start_value.event_counter),
-		  [current_event_counter]"m"(start_value.rseqp->u.e.event_counter),
-		  [to_write]"r"(to_write),
-		  [target]"b"(p),
-		  [rseq_cs]"b"(&start_value.rseqp->rseq_cs)
-		  RSEQ_INJECT_INPUT
-		: "r17", "memory", "cc"
-		  RSEQ_INJECT_CLOBBER
-		: failure
+
+#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
+		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"lwsync\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"b"(_target_spec)
+
+#elif __PPC__
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, extra_store, extra_input) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		/* 32-bit only supported on BE */ \
+		".long 0x0, 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		RSEQ_INJECT_ASM(1) \
+		"lis %%r17, (3b)@ha\n\t" \
+		"addi %%r17, %%r17, (3b)@l\n\t" \
+		"stw %%r17, 0(%[rseq_cs])\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"lwz %%r17, %[current_event_counter]\n\t" \
+		"cmpw cr7, %[start_event_counter], %%r17\n\t" \
+		"bne- cr7, %l[failure]\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		extra_store \
+		"stw %[to_write_final], 0(%[target_final])\n\t" \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		"li %%r17, 0\n\t" \
+		"stw %%r17, 0(%[rseq_cs])\n\t" \
+		: /* no outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [to_write_final]"r"(_to_write_final), \
+		  [target_final]"b"(_target_final), \
+		  [rseq_cs]"b"(&(_start_value).rseqp->rseq_cs) \
+		  extra_input \
+		  RSEQ_INJECT_INPUT \
+		: "r17", "memory", "cc" \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
 	);
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"stw %[to_write_spec], 0(%[target_spec])\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
+		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
+		"lwsync\n\t"
+
+#define RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"b"(_target_spec)
+
 #else
 #error unsupported target
 #endif
+
+static inline __attribute__((always_inline))
+bool rseq_finish(struct rseq_lock *rlock,
+		intptr_t *p, intptr_t to_write,
+		struct rseq_state start_value)
+{
+	RSEQ_INJECT_C(8)
+
+	if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
+		if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
+			rseq_fallback_wait(rlock);
+		return false;
+	}
+
+	RSEQ_FINISH_ASM(p, to_write, start_value, failure, , );
+
+	return true;
+failure:
+	RSEQ_INJECT_FAILED
+	ACCESS_ONCE(start_value.rseqp->rseq_cs) = 0;
+	return false;
+}
+
+/*
+ * p_spec and to_write_spec are used for a speculative write attempted
+ * near the end of the restartable sequence. A rseq_finish2 may fail
+ * even after this write takes place.
+ *
+ * p_final and to_write_final are used for the final write. If this
+ * write takes place, the rseq_finish2 is guaranteed to succeed.
+ */
+static inline __attribute__((always_inline))
+bool rseq_finish2(struct rseq_lock *rlock,
+		intptr_t *p_spec, intptr_t to_write_spec,
+		intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value)
+{
+	RSEQ_INJECT_C(9)
+
+	if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
+		if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
+			rseq_fallback_wait(rlock);
+		return false;
+	}
+
+	RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+		RSEQ_FINISH2_SPECULATIVE_STORE_ASM(),
+		RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(p_spec, to_write_spec)
+	);
+	return true;
+failure:
+	RSEQ_INJECT_FAILED
+	ACCESS_ONCE(start_value.rseqp->rseq_cs) = 0;
+	return false;
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish2_release(struct rseq_lock *rlock,
+		intptr_t *p_spec, intptr_t to_write_spec,
+		intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value)
+{
+	RSEQ_INJECT_C(9)
+
+	if (unlikely(start_value.lock_state != RSEQ_LOCK_STATE_RESTART)) {
+		if (start_value.lock_state == RSEQ_LOCK_STATE_LOCK)
+			rseq_fallback_wait(rlock);
+		return false;
+	}
+
+	RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+		RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM(),
+		RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM(p_spec, to_write_spec)
+	);
 	return true;
 failure:
 	RSEQ_INJECT_FAILED

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-11 23:26         ` Mathieu Desnoyers
  2016-08-12  1:28           ` Boqun Feng
@ 2016-08-12 19:36           ` Mathieu Desnoyers
  2016-08-12 20:05             ` Dave Watson
  1 sibling, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-12 19:36 UTC (permalink / raw)
  To: Dave Watson
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 11, 2016, at 7:26 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jul 24, 2016, at 2:01 PM, Dave Watson davejwatson@fb.com wrote:
> 
>>>> +static inline __attribute__((always_inline))
>>>> +bool rseq_finish(struct rseq_lock *rlock,
>>>> + intptr_t *p, intptr_t to_write,
>>>> + struct rseq_state start_value)
>> 
>>>> This ABI looks like it will work fine for our use case. I don't think it
>>>> has been mentioned yet, but we may still need multiple asm blocks
>>>> for differing numbers of writes. For example, an array-based freelist push:
>> 
>>>> void push(void *obj) {
>>>> if (index < maxlen) {
>>>> freelist[index++] = obj;
>>>> }
>>>> }
>> 
>>>> would be more efficiently implemented with a two-write rseq_finish:
>> 
>>>> rseq_finish2(&freelist[index], obj, // first write
>>>> &index, index + 1, // second write
>>>> ...);
>> 
>>> Would pairing one rseq_start with two rseq_finish do the trick
>>> there ?
>> 
>> Yes, two rseq_finish works, as long as the extra rseq management overhead
>> is not substantial.
> 
> I've added a commit implementing rseq_finish2() in my rseq volatile
> dev branch. You can fetch it at:
> 
> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
> 
> I also have a separate test and benchmark tree in addition to the
> kernel selftests here:
> 
> https://github.com/compudj/rseq-test
> 
> I named the first write a "speculative" write, and the second write
> the "final" write.
> 
> Would you like to extend the test cases to cover your intended use-case ?
> 

Hi Dave!

I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements
a per-cpu buffer holding pointers, and pushes/pops items to/from it.

To use it:

cd tools/testing/selftests/rseq
./param_test -T b

(see -h for advanced usage)

Let me know if I got it right!

Thanks,

Mathieu

> Thanks,
> 
> Mathieu
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12 19:36           ` Mathieu Desnoyers
@ 2016-08-12 20:05             ` Dave Watson
  2016-08-14 17:09               ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Dave Watson @ 2016-08-12 20:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng


>>>> Would pairing one rseq_start with two rseq_finish do the trick
>>>> there ?
>>>
>>> Yes, two rseq_finish works, as long as the extra rseq management overhead
>>> is not substantial.
>>
>> I've added a commit implementing rseq_finish2() in my rseq volatile
>> dev branch. You can fetch it at:
>>
>> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
>>
>> I also have a separate test and benchmark tree in addition to the
>> kernel selftests here:
>>
>> https://github.com/compudj/rseq-test
>>
>> I named the first write a "speculative" write, and the second write
>> the "final" write.
>>
>> Would you like to extend the test cases to cover your intended use-case ?
>>
>
>Hi Dave!
>
>I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements
>a per-cpu buffer holding pointers, and pushes/pops items to/from it.
>
>To use it:
>
>cd tools/testing/selftests/rseq
>./param_test -T b
>
>(see -h for advanced usage)
>
>Let me know if I got it right!

Hi Mathieu,

Thanks, you beat me to it.    I commented on the github, that's pretty much it.  

> In the kernel, if rather than testing for:
> 
> if ((void __user *)instruction_pointer(regs) < post_commit_ip) {
> 
> we could test for both start_ip and post_commit_ip:
> 
> if ((void __user *)instruction_pointer(regs) < post_commit_ip
>     && (void __user *)instruction_pointer(regs) >= start_ip) {
> 
> We could perform the failure path (storing NULL into the rseq_cs
> field of struct rseq) in C rather than being required to do it in
> assembly at addresses >= to post_commit_ip, all because the kernel
> would test whether we are within the assembly block address range
> using both the lower and upper bounds (start_ip and post_commit_ip).

Sounds reasonable to me.  I agree it would be best to move the failure path 
out of the asm if possible.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12 18:11                   ` Mathieu Desnoyers
@ 2016-08-13  1:28                     ` Boqun Feng
  2016-08-14 15:02                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-13  1:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 4705 bytes --]

On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
> > [snip]
> >> > > Besides, do we allow userspace programs do read-only access to the
> >> > > memory objects modified by do_rseq(). If so, we have a problem when
> >> > > there are two writes in a do_rseq()(either in the rseq critical section
> >> > > or in the asm block), because in current implemetation, these two writes
> >> > > are unordered, which makes the readers outside a do_rseq() could observe
> >> > > the ordering of writes differently.
> >> > > 
> >> > > For rseq_finish2(), a simple solution would be making the "final" write
> >> > > a RELEASE.
> >> > 
> >> > Indeed, we would need a release semantic for the final store here if this
> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication
> >> 
> >> I'm in favor of a separate rseq_finish2_release().
> >> 
> >> > there. I suspect we'll end up doing macros.
> >> > 
> >> 
> >> Me too. Lemme have a try ;-)
> >> 
> > 
> > How about this? Although a little messy, I separated the asm block into
> > several parts and implemented each part in a arch-diagnose way.
> 
> I find it rather hard to follow the per-arch assembly with this approach.
> It might prove to be troublesome if we want to do arch-specific optimizations
> in the future.
> 

It might be, but I was just trying to kill as much duplicate code as
possible, because the more duplicate we have, the more maintain effort
we need.

For example, PPC32 and PPC64 may have the same asm code to check the
event counter, but different code to do the final store.  Having the
same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if
we come up a way to optimize the counter check code on PPC.

And if some arch wants to have some very specifical optimizations,
it could always write the whole asm block again rather than use the
helpers macros.

> I've come up with the following macro approach instead, feedback welcome!
> 
> 
> commit 4d27431d6aefaee617540ef04518962b0e4d14f4
> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Date:   Thu Aug 11 19:11:27 2016 -0400
> 
>     rseq_finish2, rseq_finish2_release (WIP)
> 
[...]
> +#elif defined(__ARMEL__)
> +
> +#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
> +		_failure, extra_store, extra_input) \
> +	__asm__ __volatile__ goto ( \
> +		".pushsection __rseq_table, \"aw\"\n\t" \
> +		".balign 32\n\t" \
> +		".word 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t" \
> +		".popsection\n\t" \
> +		"1:\n\t" \
> +		RSEQ_INJECT_ASM(1) \
> +		"adr r0, 3f\n\t" \
> +		"str r0, [%[rseq_cs]]\n\t" \
> +		RSEQ_INJECT_ASM(2) \
> +		"ldr r0, %[current_event_counter]\n\t" \
> +		"mov r1, #0\n\t" \
> +		"cmp %[start_event_counter], r0\n\t" \
> +		"bne %l[failure]\n\t" \
> +		RSEQ_INJECT_ASM(3) \
> +		extra_store \
> +		"str %[to_write_final], [%[target_final]]\n\t" \
> +		"2:\n\t" \
> +		RSEQ_INJECT_ASM(5) \
> +		"str r1, [%[rseq_cs]]\n\t" \

I find this is a little weird here, that is having an extra register for
zeroing the rseq_cs. Could we

		"mov r0, #0\n\t"
		"str r0, [%[rseq_cs]]\n\t"

here? Which not only saves a register, but also an instruction "mov r1,
#0" in the fast path. Am I missing something subtle?

> +		"b 4f\n\t" \
> +		".balign 32\n\t" \
> +		"3:\n\t" \
> +		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \
> +		"4:\n\t" \
> +		: /* no outputs */ \
> +		: [start_event_counter]"r"((_start_value).event_counter), \
> +		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
> +		  [to_write_final]"r"(_to_write_final), \
> +		  [target_final]"r"(_target_final), \
> +		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
> +		  extra_input \
> +		  RSEQ_INJECT_INPUT \
> +		: "r0", "r1", "memory", "cc" \
> +		  RSEQ_INJECT_CLOBBER \
> +		: _failure \
>  	);

[...]

> +
> +#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
> +		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
> +		"dmb\n\t"
> +

Having a RELEASE barrier here may be OK for all current archs we
support, but there are archs which rather than have a lightweight
RELEASE barrier but use a special instruction for RELEASE operations,
for example, AArch64. Do we need to take that into consideration and
define a RSEQ_FINISH_ASM_RELEASE() rather than a
RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM()?

[...]

Regards
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-13  1:28                     ` Boqun Feng
@ 2016-08-14 15:02                       ` Mathieu Desnoyers
  2016-08-15  0:56                         ` Boqun Feng
  0 siblings, 1 reply; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-14 15:02 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote:
>> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@gmail.com wrote:
>> 
>> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
>> > [snip]
>> >> > > Besides, do we allow userspace programs do read-only access to the
>> >> > > memory objects modified by do_rseq(). If so, we have a problem when
>> >> > > there are two writes in a do_rseq()(either in the rseq critical section
>> >> > > or in the asm block), because in current implemetation, these two writes
>> >> > > are unordered, which makes the readers outside a do_rseq() could observe
>> >> > > the ordering of writes differently.
>> >> > > 
>> >> > > For rseq_finish2(), a simple solution would be making the "final" write
>> >> > > a RELEASE.
>> >> > 
>> >> > Indeed, we would need a release semantic for the final store here if this
>> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
>> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication
>> >> 
>> >> I'm in favor of a separate rseq_finish2_release().
>> >> 
>> >> > there. I suspect we'll end up doing macros.
>> >> > 
>> >> 
>> >> Me too. Lemme have a try ;-)
>> >> 
>> > 
>> > How about this? Although a little messy, I separated the asm block into
>> > several parts and implemented each part in a arch-diagnose way.
>> 
>> I find it rather hard to follow the per-arch assembly with this approach.
>> It might prove to be troublesome if we want to do arch-specific optimizations
>> in the future.
>> 
> 
> It might be, but I was just trying to kill as much duplicate code as
> possible, because the more duplicate we have, the more maintain effort
> we need.
> 
> For example, PPC32 and PPC64 may have the same asm code to check the
> event counter, but different code to do the final store.  Having the
> same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if
> we come up a way to optimize the counter check code on PPC.
> 
> And if some arch wants to have some very specifical optimizations,
> it could always write the whole asm block again rather than use the
> helpers macros.

Creating macros for each assembly "operation" done in the restartable
sequence ends up requiring that people learn a new custom mini-language,
and implement those macros for each architecture.

I'd rather prefer to let each architecture maintainer express the
restartable sequence directly in assembly, which is already known to
them, than require them to learn a new small macro-based language.

Eliminating duplicated code is a goal I agree with, but there are
ways to achieve this which don't end up creating a macro-based custom
mini-language (such as what I proposed below).

> 
>> I've come up with the following macro approach instead, feedback welcome!
>> 
>> 
>> commit 4d27431d6aefaee617540ef04518962b0e4d14f4
>> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Date:   Thu Aug 11 19:11:27 2016 -0400
>> 
>>     rseq_finish2, rseq_finish2_release (WIP)
>> 
> [...]
>> +#elif defined(__ARMEL__)
>> +
>> +#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
>> +		_failure, extra_store, extra_input) \
>> +	__asm__ __volatile__ goto ( \
>> +		".pushsection __rseq_table, \"aw\"\n\t" \
>> +		".balign 32\n\t" \
>> +		".word 1f, 0x0, 2f, 0x0, %l[failure], 0x0, 0x0, 0x0\n\t" \
>> +		".popsection\n\t" \
>> +		"1:\n\t" \
>> +		RSEQ_INJECT_ASM(1) \
>> +		"adr r0, 3f\n\t" \
>> +		"str r0, [%[rseq_cs]]\n\t" \
>> +		RSEQ_INJECT_ASM(2) \
>> +		"ldr r0, %[current_event_counter]\n\t" \
>> +		"mov r1, #0\n\t" \
>> +		"cmp %[start_event_counter], r0\n\t" \
>> +		"bne %l[failure]\n\t" \
>> +		RSEQ_INJECT_ASM(3) \
>> +		extra_store \
>> +		"str %[to_write_final], [%[target_final]]\n\t" \
>> +		"2:\n\t" \
>> +		RSEQ_INJECT_ASM(5) \
>> +		"str r1, [%[rseq_cs]]\n\t" \
> 
> I find this is a little weird here, that is having an extra register for
> zeroing the rseq_cs. Could we
> 
>		"mov r0, #0\n\t"
>		"str r0, [%[rseq_cs]]\n\t"
> 
> here? Which not only saves a register, but also an instruction "mov r1,
> #0" in the fast path. Am I missing something subtle?

In terms of fast-path, you would be trading:

(1)
	"ldr r0, %[current_event_counter]\n\t" \
	"mov r1, #0\n\t"
	"cmp %[start_event_counter], r0\n\t" \
	"bne %l[failure]\n\t" \
	"str %[to_write_final], [%[target_final]]\n\t" \
	"2:\n\t" \
	"str r1, [%[rseq_cs]]\n\t" \
for

(2)
	"ldr r0, %[current_event_counter]\n\t" \
	"cmp %[start_event_counter], r0\n\t" \
	"bne %l[failure]\n\t" \
	"str %[to_write_final], [%[target_final]]\n\t" \
	"2:\n\t" \
	"mov r0, #0\n\t"
	"str r0, [%[rseq_cs]]\n\t" \

Your proposal (2) saves a register (does not clobber r1), but this
is at the expense of a slower fast-path. In (1), loading the constant
0 is done while the processor is stalled on the current_event_counter
load, which is needed by a following comparison. Therefore, we can
increase instruction-level parallelism by placing the immediate value
0 load right after the ldr instruction. This, however, requires that
we use a different register than r0, because r0 is already used by the
ldr/cmp instructions.

Since this is a fast-path, achieving higher instruction throughput
is more important than saving a register.

I came up with this as an optimization while doing benchmarking
on a ARM32 Cubietruck as a reference architecture.

> 
>> +		"b 4f\n\t" \
>> +		".balign 32\n\t" \
>> +		"3:\n\t" \
>> +		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \
>> +		"4:\n\t" \
>> +		: /* no outputs */ \
>> +		: [start_event_counter]"r"((_start_value).event_counter), \
>> +		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
>> +		  [to_write_final]"r"(_to_write_final), \
>> +		  [target_final]"r"(_target_final), \
>> +		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
>> +		  extra_input \
>> +		  RSEQ_INJECT_INPUT \
>> +		: "r0", "r1", "memory", "cc" \
>> +		  RSEQ_INJECT_CLOBBER \
>> +		: _failure \
>>  	);
> 
> [...]
> 
>> +
>> +#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
>> +		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
>> +		"dmb\n\t"
>> +
> 
> Having a RELEASE barrier here may be OK for all current archs we
> support, but there are archs which rather than have a lightweight
> RELEASE barrier but use a special instruction for RELEASE operations,
> for example, AArch64. Do we need to take that into consideration and
> define a RSEQ_FINISH_ASM_RELEASE() rather than a
> RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM()?

Good point. We should introduce the barrier before the final
store to fit this scenario. This would also work if we want to
do many speculative stores followed by a final store: it really
makes sense to put the barrier just before the final store rather
than after each speculative store.

I just pushed a commit in my dev branch implementing this. Testing
is welcome.

Thanks!

Mathieu

> 
> [...]
> 
> Regards
> Boqun

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-12 20:05             ` Dave Watson
@ 2016-08-14 17:09               ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-14 17:09 UTC (permalink / raw)
  To: Dave Watson
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Boqun Feng

----- On Aug 12, 2016, at 4:05 PM, Dave Watson davejwatson@fb.com wrote:

>>>>> Would pairing one rseq_start with two rseq_finish do the trick
>>>>> there ?
>>>>
>>>> Yes, two rseq_finish works, as long as the extra rseq management overhead
>>>> is not substantial.
>>>
>>> I've added a commit implementing rseq_finish2() in my rseq volatile
>>> dev branch. You can fetch it at:
>>>
>>> https://github.com/compudj/linux-percpu-dev/tree/rseq-fallback
>>>
>>> I also have a separate test and benchmark tree in addition to the
>>> kernel selftests here:
>>>
>>> https://github.com/compudj/rseq-test
>>>
>>> I named the first write a "speculative" write, and the second write
>>> the "final" write.
>>>
>>> Would you like to extend the test cases to cover your intended use-case ?
>>>
>>
>>Hi Dave!
>>
>>I just pushed a rseq_finish2() test in my rseq-fallback branch. It implements
>>a per-cpu buffer holding pointers, and pushes/pops items to/from it.
>>
>>To use it:
>>
>>cd tools/testing/selftests/rseq
>>./param_test -T b
>>
>>(see -h for advanced usage)
>>
>>Let me know if I got it right!
> 

FYI, I have started implementing rseq_finish_memcpy() and rseq_finish_memcpy_release().
The idea is to perform an inline memcpy as speculative writes before the
final store (offset). I have pushed the work in progress in my dev branch.

This would be an alternative to rseq_finish2() (which I still consider very useful)
in cases where we want to push a sequence of bytes into a ring buffer before updating
the offset counter, without having to rely on memory allocation.

Feedback is welcome!

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-14 15:02                       ` Mathieu Desnoyers
@ 2016-08-15  0:56                         ` Boqun Feng
  2016-08-15 18:06                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 82+ messages in thread
From: Boqun Feng @ 2016-08-15  0:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 6728 bytes --]

On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@gmail.com wrote:
> >> 
> >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
> >> > [snip]
> >> >> > > Besides, do we allow userspace programs do read-only access to the
> >> >> > > memory objects modified by do_rseq(). If so, we have a problem when
> >> >> > > there are two writes in a do_rseq()(either in the rseq critical section
> >> >> > > or in the asm block), because in current implemetation, these two writes
> >> >> > > are unordered, which makes the readers outside a do_rseq() could observe
> >> >> > > the ordering of writes differently.
> >> >> > > 
> >> >> > > For rseq_finish2(), a simple solution would be making the "final" write
> >> >> > > a RELEASE.
> >> >> > 
> >> >> > Indeed, we would need a release semantic for the final store here if this
> >> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
> >> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication
> >> >> 
> >> >> I'm in favor of a separate rseq_finish2_release().
> >> >> 
> >> >> > there. I suspect we'll end up doing macros.
> >> >> > 
> >> >> 
> >> >> Me too. Lemme have a try ;-)
> >> >> 
> >> > 
> >> > How about this? Although a little messy, I separated the asm block into
> >> > several parts and implemented each part in a arch-diagnose way.
> >> 
> >> I find it rather hard to follow the per-arch assembly with this approach.
> >> It might prove to be troublesome if we want to do arch-specific optimizations
> >> in the future.
> >> 
> > 
> > It might be, but I was just trying to kill as much duplicate code as
> > possible, because the more duplicate we have, the more maintain effort
> > we need.
> > 
> > For example, PPC32 and PPC64 may have the same asm code to check the
> > event counter, but different code to do the final store.  Having the
> > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if
> > we come up a way to optimize the counter check code on PPC.
> > 
> > And if some arch wants to have some very specifical optimizations,
> > it could always write the whole asm block again rather than use the
> > helpers macros.
> 
> Creating macros for each assembly "operation" done in the restartable
> sequence ends up requiring that people learn a new custom mini-language,
> and implement those macros for each architecture.
> 
> I'd rather prefer to let each architecture maintainer express the
> restartable sequence directly in assembly, which is already known to
> them, than require them to learn a new small macro-based language.
> 
> Eliminating duplicated code is a goal I agree with, but there are
> ways to achieve this which don't end up creating a macro-based custom
> mini-language (such as what I proposed below).
> 

Fair point ;-)

One more thing, do we want to use arch-specific header files to put
arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h,
etc. This may save readers a lot of time if he or she is only interested
in a particular arch, and also make maintaining a little easier(no need
to worry about breaking other archs accidentally)

[...]
> 
> In terms of fast-path, you would be trading:
> 
> (1)
> 	"ldr r0, %[current_event_counter]\n\t" \
> 	"mov r1, #0\n\t"
> 	"cmp %[start_event_counter], r0\n\t" \
> 	"bne %l[failure]\n\t" \
> 	"str %[to_write_final], [%[target_final]]\n\t" \
> 	"2:\n\t" \
> 	"str r1, [%[rseq_cs]]\n\t" \
> for
> 
> (2)
> 	"ldr r0, %[current_event_counter]\n\t" \
> 	"cmp %[start_event_counter], r0\n\t" \
> 	"bne %l[failure]\n\t" \
> 	"str %[to_write_final], [%[target_final]]\n\t" \
> 	"2:\n\t" \
> 	"mov r0, #0\n\t"
> 	"str r0, [%[rseq_cs]]\n\t" \
> 
> Your proposal (2) saves a register (does not clobber r1), but this
> is at the expense of a slower fast-path. In (1), loading the constant
> 0 is done while the processor is stalled on the current_event_counter
> load, which is needed by a following comparison. Therefore, we can
> increase instruction-level parallelism by placing the immediate value
> 0 load right after the ldr instruction. This, however, requires that
> we use a different register than r0, because r0 is already used by the
> ldr/cmp instructions.
> 
> Since this is a fast-path, achieving higher instruction throughput
> is more important than saving a register.
> 
> I came up with this as an optimization while doing benchmarking
> on a ARM32 Cubietruck as a reference architecture.
> 

Nice ;-) Better to put a comment there?

I should try to investigate something similar for powerpc.

> > 
> >> +		"b 4f\n\t" \
> >> +		".balign 32\n\t" \
> >> +		"3:\n\t" \
> >> +		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \
> >> +		"4:\n\t" \
> >> +		: /* no outputs */ \
> >> +		: [start_event_counter]"r"((_start_value).event_counter), \
> >> +		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
> >> +		  [to_write_final]"r"(_to_write_final), \
> >> +		  [target_final]"r"(_target_final), \
> >> +		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
> >> +		  extra_input \
> >> +		  RSEQ_INJECT_INPUT \
> >> +		: "r0", "r1", "memory", "cc" \
> >> +		  RSEQ_INJECT_CLOBBER \
> >> +		: _failure \
> >>  	);
> > 
> > [...]
> > 
> >> +
> >> +#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
> >> +		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
> >> +		"dmb\n\t"
> >> +
> > 
> > Having a RELEASE barrier here may be OK for all current archs we
> > support, but there are archs which rather than have a lightweight
> > RELEASE barrier but use a special instruction for RELEASE operations,
> > for example, AArch64. Do we need to take that into consideration and
> > define a RSEQ_FINISH_ASM_RELEASE() rather than a
> > RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM()?
> 
> Good point. We should introduce the barrier before the final
> store to fit this scenario. This would also work if we want to
> do many speculative stores followed by a final store: it really
> makes sense to put the barrier just before the final store rather
> than after each speculative store.
> 
> I just pushed a commit in my dev branch implementing this. Testing
> is welcome.
> 

Sure, let me play around ;-)

Regards,
Boqun

> Thanks!
> 
> Mathieu
> 
> > 
> > [...]
> > 
> > Regards
> > Boqun
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests
  2016-08-15  0:56                         ` Boqun Feng
@ 2016-08-15 18:06                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 82+ messages in thread
From: Mathieu Desnoyers @ 2016-08-15 18:06 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dave Watson, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Aug 14, 2016, at 8:56 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote:
>> ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@gmail.com wrote:
>> 
>> > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote:
>> >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@gmail.com wrote:
>> >> 
>> >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
>> >> > [snip]
>> >> >> > > Besides, do we allow userspace programs do read-only access to the
>> >> >> > > memory objects modified by do_rseq(). If so, we have a problem when
>> >> >> > > there are two writes in a do_rseq()(either in the rseq critical section
>> >> >> > > or in the asm block), because in current implemetation, these two writes
>> >> >> > > are unordered, which makes the readers outside a do_rseq() could observe
>> >> >> > > the ordering of writes differently.
>> >> >> > > 
>> >> >> > > For rseq_finish2(), a simple solution would be making the "final" write
>> >> >> > > a RELEASE.
>> >> >> > 
>> >> >> > Indeed, we would need a release semantic for the final store here if this
>> >> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
>> >> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication
>> >> >> 
>> >> >> I'm in favor of a separate rseq_finish2_release().
>> >> >> 
>> >> >> > there. I suspect we'll end up doing macros.
>> >> >> > 
>> >> >> 
>> >> >> Me too. Lemme have a try ;-)
>> >> >> 
>> >> > 
>> >> > How about this? Although a little messy, I separated the asm block into
>> >> > several parts and implemented each part in a arch-diagnose way.
>> >> 
>> >> I find it rather hard to follow the per-arch assembly with this approach.
>> >> It might prove to be troublesome if we want to do arch-specific optimizations
>> >> in the future.
>> >> 
>> > 
>> > It might be, but I was just trying to kill as much duplicate code as
>> > possible, because the more duplicate we have, the more maintain effort
>> > we need.
>> > 
>> > For example, PPC32 and PPC64 may have the same asm code to check the
>> > event counter, but different code to do the final store.  Having the
>> > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if
>> > we come up a way to optimize the counter check code on PPC.
>> > 
>> > And if some arch wants to have some very specifical optimizations,
>> > it could always write the whole asm block again rather than use the
>> > helpers macros.
>> 
>> Creating macros for each assembly "operation" done in the restartable
>> sequence ends up requiring that people learn a new custom mini-language,
>> and implement those macros for each architecture.
>> 
>> I'd rather prefer to let each architecture maintainer express the
>> restartable sequence directly in assembly, which is already known to
>> them, than require them to learn a new small macro-based language.
>> 
>> Eliminating duplicated code is a goal I agree with, but there are
>> ways to achieve this which don't end up creating a macro-based custom
>> mini-language (such as what I proposed below).
>> 
> 
> Fair point ;-)
> 
> One more thing, do we want to use arch-specific header files to put
> arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h,
> etc. This may save readers a lot of time if he or she is only interested
> in a particular arch, and also make maintaining a little easier(no need
> to worry about breaking other archs accidentally)
> 
> [...]

Good point. I wanted to wait until we had enough architectures before
doing this, but now that we have x86 32/64, ppc 32/64 and arm 32, it
appears to be the right time. Done and pushed.

>> 
>> In terms of fast-path, you would be trading:
>> 
>> (1)
>> 	"ldr r0, %[current_event_counter]\n\t" \
>> 	"mov r1, #0\n\t"
>> 	"cmp %[start_event_counter], r0\n\t" \
>> 	"bne %l[failure]\n\t" \
>> 	"str %[to_write_final], [%[target_final]]\n\t" \
>> 	"2:\n\t" \
>> 	"str r1, [%[rseq_cs]]\n\t" \
>> for
>> 
>> (2)
>> 	"ldr r0, %[current_event_counter]\n\t" \
>> 	"cmp %[start_event_counter], r0\n\t" \
>> 	"bne %l[failure]\n\t" \
>> 	"str %[to_write_final], [%[target_final]]\n\t" \
>> 	"2:\n\t" \
>> 	"mov r0, #0\n\t"
>> 	"str r0, [%[rseq_cs]]\n\t" \
>> 
>> Your proposal (2) saves a register (does not clobber r1), but this
>> is at the expense of a slower fast-path. In (1), loading the constant
>> 0 is done while the processor is stalled on the current_event_counter
>> load, which is needed by a following comparison. Therefore, we can
>> increase instruction-level parallelism by placing the immediate value
>> 0 load right after the ldr instruction. This, however, requires that
>> we use a different register than r0, because r0 is already used by the
>> ldr/cmp instructions.
>> 
>> Since this is a fast-path, achieving higher instruction throughput
>> is more important than saving a register.
>> 
>> I came up with this as an optimization while doing benchmarking
>> on a ARM32 Cubietruck as a reference architecture.
>> 
> 
> Nice ;-) Better to put a comment there?

Done.

> 
> I should try to investigate something similar for powerpc.
> 

Yes, you could try clobbering one extra register to move the
"li %%r17, 0\n\t" right after the lwz instruction. Depending on
the architecture characteristics, it may speed it up a bit. I
would expect that benchmarks on older architectures (e.g. old ppc32)
might be more affected by such tweak than newer POWER8.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2016-08-15 18:06 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-21 21:14 [RFC PATCH v7 0/7] Restartable sequences system call Mathieu Desnoyers
2016-07-21 21:14 ` [RFC PATCH v7 1/7] " Mathieu Desnoyers
2016-07-25 23:02   ` Andy Lutomirski
2016-07-26  3:02     ` Mathieu Desnoyers
2016-08-03 12:27       ` Peter Zijlstra
2016-08-03 16:37         ` Andy Lutomirski
2016-08-03 18:31           ` Christoph Lameter
2016-08-04  5:01             ` Andy Lutomirski
2016-08-04  4:27           ` Boqun Feng
2016-08-04  5:03             ` Andy Lutomirski
2016-08-09 16:13               ` Boqun Feng
2016-08-10  8:01                 ` Andy Lutomirski
2016-08-10 17:40                   ` Mathieu Desnoyers
2016-08-10 17:33                 ` Mathieu Desnoyers
2016-08-11  4:54                   ` Boqun Feng
2016-08-10  8:13               ` Andy Lutomirski
2016-08-03 18:29       ` Christoph Lameter
2016-08-10 16:47         ` Mathieu Desnoyers
2016-08-10 16:59           ` Christoph Lameter
2016-07-27 15:03   ` Boqun Feng
2016-07-27 15:05     ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Boqun Feng
2016-07-27 15:05       ` [RFC 2/4] Restartable sequences: powerpc architecture support Boqun Feng
2016-07-28  3:13         ` Mathieu Desnoyers
2016-07-27 15:05       ` [RFC 3/4] Restartable sequences: Wire up powerpc system call Boqun Feng
2016-07-28  3:13         ` Mathieu Desnoyers
2016-07-27 15:05       ` [RFC 4/4] Restartable sequences: Add self-tests for PPC Boqun Feng
2016-07-28  2:59         ` Mathieu Desnoyers
2016-07-28  4:43           ` Boqun Feng
2016-07-28  7:37             ` [RFC v2] " Boqun Feng
2016-07-28 14:04               ` Mathieu Desnoyers
2016-07-28 13:42             ` [RFC 4/4] " Mathieu Desnoyers
2016-07-28  3:07       ` [RFC 1/4] rseq/param_test: Convert test_data_entry::count to intptr_t Mathieu Desnoyers
2016-07-28  3:10     ` [RFC PATCH v7 1/7] Restartable sequences system call Mathieu Desnoyers
2016-08-03 13:19   ` Peter Zijlstra
2016-08-03 14:53     ` Paul E. McKenney
2016-08-03 15:45     ` Boqun Feng
2016-08-07 15:36       ` Mathieu Desnoyers
2016-08-07 23:35         ` Boqun Feng
2016-08-09 13:22           ` Mathieu Desnoyers
2016-08-09 20:06     ` Mathieu Desnoyers
2016-08-09 21:33       ` Peter Zijlstra
2016-08-09 22:41         ` Mathieu Desnoyers
2016-08-10  7:50           ` Peter Zijlstra
2016-08-10 13:26             ` Mathieu Desnoyers
2016-08-10 13:33               ` Peter Zijlstra
2016-08-10 14:04                 ` Mathieu Desnoyers
2016-08-10  8:10       ` Andy Lutomirski
2016-08-10 19:04         ` Mathieu Desnoyers
2016-08-10 19:16           ` Andy Lutomirski
2016-08-10 20:06             ` Mathieu Desnoyers
2016-08-10 20:09               ` Andy Lutomirski
2016-08-10 21:01                 ` Mathieu Desnoyers
2016-08-11  7:23                   ` Andy Lutomirski
2016-08-10  8:43       ` Peter Zijlstra
2016-08-10 13:57         ` Mathieu Desnoyers
2016-08-10 14:28           ` Peter Zijlstra
2016-08-10 14:44             ` Mathieu Desnoyers
2016-08-10 13:29       ` Peter Zijlstra
2016-07-21 21:14 ` [RFC PATCH v7 2/7] tracing: instrument restartable sequences Mathieu Desnoyers
2016-07-21 21:14 ` [RFC PATCH v7 3/7] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2016-07-21 21:14 ` [RFC PATCH v7 4/7] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
2016-07-21 21:14 ` [RFC PATCH v7 5/7] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
2016-07-21 21:14 ` [RFC PATCH v7 6/7] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2016-07-21 21:14 ` [RFC PATCH v7 7/7] Restartable sequences: self-tests Mathieu Desnoyers
     [not found]   ` <CO1PR15MB09822FC140F84DCEEF2004CDDD0B0@CO1PR15MB0982.namprd15.prod.outlook.com>
2016-07-24  3:09     ` Mathieu Desnoyers
2016-07-24 18:01       ` Dave Watson
2016-07-25 16:43         ` Mathieu Desnoyers
2016-08-11 23:26         ` Mathieu Desnoyers
2016-08-12  1:28           ` Boqun Feng
2016-08-12  3:10             ` Mathieu Desnoyers
2016-08-12  3:13               ` Mathieu Desnoyers
2016-08-12  5:30               ` Boqun Feng
2016-08-12 16:35                 ` Boqun Feng
2016-08-12 18:11                   ` Mathieu Desnoyers
2016-08-13  1:28                     ` Boqun Feng
2016-08-14 15:02                       ` Mathieu Desnoyers
2016-08-15  0:56                         ` Boqun Feng
2016-08-15 18:06                           ` Mathieu Desnoyers
2016-08-12 19:36           ` Mathieu Desnoyers
2016-08-12 20:05             ` Dave Watson
2016-08-14 17:09               ` Mathieu Desnoyers
2016-07-25 18:12     ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).