linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
@ 2017-11-14 20:03 Mathieu Desnoyers
  2017-11-14 20:03 ` [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
                   ` (13 more replies)
  0 siblings, 14 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Here is the last RFC round of the updated rseq patchset containing:

- Restartable sequences system call (x86 32/64, powerpc 32/64, arm 32),
- CPU operation vector system call (x86 32/64, powerpc 32/64, arm 32),
- membarrier shared expedited, and private expedited core serializing
  commands.

This is *not* yet a PR. I'm submitting the patchset as RFC one last
time given that I did a few small fixes, and reordered the patchset,
since the last time I sent it as RFC.

You can alternatively find this patchset as a git branch at this
location:

https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-rseq.git
branch: v4.14-rseq-20171114

Orion Hodson is currently testing the private expedited core
serializing membarrier command on Android, where they have a
context synchronization issue reproducer on arm 64. We should know
more about this shortly.

Thanks,

Mathieu

Boqun Feng (2):
  Restartable sequences: powerpc architecture support
  Restartable sequences: Wire up powerpc system call

Mathieu Desnoyers (22):
  Restartable sequences system call (v11)
  Restartable sequences: ARM 32 architecture support
  Restartable sequences: wire up ARM 32 system call
  Restartable sequences: x86 32/64 architecture support
  Restartable sequences: wire up x86 32/64 system call
  Provide cpu_opv system call (v3)
  cpu_opv: Wire up x86 32/64 system call
  cpu_opv: Wire up powerpc system call
  cpu_opv: Wire up ARM32 system call
  cpu_opv: Implement selftests (v2)
  Restartable sequences: Provide self-tests (v2)
  Restartable sequences selftests: arm: workaround gcc asm size guess
  membarrier: selftest: Test private expedited cmd (v2)
  membarrier: powerpc: Skip memory barrier in switch_mm() (v7)
  membarrier: Document scheduler barrier requirements (v5)
  membarrier: provide SHARED_EXPEDITED command
  membarrier: selftest: Test shared expedited cmd
  membarrier: Provide core serializing command
  x86: Introduce sync_core_before_usermode (v2)
  membarrier: x86: Provide core serializing command (v2)
  membarrier: selftest: Test private expedited sync core cmd
  membarrier: arm64: Provide core serializing command

 MAINTAINERS                                        |   21 +
 arch/Kconfig                                       |    7 +
 arch/arm/Kconfig                                   |    1 +
 arch/arm/kernel/signal.c                           |    7 +
 arch/arm/tools/syscall.tbl                         |    2 +
 arch/arm64/Kconfig                                 |    1 +
 arch/arm64/kernel/entry.S                          |    4 +
 arch/powerpc/Kconfig                               |    2 +
 arch/powerpc/include/asm/membarrier.h              |   26 +
 arch/powerpc/include/asm/systbl.h                  |    2 +
 arch/powerpc/include/asm/unistd.h                  |    2 +-
 arch/powerpc/include/uapi/asm/unistd.h             |    2 +
 arch/powerpc/kernel/signal.c                       |    3 +
 arch/powerpc/mm/mmu_context.c                      |    7 +
 arch/x86/Kconfig                                   |    3 +
 arch/x86/entry/common.c                            |    1 +
 arch/x86/entry/entry_32.S                          |    5 +
 arch/x86/entry/entry_64.S                          |    8 +
 arch/x86/entry/syscalls/syscall_32.tbl             |    2 +
 arch/x86/entry/syscalls/syscall_64.tbl             |    2 +
 arch/x86/include/asm/processor.h                   |   10 +
 arch/x86/kernel/signal.c                           |    6 +
 arch/x86/mm/tlb.c                                  |    6 +
 fs/exec.c                                          |    1 +
 include/linux/processor.h                          |    6 +
 include/linux/sched.h                              |   89 ++
 include/linux/sched/mm.h                           |   39 +-
 include/trace/events/rseq.h                        |   60 +
 include/uapi/linux/cpu_opv.h                       |  117 ++
 include/uapi/linux/membarrier.h                    |   66 +-
 include/uapi/linux/rseq.h                          |  138 +++
 init/Kconfig                                       |   37 +
 kernel/Makefile                                    |    2 +
 kernel/cpu_opv.c                                   |  968 +++++++++++++++
 kernel/fork.c                                      |    2 +
 kernel/rseq.c                                      |  328 +++++
 kernel/sched/core.c                                |   95 +-
 kernel/sched/membarrier.c                          |  169 ++-
 kernel/sched/sched.h                               |    2 +
 kernel/sys_ni.c                                    |    4 +
 tools/testing/selftests/Makefile                   |    2 +
 tools/testing/selftests/cpu-opv/.gitignore         |    1 +
 tools/testing/selftests/cpu-opv/Makefile           |   17 +
 .../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 1157 ++++++++++++++++++
 tools/testing/selftests/cpu-opv/cpu-op.c           |  348 ++++++
 tools/testing/selftests/cpu-opv/cpu-op.h           |   68 ++
 tools/testing/selftests/lib.mk                     |    4 +
 .../testing/selftests/membarrier/membarrier_test.c |  235 +++-
 tools/testing/selftests/rseq/.gitignore            |    4 +
 tools/testing/selftests/rseq/Makefile              |   23 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  333 +++++
 tools/testing/selftests/rseq/basic_test.c          |   55 +
 tools/testing/selftests/rseq/param_test.c          | 1285 ++++++++++++++++++++
 tools/testing/selftests/rseq/rseq-arm.h            |  568 +++++++++
 tools/testing/selftests/rseq/rseq-ppc.h            |  567 +++++++++
 tools/testing/selftests/rseq/rseq-x86.h            |  898 ++++++++++++++
 tools/testing/selftests/rseq/rseq.c                |  116 ++
 tools/testing/selftests/rseq/rseq.h                |  154 +++
 tools/testing/selftests/rseq/run_param_test.sh     |  124 ++
 59 files changed, 8149 insertions(+), 63 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h
 create mode 100644 include/trace/events/rseq.h
 create mode 100644 include/uapi/linux/cpu_opv.h
 create mode 100644 include/uapi/linux/rseq.h
 create mode 100644 kernel/cpu_opv.c
 create mode 100644 kernel/rseq.c
 create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
 create mode 100644 tools/testing/selftests/cpu-opv/Makefile
 create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/param_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h
 create mode 100755 tools/testing/selftests/rseq/run_param_test.sh

-- 
2.11.0

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-14 20:03   ` Mathieu Desnoyers
       [not found]     ` <CY4PR15MB168884529B3C0F8E6CC06257CF280@CY4PR15MB1688.namprd15.prod.outlook.com>
                       ` (2 more replies)
  2017-11-14 20:03   ` [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
                     ` (10 subsequent siblings)
  11 siblings, 3 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Alexander Viro

Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path. A second system
call, cpu_opv(), is proposed as fallback to deal with debugger
single-stepping. cpu_opv() executes a sequence of operations on behalf
of user-space with preemption disabled.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                344.0                 31.4          11.0
x86-64:                15.3                  2.0           7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
             per-cpu buffer

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:               2502.0                 2250.0         1.1
x86-64:               117.4                   98.0         1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                751.0                 128.5          5.8
x86-64:                53.4                  28.6          1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

On x86-64, between CONFIG_CPU_OPV=n/y, the text size increase of vmlinux is
5576 bytes, and the data size increase of vmlinux is 6164 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit-G8L5E6GV2z5XSTzz+wBt03oUN1GumTyQ7j82oEJ37pA@public.gmane.org
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit-tdHu5vqousHHt/MElyovVYaSKrA+ACpX0E9HWUfgJXw@public.gmane.org
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---

Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
  sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
  update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
  getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
  and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
  defining this enumeration.
- Split resume notifier architecture implementation from the system call
  wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
  set the current cpu cache pointer before doing the cache update, and
  set it back to NULL if the update fails. Setting it back to NULL on
  error ensures that no resume notifier will trigger a SIGSEGV if a
  migration happened concurrently.

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
  to change log.

Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
  this system call to cover future features such as restartable critical
  sections. Generalizing this system call ensures that we can add
  features similar to the cpu_id field within the same cache-line
  without having to track one pointer per feature within the task
  struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
  the ABI beyond the initial 64-byte structure by registering structures
  with tlabi_nr greater than 0. The initial ABI structure is associated
  with tlabi_nr 0.
- Rebased on kernel v4.5.

Changes since v6:
- Integrate "restartable sequences" v2 patchset from Paul Turner.
- Add handling of single-stepping purely in user-space, with a
  fallback to locking after 2 rseq failures to ensure progress, and
  by exposing a __rseq_table section to debuggers so they know where
  to put breakpoints when dealing with rseq assembly blocks which
  can be aborted at any point.
- make the code and ABI generic: porting the kernel implementation
  simply requires to wire up the signal handler and return to user-space
  hooks, and allocate the syscall number.
- extend testing with a fully configurable test program. See
  param_spinlock_test -h for details.
- handling of rseq ENOSYS in user-space, also with a fallback
  to locking.
- modify Paul Turner's rseq ABI to only require a single TLS store on
  the user-space fast-path, removing the need to populate two additional
  registers. This is made possible by introducing struct rseq_cs into
  the ABI to describe a critical section start_ip, post_commit_ip, and
  abort_ip.
- Rebased on kernel v4.7-rc7.

Changes since v7:
- Documentation updates.
- Integrated powerpc architecture support.
- Compare rseq critical section start_ip, allows shriking the user-space
  fast-path code size.
- Added Peter Zijlstra, Paul E. McKenney and Boqun Feng as
  co-maintainers.
- Added do_rseq2 and do_rseq_memcpy to test program helper library.
- Code cleanup based on review from Peter Zijlstra, Andy Lutomirski and
  Boqun Feng.
- Rebase on kernel v4.8-rc2.

Changes since v8:
- clear rseq_cs even if non-nested. Speeds up user-space fast path by
  removing the final "rseq_cs=NULL" assignment.
- add enum rseq_flags: critical sections and threads can set migration,
  preemption and signal "disable" flags to inhibit rseq behavior.
- rseq_event_counter needs to be updated with a pre-increment: Otherwise
  misses an increment after exec (when TLS and in-kernel states are
  initially 0).

Changes since v9:
- Update changelog.
- Fold instrumentation patch.
- check abort-ip signature: Add a signature before the abort-ip landing
  address. This signature is also received as a new parameter to the
  rseq system call. The kernel uses it ensures that rseq cannot be used
  as an exploit vector to redirect execution to arbitrary code.
- Use rseq pointer for both register and unregister. This is more
  symmetric, and eventually allow supporting a linked list of rseq
  struct per thread if needed in the future.
- Unregistration of a rseq structure is now done with
  RSEQ_FLAG_UNREGISTER.
- Remove reference counting. Return "EBUSY" to the caller if rseq is
  already registered for the current thread. This simplifies
  implementation while still allowing user-space to perform lazy
  registration in multi-lib use-cases. (suggested by Ben Maurer)
- Clear rseq_cs upon unregister.
- Set cpu_id back to -1 on unregister, so if rseq user libraries follow
  an unregister, and they expect to lazily register rseq, they can do
  so.
- Document rseq_cs clear requirement: JIT should reset the rseq_cs
  pointer before reclaiming memory of rseq_cs structure.
- Introduce rseq_len syscall parameter, rseq_cs version field:
  Allow keeping track of the registered rseq struct length, for future
  extensions. Add rseq_cs version as first field. Will allow future
  extensions.
- Use offset and unsigned arithmetic to save a branch:  Save a
  conditional branch when comparing instruction pointer against a
  rseq_cs descriptor's address range by having post_commit_ip as an
  offset from start_ip, and using unsigned integer comparison.
  Suggested by Ben Maurer.
- Remove event counter from ABI. Suggested by Andy Lutomirski.
- Add INIT_ONSTACK macro: Introduce the
  RSEQ_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
  correctly initialize the upper bits of RSEQ_FIELD_u32_u64() on their
  stack to 0 on 32-bit architectures.
- Select MEMBARRIER: Allows user-space rseq fast-paths to use the value
  of cpu_id field (inherently required by the rseq algorithm) to figure
  out whether membarrier can be expected to be available.
  This effectively allows user-space fast-paths to remove extra
  comparisons and branch testing whether membarrier is enabled, and thus
  whether a full barrier is required (e.g. in userspace RCU
  implementation after rcu_read_lock/before rcu_read_unlock).
- Expose cpu_id_start field: Checking whether the (cpu_id < 0) in the C
  preparation part of the rseq fast-path brings significant overhead at
  least on arm32. We can remove this extra comparison by exposing two
  distinct cpu_id fields in the rseq TLS:

  The field cpu_id_start always contain a *possible* cpu number, although
  it may not be the current one if, for instance, rseq is not initialized
  for the current thread. cpu_id_start is meant to be used in the C code
  for the pointer chasing to figure out which per-cpu data structure
  should be passed to the rseq asm sequence.

  The field cpu_id values -1 means rseq is not initialized, and -2 means
  initialization failed. That field is used in the rseq asm sequence to
  confirm that the cpu_id_start value was indeed the current cpu number.
  It also ends up confirming that rseq is initialized for the current
  thread, because values -1 and -2 will never match the cpu_id_start
  possible cpu number values.

  This allows checking the current CPU number and rseq initialization
  state with a single comparison on the fast-path.

Changes since v10:

- Update rseq.c comment, removing reference to event_counter.

Man page associated:

RSEQ(2)                Linux Programmer's Manual               RSEQ(2)

NAME
       rseq - Restartable sequences and cpu number cache

SYNOPSIS
       #include <linux/rseq.h>

       int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t sig);

DESCRIPTION
       The  rseq()  ABI  accelerates  user-space operations on per-cpu
       data by defining a shared data structure ABI between each user-
       space thread and the kernel.

       It  allows  user-space  to perform update operations on per-cpu
       data without requiring heavy-weight atomic operations.

       Restartable sequences are atomic  with  respect  to  preemption
       (making  it atomic with respect to other threads running on the
       same CPU), as well as  signal  delivery  (user-space  execution
       contexts nested over the same thread).

       It is suited for update operations on per-cpu data.

       It can be used on data structures shared between threads within
       a process, and on data structures shared between threads across
       different processes.

       Some examples of operations that can be accelerated or improved
       by this ABI:

       · Memory allocator per-cpu free-lists,

       · Querying the current CPU number,

       · Incrementing per-CPU counters,

       · Modifying data protected by per-CPU spinlocks,

       · Inserting/removing elements in per-CPU linked-lists,

       · Writing/reading per-CPU ring buffers content.

       · Accurately reading performance monitoring unit counters  with
         respect to thread migration.

       The  rseq argument is a pointer to the thread-local rseq struc‐
       ture to be shared between kernel and user-space.  A  NULL  rseq
       value unregisters the current thread rseq structure.

       The layout of struct rseq is as follows:

       Structure alignment
              This structure is aligned on multiples of 32 bytes.

       Structure size
              This  structure  is  extensible.  Its  size is passed as
              parameter to the rseq system call.

       Fields

           cpu_id_start
              Optimistic cache of the CPU number on which the  current
              thread  is running. Its value is guaranteed to always be
              a possible CPU number, even when rseq  is  not  initial‐
              ized.  The  value it contains should always be confirmed
              by reading the cpu_id field.

           cpu_id
              Cache of the CPU number on which the current  thread  is
              running.  -1 if uninitialized.

           rseq_cs
              The  rseq_cs  field is a pointer to a struct rseq_cs. Is
              is NULL when no rseq assembly block critical section  is
              active for the current thread.  Setting it to point to a
              critical section descriptor (struct rseq_cs)  marks  the
              beginning of the critical section.

           flags
              Flags  indicating  the  restart behavior for the current
              thread. This is mainly used for debugging purposes.  Can
              be either:

       ·      RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

       ·      RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

       ·      RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

       The layout of struct rseq_cs version 0 is as follows:

       Structure alignment
              This structure is aligned on multiples of 32 bytes.

       Structure size
              This structure has a fixed size of 32 bytes.

       Fields

           version
              Version of this structure.

           flags
              Flags indicating the restart behavior of this structure.
              Can be either:

       ·      RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

       ·      RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

       ·      RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

           start_ip
              Instruction pointer address of the first instruction  of
              the sequence of consecutive assembly instructions.

           post_commit_offset
              Offset  (from start_ip address) of the address after the
              last instruction of the sequence of consecutive assembly
              instructions.

           abort_ip
              Instruction  pointer address where to move the execution
              flow in case of abort of  the  sequence  of  consecutive
              assembly instructions.

       The  rseq_len argument is the size of the struct rseq to regis‐
       ter.

       The flags argument is 0 for registration, and  RSEQ_FLAG_UNREG‐
       ISTER for unregistration.

       The  sig argument is the 32-bit signature to be expected before
       the abort handler code.

       A single library per process should keep the rseq structure  in
       a  thread-local  storage  variable.  The cpu_id field should be
       initialized to -1, and the cpu_id_start field  should  be  ini‐
       tialized to a possible CPU value (typically 0).

       Each  thread  is  responsible for registering and unregistering
       its rseq structure. No more than one rseq structure address can
       be registered per thread at a given time.

       In  a  typical  usage scenario, the thread registering the rseq
       structure will be performing  loads  and  stores  from/to  that
       structure.  It  is  however also allowed to read that structure
       from other threads.  The rseq field updates  performed  by  the
       kernel  provide  relaxed  atomicity  semantics, which guarantee
       that other threads performing relaxed atomic reads of  the  cpu
       number cache will always observe a consistent value.

RETURN VALUE
       A  return  value  of  0  indicates  success.  On  error,  -1 is
       returned, and errno is set appropriately.

ERRORS
       EINVAL Either flags contains an invalid value, or rseq contains
              an  address  which  is  not  appropriately  aligned,  or
              rseq_len contains a size that does not  match  the  size
              received on registration.

       ENOSYS The  rseq()  system call is not implemented by this ker‐
              nel.

       EFAULT rseq is an invalid address.

       EBUSY  Restartable sequence  is  already  registered  for  this
              thread.

       EPERM  The  sig  argument  on unregistration does not match the
              signature received on registration.

VERSIONS
       The rseq() system call was added in Linux 4.X (TODO).

CONFORMING TO
       rseq() is Linux-specific.

SEE ALSO
       sched_getcpu(3)

Linux                         2017-11-06                       RSEQ(2)
---
 MAINTAINERS                 |  11 ++
 arch/Kconfig                |   7 +
 fs/exec.c                   |   1 +
 include/linux/sched.h       |  89 ++++++++++++
 include/trace/events/rseq.h |  60 ++++++++
 include/uapi/linux/rseq.h   | 138 +++++++++++++++++++
 init/Kconfig                |  14 ++
 kernel/Makefile             |   1 +
 kernel/fork.c               |   2 +
 kernel/rseq.c               | 328 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c         |   4 +
 kernel/sys_ni.c             |   3 +
 12 files changed, 658 insertions(+)
 create mode 100644 include/trace/events/rseq.h
 create mode 100644 include/uapi/linux/rseq.h
 create mode 100644 kernel/rseq.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2811a211632c..c9f95f8b07ed 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11497,6 +11497,17 @@ F:	include/dt-bindings/reset/
 F:	include/linux/reset.h
 F:	include/linux/reset-controller.h
 
+RESTARTABLE SEQUENCES SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+M:	Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
+M:	"Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+M:	Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	kernel/rseq.c
+F:	include/uapi/linux/rseq.h
+F:	include/trace/events/rseq.h
+
 RFKILL
 M:	Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
 L:	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
diff --git a/arch/Kconfig b/arch/Kconfig
index 057370a0ac4e..b5e7f977fc29 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -258,6 +258,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
 	  declared in asm/ptrace.h
 	  For example the kprobes-based event tracer needs this API.
 
+config HAVE_RSEQ
+	bool
+	depends on HAVE_REGS_AND_STACK_ACCESS_API
+	help
+	  This symbol should be selected by an architecture if it
+	  supports an implementation of restartable sequences.
+
 config HAVE_CLK
 	bool
 	help
diff --git a/fs/exec.c b/fs/exec.c
index 3e14ba25f678..3faf8ff0fc6d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1803,6 +1803,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	membarrier_execve(current);
+	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdf74f27acf1..b995a3b5bfc4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -27,6 +27,7 @@
 #include <linux/signal_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/task_io_accounting.h>
+#include <linux/rseq.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -977,6 +978,13 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_RSEQ
+	struct rseq __user *rseq;
+	u32 rseq_len;
+	u32 rseq_sig;
+	bool rseq_preempt, rseq_signal, rseq_migrate;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	struct rcu_head			rcu;
@@ -1667,4 +1675,85 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_RSEQ
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+	if (t->rseq)
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+void __rseq_handle_notify_resume(struct pt_regs *regs);
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	if (current->rseq)
+		__rseq_handle_notify_resume(regs);
+}
+/*
+ * If parent process has a registered restartable sequences area, the
+ * child inherits. Only applies when forking a process, not a thread. In
+ * case a parent fork() in the middle of a restartable sequence, set the
+ * resume notifier to force the child to retry.
+ */
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD) {
+		t->rseq = NULL;
+		t->rseq_len = 0;
+		t->rseq_sig = 0;
+	} else {
+		t->rseq = current->rseq;
+		t->rseq_len = current->rseq_len;
+		t->rseq_sig = current->rseq_sig;
+		rseq_set_notify_resume(t);
+	}
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+	t->rseq = NULL;
+	t->rseq_len = 0;
+	t->rseq_sig = 0;
+}
+static inline void rseq_sched_out(struct task_struct *t)
+{
+	rseq_set_notify_resume(t);
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+	current->rseq_signal = true;
+	rseq_handle_notify_resume(regs);
+}
+static inline void rseq_preempt(struct task_struct *t)
+{
+	t->rseq_preempt = true;
+}
+static inline void rseq_migrate(struct task_struct *t)
+{
+	t->rseq_migrate = true;
+}
+#else
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+}
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+}
+static inline void rseq_sched_out(struct task_struct *t)
+{
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+}
+static inline void rseq_preempt(struct task_struct *t)
+{
+}
+static inline void rseq_migrate(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
new file mode 100644
index 000000000000..4d30d77c86b4
--- /dev/null
+++ b/include/trace/events/rseq.h
@@ -0,0 +1,60 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rseq
+
+#if !defined(_TRACE_RSEQ_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RSEQ_H
+
+#include <linux/tracepoint.h>
+#include <linux/types.h>
+
+TRACE_EVENT(rseq_update,
+
+	TP_PROTO(struct task_struct *t),
+
+	TP_ARGS(t),
+
+	TP_STRUCT__entry(
+		__field(s32, cpu_id)
+	),
+
+	TP_fast_assign(
+		__entry->cpu_id = raw_smp_processor_id();
+	),
+
+	TP_printk("cpu_id=%d", __entry->cpu_id)
+);
+
+TRACE_EVENT(rseq_ip_fixup,
+
+	TP_PROTO(void __user *regs_ip, void __user *start_ip,
+		unsigned long post_commit_offset, void __user *abort_ip,
+		int ret),
+
+	TP_ARGS(regs_ip, start_ip, post_commit_offset, abort_ip, ret),
+
+	TP_STRUCT__entry(
+		__field(void __user *, regs_ip)
+		__field(void __user *, start_ip)
+		__field(unsigned long, post_commit_offset)
+		__field(void __user *, abort_ip)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->regs_ip = regs_ip;
+		__entry->start_ip = start_ip;
+		__entry->post_commit_offset = post_commit_offset;
+		__entry->abort_ip = abort_ip;
+		__entry->ret = ret;
+	),
+
+	TP_printk("regs_ip=%p start_ip=%p post_commit_offset=%lu abort_ip=%p ret=%d",
+		__entry->regs_ip, __entry->start_ip,
+		__entry->post_commit_offset, __entry->abort_ip,
+		__entry->ret)
+);
+
+#endif /* _TRACE_SOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
new file mode 100644
index 000000000000..28ee2ebd3dae
--- /dev/null
+++ b/include/uapi/linux/rseq.h
@@ -0,0 +1,138 @@
+#ifndef _UAPI_LINUX_RSEQ_H
+#define _UAPI_LINUX_RSEQ_H
+
+/*
+ * linux/rseq.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2016 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else	/* #ifdef __KERNEL__ */
+# include <stdint.h>
+#endif	/* #else #ifdef __KERNEL__ */
+
+#include <asm/byteorder.h>
+
+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field)			uint64_t field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field)	uint32_t field ## _padding, field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
+	field ## _padding = 0, field = (intptr_t)v
+#else
+# define RSEQ_FIELD_u32_u64(field)	uint32_t field, field ## _padding
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
+	field = (intptr_t)v, field ## _padding = 0
+#endif
+
+enum rseq_flags {
+	RSEQ_FLAG_UNREGISTER = (1 << 0),
+};
+
+enum rseq_cs_flags {
+	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT	= (1U << 0),
+	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL	= (1U << 1),
+	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	= (1U << 2),
+};
+
+/*
+ * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line. It is usually declared as
+ * link-time constant data.
+ */
+struct rseq_cs {
+	uint32_t version;	/* Version of this structure. */
+	uint32_t flags;		/* enum rseq_cs_flags */
+	RSEQ_FIELD_u32_u64(start_ip);
+	RSEQ_FIELD_u32_u64(post_commit_offset);	/* From start_ip */
+	RSEQ_FIELD_u32_u64(abort_ip);
+} __attribute__((aligned(4 * sizeof(uint64_t))));
+
+/*
+ * struct rseq is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line.
+ *
+ * A single struct rseq per thread is allowed.
+ */
+struct rseq {
+	/*
+	 * Restartable sequences cpu_id_start field. Updated by the
+	 * kernel, and read by user-space with single-copy atomicity
+	 * semantics. Aligned on 32-bit. Always contain a value in the
+	 * range of possible CPUs, although the value may not be the
+	 * actual current CPU (e.g. if rseq is not initialized). This
+	 * CPU number value should always be confirmed against the value
+	 * of the cpu_id field.
+	 */
+	uint32_t cpu_id_start;
+	/*
+	 * Restartable sequences cpu_id field. Updated by the kernel,
+	 * and read by user-space with single-copy atomicity semantics.
+	 * Aligned on 32-bit. Values -1U and -2U have a special
+	 * semantic: -1U means "rseq uninitialized", and -2U means "rseq
+	 * initialization failed".
+	 */
+	uint32_t cpu_id;
+	/*
+	 * Restartable sequences rseq_cs field.
+	 *
+	 * Contains NULL when no critical section is active for the current
+	 * thread, or holds a pointer to the currently active struct rseq_cs.
+	 *
+	 * Updated by user-space at the beginning of assembly instruction
+	 * sequence block, and by the kernel when it restarts an assembly
+	 * instruction sequence block, and when the kernel detects that it
+	 * is preempting or delivering a signal outside of the range
+	 * targeted by the rseq_cs. Also needs to be cleared by user-space
+	 * before reclaiming memory that contains the targeted struct
+	 * rseq_cs.
+	 *
+	 * Read and set by the kernel with single-copy atomicity semantics.
+	 * Aligned on 64-bit.
+	 */
+	RSEQ_FIELD_u32_u64(rseq_cs);
+	/*
+	 * - RSEQ_DISABLE flag:
+	 *
+	 * Fallback fast-track flag for single-stepping.
+	 * Set by user-space if lack of progress is detected.
+	 * Cleared by user-space after rseq finish.
+	 * Read by the kernel.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+	 *     Inhibit instruction sequence block restart and event
+	 *     counter increment on preemption for this thread.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+	 *     Inhibit instruction sequence block restart and event
+	 *     counter increment on signal delivery for this thread.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+	 *     Inhibit instruction sequence block restart and event
+	 *     counter increment on migration for this thread.
+	 */
+	uint32_t flags;
+} __attribute__((aligned(4 * sizeof(uint64_t))));
+
+#endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/init/Kconfig b/init/Kconfig
index 3c1faaa2af4a..cbedfb91b40a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1400,6 +1400,20 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config RSEQ
+	bool "Enable rseq() system call" if EXPERT
+	default y
+	depends on HAVE_RSEQ
+	select MEMBARRIER
+	help
+	  Enable the restartable sequences system call. It provides a
+	  user-space cache for the current CPU number value, which
+	  speeds up getting the current CPU number from user-space,
+	  as well as an ABI to speed up user-space operations on
+	  per-CPU data.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 172d151d429c..3574669dafd9 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_RSEQ) += rseq.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cc743698d3..1f3c25e28742 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1862,6 +1862,8 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	copy_seccomp(p);
 
+	rseq_fork(p, clone_flags);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/rseq.c b/kernel/rseq.c
new file mode 100644
index 000000000000..6f0d48c2c084
--- /dev/null
+++ b/kernel/rseq.c
@@ -0,0 +1,328 @@
+/*
+ * Restartable sequences system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> and Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
+ * Copyright (C) 2015-2016, EfficiOS Inc.,
+ * Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/rseq.h>
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/rseq.h>
+
+/*
+ *
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
+ *   userspace that can handle being moved to the abort_ip between any
+ *   of those instructions.
+ *
+ *   The abort_ip address needs to be less than start_ip, or
+ *   greater-or-equal the post_commit_ip. Step [5] and the failure
+ *   code step [F1] need to be at addresses lesser than start_ip, or
+ *   greater-or-equal the post_commit_ip.
+ *
+ *       [start_ip]
+ *   1.  Userspace stores the address of the struct rseq_cs assembly
+ *       block descriptor into the rseq_cs field of the registered
+ *       struct rseq TLS area. This update is performed through a single
+ *       store, followed by a compiler barrier which prevents the
+ *       compiler from moving following loads or stores before this
+ *       store.
+ *
+ *   2.  Userspace tests to see whether the current cpu_id field
+ *       match the cpu number loaded before start_ip. Manually jumping
+ *       to [F1] in case of a mismatch.
+ *
+ *       Note that if we are preempted or interrupted by a signal
+ *       after [1] and before post_commit_ip, then the kernel
+ *       clears the rseq_cs field of struct rseq, then jumps us to
+ *       abort_ip.
+ *
+ *   3.  Userspace critical section final instruction before
+ *       post_commit_ip is the commit. The critical section is
+ *       self-terminating.
+ *       [post_commit_ip]
+ *
+ *   4.  success
+ *
+ *   On failure at [2]:
+ *
+ *       [abort_ip]
+ *   F1. goto failure label
+ */
+
+static bool rseq_update_cpu_id(struct task_struct *t)
+{
+	uint32_t cpu_id = raw_smp_processor_id();
+
+	if (__put_user(cpu_id, &t->rseq->cpu_id_start))
+		return false;
+	if (__put_user(cpu_id, &t->rseq->cpu_id))
+		return false;
+	trace_rseq_update(t);
+	return true;
+}
+
+static bool rseq_reset_rseq_cpu_id(struct task_struct *t)
+{
+	uint32_t cpu_id_start = 0, cpu_id = -1U;
+
+	/*
+	 * Reset cpu_id_start to its initial state (0).
+	 */
+	if (__put_user(cpu_id_start, &t->rseq->cpu_id_start))
+		return false;
+	/*
+	 * Reset cpu_id to -1U, so any user coming in after unregistration can
+	 * figure out that rseq needs to be registered again.
+	 */
+	if (__put_user(cpu_id, &t->rseq->cpu_id))
+		return false;
+	return true;
+}
+
+static bool rseq_get_rseq_cs(struct task_struct *t,
+		void __user **start_ip,
+		unsigned long *post_commit_offset,
+		void __user **abort_ip,
+		uint32_t *cs_flags)
+{
+	unsigned long ptr;
+	struct rseq_cs __user *urseq_cs;
+	struct rseq_cs rseq_cs;
+	u32 __user *usig;
+	u32 sig;
+
+	if (__get_user(ptr, &t->rseq->rseq_cs))
+		return false;
+	if (!ptr)
+		return true;
+	urseq_cs = (struct rseq_cs __user *)ptr;
+	if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
+		return false;
+	/*
+	 * We need to clear rseq_cs upon entry into a signal handler
+	 * nested on top of a rseq assembly block, so the signal handler
+	 * will not be fixed up if itself interrupted by a nested signal
+	 * handler or preempted.  We also need to clear rseq_cs if we
+	 * preempt or deliver a signal on top of code outside of the
+	 * rseq assembly block, to ensure that a following preemption or
+	 * signal delivery will not try to perform a fixup needlessly.
+	 */
+	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
+		return false;
+	if (rseq_cs.version > 0)
+		return false;
+	*cs_flags = rseq_cs.flags;
+	*start_ip = (void __user *)rseq_cs.start_ip;
+	*post_commit_offset = (unsigned long)rseq_cs.post_commit_offset;
+	*abort_ip = (void __user *)rseq_cs.abort_ip;
+	usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
+	if (get_user(sig, usig))
+		return false;
+	if (current->rseq_sig != sig) {
+		printk_ratelimited(KERN_WARNING
+			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+			sig, current->rseq_sig, current->pid, usig);
+		return false;
+	}
+	return true;
+}
+
+static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
+{
+	bool need_restart = false;
+	uint32_t flags;
+
+	/* Get thread flags. */
+	if (__get_user(flags, &t->rseq->flags))
+		return -EFAULT;
+
+	/* Take into account critical section flags. */
+	flags |= cs_flags;
+
+	/*
+	 * Restart on signal can only be inhibited when restart on
+	 * preempt and restart on migrate are inhibited too. Otherwise,
+	 * a preempted signal handler could fail to restart the prior
+	 * execution context on sigreturn.
+	 */
+	if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
+		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
+			return -EINVAL;
+		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+			return -EINVAL;
+	}
+	if (t->rseq_migrate
+			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
+		need_restart = true;
+	else if (t->rseq_preempt
+			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+		need_restart = true;
+	else if (t->rseq_signal
+			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
+		need_restart = true;
+
+	t->rseq_preempt = false;
+	t->rseq_signal = false;
+	t->rseq_migrate = false;
+	if (need_restart)
+		return 1;
+	return 0;
+}
+
+static int rseq_ip_fixup(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+	void __user *start_ip = NULL;
+	unsigned long post_commit_offset = 0;
+	void __user *abort_ip = NULL;
+	uint32_t cs_flags = 0;
+	int ret;
+
+	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
+			&cs_flags);
+	trace_rseq_ip_fixup((void __user *)instruction_pointer(regs),
+		start_ip, post_commit_offset, abort_ip, ret);
+	if (!ret)
+		return -EFAULT;
+
+	ret = rseq_need_restart(t, cs_flags);
+	if (ret < 0)
+		return -EFAULT;
+	if (!ret)
+		return 0;
+	/*
+	 * Handle potentially not being within a critical section.
+	 * Unsigned comparison will be true when
+	 * ip < start_ip (wrap-around to large values), and when
+	 * ip >= start_ip + post_commit_offset.
+	 */
+	if ((unsigned long)instruction_pointer(regs) - (unsigned long)start_ip
+			>= post_commit_offset)
+		return 1;
+
+	instruction_pointer_set(regs, (unsigned long)abort_ip);
+	return 1;
+}
+
+/*
+ * This resume handler should always be executed between any of:
+ * - preemption,
+ * - signal delivery,
+ * and return to user-space.
+ *
+ * This is how we can ensure that the entire rseq critical section,
+ * consisting of both the C part and the assembly instruction sequence,
+ * will issue the commit instruction only if executed atomically with
+ * respect to other threads scheduled on the same CPU, and with respect
+ * to signal handlers.
+ */
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+	int ret;
+
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (unlikely(!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq))))
+		goto error;
+	ret = rseq_ip_fixup(regs);
+	if (unlikely(ret < 0))
+		goto error;
+	if (unlikely(!rseq_update_cpu_id(t)))
+		goto error;
+	return;
+
+error:
+	force_sig(SIGSEGV, t);
+}
+
+/*
+ * sys_rseq - setup restartable sequences for caller thread.
+ */
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, uint32_t, rseq_len,
+		int, flags, uint32_t, sig)
+{
+	if (flags & RSEQ_FLAG_UNREGISTER) {
+		/* Unregister rseq for current thread. */
+		if (current->rseq != rseq || !current->rseq)
+			return -EINVAL;
+		if (current->rseq_len != rseq_len)
+			return -EINVAL;
+		if (current->rseq_sig != sig)
+			return -EPERM;
+		if (!rseq_reset_rseq_cpu_id(current))
+			return -EFAULT;
+		current->rseq = NULL;
+		current->rseq_len = 0;
+		current->rseq_sig = 0;
+		return 0;
+	}
+
+	if (unlikely(flags))
+		return -EINVAL;
+
+	if (current->rseq) {
+		/*
+		 * If rseq is already registered, check whether
+		 * the provided address differs from the prior
+		 * one.
+		 */
+		if (current->rseq != rseq
+				|| current->rseq_len != rseq_len)
+			return -EINVAL;
+		if (current->rseq_sig != sig)
+			return -EPERM;
+		return -EBUSY;	/* Already registered. */
+	} else {
+		/*
+		 * If there was no rseq previously registered,
+		 * we need to ensure the provided rseq is
+		 * properly aligned and valid.
+		 */
+		if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq))
+				|| rseq_len != sizeof(*rseq))
+			return -EINVAL;
+		if (!access_ok(VERIFY_WRITE, rseq, rseq_len))
+			return -EFAULT;
+		current->rseq = rseq;
+		current->rseq_len = rseq_len;
+		current->rseq_sig = sig;
+		/*
+		 * If rseq was previously inactive, and has just been
+		 * registered, ensure the cpu_id_start and cpu_id fields
+		 * are updated before returning to user-space.
+		 */
+		rseq_set_notify_resume(current);
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..6bba05f47e51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1179,6 +1179,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	WARN_ON_ONCE(!cpu_online(new_cpu));
 #endif
 
+	rseq_migrate(p);
+
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
@@ -2581,6 +2583,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 {
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
+	rseq_sched_out(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
@@ -3341,6 +3344,7 @@ static void __sched notrace __schedule(bool preempt)
 	clear_preempt_need_resched();
 
 	if (likely(prev != next)) {
+		rseq_preempt(prev);
 		rq->nr_switches++;
 		rq->curr = next;
 		/*
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index b5189762d275..bfa1ee1bf669 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -259,3 +259,6 @@ cond_syscall(sys_membarrier);
 cond_syscall(sys_pkey_mprotect);
 cond_syscall(sys_pkey_alloc);
 cond_syscall(sys_pkey_free);
+
+/* restartable sequence */
+cond_syscall(sys_rseq);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:03   ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences " Mathieu Desnoyers
@ 2017-11-14 20:03   ` Mathieu Desnoyers
  2017-11-14 20:03   ` [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 " Mathieu Desnoyers
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 arch/arm/Kconfig         | 1 +
 arch/arm/kernel/signal.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index d1346a160760..1469f3f39475 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -87,6 +87,7 @@ config ARM
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
 	select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index b67ae12503f3..cc3260f475b0 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -518,6 +518,12 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	int ret;
 
 	/*
+	 * Increment event counter and perform fixup for the pre-signal
+	 * frame.
+	 */
+	rseq_signal_deliver(regs);
+
+	/*
 	 * Set up the stack frame
 	 */
 	if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -637,6 +643,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 			} else {
 				clear_thread_flag(TIF_NOTIFY_RESUME);
 				tracehook_notify_resume(regs);
+				rseq_handle_notify_resume(regs);
 			}
 		}
 		local_irq_disable();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
@ 2017-11-14 20:03 ` Mathieu Desnoyers
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Wire up the rseq system call on 32-bit ARM.

This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0bb0e9c6376c..fbc74b5fa3ed 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -412,3 +412,4 @@
 395	common	pkey_alloc		sys_pkey_alloc
 396	common	pkey_free		sys_pkey_free
 397	common	statx			sys_statx
+398	common	rseq			sys_rseq
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:03   ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences " Mathieu Desnoyers
  2017-11-14 20:03   ` [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
@ 2017-11-14 20:03   ` Mathieu Desnoyers
       [not found]     ` <20171114200414.2188-5-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:03   ` [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
                     ` (8 subsequent siblings)
  11 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 arch/x86/Kconfig         | 1 +
 arch/x86/entry/common.c  | 1 +
 arch/x86/kernel/signal.c | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2fdb23313dd5..01f78c1d40b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -173,6 +173,7 @@ config X86
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER_UNWINDER && STACK_VALIDATION
 	select HAVE_STACK_VALIDATION		if X86_64
+	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03505ffbe1b6..4c717bdd1139 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -160,6 +160,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			rseq_handle_notify_resume(regs);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b9e00e8f1c9b..991017d26d00 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -687,6 +687,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	sigset_t *set = sigmask_to_save();
 	compat_sigset_t *cset = (compat_sigset_t *) set;
 
+	/*
+	 * Increment event counter and perform fixup for the pre-signal
+	 * frame.
+	 */
+	rseq_signal_deliver(regs);
+
 	/* Set up the stack frame */
 	if (is_ia32_frame(ksig)) {
 		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-11-14 20:03   ` [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 " Mathieu Desnoyers
@ 2017-11-14 20:03   ` Mathieu Desnoyers
  2017-11-14 20:03   ` [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support Mathieu Desnoyers
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..ba43ee75e425 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382	i386	pkey_free		sys_pkey_free
 383	i386	statx			sys_statx
 384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
+385	i386	rseq			sys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..3ad03495bbb9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
+333	common	rseq			sys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-11-14 20:03   ` [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
@ 2017-11-14 20:03   ` Mathieu Desnoyers
  2017-11-14 20:03   ` [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call Mathieu Desnoyers
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers,
	Benjamin Herrenschmidt

From: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal when a
signal is delivered on top of a restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
CC: Paul Mackerras <paulus-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>
CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
---
 arch/powerpc/Kconfig         | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index cb782ac1c35d..41d1dae3b1b5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -219,6 +219,7 @@ config PPC
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_VIRT_CPU_ACCOUNTING
 	select HAVE_IRQ_TIME_ACCOUNTING
+	select HAVE_RSEQ
 	select IRQ_DOMAIN
 	select IRQ_FORCED_THREADING
 	select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index e9436c5e1e09..17a994b801b1 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
 	/* Re-enable the breakpoints for the signal stack */
 	thread_change_pc(tsk, tsk->thread.regs);
 
+	rseq_signal_deliver(tsk->thread.regs);
+
 	if (is32) {
         	if (ksig.ka.sa.sa_flags & SA_SIGINFO)
 			ret = handle_rt_signal32(&ksig, oldset, tsk);
@@ -161,6 +163,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
 		clear_thread_flag(TIF_NOTIFY_RESUME);
 		tracehook_notify_resume(regs);
+		rseq_handle_notify_resume(regs);
 	}
 
 	if (thread_info_flags & _TIF_PATCH_PENDING)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
  2017-11-14 20:03 ` [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-14 20:03 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH for 4.15 10/24] cpu_opv: " Mathieu Desnoyers
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: Will Deacon, Andi Kleen, Paul Mackerras, H . Peter Anvin,
	Chris Lameter, Russell King, Andrew Hunter, Ingo Molnar,
	Michael Kerrisk, Catalin Marinas, Paul Turner, Josh Triplett,
	Steven Rostedt, Ben Maurer, Mathieu Desnoyers, Thomas Gleixner,
	linux-api, linuxppc-dev, linux-kernel, Andrew Morton,
	Linus Torvalds

From: Boqun Feng <boqun.feng@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 449912f057f6..964321a5799c 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -389,3 +389,4 @@ COMPAT_SYS_SPU(preadv2)
 COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 9ba11dbcaca9..e76bd5601ea4 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls		384
+#define NR_syscalls		385
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index df8684f31919..b1980fcd56d5 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -395,5 +395,6 @@
 #define __NR_pwritev2		381
 #define __NR_kexec_file_load	382
 #define __NR_statx		383
+#define __NR_rseq		384
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-11-14 20:03   ` [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support Mathieu Desnoyers
@ 2017-11-14 20:03   ` Mathieu Desnoyers
       [not found]     ` <20171114200414.2188-9-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:03   ` [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
                     ` (5 subsequent siblings)
  11 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

This new cpu_opv system call executes a vector of operations on behalf
of user-space on a specific CPU with preemption disabled. It is inspired
from readv() and writev() system calls which take a "struct iovec" array
as argument.

The operations available are: comparison, memcpy, add, or, and, xor,
left shift, right shift, and mb. The system call receives a CPU number
from user-space as argument, which is the CPU on which those operations
need to be performed. All preparation steps such as loading pointers,
and applying offsets to arrays, need to be performed by user-space
before invoking the system call. The "comparison" operation can be used
to check that the data used in the preparation step did not change
between preparation of system call inputs and operation execution within
the preempt-off critical section.

The reason why we require all pointer offsets to be calculated by
user-space beforehand is because we need to use get_user_pages_fast() to
first pin all pages touched by each operation. This takes care of
faulting-in the pages. Then, preemption is disabled, and the operations
are performed atomically with respect to other thread execution on that
CPU, without generating any page fault.

A maximum limit of 16 operations per cpu_opv syscall invocation is
enforced, so user-space cannot generate a too long preempt-off critical
section. Each operation is also limited a length of PAGE_SIZE bytes,
meaning that an operation can touch a maximum of 4 pages (memcpy: 2
pages for source, 2 pages for destination if addresses are not aligned
on page boundaries). Moreover, a total limit of 4216 bytes is applied
to operation lengths.

If the thread is not running on the requested CPU, a new
push_task_to_cpu() is invoked to migrate the task to the requested CPU.
If the requested CPU is not part of the cpus allowed mask of the thread,
the system call fails with EINVAL. After the migration has been
performed, preemption is disabled, and the current CPU number is checked
again and compared to the requested CPU number. If it still differs, it
means the scheduler migrated us away from that CPU. Return EAGAIN to
user-space in that case, and let user-space retry (either requesting the
same CPU number, or a different one, depending on the user-space
algorithm constraints).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---

Changes since v1:
- handle CPU hotplug,
- cleanup implementation using function pointers: We can use function
  pointers to implement the operations rather than duplicating all the
  user-access code.
- refuse device pages: Performing cpu_opv operations on io map'd pages
  with preemption disabled could generate long preempt-off critical
  sections, which leads to unwanted scheduler latency. Return EFAULT if
  a device page is received as parameter
- restrict op vector to 4216 bytes length sum: Restrict the operation
  vector to length sum of:
  - 4096 bytes (typical page size on most architectures, should be
    enough for a string, or structures)
  - 15 * 8 bytes (typical operations on integers or pointers).
  The goal here is to keep the duration of preempt off critical section
  short, so we don't add significant scheduler latency.
- Add INIT_ONSTACK macro: Introduce the
  CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
  correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
  stack to 0 on 32-bit architectures.
- Add CPU_MB_OP operation:
  Use-cases with:
  - two consecutive stores,
  - a mempcy followed by a store,
  require a memory barrier before the final store operation. A typical
  use-case is a store-release on the final store. Given that this is a
  slow path, just providing an explicit full barrier instruction should
  be sufficient.
- Add expect fault field:
  The use-case of list_pop brings interesting challenges. With rseq, we
  can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
  compare it against NULL, add an offset, and load the target "next"
  pointer from the object, all within a single req critical section.

  Life is not so easy for cpu_opv in this use-case, mainly because we
  need to pin all pages we are going to touch in the preempt-off
  critical section beforehand. So we need to know the target object (in
  which we apply an offset to fetch the next pointer) when we pin pages
  before disabling preemption.

  So the approach is to load the head pointer and compare it against
  NULL in user-space, before doing the cpu_opv syscall. User-space can
  then compute the address of the head->next field, *without loading it*.

  The cpu_opv system call will first need to pin all pages associated
  with input data. This includes the page backing the head->next object,
  which may have been concurrently deallocated and unmapped. Therefore,
  in this case, getting -EFAULT when trying to pin those pages may
  happen: it just means they have been concurrently unmapped. This is
  an expected situation, and should just return -EAGAIN to user-space,
  to user-space can distinguish between "should retry" type of
  situations and actual errors that should be handled with extreme
  prejudice to the program (e.g. abort()).

  Therefore, add "expect_fault" fields along with op input address
  pointers, so user-space can identify whether a fault when getting a
  field should return EAGAIN rather than EFAULT.
- Add compiler barrier between operations: Adding a compiler barrier
  between store operations in a cpu_opv sequence can be useful when
  paired with membarrier system call.

  An algorithm with a paired slow path and fast path can use
  sys_membarrier on the slow path to replace fast-path memory barriers
  by compiler barrier.

  Adding an explicit compiler barrier between operations allows
  cpu_opv to be used as fallback for operations meant to match
  the membarrier system call.

Changes since v2:

- Fix memory leak by introducing struct cpu_opv_pinned_pages.
  Suggested by Boqun Feng.
- Cast argument 1 passed to access_ok from integer to void __user *,
  fixing sparse warning.
---
 MAINTAINERS                  |   7 +
 include/uapi/linux/cpu_opv.h | 117 ++++++
 init/Kconfig                 |  14 +
 kernel/Makefile              |   1 +
 kernel/cpu_opv.c             | 968 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |  37 ++
 kernel/sched/sched.h         |   2 +
 kernel/sys_ni.c              |   1 +
 8 files changed, 1147 insertions(+)
 create mode 100644 include/uapi/linux/cpu_opv.h
 create mode 100644 kernel/cpu_opv.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c9f95f8b07ed..45a1bbdaa287 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3675,6 +3675,13 @@ B:	https://bugzilla.kernel.org
 F:	drivers/cpuidle/*
 F:	include/linux/cpuidle.h
 
+CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	kernel/cpu_opv.c
+F:	include/uapi/linux/cpu_opv.h
+
 CRAMFS FILESYSTEM
 W:	http://sourceforge.net/projects/cramfs/
 S:	Orphan / Obsolete
diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
new file mode 100644
index 000000000000..17f7d46e053b
--- /dev/null
+++ b/include/uapi/linux/cpu_opv.h
@@ -0,0 +1,117 @@
+#ifndef _UAPI_LINUX_CPU_OPV_H
+#define _UAPI_LINUX_CPU_OPV_H
+
+/*
+ * linux/cpu_opv.h
+ *
+ * CPU preempt-off operation vector system call API
+ *
+ * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else	/* #ifdef __KERNEL__ */
+# include <stdint.h>
+#endif	/* #else #ifdef __KERNEL__ */
+
+#include <asm/byteorder.h>
+
+#ifdef __LP64__
+# define CPU_OP_FIELD_u32_u64(field)			uint64_t field
+# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define CPU_OP_FIELD_u32_u64(field)	uint32_t field ## _padding, field
+# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
+	field ## _padding = 0, field = (intptr_t)v
+#else
+# define CPU_OP_FIELD_u32_u64(field)	uint32_t field, field ## _padding
+# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
+	field = (intptr_t)v, field ## _padding = 0
+#endif
+
+#define CPU_OP_VEC_LEN_MAX		16
+#define CPU_OP_ARG_LEN_MAX		24
+/* Max. data len per operation. */
+#define CPU_OP_DATA_LEN_MAX		PAGE_SIZE
+/*
+ * Max. data len for overall vector. We to restrict the amount of
+ * user-space data touched by the kernel in non-preemptible context so
+ * we do not introduce long scheduler latencies.
+ * This allows one copy of up to 4096 bytes, and 15 operations touching
+ * 8 bytes each.
+ * This limit is applied to the sum of length specified for all
+ * operations in a vector.
+ */
+#define CPU_OP_VEC_DATA_LEN_MAX		(4096 + 15*8)
+#define CPU_OP_MAX_PAGES		4	/* Max. pages per op. */
+
+enum cpu_op_type {
+	CPU_COMPARE_EQ_OP,	/* compare */
+	CPU_COMPARE_NE_OP,	/* compare */
+	CPU_MEMCPY_OP,		/* memcpy */
+	CPU_ADD_OP,		/* arithmetic */
+	CPU_OR_OP,		/* bitwise */
+	CPU_AND_OP,		/* bitwise */
+	CPU_XOR_OP,		/* bitwise */
+	CPU_LSHIFT_OP,		/* shift */
+	CPU_RSHIFT_OP,		/* shift */
+	CPU_MB_OP,		/* memory barrier */
+};
+
+/* Vector of operations to perform. Limited to 16. */
+struct cpu_op {
+	int32_t op;	/* enum cpu_op_type. */
+	uint32_t len;	/* data length, in bytes. */
+	union {
+		struct {
+			CPU_OP_FIELD_u32_u64(a);
+			CPU_OP_FIELD_u32_u64(b);
+			uint8_t expect_fault_a;
+			uint8_t expect_fault_b;
+		} compare_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(dst);
+			CPU_OP_FIELD_u32_u64(src);
+			uint8_t expect_fault_dst;
+			uint8_t expect_fault_src;
+		} memcpy_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(p);
+			int64_t count;
+			uint8_t expect_fault_p;
+		} arithmetic_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(p);
+			uint64_t mask;
+			uint8_t expect_fault_p;
+		} bitwise_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(p);
+			uint32_t bits;
+			uint8_t expect_fault_p;
+		} shift_op;
+		char __padding[CPU_OP_ARG_LEN_MAX];
+	} u;
+};
+
+#endif /* _UAPI_LINUX_CPU_OPV_H */
diff --git a/init/Kconfig b/init/Kconfig
index cbedfb91b40a..e4fbb5dd6a24 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1404,6 +1404,7 @@ config RSEQ
 	bool "Enable rseq() system call" if EXPERT
 	default y
 	depends on HAVE_RSEQ
+	select CPU_OPV
 	select MEMBARRIER
 	help
 	  Enable the restartable sequences system call. It provides a
@@ -1414,6 +1415,19 @@ config RSEQ
 
 	  If unsure, say Y.
 
+config CPU_OPV
+	bool "Enable cpu_opv() system call" if EXPERT
+	default y
+	help
+	  Enable the CPU preempt-off operation vector system call.
+	  It allows user-space to perform a sequence of operations on
+	  per-cpu data with preemption disabled. Useful as
+	  single-stepping fall-back for restartable sequences, and for
+	  performing more complex operations on per-cpu data that would
+	  not be otherwise possible to do with restartable sequences.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 3574669dafd9..cac8855196ff 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
 obj-$(CONFIG_RSEQ) += rseq.o
+obj-$(CONFIG_CPU_OPV) += cpu_opv.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
new file mode 100644
index 000000000000..a81837a14b17
--- /dev/null
+++ b/kernel/cpu_opv.c
@@ -0,0 +1,968 @@
+/*
+ * CPU preempt-off operation vector system call
+ *
+ * It allows user-space to perform a sequence of operations on per-cpu
+ * data with preemption disabled. Useful as single-stepping fall-back
+ * for restartable sequences, and for performing more complex operations
+ * on per-cpu data that would not be otherwise possible to do with
+ * restartable sequences.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2017, EfficiOS Inc.,
+ * Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/cpu_opv.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <asm/ptrace.h>
+#include <asm/byteorder.h>
+
+#include "sched/sched.h"
+
+#define TMP_BUFLEN			64
+#define NR_PINNED_PAGES_ON_STACK	8
+
+union op_fn_data {
+	uint8_t _u8;
+	uint16_t _u16;
+	uint32_t _u32;
+	uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+	uint32_t _u64_split[2];
+#endif
+};
+
+struct cpu_opv_pinned_pages {
+	struct page **pages;
+	size_t nr;
+	bool is_kmalloc;
+};
+
+typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
+
+static DEFINE_MUTEX(cpu_opv_offline_lock);
+
+/*
+ * The cpu_opv system call executes a vector of operations on behalf of
+ * user-space on a specific CPU with preemption disabled. It is inspired
+ * from readv() and writev() system calls which take a "struct iovec"
+ * array as argument.
+ *
+ * The operations available are: comparison, memcpy, add, or, and, xor,
+ * left shift, and right shift. The system call receives a CPU number
+ * from user-space as argument, which is the CPU on which those
+ * operations need to be performed. All preparation steps such as
+ * loading pointers, and applying offsets to arrays, need to be
+ * performed by user-space before invoking the system call. The
+ * "comparison" operation can be used to check that the data used in the
+ * preparation step did not change between preparation of system call
+ * inputs and operation execution within the preempt-off critical
+ * section.
+ *
+ * The reason why we require all pointer offsets to be calculated by
+ * user-space beforehand is because we need to use get_user_pages_fast()
+ * to first pin all pages touched by each operation. This takes care of
+ * faulting-in the pages. Then, preemption is disabled, and the
+ * operations are performed atomically with respect to other thread
+ * execution on that CPU, without generating any page fault.
+ *
+ * A maximum limit of 16 operations per cpu_opv syscall invocation is
+ * enforced, and a overall maximum length sum, so user-space cannot
+ * generate a too long preempt-off critical section. Each operation is
+ * also limited a length of PAGE_SIZE bytes, meaning that an operation
+ * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
+ * for destination if addresses are not aligned on page boundaries).
+ *
+ * If the thread is not running on the requested CPU, a new
+ * push_task_to_cpu() is invoked to migrate the task to the requested
+ * CPU.  If the requested CPU is not part of the cpus allowed mask of
+ * the thread, the system call fails with EINVAL. After the migration
+ * has been performed, preemption is disabled, and the current CPU
+ * number is checked again and compared to the requested CPU number. If
+ * it still differs, it means the scheduler migrated us away from that
+ * CPU. Return EAGAIN to user-space in that case, and let user-space
+ * retry (either requesting the same CPU number, or a different one,
+ * depending on the user-space algorithm constraints).
+ */
+
+/*
+ * Check operation types and length parameters.
+ */
+static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i;
+	uint32_t sum = 0;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_MB_OP:
+			break;
+		default:
+			sum += op->len;
+		}
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+		case CPU_COMPARE_NE_OP:
+		case CPU_MEMCPY_OP:
+			if (op->len > CPU_OP_DATA_LEN_MAX)
+				return -EINVAL;
+			break;
+		case CPU_ADD_OP:
+		case CPU_OR_OP:
+		case CPU_AND_OP:
+		case CPU_XOR_OP:
+			switch (op->len) {
+			case 1:
+			case 2:
+			case 4:
+			case 8:
+				break;
+			default:
+				return -EINVAL;
+			}
+			break;
+		case CPU_LSHIFT_OP:
+		case CPU_RSHIFT_OP:
+			switch (op->len) {
+			case 1:
+				if (op->u.shift_op.bits > 7)
+					return -EINVAL;
+				break;
+			case 2:
+				if (op->u.shift_op.bits > 15)
+					return -EINVAL;
+				break;
+			case 4:
+				if (op->u.shift_op.bits > 31)
+					return -EINVAL;
+				break;
+			case 8:
+				if (op->u.shift_op.bits > 63)
+					return -EINVAL;
+				break;
+			default:
+				return -EINVAL;
+			}
+			break;
+		case CPU_MB_OP:
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	if (sum > CPU_OP_VEC_DATA_LEN_MAX)
+		return -EINVAL;
+	return 0;
+}
+
+static unsigned long cpu_op_range_nr_pages(unsigned long addr,
+		unsigned long len)
+{
+	return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
+}
+
+static int cpu_op_check_page(struct page *page)
+{
+	struct address_space *mapping;
+
+	if (is_zone_device_page(page))
+		return -EFAULT;
+	page = compound_head(page);
+	mapping = READ_ONCE(page->mapping);
+	if (!mapping) {
+		int shmem_swizzled;
+
+		/*
+		 * Check again with page lock held to guard against
+		 * memory pressure making shmem_writepage move the page
+		 * from filecache to swapcache.
+		 */
+		lock_page(page);
+		shmem_swizzled = PageSwapCache(page) || page->mapping;
+		unlock_page(page);
+		if (shmem_swizzled)
+			return -EAGAIN;
+		return -EFAULT;
+	}
+	return 0;
+}
+
+/*
+ * Refusing device pages, the zero page, pages in the gate area, and
+ * special mappings. Inspired from futex.c checks.
+ */
+static int cpu_op_check_pages(struct page **pages,
+		unsigned long nr_pages)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++) {
+		int ret;
+
+		ret = cpu_op_check_page(pages[i]);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
+		struct cpu_opv_pinned_pages *pin_pages, int write)
+{
+	struct page *pages[2];
+	int ret, nr_pages;
+
+	if (!len)
+		return 0;
+	nr_pages = cpu_op_range_nr_pages(addr, len);
+	BUG_ON(nr_pages > 2);
+	if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
+			> NR_PINNED_PAGES_ON_STACK) {
+		struct page **pinned_pages =
+			kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
+				* sizeof(struct page *), GFP_KERNEL);
+		if (!pinned_pages)
+			return -ENOMEM;
+		memcpy(pinned_pages, pin_pages->pages,
+			pin_pages->nr * sizeof(struct page *));
+		pin_pages->pages = pinned_pages;
+		pin_pages->is_kmalloc = true;
+	}
+again:
+	ret = get_user_pages_fast(addr, nr_pages, write, pages);
+	if (ret < nr_pages) {
+		if (ret > 0)
+			put_page(pages[0]);
+		return -EFAULT;
+	}
+	/*
+	 * Refuse device pages, the zero page, pages in the gate area,
+	 * and special mappings.
+	 */
+	ret = cpu_op_check_pages(pages, nr_pages);
+	if (ret == -EAGAIN) {
+		put_page(pages[0]);
+		if (nr_pages > 1)
+			put_page(pages[1]);
+		goto again;
+	}
+	if (ret)
+		goto error;
+	pin_pages->pages[pin_pages->nr++] = pages[0];
+	if (nr_pages > 1)
+		pin_pages->pages[pin_pages->nr++] = pages[1];
+	return 0;
+
+error:
+	put_page(pages[0]);
+	if (nr_pages > 1)
+		put_page(pages[1]);
+	return -EFAULT;
+}
+
+static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
+		struct cpu_opv_pinned_pages *pin_pages)
+{
+	int ret, i;
+	bool expect_fault = false;
+
+	/* Check access, pin pages. */
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+		case CPU_COMPARE_NE_OP:
+			ret = -EFAULT;
+			expect_fault = op->u.compare_op.expect_fault_a;
+			if (!access_ok(VERIFY_READ,
+					(void __user *)op->u.compare_op.a,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.compare_op.a,
+					op->len, pin_pages, 0);
+			if (ret)
+				goto error;
+			ret = -EFAULT;
+			expect_fault = op->u.compare_op.expect_fault_b;
+			if (!access_ok(VERIFY_READ,
+					(void __user *)op->u.compare_op.b,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.compare_op.b,
+					op->len, pin_pages, 0);
+			if (ret)
+				goto error;
+			break;
+		case CPU_MEMCPY_OP:
+			ret = -EFAULT;
+			expect_fault = op->u.memcpy_op.expect_fault_dst;
+			if (!access_ok(VERIFY_WRITE,
+					(void __user *)op->u.memcpy_op.dst,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.memcpy_op.dst,
+					op->len, pin_pages, 1);
+			if (ret)
+				goto error;
+			ret = -EFAULT;
+			expect_fault = op->u.memcpy_op.expect_fault_src;
+			if (!access_ok(VERIFY_READ,
+					(void __user *)op->u.memcpy_op.src,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.memcpy_op.src,
+					op->len, pin_pages, 0);
+			if (ret)
+				goto error;
+			break;
+		case CPU_ADD_OP:
+			ret = -EFAULT;
+			expect_fault = op->u.arithmetic_op.expect_fault_p;
+			if (!access_ok(VERIFY_WRITE,
+					(void __user *)op->u.arithmetic_op.p,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.arithmetic_op.p,
+					op->len, pin_pages, 1);
+			if (ret)
+				goto error;
+			break;
+		case CPU_OR_OP:
+		case CPU_AND_OP:
+		case CPU_XOR_OP:
+			ret = -EFAULT;
+			expect_fault = op->u.bitwise_op.expect_fault_p;
+			if (!access_ok(VERIFY_WRITE,
+					(void __user *)op->u.bitwise_op.p,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.bitwise_op.p,
+					op->len, pin_pages, 1);
+			if (ret)
+				goto error;
+			break;
+		case CPU_LSHIFT_OP:
+		case CPU_RSHIFT_OP:
+			ret = -EFAULT;
+			expect_fault = op->u.shift_op.expect_fault_p;
+			if (!access_ok(VERIFY_WRITE,
+					(void __user *)op->u.shift_op.p,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.shift_op.p,
+					op->len, pin_pages, 1);
+			if (ret)
+				goto error;
+			break;
+		case CPU_MB_OP:
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+
+error:
+	for (i = 0; i < pin_pages->nr; i++)
+		put_page(pin_pages->pages[i]);
+	pin_pages->nr = 0;
+	/*
+	 * If faulting access is expected, return EAGAIN to user-space.
+	 * It allows user-space to distinguish between a fault caused by
+	 * an access which is expect to fault (e.g. due to concurrent
+	 * unmapping of underlying memory) from an unexpected fault from
+	 * which a retry would not recover.
+	 */
+	if (ret == -EFAULT && expect_fault)
+		return -EAGAIN;
+	return ret;
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
+{
+	char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
+	uint32_t compared = 0;
+
+	while (compared != len) {
+		unsigned long to_compare;
+
+		to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
+		if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
+			return -EFAULT;
+		if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
+			return -EFAULT;
+		if (memcmp(bufa, bufb, to_compare))
+			return 1;	/* different */
+		compared += to_compare;
+	}
+	return 0;	/* same */
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp[2];
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u8 != tmp[1]._u8);
+		break;
+	case 2:
+		if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u16 != tmp[1]._u16);
+		break;
+	case 4:
+		if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u32 != tmp[1]._u32);
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
+			goto end;
+#else
+		if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
+			goto end;
+		if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
+			goto end;
+		if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
+			goto end;
+		if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
+			goto end;
+#endif
+		ret = !!(tmp[0]._u64 != tmp[1]._u64);
+		break;
+	default:
+		pagefault_enable();
+		return do_cpu_op_compare_iter(a, b, len);
+	}
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src,
+		uint32_t len)
+{
+	char buf[TMP_BUFLEN];
+	uint32_t copied = 0;
+
+	while (copied != len) {
+		unsigned long to_copy;
+
+		to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied);
+		if (__copy_from_user_inatomic(buf, src + copied, to_copy))
+			return -EFAULT;
+		if (__copy_to_user_inatomic(dst + copied, buf, to_copy))
+			return -EFAULT;
+		copied += to_copy;
+	}
+	return 0;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u8, (uint8_t __user *)dst))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u16, (uint16_t __user *)dst))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u32, (uint32_t __user *)dst))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u64, (uint64_t __user *)dst))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)src))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1))
+			goto end;
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1))
+			goto end;
+#endif
+		break;
+	default:
+		pagefault_enable();
+		return do_cpu_op_memcpy_iter(dst, src, len);
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
+{
+	int ret = 0;
+
+	switch (len) {
+	case 1:
+		data->_u8 += (uint8_t)count;
+		break;
+	case 2:
+		data->_u16 += (uint16_t)count;
+		break;
+	case 4:
+		data->_u32 += (uint32_t)count;
+		break;
+	case 8:
+		data->_u64 += (uint64_t)count;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+	int ret = 0;
+
+	switch (len) {
+	case 1:
+		data->_u8 |= (uint8_t)mask;
+		break;
+	case 2:
+		data->_u16 |= (uint16_t)mask;
+		break;
+	case 4:
+		data->_u32 |= (uint32_t)mask;
+		break;
+	case 8:
+		data->_u64 |= (uint64_t)mask;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+	int ret = 0;
+
+	switch (len) {
+	case 1:
+		data->_u8 &= (uint8_t)mask;
+		break;
+	case 2:
+		data->_u16 &= (uint16_t)mask;
+		break;
+	case 4:
+		data->_u32 &= (uint32_t)mask;
+		break;
+	case 8:
+		data->_u64 &= (uint64_t)mask;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+	int ret = 0;
+
+	switch (len) {
+	case 1:
+		data->_u8 ^= (uint8_t)mask;
+		break;
+	case 2:
+		data->_u16 ^= (uint16_t)mask;
+		break;
+	case 4:
+		data->_u32 ^= (uint32_t)mask;
+		break;
+	case 8:
+		data->_u64 ^= (uint64_t)mask;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
+{
+	int ret = 0;
+
+	switch (len) {
+	case 1:
+		data->_u8 <<= (uint8_t)bits;
+		break;
+	case 2:
+		data->_u16 <<= (uint16_t)bits;
+		break;
+	case 4:
+		data->_u32 <<= (uint32_t)bits;
+		break;
+	case 8:
+		data->_u64 <<= (uint64_t)bits;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
+{
+	int ret = 0;
+
+	switch (len) {
+	case 1:
+		data->_u8 >>= (uint8_t)bits;
+		break;
+	case 2:
+		data->_u16 >>= (uint16_t)bits;
+		break;
+	case 4:
+		data->_u32 >>= (uint32_t)bits;
+		break;
+	case 8:
+		data->_u64 >>= (uint64_t)bits;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v,
+		uint32_t len)
+{
+	int ret = -EFAULT;
+	union op_fn_data tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		if (op_fn(&tmp, v, len))
+			goto end;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		if (op_fn(&tmp, v, len))
+			goto end;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		if (op_fn(&tmp, v, len))
+			goto end;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		if (op_fn(&tmp, v, len))
+			goto end;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i, ret;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		/* Guarantee a compiler barrier between each operation. */
+		barrier();
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+			ret = do_cpu_op_compare(
+					(void __user *)op->u.compare_op.a,
+					(void __user *)op->u.compare_op.b,
+					op->len);
+			/* Stop execution on error. */
+			if (ret < 0)
+				return ret;
+			/*
+			 * Stop execution, return op index + 1 if comparison
+			 * differs.
+			 */
+			if (ret > 0)
+				return i + 1;
+			break;
+		case CPU_COMPARE_NE_OP:
+			ret = do_cpu_op_compare(
+					(void __user *)op->u.compare_op.a,
+					(void __user *)op->u.compare_op.b,
+					op->len);
+			/* Stop execution on error. */
+			if (ret < 0)
+				return ret;
+			/*
+			 * Stop execution, return op index + 1 if comparison
+			 * is identical.
+			 */
+			if (ret == 0)
+				return i + 1;
+			break;
+		case CPU_MEMCPY_OP:
+			ret = do_cpu_op_memcpy(
+					(void __user *)op->u.memcpy_op.dst,
+					(void __user *)op->u.memcpy_op.src,
+					op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_ADD_OP:
+			ret = do_cpu_op_fn(op_add_fn,
+					(void __user *)op->u.arithmetic_op.p,
+					op->u.arithmetic_op.count, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_OR_OP:
+			ret = do_cpu_op_fn(op_or_fn,
+					(void __user *)op->u.bitwise_op.p,
+					op->u.bitwise_op.mask, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_AND_OP:
+			ret = do_cpu_op_fn(op_and_fn,
+					(void __user *)op->u.bitwise_op.p,
+					op->u.bitwise_op.mask, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_XOR_OP:
+			ret = do_cpu_op_fn(op_xor_fn,
+					(void __user *)op->u.bitwise_op.p,
+					op->u.bitwise_op.mask, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_LSHIFT_OP:
+			ret = do_cpu_op_fn(op_lshift_fn,
+					(void __user *)op->u.shift_op.p,
+					op->u.shift_op.bits, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_RSHIFT_OP:
+			ret = do_cpu_op_fn(op_rshift_fn,
+					(void __user *)op->u.shift_op.p,
+					op->u.shift_op.bits, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_MB_OP:
+			smp_mb();
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
+{
+	int ret;
+
+	if (cpu != raw_smp_processor_id()) {
+		ret = push_task_to_cpu(current, cpu);
+		if (ret)
+			goto check_online;
+	}
+	preempt_disable();
+	if (cpu != smp_processor_id()) {
+		ret = -EAGAIN;
+		goto end;
+	}
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+end:
+	preempt_enable();
+	return ret;
+
+check_online:
+	if (!cpu_possible(cpu))
+		return -EINVAL;
+	get_online_cpus();
+	if (cpu_online(cpu)) {
+		ret = -EAGAIN;
+		goto put_online_cpus;
+	}
+	/*
+	 * CPU is offline. Perform operation from the current CPU with
+	 * cpu_online read lock held, preventing that CPU from coming online,
+	 * and with mutex held, providing mutual exclusion against other
+	 * CPUs also finding out about an offline CPU.
+	 */
+	mutex_lock(&cpu_opv_offline_lock);
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+	mutex_unlock(&cpu_opv_offline_lock);
+put_online_cpus:
+	put_online_cpus();
+	return ret;
+}
+
+/*
+ * cpu_opv - execute operation vector on a given CPU with preempt off.
+ *
+ * Userspace should pass current CPU number as parameter. May fail with
+ * -EAGAIN if currently executing on the wrong CPU.
+ */
+SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
+		int, cpu, int, flags)
+{
+	struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
+	struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
+	struct cpu_opv_pinned_pages pin_pages = {
+		.pages = pinned_pages_on_stack,
+		.nr = 0,
+		.is_kmalloc = false,
+	};
+	int ret, i;
+
+	if (unlikely(flags))
+		return -EINVAL;
+	if (unlikely(cpu < 0))
+		return -EINVAL;
+	if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
+		return -EINVAL;
+	if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
+		return -EFAULT;
+	ret = cpu_opv_check(cpuopv, cpuopcnt);
+	if (ret)
+		return ret;
+	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
+	if (ret)
+		goto end;
+	ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
+	for (i = 0; i < pin_pages.nr; i++)
+		put_page(pin_pages.pages[i]);
+end:
+	if (pin_pages.is_kmalloc)
+		kfree(pin_pages.pages);
+	return ret;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bba05f47e51..e547f93a46c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		set_curr_task(rq, p);
 }
 
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int ret = 0;
+
+	rq = task_rq_lock(p, &rf);
+	update_rq_clock(rq);
+
+	if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (task_cpu(p) == dest_cpu)
+		goto out;
+
+	if (task_running(rq, p) || p->state == TASK_WAKING) {
+		struct migration_arg arg = { p, dest_cpu };
+		/* Need help from migration thread: drop lock and wait. */
+		task_rq_unlock(rq, p, &rf);
+		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		tlb_migrate_finish(p->mm);
+		return 0;
+	} else if (task_on_rq_queued(p)) {
+		/*
+		 * OK, since we're going to drop the lock immediately
+		 * afterwards anyway.
+		 */
+		rq = move_queued_task(rq, &rf, p, dest_cpu);
+	}
+out:
+	task_rq_unlock(rq, p, &rf);
+
+	return ret;
+}
+
 /*
  * Change a given task's CPU affinity. Migrate the thread to a
  * proper CPU and schedule it away if the CPU it's executing on
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3b448ba82225..cab256c1720a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bfa1ee1bf669..59e622296dc3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free);
 
 /* restartable sequence */
 cond_syscall(sys_rseq);
+cond_syscall(sys_cpu_opv);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 system call
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2017-11-14 20:03   ` [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call Mathieu Desnoyers
@ 2017-11-14 20:03   ` Mathieu Desnoyers
  2017-11-14 20:04   ` [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests Mathieu Desnoyers
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:03 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ba43ee75e425..afc6988fb2c8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -392,3 +392,4 @@
 383	i386	statx			sys_statx
 384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
 385	i386	rseq			sys_rseq
+386	i386	cpu_opv			sys_cpu_opv
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 3ad03495bbb9..ab5d1f9f9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -340,6 +340,7 @@
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
 333	common	rseq			sys_rseq
+334	common	cpu_opv			sys_cpu_opv
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 10/24] cpu_opv: Wire up powerpc system call
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2017-11-14 20:03 ` [RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: Will Deacon, Andi Kleen, Paul Mackerras, H . Peter Anvin,
	Chris Lameter, Russell King, Andrew Hunter, Ingo Molnar,
	Michael Kerrisk, Catalin Marinas, Paul Turner, Josh Triplett,
	Steven Rostedt, Ben Maurer, Mathieu Desnoyers, Thomas Gleixner,
	linux-api, linuxppc-dev, linux-kernel, Andrew Morton,
	Linus Torvalds

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 964321a5799c..f9cdb896fbaa 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@ COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index e76bd5601ea4..48f80f452e31 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls		385
+#define NR_syscalls		386
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index b1980fcd56d5..972a7d68c143 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_kexec_file_load	382
 #define __NR_statx		383
 #define __NR_rseq		384
+#define __NR_cpu_opv		385
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 system call
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH for 4.15 10/24] cpu_opv: " Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd Mathieu Desnoyers
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fbc74b5fa3ed..213ccfc2c437 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -413,3 +413,4 @@
 396	common	pkey_free		sys_pkey_free
 397	common	statx			sys_statx
 398	common	rseq			sys_rseq
+399	common	cpu_opv			sys_cpu_opv
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-11-14 20:03   ` [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
@ 2017-11-14 20:04   ` Mathieu Desnoyers
  2017-11-14 20:04   ` [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests Mathieu Desnoyers
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Shuah Khan,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA

Implement cpu_opv selftests. It needs to express dependencies on
header files and .so, which require to override the selftests
lib.mk targets. Introduce a new OVERRIDE_TARGETS define for this.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Shuah Khan <shuah-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
CC: linux-kselftest-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Changes since v1:

- Expose similar library API as rseq:  Expose library API closely
  matching the rseq APIs, following removal of the event counter from
  the rseq kernel API.
- Update makefile to fix make run_tests dependency on "all".
- Introduce a OVERRIDE_TARGETS.
---
 MAINTAINERS                                        |    1 +
 tools/testing/selftests/Makefile                   |    1 +
 tools/testing/selftests/cpu-opv/.gitignore         |    1 +
 tools/testing/selftests/cpu-opv/Makefile           |   17 +
 .../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 1157 ++++++++++++++++++++
 tools/testing/selftests/cpu-opv/cpu-op.c           |  348 ++++++
 tools/testing/selftests/cpu-opv/cpu-op.h           |   68 ++
 tools/testing/selftests/lib.mk                     |    4 +
 8 files changed, 1597 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
 create mode 100644 tools/testing/selftests/cpu-opv/Makefile
 create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 45a1bbdaa287..85a1e8781f56 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3681,6 +3681,7 @@ L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
 S:	Supported
 F:	kernel/cpu_opv.c
 F:	include/uapi/linux/cpu_opv.h
+F:	tools/testing/selftests/cpu-opv/
 
 CRAMFS FILESYSTEM
 W:	http://sourceforge.net/projects/cramfs/
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 3c9c0bbe7dbb..c66e5e67cfab 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += breakpoints
 TARGETS += capabilities
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
+TARGETS += cpu-opv
 TARGETS += efivarfs
 TARGETS += exec
 TARGETS += firmware
diff --git a/tools/testing/selftests/cpu-opv/.gitignore b/tools/testing/selftests/cpu-opv/.gitignore
new file mode 100644
index 000000000000..c7186eb95cf5
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/.gitignore
@@ -0,0 +1 @@
+basic_cpu_opv_test
diff --git a/tools/testing/selftests/cpu-opv/Makefile b/tools/testing/selftests/cpu-opv/Makefile
new file mode 100644
index 000000000000..21e63545d521
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/Makefile
@@ -0,0 +1,17 @@
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_cpu_opv_test
+
+TEST_GEN_PROGS_EXTENDED = libcpu-op.so
+
+include ../lib.mk
+
+$(OUTPUT)/libcpu-op.so: cpu-op.c cpu-op.h
+	$(CC) $(CFLAGS) -shared -fPIC $< -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) cpu-op.h
+	$(CC) $(CFLAGS) $< -lcpu-op -o $@
diff --git a/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
new file mode 100644
index 000000000000..6b624f1939ea
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
@@ -0,0 +1,1157 @@
+/*
+ * Basic test coverage for cpu_opv system call.
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+#include <errno.h>
+#include <stdlib.h>
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+#define TESTBUFLEN	4096
+#define TESTBUFLEN_CMP	16
+
+#define TESTBUFLEN_PAGE_MAX	65536
+
+static int test_compare_eq_op(char *a, char *b, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_compare_eq_same(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_eq same";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test compare_eq */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret > 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_compare_eq_diff(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_eq different";
+
+	printf("Testing %s\n", test_name);
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret == 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_compare_ne_op(char *a, char *b, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_compare_ne_same(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_ne same";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test compare_ne */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret == 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_compare_ne_diff(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_ne different";
+
+	printf("Testing %s\n", test_name);
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_2compare_eq_op(char *a, char *b, char *c, char *d,
+		size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, c),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, d),
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_2compare_eq_index(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN_CMP];
+	char buf2[TESTBUFLEN_CMP];
+	char buf3[TESTBUFLEN_CMP];
+	char buf4[TESTBUFLEN_CMP];
+	const char *test_name = "test_2compare_eq index";
+
+	printf("Testing %s\n", test_name);
+
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN_CMP);
+	memset(buf3, 0, TESTBUFLEN_CMP);
+	memset(buf4, 0, TESTBUFLEN_CMP);
+
+	/* First compare failure is op[0], expect 1. */
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 1) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+
+	/* All compares succeed. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf2[i] = (char)i;
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+
+	/* First compare failure is op[1], expect 2. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf3[i] = (char)i;
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 2) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 2);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int test_2compare_ne_op(char *a, char *b, char *c, char *d,
+		size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, c),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, d),
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_2compare_ne_index(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN_CMP];
+	char buf2[TESTBUFLEN_CMP];
+	char buf3[TESTBUFLEN_CMP];
+	char buf4[TESTBUFLEN_CMP];
+	const char *test_name = "test_2compare_ne index";
+
+	printf("Testing %s\n", test_name);
+
+	memset(buf1, 0, TESTBUFLEN_CMP);
+	memset(buf2, 0, TESTBUFLEN_CMP);
+	memset(buf3, 0, TESTBUFLEN_CMP);
+	memset(buf4, 0, TESTBUFLEN_CMP);
+
+	/* First compare ne failure is op[0], expect 1. */
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 1) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+
+	/* All compare ne succeed. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf3[i] = (char)i;
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+
+	/* First compare failure is op[1], expect 2. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf4[i] = (char)i;
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 2) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 2);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int test_memcpy_op(void *dst, void *src, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_memcpy";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_memcpy_op(buf2, buf1, TESTBUFLEN);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	for (i = 0; i < TESTBUFLEN; i++) {
+		if (buf2[i] != (char)i) {
+			printf("%s failed. Expecting '%d', found '%d' at offset %d\n",
+				test_name, (char)i, buf2[i], i);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static int test_memcpy_u32(void)
+{
+	int ret;
+	uint32_t v1, v2;
+	const char *test_name = "test_memcpy_u32";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy_u32 */
+	v1 = 42;
+	v2 = 0;
+	ret = test_memcpy_op(&v2, &v1, sizeof(v1));
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (v1 != v2) {
+		printf("%s failed. Expecting '%d', found '%d'\n",
+			test_name, v1, v2);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_memcpy_mb_memcpy_op(void *dst1, void *src1,
+		void *dst2, void *src2, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst1),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src1),
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[1] = {
+			.op = CPU_MB_OP,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst2),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src2),
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy_mb_memcpy(void)
+{
+	int ret;
+	int v1, v2, v3;
+	const char *test_name = "test_memcpy_mb_memcpy";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy */
+	v1 = 42;
+	v2 = v3 = 0;
+	ret = test_memcpy_mb_memcpy_op(&v2, &v1, &v3, &v2, sizeof(int));
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (v3 != v1) {
+		printf("%s failed. Expecting '%d', found '%d'\n",
+			test_name, v1, v3);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_add_op(int *v, int64_t increment)
+{
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_op_add(v, increment, sizeof(*v), cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_add(void)
+{
+	int orig_v = 42, v, ret;
+	int increment = 1;
+	const char *test_name = "test_add";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_add_op(&v, increment);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != orig_v + increment) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_two_add_op(int *v, int64_t *increments)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_ADD_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.arithmetic_op.p, v),
+			.u.arithmetic_op.count = increments[0],
+			.u.arithmetic_op.expect_fault_p = 0,
+		},
+		[1] = {
+			.op = CPU_ADD_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.arithmetic_op.p, v),
+			.u.arithmetic_op.count = increments[1],
+			.u.arithmetic_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_two_add(void)
+{
+	int orig_v = 42, v, ret;
+	int64_t increments[2] = { 99, 123 };
+	const char *test_name = "test_two_add";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_two_add_op(&v, increments);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != orig_v + increments[0] + increments[1]) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_or_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_OR_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.bitwise_op.p, v),
+			.u.bitwise_op.mask = mask,
+			.u.bitwise_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_or(void)
+{
+	int orig_v = 0xFF00000, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_or";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_or_op(&v, mask);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v | mask)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v | mask);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_and_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_AND_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.bitwise_op.p, v),
+			.u.bitwise_op.mask = mask,
+			.u.bitwise_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_and(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_and";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_and_op(&v, mask);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v & mask)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v & mask);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_xor_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_XOR_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.bitwise_op.p, v),
+			.u.bitwise_op.mask = mask,
+			.u.bitwise_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_xor(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_xor";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_xor_op(&v, mask);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v ^ mask)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v ^ mask);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_lshift_op(int *v, uint32_t bits)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_LSHIFT_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.shift_op.p, v),
+			.u.shift_op.bits = bits,
+			.u.shift_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_lshift(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t bits = 5;
+	const char *test_name = "test_lshift";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_lshift_op(&v, bits);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v << bits)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v << bits);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_rshift_op(int *v, uint32_t bits)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_RSHIFT_OP,
+			.len = sizeof(*v),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(
+				.u.shift_op.p, v),
+			.u.shift_op.bits = bits,
+			.u.shift_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_rshift(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t bits = 5;
+	const char *test_name = "test_rshift";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_rshift_op(&v, bits);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v >> bits)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v >> bits);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_cmpxchg_op(void *v, void *expect, void *old, void *n,
+		size_t len)
+{
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_op_cmpxchg(v, expect, old, n, len, cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+
+static int test_cmpxchg_success(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+	const char *test_name = "test_cmpxchg success";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+	if (v != n) {
+		printf("%s v is %lld, expecting %lld\n",
+			test_name, (long long)v, (long long)n);
+		return -1;
+	}
+	if (old != orig_v) {
+		printf("%s old is %lld, expecting %lld\n",
+			test_name, (long long)old, (long long)orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_cmpxchg_fail(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 123, old = 0, n = 3;
+	const char *test_name = "test_cmpxchg fail";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret == 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+	if (v == n) {
+		printf("%s v is %lld, expecting %lld\n",
+			test_name, (long long)v, (long long)orig_v);
+		return -1;
+	}
+	if (old != orig_v) {
+		printf("%s old is %lld, expecting %lld\n",
+			test_name, (long long)old, (long long)orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_memcpy_expect_fault_op(void *dst, void *src, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+			.u.memcpy_op.expect_fault_dst = 0,
+			/* Return EAGAIN on fault. */
+			.u.memcpy_op.expect_fault_src = 1,
+		},
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_memcpy_fault(void)
+{
+	int ret;
+	char buf1[TESTBUFLEN];
+	const char *test_name = "test_memcpy_fault";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy */
+	ret = test_memcpy_op(buf1, NULL, TESTBUFLEN);
+	if (!ret || (ret < 0 && errno != EFAULT)) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	/* Test memcpy expect fault */
+	ret = test_memcpy_expect_fault_op(buf1, NULL, TESTBUFLEN);
+	if (!ret || (ret < 0 && errno != EAGAIN)) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	return 0;
+}
+
+static int do_test_unknown_op(void)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = -1,	/* Unknown */
+			.len = 0,
+		},
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_unknown_op(void)
+{
+	int ret;
+	const char *test_name = "test_unknown_op";
+
+	printf("Testing %s\n", test_name);
+
+	ret = do_test_unknown_op();
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	return 0;
+}
+
+static int do_test_max_ops(void)
+{
+	struct cpu_op opvec[] = {
+		[0] = { .op = CPU_MB_OP, },
+		[1] = { .op = CPU_MB_OP, },
+		[2] = { .op = CPU_MB_OP, },
+		[3] = { .op = CPU_MB_OP, },
+		[4] = { .op = CPU_MB_OP, },
+		[5] = { .op = CPU_MB_OP, },
+		[6] = { .op = CPU_MB_OP, },
+		[7] = { .op = CPU_MB_OP, },
+		[8] = { .op = CPU_MB_OP, },
+		[9] = { .op = CPU_MB_OP, },
+		[10] = { .op = CPU_MB_OP, },
+		[11] = { .op = CPU_MB_OP, },
+		[12] = { .op = CPU_MB_OP, },
+		[13] = { .op = CPU_MB_OP, },
+		[14] = { .op = CPU_MB_OP, },
+		[15] = { .op = CPU_MB_OP, },
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_max_ops(void)
+{
+	int ret;
+	const char *test_name = "test_max_ops";
+
+	printf("Testing %s\n", test_name);
+
+	ret = do_test_max_ops();
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	return 0;
+}
+
+static int do_test_too_many_ops(void)
+{
+	struct cpu_op opvec[] = {
+		[0] = { .op = CPU_MB_OP, },
+		[1] = { .op = CPU_MB_OP, },
+		[2] = { .op = CPU_MB_OP, },
+		[3] = { .op = CPU_MB_OP, },
+		[4] = { .op = CPU_MB_OP, },
+		[5] = { .op = CPU_MB_OP, },
+		[6] = { .op = CPU_MB_OP, },
+		[7] = { .op = CPU_MB_OP, },
+		[8] = { .op = CPU_MB_OP, },
+		[9] = { .op = CPU_MB_OP, },
+		[10] = { .op = CPU_MB_OP, },
+		[11] = { .op = CPU_MB_OP, },
+		[12] = { .op = CPU_MB_OP, },
+		[13] = { .op = CPU_MB_OP, },
+		[14] = { .op = CPU_MB_OP, },
+		[15] = { .op = CPU_MB_OP, },
+		[16] = { .op = CPU_MB_OP, },
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_too_many_ops(void)
+{
+	int ret;
+	const char *test_name = "test_too_many_ops";
+
+	printf("Testing %s\n", test_name);
+
+	ret = do_test_too_many_ops();
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	return 0;
+}
+
+/* Use 64kB len, largest page size known on Linux. */
+static int test_memcpy_single_too_large(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN_PAGE_MAX + 1];
+	char buf2[TESTBUFLEN_PAGE_MAX + 1];
+	const char *test_name = "test_memcpy_single_too_large";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN_PAGE_MAX + 1; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN_PAGE_MAX + 1);
+	ret = test_memcpy_op(buf2, buf1, TESTBUFLEN_PAGE_MAX + 1);
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	return 0;
+}
+
+static int test_memcpy_single_ok_sum_too_large_op(void *dst, void *src, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+			CPU_OP_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy_single_ok_sum_too_large(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_memcpy_single_ok_sum_too_large";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_memcpy_single_ok_sum_too_large_op(buf2, buf1, TESTBUFLEN);
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int ret = 0;
+
+	ret |= test_compare_eq_same();
+	ret |= test_compare_eq_diff();
+	ret |= test_compare_ne_same();
+	ret |= test_compare_ne_diff();
+	ret |= test_2compare_eq_index();
+	ret |= test_2compare_ne_index();
+	ret |= test_memcpy();
+	ret |= test_memcpy_u32();
+	ret |= test_memcpy_mb_memcpy();
+	ret |= test_add();
+	ret |= test_two_add();
+	ret |= test_or();
+	ret |= test_and();
+	ret |= test_xor();
+	ret |= test_lshift();
+	ret |= test_rshift();
+	ret |= test_cmpxchg_success();
+	ret |= test_cmpxchg_fail();
+	ret |= test_memcpy_fault();
+	ret |= test_unknown_op();
+	ret |= test_max_ops();
+	ret |= test_too_many_ops();
+	ret |= test_memcpy_single_too_large();
+	ret |= test_memcpy_single_ok_sum_too_large();
+
+	return ret;
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.c b/tools/testing/selftests/cpu-opv/cpu-op.c
new file mode 100644
index 000000000000..d7ba481cca04
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.c
@@ -0,0 +1,348 @@
+/*
+ * cpu-op.c
+ *
+ * Copyright (C) 2017 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+int cpu_opv(struct cpu_op *cpu_opv, int cpuopcnt, int cpu, int flags)
+{
+	return syscall(__NR_cpu_opv, cpu_opv, cpuopcnt, cpu, flags);
+}
+
+int cpu_op_get_current_cpu(void)
+{
+	int cpu;
+
+	cpu = sched_getcpu();
+	if (cpu < 0) {
+		perror("sched_getcpu()");
+		abort();
+	}
+	return cpu;
+}
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *n,
+		size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)old,
+			.u.memcpy_op.src = (unsigned long)v,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)n,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_ADD_OP,
+			.len = len,
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = count,
+			.u.arithmetic_op.expect_fault_p = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+		int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int cpu_op_cmpeqv_storep_expect_fault(intptr_t *v, intptr_t expect,
+		intptr_t *newp, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)newp,
+			.u.memcpy_op.expect_fault_dst = 0,
+			/* Return EAGAIN on src fault. */
+			.u.memcpy_op.expect_fault_src = 1,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+		off_t voffp, intptr_t *load, int cpu)
+{
+	intptr_t oldv = READ_ONCE(*v);
+	intptr_t *newp = (intptr_t *)(oldv + voffp);
+	int ret;
+
+	if (oldv == expectnot)
+		return 1;
+	ret = cpu_op_cmpeqv_storep_expect_fault(v, oldv, newp, cpu);
+	if (!ret) {
+		*load = oldv;
+		return 0;
+	}
+	if (ret > 0) {
+		errno = EAGAIN;
+		return -1;
+	}
+	return -1;
+}
+
+int cpu_op_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v2,
+			.u.memcpy_op.src = (unsigned long)&newv2,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_storev_mb_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v2,
+			.u.memcpy_op.src = (unsigned long)&newv2,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MB_OP,
+		},
+		[3] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t expect2, intptr_t newv,
+		int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v2,
+			.u.compare_op.b = (unsigned long)&expect2,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_memcpy_mb_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MB_OP,
+		},
+		[3] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_addv(intptr_t *v, int64_t count, int cpu)
+{
+	return cpu_op_add(v, count, sizeof(intptr_t), cpu);
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.h b/tools/testing/selftests/cpu-opv/cpu-op.h
new file mode 100644
index 000000000000..ba2ec578ec50
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.h
@@ -0,0 +1,68 @@
+/*
+ * cpu-op.h
+ *
+ * (C) Copyright 2017 - Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef CPU_OPV_H
+#define CPU_OPV_H
+
+#include <stdlib.h>
+#include <stdint.h>
+#include <linux/cpu_opv.h>
+
+#define likely(x)		__builtin_expect(!!(x), 1)
+#define unlikely(x)		__builtin_expect(!!(x), 0)
+#define barrier()		__asm__ __volatile__("" : : : "memory")
+
+#define ACCESS_ONCE(x)		(*(__volatile__  __typeof__(x) *)&(x))
+#define WRITE_ONCE(x, v)	__extension__ ({ ACCESS_ONCE(x) = (v); })
+#define READ_ONCE(x)		ACCESS_ONCE(x)
+
+int cpu_opv(struct cpu_op *cpuopv, int cpuopcnt, int cpu, int flags);
+int cpu_op_get_current_cpu(void);
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *_new,
+		size_t len, int cpu);
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu);
+
+int cpu_op_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+		int cpu);
+int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+		off_t voffp, intptr_t *load, int cpu);
+int cpu_op_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu);
+int cpu_op_cmpeqv_storev_mb_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu);
+int cpu_op_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t expect2, intptr_t newv,
+		int cpu);
+int cpu_op_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu);
+int cpu_op_cmpeqv_memcpy_mb_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu);
+int cpu_op_addv(intptr_t *v, int64_t count, int cpu);
+
+#endif  /* CPU_OPV_H_ */
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 5bef05d6ba39..441d7bc63bb7 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -105,6 +105,9 @@ COMPILE.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c
 LINK.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH)
 endif
 
+# Selftest makefiles can override those targets by setting
+# OVERRIDE_TARGETS = 1.
+ifeq ($(OVERRIDE_TARGETS),)
 $(OUTPUT)/%:%.c
 	$(LINK.c) $^ $(LDLIBS) -o $@
 
@@ -113,5 +116,6 @@ $(OUTPUT)/%.o:%.S
 
 $(OUTPUT)/%:%.S
 	$(LINK.S) $^ $(LDLIBS) -o $@
+endif
 
 .PHONY: run_tests all clean install emit_tests
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (7 preceding siblings ...)
  2017-11-14 20:04   ` [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests Mathieu Desnoyers
@ 2017-11-14 20:04   ` Mathieu Desnoyers
  2017-11-14 20:04   ` [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess Mathieu Desnoyers
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Shuah Khan,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA

Implements two basic tests of RSEQ functionality, and one more
exhaustive parameterizable test.

The first, "basic_test" only asserts that RSEQ works moderately
correctly. E.g. that the CPUID pointer works.

"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.

"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.

A run_param_test.sh script runs many variants of the parametrizable
tests.

As part of those tests, a helper library "rseq" implements a user-space
API around restartable sequences. It uses the cpu_opv system call as
fallback when single-stepped by a debugger. It exposes the instruction
pointer addresses where the rseq assembly blocks begin and end, as well
as the associated abort instruction pointer, in the __rseq_table
section. This section allows debuggers may know where to place
breakpoints when single-stepping through assembly blocks which may be
aborted at any point by the kernel.

The rseq library expose APIs that present the fast-path operations.
The new from userspace is, e.g. for a counter increment:

    cpu = rseq_cpu_start();
    ret = rseq_addv(&data->c[cpu].count, 1, cpu);
    if (likely(!ret))
        return 0;        /* Success. */
    do {
        cpu = rseq_current_cpu();
        ret = cpu_op_addv(&data->c[cpu].count, 1, cpu);
        if (likely(!ret))
            return 0;    /* Success. */
    } while (ret > 0 || errno == EAGAIN);
    perror("cpu_op_addv");
    return -1;           /* Unexpected error. */

PowerPC tests have been implemented by Boqun Feng.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Shuah Khan <shuah-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
CC: linux-kselftest-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Changes since v1:
- Provide abort-ip signature: The abort-ip signature is located just
  before the abort-ip target. It is currently hardcoded, but a
  user-space application could use the __rseq_table to iterate on all
  abort-ip targets and use a random value as signature if needed in the
  future.
- Add rseq_prepare_unload(): Libraries and JIT code using rseq critical
  sections need to issue rseq_prepare_unload() on each thread at least
  once before reclaim of struct rseq_cs.
- Use initial-exec TLS model, non-weak symbol: The initial-exec model is
  signal-safe, whereas the global-dynamic model is not.  Remove the
  "weak" symbol attribute from the __rseq_abi in rseq.c. The rseq.so
  library will have ownership of that symbol, and there is not reason for
  an application or user library to try to define that symbol.
  The expected use is to link against libreq.so, which owns and provide
  that symbol.
- Set cpu_id to -2 on register error
- Add rseq_len syscall parameter, rseq_cs version
- Ensure disassember-friendly signature: x86 32/64 disassembler have a
  hard time decoding the instruction stream after a bad instruction. Use
  a nopl instruction to encode the signature. Suggested by Andy Lutomirski.
- Exercise parametrized tests variants in a shell scripts.
- Restartable sequences selftests: Remove use of event counter.
- Use cpu_id_start field:  With the cpu_id_start field, the C
  preparation phase of the fast-path does not need to compare cpu_id < 0
  anymore.
- Signal-safe registration and refcounting: Allow libraries using
  librseq.so to register it from signal handlers.
- Use OVERRIDE_TARGETS in makefile.
- Use "m" constraints for rseq_cs field.
---
 MAINTAINERS                                        |    1 +
 tools/testing/selftests/Makefile                   |    1 +
 tools/testing/selftests/rseq/.gitignore            |    4 +
 tools/testing/selftests/rseq/Makefile              |   23 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  333 +++++
 tools/testing/selftests/rseq/basic_test.c          |   55 +
 tools/testing/selftests/rseq/param_test.c          | 1285 ++++++++++++++++++++
 tools/testing/selftests/rseq/rseq-arm.h            |  535 ++++++++
 tools/testing/selftests/rseq/rseq-ppc.h            |  567 +++++++++
 tools/testing/selftests/rseq/rseq-x86.h            |  898 ++++++++++++++
 tools/testing/selftests/rseq/rseq.c                |  116 ++
 tools/testing/selftests/rseq/rseq.h                |  154 +++
 tools/testing/selftests/rseq/run_param_test.sh     |  124 ++
 13 files changed, 4096 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/param_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h
 create mode 100755 tools/testing/selftests/rseq/run_param_test.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 85a1e8781f56..d0bf5c5b4267 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11515,6 +11515,7 @@ S:	Supported
 F:	kernel/rseq.c
 F:	include/uapi/linux/rseq.h
 F:	include/trace/events/rseq.h
+F:	tools/testing/selftests/rseq/
 
 RFKILL
 M:	Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c66e5e67cfab..b7fcd7bcb87e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -25,6 +25,7 @@ TARGETS += nsfs
 TARGETS += powerpc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += rseq
 TARGETS += seccomp
 TARGETS += sigaltstack
 TARGETS += size
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
new file mode 100644
index 000000000000..9409c3db99b2
--- /dev/null
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -0,0 +1,4 @@
+basic_percpu_ops_test
+basic_test
+basic_rseq_op_test
+param_test
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 000000000000..e4f638e5752c
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,23 @@
+CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+LDLIBS += -lpthread
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test
+
+TEST_GEN_PROGS_EXTENDED = librseq.so libcpu-op.so
+
+TEST_PROGS = run_param_test.sh
+
+include ../lib.mk
+
+$(OUTPUT)/librseq.so: rseq.c rseq.h rseq-*.h
+	$(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/libcpu-op.so: ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
+	$(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h ../cpu-opv/cpu-op.h
+	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -lcpu-op -o $@
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 000000000000..e5f7fed06a03
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,333 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "rseq.h"
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+	int reps;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	int cpu;
+
+	for (;;) {
+		int ret;
+
+#ifndef SKIP_FASTPATH
+		/* Try fast path. */
+		cpu = rseq_cpu_start();
+		ret = rseq_cmpeqv_storev(&lock->c[cpu].v,
+				0, 1, cpu);
+		if (likely(!ret))
+			break;
+		if (ret > 0)
+			continue;	/* Retry. */
+#endif
+	slowpath:
+		__attribute__((unused));
+		/* Fallback on cpu_opv system call. */
+		cpu = rseq_current_cpu();
+		ret = cpu_op_cmpeqv_storev(&lock->c[cpu].v, 0, 1, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches rseq_smp_store_release().
+	 */
+	rseq_smp_acquire__after_ctrl_dep();
+	return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * rseq_smp_acquire__after_ctrl_dep().
+	 */
+	rseq_smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_test_data *data = arg;
+	int i, cpu;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+	for (i = 0; i < data->reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+	}
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = 200;
+	int i;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+
+	memset(&data, 0, sizeof(data));
+	data.reps = 5000;
+
+	for (i = 0; i < num_threads; i++)
+		pthread_create(&test_threads[i], NULL,
+			test_percpu_spinlock_thread, &data);
+
+	for (i = 0; i < num_threads; i++)
+		pthread_join(test_threads[i], NULL);
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)data.reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	intptr_t *targetptr, newval, expect;
+	int cpu, ret;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	/* Load list->c[cpu].head with single-copy atomicity. */
+	expect = (intptr_t)READ_ONCE(list->c[cpu].head);
+	newval = (intptr_t)node;
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	node->next = (struct percpu_list_node *)expect;
+	ret = rseq_cmpeqv_storev(targetptr, expect, newval, cpu);
+	if (likely(!ret))
+		return cpu;
+#endif
+	/* Fallback on cpu_opv system call. */
+	slowpath:
+		__attribute__((unused));
+	for (;;) {
+		cpu = rseq_current_cpu();
+		/* Load list->c[cpu].head with single-copy atomicity. */
+		expect = (intptr_t)READ_ONCE(list->c[cpu].head);
+		newval = (intptr_t)node;
+		targetptr = (intptr_t *)&list->c[cpu].head;
+		node->next = (struct percpu_list_node *)expect;
+		ret = cpu_op_cmpeqv_storev(targetptr, expect, newval, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	struct percpu_list_node *head;
+	int cpu, ret;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	ret = rseq_cmpnev_storeoffp_load((intptr_t *)&list->c[cpu].head,
+		(intptr_t)NULL,
+		offsetof(struct percpu_list_node, next),
+		(intptr_t *)&head, cpu);
+	if (likely(!ret))
+		return head;
+	if (ret > 0)
+		return NULL;
+#endif
+	/* Fallback on cpu_opv system call. */
+	slowpath:
+		__attribute__((unused));
+	for (;;) {
+		cpu = rseq_current_cpu();
+		ret = cpu_op_cmpnev_storeoffp_load(
+			(intptr_t *)&list->c[cpu].head,
+			(intptr_t)NULL,
+			offsetof(struct percpu_list_node, next),
+			(intptr_t *)&head, cpu);
+		if (likely(!ret))
+			break;
+		if (ret > 0)
+			return NULL;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	for (i = 0; i < 100000; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+
+		sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node);
+	}
+
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	int i, j;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[200];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < 200; i++)
+		assert(pthread_create(&test_threads[i], NULL,
+			test_percpu_list_thread, &list) == 0);
+
+	for (i = 0; i < 200; i++)
+		pthread_join(test_threads[i], NULL);
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto error;
+	}
+	printf("spinlock\n");
+	test_percpu_spinlock();
+	printf("percpu_list\n");
+	test_percpu_list();
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto error;
+	}
+	return 0;
+
+error:
+	return -1;
+}
+
diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index 000000000000..e2086b3885d7
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,55 @@
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+void test_cpu_pointer(void)
+{
+	cpu_set_t affinity, test_affinity;
+	int i;
+
+	sched_getaffinity(0, sizeof(affinity), &affinity);
+	CPU_ZERO(&test_affinity);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (CPU_ISSET(i, &affinity)) {
+			CPU_SET(i, &test_affinity);
+			sched_setaffinity(0, sizeof(test_affinity),
+					&test_affinity);
+			assert(sched_getcpu() == i);
+			assert(rseq_current_cpu() == i);
+			assert(rseq_current_cpu_raw() == i);
+			assert(rseq_cpu_start() == i);
+			CPU_CLR(i, &test_affinity);
+		}
+	}
+	sched_setaffinity(0, sizeof(affinity), &affinity);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto init_thread_error;
+	}
+	printf("testing current cpu\n");
+	test_cpu_pointer();
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto init_thread_error;
+	}
+	return 0;
+
+init_thread_error:
+	return -1;
+}
diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
new file mode 100644
index 000000000000..c7a16b656a36
--- /dev/null
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -0,0 +1,1285 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <poll.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+#include <stddef.h>
+
+#include "cpu-op.h"
+
+static inline pid_t gettid(void)
+{
+	return syscall(__NR_gettid);
+}
+
+#define NR_INJECT	9
+static int loop_cnt[NR_INJECT + 1];
+
+static int opt_modulo, verbose;
+
+static int opt_yield, opt_signal, opt_sleep,
+		opt_disable_rseq, opt_threads = 200,
+		opt_disable_mod = 0, opt_test = 's', opt_mb = 0;
+
+static long long opt_reps = 5000;
+
+static __thread __attribute__((tls_model("initial-exec"))) unsigned int signals_delivered;
+
+#ifndef BENCHMARK
+
+static __thread __attribute__((tls_model("initial-exec"))) unsigned int yield_mod_cnt, nr_abort;
+
+#define printf_verbose(fmt, ...)			\
+	do {						\
+		if (verbose)				\
+			printf(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+#define RSEQ_INJECT_INPUT \
+	, [loop_cnt_1]"m"(loop_cnt[1]) \
+	, [loop_cnt_2]"m"(loop_cnt[2]) \
+	, [loop_cnt_3]"m"(loop_cnt[3]) \
+	, [loop_cnt_4]"m"(loop_cnt[4]) \
+	, [loop_cnt_5]"m"(loop_cnt[5]) \
+	, [loop_cnt_6]"m"(loop_cnt[6])
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define INJECT_ASM_REG	"eax"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"mov %[loop_cnt_" #n "], %%" INJECT_ASM_REG "\n\t" \
+	"test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+	"jz 333f\n\t" \
+	"222:\n\t" \
+	"dec %%" INJECT_ASM_REG "\n\t" \
+	"jnz 222b\n\t" \
+	"333:\n\t"
+
+#elif defined(__ARMEL__)
+
+#define INJECT_ASM_REG	"r4"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"ldr " INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmp " INJECT_ASM_REG ", #0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subs " INJECT_ASM_REG ", #1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+
+#elif __PPC__
+#define INJECT_ASM_REG	"r18"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+#else
+#error unsupported target
+#endif
+
+#define RSEQ_INJECT_FAILED \
+	nr_abort++;
+
+#define RSEQ_INJECT_C(n) \
+{ \
+	int loc_i, loc_nr_loops = loop_cnt[n]; \
+	\
+	for (loc_i = 0; loc_i < loc_nr_loops; loc_i++) { \
+		barrier(); \
+	} \
+	if (loc_nr_loops == -1 && opt_modulo) { \
+		if (yield_mod_cnt == opt_modulo - 1) { \
+			if (opt_sleep > 0) \
+				poll(NULL, 0, opt_sleep); \
+			if (opt_yield) \
+				sched_yield(); \
+			if (opt_signal) \
+				raise(SIGUSR1); \
+			yield_mod_cnt = 0; \
+		} else { \
+			yield_mod_cnt++; \
+		} \
+	} \
+}
+
+#else
+
+#define printf_verbose(fmt, ...)
+
+#endif /* BENCHMARK */
+
+#include "rseq.h"
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct spinlock_thread_test_data {
+	struct spinlock_test_data *data;
+	long long reps;
+	int reg;
+};
+
+struct inc_test_data {
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct inc_thread_test_data {
+	struct inc_test_data *data;
+	long long reps;
+	int reg;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+#define BUFFER_ITEM_PER_CPU	100
+
+struct percpu_buffer_node {
+	intptr_t data;
+};
+
+struct percpu_buffer_entry {
+	intptr_t offset;
+	intptr_t buflen;
+	struct percpu_buffer_node **array;
+} __attribute__((aligned(128)));
+
+struct percpu_buffer {
+	struct percpu_buffer_entry c[CPU_SETSIZE];
+};
+
+#define MEMCPY_BUFFER_ITEM_PER_CPU	100
+
+struct percpu_memcpy_buffer_node {
+	intptr_t data1;
+	uint64_t data2;
+};
+
+struct percpu_memcpy_buffer_entry {
+	intptr_t offset;
+	intptr_t buflen;
+	struct percpu_memcpy_buffer_node *array;
+} __attribute__((aligned(128)));
+
+struct percpu_memcpy_buffer {
+	struct percpu_memcpy_buffer_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+static int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	int cpu;
+
+	for (;;) {
+		int ret;
+
+#ifndef SKIP_FASTPATH
+		/* Try fast path. */
+		cpu = rseq_cpu_start();
+		ret = rseq_cmpeqv_storev(&lock->c[cpu].v,
+				0, 1, cpu);
+		if (likely(!ret))
+			break;
+		if (ret > 0)
+			continue;	/* Retry. */
+#endif
+	slowpath:
+		__attribute__((unused));
+		/* Fallback on cpu_opv system call. */
+		cpu = rseq_current_cpu();
+		ret = cpu_op_cmpeqv_storev(&lock->c[cpu].v, 0, 1, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches rseq_smp_store_release().
+	 */
+	rseq_smp_acquire__after_ctrl_dep();
+	return cpu;
+}
+
+static void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * rseq_smp_acquire__after_ctrl_dep().
+	 */
+	rseq_smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_thread_test_data *thread_data = arg;
+	struct spinlock_test_data *data = thread_data->data;
+	int cpu;
+	long long i, reps;
+
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_register_current_thread())
+		abort();
+	reps = thread_data->reps;
+	for (i = 0; i < reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (reps / 10)))
+			printf_verbose("tid %d: count %lld\n", (int) gettid(), i);
+#endif
+	}
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		(int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_unregister_current_thread())
+		abort();
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = opt_threads;
+	int i, ret;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+	struct spinlock_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_spinlock_thread, &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+void *test_percpu_inc_thread(void *arg)
+{
+	struct inc_thread_test_data *thread_data = arg;
+	struct inc_test_data *data = thread_data->data;
+	long long i, reps;
+
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_register_current_thread())
+		abort();
+	reps = thread_data->reps;
+	for (i = 0; i < reps; i++) {
+		int cpu, ret;
+
+#ifndef SKIP_FASTPATH
+		/* Try fast path. */
+		cpu = rseq_cpu_start();
+		ret = rseq_addv(&data->c[cpu].count, 1, cpu);
+		if (likely(!ret))
+			goto next;
+#endif
+	slowpath:
+		__attribute__((unused));
+		for (;;) {
+			/* Fallback on cpu_opv system call. */
+			cpu = rseq_current_cpu();
+			ret = cpu_op_addv(&data->c[cpu].count, 1, cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	next:
+		__attribute__((unused));
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (reps / 10)))
+			printf_verbose("tid %d: count %lld\n", (int) gettid(), i);
+#endif
+	}
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		(int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_unregister_current_thread())
+		abort();
+	return NULL;
+}
+
+void test_percpu_inc(void)
+{
+	const int num_threads = opt_threads;
+	int i, ret;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct inc_test_data data;
+	struct inc_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_inc_thread, &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	intptr_t *targetptr, newval, expect;
+	int cpu, ret;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	/* Load list->c[cpu].head with single-copy atomicity. */
+	expect = (intptr_t)READ_ONCE(list->c[cpu].head);
+	newval = (intptr_t)node;
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	node->next = (struct percpu_list_node *)expect;
+	ret = rseq_cmpeqv_storev(targetptr, expect, newval, cpu);
+	if (likely(!ret))
+		return cpu;
+#endif
+	/* Fallback on cpu_opv system call. */
+slowpath:
+	__attribute__((unused));
+	for (;;) {
+		cpu = rseq_current_cpu();
+		/* Load list->c[cpu].head with single-copy atomicity. */
+		expect = (intptr_t)READ_ONCE(list->c[cpu].head);
+		newval = (intptr_t)node;
+		targetptr = (intptr_t *)&list->c[cpu].head;
+		node->next = (struct percpu_list_node *)expect;
+		ret = cpu_op_cmpeqv_storev(targetptr, expect, newval, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	struct percpu_list_node *head;
+	int cpu, ret;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	ret = rseq_cmpnev_storeoffp_load((intptr_t *)&list->c[cpu].head,
+		(intptr_t)NULL,
+		offsetof(struct percpu_list_node, next),
+		(intptr_t *)&head, cpu);
+	if (likely(!ret))
+		return head;
+	if (ret > 0)
+		return NULL;
+#endif
+	/* Fallback on cpu_opv system call. */
+	slowpath:
+		__attribute__((unused));
+	for (;;) {
+		cpu = rseq_current_cpu();
+		ret = cpu_op_cmpnev_storeoffp_load(
+			(intptr_t *)&list->c[cpu].head,
+			(intptr_t)NULL,
+			offsetof(struct percpu_list_node, next),
+			(intptr_t *)&head, cpu);
+		if (likely(!ret))
+			break;
+		if (ret > 0)
+			return NULL;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	long long i, reps;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		abort();
+
+	reps = opt_reps;
+	for (i = 0; i < reps; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node);
+	}
+
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		(int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_list_thread, &list);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+bool percpu_buffer_push(struct percpu_buffer *buffer,
+		struct percpu_buffer_node *node)
+{
+	intptr_t *targetptr_spec, newval_spec;
+	intptr_t *targetptr_final, newval_final;
+	int cpu, ret;
+	intptr_t offset;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == buffer->c[cpu].buflen) {
+		if (unlikely(cpu != rseq_current_cpu_raw()))
+			goto slowpath;
+		return false;
+	}
+	newval_spec = (intptr_t)node;
+	targetptr_spec = (intptr_t *)&buffer->c[cpu].array[offset];
+	newval_final = offset + 1;
+	targetptr_final = &buffer->c[cpu].offset;
+	if (opt_mb)
+		ret = rseq_cmpeqv_trystorev_storev_release(targetptr_final,
+			offset, targetptr_spec, newval_spec,
+			newval_final, cpu);
+	else
+		ret = rseq_cmpeqv_trystorev_storev(targetptr_final,
+			offset, targetptr_spec, newval_spec,
+			newval_final, cpu);
+	if (likely(!ret))
+		return true;
+#endif
+slowpath:
+	__attribute__((unused));
+	/* Fallback on cpu_opv system call. */
+	for (;;) {
+		cpu = rseq_current_cpu();
+		/* Load offset with single-copy atomicity. */
+		offset = READ_ONCE(buffer->c[cpu].offset);
+		if (offset == buffer->c[cpu].buflen)
+			return false;
+		newval_spec = (intptr_t)node;
+		targetptr_spec = (intptr_t *)&buffer->c[cpu].array[offset];
+		newval_final = offset + 1;
+		targetptr_final = &buffer->c[cpu].offset;
+		if (opt_mb)
+			ret = cpu_op_cmpeqv_storev_mb_storev(targetptr_final,
+				offset, targetptr_spec, newval_spec,
+				newval_final, cpu);
+		else
+			ret = cpu_op_cmpeqv_storev_storev(targetptr_final,
+				offset, targetptr_spec, newval_spec,
+				newval_final, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return true;
+}
+
+struct percpu_buffer_node *percpu_buffer_pop(struct percpu_buffer *buffer)
+{
+	struct percpu_buffer_node *head;
+	intptr_t *targetptr, newval;
+	int cpu, ret;
+	intptr_t offset;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == 0) {
+		if (unlikely(cpu != rseq_current_cpu_raw()))
+			goto slowpath;
+		return NULL;
+	}
+	head = buffer->c[cpu].array[offset - 1];
+	newval = offset - 1;
+	targetptr = (intptr_t *)&buffer->c[cpu].offset;
+	ret = rseq_cmpeqv_cmpeqv_storev(targetptr, offset,
+		(intptr_t *)&buffer->c[cpu].array[offset - 1], (intptr_t)head,
+		newval, cpu);
+	if (likely(!ret))
+		return head;
+#endif
+slowpath:
+	__attribute__((unused));
+	/* Fallback on cpu_opv system call. */
+	for (;;) {
+		cpu = rseq_current_cpu();
+		/* Load offset with single-copy atomicity. */
+		offset = READ_ONCE(buffer->c[cpu].offset);
+		if (offset == 0)
+			return NULL;
+		head = buffer->c[cpu].array[offset - 1];
+		newval = offset - 1;
+		targetptr = (intptr_t *)&buffer->c[cpu].offset;
+		ret = cpu_op_cmpeqv_cmpeqv_storev(targetptr, offset,
+			(intptr_t *)&buffer->c[cpu].array[offset - 1],
+			(intptr_t)head, newval, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return head;
+}
+
+void *test_percpu_buffer_thread(void *arg)
+{
+	long long i, reps;
+	struct percpu_buffer *buffer = (struct percpu_buffer *)arg;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		abort();
+
+	reps = opt_reps;
+	for (i = 0; i < reps; i++) {
+		struct percpu_buffer_node *node = percpu_buffer_pop(buffer);
+
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node) {
+			if (!percpu_buffer_push(buffer, node)) {
+				/* Should increase buffer size. */
+				abort();
+			}
+		}
+	}
+
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		(int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads.  */
+void test_percpu_buffer(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_buffer buffer;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&buffer, 0, sizeof(buffer));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		/* Worse-case is every item in same CPU. */
+		buffer.c[i].array =
+			malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE
+				* BUFFER_ITEM_PER_CPU);
+		assert(buffer.c[i].array);
+		buffer.c[i].buflen = CPU_SETSIZE * BUFFER_ITEM_PER_CPU;
+		for (j = 1; j <= BUFFER_ITEM_PER_CPU; j++) {
+			struct percpu_buffer_node *node;
+
+			expected_sum += j;
+
+			/*
+			 * We could theoretically put the word-sized
+			 * "data" directly in the buffer. However, we
+			 * want to model objects that would not fit
+			 * within a single word, so allocate an object
+			 * for each node.
+			 */
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			buffer.c[i].array[j - 1] = node;
+			buffer.c[i].offset++;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_buffer_thread, &buffer);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_buffer_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_buffer_pop(&buffer))) {
+			sum += node->data;
+			free(node);
+		}
+		free(buffer.c[i].array);
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+bool percpu_memcpy_buffer_push(struct percpu_memcpy_buffer *buffer,
+		struct percpu_memcpy_buffer_node item)
+{
+	char *destptr, *srcptr;
+	size_t copylen;
+	intptr_t *targetptr_final, newval_final;
+	int cpu, ret;
+	intptr_t offset;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == buffer->c[cpu].buflen) {
+		if (unlikely(cpu != rseq_current_cpu_raw()))
+			goto slowpath;
+		return false;
+	}
+	destptr = (char *)&buffer->c[cpu].array[offset];
+	srcptr = (char *)&item;
+	copylen = sizeof(item);
+	newval_final = offset + 1;
+	targetptr_final = &buffer->c[cpu].offset;
+	if (opt_mb)
+		ret = rseq_cmpeqv_trymemcpy_storev_release(targetptr_final,
+			offset, destptr, srcptr, copylen,
+			newval_final, cpu);
+	else
+		ret = rseq_cmpeqv_trymemcpy_storev(targetptr_final,
+			offset, destptr, srcptr, copylen,
+			newval_final, cpu);
+	if (likely(!ret))
+		return true;
+#endif
+slowpath:
+	__attribute__((unused));
+	/* Fallback on cpu_opv system call. */
+	for (;;) {
+		cpu = rseq_current_cpu();
+		/* Load offset with single-copy atomicity. */
+		offset = READ_ONCE(buffer->c[cpu].offset);
+		if (offset == buffer->c[cpu].buflen)
+			return false;
+		destptr = (char *)&buffer->c[cpu].array[offset];
+		srcptr = (char *)&item;
+		copylen = sizeof(item);
+		newval_final = offset + 1;
+		targetptr_final = &buffer->c[cpu].offset;
+		/* copylen must be <= PAGE_SIZE. */
+		if (opt_mb)
+			ret = cpu_op_cmpeqv_memcpy_mb_storev(targetptr_final,
+				offset, destptr, srcptr, copylen,
+				newval_final, cpu);
+		else
+			ret = cpu_op_cmpeqv_memcpy_storev(targetptr_final,
+				offset, destptr, srcptr, copylen,
+				newval_final, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return true;
+}
+
+bool percpu_memcpy_buffer_pop(struct percpu_memcpy_buffer *buffer,
+		struct percpu_memcpy_buffer_node *item)
+{
+	char *destptr, *srcptr;
+	size_t copylen;
+	intptr_t *targetptr_final, newval_final;
+	int cpu, ret;
+	intptr_t offset;
+
+#ifndef SKIP_FASTPATH
+	/* Try fast path. */
+	cpu = rseq_cpu_start();
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == 0) {
+		if (unlikely(cpu != rseq_current_cpu_raw()))
+			goto slowpath;
+		return false;
+	}
+	destptr = (char *)item;
+	srcptr = (char *)&buffer->c[cpu].array[offset - 1];
+	copylen = sizeof(*item);
+	newval_final = offset - 1;
+	targetptr_final = &buffer->c[cpu].offset;
+	ret = rseq_cmpeqv_trymemcpy_storev(targetptr_final,
+		offset, destptr, srcptr, copylen,
+		newval_final, cpu);
+	if (likely(!ret))
+		return true;
+#endif
+slowpath:
+	__attribute__((unused));
+	/* Fallback on cpu_opv system call. */
+	for (;;) {
+		cpu = rseq_current_cpu();
+		/* Load offset with single-copy atomicity. */
+		offset = READ_ONCE(buffer->c[cpu].offset);
+		if (offset == 0)
+			return false;
+		destptr = (char *)item;
+		srcptr = (char *)&buffer->c[cpu].array[offset - 1];
+		copylen = sizeof(*item);
+		newval_final = offset - 1;
+		targetptr_final = &buffer->c[cpu].offset;
+		/* copylen must be <= PAGE_SIZE. */
+		ret = cpu_op_cmpeqv_memcpy_storev(targetptr_final,
+			offset, destptr, srcptr, copylen,
+			newval_final, cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	return true;
+}
+
+void *test_percpu_memcpy_buffer_thread(void *arg)
+{
+	long long i, reps;
+	struct percpu_memcpy_buffer *buffer = (struct percpu_memcpy_buffer *)arg;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		abort();
+
+	reps = opt_reps;
+	for (i = 0; i < reps; i++) {
+		struct percpu_memcpy_buffer_node item;
+		bool result;
+
+		result = percpu_memcpy_buffer_pop(buffer, &item);
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (result) {
+			if (!percpu_memcpy_buffer_push(buffer, item)) {
+				/* Should increase buffer size. */
+				abort();
+			}
+		}
+	}
+
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		(int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads.  */
+void test_percpu_memcpy_buffer(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_memcpy_buffer buffer;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&buffer, 0, sizeof(buffer));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		/* Worse-case is every item in same CPU. */
+		buffer.c[i].array =
+			malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE
+				* MEMCPY_BUFFER_ITEM_PER_CPU);
+		assert(buffer.c[i].array);
+		buffer.c[i].buflen = CPU_SETSIZE * MEMCPY_BUFFER_ITEM_PER_CPU;
+		for (j = 1; j <= MEMCPY_BUFFER_ITEM_PER_CPU; j++) {
+			expected_sum += 2 * j + 1;
+
+			/*
+			 * We could theoretically put the word-sized
+			 * "data" directly in the buffer. However, we
+			 * want to model objects that would not fit
+			 * within a single word, so allocate an object
+			 * for each node.
+			 */
+			buffer.c[i].array[j - 1].data1 = j;
+			buffer.c[i].array[j - 1].data2 = j + 1;
+			buffer.c[i].offset++;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_memcpy_buffer_thread, &buffer);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_memcpy_buffer_node item;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while (percpu_memcpy_buffer_pop(&buffer, &item)) {
+			sum += item.data1;
+			sum += item.data2;
+		}
+		free(buffer.c[i].array);
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+static void test_signal_interrupt_handler(int signo)
+{
+	signals_delivered++;
+}
+
+static int set_signal_handler(void)
+{
+	int ret = 0;
+	struct sigaction sa;
+	sigset_t sigset;
+
+	ret = sigemptyset(&sigset);
+	if (ret < 0) {
+		perror("sigemptyset");
+		return ret;
+	}
+
+	sa.sa_handler = test_signal_interrupt_handler;
+	sa.sa_mask = sigset;
+	sa.sa_flags = 0;
+	ret = sigaction(SIGUSR1, &sa, NULL);
+	if (ret < 0) {
+		perror("sigaction");
+		return ret;
+	}
+
+	printf_verbose("Signal handler set for SIGUSR1\n");
+
+	return ret;
+}
+
+static void show_usage(int argc, char **argv)
+{
+	printf("Usage : %s <OPTIONS>\n",
+		argv[0]);
+	printf("OPTIONS:\n");
+	printf("	[-1 loops] Number of loops for delay injection 1\n");
+	printf("	[-2 loops] Number of loops for delay injection 2\n");
+	printf("	[-3 loops] Number of loops for delay injection 3\n");
+	printf("	[-4 loops] Number of loops for delay injection 4\n");
+	printf("	[-5 loops] Number of loops for delay injection 5\n");
+	printf("	[-6 loops] Number of loops for delay injection 6\n");
+	printf("	[-7 loops] Number of loops for delay injection 7 (-1 to enable -m)\n");
+	printf("	[-8 loops] Number of loops for delay injection 8 (-1 to enable -m)\n");
+	printf("	[-9 loops] Number of loops for delay injection 9 (-1 to enable -m)\n");
+	printf("	[-m N] Yield/sleep/kill every modulo N (default 0: disabled) (>= 0)\n");
+	printf("	[-y] Yield\n");
+	printf("	[-k] Kill thread with signal\n");
+	printf("	[-s S] S: =0: disabled (default), >0: sleep time (ms)\n");
+	printf("	[-t N] Number of threads (default 200)\n");
+	printf("	[-r N] Number of repetitions per thread (default 5000)\n");
+	printf("	[-d] Disable rseq system call (no initialization)\n");
+	printf("	[-D M] Disable rseq for each M threads\n");
+	printf("	[-T test] Choose test: (s)pinlock, (l)ist, (b)uffer, (m)emcpy, (i)ncrement\n");
+	printf("	[-M] Push into buffer and memcpy buffer with memory barriers.\n");
+	printf("	[-v] Verbose output.\n");
+	printf("	[-h] Show this help.\n");
+	printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+	int i;
+
+	for (i = 1; i < argc; i++) {
+		if (argv[i][0] != '-')
+			continue;
+		switch (argv[i][1]) {
+		case '1':
+		case '2':
+		case '3':
+		case '4':
+		case '5':
+		case '6':
+		case '7':
+		case '8':
+		case '9':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			loop_cnt[argv[i][1] - '0'] = atol(argv[i + 1]);
+			i++;
+			break;
+		case 'm':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_modulo = atol(argv[i + 1]);
+			if (opt_modulo < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 's':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_sleep = atol(argv[i + 1]);
+			if (opt_sleep < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'y':
+			opt_yield = 1;
+			break;
+		case 'k':
+			opt_signal = 1;
+			break;
+		case 'd':
+			opt_disable_rseq = 1;
+			break;
+		case 'D':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_disable_mod = atol(argv[i + 1]);
+			if (opt_disable_mod < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 't':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_threads = atol(argv[i + 1]);
+			if (opt_threads < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'r':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_reps = atoll(argv[i + 1]);
+			if (opt_reps < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'h':
+			show_usage(argc, argv);
+			goto end;
+		case 'T':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_test = *argv[i + 1];
+			switch (opt_test) {
+			case 's':
+			case 'l':
+			case 'i':
+			case 'b':
+			case 'm':
+				break;
+			default:
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'v':
+			verbose = 1;
+			break;
+		case 'M':
+			opt_mb = 1;
+			break;
+		default:
+			show_usage(argc, argv);
+			goto error;
+		}
+	}
+
+	if (set_signal_handler())
+		goto error;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		goto error;
+	switch (opt_test) {
+	case 's':
+		printf_verbose("spinlock\n");
+		test_percpu_spinlock();
+		break;
+	case 'l':
+		printf_verbose("linked list\n");
+		test_percpu_list();
+		break;
+	case 'b':
+		printf_verbose("buffer\n");
+		test_percpu_buffer();
+		break;
+	case 'm':
+		printf_verbose("memcpy buffer\n");
+		test_percpu_memcpy_buffer();
+		break;
+	case 'i':
+		printf_verbose("counter increment\n");
+		test_percpu_inc();
+		break;
+	}
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+end:
+	return 0;
+
+error:
+	return -1;
+}
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index 000000000000..b02489cde80b
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,535 @@
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#define RSEQ_SIG	0x53053053
+
+#define rseq_smp_mb()	__asm__ __volatile__ ("dmb" : : : "memory")
+#define rseq_smp_rmb()	__asm__ __volatile__ ("dmb" : : : "memory")
+#define rseq_smp_wmb()	__asm__ __volatile__ ("dmb" : : : "memory")
+
+#define rseq_smp_load_acquire(p)					\
+__extension__ ({							\
+	__typeof(*p) ____p1 = RSEQ_READ_ONCE(*p);			\
+	rseq_smp_mb();							\
+	____p1;								\
+})
+
+#define rseq_smp_acquire__after_ctrl_dep()	rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v)					\
+do {									\
+	rseq_smp_mb();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define RSEQ_ASM_DEFINE_TABLE(section, version, flags,			\
+			start_ip, post_commit_offset, abort_ip)		\
+		".pushsection " __rseq_str(section) ", \"aw\"\n\t"	\
+		".balign 32\n\t"					\
+		".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+		".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+		".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)		\
+		__rseq_str(label) ":\n\t"				\
+		RSEQ_INJECT_ASM(1)					\
+		"adr r0, " __rseq_str(cs_label) "\n\t"			\
+		"str r0, %[" __rseq_str(rseq_cs) "]\n\t"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)		\
+		RSEQ_INJECT_ASM(2)					\
+		"ldr r0, %[" __rseq_str(current_cpu_id) "]\n\t"	\
+		"cmp %[" __rseq_str(cpu_id) "], r0\n\t"		\
+		"bne " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(table_label, label, section, sig,		\
+			teardown, abort_label, version, flags, start_ip,\
+			post_commit_offset, abort_ip)			\
+		__rseq_str(table_label) ":\n\t" 			\
+		".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+		".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+		".word " __rseq_str(RSEQ_SIG) "\n\t"			\
+		__rseq_str(label) ":\n\t"				\
+		teardown						\
+		"b %l[" __rseq_str(abort_label) "]\n\t"
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, section, teardown, cmpfail_label) \
+		__rseq_str(label) ":\n\t"				\
+		teardown						\
+		"b %l[" __rseq_str(cmpfail_label) "]\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expect], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* final store */
+		"str %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(5)
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG, "", abort,
+			0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+		off_t voffp, intptr_t *load, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expectnot], r0\n\t"
+		"beq 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		"str r0, %[load]\n\t"
+		"add r0, %[voffp]\n\t"
+		"ldr r0, [r0]\n\t"
+		/* final store */
+		"str r0, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(5)
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG, "", abort,
+			0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expectnot]"r"(expectnot),
+		  [voffp]"Ir"(voffp),
+		  [load]"m"(*load)
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"add r0, %[count]\n\t"
+		/* final store */
+		"str r0, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(4)
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG, "", abort,
+			0x0, 0x0, 1b, 2b-1b, 4f)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  [v]"m"(*v),
+		  [count]"Ir"(count)
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expect], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		"str %[newv2], %[v2]\n\t"
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"str %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG, "", abort,
+			0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"r"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expect], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		"str %[newv2], %[v2]\n\t"
+		RSEQ_INJECT_ASM(5)
+		"dmb\n\t"	/* full mb provides store-release */
+		/* final store */
+		"str %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG, "", abort,
+			0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"r"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t expect2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expect], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		"ldr r0, %[v2]\n\t"
+		"cmp %[expect2], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"str %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG, "", abort,
+			0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* cmp2 input */
+		  [v2]"m"(*v2),
+		  [expect2]"r"(expect2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	uint32_t rseq_scratch[3];
+
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		"str %[src], %[rseq_scratch0]\n\t"
+		"str %[dst], %[rseq_scratch1]\n\t"
+		"str %[len], %[rseq_scratch2]\n\t"
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expect], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		"cmp %[len], #0\n\t" \
+		"beq 333f\n\t" \
+		"222:\n\t" \
+		"ldrb %%r0, [%[src]]\n\t" \
+		"strb %%r0, [%[dst]]\n\t" \
+		"adds %[src], #1\n\t" \
+		"adds %[dst], #1\n\t" \
+		"subs %[len], #1\n\t" \
+		"bne 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"str %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		"ldr %[len], %[rseq_scratch2]\n\t"
+		"ldr %[dst], %[rseq_scratch1]\n\t"
+		"ldr %[src], %[rseq_scratch0]\n\t"
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG,
+			/* teardown */
+			"ldr %[len], %[rseq_scratch2]\n\t"
+			"ldr %[dst], %[rseq_scratch1]\n\t"
+			"ldr %[src], %[rseq_scratch0]\n\t",
+			abort, 0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure,
+			/* teardown */
+			"ldr %[len], %[rseq_scratch2]\n\t"
+			"ldr %[dst], %[rseq_scratch1]\n\t"
+			"ldr %[src], %[rseq_scratch0]\n\t",
+			cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len),
+		  [rseq_scratch0]"m"(rseq_scratch[0]),
+		  [rseq_scratch1]"m"(rseq_scratch[1]),
+		  [rseq_scratch2]"m"(rseq_scratch[2])
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	uint32_t rseq_scratch[3];
+
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		"str %[src], %[rseq_scratch0]\n\t"
+		"str %[dst], %[rseq_scratch1]\n\t"
+		"str %[len], %[rseq_scratch2]\n\t"
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"ldr r0, %[v]\n\t"
+		"cmp %[expect], r0\n\t"
+		"bne 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		"cmp %[len], #0\n\t" \
+		"beq 333f\n\t" \
+		"222:\n\t" \
+		"ldrb %%r0, [%[src]]\n\t" \
+		"strb %%r0, [%[dst]]\n\t" \
+		"adds %[src], #1\n\t" \
+		"adds %[dst], #1\n\t" \
+		"subs %[len], #1\n\t" \
+		"bne 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(5)
+		"dmb\n\t"	/* full mb provides store-release */
+		/* final store */
+		"str %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		"ldr %[len], %[rseq_scratch2]\n\t"
+		"ldr %[dst], %[rseq_scratch1]\n\t"
+		"ldr %[src], %[rseq_scratch0]\n\t"
+		"b 6f\n\t"
+		RSEQ_ASM_DEFINE_ABORT(3, 4, __rseq_failure, RSEQ_SIG,
+			/* teardown */
+			"ldr %[len], %[rseq_scratch2]\n\t"
+			"ldr %[dst], %[rseq_scratch1]\n\t"
+			"ldr %[src], %[rseq_scratch0]\n\t",
+			abort, 0x0, 0x0, 1b, 2b-1b, 4f)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure,
+			/* teardown */
+			"ldr %[len], %[rseq_scratch2]\n\t"
+			"ldr %[dst], %[rseq_scratch1]\n\t"
+			"ldr %[src], %[rseq_scratch0]\n\t",
+			cmpfail)
+		"6:\n\t"
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len),
+		  [rseq_scratch0]"m"(rseq_scratch[0]),
+		  [rseq_scratch1]"m"(rseq_scratch[1]),
+		  [rseq_scratch2]"m"(rseq_scratch[2])
+		  RSEQ_INJECT_INPUT
+		: "r0", "memory", "cc"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
diff --git a/tools/testing/selftests/rseq/rseq-ppc.h b/tools/testing/selftests/rseq/rseq-ppc.h
new file mode 100644
index 000000000000..bff0d97db0ff
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -0,0 +1,567 @@
+/*
+ * rseq-ppc.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ * (C) Copyright 2016 - Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#define RSEQ_SIG	0x53053053
+
+#define rseq_smp_mb()		__asm__ __volatile__ ("sync" : : : "memory")
+#define rseq_smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
+#define rseq_smp_rmb()		rseq_smp_lwsync()
+#define rseq_smp_wmb()		rseq_smp_lwsync()
+
+#define rseq_smp_load_acquire(p)					\
+__extension__ ({							\
+	__typeof(*p) ____p1 = RSEQ_READ_ONCE(*p);			\
+	rseq_smp_lwsync();						\
+	____p1;								\
+})
+
+#define rseq_smp_acquire__after_ctrl_dep()	rseq_smp_lwsync()
+
+#define rseq_smp_store_release(p, v)					\
+do {									\
+	rseq_smp_lwsync();						\
+	RSEQ_WRITE_ONCE(*p, v);						\
+} while (0)
+
+/*
+ * The __rseq_table section can be used by debuggers to better handle
+ * single-stepping through the restartable critical sections.
+ */
+
+#ifdef __PPC64__
+
+#define STORE_WORD	"std "
+#define LOAD_WORD	"ld "
+#define LOADX_WORD	"ldx "
+#define CMP_WORD	"cmpd "
+
+#define RSEQ_ASM_DEFINE_TABLE(label, section, version, flags,			\
+			start_ip, post_commit_offset, abort_ip)			\
+		".pushsection " __rseq_str(section) ", \"aw\"\n\t"		\
+		".balign 32\n\t"						\
+		__rseq_str(label) ":\n\t"					\
+		".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t"	\
+		".quad " __rseq_str(start_ip) ", " __rseq_str(post_commit_offset) ", " __rseq_str(abort_ip) "\n\t" \
+		".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)			\
+		__rseq_str(label) ":\n\t"					\
+		RSEQ_INJECT_ASM(1)						\
+		"lis %%r17, (" __rseq_str(cs_label) ")@highest\n\t"		\
+		"ori %%r17, %%r17, (" __rseq_str(cs_label) ")@higher\n\t"	\
+		"rldicr %%r17, %%r17, 32, 31\n\t"				\
+		"oris %%r17, %%r17, (" __rseq_str(cs_label) ")@high\n\t"	\
+		"ori %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"		\
+		"std %%r17, %[" __rseq_str(rseq_cs) "]\n\t"
+
+#else /* #ifdef __PPC64__ */
+
+#define STORE_WORD	"stw "
+#define LOAD_WORD	"lwz "
+#define LOADX_WORD	"lwzx "
+#define CMP_WORD	"cmpw "
+
+#define RSEQ_ASM_DEFINE_TABLE(label, section, version, flags,			\
+			start_ip, post_commit_offset, abort_ip)			\
+		".pushsection " __rseq_str(section) ", \"aw\"\n\t"		\
+		".balign 32\n\t"						\
+		__rseq_str(label) ":\n\t"					\
+		".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t"	\
+		/* 32-bit only supported on BE */				\
+		".long 0x0, " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \
+		".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)			\
+		__rseq_str(label) ":\n\t"					\
+		RSEQ_INJECT_ASM(1)						\
+		"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t"			\
+		"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"		\
+		"stw %%r17, %[" __rseq_str(rseq_cs) "]\n\t"
+
+#endif /* #ifdef __PPC64__ */
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)			\
+		RSEQ_INJECT_ASM(2)						\
+		"lwz %%r17, %[" __rseq_str(current_cpu_id) "]\n\t"		\
+		"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t"		\
+		"bne- cr7, " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, section, sig, teardown, abort_label)	\
+		".pushsection " __rseq_str(section) ", \"ax\"\n\t"		\
+		".long " __rseq_str(sig) "\n\t"					\
+		__rseq_str(label) ":\n\t"					\
+		teardown							\
+		"b %l[" __rseq_str(abort_label) "]\n\t"			\
+		".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, section, teardown, cmpfail_label)	\
+		".pushsection " __rseq_str(section) ", \"ax\"\n\t"		\
+		__rseq_str(label) ":\n\t"					\
+		teardown							\
+		"b %l[" __rseq_str(cmpfail_label) "]\n\t"			\
+		".popsection\n\t"
+
+
+/*
+ * RSEQ_ASM_OPs: asm operations for rseq
+ * 	RSEQ_ASM_OP_R_*: has hard-code registers in it
+ * 	RSEQ_ASM_OP_* (else): doesn't have hard-code registers(unless cr7)
+ */
+#define RSEQ_ASM_OP_CMPEQ(var, expect, label)					\
+		LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"			\
+		CMP_WORD "cr7, %%r17, %[" __rseq_str(expect) "]\n\t"		\
+		"bne- cr7, " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_OP_CMPNE(var, expectnot, label)				\
+		LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"			\
+		CMP_WORD "cr7, %%r17, %[" __rseq_str(expectnot) "]\n\t"	\
+		"beq- cr7, " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_OP_STORE(value, var)						\
+		STORE_WORD "%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n\t"
+
+/* Load @var to r17 */
+#define RSEQ_ASM_OP_R_LOAD(var)							\
+		LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
+
+/* Store r17 to @var */
+#define RSEQ_ASM_OP_R_STORE(var)						\
+		STORE_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
+
+/* Add @count to r17 */
+#define RSEQ_ASM_OP_R_ADD(count)						\
+		"add %%r17, %[" __rseq_str(count) "], %%r17\n\t"
+
+/* Load (r17 + voffp) to r17 */
+#define RSEQ_ASM_OP_R_LOADX(voffp)						\
+		LOADX_WORD "%%r17, %[" __rseq_str(voffp) "], %%r17\n\t"
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_ASM_OP_R_MEMCPY() \
+		"cmpdi %%r19, 0\n\t" \
+		"beq 333f\n\t" \
+		"addi %%r20, %%r20, -1\n\t" \
+		"addi %%r21, %%r21, -1\n\t" \
+		"222:\n\t" \
+		"lbzu %%r18, 1(%%r20)\n\t" \
+		"stbu %%r18, 1(%%r21)\n\t" \
+		"addi %%r19, %%r19, -1\n\t" \
+		"cmpdi %%r19, 0\n\t" \
+		"bne 222b\n\t" \
+		"333:\n\t" \
+
+#define RSEQ_ASM_OP_R_FINAL_STORE(var, post_commit_label)			\
+		STORE_WORD "%%r17, %[" __rseq_str(var) "]\n\t"			\
+		__rseq_str(post_commit_label) ":\n\t"
+
+#define RSEQ_ASM_OP_FINAL_STORE(value, var, post_commit_label)			\
+		STORE_WORD "%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n\t"	\
+		__rseq_str(post_commit_label) ":\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v equal to @expect */
+		RSEQ_ASM_OP_CMPEQ(v, expect, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* final store */
+		RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+		RSEQ_INJECT_ASM(5)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+		off_t voffp, intptr_t *load, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v not equal to @expectnot */
+		RSEQ_ASM_OP_CMPNE(v, expectnot, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* load the value of @v */
+		RSEQ_ASM_OP_R_LOAD(v)
+		/* store it in @load */
+		RSEQ_ASM_OP_R_STORE(load)
+		/* dereference voffp(v) */
+		RSEQ_ASM_OP_R_LOADX(voffp)
+		/* final store the value at voffp(v) */
+		RSEQ_ASM_OP_R_FINAL_STORE(v, 2)
+		RSEQ_INJECT_ASM(5)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expectnot]"r"(expectnot),
+		  [voffp]"b"(voffp),
+		  [load]"m"(*load)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* load the value of @v */
+		RSEQ_ASM_OP_R_LOAD(v)
+		/* add @count to it */
+		RSEQ_ASM_OP_R_ADD(count)
+		/* final store */
+		RSEQ_ASM_OP_R_FINAL_STORE(v, 2)
+		RSEQ_INJECT_ASM(4)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [count]"r"(count)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17"
+		  RSEQ_INJECT_CLOBBER
+		: abort
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v equal to @expect */
+		RSEQ_ASM_OP_CMPEQ(v, expect, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		RSEQ_ASM_OP_STORE(newv2, v2)
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"r"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v equal to @expect */
+		RSEQ_ASM_OP_CMPEQ(v, expect, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		RSEQ_ASM_OP_STORE(newv2, v2)
+		RSEQ_INJECT_ASM(5)
+		/* for 'release' */
+		"lwsync\n\t"
+		/* final store */
+		RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"r"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t expect2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v equal to @expect */
+		RSEQ_ASM_OP_CMPEQ(v, expect, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* cmp @v2 equal to @expct2 */
+		RSEQ_ASM_OP_CMPEQ(v2, expect2, 5f)
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* cmp2 input */
+		  [v2]"m"(*v2),
+		  [expect2]"r"(expect2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		/* setup for mempcy */
+		"mr %%r19, %[len]\n\t" \
+		"mr %%r20, %[src]\n\t" \
+		"mr %%r21, %[dst]\n\t" \
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v equal to @expect */
+		RSEQ_ASM_OP_CMPEQ(v, expect, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		RSEQ_ASM_OP_R_MEMCPY()
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17", "r18", "r19", "r20", "r21"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		/* setup for mempcy */
+		"mr %%r19, %[len]\n\t" \
+		"mr %%r20, %[src]\n\t" \
+		"mr %%r21, %[dst]\n\t" \
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		/* cmp cpuid */
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* cmp @v equal to @expect */
+		RSEQ_ASM_OP_CMPEQ(v, expect, 5f)
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		RSEQ_ASM_OP_R_MEMCPY()
+		RSEQ_INJECT_ASM(5)
+		/* for 'release' */
+		"lwsync\n\t"
+		/* final store */
+		RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "r17", "r18", "r19", "r20", "r21"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+#undef STORE_WORD
+#undef LOAD_WORD
+#undef LOADX_WORD
+#undef CMP_WORD
diff --git a/tools/testing/selftests/rseq/rseq-x86.h b/tools/testing/selftests/rseq/rseq-x86.h
new file mode 100644
index 000000000000..7e4c21751c52
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-x86.h
@@ -0,0 +1,898 @@
+/*
+ * rseq-x86.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <stdint.h>
+
+#define RSEQ_SIG	0x53053053
+
+#ifdef __x86_64__
+
+#define rseq_smp_mb()	__asm__ __volatile__ ("mfence" : : : "memory")
+#define rseq_smp_rmb()	barrier()
+#define rseq_smp_wmb()	barrier()
+
+#define rseq_smp_load_acquire(p)					\
+__extension__ ({							\
+	__typeof(*p) ____p1 = RSEQ_READ_ONCE(*p);			\
+	barrier();							\
+	____p1;								\
+})
+
+#define rseq_smp_acquire__after_ctrl_dep()	rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v)					\
+do {									\
+	barrier();							\
+	RSEQ_WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define RSEQ_ASM_DEFINE_TABLE(label, section, version, flags,		\
+			start_ip, post_commit_offset, abort_ip)		\
+		".pushsection " __rseq_str(section) ", \"aw\"\n\t"	\
+		".balign 32\n\t"					\
+		__rseq_str(label) ":\n\t"				\
+		".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+		".quad " __rseq_str(start_ip) ", " __rseq_str(post_commit_offset) ", " __rseq_str(abort_ip) "\n\t" \
+		".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)		\
+		__rseq_str(label) ":\n\t"				\
+		RSEQ_INJECT_ASM(1)					\
+		"leaq " __rseq_str(cs_label) "(%%rip), %%rax\n\t"	\
+		"movq %%rax, %[" __rseq_str(rseq_cs) "]\n\t"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)		\
+		RSEQ_INJECT_ASM(2)					\
+		"cmpl %[" __rseq_str(cpu_id) "], %[" __rseq_str(current_cpu_id) "]\n\t" \
+		"jnz " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, section, sig, teardown, abort_label) \
+		".pushsection " __rseq_str(section) ", \"ax\"\n\t"	\
+		/* Disassembler-friendly signature: nopl <sig>(%rip). */\
+		".byte 0x0f, 0x1f, 0x05\n\t"				\
+		".long " __rseq_str(sig) "\n\t"			\
+		__rseq_str(label) ":\n\t"				\
+		teardown						\
+		"jmp %l[" __rseq_str(abort_label) "]\n\t"		\
+		".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, section, teardown, cmpfail_label) \
+		".pushsection " __rseq_str(section) ", \"ax\"\n\t"	\
+		__rseq_str(label) ":\n\t"				\
+		teardown						\
+		"jmp %l[" __rseq_str(cmpfail_label) "]\n\t"		\
+		".popsection\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpq %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* final store */
+		"movq %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(5)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+		off_t voffp, intptr_t *load, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpq %[v], %[expectnot]\n\t"
+		"jz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		"movq %[v], %%rax\n\t"
+		"movq %%rax, %[load]\n\t"
+		"addq %[voffp], %%rax\n\t"
+		"movq (%%rax), %%rax\n\t"
+		/* final store */
+		"movq %%rax, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(5)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expectnot]"r"(expectnot),
+		  [voffp]"er"(voffp),
+		  [load]"m"(*load)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* final store */
+		"addq %[count], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(4)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [count]"er"(count)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpq %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		"movq %[newv2], %[v2]\n\t"
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"movq %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"r"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+/* x86-64 is TSO. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	return rseq_cmpeqv_trystorev_storev(v, expect, v2, newv2,
+			newv, cpu);
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t expect2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpq %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		"cmpq %[v2], %[expect2]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"movq %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* cmp2 input */
+		  [v2]"m"(*v2),
+		  [expect2]"r"(expect2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	uint64_t rseq_scratch[3];
+
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		"movq %[src], %[rseq_scratch0]\n\t"
+		"movq %[dst], %[rseq_scratch1]\n\t"
+		"movq %[len], %[rseq_scratch2]\n\t"
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpq %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		"test %[len], %[len]\n\t" \
+		"jz 333f\n\t" \
+		"222:\n\t" \
+		"movb (%[src]), %%al\n\t" \
+		"movb %%al, (%[dst])\n\t" \
+		"inc %[src]\n\t" \
+		"inc %[dst]\n\t" \
+		"dec %[len]\n\t" \
+		"jnz 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"movq %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		"movq %[rseq_scratch2], %[len]\n\t"
+		"movq %[rseq_scratch1], %[dst]\n\t"
+		"movq %[rseq_scratch0], %[src]\n\t"
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG,
+			"movq %[rseq_scratch2], %[len]\n\t"
+			"movq %[rseq_scratch1], %[dst]\n\t"
+			"movq %[rseq_scratch0], %[src]\n\t",
+			abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure,
+			"movq %[rseq_scratch2], %[len]\n\t"
+			"movq %[rseq_scratch1], %[dst]\n\t"
+			"movq %[rseq_scratch0], %[src]\n\t",
+			cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len),
+		  [rseq_scratch0]"m"(rseq_scratch[0]),
+		  [rseq_scratch1]"m"(rseq_scratch[1]),
+		  [rseq_scratch2]"m"(rseq_scratch[2])
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+/* x86-64 is TSO. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	return rseq_cmpeqv_trymemcpy_storev(v, expect, dst, src,
+			len, newv, cpu);
+}
+
+#elif __i386__
+
+/*
+ * Support older 32-bit architectures that do not implement fence
+ * instructions.
+ */
+#define rseq_smp_mb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define rseq_smp_rmb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define rseq_smp_wmb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+
+#define rseq_smp_load_acquire(p)					\
+__extension__ ({							\
+	__typeof(*p) ____p1 = RSEQ_READ_ONCE(*p);			\
+	rseq_smp_mb();							\
+	____p1;								\
+})
+
+#define rseq_smp_acquire__after_ctrl_dep()	rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v)					\
+do {									\
+	rseq_smp_mb();							\
+	RSEQ_WRITE_ONCE(*p, v);						\
+} while (0)
+
+/*
+ * Use eax as scratch register and take memory operands as input to
+ * lessen register pressure. Especially needed when compiling in O0.
+ */
+#define RSEQ_ASM_DEFINE_TABLE(label, section, version, flags,		\
+			start_ip, post_commit_offset, abort_ip)		\
+		".pushsection " __rseq_str(section) ", \"aw\"\n\t"	\
+		".balign 32\n\t"					\
+		__rseq_str(label) ":\n\t"				\
+		".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+		".long " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+		".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)		\
+		__rseq_str(label) ":\n\t"				\
+		RSEQ_INJECT_ASM(1)					\
+		"movl $" __rseq_str(cs_label) ", %[rseq_cs]\n\t"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)		\
+		RSEQ_INJECT_ASM(2)					\
+		"cmpl %[" __rseq_str(cpu_id) "], %[" __rseq_str(current_cpu_id) "]\n\t" \
+		"jnz " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, section, sig, teardown, abort_label) \
+		".pushsection " __rseq_str(section) ", \"ax\"\n\t"	\
+		/* Disassembler-friendly signature: nopl <sig>. */\
+		".byte 0x0f, 0x1f, 0x05\n\t"				\
+		".long " __rseq_str(sig) "\n\t"			\
+		__rseq_str(label) ":\n\t"				\
+		teardown						\
+		"jmp %l[" __rseq_str(abort_label) "]\n\t"		\
+		".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, section, teardown, cmpfail_label) \
+		".pushsection " __rseq_str(section) ", \"ax\"\n\t"	\
+		__rseq_str(label) ":\n\t"				\
+		teardown						\
+		"jmp %l[" __rseq_str(cmpfail_label) "]\n\t"		\
+		".popsection\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpl %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* final store */
+		"movl %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(5)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+		off_t voffp, intptr_t *load, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpl %[v], %[expectnot]\n\t"
+		"jz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		"movl %[v], %%eax\n\t"
+		"movl %%eax, %[load]\n\t"
+		"addl %[voffp], %%eax\n\t"
+		"movl (%%eax), %%eax\n\t"
+		/* final store */
+		"movl %%eax, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(5)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expectnot]"r"(expectnot),
+		  [voffp]"ir"(voffp),
+		  [load]"m"(*load)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		/* final store */
+		"addl %[count], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(4)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [count]"ir"(count)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpl %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		"movl %[newv2], %%eax\n\t"
+		"movl %%eax, %[v2]\n\t"
+		RSEQ_INJECT_ASM(5)
+		/* final store */
+		"movl %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"m"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t newv2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"movl %[expect], %%eax\n\t"
+		"cmpl %[v], %%eax\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try store */
+		"movl %[newv2], %[v2]\n\t"
+		RSEQ_INJECT_ASM(5)
+		"lock; addl $0,0(%%esp)\n\t"
+		/* final store */
+		"movl %[newv], %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* try store input */
+		  [v2]"m"(*v2),
+		  [newv2]"r"(newv2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"m"(expect),
+		  [newv]"r"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+		intptr_t *v2, intptr_t expect2, intptr_t newv,
+		int cpu)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"cmpl %[v], %[expect]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		"cmpl %[expect2], %[v2]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(5)
+		"movl %[newv], %%eax\n\t"
+		/* final store */
+		"movl %%eax, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG, "", abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure, "", cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* cmp2 input */
+		  [v2]"m"(*v2),
+		  [expect2]"r"(expect2),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"r"(expect),
+		  [newv]"m"(newv)
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+/* TODO: implement a faster memcpy. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	uint32_t rseq_scratch[3];
+
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		"movl %[src], %[rseq_scratch0]\n\t"
+		"movl %[dst], %[rseq_scratch1]\n\t"
+		"movl %[len], %[rseq_scratch2]\n\t"
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"movl %[expect], %%eax\n\t"
+		"cmpl %%eax, %[v]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		"test %[len], %[len]\n\t" \
+		"jz 333f\n\t" \
+		"222:\n\t" \
+		"movb (%[src]), %%al\n\t" \
+		"movb %%al, (%[dst])\n\t" \
+		"inc %[src]\n\t" \
+		"inc %[dst]\n\t" \
+		"dec %[len]\n\t" \
+		"jnz 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(5)
+		"movl %[newv], %%eax\n\t"
+		/* final store */
+		"movl %%eax, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		"movl %[rseq_scratch2], %[len]\n\t"
+		"movl %[rseq_scratch1], %[dst]\n\t"
+		"movl %[rseq_scratch0], %[src]\n\t"
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG,
+			"movl %[rseq_scratch2], %[len]\n\t"
+			"movl %[rseq_scratch1], %[dst]\n\t"
+			"movl %[rseq_scratch0], %[src]\n\t",
+			abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure,
+			"movl %[rseq_scratch2], %[len]\n\t"
+			"movl %[rseq_scratch1], %[dst]\n\t"
+			"movl %[rseq_scratch0], %[src]\n\t",
+			cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"m"(expect),
+		  [newv]"m"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len),
+		  [rseq_scratch0]"m"(rseq_scratch[0]),
+		  [rseq_scratch1]"m"(rseq_scratch[1]),
+		  [rseq_scratch2]"m"(rseq_scratch[2])
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+/* TODO: implement a faster memcpy. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+		void *dst, void *src, size_t len, intptr_t newv,
+		int cpu)
+{
+	uint32_t rseq_scratch[3];
+
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, __rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
+		"movl %[src], %[rseq_scratch0]\n\t"
+		"movl %[dst], %[rseq_scratch1]\n\t"
+		"movl %[len], %[rseq_scratch2]\n\t"
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+		RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+		RSEQ_INJECT_ASM(3)
+		"movl %[expect], %%eax\n\t"
+		"cmpl %%eax, %[v]\n\t"
+		"jnz 5f\n\t"
+		RSEQ_INJECT_ASM(4)
+		/* try memcpy */
+		"test %[len], %[len]\n\t" \
+		"jz 333f\n\t" \
+		"222:\n\t" \
+		"movb (%[src]), %%al\n\t" \
+		"movb %%al, (%[dst])\n\t" \
+		"inc %[src]\n\t" \
+		"inc %[dst]\n\t" \
+		"dec %[len]\n\t" \
+		"jnz 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(5)
+		"lock; addl $0,0(%%esp)\n\t"
+		"movl %[newv], %%eax\n\t"
+		/* final store */
+		"movl %%eax, %[v]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(6)
+		/* teardown */
+		"movl %[rseq_scratch2], %[len]\n\t"
+		"movl %[rseq_scratch1], %[dst]\n\t"
+		"movl %[rseq_scratch0], %[src]\n\t"
+		RSEQ_ASM_DEFINE_ABORT(4, __rseq_failure, RSEQ_SIG,
+			"movl %[rseq_scratch2], %[len]\n\t"
+			"movl %[rseq_scratch1], %[dst]\n\t"
+			"movl %[rseq_scratch0], %[src]\n\t",
+			abort)
+		RSEQ_ASM_DEFINE_CMPFAIL(5, __rseq_failure,
+			"movl %[rseq_scratch2], %[len]\n\t"
+			"movl %[rseq_scratch1], %[dst]\n\t"
+			"movl %[rseq_scratch0], %[src]\n\t",
+			cmpfail)
+		: /* gcc asm goto does not allow outputs */
+		: [cpu_id]"r"(cpu),
+		  [current_cpu_id]"m"(__rseq_abi.cpu_id),
+		  [rseq_cs]"m"(__rseq_abi.rseq_cs),
+		  /* final store input */
+		  [v]"m"(*v),
+		  [expect]"m"(expect),
+		  [newv]"m"(newv),
+		  /* try memcpy input */
+		  [dst]"r"(dst),
+		  [src]"r"(src),
+		  [len]"r"(len),
+		  [rseq_scratch0]"m"(rseq_scratch[0]),
+		  [rseq_scratch1]"m"(rseq_scratch[1]),
+		  [rseq_scratch2]"m"(rseq_scratch[2])
+		  RSEQ_INJECT_INPUT
+		: "memory", "cc", "eax"
+		  RSEQ_INJECT_CLOBBER
+		: abort, cmpfail
+	);
+	return 0;
+abort:
+	RSEQ_INJECT_FAILED
+	return -1;
+cmpfail:
+	return 1;
+}
+
+#endif
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
new file mode 100644
index 000000000000..3db193c0afb0
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -0,0 +1,116 @@
+/*
+ * rseq.c
+ *
+ * Copyright (C) 2016 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "rseq.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+__attribute__((tls_model("initial-exec"))) __thread
+volatile struct rseq __rseq_abi = {
+	.cpu_id = -1,
+};
+
+static __attribute__((tls_model("initial-exec"))) __thread
+volatile int refcount;
+
+static void signal_off_save(sigset_t *oldset)
+{
+	sigset_t set;
+	int ret;
+
+	sigfillset(&set);
+	ret = pthread_sigmask(SIG_BLOCK, &set, oldset);
+	if (ret)
+		abort();
+}
+
+static void signal_restore(sigset_t oldset)
+{
+	int ret;
+
+	ret = pthread_sigmask(SIG_SETMASK, &oldset, NULL);
+	if (ret)
+		abort();
+}
+
+static int sys_rseq(volatile struct rseq *rseq_abi, uint32_t rseq_len,
+		int flags, uint32_t sig)
+{
+	return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
+}
+
+int rseq_register_current_thread(void)
+{
+	int rc, ret = 0;
+	sigset_t oldset;
+
+	signal_off_save(&oldset);
+	if (refcount++)
+		goto end;
+	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
+	if (!rc) {
+		assert(rseq_current_cpu_raw() >= 0);
+		goto end;
+	}
+	if (errno != EBUSY)
+		__rseq_abi.cpu_id = -2;
+	ret = -1;
+	refcount--;
+end:
+	signal_restore(oldset);
+	return ret;
+}
+
+int rseq_unregister_current_thread(void)
+{
+	int rc, ret = 0;
+	sigset_t oldset;
+
+	signal_off_save(&oldset);
+	if (--refcount)
+		goto end;
+	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq),
+			RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
+	if (!rc)
+		goto end;
+	ret = -1;
+end:
+	signal_restore(oldset);
+	return ret;
+}
+
+int32_t rseq_fallback_current_cpu(void)
+{
+	int32_t cpu;
+
+	cpu = sched_getcpu();
+	if (cpu < 0) {
+		perror("sched_getcpu()");
+		abort();
+	}
+	return cpu;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
new file mode 100644
index 000000000000..26c8ea01e940
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -0,0 +1,154 @@
+/*
+ * rseq.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef RSEQ_H
+#define RSEQ_H
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <signal.h>
+#include <sched.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <linux/rseq.h>
+
+/*
+ * Empty code injection macros, override when testing.
+ * It is important to consider that the ASM injection macros need to be
+ * fully reentrant (e.g. do not modify the stack).
+ */
+#ifndef RSEQ_INJECT_ASM
+#define RSEQ_INJECT_ASM(n)
+#endif
+
+#ifndef RSEQ_INJECT_C
+#define RSEQ_INJECT_C(n)
+#endif
+
+#ifndef RSEQ_INJECT_INPUT
+#define RSEQ_INJECT_INPUT
+#endif
+
+#ifndef RSEQ_INJECT_CLOBBER
+#define RSEQ_INJECT_CLOBBER
+#endif
+
+#ifndef RSEQ_INJECT_FAILED
+#define RSEQ_INJECT_FAILED
+#endif
+
+extern __thread volatile struct rseq __rseq_abi;
+
+#define rseq_likely(x)		__builtin_expect(!!(x), 1)
+#define rseq_unlikely(x)	__builtin_expect(!!(x), 0)
+#define rseq_barrier()		__asm__ __volatile__("" : : : "memory")
+
+#define RSEQ_ACCESS_ONCE(x)	(*(__volatile__  __typeof__(x) *)&(x))
+#define RSEQ_WRITE_ONCE(x, v)	__extension__ ({ RSEQ_ACCESS_ONCE(x) = (v); })
+#define RSEQ_READ_ONCE(x)	RSEQ_ACCESS_ONCE(x)
+
+#define __rseq_str_1(x)	#x
+#define __rseq_str(x)		__rseq_str_1(x)
+
+#if defined(__x86_64__) || defined(__i386__)
+#include <rseq-x86.h>
+#elif defined(__ARMEL__)
+#include <rseq-arm.h>
+#elif defined(__PPC__)
+#include <rseq-ppc.h>
+#else
+#error unsupported target
+#endif
+
+/*
+ * Register rseq for the current thread. This needs to be called once
+ * by any thread which uses restartable sequences, before they start
+ * using restartable sequences, to ensure restartable sequences
+ * succeed. A restartable sequence executed from a non-registered
+ * thread will always fail.
+ */
+int rseq_register_current_thread(void);
+
+/*
+ * Unregister rseq for current thread.
+ */
+int rseq_unregister_current_thread(void);
+
+/*
+ * Restartable sequence fallback for reading the current CPU number.
+ */
+int32_t rseq_fallback_current_cpu(void);
+
+/*
+ * Values returned can be either the current CPU number, -1 (rseq is
+ * uninitialized), or -2 (rseq initialization has failed).
+ */
+static inline int32_t rseq_current_cpu_raw(void)
+{
+	return RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id);
+}
+
+/*
+ * Returns a possible CPU number, which is typically the current CPU.
+ * The returned CPU number can be used to prepare for an rseq critical
+ * section, which will confirm whether the cpu number is indeed the
+ * current one, and whether rseq is initialized.
+ *
+ * The CPU number returned by rseq_cpu_start should always be validated
+ * by passing it to a rseq asm sequence, or by comparing it to the
+ * return value of rseq_current_cpu_raw() if the rseq asm sequence
+ * does not need to be invoked.
+ */
+static inline uint32_t rseq_cpu_start(void)
+{
+	return RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id_start);
+}
+
+static inline uint32_t rseq_current_cpu(void)
+{
+	int32_t cpu;
+
+	cpu = rseq_current_cpu_raw();
+	if (rseq_unlikely(cpu < 0))
+		cpu = rseq_fallback_current_cpu();
+	return cpu;
+}
+
+/*
+ * rseq_prepare_unload() should be invoked by each thread using rseq_finish*()
+ * at least once between their last rseq_finish*() and library unload of the
+ * library defining the rseq critical section (struct rseq_cs). This also
+ * applies to use of rseq in code generated by JIT: rseq_prepare_unload()
+ * should be invoked at least once by each thread using rseq_finish*() before
+ * reclaim of the memory holding the struct rseq_cs.
+ */
+static inline void rseq_prepare_unload(void)
+{
+	__rseq_abi.rseq_cs = 0;
+}
+
+#endif  /* RSEQ_H_ */
diff --git a/tools/testing/selftests/rseq/run_param_test.sh b/tools/testing/selftests/rseq/run_param_test.sh
new file mode 100755
index 000000000000..c7475a2bef11
--- /dev/null
+++ b/tools/testing/selftests/rseq/run_param_test.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+	"-T s"
+	"-T l"
+	"-T b"
+	"-T b -M"
+	"-T m"
+	"-T m -M"
+	"-T i"
+)
+
+TEST_NAME=(
+	"spinlock"
+	"list"
+	"buffer"
+	"buffer with barrier"
+	"memcpy"
+	"memcpy with barrier"
+	"increment"
+)
+IFS="$OLDIFS"
+
+function do_tests()
+{
+	local i=0
+	while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
+		echo "Running test ${TEST_NAME[$i]}"
+		./param_test ${TEST_LIST[$i]} ${@} ${EXTRA_ARGS} || exit 1
+		let "i++"
+	done
+}
+
+echo "Default parameters"
+do_tests
+
+echo "Loop injection: 10000 loops"
+
+OLDIFS="$IFS"
+IFS=$'\n'
+INJECT_LIST=(
+	"1"
+	"2"
+	"3"
+	"4"
+	"5"
+	"6"
+	"7"
+	"8"
+	"9"
+)
+IFS="$OLDIFS"
+
+NR_LOOPS=10000
+
+i=0
+while [ "$i" -lt "${#INJECT_LIST[@]}" ]; do
+	echo "Injecting at <${INJECT_LIST[$i]}>"
+	do_tests -${INJECT_LIST[i]} ${NR_LOOPS}
+	let "i++"
+done
+NR_LOOPS=
+
+function inject_blocking()
+{
+	OLDIFS="$IFS"
+	IFS=$'\n'
+	INJECT_LIST=(
+		"7"
+		"8"
+		"9"
+	)
+	IFS="$OLDIFS"
+
+	NR_LOOPS=-1
+
+	i=0
+	while [ "$i" -lt "${#INJECT_LIST[@]}" ]; do
+		echo "Injecting at <${INJECT_LIST[$i]}>"
+		do_tests -${INJECT_LIST[i]} -1 ${@}
+		let "i++"
+	done
+	NR_LOOPS=
+}
+
+echo "Yield injection (25%)"
+inject_blocking -m 4 -y -r 100
+
+echo "Yield injection (50%)"
+inject_blocking -m 2 -y -r 100
+
+echo "Yield injection (100%)"
+inject_blocking -m 1 -y -r 100
+
+echo "Kill injection (25%)"
+inject_blocking -m 4 -k -r 100
+
+echo "Kill injection (50%)"
+inject_blocking -m 2 -k -r 100
+
+echo "Kill injection (100%)"
+inject_blocking -m 1 -k -r 100
+
+echo "Sleep injection (1ms, 25%)"
+inject_blocking -m 4 -s 1 -r 100
+
+echo "Sleep injection (1ms, 50%)"
+inject_blocking -m 2 -s 1 -r 100
+
+echo "Sleep injection (1ms, 100%)"
+inject_blocking -m 1 -s 1 -r 100
+
+echo "Disable rseq for 25% threads"
+do_tests -D 4
+
+echo "Disable rseq for 50% threads"
+do_tests -D 2
+
+echo "Disable rseq"
+do_tests -d
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (8 preceding siblings ...)
  2017-11-14 20:04   ` [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests Mathieu Desnoyers
@ 2017-11-14 20:04   ` Mathieu Desnoyers
  2017-11-14 20:04   ` [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements Mathieu Desnoyers
  2017-11-14 21:08   ` [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Linus Torvalds
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Florian Weimer,
	Shuah Khan

Fixes assembler errors:
/tmp/cceKwI9a.s: Assembler messages:
/tmp/cceKwI9a.s:849: Error: co-processor offset out of range

with gcc prior to gcc-7. This can trigger if multiple rseq inline asm
are used within the same function.

My best guess on the cause of this issue is that gcc has a hard
time figuring out the actual size of the inline asm, and therefore
does not compute the offsets at which literal values can be
placed from the program counter accurately.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Florian Weimer <fweimer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Shuah Khan <shuah-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
CC: linux-kselftest-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 tools/testing/selftests/rseq/rseq-arm.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index b02489cde80b..c42db74c48ae 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -79,12 +79,15 @@ do {									\
 		teardown						\
 		"b %l[" __rseq_str(cmpfail_label) "]\n\t"
 
+#define rseq_workaround_gcc_asm_size_guess()	__asm__ __volatile__("")
+
 static inline __attribute__((always_inline))
 int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
 		int cpu)
 {
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
@@ -115,11 +118,14 @@ int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
 
@@ -129,6 +135,7 @@ int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
 {
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
@@ -164,11 +171,14 @@ int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
 
@@ -177,6 +187,7 @@ int rseq_addv(intptr_t *v, intptr_t count, int cpu)
 {
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
@@ -203,8 +214,10 @@ int rseq_addv(intptr_t *v, intptr_t count, int cpu)
 		  RSEQ_INJECT_CLOBBER
 		: abort
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 }
@@ -216,6 +229,7 @@ int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
 {
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
@@ -253,11 +267,14 @@ int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
 
@@ -268,6 +285,7 @@ int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
 {
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
@@ -306,11 +324,14 @@ int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
 
@@ -321,6 +342,7 @@ int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
 {
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
@@ -359,11 +381,14 @@ int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
 
@@ -376,6 +401,7 @@ int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
 
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		"str %[src], %[rseq_scratch0]\n\t"
@@ -442,11 +468,14 @@ int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
 
@@ -459,6 +488,7 @@ int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
 
 	RSEQ_INJECT_C(9)
 
+	rseq_workaround_gcc_asm_size_guess();
 	__asm__ __volatile__ goto (
 		RSEQ_ASM_DEFINE_TABLE(__rseq_table, 0x0, 0x0, 1f, 2f-1f, 4f)
 		"str %[src], %[rseq_scratch0]\n\t"
@@ -526,10 +556,13 @@ int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
 		  RSEQ_INJECT_CLOBBER
 		: abort, cmpfail
 	);
+	rseq_workaround_gcc_asm_size_guess();
 	return 0;
 abort:
+	rseq_workaround_gcc_asm_size_guess();
 	RSEQ_INJECT_FAILED
 	return -1;
 cmpfail:
+	rseq_workaround_gcc_asm_size_guess();
 	return 1;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() Mathieu Desnoyers
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Maged Michael,
	Avi Kivity

Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED commands.

Add checks expecting specific error values on system calls expected to
fail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Alan Stern <stern@rowland.harvard.edu>
CC: Will Deacon <will.deacon@arm.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Alice Ferrazzi <alice.ferrazzi@gmail.com>
CC: Paul Elder <paul.elder@pitt.edu>
CC: linux-kselftest@vger.kernel.org
CC: linux-arch@vger.kernel.org
---
Changes since v1:
- return result of ksft_exit_pass from main(), silencing compiler
  warning about missing return value.
---
 .../testing/selftests/membarrier/membarrier_test.c | 111 ++++++++++++++++++---
 1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index 9e674d9514d1..e6ee73d01fa1 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -16,49 +16,119 @@ static int sys_membarrier(int cmd, int flags)
 static int test_membarrier_cmd_fail(void)
 {
 	int cmd = -1, flags = 0;
+	const char *test_name = "sys membarrier invalid command";
 
 	if (sys_membarrier(cmd, flags) != -1) {
 		ksft_exit_fail_msg(
-			"sys membarrier invalid command test: command = %d, flags = %d. Should fail, but passed\n",
-			cmd, flags);
+			"%s test: command = %d, flags = %d. Should fail, but passed\n",
+			test_name, cmd, flags);
+	}
+	if (errno != EINVAL) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+			test_name, flags, EINVAL, strerror(EINVAL),
+			errno, strerror(errno));
 	}
 
 	ksft_test_result_pass(
-		"sys membarrier invalid command test: command = %d, flags = %d. Failed as expected\n",
-		cmd, flags);
+		"%s test: command = %d, flags = %d, errno = %d. Failed as expected\n",
+		test_name, cmd, flags, errno);
 	return 0;
 }
 
 static int test_membarrier_flags_fail(void)
 {
 	int cmd = MEMBARRIER_CMD_QUERY, flags = 1;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_QUERY invalid flags";
 
 	if (sys_membarrier(cmd, flags) != -1) {
 		ksft_exit_fail_msg(
-			"sys membarrier MEMBARRIER_CMD_QUERY invalid flags test: flags = %d. Should fail, but passed\n",
-			flags);
+			"%s test: flags = %d. Should fail, but passed\n",
+			test_name, flags);
+	}
+	if (errno != EINVAL) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+			test_name, flags, EINVAL, strerror(EINVAL),
+			errno, strerror(errno));
 	}
 
 	ksft_test_result_pass(
-		"sys membarrier MEMBARRIER_CMD_QUERY invalid flags test: flags = %d. Failed as expected\n",
-		flags);
+		"%s test: flags = %d, errno = %d. Failed as expected\n",
+		test_name, flags, errno);
 	return 0;
 }
 
-static int test_membarrier_success(void)
+static int test_membarrier_shared_success(void)
 {
 	int cmd = MEMBARRIER_CMD_SHARED, flags = 0;
-	const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED\n";
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED";
+
+	if (sys_membarrier(cmd, flags) != 0) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d\n", test_name, flags);
+	return 0;
+}
+
+static int test_membarrier_private_expedited_fail(void)
+{
+	int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED not registered failure";
+
+	if (sys_membarrier(cmd, flags) != -1) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d. Should fail, but passed\n",
+			test_name, flags);
+	}
+	if (errno != EPERM) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+			test_name, flags, EPERM, strerror(EPERM),
+			errno, strerror(errno));
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d, errno = %d\n",
+		test_name, flags, errno);
+	return 0;
+}
+
+static int test_membarrier_register_private_expedited_success(void)
+{
+	int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED";
 
 	if (sys_membarrier(cmd, flags) != 0) {
 		ksft_exit_fail_msg(
-			"sys membarrier MEMBARRIER_CMD_SHARED test: flags = %d\n",
-			flags);
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
 	}
 
 	ksft_test_result_pass(
-		"sys membarrier MEMBARRIER_CMD_SHARED test: flags = %d\n",
-		flags);
+		"%s test: flags = %d\n",
+		test_name, flags);
+	return 0;
+}
+
+static int test_membarrier_private_expedited_success(void)
+{
+	int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED";
+
+	if (sys_membarrier(cmd, flags) != 0) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d\n",
+		test_name, flags);
 	return 0;
 }
 
@@ -72,7 +142,16 @@ static int test_membarrier(void)
 	status = test_membarrier_flags_fail();
 	if (status)
 		return status;
-	status = test_membarrier_success();
+	status = test_membarrier_shared_success();
+	if (status)
+		return status;
+	status = test_membarrier_private_expedited_fail();
+	if (status)
+		return status;
+	status = test_membarrier_register_private_expedited_success();
+	if (status)
+		return status;
+	status = test_membarrier_private_expedited_success();
 	if (status)
 		return status;
 	return 0;
@@ -108,5 +187,5 @@ int main(int argc, char **argv)
 	test_membarrier_query();
 	test_membarrier();
 
-	ksft_exit_pass();
+	return ksft_exit_pass();
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm()
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command Mathieu Desnoyers
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Maged Michael,
	Avi Kivity

Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Alan Stern <stern@rowland.harvard.edu>
CC: Will Deacon <will.deacon@arm.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-arch@vger.kernel.org
---
 MAINTAINERS                           |  1 +
 arch/powerpc/Kconfig                  |  1 +
 arch/powerpc/include/asm/membarrier.h | 25 +++++++++++++++++++++++++
 arch/powerpc/mm/mmu_context.c         |  7 +++++++
 include/linux/sched/mm.h              | 12 +++++++++++-
 init/Kconfig                          |  3 +++
 kernel/sched/core.c                   | 10 ----------
 kernel/sched/membarrier.c             |  9 +++++++++
 8 files changed, 57 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index d0bf5c5b4267..cf41984201d2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8842,6 +8842,7 @@ L:	linux-kernel@vger.kernel.org
 S:	Supported
 F:	kernel/sched/membarrier.c
 F:	include/uapi/linux/membarrier.h
+F:	arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 41d1dae3b1b5..e54a822e5fb9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -139,6 +139,7 @@ config PPC
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_HAS_MEMBARRIER_HOOKS
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
new file mode 100644
index 000000000000..046f96768ab5
--- /dev/null
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_POWERPC_MEMBARRIER_H
+#define _ASM_POWERPC_MEMBARRIER_H
+
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+		struct mm_struct *next, struct task_struct *tsk)
+{
+	/*
+	 * Only need the full barrier when switching between processes.
+	 * Barrier when switching from kernel to userspace is not
+	 * required here, given that it is implied by mmdrop(). Barrier
+	 * when switching from userspace to kernel is not needed after
+	 * store to rq->curr.
+	 */
+	if (likely(!(atomic_read(&next->membarrier_state)
+			& MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+		return;
+
+	/*
+	 * The membarrier system call requires a full memory barrier
+	 * after storing to rq->curr, before going back to user-space.
+	 */
+	smp_mb();
+}
+
+#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0f613bc63c50..22f5c91cdc38 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -12,6 +12,7 @@
 
 #include <linux/mm.h>
 #include <linux/cpu.h>
+#include <linux/sched/mm.h>
 
 #include <asm/mmu_context.h>
 
@@ -67,6 +68,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 *
 		 * On the read side the barrier is in pte_xchg(), which orders
 		 * the store to the PTE vs the load of mm_cpumask.
+		 *
+		 * This full barrier is needed by membarrier when switching
+		 * between processes after store to rq->curr, before user-space
+		 * memory accesses.
 		 */
 		smp_mb();
 
@@ -89,6 +94,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 
 	if (new_on_cpu)
 		radix_kvm_prefetch_workaround(next);
+	else
+		membarrier_arch_switch_mm(prev, next, tsk);
 
 	/*
 	 * The actual HW switching method differs between the various
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3d49b91b674d..7077253d0df4 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -215,14 +215,24 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
 #ifdef CONFIG_MEMBARRIER
 enum {
 	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY	= (1U << 0),
-	MEMBARRIER_STATE_SWITCH_MM			= (1U << 1),
+	MEMBARRIER_STATE_PRIVATE_EXPEDITED		= (1U << 1),
 };
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+#include <asm/membarrier.h>
+#endif
+
 static inline void membarrier_execve(struct task_struct *t)
 {
 	atomic_set(&t->mm->membarrier_state, 0);
 }
 #else
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+		struct mm_struct *next, struct task_struct *tsk)
+{
+}
+#endif
 static inline void membarrier_execve(struct task_struct *t)
 {
 }
diff --git a/init/Kconfig b/init/Kconfig
index e4fbb5dd6a24..609296e764d6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1400,6 +1400,9 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config ARCH_HAS_MEMBARRIER_HOOKS
+	bool
+
 config RSEQ
 	bool "Enable rseq() system call" if EXPERT
 	default y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e547f93a46c2..0ac96e8329d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2684,16 +2684,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = prev->state;
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
-	/*
-	 * The membarrier system call requires a full memory barrier
-	 * after storing to rq->curr, before going back to user-space.
-	 *
-	 * TODO: This smp_mb__after_unlock_lock can go away if PPC end
-	 * up adding a full barrier to switch_mm(), or we should figure
-	 * out if a smp_mb__after_unlock_lock is really the proper API
-	 * to use.
-	 */
-	smp_mb__after_unlock_lock();
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index dd7908743dab..b045974346d0 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -116,6 +116,15 @@ static void membarrier_register_private_expedited(void)
 	if (atomic_read(&mm->membarrier_state)
 			& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
 		return;
+	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED,
+			&mm->membarrier_state);
+	if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
+		/*
+		 * Ensure all future scheduler executions will observe the
+		 * new thread flag state for this process.
+		 */
+		synchronize_sched();
+	}
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
 			&mm->membarrier_state);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (9 preceding siblings ...)
  2017-11-14 20:04   ` [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess Mathieu Desnoyers
@ 2017-11-14 20:04   ` Mathieu Desnoyers
  2017-11-14 21:08   ` [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Linus Torvalds
  11 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Maged Michael,
	Avi Kivity

Document the membarrier requirement on having a full memory barrier in
__schedule() after coming from user-space, before storing to rq->curr.
It is provided by smp_mb__after_spinlock() in __schedule().

Document that membarrier requires a full barrier on transition from
kernel thread to userspace thread. We currently have an implicit barrier
from atomic_dec_and_test() in mmdrop() that ensures this.

The x86 switch_mm_irqs_off() full barrier is currently provided by many
cpumask update operations as well as write_cr3(). Document that
write_cr3() provides this barrier.

Changes since v1:
- Update comments to match reality for code paths which are after
  storing to rq->curr, before returning to user-space, based on feedback
  from Andrea Parri.
Changes since v2:
- Update changelog (smp_mb__before_spinlock -> smp_mb__after_spinlock).
  Based on feedback from Andrea Parri.
Changes since v3:
- Clarify comments following feeback from Peter Zijlstra.
Changes since v4:
- Update comment regarding powerpc barrier.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Maged Michael <maged.michael-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Avi Kivity <avi-VrcmuVmyx1hWk0Htik3J/w@public.gmane.org>
CC: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
CC: Paul Mackerras <paulus-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>
CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Andrea Parri <parri.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
---
 arch/x86/mm/tlb.c        |  5 +++++
 include/linux/sched/mm.h |  5 +++++
 kernel/sched/core.c      | 37 ++++++++++++++++++++++++++-----------
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3118392cdf75..5abf9bfcca1f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -146,6 +146,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 #endif
 	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
+	/*
+	 * The membarrier system call requires a full memory barrier
+	 * before returning to user-space, after storing to rq->curr.
+	 * Writing to CR3 provides that full memory barrier.
+	 */
 	if (real_prev == next) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 7077253d0df4..0f9e1a96b890 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -39,6 +39,11 @@ static inline void mmgrab(struct mm_struct *mm)
 extern void __mmdrop(struct mm_struct *);
 static inline void mmdrop(struct mm_struct *mm)
 {
+	/*
+	 * The implicit full barrier implied by atomic_dec_and_test is
+	 * required by the membarrier system call before returning to
+	 * user-space, after storing to rq->curr.
+	 */
 	if (unlikely(atomic_dec_and_test(&mm->mm_count)))
 		__mmdrop(mm);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ac96e8329d5..c79e94278613 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2688,6 +2688,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	finish_arch_post_lock_switch();
 
 	fire_sched_in_preempt_notifiers(current);
+	/*
+	 * When transitioning from a kernel thread to a userspace
+	 * thread, mmdrop()'s implicit full barrier is required by the
+	 * membarrier system call, because the current active_mm can
+	 * become the current mm without going through switch_mm().
+	 */
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -2793,6 +2799,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * If mm is non-NULL, we pass through switch_mm(). If mm is
+	 * NULL, we will pass through mmdrop() in finish_task_switch().
+	 * Both of these contain the full memory barrier required by
+	 * membarrier after storing to rq->curr, before returning to
+	 * user-space.
+	 */
 	if (!mm) {
 		next->active_mm = oldmm;
 		mmgrab(oldmm);
@@ -3329,6 +3342,9 @@ static void __sched notrace __schedule(bool preempt)
 	 * Make sure that signal_pending_state()->signal_pending() below
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
+	 *
+	 * The membarrier system call requires a full memory barrier
+	 * after coming from user-space, before storing to rq->curr.
 	 */
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
@@ -3377,17 +3393,16 @@ static void __sched notrace __schedule(bool preempt)
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
-		 * rq->curr, before returning to user-space. For TSO
-		 * (e.g. x86), the architecture must provide its own
-		 * barrier in switch_mm(). For weakly ordered machines
-		 * for which spin_unlock() acts as a full memory
-		 * barrier, finish_lock_switch() in common code takes
-		 * care of this barrier. For weakly ordered machines for
-		 * which spin_unlock() acts as a RELEASE barrier (only
-		 * arm64 and PowerPC), arm64 has a full barrier in
-		 * switch_to(), and PowerPC has
-		 * smp_mb__after_unlock_lock() before
-		 * finish_lock_switch().
+		 * rq->curr, before returning to user-space.
+		 *
+		 * Here are the schemes providing that barrier on the
+		 * various architectures:
+		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
+		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+		 * - finish_lock_switch() for weakly-ordered
+		 *   architectures where spin_unlock is a full barrier,
+		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
+		 *   is a RELEASE barrier),
 		 */
 		++*switch_count;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
       [not found]   ` <20171114200414.2188-19-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:04 ` [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd Mathieu Desnoyers
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Maged Michael,
	Avi Kivity

Allow expedited membarrier to be used for data shared between processes
(shared memory).

Processes wishing to receive the membarriers register with
MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED. Those which want to issue
membarrier invoke MEMBARRIER_CMD_SHARED_EXPEDITED.

This allows extremely simple kernel-level implementation: we have almost
everything we need with the PRIVATE_EXPEDITED barrier code. All we need
to do is to add a flag in the mm_struct that will be used to check
whether we need to send the IPI to the current thread of each CPU.

There is a slight downside of this approach compared to targeting
specific shared memory users: when performing a membarrier operation,
all registered "shared" receivers will get the barrier, even if they
don't share a memory mapping with the "sender" issuing
MEMBARRIER_CMD_SHARED_EXPEDITED.

This registration approach seems to fit the requirement of not
disturbing processes that really deeply care about real-time: they
simply should not register with MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrea Parri <parri.andrea@gmail.com>
CC: x86@kernel.org
---
 arch/powerpc/include/asm/membarrier.h |   3 +-
 include/linux/sched/mm.h              |   6 +-
 include/uapi/linux/membarrier.h       |  34 +++++++++--
 kernel/sched/membarrier.c             | 112 ++++++++++++++++++++++++++++++++--
 4 files changed, 141 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index 046f96768ab5..ddf4baedd132 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -12,7 +12,8 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
 	 * store to rq->curr.
 	 */
 	if (likely(!(atomic_read(&next->membarrier_state)
-			& MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+			& (MEMBARRIER_STATE_PRIVATE_EXPEDITED
+			| MEMBARRIER_STATE_SHARED_EXPEDITED)) || !prev))
 		return;
 
 	/*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 0f9e1a96b890..c7b0f5970d7c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -219,8 +219,10 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
 
 #ifdef CONFIG_MEMBARRIER
 enum {
-	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY	= (1U << 0),
-	MEMBARRIER_STATE_PRIVATE_EXPEDITED		= (1U << 1),
+	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY		= (1U << 0),
+	MEMBARRIER_STATE_PRIVATE_EXPEDITED			= (1U << 1),
+	MEMBARRIER_STATE_SHARED_EXPEDITED_READY			= (1U << 2),
+	MEMBARRIER_STATE_SHARED_EXPEDITED			= (1U << 3),
 };
 
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 4e01ad7ffe98..2de01e595d3b 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,6 +40,28 @@
  *                          (non-running threads are de facto in such a
  *                          state). This covers threads from all processes
  *                          running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_SHARED_EXPEDITED:
+ *                          Execute a memory barrier on all running threads
+ *                          part of a process which previously registered
+ *                          with MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED.
+ *                          Upon return from system call, the caller thread
+ *                          is ensured that all running threads have passed
+ *                          through a state where all memory accesses to
+ *                          user-space addresses match program order between
+ *                          entry to and return from the system call
+ *                          (non-running threads are de facto in such a
+ *                          state). This only covers threads from processes
+ *                          which registered with
+ *                          MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED.
+ *                          This command returns 0. Given that
+ *                          registration is about the intent to receive
+ *                          the barriers, it is valid to invoke
+ *                          MEMBARRIER_CMD_SHARED_EXPEDITED from a
+ *                          non-registered process.
+ * @MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED:
+ *                          Register the process intent to receive
+ *                          MEMBARRIER_CMD_SHARED_EXPEDITED memory
+ *                          barriers. Always returns 0.
  * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
  *                          Execute a memory barrier on each running
  *                          thread belonging to the same process as the current
@@ -70,12 +92,12 @@
  * the value 0.
  */
 enum membarrier_cmd {
-	MEMBARRIER_CMD_QUERY				= 0,
-	MEMBARRIER_CMD_SHARED				= (1 << 0),
-	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
-	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
-	MEMBARRIER_CMD_PRIVATE_EXPEDITED		= (1 << 3),
-	MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED	= (1 << 4),
+	MEMBARRIER_CMD_QUERY					= 0,
+	MEMBARRIER_CMD_SHARED					= (1 << 0),
+	MEMBARRIER_CMD_SHARED_EXPEDITED				= (1 << 1),
+	MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED		= (1 << 2),
+	MEMBARRIER_CMD_PRIVATE_EXPEDITED			= (1 << 3),
+	MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED		= (1 << 4),
 };
 
 #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b045974346d0..76534531098f 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -27,7 +27,9 @@
  * except MEMBARRIER_CMD_QUERY.
  */
 #define MEMBARRIER_CMD_BITMASK	\
-	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
+	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_SHARED_EXPEDITED \
+	| MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED \
+	| MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
 	| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)
 
 static void ipi_mb(void *info)
@@ -35,6 +37,71 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+static int membarrier_shared_expedited(void)
+{
+	int cpu;
+	bool fallback = false;
+	cpumask_var_t tmpmask;
+
+	if (num_online_cpus() == 1)
+		return 0;
+
+	/*
+	 * Matches memory barriers around rq->curr modification in
+	 * scheduler.
+	 */
+	smp_mb();	/* system call entry is not a mb. */
+
+	/*
+	 * Expedited membarrier commands guarantee that they won't
+	 * block, hence the GFP_NOWAIT allocation flag and fallback
+	 * implementation.
+	 */
+	if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+		/* Fallback for OOM. */
+		fallback = true;
+	}
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct task_struct *p;
+
+		/*
+		 * Skipping the current CPU is OK even through we can be
+		 * migrated at any point. The current CPU, at the point
+		 * where we read raw_smp_processor_id(), is ensured to
+		 * be in program order with respect to the caller
+		 * thread. Therefore, we can skip this CPU from the
+		 * iteration.
+		 */
+		if (cpu == raw_smp_processor_id())
+			continue;
+		rcu_read_lock();
+		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		if (p && p->mm && (atomic_read(&p->mm->membarrier_state)
+				& MEMBARRIER_STATE_SHARED_EXPEDITED)) {
+			if (!fallback)
+				__cpumask_set_cpu(cpu, tmpmask);
+			else
+				smp_call_function_single(cpu, ipi_mb, NULL, 1);
+		}
+		rcu_read_unlock();
+	}
+	if (!fallback) {
+		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+		free_cpumask_var(tmpmask);
+	}
+	cpus_read_unlock();
+
+	/*
+	 * Memory barrier on the caller thread _after_ we finished
+	 * waiting for the last IPI. Matches memory barriers around
+	 * rq->curr modification in scheduler.
+	 */
+	smp_mb();	/* exit from system call is not a mb */
+	return 0;
+}
+
 static int membarrier_private_expedited(void)
 {
 	int cpu;
@@ -103,7 +170,38 @@ static int membarrier_private_expedited(void)
 	return 0;
 }
 
-static void membarrier_register_private_expedited(void)
+static int membarrier_register_shared_expedited(void)
+{
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	if (atomic_read(&mm->membarrier_state)
+			& MEMBARRIER_STATE_SHARED_EXPEDITED_READY)
+		return 0;
+	atomic_or(MEMBARRIER_STATE_SHARED_EXPEDITED, &mm->membarrier_state);
+	if (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) {
+		/*
+		 * For single mm user, single threaded process, we can
+		 * simply issue a memory barrier after setting
+		 * MEMBARRIER_STATE_SHARED_EXPEDITED to guarantee that
+		 * no memory access following registration is reordered
+		 * before registration.
+		 */
+		smp_mb();
+	} else {
+		/*
+		 * For multi-mm user threads, we need to ensure all
+		 * future scheduler executions will observe the new
+		 * thread flag state for this mm.
+		 */
+		synchronize_sched();
+	}
+	atomic_or(MEMBARRIER_STATE_SHARED_EXPEDITED_READY,
+			&mm->membarrier_state);
+	return 0;
+}
+
+static int membarrier_register_private_expedited(void)
 {
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
@@ -115,7 +213,7 @@ static void membarrier_register_private_expedited(void)
 	 */
 	if (atomic_read(&mm->membarrier_state)
 			& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
-		return;
+		return 0;
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED,
 			&mm->membarrier_state);
 	if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
@@ -127,6 +225,7 @@ static void membarrier_register_private_expedited(void)
 	}
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
 			&mm->membarrier_state);
+	return 0;
 }
 
 /**
@@ -176,11 +275,14 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
 		if (num_online_cpus() > 1)
 			synchronize_sched();
 		return 0;
+	case MEMBARRIER_CMD_SHARED_EXPEDITED:
+		return membarrier_shared_expedited();
+	case MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED:
+		return membarrier_register_shared_expedited();
 	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
 		return membarrier_private_expedited();
 	case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
-		membarrier_register_private_expedited();
-		return 0;
+		return membarrier_register_private_expedited();
 	default:
 		return -EINVAL;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
       [not found]   ` <20171114200414.2188-20-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:04 ` [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command Mathieu Desnoyers
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Shuah Khan,
	Greg Kroah-Hartman

Test the new MEMBARRIER_CMD_SHARED_EXPEDITED and
MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED commands.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Shuah Khan <shuahkh@osg.samsung.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Alan Stern <stern@rowland.harvard.edu>
CC: Will Deacon <will.deacon@arm.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Alice Ferrazzi <alice.ferrazzi@gmail.com>
CC: Paul Elder <paul.elder@pitt.edu>
CC: linux-kselftest@vger.kernel.org
CC: linux-arch@vger.kernel.org
---
 .../testing/selftests/membarrier/membarrier_test.c | 51 +++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index e6ee73d01fa1..bb9c58072c5c 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -132,6 +132,40 @@ static int test_membarrier_private_expedited_success(void)
 	return 0;
 }
 
+static int test_membarrier_register_shared_expedited_success(void)
+{
+	int cmd = MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED";
+
+	if (sys_membarrier(cmd, flags) != 0) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d\n",
+		test_name, flags);
+	return 0;
+}
+
+static int test_membarrier_shared_expedited_success(void)
+{
+	int cmd = MEMBARRIER_CMD_SHARED_EXPEDITED, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED_EXPEDITED";
+
+	if (sys_membarrier(cmd, flags) != 0) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d\n",
+		test_name, flags);
+	return 0;
+}
+
 static int test_membarrier(void)
 {
 	int status;
@@ -154,6 +188,19 @@ static int test_membarrier(void)
 	status = test_membarrier_private_expedited_success();
 	if (status)
 		return status;
+	/*
+	 * It is valid to send a shared membarrier from a non-registered
+	 * process.
+	 */
+	status = test_membarrier_shared_expedited_success();
+	if (status)
+		return status;
+	status = test_membarrier_register_shared_expedited_success();
+	if (status)
+		return status;
+	status = test_membarrier_shared_expedited_success();
+	if (status)
+		return status;
 	return 0;
 }
 
@@ -173,8 +220,10 @@ static int test_membarrier_query(void)
 		}
 		ksft_exit_fail_msg("sys_membarrier() failed\n");
 	}
-	if (!(ret & MEMBARRIER_CMD_SHARED))
+	if (!(ret & MEMBARRIER_CMD_SHARED)) {
+		ksft_test_result_fail("sys_membarrier() CMD_SHARED query failed\n");
 		ksft_exit_fail_msg("sys_membarrier is not supported.\n");
+	}
 
 	ksft_test_result_pass("sys_membarrier available\n");
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode Mathieu Desnoyers
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Andy Lutomirski,
	Maged Michael

Provide core serializing membarrier command to support memory reclaim
by JIT.

Each architecture needs to explicitly opt into that support by
documenting in their architecture code how they provide the core
serializing instructions required when returning from the membarrier
IPI, and after the scheduler has updated the curr->mm pointer (before
going back to user-space). They should then select
ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
their architecture.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@kernel.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrea Parri <parri.andrea@gmail.com>
CC: Russell King <linux@armlinux.org.uk>
CC: Greg Hackmann <ghackmann@google.com>
CC: Will Deacon <will.deacon@arm.com>
CC: David Sehr <sehr@google.com>
CC: linux-arch@vger.kernel.org
---
 include/linux/sched/mm.h        |  6 +++++
 include/uapi/linux/membarrier.h | 32 +++++++++++++++++++++++++-
 init/Kconfig                    |  3 +++
 kernel/sched/membarrier.c       | 50 +++++++++++++++++++++++++++++++----------
 4 files changed, 78 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c7b0f5970d7c..b7abb7de250f 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -223,6 +223,12 @@ enum {
 	MEMBARRIER_STATE_PRIVATE_EXPEDITED			= (1U << 1),
 	MEMBARRIER_STATE_SHARED_EXPEDITED_READY			= (1U << 2),
 	MEMBARRIER_STATE_SHARED_EXPEDITED			= (1U << 3),
+	MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY	= (1U << 4),
+	MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE		= (1U << 5),
+};
+
+enum {
+	MEMBARRIER_FLAG_SYNC_CORE	= (1U << 0),
 };
 
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 2de01e595d3b..99a66577bd85 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -73,7 +73,7 @@
  *                          to and return from the system call
  *                          (non-running threads are de facto in such a
  *                          state). This only covers threads from the
- *                          same processes as the caller thread. This
+ *                          same process as the caller thread. This
  *                          command returns 0 on success. The
  *                          "expedited" commands complete faster than
  *                          the non-expedited ones, they never block,
@@ -86,6 +86,34 @@
  *                          Register the process intent to use
  *                          MEMBARRIER_CMD_PRIVATE_EXPEDITED. Always
  *                          returns 0.
+ * @MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
+ *                          In addition to provide memory ordering
+ *                          guarantees described in
+ *                          MEMBARRIER_CMD_PRIVATE_EXPEDITED, ensure
+ *                          the caller thread, upon return from system
+ *                          call, that all its running threads siblings
+ *                          have executed a core serializing
+ *                          instruction. (architectures are required to
+ *                          guarantee that non-running threads issue
+ *                          core serializing instructions before they
+ *                          resume user-space execution). This only
+ *                          covers threads from the same process as the
+ *                          caller thread. This command returns 0 on
+ *                          success. The "expedited" commands complete
+ *                          faster than the non-expedited ones, they
+ *                          never block, but have the downside of
+ *                          causing extra overhead. If this command is
+ *                          not implemented by an architecture, -EINVAL
+ *                          is returned. A process needs to register its
+ *                          intent to use the private expedited sync
+ *                          core command prior to using it, otherwise
+ *                          this command returns -EPERM.
+ * @MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
+ *                          Register the process intent to use
+ *                          MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.
+ *                          If this command is not implemented by an
+ *                          architecture, -EINVAL is returned.
+ *                          Returns 0 on success.
  *
  * Command to be passed to the membarrier system call. The commands need to
  * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
@@ -98,6 +126,8 @@ enum membarrier_cmd {
 	MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED		= (1 << 2),
 	MEMBARRIER_CMD_PRIVATE_EXPEDITED			= (1 << 3),
 	MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED		= (1 << 4),
+	MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE		= (1 << 5),
+	MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE	= (1 << 6),
 };
 
 #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index 609296e764d6..d3e5440051b8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1403,6 +1403,9 @@ config MEMBARRIER
 config ARCH_HAS_MEMBARRIER_HOOKS
 	bool
 
+config ARCH_HAS_MEMBARRIER_SYNC_CORE
+	bool
+
 config RSEQ
 	bool "Enable rseq() system call" if EXPERT
 	default y
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 76534531098f..72f42eac99ab 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -26,11 +26,20 @@
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK	\
+	(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE \
+	| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE)
+#else
+#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK	0
+#endif
+
 #define MEMBARRIER_CMD_BITMASK	\
 	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_SHARED_EXPEDITED \
 	| MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED \
 	| MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
-	| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)
+	| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED	\
+	| MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK)
 
 static void ipi_mb(void *info)
 {
@@ -102,15 +111,23 @@ static int membarrier_shared_expedited(void)
 	return 0;
 }
 
-static int membarrier_private_expedited(void)
+static int membarrier_private_expedited(int flags)
 {
 	int cpu;
 	bool fallback = false;
 	cpumask_var_t tmpmask;
 
-	if (!(atomic_read(&current->mm->membarrier_state)
-			& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
-		return -EPERM;
+	if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+			return -EINVAL;
+		if (!(atomic_read(&current->mm->membarrier_state)
+				& MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
+			return -EPERM;
+	} else {
+		if (!(atomic_read(&current->mm->membarrier_state)
+				& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
+			return -EPERM;
+	}
 
 	if (num_online_cpus() == 1)
 		return 0;
@@ -201,18 +218,24 @@ static int membarrier_register_shared_expedited(void)
 	return 0;
 }
 
-static int membarrier_register_private_expedited(void)
+static int membarrier_register_private_expedited(int flags)
 {
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY;
+
+	if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+			return -EINVAL;
+		state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
+	}
 
 	/*
 	 * We need to consider threads belonging to different thread
 	 * groups, which use the same mm. (CLONE_VM but not
 	 * CLONE_THREAD).
 	 */
-	if (atomic_read(&mm->membarrier_state)
-			& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
+	if (atomic_read(&mm->membarrier_state) & state)
 		return 0;
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED,
 			&mm->membarrier_state);
@@ -223,8 +246,7 @@ static int membarrier_register_private_expedited(void)
 		 */
 		synchronize_sched();
 	}
-	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
-			&mm->membarrier_state);
+	atomic_or(state, &mm->membarrier_state);
 	return 0;
 }
 
@@ -280,9 +302,13 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
 	case MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED:
 		return membarrier_register_shared_expedited();
 	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
-		return membarrier_private_expedited();
+		return membarrier_private_expedited(0);
 	case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
-		return membarrier_register_private_expedited();
+		return membarrier_register_private_expedited(0);
+	case MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
+		return membarrier_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
+	case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
+		return membarrier_register_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
 	default:
 		return -EINVAL;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command Mathieu Desnoyers
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Andy Lutomirski,
	Maged Michael

Introduce an architecture function that ensures the current CPU
issues a core serializing instruction before returning to usermode.

This is needed for the membarrier "sync_core" command.

Architectures defining the sync_core_before_usermode() static inline
need to select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@kernel.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrea Parri <parri.andrea@gmail.com>
CC: Russell King <linux@armlinux.org.uk>
CC: Greg Hackmann <ghackmann@google.com>
CC: Will Deacon <will.deacon@arm.com>
CC: David Sehr <sehr@google.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: x86@kernel.org
CC: linux-arch@vger.kernel.org
---
Changes since v1:
- Fix prototype of sync_core_before_usermode in generic code (missing
  return type).
- Add linux/processor.h include to sched/core.c.
- Add ARCH_HAS_SYNC_CORE_BEFORE_USERMODE to init/Kconfig.
- Fix linux/processor.h ifdef to target
  CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE rather than
  ARCH_HAS_SYNC_CORE_BEFORE_USERMODE.
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/processor.h | 10 ++++++++++
 include/linux/processor.h        |  6 ++++++
 init/Kconfig                     |  3 +++
 kernel/sched/core.c              |  1 +
 5 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 01f78c1d40b5..54fbb8960d94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -62,6 +62,7 @@ config X86
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
+	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if X86_64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bdac19ab2488..e47a98a33c2e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -706,6 +706,16 @@ static inline void sync_core(void)
 #endif
 }
 
+/*
+ * Ensure that a core serializing instruction is issued before returning
+ * to user-mode. x86 implements return to user-space through sysexit,
+ * sysrel, and sysretq, which are not core serializing.
+ */
+static inline void sync_core_before_usermode(void)
+{
+	sync_core();
+}
+
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void amd_e400_c1e_apic_setup(void);
 
diff --git a/include/linux/processor.h b/include/linux/processor.h
index dbc952eec869..866de5326d34 100644
--- a/include/linux/processor.h
+++ b/include/linux/processor.h
@@ -68,4 +68,10 @@ do {								\
 
 #endif
 
+#ifndef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
+static inline void sync_core_before_usermode(void)
+{
+}
+#endif
+
 #endif /* _LINUX_PROCESSOR_H */
diff --git a/init/Kconfig b/init/Kconfig
index d3e5440051b8..ab073d436a9a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1943,3 +1943,6 @@ config ASN1
 	  functions to call on what tags.
 
 source "kernel/Kconfig.locks"
+
+config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
+	bool
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c79e94278613..688b65c68731 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@
 #include <linux/profile.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/processor.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
  2017-11-14 20:04 ` [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command Mathieu Desnoyers
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Andy Lutomirski,
	Maged Michael

There are two places where core serialization is needed by membarrier:

1) When returning from the membarrier IPI,
2) After scheduler updates curr to a thread with a different mm, before
   going back to user-space, since the curr->mm is used by membarrier to
   check whether it needs to send an IPI to that CPU.

x86-32 uses iret as return from interrupt, and both iret and sysexit to go
back to user-space. The iret instruction is core serializing, but not
sysexit.

x86-64 uses iret as return from interrupt, which takes care of the IPI.
However, it can return to user-space through either sysretl (compat
code), sysretq, or iret. Given that sysret{l,q} is not core serializing,
we rely instead on write_cr3() performed by switch_mm() to provide core
serialization after changing the current mm, and deal with the special
case of kthread -> uthread (temporarily keeping current mm into
active_mm) by adding a sync_core() in that specific case.

Use the new sync_core_before_usermode() to guarantee this.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@kernel.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrea Parri <parri.andrea@gmail.com>
CC: Russell King <linux@armlinux.org.uk>
CC: Greg Hackmann <ghackmann@google.com>
CC: Will Deacon <will.deacon@arm.com>
CC: David Sehr <sehr@google.com>
CC: x86@kernel.org
CC: linux-arch@vger.kernel.org

---
Changes since v1:
- Use the newly introduced sync_core_before_usermode(). Move all state
  handling to generic code.
- Add linux/processor.h include to include/linux/sched/mm.h.
---
 arch/x86/Kconfig          |  1 +
 arch/x86/entry/entry_32.S |  5 +++++
 arch/x86/entry/entry_64.S |  8 ++++++++
 arch/x86/mm/tlb.c         |  7 ++++---
 include/linux/sched/mm.h  | 12 ++++++++++++
 kernel/sched/core.c       |  6 +++++-
 kernel/sched/membarrier.c |  4 ++++
 7 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 54fbb8960d94..94bdf5fc7d94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV			if X86_64
+	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_PMEM_API		if X86_64
 	# Causing hangs/crashes, see the commit that added this change for details.
 	select ARCH_HAS_REFCOUNT		if BROKEN
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..04e5daba8456 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -553,6 +553,11 @@ restore_all:
 .Lrestore_nocheck:
 	RESTORE_REGS 4				# skip orig_eax/error_code
 .Lirq_return:
+	/*
+	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+	 * when returning from IPI handler and when returning from
+	 * scheduler to user-space.
+	 */
 	INTERRUPT_RETURN
 
 .section .fixup, "ax"
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index bcfc5668dcb2..4859f04e1695 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -642,6 +642,10 @@ GLOBAL(restore_regs_and_iret)
 restore_c_regs_and_iret:
 	RESTORE_C_REGS
 	REMOVE_PT_GPREGS_FROM_STACK 8
+	/*
+	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+	 * when returning from IPI handler.
+	 */
 	INTERRUPT_RETURN
 
 ENTRY(native_iret)
@@ -1122,6 +1126,10 @@ paranoid_exit_restore:
 	RESTORE_EXTRA_REGS
 	RESTORE_C_REGS
 	REMOVE_PT_GPREGS_FROM_STACK 8
+	/*
+	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+	 * when returning from IPI handler.
+	 */
 	INTERRUPT_RETURN
 END(paranoid_exit)
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5abf9bfcca1f..3b13d6735fa5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -147,9 +147,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	/*
-	 * The membarrier system call requires a full memory barrier
-	 * before returning to user-space, after storing to rq->curr.
-	 * Writing to CR3 provides that full memory barrier.
+	 * The membarrier system call requires a full memory barrier and
+	 * core serialization before returning to user-space, after
+	 * storing to rq->curr. Writing to CR3 provides that full
+	 * memory barrier and core serializing instruction.
 	 */
 	if (real_prev == next) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b7abb7de250f..5b763a73ae43 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,6 +7,7 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
+#include <linux/processor.h>
 
 /*
  * Routines for handling mm_structs
@@ -235,6 +236,14 @@ enum {
 #include <asm/membarrier.h>
 #endif
 
+static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+{
+	if (likely(!(atomic_read(&mm->membarrier_state) &
+			MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
+		return;
+	sync_core_before_usermode();
+}
+
 static inline void membarrier_execve(struct task_struct *t)
 {
 	atomic_set(&t->mm->membarrier_state, 0);
@@ -249,6 +258,9 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
 static inline void membarrier_execve(struct task_struct *t)
 {
 }
+static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+{
+}
 #endif
 
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 688b65c68731..43099a091742 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2694,9 +2694,13 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * thread, mmdrop()'s implicit full barrier is required by the
 	 * membarrier system call, because the current active_mm can
 	 * become the current mm without going through switch_mm().
+	 * membarrier also requires a core serializing instruction
+	 * before going back to user-space after storing to rq->curr.
 	 */
-	if (mm)
+	if (mm) {
 		mmdrop(mm);
+		membarrier_mm_sync_core_before_usermode(mm);
+	}
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 72f42eac99ab..d48185974ae0 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -239,6 +239,10 @@ static int membarrier_register_private_expedited(int flags)
 		return 0;
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED,
 			&mm->membarrier_state);
+	if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+		atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE,
+				&mm->membarrier_state);
+	}
 	if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
 		/*
 		 * Ensure all future scheduler executions will observe the
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (11 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  2017-11-17 15:09   ` Shuah Khan
  2017-11-14 20:04 ` [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command Mathieu Desnoyers
  13 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Shuah Khan,
	Greg Kroah-Hartman

Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Shuah Khan <shuahkh@osg.samsung.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Alan Stern <stern@rowland.harvard.edu>
CC: Will Deacon <will.deacon@arm.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Alice Ferrazzi <alice.ferrazzi@gmail.com>
CC: Paul Elder <paul.elder@pitt.edu>
CC: linux-kselftest@vger.kernel.org
CC: linux-arch@vger.kernel.org
---
 .../testing/selftests/membarrier/membarrier_test.c | 73 ++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index bb9c58072c5c..d9ab8b6ee52e 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -132,6 +132,63 @@ static int test_membarrier_private_expedited_success(void)
 	return 0;
 }
 
+static int test_membarrier_private_expedited_sync_core_fail(void)
+{
+	int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE not registered failure";
+
+	if (sys_membarrier(cmd, flags) != -1) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d. Should fail, but passed\n",
+			test_name, flags);
+	}
+	if (errno != EPERM) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+			test_name, flags, EPERM, strerror(EPERM),
+			errno, strerror(errno));
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d, errno = %d\n",
+		test_name, flags, errno);
+	return 0;
+}
+
+static int test_membarrier_register_private_expedited_sync_core_success(void)
+{
+	int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE";
+
+	if (sys_membarrier(cmd, flags) != 0) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d\n",
+		test_name, flags);
+	return 0;
+}
+
+static int test_membarrier_private_expedited_sync_core_success(void)
+{
+	int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+	const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE";
+
+	if (sys_membarrier(cmd, flags) != 0) {
+		ksft_exit_fail_msg(
+			"%s test: flags = %d, errno = %d\n",
+			test_name, flags, errno);
+	}
+
+	ksft_test_result_pass(
+		"%s test: flags = %d\n",
+		test_name, flags);
+	return 0;
+}
+
 static int test_membarrier_register_shared_expedited_success(void)
 {
 	int cmd = MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED, flags = 0;
@@ -188,6 +245,22 @@ static int test_membarrier(void)
 	status = test_membarrier_private_expedited_success();
 	if (status)
 		return status;
+	status = sys_membarrier(MEMBARRIER_CMD_QUERY, 0);
+	if (status < 0) {
+		ksft_test_result_fail("sys_membarrier() failed\n");
+		return status;
+	}
+	if (status & MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE) {
+		status = test_membarrier_private_expedited_sync_core_fail();
+		if (status)
+			return status;
+		status = test_membarrier_register_private_expedited_sync_core_success();
+		if (status)
+			return status;
+		status = test_membarrier_private_expedited_sync_core_success();
+		if (status)
+			return status;
+	}
 	/*
 	 * It is valid to send a shared membarrier from a non-registered
 	 * process.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command
  2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
                   ` (12 preceding siblings ...)
  2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
@ 2017-11-14 20:04 ` Mathieu Desnoyers
  13 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:04 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Mathieu Desnoyers, Andy Lutomirski,
	Maged Michael

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@kernel.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrea Parri <parri.andrea@gmail.com>
CC: Russell King <linux@armlinux.org.uk>
CC: Greg Hackmann <ghackmann@google.com>
CC: Will Deacon <will.deacon@arm.com>
CC: David Sehr <sehr@google.com>
CC: x86@kernel.org
CC: linux-arch@vger.kernel.org
---
 arch/arm64/Kconfig        | 1 +
 arch/arm64/kernel/entry.S | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..38272601850d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -16,6 +16,7 @@ config ARM64
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index e1c59d4008a8..0ec02508e53c 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -300,6 +300,10 @@ alternative_else_nop_endif
 	ldp	x28, x29, [sp, #16 * 14]
 	ldr	lr, [sp, #S_LR]
 	add	sp, sp, #S_FRAME_SIZE		// restore sp
+	/*
+	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on eret context synchronization
+	 * when returning from IPI handler, and when returning to user-space.
+	 */
 	eret					// return to kernel
 	.endm
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]     ` <20171114200414.2188-2-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-14 20:39       ` Ben Maurer
       [not found]         ` <CY4PR15MB168866BFDCFECF81B7EF4CF1CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-11-16 16:18       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 80+ messages in thread
From: Ben Maurer @ 2017-11-14 20:39 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Alexander Viro

>       int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t sig);

Really dumb question -- and one I'm sorry to bring up at the last minute. Should we consider making the syscall name something more generic "register_tls_abi"? I'm assuming that if we ever want to use a per-thread userspace/kernel ABI we'll want to use this field given the difficulty of getting adoption of registration, the need to involve glibc, etc. It seems like there could be future use cases of this TLS area that have nothing to do with rseq. 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]       ` <CY4PR15MB168884529B3C0F8E6CC06257CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-11-14 20:49         ` Ben Maurer
       [not found]           ` <CY4PR15MB1688CE0F2139CEB72B467242CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Ben Maurer @ 2017-11-14 20:49 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Alexander Viro

(apologies for the duplicate email, the previous one bounced as it was accidentally using HTML formatting)

If I understand correctly this is run on every context switch so we probably want to make it really fast

> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
> +{
> +       bool need_restart = false;
> +       uint32_t flags;
> +
> +       /* Get thread flags. */
> +       if (__get_user(flags, &t->rseq->flags))
> +               return -EFAULT;
> +
> +       /* Take into account critical section flags. */
> +       flags |= cs_flags;
> +
> +       /*
> +        * Restart on signal can only be inhibited when restart on
> +        * preempt and restart on migrate are inhibited too. Otherwise,
> +        * a preempted signal handler could fail to restart the prior
> +        * execution context on sigreturn.
> +        */
> +       if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
> +               if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +                       return -EINVAL;
> +               if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +                       return -EINVAL;
> +       }

How does this error even get to userspace? Is it worth doing this switch on every execution?


> +       if (t->rseq_migrate
> +                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +               need_restart = true;
> +       else if (t->rseq_preempt
> +                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +               need_restart = true;
> +       else if (t->rseq_signal
> +                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
> +               need_restart = true;

This could potentially be sped up by having the rseq_* fields in t use a single bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit operations to check the appropriate overlap.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]         ` <CY4PR15MB168866BFDCFECF81B7EF4CF1CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-11-14 20:52           ` Mathieu Desnoyers
       [not found]             ` <574606484.15158.1510692743725.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 20:52 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas

----- On Nov 14, 2017, at 3:39 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

>>       int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t sig);
> 
> Really dumb question -- and one I'm sorry to bring up at the last minute. Should
> we consider making the syscall name something more generic "register_tls_abi"?
> I'm assuming that if we ever want to use a per-thread userspace/kernel ABI
> we'll want to use this field given the difficulty of getting adoption of
> registration, the need to involve glibc, etc. It seems like there could be
> future use cases of this TLS area that have nothing to do with rseq.

I proposed that approach back in 2016 ("tls abi" system call), and the feedback
I received back then is that it was preferred to have a dedicated "rseq" system
call than an "open ended" and generic "tls abi" system call.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]           ` <CY4PR15MB1688CE0F2139CEB72B467242CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-11-14 21:03             ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 21:03 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas

----- On Nov 14, 2017, at 3:49 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

> (apologies for the duplicate email, the previous one bounced as it was
> accidentally using HTML formatting)
> 
> If I understand correctly this is run on every context switch so we probably
> want to make it really fast

Yes, more precisely, it runs on return to user-space, after every context
switch going back to a registered rseq thread.

> 
>> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
>> +{
>> +       bool need_restart = false;
>> +       uint32_t flags;
>> +
>> +       /* Get thread flags. */
>> +       if (__get_user(flags, &t->rseq->flags))
>> +               return -EFAULT;
>> +
>> +       /* Take into account critical section flags. */
>> +       flags |= cs_flags;
>> +
>> +       /*
>> +        * Restart on signal can only be inhibited when restart on
>> +        * preempt and restart on migrate are inhibited too. Otherwise,
>> +        * a preempted signal handler could fail to restart the prior
>> +        * execution context on sigreturn.
>> +        */
>> +       if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
>> +               if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
>> +                       return -EINVAL;
>> +               if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> +                       return -EINVAL;
>> +       }
> 
> How does this error even get to userspace? Is it worth doing this switch on
> every execution?

If we detect this situation, the rseq_need_restart caller will end up
sending a SIGSEGV signal to user-space. Note that the two nested if()
checks are only executing in the unlikely case where the NO_RESTART_ON_SIGNAL
flag is set.

> 
> 
>> +       if (t->rseq_migrate
>> +                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
>> +               need_restart = true;
>> +       else if (t->rseq_preempt
>> +                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> +               need_restart = true;
>> +       else if (t->rseq_signal
>> +                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
>> +               need_restart = true;
> 
> This could potentially be sped up by having the rseq_* fields in t use a single
> bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit
> operations to check the appropriate overlap.

Given that those are not requests impacting the ABI presented to user-space,
I'm tempted to keep these optimizations for the following 4.16 merge window.
Is that ok with you ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
       [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                     ` (10 preceding siblings ...)
  2017-11-14 20:04   ` [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements Mathieu Desnoyers
@ 2017-11-14 21:08   ` Linus Torvalds
       [not found]     ` <CA+55aFzZcQKEvu5S3TwD9MscqDhqq3pKa0Kam79NncjP8RnvoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  11 siblings, 1 reply; 80+ messages in thread
From: Linus Torvalds @ 2017-11-14 21:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, Linux Kernel Mailing List, Linux API, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Andrew Hunter, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Catalin Marinas

On Tue, Nov 14, 2017 at 12:03 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> Here is the last RFC round of the updated rseq patchset containing:

Andy? You were the one with concerns here and said you'd have
something else ready for comparison.

               Linus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
       [not found]     ` <CA+55aFzZcQKEvu5S3TwD9MscqDhqq3pKa0Kam79NncjP8RnvoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-11-14 21:15       ` Andy Lutomirski
       [not found]         ` <CALCETrVMvk0dsBMF8F-gPZCGnfJt=RQOvTnVzJhVaAFhEFbq2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2017-11-14 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Dave Watson, Linux Kernel Mailing List, Linux API, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Andrew Hunter, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Catalin Marinas

On Tue, Nov 14, 2017 at 1:08 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Tue, Nov 14, 2017 at 12:03 PM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> Here is the last RFC round of the updated rseq patchset containing:
>
> Andy? You were the one with concerns here and said you'd have
> something else ready for comparison.
>

I had a long discussion with Mathieu and KS and I think that this is a
good compromise.  I haven't reviewed the series all that carefully,
but I think the idea is sound.

Basically, event_counter is gone (to be re-added in a later kernel if
it really ends up being necessary, but it looks like it may primarily
be a temptation to write subtly incorrect user code and to see
scheduling details that shouldn't be readily exposed rather than a
genuinely useful feature) and the versioning mechanism for the asm
critical section bit is improved.  My crazy proposal should be doable
on top of this if there's demand and if anyone wants to write the
gnarly code involved.

IOW no objection from me as long as those changes were made, which I
*think* they were.  Mathieu?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
       [not found]         ` <CALCETrVMvk0dsBMF8F-gPZCGnfJt=RQOvTnVzJhVaAFhEFbq2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-11-14 21:32           ` Paul Turner
  2018-03-27 18:15             ` Mathieu Desnoyers
  2017-11-14 21:32           ` Mathieu Desnoyers
  1 sibling, 1 reply; 80+ messages in thread
From: Paul Turner @ 2017-11-14 21:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Mathieu Desnoyers, Peter Zijlstra,
	Paul E . McKenney, Boqun Feng, Dave Watson,
	Linux Kernel Mailing List, Linux API, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Catalin

I have some comments that apply to many of the threads.
I've been fully occupied with a wedding and a security issue; but I'm
about to be free to spend the majority of my time on RSEQ things.
I was sorely hoping that day would be today.  But it's looking like
I'm still a day or two from being free for this.
Thank you for the extensive clean-ups and user-side development.  I
have some updates on these topics also.

- Paul

On Tue, Nov 14, 2017 at 1:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Nov 14, 2017 at 1:08 PM, Linus Torvalds
> <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>> On Tue, Nov 14, 2017 at 12:03 PM, Mathieu Desnoyers
>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>> Here is the last RFC round of the updated rseq patchset containing:
>>
>> Andy? You were the one with concerns here and said you'd have
>> something else ready for comparison.
>>
>
> I had a long discussion with Mathieu and KS and I think that this is a
> good compromise.  I haven't reviewed the series all that carefully,
> but I think the idea is sound.
>
> Basically, event_counter is gone (to be re-added in a later kernel if
> it really ends up being necessary, but it looks like it may primarily
> be a temptation to write subtly incorrect user code and to see
> scheduling details that shouldn't be readily exposed rather than a
> genuinely useful feature) and the versioning mechanism for the asm
> critical section bit is improved.  My crazy proposal should be doable
> on top of this if there's demand and if anyone wants to write the
> gnarly code involved.
>
> IOW no objection from me as long as those changes were made, which I
> *think* they were.  Mathieu?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
       [not found]         ` <CALCETrVMvk0dsBMF8F-gPZCGnfJt=RQOvTnVzJhVaAFhEFbq2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-11-14 21:32           ` Paul Turner
@ 2017-11-14 21:32           ` Mathieu Desnoyers
       [not found]             ` <2115146800.15215.1510695175687.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-14 21:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Catalin Marinas, Will Deacon

----- On Nov 14, 2017, at 4:15 PM, Andy Lutomirski luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org wrote:

> On Tue, Nov 14, 2017 at 1:08 PM, Linus Torvalds
> <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>> On Tue, Nov 14, 2017 at 12:03 PM, Mathieu Desnoyers
>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>> Here is the last RFC round of the updated rseq patchset containing:
>>
>> Andy? You were the one with concerns here and said you'd have
>> something else ready for comparison.
>>
> 
> I had a long discussion with Mathieu and KS and I think that this is a
> good compromise.  I haven't reviewed the series all that carefully,
> but I think the idea is sound.
> 
> Basically, event_counter is gone (to be re-added in a later kernel if
> it really ends up being necessary, but it looks like it may primarily
> be a temptation to write subtly incorrect user code and to see
> scheduling details that shouldn't be readily exposed rather than a
> genuinely useful feature) and the versioning mechanism for the asm
> critical section bit is improved.  My crazy proposal should be doable
> on top of this if there's demand and if anyone wants to write the
> gnarly code involved.
> 
> IOW no objection from me as long as those changes were made, which I
> *think* they were.  Mathieu?

Yes, I applied all your suggestions.

The major change was removal of the event_counter. I did improve the
user-space side of things significantly, both in terms of speed, and
reduced complexity, as the rseq and cpu_opv C code is now very much
alike. We could even have a library header macro that would ensure both
fast and slow paths are the same.

I added the version field to struct rseq_cs. Along with the "rseq_len"
parameter to sys_rseq for the struct rseq, we should be good to handle
future extensions.

One thing I kept however that diverge from your recommendation is the
"sign" parameter to the rseq syscall. I prefer this flexible
approach to a hardcoded signature value. We never know when we may
need to randomize or change this in the future.

Regarding abort target signature the vs x86 disassemblers, I used a
5-byte no-op on x86 32/64:

  x86-32: nopl <sig>
  x86-64: nopl <sig>(%rip)

Other architectures (e.g. arm32, powerpc 32/64) with fixed-size instruction
set don't need this kind of trick to make disassemblers happy. Actually,
it is common practice on e.g. arm 32 to put literal values after branch
instructions so they are close to the load instruction.

Those were all the action items I had gathered following our discussion
at KS. Let me know if I'm missing anything.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]             ` <574606484.15158.1510692743725.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-14 21:48               ` Ben Maurer
  0 siblings, 0 replies; 80+ messages in thread
From: Ben Maurer @ 2017-11-14 21:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas


>>>       int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t sig);
>> 
>> Really dumb question -- and one I'm sorry to bring up at the last minute. Should
>> we consider making the syscall name something more generic "register_tls_abi"?
> I proposed that approach back in 2016 ("tls abi" system call), and the feedback
> I received back then is that it was preferred to have a dedicated "rseq" system
> call than an "open ended" and generic "tls abi" system call.

Ultimately I'm fine either way. I do think that in the past few months of review it has become clear that creating this tls abi requires a fair bit of work. It'd be a shame to see a future attempt to use such an ABI made difficult by forcing the author to figure out the registration process yet again. I assume the maintainers of glibc would also like to avoid the need to register multiple ABIs.

-b

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]     ` <20171114200414.2188-9-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-15  1:34       ` Mathieu Desnoyers
  2017-11-15  7:44       ` Michael Kerrisk (man-pages)
  2017-11-16 23:26       ` Thomas Gleixner
  2 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-15  1:34 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Nov 14, 2017, at 3:03 PM, Mathieu Desnoyers mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org wrote:
[...]
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3b448ba82225..cab256c1720a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p,
> unsigned int cpu)
> #endif
> }
> 
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);

Testing on CONFIG_SMP=n showed that I needed to add empty static inline
(returning 0) for !SMP case.

Mathieu


> +
> /*
>  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>  */

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command
       [not found]   ` <20171114200414.2188-19-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-15  1:36     ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-15  1:36 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, maged michael, Avi Kivity

----- On Nov 14, 2017, at 3:04 PM, Mathieu Desnoyers mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org wrote:
[...]
> +	if (!fallback) {
> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> +		free_cpumask_var(tmpmask);
> +	}

Testing with preempt debugging options showed that I need to
disable preemption around smp_call_function_many(). The membarrier
private expedited command in 4.14 is also affected. I will add a
fix to the series covering that case separately, and CC stable.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
       [not found]             ` <2115146800.15215.1510695175687.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-15  4:12               ` Andy Lutomirski
       [not found]                 ` <CALCETrX4dzY_kyZmqR+srKZf7vVYzODH5i9bguFAzdm0dcU3ZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2017-11-15  4:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Catalin Marinas, Will Deacon

> On Nov 14, 2017, at 1:32 PM, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>
> ----- On Nov 14, 2017, at 4:15 PM, Andy Lutomirski luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org wrote:
>
>
> One thing I kept however that diverge from your recommendation is the
> "sign" parameter to the rseq syscall. I prefer this flexible
> approach to a hardcoded signature value. We never know when we may
> need to randomize or change this in the future.
>
> Regarding abort target signature the vs x86 disassemblers, I used a
> 5-byte no-op on x86 32/64:
>
>  x86-32: nopl <sig>
>  x86-64: nopl <sig>(%rip)

I still don't see how this can possibly work well with libraries.  If
glibc or whatever issues the syscall and registers some signature,
that signature *must* match the expectation of all libraries used in
that thread or it's not going to work.  I can see two reasonable ways
to handle it:

1. The signature is just a well-known constant.  If you have an rseq
abort landing site, you end up with something like:

nopl $11223344(%rip)
landing_site:

or whatever the constant is.

2. The signature varies depending on the rseq_cs in use.  So you get:

static struct rseq_cs this_cs = {
  .signature = 0x55667788;
  ...
};

and then the abort landing site has:

nopl $11223344(%rip)
nopl $55667788(%rax)
landing_site:

The former is a bit easier to deal with.  The latter has the nice
property that you can't subvert one rseq_cs to land somewhere else,
but it's not clear to me how what actual attack this prevents, so I
think I prefer #1.  I just think that your variant is asking for
trouble down the road with incompatible userspace.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
       [not found]                 ` <CALCETrX4dzY_kyZmqR+srKZf7vVYzODH5i9bguFAzdm0dcU3ZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-11-15  6:34                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-15  6:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Catalin Marinas, Will Deacon

----- On Nov 14, 2017, at 11:12 PM, Andy Lutomirski luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org wrote:

>> On Nov 14, 2017, at 1:32 PM, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> wrote:
>>
>> ----- On Nov 14, 2017, at 4:15 PM, Andy Lutomirski luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org wrote:
>>
>>
>> One thing I kept however that diverge from your recommendation is the
>> "sign" parameter to the rseq syscall. I prefer this flexible
>> approach to a hardcoded signature value. We never know when we may
>> need to randomize or change this in the future.
>>
>> Regarding abort target signature the vs x86 disassemblers, I used a
>> 5-byte no-op on x86 32/64:
>>
>>  x86-32: nopl <sig>
>>  x86-64: nopl <sig>(%rip)
> 
> I still don't see how this can possibly work well with libraries.  If
> glibc or whatever issues the syscall and registers some signature,
> that signature *must* match the expectation of all libraries used in
> that thread or it's not going to work.

Here is how I envision this signature can eventually be randomized:

A librseq.so provided by glibc manages rseq thread registration. That
library could generate a random uint32_t value as signature for each
process within a constructor, as well as lazily upon first call to
signature query function (whichever comes first).

The constructors of every program/library using rseq would invoke
a signature getter function to query the random value, and iterate over
a section of pointers to signatures, and update those as part of the
constructors (temporarily mprotecting the pages as writeable).

Given that this would prevent page sharing across processes due to
CoW, I would not advise going for this randomized signature solution
unless necessary, but I think it's good to keep the door open to this
by keeping a uint32_t sig argument to sys_rseq.


> I can see two reasonable ways
> to handle it:
> 
> 1. The signature is just a well-known constant.  If you have an rseq
> abort landing site, you end up with something like:
> 
> nopl $11223344(%rip)
> landing_site:
> 
> or whatever the constant is.

If librseq.so passes a hardcoded constant to sys_rseq, then my solution
is very similar to this one, except that mine can allow randomized
signatures in the future for a kernel ABI perspective.


> 
> 2. The signature varies depending on the rseq_cs in use.  So you get:
> 
> static struct rseq_cs this_cs = {
>  .signature = 0x55667788;
>  ...
> };
> 
> and then the abort landing site has:
> 
> nopl $11223344(%rip)
> nopl $55667788(%rax)
> landing_site:

AFAIU, this solution defeats the purpose of having code signatures in the
in the first place. An attacker simply has to:

1) Craft a dummy struct rseq_cs on the stack, with:

struct rseq_cs {
  .signature = <whatever needs to be matched prior to the wanted program target>,
  .start_ip = 0x0,
  .len = -1UL,
  .abort_ip = <address of system() or such>,
}

2) Store the address of this dummy struct rseq_cs into __rseq_abi.rseq_cs.

3) Profit.

You should _never_ compare the signature in the code with an integer
value which can end up being controlled by the attacker.

Passing the signature to the system call upon registration leaves to the
kernel the job of keeping that signature around. An attacker would need
to first invoke sys_rseq to unregister the current __rseq_abi and re-register
with another signature in order to make this work. If an attacker has that
much access to control program execution and issue system calls at will,
then the game is already lost: they already control the execution flow,
so what's the point in trying to prevent branching to a specific address ?

> 
> The former is a bit easier to deal with.  The latter has the nice
> property that you can't subvert one rseq_cs to land somewhere else,
> but it's not clear to me how what actual attack this prevents, so I
> think I prefer #1.  I just think that your variant is asking for
> trouble down the road with incompatible userspace.

As described above, user-space can easily make the signature randomization
work by having all users patch code within constructors.

Thanks,

Mathieu


> 
> --Andy

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]     ` <20171114200414.2188-9-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-15  1:34       ` Mathieu Desnoyers
@ 2017-11-15  7:44       ` Michael Kerrisk (man-pages)
       [not found]         ` <CAKgNAkjrh_OMi+7EUJxqM0-84WUxL0d_vse4neOL93EB-sGKXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-11-16 23:26       ` Thomas Gleixner
  2 siblings, 1 reply; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-11-15  7:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, lkml, Linux API, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin

Hi Matthieu

On 14 November 2017 at 21:03, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> This new cpu_opv system call executes a vector of operations on behalf
> of user-space on a specific CPU with preemption disabled. It is inspired
> from readv() and writev() system calls which take a "struct iovec" array
> as argument.

Do you have a man page spfr this syscall already?

Thanks,

Michael


> The operations available are: comparison, memcpy, add, or, and, xor,
> left shift, right shift, and mb. The system call receives a CPU number
> from user-space as argument, which is the CPU on which those operations
> need to be performed. All preparation steps such as loading pointers,
> and applying offsets to arrays, need to be performed by user-space
> before invoking the system call. The "comparison" operation can be used
> to check that the data used in the preparation step did not change
> between preparation of system call inputs and operation execution within
> the preempt-off critical section.
>
> The reason why we require all pointer offsets to be calculated by
> user-space beforehand is because we need to use get_user_pages_fast() to
> first pin all pages touched by each operation. This takes care of
> faulting-in the pages. Then, preemption is disabled, and the operations
> are performed atomically with respect to other thread execution on that
> CPU, without generating any page fault.
>
> A maximum limit of 16 operations per cpu_opv syscall invocation is
> enforced, so user-space cannot generate a too long preempt-off critical
> section. Each operation is also limited a length of PAGE_SIZE bytes,
> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
> pages for source, 2 pages for destination if addresses are not aligned
> on page boundaries). Moreover, a total limit of 4216 bytes is applied
> to operation lengths.
>
> If the thread is not running on the requested CPU, a new
> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
> If the requested CPU is not part of the cpus allowed mask of the thread,
> the system call fails with EINVAL. After the migration has been
> performed, preemption is disabled, and the current CPU number is checked
> again and compared to the requested CPU number. If it still differs, it
> means the scheduler migrated us away from that CPU. Return EAGAIN to
> user-space in that case, and let user-space retry (either requesting the
> same CPU number, or a different one, depending on the user-space
> algorithm constraints).
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
> CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
> CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
> CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
> CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
> CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
> CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>
> Changes since v1:
> - handle CPU hotplug,
> - cleanup implementation using function pointers: We can use function
>   pointers to implement the operations rather than duplicating all the
>   user-access code.
> - refuse device pages: Performing cpu_opv operations on io map'd pages
>   with preemption disabled could generate long preempt-off critical
>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>   a device page is received as parameter
> - restrict op vector to 4216 bytes length sum: Restrict the operation
>   vector to length sum of:
>   - 4096 bytes (typical page size on most architectures, should be
>     enough for a string, or structures)
>   - 15 * 8 bytes (typical operations on integers or pointers).
>   The goal here is to keep the duration of preempt off critical section
>   short, so we don't add significant scheduler latency.
> - Add INIT_ONSTACK macro: Introduce the
>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>   stack to 0 on 32-bit architectures.
> - Add CPU_MB_OP operation:
>   Use-cases with:
>   - two consecutive stores,
>   - a mempcy followed by a store,
>   require a memory barrier before the final store operation. A typical
>   use-case is a store-release on the final store. Given that this is a
>   slow path, just providing an explicit full barrier instruction should
>   be sufficient.
> - Add expect fault field:
>   The use-case of list_pop brings interesting challenges. With rseq, we
>   can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
>   compare it against NULL, add an offset, and load the target "next"
>   pointer from the object, all within a single req critical section.
>
>   Life is not so easy for cpu_opv in this use-case, mainly because we
>   need to pin all pages we are going to touch in the preempt-off
>   critical section beforehand. So we need to know the target object (in
>   which we apply an offset to fetch the next pointer) when we pin pages
>   before disabling preemption.
>
>   So the approach is to load the head pointer and compare it against
>   NULL in user-space, before doing the cpu_opv syscall. User-space can
>   then compute the address of the head->next field, *without loading it*.
>
>   The cpu_opv system call will first need to pin all pages associated
>   with input data. This includes the page backing the head->next object,
>   which may have been concurrently deallocated and unmapped. Therefore,
>   in this case, getting -EFAULT when trying to pin those pages may
>   happen: it just means they have been concurrently unmapped. This is
>   an expected situation, and should just return -EAGAIN to user-space,
>   to user-space can distinguish between "should retry" type of
>   situations and actual errors that should be handled with extreme
>   prejudice to the program (e.g. abort()).
>
>   Therefore, add "expect_fault" fields along with op input address
>   pointers, so user-space can identify whether a fault when getting a
>   field should return EAGAIN rather than EFAULT.
> - Add compiler barrier between operations: Adding a compiler barrier
>   between store operations in a cpu_opv sequence can be useful when
>   paired with membarrier system call.
>
>   An algorithm with a paired slow path and fast path can use
>   sys_membarrier on the slow path to replace fast-path memory barriers
>   by compiler barrier.
>
>   Adding an explicit compiler barrier between operations allows
>   cpu_opv to be used as fallback for operations meant to match
>   the membarrier system call.
>
> Changes since v2:
>
> - Fix memory leak by introducing struct cpu_opv_pinned_pages.
>   Suggested by Boqun Feng.
> - Cast argument 1 passed to access_ok from integer to void __user *,
>   fixing sparse warning.
> ---
>  MAINTAINERS                  |   7 +
>  include/uapi/linux/cpu_opv.h | 117 ++++++
>  init/Kconfig                 |  14 +
>  kernel/Makefile              |   1 +
>  kernel/cpu_opv.c             | 968 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c          |  37 ++
>  kernel/sched/sched.h         |   2 +
>  kernel/sys_ni.c              |   1 +
>  8 files changed, 1147 insertions(+)
>  create mode 100644 include/uapi/linux/cpu_opv.h
>  create mode 100644 kernel/cpu_opv.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c9f95f8b07ed..45a1bbdaa287 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3675,6 +3675,13 @@ B:       https://bugzilla.kernel.org
>  F:     drivers/cpuidle/*
>  F:     include/linux/cpuidle.h
>
> +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
> +M:     Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> +L:     linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> +S:     Supported
> +F:     kernel/cpu_opv.c
> +F:     include/uapi/linux/cpu_opv.h
> +
>  CRAMFS FILESYSTEM
>  W:     http://sourceforge.net/projects/cramfs/
>  S:     Orphan / Obsolete
> diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
> new file mode 100644
> index 000000000000..17f7d46e053b
> --- /dev/null
> +++ b/include/uapi/linux/cpu_opv.h
> @@ -0,0 +1,117 @@
> +#ifndef _UAPI_LINUX_CPU_OPV_H
> +#define _UAPI_LINUX_CPU_OPV_H
> +
> +/*
> + * linux/cpu_opv.h
> + *
> + * CPU preempt-off operation vector system call API
> + *
> + * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifdef __KERNEL__
> +# include <linux/types.h>
> +#else  /* #ifdef __KERNEL__ */
> +# include <stdint.h>
> +#endif /* #else #ifdef __KERNEL__ */
> +
> +#include <asm/byteorder.h>
> +
> +#ifdef __LP64__
> +# define CPU_OP_FIELD_u32_u64(field)                   uint64_t field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field ## _padding, field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
> +       field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field, field ## _padding
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
> +       field = (intptr_t)v, field ## _padding = 0
> +#endif
> +
> +#define CPU_OP_VEC_LEN_MAX             16
> +#define CPU_OP_ARG_LEN_MAX             24
> +/* Max. data len per operation. */
> +#define CPU_OP_DATA_LEN_MAX            PAGE_SIZE
> +/*
> + * Max. data len for overall vector. We to restrict the amount of
> + * user-space data touched by the kernel in non-preemptible context so
> + * we do not introduce long scheduler latencies.
> + * This allows one copy of up to 4096 bytes, and 15 operations touching
> + * 8 bytes each.
> + * This limit is applied to the sum of length specified for all
> + * operations in a vector.
> + */
> +#define CPU_OP_VEC_DATA_LEN_MAX                (4096 + 15*8)
> +#define CPU_OP_MAX_PAGES               4       /* Max. pages per op. */
> +
> +enum cpu_op_type {
> +       CPU_COMPARE_EQ_OP,      /* compare */
> +       CPU_COMPARE_NE_OP,      /* compare */
> +       CPU_MEMCPY_OP,          /* memcpy */
> +       CPU_ADD_OP,             /* arithmetic */
> +       CPU_OR_OP,              /* bitwise */
> +       CPU_AND_OP,             /* bitwise */
> +       CPU_XOR_OP,             /* bitwise */
> +       CPU_LSHIFT_OP,          /* shift */
> +       CPU_RSHIFT_OP,          /* shift */
> +       CPU_MB_OP,              /* memory barrier */
> +};
> +
> +/* Vector of operations to perform. Limited to 16. */
> +struct cpu_op {
> +       int32_t op;     /* enum cpu_op_type. */
> +       uint32_t len;   /* data length, in bytes. */
> +       union {
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(a);
> +                       CPU_OP_FIELD_u32_u64(b);
> +                       uint8_t expect_fault_a;
> +                       uint8_t expect_fault_b;
> +               } compare_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(dst);
> +                       CPU_OP_FIELD_u32_u64(src);
> +                       uint8_t expect_fault_dst;
> +                       uint8_t expect_fault_src;
> +               } memcpy_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(p);
> +                       int64_t count;
> +                       uint8_t expect_fault_p;
> +               } arithmetic_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(p);
> +                       uint64_t mask;
> +                       uint8_t expect_fault_p;
> +               } bitwise_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(p);
> +                       uint32_t bits;
> +                       uint8_t expect_fault_p;
> +               } shift_op;
> +               char __padding[CPU_OP_ARG_LEN_MAX];
> +       } u;
> +};
> +
> +#endif /* _UAPI_LINUX_CPU_OPV_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index cbedfb91b40a..e4fbb5dd6a24 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1404,6 +1404,7 @@ config RSEQ
>         bool "Enable rseq() system call" if EXPERT
>         default y
>         depends on HAVE_RSEQ
> +       select CPU_OPV
>         select MEMBARRIER
>         help
>           Enable the restartable sequences system call. It provides a
> @@ -1414,6 +1415,19 @@ config RSEQ
>
>           If unsure, say Y.
>
> +config CPU_OPV
> +       bool "Enable cpu_opv() system call" if EXPERT
> +       default y
> +       help
> +         Enable the CPU preempt-off operation vector system call.
> +         It allows user-space to perform a sequence of operations on
> +         per-cpu data with preemption disabled. Useful as
> +         single-stepping fall-back for restartable sequences, and for
> +         performing more complex operations on per-cpu data that would
> +         not be otherwise possible to do with restartable sequences.
> +
> +         If unsure, say Y.
> +
>  config EMBEDDED
>         bool "Embedded system"
>         option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3574669dafd9..cac8855196ff 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
>
>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
>  obj-$(CONFIG_RSEQ) += rseq.o
> +obj-$(CONFIG_CPU_OPV) += cpu_opv.o
>
>  $(obj)/configs.o: $(obj)/config_data.h
>
> diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
> new file mode 100644
> index 000000000000..a81837a14b17
> --- /dev/null
> +++ b/kernel/cpu_opv.c
> @@ -0,0 +1,968 @@
> +/*
> + * CPU preempt-off operation vector system call
> + *
> + * It allows user-space to perform a sequence of operations on per-cpu
> + * data with preemption disabled. Useful as single-stepping fall-back
> + * for restartable sequences, and for performing more complex operations
> + * on per-cpu data that would not be otherwise possible to do with
> + * restartable sequences.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Copyright (C) 2017, EfficiOS Inc.,
> + * Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/syscalls.h>
> +#include <linux/cpu_opv.h>
> +#include <linux/types.h>
> +#include <linux/mutex.h>
> +#include <linux/pagemap.h>
> +#include <asm/ptrace.h>
> +#include <asm/byteorder.h>
> +
> +#include "sched/sched.h"
> +
> +#define TMP_BUFLEN                     64
> +#define NR_PINNED_PAGES_ON_STACK       8
> +
> +union op_fn_data {
> +       uint8_t _u8;
> +       uint16_t _u16;
> +       uint32_t _u32;
> +       uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +       uint32_t _u64_split[2];
> +#endif
> +};
> +
> +struct cpu_opv_pinned_pages {
> +       struct page **pages;
> +       size_t nr;
> +       bool is_kmalloc;
> +};
> +
> +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
> +
> +static DEFINE_MUTEX(cpu_opv_offline_lock);
> +
> +/*
> + * The cpu_opv system call executes a vector of operations on behalf of
> + * user-space on a specific CPU with preemption disabled. It is inspired
> + * from readv() and writev() system calls which take a "struct iovec"
> + * array as argument.
> + *
> + * The operations available are: comparison, memcpy, add, or, and, xor,
> + * left shift, and right shift. The system call receives a CPU number
> + * from user-space as argument, which is the CPU on which those
> + * operations need to be performed. All preparation steps such as
> + * loading pointers, and applying offsets to arrays, need to be
> + * performed by user-space before invoking the system call. The
> + * "comparison" operation can be used to check that the data used in the
> + * preparation step did not change between preparation of system call
> + * inputs and operation execution within the preempt-off critical
> + * section.
> + *
> + * The reason why we require all pointer offsets to be calculated by
> + * user-space beforehand is because we need to use get_user_pages_fast()
> + * to first pin all pages touched by each operation. This takes care of
> + * faulting-in the pages. Then, preemption is disabled, and the
> + * operations are performed atomically with respect to other thread
> + * execution on that CPU, without generating any page fault.
> + *
> + * A maximum limit of 16 operations per cpu_opv syscall invocation is
> + * enforced, and a overall maximum length sum, so user-space cannot
> + * generate a too long preempt-off critical section. Each operation is
> + * also limited a length of PAGE_SIZE bytes, meaning that an operation
> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
> + * for destination if addresses are not aligned on page boundaries).
> + *
> + * If the thread is not running on the requested CPU, a new
> + * push_task_to_cpu() is invoked to migrate the task to the requested
> + * CPU.  If the requested CPU is not part of the cpus allowed mask of
> + * the thread, the system call fails with EINVAL. After the migration
> + * has been performed, preemption is disabled, and the current CPU
> + * number is checked again and compared to the requested CPU number. If
> + * it still differs, it means the scheduler migrated us away from that
> + * CPU. Return EAGAIN to user-space in that case, and let user-space
> + * retry (either requesting the same CPU number, or a different one,
> + * depending on the user-space algorithm constraints).
> + */
> +
> +/*
> + * Check operation types and length parameters.
> + */
> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
> +{
> +       int i;
> +       uint32_t sum = 0;
> +
> +       for (i = 0; i < cpuopcnt; i++) {
> +               struct cpu_op *op = &cpuop[i];
> +
> +               switch (op->op) {
> +               case CPU_MB_OP:
> +                       break;
> +               default:
> +                       sum += op->len;
> +               }
> +               switch (op->op) {
> +               case CPU_COMPARE_EQ_OP:
> +               case CPU_COMPARE_NE_OP:
> +               case CPU_MEMCPY_OP:
> +                       if (op->len > CPU_OP_DATA_LEN_MAX)
> +                               return -EINVAL;
> +                       break;
> +               case CPU_ADD_OP:
> +               case CPU_OR_OP:
> +               case CPU_AND_OP:
> +               case CPU_XOR_OP:
> +                       switch (op->len) {
> +                       case 1:
> +                       case 2:
> +                       case 4:
> +                       case 8:
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               case CPU_LSHIFT_OP:
> +               case CPU_RSHIFT_OP:
> +                       switch (op->len) {
> +                       case 1:
> +                               if (op->u.shift_op.bits > 7)
> +                                       return -EINVAL;
> +                               break;
> +                       case 2:
> +                               if (op->u.shift_op.bits > 15)
> +                                       return -EINVAL;
> +                               break;
> +                       case 4:
> +                               if (op->u.shift_op.bits > 31)
> +                                       return -EINVAL;
> +                               break;
> +                       case 8:
> +                               if (op->u.shift_op.bits > 63)
> +                                       return -EINVAL;
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               case CPU_MB_OP:
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       if (sum > CPU_OP_VEC_DATA_LEN_MAX)
> +               return -EINVAL;
> +       return 0;
> +}
> +
> +static unsigned long cpu_op_range_nr_pages(unsigned long addr,
> +               unsigned long len)
> +{
> +       return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
> +}
> +
> +static int cpu_op_check_page(struct page *page)
> +{
> +       struct address_space *mapping;
> +
> +       if (is_zone_device_page(page))
> +               return -EFAULT;
> +       page = compound_head(page);
> +       mapping = READ_ONCE(page->mapping);
> +       if (!mapping) {
> +               int shmem_swizzled;
> +
> +               /*
> +                * Check again with page lock held to guard against
> +                * memory pressure making shmem_writepage move the page
> +                * from filecache to swapcache.
> +                */
> +               lock_page(page);
> +               shmem_swizzled = PageSwapCache(page) || page->mapping;
> +               unlock_page(page);
> +               if (shmem_swizzled)
> +                       return -EAGAIN;
> +               return -EFAULT;
> +       }
> +       return 0;
> +}
> +
> +/*
> + * Refusing device pages, the zero page, pages in the gate area, and
> + * special mappings. Inspired from futex.c checks.
> + */
> +static int cpu_op_check_pages(struct page **pages,
> +               unsigned long nr_pages)
> +{
> +       unsigned long i;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               int ret;
> +
> +               ret = cpu_op_check_page(pages[i]);
> +               if (ret)
> +                       return ret;
> +       }
> +       return 0;
> +}
> +
> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
> +               struct cpu_opv_pinned_pages *pin_pages, int write)
> +{
> +       struct page *pages[2];
> +       int ret, nr_pages;
> +
> +       if (!len)
> +               return 0;
> +       nr_pages = cpu_op_range_nr_pages(addr, len);
> +       BUG_ON(nr_pages > 2);
> +       if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
> +                       > NR_PINNED_PAGES_ON_STACK) {
> +               struct page **pinned_pages =
> +                       kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
> +                               * sizeof(struct page *), GFP_KERNEL);
> +               if (!pinned_pages)
> +                       return -ENOMEM;
> +               memcpy(pinned_pages, pin_pages->pages,
> +                       pin_pages->nr * sizeof(struct page *));
> +               pin_pages->pages = pinned_pages;
> +               pin_pages->is_kmalloc = true;
> +       }
> +again:
> +       ret = get_user_pages_fast(addr, nr_pages, write, pages);
> +       if (ret < nr_pages) {
> +               if (ret > 0)
> +                       put_page(pages[0]);
> +               return -EFAULT;
> +       }
> +       /*
> +        * Refuse device pages, the zero page, pages in the gate area,
> +        * and special mappings.
> +        */
> +       ret = cpu_op_check_pages(pages, nr_pages);
> +       if (ret == -EAGAIN) {
> +               put_page(pages[0]);
> +               if (nr_pages > 1)
> +                       put_page(pages[1]);
> +               goto again;
> +       }
> +       if (ret)
> +               goto error;
> +       pin_pages->pages[pin_pages->nr++] = pages[0];
> +       if (nr_pages > 1)
> +               pin_pages->pages[pin_pages->nr++] = pages[1];
> +       return 0;
> +
> +error:
> +       put_page(pages[0]);
> +       if (nr_pages > 1)
> +               put_page(pages[1]);
> +       return -EFAULT;
> +}
> +
> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
> +               struct cpu_opv_pinned_pages *pin_pages)
> +{
> +       int ret, i;
> +       bool expect_fault = false;
> +
> +       /* Check access, pin pages. */
> +       for (i = 0; i < cpuopcnt; i++) {
> +               struct cpu_op *op = &cpuop[i];
> +
> +               switch (op->op) {
> +               case CPU_COMPARE_EQ_OP:
> +               case CPU_COMPARE_NE_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.compare_op.expect_fault_a;
> +                       if (!access_ok(VERIFY_READ,
> +                                       (void __user *)op->u.compare_op.a,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.compare_op.a,
> +                                       op->len, pin_pages, 0);
> +                       if (ret)
> +                               goto error;
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.compare_op.expect_fault_b;
> +                       if (!access_ok(VERIFY_READ,
> +                                       (void __user *)op->u.compare_op.b,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.compare_op.b,
> +                                       op->len, pin_pages, 0);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_MEMCPY_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.memcpy_op.expect_fault_dst;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.memcpy_op.dst,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.memcpy_op.dst,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.memcpy_op.expect_fault_src;
> +                       if (!access_ok(VERIFY_READ,
> +                                       (void __user *)op->u.memcpy_op.src,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.memcpy_op.src,
> +                                       op->len, pin_pages, 0);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_ADD_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.arithmetic_op.expect_fault_p;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.arithmetic_op.p,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.arithmetic_op.p,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_OR_OP:
> +               case CPU_AND_OP:
> +               case CPU_XOR_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.bitwise_op.expect_fault_p;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.bitwise_op.p,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_LSHIFT_OP:
> +               case CPU_RSHIFT_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.shift_op.expect_fault_p;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.shift_op.p,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.shift_op.p,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_MB_OP:
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       return 0;
> +
> +error:
> +       for (i = 0; i < pin_pages->nr; i++)
> +               put_page(pin_pages->pages[i]);
> +       pin_pages->nr = 0;
> +       /*
> +        * If faulting access is expected, return EAGAIN to user-space.
> +        * It allows user-space to distinguish between a fault caused by
> +        * an access which is expect to fault (e.g. due to concurrent
> +        * unmapping of underlying memory) from an unexpected fault from
> +        * which a retry would not recover.
> +        */
> +       if (ret == -EFAULT && expect_fault)
> +               return -EAGAIN;
> +       return ret;
> +}
> +
> +/* Return 0 if same, > 0 if different, < 0 on error. */
> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
> +{
> +       char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
> +       uint32_t compared = 0;
> +
> +       while (compared != len) {
> +               unsigned long to_compare;
> +
> +               to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
> +               if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
> +                       return -EFAULT;
> +               if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
> +                       return -EFAULT;
> +               if (memcmp(bufa, bufb, to_compare))
> +                       return 1;       /* different */
> +               compared += to_compare;
> +       }
> +       return 0;       /* same */
> +}
> +
> +/* Return 0 if same, > 0 if different, < 0 on error. */
> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
> +{
> +       int ret = -EFAULT;
> +       union {
> +               uint8_t _u8;
> +               uint16_t _u16;
> +               uint32_t _u32;
> +               uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +               uint32_t _u64_split[2];
> +#endif
> +       } tmp[2];
> +
> +       pagefault_disable();
> +       switch (len) {
> +       case 1:
> +               if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
> +                       goto end;
> +               ret = !!(tmp[0]._u8 != tmp[1]._u8);
> +               break;
> +       case 2:
> +               if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
> +                       goto end;
> +               ret = !!(tmp[0]._u16 != tmp[1]._u16);
> +               break;
> +       case 4:
> +               if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
> +                       goto end;
> +               ret = !!(tmp[0]._u32 != tmp[1]._u32);
> +               break;
> +       case 8:
> +#if (BITS_PER_LONG >= 64)
> +               if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
> +                       goto end;
> +#else
> +               if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
> +                       goto end;
> +               if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
> +                       goto end;
> +               if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
> +                       goto end;
> +#endif
> +               ret = !!(tmp[0]._u64 != tmp[1]._u64);
> +               break;
> +       default:
> +               pagefault_enable();
> +               return do_cpu_op_compare_iter(a, b, len);
> +       }
> +end:
> +       pagefault_enable();
> +       return ret;
> +}
> +
> +/* Return 0 on success, < 0 on error. */
> +static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src,
> +               uint32_t len)
> +{
> +       char buf[TMP_BUFLEN];
> +       uint32_t copied = 0;
> +
> +       while (copied != len) {
> +               unsigned long to_copy;
> +
> +               to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied);
> +               if (__copy_from_user_inatomic(buf, src + copied, to_copy))
> +                       return -EFAULT;
> +               if (__copy_to_user_inatomic(dst + copied, buf, to_copy))
> +                       return -EFAULT;
> +               copied += to_copy;
> +       }
> +       return 0;
> +}
> +
> +/* Return 0 on success, < 0 on error. */
> +static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len)
> +{
> +       int ret = -EFAULT;
> +       union {
> +               uint8_t _u8;
> +               uint16_t _u16;
> +               uint32_t _u32;
> +               uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +               uint32_t _u64_split[2];
> +#endif
> +       } tmp;
> +
> +       pagefault_disable();
> +       switch (len) {
> +       case 1:
> +               if (__get_user(tmp._u8, (uint8_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u8, (uint8_t __user *)dst))
> +                       goto end;
> +               break;
> +       case 2:
> +               if (__get_user(tmp._u16, (uint16_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u16, (uint16_t __user *)dst))
> +                       goto end;
> +               break;
> +       case 4:
> +               if (__get_user(tmp._u32, (uint32_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u32, (uint32_t __user *)dst))
> +                       goto end;
> +               break;
> +       case 8:
> +#if (BITS_PER_LONG >= 64)
> +               if (__get_user(tmp._u64, (uint64_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u64, (uint64_t __user *)dst))
> +                       goto end;
> +#else
> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)src))
> +                       goto end;
> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1))
> +                       goto end;
> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst))
> +                       goto end;
> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1))
> +                       goto end;
> +#endif
> +               break;
> +       default:
> +               pagefault_enable();
> +               return do_cpu_op_memcpy_iter(dst, src, len);
> +       }
> +       ret = 0;
> +end:
> +       pagefault_enable();
> +       return ret;
> +}
> +
> +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 += (uint8_t)count;
> +               break;
> +       case 2:
> +               data->_u16 += (uint16_t)count;
> +               break;
> +       case 4:
> +               data->_u32 += (uint32_t)count;
> +               break;
> +       case 8:
> +               data->_u64 += (uint64_t)count;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 |= (uint8_t)mask;
> +               break;
> +       case 2:
> +               data->_u16 |= (uint16_t)mask;
> +               break;
> +       case 4:
> +               data->_u32 |= (uint32_t)mask;
> +               break;
> +       case 8:
> +               data->_u64 |= (uint64_t)mask;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 &= (uint8_t)mask;
> +               break;
> +       case 2:
> +               data->_u16 &= (uint16_t)mask;
> +               break;
> +       case 4:
> +               data->_u32 &= (uint32_t)mask;
> +               break;
> +       case 8:
> +               data->_u64 &= (uint64_t)mask;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 ^= (uint8_t)mask;
> +               break;
> +       case 2:
> +               data->_u16 ^= (uint16_t)mask;
> +               break;
> +       case 4:
> +               data->_u32 ^= (uint32_t)mask;
> +               break;
> +       case 8:
> +               data->_u64 ^= (uint64_t)mask;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 <<= (uint8_t)bits;
> +               break;
> +       case 2:
> +               data->_u16 <<= (uint16_t)bits;
> +               break;
> +       case 4:
> +               data->_u32 <<= (uint32_t)bits;
> +               break;
> +       case 8:
> +               data->_u64 <<= (uint64_t)bits;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 >>= (uint8_t)bits;
> +               break;
> +       case 2:
> +               data->_u16 >>= (uint16_t)bits;
> +               break;
> +       case 4:
> +               data->_u32 >>= (uint32_t)bits;
> +               break;
> +       case 8:
> +               data->_u64 >>= (uint64_t)bits;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +/* Return 0 on success, < 0 on error. */
> +static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v,
> +               uint32_t len)
> +{
> +       int ret = -EFAULT;
> +       union op_fn_data tmp;
> +
> +       pagefault_disable();
> +       switch (len) {
> +       case 1:
> +               if (__get_user(tmp._u8, (uint8_t __user *)p))
> +                       goto end;
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +               if (__put_user(tmp._u8, (uint8_t __user *)p))
> +                       goto end;
> +               break;
> +       case 2:
> +               if (__get_user(tmp._u16, (uint16_t __user *)p))
> +                       goto end;
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +               if (__put_user(tmp._u16, (uint16_t __user *)p))
> +                       goto end;
> +               break;
> +       case 4:
> +               if (__get_user(tmp._u32, (uint32_t __user *)p))
> +                       goto end;
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +               if (__put_user(tmp._u32, (uint32_t __user *)p))
> +                       goto end;
> +               break;
> +       case 8:
> +#if (BITS_PER_LONG >= 64)
> +               if (__get_user(tmp._u64, (uint64_t __user *)p))
> +                       goto end;
> +#else
> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
> +                       goto end;
> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
> +                       goto end;
> +#endif
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +#if (BITS_PER_LONG >= 64)
> +               if (__put_user(tmp._u64, (uint64_t __user *)p))
> +                       goto end;
> +#else
> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
> +                       goto end;
> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
> +                       goto end;
> +#endif
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               goto end;
> +       }
> +       ret = 0;
> +end:
> +       pagefault_enable();
> +       return ret;
> +}
> +
> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
> +{
> +       int i, ret;
> +
> +       for (i = 0; i < cpuopcnt; i++) {
> +               struct cpu_op *op = &cpuop[i];
> +
> +               /* Guarantee a compiler barrier between each operation. */
> +               barrier();
> +
> +               switch (op->op) {
> +               case CPU_COMPARE_EQ_OP:
> +                       ret = do_cpu_op_compare(
> +                                       (void __user *)op->u.compare_op.a,
> +                                       (void __user *)op->u.compare_op.b,
> +                                       op->len);
> +                       /* Stop execution on error. */
> +                       if (ret < 0)
> +                               return ret;
> +                       /*
> +                        * Stop execution, return op index + 1 if comparison
> +                        * differs.
> +                        */
> +                       if (ret > 0)
> +                               return i + 1;
> +                       break;
> +               case CPU_COMPARE_NE_OP:
> +                       ret = do_cpu_op_compare(
> +                                       (void __user *)op->u.compare_op.a,
> +                                       (void __user *)op->u.compare_op.b,
> +                                       op->len);
> +                       /* Stop execution on error. */
> +                       if (ret < 0)
> +                               return ret;
> +                       /*
> +                        * Stop execution, return op index + 1 if comparison
> +                        * is identical.
> +                        */
> +                       if (ret == 0)
> +                               return i + 1;
> +                       break;
> +               case CPU_MEMCPY_OP:
> +                       ret = do_cpu_op_memcpy(
> +                                       (void __user *)op->u.memcpy_op.dst,
> +                                       (void __user *)op->u.memcpy_op.src,
> +                                       op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_ADD_OP:
> +                       ret = do_cpu_op_fn(op_add_fn,
> +                                       (void __user *)op->u.arithmetic_op.p,
> +                                       op->u.arithmetic_op.count, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_OR_OP:
> +                       ret = do_cpu_op_fn(op_or_fn,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->u.bitwise_op.mask, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_AND_OP:
> +                       ret = do_cpu_op_fn(op_and_fn,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->u.bitwise_op.mask, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_XOR_OP:
> +                       ret = do_cpu_op_fn(op_xor_fn,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->u.bitwise_op.mask, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_LSHIFT_OP:
> +                       ret = do_cpu_op_fn(op_lshift_fn,
> +                                       (void __user *)op->u.shift_op.p,
> +                                       op->u.shift_op.bits, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_RSHIFT_OP:
> +                       ret = do_cpu_op_fn(op_rshift_fn,
> +                                       (void __user *)op->u.shift_op.p,
> +                                       op->u.shift_op.bits, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_MB_OP:
> +                       smp_mb();
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
> +{
> +       int ret;
> +
> +       if (cpu != raw_smp_processor_id()) {
> +               ret = push_task_to_cpu(current, cpu);
> +               if (ret)
> +                       goto check_online;
> +       }
> +       preempt_disable();
> +       if (cpu != smp_processor_id()) {
> +               ret = -EAGAIN;
> +               goto end;
> +       }
> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
> +end:
> +       preempt_enable();
> +       return ret;
> +
> +check_online:
> +       if (!cpu_possible(cpu))
> +               return -EINVAL;
> +       get_online_cpus();
> +       if (cpu_online(cpu)) {
> +               ret = -EAGAIN;
> +               goto put_online_cpus;
> +       }
> +       /*
> +        * CPU is offline. Perform operation from the current CPU with
> +        * cpu_online read lock held, preventing that CPU from coming online,
> +        * and with mutex held, providing mutual exclusion against other
> +        * CPUs also finding out about an offline CPU.
> +        */
> +       mutex_lock(&cpu_opv_offline_lock);
> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
> +       mutex_unlock(&cpu_opv_offline_lock);
> +put_online_cpus:
> +       put_online_cpus();
> +       return ret;
> +}
> +
> +/*
> + * cpu_opv - execute operation vector on a given CPU with preempt off.
> + *
> + * Userspace should pass current CPU number as parameter. May fail with
> + * -EAGAIN if currently executing on the wrong CPU.
> + */
> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
> +               int, cpu, int, flags)
> +{
> +       struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
> +       struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
> +       struct cpu_opv_pinned_pages pin_pages = {
> +               .pages = pinned_pages_on_stack,
> +               .nr = 0,
> +               .is_kmalloc = false,
> +       };
> +       int ret, i;
> +
> +       if (unlikely(flags))
> +               return -EINVAL;
> +       if (unlikely(cpu < 0))
> +               return -EINVAL;
> +       if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
> +               return -EINVAL;
> +       if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
> +               return -EFAULT;
> +       ret = cpu_opv_check(cpuopv, cpuopcnt);
> +       if (ret)
> +               return ret;
> +       ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
> +       if (ret)
> +               goto end;
> +       ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
> +       for (i = 0; i < pin_pages.nr; i++)
> +               put_page(pin_pages.pages[i]);
> +end:
> +       if (pin_pages.is_kmalloc)
> +               kfree(pin_pages.pages);
> +       return ret;
> +}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6bba05f47e51..e547f93a46c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
>                 set_curr_task(rq, p);
>  }
>
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
> +{
> +       struct rq_flags rf;
> +       struct rq *rq;
> +       int ret = 0;
> +
> +       rq = task_rq_lock(p, &rf);
> +       update_rq_clock(rq);
> +
> +       if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (task_cpu(p) == dest_cpu)
> +               goto out;
> +
> +       if (task_running(rq, p) || p->state == TASK_WAKING) {
> +               struct migration_arg arg = { p, dest_cpu };
> +               /* Need help from migration thread: drop lock and wait. */
> +               task_rq_unlock(rq, p, &rf);
> +               stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
> +               tlb_migrate_finish(p->mm);
> +               return 0;
> +       } else if (task_on_rq_queued(p)) {
> +               /*
> +                * OK, since we're going to drop the lock immediately
> +                * afterwards anyway.
> +                */
> +               rq = move_queued_task(rq, &rf, p, dest_cpu);
> +       }
> +out:
> +       task_rq_unlock(rq, p, &rf);
> +
> +       return ret;
> +}
> +
>  /*
>   * Change a given task's CPU affinity. Migrate the thread to a
>   * proper CPU and schedule it away if the CPU it's executing on
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3b448ba82225..cab256c1720a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
>  #endif
>  }
>
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
> +
>  /*
>   * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>   */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index bfa1ee1bf669..59e622296dc3 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free);
>
>  /* restartable sequence */
>  cond_syscall(sys_rseq);
> +cond_syscall(sys_cpu_opv);
> --
> 2.11.0
>
>
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]         ` <CAKgNAkjrh_OMi+7EUJxqM0-84WUxL0d_vse4neOL93EB-sGKXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-11-15 14:30           ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-15 14:30 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas

----- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:

> Hi Matthieu
> 
> On 14 November 2017 at 21:03, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> This new cpu_opv system call executes a vector of operations on behalf
>> of user-space on a specific CPU with preemption disabled. It is inspired
>> from readv() and writev() system calls which take a "struct iovec" array
>> as argument.
> 
> Do you have a man page for this syscall already?

Hi Michael,

It's the next thing on my roadmap when the syscall reaches mainline.
That and membarrier commands man pages updates.

Thanks,

Mathieu

> 
> Thanks,
> 
> Michael
> 
> 
>> The operations available are: comparison, memcpy, add, or, and, xor,
>> left shift, right shift, and mb. The system call receives a CPU number
>> from user-space as argument, which is the CPU on which those operations
>> need to be performed. All preparation steps such as loading pointers,
>> and applying offsets to arrays, need to be performed by user-space
>> before invoking the system call. The "comparison" operation can be used
>> to check that the data used in the preparation step did not change
>> between preparation of system call inputs and operation execution within
>> the preempt-off critical section.
>>
>> The reason why we require all pointer offsets to be calculated by
>> user-space beforehand is because we need to use get_user_pages_fast() to
>> first pin all pages touched by each operation. This takes care of
>> faulting-in the pages. Then, preemption is disabled, and the operations
>> are performed atomically with respect to other thread execution on that
>> CPU, without generating any page fault.
>>
>> A maximum limit of 16 operations per cpu_opv syscall invocation is
>> enforced, so user-space cannot generate a too long preempt-off critical
>> section. Each operation is also limited a length of PAGE_SIZE bytes,
>> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
>> pages for source, 2 pages for destination if addresses are not aligned
>> on page boundaries). Moreover, a total limit of 4216 bytes is applied
>> to operation lengths.
>>
>> If the thread is not running on the requested CPU, a new
>> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
>> If the requested CPU is not part of the cpus allowed mask of the thread,
>> the system call fails with EINVAL. After the migration has been
>> performed, preemption is disabled, and the current CPU number is checked
>> again and compared to the requested CPU number. If it still differs, it
>> means the scheduler migrated us away from that CPU. Return EAGAIN to
>> user-space in that case, and let user-space retry (either requesting the
>> same CPU number, or a different one, depending on the user-space
>> algorithm constraints).
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>> CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>> CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
>> CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
>> CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
>> CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
>> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
>> CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
>> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
>> CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
>> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
>> CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
>> CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
>> CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> ---
>>
>> Changes since v1:
>> - handle CPU hotplug,
>> - cleanup implementation using function pointers: We can use function
>>   pointers to implement the operations rather than duplicating all the
>>   user-access code.
>> - refuse device pages: Performing cpu_opv operations on io map'd pages
>>   with preemption disabled could generate long preempt-off critical
>>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>>   a device page is received as parameter
>> - restrict op vector to 4216 bytes length sum: Restrict the operation
>>   vector to length sum of:
>>   - 4096 bytes (typical page size on most architectures, should be
>>     enough for a string, or structures)
>>   - 15 * 8 bytes (typical operations on integers or pointers).
>>   The goal here is to keep the duration of preempt off critical section
>>   short, so we don't add significant scheduler latency.
>> - Add INIT_ONSTACK macro: Introduce the
>>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>>   stack to 0 on 32-bit architectures.
>> - Add CPU_MB_OP operation:
>>   Use-cases with:
>>   - two consecutive stores,
>>   - a mempcy followed by a store,
>>   require a memory barrier before the final store operation. A typical
>>   use-case is a store-release on the final store. Given that this is a
>>   slow path, just providing an explicit full barrier instruction should
>>   be sufficient.
>> - Add expect fault field:
>>   The use-case of list_pop brings interesting challenges. With rseq, we
>>   can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
>>   compare it against NULL, add an offset, and load the target "next"
>>   pointer from the object, all within a single req critical section.
>>
>>   Life is not so easy for cpu_opv in this use-case, mainly because we
>>   need to pin all pages we are going to touch in the preempt-off
>>   critical section beforehand. So we need to know the target object (in
>>   which we apply an offset to fetch the next pointer) when we pin pages
>>   before disabling preemption.
>>
>>   So the approach is to load the head pointer and compare it against
>>   NULL in user-space, before doing the cpu_opv syscall. User-space can
>>   then compute the address of the head->next field, *without loading it*.
>>
>>   The cpu_opv system call will first need to pin all pages associated
>>   with input data. This includes the page backing the head->next object,
>>   which may have been concurrently deallocated and unmapped. Therefore,
>>   in this case, getting -EFAULT when trying to pin those pages may
>>   happen: it just means they have been concurrently unmapped. This is
>>   an expected situation, and should just return -EAGAIN to user-space,
>>   to user-space can distinguish between "should retry" type of
>>   situations and actual errors that should be handled with extreme
>>   prejudice to the program (e.g. abort()).
>>
>>   Therefore, add "expect_fault" fields along with op input address
>>   pointers, so user-space can identify whether a fault when getting a
>>   field should return EAGAIN rather than EFAULT.
>> - Add compiler barrier between operations: Adding a compiler barrier
>>   between store operations in a cpu_opv sequence can be useful when
>>   paired with membarrier system call.
>>
>>   An algorithm with a paired slow path and fast path can use
>>   sys_membarrier on the slow path to replace fast-path memory barriers
>>   by compiler barrier.
>>
>>   Adding an explicit compiler barrier between operations allows
>>   cpu_opv to be used as fallback for operations meant to match
>>   the membarrier system call.
>>
>> Changes since v2:
>>
>> - Fix memory leak by introducing struct cpu_opv_pinned_pages.
>>   Suggested by Boqun Feng.
>> - Cast argument 1 passed to access_ok from integer to void __user *,
>>   fixing sparse warning.
>> ---
>>  MAINTAINERS                  |   7 +
>>  include/uapi/linux/cpu_opv.h | 117 ++++++
>>  init/Kconfig                 |  14 +
>>  kernel/Makefile              |   1 +
>>  kernel/cpu_opv.c             | 968 +++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sched/core.c          |  37 ++
>>  kernel/sched/sched.h         |   2 +
>>  kernel/sys_ni.c              |   1 +
>>  8 files changed, 1147 insertions(+)
>>  create mode 100644 include/uapi/linux/cpu_opv.h
>>  create mode 100644 kernel/cpu_opv.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index c9f95f8b07ed..45a1bbdaa287 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3675,6 +3675,13 @@ B:       https://bugzilla.kernel.org
>>  F:     drivers/cpuidle/*
>>  F:     include/linux/cpuidle.h
>>
>> +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
>> +M:     Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> +L:     linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> +S:     Supported
>> +F:     kernel/cpu_opv.c
>> +F:     include/uapi/linux/cpu_opv.h
>> +
>>  CRAMFS FILESYSTEM
>>  W:     http://sourceforge.net/projects/cramfs/
>>  S:     Orphan / Obsolete
>> diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
>> new file mode 100644
>> index 000000000000..17f7d46e053b
>> --- /dev/null
>> +++ b/include/uapi/linux/cpu_opv.h
>> @@ -0,0 +1,117 @@
>> +#ifndef _UAPI_LINUX_CPU_OPV_H
>> +#define _UAPI_LINUX_CPU_OPV_H
>> +
>> +/*
>> + * linux/cpu_opv.h
>> + *
>> + * CPU preempt-off operation vector system call API
>> + *
>> + * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to
>> deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> THE
>> + * SOFTWARE.
>> + */
>> +
>> +#ifdef __KERNEL__
>> +# include <linux/types.h>
>> +#else  /* #ifdef __KERNEL__ */
>> +# include <stdint.h>
>> +#endif /* #else #ifdef __KERNEL__ */
>> +
>> +#include <asm/byteorder.h>
>> +
>> +#ifdef __LP64__
>> +# define CPU_OP_FIELD_u32_u64(field)                   uint64_t field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   field = (intptr_t)v
>> +#elif defined(__BYTE_ORDER) ? \
>> +       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
>> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field ## _padding, field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
>> +       field ## _padding = 0, field = (intptr_t)v
>> +#else
>> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field, field ## _padding
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
>> +       field = (intptr_t)v, field ## _padding = 0
>> +#endif
>> +
>> +#define CPU_OP_VEC_LEN_MAX             16
>> +#define CPU_OP_ARG_LEN_MAX             24
>> +/* Max. data len per operation. */
>> +#define CPU_OP_DATA_LEN_MAX            PAGE_SIZE
>> +/*
>> + * Max. data len for overall vector. We to restrict the amount of
>> + * user-space data touched by the kernel in non-preemptible context so
>> + * we do not introduce long scheduler latencies.
>> + * This allows one copy of up to 4096 bytes, and 15 operations touching
>> + * 8 bytes each.
>> + * This limit is applied to the sum of length specified for all
>> + * operations in a vector.
>> + */
>> +#define CPU_OP_VEC_DATA_LEN_MAX                (4096 + 15*8)
>> +#define CPU_OP_MAX_PAGES               4       /* Max. pages per op. */
>> +
>> +enum cpu_op_type {
>> +       CPU_COMPARE_EQ_OP,      /* compare */
>> +       CPU_COMPARE_NE_OP,      /* compare */
>> +       CPU_MEMCPY_OP,          /* memcpy */
>> +       CPU_ADD_OP,             /* arithmetic */
>> +       CPU_OR_OP,              /* bitwise */
>> +       CPU_AND_OP,             /* bitwise */
>> +       CPU_XOR_OP,             /* bitwise */
>> +       CPU_LSHIFT_OP,          /* shift */
>> +       CPU_RSHIFT_OP,          /* shift */
>> +       CPU_MB_OP,              /* memory barrier */
>> +};
>> +
>> +/* Vector of operations to perform. Limited to 16. */
>> +struct cpu_op {
>> +       int32_t op;     /* enum cpu_op_type. */
>> +       uint32_t len;   /* data length, in bytes. */
>> +       union {
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(a);
>> +                       CPU_OP_FIELD_u32_u64(b);
>> +                       uint8_t expect_fault_a;
>> +                       uint8_t expect_fault_b;
>> +               } compare_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(dst);
>> +                       CPU_OP_FIELD_u32_u64(src);
>> +                       uint8_t expect_fault_dst;
>> +                       uint8_t expect_fault_src;
>> +               } memcpy_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(p);
>> +                       int64_t count;
>> +                       uint8_t expect_fault_p;
>> +               } arithmetic_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(p);
>> +                       uint64_t mask;
>> +                       uint8_t expect_fault_p;
>> +               } bitwise_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(p);
>> +                       uint32_t bits;
>> +                       uint8_t expect_fault_p;
>> +               } shift_op;
>> +               char __padding[CPU_OP_ARG_LEN_MAX];
>> +       } u;
>> +};
>> +
>> +#endif /* _UAPI_LINUX_CPU_OPV_H */
>> diff --git a/init/Kconfig b/init/Kconfig
>> index cbedfb91b40a..e4fbb5dd6a24 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1404,6 +1404,7 @@ config RSEQ
>>         bool "Enable rseq() system call" if EXPERT
>>         default y
>>         depends on HAVE_RSEQ
>> +       select CPU_OPV
>>         select MEMBARRIER
>>         help
>>           Enable the restartable sequences system call. It provides a
>> @@ -1414,6 +1415,19 @@ config RSEQ
>>
>>           If unsure, say Y.
>>
>> +config CPU_OPV
>> +       bool "Enable cpu_opv() system call" if EXPERT
>> +       default y
>> +       help
>> +         Enable the CPU preempt-off operation vector system call.
>> +         It allows user-space to perform a sequence of operations on
>> +         per-cpu data with preemption disabled. Useful as
>> +         single-stepping fall-back for restartable sequences, and for
>> +         performing more complex operations on per-cpu data that would
>> +         not be otherwise possible to do with restartable sequences.
>> +
>> +         If unsure, say Y.
>> +
>>  config EMBEDDED
>>         bool "Embedded system"
>>         option allnoconfig_y
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 3574669dafd9..cac8855196ff 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
>>
>>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
>>  obj-$(CONFIG_RSEQ) += rseq.o
>> +obj-$(CONFIG_CPU_OPV) += cpu_opv.o
>>
>>  $(obj)/configs.o: $(obj)/config_data.h
>>
>> diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
>> new file mode 100644
>> index 000000000000..a81837a14b17
>> --- /dev/null
>> +++ b/kernel/cpu_opv.c
>> @@ -0,0 +1,968 @@
>> +/*
>> + * CPU preempt-off operation vector system call
>> + *
>> + * It allows user-space to perform a sequence of operations on per-cpu
>> + * data with preemption disabled. Useful as single-stepping fall-back
>> + * for restartable sequences, and for performing more complex operations
>> + * on per-cpu data that would not be otherwise possible to do with
>> + * restartable sequences.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * Copyright (C) 2017, EfficiOS Inc.,
>> + * Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/syscalls.h>
>> +#include <linux/cpu_opv.h>
>> +#include <linux/types.h>
>> +#include <linux/mutex.h>
>> +#include <linux/pagemap.h>
>> +#include <asm/ptrace.h>
>> +#include <asm/byteorder.h>
>> +
>> +#include "sched/sched.h"
>> +
>> +#define TMP_BUFLEN                     64
>> +#define NR_PINNED_PAGES_ON_STACK       8
>> +
>> +union op_fn_data {
>> +       uint8_t _u8;
>> +       uint16_t _u16;
>> +       uint32_t _u32;
>> +       uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +       uint32_t _u64_split[2];
>> +#endif
>> +};
>> +
>> +struct cpu_opv_pinned_pages {
>> +       struct page **pages;
>> +       size_t nr;
>> +       bool is_kmalloc;
>> +};
>> +
>> +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
>> +
>> +static DEFINE_MUTEX(cpu_opv_offline_lock);
>> +
>> +/*
>> + * The cpu_opv system call executes a vector of operations on behalf of
>> + * user-space on a specific CPU with preemption disabled. It is inspired
>> + * from readv() and writev() system calls which take a "struct iovec"
>> + * array as argument.
>> + *
>> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> + * left shift, and right shift. The system call receives a CPU number
>> + * from user-space as argument, which is the CPU on which those
>> + * operations need to be performed. All preparation steps such as
>> + * loading pointers, and applying offsets to arrays, need to be
>> + * performed by user-space before invoking the system call. The
>> + * "comparison" operation can be used to check that the data used in the
>> + * preparation step did not change between preparation of system call
>> + * inputs and operation execution within the preempt-off critical
>> + * section.
>> + *
>> + * The reason why we require all pointer offsets to be calculated by
>> + * user-space beforehand is because we need to use get_user_pages_fast()
>> + * to first pin all pages touched by each operation. This takes care of
>> + * faulting-in the pages. Then, preemption is disabled, and the
>> + * operations are performed atomically with respect to other thread
>> + * execution on that CPU, without generating any page fault.
>> + *
>> + * A maximum limit of 16 operations per cpu_opv syscall invocation is
>> + * enforced, and a overall maximum length sum, so user-space cannot
>> + * generate a too long preempt-off critical section. Each operation is
>> + * also limited a length of PAGE_SIZE bytes, meaning that an operation
>> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
>> + * for destination if addresses are not aligned on page boundaries).
>> + *
>> + * If the thread is not running on the requested CPU, a new
>> + * push_task_to_cpu() is invoked to migrate the task to the requested
>> + * CPU.  If the requested CPU is not part of the cpus allowed mask of
>> + * the thread, the system call fails with EINVAL. After the migration
>> + * has been performed, preemption is disabled, and the current CPU
>> + * number is checked again and compared to the requested CPU number. If
>> + * it still differs, it means the scheduler migrated us away from that
>> + * CPU. Return EAGAIN to user-space in that case, and let user-space
>> + * retry (either requesting the same CPU number, or a different one,
>> + * depending on the user-space algorithm constraints).
>> + */
>> +
>> +/*
>> + * Check operation types and length parameters.
>> + */
>> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
>> +{
>> +       int i;
>> +       uint32_t sum = 0;
>> +
>> +       for (i = 0; i < cpuopcnt; i++) {
>> +               struct cpu_op *op = &cpuop[i];
>> +
>> +               switch (op->op) {
>> +               case CPU_MB_OP:
>> +                       break;
>> +               default:
>> +                       sum += op->len;
>> +               }
>> +               switch (op->op) {
>> +               case CPU_COMPARE_EQ_OP:
>> +               case CPU_COMPARE_NE_OP:
>> +               case CPU_MEMCPY_OP:
>> +                       if (op->len > CPU_OP_DATA_LEN_MAX)
>> +                               return -EINVAL;
>> +                       break;
>> +               case CPU_ADD_OP:
>> +               case CPU_OR_OP:
>> +               case CPU_AND_OP:
>> +               case CPU_XOR_OP:
>> +                       switch (op->len) {
>> +                       case 1:
>> +                       case 2:
>> +                       case 4:
>> +                       case 8:
>> +                               break;
>> +                       default:
>> +                               return -EINVAL;
>> +                       }
>> +                       break;
>> +               case CPU_LSHIFT_OP:
>> +               case CPU_RSHIFT_OP:
>> +                       switch (op->len) {
>> +                       case 1:
>> +                               if (op->u.shift_op.bits > 7)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       case 2:
>> +                               if (op->u.shift_op.bits > 15)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       case 4:
>> +                               if (op->u.shift_op.bits > 31)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       case 8:
>> +                               if (op->u.shift_op.bits > 63)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       default:
>> +                               return -EINVAL;
>> +                       }
>> +                       break;
>> +               case CPU_MB_OP:
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       if (sum > CPU_OP_VEC_DATA_LEN_MAX)
>> +               return -EINVAL;
>> +       return 0;
>> +}
>> +
>> +static unsigned long cpu_op_range_nr_pages(unsigned long addr,
>> +               unsigned long len)
>> +{
>> +       return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
>> +}
>> +
>> +static int cpu_op_check_page(struct page *page)
>> +{
>> +       struct address_space *mapping;
>> +
>> +       if (is_zone_device_page(page))
>> +               return -EFAULT;
>> +       page = compound_head(page);
>> +       mapping = READ_ONCE(page->mapping);
>> +       if (!mapping) {
>> +               int shmem_swizzled;
>> +
>> +               /*
>> +                * Check again with page lock held to guard against
>> +                * memory pressure making shmem_writepage move the page
>> +                * from filecache to swapcache.
>> +                */
>> +               lock_page(page);
>> +               shmem_swizzled = PageSwapCache(page) || page->mapping;
>> +               unlock_page(page);
>> +               if (shmem_swizzled)
>> +                       return -EAGAIN;
>> +               return -EFAULT;
>> +       }
>> +       return 0;
>> +}
>> +
>> +/*
>> + * Refusing device pages, the zero page, pages in the gate area, and
>> + * special mappings. Inspired from futex.c checks.
>> + */
>> +static int cpu_op_check_pages(struct page **pages,
>> +               unsigned long nr_pages)
>> +{
>> +       unsigned long i;
>> +
>> +       for (i = 0; i < nr_pages; i++) {
>> +               int ret;
>> +
>> +               ret = cpu_op_check_page(pages[i]);
>> +               if (ret)
>> +                       return ret;
>> +       }
>> +       return 0;
>> +}
>> +
>> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
>> +               struct cpu_opv_pinned_pages *pin_pages, int write)
>> +{
>> +       struct page *pages[2];
>> +       int ret, nr_pages;
>> +
>> +       if (!len)
>> +               return 0;
>> +       nr_pages = cpu_op_range_nr_pages(addr, len);
>> +       BUG_ON(nr_pages > 2);
>> +       if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
>> +                       > NR_PINNED_PAGES_ON_STACK) {
>> +               struct page **pinned_pages =
>> +                       kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
>> +                               * sizeof(struct page *), GFP_KERNEL);
>> +               if (!pinned_pages)
>> +                       return -ENOMEM;
>> +               memcpy(pinned_pages, pin_pages->pages,
>> +                       pin_pages->nr * sizeof(struct page *));
>> +               pin_pages->pages = pinned_pages;
>> +               pin_pages->is_kmalloc = true;
>> +       }
>> +again:
>> +       ret = get_user_pages_fast(addr, nr_pages, write, pages);
>> +       if (ret < nr_pages) {
>> +               if (ret > 0)
>> +                       put_page(pages[0]);
>> +               return -EFAULT;
>> +       }
>> +       /*
>> +        * Refuse device pages, the zero page, pages in the gate area,
>> +        * and special mappings.
>> +        */
>> +       ret = cpu_op_check_pages(pages, nr_pages);
>> +       if (ret == -EAGAIN) {
>> +               put_page(pages[0]);
>> +               if (nr_pages > 1)
>> +                       put_page(pages[1]);
>> +               goto again;
>> +       }
>> +       if (ret)
>> +               goto error;
>> +       pin_pages->pages[pin_pages->nr++] = pages[0];
>> +       if (nr_pages > 1)
>> +               pin_pages->pages[pin_pages->nr++] = pages[1];
>> +       return 0;
>> +
>> +error:
>> +       put_page(pages[0]);
>> +       if (nr_pages > 1)
>> +               put_page(pages[1]);
>> +       return -EFAULT;
>> +}
>> +
>> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
>> +               struct cpu_opv_pinned_pages *pin_pages)
>> +{
>> +       int ret, i;
>> +       bool expect_fault = false;
>> +
>> +       /* Check access, pin pages. */
>> +       for (i = 0; i < cpuopcnt; i++) {
>> +               struct cpu_op *op = &cpuop[i];
>> +
>> +               switch (op->op) {
>> +               case CPU_COMPARE_EQ_OP:
>> +               case CPU_COMPARE_NE_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.compare_op.expect_fault_a;
>> +                       if (!access_ok(VERIFY_READ,
>> +                                       (void __user *)op->u.compare_op.a,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.compare_op.a,
>> +                                       op->len, pin_pages, 0);
>> +                       if (ret)
>> +                               goto error;
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.compare_op.expect_fault_b;
>> +                       if (!access_ok(VERIFY_READ,
>> +                                       (void __user *)op->u.compare_op.b,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.compare_op.b,
>> +                                       op->len, pin_pages, 0);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_MEMCPY_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.memcpy_op.expect_fault_dst;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.memcpy_op.dst,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.memcpy_op.dst,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.memcpy_op.expect_fault_src;
>> +                       if (!access_ok(VERIFY_READ,
>> +                                       (void __user *)op->u.memcpy_op.src,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.memcpy_op.src,
>> +                                       op->len, pin_pages, 0);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_ADD_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.arithmetic_op.expect_fault_p;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.arithmetic_op.p,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.arithmetic_op.p,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_OR_OP:
>> +               case CPU_AND_OP:
>> +               case CPU_XOR_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.bitwise_op.expect_fault_p;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.bitwise_op.p,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_LSHIFT_OP:
>> +               case CPU_RSHIFT_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.shift_op.expect_fault_p;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.shift_op.p,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.shift_op.p,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_MB_OP:
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       return 0;
>> +
>> +error:
>> +       for (i = 0; i < pin_pages->nr; i++)
>> +               put_page(pin_pages->pages[i]);
>> +       pin_pages->nr = 0;
>> +       /*
>> +        * If faulting access is expected, return EAGAIN to user-space.
>> +        * It allows user-space to distinguish between a fault caused by
>> +        * an access which is expect to fault (e.g. due to concurrent
>> +        * unmapping of underlying memory) from an unexpected fault from
>> +        * which a retry would not recover.
>> +        */
>> +       if (ret == -EFAULT && expect_fault)
>> +               return -EAGAIN;
>> +       return ret;
>> +}
>> +
>> +/* Return 0 if same, > 0 if different, < 0 on error. */
>> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
>> +{
>> +       char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
>> +       uint32_t compared = 0;
>> +
>> +       while (compared != len) {
>> +               unsigned long to_compare;
>> +
>> +               to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
>> +               if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
>> +                       return -EFAULT;
>> +               if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
>> +                       return -EFAULT;
>> +               if (memcmp(bufa, bufb, to_compare))
>> +                       return 1;       /* different */
>> +               compared += to_compare;
>> +       }
>> +       return 0;       /* same */
>> +}
>> +
>> +/* Return 0 if same, > 0 if different, < 0 on error. */
>> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
>> +{
>> +       int ret = -EFAULT;
>> +       union {
>> +               uint8_t _u8;
>> +               uint16_t _u16;
>> +               uint32_t _u32;
>> +               uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +               uint32_t _u64_split[2];
>> +#endif
>> +       } tmp[2];
>> +
>> +       pagefault_disable();
>> +       switch (len) {
>> +       case 1:
>> +               if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
>> +                       goto end;
>> +               ret = !!(tmp[0]._u8 != tmp[1]._u8);
>> +               break;
>> +       case 2:
>> +               if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
>> +                       goto end;
>> +               ret = !!(tmp[0]._u16 != tmp[1]._u16);
>> +               break;
>> +       case 4:
>> +               if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
>> +                       goto end;
>> +               ret = !!(tmp[0]._u32 != tmp[1]._u32);
>> +               break;
>> +       case 8:
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
>> +                       goto end;
>> +#else
>> +               if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
>> +                       goto end;
>> +#endif
>> +               ret = !!(tmp[0]._u64 != tmp[1]._u64);
>> +               break;
>> +       default:
>> +               pagefault_enable();
>> +               return do_cpu_op_compare_iter(a, b, len);
>> +       }
>> +end:
>> +       pagefault_enable();
>> +       return ret;
>> +}
>> +
>> +/* Return 0 on success, < 0 on error. */
>> +static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src,
>> +               uint32_t len)
>> +{
>> +       char buf[TMP_BUFLEN];
>> +       uint32_t copied = 0;
>> +
>> +       while (copied != len) {
>> +               unsigned long to_copy;
>> +
>> +               to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied);
>> +               if (__copy_from_user_inatomic(buf, src + copied, to_copy))
>> +                       return -EFAULT;
>> +               if (__copy_to_user_inatomic(dst + copied, buf, to_copy))
>> +                       return -EFAULT;
>> +               copied += to_copy;
>> +       }
>> +       return 0;
>> +}
>> +
>> +/* Return 0 on success, < 0 on error. */
>> +static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len)
>> +{
>> +       int ret = -EFAULT;
>> +       union {
>> +               uint8_t _u8;
>> +               uint16_t _u16;
>> +               uint32_t _u32;
>> +               uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +               uint32_t _u64_split[2];
>> +#endif
>> +       } tmp;
>> +
>> +       pagefault_disable();
>> +       switch (len) {
>> +       case 1:
>> +               if (__get_user(tmp._u8, (uint8_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u8, (uint8_t __user *)dst))
>> +                       goto end;
>> +               break;
>> +       case 2:
>> +               if (__get_user(tmp._u16, (uint16_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u16, (uint16_t __user *)dst))
>> +                       goto end;
>> +               break;
>> +       case 4:
>> +               if (__get_user(tmp._u32, (uint32_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u32, (uint32_t __user *)dst))
>> +                       goto end;
>> +               break;
>> +       case 8:
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__get_user(tmp._u64, (uint64_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u64, (uint64_t __user *)dst))
>> +                       goto end;
>> +#else
>> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)src))
>> +                       goto end;
>> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1))
>> +                       goto end;
>> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst))
>> +                       goto end;
>> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1))
>> +                       goto end;
>> +#endif
>> +               break;
>> +       default:
>> +               pagefault_enable();
>> +               return do_cpu_op_memcpy_iter(dst, src, len);
>> +       }
>> +       ret = 0;
>> +end:
>> +       pagefault_enable();
>> +       return ret;
>> +}
>> +
>> +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 += (uint8_t)count;
>> +               break;
>> +       case 2:
>> +               data->_u16 += (uint16_t)count;
>> +               break;
>> +       case 4:
>> +               data->_u32 += (uint32_t)count;
>> +               break;
>> +       case 8:
>> +               data->_u64 += (uint64_t)count;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 |= (uint8_t)mask;
>> +               break;
>> +       case 2:
>> +               data->_u16 |= (uint16_t)mask;
>> +               break;
>> +       case 4:
>> +               data->_u32 |= (uint32_t)mask;
>> +               break;
>> +       case 8:
>> +               data->_u64 |= (uint64_t)mask;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 &= (uint8_t)mask;
>> +               break;
>> +       case 2:
>> +               data->_u16 &= (uint16_t)mask;
>> +               break;
>> +       case 4:
>> +               data->_u32 &= (uint32_t)mask;
>> +               break;
>> +       case 8:
>> +               data->_u64 &= (uint64_t)mask;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 ^= (uint8_t)mask;
>> +               break;
>> +       case 2:
>> +               data->_u16 ^= (uint16_t)mask;
>> +               break;
>> +       case 4:
>> +               data->_u32 ^= (uint32_t)mask;
>> +               break;
>> +       case 8:
>> +               data->_u64 ^= (uint64_t)mask;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 <<= (uint8_t)bits;
>> +               break;
>> +       case 2:
>> +               data->_u16 <<= (uint16_t)bits;
>> +               break;
>> +       case 4:
>> +               data->_u32 <<= (uint32_t)bits;
>> +               break;
>> +       case 8:
>> +               data->_u64 <<= (uint64_t)bits;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 >>= (uint8_t)bits;
>> +               break;
>> +       case 2:
>> +               data->_u16 >>= (uint16_t)bits;
>> +               break;
>> +       case 4:
>> +               data->_u32 >>= (uint32_t)bits;
>> +               break;
>> +       case 8:
>> +               data->_u64 >>= (uint64_t)bits;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +/* Return 0 on success, < 0 on error. */
>> +static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v,
>> +               uint32_t len)
>> +{
>> +       int ret = -EFAULT;
>> +       union op_fn_data tmp;
>> +
>> +       pagefault_disable();
>> +       switch (len) {
>> +       case 1:
>> +               if (__get_user(tmp._u8, (uint8_t __user *)p))
>> +                       goto end;
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +               if (__put_user(tmp._u8, (uint8_t __user *)p))
>> +                       goto end;
>> +               break;
>> +       case 2:
>> +               if (__get_user(tmp._u16, (uint16_t __user *)p))
>> +                       goto end;
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +               if (__put_user(tmp._u16, (uint16_t __user *)p))
>> +                       goto end;
>> +               break;
>> +       case 4:
>> +               if (__get_user(tmp._u32, (uint32_t __user *)p))
>> +                       goto end;
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +               if (__put_user(tmp._u32, (uint32_t __user *)p))
>> +                       goto end;
>> +               break;
>> +       case 8:
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__get_user(tmp._u64, (uint64_t __user *)p))
>> +                       goto end;
>> +#else
>> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
>> +                       goto end;
>> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
>> +                       goto end;
>> +#endif
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__put_user(tmp._u64, (uint64_t __user *)p))
>> +                       goto end;
>> +#else
>> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
>> +                       goto end;
>> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
>> +                       goto end;
>> +#endif
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               goto end;
>> +       }
>> +       ret = 0;
>> +end:
>> +       pagefault_enable();
>> +       return ret;
>> +}
>> +
>> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
>> +{
>> +       int i, ret;
>> +
>> +       for (i = 0; i < cpuopcnt; i++) {
>> +               struct cpu_op *op = &cpuop[i];
>> +
>> +               /* Guarantee a compiler barrier between each operation. */
>> +               barrier();
>> +
>> +               switch (op->op) {
>> +               case CPU_COMPARE_EQ_OP:
>> +                       ret = do_cpu_op_compare(
>> +                                       (void __user *)op->u.compare_op.a,
>> +                                       (void __user *)op->u.compare_op.b,
>> +                                       op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret < 0)
>> +                               return ret;
>> +                       /*
>> +                        * Stop execution, return op index + 1 if comparison
>> +                        * differs.
>> +                        */
>> +                       if (ret > 0)
>> +                               return i + 1;
>> +                       break;
>> +               case CPU_COMPARE_NE_OP:
>> +                       ret = do_cpu_op_compare(
>> +                                       (void __user *)op->u.compare_op.a,
>> +                                       (void __user *)op->u.compare_op.b,
>> +                                       op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret < 0)
>> +                               return ret;
>> +                       /*
>> +                        * Stop execution, return op index + 1 if comparison
>> +                        * is identical.
>> +                        */
>> +                       if (ret == 0)
>> +                               return i + 1;
>> +                       break;
>> +               case CPU_MEMCPY_OP:
>> +                       ret = do_cpu_op_memcpy(
>> +                                       (void __user *)op->u.memcpy_op.dst,
>> +                                       (void __user *)op->u.memcpy_op.src,
>> +                                       op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_ADD_OP:
>> +                       ret = do_cpu_op_fn(op_add_fn,
>> +                                       (void __user *)op->u.arithmetic_op.p,
>> +                                       op->u.arithmetic_op.count, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_OR_OP:
>> +                       ret = do_cpu_op_fn(op_or_fn,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->u.bitwise_op.mask, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_AND_OP:
>> +                       ret = do_cpu_op_fn(op_and_fn,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->u.bitwise_op.mask, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_XOR_OP:
>> +                       ret = do_cpu_op_fn(op_xor_fn,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->u.bitwise_op.mask, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_LSHIFT_OP:
>> +                       ret = do_cpu_op_fn(op_lshift_fn,
>> +                                       (void __user *)op->u.shift_op.p,
>> +                                       op->u.shift_op.bits, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_RSHIFT_OP:
>> +                       ret = do_cpu_op_fn(op_rshift_fn,
>> +                                       (void __user *)op->u.shift_op.p,
>> +                                       op->u.shift_op.bits, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_MB_OP:
>> +                       smp_mb();
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       return 0;
>> +}
>> +
>> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
>> +{
>> +       int ret;
>> +
>> +       if (cpu != raw_smp_processor_id()) {
>> +               ret = push_task_to_cpu(current, cpu);
>> +               if (ret)
>> +                       goto check_online;
>> +       }
>> +       preempt_disable();
>> +       if (cpu != smp_processor_id()) {
>> +               ret = -EAGAIN;
>> +               goto end;
>> +       }
>> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +end:
>> +       preempt_enable();
>> +       return ret;
>> +
>> +check_online:
>> +       if (!cpu_possible(cpu))
>> +               return -EINVAL;
>> +       get_online_cpus();
>> +       if (cpu_online(cpu)) {
>> +               ret = -EAGAIN;
>> +               goto put_online_cpus;
>> +       }
>> +       /*
>> +        * CPU is offline. Perform operation from the current CPU with
>> +        * cpu_online read lock held, preventing that CPU from coming online,
>> +        * and with mutex held, providing mutual exclusion against other
>> +        * CPUs also finding out about an offline CPU.
>> +        */
>> +       mutex_lock(&cpu_opv_offline_lock);
>> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +       mutex_unlock(&cpu_opv_offline_lock);
>> +put_online_cpus:
>> +       put_online_cpus();
>> +       return ret;
>> +}
>> +
>> +/*
>> + * cpu_opv - execute operation vector on a given CPU with preempt off.
>> + *
>> + * Userspace should pass current CPU number as parameter. May fail with
>> + * -EAGAIN if currently executing on the wrong CPU.
>> + */
>> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
>> +               int, cpu, int, flags)
>> +{
>> +       struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
>> +       struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
>> +       struct cpu_opv_pinned_pages pin_pages = {
>> +               .pages = pinned_pages_on_stack,
>> +               .nr = 0,
>> +               .is_kmalloc = false,
>> +       };
>> +       int ret, i;
>> +
>> +       if (unlikely(flags))
>> +               return -EINVAL;
>> +       if (unlikely(cpu < 0))
>> +               return -EINVAL;
>> +       if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
>> +               return -EINVAL;
>> +       if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
>> +               return -EFAULT;
>> +       ret = cpu_opv_check(cpuopv, cpuopcnt);
>> +       if (ret)
>> +               return ret;
>> +       ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
>> +       if (ret)
>> +               goto end;
>> +       ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
>> +       for (i = 0; i < pin_pages.nr; i++)
>> +               put_page(pin_pages.pages[i]);
>> +end:
>> +       if (pin_pages.is_kmalloc)
>> +               kfree(pin_pages.pages);
>> +       return ret;
>> +}
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 6bba05f47e51..e547f93a46c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const
>> struct cpumask *new_mask)
>>                 set_curr_task(rq, p);
>>  }
>>
>> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
>> +{
>> +       struct rq_flags rf;
>> +       struct rq *rq;
>> +       int ret = 0;
>> +
>> +       rq = task_rq_lock(p, &rf);
>> +       update_rq_clock(rq);
>> +
>> +       if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
>> +               ret = -EINVAL;
>> +               goto out;
>> +       }
>> +
>> +       if (task_cpu(p) == dest_cpu)
>> +               goto out;
>> +
>> +       if (task_running(rq, p) || p->state == TASK_WAKING) {
>> +               struct migration_arg arg = { p, dest_cpu };
>> +               /* Need help from migration thread: drop lock and wait. */
>> +               task_rq_unlock(rq, p, &rf);
>> +               stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
>> +               tlb_migrate_finish(p->mm);
>> +               return 0;
>> +       } else if (task_on_rq_queued(p)) {
>> +               /*
>> +                * OK, since we're going to drop the lock immediately
>> +                * afterwards anyway.
>> +                */
>> +               rq = move_queued_task(rq, &rf, p, dest_cpu);
>> +       }
>> +out:
>> +       task_rq_unlock(rq, p, &rf);
>> +
>> +       return ret;
>> +}
>> +
>>  /*
>>   * Change a given task's CPU affinity. Migrate the thread to a
>>   * proper CPU and schedule it away if the CPU it's executing on
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 3b448ba82225..cab256c1720a 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p,
>> unsigned int cpu)
>>  #endif
>>  }
>>
>> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
>> +
>>  /*
>>   * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>>   */
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index bfa1ee1bf669..59e622296dc3 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free);
>>
>>  /* restartable sequence */
>>  cond_syscall(sys_rseq);
>> +cond_syscall(sys_cpu_opv);
>> --
>> 2.11.0
>>
>>
>>
> 
> 
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]     ` <20171114200414.2188-2-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:39       ` Ben Maurer
@ 2017-11-16 16:18       ` Peter Zijlstra
       [not found]         ` <20171116161815.dg4hi2z35rkh4u4s-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
  2017-11-16 18:43       ` Peter Zijlstra
  2017-11-16 21:08       ` Thomas Gleixner
  3 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2017-11-16 16:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon

On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
> @@ -977,6 +978,13 @@ struct task_struct {
>  	unsigned long			numa_pages_migrated;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> +#ifdef CONFIG_RSEQ
> +	struct rseq __user *rseq;
> +	u32 rseq_len;
> +	u32 rseq_sig;
> +	bool rseq_preempt, rseq_signal, rseq_migrate;

No bool please. Use something that has a defined size in ILP32/LP64.
_Bool makes it absolutely impossible to speculate on structure layout
across architectures.

> +#endif
> +
>  	struct tlbflush_unmap_batch	tlb_ubc;
>  
>  	struct rcu_head			rcu;

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]         ` <20171116161815.dg4hi2z35rkh4u4s-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
@ 2017-11-16 16:27           ` Mathieu Desnoyers
       [not found]             ` <438349693.16595.1510849627973.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-16 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 11:18 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
>> @@ -977,6 +978,13 @@ struct task_struct {
>>  	unsigned long			numa_pages_migrated;
>>  #endif /* CONFIG_NUMA_BALANCING */
>>  
>> +#ifdef CONFIG_RSEQ
>> +	struct rseq __user *rseq;
>> +	u32 rseq_len;
>> +	u32 rseq_sig;
>> +	bool rseq_preempt, rseq_signal, rseq_migrate;
> 
> No bool please. Use something that has a defined size in ILP32/LP64.
> _Bool makes it absolutely impossible to speculate on structure layout
> across architectures.

I should as well make all those a bitmask within a "u32 rseq_event_mask" then,
sounds fair ?

Thanks,

Mathieu

> 
>> +#endif
>> +
>>  	struct tlbflush_unmap_batch	tlb_ubc;
>>  
> >  	struct rcu_head			rcu;

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]             ` <438349693.16595.1510849627973.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-16 16:32               ` Peter Zijlstra
       [not found]                 ` <20171116163218.fg4u4bbzfrbxatvz-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2017-11-16 16:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

On Thu, Nov 16, 2017 at 04:27:07PM +0000, Mathieu Desnoyers wrote:
> ----- On Nov 16, 2017, at 11:18 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> 
> > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
> >> @@ -977,6 +978,13 @@ struct task_struct {
> >>  	unsigned long			numa_pages_migrated;
> >>  #endif /* CONFIG_NUMA_BALANCING */
> >>  
> >> +#ifdef CONFIG_RSEQ
> >> +	struct rseq __user *rseq;
> >> +	u32 rseq_len;
> >> +	u32 rseq_sig;
> >> +	bool rseq_preempt, rseq_signal, rseq_migrate;
> > 
> > No bool please. Use something that has a defined size in ILP32/LP64.
> > _Bool makes it absolutely impossible to speculate on structure layout
> > across architectures.
> 
> I should as well make all those a bitmask within a "u32 rseq_event_mask" then,
> sounds fair ?

Sure, whatever works and isn't _Bool ;-)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]                 ` <20171116163218.fg4u4bbzfrbxatvz-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
@ 2017-11-16 17:09                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-16 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 11:32 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Thu, Nov 16, 2017 at 04:27:07PM +0000, Mathieu Desnoyers wrote:
>> ----- On Nov 16, 2017, at 11:18 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
>> 
>> > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
>> >> @@ -977,6 +978,13 @@ struct task_struct {
>> >>  	unsigned long			numa_pages_migrated;
>> >>  #endif /* CONFIG_NUMA_BALANCING */
>> >>  
>> >> +#ifdef CONFIG_RSEQ
>> >> +	struct rseq __user *rseq;
>> >> +	u32 rseq_len;
>> >> +	u32 rseq_sig;
>> >> +	bool rseq_preempt, rseq_signal, rseq_migrate;
>> > 
>> > No bool please. Use something that has a defined size in ILP32/LP64.
>> > _Bool makes it absolutely impossible to speculate on structure layout
>> > across architectures.
>> 
>> I should as well make all those a bitmask within a "u32 rseq_event_mask" then,
>> sounds fair ?
> 
> Sure, whatever works and isn't _Bool ;-)

So something along those lines should do the trick (including
the mask request from Ben Maurer):

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b995a3b..44aef30 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -982,7 +982,7 @@ struct task_struct {
        struct rseq __user *rseq;
        u32 rseq_len;
        u32 rseq_sig;
-       bool rseq_preempt, rseq_signal, rseq_migrate;
+       u32 rseq_event_mask;
 #endif
 
        struct tlbflush_unmap_batch     tlb_ubc;
@@ -1676,6 +1676,16 @@ static inline void set_task_cpu(struct task_struct *p, un
 #endif
 
 #ifdef CONFIG_RSEQ
+/*
+ * Map the event mask on the user-space ABI enum rseq_cs_flags
+ * for direct mask checks.
+ */
+enum rseq_event_mask {
+       RSEQ_EVENT_PREEMPT      = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT,
+       RSEQ_EVENT_SIGNAL       = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL,
+       RSEQ_EVENT_MIGRATE      = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE,
+};
+
 static inline void rseq_set_notify_resume(struct task_struct *t)
 {
        if (t->rseq)
@@ -1718,16 +1728,16 @@ static inline void rseq_sched_out(struct task_struct *t)
 }
 static inline void rseq_signal_deliver(struct pt_regs *regs)
 {
-       current->rseq_signal = true;
+       current->rseq_event_mask |= RSEQ_EVENT_SIGNAL;
        rseq_handle_notify_resume(regs);
 }
 static inline void rseq_preempt(struct task_struct *t)
 {
-       t->rseq_preempt = true;
+       t->rseq_event_mask |= RSEQ_EVENT_PREEMPT;
 }
 static inline void rseq_migrate(struct task_struct *t)
 {
-       t->rseq_migrate = true;
+       t->rseq_event_mask |= RSEQ_EVENT_MIGRATE;
 }
 #else
 static inline void rseq_set_notify_resume(struct task_struct *t)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 6f0d48c..d773003 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -159,7 +159,7 @@ static bool rseq_get_rseq_cs(struct task_struct *t,
 static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
 {
        bool need_restart = false;
-       uint32_t flags;
+       uint32_t flags, event_mask;
 
        /* Get thread flags. */
        if (__get_user(flags, &t->rseq->flags))
@@ -174,26 +174,17 @@ static int rseq_need_restart(struct task_struct *t, uint32
         * a preempted signal handler could fail to restart the prior
         * execution context on sigreturn.
         */
-       if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
-               if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
-                       return -EINVAL;
-               if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+       if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
+               if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+                               | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+                       != (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+                                | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
                        return -EINVAL;
        }
-       if (t->rseq_migrate
-                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
-               need_restart = true;
-       else if (t->rseq_preempt
-                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
-               need_restart = true;
-       else if (t->rseq_signal
-                       && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
-               need_restart = true;
-
-       t->rseq_preempt = false;
-       t->rseq_signal = false;
-       t->rseq_migrate = false;
-       if (need_restart)
+       event_mask = t->rseq_event_mask;
+       t->rseq_event_mask = 0;
+       event_mask &= ~flags;
+       if (event_mask)
                return 1;
        return 0;
 }


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]     ` <20171114200414.2188-2-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-14 20:39       ` Ben Maurer
  2017-11-16 16:18       ` Peter Zijlstra
@ 2017-11-16 18:43       ` Peter Zijlstra
       [not found]         ` <20171116184305.snpudnjdhua2obby-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
  2017-11-16 21:08       ` Thomas Gleixner
  3 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2017-11-16 18:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon

On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
> +/*
> + * If parent process has a registered restartable sequences area, the
> + * child inherits. Only applies when forking a process, not a thread. In
> + * case a parent fork() in the middle of a restartable sequence, set the
> + * resume notifier to force the child to retry.
> + */
> +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> +{
> +	if (clone_flags & CLONE_THREAD) {
> +		t->rseq = NULL;
> +		t->rseq_len = 0;
> +		t->rseq_sig = 0;
> +	} else {
> +		t->rseq = current->rseq;
> +		t->rseq_len = current->rseq_len;
> +		t->rseq_sig = current->rseq_sig;
> +		rseq_set_notify_resume(t);
> +	}
> +}

This hurts my brain... what happens if you fork a multi-threaded
process?

Do we fully inherit the TLS state of the calling thread?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]         ` <20171116184305.snpudnjdhua2obby-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
@ 2017-11-16 18:49           ` Mathieu Desnoyers
       [not found]             ` <1523632942.16739.1510858189882.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-16 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 1:43 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
>> +/*
>> + * If parent process has a registered restartable sequences area, the
>> + * child inherits. Only applies when forking a process, not a thread. In
>> + * case a parent fork() in the middle of a restartable sequence, set the
>> + * resume notifier to force the child to retry.
>> + */
>> +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
>> +{
>> +	if (clone_flags & CLONE_THREAD) {
>> +		t->rseq = NULL;
>> +		t->rseq_len = 0;
>> +		t->rseq_sig = 0;
>> +	} else {
>> +		t->rseq = current->rseq;
>> +		t->rseq_len = current->rseq_len;
>> +		t->rseq_sig = current->rseq_sig;
>> +		rseq_set_notify_resume(t);
>> +	}
>> +}
> 
> This hurts my brain... what happens if you fork a multi-threaded
> process?
> 
> Do we fully inherit the TLS state of the calling thread?

Yes, exactly. The user-space TLS should be inherited from that of
the calling thread.

At kernel-level, the only thing that's not inherited here is the
task struct rseq_event_mask, which tracks whether a restart is
needed. But this would only be relevant if fork() can be invoked
from a signal handler, or if fork() could be invoked from a
rseq critical section (which really makes little sense).

Should I copy the current->rseq_event_mask on process fork just to
be on the safe side though ?

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]             ` <1523632942.16739.1510858189882.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-16 19:06               ` Thomas Gleixner
  2017-11-16 20:06                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-16 19:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

On Thu, 16 Nov 2017, Mathieu Desnoyers wrote:
> ----- On Nov 16, 2017, at 1:43 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> 
> > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
> >> +/*
> >> + * If parent process has a registered restartable sequences area, the
> >> + * child inherits. Only applies when forking a process, not a thread. In
> >> + * case a parent fork() in the middle of a restartable sequence, set the
> >> + * resume notifier to force the child to retry.
> >> + */
> >> +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> >> +{
> >> +	if (clone_flags & CLONE_THREAD) {
> >> +		t->rseq = NULL;
> >> +		t->rseq_len = 0;
> >> +		t->rseq_sig = 0;
> >> +	} else {
> >> +		t->rseq = current->rseq;
> >> +		t->rseq_len = current->rseq_len;
> >> +		t->rseq_sig = current->rseq_sig;
> >> +		rseq_set_notify_resume(t);
> >> +	}
> >> +}
> > 
> > This hurts my brain... what happens if you fork a multi-threaded
> > process?
> > 
> > Do we fully inherit the TLS state of the calling thread?
> 
> Yes, exactly. The user-space TLS should be inherited from that of
> the calling thread.
> 
> At kernel-level, the only thing that's not inherited here is the
> task struct rseq_event_mask, which tracks whether a restart is
> needed. But this would only be relevant if fork() can be invoked
> from a signal handler, or if fork() could be invoked from a
> rseq critical section (which really makes little sense).

Whether it makes sense or not does not matter much, especially in context
of user space. You cannot make assumptions like that. When something can be
done, then it's bound to happen sooner than later because somebody thinks
he is extra clever.

The first priority is robustness in any aspect which has to do with user
space.

> Should I copy the current->rseq_event_mask on process fork just to
> be on the safe side though ?

I think so, unless you let fork() fail when invoked from a rseq critical
section.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
  2017-11-14 20:03   ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences " Mathieu Desnoyers
       [not found]     ` <CY4PR15MB168884529B3C0F8E6CC06257CF280@CY4PR15MB1688.namprd15.prod.outlook.com>
       [not found]     ` <20171114200414.2188-2-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-16 19:14     ` Peter Zijlstra
       [not found]       ` <20171116191448.rmds347hwsyibipm-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
  2 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2017-11-16 19:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon

On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
> +static bool rseq_update_cpu_id(struct task_struct *t)
> +{
> +	uint32_t cpu_id = raw_smp_processor_id();
> +
> +	if (__put_user(cpu_id, &t->rseq->cpu_id_start))
> +		return false;
> +	if (__put_user(cpu_id, &t->rseq->cpu_id))
> +		return false;

For LP64 this _could_ be a single 64bit store, right? It would save some
stac/clac noise on x86_64.

> +	trace_rseq_update(t);
> +	return true;
> +}

> +static bool rseq_get_rseq_cs(struct task_struct *t,

bool return value, but is used as a C int error value later (it works,
but is inconsistent).

> +		void __user **start_ip,
> +		unsigned long *post_commit_offset,
> +		void __user **abort_ip,
> +		uint32_t *cs_flags)

That's a fair amount of arguments, and I suppose that isn't a problem
because there's only the one callsite and it all gets inlined anyway.

> +{
> +	unsigned long ptr;
> +	struct rseq_cs __user *urseq_cs;
> +	struct rseq_cs rseq_cs;
> +	u32 __user *usig;
> +	u32 sig;
> +
> +	if (__get_user(ptr, &t->rseq->rseq_cs))
> +		return false;
> +	if (!ptr)
> +		return true;
> +	urseq_cs = (struct rseq_cs __user *)ptr;
> +	if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
> +		return false;
> +	/*
> +	 * We need to clear rseq_cs upon entry into a signal handler
> +	 * nested on top of a rseq assembly block, so the signal handler
> +	 * will not be fixed up if itself interrupted by a nested signal
> +	 * handler or preempted.  We also need to clear rseq_cs if we
> +	 * preempt or deliver a signal on top of code outside of the
> +	 * rseq assembly block, to ensure that a following preemption or
> +	 * signal delivery will not try to perform a fixup needlessly.
> +	 */
> +	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
> +		return false;
> +	if (rseq_cs.version > 0)
> +		return false;

> +	*cs_flags = rseq_cs.flags;
> +	*start_ip = (void __user *)rseq_cs.start_ip;
> +	*post_commit_offset = (unsigned long)rseq_cs.post_commit_offset;
> +	*abort_ip = (void __user *)rseq_cs.abort_ip;

> +	usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
> +	if (get_user(sig, usig))
> +		return false;

> +	if (current->rseq_sig != sig) {
> +		printk_ratelimited(KERN_WARNING
> +			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
> +			sig, current->rseq_sig, current->pid, usig);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
> +{
> +	bool need_restart = false;
> +	uint32_t flags;
> +
> +	/* Get thread flags. */
> +	if (__get_user(flags, &t->rseq->flags))
> +		return -EFAULT;
> +
> +	/* Take into account critical section flags. */
> +	flags |= cs_flags;
> +
> +	/*
> +	 * Restart on signal can only be inhibited when restart on
> +	 * preempt and restart on migrate are inhibited too. Otherwise,
> +	 * a preempted signal handler could fail to restart the prior
> +	 * execution context on sigreturn.
> +	 */
> +	if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +			return -EINVAL;
> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +			return -EINVAL;
> +	}
> +	if (t->rseq_migrate
> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))

That's a horrible code form, please put the && at the end of the
previous line and begin the next line aligned with the (, like:

	if (t->rseq_migrate &&
	    !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))

Luckily you've already killed this code, but try and remember for a next
time ;-)

> +		need_restart = true;
> +	else if (t->rseq_preempt
> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +		need_restart = true;
> +	else if (t->rseq_signal
> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
> +		need_restart = true;
> +
> +	t->rseq_preempt = false;
> +	t->rseq_signal = false;
> +	t->rseq_migrate = false;
> +	if (need_restart)
> +		return 1;
> +	return 0;
> +}
> +
> +static int rseq_ip_fixup(struct pt_regs *regs)
> +{
> +	struct task_struct *t = current;
> +	void __user *start_ip = NULL;
> +	unsigned long post_commit_offset = 0;
> +	void __user *abort_ip = NULL;
> +	uint32_t cs_flags = 0;
> +	int ret;

	unsigned long ip = instruction_pointer(regs);

> +
> +	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
> +			&cs_flags);
	trace_rseq_ip_fixup((void __user *)ip,
> +		start_ip, post_commit_offset, abort_ip, ret);

Why trace here and not right before/after instruction_pointer_set()?

> +	if (!ret)
> +		return -EFAULT;
> +
> +	ret = rseq_need_restart(t, cs_flags);
> +	if (ret < 0)
> +		return -EFAULT;
> +	if (!ret)
> +		return 0;
> +	/*
> +	 * Handle potentially not being within a critical section.
> +	 * Unsigned comparison will be true when
> +	 * ip < start_ip (wrap-around to large values), and when
> +	 * ip >= start_ip + post_commit_offset.
> +	 */
> +	if ((unsigned long)instruction_pointer(regs) - (unsigned long)start_ip
> +			>= post_commit_offset)

	if ((unsigned long)(ip - start_ip) >= post_commit_offset)

> +		return 1;
> +
> +	instruction_pointer_set(regs, (unsigned long)abort_ip);

Since you only ever use abort_ip as unsigned long, why propagate this
"void __user *" all the way from rseq_get_rseq_cs() ? Save yourself some
typing and casts :-)

> +	return 1;
> +}

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
  2017-11-16 19:06               ` Thomas Gleixner
@ 2017-11-16 20:06                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-16 20:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 2:06 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Thu, 16 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 16, 2017, at 1:43 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
>> 
>> > On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
>> >> +/*
>> >> + * If parent process has a registered restartable sequences area, the
>> >> + * child inherits. Only applies when forking a process, not a thread. In
>> >> + * case a parent fork() in the middle of a restartable sequence, set the
>> >> + * resume notifier to force the child to retry.
>> >> + */
>> >> +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
>> >> +{
>> >> +	if (clone_flags & CLONE_THREAD) {
>> >> +		t->rseq = NULL;
>> >> +		t->rseq_len = 0;
>> >> +		t->rseq_sig = 0;
>> >> +	} else {
>> >> +		t->rseq = current->rseq;
>> >> +		t->rseq_len = current->rseq_len;
>> >> +		t->rseq_sig = current->rseq_sig;
>> >> +		rseq_set_notify_resume(t);
>> >> +	}
>> >> +}
>> > 
>> > This hurts my brain... what happens if you fork a multi-threaded
>> > process?
>> > 
>> > Do we fully inherit the TLS state of the calling thread?
>> 
>> Yes, exactly. The user-space TLS should be inherited from that of
>> the calling thread.
>> 
>> At kernel-level, the only thing that's not inherited here is the
>> task struct rseq_event_mask, which tracks whether a restart is
>> needed. But this would only be relevant if fork() can be invoked
>> from a signal handler, or if fork() could be invoked from a
>> rseq critical section (which really makes little sense).
> 
> Whether it makes sense or not does not matter much, especially in context
> of user space. You cannot make assumptions like that. When something can be
> done, then it's bound to happen sooner than later because somebody thinks
> he is extra clever.
> 
> The first priority is robustness in any aspect which has to do with user
> space.
> 
>> Should I copy the current->rseq_event_mask on process fork just to
>> be on the safe side though ?
> 
> I think so, unless you let fork() fail when invoked from a rseq critical
> section.

Allright, I'll set the rseq_event_mask to 0 explicitly on exec() and
thread-fork, and copy it from the parent on process-fork.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]       ` <20171116191448.rmds347hwsyibipm-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
@ 2017-11-16 20:37         ` Mathieu Desnoyers
       [not found]           ` <1083699948.16848.1510864678185.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-16 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 2:14 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Tue, Nov 14, 2017 at 03:03:51PM -0500, Mathieu Desnoyers wrote:
>> +static bool rseq_update_cpu_id(struct task_struct *t)
>> +{
>> +	uint32_t cpu_id = raw_smp_processor_id();
>> +
>> +	if (__put_user(cpu_id, &t->rseq->cpu_id_start))
>> +		return false;
>> +	if (__put_user(cpu_id, &t->rseq->cpu_id))
>> +		return false;
> 
> For LP64 this _could_ be a single 64bit store, right? It would save some
> stac/clac noise on x86_64.

Yes it could, but last time I checked a __put_user of a u64
did not guarantee single-copy atomicity of each of the two
32-bit words on 32-bit architectures, so I figured that it
would be better to postpone that optimization to a point
where architectures would provide a u64 __put_user that
guarantee single-copy atomicity of each 32-bit word on 32-bit
architectures.

> 
>> +	trace_rseq_update(t);
>> +	return true;
>> +}
> 
>> +static bool rseq_get_rseq_cs(struct task_struct *t,
> 
> bool return value, but is used as a C int error value later (it works,
> but is inconsistent).

I can do the following on the caller side instead:

        if (!rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
                        &cs_flags))
                return -EFAULT;


> 
>> +		void __user **start_ip,
>> +		unsigned long *post_commit_offset,
>> +		void __user **abort_ip,
>> +		uint32_t *cs_flags)
> 
> That's a fair amount of arguments, and I suppose that isn't a problem
> because there's only the one callsite and it all gets inlined anyway.

Yep.

> 
>> +{
>> +	unsigned long ptr;
>> +	struct rseq_cs __user *urseq_cs;
>> +	struct rseq_cs rseq_cs;
>> +	u32 __user *usig;
>> +	u32 sig;
>> +
>> +	if (__get_user(ptr, &t->rseq->rseq_cs))
>> +		return false;
>> +	if (!ptr)
>> +		return true;
>> +	urseq_cs = (struct rseq_cs __user *)ptr;
>> +	if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
>> +		return false;
>> +	/*
>> +	 * We need to clear rseq_cs upon entry into a signal handler
>> +	 * nested on top of a rseq assembly block, so the signal handler
>> +	 * will not be fixed up if itself interrupted by a nested signal
>> +	 * handler or preempted.  We also need to clear rseq_cs if we
>> +	 * preempt or deliver a signal on top of code outside of the
>> +	 * rseq assembly block, to ensure that a following preemption or
>> +	 * signal delivery will not try to perform a fixup needlessly.
>> +	 */
>> +	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
>> +		return false;
>> +	if (rseq_cs.version > 0)
>> +		return false;
> 
>> +	*cs_flags = rseq_cs.flags;
>> +	*start_ip = (void __user *)rseq_cs.start_ip;
>> +	*post_commit_offset = (unsigned long)rseq_cs.post_commit_offset;
>> +	*abort_ip = (void __user *)rseq_cs.abort_ip;
> 
>> +	usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
>> +	if (get_user(sig, usig))
>> +		return false;
> 

ok for adding newlines.

>> +	if (current->rseq_sig != sig) {
>> +		printk_ratelimited(KERN_WARNING
>> +			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x
>> (pid=%d, addr=%p).\n",
>> +			sig, current->rseq_sig, current->pid, usig);
>> +		return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
>> +{
>> +	bool need_restart = false;
>> +	uint32_t flags;
>> +
>> +	/* Get thread flags. */
>> +	if (__get_user(flags, &t->rseq->flags))
>> +		return -EFAULT;
>> +
>> +	/* Take into account critical section flags. */
>> +	flags |= cs_flags;
>> +
>> +	/*
>> +	 * Restart on signal can only be inhibited when restart on
>> +	 * preempt and restart on migrate are inhibited too. Otherwise,
>> +	 * a preempted signal handler could fail to restart the prior
>> +	 * execution context on sigreturn.
>> +	 */
>> +	if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
>> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
>> +			return -EINVAL;
>> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> +			return -EINVAL;
>> +	}
>> +	if (t->rseq_migrate
>> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> 
> That's a horrible code form, please put the && at the end of the
> previous line and begin the next line aligned with the (, like:
> 
>	if (t->rseq_migrate &&
>	    !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> 
> Luckily you've already killed this code, but try and remember for a next
> time ;-)

I usually never space-align with open parenthesis "(". Is it a coding
style requirement of the kernel for multi-line if () conditions ?

Would the following replatement code be ok ?

        if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
                if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
                                | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) !=
                                (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
                                | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
                        return -EINVAL;
        }
        event_mask = t->rseq_event_mask;
        t->rseq_event_mask = 0;
        event_mask &= ~flags;
        if (event_mask)
                return 1;
        return 0;

I'm uneasy with the wall of text caused by the flags. And based on
your comment, I should align on the if ( parenthesis. Style improvement
ideas are welcome. An alternative would be:

        if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
                if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
                    | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) !=
                    (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
                     | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
                        return -EINVAL;
        }
[...]

> 
>> +		need_restart = true;
>> +	else if (t->rseq_preempt
>> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> +		need_restart = true;
>> +	else if (t->rseq_signal
>> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
>> +		need_restart = true;
>> +
>> +	t->rseq_preempt = false;
>> +	t->rseq_signal = false;
>> +	t->rseq_migrate = false;
>> +	if (need_restart)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +static int rseq_ip_fixup(struct pt_regs *regs)
>> +{
>> +	struct task_struct *t = current;
>> +	void __user *start_ip = NULL;
>> +	unsigned long post_commit_offset = 0;
>> +	void __user *abort_ip = NULL;
>> +	uint32_t cs_flags = 0;
>> +	int ret;
> 
>	unsigned long ip = instruction_pointer(regs);

ok

> 
>> +
>> +	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
>> +			&cs_flags);
>	trace_rseq_ip_fixup((void __user *)ip,
>> +		start_ip, post_commit_offset, abort_ip, ret);
> 
> Why trace here and not right before/after instruction_pointer_set()?

Good point. Tracing right before instruction_pointer_set() would make sense.
I can remove the "ret" parameter too.

> 
>> +	if (!ret)
>> +		return -EFAULT;
>> +
>> +	ret = rseq_need_restart(t, cs_flags);
>> +	if (ret < 0)
>> +		return -EFAULT;
>> +	if (!ret)
>> +		return 0;
>> +	/*
>> +	 * Handle potentially not being within a critical section.
>> +	 * Unsigned comparison will be true when
>> +	 * ip < start_ip (wrap-around to large values), and when
>> +	 * ip >= start_ip + post_commit_offset.
>> +	 */
>> +	if ((unsigned long)instruction_pointer(regs) - (unsigned long)start_ip
>> +			>= post_commit_offset)
> 
>	if ((unsigned long)(ip - start_ip) >= post_commit_offset)

Now that both ip and start_ip are unsigned long, I simply can do:


if (ip - start_ip >= post_commit_offset)
  ...

> 
>> +		return 1;
>> +
>> +	instruction_pointer_set(regs, (unsigned long)abort_ip);
> 
> Since you only ever use abort_ip as unsigned long, why propagate this
> "void __user *" all the way from rseq_get_rseq_cs() ? Save yourself some
> typing and casts :-)

Will do, I'll use unsigned long instead,

Thanks!

Mathieu


> 
>> +	return 1;
> > +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]           ` <1083699948.16848.1510864678185.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-16 20:46             ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2017-11-16 20:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

On Thu, Nov 16, 2017 at 08:37:58PM +0000, Mathieu Desnoyers wrote:
> I usually never space-align with open parenthesis "(". Is it a coding
> style requirement of the kernel for multi-line if () conditions ?

Not sure, but it is the predominant pattern in most of the code.

> Would the following replatement code be ok ?
> 
>         if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
>                 if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>                                 | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) !=
>                                 (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>                                 | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>                         return -EINVAL;

I really prefer the operator at the end,

git grep "&&$" | wc -l
40708

git grep "^[[:space:]]*&&" | wc -l
3901

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
       [not found]     ` <20171114200414.2188-2-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
                         ` (2 preceding siblings ...)
  2017-11-16 18:43       ` Peter Zijlstra
@ 2017-11-16 21:08       ` Thomas Gleixner
  2017-11-19 17:24         ` Mathieu Desnoyers
  3 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-16 21:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H . Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
> +#ifdef __KERNEL__
> +# include <linux/types.h>
> +#else	/* #ifdef __KERNEL__ */

Please drop these comments. They are distracting and not helpful at
all. They are valuable for long #ideffed sections but then the normal form
is:

 /* __KERNEL__ */
 
 /* !__KERNEL__ */

> +# include <stdint.h>
> +#endif	/* #else #ifdef __KERNEL__ */
> +
> +#include <asm/byteorder.h>
> +
> +#ifdef __LP64__
> +# define RSEQ_FIELD_u32_u64(field)			uint64_t field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)

Can you please make this decision separate and propagate the result ?

> +# define RSEQ_FIELD_u32_u64(field)	uint32_t field ## _padding, field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
> +	field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define RSEQ_FIELD_u32_u64(field)	uint32_t field, field ## _padding
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
> +	field = (intptr_t)v, field ## _padding = 0
> +#endif
> +
> +enum rseq_flags {
> +	RSEQ_FLAG_UNREGISTER = (1 << 0),
> +};
> +
> +enum rseq_cs_flags {
> +	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT	= (1U << 0),
> +	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL	= (1U << 1),
> +	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	= (1U << 2),
> +};
> +
> +/*
> + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
> + * contained within a single cache-line. It is usually declared as
> + * link-time constant data.
> + */
> +struct rseq_cs {
> +	uint32_t version;	/* Version of this structure. */
> +	uint32_t flags;		/* enum rseq_cs_flags */
> +	RSEQ_FIELD_u32_u64(start_ip);
> +	RSEQ_FIELD_u32_u64(post_commit_offset);	/* From start_ip */
> +	RSEQ_FIELD_u32_u64(abort_ip);
> +} __attribute__((aligned(4 * sizeof(uint64_t))));
> +
> +/*
> + * struct rseq is aligned on 4 * 8 bytes to ensure it is always
> + * contained within a single cache-line.
> + *
> + * A single struct rseq per thread is allowed.
> + */
> +struct rseq {
> +	/*
> +	 * Restartable sequences cpu_id_start field. Updated by the
> +	 * kernel, and read by user-space with single-copy atomicity
> +	 * semantics. Aligned on 32-bit. Always contain a value in the

contains

> +	 * range of possible CPUs, although the value may not be the
> +	 * actual current CPU (e.g. if rseq is not initialized). This
> +	 * CPU number value should always be confirmed against the value
> +	 * of the cpu_id field.

Who is supposed to confirm that? I think I know what the purpose of the
field is, but from that comment it's not obvious at all.

> +	 */
> +	uint32_t cpu_id_start;
> +	/*
> +	 * Restartable sequences cpu_id field. Updated by the kernel,
> +	 * and read by user-space with single-copy atomicity semantics.

Again. What's the purpose of reading it. 

> +	 * Aligned on 32-bit. Values -1U and -2U have a special
> +	 * semantic: -1U means "rseq uninitialized", and -2U means "rseq
> +	 * initialization failed".
> +	 */
> +	uint32_t cpu_id;
> +	/*
> +	 * Restartable sequences rseq_cs field.
> +	 *
> +	 * Contains NULL when no critical section is active for the current
> +	 * thread, or holds a pointer to the currently active struct rseq_cs.
> +	 *
> +	 * Updated by user-space at the beginning of assembly instruction
> +	 * sequence block, and by the kernel when it restarts an assembly
> +	 * instruction sequence block, and when the kernel detects that it
> +	 * is preempting or delivering a signal outside of the range
> +	 * targeted by the rseq_cs. Also needs to be cleared by user-space
> +	 * before reclaiming memory that contains the targeted struct
> +	 * rseq_cs.

This paragraph is pretty convoluted and it's not really clear what the
actual purpose is and how it is supposed to be used.

   It's NULL when no critical section is active.

   It holds a pointer to a struct rseq_cs when a critical section is active. Fine

Now the update rules:

    - By user space at the start of the critical section, i.e. user space
      sets the pointer to rseq_cs

    - By the kernel when it restarts a sequence block etc ....

      What happens to this field? Is the pointer updated or cleared or
      what? How is the kernel supposed to fiddle with the pointer?

> +	 *
> +	 * Read and set by the kernel with single-copy atomicity semantics.

This looks like it's purely kernel owned, but above you say it's written by
user space. There are no rules for user space?

> +	 * Aligned on 64-bit.
> +	 */
> +	RSEQ_FIELD_u32_u64(rseq_cs);
> +	/*
> +	 * - RSEQ_DISABLE flag:
> +	 *
> +	 * Fallback fast-track flag for single-stepping.
> +	 * Set by user-space if lack of progress is detected.
> +	 * Cleared by user-space after rseq finish.
> +	 * Read by the kernel.
> +	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> +	 *     Inhibit instruction sequence block restart and event
> +	 *     counter increment on preemption for this thread.
> +	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> +	 *     Inhibit instruction sequence block restart and event
> +	 *     counter increment on signal delivery for this thread.
> +	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> +	 *     Inhibit instruction sequence block restart and event
> +	 *     counter increment on migration for this thread.

That looks dangerous. You want to single step through the critical section
and just ignore whether you've been preempted or migrated. How is that
supposed to work?

> +++ b/kernel/rseq.c
> @@ -0,0 +1,328 @@
> + * Detailed algorithm of rseq user-space assembly sequences:
> + *
> + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
> + *   userspace that can handle being moved to the abort_ip between any
> + *   of those instructions.

A sequence of instructions cannot be moved. Please describe this in
technical correct wording.

> + *   The abort_ip address needs to be less than start_ip, or
> + *   greater-or-equal the post_commit_ip. Step [5] and the failure

s/the/than/

> + *   code step [F1] need to be at addresses lesser than start_ip, or
> + *   greater-or-equal the post_commit_ip.

Please describe that block visually for clarity

		init(rseq_cs)
		cpu = TLS->rseq::cpu_id
       
start_ip	-----------------
[1]		TLS->rseq::rseq_cs = rseq_cs
		barrier()

[2]		if (cpu != TLS->rseq::cpu_id)
			goto fail_ip;

[3]		last_instruction_in_cs()
post_commit_ip  ----------------

The address of jump target fail_ip must be outside the critical region, i.e.

    fail_ip < start_ip  ||	 fail_ip >= post_commit_ip

Some textual explanation along with that is certainly helpful, but.

> + *       [start_ip]
> + *   1.  Userspace stores the address of the struct rseq_cs assembly
> + *       block descriptor into the rseq_cs field of the registered
> + *       struct rseq TLS area. This update is performed through a single
> + *       store, followed by a compiler barrier which prevents the
> + *       compiler from moving following loads or stores before this
> + *       store.
> + *
> + *   2.  Userspace tests to see whether the current cpu_id field
> + *       match the cpu number loaded before start_ip. Manually jumping
> + *       to [F1] in case of a mismatch.

  Manually jumping?

> + *
> + *       Note that if we are preempted or interrupted by a signal

Up to this point the description was technical, Now you start to
impersonate. That's inconsistent at best.

> + *       after [1] and before post_commit_ip, then the kernel

How does the kernel know about being "after" [1]. Is there something else
than start_ip and post_commit_id? According to this, yes. And that wants a
name and wants to be shown in the visual block. I call it magic_ip for now.

> + *       clears the rseq_cs field of struct rseq, then jumps us to
> + *       abort_ip.

The kernel does not jump us.

    	    If the execution sequence gets preempted at an address >=
    	    magic_ip and < post_commit_ip, the kernel sets
    	    TLS->rseq::rseq_cs to NULL and sets the user space return ip to
    	    fail_ip before returning to user space, so the preempted
    	    execution resumes at fail_ip.

Hmm?

> + *   3.  Userspace critical section final instruction before
> + *       post_commit_ip is the commit. The critical section is
> + *       self-terminating.
> + *       [post_commit_ip]
> + *
> + *   4.  success
> + *
> + *   On failure at [2]:
> + *
> + *       [abort_ip]

Now you introduce abort_ip. Why not use the same terminology consistently?
Because it would make sense and not confuse the reader?

> + *   F1. goto failure label
> + */
> +
> +static bool rseq_update_cpu_id(struct task_struct *t)
> +{
> +	uint32_t cpu_id = raw_smp_processor_id();
> +
> +	if (__put_user(cpu_id, &t->rseq->cpu_id_start))
> +		return false;
> +	if (__put_user(cpu_id, &t->rseq->cpu_id))
> +		return false;
> +	trace_rseq_update(t);
> +	return true;
> +}
> +
> +static bool rseq_reset_rseq_cpu_id(struct task_struct *t)
> +{
> +	uint32_t cpu_id_start = 0, cpu_id = -1U;

Please do not use -1U. Define a proper symbol for it. Hardcoded constant
numbers which have a special measing are annoying.

> +	/*
> +	 * Reset cpu_id_start to its initial state (0).
> +	 */
> +	if (__put_user(cpu_id_start, &t->rseq->cpu_id_start))
> +		return false;

Why bool? If the callsite propagates an error code return it right from
here please.

> +	/*
> +	 * Reset cpu_id to -1U, so any user coming in after unregistration can
> +	 * figure out that rseq needs to be registered again.
> +	 */
> +	if (__put_user(cpu_id, &t->rseq->cpu_id))
> +		return false;
> +	return true;
> +}
> +
> +static bool rseq_get_rseq_cs(struct task_struct *t,
> +		void __user **start_ip,
> +		unsigned long *post_commit_offset,
> +		void __user **abort_ip,
> +		uint32_t *cs_flags)

Please align the arguments with the argument in the first line

> +{
> +	unsigned long ptr;
> +	struct rseq_cs __user *urseq_cs;
> +	struct rseq_cs rseq_cs;
> +	u32 __user *usig;
> +	u32 sig;

Please sort those variables by length in reverse fir tree order.

> +
> +	if (__get_user(ptr, &t->rseq->rseq_cs))
> +		return false;

Call site stores an int and then returns -EFAULT. Works, but pretty is
something else.

> +	if (!ptr)
> +		return true;

What's wrong with 0 / -ERRORCODE returns which are the standard way in the
kernel?

> +	urseq_cs = (struct rseq_cs __user *)ptr;
> +	if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
> +		return false;
> +	/*
> +	 * We need to clear rseq_cs upon entry into a signal handler
> +	 * nested on top of a rseq assembly block, so the signal handler
> +	 * will not be fixed up if itself interrupted by a nested signal
> +	 * handler or preempted.

This sentence does not parse.

> +	   We also need to clear rseq_cs if we
> +	 * preempt or deliver a signal on top of code outside of the
> +	 * rseq assembly block, to ensure that a following preemption or
> +	 * signal delivery will not try to perform a fixup needlessly.

Please try to avoid the impersonation. We are not doing anything.

> +	 */
> +	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
> +		return false;
> +	if (rseq_cs.version > 0)
> +		return false;
> +	*cs_flags = rseq_cs.flags;
> +	*start_ip = (void __user *)rseq_cs.start_ip;
> +	*post_commit_offset = (unsigned long)rseq_cs.post_commit_offset;
> +	*abort_ip = (void __user *)rseq_cs.abort_ip;
> +	usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));

Is there no way to avoid this abundant type casting?  It's hard to find the
code in the casts.

> +	if (get_user(sig, usig))
> +		return false;
> +	if (current->rseq_sig != sig) {
> +		printk_ratelimited(KERN_WARNING
> +			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
> +			sig, current->rseq_sig, current->pid, usig);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
> +{
> +	bool need_restart = false;
> +	uint32_t flags;
> +
> +	/* Get thread flags. */
> +	if (__get_user(flags, &t->rseq->flags))
> +		return -EFAULT;
> +
> +	/* Take into account critical section flags. */

Take critical section flags into account. Please

> +	flags |= cs_flags;
> +
> +	/*
> +	 * Restart on signal can only be inhibited when restart on
> +	 * preempt and restart on migrate are inhibited too. Otherwise,
> +	 * a preempted signal handler could fail to restart the prior
> +	 * execution context on sigreturn.
> +	 */
> +	if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +			return -EINVAL;
> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +			return -EINVAL;
> +	}
> +	if (t->rseq_migrate
> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))

	if (t->rseq_migrate &&
	    !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))

please.

> +		need_restart = true;
> +	else if (t->rseq_preempt
> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +		need_restart = true;
> +	else if (t->rseq_signal
> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
> +		need_restart = true;

If you make all of these rseq_flags explicit bits in a u32 then you can
just do a simple

     	if ((t->rseq_flags ^ flags) & t->rseq_flags)

and you can probably simplify the above checks as well.

> +
> +	t->rseq_preempt = false;
> +	t->rseq_signal = false;
> +	t->rseq_migrate = false;

This becomes a simple t->rseq_flags = 0;

> +	if (need_restart)
> +		return 1;
> +	return 0;

Why are you having a bool in the first place if you have to convert it into
a integer return value at the end. Sure the compiler can optimize that
away, but still...

> +}
> +
> +static int rseq_ip_fixup(struct pt_regs *regs)
> +{
> +	struct task_struct *t = current;
> +	void __user *start_ip = NULL;
> +	unsigned long post_commit_offset = 0;
> +	void __user *abort_ip = NULL;
> +	uint32_t cs_flags = 0;
> +	int ret;
> +
> +	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
> +			&cs_flags);
> +	trace_rseq_ip_fixup((void __user *)instruction_pointer(regs),
> +		start_ip, post_commit_offset, abort_ip, ret);
> +	if (!ret)
> +		return -EFAULT;

This boolean logic is really horrible.

> +	ret = rseq_need_restart(t, cs_flags);
> +	if (ret < 0)
> +		return -EFAULT;

Why can't you propagate ret?

> +	if (!ret)
> +		return 0;
> +	/*
> +	 * Handle potentially not being within a critical section.
> +	 * Unsigned comparison will be true when
> +	 * ip < start_ip (wrap-around to large values), and when
> +	 * ip >= start_ip + post_commit_offset.
> +	 */
> +	if ((unsigned long)instruction_pointer(regs) - (unsigned long)start_ip
> +			>= post_commit_offset)

Neither start_ip nor abort_ip need to be void __user * type. They are not
accessed at all, So why not make them unsigned long and spare all the type
cast mess here and in rseq_get_rseq_cs() ?

> +		return 1;
> +
> +	instruction_pointer_set(regs, (unsigned long)abort_ip);
> +	return 1;
> +}
> +
> +/*
> + * This resume handler should always be executed between any of:

Should? Or must?

> + * - preemption,
> + * - signal delivery,
> + * and return to user-space.
> + *
> +	if (current->rseq) {
> +		/*
> +		 * If rseq is already registered, check whether
> +		 * the provided address differs from the prior
> +		 * one.
> +		 */
> +		if (current->rseq != rseq
> +				|| current->rseq_len != rseq_len)

Align as shown above please. Same for all other malformatted multi line
conditionals.

> +			return -EINVAL;
> +		if (current->rseq_sig != sig)
> +			return -EPERM;
> +		return -EBUSY;	/* Already registered. */

Please do not use tail comments. They disturb the reading flow.

> +	} else {
> +		/*
> +		 * If there was no rseq previously registered,
> +		 * we need to ensure the provided rseq is

s/we need to//  Like in changelogs. Describe it in imperative mood.

> +		 * properly aligned and valid.
> +		 */
> +		if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq))
> +				|| rseq_len != sizeof(*rseq))
> +			return -EINVAL;
> +		if (!access_ok(VERIFY_WRITE, rseq, rseq_len))
> +			return -EFAULT;
> +		current->rseq = rseq;
> +		current->rseq_len = rseq_len;
> +		current->rseq_sig = sig;
> +		/*
> +		 * If rseq was previously inactive, and has just been
> +		 * registered, ensure the cpu_id_start and cpu_id fields
> +		 * are updated before returning to user-space.
> +		 */
> +		rseq_set_notify_resume(current);
> +	}

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support
       [not found]     ` <20171114200414.2188-5-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-16 21:14       ` Thomas Gleixner
  2017-11-19 17:41         ` Mathieu Desnoyers
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-16 21:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H . Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:

Please fix the subject line:

x86: Add support for restartable sequences

or something like that.

And for the actual rseq patches please come up with a proper short
susbsytem prefix for restartable sequences. There is no point in occupying
half of the subject space for a prefix.

Other than that.

Reviewed-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]     ` <20171114200414.2188-9-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-15  1:34       ` Mathieu Desnoyers
  2017-11-15  7:44       ` Michael Kerrisk (man-pages)
@ 2017-11-16 23:26       ` Thomas Gleixner
  2017-11-17  0:14         ` Andi Kleen
  2017-11-20 16:13         ` Mathieu Desnoyers
  2 siblings, 2 replies; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-16 23:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H . Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon

On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
> +#ifdef __KERNEL__
> +# include <linux/types.h>
> +#else	/* #ifdef __KERNEL__ */

  		   Sigh.

> +# include <stdint.h>
> +#endif	/* #else #ifdef __KERNEL__ */
> +
> +#include <asm/byteorder.h>
> +
> +#ifdef __LP64__
> +# define CPU_OP_FIELD_u32_u64(field)			uint64_t field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define CPU_OP_FIELD_u32_u64(field)	uint32_t field ## _padding, field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
> +	field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define CPU_OP_FIELD_u32_u64(field)	uint32_t field, field ## _padding
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
> +	field = (intptr_t)v, field ## _padding = 0
> +#endif

So in the rseq patch you have:

+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field)                     uint64_t field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field)     uint32_t field ## _padding, field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
+       field ## _padding = 0, field = (intptr_t)v
+#else
+# define RSEQ_FIELD_u32_u64(field)     uint32_t field, field ## _padding
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
+       field = (intptr_t)v, field ## _padding = 0
+#endif

IOW the same macro maze. Please use a separate header file which provides
these macros once and share them between the two facilities.

> +#define CPU_OP_VEC_LEN_MAX		16
> +#define CPU_OP_ARG_LEN_MAX		24
> +/* Max. data len per operation. */
> +#define CPU_OP_DATA_LEN_MAX		PAGE_SIZE

That's something between 4K and 256K depending on the architecture. 

You really want to allow up to 256K data copy with preemption disabled?
Shudder.

> +/*
> + * Max. data len for overall vector. We to restrict the amount of

We to ?

> + * user-space data touched by the kernel in non-preemptible context so
> + * we do not introduce long scheduler latencies.
> + * This allows one copy of up to 4096 bytes, and 15 operations touching
> + * 8 bytes each.
> + * This limit is applied to the sum of length specified for all
> + * operations in a vector.
> + */
> +#define CPU_OP_VEC_DATA_LEN_MAX		(4096 + 15*8)

Magic numbers. Please use proper defines for heavens sake.

> +#define CPU_OP_MAX_PAGES		4	/* Max. pages per op. */
> +
> +enum cpu_op_type {
> +	CPU_COMPARE_EQ_OP,	/* compare */
> +	CPU_COMPARE_NE_OP,	/* compare */
> +	CPU_MEMCPY_OP,		/* memcpy */
> +	CPU_ADD_OP,		/* arithmetic */
> +	CPU_OR_OP,		/* bitwise */
> +	CPU_AND_OP,		/* bitwise */
> +	CPU_XOR_OP,		/* bitwise */
> +	CPU_LSHIFT_OP,		/* shift */
> +	CPU_RSHIFT_OP,		/* shift */
> +	CPU_MB_OP,		/* memory barrier */
> +};
> +
> +/* Vector of operations to perform. Limited to 16. */
> +struct cpu_op {
> +	int32_t op;	/* enum cpu_op_type. */
> +	uint32_t len;	/* data length, in bytes. */

Please get rid of these tail comments

> +	union {
> +#define TMP_BUFLEN			64
> +#define NR_PINNED_PAGES_ON_STACK	8

8 pinned pages on stack? Which stack?

> +/*
> + * The cpu_opv system call executes a vector of operations on behalf of
> + * user-space on a specific CPU with preemption disabled. It is inspired
> + * from readv() and writev() system calls which take a "struct iovec"

s/from/by/

> + * array as argument.
> + *
> + * The operations available are: comparison, memcpy, add, or, and, xor,
> + * left shift, and right shift. The system call receives a CPU number
> + * from user-space as argument, which is the CPU on which those
> + * operations need to be performed. All preparation steps such as
> + * loading pointers, and applying offsets to arrays, need to be
> + * performed by user-space before invoking the system call. The

loading pointers and applying offsets? That makes no sense.

> + * "comparison" operation can be used to check that the data used in the
> + * preparation step did not change between preparation of system call
> + * inputs and operation execution within the preempt-off critical
> + * section.
> + *
> + * The reason why we require all pointer offsets to be calculated by
> + * user-space beforehand is because we need to use get_user_pages_fast()
> + * to first pin all pages touched by each operation. This takes care of

That doesnt explain it either.

> + * faulting-in the pages.  Then, preemption is disabled, and the
> + * operations are performed atomically with respect to other thread
> + * execution on that CPU, without generating any page fault.
> + *
> + * A maximum limit of 16 operations per cpu_opv syscall invocation is
> + * enforced, and a overall maximum length sum, so user-space cannot
> + * generate a too long preempt-off critical section. Each operation is
> + * also limited a length of PAGE_SIZE bytes, meaning that an operation
> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
> + * for destination if addresses are not aligned on page boundaries).

What's the critical section duration for operations which go to the limits
of this on a average x86 64 machine?

> + * If the thread is not running on the requested CPU, a new
> + * push_task_to_cpu() is invoked to migrate the task to the requested

new push_task_to_cpu()? Once that patch is merged push_task_to_cpu() is
hardly new.

Please refrain from putting function level details into comments which
describe the concept. The function name might change in 3 month from now
and the comment will stay stale, Its sufficient to say:

 * If the thread is not running on the requested CPU it is migrated
 * to it.

That explains the concept. It's completely irrelevant which mechanism is
used to achieve that.

> + * CPU.  If the requested CPU is not part of the cpus allowed mask of
> + * the thread, the system call fails with EINVAL. After the migration
> + * has been performed, preemption is disabled, and the current CPU
> + * number is checked again and compared to the requested CPU number. If
> + * it still differs, it means the scheduler migrated us away from that
> + * CPU. Return EAGAIN to user-space in that case, and let user-space
> + * retry (either requesting the same CPU number, or a different one,
> + * depending on the user-space algorithm constraints).

This mixture of imperative and impersonated mood is really hard to read.

> +/*
> + * Check operation types and length parameters.
> + */
> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
> +{
> +	int i;
> +	uint32_t sum = 0;
> +
> +	for (i = 0; i < cpuopcnt; i++) {
> +		struct cpu_op *op = &cpuop[i];
> +
> +		switch (op->op) {
> +		case CPU_MB_OP:
> +			break;
> +		default:
> +			sum += op->len;
> +		}

Please separate the switch cases with an empty line.

> +static unsigned long cpu_op_range_nr_pages(unsigned long addr,
> +		unsigned long len)

Please align the arguments

static unsigned long cpu_op_range_nr_pages(unsigned long addr,
					   unsigned long len)

is way simpler to parse. All over the place please.

> +{
> +	return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;

I'm surprised that there is no existing magic for this.

> +}
> +
> +static int cpu_op_check_page(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	if (is_zone_device_page(page))
> +		return -EFAULT;
> +	page = compound_head(page);
> +	mapping = READ_ONCE(page->mapping);
> +	if (!mapping) {
> +		int shmem_swizzled;
> +
> +		/*
> +		 * Check again with page lock held to guard against
> +		 * memory pressure making shmem_writepage move the page
> +		 * from filecache to swapcache.
> +		 */
> +		lock_page(page);
> +		shmem_swizzled = PageSwapCache(page) || page->mapping;
> +		unlock_page(page);
> +		if (shmem_swizzled)
> +			return -EAGAIN;
> +		return -EFAULT;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Refusing device pages, the zero page, pages in the gate area, and
> + * special mappings. Inspired from futex.c checks.

That comment should be on the function above, because this loop does not
much checking. Aside of that a more elaborate explanation how those checks
are done and how that works would be appreciated. 

> + */
> +static int cpu_op_check_pages(struct page **pages,
> +		unsigned long nr_pages)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		int ret;
> +
> +		ret = cpu_op_check_page(pages[i]);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
> +		struct cpu_opv_pinned_pages *pin_pages, int write)
> +{
> +	struct page *pages[2];
> +	int ret, nr_pages;
> +
> +	if (!len)
> +		return 0;
> +	nr_pages = cpu_op_range_nr_pages(addr, len);
> +	BUG_ON(nr_pages > 2);

If that happens then you can emit a warning and return a proper error
code. BUG() is the last resort if there is no way to recover. This really
does not qualify.

> +	if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
> +			> NR_PINNED_PAGES_ON_STACK) {

Now I see what this is used for. That's a complete misnomer.

And this check is of course completely self explaining..... NOT!

> +		struct page **pinned_pages =
> +			kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
> +				* sizeof(struct page *), GFP_KERNEL);
> +		if (!pinned_pages)
> +			return -ENOMEM;
> +		memcpy(pinned_pages, pin_pages->pages,
> +			pin_pages->nr * sizeof(struct page *));
> +		pin_pages->pages = pinned_pages;
> +		pin_pages->is_kmalloc = true;

I have no idea why this needs to be done here and cannot be done in a
preparation step. That's horrible. You allocate conditionally at some
random place and then free at the end of the syscall.

What's wrong with:

       prepare_stuff()
       pin_pages()
       do_ops()
       cleanup_stuff()

Hmm?

> +	}
> +again:
> +	ret = get_user_pages_fast(addr, nr_pages, write, pages);
> +	if (ret < nr_pages) {
> +		if (ret > 0)
> +			put_page(pages[0]);
> +		return -EFAULT;
> +	}
> +	/*
> +	 * Refuse device pages, the zero page, pages in the gate area,
> +	 * and special mappings.

So the same comment again. Really helpful.

> +	 */
> +	ret = cpu_op_check_pages(pages, nr_pages);
> +	if (ret == -EAGAIN) {
> +		put_page(pages[0]);
> +		if (nr_pages > 1)
> +			put_page(pages[1]);
> +		goto again;
> +	}

So why can't you propagate EAGAIN to the caller and use the error cleanup
label? Or put the sequence of get_user_pages_fast() and check_pages() into
one function and confine the mess there instead of having the same cleanup
sequence 3 times in this function.

> +	if (ret)
> +		goto error;
> +	pin_pages->pages[pin_pages->nr++] = pages[0];
> +	if (nr_pages > 1)
> +		pin_pages->pages[pin_pages->nr++] = pages[1];
> +	return 0;
> +
> +error:
> +	put_page(pages[0]);
> +	if (nr_pages > 1)
> +		put_page(pages[1]);
> +	return -EFAULT;
> +}
> +
> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
> +		struct cpu_opv_pinned_pages *pin_pages)
> +{
> +	int ret, i;
> +	bool expect_fault = false;
> +
> +	/* Check access, pin pages. */
> +	for (i = 0; i < cpuopcnt; i++) {
> +		struct cpu_op *op = &cpuop[i];
> +
> +		switch (op->op) {
> +		case CPU_COMPARE_EQ_OP:
> +		case CPU_COMPARE_NE_OP:
> +			ret = -EFAULT;
> +			expect_fault = op->u.compare_op.expect_fault_a;
> +			if (!access_ok(VERIFY_READ,
> +					(void __user *)op->u.compare_op.a,
> +					op->len))
> +				goto error;
> +			ret = cpu_op_pin_pages(
> +					(unsigned long)op->u.compare_op.a,
> +					op->len, pin_pages, 0);

Bah, this sucks. Moving the switch() into a separate function spares you
one indentation level and all these horrible to read line breaks.

And again I really have to ask why all of this stuff needs to be type
casted for every invocation. If you use the proper type for the argument
and then do the cast at the function entry then you can spare all that hard
to read crap.

> +error:
> +	for (i = 0; i < pin_pages->nr; i++)
> +		put_page(pin_pages->pages[i]);
> +	pin_pages->nr = 0;

Sigh. Why can't you do that at the call site where you have exactly the
same thing?

> +	/*
> +	 * If faulting access is expected, return EAGAIN to user-space.
> +	 * It allows user-space to distinguish between a fault caused by
> +	 * an access which is expect to fault (e.g. due to concurrent
> +	 * unmapping of underlying memory) from an unexpected fault from
> +	 * which a retry would not recover.
> +	 */
> +	if (ret == -EFAULT && expect_fault)
> +		return -EAGAIN;
> +	return ret;
> +}
> +
> +/* Return 0 if same, > 0 if different, < 0 on error. */
> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
> +{
> +	char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
> +	uint32_t compared = 0;
> +
> +	while (compared != len) {
> +		unsigned long to_compare;
> +
> +		to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
> +		if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
> +			return -EFAULT;
> +		if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
> +			return -EFAULT;
> +		if (memcmp(bufa, bufb, to_compare))
> +			return 1;	/* different */

These tail comments are really crap. It's entirely obvious that if memcmp
!= 0 the result is different. So what is the exact value aside of making it
hard to read?

> +		compared += to_compare;
> +	}
> +	return 0;	/* same */
> +}
> +
> +/* Return 0 if same, > 0 if different, < 0 on error. */
> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
> +{
> +	int ret = -EFAULT;
> +	union {
> +		uint8_t _u8;
> +		uint16_t _u16;
> +		uint32_t _u32;
> +		uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +		uint32_t _u64_split[2];
> +#endif
> +	} tmp[2];

I've seen the same union before

> +union op_fn_data {

......

> +
> +	pagefault_disable();
> +	switch (len) {
> +	case 1:
> +		if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
> +			goto end;
> +		if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
> +			goto end;
> +		ret = !!(tmp[0]._u8 != tmp[1]._u8);
> +		break;
> +	case 2:
> +		if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
> +			goto end;
> +		if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
> +			goto end;
> +		ret = !!(tmp[0]._u16 != tmp[1]._u16);
> +		break;
> +	case 4:
> +		if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
> +			goto end;
> +		if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
> +			goto end;
> +		ret = !!(tmp[0]._u32 != tmp[1]._u32);
> +		break;
> +	case 8:
> +#if (BITS_PER_LONG >= 64)

We alredy prepare for 128 bit?

> +		if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
> +			goto end;
> +		if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
> +			goto end;
> +#else
> +		if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
> +			goto end;
> +		if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
> +			goto end;
> +		if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
> +			goto end;
> +		if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
> +			goto end;
> +#endif
> +		ret = !!(tmp[0]._u64 != tmp[1]._u64);

This really sucks.

        union foo va, vb;

	pagefault_disable();
	switch (len) {
	case 1:
	case 2:
	case 4:
	case 8:
		va._u64 = _vb._u64 = 0;
		if (op_get_user(&va, a, len))
			goto out;
		if (op_get_user(&vb, b, len))
			goto out;
		ret = !!(va._u64 != vb._u64);
		break;
	default:
		...

and have

int op_get_user(union foo *val, void *p, int len)
{
	switch (len) {
	case 1:
	     ......

And do the magic once in that function then you spare that copy and pasted
code above. It can be reused in the other ops as well and reduces the amount
of copy and paste code significantly.

> +		break;
> +	default:
> +		pagefault_enable();
> +		return do_cpu_op_compare_iter(a, b, len);
> +	}
> +end:
> +	pagefault_enable();
> +	return ret;
> +}

> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
> +{
> +	int i, ret;
> +
> +	for (i = 0; i < cpuopcnt; i++) {
> +		struct cpu_op *op = &cpuop[i];
> +
> +		/* Guarantee a compiler barrier between each operation. */
> +		barrier();
> +
> +		switch (op->op) {
> +		case CPU_COMPARE_EQ_OP:
> +			ret = do_cpu_op_compare(
> +					(void __user *)op->u.compare_op.a,
> +					(void __user *)op->u.compare_op.b,
> +					op->len);

I think you know by now how to spare an indentation level and type casts.

> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
> +{
> +	int ret;
> +
> +	if (cpu != raw_smp_processor_id()) {
> +		ret = push_task_to_cpu(current, cpu);
> +		if (ret)
> +			goto check_online;
> +	}
> +	preempt_disable();
> +	if (cpu != smp_processor_id()) {
> +		ret = -EAGAIN;

This is weird. When the above raw_smp_processor_id() check fails you push,
but here you return. Not really consistent behaviour.

> +		goto end;
> +	}
> +	ret = __do_cpu_opv(cpuop, cpuopcnt);
> +end:
> +	preempt_enable();
> +	return ret;
> +
> +check_online:
> +	if (!cpu_possible(cpu))
> +		return -EINVAL;
> +	get_online_cpus();
> +	if (cpu_online(cpu)) {
> +		ret = -EAGAIN;
> +		goto put_online_cpus;
> +	}
> +	/*
> +	 * CPU is offline. Perform operation from the current CPU with
> +	 * cpu_online read lock held, preventing that CPU from coming online,
> +	 * and with mutex held, providing mutual exclusion against other
> +	 * CPUs also finding out about an offline CPU.
> +	 */

That's not mentioned in the comment at the top IIRC. 

> +	mutex_lock(&cpu_opv_offline_lock);
> +	ret = __do_cpu_opv(cpuop, cpuopcnt);
> +	mutex_unlock(&cpu_opv_offline_lock);
> +put_online_cpus:
> +	put_online_cpus();
> +	return ret;
> +}
> +
> +/*
> + * cpu_opv - execute operation vector on a given CPU with preempt off.
> + *
> + * Userspace should pass current CPU number as parameter. May fail with
> + * -EAGAIN if currently executing on the wrong CPU.
> + */
> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
> +		int, cpu, int, flags)
> +{
> +	struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
> +	struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];

Oh well.... Naming sucks. 

> +	struct cpu_opv_pinned_pages pin_pages = {
> +		.pages = pinned_pages_on_stack,
> +		.nr = 0,
> +		.is_kmalloc = false,
> +	};
> +	int ret, i;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +	if (unlikely(cpu < 0))
> +		return -EINVAL;
> +	if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
> +		return -EINVAL;
> +	if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
> +		return -EFAULT;
> +	ret = cpu_opv_check(cpuopv, cpuopcnt);

AFAICT you can calculate the number of pages already in the check and then
do that allocation before pinning the pages.

> +	if (ret)
> +		return ret;
> +	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
> +	if (ret)
> +		goto end;
> +	ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
> +	for (i = 0; i < pin_pages.nr; i++)
> +		put_page(pin_pages.pages[i]);
> +end:
> +	if (pin_pages.is_kmalloc)
> +		kfree(pin_pages.pages);
> +	return ret;
> +}


> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6bba05f47e51..e547f93a46c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
>  		set_curr_task(rq, p);
>  }

This is NOT part of this functionality. It's a prerequisite and wants to be
in a separate patch. And I'm dead tired by now so I leave that thing for
tomorrow or for Peter.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-16 23:26       ` Thomas Gleixner
@ 2017-11-17  0:14         ` Andi Kleen
       [not found]           ` <20171117001410.GG2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  2017-11-20 16:13         ` Mathieu Desnoyers
  1 sibling, 1 reply; 80+ messages in thread
From: Andi Kleen @ 2017-11-17  0:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H . Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas


My preference would be just to drop this new super ugly system call. 

It's also not just the ugliness, but the very large attack surface
that worries me here.

As far as I know it is only needed to support single stepping, correct?

We already have other code that cannot be single stepped, most
prominently the ring 3 vdso time functions that rely on seq locks.

The code that actually cannot be single stepped is very small
and only a few lines.

As far as I know this wasn't ever a significant problem for anybody, and
there is always a simple workaround (set a temporary break point
after it and continue) 

Same should apply to the new rseq regions. They should be all
tiny, and we should just accept that they cannot be single-stepped,
but only skipped.

Then this whole mess would disappear.

-Andi

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]           ` <20171117001410.GG2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2017-11-17 10:09             ` Thomas Gleixner
  2017-11-17 17:14               ` Mathieu Desnoyers
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-17 10:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H . Peter Anvin, Andrew Hunter,
	Chris Lameter, Ben Maurer, Steven Rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

On Thu, 16 Nov 2017, Andi Kleen wrote:
> My preference would be just to drop this new super ugly system call. 
> 
> It's also not just the ugliness, but the very large attack surface
> that worries me here.
> 
> As far as I know it is only needed to support single stepping, correct?

I can't figure that out because the changelog describes only WHAT the patch
does and not WHY. Useful, isn't it?

> Then this whole mess would disappear.

Agreed. That would be much appreciated.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd
       [not found]   ` <20171114200414.2188-20-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-17 15:07     ` Shuah Khan
  0 siblings, 0 replies; 80+ messages in thread
From: Shuah Khan @ 2017-11-17 15:07 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Greg Kroah-Hartman, Maged Michael,
	Avi Kivity

On 11/14/2017 01:04 PM, Mathieu Desnoyers wrote:
> Test the new MEMBARRIER_CMD_SHARED_EXPEDITED and
> MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED commands.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> CC: Shuah Khan <shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> CC: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> CC: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> CC: Boqun Feng <boqun.feng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Maged Michael <maged.michael-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: Avi Kivity <avi-VrcmuVmyx1hWk0Htik3J/w@public.gmane.org>
> CC: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
> CC: Paul Mackerras <paulus-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>
> CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
> CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
> CC: Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org>
> CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> CC: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> CC: Alice Ferrazzi <alice.ferrazzi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: Paul Elder <paul.elder-fYq5UfK3d1k@public.gmane.org>
> CC: linux-kselftest-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: linux-arch-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>  .../testing/selftests/membarrier/membarrier_test.c | 51 +++++++++++++++++++++-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
> index e6ee73d01fa1..bb9c58072c5c 100644
> --- a/tools/testing/selftests/membarrier/membarrier_test.c
> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
> @@ -132,6 +132,40 @@ static int test_membarrier_private_expedited_success(void)
>  	return 0;
>  }
>  
> +static int test_membarrier_register_shared_expedited_success(void)
> +{
> +	int cmd = MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED, flags = 0;
> +	const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED";
> +
> +	if (sys_membarrier(cmd, flags) != 0) {
> +		ksft_exit_fail_msg(
> +			"%s test: flags = %d, errno = %d\n",
> +			test_name, flags, errno);
> +	}
> +
> +	ksft_test_result_pass(
> +		"%s test: flags = %d\n",
> +		test_name, flags);
> +	return 0;
> +}
> +
> +static int test_membarrier_shared_expedited_success(void)
> +{
> +	int cmd = MEMBARRIER_CMD_SHARED_EXPEDITED, flags = 0;
> +	const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED_EXPEDITED";
> +
> +	if (sys_membarrier(cmd, flags) != 0) {
> +		ksft_exit_fail_msg(
> +			"%s test: flags = %d, errno = %d\n",
> +			test_name, flags, errno);
> +	}
> +
> +	ksft_test_result_pass(
> +		"%s test: flags = %d\n",
> +		test_name, flags);
> +	return 0;
> +}
> +
>  static int test_membarrier(void)
>  {
>  	int status;
> @@ -154,6 +188,19 @@ static int test_membarrier(void)
>  	status = test_membarrier_private_expedited_success();
>  	if (status)
>  		return status;
> +	/*
> +	 * It is valid to send a shared membarrier from a non-registered
> +	 * process.
> +	 */
> +	status = test_membarrier_shared_expedited_success();
> +	if (status)
> +		return status;
> +	status = test_membarrier_register_shared_expedited_success();
> +	if (status)
> +		return status;
> +	status = test_membarrier_shared_expedited_success();
> +	if (status)
> +		return status;
>  	return 0;
>  }
>  
> @@ -173,8 +220,10 @@ static int test_membarrier_query(void)
>  		}
>  		ksft_exit_fail_msg("sys_membarrier() failed\n");
>  	}
> -	if (!(ret & MEMBARRIER_CMD_SHARED))
> +	if (!(ret & MEMBARRIER_CMD_SHARED)) {
> +		ksft_test_result_fail("sys_membarrier() CMD_SHARED query failed\n");
>  		ksft_exit_fail_msg("sys_membarrier is not supported.\n");
> +	}
>  
>  	ksft_test_result_pass("sys_membarrier available\n");
>  	return 0;
> 

Looks good to me. I am assuming this patch goes in with the rest of the
series. For this patch:

Acked-by: Shuah Khan <shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd
  2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
@ 2017-11-17 15:09   ` Shuah Khan
  2017-11-17 16:17     ` Mathieu Desnoyers
  0 siblings, 1 reply; 80+ messages in thread
From: Shuah Khan @ 2017-11-17 15:09 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Greg Kroah-Hartman, Maged Michael,
	Avi Kivity

On 11/14/2017 01:04 PM, Mathieu Desnoyers wrote:
> Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Shuah Khan <shuahkh@osg.samsung.com>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> CC: Boqun Feng <boqun.feng@gmail.com>
> CC: Andrew Hunter <ahh@google.com>
> CC: Maged Michael <maged.michael@gmail.com>
> CC: Avi Kivity <avi@scylladb.com>
> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Michael Ellerman <mpe@ellerman.id.au>
> CC: Dave Watson <davejwatson@fb.com>
> CC: Alan Stern <stern@rowland.harvard.edu>
> CC: Will Deacon <will.deacon@arm.com>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Alice Ferrazzi <alice.ferrazzi@gmail.com>
> CC: Paul Elder <paul.elder@pitt.edu>
> CC: linux-kselftest@vger.kernel.org
> CC: linux-arch@vger.kernel.org
> ---
>  .../testing/selftests/membarrier/membarrier_test.c | 73 ++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
> 
> diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
> index bb9c58072c5c..d9ab8b6ee52e 100644
> --- a/tools/testing/selftests/membarrier/membarrier_test.c
> +++ b/tools/testing/selftests/membarrier/membarrier_test.c
> @@ -132,6 +132,63 @@ static int test_membarrier_private_expedited_success(void)
>  	return 0;
>  }
>  
> +static int test_membarrier_private_expedited_sync_core_fail(void)
> +{
> +	int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
> +	const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE not registered failure";
> +
> +	if (sys_membarrier(cmd, flags) != -1) {
> +		ksft_exit_fail_msg(
> +			"%s test: flags = %d. Should fail, but passed\n",
> +			test_name, flags);
> +	}
> +	if (errno != EPERM) {
> +		ksft_exit_fail_msg(
> +			"%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
> +			test_name, flags, EPERM, strerror(EPERM),
> +			errno, strerror(errno));
> +	}
> +
> +	ksft_test_result_pass(
> +		"%s test: flags = %d, errno = %d\n",
> +		test_name, flags, errno);
> +	return 0;
> +}
> +
> +static int test_membarrier_register_private_expedited_sync_core_success(void)
> +{
> +	int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
> +	const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE";
> +
> +	if (sys_membarrier(cmd, flags) != 0) {
> +		ksft_exit_fail_msg(
> +			"%s test: flags = %d, errno = %d\n",
> +			test_name, flags, errno);
> +	}
> +
> +	ksft_test_result_pass(
> +		"%s test: flags = %d\n",
> +		test_name, flags);
> +	return 0;
> +}
> +
> +static int test_membarrier_private_expedited_sync_core_success(void)
> +{
> +	int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
> +	const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE";
> +
> +	if (sys_membarrier(cmd, flags) != 0) {
> +		ksft_exit_fail_msg(
> +			"%s test: flags = %d, errno = %d\n",
> +			test_name, flags, errno);
> +	}
> +
> +	ksft_test_result_pass(
> +		"%s test: flags = %d\n",
> +		test_name, flags);
> +	return 0;
> +}
> +
>  static int test_membarrier_register_shared_expedited_success(void)
>  {
>  	int cmd = MEMBARRIER_CMD_REGISTER_SHARED_EXPEDITED, flags = 0;
> @@ -188,6 +245,22 @@ static int test_membarrier(void)
>  	status = test_membarrier_private_expedited_success();
>  	if (status)
>  		return status;
> +	status = sys_membarrier(MEMBARRIER_CMD_QUERY, 0);
> +	if (status < 0) {
> +		ksft_test_result_fail("sys_membarrier() failed\n");
> +		return status;
> +	}
> +	if (status & MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE) {
> +		status = test_membarrier_private_expedited_sync_core_fail();
> +		if (status)
> +			return status;
> +		status = test_membarrier_register_private_expedited_sync_core_success();
> +		if (status)
> +			return status;
> +		status = test_membarrier_private_expedited_sync_core_success();
> +		if (status)
> +			return status;
> +	}
>  	/*
>  	 * It is valid to send a shared membarrier from a non-registered
>  	 * process.
> 

Looks good to me. I am assuming this patch goes in with the rest of the
series. For this patch:

Acked-by: Shuah Khan <shuahkh@osg.samsung.com>

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd
  2017-11-17 15:09   ` Shuah Khan
@ 2017-11-17 16:17     ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-17 16:17 UTC (permalink / raw)
  To: Shuah Khan
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas

----- On Nov 17, 2017, at 10:09 AM, Shuah Khan shuahkh@osg.samsung.com wrote:
> 
> Looks good to me. I am assuming this patch goes in with the rest of the
> series. For this patch:
> 
> Acked-by: Shuah Khan <shuahkh@osg.samsung.com>

Thanks for the Acked-by. I'll add it to the patch changelog.
The sync_core membarrier patches are now queued for 4.16 however.

Thanks,

Mathieu

> 
> thanks,
> -- Shuah

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-17 10:09             ` Thomas Gleixner
@ 2017-11-17 17:14               ` Mathieu Desnoyers
       [not found]                 ` <1756446476.17265.1510938872121.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-17 20:22                 ` Thomas Gleixner
  0 siblings, 2 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-17 17:14 UTC (permalink / raw)
  To: Thomas Gleixner, Andi Kleen
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon, Michael

----- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Thu, 16 Nov 2017, Andi Kleen wrote:
>> My preference would be just to drop this new super ugly system call.
>> 
>> It's also not just the ugliness, but the very large attack surface
>> that worries me here.
>> 
>> As far as I know it is only needed to support single stepping, correct?
> 
> I can't figure that out because the changelog describes only WHAT the patch
> does and not WHY. Useful, isn't it?
> 
>> Then this whole mess would disappear.
> 
> Agreed. That would be much appreciated.

Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
changelog and documentation to include that information.

1) Handling single-stepping from tools

Tools like debuggers, and simulators like record-replay ("rr") use
single-stepping to run through existing programs. If core libraries start
to use restartable sequences for e.g. memory allocation, this means
pre-existing programs cannot be single-stepped, simply because the
underlying glibc or jemalloc has changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they want.
However, this requires upgrading tools, and still breaks single-stepping
in case where glibc or jemalloc is updated, but not the tooling.

Having a performance-related library improvement break tooling is likely to
cause a big push-back against wide adoption of rseq. *I* would not even
be using rseq in liburcu and lttng-ust until gdb gets updated in every
distributions that my users depend on. This will likely happen... never.


2) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to
external conditions is pretty bad. We are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide some
level of progress guarantees.

I'm very concerned about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for trouble
when real-life will happen: page faults, uprobes, and other unforeseen
conditions that would seldom cause a rseq fast-path to never progress.


3) Handling page faults

If we get creative enough, it's pretty easy to come up with corner-case
scenarios where rseq does not progress without the help from cpu_opv. For
instance, a system with swap enabled which is under high memory pressure
could trigger page faults at pretty much every rseq attempt. I recognize
that this scenario is extremely unlikely, but I'm not comfortable making
rseq the weak link of the chain here.


4) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to support single-stepping through LL/SC critical sections from the
get go in order to support a given architecture. For rseq, we're
adding critical sections into pre-existing applications/libraries,
so the user expectation is that tools don't break due to a library
optimization.


5) Perform maintenance operations on per-cpu data

rseq c.s. are quite limited feature-wise: they need to end with a
*single* commit instruction that updates a memory location. On the
other hand, the cpu_opv system call can combine a sequence of operations
that need to be executed with preemption disabled. While slower than
rseq, this allows for more complex maintenance operations to be
performed on per-cpu data concurrently with rseq fast-paths, in cases
where it's not possible to map those sequences of ops to a rseq.


6) Use cpu_opv as generic implementation for architectures not
   implementing rseq assembly code

rseq critical sections require architecture-specific user-space code
to be crafted in order to port an algorithm to a given architecture.
In addition, it requires that the kernel architecture implementation
adds hooks into signal delivery and resume to user-space.

In order to facilitate integration of rseq into user-space, cpu_opv
can provide a (relatively slower) architecture-agnostic implementation
of rseq. This means that user-space code can be ported to all
architectures through use of cpu_opv initially, and have the fast-path
use rseq whenever the asm code is implemented.


7) Allow libraries with multi-part algorithms to work on same per-cpu
   data without affecting the allowed cpu mask

I stumbled on an interesting use-case within the lttng-ust tracer
per-cpu buffers: the algorithm needs to update a "reserve" counter,
serialize data into the buffer, and then update a "commit" counter
_on the same per-cpu buffer_. My goal is to use rseq for both reserve
and commit.

Clearly, if rseq reserve fails, the algorithm can retry on a different
per-cpu buffer. However, it's not that easy for the commit. It needs to
be performed on the same per-cpu buffer as the reserve.

The cpu_opv system call solves that problem by receiving the cpu number
on which the operation needs to be performed as argument. It can push
the task to the right CPU if needed, and perform the operations there
with preemption disabled.

Changing the allowed cpu mask for the current thread is not an acceptable
alternative for a tracing library, because the application being traced
does not expect that mask to be changed by libraries.


So, TLDR: cpu_opv is needed for many use-cases other that single-stepping,
and facilitates adoption of rseq into pre-existing applications.


Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]                 ` <1756446476.17265.1510938872121.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-17 18:18                   ` Andi Kleen
       [not found]                     ` <20171117181839.GH2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Andi Kleen @ 2017-11-17 18:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Andi Kleen, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, linux-kernel,
	linux-api, Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Catalin

Thanks for the detailed write up. That should have been in the
changelog...

Some comments below. Overall I think the case for the syscall is still
very weak.

> Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
> changelog and documentation to include that information.
> 
> 1) Handling single-stepping from tools
> 
> Tools like debuggers, and simulators like record-replay ("rr") use
> single-stepping to run through existing programs. If core libraries start

No rr doesn't use single stepping. It uses branch stepping based on the
PMU, and those should only happen on external events or syscalls which would
abort the rseq anyways.

Eventually it will suceeed because not every retry there will be a new
signal. If you put a syscall into your rseq you will just get
what you deserve.

If it was single stepping it couldn't execute the vDSO (and it would
be incredible slow)

Yes debuggers have to skip instead of step, but that's easily done
(and needed today already for every gettimeofday, which tends to be
the most common syscall...) 

> to use restartable sequences for e.g. memory allocation, this means
> pre-existing programs cannot be single-stepped, simply because the
> underlying glibc or jemalloc has changed.

But how likely is it that the fall back path even works?

It would never be exercised in normal operation, so it would be a prime
target for bit rot, or not ever being tested and be broken in the first place.

> Having a performance-related library improvement break tooling is likely to
> cause a big push-back against wide adoption of rseq. *I* would not even
> be using rseq in liburcu and lttng-ust until gdb gets updated in every
> distributions that my users depend on. This will likely happen... never.

I suspect your scheme already has a <50% likelihood of working due to
the above that it's equivalent.

> 
> 
> 2) Forward-progress guarantee
> 
> Having a piece of user-space code that stops progressing due to
> external conditions is pretty bad. We are used to think of fast-path and
> slow-path (e.g. for locking), where the contended vs uncontended cases
> have different performance characteristics, but each need to provide some
> level of progress guarantees.

We already have that in the vDSO. Has never been a problem.

> 3) Handling page faults
> 
> If we get creative enough, it's pretty easy to come up with corner-case
> scenarios where rseq does not progress without the help from cpu_opv. For
> instance, a system with swap enabled which is under high memory pressure
> could trigger page faults at pretty much every rseq attempt. I recognize
> that this scenario is extremely unlikely, but I'm not comfortable making
> rseq the weak link of the chain here.

Seems very unlikely. But if this happens the program is dead anyways,
so doesn't make much difference.


> The main difference between LL/SC and rseq is that debuggers had
> to support single-stepping through LL/SC critical sections from the
> get go in order to support a given architecture. For rseq, we're
> adding critical sections into pre-existing applications/libraries,
> so the user expectation is that tools don't break due to a library
> optimization.

I would argue that debugging some other path that is normally
executed is wrong by definition. How would you find the bug if it is 
in the original path only, but not the fallback.

The whole point of debugging is to punch through abstractions,
but you're adding another layer of obfuscation here. And worse
you have no guarantee that the new layer is actually functionally
equivalent.

Having less magic and just assume the user can do the right thing
seems like a far more practical scheme.

> In order to facilitate integration of rseq into user-space, cpu_opv
> can provide a (relatively slower) architecture-agnostic implementation
> of rseq. This means that user-space code can be ported to all
> architectures through use of cpu_opv initially, and have the fast-path
> use rseq whenever the asm code is implemented.

While that's in theory correct, in practice it will be so slow
that it is useless. Nobody will want a system call in their malloc
fast path.


-Andi

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]                     ` <20171117181839.GH2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2017-11-17 18:59                       ` Thomas Gleixner
  2017-11-17 19:15                         ` Andi Kleen
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-17 18:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson, linux-kernel, linux-api,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas

On Fri, 17 Nov 2017, Andi Kleen wrote:
> > 1) Handling single-stepping from tools
> > 
> > Tools like debuggers, and simulators like record-replay ("rr") use
> > single-stepping to run through existing programs. If core libraries start
> 
> No rr doesn't use single stepping. It uses branch stepping based on the
> PMU, and those should only happen on external events or syscalls which would
> abort the rseq anyways.
> 
> Eventually it will suceeed because not every retry there will be a new
> signal. If you put a syscall into your rseq you will just get
> what you deserve.
> 
> If it was single stepping it couldn't execute the vDSO (and it would
> be incredible slow)
>
> Yes debuggers have to skip instead of step, but that's easily done
> (and needed today already for every gettimeofday, which tends to be
> the most common syscall...)

The same problem exists with TSX. Hitting a break point inside a
transaction section triggers abort with the cause bit 'breakpoint' set.

rseq can be looked at as a variant of software transactional memory with a
limited feature set, but the underlying problems are exactly the
same. Breakpoints are not working either.

> > to use restartable sequences for e.g. memory allocation, this means
> > pre-existing programs cannot be single-stepped, simply because the
> > underlying glibc or jemalloc has changed.
> 
> But how likely is it that the fall back path even works?
> 
> It would never be exercised in normal operation, so it would be a prime
> target for bit rot, or not ever being tested and be broken in the first place.

What's worse is that the fall back path cannot be debugged at all. It's a
magic byte code which is interpreted inside the kernel.


I really do not understand why this does not utilize existing design
patterns well known from transactional memory.

The most straight forward is to have a mechanism which forces everything
into the slow path in case of debugging, lack of progress, etc. The slow
path uses traditional locking to resolve the situation. That's well known
to work and if done correctly then the only difference between slow path
and fast path is locking/transaction control, i.e. it's a single
implementation of the actual memory operations which can be single stepped,
debug traced and whatever.

It solves _ALL_ of the problems you describe including support for systems
which do not support rseq at all.

This syscall is horribly overengineered and creates more problems than it
solves.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-17 18:59                       ` Thomas Gleixner
@ 2017-11-17 19:15                         ` Andi Kleen
       [not found]                           ` <20171117191547.GI2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Andi Kleen @ 2017-11-17 19:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, linux-kernel,
	linux-api, Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds

> The most straight forward is to have a mechanism which forces everything
> into the slow path in case of debugging, lack of progress, etc. The slow

That's the abort address, right?

For the generic case the fall back path would require disabling preemption
unfortunately, for which we don't have a mechanism in user space.

I think that is what Mathieu tried to implement here with this call.

There may be some special cases where it's possible without preemption
control, e.g. a malloc could just not use the per cpu cache. But I doubt
that is possible in all cases that Mathieu envisions. 

But again would be a different code path, and I question the need for it when
we can just let the operator of the debugger deal with it.

-Andi

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]                           ` <20171117191547.GI2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2017-11-17 20:07                             ` Thomas Gleixner
  2017-11-18 21:09                               ` Andy Lutomirski
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-17 20:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson, linux-kernel, linux-api,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas

On Fri, 17 Nov 2017, Andi Kleen wrote:
> > The most straight forward is to have a mechanism which forces everything
> > into the slow path in case of debugging, lack of progress, etc. The slow
> 
> That's the abort address, right?

Yes.

> For the generic case the fall back path would require disabling preemption
> unfortunately, for which we don't have a mechanism in user space.
> 
> I think that is what Mathieu tried to implement here with this call.

Yes. preempt disabled execution of byte code to make sure that the
transaction succeeds.

But, why is disabling preemption mandatory? If stuff fails due to hitting a
breakpoint or because it retried a gazillion times without progress, then
the abort code can detect that and act accordingly. Pseudo code:

abort:
	if (!slowpath_required() &&
	    !breakpoint_caused_abort() &&
	    !stall_detected()) {
		do_the_normal_abort_postprocessing();
		goto retry;
	}

	lock(slowpath_lock[cpu]);

	if (!slowpath_required()) {
	   	unlock(slowpath_lock[cpu]);
		goto retry;
	}

	if (rseq_supported)
		set_slow_path();

	/* Same code as inside the actual rseq */
	do_transaction();

	if (rseq_supported)
		unset_slow_path();

	unlock(slowpath_lock[cpu]);

The only interesting question is how to make sure that all threads on that
CPU see the slowpath required before they execute the commit so they are
forced into the slow path. The simplest thing would be atomics, but that's
what rseq wants to avoid.

I think that this can be solved cleanly with the help of the membarrier
syscall or some variant of that without all that 'yet another byte code
interpreter' mess.

The other question is whether do_transaction() is required to run on that
specific CPU. I don't think so because that magic interpreter operates even
when the required target cpu is offline and with locking in place there is
no reason why running on the target CPU would be required.

Sure, that's going to affect performance, but only for two cases:

  1) Debugging. That's completely uninteresting

  2) No progress at all. Performance is down the drain anyway, so it does
     not matter at all whether you spend a few more cycles or not to
     resolve that.

I might be missing something as usual :)

Thanks

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-17 17:14               ` Mathieu Desnoyers
       [not found]                 ` <1756446476.17265.1510938872121.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-17 20:22                 ` Thomas Gleixner
  2017-11-20 17:13                   ` Mathieu Desnoyers
  1 sibling, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-17 20:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson, linux-kernel, linux-api,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas, Will

On Fri, 17 Nov 2017, Mathieu Desnoyers wrote:
> ----- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner tglx@linutronix.de wrote:
> 7) Allow libraries with multi-part algorithms to work on same per-cpu
>    data without affecting the allowed cpu mask
> 
> I stumbled on an interesting use-case within the lttng-ust tracer
> per-cpu buffers: the algorithm needs to update a "reserve" counter,
> serialize data into the buffer, and then update a "commit" counter
> _on the same per-cpu buffer_. My goal is to use rseq for both reserve
> and commit.
> 
> Clearly, if rseq reserve fails, the algorithm can retry on a different
> per-cpu buffer. However, it's not that easy for the commit. It needs to
> be performed on the same per-cpu buffer as the reserve.
> 
> The cpu_opv system call solves that problem by receiving the cpu number
> on which the operation needs to be performed as argument. It can push
> the task to the right CPU if needed, and perform the operations there
> with preemption disabled.

If your transaction cannot be done in one go, then abusing that byte code
interpreter for concluding it is just hillarious. That whole exercise is a
gazillion times slower than the atomic operations which are neccesary to do
it without all that.

I'm even more convinced now that this is overengineered beyond repair.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-17 20:07                             ` Thomas Gleixner
@ 2017-11-18 21:09                               ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2017-11-18 21:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, linux-kernel, linux-api, Paul Turner,
	Andrew Morton, Russell King, Ingo Molnar, H. Peter Anvin,
	Andrew Hunter, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas

On Fri, Nov 17, 2017 at 12:07 PM, Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> On Fri, 17 Nov 2017, Andi Kleen wrote:
>> > The most straight forward is to have a mechanism which forces everything
>> > into the slow path in case of debugging, lack of progress, etc. The slow
>>
>> That's the abort address, right?
>
> Yes.
>
>> For the generic case the fall back path would require disabling preemption
>> unfortunately, for which we don't have a mechanism in user space.
>>
>> I think that is what Mathieu tried to implement here with this call.
>
> Yes. preempt disabled execution of byte code to make sure that the
> transaction succeeds.
>
> But, why is disabling preemption mandatory? If stuff fails due to hitting a
> breakpoint or because it retried a gazillion times without progress, then
> the abort code can detect that and act accordingly. Pseudo code:
>
> abort:
>         if (!slowpath_required() &&
>             !breakpoint_caused_abort() &&
>             !stall_detected()) {
>                 do_the_normal_abort_postprocessing();
>                 goto retry;
>         }
>
>         lock(slowpath_lock[cpu]);
>
>         if (!slowpath_required()) {
>                 unlock(slowpath_lock[cpu]);
>                 goto retry;
>         }
>
>         if (rseq_supported)
>                 set_slow_path();
>
>         /* Same code as inside the actual rseq */
>         do_transaction();
>
>         if (rseq_supported)
>                 unset_slow_path();
>
>         unlock(slowpath_lock[cpu]);

My objection to this approach is that people will get it wrong and not
notice until it's too late.  TSX has two things going for it:

1. It's part of the ISA, so debuggers have very well-defined semantics
to deal with and debuggers will know about it.  rseq is a made-up
Linux thing and debuggers may not know what to do with it.

2. TSX is slow and crappy, so it may not be that widely used.  glibc,
OTOH, will probably start using rseq on all machines if the patches
are merged.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
  2017-11-16 21:08       ` Thomas Gleixner
@ 2017-11-19 17:24         ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-19 17:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 4:08 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
>> +#ifdef __KERNEL__
>> +# include <linux/types.h>
>> +#else	/* #ifdef __KERNEL__ */
> 
> Please drop these comments. They are distracting and not helpful at
> all. They are valuable for long #ideffed sections but then the normal form
> is:
> 
> /* __KERNEL__ */
> 
> /* !__KERNEL__ */
> 

ok

>> +# include <stdint.h>
>> +#endif	/* #else #ifdef __KERNEL__ */
>> +
>> +#include <asm/byteorder.h>
>> +
>> +#ifdef __LP64__
>> +# define RSEQ_FIELD_u32_u64(field)			uint64_t field
>> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
>> +#elif defined(__BYTE_ORDER) ? \
>> +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> 
> Can you please make this decision separate and propagate the result ?

Something like this ?

#ifdef __BYTE_ORDER
# if (__BYTE_ORDER == __BIG_ENDIAN)
#  define RSEQ_BYTE_ORDER_BIG_ENDIAN
# else
#  define RSEQ_BYTE_ORDER_LITTLE_ENDIAN
# endif
#else
# ifdef __BIG_ENDIAN
#  define RSEQ_BYTE_ORDER_BIG_ENDIAN
# else
#  define RSEQ_BYTE_ORDER_LITTLE_ENDIAN
# endif
#endif

#ifdef __LP64__
# define RSEQ_FIELD_u32_u64(field)                      uint64_t field
# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)      field = (intptr_t)v
#else
# ifdef RSEQ_BYTE_ORDER_BIG_ENDIAN
#  define RSEQ_FIELD_u32_u64(field)     uint32_t field ## _padding, field
#  define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
        field ## _padding = 0, field = (intptr_t)v
# else
#  define RSEQ_FIELD_u32_u64(field)     uint32_t field, field ## _padding
#  define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
        field = (intptr_t)v, field ## _padding = 0
# endif
#endif


> 
>> +# define RSEQ_FIELD_u32_u64(field)	uint32_t field ## _padding, field
>> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
>> +	field ## _padding = 0, field = (intptr_t)v
>> +#else
>> +# define RSEQ_FIELD_u32_u64(field)	uint32_t field, field ## _padding
>> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
>> +	field = (intptr_t)v, field ## _padding = 0
>> +#endif
>> +
>> +enum rseq_flags {
>> +	RSEQ_FLAG_UNREGISTER = (1 << 0),
>> +};
>> +
>> +enum rseq_cs_flags {
>> +	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT	= (1U << 0),
>> +	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL	= (1U << 1),
>> +	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	= (1U << 2),
>> +};
>> +
>> +/*
>> + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
>> + * contained within a single cache-line. It is usually declared as
>> + * link-time constant data.
>> + */
>> +struct rseq_cs {
>> +	uint32_t version;	/* Version of this structure. */
>> +	uint32_t flags;		/* enum rseq_cs_flags */
>> +	RSEQ_FIELD_u32_u64(start_ip);
>> +	RSEQ_FIELD_u32_u64(post_commit_offset);	/* From start_ip */

I'll move the tail comments to their own line.

>> +	RSEQ_FIELD_u32_u64(abort_ip);
>> +} __attribute__((aligned(4 * sizeof(uint64_t))));
>> +
>> +/*
>> + * struct rseq is aligned on 4 * 8 bytes to ensure it is always
>> + * contained within a single cache-line.
>> + *
>> + * A single struct rseq per thread is allowed.
>> + */
>> +struct rseq {
>> +	/*
>> +	 * Restartable sequences cpu_id_start field. Updated by the
>> +	 * kernel, and read by user-space with single-copy atomicity
>> +	 * semantics. Aligned on 32-bit. Always contain a value in the
> 
> contains

ok

> 
>> +	 * range of possible CPUs, although the value may not be the
>> +	 * actual current CPU (e.g. if rseq is not initialized). This
>> +	 * CPU number value should always be confirmed against the value
>> +	 * of the cpu_id field.
> 
> Who is supposed to confirm that? I think I know what the purpose of the
> field is, but from that comment it's not obvious at all.

Proposed update:

        /*
         * Restartable sequences cpu_id_start field. Updated by the
         * kernel, and read by user-space with single-copy atomicity
         * semantics. Aligned on 32-bit. Always contains a value in the
         * range of possible CPUs, although the value may not be the
         * actual current CPU (e.g. if rseq is not initialized). This
         * CPU number value should always be compared against the value
         * of the cpu_id field before performing a rseq commit or
         * returning a value read from a data structure indexed using
         * the cpu_id_start value.
         */
        uint32_t cpu_id_start;

> 
>> +	 */
>> +	uint32_t cpu_id_start;
>> +	/*
>> +	 * Restartable sequences cpu_id field. Updated by the kernel,
>> +	 * and read by user-space with single-copy atomicity semantics.
> 
> Again. What's the purpose of reading it.
> 

Update:

        /*
         * Restartable sequences cpu_id field. Updated by the kernel,
         * and read by user-space with single-copy atomicity semantics.
         * Aligned on 32-bit. Values RSEQ_CPU_ID_UNINITIALIZED and
         * RSEQ_CPU_ID_REGISTRATION_FAILED have a special semantic: the
         * former means "rseq uninitialized", and latter means "rseq
         * initialization failed". This value is meant to be read within
         * rseq critical sections and compared with the cpu_id_start
         * value previously read, before performing the commit instruction,
         * or read and compared with the cpu_id_start value before returning
         * a value loaded from a data structure indexed using the
         * cpu_id_start value.
         */
        uint32_t cpu_id;


>> +	 * Aligned on 32-bit. Values -1U and -2U have a special
>> +	 * semantic: -1U means "rseq uninitialized", and -2U means "rseq
>> +	 * initialization failed".
>> +	 */
>> +	uint32_t cpu_id;
>> +	/*
>> +	 * Restartable sequences rseq_cs field.
>> +	 *
>> +	 * Contains NULL when no critical section is active for the current
>> +	 * thread, or holds a pointer to the currently active struct rseq_cs.
>> +	 *
>> +	 * Updated by user-space at the beginning of assembly instruction
>> +	 * sequence block, and by the kernel when it restarts an assembly
>> +	 * instruction sequence block, and when the kernel detects that it
>> +	 * is preempting or delivering a signal outside of the range
>> +	 * targeted by the rseq_cs. Also needs to be cleared by user-space
>> +	 * before reclaiming memory that contains the targeted struct
>> +	 * rseq_cs.
> 
> This paragraph is pretty convoluted and it's not really clear what the
> actual purpose is and how it is supposed to be used.
> 
>   It's NULL when no critical section is active.
> 
>   It holds a pointer to a struct rseq_cs when a critical section is active. Fine
> 
> Now the update rules:
> 
>    - By user space at the start of the critical section, i.e. user space
>      sets the pointer to rseq_cs
> 
>    - By the kernel when it restarts a sequence block etc ....
> 
>      What happens to this field? Is the pointer updated or cleared or
>      what? How is the kernel supposed to fiddle with the pointer?

The kernel sets it back to NULL when it encounters a non-NULL ptr, independently
of whether it was nesting over a rseq c.s. or not. Updating to:

         * Updated by user-space, which sets the address of the currently
         * active rseq_cs at the beginning of assembly instruction sequence
         * block, and set to NULL by the kernel when it restarts an assembly
         * instruction sequence block, as well as when the kernel detects that
         * it is preempting or delivering a signal outside of the range
         * targeted by the rseq_cs. Also needs to be set to NULL by user-space
         * before reclaiming memory that contains the targeted struct rseq_cs.


>> +	 *
>> +	 * Read and set by the kernel with single-copy atomicity semantics.
> 
> This looks like it's purely kernel owned, but above you say it's written by
> user space. There are no rules for user space?

Update:

         * Read and set by the kernel with single-copy atomicity semantics.
         * Set by user-space with single-copy atomicity semantics. Aligned
         * on 64-bit.

> 
>> +	 * Aligned on 64-bit.
>> +	 */
>> +	RSEQ_FIELD_u32_u64(rseq_cs);
>> +	/*
>> +	 * - RSEQ_DISABLE flag:
>> +	 *
>> +	 * Fallback fast-track flag for single-stepping.
>> +	 * Set by user-space if lack of progress is detected.
>> +	 * Cleared by user-space after rseq finish.
>> +	 * Read by the kernel.
>> +	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>> +	 *     Inhibit instruction sequence block restart and event
>> +	 *     counter increment on preemption for this thread.
>> +	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>> +	 *     Inhibit instruction sequence block restart and event
>> +	 *     counter increment on signal delivery for this thread.
>> +	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>> +	 *     Inhibit instruction sequence block restart and event
>> +	 *     counter increment on migration for this thread.
> 
> That looks dangerous. You want to single step through the critical section
> and just ignore whether you've been preempted or migrated. How is that
> supposed to work?

If you're closely inspecting a program with single-stepping, and the user
really want to single-step through the rseq c.s., those flags can be set by
the user or debugger to allow this single-stepping to proceed. It's then up
to the user/tool to ensure mutual exclusion with other rseq c.s. by other
means. One of its uses is for implementers of rseq c.s. assembly code, where
they may want to single-step through that code while ensuring consistency
through other mechanisms (e.g. mutual exclusion, or running single-threaded).

> 
>> +++ b/kernel/rseq.c
>> @@ -0,0 +1,328 @@
>> + * Detailed algorithm of rseq user-space assembly sequences:
>> + *
>> + *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
>> + *   userspace that can handle being moved to the abort_ip between any
>> + *   of those instructions.
> 
> A sequence of instructions cannot be moved. Please describe this in
> technical correct wording.

Update:

 *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
 *   userspace that can handle being interrupted between any of those
 *   instructions, and then resumed to the abort_ip.

> 
>> + *   The abort_ip address needs to be less than start_ip, or
>> + *   greater-or-equal the post_commit_ip. Step [5] and the failure
> 
> s/the/than/

ok

> 
>> + *   code step [F1] need to be at addresses lesser than start_ip, or
>> + *   greater-or-equal the post_commit_ip.
> 
> Please describe that block visually for clarity
> 
>		init(rseq_cs)
>		cpu = TLS->rseq::cpu_id
>       
> start_ip	-----------------
> [1]		TLS->rseq::rseq_cs = rseq_cs
>		barrier()
> 
> [2]		if (cpu != TLS->rseq::cpu_id)
>			goto fail_ip;
> 
> [3]		last_instruction_in_cs()
> post_commit_ip  ----------------
> 
> The address of jump target fail_ip must be outside the critical region, i.e.
> 
>    fail_ip < start_ip  ||	 fail_ip >= post_commit_ip
> 
> Some textual explanation along with that is certainly helpful, but.

Updated to:

 *                     init(rseq_cs)
 *                     cpu = TLS->rseq::cpu_id_start
 *   [1]               TLS->rseq::rseq_cs = rseq_cs
 *   [start_ip]        ----------------------------
 *   [2]               if (cpu != TLS->rseq::cpu_id)
 *                             goto abort_ip;
 *   [3]               <last_instruction_in_cs>
 *   [post_commit_ip]  ----------------------------
 *  
 *   The address of jump target abort_ip must be outside the critical
 *   region, i.e.:
 *
 *     [abort_ip] < [start_ip]  || [abort_ip] >= [post_commit_ip]


> 
>> + *       [start_ip]
>> + *   1.  Userspace stores the address of the struct rseq_cs assembly
>> + *       block descriptor into the rseq_cs field of the registered
>> + *       struct rseq TLS area. This update is performed through a single
>> + *       store, followed by a compiler barrier which prevents the
>> + *       compiler from moving following loads or stores before this
>> + *       store.

Actually, given that this all needs to be in an inline assembly, it
makes no sense to talk about "compiler barrier" anymore. I'll update
the last part.

The "[start_ip] tag moves just after the paragraph at "1.".


>> + *
>> + *   2.  Userspace tests to see whether the current cpu_id field
>> + *       match the cpu number loaded before start_ip. Manually jumping
>> + *       to [F1] in case of a mismatch.
> 
>  Manually jumping?

"Branching to abort_ip" would be better indeed.

> 
>> + *
>> + *       Note that if we are preempted or interrupted by a signal
> 
> Up to this point the description was technical, Now you start to
> impersonate. That's inconsistent at best.

ok

> 
>> + *       after [1] and before post_commit_ip, then the kernel
> 
> How does the kernel know about being "after" [1]. Is there something else
> than start_ip and post_commit_id? According to this, yes. And that wants a
> name and wants to be shown in the visual block. I call it magic_ip for now.

It should have been "at or after":

 *       If the sequence is preempted or interrupted by a signal
 *       at or after start_ip and before post_commit_ip, then the kernel
 *       clears TLS->__rseq_abi::rseq_cs, then resumes execution at the
 *       abort_ip.

You bring a good point. Although it has no impact in practice, the
"start_ip" can indicate the address right *after* the store to rseq_cs.
I'll change the documentation and user-space implementation accordingly.

> 
>> + *       clears the rseq_cs field of struct rseq, then jumps us to
>> + *       abort_ip.
> 
> The kernel does not jump us.
> 
>    	    If the execution sequence gets preempted at an address >=
>    	    magic_ip and < post_commit_ip, the kernel sets
>    	    TLS->rseq::rseq_cs to NULL and sets the user space return ip to
>    	    fail_ip before returning to user space, so the preempted
>    	    execution resumes at fail_ip.
> 
> Hmm?

Updated to:

 *       If the sequence is preempted or interrupted by a signal
 *       at or after start_ip and before post_commit_ip, then the kernel
 *       clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
 *       ip to abort_ip before returning to user-space, so the preempted
 *       execution resumes at abort_ip.

> 
>> + *   3.  Userspace critical section final instruction before
>> + *       post_commit_ip is the commit. The critical section is
>> + *       self-terminating.
>> + *       [post_commit_ip]
>> + *
>> + *   4.  success
>> + *
>> + *   On failure at [2]:
>> + *
>> + *       [abort_ip]
> 
> Now you introduce abort_ip. Why not use the same terminology consistently?
> Because it would make sense and not confuse the reader?

I use abort_ip everywhere. Not sure where you saw "fail_ip".

> 
>> + *   F1. goto failure label
>> + */
>> +
>> +static bool rseq_update_cpu_id(struct task_struct *t)
>> +{
>> +	uint32_t cpu_id = raw_smp_processor_id();
>> +
>> +	if (__put_user(cpu_id, &t->rseq->cpu_id_start))
>> +		return false;
>> +	if (__put_user(cpu_id, &t->rseq->cpu_id))
>> +		return false;
>> +	trace_rseq_update(t);
>> +	return true;
>> +}
>> +
>> +static bool rseq_reset_rseq_cpu_id(struct task_struct *t)
>> +{
>> +	uint32_t cpu_id_start = 0, cpu_id = -1U;
> 
> Please do not use -1U. Define a proper symbol for it. Hardcoded constant
> numbers which have a special measing are annoying.

I'll add the following enum to uapi rseq.h:

enum rseq_cpu_id_state {
        RSEQ_CPU_ID_UNINITIALIZED               = -1,
        RSEQ_CPU_ID_REGISTRATION_FAILED         = -2,   
};

And use it both in the user-space library and in the kernel
"reset" code.

> 
>> +	/*
>> +	 * Reset cpu_id_start to its initial state (0).
>> +	 */
>> +	if (__put_user(cpu_id_start, &t->rseq->cpu_id_start))
>> +		return false;
> 
> Why bool? If the callsite propagates an error code return it right from
> here please.

ok, fixed.

> 
>> +	/*
>> +	 * Reset cpu_id to -1U, so any user coming in after unregistration can
>> +	 * figure out that rseq needs to be registered again.
>> +	 */
>> +	if (__put_user(cpu_id, &t->rseq->cpu_id))
>> +		return false;
>> +	return true;
>> +}
>> +
>> +static bool rseq_get_rseq_cs(struct task_struct *t,
>> +		void __user **start_ip,
>> +		unsigned long *post_commit_offset,
>> +		void __user **abort_ip,
>> +		uint32_t *cs_flags)
> 
> Please align the arguments with the argument in the first line
> 

done.

>> +{
>> +	unsigned long ptr;
>> +	struct rseq_cs __user *urseq_cs;
>> +	struct rseq_cs rseq_cs;
>> +	u32 __user *usig;
>> +	u32 sig;
> 
> Please sort those variables by length in reverse fir tree order.

ok

> 
>> +
>> +	if (__get_user(ptr, &t->rseq->rseq_cs))
>> +		return false;
> 
> Call site stores an int and then returns -EFAULT. Works, but pretty is
> something else.

moving all return values to "int", and propagating result.

> 
>> +	if (!ptr)
>> +		return true;
> 
> What's wrong with 0 / -ERRORCODE returns which are the standard way in the
> kernel?

done

> 
>> +	urseq_cs = (struct rseq_cs __user *)ptr;
>> +	if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
>> +		return false;
>> +	/*
>> +	 * We need to clear rseq_cs upon entry into a signal handler
>> +	 * nested on top of a rseq assembly block, so the signal handler
>> +	 * will not be fixed up if itself interrupted by a nested signal
>> +	 * handler or preempted.
> 
> This sentence does not parse.

Updating to:

        /*
         * The rseq_cs field is set to NULL on preemption or signal
         * delivery on top of rseq assembly block, as well as on top
         * of code outside of the rseq assembly block. This performs
         * a lazy clear of the rseq_cs field.
         *
         * Set rseq_cs to NULL with single-copy atomicity.
         */
        ptr = 0;
        ret = __put_user(ptr, &t->rseq->rseq_cs);
        if (ret)
                return ret;

All the discussion about not fixing up while executing signal
handler was relevant with Paul Turner's original approach,
but now that we have rseq critical section descriptors that
contain the start_ip and post_commit_offset, guaranteeing that
the rseq_cs pointer is cleared before returning to a signal handler
is not relevant anymore.

I'll use __put_user rather than clear_user() in order to be consistent
with all other updates that provide single-copy atomicity guarantees.

> 
>> +	   We also need to clear rseq_cs if we
>> +	 * preempt or deliver a signal on top of code outside of the
>> +	 * rseq assembly block, to ensure that a following preemption or
>> +	 * signal delivery will not try to perform a fixup needlessly.
> 
> Please try to avoid the impersonation. We are not doing anything.

ok

> 
>> +	 */
>> +	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
>> +		return false;
>> +	if (rseq_cs.version > 0)
>> +		return false;

I'll add this check to ensure that abort_ip is not within the
rseq c.s., which would be invalid:

        /* Ensure that abort_ip is not in the critical section. */
        if (rseq_cs.abort_ip - rseq_cs.start_ip < rseq_cs.post_commit_offset)
                return false;

>> +	*cs_flags = rseq_cs.flags;
>> +	*start_ip = (void __user *)rseq_cs.start_ip;
>> +	*post_commit_offset = (unsigned long)rseq_cs.post_commit_offset;
>> +	*abort_ip = (void __user *)rseq_cs.abort_ip;
>> +	usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
> 
> Is there no way to avoid this abundant type casting?  It's hard to find the
> code in the casts.

Following peterz' advice, I use unsigned long type now.

> 
>> +	if (get_user(sig, usig))
>> +		return false;
>> +	if (current->rseq_sig != sig) {
>> +		printk_ratelimited(KERN_WARNING
>> +			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x
>> (pid=%d, addr=%p).\n",
>> +			sig, current->rseq_sig, current->pid, usig);
>> +		return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
>> +{
>> +	bool need_restart = false;
>> +	uint32_t flags;
>> +
>> +	/* Get thread flags. */
>> +	if (__get_user(flags, &t->rseq->flags))
>> +		return -EFAULT;
>> +
>> +	/* Take into account critical section flags. */
> 
> Take critical section flags into account. Please
> 

ok

>> +	flags |= cs_flags;
>> +
>> +	/*
>> +	 * Restart on signal can only be inhibited when restart on
>> +	 * preempt and restart on migrate are inhibited too. Otherwise,
>> +	 * a preempted signal handler could fail to restart the prior
>> +	 * execution context on sigreturn.
>> +	 */
>> +	if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
>> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
>> +			return -EINVAL;
>> +		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> +			return -EINVAL;
>> +	}
>> +	if (t->rseq_migrate
>> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> 
>	if (t->rseq_migrate &&
>	    !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> 
> please.
> 
>> +		need_restart = true;
>> +	else if (t->rseq_preempt
>> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> +		need_restart = true;
>> +	else if (t->rseq_signal
>> +			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
>> +		need_restart = true;
> 
> If you make all of these rseq_flags explicit bits in a u32 then you can
> just do a simple
> 
>     	if ((t->rseq_flags ^ flags) & t->rseq_flags)
> 
> and you can probably simplify the above checks as well.

I don't think xor is the operation we want here. But yes, using
masks tremendously simplifies the code. I did those changes
following peterz' feedback:

        event_mask = t->rseq_event_mask;
        t->rseq_event_mask = 0;
        event_mask &= ~flags;
        if (event_mask)
                return 1;
        return 0;
> 
>> +
>> +	t->rseq_preempt = false;
>> +	t->rseq_signal = false;
>> +	t->rseq_migrate = false;
> 
> This becomes a simple t->rseq_flags = 0;

yes.

> 
>> +	if (need_restart)
>> +		return 1;
>> +	return 0;
> 
> Why are you having a bool in the first place if you have to convert it into
> a integer return value at the end. Sure the compiler can optimize that
> away, but still...

Changed to an "int".

> 
>> +}
>> +
>> +static int rseq_ip_fixup(struct pt_regs *regs)
>> +{
>> +	struct task_struct *t = current;
>> +	void __user *start_ip = NULL;
>> +	unsigned long post_commit_offset = 0;
>> +	void __user *abort_ip = NULL;
>> +	uint32_t cs_flags = 0;
>> +	int ret;
>> +
>> +	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
>> +			&cs_flags);
>> +	trace_rseq_ip_fixup((void __user *)instruction_pointer(regs),
>> +		start_ip, post_commit_offset, abort_ip, ret);
>> +	if (!ret)
>> +		return -EFAULT;
> 
> This boolean logic is really horrible.

Changed to "int", returning the callee return value.

> 
>> +	ret = rseq_need_restart(t, cs_flags);
>> +	if (ret < 0)
>> +		return -EFAULT;
> 
> Why can't you propagate ret?

done. It becomes:

        ret = rseq_need_restart(t, cs_flags);
        if (ret <= 0)
                return ret;


> 
>> +	if (!ret)
>> +		return 0;
>> +	/*
>> +	 * Handle potentially not being within a critical section.
>> +	 * Unsigned comparison will be true when
>> +	 * ip < start_ip (wrap-around to large values), and when
>> +	 * ip >= start_ip + post_commit_offset.
>> +	 */
>> +	if ((unsigned long)instruction_pointer(regs) - (unsigned long)start_ip
>> +			>= post_commit_offset)
> 
> Neither start_ip nor abort_ip need to be void __user * type. They are not
> accessed at all, So why not make them unsigned long and spare all the type
> cast mess here and in rseq_get_rseq_cs() ?

done as per peterz's feedback.

> 
>> +		return 1;
>> +
>> +	instruction_pointer_set(regs, (unsigned long)abort_ip);
>> +	return 1;
>> +}
>> +
>> +/*
>> + * This resume handler should always be executed between any of:
> 
> Should? Or must?

Yes, must.

> 
>> + * - preemption,
>> + * - signal delivery,
>> + * and return to user-space.
>> + *
>> +	if (current->rseq) {
>> +		/*
>> +		 * If rseq is already registered, check whether
>> +		 * the provided address differs from the prior
>> +		 * one.
>> +		 */
>> +		if (current->rseq != rseq
>> +				|| current->rseq_len != rseq_len)
> 
> Align as shown above please. Same for all other malformatted multi line
> conditionals.

done

> 
>> +			return -EINVAL;
>> +		if (current->rseq_sig != sig)
>> +			return -EPERM;
>> +		return -EBUSY;	/* Already registered. */
> 
> Please do not use tail comments. They disturb the reading flow.

ok. Will do:

                /* Already registered. */
                return -EBUSY; 

> 
>> +	} else {
>> +		/*
>> +		 * If there was no rseq previously registered,
>> +		 * we need to ensure the provided rseq is
> 
> s/we need to//  Like in changelogs. Describe it in imperative mood.

ok

> 
>> +		 * properly aligned and valid.
>> +		 */
>> +		if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq))
>> +				|| rseq_len != sizeof(*rseq))
>> +			return -EINVAL;
>> +		if (!access_ok(VERIFY_WRITE, rseq, rseq_len))
>> +			return -EFAULT;
>> +		current->rseq = rseq;
>> +		current->rseq_len = rseq_len;
>> +		current->rseq_sig = sig;
>> +		/*
>> +		 * If rseq was previously inactive, and has just been
>> +		 * registered, ensure the cpu_id_start and cpu_id fields
>> +		 * are updated before returning to user-space.
>> +		 */
>> +		rseq_set_notify_resume(current);
>> +	}
> 
> Thanks,

Thanks a lot for the thorough review Thomas !!

Mathieu


> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support
  2017-11-16 21:14       ` Thomas Gleixner
@ 2017-11-19 17:41         ` Mathieu Desnoyers
       [not found]           ` <1390396579.17843.1511113291117.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-19 17:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 4:14 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
> 
> Please fix the subject line:
> 
> x86: Add support for restartable sequences
> 
> or something like that.
> 
> And for the actual rseq patches please come up with a proper short
> susbsytem prefix for restartable sequences. There is no point in occupying
> half of the subject space for a prefix.

ok. done.

> 
> Other than that.
> 
> Reviewed-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

Thanks! Should I apply the reviewed-by only to the x86 patches,
or patch 01 (rseq: Introduce restartable sequences system call)
as well ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support
       [not found]           ` <1390396579.17843.1511113291117.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-20  8:38             ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-20  8:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

On Sun, 19 Nov 2017, Mathieu Desnoyers wrote:

> ----- On Nov 16, 2017, at 4:14 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> 
> > On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
> > 
> > Please fix the subject line:
> > 
> > x86: Add support for restartable sequences
> > 
> > or something like that.
> > 
> > And for the actual rseq patches please come up with a proper short
> > susbsytem prefix for restartable sequences. There is no point in occupying
> > half of the subject space for a prefix.
> 
> ok. done.
> 
> > 
> > Other than that.
> > 
> > Reviewed-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> 
> Thanks! Should I apply the reviewed-by only to the x86 patches,
> or patch 01 (rseq: Introduce restartable sequences system call)
> as well ?

This one of course. I've certainly not yet reviewed your next version :)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-16 23:26       ` Thomas Gleixner
  2017-11-17  0:14         ` Andi Kleen
@ 2017-11-20 16:13         ` Mathieu Desnoyers
       [not found]           ` <1766414702.18278.1511194398489.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-20 16:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:
>> +#ifdef __KERNEL__
>> +# include <linux/types.h>
>> +#else	/* #ifdef __KERNEL__ */
> 
>  		   Sigh.

fixed.

> 
>> +# include <stdint.h>
>> +#endif	/* #else #ifdef __KERNEL__ */
>> +
>> +#include <asm/byteorder.h>
>> +
>> +#ifdef __LP64__
>> +# define CPU_OP_FIELD_u32_u64(field)			uint64_t field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
>> +#elif defined(__BYTE_ORDER) ? \
>> +	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
>> +# define CPU_OP_FIELD_u32_u64(field)	uint32_t field ## _padding, field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
>> +	field ## _padding = 0, field = (intptr_t)v
>> +#else
>> +# define CPU_OP_FIELD_u32_u64(field)	uint32_t field, field ## _padding
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
>> +	field = (intptr_t)v, field ## _padding = 0
>> +#endif
> 
> So in the rseq patch you have:
> 
> +#ifdef __LP64__
> +# define RSEQ_FIELD_u32_u64(field)                     uint64_t field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define RSEQ_FIELD_u32_u64(field)     uint32_t field ## _padding, field
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
> +       field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define RSEQ_FIELD_u32_u64(field)     uint32_t field, field ## _padding
> +# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
> +       field = (intptr_t)v, field ## _padding = 0
> +#endif
> 
> IOW the same macro maze. Please use a separate header file which provides
> these macros once and share them between the two facilities.

ok. I'll introduce uapi/linux/types_32_64.h, and prefix defines with
"LINUX_":

LINUX_FIELD_u32_u64()

unless other names are preferred. It will be in a separate patch.

> 
>> +#define CPU_OP_VEC_LEN_MAX		16
>> +#define CPU_OP_ARG_LEN_MAX		24
>> +/* Max. data len per operation. */
>> +#define CPU_OP_DATA_LEN_MAX		PAGE_SIZE
> 
> That's something between 4K and 256K depending on the architecture.
> 
> You really want to allow up to 256K data copy with preemption disabled?
> Shudder.

This is the max per operation. Following peterz' comments at KS, I added a
limit of 4096 + 15 * 8 on the sum of len for all operations in a vector. This
is defined below as CPU_OP_VEC_DATA_LEN_MAX.

So each of the 16 op cannot have a len larger than PAGE_SIZE, so we
pin at most 4 pages per op (e.g. memcpy 2 pages for src, 2 pages for dst),
*and* the sum of all ops len needs to be <= 4216. So the max limit you are
interested in here is the 4216 bytes limit.

> 
>> +/*
>> + * Max. data len for overall vector. We to restrict the amount of
> 
> We to ?

fixed

> 
>> + * user-space data touched by the kernel in non-preemptible context so
>> + * we do not introduce long scheduler latencies.
>> + * This allows one copy of up to 4096 bytes, and 15 operations touching
>> + * 8 bytes each.
>> + * This limit is applied to the sum of length specified for all
>> + * operations in a vector.
>> + */
>> +#define CPU_OP_VEC_DATA_LEN_MAX		(4096 + 15*8)
> 
> Magic numbers. Please use proper defines for heavens sake.

ok, it becomes:

#define CPU_OP_MEMCPY_EXPECT_LEN        4096
#define CPU_OP_EXPECT_LEN               8
#define CPU_OP_VEC_DATA_LEN_MAX         \
        (CPU_OP_MEMCPY_EXPECT_LEN +     \
         (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN)

> 
>> +#define CPU_OP_MAX_PAGES		4	/* Max. pages per op. */

Actually, I'll move the CPU_OP_MAX_PAGES define to cpu_opv.c. It's not
needed in the uapi header.


>> +
>> +enum cpu_op_type {
>> +	CPU_COMPARE_EQ_OP,	/* compare */
>> +	CPU_COMPARE_NE_OP,	/* compare */
>> +	CPU_MEMCPY_OP,		/* memcpy */
>> +	CPU_ADD_OP,		/* arithmetic */
>> +	CPU_OR_OP,		/* bitwise */
>> +	CPU_AND_OP,		/* bitwise */
>> +	CPU_XOR_OP,		/* bitwise */
>> +	CPU_LSHIFT_OP,		/* shift */
>> +	CPU_RSHIFT_OP,		/* shift */
>> +	CPU_MB_OP,		/* memory barrier */
>> +};
>> +
>> +/* Vector of operations to perform. Limited to 16. */
>> +struct cpu_op {
>> +	int32_t op;	/* enum cpu_op_type. */
>> +	uint32_t len;	/* data length, in bytes. */
> 
> Please get rid of these tail comments

ok

> 
>> +	union {
>> +#define TMP_BUFLEN			64
>> +#define NR_PINNED_PAGES_ON_STACK	8
> 
> 8 pinned pages on stack? Which stack?

The common cases need to touch few pages, and we can keep the
pointers in an array on the kernel stack within the cpu_opv system
call.

Updating to:

/*
 * Typical invocation of cpu_opv need few pages. Keep struct page
 * pointers in an array on the stack of the cpu_opv system call up to
 * this limit, beyond which the array is dynamically allocated.
 */
#define NR_PIN_PAGES_ON_STACK        8

> 
>> +/*
>> + * The cpu_opv system call executes a vector of operations on behalf of
>> + * user-space on a specific CPU with preemption disabled. It is inspired
>> + * from readv() and writev() system calls which take a "struct iovec"
> 
> s/from/by/

ok

> 
>> + * array as argument.
>> + *
>> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> + * left shift, and right shift. The system call receives a CPU number
>> + * from user-space as argument, which is the CPU on which those
>> + * operations need to be performed. All preparation steps such as
>> + * loading pointers, and applying offsets to arrays, need to be
>> + * performed by user-space before invoking the system call. The
> 
> loading pointers and applying offsets? That makes no sense.

Updating to:

 * All preparation steps such as
 * loading base pointers, and adding offsets derived from the current
 * CPU number, need to be performed by user-space before invoking the
 * system call.

> 
>> + * "comparison" operation can be used to check that the data used in the
>> + * preparation step did not change between preparation of system call
>> + * inputs and operation execution within the preempt-off critical
>> + * section.
>> + *
>> + * The reason why we require all pointer offsets to be calculated by
>> + * user-space beforehand is because we need to use get_user_pages_fast()
>> + * to first pin all pages touched by each operation. This takes care of
> 
> That doesnt explain it either.

What kind of explication are you looking for here ? Perhaps being too close
to the implementation prevents me from understanding what is unclear from
your perspective.

> 
>> + * faulting-in the pages.  Then, preemption is disabled, and the
>> + * operations are performed atomically with respect to other thread
>> + * execution on that CPU, without generating any page fault.
>> + *
>> + * A maximum limit of 16 operations per cpu_opv syscall invocation is
>> + * enforced, and a overall maximum length sum, so user-space cannot
>> + * generate a too long preempt-off critical section. Each operation is
>> + * also limited a length of PAGE_SIZE bytes, meaning that an operation
>> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
>> + * for destination if addresses are not aligned on page boundaries).
> 

Sorry, that paragraph was unclear. Updated:

 * An overall maximum of 4216 bytes in enforced on the sum of operation
 * length within an operation vector, so user-space cannot generate a
 * too long preempt-off critical section (cache cold critical section
 * duration measured as 4.7µs on x86-64). Each operation is also limited
 * a length of PAGE_SIZE bytes, meaning that an operation can touch a
 * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
 * destination if addresses are not aligned on page boundaries).

> What's the critical section duration for operations which go to the limits
> of this on a average x86 64 machine?

When cache-cold, I measure 4.7 µs per critical section doing a
4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
acceptable preempt-off latency for RT ?

> 
>> + * If the thread is not running on the requested CPU, a new
>> + * push_task_to_cpu() is invoked to migrate the task to the requested
> 
> new push_task_to_cpu()? Once that patch is merged push_task_to_cpu() is
> hardly new.
> 
> Please refrain from putting function level details into comments which
> describe the concept. The function name might change in 3 month from now
> and the comment will stay stale, Its sufficient to say:
> 
> * If the thread is not running on the requested CPU it is migrated
> * to it.
> 
> That explains the concept. It's completely irrelevant which mechanism is
> used to achieve that.
> 
>> + * CPU.  If the requested CPU is not part of the cpus allowed mask of
>> + * the thread, the system call fails with EINVAL. After the migration
>> + * has been performed, preemption is disabled, and the current CPU
>> + * number is checked again and compared to the requested CPU number. If
>> + * it still differs, it means the scheduler migrated us away from that
>> + * CPU. Return EAGAIN to user-space in that case, and let user-space
>> + * retry (either requesting the same CPU number, or a different one,
>> + * depending on the user-space algorithm constraints).
> 
> This mixture of imperative and impersonated mood is really hard to read.

This whole paragraph is replaced by:

 * If the thread is not running on the requested CPU, it is migrated to
 * it. If the scheduler then migrates the task away from the requested CPU
 * before the critical section executes, return EAGAIN to user-space,
 * and let user-space retry (either requesting the same CPU number, or a
 * different one, depending on the user-space algorithm constraints).

> 
>> +/*
>> + * Check operation types and length parameters.
>> + */
>> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
>> +{
>> +	int i;
>> +	uint32_t sum = 0;
>> +
>> +	for (i = 0; i < cpuopcnt; i++) {
>> +		struct cpu_op *op = &cpuop[i];
>> +
>> +		switch (op->op) {
>> +		case CPU_MB_OP:
>> +			break;
>> +		default:
>> +			sum += op->len;
>> +		}
> 
> Please separate the switch cases with an empty line.

ok

> 
>> +static unsigned long cpu_op_range_nr_pages(unsigned long addr,
>> +		unsigned long len)
> 
> Please align the arguments
> 
> static unsigned long cpu_op_range_nr_pages(unsigned long addr,
>					   unsigned long len)
> 
> is way simpler to parse. All over the place please.

ok

> 
>> +{
>> +	return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
> 
> I'm surprised that there is no existing magic for this.

populate_vma_page_range() does:

unsigned long nr_pages = (end - start) / PAGE_SIZE;

where "start" and "end" need to fall onto a page boundary. It does not
seem to be appropriate for cases where addr is not page aligned, and where
the length is smaller than a page.

I have not seen helpers for this neither.

> 
>> +}
>> +
>> +static int cpu_op_check_page(struct page *page)
>> +{
>> +	struct address_space *mapping;
>> +
>> +	if (is_zone_device_page(page))
>> +		return -EFAULT;
>> +	page = compound_head(page);
>> +	mapping = READ_ONCE(page->mapping);
>> +	if (!mapping) {
>> +		int shmem_swizzled;
>> +
>> +		/*
>> +		 * Check again with page lock held to guard against
>> +		 * memory pressure making shmem_writepage move the page
>> +		 * from filecache to swapcache.
>> +		 */
>> +		lock_page(page);
>> +		shmem_swizzled = PageSwapCache(page) || page->mapping;
>> +		unlock_page(page);
>> +		if (shmem_swizzled)
>> +			return -EAGAIN;
>> +		return -EFAULT;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Refusing device pages, the zero page, pages in the gate area, and
>> + * special mappings. Inspired from futex.c checks.
> 
> That comment should be on the function above, because this loop does not
> much checking. Aside of that a more elaborate explanation how those checks
> are done and how that works would be appreciated.

OK. I also noticed through testing that I missed faulting in pages, similarly
to sys_futex(). I'll add it, and I'm also adding a test in the selftests
for this case.

I'll import comments from futex.c.

> 
>> + */
>> +static int cpu_op_check_pages(struct page **pages,
>> +		unsigned long nr_pages)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		int ret;
>> +
>> +		ret = cpu_op_check_page(pages[i]);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
>> +		struct cpu_opv_pinned_pages *pin_pages, int write)
>> +{
>> +	struct page *pages[2];
>> +	int ret, nr_pages;
>> +
>> +	if (!len)
>> +		return 0;
>> +	nr_pages = cpu_op_range_nr_pages(addr, len);
>> +	BUG_ON(nr_pages > 2);
> 
> If that happens then you can emit a warning and return a proper error
> code. BUG() is the last resort if there is no way to recover. This really
> does not qualify.

ok. will do:

        nr_pages = cpu_op_range_nr_pages(addr, len);
        if (nr_pages > 2) {
                WARN_ON(1);
                return -EINVAL;
        }

> 
>> +	if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
>> +			> NR_PINNED_PAGES_ON_STACK) {
> 
> Now I see what this is used for. That's a complete misnomer.
> 
> And this check is of course completely self explaining..... NOT!
> 
>> +		struct page **pinned_pages =
>> +			kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
>> +				* sizeof(struct page *), GFP_KERNEL);
>> +		if (!pinned_pages)
>> +			return -ENOMEM;
>> +		memcpy(pinned_pages, pin_pages->pages,
>> +			pin_pages->nr * sizeof(struct page *));
>> +		pin_pages->pages = pinned_pages;
>> +		pin_pages->is_kmalloc = true;
> 
> I have no idea why this needs to be done here and cannot be done in a
> preparation step. That's horrible. You allocate conditionally at some
> random place and then free at the end of the syscall.
> 
> What's wrong with:
> 
>       prepare_stuff()
>       pin_pages()
>       do_ops()
>       cleanup_stuff()
> 
> Hmm?

Will do.

> 
>> +	}
>> +again:
>> +	ret = get_user_pages_fast(addr, nr_pages, write, pages);
>> +	if (ret < nr_pages) {
>> +		if (ret > 0)
>> +			put_page(pages[0]);
>> +		return -EFAULT;
>> +	}
>> +	/*
>> +	 * Refuse device pages, the zero page, pages in the gate area,
>> +	 * and special mappings.
> 
> So the same comment again. Really helpful.

I'll remove this duplicated comment.

> 
>> +	 */
>> +	ret = cpu_op_check_pages(pages, nr_pages);
>> +	if (ret == -EAGAIN) {
>> +		put_page(pages[0]);
>> +		if (nr_pages > 1)
>> +			put_page(pages[1]);
>> +		goto again;
>> +	}
> 
> So why can't you propagate EAGAIN to the caller and use the error cleanup
> label?

Because it needs to retry immediately in case the page has been faulted in,
or is being swapped in.

> Or put the sequence of get_user_pages_fast() and check_pages() into
> one function and confine the mess there instead of having the same cleanup
> sequence 3 times in this function.

I'll merge all this into a single error path.

> 
>> +	if (ret)
>> +		goto error;
>> +	pin_pages->pages[pin_pages->nr++] = pages[0];
>> +	if (nr_pages > 1)
>> +		pin_pages->pages[pin_pages->nr++] = pages[1];
>> +	return 0;
>> +
>> +error:
>> +	put_page(pages[0]);
>> +	if (nr_pages > 1)
>> +		put_page(pages[1]);
>> +	return -EFAULT;
>> +}

Updated function:

static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
                            struct cpu_opv_pin_pages *pin_pages,
                            int write)
{
        struct page *pages[2];
        int ret, nr_pages, nr_put_pages, n;

        nr_pages = cpu_op_count_pages(addr, len);
        if (!nr_pages)
                return 0;
again:
        ret = get_user_pages_fast(addr, nr_pages, write, pages);
        if (ret < nr_pages) {
                if (ret >= 0) {
                        nr_put_pages = ret;
                        ret = -EFAULT;
                } else {
                        nr_put_pages = 0;
                }
                goto error;
        }
        ret = cpu_op_check_pages(pages, nr_pages, addr);
        if (ret) {
                nr_put_pages = nr_pages;
                goto error;
        }
        for (n = 0; n < nr_pages; n++)
                pin_pages->pages[pin_pages->nr++] = pages[n];
        return 0;

error:
        for (n = 0; n < nr_put_pages; n++)
                put_page(pages[n]);
        /*
         * Retry if a page has been faulted in, or is being swapped in.
         */
        if (ret == -EAGAIN)
                goto again;
        return ret;
}


>> +
>> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
>> +		struct cpu_opv_pinned_pages *pin_pages)
>> +{
>> +	int ret, i;
>> +	bool expect_fault = false;
>> +
>> +	/* Check access, pin pages. */
>> +	for (i = 0; i < cpuopcnt; i++) {
>> +		struct cpu_op *op = &cpuop[i];
>> +
>> +		switch (op->op) {
>> +		case CPU_COMPARE_EQ_OP:
>> +		case CPU_COMPARE_NE_OP:
>> +			ret = -EFAULT;
>> +			expect_fault = op->u.compare_op.expect_fault_a;
>> +			if (!access_ok(VERIFY_READ,
>> +					(void __user *)op->u.compare_op.a,
>> +					op->len))
>> +				goto error;
>> +			ret = cpu_op_pin_pages(
>> +					(unsigned long)op->u.compare_op.a,
>> +					op->len, pin_pages, 0);
> 
> Bah, this sucks. Moving the switch() into a separate function spares you
> one indentation level and all these horrible to read line breaks.

done

> 
> And again I really have to ask why all of this stuff needs to be type
> casted for every invocation. If you use the proper type for the argument
> and then do the cast at the function entry then you can spare all that hard
> to read crap.

fixed for cpu_op_pin_pages. I don't control the type expected by access_ok()
though.


> 
>> +error:
>> +	for (i = 0; i < pin_pages->nr; i++)
>> +		put_page(pin_pages->pages[i]);
>> +	pin_pages->nr = 0;
> 
> Sigh. Why can't you do that at the call site where you have exactly the
> same thing?

Good point. fixed.

> 
>> +	/*
>> +	 * If faulting access is expected, return EAGAIN to user-space.
>> +	 * It allows user-space to distinguish between a fault caused by
>> +	 * an access which is expect to fault (e.g. due to concurrent
>> +	 * unmapping of underlying memory) from an unexpected fault from
>> +	 * which a retry would not recover.
>> +	 */
>> +	if (ret == -EFAULT && expect_fault)
>> +		return -EAGAIN;
>> +	return ret;
>> +}
>> +
>> +/* Return 0 if same, > 0 if different, < 0 on error. */
>> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
>> +{
>> +	char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
>> +	uint32_t compared = 0;
>> +
>> +	while (compared != len) {
>> +		unsigned long to_compare;
>> +
>> +		to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
>> +		if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
>> +			return -EFAULT;
>> +		if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
>> +			return -EFAULT;
>> +		if (memcmp(bufa, bufb, to_compare))
>> +			return 1;	/* different */
> 
> These tail comments are really crap. It's entirely obvious that if memcmp
> != 0 the result is different. So what is the exact value aside of making it
> hard to read?

Removed.


> 
>> +		compared += to_compare;
>> +	}
>> +	return 0;	/* same */

Ditto.


>> +}
>> +
>> +/* Return 0 if same, > 0 if different, < 0 on error. */
>> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
>> +{
>> +	int ret = -EFAULT;
>> +	union {
>> +		uint8_t _u8;
>> +		uint16_t _u16;
>> +		uint32_t _u32;
>> +		uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +		uint32_t _u64_split[2];
>> +#endif
>> +	} tmp[2];
> 
> I've seen the same union before
> 
>> +union op_fn_data {
> 
> ......

Ah, yes. It's already declared! I should indeed use it :)

> 
>> +
>> +	pagefault_disable();
>> +	switch (len) {
>> +	case 1:
>> +		if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
>> +			goto end;
>> +		if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
>> +			goto end;
>> +		ret = !!(tmp[0]._u8 != tmp[1]._u8);
>> +		break;
>> +	case 2:
>> +		if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
>> +			goto end;
>> +		if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
>> +			goto end;
>> +		ret = !!(tmp[0]._u16 != tmp[1]._u16);
>> +		break;
>> +	case 4:
>> +		if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
>> +			goto end;
>> +		if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
>> +			goto end;
>> +		ret = !!(tmp[0]._u32 != tmp[1]._u32);
>> +		break;
>> +	case 8:
>> +#if (BITS_PER_LONG >= 64)
> 
> We alredy prepare for 128 bit?

== it is then ;)

> 
>> +		if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
>> +			goto end;
>> +		if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
>> +			goto end;
>> +#else
>> +		if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
>> +			goto end;
>> +		if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
>> +			goto end;
>> +		if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
>> +			goto end;
>> +		if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
>> +			goto end;
>> +#endif
>> +		ret = !!(tmp[0]._u64 != tmp[1]._u64);
> 
> This really sucks.
> 
>        union foo va, vb;
> 
>	pagefault_disable();
>	switch (len) {
>	case 1:
>	case 2:
>	case 4:
>	case 8:
>		va._u64 = _vb._u64 = 0;
>		if (op_get_user(&va, a, len))
>			goto out;
>		if (op_get_user(&vb, b, len))
>			goto out;
>		ret = !!(va._u64 != vb._u64);
>		break;
>	default:
>		...
> 
> and have
> 
> int op_get_user(union foo *val, void *p, int len)
> {
>	switch (len) {
>	case 1:
>	     ......
> 
> And do the magic once in that function then you spare that copy and pasted
> code above. It can be reused in the other ops as well and reduces the amount
> of copy and paste code significantly.

Good point! done.

> 
>> +		break;
>> +	default:
>> +		pagefault_enable();
>> +		return do_cpu_op_compare_iter(a, b, len);
>> +	}
>> +end:
>> +	pagefault_enable();
>> +	return ret;
>> +}
> 
>> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
>> +{
>> +	int i, ret;
>> +
>> +	for (i = 0; i < cpuopcnt; i++) {
>> +		struct cpu_op *op = &cpuop[i];
>> +
>> +		/* Guarantee a compiler barrier between each operation. */
>> +		barrier();
>> +
>> +		switch (op->op) {
>> +		case CPU_COMPARE_EQ_OP:
>> +			ret = do_cpu_op_compare(
>> +					(void __user *)op->u.compare_op.a,
>> +					(void __user *)op->u.compare_op.b,
>> +					op->len);
> 
> I think you know by now how to spare an indentation level and type casts.

done

> 
>> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
>> +{
>> +	int ret;
>> +
>> +	if (cpu != raw_smp_processor_id()) {
>> +		ret = push_task_to_cpu(current, cpu);
>> +		if (ret)
>> +			goto check_online;
>> +	}
>> +	preempt_disable();
>> +	if (cpu != smp_processor_id()) {
>> +		ret = -EAGAIN;
> 
> This is weird. When the above raw_smp_processor_id() check fails you push,
> but here you return. Not really consistent behaviour.

Good point. We could re-try internally rather than let user-space
deal with an EAGAIN. It will make the error checking easier in
user-space.

> 
>> +		goto end;
>> +	}
>> +	ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +end:
>> +	preempt_enable();
>> +	return ret;
>> +
>> +check_online:
>> +	if (!cpu_possible(cpu))
>> +		return -EINVAL;
>> +	get_online_cpus();
>> +	if (cpu_online(cpu)) {
>> +		ret = -EAGAIN;
>> +		goto put_online_cpus;
>> +	}
>> +	/*
>> +	 * CPU is offline. Perform operation from the current CPU with
>> +	 * cpu_online read lock held, preventing that CPU from coming online,
>> +	 * and with mutex held, providing mutual exclusion against other
>> +	 * CPUs also finding out about an offline CPU.
>> +	 */
> 
> That's not mentioned in the comment at the top IIRC.

Updated.

> 
>> +	mutex_lock(&cpu_opv_offline_lock);
>> +	ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +	mutex_unlock(&cpu_opv_offline_lock);
>> +put_online_cpus:
>> +	put_online_cpus();
>> +	return ret;
>> +}
>> +
>> +/*
>> + * cpu_opv - execute operation vector on a given CPU with preempt off.
>> + *
>> + * Userspace should pass current CPU number as parameter. May fail with
>> + * -EAGAIN if currently executing on the wrong CPU.
>> + */
>> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
>> +		int, cpu, int, flags)
>> +{
>> +	struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
>> +	struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
> 
> Oh well.... Naming sucks.
> 
>> +	struct cpu_opv_pinned_pages pin_pages = {
>> +		.pages = pinned_pages_on_stack,
>> +		.nr = 0,
>> +		.is_kmalloc = false,
>> +	};
>> +	int ret, i;
>> +
>> +	if (unlikely(flags))
>> +		return -EINVAL;
>> +	if (unlikely(cpu < 0))
>> +		return -EINVAL;
>> +	if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
>> +		return -EINVAL;
>> +	if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
>> +		return -EFAULT;
>> +	ret = cpu_opv_check(cpuopv, cpuopcnt);
> 
> AFAICT you can calculate the number of pages already in the check and then
> do that allocation before pinning the pages.

will do.

> 
>> +	if (ret)
>> +		return ret;
>> +	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
>> +	if (ret)
>> +		goto end;
>> +	ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
>> +	for (i = 0; i < pin_pages.nr; i++)
>> +		put_page(pin_pages.pages[i]);
>> +end:
>> +	if (pin_pages.is_kmalloc)
>> +		kfree(pin_pages.pages);
>> +	return ret;
>> +}
> 
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 6bba05f47e51..e547f93a46c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const
>> struct cpumask *new_mask)
>>  		set_curr_task(rq, p);
>>  }
> 
> This is NOT part of this functionality. It's a prerequisite and wants to be
> in a separate patch. And I'm dead tired by now so I leave that thing for
> tomorrow or for Peter.

I'll split that part into a separate patch.

Thanks!

Mathieu


> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-17 20:22                 ` Thomas Gleixner
@ 2017-11-20 17:13                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-20 17:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson, linux-kernel, linux-api,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas, Will

----- On Nov 17, 2017, at 3:22 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Fri, 17 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner tglx@linutronix.de wrote:
>> 7) Allow libraries with multi-part algorithms to work on same per-cpu
>>    data without affecting the allowed cpu mask
>> 
>> I stumbled on an interesting use-case within the lttng-ust tracer
>> per-cpu buffers: the algorithm needs to update a "reserve" counter,
>> serialize data into the buffer, and then update a "commit" counter
>> _on the same per-cpu buffer_. My goal is to use rseq for both reserve
>> and commit.
>> 
>> Clearly, if rseq reserve fails, the algorithm can retry on a different
>> per-cpu buffer. However, it's not that easy for the commit. It needs to
>> be performed on the same per-cpu buffer as the reserve.
>> 
>> The cpu_opv system call solves that problem by receiving the cpu number
>> on which the operation needs to be performed as argument. It can push
>> the task to the right CPU if needed, and perform the operations there
>> with preemption disabled.
> 
> If your transaction cannot be done in one go, then abusing that byte code
> interpreter for concluding it is just hillarious. That whole exercise is a
> gazillion times slower than the atomic operations which are neccesary to do
> it without all that.
> 
> I'm even more convinced now that this is overengineered beyond repair.

The fast-path (typical case) will execute on the right CPU, and rseq will
do both the reserve and the commit, and that is faster than atomic ops.

However, we need to handle migration between reserve and commit.
Unfortunately, concurrent rseq and atomic ops don't mix well on the
same per-cpu data, so we cannot fall-back on atomic ops, except in
very specific cases where we can use a split-counter strategy.

So cpu_opv handles migration for this use-case by ensuring
the slow-path is performed with preemption-off after migrating
the current task to the right CPU.

Thanks,

Mathieu


> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]           ` <1766414702.18278.1511194398489.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-20 17:48             ` Thomas Gleixner
  2017-11-20 18:03               ` Thomas Gleixner
  2017-11-20 18:39               ` Mathieu Desnoyers
  0 siblings, 2 replies; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-20 17:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

[-- Attachment #1: Type: text/plain, Size: 4078 bytes --]

On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> ----- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> >> +#define NR_PINNED_PAGES_ON_STACK	8
> > 
> > 8 pinned pages on stack? Which stack?
> 
> The common cases need to touch few pages, and we can keep the
> pointers in an array on the kernel stack within the cpu_opv system
> call.
> 
> Updating to:
> 
> /*
>  * Typical invocation of cpu_opv need few pages. Keep struct page
>  * pointers in an array on the stack of the cpu_opv system call up to
>  * this limit, beyond which the array is dynamically allocated.
>  */
> #define NR_PIN_PAGES_ON_STACK        8

That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
> >> + * left shift, and right shift. The system call receives a CPU number
> >> + * from user-space as argument, which is the CPU on which those
> >> + * operations need to be performed. All preparation steps such as
> >> + * loading pointers, and applying offsets to arrays, need to be
> >> + * performed by user-space before invoking the system call. The
> > 
> > loading pointers and applying offsets? That makes no sense.
> 
> Updating to:
> 
>  * All preparation steps such as
>  * loading base pointers, and adding offsets derived from the current
>  * CPU number, need to be performed by user-space before invoking the
>  * system call.

This still does not explain anything, really.

Which base pointer is loaded?  I nowhere see a reference to a base
pointer.

And what are the offsets about?

derived from current cpu number? What is current CPU number? The one on
which the task executes now or the one which it should execute on?

I assume what you want to say is:

  All pointers in the ops must have been set up to point to the per CPU
  memory of the CPU on which the operations should be executed.

At least that's what I oracle in to that.

> >> + * "comparison" operation can be used to check that the data used in the
> >> + * preparation step did not change between preparation of system call
> >> + * inputs and operation execution within the preempt-off critical
> >> + * section.
> >> + *
> >> + * The reason why we require all pointer offsets to be calculated by
> >> + * user-space beforehand is because we need to use get_user_pages_fast()
> >> + * to first pin all pages touched by each operation. This takes care of
> > 
> > That doesnt explain it either.
> 
> What kind of explication are you looking for here ? Perhaps being too close
> to the implementation prevents me from understanding what is unclear from
> your perspective.

What the heck are pointer offsets?

The ops have one or two pointer(s) to a lump of memory. So if a pointer
points to the wrong lump of memory then you're screwed, but that's true for
all pointers handed to the kernel.

> Sorry, that paragraph was unclear. Updated:
> 
>  * An overall maximum of 4216 bytes in enforced on the sum of operation
>  * length within an operation vector, so user-space cannot generate a
>  * too long preempt-off critical section (cache cold critical section
>  * duration measured as 4.7µs on x86-64). Each operation is also limited
>  * a length of PAGE_SIZE bytes,

Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
is a hard limit of 4K. And because there is no alignment requiremnt the
rest of the sentence is stating the obvious.

>  * meaning that an operation can touch a
>  * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>  * destination if addresses are not aligned on page boundaries).

I still have to understand why the 4K copy is necessary in the first place.

> > What's the critical section duration for operations which go to the limits
> > of this on a average x86 64 machine?
> 
> When cache-cold, I measure 4.7 µs per critical section doing a
> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
> acceptable preempt-off latency for RT ?

Depends on the use case as always ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-20 17:48             ` Thomas Gleixner
@ 2017-11-20 18:03               ` Thomas Gleixner
  2017-11-20 18:42                 ` Mathieu Desnoyers
  2017-11-20 18:39               ` Mathieu Desnoyers
  1 sibling, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-20 18:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

On Mon, 20 Nov 2017, Thomas Gleixner wrote:
> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> > >> + * The reason why we require all pointer offsets to be calculated by
> > >> + * user-space beforehand is because we need to use get_user_pages_fast()
> > >> + * to first pin all pages touched by each operation. This takes care of
> > > 
> > > That doesnt explain it either.
> > 
> > What kind of explication are you looking for here ? Perhaps being too close
> > to the implementation prevents me from understanding what is unclear from
> > your perspective.
> 
> What the heck are pointer offsets?
> 
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

So I think you mix here the 'user space programmer guide - how to set up
the magic ops - into the kernel side documentation. The kernel side does
not care about the pointers, as long as they are valid to access.

Again. Try to explain things at the conceptual level.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-20 17:48             ` Thomas Gleixner
  2017-11-20 18:03               ` Thomas Gleixner
@ 2017-11-20 18:39               ` Mathieu Desnoyers
       [not found]                 ` <204285712.18480.1511203151076.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-20 18:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> >> +#define NR_PINNED_PAGES_ON_STACK	8
>> > 
>> > 8 pinned pages on stack? Which stack?
>> 
>> The common cases need to touch few pages, and we can keep the
>> pointers in an array on the kernel stack within the cpu_opv system
>> call.
>> 
>> Updating to:
>> 
>> /*
>>  * Typical invocation of cpu_opv need few pages. Keep struct page
>>  * pointers in an array on the stack of the cpu_opv system call up to
>>  * this limit, beyond which the array is dynamically allocated.
>>  */
>> #define NR_PIN_PAGES_ON_STACK        8
> 
> That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

fixed.

> 
>> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> >> + * left shift, and right shift. The system call receives a CPU number
>> >> + * from user-space as argument, which is the CPU on which those
>> >> + * operations need to be performed. All preparation steps such as
>> >> + * loading pointers, and applying offsets to arrays, need to be
>> >> + * performed by user-space before invoking the system call. The
>> > 
>> > loading pointers and applying offsets? That makes no sense.
>> 
>> Updating to:
>> 
>>  * All preparation steps such as
>>  * loading base pointers, and adding offsets derived from the current
>>  * CPU number, need to be performed by user-space before invoking the
>>  * system call.
> 
> This still does not explain anything, really.
> 
> Which base pointer is loaded?  I nowhere see a reference to a base
> pointer.
> 
> And what are the offsets about?
> 
> derived from current cpu number? What is current CPU number? The one on
> which the task executes now or the one which it should execute on?
> 
> I assume what you want to say is:
> 
>  All pointers in the ops must have been set up to point to the per CPU
>  memory of the CPU on which the operations should be executed.
> 
> At least that's what I oracle in to that.

Exactly that. Will update to use this description instead.

> 
>> >> + * "comparison" operation can be used to check that the data used in the
>> >> + * preparation step did not change between preparation of system call
>> >> + * inputs and operation execution within the preempt-off critical
>> >> + * section.
>> >> + *
>> >> + * The reason why we require all pointer offsets to be calculated by
>> >> + * user-space beforehand is because we need to use get_user_pages_fast()
>> >> + * to first pin all pages touched by each operation. This takes care of
>> > 
>> > That doesnt explain it either.
>> 
>> What kind of explication are you looking for here ? Perhaps being too close
>> to the implementation prevents me from understanding what is unclear from
>> your perspective.
> 
> What the heck are pointer offsets?
> 
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

I think the sentence you suggested above is clear enough. I'll simply use
it.

> 
>> Sorry, that paragraph was unclear. Updated:
>> 
>>  * An overall maximum of 4216 bytes in enforced on the sum of operation
>>  * length within an operation vector, so user-space cannot generate a
>>  * too long preempt-off critical section (cache cold critical section
>>  * duration measured as 4.7µs on x86-64). Each operation is also limited
>>  * a length of PAGE_SIZE bytes,
> 
> Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
> is a hard limit of 4K. And because there is no alignment requiremnt the
> rest of the sentence is stating the obvious.

I can make that a 4K limit if you prefer. This presumes that no architecture
has pages smaller than 4K, which is true on Linux.

> 
>>  * meaning that an operation can touch a
>>  * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>>  * destination if addresses are not aligned on page boundaries).
> 
> I still have to understand why the 4K copy is necessary in the first place.
> 
>> > What's the critical section duration for operations which go to the limits
>> > of this on a average x86 64 machine?
>> 
>> When cache-cold, I measure 4.7 µs per critical section doing a
>> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
>> acceptable preempt-off latency for RT ?
> 
> Depends on the use case as always ....

The use-case for 4k memcpy operation is a per-cpu ring buffer where
the rseq fast-path does the following:

- ring buffer push: in the rseq asm instruction sequence, a memcpy of a
  given structure (limited to 4k in size) into a ring buffer,
  followed by the final commit instruction which increments the current
  position offset by the number of bytes pushed.

- ring buffer pop: in the rseq asm instruction sequence, a memcpy of
  a given structure (up to 4k) from the ring buffer, at "position" offset.
  The final commit instruction decrements the current position offset by
  the number of bytes pop'd.

Having cpu_opv do a 4k memcpy allow it to handle scenarios where
rseq fails to progress.

Thanks,

Mathieu



> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-20 18:03               ` Thomas Gleixner
@ 2017-11-20 18:42                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-20 18:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 20, 2017, at 1:03 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Mon, 20 Nov 2017, Thomas Gleixner wrote:
>> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> > >> + * The reason why we require all pointer offsets to be calculated by
>> > >> + * user-space beforehand is because we need to use get_user_pages_fast()
>> > >> + * to first pin all pages touched by each operation. This takes care of
>> > > 
>> > > That doesnt explain it either.
>> > 
>> > What kind of explication are you looking for here ? Perhaps being too close
>> > to the implementation prevents me from understanding what is unclear from
>> > your perspective.
>> 
>> What the heck are pointer offsets?
>> 
>> The ops have one or two pointer(s) to a lump of memory. So if a pointer
>> points to the wrong lump of memory then you're screwed, but that's true for
>> all pointers handed to the kernel.
> 
> So I think you mix here the 'user space programmer guide - how to set up
> the magic ops - into the kernel side documentation. The kernel side does
> not care about the pointers, as long as they are valid to access.
> 
> Again. Try to explain things at the conceptual level.

Yes, I took the sentence you suggested for that reason: it removes
usage details that are meant for user-space implementer, which do not
belong in those comments.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]                 ` <204285712.18480.1511203151076.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-11-20 18:49                   ` Andi Kleen
       [not found]                     ` <20171120184927.GK2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  2017-11-20 19:44                   ` Thomas Gleixner
  1 sibling, 1 reply; 80+ messages in thread
From: Andi Kleen @ 2017-11-20 18:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson, linux-kernel, linux-api,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Andi Kleen, Chris Lameter,
	Ben Maurer, rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas

> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
> rseq fails to progress.

If anybody ever gets that right. It will be really hard to just
test such a path.

It also seems fairly theoretical to me. Do you even have a 
test case where the normal path stops making forward progress?

-Andi

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]                 ` <204285712.18480.1511203151076.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-11-20 18:49                   ` Andi Kleen
@ 2017-11-20 19:44                   ` Thomas Gleixner
  2017-11-21 11:25                     ` Mathieu Desnoyers
  1 sibling, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2017-11-20 19:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> ----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> The use-case for 4k memcpy operation is a per-cpu ring buffer where
> the rseq fast-path does the following:
> 
> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>   given structure (limited to 4k in size) into a ring buffer,
>   followed by the final commit instruction which increments the current
>   position offset by the number of bytes pushed.
> 
> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>   The final commit instruction decrements the current position offset by
>   the number of bytes pop'd.
> 
> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
> rseq fails to progress.

I'm still confused. Before you talked about the sequence:

    1) Reserve

    2) Commit

and both use rseq, but obviously you cannot do two "atomic" operation in
one section.

So now you talk about push. Is that what you described earlier as commit?

Assumed that it is, I still have no idea why the memcpy needs to happen in
that preempt disabled region.

If you have space reserved, then the copy does not have any dependencies
neither on the cpu it runs on nor on anything else. So the only
interresting operation is the final commit instruction which tells the
consumer that its ready.

So what is the part I am missing here aside of having difficulties to map
the constantly changing names of this stuff?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
       [not found]                     ` <20171120184927.GK2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2017-11-20 22:46                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-20 22:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Andy Lutomirski, Dave Watson, linux-kernel, linux-api,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andrew Hunter, Chris Lameter, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas, Will

----- On Nov 20, 2017, at 1:49 PM, Andi Kleen andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org wrote:

>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> If anybody ever gets that right. It will be really hard to just
> test such a path.
> 
> It also seems fairly theoretical to me. Do you even have a
> test case where the normal path stops making forward progress?

We expect the following loop to progress, typically after a single
iteration:

	do {
		cpu = rseq_cpu_start();
		ret = rseq_addv(&v, 1, cpu);
		attempts++;
	} while (ret);

Now runnig this in gdb, break on "main", run, and single-step
execution with "next", the program is stuck in an infinite loop.

What solution do you have in mind to handle this kind of
scenario without breaking pre-existing debuggers ?

Looking at vDSO examples of vgetcpu and vclock_gettime under
gdb 7.7.1 (debian) with glibc 2.19:

sched_getcpu behavior under single-stepping per source line
with "step" seems to only see the ../sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S
source lines, which makes it skip single-stepping of the vDSO.

sched_getcpu under "stepi": it does go through the vDSO instruction
addresses. It does progress, given that there is no loop there.

clock_gettime under "step": it only sees source lines of
../sysdeps/unix/clock_gettime.c.

clock_gettime under "stepi": it's stuck in an infinite loop.

So instruction-level stepping from gdb turns clock_gettime vDSO
into a never-ending loop, which is already bad. But with rseq,
the situation is even worse, because it turns source line level
single-stepping into infinite loops.

My understanding from https://sourceware.org/bugzilla/show_bug.cgi?id=14466
is that GDB currently simply removes the vDSO from its list of library
mappings, which is probably why it skips over vDSO for the source
lines single-stepping case. We cannot do that with rseq, because we
_want_ the rseq critical section to be inlined into the application
or library. A function call costs more than most rseq critical sections.

I plan to have the rseq user-space code provide a "__rseq_table" section
so debuggers can eventually figure out that they need to skip over the
rseq critical sections. However, it won't help the fact that pre-existing
debugger single-stepping will start turning perfectly working programs
into never-ending loops simply by having glibc use rseq for memory
allocation.

Using the cpu_opv system call on rseq failure solves this problem
entirely.

I would even go further and recommend to take a similar approach when
lack of progress is detected in a vDSO, and invoke the equivalent
system call. The current implementation of the clock_gettime()
vDSO turns instruction-level single-stepping into never
ending loops, which is far from being elegant.

Thanks,

Mathieu	

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
  2017-11-20 19:44                   ` Thomas Gleixner
@ 2017-11-21 11:25                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2017-11-21 11:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Andy Lutomirski,
	Dave Watson, linux-kernel, linux-api, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon

----- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> The use-case for 4k memcpy operation is a per-cpu ring buffer where
>> the rseq fast-path does the following:
>> 
>> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>>   given structure (limited to 4k in size) into a ring buffer,
>>   followed by the final commit instruction which increments the current
>>   position offset by the number of bytes pushed.
>> 
>> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>>   The final commit instruction decrements the current position offset by
>>   the number of bytes pop'd.
>> 
>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> I'm still confused. Before you talked about the sequence:
> 
>    1) Reserve
> 
>    2) Commit
> 
> and both use rseq, but obviously you cannot do two "atomic" operation in
> one section.
> 
> So now you talk about push. Is that what you described earlier as commit?
> 
> Assumed that it is, I still have no idea why the memcpy needs to happen in
> that preempt disabled region.
> 
> If you have space reserved, then the copy does not have any dependencies
> neither on the cpu it runs on nor on anything else. So the only
> interresting operation is the final commit instruction which tells the
> consumer that its ready.
> 
> So what is the part I am missing here aside of having difficulties to map
> the constantly changing names of this stuff?

Let's clear up some confusion: those are two different use-cases. The
ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer
with memcpy+position update is a LIFO queue.

Let me explain the various use-cases here, so we know what we're talking
about.

rseq and cpu_opv use-cases

1) per-cpu spinlock

A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.

The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).


2) per-cpu statistics counters

A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.

The cpu_opv fallback can be implemented as a "ADD" operation.

Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.


3) per-cpu LIFO linked-list (unlimited size stack)

A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.

The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.

The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.

The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.


4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)

This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.

The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation into the
next ring buffer array index (it's OK to abort after a try-store,
since it's not the commit, and its side-effect can be overwritten),
then followed by a word-store to increment the current offset (commit).

The "push" cpu_opv fallback can be done with the comparison, and
two consecutive word stores, all within the preempt-off section.

The "pop" side can be implemented with a check that offset is not
0 (whether the buffer is empty), a load of the "head" pointer before the
offset array index, followed by a rseq consisting of a word
comparison checking that the offset is unchanged since previously
loaded, another check ensuring that the "head" pointer is unchanged,
followed by a store decrementing the current offset.

The cpu_opv "pop" can be implemented with the same algorithm
as the rseq fast-path (compare, compare, store).


5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
   supporting "peek" from remote CPU

In order to implement work queues with work-stealing between CPUs, it is
useful to ensure the offset "commit" in scenario 4) "push" have a
store-release semantic, thus allowing remote CPU to load the offset
with acquire semantic, and load the top pointer, in order to check if
work-stealing should be performed. The task (work queue item) existence
should be protected by other means, e.g. RCU.

If the peek operation notices that work-stealing should indeed be
performed, a thread can use cpu_opv to move the task between per-cpu
workqueues, by first invoking cpu_opv passing the remote work queue
cpu number as argument to pop the task, and then again as "push" with
the target work queue CPU number.


6) per-cpu LIFO ring buffer with data copy (fixed-sized stack)
   (with and without acquire-release)

This structure is useful for passing around data without requiring
memory allocation by copying the data content into per-cpu fixed-sized
stack.

The "push" operation is performed with an offset comparison against
the buffer size (figuring out if the buffer is full), followed by
a rseq consisting of a comparison of the offset, a try-memcpy attempting
to copy the data content into the buffer (which can be aborted and
overwritten), and a final store incrementing the offset.

The cpu_opv fallback needs to same operations, except that the memcpy
is guaranteed to complete, given that it is performed with preemption
disabled. This requires a memcpy operation supporting length up to 4kB.

The "pop" operation is similar to the "push, except that the offset
is first compared to 0 to ensure the buffer is not empty. The
copy source is the ring buffer, and the destination is an output
buffer.


7) per-cpu FIFO ring buffer (fixed-sized queue)

This structure is useful wherever a FIFO behavior (queue) is needed.
One major use-case is tracer ring buffer.

An implementation of this ring buffer has a "reserve", followed by
serialization of multiple bytes into the buffer, ended by a "commit".
The "reserve" can be implemented as a rseq consisting of a word
comparison followed by a word store. The reserve operation moves the
producer "head". The multi-byte serialization can be performed
non-atomically. Finally, the "commit" update can be performed with
a rseq "add" commit instruction with store-release semantic. The
ring buffer consumer reads the commit value with load-acquire
semantic to know whenever it is safe to read from the ring buffer.

This use-case requires that both "reserve" and "commit" operations
be performed on the same per-cpu ring buffer, even if a migration
happens between those operations. In the typical case, both operations
will happens on the same CPU and use rseq. In the unlikely event of a
migration, the cpu_opv system call will ensure the commit can be
performed on the right CPU by migrating the task to that CPU.

On the consumer side, an alternative to using store-release and
load-acquire on the commit counter would be to use cpu_opv to
ensure the commit counter load is performed on the right CPU. This
effectively allows moving a consumer thread between CPUs to execute
close to the ring buffer cache lines it will read.

Thanks,

Mathieu


> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11
  2017-11-14 21:32           ` Paul Turner
@ 2018-03-27 18:15             ` Mathieu Desnoyers
  0 siblings, 0 replies; 80+ messages in thread
From: Mathieu Desnoyers @ 2018-03-27 18:15 UTC (permalink / raw)
  To: Paul Turner
  Cc: Andy Lutomirski, Linus Torvalds, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Dave Watson, linux-kernel,
	linux-api, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Andrew Hunter, Andi Kleen,
	Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Catalin Marinas, Will

Hi Paul!

I guess stuff like Spectre/Meltdown can turn 1-2 days into months. ;-)

I did not want to distract you too much from that work, but you'll notice
I've sent an updated patch series against 4.16-rc7, aiming at 4.17 [1]
(it should be in your inbox).

I would really appreciate if you can find time to provide feedback on
that version.

Congratulations on the wedding!

Thanks,

Mathieu

[1] https://lkml.kernel.org/r/20180327160542.28457-1-mathieu.desnoyers@efficios.com

----- On Nov 14, 2017, at 4:32 PM, Paul Turner pjt@google.com wrote:

> I have some comments that apply to many of the threads.
> I've been fully occupied with a wedding and a security issue; but I'm
> about to be free to spend the majority of my time on RSEQ things.
> I was sorely hoping that day would be today.  But it's looking like
> I'm still a day or two from being free for this.
> Thank you for the extensive clean-ups and user-side development.  I
> have some updates on these topics also.
> 
> - Paul
> 
> On Tue, Nov 14, 2017 at 1:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Nov 14, 2017 at 1:08 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Tue, Nov 14, 2017 at 12:03 PM, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>> Here is the last RFC round of the updated rseq patchset containing:
>>>
>>> Andy? You were the one with concerns here and said you'd have
>>> something else ready for comparison.
>>>
>>
>> I had a long discussion with Mathieu and KS and I think that this is a
>> good compromise.  I haven't reviewed the series all that carefully,
>> but I think the idea is sound.
>>
>> Basically, event_counter is gone (to be re-added in a later kernel if
>> it really ends up being necessary, but it looks like it may primarily
>> be a temptation to write subtly incorrect user code and to see
>> scheduling details that shouldn't be readily exposed rather than a
>> genuinely useful feature) and the versioning mechanism for the asm
>> critical section bit is improved.  My crazy proposal should be doable
>> on top of this if there's demand and if anyone wants to write the
>> gnarly code involved.
>>
>> IOW no objection from me as long as those changes were made, which I
> > *think* they were.  Mathieu?

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2018-03-27 18:15 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
     [not found] ` <20171114200414.2188-1-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-14 20:03   ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences " Mathieu Desnoyers
     [not found]     ` <CY4PR15MB168884529B3C0F8E6CC06257CF280@CY4PR15MB1688.namprd15.prod.outlook.com>
     [not found]       ` <CY4PR15MB168884529B3C0F8E6CC06257CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-11-14 20:49         ` Ben Maurer
     [not found]           ` <CY4PR15MB1688CE0F2139CEB72B467242CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-11-14 21:03             ` Mathieu Desnoyers
     [not found]     ` <20171114200414.2188-2-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-14 20:39       ` Ben Maurer
     [not found]         ` <CY4PR15MB168866BFDCFECF81B7EF4CF1CF280-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-11-14 20:52           ` Mathieu Desnoyers
     [not found]             ` <574606484.15158.1510692743725.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-14 21:48               ` Ben Maurer
2017-11-16 16:18       ` Peter Zijlstra
     [not found]         ` <20171116161815.dg4hi2z35rkh4u4s-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-11-16 16:27           ` Mathieu Desnoyers
     [not found]             ` <438349693.16595.1510849627973.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-16 16:32               ` Peter Zijlstra
     [not found]                 ` <20171116163218.fg4u4bbzfrbxatvz-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-11-16 17:09                   ` Mathieu Desnoyers
2017-11-16 18:43       ` Peter Zijlstra
     [not found]         ` <20171116184305.snpudnjdhua2obby-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-11-16 18:49           ` Mathieu Desnoyers
     [not found]             ` <1523632942.16739.1510858189882.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-16 19:06               ` Thomas Gleixner
2017-11-16 20:06                 ` Mathieu Desnoyers
2017-11-16 21:08       ` Thomas Gleixner
2017-11-19 17:24         ` Mathieu Desnoyers
2017-11-16 19:14     ` Peter Zijlstra
     [not found]       ` <20171116191448.rmds347hwsyibipm-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-11-16 20:37         ` Mathieu Desnoyers
     [not found]           ` <1083699948.16848.1510864678185.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-16 20:46             ` Peter Zijlstra
2017-11-14 20:03   ` [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2017-11-14 20:03   ` [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 " Mathieu Desnoyers
     [not found]     ` <20171114200414.2188-5-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-16 21:14       ` Thomas Gleixner
2017-11-19 17:41         ` Mathieu Desnoyers
     [not found]           ` <1390396579.17843.1511113291117.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-20  8:38             ` Thomas Gleixner
2017-11-14 20:03   ` [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2017-11-14 20:03   ` [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support Mathieu Desnoyers
2017-11-14 20:03   ` [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call Mathieu Desnoyers
     [not found]     ` <20171114200414.2188-9-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-15  1:34       ` Mathieu Desnoyers
2017-11-15  7:44       ` Michael Kerrisk (man-pages)
     [not found]         ` <CAKgNAkjrh_OMi+7EUJxqM0-84WUxL0d_vse4neOL93EB-sGKXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-15 14:30           ` Mathieu Desnoyers
2017-11-16 23:26       ` Thomas Gleixner
2017-11-17  0:14         ` Andi Kleen
     [not found]           ` <20171117001410.GG2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2017-11-17 10:09             ` Thomas Gleixner
2017-11-17 17:14               ` Mathieu Desnoyers
     [not found]                 ` <1756446476.17265.1510938872121.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-17 18:18                   ` Andi Kleen
     [not found]                     ` <20171117181839.GH2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2017-11-17 18:59                       ` Thomas Gleixner
2017-11-17 19:15                         ` Andi Kleen
     [not found]                           ` <20171117191547.GI2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2017-11-17 20:07                             ` Thomas Gleixner
2017-11-18 21:09                               ` Andy Lutomirski
2017-11-17 20:22                 ` Thomas Gleixner
2017-11-20 17:13                   ` Mathieu Desnoyers
2017-11-20 16:13         ` Mathieu Desnoyers
     [not found]           ` <1766414702.18278.1511194398489.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-20 17:48             ` Thomas Gleixner
2017-11-20 18:03               ` Thomas Gleixner
2017-11-20 18:42                 ` Mathieu Desnoyers
2017-11-20 18:39               ` Mathieu Desnoyers
     [not found]                 ` <204285712.18480.1511203151076.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-20 18:49                   ` Andi Kleen
     [not found]                     ` <20171120184927.GK2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2017-11-20 22:46                       ` Mathieu Desnoyers
2017-11-20 19:44                   ` Thomas Gleixner
2017-11-21 11:25                     ` Mathieu Desnoyers
2017-11-14 20:03   ` [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
2017-11-14 20:04   ` [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests Mathieu Desnoyers
2017-11-14 20:04   ` [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests Mathieu Desnoyers
2017-11-14 20:04   ` [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess Mathieu Desnoyers
2017-11-14 20:04   ` [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements Mathieu Desnoyers
2017-11-14 21:08   ` [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Linus Torvalds
     [not found]     ` <CA+55aFzZcQKEvu5S3TwD9MscqDhqq3pKa0Kam79NncjP8RnvoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-14 21:15       ` Andy Lutomirski
     [not found]         ` <CALCETrVMvk0dsBMF8F-gPZCGnfJt=RQOvTnVzJhVaAFhEFbq2w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-14 21:32           ` Paul Turner
2018-03-27 18:15             ` Mathieu Desnoyers
2017-11-14 21:32           ` Mathieu Desnoyers
     [not found]             ` <2115146800.15215.1510695175687.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-15  4:12               ` Andy Lutomirski
     [not found]                 ` <CALCETrX4dzY_kyZmqR+srKZf7vVYzODH5i9bguFAzdm0dcU3ZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-15  6:34                   ` Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 10/24] cpu_opv: " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command Mathieu Desnoyers
     [not found]   ` <20171114200414.2188-19-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-15  1:36     ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd Mathieu Desnoyers
     [not found]   ` <20171114200414.2188-20-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-11-17 15:07     ` Shuah Khan
2017-11-14 20:04 ` [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
2017-11-17 15:09   ` Shuah Khan
2017-11-17 16:17     ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).