linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call
@ 2018-10-10 19:19 Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc Mathieu Desnoyers
                   ` (15 more replies)
  0 siblings, 16 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

Hi,

Considering it's already late in the 4.19 rc cycle, I'm submitting this
patchset as RFC for 4.21 to give everyone plenty of time to provide
feedback.

This series contain:

- rseq selftests (this could be 4.20 material):
  - Added reference counter within user-space __rseq_abi structure, for
    integration of rseq application/libraries with future use by glibc,
  - Adapt number of threads to the number of online cpus.

- cpu_opv (4.21 material):
  - Implement push_task_to_cpu() (scheduler),
  - Introduce vm_map_user_ram()/vm_unmap_user_ram() (mm),
  - Provide is_vma_noncached() (mm),
  - Introduce cpu_opv system call, with vmap space limiting,
    - Wire up cpu_opv on x86, powerpc, arm,
  - Provide cpu_opv selftests.

The cpu_opv system call covers the use-cases that rseq does not handle,
namely single-stepping with debuggers, moving data between per-cpu data
structures without interfering with cpu affinity masks, and using rseq
from signal handlers nested between thread creation and rseq
registration by glibc, or between rseq unregistration by glibc and
thread teardown.

Thanks,

Mathieu

Mathieu Desnoyers (16):
  rseq/selftests: Add reference counter to coexist with glibc
  rseq/selftests: Adapt number of threads to the number of detected cpus
  sched: Implement push_task_to_cpu (v2)
  mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  mm: Provide is_vma_noncached
  cpu_opv: Provide cpu_opv system call (v8)
  cpu_opv: limit amount of virtual address space used by cpu_opv
  x86: Wire up cpu_opv system call
  powerpc: Wire up cpu_opv system call
  arm: Wire up cpu_opv system call
  cpu-opv/selftests: Provide cpu-op library
  cpu-opv/selftests: Provide basic test
  cpu-opv/selftests: Provide percpu_op API
  cpu-opv/selftests: Provide basic percpu ops test
  cpu-opv/selftests: Provide parametrized tests
  cpu-opv/selftests: Provide Makefile, scripts, gitignore

 MAINTAINERS                                        |    8 +
 arch/arm/tools/syscall.tbl                         |    1 +
 arch/powerpc/include/asm/systbl.h                  |    1 +
 arch/powerpc/include/uapi/asm/unistd.h             |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl             |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |    1 +
 include/linux/mm.h                                 |   24 +
 include/linux/syscalls.h                           |    3 +
 include/linux/vmalloc.h                            |    4 +
 include/uapi/linux/cpu_opv.h                       |  114 ++
 init/Kconfig                                       |   17 +
 kernel/Makefile                                    |    1 +
 kernel/cpu_opv.c                                   | 1190 +++++++++++++++++
 kernel/sched/core.c                                |   42 +
 kernel/sched/sched.h                               |    9 +
 kernel/sys_ni.c                                    |    1 +
 kernel/sysctl.c                                    |   15 +
 mm/vmalloc.c                                       |   64 +
 tools/testing/selftests/Makefile                   |    1 +
 tools/testing/selftests/cpu-opv/.gitignore         |    6 +
 tools/testing/selftests/cpu-opv/Makefile           |   39 +
 .../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 1362 ++++++++++++++++++++
 .../selftests/cpu-opv/basic_percpu_ops_test.c      |  295 +++++
 tools/testing/selftests/cpu-opv/cpu-op.c           |  353 +++++
 tools/testing/selftests/cpu-opv/cpu-op.h           |   42 +
 tools/testing/selftests/cpu-opv/param_test.c       | 1187 +++++++++++++++++
 tools/testing/selftests/cpu-opv/percpu-op.h        |  151 +++
 tools/testing/selftests/cpu-opv/run_param_test.sh  |  134 ++
 tools/testing/selftests/rseq/rseq.c                |   32 +-
 tools/testing/selftests/rseq/run_param_test.sh     |    7 +-
 30 files changed, 5096 insertions(+), 10 deletions(-)
 create mode 100644 include/uapi/linux/cpu_opv.h
 create mode 100644 kernel/cpu_opv.c
 create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
 create mode 100644 tools/testing/selftests/cpu-opv/Makefile
 create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
 create mode 100644 tools/testing/selftests/cpu-opv/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h
 create mode 100644 tools/testing/selftests/cpu-opv/param_test.c
 create mode 100644 tools/testing/selftests/cpu-opv/percpu-op.h
 create mode 100755 tools/testing/selftests/cpu-opv/run_param_test.sh

-- 
2.11.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-11 10:37   ` Szabolcs Nagy
  2018-10-10 19:19 ` [RFC PATCH for 4.21 02/16] rseq/selftests: Adapt number of threads to the number of detected cpus Mathieu Desnoyers
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, Carlos O'Donell,
	Florian Weimer, Joseph Myers, Szabolcs Nagy

In order to integrate rseq into user-space applications, add a reference
counter field after the struct rseq TLS ABI so many rseq users can be
linked into the same application (e.g. librseq and glibc). The
reference count ensures that rseq syscall registration/unregistration
happens only for the most early/late user for each thread, thus ensuring
that rseq is registered across the lifetime of all rseq users for a
given thread.

Therefore, struct rseq contains the fields shared between kernel
and user-space, and represents the ABI between kernel and user-space.
The extra field added after struct rseq is an ABI between user-space
executable and libraries, but the kernel does not care about that field,
so it is not part of the Linux uapi.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
CC: Joseph Myers <joseph@codesourcery.com>
CC: Szabolcs Nagy <szabolcs.nagy@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Maurer <bmaurer@fb.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Paul Turner <pjt@google.com>
CC: linux-api@vger.kernel.org
---
 tools/testing/selftests/rseq/rseq.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 4847e97ed049..7e9ae973f786 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -30,13 +30,29 @@
 
 #define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
 
-__attribute__((tls_model("initial-exec"))) __thread
-volatile struct rseq __rseq_abi = {
+/*
+ * linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI
+ * size is 20 bytes. For support of multiple rseq users within a process,
+ * user-space defines an extra 4 bytes field as a reference count, for a
+ * total of 24 bytes.
+ */
+struct libc_rseq {
+	/* kernel-userspace ABI. */
+	__u32 cpu_id_start;
+	__u32 cpu_id;
+	__u64 rseq_cs;
+	__u32 flags;
+	/* user-space ABI. */
+	__u32 refcount;
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+__attribute__((visibility("hidden"))) __thread
+volatile struct libc_rseq __lib_rseq_abi = {
 	.cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
 };
 
-static __attribute__((tls_model("initial-exec"))) __thread
-volatile int refcount;
+extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
+volatile struct rseq __rseq_abi;
 
 static void signal_off_save(sigset_t *oldset)
 {
@@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
 	sigset_t oldset;
 
 	signal_off_save(&oldset);
-	if (refcount++)
+	if (__lib_rseq_abi.refcount++)
 		goto end;
 	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
 	if (!rc) {
@@ -78,9 +94,9 @@ int rseq_register_current_thread(void)
 		goto end;
 	}
 	if (errno != EBUSY)
-		__rseq_abi.cpu_id = -2;
+		__rseq_abi.cpu_id = RSEQ_CPU_ID_REGISTRATION_FAILED;
 	ret = -1;
-	refcount--;
+	__lib_rseq_abi.refcount--;
 end:
 	signal_restore(oldset);
 	return ret;
@@ -92,7 +108,7 @@ int rseq_unregister_current_thread(void)
 	sigset_t oldset;
 
 	signal_off_save(&oldset);
-	if (--refcount)
+	if (--__lib_rseq_abi.refcount)
 		goto end;
 	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq),
 		      RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 02/16] rseq/selftests: Adapt number of threads to the number of detected cpus
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2) Mathieu Desnoyers
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

On smaller systems, running a test with 200 threads can take a long
time on machines with smaller number of CPUs.

Detect the number of online cpus at test runtime, and multiply that
by 6 to have 6 rseq threads per cpu preempting each other.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 tools/testing/selftests/rseq/run_param_test.sh | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/rseq/run_param_test.sh b/tools/testing/selftests/rseq/run_param_test.sh
index 3acd6d75ff9f..e426304fd4a0 100755
--- a/tools/testing/selftests/rseq/run_param_test.sh
+++ b/tools/testing/selftests/rseq/run_param_test.sh
@@ -1,6 +1,8 @@
 #!/bin/bash
 # SPDX-License-Identifier: GPL-2.0+ or MIT
 
+NR_CPUS=`grep '^processor' /proc/cpuinfo | wc -l`
+
 EXTRA_ARGS=${@}
 
 OLDIFS="$IFS"
@@ -28,15 +30,16 @@ IFS="$OLDIFS"
 
 REPS=1000
 SLOW_REPS=100
+NR_THREADS=$((6*${NR_CPUS}))
 
 function do_tests()
 {
 	local i=0
 	while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
 		echo "Running test ${TEST_NAME[$i]}"
-		./param_test ${TEST_LIST[$i]} -r ${REPS} ${@} ${EXTRA_ARGS} || exit 1
+		./param_test ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
 		echo "Running compare-twice test ${TEST_NAME[$i]}"
-		./param_test_compare_twice ${TEST_LIST[$i]} -r ${REPS} ${@} ${EXTRA_ARGS} || exit 1
+		./param_test_compare_twice ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
 		let "i++"
 	done
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2)
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 02/16] rseq/selftests: Adapt number of threads to the number of detected cpus Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-17  6:51   ` Srikar Dronamraju
  2018-10-10 19:19 ` [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram Mathieu Desnoyers
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

Implement push_task_to_cpu(), which moves the task received as argument
to the destination cpu's runqueue. It only does so if the CPU is within
the CPU allowed mask of the task and if the CPU is active. If the CPU is
not part of the allowed mask, -EINVAL is returned. If the CPU is not
active, -EBUSY is returned.

It does not change the CPU allowed mask, and can therefore be used
within applications which rely on owning the sched_setaffinity() state.

It does not pin the task to the destination CPU, which means that the
scheduler may choose to move the task away from that CPU before the
task executes. Code invoking push_task_to_cpu() must be prepared to
retry in that case.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org

---
Change since v1:
- Return -EBUSY if CPU is not active.
---
 kernel/sched/core.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  9 +++++++++
 2 files changed, 51 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad97f3ba5ec5..ee302988b342 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1036,6 +1036,48 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		set_curr_task(rq, p);
 }
 
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int ret = 0;
+
+	rq = task_rq_lock(p, &rf);
+	update_rq_clock(rq);
+
+	if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!cpumask_test_cpu(dest_cpu, cpu_active_mask)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (task_cpu(p) == dest_cpu)
+		goto out;
+
+	if (task_running(rq, p) || p->state == TASK_WAKING) {
+		struct migration_arg arg = { p, dest_cpu };
+		/* Need help from migration thread: drop lock and wait. */
+		task_rq_unlock(rq, p, &rf);
+		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		tlb_migrate_finish(p->mm);
+		return 0;
+	} else if (task_on_rq_queued(p)) {
+		/*
+		 * OK, since we're going to drop the lock immediately
+		 * afterwards anyway.
+		 */
+		rq = move_queued_task(rq, &rf, p, dest_cpu);
+	}
+out:
+	task_rq_unlock(rq, p, &rf);
+
+	return ret;
+}
+
 /*
  * Change a given task's CPU affinity. Migrate the thread to a
  * proper CPU and schedule it away if the CPU it's executing on
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 455fa330de04..27ad25780204 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1340,6 +1340,15 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+#ifdef CONFIG_SMP
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
+#else
+static inline int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
+{
+	return 0;
+}
+#endif
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2) Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-16 18:30   ` Steven Rostedt
  2018-10-10 19:19 ` [RFC PATCH for 4.21 05/16] mm: Provide is_vma_noncached Mathieu Desnoyers
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

Create and destroy mappings aliased to a user-space mapping with the same
cache coloring as the userspace mapping. Allow the kernel to load from
and store to pages shared with user-space through its own mapping in
kernel virtual addresses while ensuring cache conherency between kernel
and userspace mappings for virtually aliased architectures.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/vmalloc.h |  4 ++++
 mm/vmalloc.c            | 64 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..899657b3d469 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -59,6 +59,10 @@ struct vmap_area {
 extern void vm_unmap_ram(const void *mem, unsigned int count);
 extern void *vm_map_ram(struct page **pages, unsigned int count,
 				int node, pgprot_t prot);
+extern void vm_unmap_user_ram(const void *mem, unsigned int count);
+extern void *vm_map_user_ram(struct page **pages, unsigned int count,
+				unsigned long uaddr, int node, pgprot_t prot);
+
 extern void vm_unmap_aliases(void);
 
 #ifdef CONFIG_MMU
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a728fc492557..a86bf550b027 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1186,6 +1186,70 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 }
 EXPORT_SYMBOL(vm_map_ram);
 
+/**
+ * vm_unmap_user_ram - unmap linear kernel address space set up by vm_map_user_ram
+ * @mem: the pointer returned by vm_map_user_ram
+ * @count: the count passed to that vm_map_user_ram call (cannot unmap partial)
+ */
+void vm_unmap_user_ram(const void *mem, unsigned int count)
+{
+	unsigned long size = (unsigned long)count << PAGE_SHIFT;
+	unsigned long addr = (unsigned long)mem;
+	struct vmap_area *va;
+
+	might_sleep();
+	BUG_ON(!addr);
+	BUG_ON(addr < VMALLOC_START);
+	BUG_ON(addr > VMALLOC_END);
+	BUG_ON(!PAGE_ALIGNED(addr));
+
+	debug_check_no_locks_freed(mem, size);
+	va = find_vmap_area(addr);
+	BUG_ON(!va);
+	free_unmap_vmap_area(va);
+}
+EXPORT_SYMBOL(vm_unmap_user_ram);
+
+/**
+ * vm_map_user_ram - map user space pages linearly into kernel virtual address
+ * @pages: an array of pointers to the virtually contiguous pages to be mapped
+ * @count: number of pages
+ * @uaddr: address within the first page in the userspace mapping
+ * @node: prefer to allocate data structures on this node
+ * @prot: memory protection to use. PAGE_KERNEL for regular RAM
+ *
+ * Create a mapping aliased to a user-space mapping with the same cache
+ * coloring as the userspace mapping. Allow the kernel to load from and
+ * store to pages shared with user-space through its own mapping in kernel
+ * virtual addresses while ensuring cache conherency between kernel and
+ * userspace mappings for virtually aliased architectures.
+ *
+ * Returns: a pointer to the address that has been mapped, or %NULL on failure
+ */
+void *vm_map_user_ram(struct page **pages, unsigned int count,
+		unsigned long uaddr, int node, pgprot_t prot)
+{
+	unsigned long size = (unsigned long)count << PAGE_SHIFT;
+	unsigned long va_offset = ALIGN_DOWN(uaddr, PAGE_SIZE) & (SHMLBA - 1);
+	unsigned long alloc_size = ALIGN(va_offset + size, SHMLBA);
+	struct vmap_area *va;
+	unsigned long addr;
+	void *mem;
+
+	va = alloc_vmap_area(alloc_size, SHMLBA, VMALLOC_START, VMALLOC_END,
+			node, GFP_KERNEL);
+	if (IS_ERR(va))
+		return NULL;
+	addr = va->va_start + va_offset;
+	mem = (void *)addr;
+	if (vmap_page_range(addr, addr + size, prot, pages) < 0) {
+		vm_unmap_user_ram(mem, count);
+		return NULL;
+	}
+	return mem;
+}
+EXPORT_SYMBOL(vm_map_user_ram);
+
 static struct vm_struct *vmlist __initdata;
 /**
  * vm_area_add_early - add vmap area early during boot
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 05/16] mm: Provide is_vma_noncached
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8) Mathieu Desnoyers
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, linux-mm

Provide is_vma_noncached() static inline to allow generic code to
check whether the given vma consists of noncached memory.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-mm@kvack.org
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..18acf4f339f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2551,6 +2551,30 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 	return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
 }
 
+static inline bool pgprot_same(pgprot_t a, pgprot_t b)
+{
+	return pgprot_val(a) == pgprot_val(b);
+}
+
+#ifdef pgprot_noncached
+static inline bool is_vma_noncached(struct vm_area_struct *vma)
+{
+	pgprot_t pgprot = vma->vm_page_prot;
+
+	/* Check whether architecture implements noncached pages. */
+	if (pgprot_same(pgprot_noncached(PAGE_KERNEL), PAGE_KERNEL))
+		return false;
+	if (!pgprot_same(pgprot, pgprot_noncached(pgprot)))
+		return false;
+	return true;
+}
+#else
+static inline bool is_vma_noncached(struct vm_area_struct *vma)
+{
+	return false;
+}
+#endif
+
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 05/16] mm: Provide is_vma_noncached Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-16  8:10   ` Sergey Senozhatsky
  2018-10-17  7:19   ` Srikar Dronamraju
  2018-10-10 19:19 ` [RFC PATCH for 4.21 07/16] cpu_opv: limit amount of virtual address space used by cpu_opv Mathieu Desnoyers
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

The cpu_opv system call executes a vector of operations on behalf of
user-space on a specific CPU with preemption disabled. It is inspired
by readv() and writev() system calls which take a "struct iovec"
array as argument.

The operations available are: comparison, memcpy, add, or, and, xor,
left shift, right shift, and memory barrier. The system call receives
a CPU number from user-space as argument, which is the CPU on which
those operations need to be performed.  All pointers in the ops must
have been set up to point to the per CPU memory of the CPU on which
the operations should be executed. The "comparison" operation can be
used to check that the data used in the preparation step did not
change between preparation of system call inputs and operation
execution within the preempt-off critical section.

The reason why we require all pointer offsets to be calculated by
user-space beforehand is because we need to use get_user_pages()
to first pin all pages touched by each operation. This takes care of
faulting-in the pages. Then, preemption is disabled, and the
operations are performed atomically with respect to other thread
execution on that CPU, without generating any page fault.

An overall maximum of 4216 bytes in enforced on the sum of operation
length within an operation vector, so user-space cannot generate a
too long preempt-off critical section (cache cold critical section
duration measured as 4.7µs on x86-64). Each operation is also limited
a length of 4096 bytes, meaning that an operation can touch a
maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
destination if addresses are not aligned on page boundaries).

If the thread is not running on the requested CPU, it is migrated to
it.

**** Justification for cpu_opv ****

Here are a few reasons justifying why the cpu_opv system call is
needed in addition to rseq:

1) Allow algorithms to perform per-cpu data migration without relying on
   sched_setaffinity()

The use-cases are migrating memory between per-cpu memory free-lists, or
stealing tasks from other per-cpu work queues: each require that
accesses to remote per-cpu data structures are performed.

Just rseq is not enough to cover those use-cases without additionally
relying on sched_setaffinity, which is unfortunately not
CPU-hotplug-safe.

The cpu_opv system call receives a CPU number as argument, and migrates
the current task to the right CPU to perform the operation sequence. If
the requested CPU is offline, it performs the operations from the
current CPU while preventing CPU hotplug, and with a mutex held.

2) Handling single-stepping from tools

Tools like debuggers, and simulators use single-stepping to run through
existing programs. If core libraries start to use restartable sequences
for e.g. memory allocation, this means pre-existing programs cannot be
single-stepped, simply because the underlying glibc or jemalloc has
changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they
want.  However, this requires upgrading tools, and still breaks
single-stepping in case where glibc or jemalloc is updated, but not the
tooling.

Having a performance-related library improvement break tooling is likely
to cause a big push-back against wide adoption of rseq.

3) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to external
conditions is pretty bad. Developers are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide
some level of progress guarantees.

There are concerns about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for
trouble when real-life will happen: page faults, uprobes, and other
unforeseen conditions that would seldom cause a rseq fast-path to never
progress.

4) Handling page faults

It's pretty easy to come up with corner-case scenarios where rseq does
not progress without the help from cpu_opv. For instance, a system with
swap enabled which is under high memory pressure could trigger page
faults at pretty much every rseq attempt. Although this scenario
is extremely unlikely, rseq becomes the weak link of the chain.

5) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to support single-stepping through LL/SC critical sections from the
get go in order to support a given architecture. For rseq, we're
adding critical sections into pre-existing applications/libraries,
so the user expectation is that tools don't break due to a library
optimization.

6) Perform maintenance operations on per-cpu data

rseq c.s. are quite limited feature-wise: they need to end with a
*single* commit instruction that updates a memory location. On the other
hand, the cpu_opv system call can combine a sequence of operations that
need to be executed with preemption disabled. While slower than rseq,
this allows for more complex maintenance operations to be performed on
per-cpu data concurrently with rseq fast-paths, in cases where it's not
possible to map those sequences of ops to a rseq.

7) Use cpu_opv as generic implementation for architectures not
   implementing rseq assembly code

rseq critical sections require architecture-specific user-space code to
be crafted in order to port an algorithm to a given architecture.  In
addition, it requires that the kernel architecture implementation adds
hooks into signal delivery and resume to user-space.

In order to facilitate integration of rseq into user-space, cpu_opv can
provide a (relatively slower) architecture-agnostic implementation of
rseq. This means that user-space code can be ported to all architectures
through use of cpu_opv initially, and have the fast-path use rseq
whenever the asm code is implemented.

8) Allow libraries with multi-part algorithms to work on same per-cpu
   data without affecting the allowed cpu mask

The lttng-ust tracer presents an interesting use-case for per-cpu
buffers: the algorithm needs to update a "reserve" counter, serialize
data into the buffer, and then update a "commit" counter _on the same
per-cpu buffer_. Using rseq for both reserve and commit can bring
significant performance benefits.

Clearly, if rseq reserve fails, the algorithm can retry on a different
per-cpu buffer. However, it's not that easy for the commit. It needs to
be performed on the same per-cpu buffer as the reserve.

The cpu_opv system call solves that problem by receiving the cpu number
on which the operation needs to be performed as argument. It can push
the task to the right CPU if needed, and perform the operations there
with preemption disabled.

Changing the allowed cpu mask for the current thread is not an
acceptable alternative for a tracing library, because the application
being traced does not expect that mask to be changed by libraries.

9) Ensure that data structures don't need store-release/load-acquire
   semantic to handle fall-back

cpu_opv performs the fall-back on the requested CPU by migrating the
task to that CPU. Executing the slow-path on the right CPU ensures that
store-release/load-acquire semantic is not required neither on the
fast-path nor slow-path.

10) Allow use of rseq critical sections from signal handlers

Considering that rseq needs to be registered/unregistered from the
current thread, it means there is a window at thread creation/exit where
a signal handler can nest over the thread before rseq is registered by
glibc, or after it has been unregistered by glibc. One possibility to
handle this would be to extend clone() to have rseq registered
immediately when the thread is created, and unregistered implicitly when
the thread vanishes. Adding complexity to clone() has not been an idea
received well so far. So an alternative solution is to ensure that
signal handlers using rseq critical sections have a fallback mechanism
(cpu_opv) to work on per-cpu data structures when they are nested over
threads for which rseq is not currently registered.

**** rseq and cpu_opv use-cases ****

1) per-cpu spinlock

A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.

The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).

2) per-cpu statistics counters

A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.

The cpu_opv fallback can be implemented as a "ADD" operation.

Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.

3) per-cpu LIFO linked-list (unlimited size stack)

A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.

The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.

The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.

The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.

4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)

This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.

The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation into the
next ring buffer array index (it's OK to abort after a try-store,
since it's not the commit, and its side-effect can be overwritten),
then followed by a word-store to increment the current offset (commit).

The "push" cpu_opv fallback can be done with the comparison, and
two consecutive word stores, all within the preempt-off section.

The "pop" side can be implemented with a check that offset is not
0 (whether the buffer is empty), a load of the "head" pointer before the
offset array index, followed by a rseq consisting of a word
comparison checking that the offset is unchanged since previously
loaded, another check ensuring that the "head" pointer is unchanged,
followed by a store decrementing the current offset.

The cpu_opv "pop" can be implemented with the same algorithm
as the rseq fast-path (compare, compare, store).

5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
   supporting "peek" from remote CPU

In order to implement work queues with work-stealing between CPUs, it is
useful to ensure the offset "commit" in scenario 4) "push" have a
store-release semantic, thus allowing remote CPU to load the offset
with acquire semantic, and load the top pointer, in order to check if
work-stealing should be performed. The task (work queue item) existence
should be protected by other means, e.g. RCU.

If the peek operation notices that work-stealing should indeed be
performed, a thread can use cpu_opv to move the task between per-cpu
workqueues, by first invoking cpu_opv passing the remote work queue
cpu number as argument to pop the task, and then again as "push" with
the target work queue CPU number.

6) per-cpu LIFO ring buffer with data copy (fixed-sized stack)
   (with and without acquire-release)

This structure is useful for passing around data without requiring
memory allocation by copying the data content into per-cpu fixed-sized
stack.

The "push" operation is performed with an offset comparison against
the buffer size (figuring out if the buffer is full), followed by
a rseq consisting of a comparison of the offset, a try-memcpy attempting
to copy the data content into the buffer (which can be aborted and
overwritten), and a final store incrementing the offset.

The cpu_opv fallback needs to same operations, except that the memcpy
is guaranteed to complete, given that it is performed with preemption
disabled. This requires a memcpy operation supporting length up to 4kB.

The "pop" operation is similar to the "push, except that the offset
is first compared to 0 to ensure the buffer is not empty. The
copy source is the ring buffer, and the destination is an output
buffer.

7) per-cpu FIFO ring buffer (fixed-sized queue)

This structure is useful wherever a FIFO behavior (queue) is needed.
One major use-case is tracer ring buffer.

An implementation of this ring buffer has a "reserve", followed by
serialization of multiple bytes into the buffer, ended by a "commit".
The "reserve" can be implemented as a rseq consisting of a word
comparison followed by a word store. The reserve operation moves the
producer "head". The multi-byte serialization can be performed
non-atomically. Finally, the "commit" update can be performed with
a rseq "add" commit instruction with store-release semantic. The
ring buffer consumer reads the commit value with load-acquire
semantic to know whenever it is safe to read from the ring buffer.

This use-case requires that both "reserve" and "commit" operations
be performed on the same per-cpu ring buffer, even if a migration
happens between those operations. In the typical case, both operations
will happens on the same CPU and use rseq. In the unlikely event of a
migration, the cpu_opv system call will ensure the commit can be
performed on the right CPU by migrating the task to that CPU.

On the consumer side, an alternative to using store-release and
load-acquire on the commit counter would be to use cpu_opv to
ensure the commit counter load is performed on the right CPU. This
effectively allows moving a consumer thread between CPUs to execute
close to the ring buffer cache lines it will read.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
Changes since v1:
- handle CPU hotplug,
- cleanup implementation using function pointers: We can use function
  pointers to implement the operations rather than duplicating all the
  user-access code.
- refuse device pages: Performing cpu_opv operations on io map'd pages
  with preemption disabled could generate long preempt-off critical
  sections, which leads to unwanted scheduler latency. Return EFAULT if
  a device page is received as parameter
- restrict op vector to 4216 bytes length sum: Restrict the operation
  vector to length sum of:
  - 4096 bytes (typical page size on most architectures, should be
    enough for a string, or structures)
  - 15 * 8 bytes (typical operations on integers or pointers).
  The goal here is to keep the duration of preempt off critical section
  short, so we don't add significant scheduler latency.
- Add INIT_ONSTACK macro: Introduce the
  CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
  correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
  stack to 0 on 32-bit architectures.
- Add CPU_MB_OP operation:
  Use-cases with:
  - two consecutive stores,
  - a mempcy followed by a store,
  require a memory barrier before the final store operation. A typical
  use-case is a store-release on the final store. Given that this is a
  slow path, just providing an explicit full barrier instruction should
  be sufficient.
- Add expect fault field:
  The use-case of list_pop brings interesting challenges. With rseq, we
  can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
  compare it against NULL, add an offset, and load the target "next"
  pointer from the object, all within a single req critical section.

  Life is not so easy for cpu_opv in this use-case, mainly because we
  need to pin all pages we are going to touch in the preempt-off
  critical section beforehand. So we need to know the target object (in
  which we apply an offset to fetch the next pointer) when we pin pages
  before disabling preemption.

  So the approach is to load the head pointer and compare it against
  NULL in user-space, before doing the cpu_opv syscall. User-space can
  then compute the address of the head->next field, *without loading it*.

  The cpu_opv system call will first need to pin all pages associated
  with input data. This includes the page backing the head->next object,
  which may have been concurrently deallocated and unmapped. Therefore,
  in this case, getting -EFAULT when trying to pin those pages may
  happen: it just means they have been concurrently unmapped. This is
  an expected situation, and should just return -EAGAIN to user-space,
  to user-space can distinguish between "should retry" type of
  situations and actual errors that should be handled with extreme
  prejudice to the program (e.g. abort()).

  Therefore, add "expect_fault" fields along with op input address
  pointers, so user-space can identify whether a fault when getting a
  field should return EAGAIN rather than EFAULT.
- Add compiler barrier between operations: Adding a compiler barrier
  between store operations in a cpu_opv sequence can be useful when
  paired with membarrier system call.

  An algorithm with a paired slow path and fast path can use
  sys_membarrier on the slow path to replace fast-path memory barriers
  by compiler barrier.

  Adding an explicit compiler barrier between operations allows
  cpu_opv to be used as fallback for operations meant to match
  the membarrier system call.

Changes since v2:

- Fix memory leak by introducing struct cpu_opv_pinned_pages.
  Suggested by Boqun Feng.
- Cast argument 1 passed to access_ok from integer to void __user *,
  fixing sparse warning.

Changes since v3:

- Fix !SMP by adding push_task_to_cpu() empty static inline.
- Add missing sys_cpu_opv() asmlinkage declaration to
  include/linux/syscalls.h.

Changes since v4:

- Cleanup based on Thomas Gleixner's feedback.
- Handle retry in case where the scheduler migrates the thread away
  from the target CPU after migration within the syscall rather than
  returning EAGAIN to user-space.
- Move push_task_to_cpu() to its own patch.
- New scheme for touching user-space memory:
   1) get_user_pages_fast() to pin/get all pages (which can sleep),
   2) vm_map_ram() those pages
   3) grab mmap_sem (read lock)
   4) __get_user_pages_fast() (or get_user_pages() on failure)
      -> Confirm that the same page pointers are returned. This
         catches cases where COW mappings are changed concurrently.
      -> If page pointers differ, or on gup failure, release mmap_sem,
         vm_unmap_ram/put_page and retry from step (1).
      -> perform put_page on the extra reference immediately for each
         page.
   5) preempt disable
   6) Perform operations on vmap. Those operations are normal
      loads/stores/memcpy.
   7) preempt enable
   8) release mmap_sem
   9) vm_unmap_ram() all virtual addresses
  10) put_page() all pages
- Handle architectures with VIVT caches along with vmap(): call
  flush_kernel_vmap_range() after each "write" operation. This
  ensures that the user-space mapping and vmap reach a consistent
  state between each operation.
- Depend on MMU for is_zero_pfn(). e.g. Blackfin and SH architectures
  don't provide the zero_pfn symbol.

Changes since v5:

- Fix handling of push_task_to_cpu() when argument is a cpu which is
  not part of the task's allowed cpu mask.
- Add CPU_OP_NR_FLAG flag, which returns the number of operations
  supported by the system call.

Changes since v6:

- Use __u* in public uapi header rather than uint*_t.
- Disallow cpu_opv targeting noncached vma, which requires using
  get_user_pages() rather than get_user_pages_fast() to get the
  vma.
- Fix handling of vm_map_ram() errors by increasing nr_vaddr only after
  success.
- Issue vm_unmap_aliases() after each cpu_opv system call, thus ensuring
  lazy unmapping does not exhaust vmalloc address space in stress-tests on
  32-bit systems.
- Use vm_map_user_ram() and vm_unmap_user_ram() to ensure cache coherency
  on virtually aliased architectures.

Changes since v7:

- Adapt to removal of types_32_64.h.

---
Man page associated:

CPU_OPV(2)             Linux Programmer's Manual            CPU_OPV(2)

NAME
       cpu_opv - CPU preempt-off operation vector system call

SYNOPSIS
       #include <linux/cpu_opv.h>

       int cpu_opv(struct cpu_op * cpu_opv, int cpuopcnt, int cpu, int f
lags);

DESCRIPTION
       The  cpu_opv  system  call  executes  a vector of operations on
       behalf of user-space on a specific  CPU  with  preemption  dis‐
       abled.

       The  term  CPU  used in this documentation refers to a hardware
       execution context.

       The operations available are: comparison, memcpy, add, or, and,
       xor, left shift, right shift, and memory  barrier.  The  system
       call  receives  a CPU number from user-space as argument, which
       is the CPU on which those operations need to be performed.  All
       pointers  in  the ops must have been set up to point to the per
       CPU memory of the CPU on which the operations  should  be  exe‐
       cuted. The "comparison" operation can be used to check that the
       data used in the preparation step did not change between prepa‐
       ration of system call inputs and operation execution within the
       preempt-off critical section.

       An overall maximum of 4216 bytes in  enforced  on  the  sum  of
       operation length within an operation vector, so user-space can‐
       not generate a too  long  preempt-off  critical  section.  Each
       operation  is  also  limited  a length of 4096 bytes. A maximum
       limit of  16  operations  per  cpu_opv  syscall  invocation  is
       enforced.

       If  the  thread  is  not  running  on  the requested CPU, it is
       migrated to it.

       The layout of struct cpu_opv is as follows:

       Fields

          op Operation of type  enum  cpu_op_type  to  perform.  This
              operation type selects the associated "u" union field.

           len
              Length  (in  bytes)  of data to consider for this opera‐
              tion.

           u.compare_op
              For a CPU_COMPARE_EQ_OP , and CPU_COMPARE_NE_OP  ,  con‐
              tains   the   a   and   b   pointers   to  compare.  The
              expect_fault_a  and   expect_fault_b   fields   indicate
              whether  a  page  fault  should  be expected for each of
              those pointers.  If expect_fault_a ,  or  expect_fault_b
              is  set,  EAGAIN  is  returned  on fault, else EFAULT is
              returned. The len field is allowed to take values from 0
              to 4096 for comparison operations.

           u.memcpy_op
              For a CPU_MEMCPY_OP , contains the dst and src pointers,
              expressing a copy of src into dst. The  expect_fault_dst
              and  expect_fault_src  fields  indicate  whether  a page
              fault should be expected for each of those pointers.  If
              expect_fault_dst , or expect_fault_src is set, EAGAIN is
              returned on fault, else  EFAULT  is  returned.  The  len
              field  is allowed to take values from 0 to 4096 for mem‐
              cpy operations.

           u.arithmetic_op
              For a  CPU_ADD_OP  ,  contains  the  p  ,  count  ,  and
              expect_fault_p  fields, which are respectively a pointer
              to the memory location to increment, the  64-bit  signed
              integer value to add, and whether a page fault should be
              expected for p .  If expect_fault_p is  set,  EAGAIN  is
              returned  on  fault,  else  EFAULT  is returned. The len
              field is allowed to take values of 1, 2, 4, 8 bytes  for
              arithmetic operations.

           u.bitwise_op
              For a CPU_OR_OP , CPU_AND_OP , and CPU_XOR_OP , contains
              the p , mask ,  and  expect_fault_p  fields,  which  are
              respectively a pointer to the memory location to target,
              the mask to apply, and whether a page  fault  should  be
              expected  for  p  .  If expect_fault_p is set, EAGAIN is
              returned on fault, else  EFAULT  is  returned.  The  len
              field  is allowed to take values of 1, 2, 4, 8 bytes for
              bitwise operations.

           u.shift_op
              For a CPU_LSHIFT_OP , and CPU_RSHIFT_OP , contains the p
              ,  bits  ,  and expect_fault_p fields, which are respec‐
              tively a pointer to the memory location to  target,  the
              number  of  bits  to  shift  either  left  of right, and
              whether a page fault should be  expected  for  p  .   If
              expect_fault_p is set, EAGAIN is returned on fault, else
              EFAULT is returned. The len field  is  allowed  to  take
              values  of  1,  2,  4, 8 bytes for shift operations. The
              bits field is allowed to take values between 0 and 63.

       The enum cpu_op_types contains the following operations:

       · CPU_COMPARE_EQ_OP: Compare whether two memory  locations  are
         equal,

       · CPU_COMPARE_NE_OP:  Compare whether two memory locations dif‐
         fer,

       · CPU_MEMCPY_OP: Copy a source memory location into a  destina‐
         tion,

       · CPU_ADD_OP:  Increment  a  target  memory location of a given
         count,

       · CPU_OR_OP: Apply a "or" mask to a memory location,

       · CPU_AND_OP: Apply a "and" mask to a memory location,

       · CPU_XOR_OP: Apply a "xor" mask to a memory location,

       · CPU_LSHIFT_OP: Shift a memory location left of a given number
         of bits,

       · CPU_RSHIFT_OP:  Shift a memory location right of a given num‐
         ber of bits.

       · CPU_MB_OP: Issue a memory barrier.

         All of the operations  above  provide  single-copy  atomicity
         guarantees  for word-sized, word-aligned target pointers, for
         both loads and stores.

       The cpuopcnt argument is the number of elements in the  cpu_opv
       array. It can take values from 0 to 16.

       The  cpu  argument  is  the  CPU  number on which the operation
       sequence needs to be executed.

       The flags argument is a bitmask. When  CPU_OP_NR_FLAG  is  set,
       the  cpu_opv()  system  call  returns  the number of operations
       available. When flags is 0, the sequence of operations received
       as parameter is performed.

RETURN VALUE
       A  return  value  of  0  indicates  success.  On  error,  -1 is
       returned, and errno is set appropriately. If a comparison oper‐
       ation  fails, execution of the operation vector is stopped, and
       the return value is the index after  the  comparison  operation
       (values between 1 and 16).

ERRORS
       EAGAIN cpu_opv() system call should be attempted again.

       EINVAL Either  flags contains an invalid value, or cpu contains
              an invalid value or a value not allowed by  the  current
              thread's  allowed  cpu  mask,  or  cpuopcnt  contains an
              invalid value, or the cpu_opv operation vector  contains
              an  invalid  op  value,  or the cpu_opv operation vector
              contains an invalid len value, or the cpu_opv  operation
              vector sum of len values is too large.

       ENOSYS The  cpu_opv()  system  call  is not implemented by this
              kernel.

       EFAULT cpu_opv is an invalid address, or  a  pointer  contained
              within  an  operation  is  invalid  (and  a fault is not
              expected for that pointer). Pointers to device and  non‐
              cached   memory   within  an  operation  are  considered
              invalid.

VERSIONS
       The cpu_opv() system call was added in Linux 4.X (TODO).

CONFORMING TO
       cpu_opv() is Linux-specific.

SEE ALSO
       membarrier(2), rseq(2)

Linux                         2018-03-22                    CPU_OPV(2)
---
 MAINTAINERS                  |    7 +
 include/linux/syscalls.h     |    3 +
 include/uapi/linux/cpu_opv.h |  114 +++++
 init/Kconfig                 |   17 +
 kernel/Makefile              |    1 +
 kernel/cpu_opv.c             | 1117 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c              |    1 +
 7 files changed, 1260 insertions(+)
 create mode 100644 include/uapi/linux/cpu_opv.h
 create mode 100644 kernel/cpu_opv.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 48a65c3a4189..3b50578fc5d9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3858,6 +3858,13 @@ B:	https://bugzilla.kernel.org
 F:	drivers/cpuidle/*
 F:	include/linux/cpuidle.h
 
+CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/cpu_opv.c
+F:	include/uapi/linux/cpu_opv.h
+
 CRAMFS FILESYSTEM
 M:	Nicolas Pitre <nico@linaro.org>
 S:	Maintained
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ff814c92f7f..c5af29eccd0e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -68,6 +68,7 @@ struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
 struct rseq;
+struct cpu_op;
 union bpf_attr;
 
 #include <linux/types.h>
@@ -906,6 +907,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_cpu_opv(struct cpu_op __user *ucpuopv, int cpuopcnt,
+			int cpu, int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
new file mode 100644
index 000000000000..a57eb939efde
--- /dev/null
+++ b/include/uapi/linux/cpu_opv.h
@@ -0,0 +1,114 @@
+#ifndef _UAPI_LINUX_CPU_OPV_H
+#define _UAPI_LINUX_CPU_OPV_H
+
+/*
+ * linux/cpu_opv.h
+ *
+ * CPU preempt-off operation vector system call API
+ *
+ * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/types.h>
+
+#define CPU_OP_VEC_LEN_MAX		16
+#define CPU_OP_ARG_LEN_MAX		24
+/* Maximum data len per operation. */
+#define CPU_OP_DATA_LEN_MAX		4096
+/*
+ * Maximum data len for overall vector. Restrict the amount of user-space
+ * data touched by the kernel in non-preemptible context, so it does not
+ * introduce long scheduler latencies.
+ * This allows one copy of up to 4096 bytes, and 15 operations touching 8
+ * bytes each.
+ * This limit is applied to the sum of length specified for all operations
+ * in a vector.
+ */
+#define CPU_OP_MEMCPY_EXPECT_LEN	4096
+#define CPU_OP_EXPECT_LEN		8
+#define CPU_OP_VEC_DATA_LEN_MAX		\
+	(CPU_OP_MEMCPY_EXPECT_LEN +	\
+	 (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN)
+
+enum cpu_op_flags {
+	CPU_OP_NR_FLAG =	(1U << 0),
+};
+
+enum cpu_op_type {
+	/* compare */
+	CPU_COMPARE_EQ_OP,
+	CPU_COMPARE_NE_OP,
+	/* memcpy */
+	CPU_MEMCPY_OP,
+	/* arithmetic */
+	CPU_ADD_OP,
+	/* bitwise */
+	CPU_OR_OP,
+	CPU_AND_OP,
+	CPU_XOR_OP,
+	/* shift */
+	CPU_LSHIFT_OP,
+	CPU_RSHIFT_OP,
+	/* memory barrier */
+	CPU_MB_OP,
+
+	NR_CPU_OPS,
+};
+
+/* Vector of operations to perform. Limited to 16. */
+struct cpu_op {
+	/* enum cpu_op_type. */
+	__s32 op;
+	/* data length, in bytes. */
+	__u32 len;
+	union {
+		struct {
+			__u64 a;
+			__u64 b;
+			__u8 expect_fault_a;
+			__u8 expect_fault_b;
+		} compare_op;
+		struct {
+			__u64 dst;
+			__u64 src;
+			__u8 expect_fault_dst;
+			__u8 expect_fault_src;
+		} memcpy_op;
+		struct {
+			__u64 p;
+			__s64 count;
+			__u8 expect_fault_p;
+		} arithmetic_op;
+		struct {
+			__u64 p;
+			__u64 mask;
+			__u8 expect_fault_p;
+		} bitwise_op;
+		struct {
+			__u64 p;
+			__u32 bits;
+			__u8 expect_fault_p;
+		} shift_op;
+		char __padding[CPU_OP_ARG_LEN_MAX];
+	} u;
+};
+
+#endif /* _UAPI_LINUX_CPU_OPV_H */
diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..413981ac1e4b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1483,6 +1483,8 @@ config RSEQ
 	bool "Enable rseq() system call" if EXPERT
 	default y
 	depends on HAVE_RSEQ
+	depends on MMU
+	select CPU_OPV
 	select MEMBARRIER
 	help
 	  Enable the restartable sequences system call. It provides a
@@ -1502,6 +1504,21 @@ config DEBUG_RSEQ
 
 	  If unsure, say N.
 
+# CPU_OPV depends on MMU for is_zero_pfn()
+config CPU_OPV
+	bool "Enable cpu_opv() system call" if EXPERT
+	default y
+	depends on MMU
+	help
+	  Enable the CPU preempt-off operation vector system call.
+	  It allows user-space to perform a sequence of operations on
+	  per-cpu data with preemption disabled. Useful as
+	  single-stepping fall-back for restartable sequences, and for
+	  performing more complex operations on per-cpu data that would
+	  not be otherwise possible to do with restartable sequences.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 7a63d567fdb5..507150b93521 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -116,6 +116,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_HAS_IOMEM) += iomem.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_RSEQ) += rseq.o
+obj-$(CONFIG_CPU_OPV) += cpu_opv.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
new file mode 100644
index 000000000000..c4e4040bb5ff
--- /dev/null
+++ b/kernel/cpu_opv.c
@@ -0,0 +1,1117 @@
+/*
+ * CPU preempt-off operation vector system call
+ *
+ * It allows user-space to perform a sequence of operations on per-cpu
+ * data with preemption disabled. Useful as single-stepping fall-back
+ * for restartable sequences, and for performing more complex operations
+ * on per-cpu data that would not be otherwise possible to do with
+ * restartable sequences.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2017, EfficiOS Inc.,
+ * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/cpu_opv.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <asm/ptrace.h>
+#include <asm/byteorder.h>
+#include <asm/cacheflush.h>
+
+#include "sched/sched.h"
+
+/*
+ * Typical invocation of cpu_opv need few virtual address pointers. Keep
+ * those in an array on the stack of the cpu_opv system call up to
+ * this limit, beyond which the array is dynamically allocated.
+ */
+#define NR_VADDR_ON_STACK		8
+
+/* Maximum pages per op. */
+#define CPU_OP_MAX_PAGES		4
+
+/* Maximum number of virtual addresses per op. */
+#define CPU_OP_VEC_MAX_ADDR		(2 * CPU_OP_VEC_LEN_MAX)
+
+union op_fn_data {
+	uint8_t _u8;
+	uint16_t _u16;
+	uint32_t _u32;
+	uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+	uint32_t _u64_split[2];
+#endif
+};
+
+struct vaddr {
+	unsigned long mem;
+	unsigned long uaddr;
+	struct page *pages[2];
+	unsigned int nr_pages;
+	int write;
+};
+
+struct cpu_opv_vaddr {
+	struct vaddr *addr;
+	size_t nr_vaddr;
+	bool is_kmalloc;
+};
+
+typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
+
+/*
+ * Provide mutual exclution for threads executing a cpu_opv against an
+ * offline CPU.
+ */
+static DEFINE_MUTEX(cpu_opv_offline_lock);
+
+/*
+ * The cpu_opv system call executes a vector of operations on behalf of
+ * user-space on a specific CPU with preemption disabled. It is inspired
+ * by readv() and writev() system calls which take a "struct iovec"
+ * array as argument.
+ *
+ * The operations available are: comparison, memcpy, add, or, and, xor,
+ * left shift, right shift, and memory barrier. The system call receives
+ * a CPU number from user-space as argument, which is the CPU on which
+ * those operations need to be performed.  All pointers in the ops must
+ * have been set up to point to the per CPU memory of the CPU on which
+ * the operations should be executed. The "comparison" operation can be
+ * used to check that the data used in the preparation step did not
+ * change between preparation of system call inputs and operation
+ * execution within the preempt-off critical section.
+ *
+ * The reason why we require all pointer offsets to be calculated by
+ * user-space beforehand is because we need to use get_user_pages()
+ * to first pin all pages touched by each operation. This takes care of
+ * faulting-in the pages. Then, preemption is disabled, and the
+ * operations are performed atomically with respect to other thread
+ * execution on that CPU, without generating any page fault.
+ *
+ * An overall maximum of 4216 bytes in enforced on the sum of operation
+ * length within an operation vector, so user-space cannot generate a
+ * too long preempt-off critical section (cache cold critical section
+ * duration measured as 4.7µs on x86-64). Each operation is also limited
+ * a length of 4096 bytes, meaning that an operation can touch a
+ * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
+ * destination if addresses are not aligned on page boundaries).
+ *
+ * If the thread is not running on the requested CPU, it is migrated to
+ * it.
+ */
+
+static unsigned long cpu_op_range_nr_pages(unsigned long addr,
+					   unsigned long len)
+{
+	return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
+}
+
+static int cpu_op_count_pages(u64 addr, unsigned long len)
+{
+	unsigned long nr_pages;
+
+	/*
+	 * Validate that the address is within the process address space.
+	 * This allows cast of those addresses to unsigned long throughout the
+	 * rest of this system call, because it would be invalid to have an
+	 * address over 4GB on a 32-bit kernel.
+	 */
+	if (addr >= TASK_SIZE)
+		return -EINVAL;
+	if (!len)
+		return 0;
+	nr_pages = cpu_op_range_nr_pages(addr, len);
+	if (nr_pages > 2) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	return nr_pages;
+}
+
+static struct vaddr *cpu_op_alloc_vaddr_vector(int nr_vaddr)
+{
+	return kzalloc(nr_vaddr * sizeof(struct vaddr), GFP_KERNEL);
+}
+
+/*
+ * Check operation types and length parameters. Count number of pages.
+ */
+static int cpu_opv_check_op(struct cpu_op *op, int *nr_vaddr, uint32_t *sum)
+{
+	int ret;
+
+	switch (op->op) {
+	case CPU_MB_OP:
+		break;
+	default:
+		*sum += op->len;
+	}
+
+	/* Validate inputs. */
+	switch (op->op) {
+	case CPU_COMPARE_EQ_OP:
+	case CPU_COMPARE_NE_OP:
+	case CPU_MEMCPY_OP:
+		if (op->len > CPU_OP_DATA_LEN_MAX)
+			return -EINVAL;
+		break;
+	case CPU_ADD_OP:
+	case CPU_OR_OP:
+	case CPU_AND_OP:
+	case CPU_XOR_OP:
+		switch (op->len) {
+		case 1:
+		case 2:
+		case 4:
+		case 8:
+			break;
+		default:
+			return -EINVAL;
+		}
+		break;
+	case CPU_LSHIFT_OP:
+	case CPU_RSHIFT_OP:
+		switch (op->len) {
+		case 1:
+			if (op->u.shift_op.bits > 7)
+				return -EINVAL;
+			break;
+		case 2:
+			if (op->u.shift_op.bits > 15)
+				return -EINVAL;
+			break;
+		case 4:
+			if (op->u.shift_op.bits > 31)
+				return -EINVAL;
+			break;
+		case 8:
+			if (op->u.shift_op.bits > 63)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+		break;
+	case CPU_MB_OP:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Validate pointers, count pages and virtual addresses. */
+	switch (op->op) {
+	case CPU_COMPARE_EQ_OP:
+	case CPU_COMPARE_NE_OP:
+		ret = cpu_op_count_pages(op->u.compare_op.a, op->len);
+		if (ret < 0)
+			return ret;
+		ret = cpu_op_count_pages(op->u.compare_op.b, op->len);
+		if (ret < 0)
+			return ret;
+		*nr_vaddr += 2;
+		break;
+	case CPU_MEMCPY_OP:
+		ret = cpu_op_count_pages(op->u.memcpy_op.dst, op->len);
+		if (ret < 0)
+			return ret;
+		ret = cpu_op_count_pages(op->u.memcpy_op.src, op->len);
+		if (ret < 0)
+			return ret;
+		*nr_vaddr += 2;
+		break;
+	case CPU_ADD_OP:
+		ret = cpu_op_count_pages(op->u.arithmetic_op.p, op->len);
+		if (ret < 0)
+			return ret;
+		(*nr_vaddr)++;
+		break;
+	case CPU_OR_OP:
+	case CPU_AND_OP:
+	case CPU_XOR_OP:
+		ret = cpu_op_count_pages(op->u.bitwise_op.p, op->len);
+		if (ret < 0)
+			return ret;
+		(*nr_vaddr)++;
+		break;
+	case CPU_LSHIFT_OP:
+	case CPU_RSHIFT_OP:
+		ret = cpu_op_count_pages(op->u.shift_op.p, op->len);
+		if (ret < 0)
+			return ret;
+		(*nr_vaddr)++;
+		break;
+	case CPU_MB_OP:
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * Check operation types and length parameters. Count number of pages.
+ */
+static int cpu_opv_check(struct cpu_op *cpuopv, int cpuopcnt, int *nr_vaddr)
+{
+	uint32_t sum = 0;
+	int i, ret;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		ret = cpu_opv_check_op(&cpuopv[i], nr_vaddr, &sum);
+		if (ret)
+			return ret;
+	}
+	if (sum > CPU_OP_VEC_DATA_LEN_MAX)
+		return -EINVAL;
+	return 0;
+}
+
+static int cpu_op_check_page(struct page *page, int write)
+{
+	struct address_space *mapping;
+
+	if (is_zone_device_page(page))
+		return -EFAULT;
+
+	/*
+	 * The page lock protects many things but in this context the page
+	 * lock stabilizes mapping, prevents inode freeing in the shared
+	 * file-backed region case and guards against movement to swap
+	 * cache.
+	 *
+	 * Strictly speaking the page lock is not needed in all cases being
+	 * considered here and page lock forces unnecessarily serialization
+	 * From this point on, mapping will be re-verified if necessary and
+	 * page lock will be acquired only if it is unavoidable
+	 *
+	 * Mapping checks require the head page for any compound page so the
+	 * head page and mapping is looked up now.
+	 */
+	page = compound_head(page);
+	mapping = READ_ONCE(page->mapping);
+
+	/*
+	 * If page->mapping is NULL, then it cannot be a PageAnon page;
+	 * but it might be the ZERO_PAGE (which is OK to read from), or
+	 * in the gate area or in a special mapping (for which this
+	 * check should fail); or it may have been a good file page when
+	 * get_user_pages found it, but truncated or holepunched or
+	 * subjected to invalidate_complete_page2 before the page lock
+	 * is acquired (also cases which should fail). Given that a
+	 * reference to the page is currently held, refcount care in
+	 * invalidate_complete_page's remove_mapping prevents
+	 * drop_caches from setting mapping to NULL concurrently.
+	 *
+	 * The case to guard against is when memory pressure cause
+	 * shmem_writepage to move the page from filecache to swapcache
+	 * concurrently: an unlikely race, but a retry for page->mapping
+	 * is required in that situation.
+	 */
+	if (!mapping) {
+		int shmem_swizzled;
+
+		/*
+		 * Check again with page lock held to guard against
+		 * memory pressure making shmem_writepage move the page
+		 * from filecache to swapcache.
+		 */
+		lock_page(page);
+		shmem_swizzled = PageSwapCache(page) || page->mapping;
+		unlock_page(page);
+		if (shmem_swizzled)
+			return -EAGAIN;
+		/*
+		 * It is valid to read from, but invalid to write to the
+		 * ZERO_PAGE.
+		 */
+		if (!(is_zero_pfn(page_to_pfn(page)) ||
+		      is_huge_zero_page(page)) || write)
+			return -EFAULT;
+	}
+	return 0;
+}
+
+static int cpu_op_check_pages(struct page **pages,
+			      unsigned long nr_pages,
+			      int write)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++) {
+		int ret;
+
+		ret = cpu_op_check_page(pages[i], write);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
+			    struct cpu_opv_vaddr *vaddr_ptrs,
+			    unsigned long *vaddr, int write)
+{
+	struct page *pages[2];
+	struct vm_area_struct *vmas[2];
+	int ret, nr_pages, nr_put_pages, n;
+	unsigned long _vaddr;
+	struct vaddr *va;
+	struct mm_struct *mm = current->mm;
+
+	nr_pages = cpu_op_count_pages(addr, len);
+	if (nr_pages <= 0)
+		return nr_pages;
+again:
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages(addr, nr_pages, write ? FOLL_WRITE : 0, pages,
+			     vmas);
+	if (ret < nr_pages) {
+		if (ret >= 0) {
+			nr_put_pages = ret;
+			ret = -EFAULT;
+		} else {
+			nr_put_pages = 0;
+		}
+		up_read(&mm->mmap_sem);
+		goto error;
+	}
+	/*
+	 * cpu_opv() accesses its own cached mapping of the userspace pages.
+	 * Considering that concurrent noncached and cached accesses may yield
+	 * to unexpected results in terms of memory consistency, explicitly
+	 * disallow cpu_opv on noncached memory.
+	 */
+	for (n = 0; n < nr_pages; n++) {
+		if (is_vma_noncached(vmas[n])) {
+			nr_put_pages = nr_pages;
+			ret = -EFAULT;
+			up_read(&mm->mmap_sem);
+			goto error;
+		}
+	}
+	up_read(&mm->mmap_sem);
+	ret = cpu_op_check_pages(pages, nr_pages, write);
+	if (ret) {
+		nr_put_pages = nr_pages;
+		goto error;
+	}
+	_vaddr = (unsigned long)vm_map_user_ram(pages, nr_pages, addr,
+						numa_node_id(), PAGE_KERNEL);
+	if (!_vaddr) {
+		nr_put_pages = nr_pages;
+		ret = -ENOMEM;
+		goto error;
+	}
+	va = &vaddr_ptrs->addr[vaddr_ptrs->nr_vaddr++];
+	va->mem = _vaddr;
+	va->uaddr = addr;
+	for (n = 0; n < nr_pages; n++)
+		va->pages[n] = pages[n];
+	va->nr_pages = nr_pages;
+	va->write = write;
+	*vaddr = _vaddr + (addr & ~PAGE_MASK);
+	return 0;
+
+error:
+	for (n = 0; n < nr_put_pages; n++)
+		put_page(pages[n]);
+	/*
+	 * Retry if a page has been faulted in, or is being swapped in.
+	 */
+	if (ret == -EAGAIN)
+		goto again;
+	return ret;
+}
+
+static int cpu_opv_pin_pages_op(struct cpu_op *op,
+				struct cpu_opv_vaddr *vaddr_ptrs,
+				bool *expect_fault)
+{
+	int ret;
+	unsigned long vaddr = 0;
+
+	switch (op->op) {
+	case CPU_COMPARE_EQ_OP:
+	case CPU_COMPARE_NE_OP:
+		ret = -EFAULT;
+		*expect_fault = op->u.compare_op.expect_fault_a;
+		if (!access_ok(VERIFY_READ,
+			       (void __user *)(unsigned long)op->u.compare_op.a,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.compare_op.a, op->len,
+				       vaddr_ptrs, &vaddr, 0);
+		if (ret)
+			return ret;
+		op->u.compare_op.a = vaddr;
+		ret = -EFAULT;
+		*expect_fault = op->u.compare_op.expect_fault_b;
+		if (!access_ok(VERIFY_READ,
+			       (void __user *)(unsigned long)op->u.compare_op.b,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.compare_op.b, op->len,
+				       vaddr_ptrs, &vaddr, 0);
+		if (ret)
+			return ret;
+		op->u.compare_op.b = vaddr;
+		break;
+	case CPU_MEMCPY_OP:
+		ret = -EFAULT;
+		*expect_fault = op->u.memcpy_op.expect_fault_dst;
+		if (!access_ok(VERIFY_WRITE,
+			       (void __user *)(unsigned long)op->u.memcpy_op.dst,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.memcpy_op.dst, op->len,
+				       vaddr_ptrs, &vaddr, 1);
+		if (ret)
+			return ret;
+		op->u.memcpy_op.dst = vaddr;
+		ret = -EFAULT;
+		*expect_fault = op->u.memcpy_op.expect_fault_src;
+		if (!access_ok(VERIFY_READ,
+			       (void __user *)(unsigned long)op->u.memcpy_op.src,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.memcpy_op.src, op->len,
+				       vaddr_ptrs, &vaddr, 0);
+		if (ret)
+			return ret;
+		op->u.memcpy_op.src = vaddr;
+		break;
+	case CPU_ADD_OP:
+		ret = -EFAULT;
+		*expect_fault = op->u.arithmetic_op.expect_fault_p;
+		if (!access_ok(VERIFY_WRITE,
+			       (void __user *)(unsigned long)op->u.arithmetic_op.p,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.arithmetic_op.p, op->len,
+				       vaddr_ptrs, &vaddr, 1);
+		if (ret)
+			return ret;
+		op->u.arithmetic_op.p = vaddr;
+		break;
+	case CPU_OR_OP:
+	case CPU_AND_OP:
+	case CPU_XOR_OP:
+		ret = -EFAULT;
+		*expect_fault = op->u.bitwise_op.expect_fault_p;
+		if (!access_ok(VERIFY_WRITE,
+			       (void __user *)(unsigned long)op->u.bitwise_op.p,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.bitwise_op.p, op->len,
+				       vaddr_ptrs, &vaddr, 1);
+		if (ret)
+			return ret;
+		op->u.bitwise_op.p = vaddr;
+		break;
+	case CPU_LSHIFT_OP:
+	case CPU_RSHIFT_OP:
+		ret = -EFAULT;
+		*expect_fault = op->u.shift_op.expect_fault_p;
+		if (!access_ok(VERIFY_WRITE,
+			       (void __user *)(unsigned long)op->u.shift_op.p,
+			       op->len))
+			return ret;
+		ret = cpu_op_pin_pages(op->u.shift_op.p, op->len,
+				       vaddr_ptrs, &vaddr, 1);
+		if (ret)
+			return ret;
+		op->u.shift_op.p = vaddr;
+		break;
+	case CPU_MB_OP:
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
+			     struct cpu_opv_vaddr *vaddr_ptrs)
+{
+	int ret, i;
+	bool expect_fault = false;
+
+	/* Check access, pin pages. */
+	for (i = 0; i < cpuopcnt; i++) {
+		ret = cpu_opv_pin_pages_op(&cpuop[i], vaddr_ptrs,
+					   &expect_fault);
+		if (ret)
+			goto error;
+	}
+	return 0;
+
+error:
+	/*
+	 * If faulting access is expected, return EAGAIN to user-space.
+	 * It allows user-space to distinguish between a fault caused by
+	 * an access which is expect to fault (e.g. due to concurrent
+	 * unmapping of underlying memory) from an unexpected fault from
+	 * which a retry would not recover.
+	 */
+	if (ret == -EFAULT && expect_fault)
+		return -EAGAIN;
+	return ret;
+}
+
+static int __op_get(union op_fn_data *data, void *p, size_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 = READ_ONCE(*(uint8_t *)p);
+		break;
+	case 2:
+		data->_u16 = READ_ONCE(*(uint16_t *)p);
+		break;
+	case 4:
+		data->_u32 = READ_ONCE(*(uint32_t *)p);
+		break;
+	case 8:
+#if (BITS_PER_LONG == 64)
+		data->_u64 = READ_ONCE(*(uint64_t *)p);
+#else
+	{
+		data->_u64_split[0] = READ_ONCE(*(uint32_t *)p);
+		data->_u64_split[1] = READ_ONCE(*((uint32_t *)p + 1));
+	}
+#endif
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int __op_put(union op_fn_data *data, void *p, size_t len)
+{
+	switch (len) {
+	case 1:
+		WRITE_ONCE(*(uint8_t *)p, data->_u8);
+		break;
+	case 2:
+		WRITE_ONCE(*(uint16_t *)p, data->_u16);
+		break;
+	case 4:
+		WRITE_ONCE(*(uint32_t *)p, data->_u32);
+		break;
+	case 8:
+#if (BITS_PER_LONG == 64)
+		WRITE_ONCE(*(uint64_t *)p, data->_u64);
+#else
+	{
+		WRITE_ONCE(*(uint32_t *)p, data->_u64_split[0]);
+		WRITE_ONCE(*((uint32_t *)p + 1), data->_u64_split[1]);
+	}
+#endif
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare(unsigned long _a, unsigned long _b, uint32_t len)
+{
+	void *a = (void *)_a;
+	void *b = (void *)_b;
+	union op_fn_data tmp[2];
+	int ret;
+
+	switch (len) {
+	case 1:
+	case 2:
+	case 4:
+	case 8:
+		if (!IS_ALIGNED(_a, len) || !IS_ALIGNED(_b, len))
+			goto memcmp;
+		break;
+	default:
+		goto memcmp;
+	}
+
+	ret = __op_get(&tmp[0], a, len);
+	if (ret)
+		return ret;
+	ret = __op_get(&tmp[1], b, len);
+	if (ret)
+		return ret;
+
+	switch (len) {
+	case 1:
+		ret = !!(tmp[0]._u8 != tmp[1]._u8);
+		break;
+	case 2:
+		ret = !!(tmp[0]._u16 != tmp[1]._u16);
+		break;
+	case 4:
+		ret = !!(tmp[0]._u32 != tmp[1]._u32);
+		break;
+	case 8:
+		ret = !!(tmp[0]._u64 != tmp[1]._u64);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+
+memcmp:
+	if (memcmp(a, b, len))
+		return 1;
+	return 0;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_memcpy(unsigned long _dst, unsigned long _src,
+			    uint32_t len)
+{
+	void *dst = (void *)_dst;
+	void *src = (void *)_src;
+	union op_fn_data tmp;
+	int ret;
+
+	switch (len) {
+	case 1:
+	case 2:
+	case 4:
+	case 8:
+		if (!IS_ALIGNED(_dst, len) || !IS_ALIGNED(_src, len))
+			goto memcpy;
+		break;
+	default:
+		goto memcpy;
+	}
+
+	ret = __op_get(&tmp, src, len);
+	if (ret)
+		return ret;
+	return __op_put(&tmp, dst, len);
+
+memcpy:
+	memcpy(dst, src, len);
+	return 0;
+}
+
+static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 += (uint8_t)count;
+		break;
+	case 2:
+		data->_u16 += (uint16_t)count;
+		break;
+	case 4:
+		data->_u32 += (uint32_t)count;
+		break;
+	case 8:
+		data->_u64 += (uint64_t)count;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 |= (uint8_t)mask;
+		break;
+	case 2:
+		data->_u16 |= (uint16_t)mask;
+		break;
+	case 4:
+		data->_u32 |= (uint32_t)mask;
+		break;
+	case 8:
+		data->_u64 |= (uint64_t)mask;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 &= (uint8_t)mask;
+		break;
+	case 2:
+		data->_u16 &= (uint16_t)mask;
+		break;
+	case 4:
+		data->_u32 &= (uint32_t)mask;
+		break;
+	case 8:
+		data->_u64 &= (uint64_t)mask;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 ^= (uint8_t)mask;
+		break;
+	case 2:
+		data->_u16 ^= (uint16_t)mask;
+		break;
+	case 4:
+		data->_u32 ^= (uint32_t)mask;
+		break;
+	case 8:
+		data->_u64 ^= (uint64_t)mask;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 <<= (uint8_t)bits;
+		break;
+	case 2:
+		data->_u16 <<= (uint16_t)bits;
+		break;
+	case 4:
+		data->_u32 <<= (uint32_t)bits;
+		break;
+	case 8:
+		data->_u64 <<= (uint64_t)bits;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
+{
+	switch (len) {
+	case 1:
+		data->_u8 >>= (uint8_t)bits;
+		break;
+	case 2:
+		data->_u16 >>= (uint16_t)bits;
+		break;
+	case 4:
+		data->_u32 >>= (uint32_t)bits;
+		break;
+	case 8:
+		data->_u64 >>= (uint64_t)bits;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_fn(op_fn_t op_fn, unsigned long _p, uint64_t v,
+			uint32_t len)
+{
+	union op_fn_data tmp;
+	void *p = (void *)_p;
+	int ret;
+
+	ret = __op_get(&tmp, p, len);
+	if (ret)
+		return ret;
+	ret = op_fn(&tmp, v, len);
+	if (ret)
+		return ret;
+	ret = __op_put(&tmp, p, len);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/*
+ * Return negative value on error, positive value if comparison
+ * fails, 0 on success.
+ */
+static int __do_cpu_opv_op(struct cpu_op *op)
+{
+	/* Guarantee a compiler barrier between each operation. */
+	barrier();
+
+	switch (op->op) {
+	case CPU_COMPARE_EQ_OP:
+		return do_cpu_op_compare(op->u.compare_op.a,
+					 op->u.compare_op.b,
+					 op->len);
+	case CPU_COMPARE_NE_OP:
+	{
+		int ret;
+
+		ret = do_cpu_op_compare(op->u.compare_op.a,
+					op->u.compare_op.b,
+					op->len);
+		if (ret < 0)
+			return ret;
+		/*
+		 * Stop execution, return positive value if comparison
+		 * is identical.
+		 */
+		if (ret == 0)
+			return 1;
+		return 0;
+	}
+	case CPU_MEMCPY_OP:
+		return do_cpu_op_memcpy(op->u.memcpy_op.dst,
+					op->u.memcpy_op.src,
+					op->len);
+	case CPU_ADD_OP:
+		return do_cpu_op_fn(op_add_fn, op->u.arithmetic_op.p,
+				    op->u.arithmetic_op.count, op->len);
+	case CPU_OR_OP:
+		return do_cpu_op_fn(op_or_fn, op->u.bitwise_op.p,
+				    op->u.bitwise_op.mask, op->len);
+	case CPU_AND_OP:
+		return do_cpu_op_fn(op_and_fn, op->u.bitwise_op.p,
+				    op->u.bitwise_op.mask, op->len);
+	case CPU_XOR_OP:
+		return do_cpu_op_fn(op_xor_fn, op->u.bitwise_op.p,
+				    op->u.bitwise_op.mask, op->len);
+	case CPU_LSHIFT_OP:
+		return do_cpu_op_fn(op_lshift_fn, op->u.shift_op.p,
+				    op->u.shift_op.bits, op->len);
+	case CPU_RSHIFT_OP:
+		return do_cpu_op_fn(op_rshift_fn, op->u.shift_op.p,
+				    op->u.shift_op.bits, op->len);
+	case CPU_MB_OP:
+		/* Memory barrier provided by this operation. */
+		smp_mb();
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i, ret;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		ret = __do_cpu_opv_op(&cpuop[i]);
+		/* If comparison fails, stop execution and return index + 1. */
+		if (ret > 0)
+			return i + 1;
+		/* On error, stop execution. */
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
+/*
+ * Check that the page pointers pinned by get_user_pages()
+ * are still in the page table. Invoked with mmap_sem held.
+ * Return 0 if pointers match, -EAGAIN if they don't.
+ */
+static int vaddr_check(struct vaddr *vaddr)
+{
+	struct page *pages[2];
+	int ret, n;
+
+	ret = __get_user_pages_fast(vaddr->uaddr, vaddr->nr_pages,
+				    vaddr->write, pages);
+	for (n = 0; n < ret; n++)
+		put_page(pages[n]);
+	if (ret < vaddr->nr_pages) {
+		ret = get_user_pages(vaddr->uaddr, vaddr->nr_pages,
+				     vaddr->write ? FOLL_WRITE : 0,
+				     pages, NULL);
+		if (ret < 0)
+			return -EAGAIN;
+		for (n = 0; n < ret; n++)
+			put_page(pages[n]);
+		if (ret < vaddr->nr_pages)
+			return -EAGAIN;
+	}
+	for (n = 0; n < vaddr->nr_pages; n++) {
+		if (pages[n] != vaddr->pages[n])
+			return -EAGAIN;
+	}
+	return 0;
+}
+
+static int vaddr_ptrs_check(struct cpu_opv_vaddr *vaddr_ptrs)
+{
+	int i;
+
+	for (i = 0; i < vaddr_ptrs->nr_vaddr; i++) {
+		int ret;
+
+		ret = vaddr_check(&vaddr_ptrs->addr[i]);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt,
+		      struct cpu_opv_vaddr *vaddr_ptrs, int cpu)
+{
+	struct mm_struct *mm = current->mm;
+	int ret;
+
+retry:
+	if (cpu != raw_smp_processor_id()) {
+		ret = push_task_to_cpu(current, cpu);
+		if (ret)
+			goto check_online;
+	}
+	down_read(&mm->mmap_sem);
+	ret = vaddr_ptrs_check(vaddr_ptrs);
+	if (ret)
+		goto end;
+	preempt_disable();
+	if (cpu != smp_processor_id()) {
+		preempt_enable();
+		up_read(&mm->mmap_sem);
+		goto retry;
+	}
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+	preempt_enable();
+end:
+	up_read(&mm->mmap_sem);
+	return ret;
+
+check_online:
+	/*
+	 * push_task_to_cpu() returns -EINVAL if the requested cpu is not part
+	 * of the current thread's cpus_allowed mask.
+	 */
+	if (ret == -EINVAL)
+		return ret;
+	get_online_cpus();
+	if (cpu_online(cpu)) {
+		put_online_cpus();
+		goto retry;
+	}
+	/*
+	 * CPU is offline. Perform operation from the current CPU with
+	 * cpu_online read lock held, preventing that CPU from coming online,
+	 * and with mutex held, providing mutual exclusion against other
+	 * CPUs also finding out about an offline CPU.
+	 */
+	down_read(&mm->mmap_sem);
+	ret = vaddr_ptrs_check(vaddr_ptrs);
+	if (ret)
+		goto offline_end;
+	mutex_lock(&cpu_opv_offline_lock);
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+	mutex_unlock(&cpu_opv_offline_lock);
+offline_end:
+	up_read(&mm->mmap_sem);
+	put_online_cpus();
+	return ret;
+}
+
+/*
+ * cpu_opv - execute operation vector on a given CPU with preempt off.
+ *
+ * Userspace should pass the CPU number on which the operation vector
+ * should be executed as parameter.
+ */
+SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
+		int, cpu, int, flags)
+{
+	struct vaddr vaddr_on_stack[NR_VADDR_ON_STACK];
+	struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
+	struct cpu_opv_vaddr vaddr_ptrs = {
+		.addr = vaddr_on_stack,
+		.nr_vaddr = 0,
+		.is_kmalloc = false,
+	};
+	int ret, i, nr_vaddr = 0;
+	bool retry = false;
+
+	if (unlikely(flags & ~CPU_OP_NR_FLAG))
+		return -EINVAL;
+	if (flags & CPU_OP_NR_FLAG)
+		return NR_CPU_OPS;
+	if (unlikely(cpu < 0))
+		return -EINVAL;
+	if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
+		return -EINVAL;
+	if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
+		return -EFAULT;
+	ret = cpu_opv_check(cpuopv, cpuopcnt, &nr_vaddr);
+	if (ret)
+		return ret;
+	if (nr_vaddr > NR_VADDR_ON_STACK) {
+		vaddr_ptrs.addr = cpu_op_alloc_vaddr_vector(nr_vaddr);
+		if (!vaddr_ptrs.addr) {
+			ret = -ENOMEM;
+			goto end;
+		}
+		vaddr_ptrs.is_kmalloc = true;
+	}
+again:
+	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs);
+	if (ret)
+		goto end;
+	ret = do_cpu_opv(cpuopv, cpuopcnt, &vaddr_ptrs, cpu);
+	if (ret == -EAGAIN)
+		retry = true;
+end:
+	for (i = 0; i < vaddr_ptrs.nr_vaddr; i++) {
+		struct vaddr *vaddr = &vaddr_ptrs.addr[i];
+		int j;
+
+		vm_unmap_user_ram((void *)vaddr->mem, vaddr->nr_pages);
+		for (j = 0; j < vaddr->nr_pages; j++) {
+			if (vaddr->write)
+				set_page_dirty(vaddr->pages[j]);
+			put_page(vaddr->pages[j]);
+		}
+	}
+	/*
+	 * Force vm_map flush to ensure we don't exhaust available vmalloc
+	 * address space.
+	 */
+	if (vaddr_ptrs.nr_vaddr)
+		vm_unmap_aliases();
+	if (retry) {
+		retry = false;
+		vaddr_ptrs.nr_vaddr = 0;
+		goto again;
+	}
+	if (vaddr_ptrs.is_kmalloc)
+		kfree(vaddr_ptrs.addr);
+	return ret;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index df556175be50..0a6410d77c33 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -435,3 +435,4 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+COND_SYSCALL(cpu_opv);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 07/16] cpu_opv: limit amount of virtual address space used by cpu_opv
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8) Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 08/16] x86: Wire up cpu_opv system call Mathieu Desnoyers
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

Introduce sysctl cpu_opv_va_max_bytes, which limits the amount of
virtual address space that can be used by cpu_opv.

Its default value is the maximum amount of virtual address space which
can be used by a single cpu_opv system call (256 kB on x86).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 kernel/cpu_opv.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c  | 15 ++++++++++++
 2 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
index c4e4040bb5ff..db144b71d51a 100644
--- a/kernel/cpu_opv.c
+++ b/kernel/cpu_opv.c
@@ -30,6 +30,7 @@
 #include <linux/pagemap.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/byteorder.h>
 #include <asm/cacheflush.h>
@@ -49,6 +50,16 @@
 /* Maximum number of virtual addresses per op. */
 #define CPU_OP_VEC_MAX_ADDR		(2 * CPU_OP_VEC_LEN_MAX)
 
+/* Maximum address range size (aligned on SHMLBA) per virtual address. */
+#define CPU_OP_RANGE_PER_ADDR_MAX	(2 * SHMLBA)
+
+/*
+ * Minimum value for sysctl_cpu_opv_va_max_bytes is the maximum virtual memory
+ * space needed by one cpu_opv system call.
+ */
+#define CPU_OPV_VA_MAX_BYTES_MIN	\
+		(CPU_OP_VEC_MAX_ADDR * CPU_OP_RANGE_PER_ADDR_MAX)
+
 union op_fn_data {
 	uint8_t _u8;
 	uint16_t _u16;
@@ -81,6 +92,15 @@ typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
  */
 static DEFINE_MUTEX(cpu_opv_offline_lock);
 
+/* Maximum virtual address space which can be used by cpu_opv. */
+int sysctl_cpu_opv_va_max_bytes __read_mostly;
+int sysctl_cpu_opv_va_max_bytes_min;
+
+static atomic_t cpu_opv_va_allocated_bytes;
+
+/* Waitqueue for cpu_opv blocked on virtual address space reservation. */
+static DECLARE_WAIT_QUEUE_HEAD(cpu_opv_va_wait);
+
 /*
  * The cpu_opv system call executes a vector of operations on behalf of
  * user-space on a specific CPU with preemption disabled. It is inspired
@@ -546,6 +566,43 @@ static int cpu_opv_pin_pages_op(struct cpu_op *op,
 	return 0;
 }
 
+/*
+ * Approximate the amount of virtual address space required per
+ * vaddr to a worse-case of CPU_OP_RANGE_PER_ADDR_MAX.
+ */
+static int cpu_opv_reserve_va(int nr_vaddr, int *reserved_va)
+{
+	int nr_bytes = nr_vaddr * CPU_OP_RANGE_PER_ADDR_MAX;
+	int old_bytes, new_bytes;
+
+	WARN_ON_ONCE(*reserved_va != 0);
+	if (nr_bytes > sysctl_cpu_opv_va_max_bytes) {
+		WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+	do {
+		wait_event(cpu_opv_va_wait,
+			(old_bytes = atomic_read(&cpu_opv_va_allocated_bytes)) +
+			nr_bytes <= sysctl_cpu_opv_va_max_bytes);
+		new_bytes = old_bytes + nr_bytes;
+	} while (atomic_cmpxchg(&cpu_opv_va_allocated_bytes,
+		 old_bytes, new_bytes) != old_bytes);
+
+	*reserved_va = nr_bytes;
+	return 0;
+}
+
+static void cpu_opv_unreserve_va(int *reserved_va)
+{
+	int nr_bytes = *reserved_va;
+
+	if (!nr_bytes)
+		return;
+	atomic_sub(nr_bytes, &cpu_opv_va_allocated_bytes);
+	wake_up(&cpu_opv_va_wait);
+	*reserved_va = 0;
+}
+
 static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
 			     struct cpu_opv_vaddr *vaddr_ptrs)
 {
@@ -1057,7 +1114,7 @@ SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
 		.nr_vaddr = 0,
 		.is_kmalloc = false,
 	};
-	int ret, i, nr_vaddr = 0;
+	int ret, i, nr_vaddr = 0, reserved_va = 0;
 	bool retry = false;
 
 	if (unlikely(flags & ~CPU_OP_NR_FLAG))
@@ -1082,6 +1139,9 @@ SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
 		vaddr_ptrs.is_kmalloc = true;
 	}
 again:
+	ret = cpu_opv_reserve_va(nr_vaddr, &reserved_va);
+	if (ret)
+		goto end;
 	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs);
 	if (ret)
 		goto end;
@@ -1106,6 +1166,7 @@ SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
 	 */
 	if (vaddr_ptrs.nr_vaddr)
 		vm_unmap_aliases();
+	cpu_opv_unreserve_va(&reserved_va);
 	if (retry) {
 		retry = false;
 		vaddr_ptrs.nr_vaddr = 0;
@@ -1115,3 +1176,15 @@ SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
 		kfree(vaddr_ptrs.addr);
 	return ret;
 }
+
+/*
+ * Dynamic initialization is required on sparc because SHMLBA is not a
+ * constant.
+ */
+static int __init cpu_opv_init(void)
+{
+	sysctl_cpu_opv_va_max_bytes = CPU_OPV_VA_MAX_BYTES_MIN;
+	sysctl_cpu_opv_va_max_bytes_min = CPU_OPV_VA_MAX_BYTES_MIN;
+	return 0;
+}
+core_initcall(cpu_opv_init);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050fd0c4..eb34c6be2aa4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -175,6 +175,11 @@ extern int unaligned_dump_stack;
 extern int no_unaligned_warning;
 #endif
 
+#ifdef CONFIG_CPU_OPV
+extern int sysctl_cpu_opv_va_max_bytes;
+extern int sysctl_cpu_opv_va_max_bytes_min;
+#endif
+
 #ifdef CONFIG_PROC_SYSCTL
 
 /**
@@ -1233,6 +1238,16 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_CPU_OPV
+	{
+		.procname	= "cpu_opv_va_max_bytes",
+		.data		= &sysctl_cpu_opv_va_max_bytes,
+		.maxlen		= sizeof(sysctl_cpu_opv_va_max_bytes),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &sysctl_cpu_opv_va_max_bytes_min,
+	},
+#endif
 	{ }
 };
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 08/16] x86: Wire up cpu_opv system call
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 07/16] cpu_opv: limit amount of virtual address space used by cpu_opv Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 09/16] powerpc: " Mathieu Desnoyers
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..d3253547e15e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+387	i386	cpu_opv			sys_cpu_opv			__ia32_sys_cpu_opv
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..1391971b1517 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	cpu_opv			__x64_sys_cpu_opv
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 08/16] x86: Wire up cpu_opv system call Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 10/16] arm: " Mathieu Desnoyers
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, linuxppc-dev

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 01b5171ea189..8f58710f5e8b 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -394,3 +394,4 @@ SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
 COMPAT_SYS(io_pgetevents)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 985534d0b448..112e2c54750a 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -400,5 +400,6 @@
 #define __NR_pkey_mprotect	386
 #define __NR_rseq		387
 #define __NR_io_pgetevents	388
+#define __NR_cpu_opv		389
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 10/16] arm: Wire up cpu_opv system call
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 09/16] powerpc: " Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 11/16] cpu-opv/selftests: Provide cpu-op library Mathieu Desnoyers
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 8edf93b4490f..da2cb7a12644 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -414,3 +414,4 @@
 397	common	statx			sys_statx
 398	common	rseq			sys_rseq
 399	common	io_pgetevents		sys_io_pgetevents
+400	common	cpu_opv			sys_cpu_opv
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 11/16] cpu-opv/selftests: Provide cpu-op library
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 10/16] arm: " Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 12/16] cpu-opv/selftests: Provide basic test Mathieu Desnoyers
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

This cpu-op helper library provides a user-space API to the cpu_opv()
system call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 tools/testing/selftests/cpu-opv/cpu-op.c | 353 +++++++++++++++++++++++++++++++
 tools/testing/selftests/cpu-opv/cpu-op.h |  42 ++++
 2 files changed, 395 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h

diff --git a/tools/testing/selftests/cpu-opv/cpu-op.c b/tools/testing/selftests/cpu-opv/cpu-op.c
new file mode 100644
index 000000000000..575cd51da421
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.c
@@ -0,0 +1,353 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * cpu-op.c
+ *
+ * Copyright (C) 2017 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+#define ACCESS_ONCE(x)		(*(__volatile__  __typeof__(x) *)&(x))
+#define WRITE_ONCE(x, v)	__extension__ ({ ACCESS_ONCE(x) = (v); })
+#define READ_ONCE(x)		ACCESS_ONCE(x)
+
+int cpu_opv(struct cpu_op *cpu_opv, int cpuopcnt, int cpu, int flags)
+{
+	return syscall(__NR_cpu_opv, cpu_opv, cpuopcnt, cpu, flags);
+}
+
+int cpu_op_get_current_cpu(void)
+{
+	int cpu;
+
+	cpu = sched_getcpu();
+	if (cpu < 0) {
+		perror("sched_getcpu()");
+		abort();
+	}
+	return cpu;
+}
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *n, size_t len,
+		   int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)old,
+			.u.memcpy_op.src = (unsigned long)v,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)n,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_ADD_OP,
+			.len = len,
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = count,
+			.u.arithmetic_op.expect_fault_p = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+			 int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int cpu_op_cmpeqv_storep_expect_fault(intptr_t *v, intptr_t expect,
+					     intptr_t *newp, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)newp,
+			.u.memcpy_op.expect_fault_dst = 0,
+			/* Return EAGAIN on src fault. */
+			.u.memcpy_op.expect_fault_src = 1,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+				 off_t voffp, intptr_t *load, int cpu)
+{
+	int ret;
+
+	do {
+		intptr_t oldv = READ_ONCE(*v);
+		intptr_t *newp = (intptr_t *)(oldv + voffp);
+
+		if (oldv == expectnot)
+			return 1;
+		ret = cpu_op_cmpeqv_storep_expect_fault(v, oldv, newp, cpu);
+		if (!ret) {
+			*load = oldv;
+			return 0;
+		}
+	} while (ret > 0);
+
+	return -1;
+}
+
+int cpu_op_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+				intptr_t *v2, intptr_t newv2,
+				intptr_t newv, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v2,
+			.u.memcpy_op.src = (unsigned long)&newv2,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_storev_mb_storev(intptr_t *v, intptr_t expect,
+				   intptr_t *v2, intptr_t newv2,
+				   intptr_t newv, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v2,
+			.u.memcpy_op.src = (unsigned long)&newv2,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MB_OP,
+		},
+		[3] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+				intptr_t *v2, intptr_t expect2,
+				intptr_t newv, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v2,
+			.u.compare_op.b = (unsigned long)&expect2,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+				void *dst, void *src, size_t len,
+				intptr_t newv, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_memcpy_mb_storev(intptr_t *v, intptr_t expect,
+				   void *dst, void *src, size_t len,
+				   intptr_t newv, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = sizeof(intptr_t),
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)&expect,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[2] = {
+			.op = CPU_MB_OP,
+		},
+		[3] = {
+			.op = CPU_MEMCPY_OP,
+			.len = sizeof(intptr_t),
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)&newv,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_addv(intptr_t *v, int64_t count, int cpu)
+{
+	return cpu_op_add(v, count, sizeof(intptr_t), cpu);
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.h b/tools/testing/selftests/cpu-opv/cpu-op.h
new file mode 100644
index 000000000000..075687a1365f
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * cpu-op.h
+ *
+ * (C) Copyright 2017-2018 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#ifndef CPU_OPV_H
+#define CPU_OPV_H
+
+#include <stdlib.h>
+#include <stdint.h>
+#include <linux/cpu_opv.h>
+
+int cpu_opv(struct cpu_op *cpuopv, int cpuopcnt, int cpu, int flags);
+int cpu_op_get_current_cpu(void);
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *_new, size_t len,
+		   int cpu);
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu);
+
+int cpu_op_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu);
+int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+				 off_t voffp, intptr_t *load, int cpu);
+int cpu_op_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+				intptr_t *v2, intptr_t newv2,
+				intptr_t newv, int cpu);
+int cpu_op_cmpeqv_storev_mb_storev(intptr_t *v, intptr_t expect,
+				   intptr_t *v2, intptr_t newv2,
+				   intptr_t newv, int cpu);
+int cpu_op_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+				intptr_t *v2, intptr_t expect2,
+				intptr_t newv, int cpu);
+int cpu_op_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+				void *dst, void *src, size_t len,
+				intptr_t newv, int cpu);
+int cpu_op_cmpeqv_memcpy_mb_storev(intptr_t *v, intptr_t expect,
+				   void *dst, void *src, size_t len,
+				   intptr_t newv, int cpu);
+int cpu_op_addv(intptr_t *v, int64_t count, int cpu);
+
+#endif  /* CPU_OPV_H_ */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 12/16] cpu-opv/selftests: Provide basic test
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 11/16] cpu-opv/selftests: Provide cpu-op library Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 13/16] cpu-opv/selftests: Provide percpu_op API Mathieu Desnoyers
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

"basic_cpu_opv_test" tests basic functionality of cpu_opv.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 .../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 1362 ++++++++++++++++++++
 1 file changed, 1362 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c

diff --git a/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
new file mode 100644
index 000000000000..ec5adee20c7f
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
@@ -0,0 +1,1362 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Basic test coverage for cpu_opv system call.
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+#define TESTBUFLEN	4096
+#define TESTBUFLEN_CMP	16
+
+#define TESTBUFLEN_PAGE_MAX	65536
+
+#define NR_PF_ARRAY	16384
+#define PF_ARRAY_LEN	4096
+
+#define NR_HUGE_ARRAY	512
+#define HUGEMAPLEN	(NR_HUGE_ARRAY * PF_ARRAY_LEN)
+
+/* 64 MB arrays for page fault testing. */
+char pf_array_dst[NR_PF_ARRAY][PF_ARRAY_LEN];
+char pf_array_src[NR_PF_ARRAY][PF_ARRAY_LEN];
+
+static int test_ops_supported(void)
+{
+	const char *test_name = "test_ops_supported";
+	int ret;
+
+	ret = cpu_opv(NULL, 0, -1, CPU_OP_NR_FLAG);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret < NR_CPU_OPS) {
+		ksft_test_result_fail("%s test: only %d operations supported, expecting at least %d\n",
+				      test_name, ret, NR_CPU_OPS);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_compare_eq_op(char *a, char *b, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_compare_eq_same(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_eq same";
+
+	/* Test compare_eq */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret > 0) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 0);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_compare_eq_diff(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_eq different";
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret == 0) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 1);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_compare_ne_op(char *a, char *b, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_compare_ne_same(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_ne same";
+
+	/* Test compare_ne */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret == 0) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 1);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_compare_ne_diff(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_ne different";
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 0) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 0);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_2compare_eq_op(char *a, char *b, char *c, char *d,
+		size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)c,
+			.u.compare_op.b = (unsigned long)d,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_2compare_eq_index(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN_CMP];
+	char buf2[TESTBUFLEN_CMP];
+	char buf3[TESTBUFLEN_CMP];
+	char buf4[TESTBUFLEN_CMP];
+	const char *test_name = "test_2compare_eq index";
+
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN_CMP);
+	memset(buf3, 0, TESTBUFLEN_CMP);
+	memset(buf4, 0, TESTBUFLEN_CMP);
+
+	/* First compare failure is op[0], expect 1. */
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 1) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 1);
+		return -1;
+	}
+
+	/* All compares succeed. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf2[i] = (char)i;
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 0) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 0);
+		return -1;
+	}
+
+	/* First compare failure is op[1], expect 2. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf3[i] = (char)i;
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 2) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 2);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_2compare_ne_op(char *a, char *b, char *c, char *d,
+		size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+		[1] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)c,
+			.u.compare_op.b = (unsigned long)d,
+			.u.compare_op.expect_fault_a = 0,
+			.u.compare_op.expect_fault_b = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_2compare_ne_index(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN_CMP];
+	char buf2[TESTBUFLEN_CMP];
+	char buf3[TESTBUFLEN_CMP];
+	char buf4[TESTBUFLEN_CMP];
+	const char *test_name = "test_2compare_ne index";
+
+	memset(buf1, 0, TESTBUFLEN_CMP);
+	memset(buf2, 0, TESTBUFLEN_CMP);
+	memset(buf3, 0, TESTBUFLEN_CMP);
+	memset(buf4, 0, TESTBUFLEN_CMP);
+
+	/* First compare ne failure is op[0], expect 1. */
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 1) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 1);
+		return -1;
+	}
+
+	/* All compare ne succeed. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf3[i] = (char)i;
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 0) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 0);
+		return -1;
+	}
+
+	/* First compare failure is op[1], expect 2. */
+	for (i = 0; i < TESTBUFLEN_CMP; i++)
+		buf4[i] = (char)i;
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret != 2) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, ret, 2);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_memcpy_op(void *dst, void *src, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_memcpy";
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_memcpy_op(buf2, buf1, TESTBUFLEN);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	for (i = 0; i < TESTBUFLEN; i++) {
+		if (buf2[i] != (char)i) {
+			ksft_test_result_fail("%s test: unexpected value at offset %d. Found %d. Should be %d.\n",
+					      test_name, i, buf2[i], (char)i);
+			return -1;
+		}
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_memcpy_u32(void)
+{
+	int ret;
+	uint32_t v1, v2;
+	const char *test_name = "test_memcpy_u32";
+
+	/* Test memcpy_u32 */
+	v1 = 42;
+	v2 = 0;
+	ret = test_memcpy_op(&v2, &v1, sizeof(v1));
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v1 != v2) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v2, v1);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_memcpy_mb_memcpy_op(void *dst1, void *src1,
+		void *dst2, void *src2, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst1,
+			.u.memcpy_op.src = (unsigned long)src1,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[1] = {
+			.op = CPU_MB_OP,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst2,
+			.u.memcpy_op.src = (unsigned long)src2,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy_mb_memcpy(void)
+{
+	int ret;
+	int v1, v2, v3;
+	const char *test_name = "test_memcpy_mb_memcpy";
+
+	/* Test memcpy */
+	v1 = 42;
+	v2 = v3 = 0;
+	ret = test_memcpy_mb_memcpy_op(&v2, &v1, &v3, &v2, sizeof(int));
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v3 != v1) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v3, v1);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_add_op(int *v, int64_t increment)
+{
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_op_add(v, increment, sizeof(*v), cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_add(void)
+{
+	int orig_v = 42, v, ret;
+	int increment = 1;
+	const char *test_name = "test_add";
+
+	v = orig_v;
+	ret = test_add_op(&v, increment);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != orig_v + increment) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v,
+				      orig_v + increment);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_two_add_op(int *v, int64_t *increments)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_ADD_OP,
+			.len = sizeof(*v),
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = increments[0],
+			.u.arithmetic_op.expect_fault_p = 0,
+		},
+		[1] = {
+			.op = CPU_ADD_OP,
+			.len = sizeof(*v),
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = increments[1],
+			.u.arithmetic_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_two_add(void)
+{
+	int orig_v = 42, v, ret;
+	int64_t increments[2] = { 99, 123 };
+	const char *test_name = "test_two_add";
+
+	v = orig_v;
+	ret = test_two_add_op(&v, increments);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != orig_v + increments[0] + increments[1]) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v,
+				      orig_v + increments[0] + increments[1]);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_or_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_OR_OP,
+			.len = sizeof(*v),
+			.u.bitwise_op.p = (unsigned long)v,
+			.u.bitwise_op.mask = mask,
+			.u.bitwise_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_or(void)
+{
+	int orig_v = 0xFF00000, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_or";
+
+	v = orig_v;
+	ret = test_or_op(&v, mask);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v | mask)) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v, orig_v | mask);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_and_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_AND_OP,
+			.len = sizeof(*v),
+			.u.bitwise_op.p = (unsigned long)v,
+			.u.bitwise_op.mask = mask,
+			.u.bitwise_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_and(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_and";
+
+	v = orig_v;
+	ret = test_and_op(&v, mask);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v & mask)) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v, orig_v & mask);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_xor_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_XOR_OP,
+			.len = sizeof(*v),
+			.u.bitwise_op.p = (unsigned long)v,
+			.u.bitwise_op.mask = mask,
+			.u.bitwise_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_xor(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_xor";
+
+	v = orig_v;
+	ret = test_xor_op(&v, mask);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v ^ mask)) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v, orig_v ^ mask);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_lshift_op(int *v, uint32_t bits)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_LSHIFT_OP,
+			.len = sizeof(*v),
+			.u.shift_op.p = (unsigned long)v,
+			.u.shift_op.bits = bits,
+			.u.shift_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_lshift(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t bits = 5;
+	const char *test_name = "test_lshift";
+
+	v = orig_v;
+	ret = test_lshift_op(&v, bits);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v << bits)) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v, orig_v << bits);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_rshift_op(int *v, uint32_t bits)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_RSHIFT_OP,
+			.len = sizeof(*v),
+			.u.shift_op.p = (unsigned long)v,
+			.u.shift_op.bits = bits,
+			.u.shift_op.expect_fault_p = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_rshift(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t bits = 5;
+	const char *test_name = "test_rshift";
+
+	v = orig_v;
+	ret = test_rshift_op(&v, bits);
+	if (ret) {
+		ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v >> bits)) {
+		ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+				      test_name, v, orig_v >> bits);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_cmpxchg_op(void *v, void *expect, void *old, void *n,
+		size_t len)
+{
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_op_cmpxchg(v, expect, old, n, len, cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_cmpxchg_success(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+	const char *test_name = "test_cmpxchg success";
+
+	v = orig_v;
+	ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret) {
+		ksft_test_result_fail("%s returned %d, expecting %d\n",
+				      test_name, ret, 0);
+		return -1;
+	}
+	if (v != n) {
+		ksft_test_result_fail("%s v is %lld, expecting %lld\n",
+				      test_name, (long long)v, (long long)n);
+		return -1;
+	}
+	if (old != orig_v) {
+		ksft_test_result_fail("%s old is %lld, expecting %lld\n",
+				      test_name, (long long)old,
+				      (long long)orig_v);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_cmpxchg_fail(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 123, old = 0, n = 3;
+	const char *test_name = "test_cmpxchg fail";
+
+	v = orig_v;
+	ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (ret == 0) {
+		ksft_test_result_fail("%s returned %d, expecting %d\n",
+				      test_name, ret, 1);
+		return -1;
+	}
+	if (v == n) {
+		ksft_test_result_fail("%s returned %lld, expecting %lld\n",
+				      test_name, (long long)v,
+				      (long long)orig_v);
+		return -1;
+	}
+	if (old != orig_v) {
+		ksft_test_result_fail("%s old is %lld, expecting %lld\n",
+				      test_name, (long long)old,
+				      (long long)orig_v);
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_memcpy_expect_fault_op(void *dst, void *src, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			/* Return EAGAIN on fault. */
+			.u.memcpy_op.expect_fault_src = 1,
+		},
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_memcpy_fault(void)
+{
+	int ret;
+	char buf1[TESTBUFLEN];
+	const char *test_name = "test_memcpy_fault";
+
+	/* Test memcpy */
+	ret = test_memcpy_op(buf1, NULL, TESTBUFLEN);
+	if (!ret || (ret < 0 && errno != EFAULT)) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	/* Test memcpy expect fault */
+	ret = test_memcpy_expect_fault_op(buf1, NULL, TESTBUFLEN);
+	if (!ret || (ret < 0 && errno != EAGAIN)) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int do_test_unknown_op(void)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = -1,	/* Unknown */
+			.len = 0,
+		},
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_unknown_op(void)
+{
+	int ret;
+	const char *test_name = "test_unknown_op";
+
+	ret = do_test_unknown_op();
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int do_test_max_ops(void)
+{
+	struct cpu_op opvec[] = {
+		[0] = { .op = CPU_MB_OP, },
+		[1] = { .op = CPU_MB_OP, },
+		[2] = { .op = CPU_MB_OP, },
+		[3] = { .op = CPU_MB_OP, },
+		[4] = { .op = CPU_MB_OP, },
+		[5] = { .op = CPU_MB_OP, },
+		[6] = { .op = CPU_MB_OP, },
+		[7] = { .op = CPU_MB_OP, },
+		[8] = { .op = CPU_MB_OP, },
+		[9] = { .op = CPU_MB_OP, },
+		[10] = { .op = CPU_MB_OP, },
+		[11] = { .op = CPU_MB_OP, },
+		[12] = { .op = CPU_MB_OP, },
+		[13] = { .op = CPU_MB_OP, },
+		[14] = { .op = CPU_MB_OP, },
+		[15] = { .op = CPU_MB_OP, },
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_max_ops(void)
+{
+	int ret;
+	const char *test_name = "test_max_ops";
+
+	ret = do_test_max_ops();
+	if (ret < 0) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int do_test_too_many_ops(void)
+{
+	struct cpu_op opvec[] = {
+		[0] = { .op = CPU_MB_OP, },
+		[1] = { .op = CPU_MB_OP, },
+		[2] = { .op = CPU_MB_OP, },
+		[3] = { .op = CPU_MB_OP, },
+		[4] = { .op = CPU_MB_OP, },
+		[5] = { .op = CPU_MB_OP, },
+		[6] = { .op = CPU_MB_OP, },
+		[7] = { .op = CPU_MB_OP, },
+		[8] = { .op = CPU_MB_OP, },
+		[9] = { .op = CPU_MB_OP, },
+		[10] = { .op = CPU_MB_OP, },
+		[11] = { .op = CPU_MB_OP, },
+		[12] = { .op = CPU_MB_OP, },
+		[13] = { .op = CPU_MB_OP, },
+		[14] = { .op = CPU_MB_OP, },
+		[15] = { .op = CPU_MB_OP, },
+		[16] = { .op = CPU_MB_OP, },
+	};
+	int cpu;
+
+	cpu = cpu_op_get_current_cpu();
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_too_many_ops(void)
+{
+	int ret;
+	const char *test_name = "test_too_many_ops";
+
+	ret = do_test_too_many_ops();
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+/* Use 64kB len, largest page size known on Linux. */
+static int test_memcpy_single_too_large(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN_PAGE_MAX + 1];
+	char buf2[TESTBUFLEN_PAGE_MAX + 1];
+	const char *test_name = "test_memcpy_single_too_large";
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN_PAGE_MAX + 1; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN_PAGE_MAX + 1);
+	ret = test_memcpy_op(buf2, buf1, TESTBUFLEN_PAGE_MAX + 1);
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+static int test_memcpy_single_ok_sum_too_large_op(void *dst, void *src,
+						  size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+			.u.memcpy_op.expect_fault_dst = 0,
+			.u.memcpy_op.expect_fault_src = 0,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy_single_ok_sum_too_large(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_memcpy_single_ok_sum_too_large";
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_memcpy_single_ok_sum_too_large_op(buf2, buf1, TESTBUFLEN);
+	if (!ret || (ret < 0 && errno != EINVAL)) {
+		ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+/*
+ * Iterate over large uninitialized arrays to trigger page faults.
+ * This includes reading from zero pages.
+ */
+int test_page_fault(void)
+{
+	int ret = 0;
+	uint64_t i;
+	const char *test_name = "test_page_fault";
+
+	for (i = 0; i < NR_PF_ARRAY; i++) {
+		ret = test_memcpy_op(pf_array_dst[i],
+				     pf_array_src[i],
+				     PF_ARRAY_LEN);
+		if (ret) {
+			ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+					      test_name, ret, strerror(errno));
+			return ret;
+		}
+	}
+	ksft_test_result_pass("%s test\n", test_name);
+	return 0;
+}
+
+/*
+ * Try to use 2MB huge pages.
+ */
+int test_hugetlb(void)
+{
+	int ret = 0;
+	uint64_t i;
+	const char *test_name = "test_hugetlb";
+	int *dst, *src;
+
+	dst = mmap(NULL, HUGEMAPLEN, PROT_READ | PROT_WRITE,
+		   MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (dst == MAP_FAILED) {
+		switch (errno) {
+		case ENOMEM:
+		case ENOENT:
+		case EINVAL:
+			ksft_test_result_skip("%s test.\n", test_name);
+			goto end;
+		default:
+			break;
+		}
+		perror("mmap");
+		abort();
+	}
+	src = mmap(NULL, HUGEMAPLEN, PROT_READ | PROT_WRITE,
+		   MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (src == MAP_FAILED) {
+		if (errno == ENOMEM) {
+			ksft_test_result_skip("%s test.\n", test_name);
+			goto unmap_dst;
+		}
+		perror("mmap");
+		abort();
+	}
+
+	/* Read/write from/to huge zero pages. */
+	for (i = 0; i < NR_HUGE_ARRAY; i++) {
+		ret = test_memcpy_op(dst + (i * PF_ARRAY_LEN / sizeof(int)),
+				     src + (i * PF_ARRAY_LEN / sizeof(int)),
+				     PF_ARRAY_LEN);
+		if (ret) {
+			ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+					      test_name, ret, strerror(errno));
+			return ret;
+		}
+	}
+	for (i = 0; i < NR_HUGE_ARRAY * (PF_ARRAY_LEN / sizeof(int)); i++)
+		src[i] = i;
+
+	for (i = 0; i < NR_HUGE_ARRAY; i++) {
+		ret = test_memcpy_op(dst + (i * PF_ARRAY_LEN / sizeof(int)),
+				     src + (i * PF_ARRAY_LEN / sizeof(int)),
+				     PF_ARRAY_LEN);
+		if (ret) {
+			ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+					      test_name, ret, strerror(errno));
+			return ret;
+		}
+	}
+
+	for (i = 0; i < NR_HUGE_ARRAY * (PF_ARRAY_LEN / sizeof(int)); i++) {
+		if (dst[i] != i) {
+			ksft_test_result_fail("%s mismatch, expect %d, got %d\n",
+					      test_name, i, dst[i]);
+			return ret;
+		}
+	}
+
+	ksft_test_result_pass("%s test\n", test_name);
+
+	if (munmap(src, HUGEMAPLEN)) {
+		perror("munmap");
+		abort();
+	}
+unmap_dst:
+	if (munmap(dst, HUGEMAPLEN)) {
+		perror("munmap");
+		abort();
+	}
+end:
+	return 0;
+}
+
+static int test_cmpxchg_op_cpu(void *v, void *expect, void *old, void *n,
+		size_t len, int cpu)
+{
+	int ret;
+
+	do {
+		ret = cpu_op_cmpxchg(v, expect, old, n, len, cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_over_possible_cpu(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+	const char *test_name = "test_over_possible_cpu";
+
+	v = orig_v;
+	ret = test_cmpxchg_op_cpu(&v, &expect, &old, &n, sizeof(uint64_t),
+				  0xFFFFFFFF);
+	if (ret == 0) {
+		ksft_test_result_fail("%s test: ret = %d\n",
+				      test_name, ret);
+		return -1;
+	}
+	if (ret < 0 && errno == EINVAL) {
+		ksft_test_result_pass("%s test\n", test_name);
+		return 0;
+	}
+	ksft_test_result_fail("%s returned %d, errno %s, expecting %d, errno %s\n",
+			      test_name, ret, strerror(errno),
+			      0, strerror(EINVAL));
+	return -1;
+}
+
+static int test_allowed_affinity(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+	const char *test_name = "test_allowed_affinity";
+	cpu_set_t allowed_cpus, cpuset;
+
+	ret = sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	if (ret) {
+		ksft_test_result_fail("%s returned %d, errno %s\n",
+				      test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (!(CPU_ISSET(0, &allowed_cpus) && CPU_ISSET(1, &allowed_cpus))) {
+		ksft_test_result_skip("%s test. Requiring allowed CPUs 0 and 1.\n",
+				      test_name);
+		return 0;
+	}
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		ksft_test_result_fail("%s test. Unable to set affinity. errno = %s\n",
+				      test_name, strerror(errno));
+		return -1;
+	}
+	v = orig_v;
+	ret = test_cmpxchg_op_cpu(&v, &expect, &old, &n, sizeof(uint64_t),
+				  1);
+	if (sched_setaffinity(0, sizeof(allowed_cpus), &allowed_cpus) != 0) {
+		ksft_test_result_fail("%s test. Unable to set affinity. errno = %s\n",
+				      test_name, strerror(errno));
+		return -1;
+	}
+	if (ret == 0) {
+		ksft_test_result_fail("%s test: ret = %d\n",
+				      test_name, ret);
+		return -1;
+	}
+
+	if (ret < 0 && errno == EINVAL) {
+		ksft_test_result_pass("%s test\n", test_name);
+		return 0;
+	}
+	ksft_test_result_fail("%s returned %d, errno %s, expecting %d, errno %s\n",
+			      test_name, ret, strerror(errno),
+			      0, strerror(EINVAL));
+	return -1;
+}
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+
+	test_ops_supported();
+	test_compare_eq_same();
+	test_compare_eq_diff();
+	test_compare_ne_same();
+	test_compare_ne_diff();
+	test_2compare_eq_index();
+	test_2compare_ne_index();
+	test_memcpy();
+	test_memcpy_u32();
+	test_memcpy_mb_memcpy();
+	test_add();
+	test_two_add();
+	test_or();
+	test_and();
+	test_xor();
+	test_lshift();
+	test_rshift();
+	test_cmpxchg_success();
+	test_cmpxchg_fail();
+	test_memcpy_fault();
+	test_unknown_op();
+	test_max_ops();
+	test_too_many_ops();
+	test_memcpy_single_too_large();
+	test_memcpy_single_ok_sum_too_large();
+	test_page_fault();
+	test_hugetlb();
+	test_over_possible_cpu();
+	test_allowed_affinity();
+
+	return ksft_exit_pass();
+}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 13/16] cpu-opv/selftests: Provide percpu_op API
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (11 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 12/16] cpu-opv/selftests: Provide basic test Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 14/16] cpu-opv/selftests: Provide basic percpu ops test Mathieu Desnoyers
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

Introduce percpu-op.h API. It uses rseq internally as fast-path if
invoked from the right CPU, else cpu_opv as slow-path if called
from the wrong CPU or if rseq fails.

This allows acting on per-cpu data from various CPUs transparently from
user-space: cpu_opv will take care of migrating the thread to the
requested CPU. Use-cases such as rebalancing memory across per-cpu
memory pools, or migrating tasks for a user-space scheduler, are thus
facilitated. This also handles debugger single-stepping.

The use from userspace is, e.g. for a counter increment:

    int cpu, ret;

    cpu = percpu_current_cpu();
    ret = percpu_addv(&data->c[cpu].count, 1, cpu);
    if (unlikely(ret)) {
         perror("percpu_addv");
         return -1;
    }
    return 0;

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-kselftest@vger.kernel.org
CC: linux-api@vger.kernel.org
---
 tools/testing/selftests/cpu-opv/percpu-op.h | 151 ++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/percpu-op.h

diff --git a/tools/testing/selftests/cpu-opv/percpu-op.h b/tools/testing/selftests/cpu-opv/percpu-op.h
new file mode 100644
index 000000000000..ffc64b268fd3
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/percpu-op.h
@@ -0,0 +1,151 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * percpu-op.h
+ *
+ * (C) Copyright 2017-2018 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#ifndef PERCPU_OP_H
+#define PERCPU_OP_H
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <stdlib.h>
+#include "rseq.h"
+#include "cpu-op.h"
+
+static inline uint32_t percpu_current_cpu(void)
+{
+	return rseq_current_cpu();
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+			 int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpeqv_storev(v, expect, newv, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpeqv_storev(v, expect, newv, cpu);
+	}
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+			       off_t voffp, intptr_t *load, int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpnev_storeoffp_load(v, expectnot, voffp, load, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpnev_storeoffp_load(v, expectnot, voffp,
+						    load, cpu);
+	}
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_addv(intptr_t *v, intptr_t count, int cpu)
+{
+	if (rseq_unlikely(rseq_addv(v, count, cpu)))
+		return cpu_op_addv(v, count, cpu);
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+				intptr_t *v2, intptr_t newv2,
+				intptr_t newv, int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpeqv_trystorev_storev(v, expect, v2, newv2,
+					   newv, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpeqv_storev_storev(v, expect, v2, newv2,
+						   newv, cpu);
+	}
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_storev_storev_release(intptr_t *v, intptr_t expect,
+					intptr_t *v2, intptr_t newv2,
+					intptr_t newv, int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpeqv_trystorev_storev_release(v, expect, v2, newv2,
+						   newv, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpeqv_storev_mb_storev(v, expect, v2, newv2,
+						      newv, cpu);
+	}
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+				intptr_t *v2, intptr_t expect2,
+				intptr_t newv, int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpeqv_cmpeqv_storev(v, expect, v2, expect2, newv, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpeqv_cmpeqv_storev(v, expect, v2, expect2,
+						   newv, cpu);
+	}
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+				void *dst, void *src, size_t len,
+				intptr_t newv, int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpeqv_trymemcpy_storev(v, expect, dst, src, len,
+					   newv, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpeqv_memcpy_storev(v, expect, dst, src, len,
+						   newv, cpu);
+	}
+	return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_memcpy_storev_release(intptr_t *v, intptr_t expect,
+					void *dst, void *src, size_t len,
+					intptr_t newv, int cpu)
+{
+	int ret;
+
+	ret = rseq_cmpeqv_trymemcpy_storev_release(v, expect, dst, src, len,
+						   newv, cpu);
+	if (rseq_unlikely(ret)) {
+		if (ret > 0)
+			return ret;
+		return cpu_op_cmpeqv_memcpy_mb_storev(v, expect, dst, src, len,
+						      newv, cpu);
+	}
+	return 0;
+}
+
+#endif  /* PERCPU_OP_H_ */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 14/16] cpu-opv/selftests: Provide basic percpu ops test
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (12 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 13/16] cpu-opv/selftests: Provide percpu_op API Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 15/16] cpu-opv/selftests: Provide parametrized tests Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 16/16] cpu-opv/selftests: Provide Makefile, scripts, gitignore Mathieu Desnoyers
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

"basic_percpu_ops_test" implements a few simple per-cpu operations and
testing their correctness.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 .../selftests/cpu-opv/basic_percpu_ops_test.c      | 295 +++++++++++++++++++++
 1 file changed, 295 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/basic_percpu_ops_test.c

diff --git a/tools/testing/selftests/cpu-opv/basic_percpu_ops_test.c b/tools/testing/selftests/cpu-opv/basic_percpu_ops_test.c
new file mode 100644
index 000000000000..2ce5202f25b2
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/basic_percpu_ops_test.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "percpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+	int reps;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock. */
+void rseq_percpu_lock(struct percpu_lock *lock, int cpu)
+{
+	for (;;) {
+		int ret;
+
+		ret = percpu_cmpeqv_storev(&lock->c[cpu].v,
+					   0, 1, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches rseq_smp_store_release().
+	 */
+	rseq_smp_acquire__after_ctrl_dep();
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * rseq_smp_acquire__after_ctrl_dep().
+	 */
+	rseq_smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_test_data *data = arg;
+	int i;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+	for (i = 0; i < data->reps; i++) {
+		int cpu = percpu_current_cpu();
+
+		rseq_percpu_lock(&data->lock, cpu);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+	}
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = 200;
+	int i;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+
+	memset(&data, 0, sizeof(data));
+	data.reps = 5000;
+
+	for (i = 0; i < num_threads; i++)
+		pthread_create(&test_threads[i], NULL,
+			       test_percpu_spinlock_thread, &data);
+
+	for (i = 0; i < num_threads; i++)
+		pthread_join(test_threads[i], NULL);
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)data.reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node,
+		     int cpu)
+{
+	for (;;) {
+		intptr_t *targetptr, newval, expect;
+		int ret;
+
+		/* Load list->c[cpu].head with single-copy atomicity. */
+		expect = (intptr_t)RSEQ_READ_ONCE(list->c[cpu].head);
+		newval = (intptr_t)node;
+		targetptr = (intptr_t *)&list->c[cpu].head;
+		node->next = (struct percpu_list_node *)expect;
+		ret = percpu_cmpeqv_storev(targetptr, expect, newval, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list,
+					 int cpu)
+{
+	struct percpu_list_node *head;
+	intptr_t *targetptr, expectnot, *load;
+	off_t offset;
+	int ret;
+
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	expectnot = (intptr_t)NULL;
+	offset = offsetof(struct percpu_list_node, next);
+	load = (intptr_t *)&head;
+	ret = percpu_cmpnev_storeoffp_load(targetptr, expectnot,
+					   offset, load, cpu);
+	if (rseq_unlikely(ret < 0)) {
+		perror("cpu_opv");
+		abort();
+	}
+	if (ret > 0)
+		return NULL;
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	for (i = 0; i < 100000; i++) {
+		struct percpu_list_node *node;
+
+		node = percpu_list_pop(list, percpu_current_cpu());
+		sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node, percpu_current_cpu());
+	}
+
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	int i, j;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[200];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < 200; i++)
+		pthread_create(&test_threads[i], NULL,
+		       test_percpu_list_thread, &list);
+
+	for (i = 0; i < 200; i++)
+		pthread_join(test_threads[i], NULL);
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		while ((node = percpu_list_pop(&list, i))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto error;
+	}
+	printf("spinlock\n");
+	test_percpu_spinlock();
+	printf("percpu_list\n");
+	test_percpu_list();
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto error;
+	}
+	return 0;
+
+error:
+	return -1;
+}
+
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 15/16] cpu-opv/selftests: Provide parametrized tests
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (13 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 14/16] cpu-opv/selftests: Provide basic percpu ops test Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  2018-10-10 19:19 ` [RFC PATCH for 4.21 16/16] cpu-opv/selftests: Provide Makefile, scripts, gitignore Mathieu Desnoyers
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

"param_test" is a parametrizable percpu operations test using
both restartable sequences and cpu_opv. See the "--help" output for
usage.

"param_test_benchmark" is the same as "param_test", but it removes
testing book-keeping code to allow accurate benchmarks.

"param_test_compare_twice" is the same as "param_test", but it performs
each comparison within rseq critical section twice, thus validating
invariants. If any of the second comparisons fails, an error message
is printed and the test aborts.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 tools/testing/selftests/cpu-opv/param_test.c | 1187 ++++++++++++++++++++++++++
 1 file changed, 1187 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/param_test.c

diff --git a/tools/testing/selftests/cpu-opv/param_test.c b/tools/testing/selftests/cpu-opv/param_test.c
new file mode 100644
index 000000000000..c62e75f07385
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/param_test.c
@@ -0,0 +1,1187 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <poll.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+#include <stddef.h>
+
+static inline pid_t gettid(void)
+{
+	return syscall(__NR_gettid);
+}
+
+#define NR_INJECT	9
+static int loop_cnt[NR_INJECT + 1];
+
+static int loop_cnt_1 asm("asm_loop_cnt_1") __attribute__((used));
+static int loop_cnt_2 asm("asm_loop_cnt_2") __attribute__((used));
+static int loop_cnt_3 asm("asm_loop_cnt_3") __attribute__((used));
+static int loop_cnt_4 asm("asm_loop_cnt_4") __attribute__((used));
+static int loop_cnt_5 asm("asm_loop_cnt_5") __attribute__((used));
+static int loop_cnt_6 asm("asm_loop_cnt_6") __attribute__((used));
+
+static int opt_modulo, verbose;
+
+static int opt_yield, opt_signal, opt_sleep,
+		opt_disable_rseq, opt_threads = 200,
+		opt_disable_mod = 0, opt_test = 's', opt_mb = 0;
+
+#ifndef RSEQ_SKIP_FASTPATH
+static long long opt_reps = 5000;
+#else
+static long long opt_reps = 100;
+#endif
+
+static __thread __attribute__((tls_model("initial-exec")))
+unsigned int signals_delivered;
+
+#ifndef BENCHMARK
+
+static __thread __attribute__((tls_model("initial-exec"), unused))
+unsigned int yield_mod_cnt, nr_abort;
+
+#define printf_verbose(fmt, ...)			\
+	do {						\
+		if (verbose)				\
+			printf(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+#ifdef __i386__
+
+#define INJECT_ASM_REG	"eax"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"mov asm_loop_cnt_" #n ", %%" INJECT_ASM_REG "\n\t" \
+	"test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+	"jz 333f\n\t" \
+	"222:\n\t" \
+	"dec %%" INJECT_ASM_REG "\n\t" \
+	"jnz 222b\n\t" \
+	"333:\n\t"
+
+#elif defined(__x86_64__)
+
+#define INJECT_ASM_REG_P	"rax"
+#define INJECT_ASM_REG		"eax"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG_P \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"lea asm_loop_cnt_" #n "(%%rip), %%" INJECT_ASM_REG_P "\n\t" \
+	"mov (%%" INJECT_ASM_REG_P "), %%" INJECT_ASM_REG "\n\t" \
+	"test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+	"jz 333f\n\t" \
+	"222:\n\t" \
+	"dec %%" INJECT_ASM_REG "\n\t" \
+	"jnz 222b\n\t" \
+	"333:\n\t"
+
+#elif defined(__ARMEL__)
+
+#define RSEQ_INJECT_INPUT \
+	, [loop_cnt_1]"m"(loop_cnt[1]) \
+	, [loop_cnt_2]"m"(loop_cnt[2]) \
+	, [loop_cnt_3]"m"(loop_cnt[3]) \
+	, [loop_cnt_4]"m"(loop_cnt[4]) \
+	, [loop_cnt_5]"m"(loop_cnt[5]) \
+	, [loop_cnt_6]"m"(loop_cnt[6])
+
+#define INJECT_ASM_REG	"r4"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"ldr " INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmp " INJECT_ASM_REG ", #0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subs " INJECT_ASM_REG ", #1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+
+#elif __PPC__
+
+#define RSEQ_INJECT_INPUT \
+	, [loop_cnt_1]"m"(loop_cnt[1]) \
+	, [loop_cnt_2]"m"(loop_cnt[2]) \
+	, [loop_cnt_3]"m"(loop_cnt[3]) \
+	, [loop_cnt_4]"m"(loop_cnt[4]) \
+	, [loop_cnt_5]"m"(loop_cnt[5]) \
+	, [loop_cnt_6]"m"(loop_cnt[6])
+
+#define INJECT_ASM_REG	"r18"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+#else
+#error unsupported target
+#endif
+
+#define RSEQ_INJECT_FAILED \
+	nr_abort++;
+
+#define RSEQ_INJECT_C(n) \
+{ \
+	int loc_i, loc_nr_loops = loop_cnt[n]; \
+	\
+	for (loc_i = 0; loc_i < loc_nr_loops; loc_i++) { \
+		rseq_barrier(); \
+	} \
+	if (loc_nr_loops == -1 && opt_modulo) { \
+		if (yield_mod_cnt == opt_modulo - 1) { \
+			if (opt_sleep > 0) \
+				poll(NULL, 0, opt_sleep); \
+			if (opt_yield) \
+				sched_yield(); \
+			if (opt_signal) \
+				raise(SIGUSR1); \
+			yield_mod_cnt = 0; \
+		} else { \
+			yield_mod_cnt++; \
+		} \
+	} \
+}
+
+#else
+
+#define printf_verbose(fmt, ...)
+
+#endif /* BENCHMARK */
+
+#include "percpu-op.h"
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct spinlock_thread_test_data {
+	struct spinlock_test_data *data;
+	long long reps;
+	int reg;
+};
+
+struct inc_test_data {
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct inc_thread_test_data {
+	struct inc_test_data *data;
+	long long reps;
+	int reg;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+#define BUFFER_ITEM_PER_CPU	100
+
+struct percpu_buffer_node {
+	intptr_t data;
+};
+
+struct percpu_buffer_entry {
+	intptr_t offset;
+	intptr_t buflen;
+	struct percpu_buffer_node **array;
+} __attribute__((aligned(128)));
+
+struct percpu_buffer {
+	struct percpu_buffer_entry c[CPU_SETSIZE];
+};
+
+#define MEMCPY_BUFFER_ITEM_PER_CPU	100
+
+struct percpu_memcpy_buffer_node {
+	intptr_t data1;
+	uint64_t data2;
+};
+
+struct percpu_memcpy_buffer_entry {
+	intptr_t offset;
+	intptr_t buflen;
+	struct percpu_memcpy_buffer_node *array;
+} __attribute__((aligned(128)));
+
+struct percpu_memcpy_buffer {
+	struct percpu_memcpy_buffer_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock. */
+static void rseq_percpu_lock(struct percpu_lock *lock, int cpu)
+{
+	for (;;) {
+		int ret;
+
+		ret = percpu_cmpeqv_storev(&lock->c[cpu].v,
+					   0, 1, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches rseq_smp_store_release().
+	 */
+	rseq_smp_acquire__after_ctrl_dep();
+}
+
+static void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * rseq_smp_acquire__after_ctrl_dep().
+	 */
+	rseq_smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_thread_test_data *thread_data = arg;
+	struct spinlock_test_data *data = thread_data->data;
+	long long i, reps;
+
+	if (!opt_disable_rseq && thread_data->reg &&
+	    rseq_register_current_thread())
+		abort();
+	reps = thread_data->reps;
+	for (i = 0; i < reps; i++) {
+		int cpu = rseq_cpu_start();
+
+		rseq_percpu_lock(&data->lock, cpu);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (reps / 10)))
+			printf_verbose("tid %d: count %lld\n", (int) gettid(), i);
+#endif
+	}
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		       (int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && thread_data->reg &&
+	    rseq_unregister_current_thread())
+		abort();
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = opt_threads;
+	int i, ret;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+	struct spinlock_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+				     test_percpu_spinlock_thread,
+				     &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+void *test_percpu_inc_thread(void *arg)
+{
+	struct inc_thread_test_data *thread_data = arg;
+	struct inc_test_data *data = thread_data->data;
+	long long i, reps;
+
+	if (!opt_disable_rseq && thread_data->reg &&
+	    rseq_register_current_thread())
+		abort();
+	reps = thread_data->reps;
+	for (i = 0; i < reps; i++) {
+		int cpu, ret;
+
+		cpu = rseq_cpu_start();
+		ret = percpu_addv(&data->c[cpu].count, 1, cpu);
+		if (rseq_unlikely(ret)) {
+			perror("cpu_opv");
+			abort();
+		}
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (reps / 10)))
+			printf_verbose("tid %d: count %lld\n", (int) gettid(), i);
+#endif
+	}
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		       (int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && thread_data->reg &&
+	    rseq_unregister_current_thread())
+		abort();
+	return NULL;
+}
+
+void test_percpu_inc(void)
+{
+	const int num_threads = opt_threads;
+	int i, ret;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct inc_test_data data;
+	struct inc_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+				     test_percpu_inc_thread,
+				     &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+void percpu_list_push(struct percpu_list *list,
+		      struct percpu_list_node *node,
+		      int cpu)
+{
+	for (;;) {
+		intptr_t *targetptr, newval, expect;
+		int ret;
+
+		/* Load list->c[cpu].head with single-copy atomicity. */
+		expect = (intptr_t)RSEQ_READ_ONCE(list->c[cpu].head);
+		newval = (intptr_t)node;
+		targetptr = (intptr_t *)&list->c[cpu].head;
+		node->next = (struct percpu_list_node *)expect;
+		ret = percpu_cmpeqv_storev(targetptr, expect, newval, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list,
+					 int cpu)
+{
+	struct percpu_list_node *head;
+	intptr_t *targetptr, expectnot, *load;
+	off_t offset;
+	int ret;
+
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	expectnot = (intptr_t)NULL;
+	offset = offsetof(struct percpu_list_node, next);
+	load = (intptr_t *)&head;
+	ret = percpu_cmpnev_storeoffp_load(targetptr, expectnot,
+					   offset, load, cpu);
+	if (rseq_unlikely(ret < 0)) {
+		perror("cpu_opv");
+		abort();
+	}
+	if (ret > 0)
+		return NULL;
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	long long i, reps;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		abort();
+
+	reps = opt_reps;
+	for (i = 0; i < reps; i++) {
+		struct percpu_list_node *node;
+
+		node = percpu_list_pop(list, rseq_cpu_start());
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node, rseq_cpu_start());
+	}
+
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		       (int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+				     test_percpu_list_thread, &list);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		while ((node = percpu_list_pop(&list, i))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+bool percpu_buffer_push(struct percpu_buffer *buffer,
+			struct percpu_buffer_node *node,
+			int cpu)
+{
+	for (;;) {
+		intptr_t *targetptr_spec, newval_spec;
+		intptr_t *targetptr_final, newval_final;
+		intptr_t offset;
+		int ret;
+
+		offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+		if (offset == buffer->c[cpu].buflen)
+			return false;
+		newval_spec = (intptr_t)node;
+		targetptr_spec = (intptr_t *)&buffer->c[cpu].array[offset];
+		newval_final = offset + 1;
+		targetptr_final = &buffer->c[cpu].offset;
+		if (opt_mb)
+			ret = percpu_cmpeqv_storev_storev_release(
+				targetptr_final, offset, targetptr_spec,
+				newval_spec, newval_final, cpu);
+		else
+			ret = percpu_cmpeqv_storev_storev(targetptr_final,
+				offset, targetptr_spec, newval_spec,
+				newval_final, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	return true;
+}
+
+struct percpu_buffer_node *percpu_buffer_pop(struct percpu_buffer *buffer,
+					     int cpu)
+{
+	struct percpu_buffer_node *head;
+
+	for (;;) {
+		intptr_t *targetptr, newval;
+		intptr_t offset;
+		int ret;
+
+		/* Load offset with single-copy atomicity. */
+		offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+		if (offset == 0)
+			return NULL;
+		head = RSEQ_READ_ONCE(buffer->c[cpu].array[offset - 1]);
+		newval = offset - 1;
+		targetptr = (intptr_t *)&buffer->c[cpu].offset;
+		ret = percpu_cmpeqv_cmpeqv_storev(targetptr, offset,
+			(intptr_t *)&buffer->c[cpu].array[offset - 1],
+			(intptr_t)head, newval, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	return head;
+}
+
+void *test_percpu_buffer_thread(void *arg)
+{
+	long long i, reps;
+	struct percpu_buffer *buffer = (struct percpu_buffer *)arg;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		abort();
+
+	reps = opt_reps;
+	for (i = 0; i < reps; i++) {
+		struct percpu_buffer_node *node;
+
+		node = percpu_buffer_pop(buffer, rseq_cpu_start());
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node) {
+			if (!percpu_buffer_push(buffer, node,
+						rseq_cpu_start())) {
+				/* Should increase buffer size. */
+				abort();
+			}
+		}
+	}
+
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		       (int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads.  */
+void test_percpu_buffer(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_buffer buffer;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&buffer, 0, sizeof(buffer));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		/* Worse-case is every item in same CPU. */
+		buffer.c[i].array =
+			malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE *
+			       BUFFER_ITEM_PER_CPU);
+		assert(buffer.c[i].array);
+		buffer.c[i].buflen = CPU_SETSIZE * BUFFER_ITEM_PER_CPU;
+		for (j = 1; j <= BUFFER_ITEM_PER_CPU; j++) {
+			struct percpu_buffer_node *node;
+
+			expected_sum += j;
+
+			/*
+			 * We could theoretically put the word-sized
+			 * "data" directly in the buffer. However, we
+			 * want to model objects that would not fit
+			 * within a single word, so allocate an object
+			 * for each node.
+			 */
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			buffer.c[i].array[j - 1] = node;
+			buffer.c[i].offset++;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+				     test_percpu_buffer_thread, &buffer);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		struct percpu_buffer_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		while ((node = percpu_buffer_pop(&buffer, i))) {
+			sum += node->data;
+			free(node);
+		}
+		free(buffer.c[i].array);
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+bool percpu_memcpy_buffer_push(struct percpu_memcpy_buffer *buffer,
+			       struct percpu_memcpy_buffer_node item, int cpu)
+{
+	for (;;) {
+		intptr_t *targetptr_final, newval_final, offset;
+		char *destptr, *srcptr;
+		size_t copylen;
+		int ret;
+
+		/* Load offset with single-copy atomicity. */
+		offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+		if (offset == buffer->c[cpu].buflen)
+			return false;
+		destptr = (char *)&buffer->c[cpu].array[offset];
+		srcptr = (char *)&item;
+		/* copylen must be <= 4kB. */
+		copylen = sizeof(item);
+		newval_final = offset + 1;
+		targetptr_final = &buffer->c[cpu].offset;
+		if (opt_mb)
+			ret = percpu_cmpeqv_memcpy_storev_release(
+				targetptr_final, offset,
+				destptr, srcptr, copylen,
+				newval_final, cpu);
+		else
+			ret = percpu_cmpeqv_memcpy_storev(targetptr_final,
+				offset, destptr, srcptr, copylen,
+				newval_final, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	return true;
+}
+
+bool percpu_memcpy_buffer_pop(struct percpu_memcpy_buffer *buffer,
+		struct percpu_memcpy_buffer_node *item, int cpu)
+{
+	for (;;) {
+		intptr_t *targetptr_final, newval_final, offset;
+		char *destptr, *srcptr;
+		size_t copylen;
+		int ret;
+
+		/* Load offset with single-copy atomicity. */
+		offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+		if (offset == 0)
+			return false;
+		destptr = (char *)item;
+		srcptr = (char *)&buffer->c[cpu].array[offset - 1];
+		/* copylen must be <= 4kB. */
+		copylen = sizeof(*item);
+		newval_final = offset - 1;
+		targetptr_final = &buffer->c[cpu].offset;
+		ret = percpu_cmpeqv_memcpy_storev(targetptr_final,
+			offset, destptr, srcptr, copylen,
+			newval_final, cpu);
+		if (rseq_likely(!ret))
+			break;
+		if (rseq_unlikely(ret < 0)) {
+			perror("cpu_opv");
+			abort();
+		}
+		/* Retry if comparison fails. */
+	}
+	return true;
+}
+
+void *test_percpu_memcpy_buffer_thread(void *arg)
+{
+	long long i, reps;
+	struct percpu_memcpy_buffer *buffer = (struct percpu_memcpy_buffer *)arg;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		abort();
+
+	reps = opt_reps;
+	for (i = 0; i < reps; i++) {
+		struct percpu_memcpy_buffer_node item;
+		bool result;
+
+		result = percpu_memcpy_buffer_pop(buffer, &item,
+						  rseq_cpu_start());
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (result) {
+			if (!percpu_memcpy_buffer_push(buffer, item,
+						       rseq_cpu_start())) {
+				/* Should increase buffer size. */
+				abort();
+			}
+		}
+	}
+
+	printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+		       (int) gettid(), nr_abort, signals_delivered);
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads.  */
+void test_percpu_memcpy_buffer(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_memcpy_buffer buffer;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&buffer, 0, sizeof(buffer));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		/* Worse-case is every item in same CPU. */
+		buffer.c[i].array =
+			malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE *
+			       MEMCPY_BUFFER_ITEM_PER_CPU);
+		assert(buffer.c[i].array);
+		buffer.c[i].buflen = CPU_SETSIZE * MEMCPY_BUFFER_ITEM_PER_CPU;
+		for (j = 1; j <= MEMCPY_BUFFER_ITEM_PER_CPU; j++) {
+			expected_sum += 2 * j + 1;
+
+			/*
+			 * We could theoretically put the word-sized
+			 * "data" directly in the buffer. However, we
+			 * want to model objects that would not fit
+			 * within a single word, so allocate an object
+			 * for each node.
+			 */
+			buffer.c[i].array[j - 1].data1 = j;
+			buffer.c[i].array[j - 1].data2 = j + 1;
+			buffer.c[i].offset++;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+				     test_percpu_memcpy_buffer_thread,
+				     &buffer);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		struct percpu_memcpy_buffer_node item;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		while (percpu_memcpy_buffer_pop(&buffer, &item, i)) {
+			sum += item.data1;
+			sum += item.data2;
+		}
+		free(buffer.c[i].array);
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+static void test_signal_interrupt_handler(int signo)
+{
+	signals_delivered++;
+}
+
+static int set_signal_handler(void)
+{
+	int ret = 0;
+	struct sigaction sa;
+	sigset_t sigset;
+
+	ret = sigemptyset(&sigset);
+	if (ret < 0) {
+		perror("sigemptyset");
+		return ret;
+	}
+
+	sa.sa_handler = test_signal_interrupt_handler;
+	sa.sa_mask = sigset;
+	sa.sa_flags = 0;
+	ret = sigaction(SIGUSR1, &sa, NULL);
+	if (ret < 0) {
+		perror("sigaction");
+		return ret;
+	}
+
+	printf_verbose("Signal handler set for SIGUSR1\n");
+
+	return ret;
+}
+
+static void show_usage(int argc, char **argv)
+{
+	printf("Usage : %s <OPTIONS>\n",
+		argv[0]);
+	printf("OPTIONS:\n");
+	printf("	[-1 loops] Number of loops for delay injection 1\n");
+	printf("	[-2 loops] Number of loops for delay injection 2\n");
+	printf("	[-3 loops] Number of loops for delay injection 3\n");
+	printf("	[-4 loops] Number of loops for delay injection 4\n");
+	printf("	[-5 loops] Number of loops for delay injection 5\n");
+	printf("	[-6 loops] Number of loops for delay injection 6\n");
+	printf("	[-7 loops] Number of loops for delay injection 7 (-1 to enable -m)\n");
+	printf("	[-8 loops] Number of loops for delay injection 8 (-1 to enable -m)\n");
+	printf("	[-9 loops] Number of loops for delay injection 9 (-1 to enable -m)\n");
+	printf("	[-m N] Yield/sleep/kill every modulo N (default 0: disabled) (>= 0)\n");
+	printf("	[-y] Yield\n");
+	printf("	[-k] Kill thread with signal\n");
+	printf("	[-s S] S: =0: disabled (default), >0: sleep time (ms)\n");
+	printf("	[-t N] Number of threads (default 200)\n");
+	printf("	[-r N] Number of repetitions per thread (default 5000)\n");
+	printf("	[-d] Disable rseq system call (no initialization)\n");
+	printf("	[-D M] Disable rseq for each M threads\n");
+	printf("	[-T test] Choose test: (s)pinlock, (l)ist, (b)uffer, (m)emcpy, (i)ncrement\n");
+	printf("	[-M] Push into buffer and memcpy buffer with memory barriers.\n");
+	printf("	[-v] Verbose output.\n");
+	printf("	[-h] Show this help.\n");
+	printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+	int i;
+
+	for (i = 1; i < argc; i++) {
+		if (argv[i][0] != '-')
+			continue;
+		switch (argv[i][1]) {
+		case '1':
+		case '2':
+		case '3':
+		case '4':
+		case '5':
+		case '6':
+		case '7':
+		case '8':
+		case '9':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			loop_cnt[argv[i][1] - '0'] = atol(argv[i + 1]);
+			i++;
+			break;
+		case 'm':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_modulo = atol(argv[i + 1]);
+			if (opt_modulo < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 's':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_sleep = atol(argv[i + 1]);
+			if (opt_sleep < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'y':
+			opt_yield = 1;
+			break;
+		case 'k':
+			opt_signal = 1;
+			break;
+		case 'd':
+			opt_disable_rseq = 1;
+			break;
+		case 'D':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_disable_mod = atol(argv[i + 1]);
+			if (opt_disable_mod < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 't':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_threads = atol(argv[i + 1]);
+			if (opt_threads < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'r':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_reps = atoll(argv[i + 1]);
+			if (opt_reps < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'h':
+			show_usage(argc, argv);
+			goto end;
+		case 'T':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_test = *argv[i + 1];
+			switch (opt_test) {
+			case 's':
+			case 'l':
+			case 'i':
+			case 'b':
+			case 'm':
+				break;
+			default:
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'v':
+			verbose = 1;
+			break;
+		case 'M':
+			opt_mb = 1;
+			break;
+		default:
+			show_usage(argc, argv);
+			goto error;
+		}
+	}
+
+	loop_cnt_1 = loop_cnt[1];
+	loop_cnt_2 = loop_cnt[2];
+	loop_cnt_3 = loop_cnt[3];
+	loop_cnt_4 = loop_cnt[4];
+	loop_cnt_5 = loop_cnt[5];
+	loop_cnt_6 = loop_cnt[6];
+
+	if (set_signal_handler())
+		goto error;
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		goto error;
+	switch (opt_test) {
+	case 's':
+		printf_verbose("spinlock\n");
+		test_percpu_spinlock();
+		break;
+	case 'l':
+		printf_verbose("linked list\n");
+		test_percpu_list();
+		break;
+	case 'b':
+		printf_verbose("buffer\n");
+		test_percpu_buffer();
+		break;
+	case 'm':
+		printf_verbose("memcpy buffer\n");
+		test_percpu_memcpy_buffer();
+		break;
+	case 'i':
+		printf_verbose("counter increment\n");
+		test_percpu_inc();
+		break;
+	}
+	if (!opt_disable_rseq && rseq_unregister_current_thread())
+		abort();
+end:
+	return 0;
+
+error:
+	return -1;
+}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH for 4.21 16/16] cpu-opv/selftests: Provide Makefile, scripts, gitignore
  2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
                   ` (14 preceding siblings ...)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 15/16] cpu-opv/selftests: Provide parametrized tests Mathieu Desnoyers
@ 2018-10-10 19:19 ` Mathieu Desnoyers
  15 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-10 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Mathieu Desnoyers, Shuah Khan, linux-kselftest

A run_param_test.sh script runs many variants of the parametrizable
tests.

Wire up the cpu-opv Makefile, add directory entry into MAINTAINERS file.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 MAINTAINERS                                       |   1 +
 tools/testing/selftests/Makefile                  |   1 +
 tools/testing/selftests/cpu-opv/.gitignore        |   6 +
 tools/testing/selftests/cpu-opv/Makefile          |  39 +++++++
 tools/testing/selftests/cpu-opv/run_param_test.sh | 134 ++++++++++++++++++++++
 5 files changed, 181 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
 create mode 100644 tools/testing/selftests/cpu-opv/Makefile
 create mode 100755 tools/testing/selftests/cpu-opv/run_param_test.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 3b50578fc5d9..206fe4597697 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3864,6 +3864,7 @@ L:	linux-kernel@vger.kernel.org
 S:	Supported
 F:	kernel/cpu_opv.c
 F:	include/uapi/linux/cpu_opv.h
+F:	tools/testing/selftests/cpu-opv/
 
 CRAMFS FILESYSTEM
 M:	Nicolas Pitre <nico@linaro.org>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index f1fe492c8e17..42bcc6e5fc5d 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -6,6 +6,7 @@ TARGETS += capabilities
 TARGETS += cgroup
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
+TARGETS += cpu-opv
 TARGETS += efivarfs
 TARGETS += exec
 TARGETS += filesystems
diff --git a/tools/testing/selftests/cpu-opv/.gitignore b/tools/testing/selftests/cpu-opv/.gitignore
new file mode 100644
index 000000000000..8c7bb1f8be79
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/.gitignore
@@ -0,0 +1,6 @@
+basic_cpu_opv_test
+basic_percpu_ops_test
+param_test
+param_test_benchmark
+param_test_compare_twice
+param_test_skip_fastpath
diff --git a/tools/testing/selftests/cpu-opv/Makefile b/tools/testing/selftests/cpu-opv/Makefile
new file mode 100644
index 000000000000..46f49bf30bae
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/Makefile
@@ -0,0 +1,39 @@
+# SPDX-License-Identifier: GPL-2.0+ OR MIT
+CFLAGS += -O2 -Wall -g -I./ -I../rseq/ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+LDLIBS += -lpthread
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_cpu_opv_test basic_percpu_ops_test \
+		param_test param_test_skip_fastpath \
+		param_test_benchmark param_test_compare_twice
+
+TEST_GEN_PROGS_EXTENDED = librseq.so libcpu-op.so
+
+TEST_PROGS = run_param_test.sh
+
+include ../lib.mk
+
+$(OUTPUT)/libcpu-op.so: cpu-op.c cpu-op.h
+	$(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/librseq.so: ../rseq/rseq.c ../rseq/rseq.h ../rseq/rseq-*.h
+	$(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) ../rseq/rseq.h ../rseq/rseq-*.h cpu-op.h percpu-op.h
+	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
+$(OUTPUT)/param_test_skip_fastpath: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+					../rseq/rseq.h ../rseq/rseq-*.h cpu-op.h percpu-op.h
+	$(CC) $(CFLAGS) -DRSEQ_SKIP_FASTPATH $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
+$(OUTPUT)/param_test_benchmark: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+					../rseq/rseq.h ../rseq/rseq-*.h cpu-op.h percpu-op.h
+	$(CC) $(CFLAGS) -DBENCHMARK $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
+$(OUTPUT)/param_test_compare_twice: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+					../rseq/rseq.h ../rseq/rseq-*.h cpu-op.h percpu-op.h
+	$(CC) $(CFLAGS) -DRSEQ_COMPARE_TWICE $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
diff --git a/tools/testing/selftests/cpu-opv/run_param_test.sh b/tools/testing/selftests/cpu-opv/run_param_test.sh
new file mode 100755
index 000000000000..066d2479893c
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/run_param_test.sh
@@ -0,0 +1,134 @@
+#!/bin/bash
+
+NR_CPUS=`grep '^processor' /proc/cpuinfo | wc -l`
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+	"-T s"
+	"-T l"
+	"-T b"
+	"-T b -M"
+	"-T m"
+	"-T m -M"
+	"-T i"
+)
+
+TEST_NAME=(
+	"spinlock"
+	"list"
+	"buffer"
+	"buffer with barrier"
+	"memcpy"
+	"memcpy with barrier"
+	"increment"
+)
+IFS="$OLDIFS"
+
+REPS=1000
+SLOW_REPS=100
+NR_THREADS=$((6*${NR_CPUS}))
+
+function do_tests()
+{
+	local i=0
+	while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
+		echo "Running test ${TEST_NAME[$i]}"
+		./param_test ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+		echo "Running skip fast-path test ${TEST_NAME[$i]}"
+		./param_test_skip_fastpath ${TEST_LIST[$i]} -r ${SLOW_REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+		echo "Running compare-twice test ${TEST_NAME[$i]}"
+		./param_test_compare_twice ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+		let "i++"
+	done
+}
+
+echo "Default parameters"
+do_tests
+
+echo "Loop injection: 10000 loops"
+
+OLDIFS="$IFS"
+IFS=$'\n'
+INJECT_LIST=(
+	"1"
+	"2"
+	"3"
+	"4"
+	"5"
+	"6"
+	"7"
+	"8"
+	"9"
+)
+IFS="$OLDIFS"
+
+NR_LOOPS=10000
+
+i=0
+while [ "$i" -lt "${#INJECT_LIST[@]}" ]; do
+	echo "Injecting at <${INJECT_LIST[$i]}>"
+	do_tests -${INJECT_LIST[i]} ${NR_LOOPS}
+	let "i++"
+done
+NR_LOOPS=
+
+function inject_blocking()
+{
+	OLDIFS="$IFS"
+	IFS=$'\n'
+	INJECT_LIST=(
+		"7"
+		"8"
+		"9"
+	)
+	IFS="$OLDIFS"
+
+	NR_LOOPS=-1
+
+	i=0
+	while [ "$i" -lt "${#INJECT_LIST[@]}" ]; do
+		echo "Injecting at <${INJECT_LIST[$i]}>"
+		do_tests -${INJECT_LIST[i]} -1 ${@}
+		let "i++"
+	done
+	NR_LOOPS=
+}
+
+echo "Yield injection (25%)"
+inject_blocking -m 4 -y
+
+echo "Yield injection (50%)"
+inject_blocking -m 2 -y
+
+echo "Yield injection (100%)"
+inject_blocking -m 1 -y
+
+echo "Kill injection (25%)"
+inject_blocking -m 4 -k
+
+echo "Kill injection (50%)"
+inject_blocking -m 2 -k
+
+echo "Kill injection (100%)"
+inject_blocking -m 1 -k
+
+echo "Sleep injection (1ms, 25%)"
+inject_blocking -m 4 -s 1
+
+echo "Sleep injection (1ms, 50%)"
+inject_blocking -m 2 -s 1
+
+echo "Sleep injection (1ms, 100%)"
+inject_blocking -m 1 -s 1
+
+echo "Disable rseq for 25% threads"
+do_tests -D 4
+
+echo "Disable rseq for 50% threads"
+do_tests -D 2
+
+echo "Disable rseq"
+do_tests -d
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-10 19:19 ` [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc Mathieu Desnoyers
@ 2018-10-11 10:37   ` Szabolcs Nagy
  2018-10-11 15:13     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Szabolcs Nagy @ 2018-10-11 10:37 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng
  Cc: nd, linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes,
	Shuah Khan, Carlos O'Donell, Florian Weimer, Joseph Myers

On 10/10/18 20:19, Mathieu Desnoyers wrote:
> In order to integrate rseq into user-space applications, add a reference
> counter field after the struct rseq TLS ABI so many rseq users can be
> linked into the same application (e.g. librseq and glibc). The
> reference count ensures that rseq syscall registration/unregistration
> happens only for the most early/late user for each thread, thus ensuring
> that rseq is registered across the lifetime of all rseq users for a
> given thread.
...
> +__attribute__((visibility("hidden"))) __thread
> +volatile struct libc_rseq __lib_rseq_abi = {
...
> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
> +volatile struct rseq __rseq_abi;
...
> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>  	sigset_t oldset;
>  
>  	signal_off_save(&oldset);
> -	if (refcount++)
> +	if (__lib_rseq_abi.refcount++)
>  		goto end;
>  	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);

why do you use a local refcounter instead of the __rseq_abi one?

what prevents calling rseq_register_current_thread more than 4G times?

why cant the kernel see that the same address is registered again and succeed?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-11 10:37   ` Szabolcs Nagy
@ 2018-10-11 15:13     ` Mathieu Desnoyers
  2018-10-11 16:20       ` Szabolcs Nagy
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-11 15:13 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, nd, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>> In order to integrate rseq into user-space applications, add a reference
>> counter field after the struct rseq TLS ABI so many rseq users can be
>> linked into the same application (e.g. librseq and glibc). The
>> reference count ensures that rseq syscall registration/unregistration
>> happens only for the most early/late user for each thread, thus ensuring
>> that rseq is registered across the lifetime of all rseq users for a
>> given thread.
> ...
>> +__attribute__((visibility("hidden"))) __thread
>> +volatile struct libc_rseq __lib_rseq_abi = {
> ...
>> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
>> +volatile struct rseq __rseq_abi;
> ...
>> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>>  	sigset_t oldset;
>>  
>>  	signal_off_save(&oldset);
>> -	if (refcount++)
>> +	if (__lib_rseq_abi.refcount++)
>>  		goto end;
>>  	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
> 
> why do you use a local refcounter instead of the __rseq_abi one?

There is no refcount in struct rseq (the ABI between kernel and user-space).
The registration refcount was part of an earlier version of the rseq system call,
but we decided against keeping it in the kernel.

So I'm adding one _after_ struct rseq, purely to allow interaction between
various user-space components (program/libraries).

> 
> what prevents calling rseq_register_current_thread more than 4G times?

Nothing. It would indeed be cleaner to error out if we detect that refcount is at
INT_MAX. Is that what you have in mind ?

> 
> why cant the kernel see that the same address is registered again and succeed?

It can, and it does. However, refcounting at user-level is needed to ensure
the registration "lifetime" for rseq covers its entire use. If we have two libraries
using rseq, we end up with the following scenario:

Thread 1

  libA registers rseq
  libB registers rseq
  libB unregisters rseq
  libA uses rseq -> bug! it's been unregistered by libB.
  libA unregisters rseq -> unexpected, it's already been unregistered.
 
same applies if libA unregisters rseq before libB (and libB try to use rseq
after libA has unregistered).

The refcount in user-space fixes this.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-11 15:13     ` Mathieu Desnoyers
@ 2018-10-11 16:20       ` Szabolcs Nagy
  2018-10-11 16:37         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Szabolcs Nagy @ 2018-10-11 16:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: nd, Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

On 11/10/18 16:13, Mathieu Desnoyers wrote:
> ----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
> 
>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>> In order to integrate rseq into user-space applications, add a reference
>>> counter field after the struct rseq TLS ABI so many rseq users can be
>>> linked into the same application (e.g. librseq and glibc). The
>>> reference count ensures that rseq syscall registration/unregistration
>>> happens only for the most early/late user for each thread, thus ensuring
>>> that rseq is registered across the lifetime of all rseq users for a
>>> given thread.
>> ...
>>> +__attribute__((visibility("hidden"))) __thread
>>> +volatile struct libc_rseq __lib_rseq_abi = {
>> ...
>>> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
>>> +volatile struct rseq __rseq_abi;
>> ...
>>> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>>>  	sigset_t oldset;
>>>  
>>>  	signal_off_save(&oldset);
>>> -	if (refcount++)
>>> +	if (__lib_rseq_abi.refcount++)
>>>  		goto end;
>>>  	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
>>
>> why do you use a local refcounter instead of the __rseq_abi one?
> 
> There is no refcount in struct rseq (the ABI between kernel and user-space).
> The registration refcount was part of an earlier version of the rseq system call,
> but we decided against keeping it in the kernel.
> 
> So I'm adding one _after_ struct rseq, purely to allow interaction between
> various user-space components (program/libraries).

then all those components must use the same

  rseq_register_current_thread
  rseq_unregister_current_thread

functions and not call the syscall on their own.

in which case the refcount could be a static __thread variable.

but it's in a magic struct that's called "abi" which is confusing,
the counter is not abi, it's in a hidden object.

>> what prevents calling rseq_register_current_thread more than 4G times?
> 
> Nothing. It would indeed be cleaner to error out if we detect that refcount is at
> INT_MAX. Is that what you have in mind ?

yes

>> why cant the kernel see that the same address is registered again and succeed?
> 
> It can, and it does. However, refcounting at user-level is needed to ensure
> the registration "lifetime" for rseq covers its entire use. If we have two libraries
> using rseq, we end up with the following scenario:
> 
> Thread 1
> 
>   libA registers rseq
>   libB registers rseq
>   libB unregisters rseq
>   libA uses rseq -> bug! it's been unregistered by libB.
>   libA unregisters rseq -> unexpected, it's already been unregistered.
>  
> same applies if libA unregisters rseq before libB (and libB try to use rseq
> after libA has unregistered).
> 
> The refcount in user-space fixes this.

i see.

> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-11 16:20       ` Szabolcs Nagy
@ 2018-10-11 16:37         ` Mathieu Desnoyers
  2018-10-11 17:04           ` Szabolcs Nagy
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-11 16:37 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: nd, Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

----- On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>> ----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> 
>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>> In order to integrate rseq into user-space applications, add a reference
>>>> counter field after the struct rseq TLS ABI so many rseq users can be
>>>> linked into the same application (e.g. librseq and glibc). The
>>>> reference count ensures that rseq syscall registration/unregistration
>>>> happens only for the most early/late user for each thread, thus ensuring
>>>> that rseq is registered across the lifetime of all rseq users for a
>>>> given thread.
>>> ...
>>>> +__attribute__((visibility("hidden"))) __thread
>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>> ...
>>>> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
>>>> +volatile struct rseq __rseq_abi;
>>> ...
>>>> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>>>>  	sigset_t oldset;
>>>>  
>>>>  	signal_off_save(&oldset);
>>>> -	if (refcount++)
>>>> +	if (__lib_rseq_abi.refcount++)
>>>>  		goto end;
>>>>  	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
>>>
>>> why do you use a local refcounter instead of the __rseq_abi one?
>> 
>> There is no refcount in struct rseq (the ABI between kernel and user-space).
>> The registration refcount was part of an earlier version of the rseq system
>> call,
>> but we decided against keeping it in the kernel.
>> 
>> So I'm adding one _after_ struct rseq, purely to allow interaction between
>> various user-space components (program/libraries).
> 
> then all those components must use the same
> 
>  rseq_register_current_thread
>  rseq_unregister_current_thread
> 
> functions and not call the syscall on their own.

Not quite. Each user (programs and shared objects) must handle the refcount in a
similar way if they wish to invoke the syscall by themselves. They can
alternately use the librseq APIs if they do not wish to have a local implementation
of the reference counting and syscall registration/unregistration.

> 
> in which case the refcount could be a static __thread variable.

Yes, but I want to limit the number of symbols we need to export
from glibc by appending the refcount field at the end of struct rseq.

> 
> but it's in a magic struct that's called "abi" which is confusing,
> the counter is not abi, it's in a hidden object.

No, it is really an ABI between user-space apps/libs. It's not meant to be
hidden. glibc implements its own register/unregister functions (it does not
link against librseq). librseq exposes register/unregister functions as public
APIs. Those also use the refcount. I also plan to have existing libraries, e.g.
liblttng-ust and possibly liburcu flavors, implement the
registration/unregistration and refcount handling on their own, so we don't
have to add a requirement on additional linking on librseq for pre-existing
libraries.

So that refcount is not an ABI between kernel and user-space, but it's a
user-space ABI nevertheless (between program and shared objects).

> 
>>> what prevents calling rseq_register_current_thread more than 4G times?
>> 
>> Nothing. It would indeed be cleaner to error out if we detect that refcount is
>> at
>> INT_MAX. Is that what you have in mind ?
> 
> yes

Allright, will fix.

> 
>>> why cant the kernel see that the same address is registered again and succeed?
>> 
>> It can, and it does. However, refcounting at user-level is needed to ensure
>> the registration "lifetime" for rseq covers its entire use. If we have two
>> libraries
>> using rseq, we end up with the following scenario:
>> 
>> Thread 1
>> 
>>   libA registers rseq
>>   libB registers rseq
>>   libB unregisters rseq
>>   libA uses rseq -> bug! it's been unregistered by libB.
>>   libA unregisters rseq -> unexpected, it's already been unregistered.
>>  
>> same applies if libA unregisters rseq before libB (and libB try to use rseq
>> after libA has unregistered).
>> 
>> The refcount in user-space fixes this.
> 
> i see.

Thanks for the feedback!

Mathieu

> 
>> Thoughts ?
>> 
>> Thanks,
>> 
>> Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-11 16:37         ` Mathieu Desnoyers
@ 2018-10-11 17:04           ` Szabolcs Nagy
  2018-10-11 19:42             ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Szabolcs Nagy @ 2018-10-11 17:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: nd, Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

On 11/10/18 17:37, Mathieu Desnoyers wrote:
> ----- On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>> ----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>> ...
>> but it's in a magic struct that's called "abi" which is confusing,
>> the counter is not abi, it's in a hidden object.
> 
> No, it is really an ABI between user-space apps/libs. It's not meant to be
> hidden. glibc implements its own register/unregister functions (it does not
> link against librseq). librseq exposes register/unregister functions as public
> APIs. Those also use the refcount. I also plan to have existing libraries, e.g.
> liblttng-ust and possibly liburcu flavors, implement the
> registration/unregistration and refcount handling on their own, so we don't
> have to add a requirement on additional linking on librseq for pre-existing
> libraries.
> 
> So that refcount is not an ABI between kernel and user-space, but it's a
> user-space ABI nevertheless (between program and shared objects).
> 

if that's what you want, then your declaration is wrong.
the object should not have hidden visibility.

then each library (glibc etc) will have its own separate
tls object with their own separate refcounter (and they
will unregister when their own refcounter hits 0)

either the struct should be public abi (extern tls
symbol) or the register/unregister functions should
be public abi (so when multiple implementations are
present in the same process only one of them will
provide definition for the public abi symbol and
thus there will be one refcounter).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-11 17:04           ` Szabolcs Nagy
@ 2018-10-11 19:42             ` Mathieu Desnoyers
  2018-10-12  9:59               ` Szabolcs Nagy
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-11 19:42 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: nd, Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

----- On Oct 11, 2018, at 1:04 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 11/10/18 17:37, Mathieu Desnoyers wrote:
>> ----- On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>>> ----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>>> ...
>>> but it's in a magic struct that's called "abi" which is confusing,
>>> the counter is not abi, it's in a hidden object.
>> 
>> No, it is really an ABI between user-space apps/libs. It's not meant to be
>> hidden. glibc implements its own register/unregister functions (it does not
>> link against librseq). librseq exposes register/unregister functions as public
>> APIs. Those also use the refcount. I also plan to have existing libraries, e.g.
>> liblttng-ust and possibly liburcu flavors, implement the
>> registration/unregistration and refcount handling on their own, so we don't
>> have to add a requirement on additional linking on librseq for pre-existing
>> libraries.
>> 
>> So that refcount is not an ABI between kernel and user-space, but it's a
>> user-space ABI nevertheless (between program and shared objects).
>> 
> 
> if that's what you want, then your declaration is wrong.
> the object should not have hidden visibility.

Actually, if we look closer into my patch, it defines two symbols,
one of which is an alias:

__attribute__((visibility("hidden"))) __thread
volatile struct libc_rseq __lib_rseq_abi = {
        .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
};

extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
volatile struct rseq __rseq_abi;

Note that the public __rseq_abi symbol is weak but does not have
hidden visibility. I do this to ensure I don't get prototype
mismatch for __rseq_abi between rseq.c and rseq.h (it is required
to be a struct rseq by rseq.h), but I want the space to hold the
extra refcount field present in struct libc_rseq.


> 
> then each library (glibc etc) will have its own separate
> tls object with their own separate refcounter (and they
> will unregister when their own refcounter hits 0)

Given they all interact with the public __rseq_abi symbol,
at field refcount offset, they all effectively use the same
refcount field per thread, which serves the intended purpose.

> 
> either the struct should be public abi (extern tls
> symbol) or the register/unregister functions should
> be public abi (so when multiple implementations are
> present in the same process only one of them will
> provide definition for the public abi symbol and
> thus there will be one refcounter).

Those are two possible solutions, indeed. Considering that
we already need to expose the __rseq_abi symbol as a public
ABI in a way that ensures that multiple implementations
in a same process end up only using one of them, it seems
straightforward to simply extend that structure and hold the
refcount there, rather than having two extra ABI symbols
(register/unregister functions).

One very appropriate question here is whether we want to
expose the layout of struct libc_rseq (which includes the
refcount) in a public header file, and if so, which project
should hold it ? Or do we just want to document the layout
of this ABI so projects can define the structure layout
internally ? As my implementation currently stands, I have
the following structure duplicated into rseq selftests,
librseq, and glibc:

/*
 * linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI
 * size is 20 bytes. For support of multiple rseq users within a process,
 * user-space defines an extra 4 bytes field as a reference count, for a
 * total of 24 bytes.
 */
struct libc_rseq {
        /* kernel-userspace ABI. */
        __u32 cpu_id_start;
        __u32 cpu_id;
        __u64 rseq_cs;
        __u32 flags;
        /* user-space ABI. */
        __u32 refcount;
} __attribute__((aligned(4 * sizeof(__u64))));

That duplicated structure only needs to be present in early-adopter
applications/libraries. Those linking on librseq or relying on newer
glibc to register rseq don't need to know about this extended layout:
all they need to care about is the layout of struct rseq (without the
added refcount). 

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-11 19:42             ` Mathieu Desnoyers
@ 2018-10-12  9:59               ` Szabolcs Nagy
  2018-10-23 14:59                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Szabolcs Nagy @ 2018-10-12  9:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: nd, Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

On 11/10/18 20:42, Mathieu Desnoyers wrote:
> ----- On Oct 11, 2018, at 1:04 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
> 
>> On 11/10/18 17:37, Mathieu Desnoyers wrote:
>>> ----- On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>>>> ----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>>>> ...
>>>> but it's in a magic struct that's called "abi" which is confusing,
>>>> the counter is not abi, it's in a hidden object.
>>>
>>> No, it is really an ABI between user-space apps/libs. It's not meant to be
>>> hidden. glibc implements its own register/unregister functions (it does not
>>> link against librseq). librseq exposes register/unregister functions as public
>>> APIs. Those also use the refcount. I also plan to have existing libraries, e.g.
>>> liblttng-ust and possibly liburcu flavors, implement the
>>> registration/unregistration and refcount handling on their own, so we don't
>>> have to add a requirement on additional linking on librseq for pre-existing
>>> libraries.
>>>
>>> So that refcount is not an ABI between kernel and user-space, but it's a
>>> user-space ABI nevertheless (between program and shared objects).
>>>
>>
>> if that's what you want, then your declaration is wrong.
>> the object should not have hidden visibility.
> 
> Actually, if we look closer into my patch, it defines two symbols,
> one of which is an alias:
> 
> __attribute__((visibility("hidden"))) __thread
> volatile struct libc_rseq __lib_rseq_abi = {
>         .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> };
> 
> extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
> volatile struct rseq __rseq_abi;
> 
> Note that the public __rseq_abi symbol is weak but does not have
> hidden visibility. I do this to ensure I don't get prototype
> mismatch for __rseq_abi between rseq.c and rseq.h (it is required
> to be a struct rseq by rseq.h), but I want the space to hold the
> extra refcount field present in struct libc_rseq.
> 

but that's wrong: the weak symbol might get resolved to
a different object in another module, while you increment
a local refcounter, so there is no coordination between
userspace components.

this was the reason for my first question in my original mail,
as soon as i saw the local counter i suspected this is broken.

and "assume there is an extra counter field" is not
acceptable as user space abi, if the counter is relevant
across modules then expose the entire struct.

>> either the struct should be public abi (extern tls
>> symbol) or the register/unregister functions should
>> be public abi (so when multiple implementations are
>> present in the same process only one of them will
>> provide definition for the public abi symbol and
>> thus there will be one refcounter).
> 
> Those are two possible solutions, indeed. Considering that
> we already need to expose the __rseq_abi symbol as a public
> ABI in a way that ensures that multiple implementations
> in a same process end up only using one of them, it seems
> straightforward to simply extend that structure and hold the
> refcount there, rather than having two extra ABI symbols
> (register/unregister functions).
> 
> One very appropriate question here is whether we want to
> expose the layout of struct libc_rseq (which includes the
> refcount) in a public header file, and if so, which project
> should hold it ? Or do we just want to document the layout
> of this ABI so projects can define the structure layout
> internally ? As my implementation currently stands, I have
> the following structure duplicated into rseq selftests,
> librseq, and glibc:
> 

"not exposed" and "the counter is abi" together is not
useful, either you want coordination in user-space or
not, that decision should imply the userspace abi/api
(e.g. adding a counter to the user-space struct).

it is true that only modules that implement registration
need to know about the counter and normal users don't,
but if you want any coordination then the layout must
be fixed and that should be exposed somewhere to avoid
breakage.

(i think ideally the api would be controlled by functions
and not object symbols with magic layout, but the rseq
design is already full of such magic. and i think it's
better to do the registration in libc only without
coordination but that might not be practical if users
want it now)

> /*
>  * linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI
>  * size is 20 bytes. For support of multiple rseq users within a process,
>  * user-space defines an extra 4 bytes field as a reference count, for a
>  * total of 24 bytes.
>  */
> struct libc_rseq {
>         /* kernel-userspace ABI. */
>         __u32 cpu_id_start;
>         __u32 cpu_id;
>         __u64 rseq_cs;
>         __u32 flags;
>         /* user-space ABI. */
>         __u32 refcount;
> } __attribute__((aligned(4 * sizeof(__u64))));
> 
> That duplicated structure only needs to be present in early-adopter
> applications/libraries. Those linking on librseq or relying on newer
> glibc to register rseq don't need to know about this extended layout:
> all they need to care about is the layout of struct rseq (without the
> added refcount). 

please decide if you want multiple libraries to
be able to register rseq and coordinate or not
and document that decision in the public api.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8) Mathieu Desnoyers
@ 2018-10-16  8:10   ` Sergey Senozhatsky
  2018-10-16 19:17     ` Mathieu Desnoyers
  2018-10-17  7:19   ` Srikar Dronamraju
  1 sibling, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2018-10-16  8:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes

Hi Mathieu,

On (10/10/18 15:19), Mathieu Desnoyers wrote:
[..]
> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
> +		int, cpu, int, flags)
> +{
[..]
> +again:
> +	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs);
> +	if (ret)
> +		goto end;
> +	ret = do_cpu_opv(cpuopv, cpuopcnt, &vaddr_ptrs, cpu);
> +	if (ret == -EAGAIN)
> +		retry = true;
> +end:
> +	for (i = 0; i < vaddr_ptrs.nr_vaddr; i++) {
> +		struct vaddr *vaddr = &vaddr_ptrs.addr[i];
> +		int j;
> +
> +		vm_unmap_user_ram((void *)vaddr->mem, vaddr->nr_pages);

A dumb question.

Both vm_unmap_user_ram() and vm_map_user_ram() can BUG_ON().
So this is
   userspace -> syscall -> cpu_opv() -> vm_unmap_user_ram() -> BUG_ON()

Any chance someone can exploit it?

	-ss

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-10 19:19 ` [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram Mathieu Desnoyers
@ 2018-10-16 18:30   ` Steven Rostedt
  2018-10-16 19:21     ` Mathieu Desnoyers
  2018-10-17  0:27     ` Sergey Senozhatsky
  0 siblings, 2 replies; 40+ messages in thread
From: Steven Rostedt @ 2018-10-16 18:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, Sergey Senozhatsky

On Wed, 10 Oct 2018 15:19:24 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> + * vm_unmap_user_ram - unmap linear kernel address space set up by vm_map_user_ram
> + * @mem: the pointer returned by vm_map_user_ram
> + * @count: the count passed to that vm_map_user_ram call (cannot unmap partial)
> + */
> +void vm_unmap_user_ram(const void *mem, unsigned int count)
> +{
> +	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> +	unsigned long addr = (unsigned long)mem;
> +	struct vmap_area *va;
> +
> +	might_sleep();
> +	BUG_ON(!addr);
> +	BUG_ON(addr < VMALLOC_START);
> +	BUG_ON(addr > VMALLOC_END);
> +	BUG_ON(!PAGE_ALIGNED(addr));
> +
> +	debug_check_no_locks_freed(mem, size);
> +	va = find_vmap_area(addr);
> +	BUG_ON(!va);
> +	free_unmap_vmap_area(va);
> +}
> +EXPORT_SYMBOL(vm_unmap_user_ram);
> +

Noticing this from Sergey's question in another patch, why are you
using BUG_ON()? That's rather extreme and something we are trying to
avoid adding more of (I still need to remove the BUG_ON()s I've added
over ten years ago). I don't see why all these BUG_ON's can't be turned
into:

	if (WARN_ON(x))
		return;

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-16  8:10   ` Sergey Senozhatsky
@ 2018-10-16 19:17     ` Mathieu Desnoyers
  2018-10-17  1:46       ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-16 19:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes

----- On Oct 16, 2018, at 4:10 AM, Sergey Senozhatsky sergey.senozhatsky.work@gmail.com wrote:

> Hi Mathieu,
> 
> On (10/10/18 15:19), Mathieu Desnoyers wrote:
> [..]
>> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
>> +		int, cpu, int, flags)
>> +{
> [..]
>> +again:
>> +	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs);
>> +	if (ret)
>> +		goto end;
>> +	ret = do_cpu_opv(cpuopv, cpuopcnt, &vaddr_ptrs, cpu);
>> +	if (ret == -EAGAIN)
>> +		retry = true;
>> +end:
>> +	for (i = 0; i < vaddr_ptrs.nr_vaddr; i++) {
>> +		struct vaddr *vaddr = &vaddr_ptrs.addr[i];
>> +		int j;
>> +
>> +		vm_unmap_user_ram((void *)vaddr->mem, vaddr->nr_pages);
> 
> A dumb question.
> 
> Both vm_unmap_user_ram() and vm_map_user_ram() can BUG_ON().
> So this is
>   userspace -> syscall -> cpu_opv() -> vm_unmap_user_ram() -> BUG_ON()
> 
> Any chance someone can exploit it?

Hi Sergey,

Let's look at vm_unmap_user_ram() and vm_map_user_ram() separately.

If we look at the input from vm_unmap_user_ram, it's called with the
following parameters by the cpu_opv system call:

        for (i = 0; i < vaddr_ptrs.nr_vaddr; i++) {
                struct vaddr *vaddr = &vaddr_ptrs.addr[i];
                int j;

                vm_unmap_user_ram((void *)vaddr->mem, vaddr->nr_pages);
[...]
        }

The vaddr_ptrs array content is filled by the call to cpu_opv_pin_pages above:

        ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs);
        if (ret)
                goto end;

by passing the array to cpu_op_pin_pages(), which appends a virtual address at
the end of the array (on success) and increments nr_vaddr. Those virtual
addresses are returned by vm_map_user_ram(), so they are not user-controlled.

Therefore, only an internal kernel bug between vm_map_user_ram() and
vm_unmap_user_ram() should trigger the BUG_ON(). No user input is passed
to vm_unmap_user_ram().

Now, let's look at vm_map_user_ram(). It calls alloc_vmap_area(), which returns
a vmap_area. Then if vmap_page_range failed, vm_unmap_user_ram is called on the
memory that has just been returned by vm_map_user_ram. Again, only an internal
bug between map/unmap can trigger the BUG_ON() in vm_unmap_user_ram.

Is there another scenario I missed ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-16 18:30   ` Steven Rostedt
@ 2018-10-16 19:21     ` Mathieu Desnoyers
  2018-10-16 19:40       ` Steven Rostedt
  2018-10-17  0:27     ` Sergey Senozhatsky
  1 sibling, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-16 19:21 UTC (permalink / raw)
  To: rostedt
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, Sergey Senozhatsky

----- On Oct 16, 2018, at 2:30 PM, rostedt rostedt@goodmis.org wrote:

> On Wed, 10 Oct 2018 15:19:24 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> + * vm_unmap_user_ram - unmap linear kernel address space set up by
>> vm_map_user_ram
>> + * @mem: the pointer returned by vm_map_user_ram
>> + * @count: the count passed to that vm_map_user_ram call (cannot unmap partial)
>> + */
>> +void vm_unmap_user_ram(const void *mem, unsigned int count)
>> +{
>> +	unsigned long size = (unsigned long)count << PAGE_SHIFT;
>> +	unsigned long addr = (unsigned long)mem;
>> +	struct vmap_area *va;
>> +
>> +	might_sleep();
>> +	BUG_ON(!addr);
>> +	BUG_ON(addr < VMALLOC_START);
>> +	BUG_ON(addr > VMALLOC_END);
>> +	BUG_ON(!PAGE_ALIGNED(addr));
>> +
>> +	debug_check_no_locks_freed(mem, size);
>> +	va = find_vmap_area(addr);
>> +	BUG_ON(!va);
>> +	free_unmap_vmap_area(va);
>> +}
>> +EXPORT_SYMBOL(vm_unmap_user_ram);
>> +
> 
> Noticing this from Sergey's question in another patch, why are you
> using BUG_ON()? That's rather extreme and something we are trying to
> avoid adding more of (I still need to remove the BUG_ON()s I've added
> over ten years ago). I don't see why all these BUG_ON's can't be turned
> into:
> 
>	if (WARN_ON(x))
>		return;

I borrowed the code from vm_unmap_ram(), which has the following checks:

        BUG_ON(!addr);
        BUG_ON(addr < VMALLOC_START);
        BUG_ON(addr > VMALLOC_END);
        BUG_ON(!PAGE_ALIGNED(addr));
[...]
        va = find_vmap_area(addr);
        BUG_ON(!va);

The expectation here is that inputs to vm_unmap_ram() should always come from
vm_map_ram(), so an erroneous input is an internal kernel bug. I applied the
same logic to vm_unmap_user_ram() and vm_map_user_ram().

Should we turn all those BUG_ON() into if (WARN_ON(x)) return; in vm_{map,unmap}_ram
as well ?

Thanks,

Mathieu


> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-16 19:21     ` Mathieu Desnoyers
@ 2018-10-16 19:40       ` Steven Rostedt
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2018-10-16 19:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, Sergey Senozhatsky

On Tue, 16 Oct 2018 15:21:31 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On Oct 16, 2018, at 2:30 PM, rostedt rostedt@goodmis.org wrote:
> 
> > On Wed, 10 Oct 2018 15:19:24 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >   
> >> + * vm_unmap_user_ram - unmap linear kernel address space set up by
> >> vm_map_user_ram
> >> + * @mem: the pointer returned by vm_map_user_ram
> >> + * @count: the count passed to that vm_map_user_ram call (cannot unmap partial)
> >> + */
> >> +void vm_unmap_user_ram(const void *mem, unsigned int count)
> >> +{
> >> +	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> >> +	unsigned long addr = (unsigned long)mem;
> >> +	struct vmap_area *va;
> >> +
> >> +	might_sleep();
> >> +	BUG_ON(!addr);
> >> +	BUG_ON(addr < VMALLOC_START);
> >> +	BUG_ON(addr > VMALLOC_END);
> >> +	BUG_ON(!PAGE_ALIGNED(addr));
> >> +
> >> +	debug_check_no_locks_freed(mem, size);
> >> +	va = find_vmap_area(addr);
> >> +	BUG_ON(!va);
> >> +	free_unmap_vmap_area(va);
> >> +}
> >> +EXPORT_SYMBOL(vm_unmap_user_ram);
> >> +  
> > 
> > Noticing this from Sergey's question in another patch, why are you
> > using BUG_ON()? That's rather extreme and something we are trying to
> > avoid adding more of (I still need to remove the BUG_ON()s I've added
> > over ten years ago). I don't see why all these BUG_ON's can't be turned
> > into:
> > 
> >	if (WARN_ON(x))
> >		return;  
> 
> I borrowed the code from vm_unmap_ram(), which has the following checks:
> 
>         BUG_ON(!addr);
>         BUG_ON(addr < VMALLOC_START);
>         BUG_ON(addr > VMALLOC_END);
>         BUG_ON(!PAGE_ALIGNED(addr));
> [...]
>         va = find_vmap_area(addr);
>         BUG_ON(!va);
> 
> The expectation here is that inputs to vm_unmap_ram() should always come from
> vm_map_ram(), so an erroneous input is an internal kernel bug. I applied the
> same logic to vm_unmap_user_ram() and vm_map_user_ram().
> 
> Should we turn all those BUG_ON() into if (WARN_ON(x)) return; in vm_{map,unmap}_ram
> as well ?
> 
>

I would argue yes! That code was added in 2008 (which is also the same
year I added BUG_ON() to my code). Back then it wasn't such an issue,
but today we are finding (and Linus has been complaining) that BUG_ON
really shouldn't be necessary. Especially if you can get out of the
function with a simple return.

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-16 18:30   ` Steven Rostedt
  2018-10-16 19:21     ` Mathieu Desnoyers
@ 2018-10-17  0:27     ` Sergey Senozhatsky
  2018-10-17 15:00       ` Mathieu Desnoyers
  1 sibling, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2018-10-17  0:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes, Sergey Senozhatsky

On (10/16/18 14:30), Steven Rostedt wrote:
> > +void vm_unmap_user_ram(const void *mem, unsigned int count)
> > +{
> > +	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> > +	unsigned long addr = (unsigned long)mem;
> > +	struct vmap_area *va;
> > +
> > +	might_sleep();
> > +	BUG_ON(!addr);
> > +	BUG_ON(addr < VMALLOC_START);
> > +	BUG_ON(addr > VMALLOC_END);
> > +	BUG_ON(!PAGE_ALIGNED(addr));
> > +
> > +	debug_check_no_locks_freed(mem, size);
> > +	va = find_vmap_area(addr);
> > +	BUG_ON(!va);
> > +	free_unmap_vmap_area(va);
> > +}
> > +EXPORT_SYMBOL(vm_unmap_user_ram);
> > +
> 
> Noticing this from Sergey's question in another patch, why are you
> using BUG_ON()? That's rather extreme and something we are trying to
> avoid adding more of (I still need to remove the BUG_ON()s I've added
> over ten years ago). I don't see why all these BUG_ON's can't be turned
> into:

+1

> 	if (WARN_ON(x))
> 		return;

Given that this somewhat MM-related, I'd may be say VM_WARN_ON().

	-ss

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-16 19:17     ` Mathieu Desnoyers
@ 2018-10-17  1:46       ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2018-10-17  1:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sergey Senozhatsky, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H. Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes

Hi Mathieu,

On (10/16/18 15:17), Mathieu Desnoyers wrote:
>
> Therefore, only an internal kernel bug between vm_map_user_ram() and
> vm_unmap_user_ram() should trigger the BUG_ON(). No user input is passed
> to vm_unmap_user_ram().
>
> Now, let's look at vm_map_user_ram(). It calls alloc_vmap_area(), which returns
> a vmap_area. Then if vmap_page_range failed, vm_unmap_user_ram is called on the
> memory that has just been returned by vm_map_user_ram. Again, only an internal
> bug between map/unmap can trigger the BUG_ON() in vm_unmap_user_ram.

Thanks for spending time on this.
Just wanted someone to have extra look at syscall->BUG_ON().

	-ss

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2) Mathieu Desnoyers
@ 2018-10-17  6:51   ` Srikar Dronamraju
  2018-10-17 15:09     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Srikar Dronamraju @ 2018-10-17  6:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes

Hi Mathieu,

> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
> +{

In your use case, is the task going to be current?
If yes, we should simply be using migrate_task_to.

> +	struct rq_flags rf;
> +	struct rq *rq;
> +	int ret = 0;
> +
> +	rq = task_rq_lock(p, &rf);
> +	update_rq_clock(rq);
> +
> +	if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}

Ideally we should have checked cpus_allowed/cpu_active_mask before taking
the lock. This would help reduce the contention on the rqlock when the
passed parameter is not correct.

> +
> +	if (!cpumask_test_cpu(dest_cpu, cpu_active_mask)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	if (task_cpu(p) == dest_cpu)
> +		goto out;

Same as above.

> +
> +	if (task_running(rq, p) || p->state == TASK_WAKING) {

Why are we using migration thread to move a task in TASK_WAKING state?

> +		struct migration_arg arg = { p, dest_cpu };
> +		/* Need help from migration thread: drop lock and wait. */
> +		task_rq_unlock(rq, p, &rf);
> +		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
> +		tlb_migrate_finish(p->mm);
> +		return 0;

Why cant we use migrate_task_to instead?

> +	} else if (task_on_rq_queued(p)) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 455fa330de04..27ad25780204 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1340,6 +1340,15 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
>  #endif
>  }
>  
> +#ifdef CONFIG_SMP
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
> +#else
> +static inline int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
> +{
> +	return 0;
> +}
> +#endif
> +

Your usecase is outside kernel/sched. So I am not sure if this is the right
place for the declaration.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-10 19:19 ` [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8) Mathieu Desnoyers
  2018-10-16  8:10   ` Sergey Senozhatsky
@ 2018-10-17  7:19   ` Srikar Dronamraju
  2018-10-17 15:11     ` Mathieu Desnoyers
  1 sibling, 1 reply; 40+ messages in thread
From: Srikar Dronamraju @ 2018-10-17  7:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes


Hi Mathieu,

> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt,
> +		      struct cpu_opv_vaddr *vaddr_ptrs, int cpu)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int ret;
> +
> +retry:
> +	if (cpu != raw_smp_processor_id()) {
> +		ret = push_task_to_cpu(current, cpu);
> +		if (ret)
> +			goto check_online;
> +	}
> +	down_read(&mm->mmap_sem);
> +	ret = vaddr_ptrs_check(vaddr_ptrs);
> +	if (ret)
> +		goto end;
> +	preempt_disable();
> +	if (cpu != smp_processor_id()) {
> +		preempt_enable();
> +		up_read(&mm->mmap_sem);
> +		goto retry;
> +	}

If we have a higher priority task/s either pinned to the cpu, dont we end up
in busy-looping till the task exits/sleeps?

> +	ret = __do_cpu_opv(cpuop, cpuopcnt);
> +	preempt_enable();
> +end:
> +	up_read(&mm->mmap_sem);
> +	return ret;
> +
> +check_online:
> +	/*
> +	 * push_task_to_cpu() returns -EINVAL if the requested cpu is not part
> +	 * of the current thread's cpus_allowed mask.
> +	 */
> +	if (ret == -EINVAL)
> +		return ret;
> +	get_online_cpus();
> +	if (cpu_online(cpu)) {
> +		put_online_cpus();
> +		goto retry;
> +	}


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-17  0:27     ` Sergey Senozhatsky
@ 2018-10-17 15:00       ` Mathieu Desnoyers
  2018-10-17 15:04         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-17 15:00 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: rostedt, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H. Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes

----- On Oct 16, 2018, at 8:27 PM, Sergey Senozhatsky sergey.senozhatsky.work@gmail.com wrote:

> On (10/16/18 14:30), Steven Rostedt wrote:
>> > +void vm_unmap_user_ram(const void *mem, unsigned int count)
>> > +{
>> > +	unsigned long size = (unsigned long)count << PAGE_SHIFT;
>> > +	unsigned long addr = (unsigned long)mem;
>> > +	struct vmap_area *va;
>> > +
>> > +	might_sleep();
>> > +	BUG_ON(!addr);
>> > +	BUG_ON(addr < VMALLOC_START);
>> > +	BUG_ON(addr > VMALLOC_END);
>> > +	BUG_ON(!PAGE_ALIGNED(addr));
>> > +
>> > +	debug_check_no_locks_freed(mem, size);
>> > +	va = find_vmap_area(addr);
>> > +	BUG_ON(!va);
>> > +	free_unmap_vmap_area(va);
>> > +}
>> > +EXPORT_SYMBOL(vm_unmap_user_ram);
>> > +
>> 
>> Noticing this from Sergey's question in another patch, why are you
>> using BUG_ON()? That's rather extreme and something we are trying to
>> avoid adding more of (I still need to remove the BUG_ON()s I've added
>> over ten years ago). I don't see why all these BUG_ON's can't be turned
>> into:
> 
> +1
> 
>> 	if (WARN_ON(x))
>> 		return;
> 
> Given that this somewhat MM-related, I'd may be say VM_WARN_ON().

Good point, will do!

So I'll do one cleanup patch for vm_unmap_ram(), and I'll modify the new vm_unmap_user_ram().

Thanks,

Mathieu

> 
> 	-ss

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-17 15:00       ` Mathieu Desnoyers
@ 2018-10-17 15:04         ` Mathieu Desnoyers
  2018-10-17 15:34           ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-17 15:04 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: rostedt, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	linux-kernel, linux-api, Thomas Gleixner, Andy Lutomirski,
	Dave Watson, Paul Turner, Andrew Morton, Russell King,
	Ingo Molnar, H. Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes

----- On Oct 17, 2018, at 11:00 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Oct 16, 2018, at 8:27 PM, Sergey Senozhatsky
> sergey.senozhatsky.work@gmail.com wrote:
> 
>> On (10/16/18 14:30), Steven Rostedt wrote:
>>> > +void vm_unmap_user_ram(const void *mem, unsigned int count)
>>> > +{
>>> > +	unsigned long size = (unsigned long)count << PAGE_SHIFT;
>>> > +	unsigned long addr = (unsigned long)mem;
>>> > +	struct vmap_area *va;
>>> > +
>>> > +	might_sleep();
>>> > +	BUG_ON(!addr);
>>> > +	BUG_ON(addr < VMALLOC_START);
>>> > +	BUG_ON(addr > VMALLOC_END);
>>> > +	BUG_ON(!PAGE_ALIGNED(addr));
>>> > +
>>> > +	debug_check_no_locks_freed(mem, size);
>>> > +	va = find_vmap_area(addr);
>>> > +	BUG_ON(!va);
>>> > +	free_unmap_vmap_area(va);
>>> > +}
>>> > +EXPORT_SYMBOL(vm_unmap_user_ram);
>>> > +
>>> 
>>> Noticing this from Sergey's question in another patch, why are you
>>> using BUG_ON()? That's rather extreme and something we are trying to
>>> avoid adding more of (I still need to remove the BUG_ON()s I've added
>>> over ten years ago). I don't see why all these BUG_ON's can't be turned
>>> into:
>> 
>> +1
>> 
>>> 	if (WARN_ON(x))
>>> 		return;
>> 
>> Given that this somewhat MM-related, I'd may be say VM_WARN_ON().

I notice that VM_WARN_ON() casts the result of WARN_ON() to (void), so it
cannot be used in a if () statement.

VM_WARN_ON() will only warn if CONFIG_DEBUG_VM is set.

Is it really what we want ?

Thanks,

Mathieu


> 
> Good point, will do!
> 
> So I'll do one cleanup patch for vm_unmap_ram(), and I'll modify the new
> vm_unmap_user_ram().
> 
> Thanks,
> 
> Mathieu
> 
>> 
>> 	-ss
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2)
  2018-10-17  6:51   ` Srikar Dronamraju
@ 2018-10-17 15:09     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-17 15:09 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes

----- On Oct 17, 2018, at 2:51 AM, Srikar Dronamraju srikar@linux.vnet.ibm.com wrote:

> Hi Mathieu,
> 
>> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
>> +{
> 
> In your use case, is the task going to be current?
> If yes, we should simply be using migrate_task_to.
> 
>> +	struct rq_flags rf;
>> +	struct rq *rq;
>> +	int ret = 0;
>> +
>> +	rq = task_rq_lock(p, &rf);
>> +	update_rq_clock(rq);
>> +
>> +	if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
> 
> Ideally we should have checked cpus_allowed/cpu_active_mask before taking
> the lock. This would help reduce the contention on the rqlock when the
> passed parameter is not correct.
> 
>> +
>> +	if (!cpumask_test_cpu(dest_cpu, cpu_active_mask)) {
>> +		ret = -EBUSY;
>> +		goto out;
>> +	}
>> +
>> +	if (task_cpu(p) == dest_cpu)
>> +		goto out;
> 
> Same as above.
> 
>> +
>> +	if (task_running(rq, p) || p->state == TASK_WAKING) {
> 
> Why are we using migration thread to move a task in TASK_WAKING state?
> 
>> +		struct migration_arg arg = { p, dest_cpu };
>> +		/* Need help from migration thread: drop lock and wait. */
>> +		task_rq_unlock(rq, p, &rf);
>> +		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
>> +		tlb_migrate_finish(p->mm);
>> +		return 0;
> 
> Why cant we use migrate_task_to instead?

I could do that be moving migrate_task_to outside of NUMA-specific #ifdef,
but I think we can do much, much simpler than that, see below.

> 
>> +	} else if (task_on_rq_queued(p)) {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 455fa330de04..27ad25780204 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1340,6 +1340,15 @@ static inline void __set_task_cpu(struct task_struct *p,
>> unsigned int cpu)
>>  #endif
>>  }
>>  
>> +#ifdef CONFIG_SMP
>> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
>> +#else
>> +static inline int push_task_to_cpu(struct task_struct *p, unsigned int
>> dest_cpu)
>> +{
>> +	return 0;
>> +}
>> +#endif
>> +
> 
> Your usecase is outside kernel/sched. So I am not sure if this is the right
> place for the declaration.

Actually, now that I think of it, we may not need to migrate the task at all.
Now that cpu_opv implementation takes a temporary vmap() of the user-space pages,
we can touch that virtual address range from interrupt context from another CPU.

So cpu_opv can simply execute the vector of operations in IPI context rather than
do all this silly dance with migration.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-17  7:19   ` Srikar Dronamraju
@ 2018-10-17 15:11     ` Mathieu Desnoyers
  2018-10-17 16:09       ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-17 15:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes

----- On Oct 17, 2018, at 3:19 AM, Srikar Dronamraju srikar@linux.vnet.ibm.com wrote:

> Hi Mathieu,
> 
>> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt,
>> +		      struct cpu_opv_vaddr *vaddr_ptrs, int cpu)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	int ret;
>> +
>> +retry:
>> +	if (cpu != raw_smp_processor_id()) {
>> +		ret = push_task_to_cpu(current, cpu);
>> +		if (ret)
>> +			goto check_online;
>> +	}
>> +	down_read(&mm->mmap_sem);
>> +	ret = vaddr_ptrs_check(vaddr_ptrs);
>> +	if (ret)
>> +		goto end;
>> +	preempt_disable();
>> +	if (cpu != smp_processor_id()) {
>> +		preempt_enable();
>> +		up_read(&mm->mmap_sem);
>> +		goto retry;
>> +	}
> 
> If we have a higher priority task/s either pinned to the cpu, dont we end up
> in busy-looping till the task exits/sleeps?

You're right!

How about we ditch the thread migration altogether, and simply perform
the cpu_opv operations in a IPI handler ?

This is possible now that cpu_opv uses a temporary vmap() rather than
try to touch the user-space page through the current thread's page table.

Thoughts ?

Thanks,

Mathieu

> 
>> +	ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +	preempt_enable();
>> +end:
>> +	up_read(&mm->mmap_sem);
>> +	return ret;
>> +
>> +check_online:
>> +	/*
>> +	 * push_task_to_cpu() returns -EINVAL if the requested cpu is not part
>> +	 * of the current thread's cpus_allowed mask.
>> +	 */
>> +	if (ret == -EINVAL)
>> +		return ret;
>> +	get_online_cpus();
>> +	if (cpu_online(cpu)) {
>> +		put_online_cpus();
>> +		goto retry;
> > +	}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram
  2018-10-17 15:04         ` Mathieu Desnoyers
@ 2018-10-17 15:34           ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2018-10-17 15:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Sergey Senozhatsky, rostedt, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, linux-kernel, linux-api, Thomas Gleixner,
	Andy Lutomirski, Dave Watson, Paul Turner, Andrew Morton,
	Russell King, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Chris Lameter, Ben Maurer, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes

On (10/17/18 11:04), Mathieu Desnoyers wrote:
> >> 
> >>> 	if (WARN_ON(x))
> >>> 		return;
> >> 
> >> Given that this somewhat MM-related, I'd may be say VM_WARN_ON().
> 
> I notice that VM_WARN_ON() casts the result of WARN_ON() to (void), so it
> cannot be used in a if () statement.
> 
> VM_WARN_ON() will only warn if CONFIG_DEBUG_VM is set.
> 
> Is it really what we want ?

Oh, indeed... Sorry, I forgot that they cast WANR_ON() return to void.
Let's do what Steven suggested.

	-ss

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8)
  2018-10-17 15:11     ` Mathieu Desnoyers
@ 2018-10-17 16:09       ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-17 16:09 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes

----- On Oct 17, 2018, at 11:11 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Oct 17, 2018, at 3:19 AM, Srikar Dronamraju srikar@linux.vnet.ibm.com
> wrote:
> 
>> Hi Mathieu,
>> 
>>> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt,
>>> +		      struct cpu_opv_vaddr *vaddr_ptrs, int cpu)
>>> +{
>>> +	struct mm_struct *mm = current->mm;
>>> +	int ret;
>>> +
>>> +retry:
>>> +	if (cpu != raw_smp_processor_id()) {
>>> +		ret = push_task_to_cpu(current, cpu);
>>> +		if (ret)
>>> +			goto check_online;
>>> +	}
>>> +	down_read(&mm->mmap_sem);
>>> +	ret = vaddr_ptrs_check(vaddr_ptrs);
>>> +	if (ret)
>>> +		goto end;
>>> +	preempt_disable();
>>> +	if (cpu != smp_processor_id()) {
>>> +		preempt_enable();
>>> +		up_read(&mm->mmap_sem);
>>> +		goto retry;
>>> +	}
>> 
>> If we have a higher priority task/s either pinned to the cpu, dont we end up
>> in busy-looping till the task exits/sleeps?
> 
> You're right!
> 
> How about we ditch the thread migration altogether, and simply perform
> the cpu_opv operations in a IPI handler ?
> 
> This is possible now that cpu_opv uses a temporary vmap() rather than
> try to touch the user-space page through the current thread's page table.
> 
> Thoughts ?

Here is the associated implementation on top of this patchset:

commit 759c5a8860d867091e168900329f0955e5101989
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Wed Oct 17 11:32:02 2018 -0400

    cpu opv: use ipi

diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
index db144b71d51a..30405e0cc049 100644
--- a/kernel/cpu_opv.c
+++ b/kernel/cpu_opv.c
@@ -31,6 +31,7 @@
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 #include <linux/atomic.h>
+#include <linux/smp.h>
 #include <asm/ptrace.h>
 #include <asm/byteorder.h>
 #include <asm/cacheflush.h>
@@ -1039,41 +1040,48 @@ static int vaddr_ptrs_check(struct cpu_opv_vaddr *vaddr_ptrs)
        return 0;
 }

+struct opv_ipi_args {
+       struct cpu_op *cpuop;
+       int cpuopcnt;
+       int ret;
+};
+
+static void cpu_opv_ipi(void *info)
+{
+       struct opv_ipi_args *args = info;
+
+       rseq_preempt(current);
+       args->ret = __do_cpu_opv(args->cpuop, args->cpuopcnt);
+}
+
 static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt,
                      struct cpu_opv_vaddr *vaddr_ptrs, int cpu)
 {
        struct mm_struct *mm = current->mm;
+       struct opv_ipi_args args = {
+               .cpuop = cpuop,
+               .cpuopcnt = cpuopcnt,
+       };
        int ret;

 retry:
-       if (cpu != raw_smp_processor_id()) {
-               ret = push_task_to_cpu(current, cpu);
-               if (ret)
-                       goto check_online;
-       }
+       if (!cpumask_test_cpu(cpu, &current->cpus_allowed))
+               return -EINVAL;
        down_read(&mm->mmap_sem);
        ret = vaddr_ptrs_check(vaddr_ptrs);
        if (ret)
                goto end;
-       preempt_disable();
-       if (cpu != smp_processor_id()) {
-               preempt_enable();
+       ret = smp_call_function_single(cpu, cpu_opv_ipi, &args, 1);
+       if (ret) {
                up_read(&mm->mmap_sem);
-               goto retry;
+               goto check_online;
        }
-       ret = __do_cpu_opv(cpuop, cpuopcnt);
-       preempt_enable();
+       ret = args.ret;
 end:
        up_read(&mm->mmap_sem);
        return ret;

 check_online:
-       /*
-        * push_task_to_cpu() returns -EINVAL if the requested cpu is not part
-        * of the current thread's cpus_allowed mask.
-        */
-       if (ret == -EINVAL)
-               return ret;
        get_online_cpus();
        if (cpu_online(cpu)) {
                put_online_cpus();




-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc
  2018-10-12  9:59               ` Szabolcs Nagy
@ 2018-10-23 14:59                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2018-10-23 14:59 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: nd, Peter Zijlstra, Paul E. McKenney, Boqun Feng, linux-kernel,
	linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes, shuah, carlos, Florian Weimer,
	Joseph Myers

----- On Oct 12, 2018, at 10:59 AM, Szabolcs Nagy szabolcs.nagy@arm.com wrote:

> On 11/10/18 20:42, Mathieu Desnoyers wrote:
>> ----- On Oct 11, 2018, at 1:04 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> 
>>> On 11/10/18 17:37, Mathieu Desnoyers wrote:
>>>> ----- On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>>>>> ----- On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>>>>> ...
>>>>> but it's in a magic struct that's called "abi" which is confusing,
>>>>> the counter is not abi, it's in a hidden object.
>>>>
>>>> No, it is really an ABI between user-space apps/libs. It's not meant to be
>>>> hidden. glibc implements its own register/unregister functions (it does not
>>>> link against librseq). librseq exposes register/unregister functions as public
>>>> APIs. Those also use the refcount. I also plan to have existing libraries, e.g.
>>>> liblttng-ust and possibly liburcu flavors, implement the
>>>> registration/unregistration and refcount handling on their own, so we don't
>>>> have to add a requirement on additional linking on librseq for pre-existing
>>>> libraries.
>>>>
>>>> So that refcount is not an ABI between kernel and user-space, but it's a
>>>> user-space ABI nevertheless (between program and shared objects).
>>>>
>>>
>>> if that's what you want, then your declaration is wrong.
>>> the object should not have hidden visibility.
>> 
>> Actually, if we look closer into my patch, it defines two symbols,
>> one of which is an alias:
>> 
>> __attribute__((visibility("hidden"))) __thread
>> volatile struct libc_rseq __lib_rseq_abi = {
>>         .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
>> };
>> 
>> extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
>> volatile struct rseq __rseq_abi;
>> 
>> Note that the public __rseq_abi symbol is weak but does not have
>> hidden visibility. I do this to ensure I don't get prototype
>> mismatch for __rseq_abi between rseq.c and rseq.h (it is required
>> to be a struct rseq by rseq.h), but I want the space to hold the
>> extra refcount field present in struct libc_rseq.
>> 
>

I notice this email has been sitting in my inbox for a while, sorry
for the delayed reply.
 
> but that's wrong: the weak symbol might get resolved to
> a different object in another module, while you increment
> a local refcounter, so there is no coordination between
> userspace components.

Hrm, good point. I should not use the __lib_rseq_abi symbol at all
here.

> 
> this was the reason for my first question in my original mail,
> as soon as i saw the local counter i suspected this is broken.

Good catch, yes. I think I should not use the alias approach then.

> 
> and "assume there is an extra counter field" is not
> acceptable as user space abi, if the counter is relevant
> across modules then expose the entire struct.

The question that arises here is whether I should update
uapi/linux/rseq.h and add the refcount field directly in
there, even though the kernel does not care about it per se ?

> 
>>> either the struct should be public abi (extern tls
>>> symbol) or the register/unregister functions should
>>> be public abi (so when multiple implementations are
>>> present in the same process only one of them will
>>> provide definition for the public abi symbol and
>>> thus there will be one refcounter).
>> 
>> Those are two possible solutions, indeed. Considering that
>> we already need to expose the __rseq_abi symbol as a public
>> ABI in a way that ensures that multiple implementations
>> in a same process end up only using one of them, it seems
>> straightforward to simply extend that structure and hold the
>> refcount there, rather than having two extra ABI symbols
>> (register/unregister functions).
>> 
>> One very appropriate question here is whether we want to
>> expose the layout of struct libc_rseq (which includes the
>> refcount) in a public header file, and if so, which project
>> should hold it ? Or do we just want to document the layout
>> of this ABI so projects can define the structure layout
>> internally ? As my implementation currently stands, I have
>> the following structure duplicated into rseq selftests,
>> librseq, and glibc:
>> 
> 
> "not exposed" and "the counter is abi" together is not
> useful, either you want coordination in user-space or
> not, that decision should imply the userspace abi/api
> (e.g. adding a counter to the user-space struct).

I'm inclined to add the refcount to struct rseq directly,
unless anyone objects. It seems much simpler.

> 
> it is true that only modules that implement registration
> need to know about the counter and normal users don't,
> but if you want any coordination then the layout must
> be fixed and that should be exposed somewhere to avoid
> breakage.

Yep. Exposing this in uapi/linux/rseq.h is the main
location that seems to make sense to me.

> 
> (i think ideally the api would be controlled by functions
> and not object symbols with magic layout, but the rseq
> design is already full of such magic. and i think it's
> better to do the registration in libc only without
> coordination but that might not be practical if users
> want it now)

Yes, early adopters is my concern here.

> 
>> /*
>>  * linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI
>>  * size is 20 bytes. For support of multiple rseq users within a process,
>>  * user-space defines an extra 4 bytes field as a reference count, for a
>>  * total of 24 bytes.
>>  */
>> struct libc_rseq {
>>         /* kernel-userspace ABI. */
>>         __u32 cpu_id_start;
>>         __u32 cpu_id;
>>         __u64 rseq_cs;
>>         __u32 flags;
>>         /* user-space ABI. */
>>         __u32 refcount;
>> } __attribute__((aligned(4 * sizeof(__u64))));
>> 
>> That duplicated structure only needs to be present in early-adopter
>> applications/libraries. Those linking on librseq or relying on newer
>> glibc to register rseq don't need to know about this extended layout:
>> all they need to care about is the layout of struct rseq (without the
>> added refcount).
> 
> please decide if you want multiple libraries to
> be able to register rseq and coordinate or not
> and document that decision in the public api.

Yes, I'll try this out and see how this goes.

Thanks for the feedback!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-10-23 14:59 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10 19:19 [RFC PATCH for 4.21 00/16] rseq updates, new cpu_opv system call Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc Mathieu Desnoyers
2018-10-11 10:37   ` Szabolcs Nagy
2018-10-11 15:13     ` Mathieu Desnoyers
2018-10-11 16:20       ` Szabolcs Nagy
2018-10-11 16:37         ` Mathieu Desnoyers
2018-10-11 17:04           ` Szabolcs Nagy
2018-10-11 19:42             ` Mathieu Desnoyers
2018-10-12  9:59               ` Szabolcs Nagy
2018-10-23 14:59                 ` Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 02/16] rseq/selftests: Adapt number of threads to the number of detected cpus Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 03/16] sched: Implement push_task_to_cpu (v2) Mathieu Desnoyers
2018-10-17  6:51   ` Srikar Dronamraju
2018-10-17 15:09     ` Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 04/16] mm: Introduce vm_map_user_ram, vm_unmap_user_ram Mathieu Desnoyers
2018-10-16 18:30   ` Steven Rostedt
2018-10-16 19:21     ` Mathieu Desnoyers
2018-10-16 19:40       ` Steven Rostedt
2018-10-17  0:27     ` Sergey Senozhatsky
2018-10-17 15:00       ` Mathieu Desnoyers
2018-10-17 15:04         ` Mathieu Desnoyers
2018-10-17 15:34           ` Sergey Senozhatsky
2018-10-10 19:19 ` [RFC PATCH for 4.21 05/16] mm: Provide is_vma_noncached Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 06/16] cpu_opv: Provide cpu_opv system call (v8) Mathieu Desnoyers
2018-10-16  8:10   ` Sergey Senozhatsky
2018-10-16 19:17     ` Mathieu Desnoyers
2018-10-17  1:46       ` Sergey Senozhatsky
2018-10-17  7:19   ` Srikar Dronamraju
2018-10-17 15:11     ` Mathieu Desnoyers
2018-10-17 16:09       ` Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 07/16] cpu_opv: limit amount of virtual address space used by cpu_opv Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 08/16] x86: Wire up cpu_opv system call Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 09/16] powerpc: " Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 10/16] arm: " Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 11/16] cpu-opv/selftests: Provide cpu-op library Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 12/16] cpu-opv/selftests: Provide basic test Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 13/16] cpu-opv/selftests: Provide percpu_op API Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 14/16] cpu-opv/selftests: Provide basic percpu ops test Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 15/16] cpu-opv/selftests: Provide parametrized tests Mathieu Desnoyers
2018-10-10 19:19 ` [RFC PATCH for 4.21 16/16] cpu-opv/selftests: Provide Makefile, scripts, gitignore Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).