linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
@ 2015-10-27 23:56 Paul Turner
  2015-10-27 23:56 ` [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu " Paul Turner
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Paul Turner @ 2015-10-27 23:56 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Mathieu Desnoyers, Dave Watson, Paul E. McKenney
  Cc: linux-api, linux-kernel, Josh Triplett, Ingo Molnar,
	Chris Lameter, Linus Torvalds

This is an update to the previously posted series at:
  https://lkml.org/lkml/2015/6/24/665

Dave Watson has posted a similar follow-up which allows additional critical
regions to be registered as well as single-step support at:
  https://lkml.org/lkml/2015/10/22/588

This series is a new approach which introduces an alternate ABI that does not
depend on open-coded assembly nor a central 'repository' of rseq sequences.
Sequences may now be inlined and the preparatory[*] work for the sequence can
be written in a higher level language.

This new ABI has also been written to support debugger interaction in a way
that the previous ABI could not.

[*] A sequence essentially has 3 steps:
  1) Determine which cpu the sequence is being run on
  2) Preparatory work specific to the state read in 1)
  3) A single commit instruction which finalizes any state updates.

We require a single instruction for (3) so that if it is interrupted in any
way, we can proceed from (1) once more [or potentially bail].

This new ABI can be described as:
 Progress is ordered as follows:
  *0. Userspace stores current event+cpu counter values
   1. Userspace loads the rip to move to at failure into cx
   2. Userspace loads the rip of the instruction following
      the critical section into a registered TLS address.
   3. Userspace loads the values read at [0] into a known
      location.
   4. Userspace tests to see whether the current event and
      cpu counter values match those stored at 0.  Manually
      jumping to the address from [1] in the case of a
      mismatch.
 
      Note that if we are preempted or otherwise interrupted
      then the kernel can also now perform this comparison
      and conditionally jump us to [1].
   5. Our final instruction before [2] is then our commit.
      The critical section is self-terminating.  [2] must
      also be cleared at this point.
 
 For x86_64:
   [3] uses rdx to represent cpu and event counter as a
       single 64-bit value.
 
 For i386:
   [3] uses ax for cpu and dx for the event_counter.
 
  Both:
   Instruction after commit: rseq_state->post_commit_instr
   Current event and cpu state: rseq_state->event_and_cpu

Exactly, for x86_64 this looks like:
  movq <failed>, rcx [1]
  movq $1f, <commit_instr> [2]
  cmpq <start value>, <current value> [3] (start is in rcx)
  jnz <failed> (4)
  movq <to_write>, (<target>) (5)
  1: movq $0, <commit_instr>

There has been some related discussion, which I am supportive of, in which
we use fs/gs instead of TLS.  This maps naturally to the above and removes
the current requirement for per-thread initialization (this is a good thing!).

On debugger interactions:

There are some nice properties about this new style of API which allow it to
actually support safe interactions with a debugger:
 a) The event counter is a per-cpu value.  This means that we can not advance
    it if no threads from the same process execute on that cpu.  This
    naturally allows basic single step support with thread-isolation.
 b) Single-step can be augmented to evalute the ABI without incrementing the
    event count.
 c) A debugger can also be augmented to evaluate this ABI and push restarts
    on the kernel's behalf.

This is also compatible with David's approach of not single stepping between
2-4 above.  However, I think these are ultimately a little stronger since true
single-stepping and breakpoint support would be available.  Which would be
nice to allow actual debugging of sequences.

(Note that I haven't bothered implementing these in the current patchset as we
are still winnowing down on the ABI and they just add complexity.  It's
important to note that they are possible however.)

Thanks,

- Paul



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu critical sections
  2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
@ 2015-10-27 23:56 ` Paul Turner
  2015-11-19 16:38   ` Johannes Berg
  2015-12-11 12:56   ` Mathieu Desnoyers
  2015-10-27 23:57 ` [RFC PATCH v2 2/3] restartable sequences: x86 ABI Paul Turner
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Paul Turner @ 2015-10-27 23:56 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Mathieu Desnoyers, Dave Watson, Paul E. McKenney
  Cc: linux-api, linux-kernel, Josh Triplett, Ingo Molnar,
	Chris Lameter, Linus Torvalds

From: Paul Turner <pjt@google.com>

Introduce the notion of a restartable sequence.  This is a piece of user code
that can be described in 3 components:

 1) Establish where [e.g. which cpu] the thread is running
 2) Preparatory work that is dependent on the state in [1].
 3) A committing instruction that proceeds only if [2] and [1] have proceeded
    in a coherent state (i.e. not thread from the same address state has run
    on the same cpu).

Preemption, or other interruption beyond [1] guarantees that [3] will not
execute.  With support for control being transferred to a user-defined restart
handler.  This handler may arrange for the original operation to be retried,
including potentially resynchronizing with dependent state that may
have been updated in the interim.

This may be used in combination with an in-memory cpu-id to allow user programs
to implement cpu-local data-structures and primitives, without the use/overhead
of any atomics.

The kernel ABI generally consists of:
 a) A shared TLS word which exports the current cpu and associated event-count
 b) A shared TLS word which, when non-zero, stores the first post-commit
    instruction if a sequence is active.  (The kernel observing rips greater
    than this takes no action and concludes user-space has completed its
    critical section, but not yet cleared this state).
 c) An architecture specific way to publish the state read from (a) which
    user-space believes to be current.

[ (a) is only read by userspace, it is published by the kernel.
  (b) is only ready by the kernel, it is published by userspace. ]

Signed-off-by: Paul Turner <pjt@google.com>
---
 arch/Kconfig                           |    7 ++
 arch/x86/Kconfig                       |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/exec.c                              |    1 
 include/linux/sched.h                  |   27 ++++++++
 include/uapi/asm-generic/unistd.h      |    4 +
 init/Kconfig                           |   11 +++
 kernel/Makefile                        |    2 -
 kernel/restartable_sequences.c         |  112 ++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    |    4 +
 kernel/sched/sched.h                   |    3 +
 kernel/sys_ni.c                        |    2 +
 12 files changed, 172 insertions(+), 3 deletions(-)
 create mode 100644 kernel/restartable_sequences.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 671810c..336880c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -241,6 +241,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
 	  declared in asm/ptrace.h
 	  For example the kprobes-based event tracer needs this API.
 
+config HAVE_RESTARTABLE_SEQUENCE_SUPPORT
+	bool
+	depends on HAVE_REGS_AND_STACK_ACCESS_API
+	help
+	  This symbol should be selected by an architecture if it supports an
+	  implementation of restartable sequences.
+
 config HAVE_CLK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 262b79c..25a0a60 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -135,6 +135,7 @@ config X86
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RESTARTABLE_SEQUENCE_SUPPORT
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16			if X86_32 || IA32_EMULATION
 	select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..0fd4243 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+325	common	restartable_sequences	sys_restartable_sequences
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index 0a77a69..120ef19 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1599,6 +1599,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	current->in_execve = 0;
 	acct_update_integrals(current);
 	task_numa_free(current);
+	rseq_clear_state_exec();
 	free_bprm(bprm);
 	kfree(pathbuf);
 	putname(filename);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e1e06c..03ab1f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1189,6 +1189,21 @@ struct mempolicy;
 struct pipe_inode_info;
 struct uts_namespace;
 
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+struct restartable_sequence_state {
+	__user void *post_commit_instr_addr;
+	__user u64 *event_and_cpu;
+
+	u32 last_switch_count;
+	u32 event_counter;
+	struct preempt_notifier notifier;
+};
+
+void rseq_clear_state_exec(void);
+#else
+static inline void rseq_clear_state_exec(void) {}
+#endif
+
 struct load_weight {
 	unsigned long weight;
 	u32 inv_weight;
@@ -1820,6 +1835,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+	struct restartable_sequence_state rseq_state;
+#endif
 	int pagefault_disabled;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
@@ -3189,4 +3207,13 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+static inline int rseq_active(struct task_struct *p)
+{
+	return p->rseq_state.event_and_cpu != NULL;
+}
+#else
+static inline int rseq_active(struct task_struct *p) { return 0; }
+#endif
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ee12400..9659f31 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -713,9 +713,11 @@ __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 #define __NR_membarrier 283
 __SYSCALL(__NR_membarrier, sys_membarrier)
+#define __NR_restartable_sequences 284
+__SYSCALL(__NR_restartable_sequences, sys_restartable_sequences)
 
 #undef __NR_syscalls
-#define __NR_syscalls 284
+#define __NR_syscalls 285
 
 /*
  * All syscalls below here should go away really,
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..98c02f2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2040,7 +2040,16 @@ config STOP_MACHINE
 source "block/Kconfig"
 
 config PREEMPT_NOTIFIERS
-	bool
+	bool n
+
+config RESTARTABLE_SEQUENCES
+	bool "Userspace Restartable Sequences (RSEQ)"
+	default n
+	depends on HAVE_RESTARTABLE_SEQUENCE_SUPPORT && PREEMPT_NOTIFIERS
+	help
+	  Allows binaries to define a region of user-text within which
+	  execution will be restarted in the event of signal delivery or
+	  preemption.
 
 config PADATA
 	depends on SMP
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..dbe6963 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -101,8 +101,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
-
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_RESTARTABLE_SEQUENCES) += restartable_sequences.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
new file mode 100644
index 0000000..c99a574
--- /dev/null
+++ b/kernel/restartable_sequences.c
@@ -0,0 +1,112 @@
+/*
+ * Restartable Sequences are a lightweight interface that allows user-level
+ * code to be executed atomically relative to scheduler preemption.  Typically
+ * used for implementing per-cpu operations.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com>
+ *
+ */
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+
+#include <linux/uaccess.h>
+#include <linux/preempt.h>
+#include <linux/syscalls.h>
+
+static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
+static void rseq_sched_out_nop(struct preempt_notifier *pn,
+			       struct task_struct *next) {}
+
+static __read_mostly struct preempt_ops rseq_preempt_ops = {
+	.sched_in = rseq_sched_in_nop,
+	.sched_out = rseq_sched_out_nop,
+};
+
+int rseq_configure_current(__user u64 *event_and_cpu,
+			   __user void *post_commit_instr_addr)
+{
+	struct restartable_sequence_state *rseq_state = &current->rseq_state;
+	int addr_size = test_tsk_thread_flag(current, TIF_IA32) ? 4 : 8;
+	int registered = 0;
+
+	/* Disable */
+	if (!event_and_cpu && !post_commit_instr_addr)
+		goto valid;
+
+	if (!event_and_cpu || !post_commit_instr_addr)
+		return -EINVAL;
+
+	if (!access_ok(VERIFY_WRITE, event_and_cpu, sizeof(*event_and_cpu)) ||
+	    !access_ok(VERIFY_READ, post_commit_instr_addr, addr_size))
+		return -EINVAL;
+
+valid:
+	preempt_disable();
+
+	if (rseq_state->event_and_cpu)
+		registered = 1;
+
+	rseq_state->event_counter = 1;
+	rseq_state->event_and_cpu = event_and_cpu;
+	rseq_state->post_commit_instr_addr = post_commit_instr_addr;
+
+	if (event_and_cpu && !registered) {
+		preempt_notifier_init(&rseq_state->notifier,
+				      &rseq_preempt_ops);
+		preempt_notifier_inc();
+		preempt_notifier_register(&rseq_state->notifier);
+	} else if (!event_and_cpu && registered) {
+		preempt_notifier_unregister(&rseq_state->notifier);
+		preempt_notifier_dec();
+	}
+
+	preempt_enable();
+
+	/* Will update *event_and_cpu on return. */
+	if (event_and_cpu)
+		set_thread_flag(TIF_NOTIFY_RESUME);
+
+	return 0;
+}
+
+void rseq_clear_state_exec()
+{
+	rseq_configure_current(NULL, NULL);
+}
+
+/*
+ * RSEQ syscall interface.
+ *
+ * Usage:
+ *   sys_restartable_sequences(flags, &event_and_cpu, &post_commit_instr_addr)
+
+ *  flags is currently unused.
+ */
+SYSCALL_DEFINE3(restartable_sequences,
+		int, flags, long, event_and_cpu, long, post_commit_instr_addr)
+{
+	return rseq_configure_current((__user u64 *)event_and_cpu,
+				      (__user void *)post_commit_instr_addr);
+}
+#else
+SYSCALL_DEFINE0(restartable_sequences)
+{
+	return -ENOSYS;
+}
+#endif
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aa59732..84e0641 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2120,6 +2120,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+	memset(&p->rseq_state, 0, sizeof(p->rseq_state));
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e8de833 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,9 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	if (rseq_active(p))
+		set_tsk_thread_flag(p, TIF_NOTIFY_RESUME);
+
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a02decf..0bdf060 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -248,3 +248,5 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+/* restartable sequences */
+cond_syscall(sys_restartable_sequences);


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 2/3] restartable sequences: x86 ABI
  2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
  2015-10-27 23:56 ` [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu " Paul Turner
@ 2015-10-27 23:57 ` Paul Turner
  2015-10-28  5:03   ` Peter Zijlstra
  2015-12-11 13:30   ` Mathieu Desnoyers
  2015-10-27 23:57 ` [RFC PATCH v2 3/3] restartable sequences: basic self-tests Paul Turner
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Paul Turner @ 2015-10-27 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Mathieu Desnoyers, Dave Watson, Paul E. McKenney
  Cc: linux-api, linux-kernel, Josh Triplett, Ingo Molnar,
	Chris Lameter, Linus Torvalds

From: Paul Turner <pjt@google.com>

Recall the general ABI is:
   The kernel ABI generally consists of:
     a) A shared TLS word which exports the current cpu and event-count
     b) A shared TLS word which, when non-zero, stores the first post-commit
        instruction if a sequence is active.  (The kernel observing rips greater
        than this takes no action and concludes user-space has completed its
        critical section, but not yet cleared this state).
     c) An architecture specific way to publish the state read from (a) which
        user-space believes to be current.

This patch defines (c) for x86, both x86_64 and i386.

The exact sequence is:
  *0. Userspace stores current event+cpu counter values
   1. Userspace loads the rip to move to at failure into cx
   2. Userspace loads the rip of the instruction following
      the critical section into a registered TLS address.
   3. Userspace loads the values read at [0] into a known
      location.
   4. Userspace tests to see whether the current event and
      cpu counter values match those stored at 0.  Manually
      jumping to the address from [1] in the case of a
      mismatch.

      Note that if we are preempted or otherwise interrupted
      then the kernel can also now perform this comparison
      and conditionally jump us to [1].
   4. Our final instruction bfeore [2] is then our commit.
      The critical section is self-terminating.  [2] must
      also be cleared at this point.

 For x86_64:
   [3] uses rdx to represent cpu and event counter as a
       single 64-bit value.

 For i386:
   [3] uses ax for cpu and dx for the event_counter.

  Both:
   Instruction after commit: rseq_state->post_commit_instr
   Current event and cpu state: rseq_state->event_and_cpu

An example user-space x86_64 implementation:
    __asm__ __volatile__ goto (
                    "movq $%l[failed], %%rcx\n"
                    "movq $1f, %[commit_instr]\n"
                    "cmpq %[start_value], %[current_value]\n"
                    "jnz %l[failed]\n"
                    "movq %[to_write], (%[target])\n"
                    "1: movq $0, %[commit_instr]\n"
      : /* no outputs */
      : [start_value]"d"(start_value.storage),
        [current_value]"m"(__rseq_state),
        [to_write]"r"(to_write),
        [target]"r"(p),
        [commit_instr]"m"(__rseq_state.post_commit_instr)
      : "rcx", "memory"
      : failed

Signed-off-by: Paul Turner <pjt@google.com>
---
 arch/x86/entry/common.c                      |    3 +
 arch/x86/include/asm/restartable_sequences.h |   18 +++
 arch/x86/kernel/Makefile                     |    2 
 arch/x86/kernel/restartable_sequences.c      |  136 ++++++++++++++++++++++++++
 arch/x86/kernel/signal.c                     |    7 +
 kernel/restartable_sequences.c               |   10 +-
 6 files changed, 173 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/restartable_sequences.h
 create mode 100644 arch/x86/kernel/restartable_sequences.c

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc..e382487 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -23,6 +23,7 @@
 #include <linux/uprobes.h>
 
 #include <asm/desc.h>
+#include <asm/restartable_sequences.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
 #include <asm/uaccess.h>
@@ -249,6 +250,8 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			if (rseq_active(current))
+				arch_rseq_update_event_counter(regs);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/include/asm/restartable_sequences.h b/arch/x86/include/asm/restartable_sequences.h
new file mode 100644
index 0000000..75864a7
--- /dev/null
+++ b/arch/x86/include/asm/restartable_sequences.h
@@ -0,0 +1,18 @@
+#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
+#define _ASM_X86_RESTARTABLE_SEQUENCES_H
+
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+
+void arch_rseq_update_event_counter(struct pt_regs *regs);
+
+#else /* !CONFIG_RESTARTABLE_SEQUENCES */
+
+static inline void arch_rseq_update_event_counter(struct pt_regs *regs) {}
+
+#endif
+
+#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index b1b78ff..ee98fb6 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -110,6 +110,8 @@ obj-$(CONFIG_EFI)			+= sysfb_efi.o
 obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 
+obj-$(CONFIG_RESTARTABLE_SEQUENCES)	+= restartable_sequences.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/restartable_sequences.c b/arch/x86/kernel/restartable_sequences.c
new file mode 100644
index 0000000..9f43efd
--- /dev/null
+++ b/arch/x86/kernel/restartable_sequences.c
@@ -0,0 +1,136 @@
+/*
+ * Restartable Sequences: x86 ABI.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com>
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <asm/ptrace.h>
+#include <asm/processor-flags.h>
+#include <asm/restartable_sequences.h>
+
+static inline u64 rseq_encode_cpu_and_event_count(int cpu, int event_count)
+{
+	return (u64)(event_count) << 32 | cpu;
+}
+
+static inline int rseq_regs_cpu(struct pt_regs *regs, int is_i386)
+{
+#ifdef CONFIG_64BIT
+	return is_i386 ? regs->ax : regs->dx & 0xFFFF;
+#else
+	return regs->ax;
+#endif
+}
+
+static inline int rseq_regs_event_count(struct pt_regs *regs, int is_i386)
+{
+#ifdef CONFIG_64BIT
+	return is_i386 ? regs->dx : regs->dx >> 32;
+#else
+	return regs->dx;
+#endif
+}
+
+void arch_rseq_update_event_counter(struct pt_regs *regs)
+{
+	struct restartable_sequence_state *rseq_state = &current->rseq_state;
+	int cpu = task_cpu(current);
+	int is_i386 = test_tsk_thread_flag(current, TIF_IA32);
+	int addr_size = is_i386 ? 4 : 8;
+	long post_commit_instr = 0;
+	u64 state_value;
+
+	/*
+	 * Note: post_commit_instr must be zero-initialized above for the case
+	 * of a 32-bit thread on a 64-bit system.
+	 */
+	if (copy_from_user(&post_commit_instr,
+			   rseq_state->post_commit_instr_addr, addr_size)) {
+		goto fail;
+	}
+
+	/* Handle potentially being within a critical section. */
+	if (regs->ip < post_commit_instr) {
+		/*
+		 * The ABI is relatively compact, with some differences for 32
+		 * and 64-bit threads.
+		 *
+		 * Progress is ordered as follows:
+		 *  *0. USerspace stores current event+cpu counter values
+		 *   1. Userspace loads the rip to move to at failure into cx
+		 *   2. Userspace loads the rip of the instruction following
+		 *      the critical section into a registered TLS address.
+		 *   3. Userspace loads the values read at [0] into a known
+		 *      location.
+		 *   4. Userspace tests to see whether the current event and
+		 *      cpu counter values match those stored at 0.  Manually
+		 *      jumping to the address from [1] in the case of a
+		 *      mismatch.
+		 *
+		 *      Note that if we are preempted or otherwise interrupted
+		 *      then the kernel can also now perform this comparison
+		 *      and conditionally jump us to [1].
+		 *   4. Our final instruction bfeore [2] is then our commit.
+		 *      The critical section is self-terminating.  [2] must
+		 *      also be cleared at this point.
+		 *
+		 * For x86_64:
+		 *   [3] uses rdx to represent cpu and event counter as a
+		 *       single 64-bit value.
+		 *
+		 * For i386:
+		 *   [3] uses ax for cpu and dx for the event_counter.
+		 *
+		 *  Both:
+		 *   Instruction after commit: rseq_state->post_commit_instr
+		 *   Current event and cpu state: rseq_state->event_and_cpu
+		 *
+		 */
+		if (rseq_regs_cpu(regs, is_i386) != cpu ||
+		    rseq_regs_event_count(regs, is_i386) !=
+				rseq_state->event_counter) {
+			if (clear_user(rseq_state->post_commit_instr_addr,
+						addr_size))
+				goto fail;
+
+			/*
+			 * We set this after potentially failing in clear_user
+			 * so that the signal arrives at the faulting rip.
+			 */
+			regs->ip = regs->cx;
+		}
+	}
+
+	/* Update state. Compute the next expected state value. */
+	state_value = rseq_encode_cpu_and_event_count(cpu,
+			++rseq_state->event_counter);
+
+	if (put_user(state_value, rseq_state->event_and_cpu))
+		goto fail;
+
+	return;
+fail:
+	/*
+	 * User space has made some (invalid) protection change that does not
+	 * allow us to safely continue execution.  SEGV is the result.
+	 */
+	force_sig(SIGSEGV, current);
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b7ffb7c..58f8813 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -30,6 +30,7 @@
 #include <asm/fpu/signal.h>
 #include <asm/vdso.h>
 #include <asm/mce.h>
+#include <asm/restartable_sequences.h>
 #include <asm/sighandling.h>
 #include <asm/vm86.h>
 
@@ -615,6 +616,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	sigset_t *set = sigmask_to_save();
 	compat_sigset_t *cset = (compat_sigset_t *) set;
 
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+	/* Increment the event counter for the, present, pre-signal frame. */
+	if (rseq_active(current))
+		arch_rseq_update_event_counter(regs);
+#endif
+
 	/* Set up the stack frame */
 	if (is_ia32_frame()) {
 		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
index c99a574..5449561 100644
--- a/kernel/restartable_sequences.c
+++ b/kernel/restartable_sequences.c
@@ -24,17 +24,21 @@
 
 #ifdef CONFIG_RESTARTABLE_SEQUENCES
 
+#include <asm/restartable_sequences.h>
 #include <linux/uaccess.h>
 #include <linux/preempt.h>
 #include <linux/syscalls.h>
 
 static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
-static void rseq_sched_out_nop(struct preempt_notifier *pn,
-			       struct task_struct *next) {}
+static void rseq_sched_out(struct preempt_notifier *pn,
+			   struct task_struct *next)
+{
+	set_thread_flag(TIF_NOTIFY_RESUME);
+}
 
 static __read_mostly struct preempt_ops rseq_preempt_ops = {
 	.sched_in = rseq_sched_in_nop,
-	.sched_out = rseq_sched_out_nop,
+	.sched_out = rseq_sched_out,
 };
 
 int rseq_configure_current(__user u64 *event_and_cpu,


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 3/3] restartable sequences: basic self-tests
  2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
  2015-10-27 23:56 ` [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu " Paul Turner
  2015-10-27 23:57 ` [RFC PATCH v2 2/3] restartable sequences: x86 ABI Paul Turner
@ 2015-10-27 23:57 ` Paul Turner
  2016-04-05 20:33   ` Mathieu Desnoyers
  2015-10-28 14:44 ` [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Dave Watson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Paul Turner @ 2015-10-27 23:57 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Mathieu Desnoyers, Dave Watson, Paul E. McKenney
  Cc: linux-api, linux-kernel, Josh Triplett, Ingo Molnar,
	Chris Lameter, Linus Torvalds

From: pjt <pjt@pjt-glaptop.roam.corp.google.com>

Implements two basic tests of RSEQ functionality.

The first, "basic_test" only asserts that RSEQ works moderately correctly.
E.g. that:
- The CPUID pointer works
- Code infinitely looping within a critical section will eventually be
interrupted.
- Critical sections are interrupted by signals.

"basic_percpu_ops_test" is a slightly more "realistic" variant, implementing a
few simple per-cpu operations and testing their correctness.  It also includes
a trivial example of user-space may multiplexing the critical section via the
restart handler.

Signed-off-by: Paul Turner <pjt@google.com>
---
 tools/testing/selftests/rseq/Makefile              |   13 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  268 ++++++++++++++++++++
 tools/testing/selftests/rseq/basic_test.c          |   87 ++++++
 tools/testing/selftests/rseq/rseq.c                |   36 +++
 tools/testing/selftests/rseq/rseq.h                |  109 ++++++++
 5 files changed, 513 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 0000000..40b9338
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,13 @@
+CFLAGS += -Wall
+LDFLAGS += -lpthread
+
+TESTS = basic_test basic_percpu_ops_test
+
+all: $(TESTS)
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^ rseq.c $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+	$(RM) $(TESTS)
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 0000000..dcc57ad
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,268 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "rseq.h"
+
+struct percpu_lock {
+	intptr_t word[CPU_SETSIZE][64 / sizeof(intptr_t)];  /* Cache aligned */
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	struct rseq_state start;
+	int cpu;
+
+	do {
+		start = rseq_start();
+		cpu = rseq_cpu_at_start(start);
+	} while (lock->word[cpu][0] ||
+		 !rseq_finish(&lock->word[cpu][0], 1, start));
+	return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	barrier();  /* need a release-store here, this suffices on x86. */
+	assert(lock->word[cpu][0] == 1);
+	lock->word[cpu][0] = 0;
+}
+
+/*
+ * cmpxchg [with an additional check value].
+ *
+ * Returns:
+ *  -1 if *p != old [ || check_ptr != check_val, ] otherwise
+ *  cpu that rseq_percpu_cmpxchgcheck was executed.
+ *   - If this is different from the passed cpu, no modifications were made.
+ *
+ * Note: When specified, check_ptr is dereferenced iff *p == old
+ */
+int rseq_percpu_cmpxchg(int cpu, intptr_t *p, intptr_t old, intptr_t new)
+{
+	struct rseq_state start;
+
+	while (1) {
+		start = rseq_start();
+		if (rseq_current_cpu() != cpu) return rseq_current_cpu();
+		if (*p != old)
+			return -1;
+		if (rseq_finish(p, new, start)) return cpu;
+	}
+}
+
+int rseq_percpu_cmpxchgcheck(int cpu, intptr_t *p, intptr_t old, intptr_t new,
+			     intptr_t *check_ptr, intptr_t check_val)
+{
+	struct rseq_state start;
+
+	while (1) {
+		start = rseq_start();
+		if (rseq_current_cpu() != cpu) return rseq_current_cpu();
+		/*
+		 * Note that we'd want the ultimate implementation of this to
+		 * be open coded (similar to rseq_finish) so that we can
+		 * guarantee *check is not dereferenced when old does not
+		 * match.  This could also be facilitated with a generic
+		 * rseq_read_if_valid(...) helper.
+		 */
+		if (*p != old || *check_ptr != check_val)
+			return -1;
+		if (rseq_finish(p, new, start)) return cpu;
+	}
+}
+
+void rseq_unknown_restart_addr(void *addr)
+{
+	fprintf(stderr, "rseq: unrecognized restart address %p\n", addr);
+	exit(1);
+}
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	int counts[CPU_SETSIZE];
+	int reps;
+};
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_test_data *data = arg;
+
+	int i, cpu;
+	rseq_init_current_thread();
+	for (i = 0; i < data->reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->counts[cpu]++;
+		rseq_percpu_unlock(&data->lock, cpu);
+	}
+
+	return 0;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu lock.
+ * Obviously real applications might prefer to simply use a per-cpu increment;
+ * however, this is reasonable for a test and the lock can be extended to
+ * synchronize more complicated operations.
+ */
+void test_percpu_spinlock()
+{
+	const int num_threads = 200;
+	int i, sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+
+	memset(&data, 0, sizeof(data));
+	data.reps = 5000;
+
+	for (i = 0; i < num_threads; i++)
+		pthread_create(&test_threads[i], NULL,
+			       test_percpu_spinlock_thread, &data);
+
+	for (i = 0; i < num_threads; i++)
+		pthread_join(test_threads[i], NULL);
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.counts[i];
+
+	assert(sum == data.reps * num_threads);
+}
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list {
+	struct percpu_list_node *heads[CPU_SETSIZE];
+};
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	int cpu;
+
+	do {
+		cpu = rseq_current_cpu();
+		node->next = list->heads[cpu];
+	} while (cpu != rseq_percpu_cmpxchg(cpu,
+			(intptr_t *)&list->heads[cpu], (intptr_t)node->next,
+			(intptr_t)node));
+
+	return cpu;
+}
+
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	int cpu;
+	struct percpu_list_node *head, *next;
+
+	do {
+		cpu = rseq_current_cpu();
+		head = list->heads[cpu];
+		/*
+		 * Unlike a traditional lock-less linked list; the availability
+		 * of a cmpxchg-check primitive allows us to implement pop
+		 * without concerns over ABA-type races.
+		 */
+		if (!head) return 0;
+		next = head->next;
+	} while (cpu != rseq_percpu_cmpxchgcheck(cpu,
+		(intptr_t *)&list->heads[cpu], (intptr_t)head, (intptr_t)next,
+		(intptr_t *)&head->next, (intptr_t)next));
+
+	return head;
+}
+
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	rseq_init_current_thread();
+	for (i = 0; i < 100000; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+		sched_yield();  /* encourage shuffling */
+		if (node) percpu_list_push(list, node);
+	}
+
+	return 0;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list()
+{
+	int i, j;
+	long sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[200];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus)) continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.heads[i];
+			list.heads[i] = node;
+		}
+	}
+
+	for (i = 0; i < 200; i++)
+		assert(pthread_create(&test_threads[i], NULL,
+			       test_percpu_list_thread, &list) == 0);
+
+	for (i = 0; i < 200; i++)
+		pthread_join(test_threads[i], NULL);
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus)) continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external actor
+	 * is interfering with our allowed affinity while this test is
+	 * running).
+	 */
+	assert(sum == expected_sum);
+}
+
+int main(int argc, char **argv)
+{
+	rseq_init_current_thread();
+	printf("spinlock\n");
+	test_percpu_spinlock();
+	printf("percpu_list\n");
+	test_percpu_list();
+
+	return 0;
+}
+
diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index 0000000..a3d3cdf
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,87 @@
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+void test_cpu_pointer()
+{
+	cpu_set_t affinity, test_affinity;
+	int i;
+
+	sched_getaffinity(0, sizeof(affinity), &affinity);
+	CPU_ZERO(&test_affinity);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (CPU_ISSET(i, &affinity)) {
+			CPU_SET(i, &test_affinity);
+			sched_setaffinity(0, sizeof(test_affinity),
+					  &test_affinity);
+			assert(rseq_current_cpu() == sched_getcpu());
+			assert(rseq_current_cpu() == i);
+			CPU_CLR(i, &test_affinity);
+		}
+	}
+	sched_setaffinity(0, sizeof(affinity), &affinity);
+}
+
+void test_critical_section()
+{
+	/*
+	 * This depends solely on some environmental event triggering a counter
+	 * increase.
+	 */
+	struct rseq_state start = rseq_start(), current;
+	do {
+		current = rseq_start();
+	} while (start.cpu == current.cpu &&
+		 start.event_counter == current.event_counter);
+}
+
+volatile int signals_delivered;
+volatile struct rseq_state start;
+
+void test_signal_interrupt_handler()
+{
+	struct rseq_state current = rseq_start();
+	/*
+	 * The potential critical section bordered by 'start' must be invalid.
+	 */
+	assert(current.cpu != start.cpu ||
+	       current.event_counter != start.event_counter);
+	signals_delivered++;
+}
+
+void test_signal_interrupts()
+{
+	struct itimerval it = {{0, 1}, {0, 1}};
+	setitimer(ITIMER_PROF, &it, NULL);
+	signal(SIGPROF, test_signal_interrupt_handler);
+
+	do {
+		start = rseq_start();
+	} while (signals_delivered < 10);
+	setitimer(ITIMER_PROF, NULL, NULL);
+}
+
+int main(int argc, char **argv)
+{
+	rseq_init_current_thread();
+
+	printf("testing current cpu\n");
+	test_cpu_pointer();
+	printf("testing critical section\n");
+	test_critical_section();
+	printf("testing critical section is interrupted by signal\n");
+	test_signal_interrupts();
+
+	return 0;
+}
+
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
new file mode 100644
index 0000000..0ed5c8e
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -0,0 +1,36 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "rseq.h"
+
+__thread volatile struct rseq_state __rseq_state = { .cpu=-1 };
+
+#define __NR_rseq	325
+
+int sys_rseq(int flags,
+	     volatile uint64_t* event_and_cpu,
+	     volatile void* post_commit_instr)
+{
+	return syscall(__NR_rseq, flags,
+		       (intptr_t)event_and_cpu,
+		       (intptr_t)post_commit_instr);
+}
+
+void rseq_init_current_thread(void)
+{
+	int rc = sys_rseq(0, &__rseq_state.storage,
+			  &__rseq_state.post_commit_instr);
+	if (rc) {
+		fprintf(stderr,"Error: sys_rseq(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		exit(1);
+	}
+	assert(rseq_current_cpu() != -1); /* always updated prior to return. */
+}
+
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
new file mode 100644
index 0000000..d16e02d
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -0,0 +1,109 @@
+#ifndef RSEQ_H
+#define RSEQ_H
+
+#include <stdint.h>
+
+struct rseq_state {
+	union {
+		/*
+		 * Updated by the kernel.  The time between two rseq state
+		 * objects can be considered non-interrupted if-and-only-if
+		 * both cpu and event_counter match.
+		 *
+		 * Explicitly: the kernel is allowed to maintain a
+		 * per-cpu event_counter.
+		 */
+		struct {
+			int cpu;
+			int event_counter;
+		};
+		uint64_t storage;
+	};
+	void* post_commit_instr;
+};
+
+extern __thread volatile struct rseq_state __rseq_state;
+
+int sys_rseq(int flags,
+	     volatile uint64_t* event_and_cpu,
+	     volatile void* post_commit_instr);
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+
+static inline struct rseq_state rseq_start()
+{
+	struct rseq_state result = __rseq_state;
+	/*
+	 * We need to ensure that the compiler does not re-order the loads of
+	 * any protected values before we read the current state.
+	 */
+	barrier();
+	return result;
+}
+
+static inline int rseq_cpu_at_start(struct rseq_state start_value)
+{
+	return start_value.cpu;
+}
+
+static inline int rseq_current_cpu(void)
+{
+	return __rseq_state.cpu;
+}
+
+static inline int rseq_finish(intptr_t *p, intptr_t to_write,
+			      struct rseq_state start_value)
+{
+#ifdef __x86_64__
+	__asm__ __volatile__ goto (
+			"movq $%l[failed], %%rcx\n"
+			"movq $1f, %[commit_instr]\n"
+			"cmpq %[start_value], %[current_value]\n"
+			"jnz %l[failed]\n"
+			"movq %[to_write], (%[target])\n"
+			"1: movq $0, %[commit_instr]\n"
+	  : /* no outputs */
+	  : [start_value]"d"(start_value.storage),
+	    [current_value]"m"(__rseq_state),
+	    [to_write]"r"(to_write),
+	    [target]"r"(p),
+	    [commit_instr]"m"(__rseq_state.post_commit_instr)
+	  : "rcx", "memory"
+	  : failed
+	);
+#elif __i386__
+	__asm__ __volatile__ goto (
+			"movl $%l[failed], %%ecx\n"
+			"movl $1f, %[post_commit_instr]\n"
+			"cmp %[start_cpu], %[current_cpu]\n"
+			"jnz %l[failed]\n"
+			"cmp %[start_event], %[current_event]\n"
+			"jnz %l[failed]\n"
+			"movl %[to_write], (%[target])\n"
+			"1: movl $0, %[post_commit_instr]\n"
+	  : /* no outputs */
+	  : [start_cpu]"a"(start_value.cpu),
+	    [start_event]"d"(start_value.event_counter),
+	    [current_cpu]"g"(__rseq_state.cpu),
+	    [current_event]"g"(__rseq_state.event_counter),
+	    [post_commit_instr]"g"(__rseq_state.post_commit_instr),
+	    [to_write]"r"(to_write),
+	    [target]"r"(p)
+	  : "ecx", "memory"
+	  : failed
+	);
+#else
+#error unsupported target
+#endif
+	return 1;
+failed:
+	return 0;
+}
+
+/*
+ * Initialize rseq for the current thread.  Must be called once by any thread
+ * which uses restartable_sequences.
+ */
+void rseq_init_current_thread(void);
+
+#endif  /* RSEQ_H_ */


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/3] restartable sequences: x86 ABI
  2015-10-27 23:57 ` [RFC PATCH v2 2/3] restartable sequences: x86 ABI Paul Turner
@ 2015-10-28  5:03   ` Peter Zijlstra
  2015-10-28  5:19     ` Paul Turner
  2015-12-11 13:30   ` Mathieu Desnoyers
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-10-28  5:03 UTC (permalink / raw)
  To: Paul Turner
  Cc: Andrew Hunter, Andy Lutomirski, Andi Kleen, Mathieu Desnoyers,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

On Tue, Oct 27, 2015 at 04:57:05PM -0700, Paul Turner wrote:
> +static void rseq_sched_out(struct preempt_notifier *pn,
> +			   struct task_struct *next)
> +{
> +	set_thread_flag(TIF_NOTIFY_RESUME);
> +}
>  
>  static __read_mostly struct preempt_ops rseq_preempt_ops = {
>  	.sched_in = rseq_sched_in_nop,
> -	.sched_out = rseq_sched_out_nop,
> +	.sched_out = rseq_sched_out,
>  };

Since we're unconditionally setting this TIF flag for these tasks, can't
we introduce something similar to the (contested) TIF_NOHZ_FULL thing
which is kept on the task indefinitely.

That avoids having the preempt notifiers and this atomic op in the
schedule path.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/3] restartable sequences: x86 ABI
  2015-10-28  5:03   ` Peter Zijlstra
@ 2015-10-28  5:19     ` Paul Turner
  0 siblings, 0 replies; 38+ messages in thread
From: Paul Turner @ 2015-10-28  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Hunter, Andy Lutomirski, Andi Kleen, Mathieu Desnoyers,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

On Tue, Oct 27, 2015 at 10:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Oct 27, 2015 at 04:57:05PM -0700, Paul Turner wrote:
> > +static void rseq_sched_out(struct preempt_notifier *pn,
> > +                        struct task_struct *next)
> > +{
> > +     set_thread_flag(TIF_NOTIFY_RESUME);
> > +}
> >
> >  static __read_mostly struct preempt_ops rseq_preempt_ops = {
> >       .sched_in = rseq_sched_in_nop,
> > -     .sched_out = rseq_sched_out_nop,
> > +     .sched_out = rseq_sched_out,
> >  };
>
> Since we're unconditionally setting this TIF flag for these tasks, can't
> we introduce something similar to the (contested) TIF_NOHZ_FULL thing
> which is kept on the task indefinitely.
>
So Andy and I talked about this also, I'm in favor, in particular this
has two nice effects:
 a) In exit_to_usermode_loop() we can ensure that this is evaluated
prior to _TIF_SIGPENDING.  This removes the current requirement that
we also validate this state in setup_rt_frame() [which can perturb
this state prior to our existing notifier].
 b) We avoid spurious interactions with other things that use notify resume.

> That avoids having the preempt notifiers and this atomic op in the
> schedule path.


So we still want something there (although it can be definitely be
inlined as opposed to a preempt_notifier) since this allows us to only
evaluate this check on returns to user-space that might matter as
opposed to every syscall.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
                   ` (2 preceding siblings ...)
  2015-10-27 23:57 ` [RFC PATCH v2 3/3] restartable sequences: basic self-tests Paul Turner
@ 2015-10-28 14:44 ` Dave Watson
  2015-12-11 12:05 ` Mathieu Desnoyers
  2016-04-06 15:56 ` Andy Lutomirski
  5 siblings, 0 replies; 38+ messages in thread
From: Dave Watson @ 2015-10-28 14:44 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Mathieu Desnoyers, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

On 10/27/15 04:56 PM, Paul Turner wrote:
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.

Very nice, it's definitely much easier to use.

> Exactly, for x86_64 this looks like:
>   movq <failed>, rcx [1]
>   movq $1f, <commit_instr> [2]
>   cmpq <start value>, <current value> [3] (start is in rcx)
>   jnz <failed> (4)
>   movq <to_write>, (<target>) (5)
>   1: movq $0, <commit_instr>
>
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS.  This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).
>
> On debugger interactions:
>
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
>  a) The event counter is a per-cpu value.  This means that we can not advance
>     it if no threads from the same process execute on that cpu.  This
>     naturally allows basic single step support with thread-isolation.

I think this means multiple processes would no longer be able to use
per-cpu variables in shared memory, since they would no longer restart
with respect to each other?

>  b) Single-step can be augmented to evalute the ABI without incrementing the
>     event count.
>  c) A debugger can also be augmented to evaluate this ABI and push restarts
>     on the kernel's behalf.
>
> This is also compatible with David's approach of not single stepping between
> 2-4 above.  However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available.  Which would be
> nice to allow actual debugging of sequences.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu critical sections
  2015-10-27 23:56 ` [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu " Paul Turner
@ 2015-11-19 16:38   ` Johannes Berg
  2015-12-11 12:56   ` Mathieu Desnoyers
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Berg @ 2015-11-19 16:38 UTC (permalink / raw)
  To: Paul Turner, Peter Zijlstra, Andrew Hunter, Andy Lutomirski,
	Andi Kleen, Mathieu Desnoyers, Dave Watson, Paul E. McKenney
  Cc: linux-api, linux-kernel, Josh Triplett, Ingo Molnar,
	Chris Lameter, Linus Torvalds

On Tue, 2015-10-27 at 16:56 -0700, Paul Turner wrote:
> 
> + *  flags is currently unused.
> + */
> +SYSCALL_DEFINE3(restartable_sequences,
> +> 	> 	> int, flags, long, event_and_cpu, long, post_commit_instr_addr)
	
> +{
> +	return rseq_configure_current((__user u64 *)event_and_cpu,
> +				      (__user void *)post_commit_instr_addr);

It seems that, for forward compatibility and actually usefulness of the
'flags' argument, you need to reject the syscall if it's passed non-
zero right now. Otherwise applications can (inadvertedly) pass garbage
and you'll not be able to use it in the future.

johannes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
                   ` (3 preceding siblings ...)
  2015-10-28 14:44 ` [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Dave Watson
@ 2015-12-11 12:05 ` Mathieu Desnoyers
  2015-12-11 13:39   ` Mathieu Desnoyers
  2016-04-06 15:56 ` Andy Lutomirski
  5 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2015-12-11 12:05 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

----- On Oct 27, 2015, at 7:56 PM, Paul Turner commonly@gmail.com wrote:

> This is an update to the previously posted series at:
>  https://lkml.org/lkml/2015/6/24/665
> 
> Dave Watson has posted a similar follow-up which allows additional critical
> regions to be registered as well as single-step support at:
>  https://lkml.org/lkml/2015/10/22/588
> 
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.

Hi Paul,

Thanks for sending this updated series! A few questions:

When you say that the preparatory work can be written in a higher
level language, does it mean it still has to be inlined, or function
calls are explicitly supported ? It would be good to clarify this in
the changelog/documentation.

Another point: in this changelog, it's unclear to me what the [*] refers
to (after "preparatory"). Below there is a [*] before "A sequence essentially
has 3 steps", and and plain '*' before "0. Userspace stores current event+cpu
counter values".

> 
> This new ABI has also been written to support debugger interaction in a way
> that the previous ABI could not.
> 
> [*] A sequence essentially has 3 steps:
>  1) Determine which cpu the sequence is being run on
>  2) Preparatory work specific to the state read in 1)
>  3) A single commit instruction which finalizes any state updates.

The link between those 3 steps and the following "progress" items
below seems a bit confusing, because those are two numbered lists.
Perhaps using the numbers above for the overall steps, and sub-numbers
after a dot could help exposing the relation, e.g.:

1.1 Userspace stores current event+cpu counter values
2.1 Userspace loads the rip to move to at failure into cx
2.2 Userspace loads the rip of the instruction following
    the critical section into a registered TLS address.
....


> 
> We require a single instruction for (3) so that if it is interrupted in any
> way, we can proceed from (1) once more [or potentially bail].
> 
> This new ABI can be described as:
> Progress is ordered as follows:
>  *0. Userspace stores current event+cpu counter values

This does not explain where those "current" values are
read from ? Moreover, step [3] below seems to refer to values
read at [0], but [0] is about storing values, not reading them.
There is probably a very simple explanation to this, but in its
current form, this is puzzling. Moreover, where is it stored ?

I would suggest to change event+cpu -> (event, cpu), since they
are not summed, but rather combined as a tuple.

>   1. Userspace loads the rip to move to at failure into cx

Is it really cx (16-bit), or ecx/rcx ?

"the rip to move to at failure" could be defined as
"failure address" or "restart address", and then referred
to afterward. This would clarify point [4] below rather than
saying "to the address from ­[1]".

>   2. Userspace loads the rip of the instruction following
>      the critical section into a registered TLS address.

The critical section is not clearly defined. What is the
first instruction included in that c.s., and what is the
first instruction after the c.s. ?

>   3. Userspace loads the values read at [0] into a known
>      location.

How can we "load the value ... into" ? We usually load from,
and store to a location. What I think you mean here is that
Userspace loads the values read at [0] into a known register.
(I would expect location target a memory address).

>   4. Userspace tests to see whether the current event and
>      cpu counter values match those stored at 0.  Manually
>      jumping to the address from [1] in the case of a
>      mismatch.

address from [1] -> "failure/restart address"

> 
>      Note that if we are preempted or otherwise interrupted
>      then the kernel can also now perform this comparison
>      and conditionally jump us to [1].

Does it jump us to [1] or to the failure/restart address ?

>   5. Our final instruction before [2] is then our commit.
>      The critical section is self-terminating.  [2] must
>      also be cleared at this point.

Is it the step [2] or the instruction following the critical
section to you refer to here ?

> 
> For x86_64:
>   [3] uses rdx to represent cpu and event counter as a
>       single 64-bit value.
> 
> For i386:
>   [3] uses ax for cpu and dx for the event_counter.

Having 16 bit only for CPU id limits us to 65536 CPUs.
Can this be an issue ?

Having 16 bit only for the event counter may start to become
tricky for counter overflows.

Are we strictly required to put both cpu id and seq counter
into the same register, or those could be put into two regs ?

> 
>  Both:
>   Instruction after commit: rseq_state->post_commit_instr
>   Current event and cpu state: rseq_state->event_and_cpu
> 
> Exactly, for x86_64 this looks like:
>  movq <failed>, rcx [1]
>  movq $1f, <commit_instr> [2]
>  cmpq <start value>, <current value> [3] (start is in rcx)

As you stated yourself in a reply, rcx -> rdx.

>  jnz <failed> (4)
>  movq <to_write>, (<target>) (5)
>  1: movq $0, <commit_instr>
> 
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS.  This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).

It would be important to keep allowing other architectures to use TLS-based
approaches though.

> 
> On debugger interactions:
> 
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
> a) The event counter is a per-cpu value.  This means that we can not advance
>    it if no threads from the same process execute on that cpu.  This
>    naturally allows basic single step support with thread-isolation.
> b) Single-step can be augmented to evalute the ABI without incrementing the
>    event count.
> c) A debugger can also be augmented to evaluate this ABI and push restarts
>    on the kernel's behalf.
> 
> This is also compatible with David's approach of not single stepping between
> 2-4 above.  However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available.  Which would be
> nice to allow actual debugging of sequences.
> 
> (Note that I haven't bothered implementing these in the current patchset as we
> are still winnowing down on the ABI and they just add complexity.  It's
> important to note that they are possible however.)

Ensuring these restartable critical sections can be used on shared memory
across processes seems rather important. I wonder if the debuggability of the
critical section really justifies to introduce a "PRIVATE" and a "SHARED" mode
for the rseq (similar to sys_futex), or if we could just enforce that no
breakpoint should be inserted within the critical section (except the very
first instruction).

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
> - Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu critical sections
  2015-10-27 23:56 ` [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu " Paul Turner
  2015-11-19 16:38   ` Johannes Berg
@ 2015-12-11 12:56   ` Mathieu Desnoyers
  1 sibling, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2015-12-11 12:56 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

----- On Oct 27, 2015, at 7:56 PM, Paul Turner commonly@gmail.com wrote:

> From: Paul Turner <pjt@google.com>
> 
> Introduce the notion of a restartable sequence.  This is a piece of user code
> that can be described in 3 components:
> 
> 1) Establish where [e.g. which cpu] the thread is running
> 2) Preparatory work that is dependent on the state in [1].
> 3) A committing instruction that proceeds only if [2] and [1] have proceeded
>    in a coherent state (i.e. not thread from the same address state has run
>    on the same cpu).

About (3), we'd have to figure out if we want to support shared memory maps
between processes.

> 
> Preemption, or other interruption beyond [1] guarantees that [3] will not
> execute.  With support for control being transferred to a user-defined restart
> handler.  This handler may arrange for the original operation to be retried,
> including potentially resynchronizing with dependent state that may
> have been updated in the interim.
> 
> This may be used in combination with an in-memory cpu-id to allow user programs
> to implement cpu-local data-structures and primitives, without the use/overhead
> of any atomics.

I was thinking perhaps my "thread_local_abi" system call patch could take care
of implementing the thread-local cpu ID cache in userspace ?

ref. https://lkml.org/lkml/2015/12/10/410

This thread_local_abi structure can be extended to add the extra fields
required for this restartable critical section implementation. The main
question here is whether we need to support many rseq per thread or just
a single one ?

> 
> The kernel ABI generally consists of:
> a) A shared TLS word which exports the current cpu and associated event-count

Do the current CPU and event-count need to share the same word ? AFAIU
those could be two different fields, and proper restart behavior could be
ensured by reading the event count before the current CPU (in userspace),
and then afterward re-validating the event count. This would allow using
64/32-bit seqcount on respective architectures.

> b) A shared TLS word which, when non-zero, stores the first post-commit
>    instruction if a sequence is active.  (The kernel observing rips greater
>    than this takes no action and concludes user-space has completed its
>    critical section, but not yet cleared this state).

Ah! Interesting. So we could use this TLS word for various critical sections,
as long as there is no nesting (e.g. no c.s. within c.s., and no c.s. within
signal handler). This limitation is a bit unfortunate for tracing though,
since we'd ideally like to be able to do a c.s. within a signal handler.

Or am I missing some elegant solution to the c.s. within signal handler
problem ? Is the restart ip of a c.s. interrupted by a signal handler
fetched before or after the signal handler execution ? If before, then
c.s. within signal handler would be OK, and this characteristic should
be documented.

> c) An architecture specific way to publish the state read from (a) which
>    user-space believes to be current.

It might be good to state a bit more clearly "publish to whom". I
understand here that it's userspace publishing that state for the
sake of the kernel. Is there any reason why this is typically a
register, or it could be on stack, or in TLS ?

> 
> [ (a) is only read by userspace, it is published by the kernel.
>  (b) is only ready by the kernel, it is published by userspace. ]
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
> arch/Kconfig                           |    7 ++
> arch/x86/Kconfig                       |    1
> arch/x86/entry/syscalls/syscall_64.tbl |    1
> fs/exec.c                              |    1
> include/linux/sched.h                  |   27 ++++++++
> include/uapi/asm-generic/unistd.h      |    4 +
> init/Kconfig                           |   11 +++
> kernel/Makefile                        |    2 -
> kernel/restartable_sequences.c         |  112 ++++++++++++++++++++++++++++++++
> kernel/sched/core.c                    |    4 +
> kernel/sched/sched.h                   |    3 +
> kernel/sys_ni.c                        |    2 +
> 12 files changed, 172 insertions(+), 3 deletions(-)
> create mode 100644 kernel/restartable_sequences.c
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 671810c..336880c 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -241,6 +241,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
> 	  declared in asm/ptrace.h
> 	  For example the kprobes-based event tracer needs this API.
> 
> +config HAVE_RESTARTABLE_SEQUENCE_SUPPORT
> +	bool
> +	depends on HAVE_REGS_AND_STACK_ACCESS_API
> +	help
> +	  This symbol should be selected by an architecture if it supports an
> +	  implementation of restartable sequences.
> +
> config HAVE_CLK
> 	bool
> 	help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 262b79c..25a0a60 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -135,6 +135,7 @@ config X86
> 	select HAVE_PERF_REGS
> 	select HAVE_PERF_USER_STACK_DUMP
> 	select HAVE_REGS_AND_STACK_ACCESS_API
> +	select HAVE_RESTARTABLE_SEQUENCE_SUPPORT
> 	select HAVE_SYSCALL_TRACEPOINTS
> 	select HAVE_UID16			if X86_32 || IA32_EMULATION
> 	select HAVE_UNSTABLE_SCHED_CLOCK
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 278842f..0fd4243 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -331,6 +331,7 @@
> 322	64	execveat		stub_execveat
> 323	common	userfaultfd		sys_userfaultfd
> 324	common	membarrier		sys_membarrier
> +325	common	restartable_sequences	sys_restartable_sequences
> 
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/exec.c b/fs/exec.c
> index 0a77a69..120ef19 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1599,6 +1599,7 @@ static int do_execveat_common(int fd, struct filename
> *filename,
> 	current->in_execve = 0;
> 	acct_update_integrals(current);
> 	task_numa_free(current);
> +	rseq_clear_state_exec();

> 	free_bprm(bprm);
> 	kfree(pathbuf);
> 	putname(filename);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9e1e06c..03ab1f0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1189,6 +1189,21 @@ struct mempolicy;
> struct pipe_inode_info;
> struct uts_namespace;
> 
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +struct restartable_sequence_state {
> +	__user void *post_commit_instr_addr;
> +	__user u64 *event_and_cpu;
> +
> +	u32 last_switch_count;
> +	u32 event_counter;
> +	struct preempt_notifier notifier;
> +};
> +
> +void rseq_clear_state_exec(void);
> +#else
> +static inline void rseq_clear_state_exec(void) {}
> +#endif
> +
> struct load_weight {
> 	unsigned long weight;
> 	u32 inv_weight;
> @@ -1820,6 +1835,9 @@ struct task_struct {
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> 	unsigned long	task_state_change;
> #endif
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +	struct restartable_sequence_state rseq_state;
> +#endif
> 	int pagefault_disabled;
> /* CPU-specific state of this task */
> 	struct thread_struct thread;
> @@ -3189,4 +3207,13 @@ static inline unsigned long rlimit_max(unsigned int
> limit)
> 	return task_rlimit_max(current, limit);
> }
> 
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +static inline int rseq_active(struct task_struct *p)
> +{
> +	return p->rseq_state.event_and_cpu != NULL;
> +}
> +#else
> +static inline int rseq_active(struct task_struct *p) { return 0; }
> +#endif
> +
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h
> b/include/uapi/asm-generic/unistd.h
> index ee12400..9659f31 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -713,9 +713,11 @@ __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
> #define __NR_membarrier 283
> __SYSCALL(__NR_membarrier, sys_membarrier)
> +#define __NR_restartable_sequences 284
> +__SYSCALL(__NR_restartable_sequences, sys_restartable_sequences)
> 
> #undef __NR_syscalls
> -#define __NR_syscalls 284
> +#define __NR_syscalls 285
> 
> /*
>  * All syscalls below here should go away really,
> diff --git a/init/Kconfig b/init/Kconfig
> index c24b6f7..98c02f2 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -2040,7 +2040,16 @@ config STOP_MACHINE
> source "block/Kconfig"
> 
> config PREEMPT_NOTIFIERS
> -	bool
> +	bool n

Why bool -> bool n  here ?

> +
> +config RESTARTABLE_SEQUENCES
> +	bool "Userspace Restartable Sequences (RSEQ)"
> +	default n
> +	depends on HAVE_RESTARTABLE_SEQUENCE_SUPPORT && PREEMPT_NOTIFIERS
> +	help
> +	  Allows binaries to define a region of user-text within which
> +	  execution will be restarted in the event of signal delivery or
> +	  preemption.
> 
> config PADATA
> 	depends on SMP
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 53abf00..dbe6963 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -101,8 +101,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> obj-$(CONFIG_TORTURE_TEST) += torture.o
> obj-$(CONFIG_MEMBARRIER) += membarrier.o
> -
> obj-$(CONFIG_HAS_IOMEM) += memremap.o
> +obj-$(CONFIG_RESTARTABLE_SEQUENCES) += restartable_sequences.o
> 
> $(obj)/configs.o: $(obj)/config_data.h
> 
> diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
> new file mode 100644
> index 0000000..c99a574
> --- /dev/null
> +++ b/kernel/restartable_sequences.c
> @@ -0,0 +1,112 @@
> +/*
> + * Restartable Sequences are a lightweight interface that allows user-level
> + * code to be executed atomically relative to scheduler preemption.  Typically
> + * used for implementing per-cpu operations.

Should state that it is atomic wrt signal delivery too.

> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
> + *
> + * Copyright (C) 2015, Google, Inc.,
> + * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com>
> + *
> + */
> +
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES

This ifdef is already in the Makefile, so it's useless here.

> +
> +#include <linux/uaccess.h>
> +#include <linux/preempt.h>
> +#include <linux/syscalls.h>
> +
> +static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
> +static void rseq_sched_out_nop(struct preempt_notifier *pn,
> +			       struct task_struct *next) {}
> +
> +static __read_mostly struct preempt_ops rseq_preempt_ops = {
> +	.sched_in = rseq_sched_in_nop,
> +	.sched_out = rseq_sched_out_nop,
> +};

My guess is that Peter Zijlstra might prefer those to be implemented
as calls from the scheduler rather than preempt notifiers, similarly
to the comments I got for the membarrier system call.

> +
> +int rseq_configure_current(__user u64 *event_and_cpu,
> +			   __user void *post_commit_instr_addr)

__user usually goes after the type, e.g.:

u64 __user *event_and_cpu, void __user *post_commit_instr_addr

> +{
> +	struct restartable_sequence_state *rseq_state = &current->rseq_state;
> +	int addr_size = test_tsk_thread_flag(current, TIF_IA32) ? 4 : 8;

May I suggest:

size_t addr_size = is_compat_task() ? sizeof(compat_uptr_t) : sizeof(void __user *);


> +	int registered = 0;
> +
> +	/* Disable */
> +	if (!event_and_cpu && !post_commit_instr_addr)
> +		goto valid;
> +
> +	if (!event_and_cpu || !post_commit_instr_addr)
> +		return -EINVAL;
> +
> +	if (!access_ok(VERIFY_WRITE, event_and_cpu, sizeof(*event_and_cpu)) ||
> +	    !access_ok(VERIFY_READ, post_commit_instr_addr, addr_size))
> +		return -EINVAL;
> +
> +valid:
> +	preempt_disable();
> +
> +	if (rseq_state->event_and_cpu)
> +		registered = 1;
> +

Let's suppose we have 2 libraries fighting for having "their"
data be registered for rseq. Here the second library would just
override the first one. Is it on purpose ? Should the second lib
just fail ?

> +	rseq_state->event_counter = 1;
> +	rseq_state->event_and_cpu = event_and_cpu;
> +	rseq_state->post_commit_instr_addr = post_commit_instr_addr;
> +
> +	if (event_and_cpu && !registered) {
> +		preempt_notifier_init(&rseq_state->notifier,
> +				      &rseq_preempt_ops);
> +		preempt_notifier_inc();
> +		preempt_notifier_register(&rseq_state->notifier);
> +	} else if (!event_and_cpu && registered) {
> +		preempt_notifier_unregister(&rseq_state->notifier);
> +		preempt_notifier_dec();
> +	}
> +
> +	preempt_enable();
> +
> +	/* Will update *event_and_cpu on return. */
> +	if (event_and_cpu)
> +		set_thread_flag(TIF_NOTIFY_RESUME);

Do we need to set the resume notifier in case we're unregistering ?

> +
> +	return 0;
> +}
> +
> +void rseq_clear_state_exec()

() -> (void)

(unlike C++, () really means (...) in C)

> +{
> +	rseq_configure_current(NULL, NULL);
> +}
> +
> +/*
> + * RSEQ syscall interface.
> + *
> + * Usage:
> + *   sys_restartable_sequences(flags, &event_and_cpu, &post_commit_instr_addr)
> +
> + *  flags is currently unused.
> + */
> +SYSCALL_DEFINE3(restartable_sequences,
> +		int, flags, long, event_and_cpu, long, post_commit_instr_addr)

I would be tempted to put "flags" as last argument (based on other syscalls
I've seen).

> +{
> +	return rseq_configure_current((__user u64 *)event_and_cpu,
> +				      (__user void *)post_commit_instr_addr);
> +}
> +#else
> +SYSCALL_DEFINE0(restartable_sequences)
> +{
> +	return -ENOSYS;
> +}

AFAIK, you don't need this: sys_ni.c does exactly this.

> +#endif
> +
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index aa59732..84e0641 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2120,6 +2120,10 @@ static void __sched_fork(unsigned long clone_flags,
> struct task_struct *p)
> 
> 	p->numa_group = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> +
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +	memset(&p->rseq_state, 0, sizeof(p->rseq_state));
> +#endif

Is it really what we want on fork ? Let's take this example: we
have a single-threaded process, which registers a rseq, and it forks.
I would expect the child process' thread to keep the same state as
its parent, and have the rseq registered. Copying the parent thread
state seems less surprising here. This could be implemented by a
rseq_copy_state_fork() or such.

Thanks,

Mathieu

> }
> 
> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efd3bfc..e8de833 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -957,6 +957,9 @@ static inline void __set_task_cpu(struct task_struct *p,
> unsigned int cpu)
> {
> 	set_task_rq(p, cpu);
> #ifdef CONFIG_SMP
> +	if (rseq_active(p))
> +		set_tsk_thread_flag(p, TIF_NOTIFY_RESUME);
> +
> 	/*
> 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
> 	 * successfuly executed on another CPU. We must ensure that updates of
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index a02decf..0bdf060 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -248,3 +248,5 @@ cond_syscall(sys_execveat);
> 
> /* membarrier */
> cond_syscall(sys_membarrier);
> +/* restartable sequences */
> +cond_syscall(sys_restartable_sequences);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/3] restartable sequences: x86 ABI
  2015-10-27 23:57 ` [RFC PATCH v2 2/3] restartable sequences: x86 ABI Paul Turner
  2015-10-28  5:03   ` Peter Zijlstra
@ 2015-12-11 13:30   ` Mathieu Desnoyers
  1 sibling, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2015-12-11 13:30 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds,
	H. Peter Anvin

----- On Oct 27, 2015, at 7:57 PM, Paul Turner commonly@gmail.com wrote:

> From: Paul Turner <pjt@google.com>
> 
> Recall the general ABI is:
>   The kernel ABI generally consists of:
>     a) A shared TLS word which exports the current cpu and event-count
>     b) A shared TLS word which, when non-zero, stores the first post-commit
>        instruction if a sequence is active.  (The kernel observing rips greater
>        than this takes no action and concludes user-space has completed its
>        critical section, but not yet cleared this state).
>     c) An architecture specific way to publish the state read from (a) which
>        user-space believes to be current.
> 
> This patch defines (c) for x86, both x86_64 and i386.

It seems to also take care of signal handler restarts (should be
documented in the changelog).

> 
> The exact sequence is:
>  *0. Userspace stores current event+cpu counter values
>   1. Userspace loads the rip to move to at failure into cx
>   2. Userspace loads the rip of the instruction following
>      the critical section into a registered TLS address.
>   3. Userspace loads the values read at [0] into a known
>      location.
>   4. Userspace tests to see whether the current event and
>      cpu counter values match those stored at 0.  Manually
>      jumping to the address from [1] in the case of a
>      mismatch.
> 
>      Note that if we are preempted or otherwise interrupted
>      then the kernel can also now perform this comparison
>      and conditionally jump us to [1].
>   4. Our final instruction bfeore [2] is then our commit.

bfeore -> before

>      The critical section is self-terminating.  [2] must
>      also be cleared at this point.

^ see comments for patch 0/3 for this repeated description.

> 
> For x86_64:
>   [3] uses rdx to represent cpu and event counter as a
>       single 64-bit value.
> 
> For i386:
>   [3] uses ax for cpu and dx for the event_counter.

^ again, see comments for patch 0/3.

> 
>  Both:
>   Instruction after commit: rseq_state->post_commit_instr
>   Current event and cpu state: rseq_state->event_and_cpu
> 
> An example user-space x86_64 implementation:
>    __asm__ __volatile__ goto (
>                    "movq $%l[failed], %%rcx\n"
>                    "movq $1f, %[commit_instr]\n"
>                    "cmpq %[start_value], %[current_value]\n"
>                    "jnz %l[failed]\n"
>                    "movq %[to_write], (%[target])\n"
>                    "1: movq $0, %[commit_instr]\n"
>      : /* no outputs */
>      : [start_value]"d"(start_value.storage),
>        [current_value]"m"(__rseq_state),
>        [to_write]"r"(to_write),
>        [target]"r"(p),
>        [commit_instr]"m"(__rseq_state.post_commit_instr)
>      : "rcx", "memory"
>      : failed
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
> arch/x86/entry/common.c                      |    3 +
> arch/x86/include/asm/restartable_sequences.h |   18 +++
> arch/x86/kernel/Makefile                     |    2
> arch/x86/kernel/restartable_sequences.c      |  136 ++++++++++++++++++++++++++
> arch/x86/kernel/signal.c                     |    7 +
> kernel/restartable_sequences.c               |   10 +-
> 6 files changed, 173 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/include/asm/restartable_sequences.h
> create mode 100644 arch/x86/kernel/restartable_sequences.c
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index a89fdbc..e382487 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -23,6 +23,7 @@
> #include <linux/uprobes.h>
> 
> #include <asm/desc.h>
> +#include <asm/restartable_sequences.h>
> #include <asm/traps.h>
> #include <asm/vdso.h>
> #include <asm/uaccess.h>
> @@ -249,6 +250,8 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32
> cached_flags)
> 		if (cached_flags & _TIF_NOTIFY_RESUME) {
> 			clear_thread_flag(TIF_NOTIFY_RESUME);
> 			tracehook_notify_resume(regs);
> +			if (rseq_active(current))
> +				arch_rseq_update_event_counter(regs);
> 		}
> 
> 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
> diff --git a/arch/x86/include/asm/restartable_sequences.h
> b/arch/x86/include/asm/restartable_sequences.h
> new file mode 100644
> index 0000000..75864a7
> --- /dev/null
> +++ b/arch/x86/include/asm/restartable_sequences.h
> @@ -0,0 +1,18 @@
> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
> +
> +#include <asm/processor.h>
> +#include <asm/ptrace.h>
> +#include <linux/sched.h>
> +
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +
> +void arch_rseq_update_event_counter(struct pt_regs *regs);
> +
> +#else /* !CONFIG_RESTARTABLE_SEQUENCES */
> +
> +static inline void arch_rseq_update_event_counter(struct pt_regs *regs) {}
> +
> +#endif

for ifdef consistency, I would recommend Paul McKenney's approach:

#ifdef CONFIG_RESTARTABLE_SEQUENCES

#else /* #ifdef CONFIG_RESTARTABLE_SEQUENCES */

#endif /* #else #ifdef CONFIG_RESTARTABLE_SEQUENCES */


> +
> +#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index b1b78ff..ee98fb6 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -110,6 +110,8 @@ obj-$(CONFIG_EFI)			+= sysfb_efi.o
> obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
> obj-$(CONFIG_TRACING)			+= tracepoint.o
> 
> +obj-$(CONFIG_RESTARTABLE_SEQUENCES)	+= restartable_sequences.o
> +
> ###
> # 64 bit specific files
> ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/restartable_sequences.c
> b/arch/x86/kernel/restartable_sequences.c
> new file mode 100644
> index 0000000..9f43efd
> --- /dev/null
> +++ b/arch/x86/kernel/restartable_sequences.c
> @@ -0,0 +1,136 @@
> +/*
> + * Restartable Sequences: x86 ABI.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
> + *
> + * Copyright (C) 2015, Google, Inc.,
> + * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com>
> + *

Not sure why this empty line here ?

> + */
> +
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <asm/ptrace.h>
> +#include <asm/processor-flags.h>
> +#include <asm/restartable_sequences.h>
> +
> +static inline u64 rseq_encode_cpu_and_event_count(int cpu, int event_count)
> +{
> +	return (u64)(event_count) << 32 | cpu;
> +}

Still wondering why those need to be combined.

> +
> +static inline int rseq_regs_cpu(struct pt_regs *regs, int is_i386)
> +{
> +#ifdef CONFIG_64BIT
> +	return is_i386 ? regs->ax : regs->dx & 0xFFFF;

Should it rather be ?

return is_i386 ? regs->ax : regs->dx & 0xFFFFFFFF;

> +#else
> +	return regs->ax;
> +#endif
> +}
> +
> +static inline int rseq_regs_event_count(struct pt_regs *regs, int is_i386)
> +{
> +#ifdef CONFIG_64BIT
> +	return is_i386 ? regs->dx : regs->dx >> 32;
> +#else
> +	return regs->dx;
> +#endif
> +}
> +
> +void arch_rseq_update_event_counter(struct pt_regs *regs)
> +{
> +	struct restartable_sequence_state *rseq_state = &current->rseq_state;
> +	int cpu = task_cpu(current);

Could we change task_cpu(current) for smp_processor_id() ?

> +	int is_i386 = test_tsk_thread_flag(current, TIF_IA32);

How should we consider x32 (is_x32_task()) ?

> +	int addr_size = is_i386 ? 4 : 8;

See other patch comment about using is_compat_task() and sizeof().

> +	long post_commit_instr = 0;
> +	u64 state_value;
> +
> +	/*
> +	 * Note: post_commit_instr must be zero-initialized above for the case
> +	 * of a 32-bit thread on a 64-bit system.
> +	 */
> +	if (copy_from_user(&post_commit_instr,
> +			   rseq_state->post_commit_instr_addr, addr_size)) {

Hrm, that's a little endian hack. Could this be cleaned up so
it won't break horribly when ported to big endian architectures ?

> +		goto fail;
> +	}
> +
> +	/* Handle potentially being within a critical section. */
> +	if (regs->ip < post_commit_instr) {

It appears that this mechanism is not responsible for handling
rseq that contain a call to sub-functions, given that those sub-functions
could be above post_commit_instr, and given that it would be incorrect
to move the instruction pointer outside of the scope of the inline assembly.
Are function calls supported within the rseq c.s. ?

> +		/*
> +		 * The ABI is relatively compact, with some differences for 32
> +		 * and 64-bit threads.
> +		 *
> +		 * Progress is ordered as follows:
> +		 *  *0. USerspace stores current event+cpu counter values

USerspace -> Userspace

> +		 *   1. Userspace loads the rip to move to at failure into cx
> +		 *   2. Userspace loads the rip of the instruction following
> +		 *      the critical section into a registered TLS address.
> +		 *   3. Userspace loads the values read at [0] into a known
> +		 *      location.
> +		 *   4. Userspace tests to see whether the current event and
> +		 *      cpu counter values match those stored at 0.  Manually
> +		 *      jumping to the address from [1] in the case of a
> +		 *      mismatch.
> +		 *
> +		 *      Note that if we are preempted or otherwise interrupted
> +		 *      then the kernel can also now perform this comparison
> +		 *      and conditionally jump us to [1].
> +		 *   4. Our final instruction bfeore [2] is then our commit.

bfeore -> before

> +		 *      The critical section is self-terminating.  [2] must
> +		 *      also be cleared at this point.
> +		 *
> +		 * For x86_64:
> +		 *   [3] uses rdx to represent cpu and event counter as a
> +		 *       single 64-bit value.
> +		 *
> +		 * For i386:
> +		 *   [3] uses ax for cpu and dx for the event_counter.
> +		 *
> +		 *  Both:
> +		 *   Instruction after commit: rseq_state->post_commit_instr
> +		 *   Current event and cpu state: rseq_state->event_and_cpu

^ see comments on patch 0/3 changelog.

> +		 *

Empty line to remove.

Thanks,

Mathieu

> +		 */
> +		if (rseq_regs_cpu(regs, is_i386) != cpu ||
> +		    rseq_regs_event_count(regs, is_i386) !=
> +				rseq_state->event_counter) {
> +			if (clear_user(rseq_state->post_commit_instr_addr,
> +						addr_size))
> +				goto fail;
> +
> +			/*
> +			 * We set this after potentially failing in clear_user
> +			 * so that the signal arrives at the faulting rip.
> +			 */
> +			regs->ip = regs->cx;
> +		}
> +	}
> +
> +	/* Update state. Compute the next expected state value. */
> +	state_value = rseq_encode_cpu_and_event_count(cpu,
> +			++rseq_state->event_counter);
> +
> +	if (put_user(state_value, rseq_state->event_and_cpu))
> +		goto fail;
> +
> +	return;
> +fail:
> +	/*
> +	 * User space has made some (invalid) protection change that does not
> +	 * allow us to safely continue execution.  SEGV is the result.
> +	 */
> +	force_sig(SIGSEGV, current);
> +}
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index b7ffb7c..58f8813 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -30,6 +30,7 @@
> #include <asm/fpu/signal.h>
> #include <asm/vdso.h>
> #include <asm/mce.h>
> +#include <asm/restartable_sequences.h>
> #include <asm/sighandling.h>
> #include <asm/vm86.h>
> 
> @@ -615,6 +616,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
> 	sigset_t *set = sigmask_to_save();
> 	compat_sigset_t *cset = (compat_sigset_t *) set;
> 
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +	/* Increment the event counter for the, present, pre-signal frame. */

There appears to be a lot of, commas, here. ;-)

> +	if (rseq_active(current))
> +		arch_rseq_update_event_counter(regs);
> +#endif
> +
> 	/* Set up the stack frame */
> 	if (is_ia32_frame()) {
> 		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
> diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
> index c99a574..5449561 100644
> --- a/kernel/restartable_sequences.c
> +++ b/kernel/restartable_sequences.c
> @@ -24,17 +24,21 @@
> 
> #ifdef CONFIG_RESTARTABLE_SEQUENCES
> 
> +#include <asm/restartable_sequences.h>
> #include <linux/uaccess.h>
> #include <linux/preempt.h>
> #include <linux/syscalls.h>
> 
> static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
> -static void rseq_sched_out_nop(struct preempt_notifier *pn,
> -			       struct task_struct *next) {}
> +static void rseq_sched_out(struct preempt_notifier *pn,
> +			   struct task_struct *next)
> +{
> +	set_thread_flag(TIF_NOTIFY_RESUME);
> +}
> 
> static __read_mostly struct preempt_ops rseq_preempt_ops = {
> 	.sched_in = rseq_sched_in_nop,
> -	.sched_out = rseq_sched_out_nop,
> +	.sched_out = rseq_sched_out,
> };
> 
>  int rseq_configure_current(__user u64 *event_and_cpu,

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2015-12-11 12:05 ` Mathieu Desnoyers
@ 2015-12-11 13:39   ` Mathieu Desnoyers
  0 siblings, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2015-12-11 13:39 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

----- On Dec 11, 2015, at 7:05 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Oct 27, 2015, at 7:56 PM, Paul Turner commonly@gmail.com wrote:
> 
>> This is an update to the previously posted series at:
>>  https://lkml.org/lkml/2015/6/24/665
>> 
>> Dave Watson has posted a similar follow-up which allows additional critical
>> regions to be registered as well as single-step support at:
>>  https://lkml.org/lkml/2015/10/22/588
>> 
>> This series is a new approach which introduces an alternate ABI that does not
>> depend on open-coded assembly nor a central 'repository' of rseq sequences.
>> Sequences may now be inlined and the preparatory[*] work for the sequence can
>> be written in a higher level language.
> 
> Hi Paul,
> 
> Thanks for sending this updated series! A few questions:
> 
> When you say that the preparatory work can be written in a higher
> level language, does it mean it still has to be inlined, or function
> calls are explicitly supported ? It would be good to clarify this in
> the changelog/documentation.
> 
> Another point: in this changelog, it's unclear to me what the [*] refers
> to (after "preparatory"). Below there is a [*] before "A sequence essentially
> has 3 steps", and and plain '*' before "0. Userspace stores current event+cpu
> counter values".
> 
>> 
>> This new ABI has also been written to support debugger interaction in a way
>> that the previous ABI could not.
>> 
>> [*] A sequence essentially has 3 steps:
>>  1) Determine which cpu the sequence is being run on
>>  2) Preparatory work specific to the state read in 1)
>>  3) A single commit instruction which finalizes any state updates.
> 
> The link between those 3 steps and the following "progress" items
> below seems a bit confusing, because those are two numbered lists.
> Perhaps using the numbers above for the overall steps, and sub-numbers
> after a dot could help exposing the relation, e.g.:
> 
> 1.1 Userspace stores current event+cpu counter values
> 2.1 Userspace loads the rip to move to at failure into cx
> 2.2 Userspace loads the rip of the instruction following
>    the critical section into a registered TLS address.
> ....
> 
> 
>> 
>> We require a single instruction for (3) so that if it is interrupted in any
>> way, we can proceed from (1) once more [or potentially bail].
>> 
>> This new ABI can be described as:
>> Progress is ordered as follows:
>>  *0. Userspace stores current event+cpu counter values
> 
> This does not explain where those "current" values are
> read from ? Moreover, step [3] below seems to refer to values
> read at [0], but [0] is about storing values, not reading them.
> There is probably a very simple explanation to this, but in its
> current form, this is puzzling. Moreover, where is it stored ?
> 
> I would suggest to change event+cpu -> (event, cpu), since they
> are not summed, but rather combined as a tuple.
> 
>>   1. Userspace loads the rip to move to at failure into cx
> 
> Is it really cx (16-bit), or ecx/rcx ?
> 
> "the rip to move to at failure" could be defined as
> "failure address" or "restart address", and then referred
> to afterward. This would clarify point [4] below rather than
> saying "to the address from ­[1]".
> 
>>   2. Userspace loads the rip of the instruction following
>>      the critical section into a registered TLS address.
> 
> The critical section is not clearly defined. What is the
> first instruction included in that c.s., and what is the
> first instruction after the c.s. ?
> 
>>   3. Userspace loads the values read at [0] into a known
>>      location.
> 
> How can we "load the value ... into" ? We usually load from,
> and store to a location. What I think you mean here is that
> Userspace loads the values read at [0] into a known register.
> (I would expect location target a memory address).
> 
>>   4. Userspace tests to see whether the current event and
>>      cpu counter values match those stored at 0.  Manually
>>      jumping to the address from [1] in the case of a
>>      mismatch.
> 
> address from [1] -> "failure/restart address"
> 
>> 
>>      Note that if we are preempted or otherwise interrupted
>>      then the kernel can also now perform this comparison
>>      and conditionally jump us to [1].
> 
> Does it jump us to [1] or to the failure/restart address ?
> 
>>   5. Our final instruction before [2] is then our commit.
>>      The critical section is self-terminating.  [2] must
>>      also be cleared at this point.
> 
> Is it the step [2] or the instruction following the critical
> section to you refer to here ?
> 
>> 
>> For x86_64:
>>   [3] uses rdx to represent cpu and event counter as a
>>       single 64-bit value.
>> 
>> For i386:
>>   [3] uses ax for cpu and dx for the event_counter.
> 
> Having 16 bit only for CPU id limits us to 65536 CPUs.
> Can this be an issue ?
> 
> Having 16 bit only for the event counter may start to become
> tricky for counter overflows.
> 
> Are we strictly required to put both cpu id and seq counter
> into the same register, or those could be put into two regs ?
> 
>> 
>>  Both:
>>   Instruction after commit: rseq_state->post_commit_instr
>>   Current event and cpu state: rseq_state->event_and_cpu
>> 
>> Exactly, for x86_64 this looks like:
>>  movq <failed>, rcx [1]
>>  movq $1f, <commit_instr> [2]
>>  cmpq <start value>, <current value> [3] (start is in rcx)
> 
> As you stated yourself in a reply, rcx -> rdx.
> 
>>  jnz <failed> (4)
>>  movq <to_write>, (<target>) (5)
>>  1: movq $0, <commit_instr>
>> 
>> There has been some related discussion, which I am supportive of, in which
>> we use fs/gs instead of TLS.  This maps naturally to the above and removes
>> the current requirement for per-thread initialization (this is a good thing!).
> 
> It would be important to keep allowing other architectures to use TLS-based
> approaches though.
> 
>> 
>> On debugger interactions:
>> 
>> There are some nice properties about this new style of API which allow it to
>> actually support safe interactions with a debugger:
>> a) The event counter is a per-cpu value.  This means that we can not advance
>>    it if no threads from the same process execute on that cpu.  This
>>    naturally allows basic single step support with thread-isolation.
>> b) Single-step can be augmented to evalute the ABI without incrementing the
>>    event count.
>> c) A debugger can also be augmented to evaluate this ABI and push restarts
>>    on the kernel's behalf.
>> 
>> This is also compatible with David's approach of not single stepping between
>> 2-4 above.  However, I think these are ultimately a little stronger since true
>> single-stepping and breakpoint support would be available.  Which would be
>> nice to allow actual debugging of sequences.
>> 
>> (Note that I haven't bothered implementing these in the current patchset as we
>> are still winnowing down on the ABI and they just add complexity.  It's
>> important to note that they are possible however.)
> 
> Ensuring these restartable critical sections can be used on shared memory
> across processes seems rather important. I wonder if the debuggability of the
> critical section really justifies to introduce a "PRIVATE" and a "SHARED" mode
> for the rseq (similar to sys_futex), or if we could just enforce that no
> breakpoint should be inserted within the critical section (except the very
> first instruction).

One more thought about breakpoints: I now understand that with your new approach,
the critical section ranges are only tracked momentarily during the critical
section. There is no registry kept of all critical sections, which could make
ptrace peek tweaks challenging.

An alternative approach might be to allow the debugger to put a breakpoint
within the critical section, but override the breakpoint handler in the kernel
so that it just resumes user-space execution, effectively ignoring the
breakpoint.

Thanks,

Mathieu

> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> 
>> 
>> Thanks,
>> 
>> - Paul
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] restartable sequences: basic self-tests
  2015-10-27 23:57 ` [RFC PATCH v2 3/3] restartable sequences: basic self-tests Paul Turner
@ 2016-04-05 20:33   ` Mathieu Desnoyers
  2016-04-06  7:43     ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-05 20:33 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

----- On Oct 27, 2015, at 7:57 PM, Paul Turner commonly@gmail.com wrote:

> From: pjt <pjt@pjt-glaptop.roam.corp.google.com>

Hi Paul,

I'm worried about a possible ABA in your percpu_list_pop() implementation
below.

[...]

> seq_percpu_cmpxchgcheck(int cpu, intptr_t *p, intptr_t old, intptr_t new,
> +			     intptr_t *check_ptr, intptr_t check_val)
> +{
> +	struct rseq_state start;
> +
> +	while (1) {
> +		start = rseq_start();
> +		if (rseq_current_cpu() != cpu) return rseq_current_cpu();
> +		/*
> +		 * Note that we'd want the ultimate implementation of this to
> +		 * be open coded (similar to rseq_finish) so that we can
> +		 * guarantee *check is not dereferenced when old does not
> +		 * match.  This could also be facilitated with a generic
> +		 * rseq_read_if_valid(...) helper.
> +		 */
> +		if (*p != old || *check_ptr != check_val)
> +			return -1;
> +		if (rseq_finish(p, new, start)) return cpu;
> +	}
> +}
> +

[...]

> +struct percpu_list_node {
> +	intptr_t data;
> +	struct percpu_list_node *next;
> +};
> +
> +struct percpu_list {
> +	struct percpu_list_node *heads[CPU_SETSIZE];
> +};
> +
> +int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
> +{
> +	int cpu;
> +
> +	do {
> +		cpu = rseq_current_cpu();
> +		node->next = list->heads[cpu];
> +	} while (cpu != rseq_percpu_cmpxchg(cpu,
> +			(intptr_t *)&list->heads[cpu], (intptr_t)node->next,
> +			(intptr_t)node));
> +
> +	return cpu;
> +}
> +
> +struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
> +{
> +	int cpu;
> +	struct percpu_list_node *head, *next;
> +
> +	do {
> +		cpu = rseq_current_cpu();
> +		head = list->heads[cpu];
> +		/*
> +		 * Unlike a traditional lock-less linked list; the availability
> +		 * of a cmpxchg-check primitive allows us to implement pop
> +		 * without concerns over ABA-type races.
> +		 */

The comment above clearly states that "rseq_percpu_cmpxchgcheck"
prevents ABA, and it indeed does, to some extend. My understanding
is that it prevents ABA from happening between rseq_start() and
rseq_finish().

The problem I see is with the percpu_list_pop() implementation. It
fetches list->heads[cpu] outside of the rseq critical section. I'm
concerned that a long preemption happening between loading the list
head and rseq_start() could cause an ABA that won't be detected by
rseq_finish().

The code seems to assume that comparing the CPU number, thus ensuring that
the rseq_current_cpu() read outside of the rseq critical section (just
before loading the list head) is the same CPU number seen in the
cmpxchgcheck is sufficient to ensure there is no ABA. However, this
check does not detect preemption nor signal delivery before rseq_start().
Moreover it does not cover cases where a thread would be migrated to
a remote CPU, and then migrated back to the original CPU.

A problematic execution sequence would be

* Exhibit A: ABA (all threads running on same CPU):

Initial state: the list has a single entry "object Z"

       Thread A                       Thread B
- percpu_list_pop()
  - cpu = rseq_current_cpu();
  - head = list->heads[cpu];
    (head is a pointer to object Z)
  - next = head->next;
  (preempted)
                                      (scheduled in)
                                      - percpu_list_pop()
                                        - cpu = rseq_current_cpu();
                                        - head = list->heads[cpu];
                                          (head is a pointer to object Z)
                                        - rseq_percpu_cmpxchgcheck succeeds
                                      - percpu_list_push of a new object Y
                                      - percpu_list_push of a re-used object Z
                                        (its next pointer now points to object Y
                                        rather than end of list)
                                      (preempted)
  (scheduled in)
  - rseq_percpu_cmpxchgcheck succeeds,
    setting a wrong value into the list
    head: it will store an end of list,
    thus skipping over object Y.

percpu_list_push() seems to show a similar ABA possibility
if preempted between reading the list head and the start of the
rseq critical section. In this situation, rseq_finish() would
not see preemption or signals delivered before the rseq_start().

In order to fix those ABA issues, my understanding is that we
should surround the entire do/while loop content of
percpu_list_push/pop by an rseq critical section, including
reading the list head.


Another issue we run into is a use-after-free if we are preempted
before reading head->hext.

* Exhibit B: Use after free (all threads running on same CPU):

Initial state: the list has a single entry "object Z"

       Thread A                       Thread B
- percpu_list_pop()
  - cpu = rseq_current_cpu();
  - head = list->heads[cpu];
    (head is a pointer to object Z)
  (preempted)
                                      (scheduled in)
                                      - percpu_list_pop()
                                        - cpu = rseq_current_cpu();
                                        - head = list->heads[cpu];
                                          (head is a pointer to object Z)
                                        - rseq_percpu_cmpxchgcheck succeeds
                                      - free object Z
                                      (preempted)
  (scheduled in)
  - reading free'd memory by trying
    to dereference:
    next = head->next;

Are complementary mechanisms expected to deal with the list element
life-time to prevent use-after-free, e.g. RCU ?

Or am I missing something ?

Thanks,

Mathieu

> +		if (!head) return 0;
> +		next = head->next;
> +	} while (cpu != rseq_percpu_cmpxchgcheck(cpu,
> +		(intptr_t *)&list->heads[cpu], (intptr_t)head, (intptr_t)next,
> +		(intptr_t *)&head->next, (intptr_t)next));
> +
> +	return head;
> +}
> +
> +
> +void *test_percpu_list_thread(void *arg)
> +{
> +	int i;
> +	struct percpu_list *list = (struct percpu_list *)arg;
> +
> +	rseq_init_current_thread();
> +	for (i = 0; i < 100000; i++) {
> +		struct percpu_list_node *node = percpu_list_pop(list);
> +		sched_yield();  /* encourage shuffling */
> +		if (node) percpu_list_push(list, node);
> +	}
> +
> +	return 0;
> +}
> +

[...]

> diff --git a/tools/testing/selftests/rseq/rseq.h
> b/tools/testing/selftests/rseq/rseq.h
> new file mode 100644
> index 0000000..d16e02d
> --- /dev/null
> +++ b/tools/testing/selftests/rseq/rseq.h
> @@ -0,0 +1,109 @@
> +#ifndef RSEQ_H
> +#define RSEQ_H
> +
> +#include <stdint.h>
> +
> +struct rseq_state {
> +	union {
> +		/*
> +		 * Updated by the kernel.  The time between two rseq state
> +		 * objects can be considered non-interrupted if-and-only-if
> +		 * both cpu and event_counter match.
> +		 *
> +		 * Explicitly: the kernel is allowed to maintain a
> +		 * per-cpu event_counter.
> +		 */
> +		struct {
> +			int cpu;
> +			int event_counter;
> +		};
> +		uint64_t storage;
> +	};
> +	void* post_commit_instr;
> +};
> +
> +extern __thread volatile struct rseq_state __rseq_state;
> +
> +int sys_rseq(int flags,
> +	     volatile uint64_t* event_and_cpu,
> +	     volatile void* post_commit_instr);
> +
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +
> +static inline struct rseq_state rseq_start()
> +{
> +	struct rseq_state result = __rseq_state;
> +	/*
> +	 * We need to ensure that the compiler does not re-order the loads of
> +	 * any protected values before we read the current state.
> +	 */
> +	barrier();
> +	return result;
> +}
> +
> +static inline int rseq_cpu_at_start(struct rseq_state start_value)
> +{
> +	return start_value.cpu;
> +}
> +
> +static inline int rseq_current_cpu(void)
> +{
> +	return __rseq_state.cpu;
> +}
> +
> +static inline int rseq_finish(intptr_t *p, intptr_t to_write,
> +			      struct rseq_state start_value)
> +{
> +#ifdef __x86_64__
> +	__asm__ __volatile__ goto (
> +			"movq $%l[failed], %%rcx\n"
> +			"movq $1f, %[commit_instr]\n"
> +			"cmpq %[start_value], %[current_value]\n"
> +			"jnz %l[failed]\n"
> +			"movq %[to_write], (%[target])\n"
> +			"1: movq $0, %[commit_instr]\n"
> +	  : /* no outputs */
> +	  : [start_value]"d"(start_value.storage),
> +	    [current_value]"m"(__rseq_state),
> +	    [to_write]"r"(to_write),
> +	    [target]"r"(p),
> +	    [commit_instr]"m"(__rseq_state.post_commit_instr)
> +	  : "rcx", "memory"
> +	  : failed
> +	);
> +#elif __i386__
> +	__asm__ __volatile__ goto (
> +			"movl $%l[failed], %%ecx\n"
> +			"movl $1f, %[post_commit_instr]\n"
> +			"cmp %[start_cpu], %[current_cpu]\n"
> +			"jnz %l[failed]\n"
> +			"cmp %[start_event], %[current_event]\n"
> +			"jnz %l[failed]\n"
> +			"movl %[to_write], (%[target])\n"
> +			"1: movl $0, %[post_commit_instr]\n"
> +	  : /* no outputs */
> +	  : [start_cpu]"a"(start_value.cpu),
> +	    [start_event]"d"(start_value.event_counter),
> +	    [current_cpu]"g"(__rseq_state.cpu),
> +	    [current_event]"g"(__rseq_state.event_counter),
> +	    [post_commit_instr]"g"(__rseq_state.post_commit_instr),
> +	    [to_write]"r"(to_write),
> +	    [target]"r"(p)
> +	  : "ecx", "memory"
> +	  : failed
> +	);
> +#else
> +#error unsupported target
> +#endif
> +	return 1;
> +failed:
> +	return 0;
> +}
> +
> +/*
> + * Initialize rseq for the current thread.  Must be called once by any thread
> + * which uses restartable_sequences.
> + */
> +void rseq_init_current_thread(void);
> +
> +#endif  /* RSEQ_H_ */

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] restartable sequences: basic self-tests
  2016-04-05 20:33   ` Mathieu Desnoyers
@ 2016-04-06  7:43     ` Peter Zijlstra
  2016-04-06 13:39       ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-06  7:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

On Tue, Apr 05, 2016 at 08:33:27PM +0000, Mathieu Desnoyers wrote:

> A problematic execution sequence would be
> 
> * Exhibit A: ABA (all threads running on same CPU):
> 
> Initial state: the list has a single entry "object Z"
> 
>        Thread A                       Thread B
> - percpu_list_pop()
>   - cpu = rseq_current_cpu();
>   - head = list->heads[cpu];
>     (head is a pointer to object Z)
>   - next = head->next;
>   (preempted)
>                                       (scheduled in)
>                                       - percpu_list_pop()
>                                         - cpu = rseq_current_cpu();
>                                         - head = list->heads[cpu];
>                                           (head is a pointer to object Z)
>                                         - rseq_percpu_cmpxchgcheck succeeds
>                                       - percpu_list_push of a new object Y
>                                       - percpu_list_push of a re-used object Z
>                                         (its next pointer now points to object Y
>                                         rather than end of list)
>                                       (preempted)
>   (scheduled in)
>   - rseq_percpu_cmpxchgcheck succeeds,
>     setting a wrong value into the list
>     head: it will store an end of list,
>     thus skipping over object Y.

OK, so I'm still trying to wake up, but I'm not seeing how
rseq_percpu_cmpxchgcheck() would succeed in this case.

If you look at the code, the 'check' part would fail, that is:

> +struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
> +{
> +	int cpu;
> +	struct percpu_list_node *head, *next;
> +
> +	do {
> +		cpu = rseq_current_cpu();
> +		head = list->heads[cpu];
> +		/*
> +		 * Unlike a traditional lock-less linked list; the availability
> +		 * of a cmpxchg-check primitive allows us to implement pop
> +		 * without concerns over ABA-type races.
> +		 */
> +		if (!head) return 0;
> +		next = head->next;
> +	} while (cpu != rseq_percpu_cmpxchgcheck(cpu,
> +		(intptr_t *)&list->heads[cpu], (intptr_t)head, (intptr_t)next,
> +		(intptr_t *)&head->next, (intptr_t)next));

The extra compare is 'head->next == next', and our thread-A will have
@next == NULL (EOL), while the state after thread-B ran would be
@head->next = &Y.

So the check will fail, the cmpxchg will fail, and around we go.

> +
> +	return head;
> +}

Or am I completely not getting it?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] restartable sequences: basic self-tests
  2016-04-06  7:43     ` Peter Zijlstra
@ 2016-04-06 13:39       ` Mathieu Desnoyers
  2016-04-06 19:25         ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-06 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

----- On Apr 6, 2016, at 3:43 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Apr 05, 2016 at 08:33:27PM +0000, Mathieu Desnoyers wrote:
> 
>> A problematic execution sequence would be
>> 
>> * Exhibit A: ABA (all threads running on same CPU):
>> 
>> Initial state: the list has a single entry "object Z"
>> 
>>        Thread A                       Thread B
>> - percpu_list_pop()
>>   - cpu = rseq_current_cpu();
>>   - head = list->heads[cpu];
>>     (head is a pointer to object Z)
>>   - next = head->next;
>>   (preempted)
>>                                       (scheduled in)
>>                                       - percpu_list_pop()
>>                                         - cpu = rseq_current_cpu();
>>                                         - head = list->heads[cpu];
>>                                           (head is a pointer to object Z)
>>                                         - rseq_percpu_cmpxchgcheck succeeds
>>                                       - percpu_list_push of a new object Y
>>                                       - percpu_list_push of a re-used object Z
>>                                         (its next pointer now points to object Y
>>                                         rather than end of list)
>>                                       (preempted)
>>   (scheduled in)
>>   - rseq_percpu_cmpxchgcheck succeeds,
>>     setting a wrong value into the list
>>     head: it will store an end of list,
>>     thus skipping over object Y.
> 
> OK, so I'm still trying to wake up, but I'm not seeing how
> rseq_percpu_cmpxchgcheck() would succeed in this case.
> 
> If you look at the code, the 'check' part would fail, that is:
> 
>> +struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
>> +{
>> +	int cpu;
>> +	struct percpu_list_node *head, *next;
>> +
>> +	do {
>> +		cpu = rseq_current_cpu();
>> +		head = list->heads[cpu];
>> +		/*
>> +		 * Unlike a traditional lock-less linked list; the availability
>> +		 * of a cmpxchg-check primitive allows us to implement pop
>> +		 * without concerns over ABA-type races.
>> +		 */
>> +		if (!head) return 0;
>> +		next = head->next;
>> +	} while (cpu != rseq_percpu_cmpxchgcheck(cpu,
>> +		(intptr_t *)&list->heads[cpu], (intptr_t)head, (intptr_t)next,
>> +		(intptr_t *)&head->next, (intptr_t)next));
> 
> The extra compare is 'head->next == next', and our thread-A will have
> @next == NULL (EOL), while the state after thread-B ran would be
> @head->next = &Y.
> 
> So the check will fail, the cmpxchg will fail, and around we go.
> 
>> +
>> +	return head;
>> +}
> 
> Or am I completely not getting it?

No, you're right. I entirely missed the role of check_ptr and
check_val in rseq_percpu_cmpxchgcheck. That indeed ensures we
atomically check, from a per-cpu perspective, that both the
pointer we are about to update and the next pointer are still
the same. Mystery solved. :-)

And of course, for the percpu_list_push(), the rseq_percpu_cmpxchg()
there is enough, because we always try to add a node we own into
the list, and only ever compare to the head. This one is
straightforwardly ABA-free even without rseq.

There is still the question of use-after-free however that
remains open. My understanding is that this lock-free list
should be paired with either a type-safe memory allocator,
using RCU, or a garbage collector.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
                   ` (4 preceding siblings ...)
  2015-12-11 12:05 ` Mathieu Desnoyers
@ 2016-04-06 15:56 ` Andy Lutomirski
  2016-04-07 12:02   ` Peter Zijlstra
  5 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-06 15:56 UTC (permalink / raw)
  To: Paul Turner
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, linux-kernel,
	Chris Lameter, Andi Kleen, Josh Triplett, Dave Watson,
	Andrew Hunter, Linus Torvalds, Linux API, Peter Zijlstra

On Oct 27, 2015 4:56 PM, "Paul Turner" <commonly@gmail.com> wrote:
>
> This is an update to the previously posted series at:
>   https://lkml.org/lkml/2015/6/24/665
>
> Dave Watson has posted a similar follow-up which allows additional critical
> regions to be registered as well as single-step support at:
>   https://lkml.org/lkml/2015/10/22/588
>
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.
>
> This new ABI has also been written to support debugger interaction in a way
> that the previous ABI could not.
>
> [*] A sequence essentially has 3 steps:
>   1) Determine which cpu the sequence is being run on
>   2) Preparatory work specific to the state read in 1)
>   3) A single commit instruction which finalizes any state updates.

After much delay, I have a few questions:


What are the useful commit operations?  IIUC they probably need to be
single instructions, which makes it sound like they're all either RMW
or store operations.  I think that plain stores are sufficient to
emulate RMW since (1) if the value changes out from under you, you'll
abort, and (2) most CPUs don't have single instruction RMW anyway.


We could probably speed up commits and make the code a bit more
obvious by feeding the kernel a pointer to a descriptor instead of
feeding it individual values.  For example, the descriptor could be:

struct commit_descriptor {
  unsigned long done_instruction;
  unsigned long failure_instruction;
  // maybe cpu and event number are in here
  // maybe *pointer* to event number is in here
};


Is your scheme safe against signals that are delivered during commit?
Would the signal context need to save and zero the commit state?


I still want to see if we can get away from the kernel-managed event
counter.  Would the following work:

start_sequence:
 read current CPU number
 change event counter
 re-read current CPU number [1]

commit:
  tell kernel we're committing
  re-check event counter and CPU number
  do the commit instruction
  tell kernel we're done committing

[1] avoids a memory ordering issue if we migrate before changing the
event counter

The kernel forces an abort if, on resume from any kernel entry, the
CPU number or event counter is wrong.

If this worked, then it would be inherently debuggable, since the only
way that it would abort in a single-threaded situation is if it
migrated during commit.

--Andy

>
> We require a single instruction for (3) so that if it is interrupted in any
> way, we can proceed from (1) once more [or potentially bail].
>
> This new ABI can be described as:
>  Progress is ordered as follows:
>   *0. Userspace stores current event+cpu counter values
>    1. Userspace loads the rip to move to at failure into cx
>    2. Userspace loads the rip of the instruction following
>       the critical section into a registered TLS address.
>    3. Userspace loads the values read at [0] into a known
>       location.
>    4. Userspace tests to see whether the current event and
>       cpu counter values match those stored at 0.  Manually
>       jumping to the address from [1] in the case of a
>       mismatch.
>
>       Note that if we are preempted or otherwise interrupted
>       then the kernel can also now perform this comparison
>       and conditionally jump us to [1].
>    5. Our final instruction before [2] is then our commit.
>       The critical section is self-terminating.  [2] must
>       also be cleared at this point.
>
>  For x86_64:
>    [3] uses rdx to represent cpu and event counter as a
>        single 64-bit value.
>
>  For i386:
>    [3] uses ax for cpu and dx for the event_counter.
>
>   Both:
>    Instruction after commit: rseq_state->post_commit_instr
>    Current event and cpu state: rseq_state->event_and_cpu
>
> Exactly, for x86_64 this looks like:
>   movq <failed>, rcx [1]
>   movq $1f, <commit_instr> [2]
>   cmpq <start value>, <current value> [3] (start is in rcx)
>   jnz <failed> (4)
>   movq <to_write>, (<target>) (5)
>   1: movq $0, <commit_instr>
>
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS.  This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).
>
> On debugger interactions:
>
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
>  a) The event counter is a per-cpu value.  This means that we can not advance
>     it if no threads from the same process execute on that cpu.  This
>     naturally allows basic single step support with thread-isolation.
>  b) Single-step can be augmented to evalute the ABI without incrementing the
>     event count.
>  c) A debugger can also be augmented to evaluate this ABI and push restarts
>     on the kernel's behalf.
>
> This is also compatible with David's approach of not single stepping between
> 2-4 above.  However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available.  Which would be
> nice to allow actual debugging of sequences.
>
> (Note that I haven't bothered implementing these in the current patchset as we
> are still winnowing down on the ABI and they just add complexity.  It's
> important to note that they are possible however.)
>
> Thanks,
>
> - Paul
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] restartable sequences: basic self-tests
  2016-04-06 13:39       ` Mathieu Desnoyers
@ 2016-04-06 19:25         ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-06 19:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Paul E. McKenney, linux-api, linux-kernel,
	Josh Triplett, Ingo Molnar, Chris Lameter, Linus Torvalds

On Wed, Apr 06, 2016 at 01:39:22PM +0000, Mathieu Desnoyers wrote:
> There is still the question of use-after-free however that
> remains open. My understanding is that this lock-free list
> should be paired with either a type-safe memory allocator,
> using RCU, or a garbage collector.

Yeah, it looks that way indeed, there no sane way to fix that either,
even if you stick all this inside the rseq_start/finish thing you'll get
use-after-free, because we'll not restart until rseq_finish fails.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-06 15:56 ` Andy Lutomirski
@ 2016-04-07 12:02   ` Peter Zijlstra
  2016-04-07 14:35     ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-07 12:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul Turner, Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar,
	linux-kernel, Chris Lameter, Andi Kleen, Josh Triplett,
	Dave Watson, Andrew Hunter, Linus Torvalds, Linux API

On Wed, Apr 06, 2016 at 08:56:33AM -0700, Andy Lutomirski wrote:
> What are the useful commit operations?  IIUC they probably need to be
> single instructions, which makes it sound like they're all either RMW
> or store operations.  I think that plain stores are sufficient to
> emulate RMW since (1) if the value changes out from under you, you'll
> abort, and (2) most CPUs don't have single instruction RMW anyway.

Yeah so stores are sufficient. The only requirement is that the store is
a single operation/instruction. So you can typically not commit anything
wider than the native word size.

> We could probably speed up commits and make the code a bit more
> obvious by feeding the kernel a pointer to a descriptor instead of
> feeding it individual values.  For example, the descriptor could be:

See the Thread-Local-ABI effort by Mathieu, the idea is to get a single
full cacheline (64bytes) fixed size thingy allocated at a fixed offset
to the TCB.

That way we can reduce to %[gf]s:offset for these here variables (I
forever forget which segment register userspace uses for TLS).

> Is your scheme safe against signals that are delivered during commit?
> Would the signal context need to save and zero the commit state?

The patches you comment on here explicitly update the event from the
signal frame setup and thereby handle this.

The update not only increments the sequence count, but also tests the
post_commit_ip thing, if set it assigns fail value to regs->ip (ie the
commit fails).

tl;dr, yes its safe against signals happening during commit.

> I still want to see if we can get away from the kernel-managed event
> counter.  Would the following work:
> 
> start_sequence:
>  read current CPU number
>  change event counter
>  re-read current CPU number [1]
> 
> commit:
>   tell kernel we're committing
>   re-check event counter and CPU number
>   do the commit instruction
>   tell kernel we're done committing
> 
> [1] avoids a memory ordering issue if we migrate before changing the
> event counter

So currently the event and cpu form a single 64bit value, so that on
64bit we can use a single load and cmp to verify them.

So letting the user manage the event is doable, but it would still be
advisable to have the event in the same shared word.

> The kernel forces an abort if, on resume from any kernel entry, the
> CPU number or event counter is wrong.
> 
> If this worked, then it would be inherently debuggable, since the only
> way that it would abort in a single-threaded situation is if it
> migrated during commit.

(or a signal happened, as per the below)

Tempting.. not signal safe though, although I suppose we can still
explicitly do the:

  if (regs->ip < post_commit_ip)
    regs->ip = regs->rcx;

thing on signal frame setup to abort any in-flight commit without
explicitly incrementing the sequence number.

Not having to mange the event count from kernel reduces the kernel work
to migration only -- ie. we can get rid of the preemption hook and
reduce overhead there, something I'm entirely happy with if possible.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 12:02   ` Peter Zijlstra
@ 2016-04-07 14:35     ` Andy Lutomirski
  2016-04-07 15:24       ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-07 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Apr 7, 2016 5:02 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> On Wed, Apr 06, 2016 at 08:56:33AM -0700, Andy Lutomirski wrote:
> > What are the useful commit operations?  IIUC they probably need to be
> > single instructions, which makes it sound like they're all either RMW
> > or store operations.  I think that plain stores are sufficient to
> > emulate RMW since (1) if the value changes out from under you, you'll
> > abort, and (2) most CPUs don't have single instruction RMW anyway.
>
> Yeah so stores are sufficient. The only requirement is that the store is
> a single operation/instruction. So you can typically not commit anything
> wider than the native word size.
>
> > We could probably speed up commits and make the code a bit more
> > obvious by feeding the kernel a pointer to a descriptor instead of
> > feeding it individual values.  For example, the descriptor could be:
>
> See the Thread-Local-ABI effort by Mathieu, the idea is to get a single
> full cacheline (64bytes) fixed size thingy allocated at a fixed offset
> to the TCB.
>
> That way we can reduce to %[gf]s:offset for these here variables (I
> forever forget which segment register userspace uses for TLS).

What I meant was: rather than shoving individual values into the TLABI
thing, shove in a pointer:

struct commit_info {
  u64 post_commit_rip;
  u32 cpu;
  u64 *event;
  // whatever else;
};

and then put a commit_info* in TLABI.

This would save some bytes in the TLABI structure.

>
> > Is your scheme safe against signals that are delivered during commit?
> > Would the signal context need to save and zero the commit state?
>
> The patches you comment on here explicitly update the event from the
> signal frame setup and thereby handle this.
>
> The update not only increments the sequence count, but also tests the
> post_commit_ip thing, if set it assigns fail value to regs->ip (ie the
> commit fails).
>
> tl;dr, yes its safe against signals happening during commit.

That's not quite what I meant -- see below.

>
> > I still want to see if we can get away from the kernel-managed event
> > counter.  Would the following work:
> >
> > start_sequence:
> >  read current CPU number
> >  change event counter
> >  re-read current CPU number [1]
> >
> > commit:
> >   tell kernel we're committing
> >   re-check event counter and CPU number
> >   do the commit instruction
> >   tell kernel we're done committing
> >
> > [1] avoids a memory ordering issue if we migrate before changing the
> > event counter
>
> So currently the event and cpu form a single 64bit value, so that on
> 64bit we can use a single load and cmp to verify them.
>
> So letting the user manage the event is doable, but it would still be
> advisable to have the event in the same shared word.
>

Why is a single load needed?  The event and CPU in the TLABI structure
are only ever accessed from the thread in question.

> > The kernel forces an abort if, on resume from any kernel entry, the
> > CPU number or event counter is wrong.
> >
> > If this worked, then it would be inherently debuggable, since the only
> > way that it would abort in a single-threaded situation is if it
> > migrated during commit.
>
> (or a signal happened, as per the below)
>
> Tempting.. not signal safe though, although I suppose we can still
> explicitly do the:
>
>   if (regs->ip < post_commit_ip)
>     regs->ip = regs->rcx;
>
> thing on signal frame setup to abort any in-flight commit without
> explicitly incrementing the sequence number.

What I meant was: what if we did it a bit differently?  On signal
delivery, we could stash post_commit_ip, etc in the signal context and
then zero them out.  On sigreturn, we would restore them from the
signal frame back into TLABI and then, before resuming userspace, we'd
check for an abort.

That way we could take an async signal, handle it, and resume, even in
the middle of a commit, without aborting.  Of course, if the signal
hander tried to access the same rseq-protected resource, it would bump
the event counter and cause an abort.

I think it's safe in the sense that a signal will abort a commit
operation.  I'm wondering if we can do better: what if a signal could
avoid interfering with commit unless it would actually conflict?s food
for thought: what if, instead of aborting

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 14:35     ` Andy Lutomirski
@ 2016-04-07 15:24       ` Peter Zijlstra
  2016-04-07 15:39         ` Peter Zijlstra
  2016-04-07 15:44         ` Andy Lutomirski
  0 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-07 15:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
> What I meant was: rather than shoving individual values into the TLABI
> thing, shove in a pointer:
> 
> struct commit_info {
>   u64 post_commit_rip;
>   u32 cpu;
>   u64 *event;
>   // whatever else;
> };
> 
> and then put a commit_info* in TLABI.
> 
> This would save some bytes in the TLABI structure.

But would cost us extra indirections. The whole point was getting this
stuff at a constant offset from the TLS segment register.

> > So letting the user manage the event is doable, but it would still be
> > advisable to have the event in the same shared word.
> >
> 
> Why is a single load needed?  The event and CPU in the TLABI structure
> are only ever accessed from the thread in question.

Its not required, but being able to do a single load saves on cycles,
less cycles is more good.

> That way we could take an async signal, handle it, and resume, even in
> the middle of a commit, without aborting.  Of course, if the signal
> hander tried to access the same rseq-protected resource, it would bump
> the event counter and cause an abort.

Ah, so what happens if the signal happens before the commit but after
the load of the seqcount?

Then, even if the signal motifies the count, we'll not observe.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 15:24       ` Peter Zijlstra
@ 2016-04-07 15:39         ` Peter Zijlstra
  2016-04-07 15:44         ` Andy Lutomirski
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-07 15:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 07, 2016 at 05:24:32PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
> > That way we could take an async signal, handle it, and resume, even in
> > the middle of a commit, without aborting.  Of course, if the signal
> > hander tried to access the same rseq-protected resource, it would bump
> > the event counter and cause an abort.
> 
> Ah, so what happens if the signal happens before the commit but after
> the load of the seqcount?
> 
> Then, even if the signal motifies the count, we'll not observe.

Ah, and the same is true for preemptions. Which is why all this was
preemption driven, and not migration driven.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 15:24       ` Peter Zijlstra
  2016-04-07 15:39         ` Peter Zijlstra
@ 2016-04-07 15:44         ` Andy Lutomirski
  2016-04-07 15:53           ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-07 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>> What I meant was: rather than shoving individual values into the TLABI
>> thing, shove in a pointer:
>>
>> struct commit_info {
>>   u64 post_commit_rip;
>>   u32 cpu;
>>   u64 *event;
>>   // whatever else;
>> };
>>
>> and then put a commit_info* in TLABI.
>>
>> This would save some bytes in the TLABI structure.
>
> But would cost us extra indirections. The whole point was getting this
> stuff at a constant offset from the TLS segment register.

I don't think the extra indirections would matter much.  The kernel
would have to chase the pointer, but only in the very rare case where
it resumes userspace during a commit or on the immediately following
instruction.

At the very least, post_commit_rip and the abort address (which I
forgot about) could both live in a static structure, and shoving a
pointer to *that* into TLABI space is one store instead of two.

Admittedly, this is just a minor tweak either way.  I'm just wondering
an extra indirect variant ends up being slightly faster.

>
>> > So letting the user manage the event is doable, but it would still be
>> > advisable to have the event in the same shared word.
>> >
>>
>> Why is a single load needed?  The event and CPU in the TLABI structure
>> are only ever accessed from the thread in question.
>
> Its not required, but being able to do a single load saves on cycles,
> less cycles is more good.
>
>> That way we could take an async signal, handle it, and resume, even in
>> the middle of a commit, without aborting.  Of course, if the signal
>> hander tried to access the same rseq-protected resource, it would bump
>> the event counter and cause an abort.
>
> Ah, so what happens if the signal happens before the commit but after
> the load of the seqcount?
>
> Then, even if the signal motifies the count, we'll not observe.
>

Where exactly?

In my scheme, nothing except the kernel ever loads the seqcount.  The
user code generates a fresh value, writes it to memory, and then, just
before commit, writes that same value to the TLABI area and then
double-checks that the value it wrote at the beginning is still there.

If the signal modifies the count, then the user code won't directly
notice, but prepare_exit_to_usermode on the way out of the signal will
notice that the (restored) TLABI state doesn't match the counter that
the signal handler changed and will just to the abort address.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 15:44         ` Andy Lutomirski
@ 2016-04-07 15:53           ` Peter Zijlstra
  2016-04-07 16:43             ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-07 15:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
> >> What I meant was: rather than shoving individual values into the TLABI
> >> thing, shove in a pointer:
> >>
> >> struct commit_info {
> >>   u64 post_commit_rip;
> >>   u32 cpu;
> >>   u64 *event;
> >>   // whatever else;
> >> };
> >>
> >> and then put a commit_info* in TLABI.
> >>
> >> This would save some bytes in the TLABI structure.
> >
> > But would cost us extra indirections. The whole point was getting this
> > stuff at a constant offset from the TLS segment register.
> 
> I don't think the extra indirections would matter much.  The kernel
> would have to chase the pointer, but only in the very rare case where
> it resumes userspace during a commit or on the immediately following
> instruction.

Its about userspace finding these values, not the kernel.

> At the very least, post_commit_rip and the abort address (which I
> forgot about) could both live in a static structure,

Paul keeps the abort address in rcx.

> and shoving a
> pointer to *that* into TLABI space is one store instead of two.

> > Ah, so what happens if the signal happens before the commit but after
> > the load of the seqcount?
> >
> > Then, even if the signal motifies the count, we'll not observe.
> >
> 
> Where exactly?
> 
> In my scheme, nothing except the kernel ever loads the seqcount.  The
> user code generates a fresh value, writes it to memory, and then, just
> before commit, writes that same value to the TLABI area and then
> double-checks that the value it wrote at the beginning is still there.
> 
> If the signal modifies the count, then the user code won't directly
> notice, but prepare_exit_to_usermode on the way out of the signal will
> notice that the (restored) TLABI state doesn't match the counter that
> the signal handler changed and will just to the abort address.


OK, you lost me..  commit looks like:

+       __asm__ __volatile__ goto (
+                       "movq $%l[failed], %%rcx\n"
+                       "movq $1f, %[commit_instr]\n"
+                       "cmpq %[start_value], %[current_value]\n"

If we get preempted/signaled here without the preemption/signal entry
checking for the post_commit_instr, we'll fail hard.

+                       "jnz %l[failed]\n"
+                       "movq %[to_write], (%[target])\n"
+                       "1: movq $0, %[commit_instr]\n"
+         : /* no outputs */
+         : [start_value]"d"(start_value.storage),
+           [current_value]"m"(__rseq_state),
+           [to_write]"r"(to_write),
+           [target]"r"(p),
+           [commit_instr]"m"(__rseq_state.post_commit_instr)
+         : "rcx", "memory"
+         : failed
+       );

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 15:53           ` Peter Zijlstra
@ 2016-04-07 16:43             ` Andy Lutomirski
  2016-04-07 20:11               ` Peter Zijlstra
                                 ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-07 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 7, 2016 at 8:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>> >> What I meant was: rather than shoving individual values into the TLABI
>> >> thing, shove in a pointer:
>> >>
>> >> struct commit_info {
>> >>   u64 post_commit_rip;
>> >>   u32 cpu;
>> >>   u64 *event;
>> >>   // whatever else;
>> >> };
>> >>
>> >> and then put a commit_info* in TLABI.
>> >>
>> >> This would save some bytes in the TLABI structure.
>> >
>> > But would cost us extra indirections. The whole point was getting this
>> > stuff at a constant offset from the TLS segment register.
>>
>> I don't think the extra indirections would matter much.  The kernel
>> would have to chase the pointer, but only in the very rare case where
>> it resumes userspace during a commit or on the immediately following
>> instruction.
>
> Its about userspace finding these values, not the kernel.

But userspace never reads post_commit_rip or the abort address AFAIK.

>
>> At the very least, post_commit_rip and the abort address (which I
>> forgot about) could both live in a static structure,
>
> Paul keeps the abort address in rcx.

So it's probably a wash.  Paul's way is to load the abort address in
rcx (one instruction, no memory access) and to put post_commit_rip in
TLABI (one store).  Mine is to put a pointer to a descriptor into
TLABI (one store).  Should barely matter.

>
>> and shoving a
>> pointer to *that* into TLABI space is one store instead of two.
>
>> > Ah, so what happens if the signal happens before the commit but after
>> > the load of the seqcount?
>> >
>> > Then, even if the signal motifies the count, we'll not observe.
>> >
>>
>> Where exactly?
>>
>> In my scheme, nothing except the kernel ever loads the seqcount.  The
>> user code generates a fresh value, writes it to memory, and then, just
>> before commit, writes that same value to the TLABI area and then
>> double-checks that the value it wrote at the beginning is still there.
>>
>> If the signal modifies the count, then the user code won't directly
>> notice, but prepare_exit_to_usermode on the way out of the signal will
>> notice that the (restored) TLABI state doesn't match the counter that
>> the signal handler changed and will just to the abort address.
>
>
> OK, you lost me..  commit looks like:
>
> +       __asm__ __volatile__ goto (
> +                       "movq $%l[failed], %%rcx\n"
> +                       "movq $1f, %[commit_instr]\n"
> +                       "cmpq %[start_value], %[current_value]\n"
>
> If we get preempted/signaled here without the preemption/signal entry
> checking for the post_commit_instr, we'll fail hard.

Not if I'm thinking about this right.  Because, in my scheme, we'd be
storing both the address of the counter [1] and the value that we
expect it to be (current_value?  something's confused in the asm or
the variable naming, or I'm just not following exactly which variable
is which) in registers or TLABI, and the kernel would notice that
*counter != expected_counter when it resumes and would send us to the
abort label.

Presumably there would be one register hard-coded as the expected
counter value and another register hard-coded as the address of the
counter.  That would avoid a bunch of stores to TLABI, and the kernel
could still find them.  If the C optimizer did a good job, the
resulting code would be simple.

[1] In my scheme, there would be no reason for the counter to live in
TLABI.  In contrast, there would be an excellent reason for it to
*not* live in TLABI -- user code could have more than one set of
counters, and they wouldn't interfere with each other.  This would be
nice if one of them protects something that uses long enough code in
the critical sections that it's likely to be preempted.

>
> +                       "jnz %l[failed]\n"
> +                       "movq %[to_write], (%[target])\n"
> +                       "1: movq $0, %[commit_instr]\n"
> +         : /* no outputs */
> +         : [start_value]"d"(start_value.storage),
> +           [current_value]"m"(__rseq_state),
> +           [to_write]"r"(to_write),
> +           [target]"r"(p),
> +           [commit_instr]"m"(__rseq_state.post_commit_instr)
> +         : "rcx", "memory"
> +         : failed
> +       );


More concretely, this looks like (using totally arbitrary register
assingments -- probably far from ideal, especially given how GCC's
constraints work):

enter the critical section:
1:
movq %[cpu], %%r12
movq {address of counter for our cpu}, %%r13
movq {some fresh value}, (%%r13)
cmpq %[cpu], %%r12
jne 1b

... do whatever setup or computation is needed...

movq $%l[failed], %%rcx
movq $1f, %[commit_instr]
cmpq {whatever counter we chose}, (%%r13)
jne %l[failed]
cmpq %[cpu], %%r12
jne %l[failed]

<-- a signal in here that conflicts with us would clobber (%%r13), and
the kernel would notice and send us to the failed label

movq %[to_write], (%[target])
1: movq $0, %[commit_instr]

In contrast to Paul's scheme, this has two additional (highly
predictable) branches and requires generation of a seqcount in
userspace.  In its favor, though, it doesnt need preemption hooks,
it's inherently debuggable, and it allows multiple independent
rseq-protected things to coexist without forcing each other to abort.

The first cmpq might not be necessary.  I put it there to ensure that
"some fresh value" is visible on the chosen CPU before any other loads
happen, but it's possible that the other cpu double-check is enough.


(Off topic: some of those movq instructions should probably be
rip-relative leaq instead.)

If my descriptor approach were used, we'd save the write to rcx.
Instead we'd do:

.pushsection .rodata
2:
.quad %l[failed] - .  /* or similar */
.quad 1b - .
.popsection
leaq 2b(%rip), %[tlabi_rseq_desc]

That's the optimization I was talking about above, although I did a
bad job of describing it.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 16:43             ` Andy Lutomirski
@ 2016-04-07 20:11               ` Peter Zijlstra
  2016-04-07 22:05                 ` Andy Lutomirski
  2016-04-08  6:41               ` Peter Zijlstra
  2016-04-11 21:55               ` Mathieu Desnoyers
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-07 20:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> More concretely, this looks like (using totally arbitrary register
> assingments -- probably far from ideal, especially given how GCC's
> constraints work):
> 
> enter the critical section:
> 1:
> movq %[cpu], %%r12
> movq {address of counter for our cpu}, %%r13
> movq {some fresh value}, (%%r13)
> cmpq %[cpu], %%r12
> jne 1b
> 
> ... do whatever setup or computation is needed...
> 
> movq $%l[failed], %%rcx
> movq $1f, %[commit_instr]
> cmpq {whatever counter we chose}, (%%r13)
> jne %l[failed]
> cmpq %[cpu], %%r12
> jne %l[failed]
> 
> <-- a signal in here that conflicts with us would clobber (%%r13), and
> the kernel would notice and send us to the failed label
> 
> movq %[to_write], (%[target])
> 1: movq $0, %[commit_instr]

And the kernel, for every thread that has had the syscall called and a
thingy registered, needs to (at preempt/signal-setup):

	if (get_user(post_commit_ip, current->post_commit_ip))
		return -EFAULT;

	if (likely(!post_commit_ip))
		return 0;

	if (regs->ip >= post_commit_ip)
		return 0;

	if (get_user(seq, (u32 __user *)regs->r13))
		return -EFAULT;

	if (regs->$(which one holds our chosen seq?) == seq) {
		/* nothing changed, do not cancel, proceed to commit. */
		return 0;
	}

	if (put_user(0UL, current->post_commit_ip))
		return -EFAULT;

	regs->ip = regs->rcx;


> In contrast to Paul's scheme, this has two additional (highly
> predictable) branches and requires generation of a seqcount in
> userspace.  In its favor, though, it doesnt need preemption hooks,

Without preemption hooks, how would one thread preempting another at the
above <-- clobber anything and cause the commit to fail?

> it's inherently debuggable, 

It is more debuggable, agreed.

> and it allows multiple independent
> rseq-protected things to coexist without forcing each other to abort.

And the kernel only needs to load the second cacheline if it lands in
the middle of a finish block, which should be manageable overhead I
suppose.

But the userspace chunk is lots slower as it needs to always touch
multiple lines, since the @cpu, @seq and @post_commit_ip all live in
separate lines (although I suppose @cpu and @post_commit_ip could live
in the same).

The finish thing needs 3 registers for:

 - fail ip
 - seq pointer
 - seq value

Which I suppose is possible even on register constrained architectures
like i386.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 20:11               ` Peter Zijlstra
@ 2016-04-07 22:05                 ` Andy Lutomirski
  2016-04-08  1:11                   ` Mathieu Desnoyers
  2016-04-08 11:02                   ` Peter Zijlstra
  0 siblings, 2 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-07 22:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>> More concretely, this looks like (using totally arbitrary register
>> assingments -- probably far from ideal, especially given how GCC's
>> constraints work):
>>
>> enter the critical section:
>> 1:
>> movq %[cpu], %%r12
>> movq {address of counter for our cpu}, %%r13
>> movq {some fresh value}, (%%r13)
>> cmpq %[cpu], %%r12
>> jne 1b
>>
>> ... do whatever setup or computation is needed...
>>
>> movq $%l[failed], %%rcx
>> movq $1f, %[commit_instr]
>> cmpq {whatever counter we chose}, (%%r13)
>> jne %l[failed]
>> cmpq %[cpu], %%r12
>> jne %l[failed]
>>
>> <-- a signal in here that conflicts with us would clobber (%%r13), and
>> the kernel would notice and send us to the failed label
>>
>> movq %[to_write], (%[target])
>> 1: movq $0, %[commit_instr]
>
> And the kernel, for every thread that has had the syscall called and a
> thingy registered, needs to (at preempt/signal-setup):
>
>         if (get_user(post_commit_ip, current->post_commit_ip))
>                 return -EFAULT;
>
>         if (likely(!post_commit_ip))
>                 return 0;
>
>         if (regs->ip >= post_commit_ip)
>                 return 0;
>
>         if (get_user(seq, (u32 __user *)regs->r13))
>                 return -EFAULT;
>
>         if (regs->$(which one holds our chosen seq?) == seq) {
>                 /* nothing changed, do not cancel, proceed to commit. */
>                 return 0;

Only return zero if regs->${which one holds the cpu) == smp_processor_id().

>         }
>
>         if (put_user(0UL, current->post_commit_ip))
>                 return -EFAULT;
>
>         regs->ip = regs->rcx;

I was imagining this happening at (return to userspace or preempt) and
possibly at signal return, but yes, more or less.

>
>
>> In contrast to Paul's scheme, this has two additional (highly
>> predictable) branches and requires generation of a seqcount in
>> userspace.  In its favor, though, it doesnt need preemption hooks,
>
> Without preemption hooks, how would one thread preempting another at the
> above <-- clobber anything and cause the commit to fail?

It doesn't, which is what I like about my variant.  If the thread
accesses the protected data structure, though, it should bump the
sequence count, which will cause the first thread to about when it
gets scheduled in.

>
>> it's inherently debuggable,
>
> It is more debuggable, agreed.
>
>> and it allows multiple independent
>> rseq-protected things to coexist without forcing each other to abort.
>
> And the kernel only needs to load the second cacheline if it lands in
> the middle of a finish block, which should be manageable overhead I
> suppose.
>
> But the userspace chunk is lots slower as it needs to always touch
> multiple lines, since the @cpu, @seq and @post_commit_ip all live in
> separate lines (although I suppose @cpu and @post_commit_ip could live
> in the same).
>
> The finish thing needs 3 registers for:
>
>  - fail ip
>  - seq pointer
>  - seq value
>
> Which I suppose is possible even on register constrained architectures
> like i386.

I think this can all be munged into two cachelines:

One cacheline contains the per-thread CPU number and post_commit_ip
(either by doing it over Linus' dead body or by having userspace
allocate it carefully).  The other contains the sequence counter *and*
the percpu data structure that's protected.  So in some sense it's the
same number of cache lines as Paul's version.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 22:05                 ` Andy Lutomirski
@ 2016-04-08  1:11                   ` Mathieu Desnoyers
  2016-04-08  1:21                     ` Andy Lutomirski
  2016-04-08 11:02                   ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-08  1:11 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Paul E. McKenney, Ingo Molnar, Paul Turner, Andi Kleen,
	Chris Lameter, Dave Watson, Josh Triplett, linux-api,
	linux-kernel, Andrew Hunter, Linus Torvalds

----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski luto@amacapital.net wrote:

> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
[...]
>>
>>> it's inherently debuggable,
>>
>> It is more debuggable, agreed.
>>
>>> and it allows multiple independent
>>> rseq-protected things to coexist without forcing each other to abort.

[...]

My understanding is that the main goal of this rather more complex
proposal is to make interaction with debuggers more straightforward in
cases of single-stepping through the rseq critical section.

I recently came up with a scheme that should allow us to handle such
situations in a fashion similar to debuggers handling ll/sc
restartable sequences of instructions on e.g. powerpc. The good news
is that my scheme does not require anything at the kernel level.

The idea is simple: the userspace rseq critical sections now
become marked by 3 inline functions (rather than 2 in Paul's proposal):

rseq_start(void *rseq_key)
rseq_finish(void *rseq_key)
rseq_abort(void *rseq_key)

We associate each critical section with a unique "key" (dummy
1 byte object in the process address space), so we can group
them. The new "rseq_abort" would mark exit points that would
exit the critical section without executing the final commit
instruction.

Within each of rseq_start, rseq_finish and rseq_abort,
we declare a non-loadable section that gets populated
with the following tuples:

(RSEQ_TYPE, insn address, rseq_key)

Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.

That special section would be found in the executable by the
debugger, which can then skip over entire restartable critical
sections when it encounters them by placing breakpoints at
all exit points (finish and cancel) associated to the same
rseq_key as the entry point (start).

This way we don't need to complexify the runtime code, neither
at kernel nor user-space level, and we get debuggability using
a trick similar to what ll/sc architectures already need to do.

Of course, this requires extending gdb, which should not be
a show-stopper.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08  1:11                   ` Mathieu Desnoyers
@ 2016-04-08  1:21                     ` Andy Lutomirski
  2016-04-08  2:05                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-08  1:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, linux-api,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski luto@amacapital.net wrote:
>
>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> [...]
>>>
>>>> it's inherently debuggable,
>>>
>>> It is more debuggable, agreed.
>>>
>>>> and it allows multiple independent
>>>> rseq-protected things to coexist without forcing each other to abort.
>
> [...]
>
> My understanding is that the main goal of this rather more complex
> proposal is to make interaction with debuggers more straightforward in
> cases of single-stepping through the rseq critical section.

The things I like about my proposal are both that you can single-step
through it just like any other code as long as you pin the thread to a
CPU and that it doesn't make preemption magical.  (Of course, you can
*force* it to do something on resume and/or preemption by sticking a
bogus value in the expected event count field, but that's not the
intended use.  Hmm, I guess it does need to hook preemption and/or
resume for all processes that enable the thing so it can know to check
for an enabled post_commit_rip, just like all the other proposals.)

Also, mine lets you have a fairly long-running critical section that
doesn't get aborted under heavy load and can interleave with other
critical sections that don't conflict.

>
> I recently came up with a scheme that should allow us to handle such
> situations in a fashion similar to debuggers handling ll/sc
> restartable sequences of instructions on e.g. powerpc. The good news
> is that my scheme does not require anything at the kernel level.
>
> The idea is simple: the userspace rseq critical sections now
> become marked by 3 inline functions (rather than 2 in Paul's proposal):
>
> rseq_start(void *rseq_key)
> rseq_finish(void *rseq_key)
> rseq_abort(void *rseq_key)

How do you use this thing?  What are its semantics?

--Andy

>
> We associate each critical section with a unique "key" (dummy
> 1 byte object in the process address space), so we can group
> them. The new "rseq_abort" would mark exit points that would
> exit the critical section without executing the final commit
> instruction.
>
> Within each of rseq_start, rseq_finish and rseq_abort,
> we declare a non-loadable section that gets populated
> with the following tuples:
>
> (RSEQ_TYPE, insn address, rseq_key)
>
> Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
>
> That special section would be found in the executable by the
> debugger, which can then skip over entire restartable critical
> sections when it encounters them by placing breakpoints at
> all exit points (finish and cancel) associated to the same
> rseq_key as the entry point (start).
>
> This way we don't need to complexify the runtime code, neither
> at kernel nor user-space level, and we get debuggability using
> a trick similar to what ll/sc architectures already need to do.
>
> Of course, this requires extending gdb, which should not be
> a show-stopper.
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08  1:21                     ` Andy Lutomirski
@ 2016-04-08  2:05                       ` Mathieu Desnoyers
  2016-04-08 17:46                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-08  2:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, linux-api,
	linux-kernel, Andrew Hunter, Linus Torvalds

----- On Apr 7, 2016, at 9:21 PM, Andy Lutomirski luto@amacapital.net wrote:

> On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski luto@amacapital.net wrote:
>>
>>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>> [...]
>>>>
>>>>> it's inherently debuggable,
>>>>
>>>> It is more debuggable, agreed.
>>>>
>>>>> and it allows multiple independent
>>>>> rseq-protected things to coexist without forcing each other to abort.
>>
>> [...]
>>
>> My understanding is that the main goal of this rather more complex
>> proposal is to make interaction with debuggers more straightforward in
>> cases of single-stepping through the rseq critical section.
> 
> The things I like about my proposal are both that you can single-step
> through it just like any other code as long as you pin the thread to a
> CPU and that it doesn't make preemption magical.  (Of course, you can
> *force* it to do something on resume and/or preemption by sticking a
> bogus value in the expected event count field, but that's not the
> intended use.  Hmm, I guess it does need to hook preemption and/or
> resume for all processes that enable the thing so it can know to check
> for an enabled post_commit_rip, just like all the other proposals.)
> 
> Also, mine lets you have a fairly long-running critical section that
> doesn't get aborted under heavy load and can interleave with other
> critical sections that don't conflict.

Yes, those would be nice advantages. I'll have to do a few more
pseudo-code and execution scenarios to get a better understanding of
your idea.

> 
>>
>> I recently came up with a scheme that should allow us to handle such
>> situations in a fashion similar to debuggers handling ll/sc
>> restartable sequences of instructions on e.g. powerpc. The good news
>> is that my scheme does not require anything at the kernel level.
>>
>> The idea is simple: the userspace rseq critical sections now
>> become marked by 3 inline functions (rather than 2 in Paul's proposal):
>>
>> rseq_start(void *rseq_key)
>> rseq_finish(void *rseq_key)
>> rseq_abort(void *rseq_key)
> 
> How do you use this thing?  What are its semantics?

You define one rseq_key variable (dummy 1 byte variable, can be an
empty structure) for each rseq critical section you have in your
program.

A rseq critical section will typically have one entry point (rseq_start),
and one exit point (rseq_finish). I'm saying "typically" because there
may be more than one entry point, and more than one exit point per
critical section.

Entry and exit points mark the beginning and end of each rseq critical
section. rseq_start loads the sequence counter from the TLS and copies
it onto the stack. It then gets passed to rseq_finish() to be compared
with the final seqnum TLS value just before the commit. rseq_finish is
the one responsible for storing into the post_commit_instr field of the
TLS and populating rcx with the failure insn label address. rseq_finish()
does the commit.

And there is rseq_abort(), which would need to be called if we just want
to exit from a rseq critical section without doing the commit (no matching
call to rseq_finish after a rseq_start).

Each of rseq_start, finish, and abort would need to receive a pointer
to the rseq_key as parameter.

rseq_start would return the sequence number read from the TLS.

rseq_finish would also receive as parameter that sequence number that has
been returned by rseq_start.

Does it make sense ?

Thanks,

Mathieu


> 
> --Andy
> 
>>
>> We associate each critical section with a unique "key" (dummy
>> 1 byte object in the process address space), so we can group
>> them. The new "rseq_abort" would mark exit points that would
>> exit the critical section without executing the final commit
>> instruction.
>>
>> Within each of rseq_start, rseq_finish and rseq_abort,
>> we declare a non-loadable section that gets populated
>> with the following tuples:
>>
>> (RSEQ_TYPE, insn address, rseq_key)
>>
>> Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
>>
>> That special section would be found in the executable by the
>> debugger, which can then skip over entire restartable critical
>> sections when it encounters them by placing breakpoints at
>> all exit points (finish and cancel) associated to the same
>> rseq_key as the entry point (start).
>>
>> This way we don't need to complexify the runtime code, neither
>> at kernel nor user-space level, and we get debuggability using
>> a trick similar to what ll/sc architectures already need to do.
>>
>> Of course, this requires extending gdb, which should not be
>> a show-stopper.
>>
>> Thoughts ?
>>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
> 
> 
> 
> --
> Andy Lutomirski
> AMA Capital Management, LLC

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 16:43             ` Andy Lutomirski
  2016-04-07 20:11               ` Peter Zijlstra
@ 2016-04-08  6:41               ` Peter Zijlstra
  2016-04-08 15:58                 ` Andy Lutomirski
  2016-04-11 21:55               ` Mathieu Desnoyers
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-08  6:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> enter the critical section:
> 1:
> movq %[cpu], %%r12
> movq {address of counter for our cpu}, %%r13
> movq {some fresh value}, (%%r13)
> cmpq %[cpu], %%r12
> jne 1b

This is inherently racy; your forgot the detail of 'some fresh value',
but since you want to avoid collisions you really want an increment.

But load-store archs cannot do that. Or rather, they need to do:

	load	Rn, $event
	add	Rn, Rn, 1
	store	$event, Rn

But if they're preempted in the middle, two threads will collide and
generate the _same_ increment. Comparing CPU numbers will not fix that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 22:05                 ` Andy Lutomirski
  2016-04-08  1:11                   ` Mathieu Desnoyers
@ 2016-04-08 11:02                   ` Peter Zijlstra
  2016-04-08 15:57                     ` Andy Lutomirski
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2016-04-08 11:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Thu, Apr 07, 2016 at 03:05:26PM -0700, Andy Lutomirski wrote:

> It doesn't, which is what I like about my variant.  If the thread
> accesses the protected data structure, though, it should bump the
> sequence count, which will cause the first thread to about when it
> gets scheduled in.

Nope it won't, because that first thread is right at the commit
instruction, nothing will stop it from executing that store and clobbing
what we just wrote.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08 11:02                   ` Peter Zijlstra
@ 2016-04-08 15:57                     ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-08 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Chris Lameter, Andi Kleen, Josh Triplett, Dave Watson, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Apr 8, 2016 4:04 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> On Thu, Apr 07, 2016 at 03:05:26PM -0700, Andy Lutomirski wrote:
>
> > It doesn't, which is what I like about my variant.  If the thread
> > accesses the protected data structure, though, it should bump the
> > sequence count, which will cause the first thread to about when it
> > gets scheduled in.
>
> Nope it won't, because that first thread is right at the commit
> instruction, nothing will stop it from executing that store and clobbing
> what we just wrote.
>

I don't think so.  I write an event number.  You commit because you
didn't notice.  I haven't loaded yet from the value you wrote when you
committed, so nothing goes wrong.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08  6:41               ` Peter Zijlstra
@ 2016-04-08 15:58                 ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-08 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Chris Lameter, Andi Kleen, Josh Triplett, Dave Watson, Linux API,
	linux-kernel, Andrew Hunter, Linus Torvalds

On Apr 7, 2016 11:41 PM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> > enter the critical section:
> > 1:
> > movq %[cpu], %%r12
> > movq {address of counter for our cpu}, %%r13
> > movq {some fresh value}, (%%r13)
> > cmpq %[cpu], %%r12
> > jne 1b
>
> This is inherently racy; your forgot the detail of 'some fresh value',
> but since you want to avoid collisions you really want an increment.
>
> But load-store archs cannot do that. Or rather, they need to do:
>
>         load    Rn, $event
>         add     Rn, Rn, 1
>         store   $event, Rn
>
> But if they're preempted in the middle, two threads will collide and
> generate the _same_ increment. Comparing CPU numbers will not fix that.

Even on x86 this won't work -- we have no actual guarantee we're on
the right CPU, so we'd have to use an atomic.

I was thinking we'd allocate from a per-thread pool (say 24 bits of
thread ID and the rest being a nonce).  On load-store architectures
this wouldn't be async-signal-safe, though.  Hmm.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08  2:05                       ` Mathieu Desnoyers
@ 2016-04-08 17:46                         ` Mathieu Desnoyers
  2016-04-08 21:16                           ` Andy Lutomirski
  2016-04-08 21:25                           ` Linus Torvalds
  0 siblings, 2 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-08 17:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, linux-api,
	linux-kernel, Andrew Hunter, Linus Torvalds

----- On Apr 7, 2016, at 10:05 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Apr 7, 2016, at 9:21 PM, Andy Lutomirski luto@amacapital.net wrote:
> 
>> On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski luto@amacapital.net wrote:
>>>
>>>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>>> [...]
>>>>>
>>>>>> it's inherently debuggable,
>>>>>
>>>>> It is more debuggable, agreed.
>>>>>
>>>>>> and it allows multiple independent
>>>>>> rseq-protected things to coexist without forcing each other to abort.
>>>
>>> [...]
>>>
>>> My understanding is that the main goal of this rather more complex
>>> proposal is to make interaction with debuggers more straightforward in
>>> cases of single-stepping through the rseq critical section.
>> 
>> The things I like about my proposal are both that you can single-step
>> through it just like any other code as long as you pin the thread to a
>> CPU and that it doesn't make preemption magical.  (Of course, you can
>> *force* it to do something on resume and/or preemption by sticking a
>> bogus value in the expected event count field, but that's not the
>> intended use.  Hmm, I guess it does need to hook preemption and/or
>> resume for all processes that enable the thing so it can know to check
>> for an enabled post_commit_rip, just like all the other proposals.)
>> 
>> Also, mine lets you have a fairly long-running critical section that
>> doesn't get aborted under heavy load and can interleave with other
>> critical sections that don't conflict.
> 
> Yes, those would be nice advantages. I'll have to do a few more
> pseudo-code and execution scenarios to get a better understanding of
> your idea.
> 
>> 
>>>
>>> I recently came up with a scheme that should allow us to handle such
>>> situations in a fashion similar to debuggers handling ll/sc
>>> restartable sequences of instructions on e.g. powerpc. The good news
>>> is that my scheme does not require anything at the kernel level.
>>>
>>> The idea is simple: the userspace rseq critical sections now
>>> become marked by 3 inline functions (rather than 2 in Paul's proposal):
>>>
>>> rseq_start(void *rseq_key)
>>> rseq_finish(void *rseq_key)
>>> rseq_abort(void *rseq_key)
>> 
>> How do you use this thing?  What are its semantics?
> 
> You define one rseq_key variable (dummy 1 byte variable, can be an
> empty structure) for each rseq critical section you have in your
> program.
> 
> A rseq critical section will typically have one entry point (rseq_start),
> and one exit point (rseq_finish). I'm saying "typically" because there
> may be more than one entry point, and more than one exit point per
> critical section.
> 
> Entry and exit points mark the beginning and end of each rseq critical
> section. rseq_start loads the sequence counter from the TLS and copies
> it onto the stack. It then gets passed to rseq_finish() to be compared
> with the final seqnum TLS value just before the commit. rseq_finish is
> the one responsible for storing into the post_commit_instr field of the
> TLS and populating rcx with the failure insn label address. rseq_finish()
> does the commit.
> 
> And there is rseq_abort(), which would need to be called if we just want
> to exit from a rseq critical section without doing the commit (no matching
> call to rseq_finish after a rseq_start).
> 
> Each of rseq_start, finish, and abort would need to receive a pointer
> to the rseq_key as parameter.
> 
> rseq_start would return the sequence number read from the TLS.
> 
> rseq_finish would also receive as parameter that sequence number that has
> been returned by rseq_start.
> 
> Does it make sense ?

By the way, the debugger can always decide to single-step through the
first iteration of the rseq, and then after it loops, decide to skip
single-stepping until the exit points are reached.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
>> 
>> --Andy
>> 
>>>
>>> We associate each critical section with a unique "key" (dummy
>>> 1 byte object in the process address space), so we can group
>>> them. The new "rseq_abort" would mark exit points that would
>>> exit the critical section without executing the final commit
>>> instruction.
>>>
>>> Within each of rseq_start, rseq_finish and rseq_abort,
>>> we declare a non-loadable section that gets populated
>>> with the following tuples:
>>>
>>> (RSEQ_TYPE, insn address, rseq_key)
>>>
>>> Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
>>>
>>> That special section would be found in the executable by the
>>> debugger, which can then skip over entire restartable critical
>>> sections when it encounters them by placing breakpoints at
>>> all exit points (finish and cancel) associated to the same
>>> rseq_key as the entry point (start).
>>>
>>> This way we don't need to complexify the runtime code, neither
>>> at kernel nor user-space level, and we get debuggability using
>>> a trick similar to what ll/sc architectures already need to do.
>>>
>>> Of course, this requires extending gdb, which should not be
>>> a show-stopper.
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> http://www.efficios.com
>> 
>> 
>> 
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08 17:46                         ` Mathieu Desnoyers
@ 2016-04-08 21:16                           ` Andy Lutomirski
  2016-04-08 21:25                           ` Linus Torvalds
  1 sibling, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-04-08 21:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Ingo Molnar, linux-kernel, linux-api,
	Paul Turner, Chris Lameter, Andi Kleen, Josh Triplett,
	Dave Watson, Andrew Hunter, Linus Torvalds, Peter Zijlstra

On Apr 8, 2016 10:46 AM, "Mathieu Desnoyers"
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Apr 7, 2016, at 10:05 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
>
> > ----- On Apr 7, 2016, at 9:21 PM, Andy Lutomirski luto@amacapital.net wrote:
> >
> >> On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
> >> <mathieu.desnoyers@efficios.com> wrote:
> >>> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski luto@amacapital.net wrote:
> >>>
> >>>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> >>> [...]
> >>>>>
> >>>>>> it's inherently debuggable,
> >>>>>
> >>>>> It is more debuggable, agreed.
> >>>>>
> >>>>>> and it allows multiple independent
> >>>>>> rseq-protected things to coexist without forcing each other to abort.
> >>>
> >>> [...]
> >>>
> >>> My understanding is that the main goal of this rather more complex
> >>> proposal is to make interaction with debuggers more straightforward in
> >>> cases of single-stepping through the rseq critical section.
> >>
> >> The things I like about my proposal are both that you can single-step
> >> through it just like any other code as long as you pin the thread to a
> >> CPU and that it doesn't make preemption magical.  (Of course, you can
> >> *force* it to do something on resume and/or preemption by sticking a
> >> bogus value in the expected event count field, but that's not the
> >> intended use.  Hmm, I guess it does need to hook preemption and/or
> >> resume for all processes that enable the thing so it can know to check
> >> for an enabled post_commit_rip, just like all the other proposals.)
> >>
> >> Also, mine lets you have a fairly long-running critical section that
> >> doesn't get aborted under heavy load and can interleave with other
> >> critical sections that don't conflict.
> >
> > Yes, those would be nice advantages. I'll have to do a few more
> > pseudo-code and execution scenarios to get a better understanding of
> > your idea.
> >
> >>
> >>>
> >>> I recently came up with a scheme that should allow us to handle such
> >>> situations in a fashion similar to debuggers handling ll/sc
> >>> restartable sequences of instructions on e.g. powerpc. The good news
> >>> is that my scheme does not require anything at the kernel level.
> >>>
> >>> The idea is simple: the userspace rseq critical sections now
> >>> become marked by 3 inline functions (rather than 2 in Paul's proposal):
> >>>
> >>> rseq_start(void *rseq_key)
> >>> rseq_finish(void *rseq_key)
> >>> rseq_abort(void *rseq_key)
> >>
> >> How do you use this thing?  What are its semantics?
> >
> > You define one rseq_key variable (dummy 1 byte variable, can be an
> > empty structure) for each rseq critical section you have in your
> > program.
> >
> > A rseq critical section will typically have one entry point (rseq_start),
> > and one exit point (rseq_finish). I'm saying "typically" because there
> > may be more than one entry point, and more than one exit point per
> > critical section.
> >
> > Entry and exit points mark the beginning and end of each rseq critical
> > section. rseq_start loads the sequence counter from the TLS and copies
> > it onto the stack. It then gets passed to rseq_finish() to be compared
> > with the final seqnum TLS value just before the commit. rseq_finish is
> > the one responsible for storing into the post_commit_instr field of the
> > TLS and populating rcx with the failure insn label address. rseq_finish()
> > does the commit.
> >
> > And there is rseq_abort(), which would need to be called if we just want
> > to exit from a rseq critical section without doing the commit (no matching
> > call to rseq_finish after a rseq_start).
> >
> > Each of rseq_start, finish, and abort would need to receive a pointer
> > to the rseq_key as parameter.
> >
> > rseq_start would return the sequence number read from the TLS.
> >
> > rseq_finish would also receive as parameter that sequence number that has
> > been returned by rseq_start.
> >
> > Does it make sense ?
>
> By the way, the debugger can always decide to single-step through the
> first iteration of the rseq, and then after it loops, decide to skip
> single-stepping until the exit points are reached.

True, but you're assuming that someone will actually write that code
and then users will know how to use it.  That's something I like about
my version.

Admittedly, LL/SC and TSX have the same problem, but those are
architectural, and it's not really a good excuse to add a new thing
that's awkward to debug.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08 17:46                         ` Mathieu Desnoyers
  2016-04-08 21:16                           ` Andy Lutomirski
@ 2016-04-08 21:25                           ` Linus Torvalds
  2016-04-10 14:07                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2016-04-08 21:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Peter Zijlstra, Paul E. McKenney, Ingo Molnar,
	Paul Turner, Andi Kleen, Chris Lameter, Dave Watson,
	Josh Triplett, linux-api, Linux Kernel Mailing List,
	Andrew Hunter

On Fri, Apr 8, 2016 at 10:46 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> By the way, the debugger can always decide to single-step through the
> first iteration of the rseq, and then after it loops, decide to skip
> single-stepping until the exit points are reached.

A _human_ debugger may decide to do that yes.

But the the debugger _program_ may not be that smart. For example,
let's say that you - as a human - set a few watchpoints. The debugger
will use hardware breakpoints for the first few, but in more complex
cases the debugger will actually say "oops, no more hw breakpoints,
I'll just start single-stepping instead".

The human operator may not even be aware that the debugger has gone
into that slower mode. Normally it's just slower. But you'd want it to
be _only_ slower, not "oops, the program no longer makes any forward
progress at all, because a library that the user didn't even know or
care about - and never sees, because the single-stepping is all
internal = happened to use a code sequence that doesn't work under
single-stepping".

               Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-08 21:25                           ` Linus Torvalds
@ 2016-04-10 14:07                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-10 14:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Zijlstra, Paul E. McKenney, Ingo Molnar,
	Paul Turner, Andi Kleen, Chris Lameter, Dave Watson,
	Josh Triplett, linux-api, Linux Kernel Mailing List,
	Andrew Hunter

----- On Apr 8, 2016, at 5:25 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Fri, Apr 8, 2016 at 10:46 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> By the way, the debugger can always decide to single-step through the
>> first iteration of the rseq, and then after it loops, decide to skip
>> single-stepping until the exit points are reached.
> 
> A _human_ debugger may decide to do that yes.
> 
> But the the debugger _program_ may not be that smart. For example,
> let's say that you - as a human - set a few watchpoints. The debugger
> will use hardware breakpoints for the first few, but in more complex
> cases the debugger will actually say "oops, no more hw breakpoints,
> I'll just start single-stepping instead".
> 
> The human operator may not even be aware that the debugger has gone
> into that slower mode. Normally it's just slower. But you'd want it to
> be _only_ slower, not "oops, the program no longer makes any forward
> progress at all, because a library that the user didn't even know or
> care about - and never sees, because the single-stepping is all
> internal = happened to use a code sequence that doesn't work under
> single-stepping".

Which is why I'm proposing to extend gdb to support this automatically,
without requiring interaction or knowledge from the user.

The idea is to let gdb detect entry points into those restartable
critical sections as it single-steps through the program. It would
know about all rseq c.s. exit points too, so it can track whether
it has single-stepped over an entire rseq c.s. and thus caused a
restart. At that point, it can put the breakpoint at each exit point
associated with the entry point, thus skipping single-step of the
second iteration of the critical section.

I think this could be achieved by populating a section that contains
information about entry and exit points of those critical sections
in the rseq_{start,finish,abort} functions. Those sections would end
up in the app/lib ELF binary, may not have to be necessarily loaded
into program's memory.

Does it make sense to try it out, or am I missing something obvious ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections
  2016-04-07 16:43             ` Andy Lutomirski
  2016-04-07 20:11               ` Peter Zijlstra
  2016-04-08  6:41               ` Peter Zijlstra
@ 2016-04-11 21:55               ` Mathieu Desnoyers
  2 siblings, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2016-04-11 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, Paul Turner,
	Andi Kleen, Chris Lameter, Dave Watson, Josh Triplett, linux-api,
	linux-kernel, Andrew Hunter, Linus Torvalds

----- On Apr 7, 2016, at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote:

> On Thu, Apr 7, 2016 at 8:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
>>> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>>> >> What I meant was: rather than shoving individual values into the TLABI
>>> >> thing, shove in a pointer:
>>> >>
>>> >> struct commit_info {
>>> >>   u64 post_commit_rip;
>>> >>   u32 cpu;
>>> >>   u64 *event;
>>> >>   // whatever else;
>>> >> };
>>> >>
>>> >> and then put a commit_info* in TLABI.
>>> >>
>>> >> This would save some bytes in the TLABI structure.
>>> >
>>> > But would cost us extra indirections. The whole point was getting this
>>> > stuff at a constant offset from the TLS segment register.
>>>
>>> I don't think the extra indirections would matter much.  The kernel
>>> would have to chase the pointer, but only in the very rare case where
>>> it resumes userspace during a commit or on the immediately following
>>> instruction.
>>
>> Its about userspace finding these values, not the kernel.
> 
> But userspace never reads post_commit_rip or the abort address AFAIK.
> 
>>
>>> At the very least, post_commit_rip and the abort address (which I
>>> forgot about) could both live in a static structure,
>>
>> Paul keeps the abort address in rcx.
> 
> So it's probably a wash.  Paul's way is to load the abort address in
> rcx (one instruction, no memory access) and to put post_commit_rip in
> TLABI (one store).  Mine is to put a pointer to a descriptor into
> TLABI (one store).  Should barely matter.
> 
>>
>>> and shoving a
>>> pointer to *that* into TLABI space is one store instead of two.
>>
>>> > Ah, so what happens if the signal happens before the commit but after
>>> > the load of the seqcount?
>>> >
>>> > Then, even if the signal motifies the count, we'll not observe.
>>> >
>>>
>>> Where exactly?
>>>
>>> In my scheme, nothing except the kernel ever loads the seqcount.  The
>>> user code generates a fresh value, writes it to memory, and then, just
>>> before commit, writes that same value to the TLABI area and then
>>> double-checks that the value it wrote at the beginning is still there.
>>>
>>> If the signal modifies the count, then the user code won't directly
>>> notice, but prepare_exit_to_usermode on the way out of the signal will
>>> notice that the (restored) TLABI state doesn't match the counter that
>>> the signal handler changed and will just to the abort address.
>>
>>
>> OK, you lost me..  commit looks like:
>>
>> +       __asm__ __volatile__ goto (
>> +                       "movq $%l[failed], %%rcx\n"
>> +                       "movq $1f, %[commit_instr]\n"
>> +                       "cmpq %[start_value], %[current_value]\n"
>>
>> If we get preempted/signaled here without the preemption/signal entry
>> checking for the post_commit_instr, we'll fail hard.
> 
> Not if I'm thinking about this right.  Because, in my scheme, we'd be
> storing both the address of the counter [1] and the value that we
> expect it to be (current_value?  something's confused in the asm or
> the variable naming, or I'm just not following exactly which variable
> is which) in registers or TLABI, and the kernel would notice that
> *counter != expected_counter when it resumes and would send us to the
> abort label.
> 
> Presumably there would be one register hard-coded as the expected
> counter value and another register hard-coded as the address of the
> counter.  That would avoid a bunch of stores to TLABI, and the kernel
> could still find them.  If the C optimizer did a good job, the
> resulting code would be simple.
> 
> [1] In my scheme, there would be no reason for the counter to live in
> TLABI.  In contrast, there would be an excellent reason for it to
> *not* live in TLABI -- user code could have more than one set of
> counters, and they wouldn't interfere with each other.  This would be
> nice if one of them protects something that uses long enough code in
> the critical sections that it's likely to be preempted.
> 
>>
>> +                       "jnz %l[failed]\n"
>> +                       "movq %[to_write], (%[target])\n"
>> +                       "1: movq $0, %[commit_instr]\n"
>> +         : /* no outputs */
>> +         : [start_value]"d"(start_value.storage),
>> +           [current_value]"m"(__rseq_state),
>> +           [to_write]"r"(to_write),
>> +           [target]"r"(p),
>> +           [commit_instr]"m"(__rseq_state.post_commit_instr)
>> +         : "rcx", "memory"
>> +         : failed
>> +       );
> 
> 
> More concretely, this looks like (using totally arbitrary register
> assingments -- probably far from ideal, especially given how GCC's
> constraints work):
> 

I think you have an ABA-style issue here: if you are migrated
to another CPU at (A), and back to the original CPU at (B), your
CPU check is not doing much, and you may be corrupting (%%r13)
of a concurrently running thread.

> enter the critical section:
> 1:
> movq %[cpu], %%r12
> movq {address of counter for our cpu}, %%r13

(A) -> preempted, migrated to remote CPU.

> movq {some fresh value}, (%%r13)

(B) -> migrated back to original CPU.

> cmpq %[cpu], %%r12
> jne 1b

My understanding is that you should only ever _read_ from such
a per-cpu data structure upon entry into the critical section,
not write, because you have no way to know whether you are
migrated to another CPU at that point.

Those per-cpu data ideas are interesting though. I've came up
with the sketch of a scheme that fetches a seqcount from the percpu data,
and keeps a copy in the TLS area. Whenever the kernel preempts
a thread, it increments both the seqcount in the currently used
percpu data _and_ the seqcount in the TLS area. Then checking
if we need to restart the critical section is a matter of checking
that the two counters match, else it means the percpu data counter
has been touched by some other thread.

I'll try to come up with an implementation of this idea, because
there are a few more tricks to it to ensure all migration races
are dealt with.

One thing I doubt I'll be able to handle would be having nested
restartable critical sections (e.g. one c.s. nested in another,
but not through a nested signal handler). Not sure if we really
care about this use-case though.

Thanks,

Mathieu

> 
> ... do whatever setup or computation is needed...
> 
> movq $%l[failed], %%rcx
> movq $1f, %[commit_instr]
> cmpq {whatever counter we chose}, (%%r13)
> jne %l[failed]
> cmpq %[cpu], %%r12
> jne %l[failed]
> 
> <-- a signal in here that conflicts with us would clobber (%%r13), and
> the kernel would notice and send us to the failed label
> 
> movq %[to_write], (%[target])
> 1: movq $0, %[commit_instr]
> 
> In contrast to Paul's scheme, this has two additional (highly
> predictable) branches and requires generation of a seqcount in
> userspace.  In its favor, though, it doesnt need preemption hooks,
> it's inherently debuggable, and it allows multiple independent
> rseq-protected things to coexist without forcing each other to abort.
> 
> The first cmpq might not be necessary.  I put it there to ensure that
> "some fresh value" is visible on the chosen CPU before any other loads
> happen, but it's possible that the other cpu double-check is enough.
> 
> 
> (Off topic: some of those movq instructions should probably be
> rip-relative leaq instead.)
> 
> If my descriptor approach were used, we'd save the write to rcx.
> Instead we'd do:
> 
> .pushsection .rodata
> 2:
> .quad %l[failed] - .  /* or similar */
> .quad 1b - .
> .popsection
> leaq 2b(%rip), %[tlabi_rseq_desc]
> 
> That's the optimization I was talking about above, although I did a
> bad job of describing it.
> 
> --Andy

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-04-11 21:55 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-27 23:56 [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Paul Turner
2015-10-27 23:56 ` [RFC PATCH v2 1/3] restartable sequences: user-space per-cpu " Paul Turner
2015-11-19 16:38   ` Johannes Berg
2015-12-11 12:56   ` Mathieu Desnoyers
2015-10-27 23:57 ` [RFC PATCH v2 2/3] restartable sequences: x86 ABI Paul Turner
2015-10-28  5:03   ` Peter Zijlstra
2015-10-28  5:19     ` Paul Turner
2015-12-11 13:30   ` Mathieu Desnoyers
2015-10-27 23:57 ` [RFC PATCH v2 3/3] restartable sequences: basic self-tests Paul Turner
2016-04-05 20:33   ` Mathieu Desnoyers
2016-04-06  7:43     ` Peter Zijlstra
2016-04-06 13:39       ` Mathieu Desnoyers
2016-04-06 19:25         ` Peter Zijlstra
2015-10-28 14:44 ` [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Dave Watson
2015-12-11 12:05 ` Mathieu Desnoyers
2015-12-11 13:39   ` Mathieu Desnoyers
2016-04-06 15:56 ` Andy Lutomirski
2016-04-07 12:02   ` Peter Zijlstra
2016-04-07 14:35     ` Andy Lutomirski
2016-04-07 15:24       ` Peter Zijlstra
2016-04-07 15:39         ` Peter Zijlstra
2016-04-07 15:44         ` Andy Lutomirski
2016-04-07 15:53           ` Peter Zijlstra
2016-04-07 16:43             ` Andy Lutomirski
2016-04-07 20:11               ` Peter Zijlstra
2016-04-07 22:05                 ` Andy Lutomirski
2016-04-08  1:11                   ` Mathieu Desnoyers
2016-04-08  1:21                     ` Andy Lutomirski
2016-04-08  2:05                       ` Mathieu Desnoyers
2016-04-08 17:46                         ` Mathieu Desnoyers
2016-04-08 21:16                           ` Andy Lutomirski
2016-04-08 21:25                           ` Linus Torvalds
2016-04-10 14:07                             ` Mathieu Desnoyers
2016-04-08 11:02                   ` Peter Zijlstra
2016-04-08 15:57                     ` Andy Lutomirski
2016-04-08  6:41               ` Peter Zijlstra
2016-04-08 15:58                 ` Andy Lutomirski
2016-04-11 21:55               ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).