linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
@ 2015-07-16 20:00 Mathieu Desnoyers
  2015-07-17 10:49 ` Ben Maurer
  2015-07-17 12:48 ` Nikolay Borisov
  0 siblings, 2 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2015-07-16 20:00 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Hunter, Peter Zijlstra,
	Ingo Molnar, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Andrew Morton, linux-api

Expose a new system call allowing threads to register a userspace memory
area where to store the current CPU number. Scheduler migration sets the
TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space,
a notify-resume handler updates the current CPU value within that
user-space memory area.

This getcpu cache is an alternative to the sched_getcpu() vdso which has
a few benefits:
- It is faster to do a memory read that to call a vDSO,
- This cache value can be read from within an inline assembly, which
  makes it a useful building block for restartable sequences.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections:
Ref.:
* https://lkml.org/lkml/2015/6/24/665
* https://lwn.net/Articles/650333/
* http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Benchmarking sched_getcpu() vs tls cache approach. Getting the
current CPU number:

- With Linux vdso:            12.7 ns
- With TLS-cached cpu number:  0.3 ns

The system call can be extended by registering a larger structure in
the future.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
---
 arch/x86/kernel/signal.c              |  2 +
 arch/x86/syscalls/syscall_64.tbl      |  1 +
 fs/exec.c                             |  1 +
 include/linux/sched.h                 | 35 ++++++++++++++
 include/uapi/asm-generic/unistd.h     |  4 +-
 include/uapi/linux/Kbuild             |  1 +
 include/uapi/linux/thread_local_abi.h | 37 ++++++++++++++
 init/Kconfig                          |  9 ++++
 kernel/Makefile                       |  1 +
 kernel/fork.c                         |  2 +
 kernel/sched/core.c                   |  4 ++
 kernel/sched/sched.h                  |  2 +
 kernel/sys_ni.c                       |  3 ++
 kernel/thread_local_abi.c             | 90 +++++++++++++++++++++++++++++++++++
 14 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/thread_local_abi.h
 create mode 100644 kernel/thread_local_abi.c

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index e504246..157cec0 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -750,6 +750,8 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
 		clear_thread_flag(TIF_NOTIFY_RESUME);
 		tracehook_notify_resume(regs);
+		if (getcpu_cache_active(current))
+			getcpu_cache_handle_notify_resume(current);
 	}
 	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
 		fire_user_return_notifiers();
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..0eb2fc2 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	thread_local_abi	sys_thread_local_abi
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index c7f9b73..e5acf80 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	thread_local_abi_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..4a3fc52 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2,6 +2,7 @@
 #define _LINUX_SCHED_H
 
 #include <uapi/linux/sched.h>
+#include <uapi/linux/thread_local_abi.h>
 
 #include <linux/sched/prio.h>
 
@@ -1710,6 +1711,10 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_THREAD_LOCAL_ABI
+	size_t thread_local_abi_len;
+	struct thread_local_abi __user *thread_local_abi;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -3090,4 +3095,34 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_THREAD_LOCAL_ABI
+void thread_local_abi_fork(struct task_struct *t);
+void thread_local_abi_execve(struct task_struct *t);
+void getcpu_cache_handle_notify_resume(struct task_struct *t);
+static inline bool getcpu_cache_active(struct task_struct *t)
+{
+	struct thread_local_abi __user *tlap = t->thread_local_abi;
+
+	if (!tlap || t->thread_local_abi_len <
+			offsetof(struct thread_local_abi, cpu)
+			+ sizeof(tlap->cpu))
+		return false;
+	return true;
+}
+#else
+static inline void thread_local_abi_fork(struct task_struct *t)
+{
+}
+static inline void thread_local_abi_execve(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+}
+static inline bool getcpu_cache_active(struct task_struct *t)
+{
+	return false;
+}
+#endif
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..50aa984 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_thread_local_abi 282
+__SYSCALL(__NR_thread_local_abi, sys_thread_local_abi)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 68ceb97..dfd6a30 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -389,6 +389,7 @@ header-y += tcp_metrics.h
 header-y += telephony.h
 header-y += termios.h
 header-y += thermal.h
+header-y += thread_local_abi.h
 header-y += time.h
 header-y += times.h
 header-y += timex.h
diff --git a/include/uapi/linux/thread_local_abi.h b/include/uapi/linux/thread_local_abi.h
new file mode 100644
index 0000000..6487c92
--- /dev/null
+++ b/include/uapi/linux/thread_local_abi.h
@@ -0,0 +1,37 @@
+#ifndef _UAPI_LINUX_THREAD_LOCAL_ABI_H
+#define _UAPI_LINUX_THREAD_LOCAL_ABI_H
+
+/*
+ * linux/thread_local_abi.h
+ *
+ * thread_local_abi system call API
+ *
+ * Copyright (c) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/types.h>
+
+/* This structure is an ABI that can only be extended. */
+struct thread_local_abi {
+	int32_t cpu;
+};
+
+#endif /* _UAPI_LINUX_THREAD_LOCAL_ABI_H */
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..c8ff5fa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1559,6 +1559,15 @@ config PCI_QUIRKS
 	  bugs/quirks. Disable this only if your target machine is
 	  unaffected by PCI quirks.
 
+config THREAD_LOCAL_ABI
+	bool "Enable thread-local ABI" if EXPERT
+	default y
+	help
+	  Enable the thread-local ABI system call. It provides a user-space
+	  cache for the current CPU number value.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..cc1f3d4 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_THREAD_LOCAL_ABI) += thread_local_abi.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..e17bcb3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
+	if (!(clone_flags & CLONE_THREAD))
+		thread_local_abi_fork(p);
 	perf_event_fork(p);
 
 	trace_task_newtask(p, clone_flags);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62671f5..668a502 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1823,6 +1823,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_THREAD_LOCAL_ABI
+	p->thread_local_abi_len = 0;
+	p->thread_local_abi = NULL;
+#endif
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435..bf3e346 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -921,6 +921,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	if (getcpu_cache_active(p))
+		set_tsk_thread_flag(p, TIF_NOTIFY_RESUME);
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..cadb903 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* thread-local ABI */
+cond_syscall(sys_thread_local_abi);
diff --git a/kernel/thread_local_abi.c b/kernel/thread_local_abi.c
new file mode 100644
index 0000000..681f06e
--- /dev/null
+++ b/kernel/thread_local_abi.c
@@ -0,0 +1,90 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * thread_local_abi system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+
+static int getcpu_cache_update(struct task_struct *t)
+{
+	if (put_user(raw_smp_processor_id(), &t->thread_local_abi->cpu)) {
+		t->thread_local_abi_len = 0;
+		t->thread_local_abi = NULL;
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	BUG_ON(!getcpu_cache_active(t));
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (getcpu_cache_update(t))
+		force_sig(SIGSEGV, t);
+}
+
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only applies
+ * when forking a process, not a thread.
+ */
+void thread_local_abi_fork(struct task_struct *t)
+{
+	t->thread_local_abi_len = current->thread_local_abi_len;
+	t->thread_local_abi = current->thread_local_abi;
+}
+
+void thread_local_abi_execve(struct task_struct *t)
+{
+	t->thread_local_abi_len = 0;
+	t->thread_local_abi = NULL;
+}
+
+/*
+ * sys_thread_local_abi - setup thread-local ABI for caller thread
+ */
+SYSCALL_DEFINE3(thread_local_abi, struct thread_local_abi __user *, tlap,
+		size_t, len, int, flags)
+{
+	size_t minlen;
+
+	if (flags)
+		return -EINVAL;
+	if (current->thread_local_abi && tlap)
+		return -EBUSY;
+	/* Agree on the intersection of userspace and kernel features */
+	minlen = min_t(size_t, len, sizeof(struct thread_local_abi));
+	current->thread_local_abi_len = minlen;
+	current->thread_local_abi = tlap;
+	if (!tlap)
+		return 0;
+	/*
+	 * Migration checks ->thread_local_abi to see if notify_resume
+	 * flag should be set. Therefore, we need to ensure that
+	 * the scheduler sees ->thread_local_abi before we update its content.
+	 */
+	barrier();	/* Store thread_local_abi before update content */
+	if (getcpu_cache_active(current)) {
+		if (getcpu_cache_update(current))
+			return -EFAULT;
+	}
+	return minlen;
+}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
  2015-07-16 20:00 [RFC PATCH] thread_local_abi system call: caching current CPU number (x86) Mathieu Desnoyers
@ 2015-07-17 10:49 ` Ben Maurer
  2015-07-17 16:12   ` Mathieu Desnoyers
  2015-07-17 17:03   ` Josh Triplett
  2015-07-17 12:48 ` Nikolay Borisov
  1 sibling, 2 replies; 6+ messages in thread
From: Ben Maurer @ 2015-07-17 10:49 UTC (permalink / raw)
  To: Mathieu Desnoyers, Paul Turner
  Cc: linux-kernel, Andrew Hunter, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, linux-api

Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register a userspace memory
> area where to store the current CPU number. Scheduler migration sets the

I really like that this approach makes it easier to add a per-thread interaction between userspace and the kernel in the future.

>+       if (!tlap || t->thread_local_abi_len <
>+                       offsetof(struct thread_local_abi, cpu)
>+                       + sizeof(tlap->cpu))

Could you save a branch here by enforcing that thread_local_abi_len = 0 if thread_local_abi = null?

-b

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
  2015-07-16 20:00 [RFC PATCH] thread_local_abi system call: caching current CPU number (x86) Mathieu Desnoyers
  2015-07-17 10:49 ` Ben Maurer
@ 2015-07-17 12:48 ` Nikolay Borisov
  2015-07-17 16:23   ` Mathieu Desnoyers
  1 sibling, 1 reply; 6+ messages in thread
From: Nikolay Borisov @ 2015-07-17 12:48 UTC (permalink / raw)
  To: Mathieu Desnoyers, Paul Turner
  Cc: linux-kernel, Andrew Hunter, Peter Zijlstra, Ingo Molnar,
	Ben Maurer, Steven Rostedt, Paul E. McKenney, Josh Triplett,
	Linus Torvalds, Andrew Morton, linux-api



On 07/16/2015 11:00 PM, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register a userspace memory
> area where to store the current CPU number. Scheduler migration sets the
> TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space,
> a notify-resume handler updates the current CPU value within that
> user-space memory area.
> 
> This getcpu cache is an alternative to the sched_getcpu() vdso which has
> a few benefits:
> - It is faster to do a memory read that to call a vDSO,
> - This cache value can be read from within an inline assembly, which
>   makes it a useful building block for restartable sequences.
> 
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections:
> Ref.:
> * https://lkml.org/lkml/2015/6/24/665
> * https://lwn.net/Articles/650333/
> * http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> 
> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> current CPU number:
> 
> - With Linux vdso:            12.7 ns
> - With TLS-cached cpu number:  0.3 ns
> 
> The system call can be extended by registering a larger structure in
> the future.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Paul Turner <pjt@google.com>
> CC: Andrew Hunter <ahh@google.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Ben Maurer <bmaurer@fb.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: linux-api@vger.kernel.org
> ---
>  arch/x86/kernel/signal.c              |  2 +
>  arch/x86/syscalls/syscall_64.tbl      |  1 +
>  fs/exec.c                             |  1 +
>  include/linux/sched.h                 | 35 ++++++++++++++
>  include/uapi/asm-generic/unistd.h     |  4 +-
>  include/uapi/linux/Kbuild             |  1 +
>  include/uapi/linux/thread_local_abi.h | 37 ++++++++++++++
>  init/Kconfig                          |  9 ++++
>  kernel/Makefile                       |  1 +
>  kernel/fork.c                         |  2 +
>  kernel/sched/core.c                   |  4 ++
>  kernel/sched/sched.h                  |  2 +
>  kernel/sys_ni.c                       |  3 ++
>  kernel/thread_local_abi.c             | 90 +++++++++++++++++++++++++++++++++++
>  14 files changed, 191 insertions(+), 1 deletion(-)
>  create mode 100644 include/uapi/linux/thread_local_abi.h
>  create mode 100644 kernel/thread_local_abi.c
> 
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index e504246..157cec0 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -750,6 +750,8 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
>  	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>  		clear_thread_flag(TIF_NOTIFY_RESUME);
>  		tracehook_notify_resume(regs);
> +		if (getcpu_cache_active(current))
> +			getcpu_cache_handle_notify_resume(current);
>  	}
>  	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
>  		fire_user_return_notifiers();
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..0eb2fc2 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320	common	kexec_file_load		sys_kexec_file_load
>  321	common	bpf			sys_bpf
>  322	64	execveat		stub_execveat
> +323	common	thread_local_abi	sys_thread_local_abi
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/exec.c b/fs/exec.c
> index c7f9b73..e5acf80 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	/* execve succeeded */
>  	current->fs->in_exec = 0;
>  	current->in_execve = 0;
> +	thread_local_abi_execve(current);
>  	acct_update_integrals(current);
>  	task_numa_free(current);
>  	free_bprm(bprm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a419b65..4a3fc52 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2,6 +2,7 @@
>  #define _LINUX_SCHED_H
>  
>  #include <uapi/linux/sched.h>
> +#include <uapi/linux/thread_local_abi.h>
>  
>  #include <linux/sched/prio.h>
>  
> @@ -1710,6 +1711,10 @@ struct task_struct {
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>  	unsigned long	task_state_change;
>  #endif
> +#ifdef CONFIG_THREAD_LOCAL_ABI
> +	size_t thread_local_abi_len;
> +	struct thread_local_abi __user *thread_local_abi;
> +#endif
>  };
>  
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> @@ -3090,4 +3095,34 @@ static inline unsigned long rlimit_max(unsigned int limit)
>  	return task_rlimit_max(current, limit);
>  }
>  
> +#ifdef CONFIG_THREAD_LOCAL_ABI
> +void thread_local_abi_fork(struct task_struct *t);
> +void thread_local_abi_execve(struct task_struct *t);
> +void getcpu_cache_handle_notify_resume(struct task_struct *t);
> +static inline bool getcpu_cache_active(struct task_struct *t)
> +{
> +	struct thread_local_abi __user *tlap = t->thread_local_abi;
> +
> +	if (!tlap || t->thread_local_abi_len <
> +			offsetof(struct thread_local_abi, cpu)
> +			+ sizeof(tlap->cpu))
> +		return false;
> +	return true;
> +}
> +#else
> +static inline void thread_local_abi_fork(struct task_struct *t)
> +{
> +}
> +static inline void thread_local_abi_execve(struct task_struct *t)
> +{
> +}
> +static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
> +{
> +}
> +static inline bool getcpu_cache_active(struct task_struct *t)
> +{
> +	return false;
> +}
> +#endif
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..50aa984 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  __SYSCALL(__NR_bpf, sys_bpf)
>  #define __NR_execveat 281
>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_thread_local_abi 282
> +__SYSCALL(__NR_thread_local_abi, sys_thread_local_abi)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
>  
>  /*
>   * All syscalls below here should go away really,
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 68ceb97..dfd6a30 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -389,6 +389,7 @@ header-y += tcp_metrics.h
>  header-y += telephony.h
>  header-y += termios.h
>  header-y += thermal.h
> +header-y += thread_local_abi.h
>  header-y += time.h
>  header-y += times.h
>  header-y += timex.h
> diff --git a/include/uapi/linux/thread_local_abi.h b/include/uapi/linux/thread_local_abi.h
> new file mode 100644
> index 0000000..6487c92
> --- /dev/null
> +++ b/include/uapi/linux/thread_local_abi.h
> @@ -0,0 +1,37 @@
> +#ifndef _UAPI_LINUX_THREAD_LOCAL_ABI_H
> +#define _UAPI_LINUX_THREAD_LOCAL_ABI_H
> +
> +/*
> + * linux/thread_local_abi.h
> + *
> + * thread_local_abi system call API
> + *
> + * Copyright (c) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/types.h>
> +
> +/* This structure is an ABI that can only be extended. */
> +struct thread_local_abi {
> +	int32_t cpu;
> +};
> +
> +#endif /* _UAPI_LINUX_THREAD_LOCAL_ABI_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d..c8ff5fa 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1559,6 +1559,15 @@ config PCI_QUIRKS
>  	  bugs/quirks. Disable this only if your target machine is
>  	  unaffected by PCI quirks.
>  
> +config THREAD_LOCAL_ABI
> +	bool "Enable thread-local ABI" if EXPERT
> +	default y
> +	help
> +	  Enable the thread-local ABI system call. It provides a user-space
> +	  cache for the current CPU number value.
> +
> +	  If unsure, say Y.
> +
>  config EMBEDDED
>  	bool "Embedded system"
>  	option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 1408b33..cc1f3d4 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>  obj-$(CONFIG_TORTURE_TEST) += torture.o
> +obj-$(CONFIG_THREAD_LOCAL_ABI) += thread_local_abi.o
>  
>  $(obj)/configs.o: $(obj)/config_data.h
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..e17bcb3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	cgroup_post_fork(p);
>  	if (clone_flags & CLONE_THREAD)
>  		threadgroup_change_end(current);
> +	if (!(clone_flags & CLONE_THREAD))
> +		thread_local_abi_fork(p);
>  	perf_event_fork(p);
>  
>  	trace_task_newtask(p, clone_flags);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 62671f5..668a502 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1823,6 +1823,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +#ifdef CONFIG_THREAD_LOCAL_ABI
> +	p->thread_local_abi_len = 0;
> +	p->thread_local_abi = NULL;
> +#endif
>  }
>  
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dc0f435..bf3e346 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -921,6 +921,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
>  {
>  	set_task_rq(p, cpu);
>  #ifdef CONFIG_SMP
> +	if (getcpu_cache_active(p))
> +		set_tsk_thread_flag(p, TIF_NOTIFY_RESUME);
>  	/*
>  	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
>  	 * successfuly executed on another CPU. We must ensure that updates of
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 5adcb0a..cadb903 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
>  
>  /* execveat */
>  cond_syscall(sys_execveat);
> +
> +/* thread-local ABI */
> +cond_syscall(sys_thread_local_abi);
> diff --git a/kernel/thread_local_abi.c b/kernel/thread_local_abi.c
> new file mode 100644
> index 0000000..681f06e
> --- /dev/null
> +++ b/kernel/thread_local_abi.c
> @@ -0,0 +1,90 @@
> +/*
> + * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> + *
> + * thread_local_abi system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/syscalls.h>
> +
> +static int getcpu_cache_update(struct task_struct *t)
> +{
> +	if (put_user(raw_smp_processor_id(), &t->thread_local_abi->cpu)) {
> +		t->thread_local_abi_len = 0;
> +		t->thread_local_abi = NULL;
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * This resume handler should always be executed between a migration
> + * triggered by preemption and return to user-space.
> + */
> +void getcpu_cache_handle_notify_resume(struct task_struct *t)
> +{
> +	BUG_ON(!getcpu_cache_active(t));
> +	if (unlikely(t->flags & PF_EXITING))
> +		return;
> +	if (getcpu_cache_update(t))
> +		force_sig(SIGSEGV, t);
> +}
> +
> +/*
> + * If parent process has a thread-local ABI, the child inherits. Only applies
> + * when forking a process, not a thread.
> + */
> +void thread_local_abi_fork(struct task_struct *t)
> +{
> +	t->thread_local_abi_len = current->thread_local_abi_len;
> +	t->thread_local_abi = current->thread_local_abi;
> +}
> +
> +void thread_local_abi_execve(struct task_struct *t)
> +{
> +	t->thread_local_abi_len = 0;
> +	t->thread_local_abi = NULL;
> +}
> +
> +/*
> + * sys_thread_local_abi - setup thread-local ABI for caller thread
> + */
> +SYSCALL_DEFINE3(thread_local_abi, struct thread_local_abi __user *, tlap,
> +		size_t, len, int, flags)
> +{
> +	size_t minlen;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (current->thread_local_abi && tlap)
> +		return -EBUSY;
> +	/* Agree on the intersection of userspace and kernel features */
> +	minlen = min_t(size_t, len, sizeof(struct thread_local_abi));
> +	current->thread_local_abi_len = minlen;
> +	current->thread_local_abi = tlap;
> +	if (!tlap)
> +		return 0;
> +	/*
> +	 * Migration checks ->thread_local_abi to see if notify_resume
> +	 * flag should be set. Therefore, we need to ensure that
> +	 * the scheduler sees ->thread_local_abi before we update its content.
> +	 */
> +	barrier();	/* Store thread_local_abi before update content */
> +	if (getcpu_cache_active(current)) {

Just checking whether my understanding of the code is correct, but this
'if' is necessary in case we have been moved to a different CPU after
the store of the thread_local_abi?

> +		if (getcpu_cache_update(current))
> +			return -EFAULT;
> +	}
> +	return minlen;
> +}
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
  2015-07-17 10:49 ` Ben Maurer
@ 2015-07-17 16:12   ` Mathieu Desnoyers
  2015-07-17 17:03   ` Josh Triplett
  1 sibling, 0 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2015-07-17 16:12 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Paul Turner, linux-kernel, Andrew Hunter, Peter Zijlstra,
	Ingo Molnar, rostedt, Paul E. McKenney, Josh Triplett,
	Linus Torvalds, Andrew Morton, linux-api

----- On Jul 17, 2015, at 6:49 AM, Ben Maurer bmaurer@fb.com wrote:

> Mathieu Desnoyers wrote:
>> Expose a new system call allowing threads to register a userspace memory
>> area where to store the current CPU number. Scheduler migration sets the
> 
> I really like that this approach makes it easier to add a per-thread interaction
> between userspace and the kernel in the future.
> 
>>+       if (!tlap || t->thread_local_abi_len <
>>+                       offsetof(struct thread_local_abi, cpu)
>>+                       + sizeof(tlap->cpu))
> 
> Could you save a branch here by enforcing that thread_local_abi_len = 0 if
> thread_local_abi = null?

Yes, good idea! Will do.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
  2015-07-17 12:48 ` Nikolay Borisov
@ 2015-07-17 16:23   ` Mathieu Desnoyers
  0 siblings, 0 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2015-07-17 16:23 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Paul Turner, linux-kernel, Andrew Hunter, Peter Zijlstra,
	Ingo Molnar, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Andrew Morton, linux-api

----- On Jul 17, 2015, at 8:48 AM, Nikolay Borisov n.borisov@siteground.com wrote:

> On 07/16/2015 11:00 PM, Mathieu Desnoyers wrote:
>> Expose a new system call allowing threads to register a userspace memory
>> area where to store the current CPU number. Scheduler migration sets the
>> TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space,
>> a notify-resume handler updates the current CPU value within that
>> user-space memory area.
>> 
>> This getcpu cache is an alternative to the sched_getcpu() vdso which has
>> a few benefits:
>> - It is faster to do a memory read that to call a vDSO,
>> - This cache value can be read from within an inline assembly, which
>>   makes it a useful building block for restartable sequences.
>> 
>> This approach is inspired by Paul Turner and Andrew Hunter's work
>> on percpu atomics, which lets the kernel handle restart of critical
>> sections:
>> Ref.:
>> * https://lkml.org/lkml/2015/6/24/665
>> * https://lwn.net/Articles/650333/
>> *
>> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> 
>> Benchmarking sched_getcpu() vs tls cache approach. Getting the
>> current CPU number:
>> 
>> - With Linux vdso:            12.7 ns
>> - With TLS-cached cpu number:  0.3 ns
>> 
>> The system call can be extended by registering a larger structure in
>> the future.
>> 
[...]
>> +/*
>> + * sys_thread_local_abi - setup thread-local ABI for caller thread
>> + */
>> +SYSCALL_DEFINE3(thread_local_abi, struct thread_local_abi __user *, tlap,
>> +		size_t, len, int, flags)
>> +{
>> +	size_t minlen;
>> +
>> +	if (flags)
>> +		return -EINVAL;
>> +	if (current->thread_local_abi && tlap)
>> +		return -EBUSY;
>> +	/* Agree on the intersection of userspace and kernel features */
>> +	minlen = min_t(size_t, len, sizeof(struct thread_local_abi));
>> +	current->thread_local_abi_len = minlen;
>> +	current->thread_local_abi = tlap;
>> +	if (!tlap)
>> +		return 0;
>> +	/*
>> +	 * Migration checks ->thread_local_abi to see if notify_resume
>> +	 * flag should be set. Therefore, we need to ensure that
>> +	 * the scheduler sees ->thread_local_abi before we update its content.
>> +	 */
>> +	barrier();	/* Store thread_local_abi before update content */
>> +	if (getcpu_cache_active(current)) {
> 
> Just checking whether my understanding of the code is correct, but this
> 'if' is necessary in case we have been moved to a different CPU after
> the store of the thread_local_abi?

No, this is not correct. Currently, only the getcpu_cache feature is
implemented, but if struct thread_local_abi eventually grows with more
fields, userspace could call the kernel with a "len" argument that does not
cover some of the features. Therefore, the generic way to check whether
getcpu_cache is implemented by the current thread is to call
"getcpu_cache_active()". If it is enabled, then we need to update the
getcpu_cache content for the current thread.

The barrier() above is required because we want to store thread_local_abi
(and thread_local_abi_len) before we get the current CPU number and store
it into the getcpu_cache, because we could be migrated by the scheduler
with CONFIG_PREEMPT=y at any point between the moment we read the current
CPU number within getcpu_cache_update() and resume userspace. Having
thread_local_abi and thread_local_abi_len set before fetching the current
CPU number ensures that the scheduler will succeed its own getcpu_cache_active()
check, and will therefore raise the resume notifier flag upon migration,
which will then fix the CPU number before resuming to userspace.

Thanks,

Mathieu

> 
>> +		if (getcpu_cache_update(current))
>> +			return -EFAULT;
>> +	}
>> +	return minlen;
>> +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
  2015-07-17 10:49 ` Ben Maurer
  2015-07-17 16:12   ` Mathieu Desnoyers
@ 2015-07-17 17:03   ` Josh Triplett
  1 sibling, 0 replies; 6+ messages in thread
From: Josh Triplett @ 2015-07-17 17:03 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Mathieu Desnoyers, Paul Turner, linux-kernel, Andrew Hunter,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Paul E. McKenney,
	Linus Torvalds, Andrew Morton, linux-api

On Fri, Jul 17, 2015 at 10:49:19AM +0000, Ben Maurer wrote:
> Mathieu Desnoyers wrote:
> > Expose a new system call allowing threads to register a userspace memory
> > area where to store the current CPU number. Scheduler migration sets the
> 
> I really like that this approach makes it easier to add a per-thread interaction between userspace and the kernel in the future.
> 
> >+       if (!tlap || t->thread_local_abi_len <
> >+                       offsetof(struct thread_local_abi, cpu)
> >+                       + sizeof(tlap->cpu))
> 
> Could you save a branch here by enforcing that thread_local_abi_len = 0 if thread_local_abi = null?

"saving a branch" doesn't seem like a good reason to do that; however,
it *is* the convention across other calls: if you pass 0, the pointer
is ignored, but if you pass non-zero, the pointer must be valid or you
get -EFAULT (or an actual segfault).

- Josh Triplett

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-07-17 17:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-16 20:00 [RFC PATCH] thread_local_abi system call: caching current CPU number (x86) Mathieu Desnoyers
2015-07-17 10:49 ` Ben Maurer
2015-07-17 16:12   ` Mathieu Desnoyers
2015-07-17 17:03   ` Josh Triplett
2015-07-17 12:48 ` Nikolay Borisov
2015-07-17 16:23   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).