All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-05  7:01 ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:01 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Hi,

Here is a patchset implementing a cache for the CPU number of the
currently running thread in user-space.

Benchmarks comparing this approach to a system call-based getcpu on ARM
show a 44x speedup. They show a 14x speedup on x86-64 compared to
executing lsl from a vDSO through glibc.

I'm added a man page in the changelog of patch 1/3, which shows an
example usage of this new system call.

This patchset is sent as RFC. It applies on Linux 4.3.

Feedback is welcome,

Thanks!

Mathieu

Mathieu Desnoyers (3):
  getcpu_cache system call: cache CPU number of running thread
  getcpu_cache: wire up ARM system call
  getcpu_cache: wire up x86 32/64 system call

 arch/arm/include/uapi/asm/unistd.h     |   1 +
 arch/arm/kernel/calls.S                |   1 +
 arch/arm/kernel/signal.c               |   1 +
 arch/x86/entry/common.c                |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/exec.c                              |   1 +
 include/linux/init_task.h              |   8 ++
 include/linux/sched.h                  |  43 +++++++++
 include/uapi/linux/Kbuild              |   1 +
 include/uapi/linux/getcpu_cache.h      |  44 +++++++++
 init/Kconfig                           |  10 ++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   7 ++
 kernel/getcpu_cache.c                  | 170 +++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    |   3 +
 kernel/sched/sched.h                   |   1 +
 kernel/sys_ni.c                        |   3 +
 18 files changed, 298 insertions(+)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

-- 
2.1.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-05  7:01 ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:01 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Hi,

Here is a patchset implementing a cache for the CPU number of the
currently running thread in user-space.

Benchmarks comparing this approach to a system call-based getcpu on ARM
show a 44x speedup. They show a 14x speedup on x86-64 compared to
executing lsl from a vDSO through glibc.

I'm added a man page in the changelog of patch 1/3, which shows an
example usage of this new system call.

This patchset is sent as RFC. It applies on Linux 4.3.

Feedback is welcome,

Thanks!

Mathieu

Mathieu Desnoyers (3):
  getcpu_cache system call: cache CPU number of running thread
  getcpu_cache: wire up ARM system call
  getcpu_cache: wire up x86 32/64 system call

 arch/arm/include/uapi/asm/unistd.h     |   1 +
 arch/arm/kernel/calls.S                |   1 +
 arch/arm/kernel/signal.c               |   1 +
 arch/x86/entry/common.c                |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/exec.c                              |   1 +
 include/linux/init_task.h              |   8 ++
 include/linux/sched.h                  |  43 +++++++++
 include/uapi/linux/Kbuild              |   1 +
 include/uapi/linux/getcpu_cache.h      |  44 +++++++++
 init/Kconfig                           |  10 ++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   7 ++
 kernel/getcpu_cache.c                  | 170 +++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    |   3 +
 kernel/sched/sched.h                   |   1 +
 kernel/sys_ni.c                        |   3 +
 18 files changed, 298 insertions(+)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05  7:01   ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:01 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Expose a new system call allowing threads to register userspace memory
areas where to store the CPU number on which the calling thread is
running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within each registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

This getcpu cache is an improvement over current mechanisms available to
read the current CPU number, which has the following benefits:

- 44x speedup on ARM vs system call through glibc,
- 14x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 11x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cached value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The getcpu cache approach is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the getcpu cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections:
Ref.:
* https://lkml.org/lkml/2015/10/27/1095
* https://lkml.org/lkml/2015/6/24/665
* https://lwn.net/Articles/650333/
* http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard i.MX6 Quad Board
- Baseline (empty loop):               10.1 ns
- Read CPU from getcpu cache:          10.1 ns
- glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
- getcpu system call:                 322.2 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                1.0 ns
- Read CPU from getcpu cache:           1.0 ns
- Read using gs segment selector:       1.0 ns
- "lsl" inline assembly:               11.2 ns
- glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
- getcpu system call:                  51.0 ns

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: linux-api@vger.kernel.org
---
Man page associated:

GETCPU_CACHE(2)       Linux Programmer's Manual      GETCPU_CACHE(2)

NAME
       getcpu_cache  -  cache CPU number on which the calling thread
       is running

SYNOPSIS
       #include <linux/getcpu_cache.h>

       int getcpu_cache(int cmd, int32_t * cpu_cache, int flags);

DESCRIPTION
       The getcpu_cache() helps speeding up reading the current  CPU
       number  by ensuring that memory locations registered by user-
       space threads are always updated with the CPU number on which
       the thread is running when reading those memory locations.

       The cpu_cache argument is a pointer to a int32_t.

       The cmd argument is one of the following:

       GETCPU_CACHE_CMD_REGISTER
              Register the cpu_cache given as parameter for the cur‐
              rent thread.

       GETCPU_CACHE_CMD_UNREGISTER
              Unregister the cpu_cache given as parameter  from  the
              current thread.

       The  flags argument is currently unused and must be specified
       as 0.

       Typically, a library or application will put the cpu_cache in
       a  thread-local  storage  variable,  or  other  memory  areas
       belonging to each thread. It  is  recommended  to  perform  a
       volatile  read  of the cpu_cache to prevent the compiler from
       doing load tearing. An alternative approach is  to  read  the
       cpu_cache from inline assembly in a single instruction.

       Each thread is responsible for registering its own cpu_cache.
       It is possible to register many cpu_cache for a given thread,
       for instance from different libraries.

       Unregistration  of  associated  cpu_cache are implicitly per‐
       formed when a thread or process exit.

RETURN VALUE
       A return value of 0  indicates  success.   On  error,  -1  is
       returned, and errno is set appropriately.

ERRORS
       EINVAL cmd  is unsupported, cpu_cache is invalid, or flags is
              non-zero.

       ENOSYS The getcpu_cache() system call is not  implemented  by
              this kernel.

       EBUSY  cmd  is  GETCPU_CACHE_CMD_REGISTER  and  cpu_cache  is
              already registered for this thread.

       EFAULT cmd is GETCPU_CACHE_CMD_REGISTER and the memory  loca‐
              tion specified by cpu_cache is a bad address.

       ENOENT cmd  is GETCPU_CACHE_CMD_UNREGISTER and cpu_cache can‐
              not be found for this thread.

       ENOMEM cmd is GETCPU_CACHE_CMD_UNREGISTER and the kernel  has
              run out of memory.

VERSIONS
       The getcpu_cache() system call was added in Linux 4.X (TODO).

CONFORMING TO
       getcpu_cache() is Linux-specific.

EXAMPLE
       The  following  code  uses  the getcpu_cache() system call to
       keep a thread local storage variable up to date with the cur‐
       rent  CPU  number.  For  example  simplicity,  it  is done in
       main(), but  multithreaded  programs  would  need  to  invoke
       getcpu_cache() from each program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sys/syscall.h>
           #include <linux/getcpu_cache.h>

           static inline int
           getcpu_cache(int cmd, volatile int32_t *cpu_cache, int flags)
           {
               return syscall(__NR_getcpu_cache, cmd, cpu_cache, flags);
           }

           static __thread volatile int32_t getcpu_cache_tls;

           int
           main(int argc, char **argv)
           {
               if (getcpu_cache(GETCPU_CACHE_CMD_REGISTER,
                     &getcpu_cache_tls, 0) < 0) {
                   perror("getcpu_cache register");
                   exit(EXIT_FAILURE);
               }

               printf("Current CPU number: %d\n", getcpu_cache_tls);

               if (getcpu_cache(GETCPU_CACHE_CMD_UNREGISTER,
                     &getcpu_cache_tls, 0) < 0) {
                   perror("getcpu_cache unregister");
                   exit(EXIT_FAILURE);
               }
               exit(EXIT_SUCCESS);
           }

Linux                        2016-01-01              GETCPU_CACHE(2)

Rationale for the getcpu_cache system call rather than the thread-local
ABI system call proposed earlier:

Rather than doing a "generic" thread-local ABI, specialize this system
call for a cpu number cache only. Anyway, the thread-local ABI approach
would have required that we introduce "feature" flags, which would have
ended up reimplementing multiplexing of features on top of a system
call. It seems better to introduce one system call per feature instead.
---
 fs/exec.c                         |   1 +
 include/linux/init_task.h         |   8 ++
 include/linux/sched.h             |  43 ++++++++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/getcpu_cache.h |  44 ++++++++++
 init/Kconfig                      |  10 +++
 kernel/Makefile                   |   1 +
 kernel/fork.c                     |   7 ++
 kernel/getcpu_cache.c             | 170 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c               |   3 +
 kernel/sched/sched.h              |   1 +
 kernel/sys_ni.c                   |   3 +
 12 files changed, 292 insertions(+)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

diff --git a/fs/exec.c b/fs/exec.c
index b06623a..1d66af6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1594,6 +1594,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	getcpu_cache_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1c1ff7e..5097798 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -183,6 +183,13 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_GETCPU_CACHE
+# define INIT_GETCPU_CACHE(tsk)						\
+	.getcpu_cache_head = LIST_HEAD_INIT(tsk.getcpu_cache_head),
+#else
+# define INIT_GETCPU_CACHE(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -260,6 +267,7 @@ extern struct task_group root_task_group;
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
 	INIT_KASAN(tsk)							\
+	INIT_GETCPU_CACHE(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..044fa79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1375,6 +1375,11 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+struct getcpu_cache_entry {
+	int32_t __user *cpu_cache;
+	struct list_head entry;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1812,6 +1817,10 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_GETCPU_CACHE
+	/* list of struct getcpu_cache_entry */
+	struct list_head getcpu_cache_head;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3188,4 +3197,38 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_GETCPU_CACHE
+int getcpu_cache_fork(struct task_struct *t);
+void getcpu_cache_execve(struct task_struct *t);
+void getcpu_cache_exit(struct task_struct *t);
+void __getcpu_cache_handle_notify_resume(struct task_struct *t);
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+	if (!list_empty(&t->getcpu_cache_head))
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (!list_empty(&t->getcpu_cache_head))
+		__getcpu_cache_handle_notify_resume(t);
+}
+#else
+static inline int getcpu_cache_fork(struct task_struct *t)
+{
+	return 0;
+}
+static inline void getcpu_cache_execve(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_exit(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 628e6e6..6be3724 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -136,6 +136,7 @@ header-y += futex.h
 header-y += gameport.h
 header-y += genetlink.h
 header-y += gen_stats.h
+header-y += getcpu_cache.h
 header-y += gfs2_ondisk.h
 header-y += gigaset_dev.h
 header-y += gsmmux.h
diff --git a/include/uapi/linux/getcpu_cache.h b/include/uapi/linux/getcpu_cache.h
new file mode 100644
index 0000000..4cd1bd4
--- /dev/null
+++ b/include/uapi/linux/getcpu_cache.h
@@ -0,0 +1,44 @@
+#ifndef _UAPI_LINUX_GETCPU_CACHE_H
+#define _UAPI_LINUX_GETCPU_CACHE_H
+
+/*
+ * linux/getcpu_cache.h
+ *
+ * getcpu_cache system call API
+ *
+ * Copyright (c) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/**
+ * enum getcpu_cache_cmd - getcpu_cache system call command
+ * @GETCPU_CACHE_CMD_REGISTER:   Register the cpu_cache for the current
+ *				 thread.
+ * @GETCPU_CACHE_CMD_UNREGISTER: Unregister the cpu_cache from the current
+ *				 thread.
+ *
+ * Command to be passed to the getcpu_cache system call.
+ */
+enum getcpu_cache_cmd {
+	GETCPU_CACHE_CMD_REGISTER = (1 << 0),
+	GETCPU_CACHE_CMD_UNREGISTER = (1 << 1),
+};
+
+#endif /* _UAPI_LINUX_GETCPU_CACHE_H */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..61287ff 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1614,6 +1614,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config GETCPU_CACHE
+	bool "Enable getcpu cache" if EXPERT
+	default y
+	help
+	  Enable the getcpu cache system call. It provides a user-space
+	  cache for the current CPU number value, which speeds up
+	  getting the current CPU number from user-space.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..b630247 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_GETCPU_CACHE) += getcpu_cache.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index f97f2c4..2d8aba6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -252,6 +252,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	cgroup_free(tsk);
+	getcpu_cache_exit(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
@@ -1554,6 +1555,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	copy_seccomp(p);
 
+	if (!(clone_flags & CLONE_THREAD)) {
+		retval = -ENOMEM;
+		if (getcpu_cache_fork(p))
+			goto bad_fork_cancel_cgroup;
+	}
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/getcpu_cache.c b/kernel/getcpu_cache.c
new file mode 100644
index 0000000..d15d5a8
--- /dev/null
+++ b/kernel/getcpu_cache.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * getcpu cache system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/getcpu_cache.h>
+
+static struct getcpu_cache_entry *
+	add_thread_entry(struct task_struct *t,
+		int32_t __user *cpu_cache)
+{
+	struct getcpu_cache_entry *te;
+
+	te = kmalloc(sizeof(*te), GFP_KERNEL);
+	if (!te)
+		return NULL;
+	te->cpu_cache = cpu_cache;
+	list_add(&te->entry, &t->getcpu_cache_head);
+	return te;
+}
+
+static void remove_thread_entry(struct getcpu_cache_entry *te)
+{
+	list_del(&te->entry);
+	kfree(te);
+}
+
+static void remove_all_thread_entry(struct task_struct *t)
+{
+	struct getcpu_cache_entry *te, *te_tmp;
+
+	list_for_each_entry_safe(te, te_tmp, &t->getcpu_cache_head, entry)
+		remove_thread_entry(te);
+}
+
+static struct getcpu_cache_entry *
+	find_thread_entry(struct task_struct *t,
+		int32_t __user *cpu_cache)
+{
+	struct getcpu_cache_entry *te;
+
+	list_for_each_entry(te, &t->getcpu_cache_head, entry) {
+		if (te->cpu_cache == cpu_cache)
+			return te;
+	}
+	return NULL;
+}
+
+static int getcpu_cache_update_entry(struct getcpu_cache_entry *te)
+{
+	if (put_user(raw_smp_processor_id(), te->cpu_cache)) {
+		/*
+		 * Force unregistration of each entry causing
+		 * put_user() errors.
+		 */
+		remove_thread_entry(te);
+		return -1;
+	}
+	return 0;
+
+}
+
+static int getcpu_cache_update(struct task_struct *t)
+{
+	struct getcpu_cache_entry *te, *te_tmp;
+	int err = 0;
+
+	list_for_each_entry_safe(te, te_tmp, &t->getcpu_cache_head, entry) {
+		if (getcpu_cache_update_entry(te))
+			err = -1;
+	}
+	return err;
+}
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void __getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (getcpu_cache_update(t))
+		force_sig(SIGSEGV, t);
+}
+
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only applies
+ * when forking a process, not a thread.
+ */
+int getcpu_cache_fork(struct task_struct *t)
+{
+	struct getcpu_cache_entry *te;
+
+	list_for_each_entry(te, &current->getcpu_cache_head, entry) {
+		if (!add_thread_entry(t, te->cpu_cache))
+			return -1;
+	}
+	return 0;
+}
+
+void getcpu_cache_execve(struct task_struct *t)
+{
+	remove_all_thread_entry(t);
+}
+
+void getcpu_cache_exit(struct task_struct *t)
+{
+	remove_all_thread_entry(t);
+}
+
+/*
+ * sys_getcpu_cache - setup getcpu cache for caller thread
+ */
+SYSCALL_DEFINE3(getcpu_cache, int, cmd, int32_t __user *, cpu_cache,
+		int, flags)
+{
+	struct getcpu_cache_entry *te;
+
+	if (unlikely(!cpu_cache || flags))
+		return -EINVAL;
+	te = find_thread_entry(current, cpu_cache);
+	switch (cmd) {
+	case GETCPU_CACHE_CMD_REGISTER:
+		/* Attempt to register cpu_cache. Check if already there. */
+		if (te)
+			return -EBUSY;
+		te = add_thread_entry(current, cpu_cache);
+		if (!te)
+			return -ENOMEM;
+		/*
+		 * Migration walks the getcpu cache entry list to see
+		 * whether the notify_resume flag should be set.
+		 * Therefore, we need to ensure that the scheduler sees
+		 * the list update before we update the getcpu cache
+		 * content with the current CPU number.
+		 *
+		 * Add thread entry to list before updating content.
+		 */
+		barrier();
+		if (getcpu_cache_update_entry(te))
+			return -EFAULT;
+		return 0;
+	case GETCPU_CACHE_CMD_UNREGISTER:
+		/* Unregistration is requested. */
+		if (!te)
+			return -ENOENT;
+		remove_thread_entry(te);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..2e93411 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2120,6 +2120,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_GETCPU_CACHE
+	INIT_LIST_HEAD(&p->getcpu_cache_head);
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..8f6d5d3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	getcpu_cache_set_notify_resume(p);
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..1e1c299 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -249,3 +249,6 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+
+/* thread-local ABI */
+cond_syscall(sys_getcpu_cache);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05  7:01   ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:01 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Expose a new system call allowing threads to register userspace memory
areas where to store the CPU number on which the calling thread is
running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within each registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

This getcpu cache is an improvement over current mechanisms available to
read the current CPU number, which has the following benefits:

- 44x speedup on ARM vs system call through glibc,
- 14x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 11x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cached value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The getcpu cache approach is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the getcpu cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections:
Ref.:
* https://lkml.org/lkml/2015/10/27/1095
* https://lkml.org/lkml/2015/6/24/665
* https://lwn.net/Articles/650333/
* http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard i.MX6 Quad Board
- Baseline (empty loop):               10.1 ns
- Read CPU from getcpu cache:          10.1 ns
- glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
- getcpu system call:                 322.2 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                1.0 ns
- Read CPU from getcpu cache:           1.0 ns
- Read using gs segment selector:       1.0 ns
- "lsl" inline assembly:               11.2 ns
- glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
- getcpu system call:                  51.0 ns

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Man page associated:

GETCPU_CACHE(2)       Linux Programmer's Manual      GETCPU_CACHE(2)

NAME
       getcpu_cache  -  cache CPU number on which the calling thread
       is running

SYNOPSIS
       #include <linux/getcpu_cache.h>

       int getcpu_cache(int cmd, int32_t * cpu_cache, int flags);

DESCRIPTION
       The getcpu_cache() helps speeding up reading the current  CPU
       number  by ensuring that memory locations registered by user-
       space threads are always updated with the CPU number on which
       the thread is running when reading those memory locations.

       The cpu_cache argument is a pointer to a int32_t.

       The cmd argument is one of the following:

       GETCPU_CACHE_CMD_REGISTER
              Register the cpu_cache given as parameter for the cur‐
              rent thread.

       GETCPU_CACHE_CMD_UNREGISTER
              Unregister the cpu_cache given as parameter  from  the
              current thread.

       The  flags argument is currently unused and must be specified
       as 0.

       Typically, a library or application will put the cpu_cache in
       a  thread-local  storage  variable,  or  other  memory  areas
       belonging to each thread. It  is  recommended  to  perform  a
       volatile  read  of the cpu_cache to prevent the compiler from
       doing load tearing. An alternative approach is  to  read  the
       cpu_cache from inline assembly in a single instruction.

       Each thread is responsible for registering its own cpu_cache.
       It is possible to register many cpu_cache for a given thread,
       for instance from different libraries.

       Unregistration  of  associated  cpu_cache are implicitly per‐
       formed when a thread or process exit.

RETURN VALUE
       A return value of 0  indicates  success.   On  error,  -1  is
       returned, and errno is set appropriately.

ERRORS
       EINVAL cmd  is unsupported, cpu_cache is invalid, or flags is
              non-zero.

       ENOSYS The getcpu_cache() system call is not  implemented  by
              this kernel.

       EBUSY  cmd  is  GETCPU_CACHE_CMD_REGISTER  and  cpu_cache  is
              already registered for this thread.

       EFAULT cmd is GETCPU_CACHE_CMD_REGISTER and the memory  loca‐
              tion specified by cpu_cache is a bad address.

       ENOENT cmd  is GETCPU_CACHE_CMD_UNREGISTER and cpu_cache can‐
              not be found for this thread.

       ENOMEM cmd is GETCPU_CACHE_CMD_UNREGISTER and the kernel  has
              run out of memory.

VERSIONS
       The getcpu_cache() system call was added in Linux 4.X (TODO).

CONFORMING TO
       getcpu_cache() is Linux-specific.

EXAMPLE
       The  following  code  uses  the getcpu_cache() system call to
       keep a thread local storage variable up to date with the cur‐
       rent  CPU  number.  For  example  simplicity,  it  is done in
       main(), but  multithreaded  programs  would  need  to  invoke
       getcpu_cache() from each program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sys/syscall.h>
           #include <linux/getcpu_cache.h>

           static inline int
           getcpu_cache(int cmd, volatile int32_t *cpu_cache, int flags)
           {
               return syscall(__NR_getcpu_cache, cmd, cpu_cache, flags);
           }

           static __thread volatile int32_t getcpu_cache_tls;

           int
           main(int argc, char **argv)
           {
               if (getcpu_cache(GETCPU_CACHE_CMD_REGISTER,
                     &getcpu_cache_tls, 0) < 0) {
                   perror("getcpu_cache register");
                   exit(EXIT_FAILURE);
               }

               printf("Current CPU number: %d\n", getcpu_cache_tls);

               if (getcpu_cache(GETCPU_CACHE_CMD_UNREGISTER,
                     &getcpu_cache_tls, 0) < 0) {
                   perror("getcpu_cache unregister");
                   exit(EXIT_FAILURE);
               }
               exit(EXIT_SUCCESS);
           }

Linux                        2016-01-01              GETCPU_CACHE(2)

Rationale for the getcpu_cache system call rather than the thread-local
ABI system call proposed earlier:

Rather than doing a "generic" thread-local ABI, specialize this system
call for a cpu number cache only. Anyway, the thread-local ABI approach
would have required that we introduce "feature" flags, which would have
ended up reimplementing multiplexing of features on top of a system
call. It seems better to introduce one system call per feature instead.
---
 fs/exec.c                         |   1 +
 include/linux/init_task.h         |   8 ++
 include/linux/sched.h             |  43 ++++++++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/getcpu_cache.h |  44 ++++++++++
 init/Kconfig                      |  10 +++
 kernel/Makefile                   |   1 +
 kernel/fork.c                     |   7 ++
 kernel/getcpu_cache.c             | 170 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c               |   3 +
 kernel/sched/sched.h              |   1 +
 kernel/sys_ni.c                   |   3 +
 12 files changed, 292 insertions(+)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

diff --git a/fs/exec.c b/fs/exec.c
index b06623a..1d66af6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1594,6 +1594,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	getcpu_cache_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1c1ff7e..5097798 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -183,6 +183,13 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_GETCPU_CACHE
+# define INIT_GETCPU_CACHE(tsk)						\
+	.getcpu_cache_head = LIST_HEAD_INIT(tsk.getcpu_cache_head),
+#else
+# define INIT_GETCPU_CACHE(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -260,6 +267,7 @@ extern struct task_group root_task_group;
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
 	INIT_KASAN(tsk)							\
+	INIT_GETCPU_CACHE(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..044fa79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1375,6 +1375,11 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+struct getcpu_cache_entry {
+	int32_t __user *cpu_cache;
+	struct list_head entry;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1812,6 +1817,10 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_GETCPU_CACHE
+	/* list of struct getcpu_cache_entry */
+	struct list_head getcpu_cache_head;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3188,4 +3197,38 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_GETCPU_CACHE
+int getcpu_cache_fork(struct task_struct *t);
+void getcpu_cache_execve(struct task_struct *t);
+void getcpu_cache_exit(struct task_struct *t);
+void __getcpu_cache_handle_notify_resume(struct task_struct *t);
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+	if (!list_empty(&t->getcpu_cache_head))
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (!list_empty(&t->getcpu_cache_head))
+		__getcpu_cache_handle_notify_resume(t);
+}
+#else
+static inline int getcpu_cache_fork(struct task_struct *t)
+{
+	return 0;
+}
+static inline void getcpu_cache_execve(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_exit(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 628e6e6..6be3724 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -136,6 +136,7 @@ header-y += futex.h
 header-y += gameport.h
 header-y += genetlink.h
 header-y += gen_stats.h
+header-y += getcpu_cache.h
 header-y += gfs2_ondisk.h
 header-y += gigaset_dev.h
 header-y += gsmmux.h
diff --git a/include/uapi/linux/getcpu_cache.h b/include/uapi/linux/getcpu_cache.h
new file mode 100644
index 0000000..4cd1bd4
--- /dev/null
+++ b/include/uapi/linux/getcpu_cache.h
@@ -0,0 +1,44 @@
+#ifndef _UAPI_LINUX_GETCPU_CACHE_H
+#define _UAPI_LINUX_GETCPU_CACHE_H
+
+/*
+ * linux/getcpu_cache.h
+ *
+ * getcpu_cache system call API
+ *
+ * Copyright (c) 2015 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/fQFizaE/u3fw@public.gmane.orgm>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/**
+ * enum getcpu_cache_cmd - getcpu_cache system call command
+ * @GETCPU_CACHE_CMD_REGISTER:   Register the cpu_cache for the current
+ *				 thread.
+ * @GETCPU_CACHE_CMD_UNREGISTER: Unregister the cpu_cache from the current
+ *				 thread.
+ *
+ * Command to be passed to the getcpu_cache system call.
+ */
+enum getcpu_cache_cmd {
+	GETCPU_CACHE_CMD_REGISTER = (1 << 0),
+	GETCPU_CACHE_CMD_UNREGISTER = (1 << 1),
+};
+
+#endif /* _UAPI_LINUX_GETCPU_CACHE_H */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..61287ff 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1614,6 +1614,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config GETCPU_CACHE
+	bool "Enable getcpu cache" if EXPERT
+	default y
+	help
+	  Enable the getcpu cache system call. It provides a user-space
+	  cache for the current CPU number value, which speeds up
+	  getting the current CPU number from user-space.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..b630247 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_GETCPU_CACHE) += getcpu_cache.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index f97f2c4..2d8aba6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -252,6 +252,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	cgroup_free(tsk);
+	getcpu_cache_exit(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
@@ -1554,6 +1555,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	copy_seccomp(p);
 
+	if (!(clone_flags & CLONE_THREAD)) {
+		retval = -ENOMEM;
+		if (getcpu_cache_fork(p))
+			goto bad_fork_cancel_cgroup;
+	}
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/getcpu_cache.c b/kernel/getcpu_cache.c
new file mode 100644
index 0000000..d15d5a8
--- /dev/null
+++ b/kernel/getcpu_cache.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/fQFizaE/u3fw@public.gmane.orgm>
+ *
+ * getcpu cache system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/getcpu_cache.h>
+
+static struct getcpu_cache_entry *
+	add_thread_entry(struct task_struct *t,
+		int32_t __user *cpu_cache)
+{
+	struct getcpu_cache_entry *te;
+
+	te = kmalloc(sizeof(*te), GFP_KERNEL);
+	if (!te)
+		return NULL;
+	te->cpu_cache = cpu_cache;
+	list_add(&te->entry, &t->getcpu_cache_head);
+	return te;
+}
+
+static void remove_thread_entry(struct getcpu_cache_entry *te)
+{
+	list_del(&te->entry);
+	kfree(te);
+}
+
+static void remove_all_thread_entry(struct task_struct *t)
+{
+	struct getcpu_cache_entry *te, *te_tmp;
+
+	list_for_each_entry_safe(te, te_tmp, &t->getcpu_cache_head, entry)
+		remove_thread_entry(te);
+}
+
+static struct getcpu_cache_entry *
+	find_thread_entry(struct task_struct *t,
+		int32_t __user *cpu_cache)
+{
+	struct getcpu_cache_entry *te;
+
+	list_for_each_entry(te, &t->getcpu_cache_head, entry) {
+		if (te->cpu_cache == cpu_cache)
+			return te;
+	}
+	return NULL;
+}
+
+static int getcpu_cache_update_entry(struct getcpu_cache_entry *te)
+{
+	if (put_user(raw_smp_processor_id(), te->cpu_cache)) {
+		/*
+		 * Force unregistration of each entry causing
+		 * put_user() errors.
+		 */
+		remove_thread_entry(te);
+		return -1;
+	}
+	return 0;
+
+}
+
+static int getcpu_cache_update(struct task_struct *t)
+{
+	struct getcpu_cache_entry *te, *te_tmp;
+	int err = 0;
+
+	list_for_each_entry_safe(te, te_tmp, &t->getcpu_cache_head, entry) {
+		if (getcpu_cache_update_entry(te))
+			err = -1;
+	}
+	return err;
+}
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void __getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (getcpu_cache_update(t))
+		force_sig(SIGSEGV, t);
+}
+
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only applies
+ * when forking a process, not a thread.
+ */
+int getcpu_cache_fork(struct task_struct *t)
+{
+	struct getcpu_cache_entry *te;
+
+	list_for_each_entry(te, &current->getcpu_cache_head, entry) {
+		if (!add_thread_entry(t, te->cpu_cache))
+			return -1;
+	}
+	return 0;
+}
+
+void getcpu_cache_execve(struct task_struct *t)
+{
+	remove_all_thread_entry(t);
+}
+
+void getcpu_cache_exit(struct task_struct *t)
+{
+	remove_all_thread_entry(t);
+}
+
+/*
+ * sys_getcpu_cache - setup getcpu cache for caller thread
+ */
+SYSCALL_DEFINE3(getcpu_cache, int, cmd, int32_t __user *, cpu_cache,
+		int, flags)
+{
+	struct getcpu_cache_entry *te;
+
+	if (unlikely(!cpu_cache || flags))
+		return -EINVAL;
+	te = find_thread_entry(current, cpu_cache);
+	switch (cmd) {
+	case GETCPU_CACHE_CMD_REGISTER:
+		/* Attempt to register cpu_cache. Check if already there. */
+		if (te)
+			return -EBUSY;
+		te = add_thread_entry(current, cpu_cache);
+		if (!te)
+			return -ENOMEM;
+		/*
+		 * Migration walks the getcpu cache entry list to see
+		 * whether the notify_resume flag should be set.
+		 * Therefore, we need to ensure that the scheduler sees
+		 * the list update before we update the getcpu cache
+		 * content with the current CPU number.
+		 *
+		 * Add thread entry to list before updating content.
+		 */
+		barrier();
+		if (getcpu_cache_update_entry(te))
+			return -EFAULT;
+		return 0;
+	case GETCPU_CACHE_CMD_UNREGISTER:
+		/* Unregistration is requested. */
+		if (!te)
+			return -ENOENT;
+		remove_thread_entry(te);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..2e93411 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2120,6 +2120,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_GETCPU_CACHE
+	INIT_LIST_HEAD(&p->getcpu_cache_head);
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..8f6d5d3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	getcpu_cache_set_notify_resume(p);
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..1e1c299 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -249,3 +249,6 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+
+/* thread-local ABI */
+cond_syscall(sys_getcpu_cache);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 2/3] getcpu_cache: wire up ARM system call
  2016-01-05  7:01 ` Mathieu Desnoyers
  (?)
  (?)
@ 2016-01-05  7:01 ` Mathieu Desnoyers
  -1 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:01 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu cache system call on 32-bit ARM. Call the
getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

This provides an ABI improving the speed of a getcpu operation
on ARM by skipping the getcpu system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: linux-api@vger.kernel.org
---
 arch/arm/include/uapi/asm/unistd.h | 1 +
 arch/arm/kernel/calls.S            | 1 +
 arch/arm/kernel/signal.c           | 1 +
 3 files changed, 3 insertions(+)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 7a2a32a..6605e94 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -416,6 +416,7 @@
 #define __NR_execveat			(__NR_SYSCALL_BASE+387)
 #define __NR_userfaultfd		(__NR_SYSCALL_BASE+388)
 #define __NR_membarrier			(__NR_SYSCALL_BASE+389)
+#define __NR_getcpu_cache		(__NR_SYSCALL_BASE+390)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index fde6c88..6786411 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -399,6 +399,7 @@
 		CALL(sys_execveat)
 		CALL(sys_userfaultfd)
 		CALL(sys_membarrier)
+/* 390 */	CALL(sys_getcpu_cache)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index 7b8f214..ff5052c 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -594,6 +594,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 			} else {
 				clear_thread_flag(TIF_NOTIFY_RESUME);
 				tracehook_notify_resume(regs);
+				getcpu_cache_handle_notify_resume(current);
 			}
 		}
 		local_irq_disable();
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 3/3] getcpu_cache: wire up x86 32/64 system call
@ 2016-01-05  7:02   ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:02 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu_cache system call on x86 32/64. Call the
getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

This provides an ABI improving the speed of a getcpu operation
on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/common.c                | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 3 files changed, 3 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc..f7e6e12 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -249,6 +249,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			getcpu_cache_handle_notify_resume(current);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..b3ea491 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	getcpu_cache		sys_getcpu_cache
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..442aaa0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	getcpu_cache		sys_getcpu_cache
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 3/3] getcpu_cache: wire up x86 32/64 system call
@ 2016-01-05  7:02   ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05  7:02 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu_cache system call on x86 32/64. Call the
getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

This provides an ABI improving the speed of a getcpu operation
on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 arch/x86/entry/common.c                | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 3 files changed, 3 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc..f7e6e12 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -249,6 +249,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			getcpu_cache_handle_notify_resume(current);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..b3ea491 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	getcpu_cache		sys_getcpu_cache
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..442aaa0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	getcpu_cache		sys_getcpu_cache
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-05  7:01   ` Mathieu Desnoyers
  (?)
@ 2016-01-05 12:04   ` Will Deacon
  2016-01-05 17:31     ` Mathieu Desnoyers
  -1 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2016-01-05 12:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Michael Kerrisk

Hi Mathieu,

On Tue, Jan 05, 2016 at 02:01:58AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register userspace memory
> areas where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.

What guarantees do you provide if a thread other than the one which
registered the cache tries to access the value? Obviously, there's a
potential data race here with the kernel issuing a parallel update, but
are you intending to have single-copy atomicity semantics (like relaxed
atomics in C11) or is this simply going to give you junk?

I ask because, in the absence of alignment checks on the cache pointer,
we can't guarantee single-copy atomicity on ARM when the kernel writes
the current CPU value.

Cheers,

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-05 12:04   ` Will Deacon
@ 2016-01-05 17:31     ` Mathieu Desnoyers
  2016-01-05 17:34         ` Mathieu Desnoyers
  2016-01-05 17:40         ` Russell King - ARM Linux
  0 siblings, 2 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05 17:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Michael Kerrisk

----- On Jan 5, 2016, at 7:04 AM, Will Deacon will.deacon@arm.com wrote:

> Hi Mathieu,
> 
> On Tue, Jan 05, 2016 at 02:01:58AM -0500, Mathieu Desnoyers wrote:
>> Expose a new system call allowing threads to register userspace memory
>> areas where to store the CPU number on which the calling thread is
>> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
>> current thread. Upon return to user-space, a notify-resume handler
>> updates the current CPU value within each registered user-space memory
>> area. User-space can then read the current CPU number directly from
>> memory.
> 
> What guarantees do you provide if a thread other than the one which
> registered the cache tries to access the value? Obviously, there's a
> potential data race here with the kernel issuing a parallel update, but
> are you intending to have single-copy atomicity semantics (like relaxed
> atomics in C11) or is this simply going to give you junk?
> 
> I ask because, in the absence of alignment checks on the cache pointer,
> we can't guarantee single-copy atomicity on ARM when the kernel writes
> the current CPU value.

Hi Will,

This is an excellent question. My initial thinking was that only the
thread registering the cache would read it, but now that you ask,
there might be use-cases where other threads would be interested in
reading each other's current CPU number.

For instance, an application could create a linked list or hash map
of thread control structures, which could contain the current CPU
number of each thread. A dispatch thread could then traverse or
lookup this structure to see on which CPU each thread is running and
do work queue dispatch or scheduling decisions accordingly.

This use-case would imply ensuring that reading the current CPU value
from another CPU will never result in reading a garbage value.

If we indeed intend to enable this use-case, we should:

1) Add an alignment check on the cpu_cache pointer. Should we
   return -EINVAL if unaligned ?
2) Document this alignment requirement in the man page, and the
   atomicity guarantees it provides,

The tiny downside of having this alignment requirement is that
it would not be possible to put the cpu_cache into a packed
structure. I don't think anyone would care though.

Thanks!

Mathieu


> 
> Cheers,
> 
> Will

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 17:34         ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05 17:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Michael Kerrisk

----- On Jan 5, 2016, at 12:31 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jan 5, 2016, at 7:04 AM, Will Deacon will.deacon@arm.com wrote:
> 
>> Hi Mathieu,
>> 
>> On Tue, Jan 05, 2016 at 02:01:58AM -0500, Mathieu Desnoyers wrote:
>>> Expose a new system call allowing threads to register userspace memory
>>> areas where to store the CPU number on which the calling thread is
>>> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
>>> current thread. Upon return to user-space, a notify-resume handler
>>> updates the current CPU value within each registered user-space memory
>>> area. User-space can then read the current CPU number directly from
>>> memory.
>> 
>> What guarantees do you provide if a thread other than the one which
>> registered the cache tries to access the value? Obviously, there's a
>> potential data race here with the kernel issuing a parallel update, but
>> are you intending to have single-copy atomicity semantics (like relaxed
>> atomics in C11) or is this simply going to give you junk?
>> 
>> I ask because, in the absence of alignment checks on the cache pointer,
>> we can't guarantee single-copy atomicity on ARM when the kernel writes
>> the current CPU value.
> 
> Hi Will,
> 
> This is an excellent question. My initial thinking was that only the
> thread registering the cache would read it, but now that you ask,
> there might be use-cases where other threads would be interested in
> reading each other's current CPU number.
> 
> For instance, an application could create a linked list or hash map
> of thread control structures, which could contain the current CPU
> number of each thread. A dispatch thread could then traverse or
> lookup this structure to see on which CPU each thread is running and
> do work queue dispatch or scheduling decisions accordingly.
> 
> This use-case would imply ensuring that reading the current CPU value
> from another CPU will never result in reading a garbage value.
> 
> If we indeed intend to enable this use-case, we should:
> 
> 1) Add an alignment check on the cpu_cache pointer. Should we
>   return -EINVAL if unaligned ?
> 2) Document this alignment requirement in the man page, and the
>   atomicity guarantees it provides,

Related question: if we check that cpu_cache pointer is aligned
on 4 bytes, does put_user() then guarantee single-copy atomicity
on all architectures ?

Thanks,

Mathieu

> 
> The tiny downside of having this alignment requirement is that
> it would not be possible to put the cpu_cache into a packed
> structure. I don't think anyone would care though.
> 
> Thanks!
> 
> Mathieu
> 
> 
>> 
>> Cheers,
>> 
>> Will
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 17:34         ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05 17:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Michael Kerrisk

----- On Jan 5, 2016, at 12:31 PM, Mathieu Desnoyers mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org wrote:

> ----- On Jan 5, 2016, at 7:04 AM, Will Deacon will.deacon-5wv7dgnIgG8@public.gmane.org wrote:
> 
>> Hi Mathieu,
>> 
>> On Tue, Jan 05, 2016 at 02:01:58AM -0500, Mathieu Desnoyers wrote:
>>> Expose a new system call allowing threads to register userspace memory
>>> areas where to store the CPU number on which the calling thread is
>>> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
>>> current thread. Upon return to user-space, a notify-resume handler
>>> updates the current CPU value within each registered user-space memory
>>> area. User-space can then read the current CPU number directly from
>>> memory.
>> 
>> What guarantees do you provide if a thread other than the one which
>> registered the cache tries to access the value? Obviously, there's a
>> potential data race here with the kernel issuing a parallel update, but
>> are you intending to have single-copy atomicity semantics (like relaxed
>> atomics in C11) or is this simply going to give you junk?
>> 
>> I ask because, in the absence of alignment checks on the cache pointer,
>> we can't guarantee single-copy atomicity on ARM when the kernel writes
>> the current CPU value.
> 
> Hi Will,
> 
> This is an excellent question. My initial thinking was that only the
> thread registering the cache would read it, but now that you ask,
> there might be use-cases where other threads would be interested in
> reading each other's current CPU number.
> 
> For instance, an application could create a linked list or hash map
> of thread control structures, which could contain the current CPU
> number of each thread. A dispatch thread could then traverse or
> lookup this structure to see on which CPU each thread is running and
> do work queue dispatch or scheduling decisions accordingly.
> 
> This use-case would imply ensuring that reading the current CPU value
> from another CPU will never result in reading a garbage value.
> 
> If we indeed intend to enable this use-case, we should:
> 
> 1) Add an alignment check on the cpu_cache pointer. Should we
>   return -EINVAL if unaligned ?
> 2) Document this alignment requirement in the man page, and the
>   atomicity guarantees it provides,

Related question: if we check that cpu_cache pointer is aligned
on 4 bytes, does put_user() then guarantee single-copy atomicity
on all architectures ?

Thanks,

Mathieu

> 
> The tiny downside of having this alignment requirement is that
> it would not be possible to put the cpu_cache into a packed
> structure. I don't think anyone would care though.
> 
> Thanks!
> 
> Mathieu
> 
> 
>> 
>> Cheers,
>> 
>> Will
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 17:40         ` Russell King - ARM Linux
  0 siblings, 0 replies; 42+ messages in thread
From: Russell King - ARM Linux @ 2016-01-05 17:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Will Deacon, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Michael Kerrisk

On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> For instance, an application could create a linked list or hash map
> of thread control structures, which could contain the current CPU
> number of each thread. A dispatch thread could then traverse or
> lookup this structure to see on which CPU each thread is running and
> do work queue dispatch or scheduling decisions accordingly.

So, what happens if the linked list is walked from thread X, and we
discover that thread Y is allegedly running on CPU1.  We decide that
we want to dispatch some work on that thread due to it being on CPU1,
so we send an event to thread Y.

Thread Y becomes runnable, and the scheduler decides to schedule the
thread on CPU3 instead of CPU1.

My point is that the above idea is inherently racy.  The only case
where it isn't racy is when thread Y is bound to CPU1, and so can't
move - but then you'd know that thread Y is on CPU1 and there
wouldn't be a need for the inherent complexity suggested above.

The behaviour I've seen on ARM from the scheduler (on a quad CPU
platform, observing the system activity with top reporting the last
CPU number used by each thread) is that threads often migrate
between CPUs - especially in the case of (eg) one or two threads
running in a quad-CPU system.

Given that, I'm really not sure what the use of reading and making
decisions on the current CPU number would be within a program -
unless the thread is bound to a particular CPU or group of CPUs,
it seems that you can't rely on being on the reported CPU by the
time the system call returns.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 17:40         ` Russell King - ARM Linux
  0 siblings, 0 replies; 42+ messages in thread
From: Russell King - ARM Linux @ 2016-01-05 17:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Will Deacon, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Andrew Morton, Catalin Marinas,
	Michael Kerrisk

On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> For instance, an application could create a linked list or hash map
> of thread control structures, which could contain the current CPU
> number of each thread. A dispatch thread could then traverse or
> lookup this structure to see on which CPU each thread is running and
> do work queue dispatch or scheduling decisions accordingly.

So, what happens if the linked list is walked from thread X, and we
discover that thread Y is allegedly running on CPU1.  We decide that
we want to dispatch some work on that thread due to it being on CPU1,
so we send an event to thread Y.

Thread Y becomes runnable, and the scheduler decides to schedule the
thread on CPU3 instead of CPU1.

My point is that the above idea is inherently racy.  The only case
where it isn't racy is when thread Y is bound to CPU1, and so can't
move - but then you'd know that thread Y is on CPU1 and there
wouldn't be a need for the inherent complexity suggested above.

The behaviour I've seen on ARM from the scheduler (on a quad CPU
platform, observing the system activity with top reporting the last
CPU number used by each thread) is that threads often migrate
between CPUs - especially in the case of (eg) one or two threads
running in a quad-CPU system.

Given that, I'm really not sure what the use of reading and making
decisions on the current CPU number would be within a program -
unless the thread is bound to a particular CPU or group of CPUs,
it seems that you can't rely on being on the reported CPU by the
time the system call returns.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-05 17:40         ` Russell King - ARM Linux
  (?)
@ 2016-01-05 17:49         ` Mathieu Desnoyers
  -1 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05 17:49 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Will Deacon, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Michael Kerrisk

----- On Jan 5, 2016, at 12:40 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:

> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
>> For instance, an application could create a linked list or hash map
>> of thread control structures, which could contain the current CPU
>> number of each thread. A dispatch thread could then traverse or
>> lookup this structure to see on which CPU each thread is running and
>> do work queue dispatch or scheduling decisions accordingly.
> 
> So, what happens if the linked list is walked from thread X, and we
> discover that thread Y is allegedly running on CPU1.  We decide that
> we want to dispatch some work on that thread due to it being on CPU1,
> so we send an event to thread Y.
> 
> Thread Y becomes runnable, and the scheduler decides to schedule the
> thread on CPU3 instead of CPU1.
> 
> My point is that the above idea is inherently racy.  The only case
> where it isn't racy is when thread Y is bound to CPU1, and so can't
> move - but then you'd know that thread Y is on CPU1 and there
> wouldn't be a need for the inherent complexity suggested above.

I agree that this is inherently racy. The goal of such a scheme would
be to make statistically better dispatch decisions based on the
assumption that migration is not performed too often.

> 
> The behaviour I've seen on ARM from the scheduler (on a quad CPU
> platform, observing the system activity with top reporting the last
> CPU number used by each thread) is that threads often migrate
> between CPUs - especially in the case of (eg) one or two threads
> running in a quad-CPU system.

That seems rather odd. I have no doubt that this might be happening
right now, but having the scheduler do frequent migrations don't
appear to be in the best interest of access locality.

> 
> Given that, I'm really not sure what the use of reading and making
> decisions on the current CPU number would be within a program -
> unless the thread is bound to a particular CPU or group of CPUs,
> it seems that you can't rely on being on the reported CPU by the
> time the system call returns.

I completely agree that you can't rely on that information, but it
seems rather odd that migration would happen so often that it would
make this information statistically irrelevant.

Perhaps others could shed some light on this scheduler behavior ?

Thanks!

Mathieu

> 
> --
> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 21:47           ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2016-01-05 21:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Mathieu Desnoyers, Will Deacon, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra, linux-kernel, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, Ben Maurer, rostedt, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Michael Kerrisk

On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> > For instance, an application could create a linked list or hash map
> > of thread control structures, which could contain the current CPU
> > number of each thread. A dispatch thread could then traverse or
> > lookup this structure to see on which CPU each thread is running and
> > do work queue dispatch or scheduling decisions accordingly.
> 
> So, what happens if the linked list is walked from thread X, and we
> discover that thread Y is allegedly running on CPU1.  We decide that
> we want to dispatch some work on that thread due to it being on CPU1,
> so we send an event to thread Y.
> 
> Thread Y becomes runnable, and the scheduler decides to schedule the
> thread on CPU3 instead of CPU1.
> 
> My point is that the above idea is inherently racy.  The only case
> where it isn't racy is when thread Y is bound to CPU1, and so can't
> move - but then you'd know that thread Y is on CPU1 and there
> wouldn't be a need for the inherent complexity suggested above.
> 
> The behaviour I've seen on ARM from the scheduler (on a quad CPU
> platform, observing the system activity with top reporting the last
> CPU number used by each thread) is that threads often migrate
> between CPUs - especially in the case of (eg) one or two threads
> running in a quad-CPU system.
> 
> Given that, I'm really not sure what the use of reading and making
> decisions on the current CPU number would be within a program -
> unless the thread is bound to a particular CPU or group of CPUs,
> it seems that you can't rely on being on the reported CPU by the
> time the system call returns.

As I understand it, the idea is -not- to eliminate synchronization
like we do with per-CPU variables in the kernel, but rather to
reduce the average cost of synchronization.  For example, there
might be a separate data structure per CPU, each structure guarded
by its own lock.  A thread could sample the current running CPU,
acquire that CPU's corresponding lock, and operate on that CPU's
structure.  This would work correctly even if there was an arbitrarily
high number of preemptions/migrations, but would have improved
performance (compared to a single global lock) in the common case
where there were no preemptions/migrations.

This approach can also be used in conjunction with Paul Turner's
per-CPU atomics.

Make sense, or am I missing your point?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 21:47           ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2016-01-05 21:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Mathieu Desnoyers, Will Deacon, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Andrew Morton,
	Catalin Marinas, Michael Kerrisk

On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> > For instance, an application could create a linked list or hash map
> > of thread control structures, which could contain the current CPU
> > number of each thread. A dispatch thread could then traverse or
> > lookup this structure to see on which CPU each thread is running and
> > do work queue dispatch or scheduling decisions accordingly.
> 
> So, what happens if the linked list is walked from thread X, and we
> discover that thread Y is allegedly running on CPU1.  We decide that
> we want to dispatch some work on that thread due to it being on CPU1,
> so we send an event to thread Y.
> 
> Thread Y becomes runnable, and the scheduler decides to schedule the
> thread on CPU3 instead of CPU1.
> 
> My point is that the above idea is inherently racy.  The only case
> where it isn't racy is when thread Y is bound to CPU1, and so can't
> move - but then you'd know that thread Y is on CPU1 and there
> wouldn't be a need for the inherent complexity suggested above.
> 
> The behaviour I've seen on ARM from the scheduler (on a quad CPU
> platform, observing the system activity with top reporting the last
> CPU number used by each thread) is that threads often migrate
> between CPUs - especially in the case of (eg) one or two threads
> running in a quad-CPU system.
> 
> Given that, I'm really not sure what the use of reading and making
> decisions on the current CPU number would be within a program -
> unless the thread is bound to a particular CPU or group of CPUs,
> it seems that you can't rely on being on the reported CPU by the
> time the system call returns.

As I understand it, the idea is -not- to eliminate synchronization
like we do with per-CPU variables in the kernel, but rather to
reduce the average cost of synchronization.  For example, there
might be a separate data structure per CPU, each structure guarded
by its own lock.  A thread could sample the current running CPU,
acquire that CPU's corresponding lock, and operate on that CPU's
structure.  This would work correctly even if there was an arbitrarily
high number of preemptions/migrations, but would have improved
performance (compared to a single global lock) in the common case
where there were no preemptions/migrations.

This approach can also be used in conjunction with Paul Turner's
per-CPU atomics.

Make sense, or am I missing your point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 22:34             ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05 22:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Russell King - ARM Linux, Will Deacon, Thomas Gleixner,
	Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Andrew Morton, Catalin Marinas, Michael Kerrisk

----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

> On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
>> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
>> > For instance, an application could create a linked list or hash map
>> > of thread control structures, which could contain the current CPU
>> > number of each thread. A dispatch thread could then traverse or
>> > lookup this structure to see on which CPU each thread is running and
>> > do work queue dispatch or scheduling decisions accordingly.
>> 
>> So, what happens if the linked list is walked from thread X, and we
>> discover that thread Y is allegedly running on CPU1.  We decide that
>> we want to dispatch some work on that thread due to it being on CPU1,
>> so we send an event to thread Y.
>> 
>> Thread Y becomes runnable, and the scheduler decides to schedule the
>> thread on CPU3 instead of CPU1.
>> 
>> My point is that the above idea is inherently racy.  The only case
>> where it isn't racy is when thread Y is bound to CPU1, and so can't
>> move - but then you'd know that thread Y is on CPU1 and there
>> wouldn't be a need for the inherent complexity suggested above.
>> 
>> The behaviour I've seen on ARM from the scheduler (on a quad CPU
>> platform, observing the system activity with top reporting the last
>> CPU number used by each thread) is that threads often migrate
>> between CPUs - especially in the case of (eg) one or two threads
>> running in a quad-CPU system.
>> 
>> Given that, I'm really not sure what the use of reading and making
>> decisions on the current CPU number would be within a program -
>> unless the thread is bound to a particular CPU or group of CPUs,
>> it seems that you can't rely on being on the reported CPU by the
>> time the system call returns.
> 
> As I understand it, the idea is -not- to eliminate synchronization
> like we do with per-CPU variables in the kernel, but rather to
> reduce the average cost of synchronization.  For example, there
> might be a separate data structure per CPU, each structure guarded
> by its own lock.  A thread could sample the current running CPU,
> acquire that CPU's corresponding lock, and operate on that CPU's
> structure.  This would work correctly even if there was an arbitrarily
> high number of preemptions/migrations, but would have improved
> performance (compared to a single global lock) in the common case
> where there were no preemptions/migrations.
> 
> This approach can also be used in conjunction with Paul Turner's
> per-CPU atomics.
> 
> Make sense, or am I missing your point?

Russell's point is more about accessing a given thread's cpu_cache
variable from other threads/cores, which is beyond what is needed
for restartable critical sections.

Independently of the usefulness of reading other thread's cpu_cache
to see their current CPU, I would advocate for checking the cpu_cache
natural alignment, and return EINVAL if it is not aligned. Even for
thread-local reads, we care about ensuring there is no load tearing
when reading this variable. The behavior of the kernel updating this
variable read by a user-space thread is very similar to having a
variable updated by a signal handler nested on top of a thread. This
makes it simpler and reduces the testing state space.

Thoughts ?

Thanks,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 22:34             ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-05 22:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Russell King - ARM Linux, Will Deacon, Thomas Gleixner,
	Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Andrew Morton,
	Catalin Marinas, Michael Kerrisk

----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org wrote:

> On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
>> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
>> > For instance, an application could create a linked list or hash map
>> > of thread control structures, which could contain the current CPU
>> > number of each thread. A dispatch thread could then traverse or
>> > lookup this structure to see on which CPU each thread is running and
>> > do work queue dispatch or scheduling decisions accordingly.
>> 
>> So, what happens if the linked list is walked from thread X, and we
>> discover that thread Y is allegedly running on CPU1.  We decide that
>> we want to dispatch some work on that thread due to it being on CPU1,
>> so we send an event to thread Y.
>> 
>> Thread Y becomes runnable, and the scheduler decides to schedule the
>> thread on CPU3 instead of CPU1.
>> 
>> My point is that the above idea is inherently racy.  The only case
>> where it isn't racy is when thread Y is bound to CPU1, and so can't
>> move - but then you'd know that thread Y is on CPU1 and there
>> wouldn't be a need for the inherent complexity suggested above.
>> 
>> The behaviour I've seen on ARM from the scheduler (on a quad CPU
>> platform, observing the system activity with top reporting the last
>> CPU number used by each thread) is that threads often migrate
>> between CPUs - especially in the case of (eg) one or two threads
>> running in a quad-CPU system.
>> 
>> Given that, I'm really not sure what the use of reading and making
>> decisions on the current CPU number would be within a program -
>> unless the thread is bound to a particular CPU or group of CPUs,
>> it seems that you can't rely on being on the reported CPU by the
>> time the system call returns.
> 
> As I understand it, the idea is -not- to eliminate synchronization
> like we do with per-CPU variables in the kernel, but rather to
> reduce the average cost of synchronization.  For example, there
> might be a separate data structure per CPU, each structure guarded
> by its own lock.  A thread could sample the current running CPU,
> acquire that CPU's corresponding lock, and operate on that CPU's
> structure.  This would work correctly even if there was an arbitrarily
> high number of preemptions/migrations, but would have improved
> performance (compared to a single global lock) in the common case
> where there were no preemptions/migrations.
> 
> This approach can also be used in conjunction with Paul Turner's
> per-CPU atomics.
> 
> Make sense, or am I missing your point?

Russell's point is more about accessing a given thread's cpu_cache
variable from other threads/cores, which is beyond what is needed
for restartable critical sections.

Independently of the usefulness of reading other thread's cpu_cache
to see their current CPU, I would advocate for checking the cpu_cache
natural alignment, and return EINVAL if it is not aligned. Even for
thread-local reads, we care about ensuring there is no load tearing
when reading this variable. The behavior of the kernel updating this
variable read by a user-space thread is very similar to having a
variable updated by a signal handler nested on top of a thread. This
makes it simpler and reduces the testing state space.

Thoughts ?

Thanks,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 22:54               ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2016-01-05 22:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King - ARM Linux, Will Deacon, Thomas Gleixner,
	Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Andrew Morton, Catalin Marinas, Michael Kerrisk

On Tue, Jan 05, 2016 at 10:34:04PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:
> 
> > On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
> >> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> >> > For instance, an application could create a linked list or hash map
> >> > of thread control structures, which could contain the current CPU
> >> > number of each thread. A dispatch thread could then traverse or
> >> > lookup this structure to see on which CPU each thread is running and
> >> > do work queue dispatch or scheduling decisions accordingly.
> >> 
> >> So, what happens if the linked list is walked from thread X, and we
> >> discover that thread Y is allegedly running on CPU1.  We decide that
> >> we want to dispatch some work on that thread due to it being on CPU1,
> >> so we send an event to thread Y.
> >> 
> >> Thread Y becomes runnable, and the scheduler decides to schedule the
> >> thread on CPU3 instead of CPU1.
> >> 
> >> My point is that the above idea is inherently racy.  The only case
> >> where it isn't racy is when thread Y is bound to CPU1, and so can't
> >> move - but then you'd know that thread Y is on CPU1 and there
> >> wouldn't be a need for the inherent complexity suggested above.
> >> 
> >> The behaviour I've seen on ARM from the scheduler (on a quad CPU
> >> platform, observing the system activity with top reporting the last
> >> CPU number used by each thread) is that threads often migrate
> >> between CPUs - especially in the case of (eg) one or two threads
> >> running in a quad-CPU system.
> >> 
> >> Given that, I'm really not sure what the use of reading and making
> >> decisions on the current CPU number would be within a program -
> >> unless the thread is bound to a particular CPU or group of CPUs,
> >> it seems that you can't rely on being on the reported CPU by the
> >> time the system call returns.
> > 
> > As I understand it, the idea is -not- to eliminate synchronization
> > like we do with per-CPU variables in the kernel, but rather to
> > reduce the average cost of synchronization.  For example, there
> > might be a separate data structure per CPU, each structure guarded
> > by its own lock.  A thread could sample the current running CPU,
> > acquire that CPU's corresponding lock, and operate on that CPU's
> > structure.  This would work correctly even if there was an arbitrarily
> > high number of preemptions/migrations, but would have improved
> > performance (compared to a single global lock) in the common case
> > where there were no preemptions/migrations.
> > 
> > This approach can also be used in conjunction with Paul Turner's
> > per-CPU atomics.
> > 
> > Make sense, or am I missing your point?
> 
> Russell's point is more about accessing a given thread's cpu_cache
> variable from other threads/cores, which is beyond what is needed
> for restartable critical sections.

Fair enough!

> Independently of the usefulness of reading other thread's cpu_cache
> to see their current CPU, I would advocate for checking the cpu_cache
> natural alignment, and return EINVAL if it is not aligned. Even for
> thread-local reads, we care about ensuring there is no load tearing
> when reading this variable. The behavior of the kernel updating this
> variable read by a user-space thread is very similar to having a
> variable updated by a signal handler nested on top of a thread. This
> makes it simpler and reduces the testing state space.

Makes sense to me!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-05 22:54               ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2016-01-05 22:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King - ARM Linux, Will Deacon, Thomas Gleixner,
	Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Josh Triplett, Linus Torvalds, Andrew Morton,
	Catalin Marinas, Michael Kerrisk

On Tue, Jan 05, 2016 at 10:34:04PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org wrote:
> 
> > On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
> >> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> >> > For instance, an application could create a linked list or hash map
> >> > of thread control structures, which could contain the current CPU
> >> > number of each thread. A dispatch thread could then traverse or
> >> > lookup this structure to see on which CPU each thread is running and
> >> > do work queue dispatch or scheduling decisions accordingly.
> >> 
> >> So, what happens if the linked list is walked from thread X, and we
> >> discover that thread Y is allegedly running on CPU1.  We decide that
> >> we want to dispatch some work on that thread due to it being on CPU1,
> >> so we send an event to thread Y.
> >> 
> >> Thread Y becomes runnable, and the scheduler decides to schedule the
> >> thread on CPU3 instead of CPU1.
> >> 
> >> My point is that the above idea is inherently racy.  The only case
> >> where it isn't racy is when thread Y is bound to CPU1, and so can't
> >> move - but then you'd know that thread Y is on CPU1 and there
> >> wouldn't be a need for the inherent complexity suggested above.
> >> 
> >> The behaviour I've seen on ARM from the scheduler (on a quad CPU
> >> platform, observing the system activity with top reporting the last
> >> CPU number used by each thread) is that threads often migrate
> >> between CPUs - especially in the case of (eg) one or two threads
> >> running in a quad-CPU system.
> >> 
> >> Given that, I'm really not sure what the use of reading and making
> >> decisions on the current CPU number would be within a program -
> >> unless the thread is bound to a particular CPU or group of CPUs,
> >> it seems that you can't rely on being on the reported CPU by the
> >> time the system call returns.
> > 
> > As I understand it, the idea is -not- to eliminate synchronization
> > like we do with per-CPU variables in the kernel, but rather to
> > reduce the average cost of synchronization.  For example, there
> > might be a separate data structure per CPU, each structure guarded
> > by its own lock.  A thread could sample the current running CPU,
> > acquire that CPU's corresponding lock, and operate on that CPU's
> > structure.  This would work correctly even if there was an arbitrarily
> > high number of preemptions/migrations, but would have improved
> > performance (compared to a single global lock) in the common case
> > where there were no preemptions/migrations.
> > 
> > This approach can also be used in conjunction with Paul Turner's
> > per-CPU atomics.
> > 
> > Make sense, or am I missing your point?
> 
> Russell's point is more about accessing a given thread's cpu_cache
> variable from other threads/cores, which is beyond what is needed
> for restartable critical sections.

Fair enough!

> Independently of the usefulness of reading other thread's cpu_cache
> to see their current CPU, I would advocate for checking the cpu_cache
> natural alignment, and return EINVAL if it is not aligned. Even for
> thread-local reads, we care about ensuring there is no load tearing
> when reading this variable. The behavior of the kernel updating this
> variable read by a user-space thread is very similar to having a
> variable updated by a signal handler nested on top of a thread. This
> makes it simpler and reduces the testing state space.

Makes sense to me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-11 22:38   ` Seymour, Shane M
  0 siblings, 0 replies; 42+ messages in thread
From: Seymour, Shane M @ 2016-01-11 22:38 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

Hi Mathieu,

I have some concerns and suggestions for you about this.

What's to stop someone in user space from requesting an arbitrarily large number of CPU # cache locations that the kernel needs to allocate memory to track and each time the task migrates to a new CPU it needs to update them all? Could you use it to dramatically slow down a system/task switching? Should there be a ulimit type value or a sysctl setting to limit the number that you're allowed to register per-task?

If you can just register consecutive addresses each 4 bytes apart the size of the structure required to keep track of it in the kernel looks to be 20 or 24 bytes long depending on kernel bitness (using kmalloc it should be 32 bytes allocated either way) so if you do something like tie up 1GiB of memory in user space registered with CPU cache locations it will tie up >8GiB of memory in the kernel and there will be a huge linked list that will take significant amounts of time to traverse. You could use it as a local denial of service attack to try and soak up memory and cause a kernel OOM because the kernel needs more memory to keep track of the request compared to the size of the memory used by user space to create it. There doesn't currently appear to be any upper bounds on the number that can be registered.

In terms of tracking what it's doing would you consider some sysfs attribute files (or something in debugfs) that tracked (these would all be in the add path so it shouldn't be performance sensitive):

1) The largest number of entries someone has created in the list in any task
2) The number of times (assuming you implement an upper bound on the number allowed) the upper bound is being hit (to allow someone to monitor for issues where the upper bound is being hit)

Assuming that something (e.g. glibc) is willing to register and make an entry available for the life of the task consider allowing one flag, for example, GETCPU_CACHE_PERSISTENT with GETCPU_CACHE_CMD_REGISTER and have a new command GETCPU_CACHE_CMD_GET_ PERSISTENT to allow someone to ask for the user space address of an entry that something has guaranteed will be there until the task ends. If none exist they can fall back and allocate a new one - it allows for better reuse of an existing resource but log a warning if you're forced to remove a persistent entry or someone attempts to unregister it (which should always fail) since someone in user space will have broken their promise that it will always be there until the task ends (that means persistent ones should be left to be torn down by the kernel not unregistered from user space). If you do this you might need to optimize the find process so it's more likely the first persistent one will be the first one found if you think someone is more likely to take the approach of asking for that first and falling back to creating a new one if there isn't already one present. Having this will also tend to limit the number of these that anyone will need to create if most libraries ask for a persistent entry first.

Thanks
Shane

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-11 22:38   ` Seymour, Shane M
  0 siblings, 0 replies; 42+ messages in thread
From: Seymour, Shane M @ 2016-01-11 22:38 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

Hi Mathieu,

I have some concerns and suggestions for you about this.

What's to stop someone in user space from requesting an arbitrarily large number of CPU # cache locations that the kernel needs to allocate memory to track and each time the task migrates to a new CPU it needs to update them all? Could you use it to dramatically slow down a system/task switching? Should there be a ulimit type value or a sysctl setting to limit the number that you're allowed to register per-task?

If you can just register consecutive addresses each 4 bytes apart the size of the structure required to keep track of it in the kernel looks to be 20 or 24 bytes long depending on kernel bitness (using kmalloc it should be 32 bytes allocated either way) so if you do something like tie up 1GiB of memory in user space registered with CPU cache locations it will tie up >8GiB of memory in the kernel and there will be a huge linked list that will take significant amounts of time to traverse. You could use it as a local denial of service attack to try and soak up memory and cause a kernel OOM because the kernel needs more memory to keep track of the request compared to the size of the memory used by user space to create it. There doesn't currently appear to be any upper bounds on the number that can be registered.

In terms of tracking what it's doing would you consider some sysfs attribute files (or something in debugfs) that tracked (these would all be in the add path so it shouldn't be performance sensitive):

1) The largest number of entries someone has created in the list in any task
2) The number of times (assuming you implement an upper bound on the number allowed) the upper bound is being hit (to allow someone to monitor for issues where the upper bound is being hit)

Assuming that something (e.g. glibc) is willing to register and make an entry available for the life of the task consider allowing one flag, for example, GETCPU_CACHE_PERSISTENT with GETCPU_CACHE_CMD_REGISTER and have a new command GETCPU_CACHE_CMD_GET_ PERSISTENT to allow someone to ask for the user space address of an entry that something has guaranteed will be there until the task ends. If none exist they can fall back and allocate a new one - it allows for better reuse of an existing resource but log a warning if you're forced to remove a persistent entry or someone attempts to unregister it (which should always fail) since someone in user space will have broken their promise that it will always be there until the task ends (that means persistent ones should be left to be torn down by the kernel not unregistered from user space). If you do this you might need to optimize the find process so it's more likely the first persistent one will be the first one found if you think someone is more likely to take the approach of asking for that first and falling back to creating a new one if there isn't already one present. Having this will also tend to limit the number of these that anyone will need to create if most libraries ask for a persistent entry first.

Thanks
Shane

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
  2016-01-11 22:38   ` Seymour, Shane M
  (?)
@ 2016-01-11 23:03   ` Josh Triplett
  2016-01-12  0:49       ` Mathieu Desnoyers
  -1 siblings, 1 reply; 42+ messages in thread
From: Josh Triplett @ 2016-01-11 23:03 UTC (permalink / raw)
  To: Seymour, Shane M
  Cc: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
> I have some concerns and suggestions for you about this.
> 
> What's to stop someone in user space from requesting an arbitrarily large number of CPU # cache locations that the kernel needs to allocate memory to track and each time the task migrates to a new CPU it needs to update them all? Could you use it to dramatically slow down a system/task switching? Should there be a ulimit type value or a sysctl setting to limit the number that you're allowed to register per-task?

The documented behavior of the syscall allows only one location per
thread, so the kernel can track that one and only address rather easily
in the task_struct.  Allowing dynamic allocation definitely doesn't seem
like a good idea.

- Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-11 23:16     ` Seymour, Shane M
  0 siblings, 0 replies; 42+ messages in thread
From: Seymour, Shane M @ 2016-01-11 23:16 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

Ignore my email I'd overlooked one bit of code misunderstood how it worked.

-----Original Message-----
From: linux-api-owner@vger.kernel.org [mailto:linux-api-owner@vger.kernel.org] On Behalf Of Seymour, Shane M
Sent: Tuesday, January 12, 2016 9:38 AM
To: Mathieu Desnoyers; Thomas Gleixner; Paul Turner; Andrew Hunter; Peter Zijlstra
Cc: linux-kernel@vger.kernel.org; linux-api@vger.kernel.org; Andy Lutomirski; Andi Kleen; Dave Watson; Chris Lameter; Ingo Molnar; Ben Maurer; Steven Rostedt; Paul E. McKenney; Josh Triplett; Linus Torvalds; Andrew Morton; Russell King; Catalin Marinas; Will Deacon; Michael Kerrisk
Subject: RE: [RFC PATCH 0/3] Implement getcpu_cache system call

Hi Mathieu,

I have some concerns and suggestions for you about this.

What's to stop someone in user space from requesting an arbitrarily large number of CPU # cache locations that the kernel needs to allocate memory to track and each time the task migrates to a new CPU it needs to update them all? Could you use it to dramatically slow down a system/task switching? Should there be a ulimit type value or a sysctl setting to limit the number that you're allowed to register per-task?

If you can just register consecutive addresses each 4 bytes apart the size of the structure required to keep track of it in the kernel looks to be 20 or 24 bytes long depending on kernel bitness (using kmalloc it should be 32 bytes allocated either way) so if you do something like tie up 1GiB of memory in user space registered with CPU cache locations it will tie up >8GiB of memory in the kernel and there will be a huge linked list that will take significant amounts of time to traverse. You could use it as a local denial of service attack to try and soak up memory and cause a kernel OOM because the kernel needs more memory to keep track of the request compared to the size of the memory used by user space to create it. There doesn't currently appear to be any upper bounds on the number that can be registered.

In terms of tracking what it's doing would you consider some sysfs attribute files (or something in debugfs) that tracked (these would all be in the add path so it shouldn't be performance sensitive):

1) The largest number of entries someone has created in the list in any task
2) The number of times (assuming you implement an upper bound on the number allowed) the upper bound is being hit (to allow someone to monitor for issues where the upper bound is being hit)

Assuming that something (e.g. glibc) is willing to register and make an entry available for the life of the task consider allowing one flag, for example, GETCPU_CACHE_PERSISTENT with GETCPU_CACHE_CMD_REGISTER and have a new command GETCPU_CACHE_CMD_GET_ PERSISTENT to allow someone to ask for the user space address of an entry that something has guaranteed will be there until the task ends. If none exist they can fall back and allocate a new one - it allows for better reuse of an existing resource but log a warning if you're forced to remove a persistent entry or someone attempts to unregister it (which should always fail) since someone in user space will have broken their promise that it will always be there until the task ends (that means persistent ones should be left to be torn down by the kernel not unregistered from user space). If you do this you might need to optimize the find process so it's more likely the first persistent one will be the first one found if you think someone is more likely to take the approach of asking for that first and falling back to creating a new one if there isn't already one present. Having this will also tend to limit the number of these that anyone will need to create if most libraries ask for a persistent entry first.

Thanks
Shane
\x04�{.n�+�������+%��lzwm��b�맲��r��zX��\x16��)���w*jg���\x1e�����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w�٥

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-11 23:16     ` Seymour, Shane M
  0 siblings, 0 replies; 42+ messages in thread
From: Seymour, Shane M @ 2016-01-11 23:16 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

Ignore my email I'd overlooked one bit of code misunderstood how it worked.

-----Original Message-----
From: linux-api-owner@vger.kernel.org [mailto:linux-api-owner@vger.kernel.org] On Behalf Of Seymour, Shane M
Sent: Tuesday, January 12, 2016 9:38 AM
To: Mathieu Desnoyers; Thomas Gleixner; Paul Turner; Andrew Hunter; Peter Zijlstra
Cc: linux-kernel@vger.kernel.org; linux-api@vger.kernel.org; Andy Lutomirski; Andi Kleen; Dave Watson; Chris Lameter; Ingo Molnar; Ben Maurer; Steven Rostedt; Paul E. McKenney; Josh Triplett; Linus Torvalds; Andrew Morton; Russell King; Catalin Marinas; Will Deacon; Michael Kerrisk
Subject: RE: [RFC PATCH 0/3] Implement getcpu_cache system call

Hi Mathieu,

I have some concerns and suggestions for you about this.

What's to stop someone in user space from requesting an arbitrarily large number of CPU # cache locations that the kernel needs to allocate memory to track and each time the task migrates to a new CPU it needs to update them all? Could you use it to dramatically slow down a system/task switching? Should there be a ulimit type value or a sysctl setting to limit the number that you're allowed to register per-task?

If you can just register consecutive addresses each 4 bytes apart the size of the structure required to keep track of it in the kernel looks to be 20 or 24 bytes long depending on kernel bitness (using kmalloc it should be 32 bytes allocated either way) so if you do something like tie up 1GiB of memory in user space registered with CPU cache locations it will tie up >8GiB of memory in the kernel and there will be a huge linked list that will take significant amounts of time to traverse. You could use it as a local denial of service attack to try and soak up memory and cause a kernel OOM because the kernel needs more memory to keep track of the request compared to the size of the memory used by user space to create it. There doesn't currently appear to be any upper bounds on the number that can be registered.

In terms of tracking what it's doing would you consider some sysfs attribute files (or something in debugfs) that tracked (these would all be in the add path so it shouldn't be performance sensitive):

1) The largest number of entries someone has created in the list in any task
2) The number of times (assuming you implement an upper bound on the number allowed) the upper bound is being hit (to allow someone to monitor for issues where the upper bound is being hit)

Assuming that something (e.g. glibc) is willing to register and make an entry available for the life of the task consider allowing one flag, for example, GETCPU_CACHE_PERSISTENT with GETCPU_CACHE_CMD_REGISTER and have a new command GETCPU_CACHE_CMD_GET_ PERSISTENT to allow someone to ask for the user space address of an entry that something has guaranteed will be there until the task ends. If none exist they can fall back and allocate a new one - it allows for better reuse of an existing resource but log a warning if you're forced to remove a persistent entry or someone attempts to unregister it (which should always fail) since someone in user space will have broken their promise that it will always be there until the task ends (that means persistent ones should be left to be torn down by the kernel not unregistered from user space). If you do this you might need to optimize the find process so it's more likely the first persistent one will be the first one found if you think someone is more likely to take the approach of asking for that first and falling back to creating a new one if there isn't already one present. Having this will also tend to limit the number of these that anyone will need to create if most libraries ask for a persistent entry first.

Thanks
Shane
\x04�{.n�+�������+%��lzwm��b�맲��r��zX��\x16��)���w*jg���\x1e�����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w�٥

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
  2016-01-11 23:03   ` Josh Triplett
@ 2016-01-12  0:49       ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-12  0:49 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Shane M Seymour, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh@joshtriplett.org wrote:

> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
>> I have some concerns and suggestions for you about this.
>> 
>> What's to stop someone in user space from requesting an arbitrarily large number
>> of CPU # cache locations that the kernel needs to allocate memory to track and
>> each time the task migrates to a new CPU it needs to update them all? Could you
>> use it to dramatically slow down a system/task switching? Should there be a
>> ulimit type value or a sysctl setting to limit the number that you're allowed
>> to register per-task?
> 
> The documented behavior of the syscall allows only one location per
> thread, so the kernel can track that one and only address rather easily
> in the task_struct.  Allowing dynamic allocation definitely doesn't seem
> like a good idea.

The current implementation now allows more than one location per
thread. Which piece of documentation states that only one location
per thread is allowed ? This was indeed the case for the prior
implementations, but I moved to implementing a linked-list of
cpu_cache areas per thread to allow the getcpu_cache system call to
be used by more than a single shared object within a given program.

Without the linked list, as soon as more than one shared object try
to register their cache, the first one will prohibit all others from
doing so.

We could perhaps try to document that this system call should only
ever be used by *libc, and all libraries and applications should
then use the libc TLS cache variable, but it seems rather fragile,
and any app/lib could try to register its own cache.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12  0:49       ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-12  0:49 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Shane M Seymour, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, Ben Maurer, rostedt, Paul E. McKenney,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:

> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
>> I have some concerns and suggestions for you about this.
>> 
>> What's to stop someone in user space from requesting an arbitrarily large number
>> of CPU # cache locations that the kernel needs to allocate memory to track and
>> each time the task migrates to a new CPU it needs to update them all? Could you
>> use it to dramatically slow down a system/task switching? Should there be a
>> ulimit type value or a sysctl setting to limit the number that you're allowed
>> to register per-task?
> 
> The documented behavior of the syscall allows only one location per
> thread, so the kernel can track that one and only address rather easily
> in the task_struct.  Allowing dynamic allocation definitely doesn't seem
> like a good idea.

The current implementation now allows more than one location per
thread. Which piece of documentation states that only one location
per thread is allowed ? This was indeed the case for the prior
implementations, but I moved to implementing a linked-list of
cpu_cache areas per thread to allow the getcpu_cache system call to
be used by more than a single shared object within a given program.

Without the linked list, as soon as more than one shared object try
to register their cache, the first one will prohibit all others from
doing so.

We could perhaps try to document that this system call should only
ever be used by *libc, and all libraries and applications should
then use the libc TLS cache variable, but it seems rather fragile,
and any app/lib could try to register its own cache.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
  2016-01-12  0:49       ` Mathieu Desnoyers
  (?)
@ 2016-01-12  2:45       ` Josh Triplett
  2016-01-12  4:27           ` Ben Maurer
  -1 siblings, 1 reply; 42+ messages in thread
From: Josh Triplett @ 2016-01-12  2:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Shane M Seymour, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh@joshtriplett.org wrote:
> 
> > On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
> >> I have some concerns and suggestions for you about this.
> >> 
> >> What's to stop someone in user space from requesting an arbitrarily large number
> >> of CPU # cache locations that the kernel needs to allocate memory to track and
> >> each time the task migrates to a new CPU it needs to update them all? Could you
> >> use it to dramatically slow down a system/task switching? Should there be a
> >> ulimit type value or a sysctl setting to limit the number that you're allowed
> >> to register per-task?
> > 
> > The documented behavior of the syscall allows only one location per
> > thread, so the kernel can track that one and only address rather easily
> > in the task_struct.  Allowing dynamic allocation definitely doesn't seem
> > like a good idea.
> 
> The current implementation now allows more than one location per
> thread. Which piece of documentation states that only one location
> per thread is allowed ? This was indeed the case for the prior
> implementations, but I moved to implementing a linked-list of
> cpu_cache areas per thread to allow the getcpu_cache system call to
> be used by more than a single shared object within a given program.

Ah, I missed that change.

> Without the linked list, as soon as more than one shared object try
> to register their cache, the first one will prohibit all others from
> doing so.
> 
> We could perhaps try to document that this system call should only
> ever be used by *libc, and all libraries and applications should
> then use the libc TLS cache variable, but it seems rather fragile,
> and any app/lib could try to register its own cache.

That does seem a bit fragile, true; on the other hand, the linked-list
approach would allow userspace to allocate an unbounded amount of kernel
memory, without any particular control on it.  That doesn't seem
reasonable.  Introducing an rlimit or similar for this seems like
massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
rule.

Given that any registered location will always provide the same value,
allowing only a single registration doesn't seem *too* problematic;
libc-based programs can use the libc implementation, and non-libc-based
programs can register a location themselves.  And users of this API will
already likely want to use some TLS mechanism, which already interacts
heavily with libc (set_thread_area/clone).

Allowing only one registration at a time seems preferable to introducing
another way to allocate kernel resources on a process's behalf.

- Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
  2016-01-12  2:45       ` Josh Triplett
@ 2016-01-12  4:27           ` Ben Maurer
  0 siblings, 0 replies; 42+ messages in thread
From: Ben Maurer @ 2016-01-12  4:27 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Mathieu Desnoyers, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra, linux-kernel, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, rostedt, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

One disadvantage of only allowing one is that high performance server applications tend to statically link. It'd suck to have to go through what ever type of relocation we'd need to pull this out of glibc. But if there's only one registration allowed a statically linked app couldn't create its own if glibc might use it some day. 



Sent from my iPhone

> On Jan 11, 2016, at 6:46 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> 
>> On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh@joshtriplett.org wrote:
>> 
>>>> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
>>>> I have some concerns and suggestions for you about this.
>>>> 
>>>> What's to stop someone in user space from requesting an arbitrarily large number
>>>> of CPU # cache locations that the kernel needs to allocate memory to track and
>>>> each time the task migrates to a new CPU it needs to update them all? Could you
>>>> use it to dramatically slow down a system/task switching? Should there be a
>>>> ulimit type value or a sysctl setting to limit the number that you're allowed
>>>> to register per-task?
>>> 
>>> The documented behavior of the syscall allows only one location per
>>> thread, so the kernel can track that one and only address rather easily
>>> in the task_struct.  Allowing dynamic allocation definitely doesn't seem
>>> like a good idea.
>> 
>> The current implementation now allows more than one location per
>> thread. Which piece of documentation states that only one location
>> per thread is allowed ? This was indeed the case for the prior
>> implementations, but I moved to implementing a linked-list of
>> cpu_cache areas per thread to allow the getcpu_cache system call to
>> be used by more than a single shared object within a given program.
> 
> Ah, I missed that change.
> 
>> Without the linked list, as soon as more than one shared object try
>> to register their cache, the first one will prohibit all others from
>> doing so.
>> 
>> We could perhaps try to document that this system call should only
>> ever be used by *libc, and all libraries and applications should
>> then use the libc TLS cache variable, but it seems rather fragile,
>> and any app/lib could try to register its own cache.
> 
> That does seem a bit fragile, true; on the other hand, the linked-list
> approach would allow userspace to allocate an unbounded amount of kernel
> memory, without any particular control on it.  That doesn't seem
> reasonable.  Introducing an rlimit or similar for this seems like
> massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
> rule.
> 
> Given that any registered location will always provide the same value,
> allowing only a single registration doesn't seem *too* problematic;
> libc-based programs can use the libc implementation, and non-libc-based
> programs can register a location themselves.  And users of this API will
> already likely want to use some TLS mechanism, which already interacts
> heavily with libc (set_thread_area/clone).
> 
> Allowing only one registration at a time seems preferable to introducing
> another way to allocate kernel resources on a process's behalf.
> 
> - Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12  4:27           ` Ben Maurer
  0 siblings, 0 replies; 42+ messages in thread
From: Ben Maurer @ 2016-01-12  4:27 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Mathieu Desnoyers, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

One disadvantage of only allowing one is that high performance server applications tend to statically link. It'd suck to have to go through what ever type of relocation we'd need to pull this out of glibc. But if there's only one registration allowed a statically linked app couldn't create its own if glibc might use it some day. 



Sent from my iPhone

> On Jan 11, 2016, at 6:46 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> 
>> On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
>> 
>>>> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
>>>> I have some concerns and suggestions for you about this.
>>>> 
>>>> What's to stop someone in user space from requesting an arbitrarily large number
>>>> of CPU # cache locations that the kernel needs to allocate memory to track and
>>>> each time the task migrates to a new CPU it needs to update them all? Could you
>>>> use it to dramatically slow down a system/task switching? Should there be a
>>>> ulimit type value or a sysctl setting to limit the number that you're allowed
>>>> to register per-task?
>>> 
>>> The documented behavior of the syscall allows only one location per
>>> thread, so the kernel can track that one and only address rather easily
>>> in the task_struct.  Allowing dynamic allocation definitely doesn't seem
>>> like a good idea.
>> 
>> The current implementation now allows more than one location per
>> thread. Which piece of documentation states that only one location
>> per thread is allowed ? This was indeed the case for the prior
>> implementations, but I moved to implementing a linked-list of
>> cpu_cache areas per thread to allow the getcpu_cache system call to
>> be used by more than a single shared object within a given program.
> 
> Ah, I missed that change.
> 
>> Without the linked list, as soon as more than one shared object try
>> to register their cache, the first one will prohibit all others from
>> doing so.
>> 
>> We could perhaps try to document that this system call should only
>> ever be used by *libc, and all libraries and applications should
>> then use the libc TLS cache variable, but it seems rather fragile,
>> and any app/lib could try to register its own cache.
> 
> That does seem a bit fragile, true; on the other hand, the linked-list
> approach would allow userspace to allocate an unbounded amount of kernel
> memory, without any particular control on it.  That doesn't seem
> reasonable.  Introducing an rlimit or similar for this seems like
> massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
> rule.
> 
> Given that any registered location will always provide the same value,
> allowing only a single registration doesn't seem *too* problematic;
> libc-based programs can use the libc implementation, and non-libc-based
> programs can register a location themselves.  And users of this API will
> already likely want to use some TLS mechanism, which already interacts
> heavily with libc (set_thread_area/clone).
> 
> Allowing only one registration at a time seems preferable to introducing
> another way to allocate kernel resources on a process's behalf.
> 
> - Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
  2016-01-12  4:27           ` Ben Maurer
@ 2016-01-12  6:40             ` Seymour, Shane M
  -1 siblings, 0 replies; 42+ messages in thread
From: Seymour, Shane M @ 2016-01-12  6:40 UTC (permalink / raw)
  To: Ben Maurer, Josh Triplett
  Cc: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk


> -----Original Message-----
> From: Ben Maurer [mailto:bmaurer@fb.com]
> Sent: Tuesday, January 12, 2016 3:28 PM
> 
> One disadvantage of only allowing one is that high performance server
> applications tend to statically link. It'd suck to have to go through what ever
> type of relocation we'd need to pull this out of glibc. But if there's only one
> registration allowed a statically linked app couldn't create its own if glibc
> might use it some day.
> 

If there was a new command like GETCPU_CACHE_CMD_ADDRESS that returned the address that is currently registered for that task that wouldn't be an issue (the kernel knows the address it's going to write to for that task there's no reason not to make it available back to user space on request). The main limitation of only allowing one address is that whomever registered that address would be providing an (unenforceable implicit) guarantee that it would always be there until the task ended (GETCPU_CACHE_CMD_UNREGISTER would have to go away with only one registerable address). It's highly likely that anyone registering an address would keep it for the life of the task but it's hard to guarantee it.

There only two impacts that I can think of quickly are:

1) if multiple shared libraries wanted to register an address that were dynamically loaded after a program starts using an explicit dlopen (in a process that didn't already have an address registered). They shouldn't register an address - they should only ask for an already existing one and have a fallback if there isn't one currently. If it did register something and was unloaded and the memory is freed/unmapped anyone else using the address isn't going to be happy since you could either have non-existent addresses or use after free happening. The library that registered it would need to leak the address so anyone using it can still do so or have some other method of doing cleanup when the task ends after it's been unloaded. The potential impact of that depends on if anyone thinks that is at all likely to happen.
2) There could be ordering issues for shared libraries with initializers and finalizers if an cpu cache address is registered in an initializer and used in a finalizer of another library that is ran after the finalizer of the library that registered it (if it's in memory that is no longer available or it's possible to unregister an address). 

> 
> 
> Sent from my iPhone
> 
> > On Jan 11, 2016, at 6:46 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> >
> >> On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh@joshtriplett.org wrote:
> >>
> >>>> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
> >>>> I have some concerns and suggestions for you about this.
> >>>>
> >>>> What's to stop someone in user space from requesting an arbitrarily
> >>>> large number of CPU # cache locations that the kernel needs to
> >>>> allocate memory to track and each time the task migrates to a new
> >>>> CPU it needs to update them all? Could you use it to dramatically
> >>>> slow down a system/task switching? Should there be a ulimit type
> >>>> value or a sysctl setting to limit the number that you're allowed to
> register per-task?
> >>>
> >>> The documented behavior of the syscall allows only one location per
> >>> thread, so the kernel can track that one and only address rather
> >>> easily in the task_struct.  Allowing dynamic allocation definitely
> >>> doesn't seem like a good idea.
> >>
> >> The current implementation now allows more than one location per
> >> thread. Which piece of documentation states that only one location
> >> per thread is allowed ? This was indeed the case for the prior
> >> implementations, but I moved to implementing a linked-list of
> >> cpu_cache areas per thread to allow the getcpu_cache system call to
> >> be used by more than a single shared object within a given program.
> >
> > Ah, I missed that change.
> >
> >> Without the linked list, as soon as more than one shared object try
> >> to register their cache, the first one will prohibit all others from
> >> doing so.
> >>
> >> We could perhaps try to document that this system call should only
> >> ever be used by *libc, and all libraries and applications should then
> >> use the libc TLS cache variable, but it seems rather fragile, and any
> >> app/lib could try to register its own cache.
> >
> > That does seem a bit fragile, true; on the other hand, the linked-list
> > approach would allow userspace to allocate an unbounded amount of
> > kernel memory, without any particular control on it.  That doesn't
> > seem reasonable.  Introducing an rlimit or similar for this seems like
> > massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
> > rule.
> >
> > Given that any registered location will always provide the same value,
> > allowing only a single registration doesn't seem *too* problematic;
> > libc-based programs can use the libc implementation, and
> > non-libc-based programs can register a location themselves.  And users
> > of this API will already likely want to use some TLS mechanism, which
> > already interacts heavily with libc (set_thread_area/clone).
> >
> > Allowing only one registration at a time seems preferable to
> > introducing another way to allocate kernel resources on a process's behalf.
> >
> > - Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12  6:40             ` Seymour, Shane M
  0 siblings, 0 replies; 42+ messages in thread
From: Seymour, Shane M @ 2016-01-12  6:40 UTC (permalink / raw)
  To: Ben Maurer, Josh Triplett
  Cc: Mathieu Desnoyers, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk


> -----Original Message-----
> From: Ben Maurer [mailto:bmaurer@fb.com]
> Sent: Tuesday, January 12, 2016 3:28 PM
> 
> One disadvantage of only allowing one is that high performance server
> applications tend to statically link. It'd suck to have to go through what ever
> type of relocation we'd need to pull this out of glibc. But if there's only one
> registration allowed a statically linked app couldn't create its own if glibc
> might use it some day.
> 

If there was a new command like GETCPU_CACHE_CMD_ADDRESS that returned the address that is currently registered for that task that wouldn't be an issue (the kernel knows the address it's going to write to for that task there's no reason not to make it available back to user space on request). The main limitation of only allowing one address is that whomever registered that address would be providing an (unenforceable implicit) guarantee that it would always be there until the task ended (GETCPU_CACHE_CMD_UNREGISTER would have to go away with only one registerable address). It's highly likely that anyone registering an address would keep it for the life of the task but it's hard to guarantee it.

There only two impacts that I can think of quickly are:

1) if multiple shared libraries wanted to register an address that were dynamically loaded after a program starts using an explicit dlopen (in a process that didn't already have an address registered). They shouldn't register an address - they should only ask for an already existing one and have a fallback if there isn't one currently. If it did register something and was unloaded and the memory is freed/unmapped anyone else using the address isn't going to be happy since you could either have non-existent addresses or use after free happening. The library that registered it would need to leak the address so anyone using it can still do so or have some other method of doing cleanup when the task ends after it's been unloaded. The potential impact of that depends on if anyone thinks that is
  at all likely to happen.
2) There could be ordering issues for shared libraries with initializers and finalizers if an cpu cache address is registered in an initializer and used in a finalizer of another library that is ran after the finalizer of the library that registered it (if it's in memory that is no longer available or it's possible to unregister an address). 

> 
> 
> Sent from my iPhone
> 
> > On Jan 11, 2016, at 6:46 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> >
> >> On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh@joshtriplett.org wrote:
> >>
> >>>> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
> >>>> I have some concerns and suggestions for you about this.
> >>>>
> >>>> What's to stop someone in user space from requesting an arbitrarily
> >>>> large number of CPU # cache locations that the kernel needs to
> >>>> allocate memory to track and each time the task migrates to a new
> >>>> CPU it needs to update them all? Could you use it to dramatically
> >>>> slow down a system/task switching? Should there be a ulimit type
> >>>> value or a sysctl setting to limit the number that you're allowed to
> register per-task?
> >>>
> >>> The documented behavior of the syscall allows only one location per
> >>> thread, so the kernel can track that one and only address rather
> >>> easily in the task_struct.  Allowing dynamic allocation definitely
> >>> doesn't seem like a good idea.
> >>
> >> The current implementation now allows more than one location per
> >> thread. Which piece of documentation states that only one location
> >> per thread is allowed ? This was indeed the case for the prior
> >> implementations, but I moved to implementing a linked-list of
> >> cpu_cache areas per thread to allow the getcpu_cache system call to
> >> be used by more than a single shared object within a given program.
> >
> > Ah, I missed that change.
> >
> >> Without the linked list, as soon as more than one shared object try
> >> to register their cache, the first one will prohibit all others from
> >> doing so.
> >>
> >> We could perhaps try to document that this system call should only
> >> ever be used by *libc, and all libraries and applications should then
> >> use the libc TLS cache variable, but it seems rather fragile, and any
> >> app/lib could try to register its own cache.
> >
> > That does seem a bit fragile, true; on the other hand, the linked-list
> > approach would allow userspace to allocate an unbounded amount of
> > kernel memory, without any particular control on it.  That doesn't
> > seem reasonable.  Introducing an rlimit or similar for this seems like
> > massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
> > rule.
> >
> > Given that any registered location will always provide the same value,
> > allowing only a single registration doesn't seem *too* problematic;
> > libc-based programs can use the libc implementation, and
> > non-libc-based programs can register a location themselves.  And users
> > of this API will already likely want to use some TLS mechanism, which
> > already interacts heavily with libc (set_thread_area/clone).
> >
> > Allowing only one registration at a time seems preferable to
> > introducing another way to allocate kernel resources on a process's behalf.
> >
> > - Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12 13:15             ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-12 13:15 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Josh Triplett, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra, linux-kernel, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, rostedt, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Jan 11, 2016, at 11:27 PM, Ben Maurer bmaurer@fb.com wrote:

> One disadvantage of only allowing one is that high performance server
> applications tend to statically link. It'd suck to have to go through what ever
> type of relocation we'd need to pull this out of glibc. But if there's only one
> registration allowed a statically linked app couldn't create its own if glibc
> might use it some day.

One idea I have would be to let the kernel reserve some space either after the
first stack address (for a stack growing down) or at the beginning of the
allocated TLS area for each thread in copy_thread_tls() by fiddling with
sp or the tls base address when creating a thread.

In theory, this would allow always returning the same address, and the memory
would exist as long as the thread exists.

Not sure whether it may have unforeseen impact though.

Thoughts ?

Thanks,

Mathieu

> 
> 
> 
> Sent from my iPhone
> 
>> On Jan 11, 2016, at 6:46 PM, Josh Triplett <josh@joshtriplett.org> wrote:
>> 
>>> On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
>>> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh@joshtriplett.org wrote:
>>> 
>>>>> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
>>>>> I have some concerns and suggestions for you about this.
>>>>> 
>>>>> What's to stop someone in user space from requesting an arbitrarily large number
>>>>> of CPU # cache locations that the kernel needs to allocate memory to track and
>>>>> each time the task migrates to a new CPU it needs to update them all? Could you
>>>>> use it to dramatically slow down a system/task switching? Should there be a
>>>>> ulimit type value or a sysctl setting to limit the number that you're allowed
>>>>> to register per-task?
>>>> 
>>>> The documented behavior of the syscall allows only one location per
>>>> thread, so the kernel can track that one and only address rather easily
>>>> in the task_struct.  Allowing dynamic allocation definitely doesn't seem
>>>> like a good idea.
>>> 
>>> The current implementation now allows more than one location per
>>> thread. Which piece of documentation states that only one location
>>> per thread is allowed ? This was indeed the case for the prior
>>> implementations, but I moved to implementing a linked-list of
>>> cpu_cache areas per thread to allow the getcpu_cache system call to
>>> be used by more than a single shared object within a given program.
>> 
>> Ah, I missed that change.
>> 
>>> Without the linked list, as soon as more than one shared object try
>>> to register their cache, the first one will prohibit all others from
>>> doing so.
>>> 
>>> We could perhaps try to document that this system call should only
>>> ever be used by *libc, and all libraries and applications should
>>> then use the libc TLS cache variable, but it seems rather fragile,
>>> and any app/lib could try to register its own cache.
>> 
>> That does seem a bit fragile, true; on the other hand, the linked-list
>> approach would allow userspace to allocate an unbounded amount of kernel
>> memory, without any particular control on it.  That doesn't seem
>> reasonable.  Introducing an rlimit or similar for this seems like
>> massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
>> rule.
>> 
>> Given that any registered location will always provide the same value,
>> allowing only a single registration doesn't seem *too* problematic;
>> libc-based programs can use the libc implementation, and non-libc-based
>> programs can register a location themselves.  And users of this API will
>> already likely want to use some TLS mechanism, which already interacts
>> heavily with libc (set_thread_area/clone).
>> 
>> Allowing only one registration at a time seems preferable to introducing
>> another way to allocate kernel resources on a process's behalf.
>> 
> > - Josh Triplett

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12 13:15             ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-12 13:15 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Josh Triplett, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 11, 2016, at 11:27 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

> One disadvantage of only allowing one is that high performance server
> applications tend to statically link. It'd suck to have to go through what ever
> type of relocation we'd need to pull this out of glibc. But if there's only one
> registration allowed a statically linked app couldn't create its own if glibc
> might use it some day.

One idea I have would be to let the kernel reserve some space either after the
first stack address (for a stack growing down) or at the beginning of the
allocated TLS area for each thread in copy_thread_tls() by fiddling with
sp or the tls base address when creating a thread.

In theory, this would allow always returning the same address, and the memory
would exist as long as the thread exists.

Not sure whether it may have unforeseen impact though.

Thoughts ?

Thanks,

Mathieu

> 
> 
> 
> Sent from my iPhone
> 
>> On Jan 11, 2016, at 6:46 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> 
>>> On Tue, Jan 12, 2016 at 12:49:18AM +0000, Mathieu Desnoyers wrote:
>>> ----- On Jan 11, 2016, at 6:03 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
>>> 
>>>>> On Mon, Jan 11, 2016 at 10:38:28PM +0000, Seymour, Shane M wrote:
>>>>> I have some concerns and suggestions for you about this.
>>>>> 
>>>>> What's to stop someone in user space from requesting an arbitrarily large number
>>>>> of CPU # cache locations that the kernel needs to allocate memory to track and
>>>>> each time the task migrates to a new CPU it needs to update them all? Could you
>>>>> use it to dramatically slow down a system/task switching? Should there be a
>>>>> ulimit type value or a sysctl setting to limit the number that you're allowed
>>>>> to register per-task?
>>>> 
>>>> The documented behavior of the syscall allows only one location per
>>>> thread, so the kernel can track that one and only address rather easily
>>>> in the task_struct.  Allowing dynamic allocation definitely doesn't seem
>>>> like a good idea.
>>> 
>>> The current implementation now allows more than one location per
>>> thread. Which piece of documentation states that only one location
>>> per thread is allowed ? This was indeed the case for the prior
>>> implementations, but I moved to implementing a linked-list of
>>> cpu_cache areas per thread to allow the getcpu_cache system call to
>>> be used by more than a single shared object within a given program.
>> 
>> Ah, I missed that change.
>> 
>>> Without the linked list, as soon as more than one shared object try
>>> to register their cache, the first one will prohibit all others from
>>> doing so.
>>> 
>>> We could perhaps try to document that this system call should only
>>> ever be used by *libc, and all libraries and applications should
>>> then use the libc TLS cache variable, but it seems rather fragile,
>>> and any app/lib could try to register its own cache.
>> 
>> That does seem a bit fragile, true; on the other hand, the linked-list
>> approach would allow userspace to allocate an unbounded amount of kernel
>> memory, without any particular control on it.  That doesn't seem
>> reasonable.  Introducing an rlimit or similar for this seems like
>> massive overkill, and hardcoding a fixed limit breaks the 0-1-infinity
>> rule.
>> 
>> Given that any registered location will always provide the same value,
>> allowing only a single registration doesn't seem *too* problematic;
>> libc-based programs can use the libc implementation, and non-libc-based
>> programs can register a location themselves.  And users of this API will
>> already likely want to use some TLS mechanism, which already interacts
>> heavily with libc (set_thread_area/clone).
>> 
>> Allowing only one registration at a time seems preferable to introducing
>> another way to allocate kernel resources on a process's behalf.
>> 
> > - Josh Triplett

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12 21:02               ` Ben Maurer
  0 siblings, 0 replies; 42+ messages in thread
From: Ben Maurer @ 2016-01-12 21:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Josh Triplett, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra, linux-kernel, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, rostedt, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk



> One idea I have would be to let the kernel reserve some space either after the
> first stack address (for a stack growing down) or at the beginning of the
> allocated TLS area for each thread in copy_thread_tls() by fiddling with
> sp or the tls base address when creating a thread.

Could this be implemented by having glibc use a well known symbol name to define the per-thread TLS area? If an high performance application wants to avoid any relocations in accessing this variable it would define it and that definition would override glibc's. This is how things work with malloc. glibc has a default malloc implementation but we link jemalloc directly into our binaries. in addition to changing the malloc implementation this means that calls to malloc don't go through the PLT.

-b

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-12 21:02               ` Ben Maurer
  0 siblings, 0 replies; 42+ messages in thread
From: Ben Maurer @ 2016-01-12 21:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Josh Triplett, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk



> One idea I have would be to let the kernel reserve some space either after the
> first stack address (for a stack growing down) or at the beginning of the
> allocated TLS area for each thread in copy_thread_tls() by fiddling with
> sp or the tls base address when creating a thread.

Could this be implemented by having glibc use a well known symbol name to define the per-thread TLS area? If an high performance application wants to avoid any relocations in accessing this variable it would define it and that definition would override glibc's. This is how things work with malloc. glibc has a default malloc implementation but we link jemalloc directly into our binaries. in addition to changing the malloc implementation this means that calls to malloc don't go through the PLT.

-b--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-13  0:22                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-13  0:22 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Josh Triplett, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra, linux-kernel, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, rostedt, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Jan 12, 2016, at 4:02 PM, Ben Maurer bmaurer@fb.com wrote:

>> One idea I have would be to let the kernel reserve some space either after the
>> first stack address (for a stack growing down) or at the beginning of the
>> allocated TLS area for each thread in copy_thread_tls() by fiddling with
>> sp or the tls base address when creating a thread.
> 
> Could this be implemented by having glibc use a well known symbol name to define
> the per-thread TLS area? If an high performance application wants to avoid any
> relocations in accessing this variable it would define it and that definition
> would override glibc's. This is how things work with malloc. glibc has a
> default malloc implementation but we link jemalloc directly into our binaries.
> in addition to changing the malloc implementation this means that calls to
> malloc don't go through the PLT.

Just to make sure I understand your proposal: defining a well known symbol
with a weak attribute in glibc (or bionic...), e.g.:

int32_t __thread __attribute__((weak)) __getcpu_cache;

so that applications which care about bypassing the PLT can override it with:

int32_t __thread __getcpu_cache;

glibc/bionic would be responsible for calling the getcpu_cache() system call
to register/unregister this TLS variable for each thread.

One thing I would like to figure out is whether we can use this in a way that
would allow introducing getcpu_cache() into applications and libraries
(e.g. lttng-ust tracer) before it gets implemented into glibc, in a way that
would keep forward compatibility for whenever it gets introduced in glibc.

We can declare __getcpu_cache as a weak symbol in arbitrary libraries, and
make them register/unregister the cache through the getcpu_cache syscall.
The main thing that I would need to tweak at the kernel level within the
system call would be to keep a refcount of the number of times the
__getcpu_cache is registered per thread. This would allow multiple registrations,
one per library (e.g. lttng-ust) and one for glibc, but we would validate
that they all register the exact same address for a given thread.

The reference counting trick should also work for cases where applications
define a non-weak __getcpu_cache, and want to call the getcpu_cache
system call to register it themselves (before glibc adds support for it).

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-13  0:22                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-13  0:22 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Josh Triplett, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 12, 2016, at 4:02 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

>> One idea I have would be to let the kernel reserve some space either after the
>> first stack address (for a stack growing down) or at the beginning of the
>> allocated TLS area for each thread in copy_thread_tls() by fiddling with
>> sp or the tls base address when creating a thread.
> 
> Could this be implemented by having glibc use a well known symbol name to define
> the per-thread TLS area? If an high performance application wants to avoid any
> relocations in accessing this variable it would define it and that definition
> would override glibc's. This is how things work with malloc. glibc has a
> default malloc implementation but we link jemalloc directly into our binaries.
> in addition to changing the malloc implementation this means that calls to
> malloc don't go through the PLT.

Just to make sure I understand your proposal: defining a well known symbol
with a weak attribute in glibc (or bionic...), e.g.:

int32_t __thread __attribute__((weak)) __getcpu_cache;

so that applications which care about bypassing the PLT can override it with:

int32_t __thread __getcpu_cache;

glibc/bionic would be responsible for calling the getcpu_cache() system call
to register/unregister this TLS variable for each thread.

One thing I would like to figure out is whether we can use this in a way that
would allow introducing getcpu_cache() into applications and libraries
(e.g. lttng-ust tracer) before it gets implemented into glibc, in a way that
would keep forward compatibility for whenever it gets introduced in glibc.

We can declare __getcpu_cache as a weak symbol in arbitrary libraries, and
make them register/unregister the cache through the getcpu_cache syscall.
The main thing that I would need to tweak at the kernel level within the
system call would be to keep a refcount of the number of times the
__getcpu_cache is registered per thread. This would allow multiple registrations,
one per library (e.g. lttng-ust) and one for glibc, but we would validate
that they all register the exact same address for a given thread.

The reference counting trick should also work for cases where applications
define a non-weak __getcpu_cache, and want to call the getcpu_cache
system call to register it themselves (before glibc adds support for it).

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-13  0:51                   ` Josh Triplett
  0 siblings, 0 replies; 42+ messages in thread
From: Josh Triplett @ 2016-01-13  0:51 UTC (permalink / raw)
  To: Mathieu Desnoyers, Ben Maurer
  Cc: Shane M Seymour, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On January 12, 2016 4:22:29 PM PST, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>----- On Jan 12, 2016, at 4:02 PM, Ben Maurer bmaurer@fb.com wrote:
>
>>> One idea I have would be to let the kernel reserve some space either
>after the
>>> first stack address (for a stack growing down) or at the beginning
>of the
>>> allocated TLS area for each thread in copy_thread_tls() by fiddling
>with
>>> sp or the tls base address when creating a thread.
>> 
>> Could this be implemented by having glibc use a well known symbol
>name to define
>> the per-thread TLS area? If an high performance application wants to
>avoid any
>> relocations in accessing this variable it would define it and that
>definition
>> would override glibc's. This is how things work with malloc. glibc
>has a
>> default malloc implementation but we link jemalloc directly into our
>binaries.
>> in addition to changing the malloc implementation this means that
>calls to
>> malloc don't go through the PLT.
>
>Just to make sure I understand your proposal: defining a well known
>symbol
>with a weak attribute in glibc (or bionic...), e.g.:
>
>int32_t __thread __attribute__((weak)) __getcpu_cache;
>
>so that applications which care about bypassing the PLT can override it
>with:
>
>int32_t __thread __getcpu_cache;
>
>glibc/bionic would be responsible for calling the getcpu_cache() system
>call
>to register/unregister this TLS variable for each thread.
>
>One thing I would like to figure out is whether we can use this in a
>way that
>would allow introducing getcpu_cache() into applications and libraries
>(e.g. lttng-ust tracer) before it gets implemented into glibc, in a way
>that
>would keep forward compatibility for whenever it gets introduced in
>glibc.
>
>We can declare __getcpu_cache as a weak symbol in arbitrary libraries,
>and
>make them register/unregister the cache through the getcpu_cache
>syscall.
>The main thing that I would need to tweak at the kernel level within
>the
>system call would be to keep a refcount of the number of times the
>__getcpu_cache is registered per thread. This would allow multiple
>registrations,
>one per library (e.g. lttng-ust) and one for glibc, but we would
>validate
>that they all register the exact same address for a given thread.
>
>The reference counting trick should also work for cases where
>applications
>define a non-weak __getcpu_cache, and want to call the getcpu_cache
>system call to register it themselves (before glibc adds support for
>it).

This seems like something better done in a tiny common library, rather than the kernel or by playing symbol resolution games.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-13  0:51                   ` Josh Triplett
  0 siblings, 0 replies; 42+ messages in thread
From: Josh Triplett @ 2016-01-13  0:51 UTC (permalink / raw)
  To: Mathieu Desnoyers, Ben Maurer
  Cc: Shane M Seymour, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, rostedt, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On January 12, 2016 4:22:29 PM PST, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>----- On Jan 12, 2016, at 4:02 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:
>
>>> One idea I have would be to let the kernel reserve some space either
>after the
>>> first stack address (for a stack growing down) or at the beginning
>of the
>>> allocated TLS area for each thread in copy_thread_tls() by fiddling
>with
>>> sp or the tls base address when creating a thread.
>> 
>> Could this be implemented by having glibc use a well known symbol
>name to define
>> the per-thread TLS area? If an high performance application wants to
>avoid any
>> relocations in accessing this variable it would define it and that
>definition
>> would override glibc's. This is how things work with malloc. glibc
>has a
>> default malloc implementation but we link jemalloc directly into our
>binaries.
>> in addition to changing the malloc implementation this means that
>calls to
>> malloc don't go through the PLT.
>
>Just to make sure I understand your proposal: defining a well known
>symbol
>with a weak attribute in glibc (or bionic...), e.g.:
>
>int32_t __thread __attribute__((weak)) __getcpu_cache;
>
>so that applications which care about bypassing the PLT can override it
>with:
>
>int32_t __thread __getcpu_cache;
>
>glibc/bionic would be responsible for calling the getcpu_cache() system
>call
>to register/unregister this TLS variable for each thread.
>
>One thing I would like to figure out is whether we can use this in a
>way that
>would allow introducing getcpu_cache() into applications and libraries
>(e.g. lttng-ust tracer) before it gets implemented into glibc, in a way
>that
>would keep forward compatibility for whenever it gets introduced in
>glibc.
>
>We can declare __getcpu_cache as a weak symbol in arbitrary libraries,
>and
>make them register/unregister the cache through the getcpu_cache
>syscall.
>The main thing that I would need to tweak at the kernel level within
>the
>system call would be to keep a refcount of the number of times the
>__getcpu_cache is registered per thread. This would allow multiple
>registrations,
>one per library (e.g. lttng-ust) and one for glibc, but we would
>validate
>that they all register the exact same address for a given thread.
>
>The reference counting trick should also work for cases where
>applications
>define a non-weak __getcpu_cache, and want to call the getcpu_cache
>system call to register it themselves (before glibc adds support for
>it).

This seems like something better done in a tiny common library, rather than the kernel or by playing symbol resolution games.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-14 15:58                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-14 15:58 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Ben Maurer, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra, linux-kernel, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, rostedt, Paul E. McKenney, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Jan 12, 2016, at 7:51 PM, Josh Triplett josh@joshtriplett.org wrote:

> On January 12, 2016 4:22:29 PM PST, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>----- On Jan 12, 2016, at 4:02 PM, Ben Maurer bmaurer@fb.com wrote:
>>
>>>> One idea I have would be to let the kernel reserve some space either
>>after the
>>>> first stack address (for a stack growing down) or at the beginning
>>of the
>>>> allocated TLS area for each thread in copy_thread_tls() by fiddling
>>with
>>>> sp or the tls base address when creating a thread.
>>> 
>>> Could this be implemented by having glibc use a well known symbol
>>name to define
>>> the per-thread TLS area? If an high performance application wants to
>>avoid any
>>> relocations in accessing this variable it would define it and that
>>definition
>>> would override glibc's. This is how things work with malloc. glibc
>>has a
>>> default malloc implementation but we link jemalloc directly into our
>>binaries.
>>> in addition to changing the malloc implementation this means that
>>calls to
>>> malloc don't go through the PLT.
>>
>>Just to make sure I understand your proposal: defining a well known
>>symbol
>>with a weak attribute in glibc (or bionic...), e.g.:
>>
>>int32_t __thread __attribute__((weak)) __getcpu_cache;
>>
>>so that applications which care about bypassing the PLT can override it
>>with:
>>
>>int32_t __thread __getcpu_cache;
>>
>>glibc/bionic would be responsible for calling the getcpu_cache() system
>>call
>>to register/unregister this TLS variable for each thread.
>>
>>One thing I would like to figure out is whether we can use this in a
>>way that
>>would allow introducing getcpu_cache() into applications and libraries
>>(e.g. lttng-ust tracer) before it gets implemented into glibc, in a way
>>that
>>would keep forward compatibility for whenever it gets introduced in
>>glibc.
>>
>>We can declare __getcpu_cache as a weak symbol in arbitrary libraries,
>>and
>>make them register/unregister the cache through the getcpu_cache
>>syscall.
>>The main thing that I would need to tweak at the kernel level within
>>the
>>system call would be to keep a refcount of the number of times the
>>__getcpu_cache is registered per thread. This would allow multiple
>>registrations,
>>one per library (e.g. lttng-ust) and one for glibc, but we would
>>validate
>>that they all register the exact same address for a given thread.
>>
>>The reference counting trick should also work for cases where
>>applications
>>define a non-weak __getcpu_cache, and want to call the getcpu_cache
>>system call to register it themselves (before glibc adds support for
>>it).
> 
> This seems like something better done in a tiny common library, rather than the
> kernel or by playing symbol resolution games.

It does not cost much to recommend a specific symbol name and marking the
symbol as weak in shared libraries. We could then also remove the "unregister"
command, which then means any library registering its cache cannot be unloaded.
This would remove the need to keep track of registration/unregistration with
a reference count within the kernel.

We should then document that a registered cpu_cache should not be freed before
its associated thread exits.

Would it be simple enough, or too simplistic ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 0/3] Implement getcpu_cache system call
@ 2016-01-14 15:58                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 42+ messages in thread
From: Mathieu Desnoyers @ 2016-01-14 15:58 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Ben Maurer, Shane M Seymour, Thomas Gleixner, Paul Turner,
	Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 12, 2016, at 7:51 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:

> On January 12, 2016 4:22:29 PM PST, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>----- On Jan 12, 2016, at 4:02 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:
>>
>>>> One idea I have would be to let the kernel reserve some space either
>>after the
>>>> first stack address (for a stack growing down) or at the beginning
>>of the
>>>> allocated TLS area for each thread in copy_thread_tls() by fiddling
>>with
>>>> sp or the tls base address when creating a thread.
>>> 
>>> Could this be implemented by having glibc use a well known symbol
>>name to define
>>> the per-thread TLS area? If an high performance application wants to
>>avoid any
>>> relocations in accessing this variable it would define it and that
>>definition
>>> would override glibc's. This is how things work with malloc. glibc
>>has a
>>> default malloc implementation but we link jemalloc directly into our
>>binaries.
>>> in addition to changing the malloc implementation this means that
>>calls to
>>> malloc don't go through the PLT.
>>
>>Just to make sure I understand your proposal: defining a well known
>>symbol
>>with a weak attribute in glibc (or bionic...), e.g.:
>>
>>int32_t __thread __attribute__((weak)) __getcpu_cache;
>>
>>so that applications which care about bypassing the PLT can override it
>>with:
>>
>>int32_t __thread __getcpu_cache;
>>
>>glibc/bionic would be responsible for calling the getcpu_cache() system
>>call
>>to register/unregister this TLS variable for each thread.
>>
>>One thing I would like to figure out is whether we can use this in a
>>way that
>>would allow introducing getcpu_cache() into applications and libraries
>>(e.g. lttng-ust tracer) before it gets implemented into glibc, in a way
>>that
>>would keep forward compatibility for whenever it gets introduced in
>>glibc.
>>
>>We can declare __getcpu_cache as a weak symbol in arbitrary libraries,
>>and
>>make them register/unregister the cache through the getcpu_cache
>>syscall.
>>The main thing that I would need to tweak at the kernel level within
>>the
>>system call would be to keep a refcount of the number of times the
>>__getcpu_cache is registered per thread. This would allow multiple
>>registrations,
>>one per library (e.g. lttng-ust) and one for glibc, but we would
>>validate
>>that they all register the exact same address for a given thread.
>>
>>The reference counting trick should also work for cases where
>>applications
>>define a non-weak __getcpu_cache, and want to call the getcpu_cache
>>system call to register it themselves (before glibc adds support for
>>it).
> 
> This seems like something better done in a tiny common library, rather than the
> kernel or by playing symbol resolution games.

It does not cost much to recommend a specific symbol name and marking the
symbol as weak in shared libraries. We could then also remove the "unregister"
command, which then means any library registering its cache cannot be unloaded.
This would remove the need to keep track of registration/unregistration with
a reference count within the kernel.

We should then document that a registered cpu_cache should not be freed before
its associated thread exits.

Would it be simple enough, or too simplistic ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-01-14 15:59 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-05  7:01 [RFC PATCH 0/3] Implement getcpu_cache system call Mathieu Desnoyers
2016-01-05  7:01 ` Mathieu Desnoyers
2016-01-05  7:01 ` [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread Mathieu Desnoyers
2016-01-05  7:01   ` Mathieu Desnoyers
2016-01-05 12:04   ` Will Deacon
2016-01-05 17:31     ` Mathieu Desnoyers
2016-01-05 17:34       ` Mathieu Desnoyers
2016-01-05 17:34         ` Mathieu Desnoyers
2016-01-05 17:40       ` Russell King - ARM Linux
2016-01-05 17:40         ` Russell King - ARM Linux
2016-01-05 17:49         ` Mathieu Desnoyers
2016-01-05 21:47         ` Paul E. McKenney
2016-01-05 21:47           ` Paul E. McKenney
2016-01-05 22:34           ` Mathieu Desnoyers
2016-01-05 22:34             ` Mathieu Desnoyers
2016-01-05 22:54             ` Paul E. McKenney
2016-01-05 22:54               ` Paul E. McKenney
2016-01-05  7:01 ` [RFC PATCH 2/3] getcpu_cache: wire up ARM system call Mathieu Desnoyers
2016-01-05  7:02 ` [RFC PATCH 3/3] getcpu_cache: wire up x86 32/64 " Mathieu Desnoyers
2016-01-05  7:02   ` Mathieu Desnoyers
2016-01-11 22:38 ` [RFC PATCH 0/3] Implement getcpu_cache " Seymour, Shane M
2016-01-11 22:38   ` Seymour, Shane M
2016-01-11 23:03   ` Josh Triplett
2016-01-12  0:49     ` Mathieu Desnoyers
2016-01-12  0:49       ` Mathieu Desnoyers
2016-01-12  2:45       ` Josh Triplett
2016-01-12  4:27         ` Ben Maurer
2016-01-12  4:27           ` Ben Maurer
2016-01-12  6:40           ` Seymour, Shane M
2016-01-12  6:40             ` Seymour, Shane M
2016-01-12 13:15           ` Mathieu Desnoyers
2016-01-12 13:15             ` Mathieu Desnoyers
2016-01-12 21:02             ` Ben Maurer
2016-01-12 21:02               ` Ben Maurer
2016-01-13  0:22               ` Mathieu Desnoyers
2016-01-13  0:22                 ` Mathieu Desnoyers
2016-01-13  0:51                 ` Josh Triplett
2016-01-13  0:51                   ` Josh Triplett
2016-01-14 15:58                   ` Mathieu Desnoyers
2016-01-14 15:58                     ` Mathieu Desnoyers
2016-01-11 23:16   ` Seymour, Shane M
2016-01-11 23:16     ` Seymour, Shane M

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.