All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] getcpu_cache system call
@ 2016-01-27 16:54 Mathieu Desnoyers
  2016-01-27 16:54 ` [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 16:54 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Hi,

Here is a patchset implementing a cache for the CPU number of the
currently running thread in user-space.

Benchmarks comparing this approach to a getcpu based on system call on
ARM show a 44x speedup. They show a 14x speedup on x86-64 compared to
executing lsl from a vDSO through glibc.

I'm added a man page in the changelog of patch 1/3, which shows an
example usage of this new system call.

This patchset is sent as RFC. It applies on Linux 4.4.

Feedback is welcome,

Thanks!

Mathieu


Mathieu Desnoyers (3):
  getcpu_cache system call: cache CPU number of running thread
  getcpu_cache: wire up ARM system call
  getcpu_cache: wire up x86 32/64 system call

 MAINTAINERS                            |   6 ++
 arch/arm/include/uapi/asm/unistd.h     |   1 +
 arch/arm/kernel/calls.S                |   1 +
 arch/arm/kernel/signal.c               |   1 +
 arch/x86/entry/common.c                |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/exec.c                              |   1 +
 include/linux/sched.h                  |  36 +++++++++++
 init/Kconfig                           |  10 ++++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   4 ++
 kernel/getcpu_cache.c                  | 106 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h                   |   1 +
 kernel/sys_ni.c                        |   3 +
 15 files changed, 174 insertions(+)
 create mode 100644 kernel/getcpu_cache.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 16:54 [RFC PATCH v2 0/3] getcpu_cache system call Mathieu Desnoyers
@ 2016-01-27 16:54 ` Mathieu Desnoyers
  2016-01-27 17:20     ` Josh Triplett
                     ` (2 more replies)
  2016-01-27 16:54 ` [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call Mathieu Desnoyers
  2016-01-27 16:54 ` [RFC PATCH v2 3/3] getcpu_cache: wire up x86 32/64 " Mathieu Desnoyers
  2 siblings, 3 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 16:54 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Expose a new system call allowing threads to register one userspace
memory area where to store the CPU number on which the calling thread is
running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within each registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

This getcpu cache is an improvement over current mechanisms available to
read the current CPU number, which has the following benefits:

- 44x speedup on ARM vs system call through glibc,
- 14x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 11x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cached value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The getcpu cache approach is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the getcpu cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections:
Ref.:
* https://lkml.org/lkml/2015/10/27/1095
* https://lkml.org/lkml/2015/6/24/665
* https://lwn.net/Articles/650333/
* http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard i.MX6 Quad Board
- Baseline (empty loop):               10.1 ns
- Read CPU from getcpu cache:          10.1 ns
- glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
- getcpu system call:                 322.2 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                1.0 ns
- Read CPU from getcpu cache:           1.0 ns
- Read using gs segment selector:       1.0 ns
- "lsl" inline assembly:               11.2 ns
- glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
- getcpu system call:                  51.0 ns

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: linux-api@vger.kernel.org
---

Changes since v1:

- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
  sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
  update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
  getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Rationale for the getcpu_cache system call rather than the thread-local
ABI system call proposed earlier:

Rather than doing a "generic" thread-local ABI, specialize this system
call for a cpu number cache only. Anyway, the thread-local ABI approach
would have required that we introduce "feature" flags, which would have
ended up reimplementing multiplexing of features on top of a system
call. It seems better to introduce one system call per feature instead.

Man page associated:

GETCPU_CACHE(2)       Linux Programmer's Manual      GETCPU_CACHE(2)

NAME
       getcpu_cache  -  cache CPU number on which the calling thread
       is running

SYNOPSIS
       #include <stdint.h>

       int getcpu_cache(int32_t **cpu_cachep, int flags);

DESCRIPTION
       The getcpu_cache() helps speeding up reading the current  CPU
       number  by  ensuring  that  the memory location registered by
       each user-space thread is always updated with the CPU  number
       on which the thread is running when reading that memory loca‐
       tion.

       The cpu_cachep argument is a pointer to a int32_t pointer. It
       is  used  as  both  an input argument and output argument. As
       input, it expects the target location to contain a pointer to
       a  possible  cpu  number  cache  to use for this thread (this
       pointer must be naturally aligned on  4-byte  multiples),  or
       contain  NULL. As output, on success, it populates the target
       location with a pointer to the location  of  the  cpu  number
       cache  for  this thread. This cpu number cache address can be
       either the one provided as input, or one  already  registered
       by this thread previously.

       The  flags argument is currently unused and must be specified
       as 0.

       Typically, a library or application will put the  cpu  number
       cache  in  a  thread-local  storage variable, or other memory
       areas belonging to each thread. It is recommended to  perform
       a  volatile  read of the cpu number cache to prevent the com‐
       piler from doing load tearing. An alternative approach is  to
       read  the  cpu  number cache from inline assembly in a single
       instruction.

       Each thread is responsible for registering its own cpu number
       cache.   Only  one  cpu_cache  address  can be registered per
       thread. Following registration  will  return  the  previously
       registered address in the cpu_cachep target location.

       The  symbol  __getcpu_cache_tls  is  recommended  to  be used
       across libraries  and  applications  wishing  to  register  a
       thread-local  getcpu_cache.  The  attribute  "weak" is recom‐
       mended when declaring this variable in  libraries.   Applica‐
       tions  can  choose to define their own version of this symbol
       without the weak attribute as a performance improvement.

       In a typical usage scenario, the thread registering  the  cpu
       number  cache will be performing reads from that cache. It is
       however also allowed to read the cpu number cache from  other
       threads. The cpu number cache updates performed by the kernel
       provide single-copy atomicity semantics, which guarantee that
       other  threads performing single-copy atomic reads of the cpu
       number cache will always observe a consistent value.

       Memory registered as cpu number cache should never be deallo‐
       cated  before  the thread which registered it exits: specifi‐
       cally, it should not be freed, and the library containing the
       registered thread-local storage should not be dlclose'd.

       Unregistration  of  associated  cpu_cache are implicitly per‐
       formed when a thread or process exit.

RETURN VALUE
       A return value of 0 indicates success. On success, the memory
       location  pointed  to by cpu_cachep contains the address used
       as cpu number cache for this thread, which  may  differ  from
       the address provided as input.  On error, -1 is returned, and
       errno is set appropriately.

ERRORS
       EINVAL cpu_cachep points to a location containing an  invalid
              address, cpu_cachep points to a location containing an
              address which is not aligned on 4-byte  multiples,  or
              flags is non-zero.

       ENOSYS The  getcpu_cache()  system call is not implemented by
              this kernel.

       EFAULT cpu_cachep is an invalid address, or cpu_cachep points
              to a location containing an invalid address.

VERSIONS
       The getcpu_cache() system call was added in Linux 4.X (TODO).

CONFORMING TO
       getcpu_cache() is Linux-specific.

EXAMPLE
       The  following  code  uses  the getcpu_cache() system call to
       keep a thread local storage variable up to date with the cur‐
       rent  CPU  number.  For  example  simplicity,  it  is done in
       main(), but  multithreaded  programs  would  need  to  invoke
       getcpu_cache() from each program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sys/syscall.h>

           static inline int
           getcpu_cache(volatile int32_t **cpu_cachep, int flags)
           {
               return syscall(__NR_getcpu_cache, cpu_cachep, flags);
           }

           /*
            * __getcpu_cache_tls is recommended as symbol name. Weak
            * attribute is recommended when declaring this variable in
            * libraries. Applications can choose to define their own
            * version of this symbol without the weak attribute as a
            * performance improvement.
            */
           __thread __attribute__((weak)) volatile int32_t __getcpu_cache_tls;

           int
           main(int argc, char **argv)
           {
               volatile int32_t *cpu_cache = &__getcpu_cache_tls;

               if (getcpu_cache(&cpu_cache, 0) < 0) {
                   perror("getcpu_cache");
                   exit(EXIT_FAILURE);
               }
               if (cpu_cache != &__getcpu_cache_tls) {
                   fprintf(stderr, "Unexpected CPU cache pointer %p\n",
                           cpu_cache);
                   exit(EXIT_FAILURE);
               }

               printf("Current CPU number: %d\n", __getcpu_cache_tls);

               exit(EXIT_SUCCESS);
           }

Linux                        2016-01-27              GETCPU_CACHE(2)
---
 MAINTAINERS           |   6 +++
 fs/exec.c             |   1 +
 include/linux/sched.h |  36 +++++++++++++++++
 init/Kconfig          |  10 +++++
 kernel/Makefile       |   1 +
 kernel/fork.c         |   4 ++
 kernel/getcpu_cache.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |   1 +
 kernel/sys_ni.c       |   3 ++
 9 files changed, 168 insertions(+)
 create mode 100644 kernel/getcpu_cache.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 233f834..e9106b7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4712,6 +4712,12 @@ M:	Joe Perches <joe@perches.com>
 S:	Maintained
 F:	scripts/get_maintainer.pl
 
+GETCPU_CACHE SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/getcpu_cache.c
+
 GFS2 FILE SYSTEM
 M:	Steven Whitehouse <swhiteho@redhat.com>
 M:	Bob Peterson <rpeterso@redhat.com>
diff --git a/fs/exec.c b/fs/exec.c
index b06623a..1d66af6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1594,6 +1594,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	getcpu_cache_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fa39434..2fa2db8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1813,6 +1813,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_GETCPU_CACHE
+	int32_t __user *cpu_cache;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3190,4 +3193,37 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_GETCPU_CACHE
+void getcpu_cache_fork(struct task_struct *t);
+void getcpu_cache_execve(struct task_struct *t);
+void getcpu_cache_exit(struct task_struct *t);
+void __getcpu_cache_handle_notify_resume(struct task_struct *t);
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+	if (t->cpu_cache)
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (t->cpu_cache)
+		__getcpu_cache_handle_notify_resume(t);
+}
+#else
+static inline void getcpu_cache_fork(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_execve(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_exit(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2..fee2fa1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1614,6 +1614,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config GETCPU_CACHE
+	bool "Enable getcpu cache" if EXPERT
+	default y
+	help
+	  Enable the getcpu cache system call. It provides a user-space
+	  cache for the current CPU number value, which speeds up
+	  getting the current CPU number from user-space.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..b630247 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_GETCPU_CACHE) += getcpu_cache.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1155eac..37d0645 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -252,6 +252,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	cgroup_free(tsk);
+	getcpu_cache_exit(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
@@ -1554,6 +1555,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	copy_seccomp(p);
 
+	if (!(clone_flags & CLONE_THREAD))
+		getcpu_cache_fork(p);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/getcpu_cache.c b/kernel/getcpu_cache.c
new file mode 100644
index 0000000..4a1bda5
--- /dev/null
+++ b/kernel/getcpu_cache.c
@@ -0,0 +1,106 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * getcpu cache system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+
+static int getcpu_cache_update(int32_t __user *cpu_cache)
+{
+	if (put_user(raw_smp_processor_id(), cpu_cache))
+		return -1;
+	return 0;
+}
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void __getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (getcpu_cache_update(t->cpu_cache))
+		force_sig(SIGSEGV, t);
+}
+
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only applies
+ * when forking a process, not a thread.
+ */
+void getcpu_cache_fork(struct task_struct *t)
+{
+	t->cpu_cache = current->cpu_cache;
+}
+
+void getcpu_cache_execve(struct task_struct *t)
+{
+	t->cpu_cache = NULL;
+}
+
+void getcpu_cache_exit(struct task_struct *t)
+{
+	t->cpu_cache = NULL;
+}
+
+/*
+ * sys_getcpu_cache - setup getcpu cache for caller thread
+ */
+SYSCALL_DEFINE2(getcpu_cache, int32_t __user **, cpu_cachep, int, flags)
+{
+	int32_t __user *cpu_cache;
+
+	if (unlikely(flags))
+		return -EINVAL;
+	/* Check if cpu_cache is already registered. */
+	if (current->cpu_cache) {
+		if (put_user(current->cpu_cache, cpu_cachep))
+			return -EFAULT;
+		return 0;
+	}
+	if (get_user(cpu_cache, cpu_cachep))
+			return -EFAULT;
+	if (unlikely(!IS_ALIGNED((unsigned long)cpu_cache, sizeof(int32_t))
+			|| !cpu_cache))
+		return -EINVAL;
+	/*
+	 * Do an initial cpu cache update to ensure we won't hit
+	 * SIGSEGV if put_user() fails in the resume notifier.
+	 */
+	if (getcpu_cache_update(cpu_cache)) {
+		return -EFAULT;
+	}
+	current->cpu_cache = cpu_cache;
+	/*
+	 * Migration checks the getcpu cache to see whether the
+	 * notify_resume flag should be set.
+	 * Therefore, we need to ensure that the scheduler sees
+	 * the getcpu cache pointer update before we update the getcpu
+	 * cache content with the current CPU number.
+	 *
+	 * Set cpu_cache pointer before updating content.
+	 */
+	barrier();
+	/*
+	 * Set the resume notifier to ensure we update the current CPU
+	 * number before returning to userspace if needed. This handles
+	 * migration happening between the initial
+	 * get_cpu_cache_update() call and setting the current
+	 * cpu_cache pointer.
+	 */
+	getcpu_cache_set_notify_resume(current);
+	return 0;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b242775..3edcd13 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	getcpu_cache_set_notify_resume(p);
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..1e1c299 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -249,3 +249,6 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+
+/* thread-local ABI */
+cond_syscall(sys_getcpu_cache);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
  2016-01-27 16:54 [RFC PATCH v2 0/3] getcpu_cache system call Mathieu Desnoyers
  2016-01-27 16:54 ` [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread Mathieu Desnoyers
@ 2016-01-27 16:54 ` Mathieu Desnoyers
  2016-01-27 18:19     ` Russell King - ARM Linux
  2016-01-27 16:54 ` [RFC PATCH v2 3/3] getcpu_cache: wire up x86 32/64 " Mathieu Desnoyers
  2 siblings, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 16:54 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu cache system call on 32-bit ARM. Call the
getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

This provides an ABI improving the speed of a getcpu operation
on ARM by skipping the getcpu system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: linux-api@vger.kernel.org
---
 arch/arm/include/uapi/asm/unistd.h | 1 +
 arch/arm/kernel/calls.S            | 1 +
 arch/arm/kernel/signal.c           | 1 +
 3 files changed, 3 insertions(+)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index ede692f..ea0ea94 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -417,6 +417,7 @@
 #define __NR_userfaultfd		(__NR_SYSCALL_BASE+388)
 #define __NR_membarrier			(__NR_SYSCALL_BASE+389)
 #define __NR_mlock2			(__NR_SYSCALL_BASE+390)
+#define __NR_getcpu_cache		(__NR_SYSCALL_BASE+391)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index ac368bb..aa43e29 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -400,6 +400,7 @@
 		CALL(sys_userfaultfd)
 		CALL(sys_membarrier)
 		CALL(sys_mlock2)
+/* 391 */	CALL(sys_getcpu_cache)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index 7b8f214..ff5052c 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -594,6 +594,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 			} else {
 				clear_thread_flag(TIF_NOTIFY_RESUME);
 				tracehook_notify_resume(regs);
+				getcpu_cache_handle_notify_resume(current);
 			}
 		}
 		local_irq_disable();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 3/3] getcpu_cache: wire up x86 32/64 system call
  2016-01-27 16:54 [RFC PATCH v2 0/3] getcpu_cache system call Mathieu Desnoyers
  2016-01-27 16:54 ` [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread Mathieu Desnoyers
  2016-01-27 16:54 ` [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call Mathieu Desnoyers
@ 2016-01-27 16:54 ` Mathieu Desnoyers
  2 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 16:54 UTC (permalink / raw)
  To: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra
  Cc: linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu_cache system call on x86 32/64. Call the
getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

This provides an ABI improving the speed of a getcpu operation
on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/common.c                | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 3 files changed, 3 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0366374..eb6bcae 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -249,6 +249,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			getcpu_cache_handle_notify_resume(current);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..b3ea491 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	getcpu_cache		sys_getcpu_cache
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..442aaa0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	getcpu_cache		sys_getcpu_cache
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:20     ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 17:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register one userspace
> memory area where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
> 
> This getcpu cache is an improvement over current mechanisms available to
> read the current CPU number, which has the following benefits:
> 
> - 44x speedup on ARM vs system call through glibc,
> - 14x speedup on x86 compared to calling glibc, which calls vdso
>   executing a "lsl" instruction,
> - 11x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cached value can be read from an inline
>   assembly, which makes it a useful building block for restartable
>   sequences.
> - The getcpu cache approach is portable (e.g. ARM), which is not the
>   case for the lsl-based x86 vdso.
> 
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the getcpu cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.
> 
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections:
> Ref.:
> * https://lkml.org/lkml/2015/10/27/1095
> * https://lkml.org/lkml/2015/6/24/665
> * https://lwn.net/Articles/650333/
> * http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> 
> Benchmarking various approaches for reading the current CPU number:
> 
> ARMv7 Processor rev 10 (v7l)
> Machine model: Wandboard i.MX6 Quad Board
> - Baseline (empty loop):               10.1 ns
> - Read CPU from getcpu cache:          10.1 ns
> - glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
> - getcpu system call:                 322.2 ns
> 
> x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
> - Baseline (empty loop):                1.0 ns
> - Read CPU from getcpu cache:           1.0 ns
> - Read using gs segment selector:       1.0 ns
> - "lsl" inline assembly:               11.2 ns
> - glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
> - getcpu system call:                  51.0 ns
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Paul Turner <pjt@google.com>
> CC: Andrew Hunter <ahh@google.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Dave Watson <davejwatson@fb.com>
> CC: Chris Lameter <cl@linux.com>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Ben Maurer <bmaurer@fb.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Russell King <linux@arm.linux.org.uk>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> CC: Will Deacon <will.deacon@arm.com>
> CC: Michael Kerrisk <mtk.manpages@gmail.com>
> CC: linux-api@vger.kernel.org
> ---
> 
> Changes since v1:
> 
> - Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
>   sizeof(int32_t).
> - Update man page to describe the pointer alignement requirements and
>   update atomicity guarantees.
> - Add MAINTAINERS file GETCPU_CACHE entry.
> - Remove dynamic memory allocation: go back to having a single
>   getcpu_cache entry per thread. Update documentation accordingly.
> - Rebased on Linux 4.4.

With the dynamic allocation removed, this seems sensible to me.  One
minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
number should never need to hold a negative number.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:20     ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 17:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register one userspace
> memory area where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
> 
> This getcpu cache is an improvement over current mechanisms available to
> read the current CPU number, which has the following benefits:
> 
> - 44x speedup on ARM vs system call through glibc,
> - 14x speedup on x86 compared to calling glibc, which calls vdso
>   executing a "lsl" instruction,
> - 11x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cached value can be read from an inline
>   assembly, which makes it a useful building block for restartable
>   sequences.
> - The getcpu cache approach is portable (e.g. ARM), which is not the
>   case for the lsl-based x86 vdso.
> 
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the getcpu cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.
> 
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections:
> Ref.:
> * https://lkml.org/lkml/2015/10/27/1095
> * https://lkml.org/lkml/2015/6/24/665
> * https://lwn.net/Articles/650333/
> * http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> 
> Benchmarking various approaches for reading the current CPU number:
> 
> ARMv7 Processor rev 10 (v7l)
> Machine model: Wandboard i.MX6 Quad Board
> - Baseline (empty loop):               10.1 ns
> - Read CPU from getcpu cache:          10.1 ns
> - glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
> - getcpu system call:                 322.2 ns
> 
> x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
> - Baseline (empty loop):                1.0 ns
> - Read CPU from getcpu cache:           1.0 ns
> - Read using gs segment selector:       1.0 ns
> - "lsl" inline assembly:               11.2 ns
> - glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
> - getcpu system call:                  51.0 ns
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
> CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
> CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
> CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
> CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
> CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
> 
> Changes since v1:
> 
> - Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
>   sizeof(int32_t).
> - Update man page to describe the pointer alignement requirements and
>   update atomicity guarantees.
> - Add MAINTAINERS file GETCPU_CACHE entry.
> - Remove dynamic memory allocation: go back to having a single
>   getcpu_cache entry per thread. Update documentation accordingly.
> - Rebased on Linux 4.4.

With the dynamic allocation removed, this seems sensible to me.  One
minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
number should never need to hold a negative number.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:22     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> +/*
> + * sys_getcpu_cache - setup getcpu cache for caller thread
> + */
> +SYSCALL_DEFINE2(getcpu_cache, int32_t __user **, cpu_cachep, int, flags)
> +{
> +	int32_t __user *cpu_cache;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +	/* Check if cpu_cache is already registered. */
> +	if (current->cpu_cache) {
> +		if (put_user(current->cpu_cache, cpu_cachep))
> +			return -EFAULT;
> +		return 0;
> +	}

This is really odd. How is the caller supposed to differentiate between:

  1) Get the installed cpucache pointer

  2) Set the cpucache pointer

We really want clearly seperated functionality here.

   getcpu_cache(ptr, GET_CACHEP);
   
and   

   getcpu_cache(ptr, SET_CACHEP);

    Returns -EBUSY if current->cpu_cache is already set, except we allow
    replacing the pointer.  

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:22     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> +/*
> + * sys_getcpu_cache - setup getcpu cache for caller thread
> + */
> +SYSCALL_DEFINE2(getcpu_cache, int32_t __user **, cpu_cachep, int, flags)
> +{
> +	int32_t __user *cpu_cache;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +	/* Check if cpu_cache is already registered. */
> +	if (current->cpu_cache) {
> +		if (put_user(current->cpu_cache, cpu_cachep))
> +			return -EFAULT;
> +		return 0;
> +	}

This is really odd. How is the caller supposed to differentiate between:

  1) Get the installed cpucache pointer

  2) Set the cpucache pointer

We really want clearly seperated functionality here.

   getcpu_cache(ptr, GET_CACHEP);
   
and   

   getcpu_cache(ptr, SET_CACHEP);

    Returns -EBUSY if current->cpu_cache is already set, except we allow
    replacing the pointer.  

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 17:20     ` Josh Triplett
@ 2016-01-27 17:24       ` Thomas Gleixner
  -1 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:24 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Mathieu Desnoyers, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, 27 Jan 2016, Josh Triplett wrote:
> With the dynamic allocation removed, this seems sensible to me.  One
> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> number should never need to hold a negative number.

You try to block the future of computing: https://lwn.net/Articles/638673/


	 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:24       ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:24 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Mathieu Desnoyers, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, 27 Jan 2016, Josh Triplett wrote:
> With the dynamic allocation removed, this seems sensible to me.  One
> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> number should never need to hold a negative number.

You try to block the future of computing: https://lwn.net/Articles/638673/


	 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 17:22     ` Thomas Gleixner
@ 2016-01-27 17:31       ` Mathieu Desnoyers
  -1 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 17:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> +/*
>> + * sys_getcpu_cache - setup getcpu cache for caller thread
>> + */
>> +SYSCALL_DEFINE2(getcpu_cache, int32_t __user **, cpu_cachep, int, flags)
>> +{
>> +	int32_t __user *cpu_cache;
>> +
>> +	if (unlikely(flags))
>> +		return -EINVAL;
>> +	/* Check if cpu_cache is already registered. */
>> +	if (current->cpu_cache) {
>> +		if (put_user(current->cpu_cache, cpu_cachep))
>> +			return -EFAULT;
>> +		return 0;
>> +	}
> 
> This is really odd. How is the caller supposed to differentiate between:
> 
>  1) Get the installed cpucache pointer
> 
>  2) Set the cpucache pointer
> 
> We really want clearly seperated functionality here.
> 
>   getcpu_cache(ptr, GET_CACHEP);
>   
> and
> 
>   getcpu_cache(ptr, SET_CACHEP);
> 
>    Returns -EBUSY if current->cpu_cache is already set, except we allow
>    replacing the pointer.

Sounds fair. What is the recommended typing for "ptr" then ?
uint32_t ** or uint32_t * ?

It would be expected to pass a "uint32_t *" for the set
operation, but the "get" operation requires a "uint32_t **".

Also, I'd be tempted to put the GET/SET operation selector as
a first parameter.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:31       ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 17:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> +/*
>> + * sys_getcpu_cache - setup getcpu cache for caller thread
>> + */
>> +SYSCALL_DEFINE2(getcpu_cache, int32_t __user **, cpu_cachep, int, flags)
>> +{
>> +	int32_t __user *cpu_cache;
>> +
>> +	if (unlikely(flags))
>> +		return -EINVAL;
>> +	/* Check if cpu_cache is already registered. */
>> +	if (current->cpu_cache) {
>> +		if (put_user(current->cpu_cache, cpu_cachep))
>> +			return -EFAULT;
>> +		return 0;
>> +	}
> 
> This is really odd. How is the caller supposed to differentiate between:
> 
>  1) Get the installed cpucache pointer
> 
>  2) Set the cpucache pointer
> 
> We really want clearly seperated functionality here.
> 
>   getcpu_cache(ptr, GET_CACHEP);
>   
> and
> 
>   getcpu_cache(ptr, SET_CACHEP);
> 
>    Returns -EBUSY if current->cpu_cache is already set, except we allow
>    replacing the pointer.

Sounds fair. What is the recommended typing for "ptr" then ?
uint32_t ** or uint32_t * ?

It would be expected to pass a "uint32_t *" for the set
operation, but the "get" operation requires a "uint32_t **".

Also, I'd be tempted to put the GET/SET operation selector as
a first parameter.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:34         ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
> Sounds fair. What is the recommended typing for "ptr" then ?
> uint32_t ** or uint32_t * ?
> 
> It would be expected to pass a "uint32_t *" for the set
> operation, but the "get" operation requires a "uint32_t **".

Well, you can't change the types depending on the opcode, so you need to stick
with **.
 
> Also, I'd be tempted to put the GET/SET operation selector as
> a first parameter.

Makes sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:34         ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> Sounds fair. What is the recommended typing for "ptr" then ?
> uint32_t ** or uint32_t * ?
> 
> It would be expected to pass a "uint32_t *" for the set
> operation, but the "get" operation requires a "uint32_t **".

Well, you can't change the types depending on the opcode, so you need to stick
with **.
 
> Also, I'd be tempted to put the GET/SET operation selector as
> a first parameter.

Makes sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 17:24       ` Thomas Gleixner
@ 2016-01-27 17:36         ` Mathieu Desnoyers
  -1 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 17:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Josh Triplett, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Wed, 27 Jan 2016, Josh Triplett wrote:
>> With the dynamic allocation removed, this seems sensible to me.  One
>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>> number should never need to hold a negative number.
> 
> You try to block the future of computing: https://lwn.net/Articles/638673/

Besides impossible architectures, there is actually a use-case for
signedness here. It makes it possible to initialize the cpu number
cache to a negative value, e.g. -1, in userspace. Then, a check for
value < 0 can be used to figure out cases where the getcpu_cache
system call is not implemented, and where a fallback (vdso or getcpu
syscall) needs to be used.

This is why I have chosen a signed type for the cpu cache so far.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:36         ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 17:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Josh Triplett, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Wed, 27 Jan 2016, Josh Triplett wrote:
>> With the dynamic allocation removed, this seems sensible to me.  One
>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>> number should never need to hold a negative number.
> 
> You try to block the future of computing: https://lwn.net/Articles/638673/

Besides impossible architectures, there is actually a use-case for
signedness here. It makes it possible to initialize the cpu number
cache to a negative value, e.g. -1, in userspace. Then, a check for
value < 0 can be used to figure out cases where the getcpu_cache
system call is not implemented, and where a fallback (vdso or getcpu
syscall) needs to be used.

This is why I have chosen a signed type for the cpu cache so far.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 17:34         ` Thomas Gleixner
@ 2016-01-27 17:37           ` Thomas Gleixner
  -1 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, 27 Jan 2016, Thomas Gleixner wrote:

> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
> > Sounds fair. What is the recommended typing for "ptr" then ?
> > uint32_t ** or uint32_t * ?
> > 
> > It would be expected to pass a "uint32_t *" for the set
> > operation, but the "get" operation requires a "uint32_t **".
> 
> Well, you can't change the types depending on the opcode, so you need to stick
> with **.

Alternatively you make it:

 (opcode, *newptr, **oldptr, flags);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 17:37           ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2016-01-27 17:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, 27 Jan 2016, Thomas Gleixner wrote:

> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> > Sounds fair. What is the recommended typing for "ptr" then ?
> > uint32_t ** or uint32_t * ?
> > 
> > It would be expected to pass a "uint32_t *" for the set
> > operation, but the "get" operation requires a "uint32_t **".
> 
> Well, you can't change the types depending on the opcode, so you need to stick
> with **.

Alternatively you make it:

 (opcode, *newptr, **oldptr, flags);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 18:02           ` Andrew Hunter
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew Hunter @ 2016-01-27 18:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Josh Triplett, Paul Turner, Peter Zijlstra,
	lkml, linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 9:36 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:
>
>> On Wed, 27 Jan 2016, Josh Triplett wrote:
>>> With the dynamic allocation removed, this seems sensible to me.  One
>>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>>> number should never need to hold a negative number.
>>
>> You try to block the future of computing: https://lwn.net/Articles/638673/
>
> Besides impossible architectures, there is actually a use-case for
> signedness here. It makes it possible to initialize the cpu number
> cache to a negative value, e.g. -1, in userspace. Then, a check for
> value < 0 can be used to figure out cases where the getcpu_cache
> system call is not implemented, and where a fallback (vdso or getcpu
> syscall) needs to be used.
>
> This is why I have chosen a signed type for the cpu cache so far.
>

In our internal version of this patch (part of the RSEQ system
discussed elsewhere) we have a signed CPU id for this reason.  I think
it's a good idea to keep that in userspace and it makes more sense to
match the user and kernel versions of the types.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 18:02           ` Andrew Hunter
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew Hunter @ 2016-01-27 18:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Josh Triplett, Paul Turner, Peter Zijlstra,
	lkml, linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 9:36 AM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>
>> On Wed, 27 Jan 2016, Josh Triplett wrote:
>>> With the dynamic allocation removed, this seems sensible to me.  One
>>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>>> number should never need to hold a negative number.
>>
>> You try to block the future of computing: https://lwn.net/Articles/638673/
>
> Besides impossible architectures, there is actually a use-case for
> signedness here. It makes it possible to initialize the cpu number
> cache to a negative value, e.g. -1, in userspace. Then, a check for
> value < 0 can be used to figure out cases where the getcpu_cache
> system call is not implemented, and where a fallback (vdso or getcpu
> syscall) needs to be used.
>
> This is why I have chosen a signed type for the cpu cache so far.
>

In our internal version of this patch (part of the RSEQ system
discussed elsewhere) we have a signed CPU id for this reason.  I think
it's a good idea to keep that in userspace and it makes more sense to
match the user and kernel versions of the types.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 18:03           ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 18:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:
> > On Wed, 27 Jan 2016, Josh Triplett wrote:
> >> With the dynamic allocation removed, this seems sensible to me.  One
> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> >> number should never need to hold a negative number.
> > 
> > You try to block the future of computing: https://lwn.net/Articles/638673/
> 
> Besides impossible architectures, there is actually a use-case for
> signedness here. It makes it possible to initialize the cpu number
> cache to a negative value, e.g. -1, in userspace. Then, a check for
> value < 0 can be used to figure out cases where the getcpu_cache
> system call is not implemented, and where a fallback (vdso or getcpu
> syscall) needs to be used.
> 
> This is why I have chosen a signed type for the cpu cache so far.

If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
returns 0, then you can assume the kernel will give you a valid CPU
number.

- Josh Triplett

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 18:03           ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 18:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> > On Wed, 27 Jan 2016, Josh Triplett wrote:
> >> With the dynamic allocation removed, this seems sensible to me.  One
> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> >> number should never need to hold a negative number.
> > 
> > You try to block the future of computing: https://lwn.net/Articles/638673/
> 
> Besides impossible architectures, there is actually a use-case for
> signedness here. It makes it possible to initialize the cpu number
> cache to a negative value, e.g. -1, in userspace. Then, a check for
> value < 0 can be used to figure out cases where the getcpu_cache
> system call is not implemented, and where a fallback (vdso or getcpu
> syscall) needs to be used.
> 
> This is why I have chosen a signed type for the cpu cache so far.

If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
returns 0, then you can assume the kernel will give you a valid CPU
number.

- Josh Triplett

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
@ 2016-01-27 18:19     ` Russell King - ARM Linux
  0 siblings, 0 replies; 47+ messages in thread
From: Russell King - ARM Linux @ 2016-01-27 18:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 11:54:42AM -0500, Mathieu Desnoyers wrote:
> Wire up the getcpu cache system call on 32-bit ARM. Call the

You should note that your chosen system call number of 391 is temporary
here and can't be relied upon as we have to wire up the new
copy_file_range() call on ARM (which I've just committed the patch for.)

I'd much prefer the wiring up to be either an entirely separate patch
(which must be submitted to me, and I'll merge between -rc1 and -rc2
along with any other syscall updates) or you leave the wiring up to me.

This is so that I can ensure that (a) system calls numbers are not
duplicated and (b) ensure that we follow the same logical order as x86
where possible.

Thanks.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
@ 2016-01-27 18:19     ` Russell King - ARM Linux
  0 siblings, 0 replies; 47+ messages in thread
From: Russell King - ARM Linux @ 2016-01-27 18:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 11:54:42AM -0500, Mathieu Desnoyers wrote:
> Wire up the getcpu cache system call on 32-bit ARM. Call the

You should note that your chosen system call number of 391 is temporary
here and can't be relied upon as we have to wire up the new
copy_file_range() call on ARM (which I've just committed the patch for.)

I'd much prefer the wiring up to be either an entirely separate patch
(which must be submitted to me, and I'll merge between -rc1 and -rc2
along with any other syscall updates) or you leave the wiring up to me.

This is so that I can ensure that (a) system calls numbers are not
duplicated and (b) ensure that we follow the same logical order as x86
where possible.

Thanks.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 18:03           ` Josh Triplett
  (?)
@ 2016-01-27 18:43           ` Mathieu Desnoyers
  2016-01-27 19:16             ` Josh Triplett
  -1 siblings, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 18:43 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 1:03 PM, Josh Triplett josh@joshtriplett.org wrote:

> On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> > On Wed, 27 Jan 2016, Josh Triplett wrote:
>> >> With the dynamic allocation removed, this seems sensible to me.  One
>> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>> >> number should never need to hold a negative number.
>> > 
>> > You try to block the future of computing: https://lwn.net/Articles/638673/
>> 
>> Besides impossible architectures, there is actually a use-case for
>> signedness here. It makes it possible to initialize the cpu number
>> cache to a negative value, e.g. -1, in userspace. Then, a check for
>> value < 0 can be used to figure out cases where the getcpu_cache
>> system call is not implemented, and where a fallback (vdso or getcpu
>> syscall) needs to be used.
>> 
>> This is why I have chosen a signed type for the cpu cache so far.
> 
> If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
> returns 0, then you can assume the kernel will give you a valid CPU
> number.

I'm referring to the code path that read the content of the cache.
This code don't call the getcpu_cache system call each time (this
would defeat the entire purpose of this cache), but still has to
know whether it can rely on the cache content to contain the current
CPU number. Seeing a "-1" there is a nice way to tell the fast path
that it needs to go through a fallback.

Or perhaps you have another mechanism in mind for that ? How do
you intend to communicate the ENOSYS from the kernel to all
eventual readers of the cache, without adding extra function
call overhead on the fast path ?

Thanks,

Mathieu

> 
> - Josh Triplett

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
@ 2016-01-27 18:46       ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 18:46 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 1:19 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:

> On Wed, Jan 27, 2016 at 11:54:42AM -0500, Mathieu Desnoyers wrote:
>> Wire up the getcpu cache system call on 32-bit ARM. Call the
> 
> You should note that your chosen system call number of 391 is temporary
> here and can't be relied upon as we have to wire up the new
> copy_file_range() call on ARM (which I've just committed the patch for.)
> 
> I'd much prefer the wiring up to be either an entirely separate patch
> (which must be submitted to me, and I'll merge between -rc1 and -rc2
> along with any other syscall updates) or you leave the wiring up to me.
> 
> This is so that I can ensure that (a) system calls numbers are not
> duplicated and (b) ensure that we follow the same logical order as x86
> where possible.

Just to make sure I understand: you would perfer if I split this
patch into two: one that adds the getcpu_cache_handle_notify_resume()
call in the resume notifier ("getcpu_cache ARM architecture support"),
and a separate patch for wiring up the system call ?

Thanks,

Mathieu

> 
> Thanks.
> 
> --
> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
@ 2016-01-27 18:46       ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 18:46 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 1:19 PM, Russell King - ARM Linux linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org wrote:

> On Wed, Jan 27, 2016 at 11:54:42AM -0500, Mathieu Desnoyers wrote:
>> Wire up the getcpu cache system call on 32-bit ARM. Call the
> 
> You should note that your chosen system call number of 391 is temporary
> here and can't be relied upon as we have to wire up the new
> copy_file_range() call on ARM (which I've just committed the patch for.)
> 
> I'd much prefer the wiring up to be either an entirely separate patch
> (which must be submitted to me, and I'll merge between -rc1 and -rc2
> along with any other syscall updates) or you leave the wiring up to me.
> 
> This is so that I can ensure that (a) system calls numbers are not
> duplicated and (b) ensure that we follow the same logical order as x86
> where possible.

Just to make sure I understand: you would perfer if I split this
patch into two: one that adds the getcpu_cache_handle_notify_resume()
call in the resume notifier ("getcpu_cache ARM architecture support"),
and a separate patch for wiring up the system call ?

Thanks,

Mathieu

> 
> Thanks.
> 
> --
> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 18:43           ` Mathieu Desnoyers
@ 2016-01-27 19:16             ` Josh Triplett
  2016-01-27 21:02               ` Mathieu Desnoyers
  0 siblings, 1 reply; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 19:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 06:43:36PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 1:03 PM, Josh Triplett josh@joshtriplett.org wrote:
> 
> > On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:
> >> > On Wed, 27 Jan 2016, Josh Triplett wrote:
> >> >> With the dynamic allocation removed, this seems sensible to me.  One
> >> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> >> >> number should never need to hold a negative number.
> >> > 
> >> > You try to block the future of computing: https://lwn.net/Articles/638673/
> >> 
> >> Besides impossible architectures, there is actually a use-case for
> >> signedness here. It makes it possible to initialize the cpu number
> >> cache to a negative value, e.g. -1, in userspace. Then, a check for
> >> value < 0 can be used to figure out cases where the getcpu_cache
> >> system call is not implemented, and where a fallback (vdso or getcpu
> >> syscall) needs to be used.
> >> 
> >> This is why I have chosen a signed type for the cpu cache so far.
> > 
> > If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
> > returns 0, then you can assume the kernel will give you a valid CPU
> > number.
> 
> I'm referring to the code path that read the content of the cache.
> This code don't call the getcpu_cache system call each time (this
> would defeat the entire purpose of this cache), but still has to
> know whether it can rely on the cache content to contain the current
> CPU number. Seeing a "-1" there is a nice way to tell the fast path
> that it needs to go through a fallback.
> 
> Or perhaps you have another mechanism in mind for that ? How do
> you intend to communicate the ENOSYS from the kernel to all
> eventual readers of the cache, without adding extra function
> call overhead on the fast path ?

Have the fast path assume the cache, without even checking for -1; only
use that fast path if getcpu_cache exists.  If you don't have
getcpu_cache, don't even attempt to use the fast path; substitute in a
fallback implementation.  Don't have a conditional in either version;
just decide which version to use based on system capabilities.

Alternatively, use the implementation you have with a placeholder value,
and just use 0xFFFFFFFF as the placeholder; that seems no more or
less valid.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 19:16             ` Josh Triplett
@ 2016-01-27 21:02               ` Mathieu Desnoyers
  2016-01-27 21:30                   ` Josh Triplett
  0 siblings, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 21:02 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 2:16 PM, Josh Triplett josh@joshtriplett.org wrote:

> On Wed, Jan 27, 2016 at 06:43:36PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 27, 2016, at 1:03 PM, Josh Triplett josh@joshtriplett.org wrote:
>> 
>> > On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
>> >> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> >> > On Wed, 27 Jan 2016, Josh Triplett wrote:
>> >> >> With the dynamic allocation removed, this seems sensible to me.  One
>> >> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>> >> >> number should never need to hold a negative number.
>> >> > 
>> >> > You try to block the future of computing: https://lwn.net/Articles/638673/
>> >> 
>> >> Besides impossible architectures, there is actually a use-case for
>> >> signedness here. It makes it possible to initialize the cpu number
>> >> cache to a negative value, e.g. -1, in userspace. Then, a check for
>> >> value < 0 can be used to figure out cases where the getcpu_cache
>> >> system call is not implemented, and where a fallback (vdso or getcpu
>> >> syscall) needs to be used.
>> >> 
>> >> This is why I have chosen a signed type for the cpu cache so far.
>> > 
>> > If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
>> > returns 0, then you can assume the kernel will give you a valid CPU
>> > number.
>> 
>> I'm referring to the code path that read the content of the cache.
>> This code don't call the getcpu_cache system call each time (this
>> would defeat the entire purpose of this cache), but still has to
>> know whether it can rely on the cache content to contain the current
>> CPU number. Seeing a "-1" there is a nice way to tell the fast path
>> that it needs to go through a fallback.
>> 
>> Or perhaps you have another mechanism in mind for that ? How do
>> you intend to communicate the ENOSYS from the kernel to all
>> eventual readers of the cache, without adding extra function
>> call overhead on the fast path ?
> 
> Have the fast path assume the cache, without even checking for -1; only
> use that fast path if getcpu_cache exists.  If you don't have
> getcpu_cache, don't even attempt to use the fast path; substitute in a
> fallback implementation.  Don't have a conditional in either version;
> just decide which version to use based on system capabilities.

I'm under the impression that we are talking past each other, because
I still don't get how your proposal works in practice without relying on
dynamic code patching.

Let's consider the following scenario:

Let's suppose getcpu_cache syscall gets a number assigned on ARM for kernel
4.6. We build an application against those kernel headers, so the application
will attempt to perform the getcpu_cache syscall to register the cache for
each thread.

However, said application is deployed on an older kernel, for which getcpu_cache
returns -1, errno=ENOSYS.

Within the fast-path, in our scenario, it would be a load instruction
fetching the cache within an inline assembly. How are we supposed to
turn that instruction into something else without dynamically patching
userspace code ?

One important aspect here is that we are not doing a function call to
get to the fast-path: the fast-path is inlined within the application
code.

> 
> Alternatively, use the implementation you have with a placeholder value,
> and just use 0xFFFFFFFF as the placeholder; that seems no more or
> less valid.

If we expect this comparison to be performed at every fast-path, it
would appear to produce slightly more compact code to compare against
0 (< 0) than to compare != 0xFFFFFFFF (even though cmp and test have
the same instruction throughput and latency based on Intel optimization
manuals).

e.g. on x86-64:

if (a < 0)

  400536:       85 c0                   test   %eax,%eax
  400538:       78 06                   js     400540 <fct+0x10>

-> 4 bytes


if (a != 0xFFFFFFFF)

  400536:       83 f8 ff                cmp    $0xffffffff,%eax
  400539:       74 15                   je     400550 <fct+0x20>

-> 5 bytes

I don't have a strong opinion there, but I wonder what is
the upside of having an unsigned value for the cpu number,
given that it makes the userspace code a bit more awkward
than with a signed value. If the goal is to support number
of CPUs higher than 2^31, then we should clearly think about
using a "long" type rather than int32_t for the CPU cache.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 21:30                   ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 21:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 09:02:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 2:16 PM, Josh Triplett josh@joshtriplett.org wrote:
> 
> > On Wed, Jan 27, 2016 at 06:43:36PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 27, 2016, at 1:03 PM, Josh Triplett josh@joshtriplett.org wrote:
> >> 
> >> > On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
> >> >> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx@linutronix.de wrote:
> >> >> > On Wed, 27 Jan 2016, Josh Triplett wrote:
> >> >> >> With the dynamic allocation removed, this seems sensible to me.  One
> >> >> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> >> >> >> number should never need to hold a negative number.
> >> >> > 
> >> >> > You try to block the future of computing: https://lwn.net/Articles/638673/
> >> >> 
> >> >> Besides impossible architectures, there is actually a use-case for
> >> >> signedness here. It makes it possible to initialize the cpu number
> >> >> cache to a negative value, e.g. -1, in userspace. Then, a check for
> >> >> value < 0 can be used to figure out cases where the getcpu_cache
> >> >> system call is not implemented, and where a fallback (vdso or getcpu
> >> >> syscall) needs to be used.
> >> >> 
> >> >> This is why I have chosen a signed type for the cpu cache so far.
> >> > 
> >> > If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
> >> > returns 0, then you can assume the kernel will give you a valid CPU
> >> > number.
> >> 
> >> I'm referring to the code path that read the content of the cache.
> >> This code don't call the getcpu_cache system call each time (this
> >> would defeat the entire purpose of this cache), but still has to
> >> know whether it can rely on the cache content to contain the current
> >> CPU number. Seeing a "-1" there is a nice way to tell the fast path
> >> that it needs to go through a fallback.
> >> 
> >> Or perhaps you have another mechanism in mind for that ? How do
> >> you intend to communicate the ENOSYS from the kernel to all
> >> eventual readers of the cache, without adding extra function
> >> call overhead on the fast path ?
> > 
> > Have the fast path assume the cache, without even checking for -1; only
> > use that fast path if getcpu_cache exists.  If you don't have
> > getcpu_cache, don't even attempt to use the fast path; substitute in a
> > fallback implementation.  Don't have a conditional in either version;
> > just decide which version to use based on system capabilities.
> 
> I'm under the impression that we are talking past each other, because
> I still don't get how your proposal works in practice without relying on
> dynamic code patching.

I assumed that you could select one of two implementations at runtime,
where both implementations did not contain conditionals.  That doesn't
necessarily require dynamic code patching, except in the case where you
inline this code into applications at compile time as you mention below.
I didn't have that detail.

> Let's consider the following scenario:
> 
> Let's suppose getcpu_cache syscall gets a number assigned on ARM for kernel
> 4.6. We build an application against those kernel headers, so the application
> will attempt to perform the getcpu_cache syscall to register the cache for
> each thread.
> 
> However, said application is deployed on an older kernel, for which getcpu_cache
> returns -1, errno=ENOSYS.

I understood that much.

> Within the fast-path, in our scenario, it would be a load instruction
> fetching the cache within an inline assembly. How are we supposed to
> turn that instruction into something else without dynamically patching
> userspace code ?
>
> One important aspect here is that we are not doing a function call to
> get to the fast-path: the fast-path is inlined within the application
> code.

That changes things, then.

> > Alternatively, use the implementation you have with a placeholder value,
> > and just use 0xFFFFFFFF as the placeholder; that seems no more or
> > less valid.
> 
> If we expect this comparison to be performed at every fast-path, it
> would appear to produce slightly more compact code to compare against
> 0 (< 0) than to compare != 0xFFFFFFFF (even though cmp and test have
> the same instruction throughput and latency based on Intel optimization
> manuals).

I had assumed an '== -1', but the same concept applies: just go ahead
and check <0 then, without changing the type.  You can always cast the
pointer you pass to the kernel if you want to make that assumption.  Or
you could perform a single 64-bit read of a number and a boolean.

It seems unfortunate to use a semantically inappropriate type in the
kernel implementation based on constraints in planned userspace code.
>From the kernel's perspective, it seems odd to use int32_t for something
the kernel will only ever write a CPU number into.

That said, I also didn't expect this suggestion to add complexity,
either.  I certainly didn't make this suggestion as future-proofing
against future systems with two billion CPUs. :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 21:30                   ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 21:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 09:02:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 2:16 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> 
> > On Wed, Jan 27, 2016 at 06:43:36PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 27, 2016, at 1:03 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> >> 
> >> > On Wed, Jan 27, 2016 at 05:36:48PM +0000, Mathieu Desnoyers wrote:
> >> >> ----- On Jan 27, 2016, at 12:24 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> >> >> > On Wed, 27 Jan 2016, Josh Triplett wrote:
> >> >> >> With the dynamic allocation removed, this seems sensible to me.  One
> >> >> >> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
> >> >> >> number should never need to hold a negative number.
> >> >> > 
> >> >> > You try to block the future of computing: https://lwn.net/Articles/638673/
> >> >> 
> >> >> Besides impossible architectures, there is actually a use-case for
> >> >> signedness here. It makes it possible to initialize the cpu number
> >> >> cache to a negative value, e.g. -1, in userspace. Then, a check for
> >> >> value < 0 can be used to figure out cases where the getcpu_cache
> >> >> system call is not implemented, and where a fallback (vdso or getcpu
> >> >> syscall) needs to be used.
> >> >> 
> >> >> This is why I have chosen a signed type for the cpu cache so far.
> >> > 
> >> > If getcpu_cache doesn't exist, you'll get ENOSYS.  If getcpu_cache
> >> > returns 0, then you can assume the kernel will give you a valid CPU
> >> > number.
> >> 
> >> I'm referring to the code path that read the content of the cache.
> >> This code don't call the getcpu_cache system call each time (this
> >> would defeat the entire purpose of this cache), but still has to
> >> know whether it can rely on the cache content to contain the current
> >> CPU number. Seeing a "-1" there is a nice way to tell the fast path
> >> that it needs to go through a fallback.
> >> 
> >> Or perhaps you have another mechanism in mind for that ? How do
> >> you intend to communicate the ENOSYS from the kernel to all
> >> eventual readers of the cache, without adding extra function
> >> call overhead on the fast path ?
> > 
> > Have the fast path assume the cache, without even checking for -1; only
> > use that fast path if getcpu_cache exists.  If you don't have
> > getcpu_cache, don't even attempt to use the fast path; substitute in a
> > fallback implementation.  Don't have a conditional in either version;
> > just decide which version to use based on system capabilities.
> 
> I'm under the impression that we are talking past each other, because
> I still don't get how your proposal works in practice without relying on
> dynamic code patching.

I assumed that you could select one of two implementations at runtime,
where both implementations did not contain conditionals.  That doesn't
necessarily require dynamic code patching, except in the case where you
inline this code into applications at compile time as you mention below.
I didn't have that detail.

> Let's consider the following scenario:
> 
> Let's suppose getcpu_cache syscall gets a number assigned on ARM for kernel
> 4.6. We build an application against those kernel headers, so the application
> will attempt to perform the getcpu_cache syscall to register the cache for
> each thread.
> 
> However, said application is deployed on an older kernel, for which getcpu_cache
> returns -1, errno=ENOSYS.

I understood that much.

> Within the fast-path, in our scenario, it would be a load instruction
> fetching the cache within an inline assembly. How are we supposed to
> turn that instruction into something else without dynamically patching
> userspace code ?
>
> One important aspect here is that we are not doing a function call to
> get to the fast-path: the fast-path is inlined within the application
> code.

That changes things, then.

> > Alternatively, use the implementation you have with a placeholder value,
> > and just use 0xFFFFFFFF as the placeholder; that seems no more or
> > less valid.
> 
> If we expect this comparison to be performed at every fast-path, it
> would appear to produce slightly more compact code to compare against
> 0 (< 0) than to compare != 0xFFFFFFFF (even though cmp and test have
> the same instruction throughput and latency based on Intel optimization
> manuals).

I had assumed an '== -1', but the same concept applies: just go ahead
and check <0 then, without changing the type.  You can always cast the
pointer you pass to the kernel if you want to make that assumption.  Or
you could perform a single 64-bit read of a number and a boolean.

It seems unfortunate to use a semantically inappropriate type in the
kernel implementation based on constraints in planned userspace code.
>From the kernel's perspective, it seems odd to use int32_t for something
the kernel will only ever write a CPU number into.

That said, I also didn't expect this suggestion to add complexity,
either.  I certainly didn't make this suggestion as future-proofing
against future systems with two billion CPUs. :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 17:37           ` Thomas Gleixner
@ 2016-01-27 21:34             ` Mathieu Desnoyers
  -1 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 21:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra, linux-kernel,
	linux-api, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Wed, 27 Jan 2016, Thomas Gleixner wrote:
> 
>> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> > Sounds fair. What is the recommended typing for "ptr" then ?
>> > uint32_t ** or uint32_t * ?
>> > 
>> > It would be expected to pass a "uint32_t *" for the set
>> > operation, but the "get" operation requires a "uint32_t **".
>> 
>> Well, you can't change the types depending on the opcode, so you need to stick
>> with **.
> 
> Alternatively you make it:
> 
>  (opcode, *newptr, **oldptr, flags);

I'm tempted to stick to (opcode, **ptr, flags), because
other syscalls that have "*newptr", "**oldptr"
typically have them because they save the current state
into oldptr, and set the new state, which is really
not the case here. To eliminate any risk of confusion,
I am tempted to keep a single "**ptr".

Unless someone has a better idea...

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 21:34             ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 21:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Wed, 27 Jan 2016, Thomas Gleixner wrote:
> 
>> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> > Sounds fair. What is the recommended typing for "ptr" then ?
>> > uint32_t ** or uint32_t * ?
>> > 
>> > It would be expected to pass a "uint32_t *" for the set
>> > operation, but the "get" operation requires a "uint32_t **".
>> 
>> Well, you can't change the types depending on the opcode, so you need to stick
>> with **.
> 
> Alternatively you make it:
> 
>  (opcode, *newptr, **oldptr, flags);

I'm tempted to stick to (opcode, **ptr, flags), because
other syscalls that have "*newptr", "**oldptr"
typically have them because they save the current state
into oldptr, and set the new state, which is really
not the case here. To eliminate any risk of confusion,
I am tempted to keep a single "**ptr".

Unless someone has a better idea...

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 22:11               ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 22:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
> 
> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
> > 
> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
> >> > Sounds fair. What is the recommended typing for "ptr" then ?
> >> > uint32_t ** or uint32_t * ?
> >> > 
> >> > It would be expected to pass a "uint32_t *" for the set
> >> > operation, but the "get" operation requires a "uint32_t **".
> >> 
> >> Well, you can't change the types depending on the opcode, so you need to stick
> >> with **.
> > 
> > Alternatively you make it:
> > 
> >  (opcode, *newptr, **oldptr, flags);
> 
> I'm tempted to stick to (opcode, **ptr, flags), because
> other syscalls that have "*newptr", "**oldptr"
> typically have them because they save the current state
> into oldptr, and set the new state, which is really
> not the case here. To eliminate any risk of confusion,
> I am tempted to keep a single "**ptr".
> 
> Unless someone has a better idea...

Either that or you could define it as "void *" and interpret it based on
flags, but that seems unfortunate; let's not imitate ioctl-style
typeless parameters.  I'd stick with the double pointer and the current
behavior.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 22:11               ` Josh Triplett
  0 siblings, 0 replies; 47+ messages in thread
From: Josh Triplett @ 2016-01-27 22:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> 
> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
> > 
> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> >> > Sounds fair. What is the recommended typing for "ptr" then ?
> >> > uint32_t ** or uint32_t * ?
> >> > 
> >> > It would be expected to pass a "uint32_t *" for the set
> >> > operation, but the "get" operation requires a "uint32_t **".
> >> 
> >> Well, you can't change the types depending on the opcode, so you need to stick
> >> with **.
> > 
> > Alternatively you make it:
> > 
> >  (opcode, *newptr, **oldptr, flags);
> 
> I'm tempted to stick to (opcode, **ptr, flags), because
> other syscalls that have "*newptr", "**oldptr"
> typically have them because they save the current state
> into oldptr, and set the new state, which is really
> not the case here. To eliminate any risk of confusion,
> I am tempted to keep a single "**ptr".
> 
> Unless someone has a better idea...

Either that or you could define it as "void *" and interpret it based on
flags, but that seems unfortunate; let's not imitate ioctl-style
typeless parameters.  I'd stick with the double pointer and the current
behavior.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-27 22:11               ` Josh Triplett
@ 2016-01-27 22:47                 ` Mathieu Desnoyers
  -1 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 22:47 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 5:11 PM, Josh Triplett josh@joshtriplett.org wrote:

> On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> 
>> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
>> > 
>> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> >> > Sounds fair. What is the recommended typing for "ptr" then ?
>> >> > uint32_t ** or uint32_t * ?
>> >> > 
>> >> > It would be expected to pass a "uint32_t *" for the set
>> >> > operation, but the "get" operation requires a "uint32_t **".
>> >> 
>> >> Well, you can't change the types depending on the opcode, so you need to stick
>> >> with **.
>> > 
>> > Alternatively you make it:
>> > 
>> >  (opcode, *newptr, **oldptr, flags);
>> 
>> I'm tempted to stick to (opcode, **ptr, flags), because
>> other syscalls that have "*newptr", "**oldptr"
>> typically have them because they save the current state
>> into oldptr, and set the new state, which is really
>> not the case here. To eliminate any risk of confusion,
>> I am tempted to keep a single "**ptr".
>> 
>> Unless someone has a better idea...
> 
> Either that or you could define it as "void *" and interpret it based on
> flags, but that seems unfortunate; let's not imitate ioctl-style
> typeless parameters.  I'd stick with the double pointer and the current
> behavior.

Allright, will do! Thanks for the feedback :)

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-27 22:47                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 22:47 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 5:11 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:

> On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> 
>> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
>> > 
>> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> >> > Sounds fair. What is the recommended typing for "ptr" then ?
>> >> > uint32_t ** or uint32_t * ?
>> >> > 
>> >> > It would be expected to pass a "uint32_t *" for the set
>> >> > operation, but the "get" operation requires a "uint32_t **".
>> >> 
>> >> Well, you can't change the types depending on the opcode, so you need to stick
>> >> with **.
>> > 
>> > Alternatively you make it:
>> > 
>> >  (opcode, *newptr, **oldptr, flags);
>> 
>> I'm tempted to stick to (opcode, **ptr, flags), because
>> other syscalls that have "*newptr", "**oldptr"
>> typically have them because they save the current state
>> into oldptr, and set the new state, which is really
>> not the case here. To eliminate any risk of confusion,
>> I am tempted to keep a single "**ptr".
>> 
>> Unless someone has a better idea...
> 
> Either that or you could define it as "void *" and interpret it based on
> flags, but that seems unfortunate; let's not imitate ioctl-style
> typeless parameters.  I'd stick with the double pointer and the current
> behavior.

Allright, will do! Thanks for the feedback :)

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
@ 2016-01-27 23:03         ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 23:03 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 1:46 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jan 27, 2016, at 1:19 PM, Russell King - ARM Linux
> linux@arm.linux.org.uk wrote:
> 
>> On Wed, Jan 27, 2016 at 11:54:42AM -0500, Mathieu Desnoyers wrote:
>>> Wire up the getcpu cache system call on 32-bit ARM. Call the
>> 
>> You should note that your chosen system call number of 391 is temporary
>> here and can't be relied upon as we have to wire up the new
>> copy_file_range() call on ARM (which I've just committed the patch for.)
>> 
>> I'd much prefer the wiring up to be either an entirely separate patch
>> (which must be submitted to me, and I'll merge between -rc1 and -rc2
>> along with any other syscall updates) or you leave the wiring up to me.
>> 
>> This is so that I can ensure that (a) system calls numbers are not
>> duplicated and (b) ensure that we follow the same logical order as x86
>> where possible.
> 
> Just to make sure I understand: you would prefer if I split this
> patch into two: one that adds the getcpu_cache_handle_notify_resume()
> call in the resume notifier ("getcpu_cache ARM architecture support"),
> and a separate patch for wiring up the system call ?

I just refactored my patchset as described above (for upcoming v3). Just
to make sure there is no confusion, I target Linux 4.6 for this patchset,
since the 4.5 merge window is closed.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
>> 
>> Thanks.
>> 
>> --
>> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
>> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
>> according to speedtest.net.
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call
@ 2016-01-27 23:03         ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-27 23:03 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 1:46 PM, Mathieu Desnoyers mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org wrote:

> ----- On Jan 27, 2016, at 1:19 PM, Russell King - ARM Linux
> linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org wrote:
> 
>> On Wed, Jan 27, 2016 at 11:54:42AM -0500, Mathieu Desnoyers wrote:
>>> Wire up the getcpu cache system call on 32-bit ARM. Call the
>> 
>> You should note that your chosen system call number of 391 is temporary
>> here and can't be relied upon as we have to wire up the new
>> copy_file_range() call on ARM (which I've just committed the patch for.)
>> 
>> I'd much prefer the wiring up to be either an entirely separate patch
>> (which must be submitted to me, and I'll merge between -rc1 and -rc2
>> along with any other syscall updates) or you leave the wiring up to me.
>> 
>> This is so that I can ensure that (a) system calls numbers are not
>> duplicated and (b) ensure that we follow the same logical order as x86
>> where possible.
> 
> Just to make sure I understand: you would prefer if I split this
> patch into two: one that adds the getcpu_cache_handle_notify_resume()
> call in the resume notifier ("getcpu_cache ARM architecture support"),
> and a separate patch for wiring up the system call ?

I just refactored my patchset as described above (for upcoming v3). Just
to make sure there is no confusion, I target Linux 4.6 for this patchset,
since the 4.5 merge window is closed.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
>> 
>> Thanks.
>> 
>> --
>> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
>> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
>> according to speedtest.net.
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28  3:12     ` Alexei Starovoitov
  0 siblings, 0 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-01-28  3:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register one userspace
> memory area where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
> 
> This getcpu cache is an improvement over current mechanisms available to
> read the current CPU number, which has the following benefits:
> 
> - 44x speedup on ARM vs system call through glibc,
> - 14x speedup on x86 compared to calling glibc, which calls vdso
>   executing a "lsl" instruction,
> - 11x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cached value can be read from an inline
>   assembly, which makes it a useful building block for restartable
>   sequences.
> - The getcpu cache approach is portable (e.g. ARM), which is not the
>   case for the lsl-based x86 vdso.
> 
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the getcpu cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.

Great work! The only concern is that every arch has to implement
a call to getcpu_cache_handle_notify_resume() to be able to do put_user()
from the safe place which is not pretty.
Can we do better?
Here is one crazy idea:
The kernel can allocate the memory that user space will mmap()
(ideally reusing perf ring-buffer alloc/mmap mechanism).
then the kernel can just write cpuid into it from any place.
Then user space will register the 'offset' into this space for a given
user space thread (or kernel will return it or ptr within this area)
and in finish_task_switch() the kernel will do
*task->offset_converted_to_ptr = smp_processor_id();
At init time the user space will do:
__thread int *cpuid;
cpuid = (void*)addr_from_mmap + registered_offset;
and at runtime the '*cpuid' will give userspace what it wants.
It's two loads to get cpuid vs getcpu_cache approach, but
probably still fast enough?
And this way we can have a mechanism to return much bigger
structures to userspace. Kernel can update such area from any
place and user space only needs one extra load to get the base of
such per-cpu area and another load to fetch cpuid.
Thoughts?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28  3:12     ` Alexei Starovoitov
  0 siblings, 0 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-01-28  3:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register one userspace
> memory area where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
> 
> This getcpu cache is an improvement over current mechanisms available to
> read the current CPU number, which has the following benefits:
> 
> - 44x speedup on ARM vs system call through glibc,
> - 14x speedup on x86 compared to calling glibc, which calls vdso
>   executing a "lsl" instruction,
> - 11x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cached value can be read from an inline
>   assembly, which makes it a useful building block for restartable
>   sequences.
> - The getcpu cache approach is portable (e.g. ARM), which is not the
>   case for the lsl-based x86 vdso.
> 
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the getcpu cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.

Great work! The only concern is that every arch has to implement
a call to getcpu_cache_handle_notify_resume() to be able to do put_user()
from the safe place which is not pretty.
Can we do better?
Here is one crazy idea:
The kernel can allocate the memory that user space will mmap()
(ideally reusing perf ring-buffer alloc/mmap mechanism).
then the kernel can just write cpuid into it from any place.
Then user space will register the 'offset' into this space for a given
user space thread (or kernel will return it or ptr within this area)
and in finish_task_switch() the kernel will do
*task->offset_converted_to_ptr = smp_processor_id();
At init time the user space will do:
__thread int *cpuid;
cpuid = (void*)addr_from_mmap + registered_offset;
and at runtime the '*cpuid' will give userspace what it wants.
It's two loads to get cpuid vs getcpu_cache approach, but
probably still fast enough?
And this way we can have a mechanism to return much bigger
structures to userspace. Kernel can update such area from any
place and user space only needs one extra load to get the base of
such per-cpu area and another load to fetch cpuid.
Thoughts?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28 11:12                   ` Heiko Carstens
  0 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2016-01-28 11:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Josh Triplett, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 10:47:37PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 5:11 PM, Josh Triplett josh@joshtriplett.org wrote:
> 
> > On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
> >> 
> >> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
> >> > 
> >> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> >> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
> >> >> > Sounds fair. What is the recommended typing for "ptr" then ?
> >> >> > uint32_t ** or uint32_t * ?
> >> >> > 
> >> >> > It would be expected to pass a "uint32_t *" for the set
> >> >> > operation, but the "get" operation requires a "uint32_t **".
> >> >> 
> >> >> Well, you can't change the types depending on the opcode, so you need to stick
> >> >> with **.
> >> > 
> >> > Alternatively you make it:
> >> > 
> >> >  (opcode, *newptr, **oldptr, flags);
> >> 
> >> I'm tempted to stick to (opcode, **ptr, flags), because
> >> other syscalls that have "*newptr", "**oldptr"
> >> typically have them because they save the current state
> >> into oldptr, and set the new state, which is really
> >> not the case here. To eliminate any risk of confusion,
> >> I am tempted to keep a single "**ptr".
> >> 
> >> Unless someone has a better idea...
> > 
> > Either that or you could define it as "void *" and interpret it based on
> > flags, but that seems unfortunate; let's not imitate ioctl-style
> > typeless parameters.  I'd stick with the double pointer and the current
> > behavior.
> 
> Allright, will do! Thanks for the feedback :)

Please don't forget that you also need to implement compat handling since
the size of the pointer that is being pointed to is only four bytes for
compat tasks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28 11:12                   ` Heiko Carstens
  0 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2016-01-28 11:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Josh Triplett, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, Ben Maurer, rostedt, Paul E. McKenney,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Will Deacon, Michael Kerrisk

On Wed, Jan 27, 2016 at 10:47:37PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 27, 2016, at 5:11 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> 
> > On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> >> 
> >> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
> >> > 
> >> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
> >> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> >> >> > Sounds fair. What is the recommended typing for "ptr" then ?
> >> >> > uint32_t ** or uint32_t * ?
> >> >> > 
> >> >> > It would be expected to pass a "uint32_t *" for the set
> >> >> > operation, but the "get" operation requires a "uint32_t **".
> >> >> 
> >> >> Well, you can't change the types depending on the opcode, so you need to stick
> >> >> with **.
> >> > 
> >> > Alternatively you make it:
> >> > 
> >> >  (opcode, *newptr, **oldptr, flags);
> >> 
> >> I'm tempted to stick to (opcode, **ptr, flags), because
> >> other syscalls that have "*newptr", "**oldptr"
> >> typically have them because they save the current state
> >> into oldptr, and set the new state, which is really
> >> not the case here. To eliminate any risk of confusion,
> >> I am tempted to keep a single "**ptr".
> >> 
> >> Unless someone has a better idea...
> > 
> > Either that or you could define it as "void *" and interpret it based on
> > flags, but that seems unfortunate; let's not imitate ioctl-style
> > typeless parameters.  I'd stick with the double pointer and the current
> > behavior.
> 
> Allright, will do! Thanks for the feedback :)

Please don't forget that you also need to implement compat handling since
the size of the pointer that is being pointed to is only four bytes for
compat tasks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
  2016-01-28 11:12                   ` Heiko Carstens
@ 2016-01-28 13:33                     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-28 13:33 UTC (permalink / raw)
  To: heiko carstens
  Cc: Josh Triplett, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 28, 2016, at 6:12 AM, heiko carstens heiko.carstens@de.ibm.com wrote:

> On Wed, Jan 27, 2016 at 10:47:37PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 27, 2016, at 5:11 PM, Josh Triplett josh@joshtriplett.org wrote:
>> 
>> > On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
>> >> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> >> 
>> >> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
>> >> > 
>> >> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> >> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> >> >> > Sounds fair. What is the recommended typing for "ptr" then ?
>> >> >> > uint32_t ** or uint32_t * ?
>> >> >> > 
>> >> >> > It would be expected to pass a "uint32_t *" for the set
>> >> >> > operation, but the "get" operation requires a "uint32_t **".
>> >> >> 
>> >> >> Well, you can't change the types depending on the opcode, so you need to stick
>> >> >> with **.
>> >> > 
>> >> > Alternatively you make it:
>> >> > 
>> >> >  (opcode, *newptr, **oldptr, flags);
>> >> 
>> >> I'm tempted to stick to (opcode, **ptr, flags), because
>> >> other syscalls that have "*newptr", "**oldptr"
>> >> typically have them because they save the current state
>> >> into oldptr, and set the new state, which is really
>> >> not the case here. To eliminate any risk of confusion,
>> >> I am tempted to keep a single "**ptr".
>> >> 
>> >> Unless someone has a better idea...
>> > 
>> > Either that or you could define it as "void *" and interpret it based on
>> > flags, but that seems unfortunate; let's not imitate ioctl-style
>> > typeless parameters.  I'd stick with the double pointer and the current
>> > behavior.
>> 
>> Allright, will do! Thanks for the feedback :)
> 
> Please don't forget that you also need to implement compat handling since
> the size of the pointer that is being pointed to is only four bytes for
> compat tasks.

Oops, forgot about that. Will update my upcoming v3 to handle this properly.

Well spotted ! Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28 13:33                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-28 13:33 UTC (permalink / raw)
  To: heiko carstens
  Cc: Josh Triplett, Thomas Gleixner, Paul Turner, Andrew Hunter,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ingo Molnar, Ben Maurer, rostedt, Paul E. McKenney,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Jan 28, 2016, at 6:12 AM, heiko carstens heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org wrote:

> On Wed, Jan 27, 2016 at 10:47:37PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 27, 2016, at 5:11 PM, Josh Triplett josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
>> 
>> > On Wed, Jan 27, 2016 at 09:34:35PM +0000, Mathieu Desnoyers wrote:
>> >> ----- On Jan 27, 2016, at 12:37 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> >> 
>> >> > On Wed, 27 Jan 2016, Thomas Gleixner wrote:
>> >> > 
>> >> >> On Wed, 27 Jan 2016, Mathieu Desnoyers wrote:
>> >> >> > ----- On Jan 27, 2016, at 12:22 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> >> >> > Sounds fair. What is the recommended typing for "ptr" then ?
>> >> >> > uint32_t ** or uint32_t * ?
>> >> >> > 
>> >> >> > It would be expected to pass a "uint32_t *" for the set
>> >> >> > operation, but the "get" operation requires a "uint32_t **".
>> >> >> 
>> >> >> Well, you can't change the types depending on the opcode, so you need to stick
>> >> >> with **.
>> >> > 
>> >> > Alternatively you make it:
>> >> > 
>> >> >  (opcode, *newptr, **oldptr, flags);
>> >> 
>> >> I'm tempted to stick to (opcode, **ptr, flags), because
>> >> other syscalls that have "*newptr", "**oldptr"
>> >> typically have them because they save the current state
>> >> into oldptr, and set the new state, which is really
>> >> not the case here. To eliminate any risk of confusion,
>> >> I am tempted to keep a single "**ptr".
>> >> 
>> >> Unless someone has a better idea...
>> > 
>> > Either that or you could define it as "void *" and interpret it based on
>> > flags, but that seems unfortunate; let's not imitate ioctl-style
>> > typeless parameters.  I'd stick with the double pointer and the current
>> > behavior.
>> 
>> Allright, will do! Thanks for the feedback :)
> 
> Please don't forget that you also need to implement compat handling since
> the size of the pointer that is being pointed to is only four bytes for
> compat tasks.

Oops, forgot about that. Will update my upcoming v3 to handle this properly.

Well spotted ! Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28 17:41       ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-28 17:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel, linux-api, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Michael Kerrisk

----- On Jan 27, 2016, at 10:12 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:

> On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
>> Expose a new system call allowing threads to register one userspace
>> memory area where to store the CPU number on which the calling thread is
>> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
>> current thread. Upon return to user-space, a notify-resume handler
>> updates the current CPU value within each registered user-space memory
>> area. User-space can then read the current CPU number directly from
>> memory.
>> 
>> This getcpu cache is an improvement over current mechanisms available to
>> read the current CPU number, which has the following benefits:
>> 
>> - 44x speedup on ARM vs system call through glibc,
>> - 14x speedup on x86 compared to calling glibc, which calls vdso
>>   executing a "lsl" instruction,
>> - 11x speedup on x86 compared to inlined "lsl" instruction,
>> - Unlike vdso approaches, this cached value can be read from an inline
>>   assembly, which makes it a useful building block for restartable
>>   sequences.
>> - The getcpu cache approach is portable (e.g. ARM), which is not the
>>   case for the lsl-based x86 vdso.
>> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the getcpu cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
> 
> Great work!

Thanks!

> The only concern is that every arch has to implement
> a call to getcpu_cache_handle_notify_resume() to be able to do put_user()
> from the safe place which is not pretty.

Indeed, I've considered the alternatives before going down that
route. The idea you propose below was among those I eventually
rejected. Here is why:

> Can we do better?
> Here is one crazy idea:
> The kernel can allocate the memory that user space will mmap()
> (ideally reusing perf ring-buffer alloc/mmap mechanism).
> then the kernel can just write cpuid into it from any place.

This requires the memory to be "mlock'd" or equivalent, because the
kernel cannot page fault when writing to it. That memory then becomes
impossible to swap out.

Also, how large should this mmap() area be ? Since there can be a
very large amount of threads created within a process, it would
probably need to be extended at some point. Then how do you manage
memory fragmentation, e.g. if there is no room left to extend the
mapping when a thread appears ?

> Then user space will register the 'offset' into this space for a given
> user space thread (or kernel will return it or ptr within this area)

This seems to require a "thread ID allocation" mechanism which allocates
IDs for threads starting from 0, with re-use of IDs when thread go away.
So we're adding a free-list of IDs per process or something similar.

In order to eliminate false-sharing, each "entry" in this array
would need to be at least cacheline-sized, which leads to wasted
memory space and cache compared to the TLS approach.

> and in finish_task_switch() the kernel will do
> *task->offset_converted_to_ptr = smp_processor_id();
> At init time the user space will do:
> __thread int *cpuid;
> cpuid = (void*)addr_from_mmap + registered_offset;
> and at runtime the '*cpuid' will give userspace what it wants.
> It's two loads to get cpuid vs getcpu_cache approach, but
> probably still fast enough?

Those per-cpu fast paths are extremely fast (they run in a couple
of nanoseconds). Adding a pointer dereference on the fast path
will likely be measurable.

> And this way we can have a mechanism to return much bigger
> structures to userspace. Kernel can update such area from any
> place and user space only needs one extra load to get the base of
> such per-cpu area and another load to fetch cpuid.
> Thoughts?

>From my understanding, your mmap-array proposal adds more complexity
than the one call in each architecture's resume notifier it's trying
to remove. It requires unswappable memory, a thread ID allocator, and
leads to a slower fast-path due to the extra pointer dereference. I
don't see this approach as a net gain over the call in the arch
resume notifier.

Thanks for the feedback!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
@ 2016-01-28 17:41       ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2016-01-28 17:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Paul Turner, Andrew Hunter, Peter Zijlstra,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ingo Molnar, Ben Maurer,
	rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Jan 27, 2016, at 10:12 PM, Alexei Starovoitov alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:

> On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
>> Expose a new system call allowing threads to register one userspace
>> memory area where to store the CPU number on which the calling thread is
>> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
>> current thread. Upon return to user-space, a notify-resume handler
>> updates the current CPU value within each registered user-space memory
>> area. User-space can then read the current CPU number directly from
>> memory.
>> 
>> This getcpu cache is an improvement over current mechanisms available to
>> read the current CPU number, which has the following benefits:
>> 
>> - 44x speedup on ARM vs system call through glibc,
>> - 14x speedup on x86 compared to calling glibc, which calls vdso
>>   executing a "lsl" instruction,
>> - 11x speedup on x86 compared to inlined "lsl" instruction,
>> - Unlike vdso approaches, this cached value can be read from an inline
>>   assembly, which makes it a useful building block for restartable
>>   sequences.
>> - The getcpu cache approach is portable (e.g. ARM), which is not the
>>   case for the lsl-based x86 vdso.
>> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the getcpu cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
> 
> Great work!

Thanks!

> The only concern is that every arch has to implement
> a call to getcpu_cache_handle_notify_resume() to be able to do put_user()
> from the safe place which is not pretty.

Indeed, I've considered the alternatives before going down that
route. The idea you propose below was among those I eventually
rejected. Here is why:

> Can we do better?
> Here is one crazy idea:
> The kernel can allocate the memory that user space will mmap()
> (ideally reusing perf ring-buffer alloc/mmap mechanism).
> then the kernel can just write cpuid into it from any place.

This requires the memory to be "mlock'd" or equivalent, because the
kernel cannot page fault when writing to it. That memory then becomes
impossible to swap out.

Also, how large should this mmap() area be ? Since there can be a
very large amount of threads created within a process, it would
probably need to be extended at some point. Then how do you manage
memory fragmentation, e.g. if there is no room left to extend the
mapping when a thread appears ?

> Then user space will register the 'offset' into this space for a given
> user space thread (or kernel will return it or ptr within this area)

This seems to require a "thread ID allocation" mechanism which allocates
IDs for threads starting from 0, with re-use of IDs when thread go away.
So we're adding a free-list of IDs per process or something similar.

In order to eliminate false-sharing, each "entry" in this array
would need to be at least cacheline-sized, which leads to wasted
memory space and cache compared to the TLS approach.

> and in finish_task_switch() the kernel will do
> *task->offset_converted_to_ptr = smp_processor_id();
> At init time the user space will do:
> __thread int *cpuid;
> cpuid = (void*)addr_from_mmap + registered_offset;
> and at runtime the '*cpuid' will give userspace what it wants.
> It's two loads to get cpuid vs getcpu_cache approach, but
> probably still fast enough?

Those per-cpu fast paths are extremely fast (they run in a couple
of nanoseconds). Adding a pointer dereference on the fast path
will likely be measurable.

> And this way we can have a mechanism to return much bigger
> structures to userspace. Kernel can update such area from any
> place and user space only needs one extra load to get the base of
> such per-cpu area and another load to fetch cpuid.
> Thoughts?

>From my understanding, your mmap-array proposal adds more complexity
than the one call in each architecture's resume notifier it's trying
to remove. It requires unswappable memory, a thread ID allocator, and
leads to a slower fast-path due to the extra pointer dereference. I
don't see this approach as a net gain over the call in the arch
resume notifier.

Thanks for the feedback!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-01-28 17:41 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-27 16:54 [RFC PATCH v2 0/3] getcpu_cache system call Mathieu Desnoyers
2016-01-27 16:54 ` [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread Mathieu Desnoyers
2016-01-27 17:20   ` Josh Triplett
2016-01-27 17:20     ` Josh Triplett
2016-01-27 17:24     ` Thomas Gleixner
2016-01-27 17:24       ` Thomas Gleixner
2016-01-27 17:36       ` Mathieu Desnoyers
2016-01-27 17:36         ` Mathieu Desnoyers
2016-01-27 18:02         ` Andrew Hunter
2016-01-27 18:02           ` Andrew Hunter
2016-01-27 18:03         ` Josh Triplett
2016-01-27 18:03           ` Josh Triplett
2016-01-27 18:43           ` Mathieu Desnoyers
2016-01-27 19:16             ` Josh Triplett
2016-01-27 21:02               ` Mathieu Desnoyers
2016-01-27 21:30                 ` Josh Triplett
2016-01-27 21:30                   ` Josh Triplett
2016-01-27 17:22   ` Thomas Gleixner
2016-01-27 17:22     ` Thomas Gleixner
2016-01-27 17:31     ` Mathieu Desnoyers
2016-01-27 17:31       ` Mathieu Desnoyers
2016-01-27 17:34       ` Thomas Gleixner
2016-01-27 17:34         ` Thomas Gleixner
2016-01-27 17:37         ` Thomas Gleixner
2016-01-27 17:37           ` Thomas Gleixner
2016-01-27 21:34           ` Mathieu Desnoyers
2016-01-27 21:34             ` Mathieu Desnoyers
2016-01-27 22:11             ` Josh Triplett
2016-01-27 22:11               ` Josh Triplett
2016-01-27 22:47               ` Mathieu Desnoyers
2016-01-27 22:47                 ` Mathieu Desnoyers
2016-01-28 11:12                 ` Heiko Carstens
2016-01-28 11:12                   ` Heiko Carstens
2016-01-28 13:33                   ` Mathieu Desnoyers
2016-01-28 13:33                     ` Mathieu Desnoyers
2016-01-28  3:12   ` Alexei Starovoitov
2016-01-28  3:12     ` Alexei Starovoitov
2016-01-28 17:41     ` Mathieu Desnoyers
2016-01-28 17:41       ` Mathieu Desnoyers
2016-01-27 16:54 ` [RFC PATCH v2 2/3] getcpu_cache: wire up ARM system call Mathieu Desnoyers
2016-01-27 18:19   ` Russell King - ARM Linux
2016-01-27 18:19     ` Russell King - ARM Linux
2016-01-27 18:46     ` Mathieu Desnoyers
2016-01-27 18:46       ` Mathieu Desnoyers
2016-01-27 23:03       ` Mathieu Desnoyers
2016-01-27 23:03         ` Mathieu Desnoyers
2016-01-27 16:54 ` [RFC PATCH v2 3/3] getcpu_cache: wire up x86 32/64 " Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.