All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-23 23:28 Mathieu Desnoyers
  2016-02-23 23:28   ` Mathieu Desnoyers
                   ` (5 more replies)
  0 siblings, 6 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Hi,

Here is a patchset implementing a cache for the CPU number of the
currently running thread in user-space.

Benchmarks comparing this approach to a getcpu based on system call on
ARM show a 44x speedup. They show a 14x speedup on x86-64 compared to
executing lsl from a vDSO through glibc.

I'm added a man page in the changelog of patch 1/3, which shows an
example usage of this new system call.

This series is based on v4.5-rc5, submitted for Linux 4.6.

Feedback is welcome,

Thanks!

Mathieu


Mathieu Desnoyers (5):
  getcpu_cache system call: cache CPU number of running thread
  getcpu_cache: ARM resume notifier
  getcpu_cache: wire up ARM system call
  getcpu_cache: x86 32/64 resume notifier
  getcpu_cache: wire up x86 32/64 system call

 MAINTAINERS                            |   7 ++
 arch/arm/include/uapi/asm/unistd.h     |   1 +
 arch/arm/kernel/calls.S                |   3 +-
 arch/arm/kernel/signal.c               |   1 +
 arch/x86/entry/common.c                |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/exec.c                              |   1 +
 include/linux/sched.h                  |  36 ++++++++
 include/uapi/linux/Kbuild              |   1 +
 include/uapi/linux/getcpu_cache.h      |  42 +++++++++
 init/Kconfig                           |  10 ++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   4 +
 kernel/getcpu_cache.c                  | 163 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h                   |   1 +
 kernel/sys_ni.c                        |   3 +
 17 files changed, 276 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-23 23:28   ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Expose a new system call allowing threads to register one userspace
memory area where to store the CPU number on which the calling thread is
running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within each registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

This getcpu cache is an improvement over current mechanisms available to
read the current CPU number, which has the following benefits:

- 44x speedup on ARM vs system call through glibc,
- 14x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 11x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cached value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The getcpu cache approach is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the getcpu cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections. [1] [2]

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard i.MX6 Quad Board
- Baseline (empty loop):               10.1 ns
- Read CPU from getcpu cache:          10.1 ns
- glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
- getcpu system call:                 322.2 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                1.0 ns
- Read CPU from getcpu cache:           1.0 ns
- Read using gs segment selector:       1.0 ns
- "lsl" inline assembly:               11.2 ns
- glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
- getcpu system call:                  51.0 ns

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: linux-api@vger.kernel.org
---

Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
  sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
  update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
  getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
  and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
  defining this enumeration.
- Split resume notifier architecture implementation from the system call
  wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
  set the current cpu cache pointer before doing the cache update, and
  set it back to NULL if the update fails. Setting it back to NULL on
  error ensures that no resume notifier will trigger a SIGSEGV if a
  migration happened concurrently.

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Rationale for the getcpu_cache system call rather than the thread-local
ABI system call proposed earlier:

Rather than doing a "generic" thread-local ABI, specialize this system
call for a cpu number cache only. Anyway, the thread-local ABI approach
would have required that we introduce "feature" flags, which would have
ended up reimplementing multiplexing of features on top of a system
call. It seems better to introduce one system call per feature instead.

Man page associated:

GETCPU_CACHE(2)       Linux Programmer's Manual      GETCPU_CACHE(2)

NAME
       getcpu_cache  -  cache CPU number on which the calling thread
       is running

SYNOPSIS
       #include <linux/getcpu_cache.h>
       #include <stdint.h>

       int getcpu_cache(int cmd, int32_t **cpu_cachep, int flags);

DESCRIPTION
       The getcpu_cache() helps speeding up reading the current  CPU
       number  by  ensuring  that  the memory location registered by
       each user-space thread is always updated with the CPU  number
       on which the thread is running when reading that memory loca‐
       tion.

       The cmd argument is one of the following:

       GETCPU_CACHE_GET
              Get the pointer to the current cpu number  cache  into
              the   memory   location  targeted  by  the  cpu_cachep
              pointer.

       GETCPU_CACHE_SET
              Attempt to set the current cpu number cache  by  using
              the pointer located in the memory location targeted by
              the cpu_cachep pointer. This pointer must  be  aligned
              on 4-byte multiples (natural alignment).

       The cpu_cachep argument is a pointer to a int32_t pointer. It
       is used as an output argument for GETCPU_CACHE_GET, and as an
       input argument for GETCPU_CACHE_SET.

       The  flags argument is currently unused and must be specified
       as 0.

       Typically, a library or application will keep the cpu  number
       cache  in  a  thread-local  storage variable, or other memory
       areas belonging to each thread. It is recommended to  perform
       a  volatile  read of the cpu number cache to prevent the com‐
       piler from doing load tearing. An alternative approach is  to
       read  the  cpu  number cache from inline assembly in a single
       instruction.

       Each thread is responsible for registering its own cpu number
       cache.   Only  one  cpu  cache  address can be registered per
       thread.

       The symbol  __getcpu_cache_tls  is  recommended  to  be  used
       across  libraries  and  applications  wishing  to  register a
       thread-local getcpu_cache. The  attribute  "weak"  is  recom‐
       mended  when  declaring this variable in libraries.  Applica‐
       tions can choose to define their own version of  this  symbol
       without the weak attribute as a performance improvement.

       In  a  typical usage scenario, the thread registering the cpu
       number cache will be performing reads from that cache. It  is
       however  also allowed to read the cpu number cache from other
       threads. The cpu number cache updates performed by the kernel
       provide single-copy atomicity semantics, which guarantee that
       other threads performing single-copy atomic reads of the  cpu
       number cache will always observe a consistent value.

       Memory registered as cpu number cache should never be deallo‐
       cated before the thread which registered it  exits:  specifi‐
       cally, it should not be freed, and the library containing the
       registered thread-local storage should not be dlclose'd.

       Unregistration of associated cpu  cache  is  implicitly  per‐
       formed when a thread or process exit.

RETURN VALUE
       A  return  value  of  0  indicates  success.  On error, -1 is
       returned, and errno is set appropriately.

ERRORS
       EINVAL Either flags is non-zero,  an  invalid  cmd  has  been
              specified,  or  the  GETCPU_CACHE_GET command has been
              specified and cpu_cachep points to a location contain‐
              ing  an  invalid  address,  or  cpu_cachep points to a
              location containing an address which is not aligned on
              4-byte multiples.

       ENOSYS The  getcpu_cache()  system call is not implemented by
              this kernel.

       EFAULT cpu_cachep is an invalid address, or cpu_cachep points
              to a location containing an invalid address.

       EBUSY  The GETCPU_CACHE_SET command has been specified, and a
              cpu cache address which differs from  the  content  of
              the  memory  location  pointed  to  by  cpu_cachep  is
              already registered for this thread.

       ENOENT The GETCPU_CACHE_GET command has been  specified,  but
              no cpu cache has been registered for this thread.

VERSIONS
       The getcpu_cache() system call was added in Linux 4.X (TODO).

CONFORMING TO
       getcpu_cache() is Linux-specific.

EXAMPLE
       The  following  code  uses  the getcpu_cache() system call to
       keep a thread local storage variable up to date with the cur‐
       rent  CPU  number,  with a fallback on sched_getcpu(3) if the
       cache is not available. For example simplicity, it is done in
       main(),  but  multithreaded  programs  would  need  to invoke
       getcpu_cache() from each program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sched.h>
           #include <linux/getcpu_cache.h>
           #include <sys/syscall.h>

           static inline int
           getcpu_cache(int cmd, volatile int32_t **cpu_cachep, int flags)
           {
               return syscall(__NR_getcpu_cache, cmd, cpu_cachep, flags);
           }

           /*
            * __getcpu_cache_tls is recommended as symbol name for the
            * cpu number cache. Weak attribute is recommended when
            * declaring this variable in libraries. Applications can
            * choose to define their own version of this symbol without
            * the weak attribute and access it directly as a
            * performance improvement when it matches the address
            * returned by GETCPU_CACHE_GET. The initial value "-1"
            * will be read in case the getcpu cache is not available.
            */
           __thread __attribute__((weak)) volatile int32_t
                     __getcpu_cache_tls = -1;

           int
           main(int argc, char **argv)
           {
               volatile int32_t *cpu_cache = &__getcpu_cache_tls;
               int32_t cpu;

               /* Try to register the CPU cache. */
               if (getcpu_cache(GETCPU_CACHE_SET, &cpu_cache, 0) < 0) {
                   perror("getcpu_cache set");
                   fprintf(stderr, "Using sched_getcpu() as fallback.\n");
               }

               cpu = __getcpu_cache_tls;    /* Read current CPU number. */
               if (cpu < 0) {
                   /* Fallback on sched_getcpu(). */
                   cpu = sched_getcpu();
               }
               printf("Current CPU number: %d\n", cpu);

               exit(EXIT_SUCCESS);
           }

SEE ALSO
       sched_getcpu(3)

Linux                        2016-01-27              GETCPU_CACHE(2)
---
 MAINTAINERS                       |   7 ++
 fs/exec.c                         |   1 +
 include/linux/sched.h             |  36 +++++++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/getcpu_cache.h |  42 ++++++++++
 init/Kconfig                      |  10 +++
 kernel/Makefile                   |   1 +
 kernel/fork.c                     |   4 +
 kernel/getcpu_cache.c             | 163 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h              |   1 +
 kernel/sys_ni.c                   |   3 +
 11 files changed, 269 insertions(+)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4978dc1..dfef1bc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4766,6 +4766,13 @@ M:	Joe Perches <joe@perches.com>
 S:	Maintained
 F:	scripts/get_maintainer.pl
 
+GETCPU_CACHE SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/getcpu_cache.c
+F:	include/uapi/linux/getcpu_cache.h
+
 GFS2 FILE SYSTEM
 M:	Steven Whitehouse <swhiteho@redhat.com>
 M:	Bob Peterson <rpeterso@redhat.com>
diff --git a/fs/exec.c b/fs/exec.c
index dcd4ac7..f4ec02f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1594,6 +1594,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	getcpu_cache_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..18f2f79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1830,6 +1830,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_GETCPU_CACHE
+	int32_t __user *cpu_cache;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3207,4 +3210,37 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_GETCPU_CACHE
+void getcpu_cache_fork(struct task_struct *t);
+void getcpu_cache_execve(struct task_struct *t);
+void getcpu_cache_exit(struct task_struct *t);
+void __getcpu_cache_handle_notify_resume(struct task_struct *t);
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+	if (t->cpu_cache)
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (t->cpu_cache)
+		__getcpu_cache_handle_notify_resume(t);
+}
+#else
+static inline void getcpu_cache_fork(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_execve(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_exit(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ebd10e6..1d7eb4d 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -136,6 +136,7 @@ header-y += futex.h
 header-y += gameport.h
 header-y += genetlink.h
 header-y += gen_stats.h
+header-y += getcpu_cache.h
 header-y += gfs2_ondisk.h
 header-y += gigaset_dev.h
 header-y += gsmmux.h
diff --git a/include/uapi/linux/getcpu_cache.h b/include/uapi/linux/getcpu_cache.h
new file mode 100644
index 0000000..25343b9
--- /dev/null
+++ b/include/uapi/linux/getcpu_cache.h
@@ -0,0 +1,42 @@
+#ifndef _UAPI_LINUX_GETCPU_CACHE_H
+#define _UAPI_LINUX_GETCPU_CACHE_H
+
+/*
+ * linux/getcpu_cache.h
+ *
+ * getcpu_cache system call API
+ *
+ * Copyright (c) 2015, 2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/**
+ * enum getcpu_cache_cmd - getcpu_cache system call command
+ * @GETCPU_CACHE_GET: Get the address of the current thread CPU number
+ *                    cache.
+ * @GETCPU_CACHE_SET: Set the address of the current thread CPU number
+ *                    cache.
+ */
+enum getcpu_cache_cmd {
+	GETCPU_CACHE_GET = 0,
+	GETCPU_CACHE_SET = 1,
+};
+
+#endif /* _UAPI_LINUX_GETCPU_CACHE_H */
diff --git a/init/Kconfig b/init/Kconfig
index 2232080..e8db8db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1589,6 +1589,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config GETCPU_CACHE
+	bool "Enable getcpu cache" if EXPERT
+	default y
+	help
+	  Enable the getcpu cache system call. It provides a user-space
+	  cache for the current CPU number value, which speeds up
+	  getting the current CPU number from user-space.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..b630247 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_GETCPU_CACHE) += getcpu_cache.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c7..fad76d5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -252,6 +252,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	cgroup_free(tsk);
+	getcpu_cache_exit(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
@@ -1552,6 +1553,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	copy_seccomp(p);
 
+	if (!(clone_flags & CLONE_THREAD))
+		getcpu_cache_fork(p);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/getcpu_cache.c b/kernel/getcpu_cache.c
new file mode 100644
index 0000000..b7eaed0
--- /dev/null
+++ b/kernel/getcpu_cache.c
@@ -0,0 +1,163 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * getcpu cache system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/getcpu_cache.h>
+
+static int getcpu_cache_update(int32_t __user *cpu_cache)
+{
+	if (put_user(raw_smp_processor_id(), cpu_cache))
+		return -1;
+	return 0;
+}
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void __getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (getcpu_cache_update(t->cpu_cache))
+		force_sig(SIGSEGV, t);
+}
+
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only applies
+ * when forking a process, not a thread.
+ */
+void getcpu_cache_fork(struct task_struct *t)
+{
+	t->cpu_cache = current->cpu_cache;
+}
+
+void getcpu_cache_execve(struct task_struct *t)
+{
+	t->cpu_cache = NULL;
+}
+
+void getcpu_cache_exit(struct task_struct *t)
+{
+	t->cpu_cache = NULL;
+}
+
+static int __get_cpu_cache_ptr(int32_t __user **cpu_cache,
+		int32_t __user * __user *cpu_cachep)
+{
+#ifdef CONFIG_COMPAT
+	if (is_compat_task()) {
+		compat_uptr_t __user *compat_cachep =
+			(compat_uptr_t __user *) cpu_cachep;
+		compat_uptr_t compat_cache;
+
+		if (get_user(compat_cache, compat_cachep))
+			return -EFAULT;
+		*cpu_cache = compat_ptr(compat_cache);
+		return 0;
+	}
+#endif
+	return get_user(*cpu_cache, cpu_cachep);
+}
+
+#define get_cpu_cache_ptr(cpu_cache, cpu_cachep)	\
+	__get_cpu_cache_ptr(&(cpu_cache), cpu_cachep)
+
+static int put_cpu_cache_ptr(int32_t __user *cpu_cache,
+		int32_t __user * __user *cpu_cachep)
+{
+#ifdef CONFIG_COMPAT
+	if (is_compat_task()) {
+		compat_uptr_t compat_cache = ptr_to_compat(cpu_cache);
+		compat_uptr_t __user *compat_cachep =
+			(compat_uptr_t __user *) cpu_cachep;
+
+		return put_user(compat_cache, compat_cachep);
+	}
+#endif
+	return put_user(cpu_cache, cpu_cachep);
+}
+
+/*
+ * sys_getcpu_cache - setup getcpu cache for caller thread
+ */
+SYSCALL_DEFINE3(getcpu_cache, int, cmd, int32_t __user * __user *, cpu_cachep,
+		int, flags)
+{
+	if (unlikely(flags))
+		return -EINVAL;
+	switch (cmd) {
+	case GETCPU_CACHE_GET:
+		if (!current->cpu_cache)
+			return -ENOENT;
+		if (put_cpu_cache_ptr(current->cpu_cache, cpu_cachep))
+			return -EFAULT;
+		return 0;
+	case GETCPU_CACHE_SET:
+	{
+		int32_t __user *cpu_cache;
+
+		if (get_cpu_cache_ptr(cpu_cache, cpu_cachep))
+			return -EFAULT;
+		if (unlikely(!IS_ALIGNED((unsigned long)cpu_cache,
+				sizeof(int32_t)) || !cpu_cache))
+			return -EINVAL;
+		/*
+		 * Check if cpu_cache is already registered, and whether
+		 * the address differs from *cpu_cachep.
+		 */
+		if (current->cpu_cache) {
+			if (current->cpu_cache != cpu_cache)
+				return -EBUSY;
+			return 0;
+		}
+		current->cpu_cache = cpu_cache;
+		/*
+		 * Migration reads the current->cpu_cache pointer to
+		 * decide whether the notify_resume flag should be set.
+		 * Therefore, we need to ensure that the scheduler sees
+		 * the getcpu cache pointer update before we update the
+		 * getcpu cache content with the current CPU number.
+		 * This ensures we don't return from the getcpu_cache
+		 * system call to userspace with a wrong CPU number in
+		 * the cache if preempted and migrated after the initial
+		 * successful cpu cache update (below).
+		 *
+		 * This compiler barrier enforces ordering of the
+		 * current->cpu_cache address store before update of the
+		 * *cpu_cache.
+		 */
+		barrier();
+		/*
+		 * Do an initial cpu cache update to populate the
+		 * current CPU value, and to check whether the address
+		 * is valid, thus ensuring we return -EFAULT in case or
+		 * invalid address rather than triggering a SIGSEGV if
+		 * put_user() fails in the resume notifier.
+		 */
+		if (getcpu_cache_update(cpu_cache)) {
+			current->cpu_cache = NULL;
+			return -EFAULT;
+		}
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..11ae33f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -971,6 +971,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	getcpu_cache_set_notify_resume(p);
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8..7e336c0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -250,3 +250,6 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+
+/* thread-local ABI */
+cond_syscall(sys_getcpu_cache);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-23 23:28   ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Expose a new system call allowing threads to register one userspace
memory area where to store the CPU number on which the calling thread is
running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within each registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

This getcpu cache is an improvement over current mechanisms available to
read the current CPU number, which has the following benefits:

- 44x speedup on ARM vs system call through glibc,
- 14x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 11x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cached value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The getcpu cache approach is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the getcpu cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections. [1] [2]

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard i.MX6 Quad Board
- Baseline (empty loop):               10.1 ns
- Read CPU from getcpu cache:          10.1 ns
- glibc 2.19-0ubuntu6.6 getcpu:       445.6 ns
- getcpu system call:                 322.2 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                1.0 ns
- Read CPU from getcpu cache:           1.0 ns
- Read using gs segment selector:       1.0 ns
- "lsl" inline assembly:               11.2 ns
- glibc 2.19-0ubuntu6.6 getcpu:        14.3 ns
- getcpu system call:                  51.0 ns

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---

Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
  sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
  update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
  getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
  and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
  defining this enumeration.
- Split resume notifier architecture implementation from the system call
  wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
  set the current cpu cache pointer before doing the cache update, and
  set it back to NULL if the update fails. Setting it back to NULL on
  error ensures that no resume notifier will trigger a SIGSEGV if a
  migration happened concurrently.

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Rationale for the getcpu_cache system call rather than the thread-local
ABI system call proposed earlier:

Rather than doing a "generic" thread-local ABI, specialize this system
call for a cpu number cache only. Anyway, the thread-local ABI approach
would have required that we introduce "feature" flags, which would have
ended up reimplementing multiplexing of features on top of a system
call. It seems better to introduce one system call per feature instead.

Man page associated:

GETCPU_CACHE(2)       Linux Programmer's Manual      GETCPU_CACHE(2)

NAME
       getcpu_cache  -  cache CPU number on which the calling thread
       is running

SYNOPSIS
       #include <linux/getcpu_cache.h>
       #include <stdint.h>

       int getcpu_cache(int cmd, int32_t **cpu_cachep, int flags);

DESCRIPTION
       The getcpu_cache() helps speeding up reading the current  CPU
       number  by  ensuring  that  the memory location registered by
       each user-space thread is always updated with the CPU  number
       on which the thread is running when reading that memory loca‐
       tion.

       The cmd argument is one of the following:

       GETCPU_CACHE_GET
              Get the pointer to the current cpu number  cache  into
              the   memory   location  targeted  by  the  cpu_cachep
              pointer.

       GETCPU_CACHE_SET
              Attempt to set the current cpu number cache  by  using
              the pointer located in the memory location targeted by
              the cpu_cachep pointer. This pointer must  be  aligned
              on 4-byte multiples (natural alignment).

       The cpu_cachep argument is a pointer to a int32_t pointer. It
       is used as an output argument for GETCPU_CACHE_GET, and as an
       input argument for GETCPU_CACHE_SET.

       The  flags argument is currently unused and must be specified
       as 0.

       Typically, a library or application will keep the cpu  number
       cache  in  a  thread-local  storage variable, or other memory
       areas belonging to each thread. It is recommended to  perform
       a  volatile  read of the cpu number cache to prevent the com‐
       piler from doing load tearing. An alternative approach is  to
       read  the  cpu  number cache from inline assembly in a single
       instruction.

       Each thread is responsible for registering its own cpu number
       cache.   Only  one  cpu  cache  address can be registered per
       thread.

       The symbol  __getcpu_cache_tls  is  recommended  to  be  used
       across  libraries  and  applications  wishing  to  register a
       thread-local getcpu_cache. The  attribute  "weak"  is  recom‐
       mended  when  declaring this variable in libraries.  Applica‐
       tions can choose to define their own version of  this  symbol
       without the weak attribute as a performance improvement.

       In  a  typical usage scenario, the thread registering the cpu
       number cache will be performing reads from that cache. It  is
       however  also allowed to read the cpu number cache from other
       threads. The cpu number cache updates performed by the kernel
       provide single-copy atomicity semantics, which guarantee that
       other threads performing single-copy atomic reads of the  cpu
       number cache will always observe a consistent value.

       Memory registered as cpu number cache should never be deallo‐
       cated before the thread which registered it  exits:  specifi‐
       cally, it should not be freed, and the library containing the
       registered thread-local storage should not be dlclose'd.

       Unregistration of associated cpu  cache  is  implicitly  per‐
       formed when a thread or process exit.

RETURN VALUE
       A  return  value  of  0  indicates  success.  On error, -1 is
       returned, and errno is set appropriately.

ERRORS
       EINVAL Either flags is non-zero,  an  invalid  cmd  has  been
              specified,  or  the  GETCPU_CACHE_GET command has been
              specified and cpu_cachep points to a location contain‐
              ing  an  invalid  address,  or  cpu_cachep points to a
              location containing an address which is not aligned on
              4-byte multiples.

       ENOSYS The  getcpu_cache()  system call is not implemented by
              this kernel.

       EFAULT cpu_cachep is an invalid address, or cpu_cachep points
              to a location containing an invalid address.

       EBUSY  The GETCPU_CACHE_SET command has been specified, and a
              cpu cache address which differs from  the  content  of
              the  memory  location  pointed  to  by  cpu_cachep  is
              already registered for this thread.

       ENOENT The GETCPU_CACHE_GET command has been  specified,  but
              no cpu cache has been registered for this thread.

VERSIONS
       The getcpu_cache() system call was added in Linux 4.X (TODO).

CONFORMING TO
       getcpu_cache() is Linux-specific.

EXAMPLE
       The  following  code  uses  the getcpu_cache() system call to
       keep a thread local storage variable up to date with the cur‐
       rent  CPU  number,  with a fallback on sched_getcpu(3) if the
       cache is not available. For example simplicity, it is done in
       main(),  but  multithreaded  programs  would  need  to invoke
       getcpu_cache() from each program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sched.h>
           #include <linux/getcpu_cache.h>
           #include <sys/syscall.h>

           static inline int
           getcpu_cache(int cmd, volatile int32_t **cpu_cachep, int flags)
           {
               return syscall(__NR_getcpu_cache, cmd, cpu_cachep, flags);
           }

           /*
            * __getcpu_cache_tls is recommended as symbol name for the
            * cpu number cache. Weak attribute is recommended when
            * declaring this variable in libraries. Applications can
            * choose to define their own version of this symbol without
            * the weak attribute and access it directly as a
            * performance improvement when it matches the address
            * returned by GETCPU_CACHE_GET. The initial value "-1"
            * will be read in case the getcpu cache is not available.
            */
           __thread __attribute__((weak)) volatile int32_t
                     __getcpu_cache_tls = -1;

           int
           main(int argc, char **argv)
           {
               volatile int32_t *cpu_cache = &__getcpu_cache_tls;
               int32_t cpu;

               /* Try to register the CPU cache. */
               if (getcpu_cache(GETCPU_CACHE_SET, &cpu_cache, 0) < 0) {
                   perror("getcpu_cache set");
                   fprintf(stderr, "Using sched_getcpu() as fallback.\n");
               }

               cpu = __getcpu_cache_tls;    /* Read current CPU number. */
               if (cpu < 0) {
                   /* Fallback on sched_getcpu(). */
                   cpu = sched_getcpu();
               }
               printf("Current CPU number: %d\n", cpu);

               exit(EXIT_SUCCESS);
           }

SEE ALSO
       sched_getcpu(3)

Linux                        2016-01-27              GETCPU_CACHE(2)
---
 MAINTAINERS                       |   7 ++
 fs/exec.c                         |   1 +
 include/linux/sched.h             |  36 +++++++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/getcpu_cache.h |  42 ++++++++++
 init/Kconfig                      |  10 +++
 kernel/Makefile                   |   1 +
 kernel/fork.c                     |   4 +
 kernel/getcpu_cache.c             | 163 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h              |   1 +
 kernel/sys_ni.c                   |   3 +
 11 files changed, 269 insertions(+)
 create mode 100644 include/uapi/linux/getcpu_cache.h
 create mode 100644 kernel/getcpu_cache.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4978dc1..dfef1bc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4766,6 +4766,13 @@ M:	Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
 S:	Maintained
 F:	scripts/get_maintainer.pl
 
+GETCPU_CACHE SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
+L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	kernel/getcpu_cache.c
+F:	include/uapi/linux/getcpu_cache.h
+
 GFS2 FILE SYSTEM
 M:	Steven Whitehouse <swhiteho-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
 M:	Bob Peterson <rpeterso-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
diff --git a/fs/exec.c b/fs/exec.c
index dcd4ac7..f4ec02f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1594,6 +1594,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	getcpu_cache_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..18f2f79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1830,6 +1830,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_GETCPU_CACHE
+	int32_t __user *cpu_cache;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3207,4 +3210,37 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_GETCPU_CACHE
+void getcpu_cache_fork(struct task_struct *t);
+void getcpu_cache_execve(struct task_struct *t);
+void getcpu_cache_exit(struct task_struct *t);
+void __getcpu_cache_handle_notify_resume(struct task_struct *t);
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+	if (t->cpu_cache)
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (t->cpu_cache)
+		__getcpu_cache_handle_notify_resume(t);
+}
+#else
+static inline void getcpu_cache_fork(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_execve(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_exit(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ebd10e6..1d7eb4d 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -136,6 +136,7 @@ header-y += futex.h
 header-y += gameport.h
 header-y += genetlink.h
 header-y += gen_stats.h
+header-y += getcpu_cache.h
 header-y += gfs2_ondisk.h
 header-y += gigaset_dev.h
 header-y += gsmmux.h
diff --git a/include/uapi/linux/getcpu_cache.h b/include/uapi/linux/getcpu_cache.h
new file mode 100644
index 0000000..25343b9
--- /dev/null
+++ b/include/uapi/linux/getcpu_cache.h
@@ -0,0 +1,42 @@
+#ifndef _UAPI_LINUX_GETCPU_CACHE_H
+#define _UAPI_LINUX_GETCPU_CACHE_H
+
+/*
+ * linux/getcpu_cache.h
+ *
+ * getcpu_cache system call API
+ *
+ * Copyright (c) 2015, 2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/**
+ * enum getcpu_cache_cmd - getcpu_cache system call command
+ * @GETCPU_CACHE_GET: Get the address of the current thread CPU number
+ *                    cache.
+ * @GETCPU_CACHE_SET: Set the address of the current thread CPU number
+ *                    cache.
+ */
+enum getcpu_cache_cmd {
+	GETCPU_CACHE_GET = 0,
+	GETCPU_CACHE_SET = 1,
+};
+
+#endif /* _UAPI_LINUX_GETCPU_CACHE_H */
diff --git a/init/Kconfig b/init/Kconfig
index 2232080..e8db8db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1589,6 +1589,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config GETCPU_CACHE
+	bool "Enable getcpu cache" if EXPERT
+	default y
+	help
+	  Enable the getcpu cache system call. It provides a user-space
+	  cache for the current CPU number value, which speeds up
+	  getting the current CPU number from user-space.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..b630247 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_GETCPU_CACHE) += getcpu_cache.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c7..fad76d5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -252,6 +252,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	cgroup_free(tsk);
+	getcpu_cache_exit(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
@@ -1552,6 +1553,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	copy_seccomp(p);
 
+	if (!(clone_flags & CLONE_THREAD))
+		getcpu_cache_fork(p);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/getcpu_cache.c b/kernel/getcpu_cache.c
new file mode 100644
index 0000000..b7eaed0
--- /dev/null
+++ b/kernel/getcpu_cache.c
@@ -0,0 +1,163 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/fQFizaE/u3fw@public.gmane.orgm>
+ *
+ * getcpu cache system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/getcpu_cache.h>
+
+static int getcpu_cache_update(int32_t __user *cpu_cache)
+{
+	if (put_user(raw_smp_processor_id(), cpu_cache))
+		return -1;
+	return 0;
+}
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void __getcpu_cache_handle_notify_resume(struct task_struct *t)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (getcpu_cache_update(t->cpu_cache))
+		force_sig(SIGSEGV, t);
+}
+
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only applies
+ * when forking a process, not a thread.
+ */
+void getcpu_cache_fork(struct task_struct *t)
+{
+	t->cpu_cache = current->cpu_cache;
+}
+
+void getcpu_cache_execve(struct task_struct *t)
+{
+	t->cpu_cache = NULL;
+}
+
+void getcpu_cache_exit(struct task_struct *t)
+{
+	t->cpu_cache = NULL;
+}
+
+static int __get_cpu_cache_ptr(int32_t __user **cpu_cache,
+		int32_t __user * __user *cpu_cachep)
+{
+#ifdef CONFIG_COMPAT
+	if (is_compat_task()) {
+		compat_uptr_t __user *compat_cachep =
+			(compat_uptr_t __user *) cpu_cachep;
+		compat_uptr_t compat_cache;
+
+		if (get_user(compat_cache, compat_cachep))
+			return -EFAULT;
+		*cpu_cache = compat_ptr(compat_cache);
+		return 0;
+	}
+#endif
+	return get_user(*cpu_cache, cpu_cachep);
+}
+
+#define get_cpu_cache_ptr(cpu_cache, cpu_cachep)	\
+	__get_cpu_cache_ptr(&(cpu_cache), cpu_cachep)
+
+static int put_cpu_cache_ptr(int32_t __user *cpu_cache,
+		int32_t __user * __user *cpu_cachep)
+{
+#ifdef CONFIG_COMPAT
+	if (is_compat_task()) {
+		compat_uptr_t compat_cache = ptr_to_compat(cpu_cache);
+		compat_uptr_t __user *compat_cachep =
+			(compat_uptr_t __user *) cpu_cachep;
+
+		return put_user(compat_cache, compat_cachep);
+	}
+#endif
+	return put_user(cpu_cache, cpu_cachep);
+}
+
+/*
+ * sys_getcpu_cache - setup getcpu cache for caller thread
+ */
+SYSCALL_DEFINE3(getcpu_cache, int, cmd, int32_t __user * __user *, cpu_cachep,
+		int, flags)
+{
+	if (unlikely(flags))
+		return -EINVAL;
+	switch (cmd) {
+	case GETCPU_CACHE_GET:
+		if (!current->cpu_cache)
+			return -ENOENT;
+		if (put_cpu_cache_ptr(current->cpu_cache, cpu_cachep))
+			return -EFAULT;
+		return 0;
+	case GETCPU_CACHE_SET:
+	{
+		int32_t __user *cpu_cache;
+
+		if (get_cpu_cache_ptr(cpu_cache, cpu_cachep))
+			return -EFAULT;
+		if (unlikely(!IS_ALIGNED((unsigned long)cpu_cache,
+				sizeof(int32_t)) || !cpu_cache))
+			return -EINVAL;
+		/*
+		 * Check if cpu_cache is already registered, and whether
+		 * the address differs from *cpu_cachep.
+		 */
+		if (current->cpu_cache) {
+			if (current->cpu_cache != cpu_cache)
+				return -EBUSY;
+			return 0;
+		}
+		current->cpu_cache = cpu_cache;
+		/*
+		 * Migration reads the current->cpu_cache pointer to
+		 * decide whether the notify_resume flag should be set.
+		 * Therefore, we need to ensure that the scheduler sees
+		 * the getcpu cache pointer update before we update the
+		 * getcpu cache content with the current CPU number.
+		 * This ensures we don't return from the getcpu_cache
+		 * system call to userspace with a wrong CPU number in
+		 * the cache if preempted and migrated after the initial
+		 * successful cpu cache update (below).
+		 *
+		 * This compiler barrier enforces ordering of the
+		 * current->cpu_cache address store before update of the
+		 * *cpu_cache.
+		 */
+		barrier();
+		/*
+		 * Do an initial cpu cache update to populate the
+		 * current CPU value, and to check whether the address
+		 * is valid, thus ensuring we return -EFAULT in case or
+		 * invalid address rather than triggering a SIGSEGV if
+		 * put_user() fails in the resume notifier.
+		 */
+		if (getcpu_cache_update(cpu_cache)) {
+			current->cpu_cache = NULL;
+			return -EFAULT;
+		}
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..11ae33f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -971,6 +971,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
 #ifdef CONFIG_SMP
+	getcpu_cache_set_notify_resume(p);
 	/*
 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
 	 * successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8..7e336c0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -250,3 +250,6 @@ cond_syscall(sys_execveat);
 
 /* membarrier */
 cond_syscall(sys_membarrier);
+
+/* thread-local ABI */
+cond_syscall(sys_getcpu_cache);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v4 2/5] getcpu_cache: ARM resume notifier
  2016-02-23 23:28 [PATCH v4 0/5] getcpu_cache system call for 4.6 Mathieu Desnoyers
  2016-02-23 23:28   ` Mathieu Desnoyers
@ 2016-02-23 23:28 ` Mathieu Desnoyers
  2016-02-23 23:28 ` [PATCH v4 3/5] getcpu_cache: wire up ARM system call Mathieu Desnoyers
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Call the getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
---
 arch/arm/kernel/signal.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index 7b8f214..ff5052c 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -594,6 +594,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 			} else {
 				clear_thread_flag(TIF_NOTIFY_RESUME);
 				tracehook_notify_resume(regs);
+				getcpu_cache_handle_notify_resume(current);
 			}
 		}
 		local_irq_disable();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v4 3/5] getcpu_cache: wire up ARM system call
  2016-02-23 23:28 [PATCH v4 0/5] getcpu_cache system call for 4.6 Mathieu Desnoyers
  2016-02-23 23:28   ` Mathieu Desnoyers
  2016-02-23 23:28 ` [PATCH v4 2/5] getcpu_cache: ARM resume notifier Mathieu Desnoyers
@ 2016-02-23 23:28 ` Mathieu Desnoyers
  2016-02-24  0:54     ` kbuild test robot
  2016-02-24  1:05     ` Mathieu Desnoyers
  2016-02-23 23:28 ` [PATCH v4 4/5] getcpu_cache: x86 32/64 resume notifier Mathieu Desnoyers
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu cache system call on 32-bit ARM.

This provides an ABI improving the speed of a getcpu operation
on ARM by skipping the getcpu system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
---
 arch/arm/include/uapi/asm/unistd.h | 1 +
 arch/arm/kernel/calls.S            | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 5dd2528..1ad1351 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -418,6 +418,7 @@
 #define __NR_membarrier			(__NR_SYSCALL_BASE+389)
 #define __NR_mlock2			(__NR_SYSCALL_BASE+390)
 #define __NR_copy_file_range		(__NR_SYSCALL_BASE+391)
+#define __NR_getcpu_cache		(__NR_SYSCALL_BASE+392)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index dfc7cd6..7e794e9 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -399,8 +399,9 @@
 		CALL(sys_execveat)
 		CALL(sys_userfaultfd)
 		CALL(sys_membarrier)
-		CALL(sys_mlock2)
+/* 390 */	CALL(sys_mlock2)
 		CALL(sys_copy_file_range)
+		CALL(sys_getcpu_cache)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v4 4/5] getcpu_cache: x86 32/64 resume notifier
  2016-02-23 23:28 [PATCH v4 0/5] getcpu_cache system call for 4.6 Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2016-02-23 23:28 ` [PATCH v4 3/5] getcpu_cache: wire up ARM system call Mathieu Desnoyers
@ 2016-02-23 23:28 ` Mathieu Desnoyers
  2016-02-23 23:28 ` [PATCH v4 5/5] getcpu_cache: wire up x86 32/64 system call Mathieu Desnoyers
  2016-02-24  1:36   ` H. Peter Anvin
  5 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Call the getcpu_cache_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0366374..eb6bcae 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -249,6 +249,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			getcpu_cache_handle_notify_resume(current);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v4 5/5] getcpu_cache: wire up x86 32/64 system call
  2016-02-23 23:28 [PATCH v4 0/5] getcpu_cache system call for 4.6 Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2016-02-23 23:28 ` [PATCH v4 4/5] getcpu_cache: x86 32/64 resume notifier Mathieu Desnoyers
@ 2016-02-23 23:28 ` Mathieu Desnoyers
  2016-02-24  1:36   ` H. Peter Anvin
  5 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu_cache system call on x86 32/64.

This provides an ABI improving the speed of a getcpu operation
on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index cb713df..c2372a7 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -384,3 +384,4 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	copy_file_range		sys_copy_file_range
+378	i386	getcpu_cache		sys_getcpu_cache
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dc1040a..6b3ffa0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -333,6 +333,7 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	copy_file_range		sys_copy_file_range
+326	common	getcpu_cache		sys_getcpu_cache
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 3/5] getcpu_cache: wire up ARM system call
@ 2016-02-24  0:54     ` kbuild test robot
  0 siblings, 0 replies; 96+ messages in thread
From: kbuild test robot @ 2016-02-24  0:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kbuild-all, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 2986 bytes --]

Hi Mathieu,

[auto build test ERROR on tip/x86/core]
[also build test ERROR on v4.5-rc5]
[cannot apply to next-20160223]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/getcpu_cache-system-call-for-4-6/20160224-073424
config: arm-at91_dt_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/kernel/entry-common.S: Assembler messages:
>> arch/arm/kernel/entry-common.S:132: Error: __NR_syscalls is not equal to the size of the syscall table

vim +132 arch/arm/kernel/entry-common.S

9fff2fa0 Al Viro         2012-10-10  116  	movne	r0, r4
14327c66 Russell King    2015-04-21  117  	badrne	lr, 1f
6ebbf2ce Russell King    2014-06-30  118  	retne	r5
68687c84 Russell King    2012-10-15  119  1:	get_thread_info tsk
^1da177e Linus Torvalds  2005-04-16  120  	b	ret_slow_syscall
93ed3970 Catalin Marinas 2008-08-28  121  ENDPROC(ret_from_fork)
^1da177e Linus Torvalds  2005-04-16  122  
fa1b4f91 Al Viro         2006-01-19  123  	.equ NR_syscalls,0
fa1b4f91 Al Viro         2006-01-19  124  #define CALL(x) .equ NR_syscalls,NR_syscalls+1
^1da177e Linus Torvalds  2005-04-16  125  #include "calls.S"
1f66e06f Wade Farnsworth 2012-09-07  126  
1f66e06f Wade Farnsworth 2012-09-07  127  /*
1f66e06f Wade Farnsworth 2012-09-07  128   * Ensure that the system call table is equal to __NR_syscalls,
1f66e06f Wade Farnsworth 2012-09-07  129   * which is the value the rest of the system sees
1f66e06f Wade Farnsworth 2012-09-07  130   */
1f66e06f Wade Farnsworth 2012-09-07  131  .ifne NR_syscalls - __NR_syscalls
1f66e06f Wade Farnsworth 2012-09-07 @132  .error "__NR_syscalls is not equal to the size of the syscall table"
1f66e06f Wade Farnsworth 2012-09-07  133  .endif
1f66e06f Wade Farnsworth 2012-09-07  134  
fa1b4f91 Al Viro         2006-01-19  135  #undef CALL
fa1b4f91 Al Viro         2006-01-19  136  #define CALL(x) .long x
^1da177e Linus Torvalds  2005-04-16  137  
^1da177e Linus Torvalds  2005-04-16  138  /*=============================================================================
^1da177e Linus Torvalds  2005-04-16  139   * SWI handler
^1da177e Linus Torvalds  2005-04-16  140   *-----------------------------------------------------------------------------

:::::: The code at line 132 was first introduced by commit
:::::: 1f66e06fb6414732bef7bf4a071ef76a837badec ARM: 7524/1: support syscall tracing

:::::: TO: Wade Farnsworth <wade_farnsworth@mentor.com>
:::::: CC: Russell King <rmk+kernel@arm.linux.org.uk>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 20794 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 3/5] getcpu_cache: wire up ARM system call
@ 2016-02-24  0:54     ` kbuild test robot
  0 siblings, 0 replies; 96+ messages in thread
From: kbuild test robot @ 2016-02-24  0:54 UTC (permalink / raw)
  Cc: kbuild-all-JC7UmRfGjtg, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 3038 bytes --]

Hi Mathieu,

[auto build test ERROR on tip/x86/core]
[also build test ERROR on v4.5-rc5]
[cannot apply to next-20160223]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/getcpu_cache-system-call-for-4-6/20160224-073424
config: arm-at91_dt_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/kernel/entry-common.S: Assembler messages:
>> arch/arm/kernel/entry-common.S:132: Error: __NR_syscalls is not equal to the size of the syscall table

vim +132 arch/arm/kernel/entry-common.S

9fff2fa0 Al Viro         2012-10-10  116  	movne	r0, r4
14327c66 Russell King    2015-04-21  117  	badrne	lr, 1f
6ebbf2ce Russell King    2014-06-30  118  	retne	r5
68687c84 Russell King    2012-10-15  119  1:	get_thread_info tsk
^1da177e Linus Torvalds  2005-04-16  120  	b	ret_slow_syscall
93ed3970 Catalin Marinas 2008-08-28  121  ENDPROC(ret_from_fork)
^1da177e Linus Torvalds  2005-04-16  122  
fa1b4f91 Al Viro         2006-01-19  123  	.equ NR_syscalls,0
fa1b4f91 Al Viro         2006-01-19  124  #define CALL(x) .equ NR_syscalls,NR_syscalls+1
^1da177e Linus Torvalds  2005-04-16  125  #include "calls.S"
1f66e06f Wade Farnsworth 2012-09-07  126  
1f66e06f Wade Farnsworth 2012-09-07  127  /*
1f66e06f Wade Farnsworth 2012-09-07  128   * Ensure that the system call table is equal to __NR_syscalls,
1f66e06f Wade Farnsworth 2012-09-07  129   * which is the value the rest of the system sees
1f66e06f Wade Farnsworth 2012-09-07  130   */
1f66e06f Wade Farnsworth 2012-09-07  131  .ifne NR_syscalls - __NR_syscalls
1f66e06f Wade Farnsworth 2012-09-07 @132  .error "__NR_syscalls is not equal to the size of the syscall table"
1f66e06f Wade Farnsworth 2012-09-07  133  .endif
1f66e06f Wade Farnsworth 2012-09-07  134  
fa1b4f91 Al Viro         2006-01-19  135  #undef CALL
fa1b4f91 Al Viro         2006-01-19  136  #define CALL(x) .long x
^1da177e Linus Torvalds  2005-04-16  137  
^1da177e Linus Torvalds  2005-04-16  138  /*=============================================================================
^1da177e Linus Torvalds  2005-04-16  139   * SWI handler
^1da177e Linus Torvalds  2005-04-16  140   *-----------------------------------------------------------------------------

:::::: The code at line 132 was first introduced by commit
:::::: 1f66e06fb6414732bef7bf4a071ef76a837badec ARM: 7524/1: support syscall tracing

:::::: TO: Wade Farnsworth <wade_farnsworth-nmGgyN9QBj3QT0dZR+AlfA@public.gmane.org>
:::::: CC: Russell King <rmk+kernel-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 20794 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v4 (updated)] getcpu_cache: wire up ARM system call
@ 2016-02-24  1:05     ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-24  1:05 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu cache system call on 32-bit ARM.

This provides an ABI improving the speed of a getcpu operation
on ARM by skipping the getcpu system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-api@vger.kernel.org
---
 arch/arm/include/asm/unistd.h      | 2 +-
 arch/arm/include/uapi/asm/unistd.h | 1 +
 arch/arm/kernel/calls.S            | 3 ++-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 7b84657..194b699 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -19,7 +19,7 @@
  * This may need to be greater than __NR_last_syscall+1 in order to
  * account for the padding in the syscall table
  */
-#define __NR_syscalls  (392)
+#define __NR_syscalls  (396)
 
 #define __ARCH_WANT_STAT64
 #define __ARCH_WANT_SYS_GETHOSTNAME
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 5dd2528..1ad1351 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -418,6 +418,7 @@
 #define __NR_membarrier			(__NR_SYSCALL_BASE+389)
 #define __NR_mlock2			(__NR_SYSCALL_BASE+390)
 #define __NR_copy_file_range		(__NR_SYSCALL_BASE+391)
+#define __NR_getcpu_cache		(__NR_SYSCALL_BASE+392)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index dfc7cd6..7e794e9 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -399,8 +399,9 @@
 		CALL(sys_execveat)
 		CALL(sys_userfaultfd)
 		CALL(sys_membarrier)
-		CALL(sys_mlock2)
+/* 390 */	CALL(sys_mlock2)
 		CALL(sys_copy_file_range)
+		CALL(sys_getcpu_cache)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v4 (updated)] getcpu_cache: wire up ARM system call
@ 2016-02-24  1:05     ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-24  1:05 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

Wire up the getcpu cache system call on 32-bit ARM.

This provides an ABI improving the speed of a getcpu operation
on ARM by skipping the getcpu system call on the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
CC: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
CC: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
CC: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
CC: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
CC: Dave Watson <davejwatson-b10kYP2dOMg@public.gmane.org>
CC: Chris Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
CC: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 arch/arm/include/asm/unistd.h      | 2 +-
 arch/arm/include/uapi/asm/unistd.h | 1 +
 arch/arm/kernel/calls.S            | 3 ++-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 7b84657..194b699 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -19,7 +19,7 @@
  * This may need to be greater than __NR_last_syscall+1 in order to
  * account for the padding in the syscall table
  */
-#define __NR_syscalls  (392)
+#define __NR_syscalls  (396)
 
 #define __ARCH_WANT_STAT64
 #define __ARCH_WANT_SYS_GETHOSTNAME
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 5dd2528..1ad1351 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -418,6 +418,7 @@
 #define __NR_membarrier			(__NR_SYSCALL_BASE+389)
 #define __NR_mlock2			(__NR_SYSCALL_BASE+390)
 #define __NR_copy_file_range		(__NR_SYSCALL_BASE+391)
+#define __NR_getcpu_cache		(__NR_SYSCALL_BASE+392)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index dfc7cd6..7e794e9 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -399,8 +399,9 @@
 		CALL(sys_execveat)
 		CALL(sys_userfaultfd)
 		CALL(sys_membarrier)
-		CALL(sys_mlock2)
+/* 390 */	CALL(sys_mlock2)
 		CALL(sys_copy_file_range)
+		CALL(sys_getcpu_cache)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-24  1:36   ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-24  1:36 UTC (permalink / raw)
  To: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar
  Cc: linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
> Hi,
> 
> Here is a patchset implementing a cache for the CPU number of the
> currently running thread in user-space.
> 
> Benchmarks comparing this approach to a getcpu based on system call on
> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared to
> executing lsl from a vDSO through glibc.
> 
> I'm added a man page in the changelog of patch 1/3, which shows an
> example usage of this new system call.
> 
> This series is based on v4.5-rc5, submitted for Linux 4.6.
> 
> Feedback is welcome,
> 

What is the resulting context switch overhead?

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-24  1:36   ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-24  1:36 UTC (permalink / raw)
  To: Mathieu Desnoyers, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
> Hi,
> 
> Here is a patchset implementing a cache for the CPU number of the
> currently running thread in user-space.
> 
> Benchmarks comparing this approach to a getcpu based on system call on
> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared to
> executing lsl from a vDSO through glibc.
> 
> I'm added a man page in the changelog of patch 1/3, which shows an
> example usage of this new system call.
> 
> This series is based on v4.5-rc5, submitted for Linux 4.6.
> 
> Feedback is welcome,
> 

What is the resulting context switch overhead?

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
  2016-02-24  1:36   ` H. Peter Anvin
  (?)
@ 2016-02-24  4:09   ` Mathieu Desnoyers
  2016-02-24 20:07       ` H. Peter Anvin
  -1 siblings, 1 reply; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-24  4:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Feb 23, 2016, at 8:36 PM, H. Peter Anvin hpa@zytor.com wrote:

> On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
>> Hi,
>> 
>> Here is a patchset implementing a cache for the CPU number of the
>> currently running thread in user-space.
>> 
>> Benchmarks comparing this approach to a getcpu based on system call on
>> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared to
>> executing lsl from a vDSO through glibc.
>> 
>> I'm added a man page in the changelog of patch 1/3, which shows an
>> example usage of this new system call.
>> 
>> This series is based on v4.5-rc5, submitted for Linux 4.6.
>> 
>> Feedback is welcome,
>> 
> 
> What is the resulting context switch overhead?

The getcpu_cache only adds code to the thread migration path,
and to the resume notifier. The context switch path per se is
untouched. I would therefore expect the overhead on context
switch to be within the noise, except if stuff like hackbench
would be so sensitive to the size of struct task_struct that
a single extra pointer added at the end of struct task_struct
would throw off the benchmarks.

Is that what you are concerned about ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 (updated)] getcpu_cache: wire up ARM system call
@ 2016-02-24  5:28       ` kbuild test robot
  0 siblings, 0 replies; 96+ messages in thread
From: kbuild test robot @ 2016-02-24  5:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kbuild-all, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1915 bytes --]

Hi Mathieu,

[auto build test ERROR on arm/for-next]
[also build test ERROR on v4.5-rc5 next-20160223]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/getcpu_cache-wire-up-ARM-system-call/20160224-090642
base:   http://repo.or.cz/linux-2.6/linux-2.6-arm.git for-next
config: arm-badge4_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/kernel/built-in.o: In function `__sys_trace_return_nosave':
>> arch/arm/kernel/entry-common.S:284: undefined reference to `sys_getcpu_cache'

vim +284 arch/arm/kernel/entry-common.S

^1da177e4 Linus Torvalds 2005-04-16  278  	b	ret_slow_syscall
^1da177e4 Linus Torvalds 2005-04-16  279  
3302caddf Russell King   2015-08-20  280  __sys_trace_return_nosave:
e0aa3a665 Russell King   2015-08-20  281  	enable_irq_notrace
3302caddf Russell King   2015-08-20  282  	mov	r0, sp
3302caddf Russell King   2015-08-20  283  	bl	syscall_trace_exit
3302caddf Russell King   2015-08-20 @284  	b	ret_slow_syscall
3302caddf Russell King   2015-08-20  285  
^1da177e4 Linus Torvalds 2005-04-16  286  	.align	5
^1da177e4 Linus Torvalds 2005-04-16  287  #ifdef CONFIG_ALIGNMENT_TRAP

:::::: The code at line 284 was first introduced by commit
:::::: 3302caddf10ad50710dbb7a94ccbdb3ad5bf1412 ARM: entry: efficiency cleanups

:::::: TO: Russell King <rmk+kernel@arm.linux.org.uk>
:::::: CC: Russell King <rmk+kernel@arm.linux.org.uk>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 14225 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 (updated)] getcpu_cache: wire up ARM system call
@ 2016-02-24  5:28       ` kbuild test robot
  0 siblings, 0 replies; 96+ messages in thread
From: kbuild test robot @ 2016-02-24  5:28 UTC (permalink / raw)
  Cc: kbuild-all-JC7UmRfGjtg, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1961 bytes --]

Hi Mathieu,

[auto build test ERROR on arm/for-next]
[also build test ERROR on v4.5-rc5 next-20160223]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/getcpu_cache-wire-up-ARM-system-call/20160224-090642
base:   http://repo.or.cz/linux-2.6/linux-2.6-arm.git for-next
config: arm-badge4_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/kernel/built-in.o: In function `__sys_trace_return_nosave':
>> arch/arm/kernel/entry-common.S:284: undefined reference to `sys_getcpu_cache'

vim +284 arch/arm/kernel/entry-common.S

^1da177e4 Linus Torvalds 2005-04-16  278  	b	ret_slow_syscall
^1da177e4 Linus Torvalds 2005-04-16  279  
3302caddf Russell King   2015-08-20  280  __sys_trace_return_nosave:
e0aa3a665 Russell King   2015-08-20  281  	enable_irq_notrace
3302caddf Russell King   2015-08-20  282  	mov	r0, sp
3302caddf Russell King   2015-08-20  283  	bl	syscall_trace_exit
3302caddf Russell King   2015-08-20 @284  	b	ret_slow_syscall
3302caddf Russell King   2015-08-20  285  
^1da177e4 Linus Torvalds 2005-04-16  286  	.align	5
^1da177e4 Linus Torvalds 2005-04-16  287  #ifdef CONFIG_ALIGNMENT_TRAP

:::::: The code at line 284 was first introduced by commit
:::::: 3302caddf10ad50710dbb7a94ccbdb3ad5bf1412 ARM: entry: efficiency cleanups

:::::: TO: Russell King <rmk+kernel-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
:::::: CC: Russell King <rmk+kernel-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 14225 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 (updated)] getcpu_cache: wire up ARM system call
@ 2016-02-24  6:54       ` kbuild test robot
  0 siblings, 0 replies; 96+ messages in thread
From: kbuild test robot @ 2016-02-24  6:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kbuild-all, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-kernel, linux-api,
	Paul Turner, Andrew Hunter, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Dave Watson, Chris Lameter, Ben Maurer,
	Steven Rostedt, Paul E. McKenney, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 2166 bytes --]

Hi Mathieu,

[auto build test ERROR on arm/for-next]
[also build test ERROR on v4.5-rc5 next-20160223]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/getcpu_cache-wire-up-ARM-system-call/20160224-090642
base:   http://repo.or.cz/linux-2.6/linux-2.6-arm.git for-next
config: arm-efm32_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/kernel/built-in.o: In function `__sys_trace_return_nosave':
>> arch/arm/kernel/entry-header.S:312: undefined reference to `sys_getcpu_cache'

vim +312 arch/arm/kernel/entry-header.S

a18f3645 Daniel Thompson  2015-01-09  306  	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
b86040a5 Catalin Marinas  2009-07-24  307  	.endif
8e4971f2 Anders Grafström 2010-03-15  308  	mov	r0, r0				@ ARMv5T and earlier require a nop
8e4971f2 Anders Grafström 2010-03-15  309  						@ after ldm {}^
a18f3645 Daniel Thompson  2015-01-09  310  	add	sp, sp, #\offset + S_FRAME_SIZE
b86040a5 Catalin Marinas  2009-07-24  311  	movs	pc, lr				@ return & move spsr_svc into cpsr
aa06e5c1 Russell King     2015-08-26 @312  #elif defined(CONFIG_CPU_V7M)
aa06e5c1 Russell King     2015-08-26  313  	@ V7M restore.
aa06e5c1 Russell King     2015-08-26  314  	@ Note that we don't need to do clrex here as clearing the local
aa06e5c1 Russell King     2015-08-26  315  	@ monitor is part of the exception entry and exit sequence.

:::::: The code at line 312 was first introduced by commit
:::::: aa06e5c1f9c2b466712be904cc5b56a813e24cfd ARM: entry: get rid of multiple macro definitions

:::::: TO: Russell King <rmk+kernel@arm.linux.org.uk>
:::::: CC: Russell King <rmk+kernel@arm.linux.org.uk>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 10569 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 (updated)] getcpu_cache: wire up ARM system call
@ 2016-02-24  6:54       ` kbuild test robot
  0 siblings, 0 replies; 96+ messages in thread
From: kbuild test robot @ 2016-02-24  6:54 UTC (permalink / raw)
  Cc: kbuild-all-JC7UmRfGjtg, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 2212 bytes --]

Hi Mathieu,

[auto build test ERROR on arm/for-next]
[also build test ERROR on v4.5-rc5 next-20160223]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/getcpu_cache-wire-up-ARM-system-call/20160224-090642
base:   http://repo.or.cz/linux-2.6/linux-2.6-arm.git for-next
config: arm-efm32_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/kernel/built-in.o: In function `__sys_trace_return_nosave':
>> arch/arm/kernel/entry-header.S:312: undefined reference to `sys_getcpu_cache'

vim +312 arch/arm/kernel/entry-header.S

a18f3645 Daniel Thompson  2015-01-09  306  	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
b86040a5 Catalin Marinas  2009-07-24  307  	.endif
8e4971f2 Anders Grafström 2010-03-15  308  	mov	r0, r0				@ ARMv5T and earlier require a nop
8e4971f2 Anders Grafström 2010-03-15  309  						@ after ldm {}^
a18f3645 Daniel Thompson  2015-01-09  310  	add	sp, sp, #\offset + S_FRAME_SIZE
b86040a5 Catalin Marinas  2009-07-24  311  	movs	pc, lr				@ return & move spsr_svc into cpsr
aa06e5c1 Russell King     2015-08-26 @312  #elif defined(CONFIG_CPU_V7M)
aa06e5c1 Russell King     2015-08-26  313  	@ V7M restore.
aa06e5c1 Russell King     2015-08-26  314  	@ Note that we don't need to do clrex here as clearing the local
aa06e5c1 Russell King     2015-08-26  315  	@ monitor is part of the exception entry and exit sequence.

:::::: The code at line 312 was first introduced by commit
:::::: aa06e5c1f9c2b466712be904cc5b56a813e24cfd ARM: entry: get rid of multiple macro definitions

:::::: TO: Russell King <rmk+kernel-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
:::::: CC: Russell King <rmk+kernel-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 10569 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-23 23:28   ` Mathieu Desnoyers
  (?)
@ 2016-02-24 11:11   ` Thomas Gleixner
  2016-02-24 17:17     ` Mathieu Desnoyers
  2016-02-25 23:32       ` Rasmus Villemoes
  -1 siblings, 2 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-02-24 11:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Ingo Molnar, H. Peter Anvin,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Tue, 23 Feb 2016, Mathieu Desnoyers wrote:
> +/*
> + * If parent process has a thread-local ABI, the child inherits. Only applies
> + * when forking a process, not a thread.
> + */
> +void getcpu_cache_fork(struct task_struct *t)
> +{
> +	t->cpu_cache = current->cpu_cache;
> +}
> +
> +void getcpu_cache_execve(struct task_struct *t)
> +{
> +	t->cpu_cache = NULL;
> +}
> +
> +void getcpu_cache_exit(struct task_struct *t)
> +{
> +	t->cpu_cache = NULL;
> +}

That's hardly worth a function call. Please inline.

> +/*
> + * sys_getcpu_cache - setup getcpu cache for caller thread
> + */
> +SYSCALL_DEFINE3(getcpu_cache, int, cmd, int32_t __user * __user *, cpu_cachep,
> +		int, flags)
> +{
> +	if (unlikely(flags))
> +		return -EINVAL;

New line for readability sake.

> +	switch (cmd) {
> +	case GETCPU_CACHE_GET:

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-24 11:11   ` Thomas Gleixner
@ 2016-02-24 17:17     ` Mathieu Desnoyers
  2016-02-25 23:32       ` Rasmus Villemoes
  1 sibling, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-24 17:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Russell King, Ingo Molnar, H. Peter Anvin,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Feb 24, 2016, at 6:11 AM, Thomas Gleixner tglx@linutronix.de wrote:

> On Tue, 23 Feb 2016, Mathieu Desnoyers wrote:
>> +/*
>> + * If parent process has a thread-local ABI, the child inherits. Only applies
>> + * when forking a process, not a thread.
>> + */
>> +void getcpu_cache_fork(struct task_struct *t)
>> +{
>> +	t->cpu_cache = current->cpu_cache;
>> +}
>> +
>> +void getcpu_cache_execve(struct task_struct *t)
>> +{
>> +	t->cpu_cache = NULL;
>> +}
>> +
>> +void getcpu_cache_exit(struct task_struct *t)
>> +{
>> +	t->cpu_cache = NULL;
>> +}
> 
> That's hardly worth a function call. Please inline.
> 
>> +/*
>> + * sys_getcpu_cache - setup getcpu cache for caller thread
>> + */
>> +SYSCALL_DEFINE3(getcpu_cache, int, cmd, int32_t __user * __user *, cpu_cachep,
>> +		int, flags)
>> +{
>> +	if (unlikely(flags))
>> +		return -EINVAL;
> 
> New line for readability sake.
> 
>> +	switch (cmd) {
>> +	case GETCPU_CACHE_GET:
> 
> Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Thanks! Will do those changes for v5 and add your Reviewed-by tag.

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-24 20:07       ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-24 20:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On February 23, 2016 8:09:23 PM PST, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>----- On Feb 23, 2016, at 8:36 PM, H. Peter Anvin hpa@zytor.com wrote:
>
>> On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
>>> Hi,
>>> 
>>> Here is a patchset implementing a cache for the CPU number of the
>>> currently running thread in user-space.
>>> 
>>> Benchmarks comparing this approach to a getcpu based on system call
>on
>>> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared
>to
>>> executing lsl from a vDSO through glibc.
>>> 
>>> I'm added a man page in the changelog of patch 1/3, which shows an
>>> example usage of this new system call.
>>> 
>>> This series is based on v4.5-rc5, submitted for Linux 4.6.
>>> 
>>> Feedback is welcome,
>>> 
>> 
>> What is the resulting context switch overhead?
>
>The getcpu_cache only adds code to the thread migration path,
>and to the resume notifier. The context switch path per se is
>untouched. I would therefore expect the overhead on context
>switch to be within the noise, except if stuff like hackbench
>would be so sensitive to the size of struct task_struct that
>a single extra pointer added at the end of struct task_struct
>would throw off the benchmarks.
>
>Is that what you are concerned about ?
>
>Thanks,
>
>Mathieu

Yes, I'd like to see numbers.  It is way easy to handwave small changes away, but they add up over time.  Without numbers it is a bit hard to quantify the pro vs con.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-24 20:07       ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-24 20:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

On February 23, 2016 8:09:23 PM PST, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>----- On Feb 23, 2016, at 8:36 PM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:
>
>> On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
>>> Hi,
>>> 
>>> Here is a patchset implementing a cache for the CPU number of the
>>> currently running thread in user-space.
>>> 
>>> Benchmarks comparing this approach to a getcpu based on system call
>on
>>> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared
>to
>>> executing lsl from a vDSO through glibc.
>>> 
>>> I'm added a man page in the changelog of patch 1/3, which shows an
>>> example usage of this new system call.
>>> 
>>> This series is based on v4.5-rc5, submitted for Linux 4.6.
>>> 
>>> Feedback is welcome,
>>> 
>> 
>> What is the resulting context switch overhead?
>
>The getcpu_cache only adds code to the thread migration path,
>and to the resume notifier. The context switch path per se is
>untouched. I would therefore expect the overhead on context
>switch to be within the noise, except if stuff like hackbench
>would be so sensitive to the size of struct task_struct that
>a single extra pointer added at the end of struct task_struct
>would throw off the benchmarks.
>
>Is that what you are concerned about ?
>
>Thanks,
>
>Mathieu

Yes, I'd like to see numbers.  It is way easy to handwave small changes away, but they add up over time.  Without numbers it is a bit hard to quantify the pro vs con.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-24 22:38         ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-24 22:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Feb 24, 2016, at 3:07 PM, H. Peter Anvin hpa@zytor.com wrote:

> On February 23, 2016 8:09:23 PM PST, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>----- On Feb 23, 2016, at 8:36 PM, H. Peter Anvin hpa@zytor.com wrote:
>>
>>> On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
>>>> Hi,
>>>> 
>>>> Here is a patchset implementing a cache for the CPU number of the
>>>> currently running thread in user-space.
>>>> 
>>>> Benchmarks comparing this approach to a getcpu based on system call
>>on
>>>> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared
>>to
>>>> executing lsl from a vDSO through glibc.
>>>> 
>>>> I'm added a man page in the changelog of patch 1/3, which shows an
>>>> example usage of this new system call.
>>>> 
>>>> This series is based on v4.5-rc5, submitted for Linux 4.6.
>>>> 
>>>> Feedback is welcome,
>>>> 
>>> 
>>> What is the resulting context switch overhead?
>>
>>The getcpu_cache only adds code to the thread migration path,
>>and to the resume notifier. The context switch path per se is
>>untouched. I would therefore expect the overhead on context
>>switch to be within the noise, except if stuff like hackbench
>>would be so sensitive to the size of struct task_struct that
>>a single extra pointer added at the end of struct task_struct
>>would throw off the benchmarks.
>>
>>Is that what you are concerned about ?
>>
>>Thanks,
>>
>>Mathieu
> 
> Yes, I'd like to see numbers.  It is way easy to handwave small changes away,
> but they add up over time.  Without numbers it is a bit hard to quantify the
> pro vs con.

- Speed

Running 10 runs of hackbench -l 100000 on a 2 sockets * 8-core Intel(R) Xeon(R) CPU
E5-2630 v3 @ 2.40GHz (directly on hardware, no virtualization), with
hyperthreading, with a 4.5-rc5 defconfig+localyesconfig, getcpu_cache series
applied, seems to indicate that the sched switch impact of this new configuration
option is within the noise:

* CONFIG_GETCPU_CACHE=n

avg.:      26.63 s
std.dev.:   0.38 s

* CONFIG_GETCPU_CACHE=y

avg.:      26.52 s
std.dev.:   0.47 s


- Size

Between CONFIG_GETCPU_CACHE=n/y, the size delta added to the compressed kernel
zImage is 704 bytes. The text size increase of vmlinux is 512 bytes, and the data
size increase of vmlinux is also 512 bytes.

* CONFIG_GETCPU_CACHE=n
   text	           data	    bss	     dec	    hex	filename
16802349	2745968	1564672	21112989	142289d	vmlinux

* CONFIG_GETCPU_CACHE=y
   text            data     bss      dec            hex filename
16802861        2746480 1564672 21114013        1422c9d vmlinux

Am I missing anything ? I plan to add this information to the
changelog for my next round (v5).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 0/5] getcpu_cache system call for 4.6
@ 2016-02-24 22:38         ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-24 22:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Feb 24, 2016, at 3:07 PM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:

> On February 23, 2016 8:09:23 PM PST, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>----- On Feb 23, 2016, at 8:36 PM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:
>>
>>> On 02/23/2016 03:28 PM, Mathieu Desnoyers wrote:
>>>> Hi,
>>>> 
>>>> Here is a patchset implementing a cache for the CPU number of the
>>>> currently running thread in user-space.
>>>> 
>>>> Benchmarks comparing this approach to a getcpu based on system call
>>on
>>>> ARM show a 44x speedup. They show a 14x speedup on x86-64 compared
>>to
>>>> executing lsl from a vDSO through glibc.
>>>> 
>>>> I'm added a man page in the changelog of patch 1/3, which shows an
>>>> example usage of this new system call.
>>>> 
>>>> This series is based on v4.5-rc5, submitted for Linux 4.6.
>>>> 
>>>> Feedback is welcome,
>>>> 
>>> 
>>> What is the resulting context switch overhead?
>>
>>The getcpu_cache only adds code to the thread migration path,
>>and to the resume notifier. The context switch path per se is
>>untouched. I would therefore expect the overhead on context
>>switch to be within the noise, except if stuff like hackbench
>>would be so sensitive to the size of struct task_struct that
>>a single extra pointer added at the end of struct task_struct
>>would throw off the benchmarks.
>>
>>Is that what you are concerned about ?
>>
>>Thanks,
>>
>>Mathieu
> 
> Yes, I'd like to see numbers.  It is way easy to handwave small changes away,
> but they add up over time.  Without numbers it is a bit hard to quantify the
> pro vs con.

- Speed

Running 10 runs of hackbench -l 100000 on a 2 sockets * 8-core Intel(R) Xeon(R) CPU
E5-2630 v3 @ 2.40GHz (directly on hardware, no virtualization), with
hyperthreading, with a 4.5-rc5 defconfig+localyesconfig, getcpu_cache series
applied, seems to indicate that the sched switch impact of this new configuration
option is within the noise:

* CONFIG_GETCPU_CACHE=n

avg.:      26.63 s
std.dev.:   0.38 s

* CONFIG_GETCPU_CACHE=y

avg.:      26.52 s
std.dev.:   0.47 s


- Size

Between CONFIG_GETCPU_CACHE=n/y, the size delta added to the compressed kernel
zImage is 704 bytes. The text size increase of vmlinux is 512 bytes, and the data
size increase of vmlinux is also 512 bytes.

* CONFIG_GETCPU_CACHE=n
   text	           data	    bss	     dec	    hex	filename
16802349	2745968	1564672	21112989	142289d	vmlinux

* CONFIG_GETCPU_CACHE=y
   text            data     bss      dec            hex filename
16802861        2746480 1564672 21114013        1422c9d vmlinux

Am I missing anything ? I plan to add this information to the
changelog for my next round (v5).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25  9:56     ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-25  9:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Tue, Feb 23, 2016 at 06:28:36PM -0500, Mathieu Desnoyers wrote:
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections. [1] [2]

So I'd like a few extra words on the intersection with that work.

Yes, that also needs a CPU number, but that needs a little extra as
well. Can this work be extended to provide the little extra and is the
getcpu name still sane in that case?

Alternatively, could you not, at equal speed, get the CPU number from
the restartable sequence data?

That is, do explain why we want both.

(And remind Paul to keep pushing that)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25  9:56     ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-25  9:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, Steven Rostedt, Paul E. McKenney, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon, Michael Kerrisk

On Tue, Feb 23, 2016 at 06:28:36PM -0500, Mathieu Desnoyers wrote:
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections. [1] [2]

So I'd like a few extra words on the intersection with that work.

Yes, that also needs a CPU number, but that needs a little extra as
well. Can this work be extended to provide the little extra and is the
getcpu name still sane in that case?

Alternatively, could you not, at equal speed, get the CPU number from
the restartable sequence data?

That is, do explain why we want both.

(And remind Paul to keep pushing that)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 16:55       ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-25 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Feb 23, 2016 at 06:28:36PM -0500, Mathieu Desnoyers wrote:
>> This approach is inspired by Paul Turner and Andrew Hunter's work
>> on percpu atomics, which lets the kernel handle restart of critical
>> sections. [1] [2]
> 
> So I'd like a few extra words on the intersection with that work.
> 
> Yes, that also needs a CPU number, but that needs a little extra as
> well. Can this work be extended to provide the little extra and is the
> getcpu name still sane in that case?
> 
> Alternatively, could you not, at equal speed, get the CPU number from
> the restartable sequence data?
> 
> That is, do explain why we want both.

Paul Turner's percpu atomics (restartable sequences) allow
turning atomic instructions (e.g. LOCK; cmpxchg on x86) meant
to update userspace per-cpu data into a sequence of instructions
that end with a single commit instruction. The primary use-case
for this is for implementing efficient memory allocators with
per-cpu memory pools (rather than global or per-thread pools).

This is made possible with the collaboration between kernel and
user-space, where user-space marks the surrounding of this "rseq"
critical section, and the kernel moves the instruction pointer
to a restart address (also published by user-space) if it
preempts/migrates/delivers a signal over that critical section.

The benefit of those restartable sequences over atomic instructions
is that it is much faster to execute a sequence of simple non-atomic
instructions (e.g. load, test, cond. branch, store) than a single
atomic instruction.

The restartable sequences are intrinsically designed to work
on per-cpu data, so they need to fetch the current CPU number
within the rseq critical section. This is where the getcpu_cache
system call becomes very useful when combined with rseq:
getcpu_cache allows reading the current CPU number in a
fraction of cycle.

However, there are other use-cases for having a fast mechanism
for reading the current CPU number, besides restartable sequences.
For instance, it can be used by glibc to implement a faster
sched_getcpu. Therefore, implementing getcpu_cache as its own
system call makes sense: an architecture could very well just
introduce getcpu_cache even if it cannot support restartable
sequences for some reason. Also, a kernel configuration can
enable getcpu_cache (since it has no effect on the scheduler
switch time, only migration) without enabling restartable
sequences.

The main reason why I decided to start working on getcpu_cache
is because I noticed that the restartable sequences system
call originally proposed by Paul Turner was trying to accomplish
too much at once: both handling of restartable sequences, and
quickly reading the current CPU number. My thinking is that
the issue of reading the current CPU number could be completely
taken out of the rseq picture by having rseq rely on the
address registered by getcpu_cache to read the CPU number.
This would therefore simplify the implementation of rseq,
and allow us to focus the rseq review discussions without
being side-tracked on the simpler problem of quickly reading
the current CPU number.

> 
> (And remind Paul to keep pushing that)

Indeed, I look forward to Paul's feedback on my review of his
last patchset round. Hopefully this getcpu_cache work will
allow us to better focus the discussions on rseq work.

Is the explanation above OK for you ? I'll add it to the
Changelog in v5 of the getcpu_cache series if so.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 16:55       ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-25 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Tue, Feb 23, 2016 at 06:28:36PM -0500, Mathieu Desnoyers wrote:
>> This approach is inspired by Paul Turner and Andrew Hunter's work
>> on percpu atomics, which lets the kernel handle restart of critical
>> sections. [1] [2]
> 
> So I'd like a few extra words on the intersection with that work.
> 
> Yes, that also needs a CPU number, but that needs a little extra as
> well. Can this work be extended to provide the little extra and is the
> getcpu name still sane in that case?
> 
> Alternatively, could you not, at equal speed, get the CPU number from
> the restartable sequence data?
> 
> That is, do explain why we want both.

Paul Turner's percpu atomics (restartable sequences) allow
turning atomic instructions (e.g. LOCK; cmpxchg on x86) meant
to update userspace per-cpu data into a sequence of instructions
that end with a single commit instruction. The primary use-case
for this is for implementing efficient memory allocators with
per-cpu memory pools (rather than global or per-thread pools).

This is made possible with the collaboration between kernel and
user-space, where user-space marks the surrounding of this "rseq"
critical section, and the kernel moves the instruction pointer
to a restart address (also published by user-space) if it
preempts/migrates/delivers a signal over that critical section.

The benefit of those restartable sequences over atomic instructions
is that it is much faster to execute a sequence of simple non-atomic
instructions (e.g. load, test, cond. branch, store) than a single
atomic instruction.

The restartable sequences are intrinsically designed to work
on per-cpu data, so they need to fetch the current CPU number
within the rseq critical section. This is where the getcpu_cache
system call becomes very useful when combined with rseq:
getcpu_cache allows reading the current CPU number in a
fraction of cycle.

However, there are other use-cases for having a fast mechanism
for reading the current CPU number, besides restartable sequences.
For instance, it can be used by glibc to implement a faster
sched_getcpu. Therefore, implementing getcpu_cache as its own
system call makes sense: an architecture could very well just
introduce getcpu_cache even if it cannot support restartable
sequences for some reason. Also, a kernel configuration can
enable getcpu_cache (since it has no effect on the scheduler
switch time, only migration) without enabling restartable
sequences.

The main reason why I decided to start working on getcpu_cache
is because I noticed that the restartable sequences system
call originally proposed by Paul Turner was trying to accomplish
too much at once: both handling of restartable sequences, and
quickly reading the current CPU number. My thinking is that
the issue of reading the current CPU number could be completely
taken out of the rseq picture by having rseq rely on the
address registered by getcpu_cache to read the CPU number.
This would therefore simplify the implementation of rseq,
and allow us to focus the rseq review discussions without
being side-tracked on the simpler problem of quickly reading
the current CPU number.

> 
> (And remind Paul to keep pushing that)

Indeed, I look forward to Paul's feedback on my review of his
last patchset round. Hopefully this getcpu_cache work will
allow us to better focus the discussions on rseq work.

Is the explanation above OK for you ? I'll add it to the
Changelog in v5 of the getcpu_cache series if so.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 17:04         ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-25 17:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@infradead.org wrote:
> The restartable sequences are intrinsically designed to work
> on per-cpu data, so they need to fetch the current CPU number
> within the rseq critical section. This is where the getcpu_cache
> system call becomes very useful when combined with rseq:
> getcpu_cache allows reading the current CPU number in a
> fraction of cycle.

Yes yes, I know how restartable sequences work.

But what I worry about is that they want a cpu number and a sequence
number, and for performance it would be very good if those live in the
same cacheline.

That means either getcpu needs to grow a seq number, or restartable
sequences need to _also_ provide the cpu number.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 17:04         ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-25 17:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> The restartable sequences are intrinsically designed to work
> on per-cpu data, so they need to fetch the current CPU number
> within the rseq critical section. This is where the getcpu_cache
> system call becomes very useful when combined with rseq:
> getcpu_cache allows reading the current CPU number in a
> fraction of cycle.

Yes yes, I know how restartable sequences work.

But what I worry about is that they want a cpu number and a sequence
number, and for performance it would be very good if those live in the
same cacheline.

That means either getcpu needs to grow a seq number, or restartable
sequences need to _also_ provide the cpu number.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 17:17           ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-25 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
>> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@infradead.org wrote:
>> The restartable sequences are intrinsically designed to work
>> on per-cpu data, so they need to fetch the current CPU number
>> within the rseq critical section. This is where the getcpu_cache
>> system call becomes very useful when combined with rseq:
>> getcpu_cache allows reading the current CPU number in a
>> fraction of cycle.
> 
> Yes yes, I know how restartable sequences work.
> 
> But what I worry about is that they want a cpu number and a sequence
> number, and for performance it would be very good if those live in the
> same cacheline.
> 
> That means either getcpu needs to grow a seq number, or restartable
> sequences need to _also_ provide the cpu number.

If we plan things well, we could have both the cpu number and the
seqnum in the same cache line, registered by two different system
calls. It's up to user-space to organize those two variables
to fit within the same cache-line.

getcpu_cache GETCPU_CACHE_SET operation takes the address where
the CPU number should live as input.

rseq system call could do the same for the seqnum address.

The question becomes: how do we introduce this to user-space,
considering that only a single address per thread is allowed
for each of getcpu_cache and rseq ?

If both CPU number and seqnum are centralized in a TLS within
e.g. glibc, that would be OK, but if we intend to allow libraries
or applications to directly register their own getcpu_cache
address and/or rseq, we may end up in situations where we have
to fallback on using two different cache-lines. But how much
should we care about performance in cases where non-generic
libraries directly use those system calls ?

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 17:17           ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-25 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
>> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
>> The restartable sequences are intrinsically designed to work
>> on per-cpu data, so they need to fetch the current CPU number
>> within the rseq critical section. This is where the getcpu_cache
>> system call becomes very useful when combined with rseq:
>> getcpu_cache allows reading the current CPU number in a
>> fraction of cycle.
> 
> Yes yes, I know how restartable sequences work.
> 
> But what I worry about is that they want a cpu number and a sequence
> number, and for performance it would be very good if those live in the
> same cacheline.
> 
> That means either getcpu needs to grow a seq number, or restartable
> sequences need to _also_ provide the cpu number.

If we plan things well, we could have both the cpu number and the
seqnum in the same cache line, registered by two different system
calls. It's up to user-space to organize those two variables
to fit within the same cache-line.

getcpu_cache GETCPU_CACHE_SET operation takes the address where
the CPU number should live as input.

rseq system call could do the same for the seqnum address.

The question becomes: how do we introduce this to user-space,
considering that only a single address per thread is allowed
for each of getcpu_cache and rseq ?

If both CPU number and seqnum are centralized in a TLS within
e.g. glibc, that would be OK, but if we intend to allow libraries
or applications to directly register their own getcpu_cache
address and/or rseq, we may end up in situations where we have
to fallback on using two different cache-lines. But how much
should we care about performance in cases where non-generic
libraries directly use those system calls ?

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-24 11:11   ` Thomas Gleixner
@ 2016-02-25 23:32       ` Rasmus Villemoes
  2016-02-25 23:32       ` Rasmus Villemoes
  1 sibling, 0 replies; 96+ messages in thread
From: Rasmus Villemoes @ 2016-02-25 23:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

On Wed, Feb 24 2016, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

>
>        Typically, a library or application will keep the cpu  number
>        cache  in  a  thread-local  storage variable, or other memory
>        areas belonging to each thread. It is recommended to  perform
>        a  volatile  read of the cpu number cache to prevent the com‐
>        piler from doing load tearing. An alternative approach is  to
>        read  the  cpu  number cache from inline assembly in a single
>        instruction.
>
>        Each thread is responsible for registering its own cpu number
>        cache.   Only  one  cpu  cache  address can be registered per
>        thread.
>
>        The symbol  __getcpu_cache_tls  is  recommended  to  be  used
>        across  libraries  and  applications  wishing  to  register a
>        thread-local getcpu_cache. The  attribute  "weak"  is  recom‐
>        mended  when  declaring this variable in libraries.  Applica‐
>        tions can choose to define their own version of  this  symbol
>        without the weak attribute as a performance improvement.
>
>        In  a  typical usage scenario, the thread registering the cpu
>        number cache will be performing reads from that cache. It  is
>        however  also allowed to read the cpu number cache from other
>        threads. The cpu number cache updates performed by the kernel
>        provide single-copy atomicity semantics, which guarantee that
>        other threads performing single-copy atomic reads of the  cpu
>        number cache will always observe a consistent value.
>
>        Memory registered as cpu number cache should never be deallo‐
>        cated before the thread which registered it  exits:  specifi‐
>        cally, it should not be freed, and the library containing the
>        registered thread-local storage should not be dlclose'd.

Maybe spell out the consequence if this is violated - since the SIGSEGV
only happens on migration, it may take a while to strike.

Random thoughts: The current implementation ensures that getcpu_cache is
"idempotent" from within a single thread - once set, it can never get
unset nor set to some other pointer. I think that can be useful, since
it means a library can reliably use the TLS variable itself (initialized
with some negative number) as an indicator of whether
getcpu_cache(GETCPU_CACHE_SET) has been called. So if a single test on a
fast path where the library would need to load __getcpu_cache_tls anyway
is acceptable, it can avoid requiring some library init function to be
called in each thread - which can sometimes be hard to arrange. Is this
something we want to guarantee - that is, will we never implement
GETCPU_CACHE_UNSET or a "force" flag to _SET? Either way, I think we
should spend a few words on it to avoid the current behaviour becoming
accidental ABI.

In another thread:

> However, there are other use-cases for having a fast mechanism for
> reading the current CPU number, besides restartable sequences.  For
> instance, it can be used by glibc to implement a faster sched_getcpu.

Will glibc do that? It may be a little contentious for glibc to claim a
unique resource such as task_struct::cpu_cache for itself, even if
everybody is supposed to use the same symbol. Hm, maybe one could say
that if an application does define the symbol __getcpu_cache_tls (which
is techically in the implementation namespace), that gives glibc (and
any other library) license to do getcpu_cache(SET, &&__getcpu_cache_tls)
(pseudo-code, of course). If a library initializes its own weak version
with -2 it can check whether the application defined
__getcpu_cache_tls. Ok, I'm probably overthinking this...

Rasmus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-25 23:32       ` Rasmus Villemoes
  0 siblings, 0 replies; 96+ messages in thread
From: Rasmus Villemoes @ 2016-02-25 23:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Andrew Hunter,
	Peter Zijlstra, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Wed, Feb 24 2016, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:

>
>        Typically, a library or application will keep the cpu  number
>        cache  in  a  thread-local  storage variable, or other memory
>        areas belonging to each thread. It is recommended to  perform
>        a  volatile  read of the cpu number cache to prevent the com‐
>        piler from doing load tearing. An alternative approach is  to
>        read  the  cpu  number cache from inline assembly in a single
>        instruction.
>
>        Each thread is responsible for registering its own cpu number
>        cache.   Only  one  cpu  cache  address can be registered per
>        thread.
>
>        The symbol  __getcpu_cache_tls  is  recommended  to  be  used
>        across  libraries  and  applications  wishing  to  register a
>        thread-local getcpu_cache. The  attribute  "weak"  is  recom‐
>        mended  when  declaring this variable in libraries.  Applica‐
>        tions can choose to define their own version of  this  symbol
>        without the weak attribute as a performance improvement.
>
>        In  a  typical usage scenario, the thread registering the cpu
>        number cache will be performing reads from that cache. It  is
>        however  also allowed to read the cpu number cache from other
>        threads. The cpu number cache updates performed by the kernel
>        provide single-copy atomicity semantics, which guarantee that
>        other threads performing single-copy atomic reads of the  cpu
>        number cache will always observe a consistent value.
>
>        Memory registered as cpu number cache should never be deallo‐
>        cated before the thread which registered it  exits:  specifi‐
>        cally, it should not be freed, and the library containing the
>        registered thread-local storage should not be dlclose'd.

Maybe spell out the consequence if this is violated - since the SIGSEGV
only happens on migration, it may take a while to strike.

Random thoughts: The current implementation ensures that getcpu_cache is
"idempotent" from within a single thread - once set, it can never get
unset nor set to some other pointer. I think that can be useful, since
it means a library can reliably use the TLS variable itself (initialized
with some negative number) as an indicator of whether
getcpu_cache(GETCPU_CACHE_SET) has been called. So if a single test on a
fast path where the library would need to load __getcpu_cache_tls anyway
is acceptable, it can avoid requiring some library init function to be
called in each thread - which can sometimes be hard to arrange. Is this
something we want to guarantee - that is, will we never implement
GETCPU_CACHE_UNSET or a "force" flag to _SET? Either way, I think we
should spend a few words on it to avoid the current behaviour becoming
accidental ABI.

In another thread:

> However, there are other use-cases for having a fast mechanism for
> reading the current CPU number, besides restartable sequences.  For
> instance, it can be used by glibc to implement a faster sched_getcpu.

Will glibc do that? It may be a little contentious for glibc to claim a
unique resource such as task_struct::cpu_cache for itself, even if
everybody is supposed to use the same symbol. Hm, maybe one could say
that if an application does define the symbol __getcpu_cache_tls (which
is techically in the implementation namespace), that gives glibc (and
any other library) license to do getcpu_cache(SET, &&__getcpu_cache_tls)
(pseudo-code, of course). If a library initializes its own weak version
with -2 it can check whether the application defined
__getcpu_cache_tls. Ok, I'm probably overthinking this...

Rasmus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 11:33             ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-26 11:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Thu, Feb 25, 2016 at 05:17:51PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@infradead.org wrote:
> >> The restartable sequences are intrinsically designed to work
> >> on per-cpu data, so they need to fetch the current CPU number
> >> within the rseq critical section. This is where the getcpu_cache
> >> system call becomes very useful when combined with rseq:
> >> getcpu_cache allows reading the current CPU number in a
> >> fraction of cycle.
> > 
> > Yes yes, I know how restartable sequences work.
> > 
> > But what I worry about is that they want a cpu number and a sequence
> > number, and for performance it would be very good if those live in the
> > same cacheline.
> > 
> > That means either getcpu needs to grow a seq number, or restartable
> > sequences need to _also_ provide the cpu number.
> 
> If we plan things well, we could have both the cpu number and the
> seqnum in the same cache line, registered by two different system
> calls. It's up to user-space to organize those two variables
> to fit within the same cache-line.

I feel this is more fragile than needed. Why not do a single systemcall
that does both?

> getcpu_cache GETCPU_CACHE_SET operation takes the address where
> the CPU number should live as input.
> 
> rseq system call could do the same for the seqnum address.

So I really don't like that, that means we have to track more kernel
state -- we have to carry two pointers instead of one, we have to have
more update functions etc..

That just increases the total overhead of all of this.

> The question becomes: how do we introduce this to user-space,
> considering that only a single address per thread is allowed
> for each of getcpu_cache and rseq ?
> 
> If both CPU number and seqnum are centralized in a TLS within
> e.g. glibc, that would be OK, but if we intend to allow libraries
> or applications to directly register their own getcpu_cache
> address and/or rseq, we may end up in situations where we have
> to fallback on using two different cache-lines. But how much
> should we care about performance in cases where non-generic
> libraries directly use those system calls ?
> 
> Thoughts ?

Yeah, not sure, but that is a separate problem. Both your proposed code
and the rseq code have this. Having them separate system calls just
increases the amount of ways you can do it wrong.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 11:33             ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-26 11:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

On Thu, Feb 25, 2016 at 05:17:51PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> 
> > On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> >> The restartable sequences are intrinsically designed to work
> >> on per-cpu data, so they need to fetch the current CPU number
> >> within the rseq critical section. This is where the getcpu_cache
> >> system call becomes very useful when combined with rseq:
> >> getcpu_cache allows reading the current CPU number in a
> >> fraction of cycle.
> > 
> > Yes yes, I know how restartable sequences work.
> > 
> > But what I worry about is that they want a cpu number and a sequence
> > number, and for performance it would be very good if those live in the
> > same cacheline.
> > 
> > That means either getcpu needs to grow a seq number, or restartable
> > sequences need to _also_ provide the cpu number.
> 
> If we plan things well, we could have both the cpu number and the
> seqnum in the same cache line, registered by two different system
> calls. It's up to user-space to organize those two variables
> to fit within the same cache-line.

I feel this is more fragile than needed. Why not do a single systemcall
that does both?

> getcpu_cache GETCPU_CACHE_SET operation takes the address where
> the CPU number should live as input.
> 
> rseq system call could do the same for the seqnum address.

So I really don't like that, that means we have to track more kernel
state -- we have to carry two pointers instead of one, we have to have
more update functions etc..

That just increases the total overhead of all of this.

> The question becomes: how do we introduce this to user-space,
> considering that only a single address per thread is allowed
> for each of getcpu_cache and rseq ?
> 
> If both CPU number and seqnum are centralized in a TLS within
> e.g. glibc, that would be OK, but if we intend to allow libraries
> or applications to directly register their own getcpu_cache
> address and/or rseq, we may end up in situations where we have
> to fallback on using two different cache-lines. But how much
> should we care about performance in cases where non-generic
> libraries directly use those system calls ?
> 
> Thoughts ?

Yeah, not sure, but that is a separate problem. Both your proposed code
and the rseq code have this. Having them separate system calls just
increases the amount of ways you can do it wrong.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 16:29               ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-02-26 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk

On Fri, 26 Feb 2016, Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 05:17:51PM +0000, Mathieu Desnoyers wrote:
> > ----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz@infradead.org wrote:
> > 
> > > On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> > >> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@infradead.org wrote:
> > >> The restartable sequences are intrinsically designed to work
> > >> on per-cpu data, so they need to fetch the current CPU number
> > >> within the rseq critical section. This is where the getcpu_cache
> > >> system call becomes very useful when combined with rseq:
> > >> getcpu_cache allows reading the current CPU number in a
> > >> fraction of cycle.
> > > 
> > > Yes yes, I know how restartable sequences work.
> > > 
> > > But what I worry about is that they want a cpu number and a sequence
> > > number, and for performance it would be very good if those live in the
> > > same cacheline.
> > > 
> > > That means either getcpu needs to grow a seq number, or restartable
> > > sequences need to _also_ provide the cpu number.
> > 
> > If we plan things well, we could have both the cpu number and the
> > seqnum in the same cache line, registered by two different system
> > calls. It's up to user-space to organize those two variables
> > to fit within the same cache-line.
> 
> I feel this is more fragile than needed. Why not do a single systemcall
> that does both?

Right. There is no point in having two calls and two update mechanisms for a
very similar purpose.

So let userspace have one struct where cpu/seq and whatever is required for
rseq is located and flag at register time which parts of the struct need to be
updated.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 16:29               ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-02-26 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

On Fri, 26 Feb 2016, Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 05:17:51PM +0000, Mathieu Desnoyers wrote:
> > ----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> > 
> > > On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> > >> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> > >> The restartable sequences are intrinsically designed to work
> > >> on per-cpu data, so they need to fetch the current CPU number
> > >> within the rseq critical section. This is where the getcpu_cache
> > >> system call becomes very useful when combined with rseq:
> > >> getcpu_cache allows reading the current CPU number in a
> > >> fraction of cycle.
> > > 
> > > Yes yes, I know how restartable sequences work.
> > > 
> > > But what I worry about is that they want a cpu number and a sequence
> > > number, and for performance it would be very good if those live in the
> > > same cacheline.
> > > 
> > > That means either getcpu needs to grow a seq number, or restartable
> > > sequences need to _also_ provide the cpu number.
> > 
> > If we plan things well, we could have both the cpu number and the
> > seqnum in the same cache line, registered by two different system
> > calls. It's up to user-space to organize those two variables
> > to fit within the same cache-line.
> 
> I feel this is more fragile than needed. Why not do a single systemcall
> that does both?

Right. There is no point in having two calls and two update mechanisms for a
very similar purpose.

So let userspace have one struct where cpu/seq and whatever is required for
rseq is located and flag at register time which parts of the struct need to be
updated.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-26 16:29               ` Thomas Gleixner
  (?)
@ 2016-02-26 17:20               ` Mathieu Desnoyers
  2016-02-26 18:01                   ` Thomas Gleixner
  -1 siblings, 1 reply; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-26 17:20 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Andrew Morton, Russell King, Ingo Molnar, H. Peter Anvin,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner tglx@linutronix.de wrote:

> On Fri, 26 Feb 2016, Peter Zijlstra wrote:
>> On Thu, Feb 25, 2016 at 05:17:51PM +0000, Mathieu Desnoyers wrote:
>> > ----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz@infradead.org wrote:
>> > 
>> > > On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
>> > >> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@infradead.org wrote:
>> > >> The restartable sequences are intrinsically designed to work
>> > >> on per-cpu data, so they need to fetch the current CPU number
>> > >> within the rseq critical section. This is where the getcpu_cache
>> > >> system call becomes very useful when combined with rseq:
>> > >> getcpu_cache allows reading the current CPU number in a
>> > >> fraction of cycle.
>> > > 
>> > > Yes yes, I know how restartable sequences work.
>> > > 
>> > > But what I worry about is that they want a cpu number and a sequence
>> > > number, and for performance it would be very good if those live in the
>> > > same cacheline.
>> > > 
>> > > That means either getcpu needs to grow a seq number, or restartable
>> > > sequences need to _also_ provide the cpu number.
>> > 
>> > If we plan things well, we could have both the cpu number and the
>> > seqnum in the same cache line, registered by two different system
>> > calls. It's up to user-space to organize those two variables
>> > to fit within the same cache-line.
>> 
>> I feel this is more fragile than needed. Why not do a single systemcall
>> that does both?
> 
> Right. There is no point in having two calls and two update mechanisms for a
> very similar purpose.
> 
> So let userspace have one struct where cpu/seq and whatever is required for
> rseq is located and flag at register time which parts of the struct need to be
> updated.

If we put both cpu/seq/other in that structure, why not plan ahead and make
it extensible then ?

That looks very much like the "Thread-local ABI" series I posted last year.
See https://lkml.org/lkml/2015/12/22/464

Here is why I ended up introducing the specialized "getcpu_cache" system call
rather than the "generic" system call (quote from the getcpu_cache changelog):

    Rationale for the getcpu_cache system call rather than the thread-local
    ABI system call proposed earlier:
    
    Rather than doing a "generic" thread-local ABI, specialize this system
    call for a cpu number cache only. Anyway, the thread-local ABI approach
    would have required that we introduce "feature" flags, which would have
    ended up reimplementing multiplexing of features on top of a system
    call. It seems better to introduce one system call per feature instead.

If everyone end up preferring that we introduce a system call that implements
many features at once, that's indeed something we can do, but I remember
being told in the past that this is generally a bad idea.

For one thing, it would make the interface more cumbersome to deal with
from user-space in terms of feature detection: if we want to make this
interface extensible, in addition to check -1, errno=ENOSYS, userspace
would have to deal with a field containing the length of the structure
as expected by user-space and kernel, and feature flags to see the common
set of features supported by kernel and user-space.

Having one system call per feature seems simpler to handle in terms of
feature availability detection from a userspace point of view.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-25 23:32       ` Rasmus Villemoes
  (?)
@ 2016-02-26 17:47       ` Mathieu Desnoyers
  -1 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-26 17:47 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk

----- On Feb 25, 2016, at 6:32 PM, Rasmus Villemoes linux@rasmusvillemoes.dk wrote:

> On Wed, Feb 24 2016, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>
>>        Typically, a library or application will keep the cpu  number
>>        cache  in  a  thread-local  storage variable, or other memory
>>        areas belonging to each thread. It is recommended to  perform
>>        a  volatile  read of the cpu number cache to prevent the com‐
>>        piler from doing load tearing. An alternative approach is  to
>>        read  the  cpu  number cache from inline assembly in a single
>>        instruction.
>>
>>        Each thread is responsible for registering its own cpu number
>>        cache.   Only  one  cpu  cache  address can be registered per
>>        thread.
>>
>>        The symbol  __getcpu_cache_tls  is  recommended  to  be  used
>>        across  libraries  and  applications  wishing  to  register a
>>        thread-local getcpu_cache. The  attribute  "weak"  is  recom‐
>>        mended  when  declaring this variable in libraries.  Applica‐
>>        tions can choose to define their own version of  this  symbol
>>        without the weak attribute as a performance improvement.
>>
>>        In  a  typical usage scenario, the thread registering the cpu
>>        number cache will be performing reads from that cache. It  is
>>        however  also allowed to read the cpu number cache from other
>>        threads. The cpu number cache updates performed by the kernel
>>        provide single-copy atomicity semantics, which guarantee that
>>        other threads performing single-copy atomic reads of the  cpu
>>        number cache will always observe a consistent value.
>>
>>        Memory registered as cpu number cache should never be deallo‐
>>        cated before the thread which registered it  exits:  specifi‐
>>        cally, it should not be freed, and the library containing the
>>        registered thread-local storage should not be dlclose'd.
> 
> Maybe spell out the consequence if this is violated - since the SIGSEGV
> only happens on migration, it may take a while to strike.

Good point.

> 
> Random thoughts: The current implementation ensures that getcpu_cache is
> "idempotent" from within a single thread - once set, it can never get
> unset nor set to some other pointer. I think that can be useful, since
> it means a library can reliably use the TLS variable itself (initialized
> with some negative number) as an indicator of whether
> getcpu_cache(GETCPU_CACHE_SET) has been called. So if a single test on a
> fast path where the library would need to load __getcpu_cache_tls anyway
> is acceptable, it can avoid requiring some library init function to be
> called in each thread - which can sometimes be hard to arrange. Is this
> something we want to guarantee - that is, will we never implement
> GETCPU_CACHE_UNSET or a "force" flag to _SET? Either way, I think we
> should spend a few words on it to avoid the current behaviour becoming
> accidental ABI.

Yes, I would be tempted to state that once set, the address is idempotent
for a thread.

> 
> In another thread:
> 
>> However, there are other use-cases for having a fast mechanism for
>> reading the current CPU number, besides restartable sequences.  For
>> instance, it can be used by glibc to implement a faster sched_getcpu.
> 
> Will glibc do that? It may be a little contentious for glibc to claim a
> unique resource such as task_struct::cpu_cache for itself, even if
> everybody is supposed to use the same symbol. Hm, maybe one could say
> that if an application does define the symbol __getcpu_cache_tls (which
> is techically in the implementation namespace), that gives glibc (and
> any other library) license to do getcpu_cache(SET, &&__getcpu_cache_tls)
> (pseudo-code, of course). If a library initializes its own weak version
> with -2 it can check whether the application defined
> __getcpu_cache_tls. Ok, I'm probably overthinking this...

I've had the exact same thoughts a few days ago then thinking about
how lttng-ust could do a "lazy binding" of the getcpu_cache without
requiring an explicit initialization at thread start. We're reaching
very similar conclusions. We could recommend/require that userspace
does this whenever it defines a __getcpu_cache_tls:

Declare as

__thread __attribute__((weak)) volatile int32_t __getcpu_cache_tls = -1;

Then whenever it loads it, "-1" would mean "uninitialized", and "-2"
could mean "this thread tried to initialize it, but fail, so you
should directly go to a fallback". ">= 0" would mean initialized and
working.

static inline int32_t getcpu_cache_read(void)
{
    int32_t cachev = __getcpu_cache_tls;

    if (likely(cachev >= 0))
        return cachev;

    if (cachev == -1) {
        volatile int32_t *cpu_cache = &__getcpu_cache_tls;

        if (!getcpu_cache(GETCPU_CACHE_SET, &cpu_cache, 0))
            return __getcpu_cache_tls;
        __getcpu_cache_tls = -2;
    }
    /* Fallback on sched_getcpu(). */
    return sched_getcpu();
}

This could be documented in the getcpu_cache system call man page.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 18:01                   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-02-26 18:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Catalin Marinas, Will Deacon, Michael Kerrisk,
	Linus Torvalds

On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner tglx@linutronix.de wrote:
> > Right. There is no point in having two calls and two update mechanisms for a
> > very similar purpose.
> > 
> > So let userspace have one struct where cpu/seq and whatever is required for
> > rseq is located and flag at register time which parts of the struct need to be
> > updated.
> 
> If we put both cpu/seq/other in that structure, why not plan ahead and make
> it extensible then ?
> 
> That looks very much like the "Thread-local ABI" series I posted last year.
> See https://lkml.org/lkml/2015/12/22/464
> 
> Here is why I ended up introducing the specialized "getcpu_cache" system call
> rather than the "generic" system call (quote from the getcpu_cache changelog):
> 
>     Rationale for the getcpu_cache system call rather than the thread-local
>     ABI system call proposed earlier:
>     
>     Rather than doing a "generic" thread-local ABI, specialize this system
>     call for a cpu number cache only. Anyway, the thread-local ABI approach
>     would have required that we introduce "feature" flags, which would have
>     ended up reimplementing multiplexing of features on top of a system
>     call. It seems better to introduce one system call per feature instead.
> 
> If everyone end up preferring that we introduce a system call that implements
> many features at once, that's indeed something we can do, but I remember
> being told in the past that this is generally a bad idea.

It's a bad idea if you mix stuff which does not belong together, but if you
have stuff which shares a substantial amount of things then it makes a lot of
sense. Especially if it adds similar stuff into hotpathes.
 
> For one thing, it would make the interface more cumbersome to deal with
> from user-space in terms of feature detection: if we want to make this
> interface extensible, in addition to check -1, errno=ENOSYS, userspace
> would have to deal with a field containing the length of the structure
> as expected by user-space and kernel, and feature flags to see the common
> set of features supported by kernel and user-space.
>
> Having one system call per feature seems simpler to handle in terms of
> feature availability detection from a userspace point of view.

That might well be, but that does not justify two fastpath updates, two
seperate pointers to handle, etc ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 18:01                   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-02-26 18:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
> > Right. There is no point in having two calls and two update mechanisms for a
> > very similar purpose.
> > 
> > So let userspace have one struct where cpu/seq and whatever is required for
> > rseq is located and flag at register time which parts of the struct need to be
> > updated.
> 
> If we put both cpu/seq/other in that structure, why not plan ahead and make
> it extensible then ?
> 
> That looks very much like the "Thread-local ABI" series I posted last year.
> See https://lkml.org/lkml/2015/12/22/464
> 
> Here is why I ended up introducing the specialized "getcpu_cache" system call
> rather than the "generic" system call (quote from the getcpu_cache changelog):
> 
>     Rationale for the getcpu_cache system call rather than the thread-local
>     ABI system call proposed earlier:
>     
>     Rather than doing a "generic" thread-local ABI, specialize this system
>     call for a cpu number cache only. Anyway, the thread-local ABI approach
>     would have required that we introduce "feature" flags, which would have
>     ended up reimplementing multiplexing of features on top of a system
>     call. It seems better to introduce one system call per feature instead.
> 
> If everyone end up preferring that we introduce a system call that implements
> many features at once, that's indeed something we can do, but I remember
> being told in the past that this is generally a bad idea.

It's a bad idea if you mix stuff which does not belong together, but if you
have stuff which shares a substantial amount of things then it makes a lot of
sense. Especially if it adds similar stuff into hotpathes.
 
> For one thing, it would make the interface more cumbersome to deal with
> from user-space in terms of feature detection: if we want to make this
> interface extensible, in addition to check -1, errno=ENOSYS, userspace
> would have to deal with a field containing the length of the structure
> as expected by user-space and kernel, and feature flags to see the common
> set of features supported by kernel and user-space.
>
> Having one system call per feature seems simpler to handle in terms of
> feature availability detection from a userspace point of view.

That might well be, but that does not justify two fastpath updates, two
seperate pointers to handle, etc ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-26 18:01                   ` Thomas Gleixner
@ 2016-02-26 20:24                     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-26 20:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Catalin Marinas, Will Deacon, Michael Kerrisk,
	Linus Torvalds

----- On Feb 26, 2016, at 1:01 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
>> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner tglx@linutronix.de wrote:
>> > Right. There is no point in having two calls and two update mechanisms for a
>> > very similar purpose.
>> > 
>> > So let userspace have one struct where cpu/seq and whatever is required for
>> > rseq is located and flag at register time which parts of the struct need to be
>> > updated.
>> 
>> If we put both cpu/seq/other in that structure, why not plan ahead and make
>> it extensible then ?
>> 
>> That looks very much like the "Thread-local ABI" series I posted last year.
>> See https://lkml.org/lkml/2015/12/22/464
>> 
>> Here is why I ended up introducing the specialized "getcpu_cache" system call
>> rather than the "generic" system call (quote from the getcpu_cache changelog):
>> 
>>     Rationale for the getcpu_cache system call rather than the thread-local
>>     ABI system call proposed earlier:
>>     
>>     Rather than doing a "generic" thread-local ABI, specialize this system
>>     call for a cpu number cache only. Anyway, the thread-local ABI approach
>>     would have required that we introduce "feature" flags, which would have
>>     ended up reimplementing multiplexing of features on top of a system
>>     call. It seems better to introduce one system call per feature instead.
>> 
>> If everyone end up preferring that we introduce a system call that implements
>> many features at once, that's indeed something we can do, but I remember
>> being told in the past that this is generally a bad idea.
> 
> It's a bad idea if you mix stuff which does not belong together, but if you
> have stuff which shares a substantial amount of things then it makes a lot of
> sense. Especially if it adds similar stuff into hotpathes.
> 
>> For one thing, it would make the interface more cumbersome to deal with
>> from user-space in terms of feature detection: if we want to make this
>> interface extensible, in addition to check -1, errno=ENOSYS, userspace
>> would have to deal with a field containing the length of the structure
>> as expected by user-space and kernel, and feature flags to see the common
>> set of features supported by kernel and user-space.
>>
>> Having one system call per feature seems simpler to handle in terms of
>> feature availability detection from a userspace point of view.
> 
> That might well be, but that does not justify two fastpath updates, two
> seperate pointers to handle, etc ....

Keeping two separate pointers in the task_struct rather than a single one
might indeed be unwelcome, but I'm not sure I fully grasp the fast path
argument in this case: getcpu_cache only sets a notifier thread flag
on thread migration, whereas AFAIU rseq adds code to context switch and signal
delivery, which are prone to have a higher impact.

Indeed both will have their own code in the resume notifier, but is it really
a fast path ?

>From my point of view, making it easy for userspace to just enable getcpu_cache
without having the scheduler and signal delivery fast-path overhead of rseq seems
like a good thing. I'm not all that sure that saving an extra pointer in
task_struct justifies the added system call interface complexity.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-26 20:24                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-26 20:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Ingo Molnar,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

----- On Feb 26, 2016, at 1:01 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

> On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
>> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>> > Right. There is no point in having two calls and two update mechanisms for a
>> > very similar purpose.
>> > 
>> > So let userspace have one struct where cpu/seq and whatever is required for
>> > rseq is located and flag at register time which parts of the struct need to be
>> > updated.
>> 
>> If we put both cpu/seq/other in that structure, why not plan ahead and make
>> it extensible then ?
>> 
>> That looks very much like the "Thread-local ABI" series I posted last year.
>> See https://lkml.org/lkml/2015/12/22/464
>> 
>> Here is why I ended up introducing the specialized "getcpu_cache" system call
>> rather than the "generic" system call (quote from the getcpu_cache changelog):
>> 
>>     Rationale for the getcpu_cache system call rather than the thread-local
>>     ABI system call proposed earlier:
>>     
>>     Rather than doing a "generic" thread-local ABI, specialize this system
>>     call for a cpu number cache only. Anyway, the thread-local ABI approach
>>     would have required that we introduce "feature" flags, which would have
>>     ended up reimplementing multiplexing of features on top of a system
>>     call. It seems better to introduce one system call per feature instead.
>> 
>> If everyone end up preferring that we introduce a system call that implements
>> many features at once, that's indeed something we can do, but I remember
>> being told in the past that this is generally a bad idea.
> 
> It's a bad idea if you mix stuff which does not belong together, but if you
> have stuff which shares a substantial amount of things then it makes a lot of
> sense. Especially if it adds similar stuff into hotpathes.
> 
>> For one thing, it would make the interface more cumbersome to deal with
>> from user-space in terms of feature detection: if we want to make this
>> interface extensible, in addition to check -1, errno=ENOSYS, userspace
>> would have to deal with a field containing the length of the structure
>> as expected by user-space and kernel, and feature flags to see the common
>> set of features supported by kernel and user-space.
>>
>> Having one system call per feature seems simpler to handle in terms of
>> feature availability detection from a userspace point of view.
> 
> That might well be, but that does not justify two fastpath updates, two
> seperate pointers to handle, etc ....

Keeping two separate pointers in the task_struct rather than a single one
might indeed be unwelcome, but I'm not sure I fully grasp the fast path
argument in this case: getcpu_cache only sets a notifier thread flag
on thread migration, whereas AFAIU rseq adds code to context switch and signal
delivery, which are prone to have a higher impact.

Indeed both will have their own code in the resume notifier, but is it really
a fast path ?

>From my point of view, making it easy for userspace to just enable getcpu_cache
without having the scheduler and signal delivery fast-path overhead of rseq seems
like a good thing. I'm not all that sure that saving an extra pointer in
task_struct justifies the added system call interface complexity.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-26 20:24                     ` Mathieu Desnoyers
  (?)
@ 2016-02-26 23:04                     ` H. Peter Anvin
  2016-02-27  0:40                         ` Mathieu Desnoyers
  -1 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-26 23:04 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner
  Cc: Peter Zijlstra, Andrew Morton, Russell King, Ingo Molnar,
	linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

On February 26, 2016 12:24:15 PM PST, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>----- On Feb 26, 2016, at 1:01 PM, Thomas Gleixner tglx@linutronix.de
>wrote:
>
>> On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
>>> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner
>tglx@linutronix.de wrote:
>>> > Right. There is no point in having two calls and two update
>mechanisms for a
>>> > very similar purpose.
>>> > 
>>> > So let userspace have one struct where cpu/seq and whatever is
>required for
>>> > rseq is located and flag at register time which parts of the
>struct need to be
>>> > updated.
>>> 
>>> If we put both cpu/seq/other in that structure, why not plan ahead
>and make
>>> it extensible then ?
>>> 
>>> That looks very much like the "Thread-local ABI" series I posted
>last year.
>>> See https://lkml.org/lkml/2015/12/22/464
>>> 
>>> Here is why I ended up introducing the specialized "getcpu_cache"
>system call
>>> rather than the "generic" system call (quote from the getcpu_cache
>changelog):
>>> 
>>>     Rationale for the getcpu_cache system call rather than the
>thread-local
>>>     ABI system call proposed earlier:
>>>     
>>>     Rather than doing a "generic" thread-local ABI, specialize this
>system
>>>     call for a cpu number cache only. Anyway, the thread-local ABI
>approach
>>>     would have required that we introduce "feature" flags, which
>would have
>>>     ended up reimplementing multiplexing of features on top of a
>system
>>>     call. It seems better to introduce one system call per feature
>instead.
>>> 
>>> If everyone end up preferring that we introduce a system call that
>implements
>>> many features at once, that's indeed something we can do, but I
>remember
>>> being told in the past that this is generally a bad idea.
>> 
>> It's a bad idea if you mix stuff which does not belong together, but
>if you
>> have stuff which shares a substantial amount of things then it makes
>a lot of
>> sense. Especially if it adds similar stuff into hotpathes.
>> 
>>> For one thing, it would make the interface more cumbersome to deal
>with
>>> from user-space in terms of feature detection: if we want to make
>this
>>> interface extensible, in addition to check -1, errno=ENOSYS,
>userspace
>>> would have to deal with a field containing the length of the
>structure
>>> as expected by user-space and kernel, and feature flags to see the
>common
>>> set of features supported by kernel and user-space.
>>>
>>> Having one system call per feature seems simpler to handle in terms
>of
>>> feature availability detection from a userspace point of view.
>> 
>> That might well be, but that does not justify two fastpath updates,
>two
>> seperate pointers to handle, etc ....
>
>Keeping two separate pointers in the task_struct rather than a single
>one
>might indeed be unwelcome, but I'm not sure I fully grasp the fast path
>argument in this case: getcpu_cache only sets a notifier thread flag
>on thread migration, whereas AFAIU rseq adds code to context switch and
>signal
>delivery, which are prone to have a higher impact.
>
>Indeed both will have their own code in the resume notifier, but is it
>really
>a fast path ?
>
>From my point of view, making it easy for userspace to just enable
>getcpu_cache
>without having the scheduler and signal delivery fast-path overhead of
>rseq seems
>like a good thing. I'm not all that sure that saving an extra pointer
>in
>task_struct justifies the added system call interface complexity.
>
>Thanks,
>
>Mathieu

I think it would be a good idea to make this a general pointer for the kernel to be able to write per thread state to user space, which obviously can't be done with the vDSO.

This means the libc per thread startup should query the kernel for the size of this structure and allocate thread local data accordingly.  We can then grow this structure if needed without making the ABI even more complex.

This is more than a system call: this is an entirely new way for userspace to interact with the kernel.  Therefore we should make it a general facility.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27  0:40                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-27  0:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

----- On Feb 26, 2016, at 6:04 PM, H. Peter Anvin hpa@zytor.com wrote:

> On February 26, 2016 12:24:15 PM PST, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>----- On Feb 26, 2016, at 1:01 PM, Thomas Gleixner tglx@linutronix.de
>>wrote:
>>
>>> On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
>>>> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner
>>tglx@linutronix.de wrote:
>>>> > Right. There is no point in having two calls and two update
>>mechanisms for a
>>>> > very similar purpose.
>>>> > 
>>>> > So let userspace have one struct where cpu/seq and whatever is
>>required for
>>>> > rseq is located and flag at register time which parts of the
>>struct need to be
>>>> > updated.
>>>> 
>>>> If we put both cpu/seq/other in that structure, why not plan ahead
>>and make
>>>> it extensible then ?
>>>> 
>>>> That looks very much like the "Thread-local ABI" series I posted
>>last year.
>>>> See https://lkml.org/lkml/2015/12/22/464
>>>> 
>>>> Here is why I ended up introducing the specialized "getcpu_cache"
>>system call
>>>> rather than the "generic" system call (quote from the getcpu_cache
>>changelog):
>>>> 
>>>>     Rationale for the getcpu_cache system call rather than the
>>thread-local
>>>>     ABI system call proposed earlier:
>>>>     
>>>>     Rather than doing a "generic" thread-local ABI, specialize this
>>system
>>>>     call for a cpu number cache only. Anyway, the thread-local ABI
>>approach
>>>>     would have required that we introduce "feature" flags, which
>>would have
>>>>     ended up reimplementing multiplexing of features on top of a
>>system
>>>>     call. It seems better to introduce one system call per feature
>>instead.
>>>> 
>>>> If everyone end up preferring that we introduce a system call that
>>implements
>>>> many features at once, that's indeed something we can do, but I
>>remember
>>>> being told in the past that this is generally a bad idea.
>>> 
>>> It's a bad idea if you mix stuff which does not belong together, but
>>if you
>>> have stuff which shares a substantial amount of things then it makes
>>a lot of
>>> sense. Especially if it adds similar stuff into hotpathes.
>>> 
>>>> For one thing, it would make the interface more cumbersome to deal
>>with
>>>> from user-space in terms of feature detection: if we want to make
>>this
>>>> interface extensible, in addition to check -1, errno=ENOSYS,
>>userspace
>>>> would have to deal with a field containing the length of the
>>structure
>>>> as expected by user-space and kernel, and feature flags to see the
>>common
>>>> set of features supported by kernel and user-space.
>>>>
>>>> Having one system call per feature seems simpler to handle in terms
>>of
>>>> feature availability detection from a userspace point of view.
>>> 
>>> That might well be, but that does not justify two fastpath updates,
>>two
>>> seperate pointers to handle, etc ....
>>
>>Keeping two separate pointers in the task_struct rather than a single
>>one
>>might indeed be unwelcome, but I'm not sure I fully grasp the fast path
>>argument in this case: getcpu_cache only sets a notifier thread flag
>>on thread migration, whereas AFAIU rseq adds code to context switch and
>>signal
>>delivery, which are prone to have a higher impact.
>>
>>Indeed both will have their own code in the resume notifier, but is it
>>really
>>a fast path ?
>>
>>From my point of view, making it easy for userspace to just enable
>>getcpu_cache
>>without having the scheduler and signal delivery fast-path overhead of
>>rseq seems
>>like a good thing. I'm not all that sure that saving an extra pointer
>>in
>>task_struct justifies the added system call interface complexity.
>>
>>Thanks,
>>
>>Mathieu
> 
> I think it would be a good idea to make this a general pointer for the kernel to
> be able to write per thread state to user space, which obviously can't be done
> with the vDSO.
> 
> This means the libc per thread startup should query the kernel for the size of
> this structure and allocate thread local data accordingly.  We can then grow
> this structure if needed without making the ABI even more complex.
> 
> This is more than a system call: this is an entirely new way for userspace to
> interact with the kernel.  Therefore we should make it a general facility.

I'm really glad to see I'm not the only one seeing potential for
genericity here. :-) This is exactly what I had in mind
last year when proposing the thread_local_abi() system call:
a generic way to register an extensible per-thread data structure
so the kernel can communicate with user-space and vice-versa.

Rather than having the libc query the kernel for size of the structure,
I would recommend that libc tells the kernel the size of the thread-local
ABI structure it supports. The idea here is that both the kernel and libc
need to know about the fields in that structure to allow a two-way
interaction. Fields known only by either the kernel or userspace
are useless for a given thread anyway. This way, libc could statically
define the structure.

I would be tempted to also add "features" flags, so both user-space
and the kernel could tell each other what they support: user-space
would announce the set of features it supports, and it could also
query the kernel for the set of supported features. One simple approach
would be to use a uint64_t as type for those feature flags, and
reserve the last bit for extending to future flags if we ever have
more than 64.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27  0:40                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-27  0:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

----- On Feb 26, 2016, at 6:04 PM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:

> On February 26, 2016 12:24:15 PM PST, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>----- On Feb 26, 2016, at 1:01 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org
>>wrote:
>>
>>> On Fri, 26 Feb 2016, Mathieu Desnoyers wrote:
>>>> ----- On Feb 26, 2016, at 11:29 AM, Thomas Gleixner
>>tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
>>>> > Right. There is no point in having two calls and two update
>>mechanisms for a
>>>> > very similar purpose.
>>>> > 
>>>> > So let userspace have one struct where cpu/seq and whatever is
>>required for
>>>> > rseq is located and flag at register time which parts of the
>>struct need to be
>>>> > updated.
>>>> 
>>>> If we put both cpu/seq/other in that structure, why not plan ahead
>>and make
>>>> it extensible then ?
>>>> 
>>>> That looks very much like the "Thread-local ABI" series I posted
>>last year.
>>>> See https://lkml.org/lkml/2015/12/22/464
>>>> 
>>>> Here is why I ended up introducing the specialized "getcpu_cache"
>>system call
>>>> rather than the "generic" system call (quote from the getcpu_cache
>>changelog):
>>>> 
>>>>     Rationale for the getcpu_cache system call rather than the
>>thread-local
>>>>     ABI system call proposed earlier:
>>>>     
>>>>     Rather than doing a "generic" thread-local ABI, specialize this
>>system
>>>>     call for a cpu number cache only. Anyway, the thread-local ABI
>>approach
>>>>     would have required that we introduce "feature" flags, which
>>would have
>>>>     ended up reimplementing multiplexing of features on top of a
>>system
>>>>     call. It seems better to introduce one system call per feature
>>instead.
>>>> 
>>>> If everyone end up preferring that we introduce a system call that
>>implements
>>>> many features at once, that's indeed something we can do, but I
>>remember
>>>> being told in the past that this is generally a bad idea.
>>> 
>>> It's a bad idea if you mix stuff which does not belong together, but
>>if you
>>> have stuff which shares a substantial amount of things then it makes
>>a lot of
>>> sense. Especially if it adds similar stuff into hotpathes.
>>> 
>>>> For one thing, it would make the interface more cumbersome to deal
>>with
>>>> from user-space in terms of feature detection: if we want to make
>>this
>>>> interface extensible, in addition to check -1, errno=ENOSYS,
>>userspace
>>>> would have to deal with a field containing the length of the
>>structure
>>>> as expected by user-space and kernel, and feature flags to see the
>>common
>>>> set of features supported by kernel and user-space.
>>>>
>>>> Having one system call per feature seems simpler to handle in terms
>>of
>>>> feature availability detection from a userspace point of view.
>>> 
>>> That might well be, but that does not justify two fastpath updates,
>>two
>>> seperate pointers to handle, etc ....
>>
>>Keeping two separate pointers in the task_struct rather than a single
>>one
>>might indeed be unwelcome, but I'm not sure I fully grasp the fast path
>>argument in this case: getcpu_cache only sets a notifier thread flag
>>on thread migration, whereas AFAIU rseq adds code to context switch and
>>signal
>>delivery, which are prone to have a higher impact.
>>
>>Indeed both will have their own code in the resume notifier, but is it
>>really
>>a fast path ?
>>
>>From my point of view, making it easy for userspace to just enable
>>getcpu_cache
>>without having the scheduler and signal delivery fast-path overhead of
>>rseq seems
>>like a good thing. I'm not all that sure that saving an extra pointer
>>in
>>task_struct justifies the added system call interface complexity.
>>
>>Thanks,
>>
>>Mathieu
> 
> I think it would be a good idea to make this a general pointer for the kernel to
> be able to write per thread state to user space, which obviously can't be done
> with the vDSO.
> 
> This means the libc per thread startup should query the kernel for the size of
> this structure and allocate thread local data accordingly.  We can then grow
> this structure if needed without making the ABI even more complex.
> 
> This is more than a system call: this is an entirely new way for userspace to
> interact with the kernel.  Therefore we should make it a general facility.

I'm really glad to see I'm not the only one seeing potential for
genericity here. :-) This is exactly what I had in mind
last year when proposing the thread_local_abi() system call:
a generic way to register an extensible per-thread data structure
so the kernel can communicate with user-space and vice-versa.

Rather than having the libc query the kernel for size of the structure,
I would recommend that libc tells the kernel the size of the thread-local
ABI structure it supports. The idea here is that both the kernel and libc
need to know about the fields in that structure to allow a two-way
interaction. Fields known only by either the kernel or userspace
are useless for a given thread anyway. This way, libc could statically
define the structure.

I would be tempted to also add "features" flags, so both user-space
and the kernel could tell each other what they support: user-space
would announce the set of features it supports, and it could also
query the kernel for the set of supported features. One simple approach
would be to use a uint64_t as type for those feature flags, and
reserve the last bit for extending to future flags if we ever have
more than 64.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27  6:24                           ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-27  6:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>
>> I think it would be a good idea to make this a general pointer for the kernel to
>> be able to write per thread state to user space, which obviously can't be done
>> with the vDSO.
>>
>> This means the libc per thread startup should query the kernel for the size of
>> this structure and allocate thread local data accordingly.  We can then grow
>> this structure if needed without making the ABI even more complex.
>>
>> This is more than a system call: this is an entirely new way for userspace to
>> interact with the kernel.  Therefore we should make it a general facility.
>
> I'm really glad to see I'm not the only one seeing potential for
> genericity here. :-) This is exactly what I had in mind
> last year when proposing the thread_local_abi() system call:
> a generic way to register an extensible per-thread data structure
> so the kernel can communicate with user-space and vice-versa.
>
> Rather than having the libc query the kernel for size of the structure,
> I would recommend that libc tells the kernel the size of the thread-local
> ABI structure it supports. The idea here is that both the kernel and libc
> need to know about the fields in that structure to allow a two-way
> interaction. Fields known only by either the kernel or userspace
> are useless for a given thread anyway. This way, libc could statically
> define the structure.

Big fat NOPE there.  Why?  Because it means that EVERY interaction with 
this memory, no matter how critical, needs to be conditionalized. 
Furthermore, userspace != libc.  Applications or higher-layer libraries 
might have more information than the running libc about additional 
fields, but with your proposal libc would gate them.

As far as the kernel providing the size in the structure (alone) -- I 
*really* hope you can see what is wrong with that!!  That doesn't mean 
we can't provide it in the structure as well, and that too might avoid 
the skipped libc problem.

> I would be tempted to also add "features" flags, so both user-space
> and the kernel could tell each other what they support: user-space
> would announce the set of features it supports, and it could also
> query the kernel for the set of supported features. One simple approach
> would be to use a uint64_t as type for those feature flags, and
> reserve the last bit for extending to future flags if we ever have
> more than 64.
>
> Thoughts ?

It doesn't seem like it would hurt, although the size of the flags field 
could end up being an issue.

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27  6:24                           ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-27  6:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>
>> I think it would be a good idea to make this a general pointer for the kernel to
>> be able to write per thread state to user space, which obviously can't be done
>> with the vDSO.
>>
>> This means the libc per thread startup should query the kernel for the size of
>> this structure and allocate thread local data accordingly.  We can then grow
>> this structure if needed without making the ABI even more complex.
>>
>> This is more than a system call: this is an entirely new way for userspace to
>> interact with the kernel.  Therefore we should make it a general facility.
>
> I'm really glad to see I'm not the only one seeing potential for
> genericity here. :-) This is exactly what I had in mind
> last year when proposing the thread_local_abi() system call:
> a generic way to register an extensible per-thread data structure
> so the kernel can communicate with user-space and vice-versa.
>
> Rather than having the libc query the kernel for size of the structure,
> I would recommend that libc tells the kernel the size of the thread-local
> ABI structure it supports. The idea here is that both the kernel and libc
> need to know about the fields in that structure to allow a two-way
> interaction. Fields known only by either the kernel or userspace
> are useless for a given thread anyway. This way, libc could statically
> define the structure.

Big fat NOPE there.  Why?  Because it means that EVERY interaction with 
this memory, no matter how critical, needs to be conditionalized. 
Furthermore, userspace != libc.  Applications or higher-layer libraries 
might have more information than the running libc about additional 
fields, but with your proposal libc would gate them.

As far as the kernel providing the size in the structure (alone) -- I 
*really* hope you can see what is wrong with that!!  That doesn't mean 
we can't provide it in the structure as well, and that too might avoid 
the skipped libc problem.

> I would be tempted to also add "features" flags, so both user-space
> and the kernel could tell each other what they support: user-space
> would announce the set of features it supports, and it could also
> query the kernel for the set of supported features. One simple approach
> would be to use a uint64_t as type for those feature flags, and
> reserve the last bit for extending to future flags if we ever have
> more than 64.
>
> Thoughts ?

It doesn't seem like it would hurt, although the size of the flags field 
could end up being an issue.

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 14:15                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-27 14:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

----- On Feb 27, 2016, at 1:24 AM, H. Peter Anvin hpa@zytor.com wrote:

> On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>>
>>> I think it would be a good idea to make this a general pointer for the kernel to
>>> be able to write per thread state to user space, which obviously can't be done
>>> with the vDSO.
>>>
>>> This means the libc per thread startup should query the kernel for the size of
>>> this structure and allocate thread local data accordingly.  We can then grow
>>> this structure if needed without making the ABI even more complex.
>>>
>>> This is more than a system call: this is an entirely new way for userspace to
>>> interact with the kernel.  Therefore we should make it a general facility.
>>
>> I'm really glad to see I'm not the only one seeing potential for
>> genericity here. :-) This is exactly what I had in mind
>> last year when proposing the thread_local_abi() system call:
>> a generic way to register an extensible per-thread data structure
>> so the kernel can communicate with user-space and vice-versa.
>>
>> Rather than having the libc query the kernel for size of the structure,
>> I would recommend that libc tells the kernel the size of the thread-local
>> ABI structure it supports. The idea here is that both the kernel and libc
>> need to know about the fields in that structure to allow a two-way
>> interaction. Fields known only by either the kernel or userspace
>> are useless for a given thread anyway. This way, libc could statically
>> define the structure.
> 
> Big fat NOPE there.  Why?  Because it means that EVERY interaction with
> this memory, no matter how critical, needs to be conditionalized.
> Furthermore, userspace != libc.  Applications or higher-layer libraries
> might have more information than the running libc about additional
> fields, but with your proposal libc would gate them.

Good point!

> 
> As far as the kernel providing the size in the structure (alone) -- I
> *really* hope you can see what is wrong with that!!  That doesn't mean
> we can't provide it in the structure as well, and that too might avoid
> the skipped libc problem.

Indeed, libc would need to query the size before it can allocate
the structure.

> 
>> I would be tempted to also add "features" flags, so both user-space
>> and the kernel could tell each other what they support: user-space
>> would announce the set of features it supports, and it could also
>> query the kernel for the set of supported features. One simple approach
>> would be to use a uint64_t as type for those feature flags, and
>> reserve the last bit for extending to future flags if we ever have
>> more than 64.
>>
>> Thoughts ?
> 
> It doesn't seem like it would hurt, although the size of the flags field
> could end up being an issue.

I'm concerned that this thread-local ABI structure may become messy.
Let's just imagine how we would first introduce a "cpu_id" field (int32_t),
and eventually add a "seqnum" field for rseq in the future (unsigned long).

Both fields need to be read with single-copy semantics as volatile
reads, and both need to be naturally aligned. However, I'm tempted
to use the "packed" attribute on the structure since it's an ABI
between kernel and user-space. A pretty bad example of what this
could become, due to alignment constraints, looks like:

/* This structure needs to be aligned on pointer size. */
struct thread_local_abi {
        int32_t cpu_id;
        int32_t __unused1;
        unsigned long seqnum;
        /* Add new fields at the end. */
} __attribute__((packed));

And this is just a start. It may become messier as we append
new fields in the future.

The main argument I currently see in favor of having this
meta system call for all per-thread features is to only
maintain a single pointer in the kernel task_struct rather
than one per thread-local feature.

If the goal is really to keep the burden on the task struct
small, we could use kmalloc()/kfree() to allocate and free an
array of pointers to the various per-thread features, rather
than putting them directly in task_struct. We could keep a
mask of the enabled features in the task struct too (which
we will likely have to do even if we go the the thread-local
ABI meta system call).

Having this per-task allocated pointer array at kernel-level
would allow us to have one system call per feature, with clear
semantics, without evolving a messy thread-local ABI structure
due to all sorts of alignment constraints.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 14:15                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-27 14:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

----- On Feb 27, 2016, at 1:24 AM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:

> On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>>
>>> I think it would be a good idea to make this a general pointer for the kernel to
>>> be able to write per thread state to user space, which obviously can't be done
>>> with the vDSO.
>>>
>>> This means the libc per thread startup should query the kernel for the size of
>>> this structure and allocate thread local data accordingly.  We can then grow
>>> this structure if needed without making the ABI even more complex.
>>>
>>> This is more than a system call: this is an entirely new way for userspace to
>>> interact with the kernel.  Therefore we should make it a general facility.
>>
>> I'm really glad to see I'm not the only one seeing potential for
>> genericity here. :-) This is exactly what I had in mind
>> last year when proposing the thread_local_abi() system call:
>> a generic way to register an extensible per-thread data structure
>> so the kernel can communicate with user-space and vice-versa.
>>
>> Rather than having the libc query the kernel for size of the structure,
>> I would recommend that libc tells the kernel the size of the thread-local
>> ABI structure it supports. The idea here is that both the kernel and libc
>> need to know about the fields in that structure to allow a two-way
>> interaction. Fields known only by either the kernel or userspace
>> are useless for a given thread anyway. This way, libc could statically
>> define the structure.
> 
> Big fat NOPE there.  Why?  Because it means that EVERY interaction with
> this memory, no matter how critical, needs to be conditionalized.
> Furthermore, userspace != libc.  Applications or higher-layer libraries
> might have more information than the running libc about additional
> fields, but with your proposal libc would gate them.

Good point!

> 
> As far as the kernel providing the size in the structure (alone) -- I
> *really* hope you can see what is wrong with that!!  That doesn't mean
> we can't provide it in the structure as well, and that too might avoid
> the skipped libc problem.

Indeed, libc would need to query the size before it can allocate
the structure.

> 
>> I would be tempted to also add "features" flags, so both user-space
>> and the kernel could tell each other what they support: user-space
>> would announce the set of features it supports, and it could also
>> query the kernel for the set of supported features. One simple approach
>> would be to use a uint64_t as type for those feature flags, and
>> reserve the last bit for extending to future flags if we ever have
>> more than 64.
>>
>> Thoughts ?
> 
> It doesn't seem like it would hurt, although the size of the flags field
> could end up being an issue.

I'm concerned that this thread-local ABI structure may become messy.
Let's just imagine how we would first introduce a "cpu_id" field (int32_t),
and eventually add a "seqnum" field for rseq in the future (unsigned long).

Both fields need to be read with single-copy semantics as volatile
reads, and both need to be naturally aligned. However, I'm tempted
to use the "packed" attribute on the structure since it's an ABI
between kernel and user-space. A pretty bad example of what this
could become, due to alignment constraints, looks like:

/* This structure needs to be aligned on pointer size. */
struct thread_local_abi {
        int32_t cpu_id;
        int32_t __unused1;
        unsigned long seqnum;
        /* Add new fields at the end. */
} __attribute__((packed));

And this is just a start. It may become messier as we append
new fields in the future.

The main argument I currently see in favor of having this
meta system call for all per-thread features is to only
maintain a single pointer in the kernel task_struct rather
than one per thread-local feature.

If the goal is really to keep the burden on the task struct
small, we could use kmalloc()/kfree() to allocate and free an
array of pointers to the various per-thread features, rather
than putting them directly in task_struct. We could keep a
mask of the enabled features in the task struct too (which
we will likely have to do even if we go the the thread-local
ABI meta system call).

Having this per-task allocated pointer array at kernel-level
would allow us to have one system call per feature, with clear
semantics, without evolving a messy thread-local ABI structure
due to all sorts of alignment constraints.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 14:58                               ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-27 14:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Thomas Gleixner, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

On Sat, Feb 27, 2016 at 02:15:01PM +0000, Mathieu Desnoyers wrote:
> I'm concerned that this thread-local ABI structure may become messy.
> Let's just imagine how we would first introduce a "cpu_id" field (int32_t),
> and eventually add a "seqnum" field for rseq in the future (unsigned long).

The rseq seq number can be uint32_t, in fact it is in Paul's patches.

(This is true because every seq increment will guarantee a userspace
exception and reload of the value, its impossible to wrap the thing and
get a false positive.)

Paul's patches have the following structure:

struct thread_local_abi {
	union {
		struct {
			u32	cpu_id;
			u32	seq;
		};
		u64 cpu_seq;
	};
	unsigned long post_commit_ip;
};

Although he allows the post_commit_ip to be a separate field (which I
don't think makes sense).

> /* This structure needs to be aligned on pointer size. */

I would mandate the thing be cacheline aligned, and sod packed, that can
lead to horrible layouts.

> If the goal is really to keep the burden on the task struct
> small, we could use kmalloc()/kfree() to allocate and free an
> array of pointers to the various per-thread features, rather

*groan*, no that's even worse, then you get even more loads to update
the fields. The point is to reduce the total overhead of having this
stuff.

Having a single pointer with known offsets is best because then its
guaranteed a single load, then having the whole data structure in a
single cacheline again saves on memops, you can only miss once.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 14:58                               ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-27 14:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Thomas Gleixner, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

On Sat, Feb 27, 2016 at 02:15:01PM +0000, Mathieu Desnoyers wrote:
> I'm concerned that this thread-local ABI structure may become messy.
> Let's just imagine how we would first introduce a "cpu_id" field (int32_t),
> and eventually add a "seqnum" field for rseq in the future (unsigned long).

The rseq seq number can be uint32_t, in fact it is in Paul's patches.

(This is true because every seq increment will guarantee a userspace
exception and reload of the value, its impossible to wrap the thing and
get a false positive.)

Paul's patches have the following structure:

struct thread_local_abi {
	union {
		struct {
			u32	cpu_id;
			u32	seq;
		};
		u64 cpu_seq;
	};
	unsigned long post_commit_ip;
};

Although he allows the post_commit_ip to be a separate field (which I
don't think makes sense).

> /* This structure needs to be aligned on pointer size. */

I would mandate the thing be cacheline aligned, and sod packed, that can
lead to horrible layouts.

> If the goal is really to keep the burden on the task struct
> small, we could use kmalloc()/kfree() to allocate and free an
> array of pointers to the various per-thread features, rather

*groan*, no that's even worse, then you get even more loads to update
the fields. The point is to reduce the total overhead of having this
stuff.

Having a single pointer with known offsets is best because then its
guaranteed a single load, then having the whole data structure in a
single cacheline again saves on memops, you can only miss once.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 15:04                               ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-27 15:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Linus Torvalds

On February 27, 2016 6:15:01 AM PST, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>----- On Feb 27, 2016, at 1:24 AM, H. Peter Anvin hpa@zytor.com wrote:
>
>> On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>>>
>>>> I think it would be a good idea to make this a general pointer for
>the kernel to
>>>> be able to write per thread state to user space, which obviously
>can't be done
>>>> with the vDSO.
>>>>
>>>> This means the libc per thread startup should query the kernel for
>the size of
>>>> this structure and allocate thread local data accordingly.  We can
>then grow
>>>> this structure if needed without making the ABI even more complex.
>>>>
>>>> This is more than a system call: this is an entirely new way for
>userspace to
>>>> interact with the kernel.  Therefore we should make it a general
>facility.
>>>
>>> I'm really glad to see I'm not the only one seeing potential for
>>> genericity here. :-) This is exactly what I had in mind
>>> last year when proposing the thread_local_abi() system call:
>>> a generic way to register an extensible per-thread data structure
>>> so the kernel can communicate with user-space and vice-versa.
>>>
>>> Rather than having the libc query the kernel for size of the
>structure,
>>> I would recommend that libc tells the kernel the size of the
>thread-local
>>> ABI structure it supports. The idea here is that both the kernel and
>libc
>>> need to know about the fields in that structure to allow a two-way
>>> interaction. Fields known only by either the kernel or userspace
>>> are useless for a given thread anyway. This way, libc could
>statically
>>> define the structure.
>> 
>> Big fat NOPE there.  Why?  Because it means that EVERY interaction
>with
>> this memory, no matter how critical, needs to be conditionalized.
>> Furthermore, userspace != libc.  Applications or higher-layer
>libraries
>> might have more information than the running libc about additional
>> fields, but with your proposal libc would gate them.
>
>Good point!
>
>> 
>> As far as the kernel providing the size in the structure (alone) -- I
>> *really* hope you can see what is wrong with that!!  That doesn't
>mean
>> we can't provide it in the structure as well, and that too might
>avoid
>> the skipped libc problem.
>
>Indeed, libc would need to query the size before it can allocate
>the structure.
>
>> 
>>> I would be tempted to also add "features" flags, so both user-space
>>> and the kernel could tell each other what they support: user-space
>>> would announce the set of features it supports, and it could also
>>> query the kernel for the set of supported features. One simple
>approach
>>> would be to use a uint64_t as type for those feature flags, and
>>> reserve the last bit for extending to future flags if we ever have
>>> more than 64.
>>>
>>> Thoughts ?
>> 
>> It doesn't seem like it would hurt, although the size of the flags
>field
>> could end up being an issue.
>
>I'm concerned that this thread-local ABI structure may become messy.
>Let's just imagine how we would first introduce a "cpu_id" field
>(int32_t),
>and eventually add a "seqnum" field for rseq in the future (unsigned
>long).
>
>Both fields need to be read with single-copy semantics as volatile
>reads, and both need to be naturally aligned. However, I'm tempted
>to use the "packed" attribute on the structure since it's an ABI
>between kernel and user-space. A pretty bad example of what this
>could become, due to alignment constraints, looks like:
>
>/* This structure needs to be aligned on pointer size. */
>struct thread_local_abi {
>        int32_t cpu_id;
>        int32_t __unused1;
>        unsigned long seqnum;
>        /* Add new fields at the end. */
>} __attribute__((packed));
>
>And this is just a start. It may become messier as we append
>new fields in the future.
>
>The main argument I currently see in favor of having this
>meta system call for all per-thread features is to only
>maintain a single pointer in the kernel task_struct rather
>than one per thread-local feature.
>
>If the goal is really to keep the burden on the task struct
>small, we could use kmalloc()/kfree() to allocate and free an
>array of pointers to the various per-thread features, rather
>than putting them directly in task_struct. We could keep a
>mask of the enabled features in the task struct too (which
>we will likely have to do even if we go the the thread-local
>ABI meta system call).
>
>Having this per-task allocated pointer array at kernel-level
>would allow us to have one system call per feature, with clear
>semantics, without evolving a messy thread-local ABI structure
>due to all sorts of alignment constraints.
>
>Thoughts ?
>
>Thanks,
>
>Mathieu

I think you are worried about problems which we have already solved many, many times - structures are very common in the user space ABI and we know how to deal with this.

And when you say:

> However, I'm tempted
> to use the "packed" attribute on the structure
> since it's an ABI
> between kernel and user-space.

and mention "unsigned long" in a user space ABI all I can think of that you really have not followed the issues of user space ABI design as they have evolved over the last 20 years.

Simply put: non-problem.  
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 15:04                               ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-27 15:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Russell King,
	Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Linus Torvalds

On February 27, 2016 6:15:01 AM PST, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>----- On Feb 27, 2016, at 1:24 AM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:
>
>> On 02/26/16 16:40, Mathieu Desnoyers wrote:
>>>>
>>>> I think it would be a good idea to make this a general pointer for
>the kernel to
>>>> be able to write per thread state to user space, which obviously
>can't be done
>>>> with the vDSO.
>>>>
>>>> This means the libc per thread startup should query the kernel for
>the size of
>>>> this structure and allocate thread local data accordingly.  We can
>then grow
>>>> this structure if needed without making the ABI even more complex.
>>>>
>>>> This is more than a system call: this is an entirely new way for
>userspace to
>>>> interact with the kernel.  Therefore we should make it a general
>facility.
>>>
>>> I'm really glad to see I'm not the only one seeing potential for
>>> genericity here. :-) This is exactly what I had in mind
>>> last year when proposing the thread_local_abi() system call:
>>> a generic way to register an extensible per-thread data structure
>>> so the kernel can communicate with user-space and vice-versa.
>>>
>>> Rather than having the libc query the kernel for size of the
>structure,
>>> I would recommend that libc tells the kernel the size of the
>thread-local
>>> ABI structure it supports. The idea here is that both the kernel and
>libc
>>> need to know about the fields in that structure to allow a two-way
>>> interaction. Fields known only by either the kernel or userspace
>>> are useless for a given thread anyway. This way, libc could
>statically
>>> define the structure.
>> 
>> Big fat NOPE there.  Why?  Because it means that EVERY interaction
>with
>> this memory, no matter how critical, needs to be conditionalized.
>> Furthermore, userspace != libc.  Applications or higher-layer
>libraries
>> might have more information than the running libc about additional
>> fields, but with your proposal libc would gate them.
>
>Good point!
>
>> 
>> As far as the kernel providing the size in the structure (alone) -- I
>> *really* hope you can see what is wrong with that!!  That doesn't
>mean
>> we can't provide it in the structure as well, and that too might
>avoid
>> the skipped libc problem.
>
>Indeed, libc would need to query the size before it can allocate
>the structure.
>
>> 
>>> I would be tempted to also add "features" flags, so both user-space
>>> and the kernel could tell each other what they support: user-space
>>> would announce the set of features it supports, and it could also
>>> query the kernel for the set of supported features. One simple
>approach
>>> would be to use a uint64_t as type for those feature flags, and
>>> reserve the last bit for extending to future flags if we ever have
>>> more than 64.
>>>
>>> Thoughts ?
>> 
>> It doesn't seem like it would hurt, although the size of the flags
>field
>> could end up being an issue.
>
>I'm concerned that this thread-local ABI structure may become messy.
>Let's just imagine how we would first introduce a "cpu_id" field
>(int32_t),
>and eventually add a "seqnum" field for rseq in the future (unsigned
>long).
>
>Both fields need to be read with single-copy semantics as volatile
>reads, and both need to be naturally aligned. However, I'm tempted
>to use the "packed" attribute on the structure since it's an ABI
>between kernel and user-space. A pretty bad example of what this
>could become, due to alignment constraints, looks like:
>
>/* This structure needs to be aligned on pointer size. */
>struct thread_local_abi {
>        int32_t cpu_id;
>        int32_t __unused1;
>        unsigned long seqnum;
>        /* Add new fields at the end. */
>} __attribute__((packed));
>
>And this is just a start. It may become messier as we append
>new fields in the future.
>
>The main argument I currently see in favor of having this
>meta system call for all per-thread features is to only
>maintain a single pointer in the kernel task_struct rather
>than one per thread-local feature.
>
>If the goal is really to keep the burden on the task struct
>small, we could use kmalloc()/kfree() to allocate and free an
>array of pointers to the various per-thread features, rather
>than putting them directly in task_struct. We could keep a
>mask of the enabled features in the task struct too (which
>we will likely have to do even if we go the the thread-local
>ABI meta system call).
>
>Having this per-task allocated pointer array at kernel-level
>would allow us to have one system call per feature, with clear
>semantics, without evolving a messy thread-local ABI structure
>due to all sorts of alignment constraints.
>
>Thoughts ?
>
>Thanks,
>
>Mathieu

I think you are worried about problems which we have already solved many, many times - structures are very common in the user space ABI and we know how to deal with this.

And when you say:

> However, I'm tempted
> to use the "packed" attribute on the structure
> since it's an ABI
> between kernel and user-space.

and mention "unsigned long" in a user space ABI all I can think of that you really have not followed the issues of user space ABI design as they have evolved over the last 20 years.

Simply put: non-problem.  
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 18:35                                 ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2016-02-27 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Paul's patches have the following structure:
>
> struct thread_local_abi {
>         union {
>                 struct {
>                         u32     cpu_id;
>                         u32     seq;
>                 };
>                 u64 cpu_seq;
>         };
>         unsigned long post_commit_ip;
> };

Please don't do "unsigned long" in ABI structures any more.

Make it u64, and make sure it is 64-bit aligned (which it would be in
this case). Make it so that we don't have to have separate compat
paths.

               Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 18:35                                 ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2016-02-27 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>
> Paul's patches have the following structure:
>
> struct thread_local_abi {
>         union {
>                 struct {
>                         u32     cpu_id;
>                         u32     seq;
>                 };
>                 u64 cpu_seq;
>         };
>         unsigned long post_commit_ip;
> };

Please don't do "unsigned long" in ABI structures any more.

Make it u64, and make sure it is 64-bit aligned (which it would be in
this case). Make it so that we don't have to have separate compat
paths.

               Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 19:01                                   ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-27 19:01 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Mathieu Desnoyers, Thomas Gleixner, Andrew Morton, Russell King,
	Ingo Molnar, Linux Kernel Mailing List, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Catalin Marinas, Will Deacon, Michael Kerrisk

On February 27, 2016 10:35:28 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz@infradead.org>
>wrote:
>>
>> Paul's patches have the following structure:
>>
>> struct thread_local_abi {
>>         union {
>>                 struct {
>>                         u32     cpu_id;
>>                         u32     seq;
>>                 };
>>                 u64 cpu_seq;
>>         };
>>         unsigned long post_commit_ip;
>> };
>
>Please don't do "unsigned long" in ABI structures any more.
>
>Make it u64, and make sure it is 64-bit aligned (which it would be in
>this case). Make it so that we don't have to have separate compat
>paths.
>
>               Linus

Yes, if we have to do compat crap for this entire new ABI path I think I'll scream.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-27 19:01                                   ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-27 19:01 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Mathieu Desnoyers, Thomas Gleixner, Andrew Morton, Russell King,
	Ingo Molnar, Linux Kernel Mailing List, linux-api, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Andi Kleen, Dave Watson,
	Chris Lameter, Ben Maurer, rostedt, Paul E. McKenney,
	Josh Triplett, Catalin Marinas, Will Deacon, Michael Kerrisk

On February 27, 2016 10:35:28 AM PST, Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>wrote:
>>
>> Paul's patches have the following structure:
>>
>> struct thread_local_abi {
>>         union {
>>                 struct {
>>                         u32     cpu_id;
>>                         u32     seq;
>>                 };
>>                 u64 cpu_seq;
>>         };
>>         unsigned long post_commit_ip;
>> };
>
>Please don't do "unsigned long" in ABI structures any more.
>
>Make it u64, and make sure it is 64-bit aligned (which it would be in
>this case). Make it so that we don't have to have separate compat
>paths.
>
>               Linus

Yes, if we have to do compat crap for this entire new ABI path I think I'll scream.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-27 18:35                                 ` Linus Torvalds
  (?)
  (?)
@ 2016-02-27 23:53                                 ` Mathieu Desnoyers
       [not found]                                   ` <CA+55aFwcgwRxvVBz5kk_3O8dESXAGJ4KHBkf=pSXjiS7Xh4NwA@mail.gmail.com>
  -1 siblings, 1 reply; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-27 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Andrew Morton,
	Russell King, Ingo Molnar, Linux Kernel Mailing List, linux-api,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Andi Kleen,
	Dave Watson, Chris Lameter, Ben Maurer, rostedt,
	Paul E. McKenney, Josh Triplett, Catalin Marinas, Will Deacon,
	Michael Kerrisk

----- On Feb 27, 2016, at 1:35 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Paul's patches have the following structure:
>>
>> struct thread_local_abi {
>>         union {
>>                 struct {
>>                         u32     cpu_id;
>>                         u32     seq;
>>                 };
>>                 u64 cpu_seq;
>>         };
>>         unsigned long post_commit_ip;
>> };
> 
> Please don't do "unsigned long" in ABI structures any more.
> 
> Make it u64, and make sure it is 64-bit aligned (which it would be in
> this case). Make it so that we don't have to have separate compat
> paths.

AFAIU, this "post_commit_ip" field is expected to be updated
with a single-copy-store by user-space. If we want to handle both
32-bit and 64-bit processes, how do you recommend doing this
without an unsigned long type ?

A 64-bit integer would not be a single-copy store for
32-bit processes, but a 32-bit integer would not be large
enough for 64-bit processes.

Would a

union {
    uint32_t val32;
    uint64_t val64;
} field;

be an acceptable option ? Then the kernel could use
one field or the other depending on the process bitness.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28  0:57                                         ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2016-02-28  0:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	H. Peter Anvin, Peter Zijlstra

On Sat, Feb 27, 2016 at 4:39 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
>
> I'm particularly interested to know what are the best practices to
> deal with an extensible bitfield (the features mask). cpu_set_t
> and sigmask each seem to do their own thing.

Quite frankly, why would the kernel ever touch anything else?

And if the kernel doesn't touch anything else, why make it part of the ABI?

I don't see why the kernel would ever want to have a more complex
interface. Explain.

           Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28  0:57                                         ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2016-02-28  0:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	H. Peter Anvin, Peter Zijlstra

On Sat, Feb 27, 2016 at 4:39 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>
>
> I'm particularly interested to know what are the best practices to
> deal with an extensible bitfield (the features mask). cpu_set_t
> and sigmask each seem to do their own thing.

Quite frankly, why would the kernel ever touch anything else?

And if the kernel doesn't touch anything else, why make it part of the ABI?

I don't see why the kernel would ever want to have a more complex
interface. Explain.

           Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28 13:07                                   ` Geert Uytterhoeven
  0 siblings, 0 replies; 96+ messages in thread
From: Geert Uytterhoeven @ 2016-02-28 13:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mathieu Desnoyers, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

Hi Linus,

On Sat, Feb 27, 2016 at 7:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Paul's patches have the following structure:
>>
>> struct thread_local_abi {
>>         union {
>>                 struct {
>>                         u32     cpu_id;
>>                         u32     seq;
>>                 };
>>                 u64 cpu_seq;
>>         };
>>         unsigned long post_commit_ip;
>> };
>
> Please don't do "unsigned long" in ABI structures any more.
>
> Make it u64, and make sure it is 64-bit aligned (which it would be in
> this case). Make it so that we don't have to have separate compat
> paths.

__alignof__(u64) is not 8 on all architectures.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28 13:07                                   ` Geert Uytterhoeven
  0 siblings, 0 replies; 96+ messages in thread
From: Geert Uytterhoeven @ 2016-02-28 13:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mathieu Desnoyers, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

Hi Linus,

On Sat, Feb 27, 2016 at 7:35 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>>
>> Paul's patches have the following structure:
>>
>> struct thread_local_abi {
>>         union {
>>                 struct {
>>                         u32     cpu_id;
>>                         u32     seq;
>>                 };
>>                 u64 cpu_seq;
>>         };
>>         unsigned long post_commit_ip;
>> };
>
> Please don't do "unsigned long" in ABI structures any more.
>
> Make it u64, and make sure it is 64-bit aligned (which it would be in
> this case). Make it so that we don't have to have separate compat
> paths.

__alignof__(u64) is not 8 on all architectures.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28 14:32                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-28 14:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	H. Peter Anvin, Peter Zijlstra

----- On Feb 27, 2016, at 7:57 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Sat, Feb 27, 2016 at 4:39 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>>
>> I'm particularly interested to know what are the best practices to
>> deal with an extensible bitfield (the features mask). cpu_set_t
>> and sigmask each seem to do their own thing.
> 
> Quite frankly, why would the kernel ever touch anything else?
> 
> And if the kernel doesn't touch anything else, why make it part of the ABI?
> 
> I don't see why the kernel would ever want to have a more complex
> interface. Explain.

The part of ABI I'm trying to express here is for discoverability
of available features by user-space. For instance, a kernel
could be configured with "CONFIG_RSEQ=n", and userspace should
not rely on the rseq fields of the thread-local ABI in that case.

The initial idea I had was to populate a mask of available features
(hence my question above), but now that I think about it, we could
perhaps have a "query" system call receiving a "feature number", no
mask needed then. E.g.:

enum thread_local_abi_features {
    THREAD_LOCAL_FEATURE_CPU_ID = 0,
    THREAD_LOCAL_FEATURE_RSEQ = 1,
    /* Add future features here. */
};

int thread_local_abi_feature(uint64_t feature);

Another option would be to rely on specific "uninitialized"
values for each feature in struct thread_local_abi (e.g. -1
for cpu_id). We may need to reserve extra space for
"feature enabled" booleans in cases where the uninitialized
value is also used when initialized (e.g. a sequence counteR).
The advantage of using the uninitialized value and/or the
"boolean" within the struct thread_local_abi is that testing
whether the feature is active can be done by reading from
the same cache-line as when using the feature (in user-space).

Not sure what would be the best option here.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28 14:32                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-28 14:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	H. Peter Anvin, Peter Zijlstra

----- On Feb 27, 2016, at 7:57 PM, Linus Torvalds torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org wrote:

> On Sat, Feb 27, 2016 at 4:39 PM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>
>>
>> I'm particularly interested to know what are the best practices to
>> deal with an extensible bitfield (the features mask). cpu_set_t
>> and sigmask each seem to do their own thing.
> 
> Quite frankly, why would the kernel ever touch anything else?
> 
> And if the kernel doesn't touch anything else, why make it part of the ABI?
> 
> I don't see why the kernel would ever want to have a more complex
> interface. Explain.

The part of ABI I'm trying to express here is for discoverability
of available features by user-space. For instance, a kernel
could be configured with "CONFIG_RSEQ=n", and userspace should
not rely on the rseq fields of the thread-local ABI in that case.

The initial idea I had was to populate a mask of available features
(hence my question above), but now that I think about it, we could
perhaps have a "query" system call receiving a "feature number", no
mask needed then. E.g.:

enum thread_local_abi_features {
    THREAD_LOCAL_FEATURE_CPU_ID = 0,
    THREAD_LOCAL_FEATURE_RSEQ = 1,
    /* Add future features here. */
};

int thread_local_abi_feature(uint64_t feature);

Another option would be to rely on specific "uninitialized"
values for each feature in struct thread_local_abi (e.g. -1
for cpu_id). We may need to reserve extra space for
"feature enabled" booleans in cases where the uninitialized
value is also used when initialized (e.g. a sequence counteR).
The advantage of using the uninitialized value and/or the
"boolean" within the struct thread_local_abi is that testing
whether the feature is active can be done by reading from
the same cache-line as when using the feature (in user-space).

Not sure what would be the best option here.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28 16:21                                     ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2016-02-28 16:21 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Peter Zijlstra, Mathieu Desnoyers, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Sun, Feb 28, 2016 at 5:07 AM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
>
> __alignof__(u64) is not 8 on all architectures.

Indeed, which is why I said "make sure it's 64-bit aligned". We do it
manually for ABI structures (although we did have some discussion
about adding a alignment directive, and then having an explicitly
unaligned type for legacy cases that we got wrong).

In the above case it was already properly aligned, because the
previous structure members added up to 64-bit boundaries.

Of course, nothing then stops user space from giving us structures
that are unaligned to begin with, but that's not our problem. As long
as the layout is correct, we're fine, and that's all we care about.

              Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-28 16:21                                     ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2016-02-28 16:21 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Peter Zijlstra, Mathieu Desnoyers, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Sun, Feb 28, 2016 at 5:07 AM, Geert Uytterhoeven
<geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org> wrote:
>
> __alignof__(u64) is not 8 on all architectures.

Indeed, which is why I said "make sure it's 64-bit aligned". We do it
manually for ABI structures (although we did have some discussion
about adding a alignment directive, and then having an explicitly
unaligned type for legacy cases that we got wrong).

In the above case it was already properly aligned, because the
previous structure members added up to 64-bit boundaries.

Of course, nothing then stops user space from giving us structures
that are unaligned to begin with, but that's not our problem. As long
as the layout is correct, we're fine, and that's all we care about.

              Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:01                                   ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-29 10:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Sat, Feb 27, 2016 at 10:35:28AM -0800, Linus Torvalds wrote:
> On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Paul's patches have the following structure:
> >
> > struct thread_local_abi {
> >         union {
> >                 struct {
> >                         u32     cpu_id;
> >                         u32     seq;
> >                 };
> >                 u64 cpu_seq;
> >         };
> >         unsigned long post_commit_ip;
> > };
> 
> Please don't do "unsigned long" in ABI structures any more.
> 
> Make it u64, and make sure it is 64-bit aligned (which it would be in
> this case). Make it so that we don't have to have separate compat
> paths.

Yes, for sure. I was 'only' trying to reflect the state of the last rseq
patches. But yes, I should have called that out and avoided 'confusion'.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:01                                   ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-29 10:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, linux-api, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Andi Kleen, Dave Watson, Chris Lameter,
	Ben Maurer, rostedt, Paul E. McKenney, Josh Triplett,
	Catalin Marinas, Will Deacon, Michael Kerrisk

On Sat, Feb 27, 2016 at 10:35:28AM -0800, Linus Torvalds wrote:
> On Sat, Feb 27, 2016 at 6:58 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> > Paul's patches have the following structure:
> >
> > struct thread_local_abi {
> >         union {
> >                 struct {
> >                         u32     cpu_id;
> >                         u32     seq;
> >                 };
> >                 u64 cpu_seq;
> >         };
> >         unsigned long post_commit_ip;
> > };
> 
> Please don't do "unsigned long" in ABI structures any more.
> 
> Make it u64, and make sure it is 64-bit aligned (which it would be in
> this case). Make it so that we don't have to have separate compat
> paths.

Yes, for sure. I was 'only' trying to reflect the state of the last rseq
patches. But yes, I should have called that out and avoided 'confusion'.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:32                                         ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-29 10:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, H. Peter Anvin

On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:

> /* This structure needs to be aligned cache line size. */
> struct thread_local_abi {
>   int32_t  cpu_id;
>   uint32_t rseq_seqnum;
>   uint64_t rseq_post_commit_ip;
>   /* Add new fields at the end. */ 
> } __attribute__((packed));

I would really not use packed; that can lead to horrible layout.

Suppose someone would add:

	uint32_t foo;
	uint64_t bar;

With packed, you get an unaligned uint64_t in there, which is horrible.
Without packed, you get a hole, which you can later fill.

> /* Thread local ABI system calls. */ 
> 
> int thread_local_abi_len(size_t *features_mask_len, size_t *tlabi_len); 

See below; maybe we can fudge the register call to return the size when
called 'right', maybe that'll end up too ugly, dunno. But I don't think
we need the feature mask bits.

Maybe: TLA_FLAG_GETSIZE ?

> int thread_local_abi_features(uint8_t *mask); 

Not sure you need this; see below. Either you know about a
TLA_ENABLE_feat flag and you can attempt enabling it (failing if the
kernel doesn't support it), or you don't, in which case you won't
attempt use.

> int thread_local_abi_register(struct thread_local_abi *tlabi); 

This has the problem that the moment you register for this, we must have
all features enabled. And esp. the rseq stuff has non-trivial overhead.

I would much rather have something where we only enable the features
actually used by the program at hand.


Also, every syscall should have a flags argument, so maybe we can do
something like:

	#define TLA_ENABLE_CPU		0x01
	#define TLA_ENABLE_RSEQ		0x03 /* RSEQ must imply CPU */

	int thread_local_abi_register(struct tla *tla, unsigned int enable, unsigned int flags);

Where (g)libc would unconditionally set up the structure with
.enabled=0, .flags=0, and anybody actually wanting to make use of the
thing do:

	thread_local_abi_register(NULL, TLA_ENABLE_CPU, 0);

Obviously calling register with !NULL address twice will error (you
already registered), calling with NULL before !NULL will also error.


And if you really worry about running out of feature bits, we could of
course pass it in a mask, but I'm not sure I can see 30 other features
we would want to cram into this (yes, yes, famous last words etc.. 640kb
anyone?).

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:32                                         ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-29 10:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, H. Peter Anvin

On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:

> /* This structure needs to be aligned cache line size. */
> struct thread_local_abi {
>   int32_t  cpu_id;
>   uint32_t rseq_seqnum;
>   uint64_t rseq_post_commit_ip;
>   /* Add new fields at the end. */ 
> } __attribute__((packed));

I would really not use packed; that can lead to horrible layout.

Suppose someone would add:

	uint32_t foo;
	uint64_t bar;

With packed, you get an unaligned uint64_t in there, which is horrible.
Without packed, you get a hole, which you can later fill.

> /* Thread local ABI system calls. */ 
> 
> int thread_local_abi_len(size_t *features_mask_len, size_t *tlabi_len); 

See below; maybe we can fudge the register call to return the size when
called 'right', maybe that'll end up too ugly, dunno. But I don't think
we need the feature mask bits.

Maybe: TLA_FLAG_GETSIZE ?

> int thread_local_abi_features(uint8_t *mask); 

Not sure you need this; see below. Either you know about a
TLA_ENABLE_feat flag and you can attempt enabling it (failing if the
kernel doesn't support it), or you don't, in which case you won't
attempt use.

> int thread_local_abi_register(struct thread_local_abi *tlabi); 

This has the problem that the moment you register for this, we must have
all features enabled. And esp. the rseq stuff has non-trivial overhead.

I would much rather have something where we only enable the features
actually used by the program at hand.


Also, every syscall should have a flags argument, so maybe we can do
something like:

	#define TLA_ENABLE_CPU		0x01
	#define TLA_ENABLE_RSEQ		0x03 /* RSEQ must imply CPU */

	int thread_local_abi_register(struct tla *tla, unsigned int enable, unsigned int flags);

Where (g)libc would unconditionally set up the structure with
.enabled=0, .flags=0, and anybody actually wanting to make use of the
thing do:

	thread_local_abi_register(NULL, TLA_ENABLE_CPU, 0);

Obviously calling register with !NULL address twice will error (you
already registered), calling with NULL before !NULL will also error.


And if you really worry about running out of feature bits, we could of
course pass it in a mask, but I'm not sure I can see 30 other features
we would want to cram into this (yes, yes, famous last words etc.. 640kb
anyone?).

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:35                                             ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-29 10:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, H. Peter Anvin

On Sun, Feb 28, 2016 at 02:32:28PM +0000, Mathieu Desnoyers wrote:
> The part of ABI I'm trying to express here is for discoverability
> of available features by user-space. For instance, a kernel
> could be configured with "CONFIG_RSEQ=n", and userspace should
> not rely on the rseq fields of the thread-local ABI in that case.

Per the just proposed interface; discoverability would end with:

	thread_local_abi_register(NULL, TLA_ENABLE_RSEQ, 0);

failing. This would indicate your kernel does not support (or your glibc
failed to register, depending on error code I suppose).

Then your program can either fall back to full atomics or just bail.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:35                                             ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-02-29 10:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, H. Peter Anvin

On Sun, Feb 28, 2016 at 02:32:28PM +0000, Mathieu Desnoyers wrote:
> The part of ABI I'm trying to express here is for discoverability
> of available features by user-space. For instance, a kernel
> could be configured with "CONFIG_RSEQ=n", and userspace should
> not rely on the rseq fields of the thread-local ABI in that case.

Per the just proposed interface; discoverability would end with:

	thread_local_abi_register(NULL, TLA_ENABLE_RSEQ, 0);

failing. This would indicate your kernel does not support (or your glibc
failed to register, depending on error code I suppose).

Then your program can either fall back to full atomics or just bail.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:39                                           ` Arnd Bergmann
  0 siblings, 0 replies; 96+ messages in thread
From: Arnd Bergmann @ 2016-02-29 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, H. Peter Anvin, Mathieu Desnoyers

On Monday 29 February 2016 11:32:21 Peter Zijlstra wrote:
> On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:
> 
> > /* This structure needs to be aligned cache line size. */
> > struct thread_local_abi {
> >   int32_t  cpu_id;
> >   uint32_t rseq_seqnum;
> >   uint64_t rseq_post_commit_ip;
> >   /* Add new fields at the end. */ 
> > } __attribute__((packed));
> 
> I would really not use packed; that can lead to horrible layout.
> 
> Suppose someone would add:
> 
> 	uint32_t foo;
> 	uint64_t bar;
> 
> With packed, you get an unaligned uint64_t in there, which is horrible.
> Without packed, you get a hole, which you can later fill.

What's making things worse is that on some architectures, adding
__packed will force access by bytes rather than just reading
a 32-bit or 64-bit numbers directly, so it's slow and non-atomic.

	Arnd

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 10:39                                           ` Arnd Bergmann
  0 siblings, 0 replies; 96+ messages in thread
From: Arnd Bergmann @ 2016-02-29 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, H. Peter Anvin, Mathieu Desnoyers

On Monday 29 February 2016 11:32:21 Peter Zijlstra wrote:
> On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:
> 
> > /* This structure needs to be aligned cache line size. */
> > struct thread_local_abi {
> >   int32_t  cpu_id;
> >   uint32_t rseq_seqnum;
> >   uint64_t rseq_post_commit_ip;
> >   /* Add new fields at the end. */ 
> > } __attribute__((packed));
> 
> I would really not use packed; that can lead to horrible layout.
> 
> Suppose someone would add:
> 
> 	uint32_t foo;
> 	uint64_t bar;
> 
> With packed, you get an unaligned uint64_t in there, which is horrible.
> Without packed, you get a hole, which you can later fill.

What's making things worse is that on some architectures, adding
__packed will force access by bytes rather than just reading
a 32-bit or 64-bit numbers directly, so it's slow and non-atomic.

	Arnd

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-02-29 10:39                                           ` Arnd Bergmann
@ 2016-02-29 12:41                                             ` Mathieu Desnoyers
  -1 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-29 12:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter, H. Peter Anvin

----- On Feb 29, 2016, at 5:39 AM, Arnd Bergmann arnd@arndb.de wrote:

> On Monday 29 February 2016 11:32:21 Peter Zijlstra wrote:
>> On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:
>> 
>> > /* This structure needs to be aligned cache line size. */
>> > struct thread_local_abi {
>> >   int32_t  cpu_id;
>> >   uint32_t rseq_seqnum;
>> >   uint64_t rseq_post_commit_ip;
>> >   /* Add new fields at the end. */
>> > } __attribute__((packed));
>> 
>> I would really not use packed; that can lead to horrible layout.
>> 
>> Suppose someone would add:
>> 
>> 	uint32_t foo;
>> 	uint64_t bar;
>> 
>> With packed, you get an unaligned uint64_t in there, which is horrible.
>> Without packed, you get a hole, which you can later fill.
> 

Actually, Peter is wrong about the hole there. On some 32-bit architectures,
64-bit integers are aligned on 32-bit, not 64-bit. So there may or may not
be a hole there, and that would lead to a mess.

> What's making things worse is that on some architectures, adding
> __packed will force access by bytes rather than just reading
> a 32-bit or 64-bit numbers directly, so it's slow and non-atomic.

Agreed that many architectures issue slower instructions when reading
from packed structures, which is unwanted.

Could we require that each field be naturally aligned and require that
they are placed so _no_ padding whatsoever should ever be added by the
compiler ? If that's possible, then we could remove the packed.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 12:41                                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-02-29 12:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter, H. Peter Anvin

----- On Feb 29, 2016, at 5:39 AM, Arnd Bergmann arnd-r2nGTMty4D4@public.gmane.org wrote:

> On Monday 29 February 2016 11:32:21 Peter Zijlstra wrote:
>> On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:
>> 
>> > /* This structure needs to be aligned cache line size. */
>> > struct thread_local_abi {
>> >   int32_t  cpu_id;
>> >   uint32_t rseq_seqnum;
>> >   uint64_t rseq_post_commit_ip;
>> >   /* Add new fields at the end. */
>> > } __attribute__((packed));
>> 
>> I would really not use packed; that can lead to horrible layout.
>> 
>> Suppose someone would add:
>> 
>> 	uint32_t foo;
>> 	uint64_t bar;
>> 
>> With packed, you get an unaligned uint64_t in there, which is horrible.
>> Without packed, you get a hole, which you can later fill.
> 

Actually, Peter is wrong about the hole there. On some 32-bit architectures,
64-bit integers are aligned on 32-bit, not 64-bit. So there may or may not
be a hole there, and that would lead to a mess.

> What's making things worse is that on some architectures, adding
> __packed will force access by bytes rather than just reading
> a 32-bit or 64-bit numbers directly, so it's slow and non-atomic.

Agreed that many architectures issue slower instructions when reading
from packed structures, which is unwanted.

Could we require that each field be naturally aligned and require that
they are placed so _no_ padding whatsoever should ever be added by the
compiler ? If that's possible, then we could remove the packed.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 13:08                                               ` Arnd Bergmann
  0 siblings, 0 replies; 96+ messages in thread
From: Arnd Bergmann @ 2016-02-29 13:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter, H. Peter Anvin

On Monday 29 February 2016 12:41:49 Mathieu Desnoyers wrote:
> ----- On Feb 29, 2016, at 5:39 AM, Arnd Bergmann arnd@arndb.de wrote:

> > What's making things worse is that on some architectures, adding
> > __packed will force access by bytes rather than just reading
> > a 32-bit or 64-bit numbers directly, so it's slow and non-atomic.
> 
> Agreed that many architectures issue slower instructions when reading
> from packed structures, which is unwanted.
> 
> Could we require that each field be naturally aligned and require that
> they are placed so _no_ padding whatsoever should ever be added by the
> compiler ? If that's possible, then we could remove the packed.

Yes, I think that is a reasonable requirement.

	Arnd

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 13:08                                               ` Arnd Bergmann
  0 siblings, 0 replies; 96+ messages in thread
From: Arnd Bergmann @ 2016-02-29 13:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter, H. Peter Anvin

On Monday 29 February 2016 12:41:49 Mathieu Desnoyers wrote:
> ----- On Feb 29, 2016, at 5:39 AM, Arnd Bergmann arnd-r2nGTMty4D4@public.gmane.org wrote:

> > What's making things worse is that on some architectures, adding
> > __packed will force access by bytes rather than just reading
> > a 32-bit or 64-bit numbers directly, so it's slow and non-atomic.
> 
> Agreed that many architectures issue slower instructions when reading
> from packed structures, which is unwanted.
> 
> Could we require that each field be naturally aligned and require that
> they are placed so _no_ padding whatsoever should ever be added by the
> compiler ? If that's possible, then we could remove the packed.

Yes, I think that is a reasonable requirement.

	Arnd

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 18:19                                               ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-29 18:19 UTC (permalink / raw)
  To: Mathieu Desnoyers, Arnd Bergmann
  Cc: Peter Zijlstra, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On February 29, 2016 4:41:49 AM PST, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>Agreed that many architectures issue slower instructions when reading
>from packed structures, which is unwanted.
>

And detrimental to atomicity.

>Could we require that each field be naturally aligned and require that
>they are placed so _no_ padding whatsoever should ever be added by the
>compiler ? If that's possible, then we could remove the packed.

What people have been trying to tell you is that we *must* do this, and no compiler truck like packed will help.


-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-02-29 18:19                                               ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-02-29 18:19 UTC (permalink / raw)
  To: Mathieu Desnoyers, Arnd Bergmann
  Cc: Peter Zijlstra, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On February 29, 2016 4:41:49 AM PST, Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>
>Agreed that many architectures issue slower instructions when reading
>from packed structures, which is unwanted.
>

And detrimental to atomicity.

>Could we require that each field be naturally aligned and require that
>they are placed so _no_ padding whatsoever should ever be added by the
>compiler ? If that's possible, then we could remove the packed.

What people have been trying to tell you is that we *must* do this, and no compiler truck like packed will help.


-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 18:25                                         ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-03-01 18:25 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	Peter Zijlstra

On 02/27/16 16:39, Mathieu Desnoyers wrote:
> 
> Very good points! Would the following interfaces be acceptable ?
> 
> /* This structure needs to be aligned cache line size. */
> struct thread_local_abi {
>         int32_t cpu_id;                               /* Aligned on
> 32-bit. */
>         uint32_t rseq_seqnum;                 /* Aligned on 32-bit. */
>         uint64_t rseq_post_commit_ip;   /* Aligned on 64-bit. */
>         /* Add new fields at the end. */
> } __attribute__((packed));
> 

First of all, DO NOT use __attribute__((packed)).  First of all, it
buggers up the alignment of the *entire structure* (the alignment of a
packed structure defaults to 1, and gcc will assume the whole structure
is misaligned, generating unaligned access instructions on architectures
which need them.)

Sadly gcc doesn't currently have an __attribute__ to express "error out
on padding" which is what you actually want here.

You may, however, want to add an explicit alignment attribute to make
sure it is cache line aligned.

Second, as far as the 32/64 bit issue is concerned, you have to order
the fields so you always access the LSB.  This is probably the best way
to do it:

#ifdef __LP64__
# define __FIELD_32_64(field,n)	uint64_t field;
#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
# define __FIELD_32_64(field,n) uint32_t field, _unused ## n;
#else
# define __FIELD_32_64(field,n) uint32_t _unused ## n, field;
#endif

All these macros are intrinsic to gcc (and hopefully to gcc-compatible
compilers) so there are no header file dependencies.

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 18:25                                         ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2016-03-01 18:25 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds
  Cc: Ben Maurer, Thomas Gleixner, Ingo Molnar, Russell King,
	linux-api, Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	Peter Zijlstra

On 02/27/16 16:39, Mathieu Desnoyers wrote:
> 
> Very good points! Would the following interfaces be acceptable ?
> 
> /* This structure needs to be aligned cache line size. */
> struct thread_local_abi {
>         int32_t cpu_id;                               /* Aligned on
> 32-bit. */
>         uint32_t rseq_seqnum;                 /* Aligned on 32-bit. */
>         uint64_t rseq_post_commit_ip;   /* Aligned on 64-bit. */
>         /* Add new fields at the end. */
> } __attribute__((packed));
> 

First of all, DO NOT use __attribute__((packed)).  First of all, it
buggers up the alignment of the *entire structure* (the alignment of a
packed structure defaults to 1, and gcc will assume the whole structure
is misaligned, generating unaligned access instructions on architectures
which need them.)

Sadly gcc doesn't currently have an __attribute__ to express "error out
on padding" which is what you actually want here.

You may, however, want to add an explicit alignment attribute to make
sure it is cache line aligned.

Second, as far as the 32/64 bit issue is concerned, you have to order
the fields so you always access the LSB.  This is probably the best way
to do it:

#ifdef __LP64__
# define __FIELD_32_64(field,n)	uint64_t field;
#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
# define __FIELD_32_64(field,n) uint32_t field, _unused ## n;
#else
# define __FIELD_32_64(field,n) uint32_t _unused ## n, field;
#endif

All these macros are intrinsic to gcc (and hopefully to gcc-compatible
compilers) so there are no header file dependencies.

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 18:40                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-03-01 18:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, Peter Zijlstra

----- On Mar 1, 2016, at 1:25 PM, H. Peter Anvin hpa@zytor.com wrote:

> On 02/27/16 16:39, Mathieu Desnoyers wrote:
>> 
>> Very good points! Would the following interfaces be acceptable ?
>> 
>> /* This structure needs to be aligned cache line size. */
>> struct thread_local_abi {
>>         int32_t cpu_id;                               /* Aligned on
>> 32-bit. */
>>         uint32_t rseq_seqnum;                 /* Aligned on 32-bit. */
>>         uint64_t rseq_post_commit_ip;   /* Aligned on 64-bit. */
>>         /* Add new fields at the end. */
>> } __attribute__((packed));
>> 
> 
> First of all, DO NOT use __attribute__((packed)).  First of all, it
> buggers up the alignment of the *entire structure* (the alignment of a
> packed structure defaults to 1, and gcc will assume the whole structure
> is misaligned, generating unaligned access instructions on architectures
> which need them.)
> 
> Sadly gcc doesn't currently have an __attribute__ to express "error out
> on padding" which is what you actually want here.

Good point.

> 
> You may, however, want to add an explicit alignment attribute to make
> sure it is cache line aligned.

Good idea, will do!

> 
> Second, as far as the 32/64 bit issue is concerned, you have to order
> the fields so you always access the LSB.  This is probably the best way
> to do it:
> 
> #ifdef __LP64__
> # define __FIELD_32_64(field,n)	uint64_t field;
> #elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> # define __FIELD_32_64(field,n) uint32_t field, _unused ## n;
> #else
> # define __FIELD_32_64(field,n) uint32_t _unused ## n, field;
> #endif
> 
> All these macros are intrinsic to gcc (and hopefully to gcc-compatible
> compilers) so there are no header file dependencies.

Thanks for the hint. I'll try it out.

Mathieu


> 
> 	-hpa

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 18:40                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-03-01 18:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter, Peter Zijlstra

----- On Mar 1, 2016, at 1:25 PM, H. Peter Anvin hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org wrote:

> On 02/27/16 16:39, Mathieu Desnoyers wrote:
>> 
>> Very good points! Would the following interfaces be acceptable ?
>> 
>> /* This structure needs to be aligned cache line size. */
>> struct thread_local_abi {
>>         int32_t cpu_id;                               /* Aligned on
>> 32-bit. */
>>         uint32_t rseq_seqnum;                 /* Aligned on 32-bit. */
>>         uint64_t rseq_post_commit_ip;   /* Aligned on 64-bit. */
>>         /* Add new fields at the end. */
>> } __attribute__((packed));
>> 
> 
> First of all, DO NOT use __attribute__((packed)).  First of all, it
> buggers up the alignment of the *entire structure* (the alignment of a
> packed structure defaults to 1, and gcc will assume the whole structure
> is misaligned, generating unaligned access instructions on architectures
> which need them.)
> 
> Sadly gcc doesn't currently have an __attribute__ to express "error out
> on padding" which is what you actually want here.

Good point.

> 
> You may, however, want to add an explicit alignment attribute to make
> sure it is cache line aligned.

Good idea, will do!

> 
> Second, as far as the 32/64 bit issue is concerned, you have to order
> the fields so you always access the LSB.  This is probably the best way
> to do it:
> 
> #ifdef __LP64__
> # define __FIELD_32_64(field,n)	uint64_t field;
> #elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> # define __FIELD_32_64(field,n) uint32_t field, _unused ## n;
> #else
> # define __FIELD_32_64(field,n) uint32_t _unused ## n, field;
> #endif
> 
> All these macros are intrinsic to gcc (and hopefully to gcc-compatible
> compilers) so there are no header file dependencies.

Thanks for the hint. I'll try it out.

Mathieu


> 
> 	-hpa

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 20:23                                               ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-03-01 20:23 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter

----- On Feb 29, 2016, at 5:35 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Sun, Feb 28, 2016 at 02:32:28PM +0000, Mathieu Desnoyers wrote:
>> The part of ABI I'm trying to express here is for discoverability
>> of available features by user-space. For instance, a kernel
>> could be configured with "CONFIG_RSEQ=n", and userspace should
>> not rely on the rseq fields of the thread-local ABI in that case.
> 
> Per the just proposed interface; discoverability would end with:
> 
>	thread_local_abi_register(NULL, TLA_ENABLE_RSEQ, 0);
> 
> failing. This would indicate your kernel does not support (or your glibc
> failed to register, depending on error code I suppose).
> 
> Then your program can either fall back to full atomics or just bail.

I think it's important that user-space fast-paths can quickly
detect whether the feature is enabled without having to rely on
always reading a separate cache-line. I've put together an ABI
proposal that take into account the feedback received so far.

The main trick here is to use "-1" value in cpu_id and rseq_seqnum
to mean "the feature is inactive" so user-space can call the system
call to register the feature, and the value "-2" can be set by the
kernel when it knows the feature is not available. It does mean
that seqnum would wrap from MAX_INT to 0 in the kernel, skipping
negative values.

Please let me know if I missed anything.

#ifdef __LP64__
# define TLABI_FIELD_u32_u64(field)     uint64_t field
#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
# define TLABI_FIELD_u32_u64(field)     uint32_t field, _padding ## field
#else
# define TLABI_FIELD_u32_u64(field)     uint32_t _padding ## field, field
#endif

/*
 * The thread-local ABI structure needs to be aligned at least on 32
 * bytes multiples.
 */
#define TLABI_ALIGNMENT         32

struct thread_local_abi {
        /*
         * Thread-local ABI cpu_id field.
         * Updated by the kernel, and read by user-space with
         * single-copy atomicity semantics. Aligned on 32-bit.
         * Values:
         * >= 0: CPU number of running thread.
         * -1 (initial value): means the cpu_id feature is inactive.
         * -2: cpu_id feature is not available.
         */
        int32_t cpu_id;

        /*
         * Thread-local ABI rseq_seqnum field.
         * Updated by the kernel, and read by user-space with
         * single-copy atomicity semantics. Aligned on 32-bit.
         * Values:
         * >= 0: current seqnum for this thread (feature is active).
         * -1 (initial value): means the rseq feature is inactive.
         * -2: rseq feature is not available.
         */
        int32_t rseq_seqnum;

        /*
         * Thread-local ABI rseq_post_commit_ip field.
         * Updated by user-space, and read by the kernel with
         * single-copy atomicity semantics.
         * Aligned on 64-bit.
         */
        TLABI_FIELD_u32_u64(rseq_post_commit_ip);

        /* Add new fields at the end. */
} __attribute__ ((aligned(TLABI_ALIGNMENT)));

enum thread_local_abi_feature {
        TLA_FEATURE_NONE = 0,
        TLA_FEATURE_CPU_ID = (1 << 0),
        TLA_FEATURE_RSEQ = (1 << 1),
};

/*
 * Thread local ABI system call.
 *
 * First call with (NULL, 0, 0), returns the size of the struct
 * thread_local_abi expected by the kernel, or -1 on error.
 *
 * Second, allocate a memory area to hold the struct thread_local_abi,
 * and call with (ptr, 0, 0). Returns 0 on success, or -1 on error.
 *
 * Third, enable specific features by passing a mask, e.g. call with
 * (NULL, TLA_FEATURE_CPU_ID | TLA_FEATURE_RSEQ, 0).
 * Returns 0 on success, -1 on error.
 *
 * Then the fields associated with the enabled features are managed by
 * the kernel.
 */
ssize_t thread_local_abi(struct thread_local_abi *tlabi,
                uint64_t feature_mask, int flags);

Thanks for your feedback!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 20:23                                               ` Mathieu Desnoyers
  0 siblings, 0 replies; 96+ messages in thread
From: Mathieu Desnoyers @ 2016-03-01 20:23 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter

----- On Feb 29, 2016, at 5:35 AM, Peter Zijlstra peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:

> On Sun, Feb 28, 2016 at 02:32:28PM +0000, Mathieu Desnoyers wrote:
>> The part of ABI I'm trying to express here is for discoverability
>> of available features by user-space. For instance, a kernel
>> could be configured with "CONFIG_RSEQ=n", and userspace should
>> not rely on the rseq fields of the thread-local ABI in that case.
> 
> Per the just proposed interface; discoverability would end with:
> 
>	thread_local_abi_register(NULL, TLA_ENABLE_RSEQ, 0);
> 
> failing. This would indicate your kernel does not support (or your glibc
> failed to register, depending on error code I suppose).
> 
> Then your program can either fall back to full atomics or just bail.

I think it's important that user-space fast-paths can quickly
detect whether the feature is enabled without having to rely on
always reading a separate cache-line. I've put together an ABI
proposal that take into account the feedback received so far.

The main trick here is to use "-1" value in cpu_id and rseq_seqnum
to mean "the feature is inactive" so user-space can call the system
call to register the feature, and the value "-2" can be set by the
kernel when it knows the feature is not available. It does mean
that seqnum would wrap from MAX_INT to 0 in the kernel, skipping
negative values.

Please let me know if I missed anything.

#ifdef __LP64__
# define TLABI_FIELD_u32_u64(field)     uint64_t field
#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
# define TLABI_FIELD_u32_u64(field)     uint32_t field, _padding ## field
#else
# define TLABI_FIELD_u32_u64(field)     uint32_t _padding ## field, field
#endif

/*
 * The thread-local ABI structure needs to be aligned at least on 32
 * bytes multiples.
 */
#define TLABI_ALIGNMENT         32

struct thread_local_abi {
        /*
         * Thread-local ABI cpu_id field.
         * Updated by the kernel, and read by user-space with
         * single-copy atomicity semantics. Aligned on 32-bit.
         * Values:
         * >= 0: CPU number of running thread.
         * -1 (initial value): means the cpu_id feature is inactive.
         * -2: cpu_id feature is not available.
         */
        int32_t cpu_id;

        /*
         * Thread-local ABI rseq_seqnum field.
         * Updated by the kernel, and read by user-space with
         * single-copy atomicity semantics. Aligned on 32-bit.
         * Values:
         * >= 0: current seqnum for this thread (feature is active).
         * -1 (initial value): means the rseq feature is inactive.
         * -2: rseq feature is not available.
         */
        int32_t rseq_seqnum;

        /*
         * Thread-local ABI rseq_post_commit_ip field.
         * Updated by user-space, and read by the kernel with
         * single-copy atomicity semantics.
         * Aligned on 64-bit.
         */
        TLABI_FIELD_u32_u64(rseq_post_commit_ip);

        /* Add new fields at the end. */
} __attribute__ ((aligned(TLABI_ALIGNMENT)));

enum thread_local_abi_feature {
        TLA_FEATURE_NONE = 0,
        TLA_FEATURE_CPU_ID = (1 << 0),
        TLA_FEATURE_RSEQ = (1 << 1),
};

/*
 * Thread local ABI system call.
 *
 * First call with (NULL, 0, 0), returns the size of the struct
 * thread_local_abi expected by the kernel, or -1 on error.
 *
 * Second, allocate a memory area to hold the struct thread_local_abi,
 * and call with (ptr, 0, 0). Returns 0 on success, or -1 on error.
 *
 * Third, enable specific features by passing a mask, e.g. call with
 * (NULL, TLA_FEATURE_CPU_ID | TLA_FEATURE_RSEQ, 0).
 * Returns 0 on success, -1 on error.
 *
 * Then the fields associated with the enabled features are managed by
 * the kernel.
 */
ssize_t thread_local_abi(struct thread_local_abi *tlabi,
                uint64_t feature_mask, int flags);

Thanks for your feedback!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 21:32                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-03-01 21:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On Tue, Mar 01, 2016 at 08:23:12PM +0000, Mathieu Desnoyers wrote:
> I think it's important that user-space fast-paths can quickly
> detect whether the feature is enabled without having to rely on
> always reading a separate cache-line. I've put together an ABI
> proposal that take into account the feedback received so far.

Nah, adding detectoring code to fast paths is silly, makes them less
fast. Doesn't userspace have self modifying code? I know that at least
glibc does linker trickery to call different functions depending on
runtime context.

> struct thread_local_abi {
>         /*
>          * Thread-local ABI cpu_id field.
>          * Updated by the kernel, and read by user-space with
>          * single-copy atomicity semantics. Aligned on 32-bit.
>          * Values:
>          * >= 0: CPU number of running thread.
>          * -1 (initial value): means the cpu_id feature is inactive.
>          * -2: cpu_id feature is not available.
>          */
>         int32_t cpu_id;
> 
>         /*
>          * Thread-local ABI rseq_seqnum field.
>          * Updated by the kernel, and read by user-space with
>          * single-copy atomicity semantics. Aligned on 32-bit.
>          * Values:
>          * >= 0: current seqnum for this thread (feature is active).
>          * -1 (initial value): means the rseq feature is inactive.
>          * -2: rseq feature is not available.
>          */
>         int32_t rseq_seqnum;

So I really hate that, that makes we have to check for these special
values whenever we increment the seq count and cannot have it wrap
naturally.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 21:32                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-03-01 21:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On Tue, Mar 01, 2016 at 08:23:12PM +0000, Mathieu Desnoyers wrote:
> I think it's important that user-space fast-paths can quickly
> detect whether the feature is enabled without having to rely on
> always reading a separate cache-line. I've put together an ABI
> proposal that take into account the feedback received so far.

Nah, adding detectoring code to fast paths is silly, makes them less
fast. Doesn't userspace have self modifying code? I know that at least
glibc does linker trickery to call different functions depending on
runtime context.

> struct thread_local_abi {
>         /*
>          * Thread-local ABI cpu_id field.
>          * Updated by the kernel, and read by user-space with
>          * single-copy atomicity semantics. Aligned on 32-bit.
>          * Values:
>          * >= 0: CPU number of running thread.
>          * -1 (initial value): means the cpu_id feature is inactive.
>          * -2: cpu_id feature is not available.
>          */
>         int32_t cpu_id;
> 
>         /*
>          * Thread-local ABI rseq_seqnum field.
>          * Updated by the kernel, and read by user-space with
>          * single-copy atomicity semantics. Aligned on 32-bit.
>          * Values:
>          * >= 0: current seqnum for this thread (feature is active).
>          * -1 (initial value): means the rseq feature is inactive.
>          * -2: rseq feature is not available.
>          */
>         int32_t rseq_seqnum;

So I really hate that, that makes we have to check for these special
values whenever we increment the seq count and cannot have it wrap
naturally.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 21:36                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-03-01 21:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On Tue, Mar 01, 2016 at 10:32:02PM +0100, Peter Zijlstra wrote:

> >         /*
> >          * Thread-local ABI rseq_seqnum field.
> >          * Updated by the kernel, and read by user-space with
> >          * single-copy atomicity semantics. Aligned on 32-bit.
> >          * Values:
> >          * >= 0: current seqnum for this thread (feature is active).
> >          * -1 (initial value): means the rseq feature is inactive.
> >          * -2: rseq feature is not available.
> >          */
> >         int32_t rseq_seqnum;
> 
> So I really hate that, that makes we have to check for these special
> values whenever we increment the seq count and cannot have it wrap
> naturally.

Also, since it will wrap, uint32_t is more natural, since the whole
signed overflow thing is somewhat undefined in C.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-01 21:36                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-03-01 21:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On Tue, Mar 01, 2016 at 10:32:02PM +0100, Peter Zijlstra wrote:

> >         /*
> >          * Thread-local ABI rseq_seqnum field.
> >          * Updated by the kernel, and read by user-space with
> >          * single-copy atomicity semantics. Aligned on 32-bit.
> >          * Values:
> >          * >= 0: current seqnum for this thread (feature is active).
> >          * -1 (initial value): means the rseq feature is inactive.
> >          * -2: rseq feature is not available.
> >          */
> >         int32_t rseq_seqnum;
> 
> So I really hate that, that makes we have to check for these special
> values whenever we increment the seq count and cannot have it wrap
> naturally.

Also, since it will wrap, uint32_t is more natural, since the whole
signed overflow thing is somewhat undefined in C.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-03-01 21:32                                                 ` Peter Zijlstra
  (?)
  (?)
@ 2016-03-01 21:47                                                 ` H. Peter Anvin
  2016-03-02 10:34                                                   ` Peter Zijlstra
  -1 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2016-03-01 21:47 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers
  Cc: Linus Torvalds, Ben Maurer, Thomas Gleixner, Ingo Molnar,
	Russell King, linux-api, Andrew Morton, Michael Kerrisk,
	Dave Watson, rostedt, Andy Lutomirski, Will Deacon,
	Paul E. McKenney, Chris Lameter, Andi Kleen, Josh Triplett,
	Paul Turner, Linux Kernel Mailing List, Catalin Marinas,
	Andrew Hunter

On 03/01/16 13:32, Peter Zijlstra wrote:
> On Tue, Mar 01, 2016 at 08:23:12PM +0000, Mathieu Desnoyers wrote:
>> I think it's important that user-space fast-paths can quickly
>> detect whether the feature is enabled without having to rely on
>> always reading a separate cache-line. I've put together an ABI
>> proposal that take into account the feedback received so far.
> 
> Nah, adding detectoring code to fast paths is silly, makes them less
> fast. Doesn't userspace have self modifying code? I know that at least
> glibc does linker trickery to call different functions depending on
> runtime context.
> 

No, userspace does not have self-modifying code.  The glibc indirect
function is done at dynamic link time; it is also worth noting that
resolving global symbols through dynamic linking often requires an
indirect call.

	-hpa

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
  2016-03-01 21:47                                                 ` H. Peter Anvin
@ 2016-03-02 10:34                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2016-03-02 10:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, Linus Torvalds, Ben Maurer, Thomas Gleixner,
	Ingo Molnar, Russell King, linux-api, Andrew Morton,
	Michael Kerrisk, Dave Watson, rostedt, Andy Lutomirski,
	Will Deacon, Paul E. McKenney, Chris Lameter, Andi Kleen,
	Josh Triplett, Paul Turner, Linux Kernel Mailing List,
	Catalin Marinas, Andrew Hunter

On Tue, Mar 01, 2016 at 01:47:38PM -0800, H. Peter Anvin wrote:
> On 03/01/16 13:32, Peter Zijlstra wrote:
> > On Tue, Mar 01, 2016 at 08:23:12PM +0000, Mathieu Desnoyers wrote:
> >> I think it's important that user-space fast-paths can quickly
> >> detect whether the feature is enabled without having to rely on
> >> always reading a separate cache-line. I've put together an ABI
> >> proposal that take into account the feedback received so far.
> > 
> > Nah, adding detectoring code to fast paths is silly, makes them less
> > fast. Doesn't userspace have self modifying code? I know that at least
> > glibc does linker trickery to call different functions depending on
> > runtime context.
> > 
> 
> No, userspace does not have self-modifying code.  The glibc indirect
> function is done at dynamic link time; it is also worth noting that
> resolving global symbols through dynamic linking often requires an
> indirect call.

Boy that blows. And here I was thinking you could edit the code at
dynamic link time because nobody was running it yet :/

And I suppose JITs need an (effective) munmap()+mmap() cycle to ensure
the 'old' code is flushed from all caches etc..?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-02 10:44                                               ` Geert Uytterhoeven
  0 siblings, 0 replies; 96+ messages in thread
From: Geert Uytterhoeven @ 2016-03-02 10:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Arnd Bergmann, Peter Zijlstra, Linus Torvalds, Ben Maurer,
	Thomas Gleixner, Ingo Molnar, Russell King, linux-api,
	Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	H. Peter Anvin

On Mon, Feb 29, 2016 at 1:41 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Feb 29, 2016, at 5:39 AM, Arnd Bergmann arnd@arndb.de wrote:
>
>> On Monday 29 February 2016 11:32:21 Peter Zijlstra wrote:
>>> On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:
>>>
>>> > /* This structure needs to be aligned cache line size. */
>>> > struct thread_local_abi {
>>> >   int32_t  cpu_id;
>>> >   uint32_t rseq_seqnum;
>>> >   uint64_t rseq_post_commit_ip;
>>> >   /* Add new fields at the end. */
>>> > } __attribute__((packed));
>>>
>>> I would really not use packed; that can lead to horrible layout.
>>>
>>> Suppose someone would add:
>>>
>>>      uint32_t foo;
>>>      uint64_t bar;
>>>
>>> With packed, you get an unaligned uint64_t in there, which is horrible.
>>> Without packed, you get a hole, which you can later fill.
>>
>
> Actually, Peter is wrong about the hole there. On some 32-bit architectures,
> 64-bit integers are aligned on 32-bit, not 64-bit. So there may or may not

... or even on 16-bit.

> be a hole there, and that would lead to a mess.

indeed.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread
@ 2016-03-02 10:44                                               ` Geert Uytterhoeven
  0 siblings, 0 replies; 96+ messages in thread
From: Geert Uytterhoeven @ 2016-03-02 10:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Arnd Bergmann, Peter Zijlstra, Linus Torvalds, Ben Maurer,
	Thomas Gleixner, Ingo Molnar, Russell King, linux-api,
	Andrew Morton, Michael Kerrisk, Dave Watson, rostedt,
	Andy Lutomirski, Will Deacon, Paul E. McKenney, Chris Lameter,
	Andi Kleen, Josh Triplett, Paul Turner,
	Linux Kernel Mailing List, Catalin Marinas, Andrew Hunter,
	H. Peter Anvin

On Mon, Feb 29, 2016 at 1:41 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> ----- On Feb 29, 2016, at 5:39 AM, Arnd Bergmann arnd-r2nGTMty4D4@public.gmane.org wrote:
>
>> On Monday 29 February 2016 11:32:21 Peter Zijlstra wrote:
>>> On Sun, Feb 28, 2016 at 12:39:54AM +0000, Mathieu Desnoyers wrote:
>>>
>>> > /* This structure needs to be aligned cache line size. */
>>> > struct thread_local_abi {
>>> >   int32_t  cpu_id;
>>> >   uint32_t rseq_seqnum;
>>> >   uint64_t rseq_post_commit_ip;
>>> >   /* Add new fields at the end. */
>>> > } __attribute__((packed));
>>>
>>> I would really not use packed; that can lead to horrible layout.
>>>
>>> Suppose someone would add:
>>>
>>>      uint32_t foo;
>>>      uint64_t bar;
>>>
>>> With packed, you get an unaligned uint64_t in there, which is horrible.
>>> Without packed, you get a hole, which you can later fill.
>>
>
> Actually, Peter is wrong about the hole there. On some 32-bit architectures,
> 64-bit integers are aligned on 32-bit, not 64-bit. So there may or may not

... or even on 16-bit.

> be a hole there, and that would lead to a mess.

indeed.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2016-03-02 10:44 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23 23:28 [PATCH v4 0/5] getcpu_cache system call for 4.6 Mathieu Desnoyers
2016-02-23 23:28 ` [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread Mathieu Desnoyers
2016-02-23 23:28   ` Mathieu Desnoyers
2016-02-24 11:11   ` Thomas Gleixner
2016-02-24 17:17     ` Mathieu Desnoyers
2016-02-25 23:32     ` Rasmus Villemoes
2016-02-25 23:32       ` Rasmus Villemoes
2016-02-26 17:47       ` Mathieu Desnoyers
2016-02-25  9:56   ` Peter Zijlstra
2016-02-25  9:56     ` Peter Zijlstra
2016-02-25 16:55     ` Mathieu Desnoyers
2016-02-25 16:55       ` Mathieu Desnoyers
2016-02-25 17:04       ` Peter Zijlstra
2016-02-25 17:04         ` Peter Zijlstra
2016-02-25 17:17         ` Mathieu Desnoyers
2016-02-25 17:17           ` Mathieu Desnoyers
2016-02-26 11:33           ` Peter Zijlstra
2016-02-26 11:33             ` Peter Zijlstra
2016-02-26 16:29             ` Thomas Gleixner
2016-02-26 16:29               ` Thomas Gleixner
2016-02-26 17:20               ` Mathieu Desnoyers
2016-02-26 18:01                 ` Thomas Gleixner
2016-02-26 18:01                   ` Thomas Gleixner
2016-02-26 20:24                   ` Mathieu Desnoyers
2016-02-26 20:24                     ` Mathieu Desnoyers
2016-02-26 23:04                     ` H. Peter Anvin
2016-02-27  0:40                       ` Mathieu Desnoyers
2016-02-27  0:40                         ` Mathieu Desnoyers
2016-02-27  6:24                         ` H. Peter Anvin
2016-02-27  6:24                           ` H. Peter Anvin
2016-02-27 14:15                           ` Mathieu Desnoyers
2016-02-27 14:15                             ` Mathieu Desnoyers
2016-02-27 14:58                             ` Peter Zijlstra
2016-02-27 14:58                               ` Peter Zijlstra
2016-02-27 18:35                               ` Linus Torvalds
2016-02-27 18:35                                 ` Linus Torvalds
2016-02-27 19:01                                 ` H. Peter Anvin
2016-02-27 19:01                                   ` H. Peter Anvin
2016-02-27 23:53                                 ` Mathieu Desnoyers
     [not found]                                   ` <CA+55aFwcgwRxvVBz5kk_3O8dESXAGJ4KHBkf=pSXjiS7Xh4NwA@mail.gmail.com>
     [not found]                                     ` <1082926946.10326.1456619994590.JavaMail.zimbra@efficios.com>
2016-02-28  0:57                                       ` Linus Torvalds
2016-02-28  0:57                                         ` Linus Torvalds
2016-02-28 14:32                                         ` Mathieu Desnoyers
2016-02-28 14:32                                           ` Mathieu Desnoyers
2016-02-29 10:35                                           ` Peter Zijlstra
2016-02-29 10:35                                             ` Peter Zijlstra
2016-03-01 20:23                                             ` Mathieu Desnoyers
2016-03-01 20:23                                               ` Mathieu Desnoyers
2016-03-01 21:32                                               ` Peter Zijlstra
2016-03-01 21:32                                                 ` Peter Zijlstra
2016-03-01 21:36                                                 ` Peter Zijlstra
2016-03-01 21:36                                                   ` Peter Zijlstra
2016-03-01 21:47                                                 ` H. Peter Anvin
2016-03-02 10:34                                                   ` Peter Zijlstra
2016-02-29 10:32                                       ` Peter Zijlstra
2016-02-29 10:32                                         ` Peter Zijlstra
2016-02-29 10:39                                         ` Arnd Bergmann
2016-02-29 10:39                                           ` Arnd Bergmann
2016-02-29 12:41                                           ` Mathieu Desnoyers
2016-02-29 12:41                                             ` Mathieu Desnoyers
2016-02-29 13:08                                             ` Arnd Bergmann
2016-02-29 13:08                                               ` Arnd Bergmann
2016-02-29 18:19                                             ` H. Peter Anvin
2016-02-29 18:19                                               ` H. Peter Anvin
2016-03-02 10:44                                             ` Geert Uytterhoeven
2016-03-02 10:44                                               ` Geert Uytterhoeven
2016-03-01 18:25                                       ` H. Peter Anvin
2016-03-01 18:25                                         ` H. Peter Anvin
2016-03-01 18:40                                         ` Mathieu Desnoyers
2016-03-01 18:40                                           ` Mathieu Desnoyers
2016-02-28 13:07                                 ` Geert Uytterhoeven
2016-02-28 13:07                                   ` Geert Uytterhoeven
2016-02-28 16:21                                   ` Linus Torvalds
2016-02-28 16:21                                     ` Linus Torvalds
2016-02-29 10:01                                 ` Peter Zijlstra
2016-02-29 10:01                                   ` Peter Zijlstra
2016-02-27 15:04                             ` H. Peter Anvin
2016-02-27 15:04                               ` H. Peter Anvin
2016-02-23 23:28 ` [PATCH v4 2/5] getcpu_cache: ARM resume notifier Mathieu Desnoyers
2016-02-23 23:28 ` [PATCH v4 3/5] getcpu_cache: wire up ARM system call Mathieu Desnoyers
2016-02-24  0:54   ` kbuild test robot
2016-02-24  0:54     ` kbuild test robot
2016-02-24  1:05   ` [PATCH v4 (updated)] " Mathieu Desnoyers
2016-02-24  1:05     ` Mathieu Desnoyers
2016-02-24  5:28     ` kbuild test robot
2016-02-24  5:28       ` kbuild test robot
2016-02-24  6:54     ` kbuild test robot
2016-02-24  6:54       ` kbuild test robot
2016-02-23 23:28 ` [PATCH v4 4/5] getcpu_cache: x86 32/64 resume notifier Mathieu Desnoyers
2016-02-23 23:28 ` [PATCH v4 5/5] getcpu_cache: wire up x86 32/64 system call Mathieu Desnoyers
2016-02-24  1:36 ` [PATCH v4 0/5] getcpu_cache system call for 4.6 H. Peter Anvin
2016-02-24  1:36   ` H. Peter Anvin
2016-02-24  4:09   ` Mathieu Desnoyers
2016-02-24 20:07     ` H. Peter Anvin
2016-02-24 20:07       ` H. Peter Anvin
2016-02-24 22:38       ` Mathieu Desnoyers
2016-02-24 22:38         ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.