All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] Micro-optimize vclock_gettime
@ 2011-05-16 16:00 Andy Lutomirski
  2011-05-16 16:00 ` [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables Andy Lutomirski
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost
30% (tested on Sandy Bridge).  These patches are intended for
2.6.40, and if I'm feeling really ambitious I'll try to shave a few
more ns off for 2.6.41.  (There are lots more optimization
opportunities in there.)

x86-64: Clean up vdso/kernel shared variables

Because vsyscall_gtod_data's address isn't known until load time,
the code contains unnecessary address calculations.  The code is
also rather complicated.  Clean it up and use addresses that are
known at compile time.

x86-64: Remove unnecessary barrier in vread_tsc

A fair amount of testing on lots of machines has failed to find a
single example in which the barrier *after* rdtsc is needed. So
remove it.  (The manuals give no real justification for it, and
rdtsc has no dependencies so there's no sensible reason for a CPU to
delay it.)

x86-64: Don't generate cmov in vread_tsc

GCC likes to generate a cmov on a branch that's almost completely
predictable.  Force it to generate a real branch instead.

x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0

vset_normalize_timespec was more general than necessary.  Open-code
the appropriate normalization loops.  This is a big win for
CLOCK_MONOTONIC_COARSE.

x86-64: Move vread_tsc into a new file with sensible options

This way vread_tsc doesn't have a frame pointer, with saves about
0.3ns.  I guess that the CPU's stack frame optimizations aren't quite
as good as I thought.

x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO

We're building the vDSO with optimizations disabled that were meant
for kernel code.  Override that, except for -fno-omit-frame-pointers,
which might make userspace debugging harder.

Changes from v3:
 - Put jiffies and vgetcpu_mode into the same cacheline.  I folded it
   into the vsyscall cleanup patch because it's literally just changing
   a number.  (In theory one more cacheline could be saved by putting
   jiffies and vgetcpu_mode at the end of gtod_data, but that would be
   annoying to maintain and would, I think, have little benefit.
 - Don't turn off frame pointers in vDSO code.

Changes from v2:
 - Just remove the second barrier instead of hacking it.  Tests
   still pass.

Changes from v1:
 - Redo the vsyscall_gtod_data address patch to make the code
   cleaner instead of uglier and to make it work for all the
   vsyscall variables.
 - Improve the comments for clarity and formatting.
 - Fix up the changelog for the nsec < 0 tweak (the normalization
   code can't be inline because the two callers are different).
 - Move vread_tsc into its own file, removing a GCC version
   dependence and making it more maintainable.


Andy Lutomirski (6):
  x86-64: Clean up vdso/kernel shared variables
  x86-64: Remove unnecessary barrier in vread_tsc
  x86-64: Don't generate cmov in vread_tsc
  x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
  x86-64: Move vread_tsc into a new file with sensible options
  x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO

 arch/x86/include/asm/tsc.h      |    4 +++
 arch/x86/include/asm/vdso.h     |   14 ----------
 arch/x86/include/asm/vgtod.h    |    2 -
 arch/x86/include/asm/vsyscall.h |   12 +-------
 arch/x86/include/asm/vvar.h     |   52 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile        |    8 +++--
 arch/x86/kernel/time.c          |    2 +-
 arch/x86/kernel/tsc.c           |   19 --------------
 arch/x86/kernel/vmlinux.lds.S   |   34 ++++++++-----------------
 arch/x86/kernel/vread_tsc_64.c  |   36 +++++++++++++++++++++++++++
 arch/x86/kernel/vsyscall_64.c   |   46 +++++++++++++++-------------------
 arch/x86/vdso/Makefile          |   17 +++++++++++-
 arch/x86/vdso/vclock_gettime.c  |   43 +++++++++++++++++---------------
 arch/x86/vdso/vdso.lds.S        |    7 -----
 arch/x86/vdso/vextern.h         |   16 ------------
 arch/x86/vdso/vgetcpu.c         |    3 +-
 arch/x86/vdso/vma.c             |   27 --------------------
 arch/x86/vdso/vvar.c            |   12 ---------
 18 files changed, 170 insertions(+), 184 deletions(-)
 create mode 100644 arch/x86/include/asm/vvar.h
 create mode 100644 arch/x86/kernel/vread_tsc_64.c
 delete mode 100644 arch/x86/vdso/vextern.h
 delete mode 100644 arch/x86/vdso/vvar.c

-- 
1.7.5.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
@ 2011-05-16 16:00 ` Andy Lutomirski
  2011-05-16 17:23   ` Borislav Petkov
  2011-05-16 16:00 ` [PATCH v4 2/6] x86-64: Remove unnecessary barrier in vread_tsc Andy Lutomirski
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

Variables that are shared between the vdso and the kernel are
currently a bit of a mess.  They are each defined with their own
magic, they are accessed differently in the kernel, the vsyscall page,
and the vdso, and one of them (vsyscall_clock) doesn't even really
exist.

This changes them all to use a common mechanism.  All of them are
delcared in vvar.h with a fixed address (validated by the linker
script).  In the kernel (as before), they look like ordinary
read-write variables.  In the vsyscall page and the vdso, they are
accessed through a new macro VVAR, which gives read-only access.

The vdso is now loaded verbatim into memory without any fixups.  As a
side bonus, access from the vdso is faster because a level of
indirection is removed.

While we're at it, pack jiffies and vgetcpu_mode into the same
cacheline.
---
 arch/x86/include/asm/vdso.h     |   14 ----------
 arch/x86/include/asm/vgtod.h    |    2 -
 arch/x86/include/asm/vsyscall.h |   12 +-------
 arch/x86/include/asm/vvar.h     |   52 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/time.c          |    2 +-
 arch/x86/kernel/tsc.c           |    4 +-
 arch/x86/kernel/vmlinux.lds.S   |   34 ++++++++-----------------
 arch/x86/kernel/vsyscall_64.c   |   46 +++++++++++++++-------------------
 arch/x86/vdso/Makefile          |    2 +-
 arch/x86/vdso/vclock_gettime.c  |    3 +-
 arch/x86/vdso/vdso.lds.S        |    7 -----
 arch/x86/vdso/vextern.h         |   16 ------------
 arch/x86/vdso/vgetcpu.c         |    3 +-
 arch/x86/vdso/vma.c             |   27 --------------------
 arch/x86/vdso/vvar.c            |   12 ---------
 15 files changed, 91 insertions(+), 145 deletions(-)
 create mode 100644 arch/x86/include/asm/vvar.h
 delete mode 100644 arch/x86/vdso/vextern.h
 delete mode 100644 arch/x86/vdso/vvar.c

diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 9064052..bb05228 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -1,20 +1,6 @@
 #ifndef _ASM_X86_VDSO_H
 #define _ASM_X86_VDSO_H
 
-#ifdef CONFIG_X86_64
-extern const char VDSO64_PRELINK[];
-
-/*
- * Given a pointer to the vDSO image, find the pointer to VDSO64_name
- * as that symbol is defined in the vDSO sources or linker script.
- */
-#define VDSO64_SYMBOL(base, name)					\
-({									\
-	extern const char VDSO64_##name[];				\
-	(void *)(VDSO64_##name - VDSO64_PRELINK + (unsigned long)(base)); \
-})
-#endif
-
 #if defined CONFIG_X86_32 || defined CONFIG_COMPAT
 extern const char VDSO32_PRELINK[];
 
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 3d61e20..646b4c1 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -23,8 +23,6 @@ struct vsyscall_gtod_data {
 	struct timespec wall_to_monotonic;
 	struct timespec wall_time_coarse;
 };
-extern struct vsyscall_gtod_data __vsyscall_gtod_data
-__section_vsyscall_gtod_data;
 extern struct vsyscall_gtod_data vsyscall_gtod_data;
 
 #endif /* _ASM_X86_VGTOD_H */
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d0983d2..d555973 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -16,27 +16,19 @@ enum vsyscall_num {
 #ifdef __KERNEL__
 #include <linux/seqlock.h>
 
-#define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16)))
-#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
-
 /* Definitions for CONFIG_GENERIC_TIME definitions */
-#define __section_vsyscall_gtod_data __attribute__ \
-	((unused, __section__ (".vsyscall_gtod_data"),aligned(16)))
-#define __section_vsyscall_clock __attribute__ \
-	((unused, __section__ (".vsyscall_clock"),aligned(16)))
 #define __vsyscall_fn \
 	__attribute__ ((unused, __section__(".vsyscall_fn"))) notrace
 
 #define VGETCPU_RDTSCP	1
 #define VGETCPU_LSL	2
 
-extern int __vgetcpu_mode;
-extern volatile unsigned long __jiffies;
-
 /* kernel space (writeable) */
 extern int vgetcpu_mode;
 extern struct timezone sys_tz;
 
+#include <asm/vvar.h>
+
 extern void map_vsyscall(void);
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
new file mode 100644
index 0000000..341b355
--- /dev/null
+++ b/arch/x86/include/asm/vvar.h
@@ -0,0 +1,52 @@
+/*
+ * vvar.h: Shared vDSO/kernel variable declarations
+ * Copyright (c) 2011 Andy Lutomirski
+ * Subject to the GNU General Public License, version 2
+ *
+ * A handful of variables are accessible (read-only) from userspace
+ * code in the vsyscall page and the vdso.  They are declared here.
+ * Some other file must define them with DEFINE_VVAR.
+ *
+ * In normal kernel code, they are used like any other variable.
+ * In user code, they are accessed through the VVAR macro.
+ *
+ * Each of these variables lives in the vsyscall page, and each
+ * one needs a unique offset within the little piece of the page
+ * reserved for vvars.  Specify that offset in DECLARE_VVAR.
+ * (There are 896 bytes available.  If you mess up, the linker will
+ * catch it.)
+ */
+
+/* Offset of vars within vsyscall page */
+#define VSYSCALL_VARS_OFFSET (3072 + 128)
+
+#if defined(__VVAR_KERNEL_LDS)
+
+/* The kernel linker script defines its own magic to put vvars in the
+ * right place.
+ */
+#define DECLARE_VVAR(offset, type, name) \
+	EMIT_VVAR(name, VSYSCALL_VARS_OFFSET + offset)
+
+#else
+
+#define DECLARE_VVAR(offset, type, name)				\
+	static type const * const vvaraddr_ ## name =			\
+		(void *)(VSYSCALL_START + VSYSCALL_VARS_OFFSET + (offset));
+
+#define DEFINE_VVAR(type, name)						\
+	type __vvar_ ## name						\
+	__attribute__((section(".vsyscall_var_" #name), aligned(16)))
+
+#define VVAR(name) (*vvaraddr_ ## name)
+
+#endif
+
+/* DECLARE_VVAR(offset, type, name) */
+
+DECLARE_VVAR(0, volatile unsigned long, jiffies)
+DECLARE_VVAR(8, int, vgetcpu_mode)
+DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
+
+#undef DECLARE_VVAR
+#undef VSYSCALL_VARS_OFFSET
diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index 25a28a2..00cbb27 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -23,7 +23,7 @@
 #include <asm/time.h>
 
 #ifdef CONFIG_X86_64
-volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
+DEFINE_VVAR(volatile unsigned long, jiffies) = INITIAL_JIFFIES;
 #endif
 
 unsigned long profile_pc(struct pt_regs *regs)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ffe5755..bc46566 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -777,8 +777,8 @@ static cycle_t __vsyscall_fn vread_tsc(void)
 	ret = (cycle_t)vget_cycles();
 	rdtsc_barrier();
 
-	return ret >= __vsyscall_gtod_data.clock.cycle_last ?
-		ret : __vsyscall_gtod_data.clock.cycle_last;
+	return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
+		ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
 }
 #endif
 
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index bf47007..bbe6cc2 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -160,6 +160,12 @@ SECTIONS
 
 #define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
 #define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
+#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x	\
+	ADDR(.vsyscall_0) + offset		 	\
+	: AT(VLOAD(.vsyscall_var_ ## x)) {     		\
+		*(.vsyscall_var_ ## x)			\
+	}						\
+	x = VVIRT(.vsyscall_var_ ## x);
 
 	. = ALIGN(4096);
 	__vsyscall_0 = .;
@@ -174,18 +180,6 @@ SECTIONS
 		*(.vsyscall_fn)
 	}
 
-	. = ALIGN(L1_CACHE_BYTES);
-	.vsyscall_gtod_data : AT(VLOAD(.vsyscall_gtod_data)) {
-		*(.vsyscall_gtod_data)
-	}
-
-	vsyscall_gtod_data = VVIRT(.vsyscall_gtod_data);
-	.vsyscall_clock : AT(VLOAD(.vsyscall_clock)) {
-		*(.vsyscall_clock)
-	}
-	vsyscall_clock = VVIRT(.vsyscall_clock);
-
-
 	.vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) {
 		*(.vsyscall_1)
 	}
@@ -193,21 +187,14 @@ SECTIONS
 		*(.vsyscall_2)
 	}
 
-	.vgetcpu_mode : AT(VLOAD(.vgetcpu_mode)) {
-		*(.vgetcpu_mode)
-	}
-	vgetcpu_mode = VVIRT(.vgetcpu_mode);
-
-	. = ALIGN(L1_CACHE_BYTES);
-	.jiffies : AT(VLOAD(.jiffies)) {
-		*(.jiffies)
-	}
-	jiffies = VVIRT(.jiffies);
-
 	.vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) {
 		*(.vsyscall_3)
 	}
 
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+
 	. = __vsyscall_0 + PAGE_SIZE;
 
 #undef VSYSCALL_ADDR
@@ -215,6 +202,7 @@ SECTIONS
 #undef VLOAD
 #undef VVIRT_OFFSET
 #undef VVIRT
+#undef EMIT_VVAR
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index dcbb28c..5f6ad03 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -49,15 +49,8 @@
 		__attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
 #define __syscall_clobber "r11","cx","memory"
 
-/*
- * vsyscall_gtod_data contains data that is :
- * - readonly from vsyscalls
- * - written by timer interrupt or systcl (/proc/sys/kernel/vsyscall64)
- * Try to keep this structure as small as possible to avoid cache line ping pongs
- */
-int __vgetcpu_mode __section_vgetcpu_mode;
-
-struct vsyscall_gtod_data __vsyscall_gtod_data __section_vsyscall_gtod_data =
+DEFINE_VVAR(int, vgetcpu_mode);
+DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
 {
 	.lock = SEQLOCK_UNLOCKED,
 	.sysctl_enabled = 1,
@@ -97,7 +90,7 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
  */
 static __always_inline void do_get_tz(struct timezone * tz)
 {
-	*tz = __vsyscall_gtod_data.sys_tz;
+	*tz = VVAR(vsyscall_gtod_data).sys_tz;
 }
 
 static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
@@ -126,23 +119,24 @@ static __always_inline void do_vgettimeofday(struct timeval * tv)
 	unsigned long mult, shift, nsec;
 	cycle_t (*vread)(void);
 	do {
-		seq = read_seqbegin(&__vsyscall_gtod_data.lock);
+		seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
 
-		vread = __vsyscall_gtod_data.clock.vread;
-		if (unlikely(!__vsyscall_gtod_data.sysctl_enabled || !vread)) {
+		vread = VVAR(vsyscall_gtod_data).clock.vread;
+		if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled ||
+			     !vread)) {
 			gettimeofday(tv,NULL);
 			return;
 		}
 
 		now = vread();
-		base = __vsyscall_gtod_data.clock.cycle_last;
-		mask = __vsyscall_gtod_data.clock.mask;
-		mult = __vsyscall_gtod_data.clock.mult;
-		shift = __vsyscall_gtod_data.clock.shift;
+		base = VVAR(vsyscall_gtod_data).clock.cycle_last;
+		mask = VVAR(vsyscall_gtod_data).clock.mask;
+		mult = VVAR(vsyscall_gtod_data).clock.mult;
+		shift = VVAR(vsyscall_gtod_data).clock.shift;
 
-		tv->tv_sec = __vsyscall_gtod_data.wall_time_sec;
-		nsec = __vsyscall_gtod_data.wall_time_nsec;
-	} while (read_seqretry(&__vsyscall_gtod_data.lock, seq));
+		tv->tv_sec = VVAR(vsyscall_gtod_data).wall_time_sec;
+		nsec = VVAR(vsyscall_gtod_data).wall_time_nsec;
+	} while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
 
 	/* calculate interval: */
 	cycle_delta = (now - base) & mask;
@@ -171,15 +165,15 @@ time_t __vsyscall(1) vtime(time_t *t)
 {
 	unsigned seq;
 	time_t result;
-	if (unlikely(!__vsyscall_gtod_data.sysctl_enabled))
+	if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
 		return time_syscall(t);
 
 	do {
-		seq = read_seqbegin(&__vsyscall_gtod_data.lock);
+		seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
 
-		result = __vsyscall_gtod_data.wall_time_sec;
+		result = VVAR(vsyscall_gtod_data).wall_time_sec;
 
-	} while (read_seqretry(&__vsyscall_gtod_data.lock, seq));
+	} while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
 
 	if (t)
 		*t = result;
@@ -208,9 +202,9 @@ vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
 	   We do this here because otherwise user space would do it on
 	   its own in a likely inferior way (no access to jiffies).
 	   If you don't like it pass NULL. */
-	if (tcache && tcache->blob[0] == (j = __jiffies)) {
+	if (tcache && tcache->blob[0] == (j = VVAR(jiffies))) {
 		p = tcache->blob[1];
-	} else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
+	} else if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
 		/* Load per CPU data from RDTSCP */
 		native_read_tscp(&p);
 	} else {
diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index b6552b1..a651861 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -11,7 +11,7 @@ vdso-install-$(VDSO32-y)	+= $(vdso32-images)
 
 
 # files to link into the vdso
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vvar.o
+vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
 
 # files to link into kernel
 obj-$(VDSO64-y)			+= vma.o vdso.o
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index ee55754..0b873d4 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -22,9 +22,8 @@
 #include <asm/hpet.h>
 #include <asm/unistd.h>
 #include <asm/io.h>
-#include "vextern.h"
 
-#define gtod vdso_vsyscall_gtod_data
+#define gtod (&VVAR(vsyscall_gtod_data))
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
diff --git a/arch/x86/vdso/vdso.lds.S b/arch/x86/vdso/vdso.lds.S
index 4e5dd3b..81f2500 100644
--- a/arch/x86/vdso/vdso.lds.S
+++ b/arch/x86/vdso/vdso.lds.S
@@ -28,10 +28,3 @@ VERSION {
 }
 
 VDSO64_PRELINK = VDSO_PRELINK;
-
-/*
- * Define VDSO64_x for each VEXTERN(x), for use via VDSO64_SYMBOL.
- */
-#define VEXTERN(x)	VDSO64_ ## x = vdso_ ## x;
-#include "vextern.h"
-#undef	VEXTERN
diff --git a/arch/x86/vdso/vextern.h b/arch/x86/vdso/vextern.h
deleted file mode 100644
index 1683ba2..0000000
--- a/arch/x86/vdso/vextern.h
+++ /dev/null
@@ -1,16 +0,0 @@
-#ifndef VEXTERN
-#include <asm/vsyscall.h>
-#define VEXTERN(x) \
-	extern typeof(x) *vdso_ ## x __attribute__((visibility("hidden")));
-#endif
-
-#define VMAGIC 0xfeedbabeabcdefabUL
-
-/* Any kernel variables used in the vDSO must be exported in the main
-   kernel's vmlinux.lds.S/vsyscall.h/proper __section and
-   put into vextern.h and be referenced as a pointer with vdso prefix.
-   The main kernel later fills in the values.   */
-
-VEXTERN(jiffies)
-VEXTERN(vgetcpu_mode)
-VEXTERN(vsyscall_gtod_data)
diff --git a/arch/x86/vdso/vgetcpu.c b/arch/x86/vdso/vgetcpu.c
index 9fbc6b2..5463ad5 100644
--- a/arch/x86/vdso/vgetcpu.c
+++ b/arch/x86/vdso/vgetcpu.c
@@ -11,14 +11,13 @@
 #include <linux/time.h>
 #include <asm/vsyscall.h>
 #include <asm/vgtod.h>
-#include "vextern.h"
 
 notrace long
 __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
 {
 	unsigned int p;
 
-	if (*vdso_vgetcpu_mode == VGETCPU_RDTSCP) {
+	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
 		/* Load per CPU data from RDTSCP */
 		native_read_tscp(&p);
 	} else {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 4b5d26f..7abd2be 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -15,9 +15,6 @@
 #include <asm/proto.h>
 #include <asm/vdso.h>
 
-#include "vextern.h"		/* Just for VMAGIC.  */
-#undef VEXTERN
-
 unsigned int __read_mostly vdso_enabled = 1;
 
 extern char vdso_start[], vdso_end[];
@@ -26,20 +23,10 @@ extern unsigned short vdso_sync_cpuid;
 static struct page **vdso_pages;
 static unsigned vdso_size;
 
-static inline void *var_ref(void *p, char *name)
-{
-	if (*(void **)p != (void *)VMAGIC) {
-		printk("VDSO: variable %s broken\n", name);
-		vdso_enabled = 0;
-	}
-	return p;
-}
-
 static int __init init_vdso_vars(void)
 {
 	int npages = (vdso_end - vdso_start + PAGE_SIZE - 1) / PAGE_SIZE;
 	int i;
-	char *vbase;
 
 	vdso_size = npages << PAGE_SHIFT;
 	vdso_pages = kmalloc(sizeof(struct page *) * npages, GFP_KERNEL);
@@ -54,20 +41,6 @@ static int __init init_vdso_vars(void)
 		copy_page(page_address(p), vdso_start + i*PAGE_SIZE);
 	}
 
-	vbase = vmap(vdso_pages, npages, 0, PAGE_KERNEL);
-	if (!vbase)
-		goto oom;
-
-	if (memcmp(vbase, "\177ELF", 4)) {
-		printk("VDSO: I'm broken; not ELF\n");
-		vdso_enabled = 0;
-	}
-
-#define VEXTERN(x) \
-	*(typeof(__ ## x) **) var_ref(VDSO64_SYMBOL(vbase, x), #x) = &__ ## x;
-#include "vextern.h"
-#undef VEXTERN
-	vunmap(vbase);
 	return 0;
 
  oom:
diff --git a/arch/x86/vdso/vvar.c b/arch/x86/vdso/vvar.c
deleted file mode 100644
index 1b7e703..0000000
--- a/arch/x86/vdso/vvar.c
+++ /dev/null
@@ -1,12 +0,0 @@
-/* Define pointer to external vDSO variables.
-   These are part of the vDSO. The kernel fills in the real addresses
-   at boot time. This is done because when the vdso is linked the
-   kernel isn't yet and we don't know the final addresses. */
-#include <linux/kernel.h>
-#include <linux/time.h>
-#include <asm/vsyscall.h>
-#include <asm/timex.h>
-#include <asm/vgtod.h>
-
-#define VEXTERN(x) typeof (__ ## x) *const vdso_ ## x = (void *)VMAGIC;
-#include "vextern.h"
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 2/6] x86-64: Remove unnecessary barrier in vread_tsc
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
  2011-05-16 16:00 ` [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables Andy Lutomirski
@ 2011-05-16 16:00 ` Andy Lutomirski
  2011-05-16 16:01 ` [PATCH v4 3/6] x86-64: Don't generate cmov " Andy Lutomirski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

RDTSC is completely unordered on modern Intel and AMD CPUs.  The
Intel manual says that lfence;rdtsc causes all previous instructions
to complete before the tsc is read, and the AMD manual says to use
mfence;rdtsc to do the same thing.

>From a decent amount of testing [1] this is enough to make rdtsc
be ordered with respect to subsequent loads across a wide variety
of CPUs.

On Sandy Bridge (i7-2600), this improves a loop of
clock_gettime(CLOCK_MONOTONIC) by more than 5 ns/iter.

[1] https://lkml.org/lkml/2011/4/18/350

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/tsc.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bc46566..7cabdae 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -769,13 +769,14 @@ static cycle_t __vsyscall_fn vread_tsc(void)
 	cycle_t ret;
 
 	/*
-	 * Surround the RDTSC by barriers, to make sure it's not
-	 * speculated to outside the seqlock critical section and
-	 * does not cause time warps:
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
 	 */
 	rdtsc_barrier();
 	ret = (cycle_t)vget_cycles();
-	rdtsc_barrier();
 
 	return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
 		ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 3/6] x86-64: Don't generate cmov in vread_tsc
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
  2011-05-16 16:00 ` [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables Andy Lutomirski
  2011-05-16 16:00 ` [PATCH v4 2/6] x86-64: Remove unnecessary barrier in vread_tsc Andy Lutomirski
@ 2011-05-16 16:01 ` Andy Lutomirski
  2011-05-16 16:01 ` [PATCH v4 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:01 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

vread_tsc checks whether rdtsc returns something less than
cycle_last, which is an extremely predictable branch.  GCC likes
to generate a cmov anyway, which is several cycles slower than
a predicted branch.  This saves a couple of nanoseconds.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/tsc.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7cabdae..db4c6e6 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -767,6 +767,7 @@ static cycle_t read_tsc(struct clocksource *cs)
 static cycle_t __vsyscall_fn vread_tsc(void)
 {
 	cycle_t ret;
+	u64 last;
 
 	/*
 	 * Empirically, a fence (of type that depends on the CPU)
@@ -778,8 +779,21 @@ static cycle_t __vsyscall_fn vread_tsc(void)
 	rdtsc_barrier();
 	ret = (cycle_t)vget_cycles();
 
-	return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
-		ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
 }
 #endif
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
                   ` (2 preceding siblings ...)
  2011-05-16 16:01 ` [PATCH v4 3/6] x86-64: Don't generate cmov " Andy Lutomirski
@ 2011-05-16 16:01 ` Andy Lutomirski
  2011-05-16 16:01 ` [PATCH v4 5/6] x86-64: Move vread_tsc into a new file with sensible options Andy Lutomirski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:01 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

vclock_gettime's do_monotonic helper can't ever generate a negative
nsec value, so it doesn't need to check whether it's negative.  In
the CLOCK_MONOTONIC_COARSE case, ns can't ever exceed 2e9-1, so we
can avoid the loop entirely.  This saves a single easily-predicted
branch.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/vdso/vclock_gettime.c |   40 ++++++++++++++++++++++------------------
 1 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 0b873d4..28b2c00 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -55,22 +55,6 @@ notrace static noinline int do_realtime(struct timespec *ts)
 	return 0;
 }
 
-/* Copy of the version in kernel/time.c which we cannot directly access */
-notrace static void
-vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
-{
-	while (nsec >= NSEC_PER_SEC) {
-		nsec -= NSEC_PER_SEC;
-		++sec;
-	}
-	while (nsec < 0) {
-		nsec += NSEC_PER_SEC;
-		--sec;
-	}
-	ts->tv_sec = sec;
-	ts->tv_nsec = nsec;
-}
-
 notrace static noinline int do_monotonic(struct timespec *ts)
 {
 	unsigned long seq, ns, secs;
@@ -81,7 +65,17 @@ notrace static noinline int do_monotonic(struct timespec *ts)
 		secs += gtod->wall_to_monotonic.tv_sec;
 		ns += gtod->wall_to_monotonic.tv_nsec;
 	} while (unlikely(read_seqretry(&gtod->lock, seq)));
-	vset_normalized_timespec(ts, secs, ns);
+
+	/* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
+	 * are all guaranteed to be nonnegative.
+	 */
+	while (ns >= NSEC_PER_SEC) {
+		ns -= NSEC_PER_SEC;
+		++secs;
+	}
+	ts->tv_sec = secs;
+	ts->tv_nsec = ns;
+
 	return 0;
 }
 
@@ -106,7 +100,17 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
 		secs += gtod->wall_to_monotonic.tv_sec;
 		ns += gtod->wall_to_monotonic.tv_nsec;
 	} while (unlikely(read_seqretry(&gtod->lock, seq)));
-	vset_normalized_timespec(ts, secs, ns);
+
+	/* wall_time_nsec and wall_to_monotonic.tv_nsec are
+	 * guaranteed to be between 0 and NSEC_PER_SEC.
+	 */
+	if (ns >= NSEC_PER_SEC) {
+		ns -= NSEC_PER_SEC;
+		++secs;
+	}
+	ts->tv_sec = secs;
+	ts->tv_nsec = ns;
+
 	return 0;
 }
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 5/6] x86-64: Move vread_tsc into a new file with sensible options
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
                   ` (3 preceding siblings ...)
  2011-05-16 16:01 ` [PATCH v4 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
@ 2011-05-16 16:01 ` Andy Lutomirski
  2011-05-16 16:01 ` [PATCH v4 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
  2011-05-16 16:09 ` [PATCH v4 0/6] Micro-optimize vclock_gettime Andi Kleen
  6 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:01 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

vread_tsc is short and hot, and it's userspace code so the usual
reasons to enable -pg and turn off sibling calls don't apply.

(OK, turning off sibling calls has no effect.  But it might
someday...)

As an added benefit, tsc.c is profilable now.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/tsc.h     |    4 ++++
 arch/x86/kernel/Makefile       |    8 +++++---
 arch/x86/kernel/tsc.c          |   34 ----------------------------------
 arch/x86/kernel/vread_tsc_64.c |   36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+), 37 deletions(-)
 create mode 100644 arch/x86/kernel/vread_tsc_64.c

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 1ca132f..8f2b3c6 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -51,6 +51,10 @@ extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern unsigned long native_calibrate_tsc(void);
 
+#ifdef CONFIG_X86_64
+extern cycles_t vread_tsc(void);
+#endif
+
 /*
  * Boot-time check whether the TSCs are synchronized across
  * all CPUs/cores:
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 34244b2..7ea1aa2 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -8,7 +8,6 @@ CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not profile debug and lowlevel utilities
-CFLAGS_REMOVE_tsc.o = -pg
 CFLAGS_REMOVE_rtc.o = -pg
 CFLAGS_REMOVE_paravirt-spinlocks.o = -pg
 CFLAGS_REMOVE_pvclock.o = -pg
@@ -24,13 +23,16 @@ endif
 nostackp := $(call cc-option, -fno-stack-protector)
 CFLAGS_vsyscall_64.o	:= $(PROFILING) -g0 $(nostackp)
 CFLAGS_hpet.o		:= $(nostackp)
-CFLAGS_tsc.o		:= $(nostackp)
+CFLAGS_vread_tsc_64.o	:= $(nostackp)
 CFLAGS_paravirt.o	:= $(nostackp)
 GCOV_PROFILE_vsyscall_64.o	:= n
 GCOV_PROFILE_hpet.o		:= n
 GCOV_PROFILE_tsc.o		:= n
 GCOV_PROFILE_paravirt.o		:= n
 
+# vread_tsc_64 is hot and should be fully optimized:
+CFLAGS_REMOVE_vread_tsc_64.o = -pg -fno-optimize-sibling-calls
+
 obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o ldt.o dumpstack.o
@@ -39,7 +41,7 @@ obj-$(CONFIG_IRQ_WORK)  += irq_work.o
 obj-$(CONFIG_X86_32)	+= probe_roms_32.o
 obj-$(CONFIG_X86_32)	+= sys_i386_32.o i386_ksyms_32.o
 obj-$(CONFIG_X86_64)	+= sys_x86_64.o x8664_ksyms_64.o
-obj-$(CONFIG_X86_64)	+= syscall_64.o vsyscall_64.o
+obj-$(CONFIG_X86_64)	+= syscall_64.o vsyscall_64.o vread_tsc_64.o
 obj-y			+= bootflag.o e820.o
 obj-y			+= pci-dma.o quirks.o i8237.o topology.o kdebugfs.o
 obj-y			+= alternative.o i8253.o pci-nommu.o hw_breakpoint.o
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index db4c6e6..5346381 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -763,40 +763,6 @@ static cycle_t read_tsc(struct clocksource *cs)
 		ret : clocksource_tsc.cycle_last;
 }
 
-#ifdef CONFIG_X86_64
-static cycle_t __vsyscall_fn vread_tsc(void)
-{
-	cycle_t ret;
-	u64 last;
-
-	/*
-	 * Empirically, a fence (of type that depends on the CPU)
-	 * before rdtsc is enough to ensure that rdtsc is ordered
-	 * with respect to loads.  The various CPU manuals are unclear
-	 * as to whether rdtsc can be reordered with later loads,
-	 * but no one has ever seen it happen.
-	 */
-	rdtsc_barrier();
-	ret = (cycle_t)vget_cycles();
-
-	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	/*
-	 * GCC likes to generate cmov here, but this branch is extremely
-	 * predictable (it's just a funciton of time and the likely is
-	 * very likely) and there's a data dependence, so force GCC
-	 * to generate a branch instead.  I don't barrier() because
-	 * we don't actually need a barrier, and if this function
-	 * ever gets inlined it will generate worse code.
-	 */
-	asm volatile ("");
-	return last;
-}
-#endif
-
 static void resume_tsc(struct clocksource *cs)
 {
 	clocksource_tsc.cycle_last = 0;
diff --git a/arch/x86/kernel/vread_tsc_64.c b/arch/x86/kernel/vread_tsc_64.c
new file mode 100644
index 0000000..a81aa9e
--- /dev/null
+++ b/arch/x86/kernel/vread_tsc_64.c
@@ -0,0 +1,36 @@
+/* This code runs in userspace. */
+
+#define DISABLE_BRANCH_PROFILING
+#include <asm/vgtod.h>
+
+notrace cycle_t __vsyscall_fn vread_tsc(void)
+{
+	cycle_t ret;
+	u64 last;
+
+	/*
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
+	 */
+	rdtsc_barrier();
+	ret = (cycle_t)vget_cycles();
+
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
+}
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
                   ` (4 preceding siblings ...)
  2011-05-16 16:01 ` [PATCH v4 5/6] x86-64: Move vread_tsc into a new file with sensible options Andy Lutomirski
@ 2011-05-16 16:01 ` Andy Lutomirski
  2011-05-16 16:09 ` [PATCH v4 0/6] Micro-optimize vclock_gettime Andi Kleen
  6 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-16 16:01 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski

The vDSO isn't part of the kernel, so profiling and kernel
backtraces don't really matter.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/vdso/Makefile |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index a651861..bef0bc9 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -37,11 +37,24 @@ $(obj)/%.so: OBJCOPYFLAGS := -S
 $(obj)/%.so: $(obj)/%.so.dbg FORCE
 	$(call if_changed,objcopy)
 
+#
+# Don't omit frame pointers for ease of userspace debugging, but do
+# optimize sibling calls.
+#
 CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
-       $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector)
+       $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector) \
+       -fno-omit-frame-pointer -foptimize-sibling-calls
 
 $(vobjs): KBUILD_CFLAGS += $(CFL)
 
+#
+# vDSO code runs in userspace and -pg doesn't help with profiling anyway.
+#
+CFLAGS_REMOVE_vdso-note.o = -pg
+CFLAGS_REMOVE_vclock_gettime.o = -pg
+CFLAGS_REMOVE_vgetcpu.o = -pg
+CFLAGS_REMOVE_vvar.o = -pg
+
 targets += vdso-syms.lds
 obj-$(VDSO64-y)			+= vdso-syms.lds
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
                   ` (5 preceding siblings ...)
  2011-05-16 16:01 ` [PATCH v4 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
@ 2011-05-16 16:09 ` Andi Kleen
  2011-05-16 16:25   ` Thomas Gleixner
  6 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2011-05-16 16:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov

>  - Don't turn off frame pointers in vDSO code.

Why? Note frame pointers are costly on some CPUs, like Atom.

Also unlike the primitive state of the kernel the user space debuggers
are usually advanced enough to support dwarf2 unwinding.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 16:09 ` [PATCH v4 0/6] Micro-optimize vclock_gettime Andi Kleen
@ 2011-05-16 16:25   ` Thomas Gleixner
  2011-05-16 16:49     ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2011-05-16 16:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, x86, linux-kernel, Ingo Molnar, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Mon, 16 May 2011, Andi Kleen wrote:

> >  - Don't turn off frame pointers in vDSO code.
> 
> Why? Note frame pointers are costly on some CPUs, like Atom.

That's the least worry on ATOM.

> Also unlike the primitive state of the kernel the user space debuggers
> are usually advanced enough to support dwarf2 unwinding.

And unless you or someone else changes the primitive state of the
kernel, framepointers are going to stay simply because removing them
breaks profiling backtraces when the hit is inside vread().

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 16:25   ` Thomas Gleixner
@ 2011-05-16 16:49     ` Andi Kleen
  2011-05-16 17:05       ` Andrew Lutomirski
  2011-05-17  7:56       ` Ingo Molnar
  0 siblings, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2011-05-16 16:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Andy Lutomirski, x86, linux-kernel, Ingo Molnar,
	Linus Torvalds, David S. Miller, Eric Dumazet, Peter Zijlstra,
	Borislav Petkov

> And unless you or someone else changes the primitive state of the
> kernel, framepointers are going to stay simply because removing them
> breaks profiling backtraces when the hit is inside vread().

This doesn't work anyways because the glibc stub code calling vgettimeofday
normally doesn't set up a frame pointer frame.

The only way to unwind there is dwarf2.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 16:49     ` Andi Kleen
@ 2011-05-16 17:05       ` Andrew Lutomirski
  2011-05-16 20:22         ` Andi Kleen
  2011-05-17  7:56       ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-16 17:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, x86, linux-kernel, Ingo Molnar, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Mon, May 16, 2011 at 12:49 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> And unless you or someone else changes the primitive state of the
>> kernel, framepointers are going to stay simply because removing them
>> breaks profiling backtraces when the hit is inside vread().
>
> This doesn't work anyways because the glibc stub code calling vgettimeofday
> normally doesn't set up a frame pointer frame.
>
> The only way to unwind there is dwarf2.

For code in the vsyscall page, I think using CFI data is a lost cause.
 How is any user code supposed to find the CFI data?

For the vDSO, we could be nice to userspace and install the debugging
symbols somewhere sensible.  Currently we generate a buildid but we
don't install the symbols anywhere by default.

Longer term, it would be nice to mark the vsyscall page NX.  That
involves a few things:

1. Move vread to the vDSO.  That's not very hard.
2. Get glibc to stop using the old vsyscalls.  I think it still does
for static-linked programs.
3. Have a bit of a deprecation period.
4. Come up with some way to make old programs keep working even with
NX set.  Maybe just delete the vsyscall functions and emulate them on
a trap (with a printk_once warning about the performance loss).

That way all user code has (in principle) dwarf2.  And we won't have a
syscall instruction sitting at a predictable address.

--Andy

>
> -Andi
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables
  2011-05-16 16:00 ` [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables Andy Lutomirski
@ 2011-05-16 17:23   ` Borislav Petkov
  2011-05-16 17:34     ` Andrew Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2011-05-16 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov

On Mon, May 16, 2011 at 12:00:58PM -0400, Andy Lutomirski wrote:
> Variables that are shared between the vdso and the kernel are
> currently a bit of a mess.  They are each defined with their own
> magic, they are accessed differently in the kernel, the vsyscall page,
> and the vdso, and one of them (vsyscall_clock) doesn't even really
> exist.
> 
> This changes them all to use a common mechanism.  All of them are
> delcared in vvar.h with a fixed address (validated by the linker
> script).  In the kernel (as before), they look like ordinary
> read-write variables.  In the vsyscall page and the vdso, they are
> accessed through a new macro VVAR, which gives read-only access.
> 
> The vdso is now loaded verbatim into memory without any fixups.  As a
> side bonus, access from the vdso is faster because a level of
> indirection is removed.
> 
> While we're at it, pack jiffies and vgetcpu_mode into the same
> cacheline.

How about

Signed-off-by:...

or you don't wanna? :)

> ---
>  arch/x86/include/asm/vdso.h     |   14 ----------
>  arch/x86/include/asm/vgtod.h    |    2 -
>  arch/x86/include/asm/vsyscall.h |   12 +-------
>  arch/x86/include/asm/vvar.h     |   52 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/time.c          |    2 +-
>  arch/x86/kernel/tsc.c           |    4 +-
>  arch/x86/kernel/vmlinux.lds.S   |   34 ++++++++-----------------
>  arch/x86/kernel/vsyscall_64.c   |   46 +++++++++++++++-------------------
>  arch/x86/vdso/Makefile          |    2 +-
>  arch/x86/vdso/vclock_gettime.c  |    3 +-
>  arch/x86/vdso/vdso.lds.S        |    7 -----
>  arch/x86/vdso/vextern.h         |   16 ------------
>  arch/x86/vdso/vgetcpu.c         |    3 +-
>  arch/x86/vdso/vma.c             |   27 --------------------
>  arch/x86/vdso/vvar.c            |   12 ---------
>  15 files changed, 91 insertions(+), 145 deletions(-)
>  create mode 100644 arch/x86/include/asm/vvar.h
>  delete mode 100644 arch/x86/vdso/vextern.h
>  delete mode 100644 arch/x86/vdso/vvar.c
> 

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables
  2011-05-16 17:23   ` Borislav Petkov
@ 2011-05-16 17:34     ` Andrew Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-16 17:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, Ingo Molnar, Andi Kleen, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Thomas Gleixner

On Mon, May 16, 2011 at 1:23 PM, Borislav Petkov <bp@amd64.org> wrote:
> On Mon, May 16, 2011 at 12:00:58PM -0400, Andy Lutomirski wrote:
>> Variables that are shared between the vdso and the kernel are
>> currently a bit of a mess.  They are each defined with their own
>> magic, they are accessed differently in the kernel, the vsyscall page,
>> and the vdso, and one of them (vsyscall_clock) doesn't even really
>> exist.
>>
>> This changes them all to use a common mechanism.  All of them are
>> delcared in vvar.h with a fixed address (validated by the linker
>> script).  In the kernel (as before), they look like ordinary
>> read-write variables.  In the vsyscall page and the vdso, they are
>> accessed through a new macro VVAR, which gives read-only access.
>>
>> The vdso is now loaded verbatim into memory without any fixups.  As a
>> side bonus, access from the vdso is faster because a level of
>> indirection is removed.
>>
>> While we're at it, pack jiffies and vgetcpu_mode into the same
>> cacheline.
>
> How about
>
> Signed-off-by:...

Signed-off-by: Andy Lutomirski <luto@mit.edu>

If whoever wants to pick up these patches wants it (or if there's a
v5), I'll send a new patch.

>
> or you don't wanna? :)

Nope.  I probably just hit the delete button too hard or something.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 17:05       ` Andrew Lutomirski
@ 2011-05-16 20:22         ` Andi Kleen
  2011-05-16 21:28           ` Andrew Lutomirski
  2011-05-16 21:53           ` Thomas Gleixner
  0 siblings, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2011-05-16 20:22 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Thomas Gleixner, x86, linux-kernel, Ingo Molnar, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

Andrew Lutomirski <luto@mit.edu> writes:

> On Mon, May 16, 2011 at 12:49 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>> And unless you or someone else changes the primitive state of the
>>> kernel, framepointers are going to stay simply because removing them
>>> breaks profiling backtraces when the hit is inside vread().
>>
>> This doesn't work anyways because the glibc stub code calling vgettimeofday
>> normally doesn't set up a frame pointer frame.
>>
>> The only way to unwind there is dwarf2.
>
> For code in the vsyscall page, I think using CFI data is a lost cause.
>  How is any user code supposed to find the CFI data?

It only works for the vDSO. vsyscall is legacy.

> For the vDSO, we could be nice to userspace and install the debugging
> symbols somewhere sensible.  Currently we generate a buildid but we
> don't install the symbols anywhere by default.

Actually we include the dwarf2 tables and signal unwinding works
(for the vDSO)

> Longer term, it would be nice to mark the vsyscall page NX.  That
> involves a few things:

Why NX? What would make sense is to call the VDSO from it.
The problem is that the vDSO is randomized and there's no good memory
location to store the pointer to it.

The real reason for all this dance is to have some less non randomized
code around. What I implemented back then was instead code to patch out
the SYSCALL in there if not needed to lower the attack surface (not sure
if that still works though, but that was the idea). For most cases
(TSC/HPET read) it's not needed.

Checking: someone removed the code meanwhile.

> And we won't have a
> syscall instruction sitting at a predictable address.

The easy way to fix this is to just re-add the patching.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 20:22         ` Andi Kleen
@ 2011-05-16 21:28           ` Andrew Lutomirski
  2011-05-16 21:53           ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-16 21:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, x86, linux-kernel, Ingo Molnar, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Mon, May 16, 2011 at 4:22 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Andrew Lutomirski <luto@mit.edu> writes:
>
>> On Mon, May 16, 2011 at 12:49 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>> And unless you or someone else changes the primitive state of the
>>>> kernel, framepointers are going to stay simply because removing them
>>>> breaks profiling backtraces when the hit is inside vread().
>>>
>>> This doesn't work anyways because the glibc stub code calling vgettimeofday
>>> normally doesn't set up a frame pointer frame.
>>>
>>> The only way to unwind there is dwarf2.
>>
>> For code in the vsyscall page, I think using CFI data is a lost cause.
>>  How is any user code supposed to find the CFI data?
>
> It only works for the vDSO. vsyscall is legacy.

Except that vread_tsc (i.e. the code in question) is in the vsyscall
page.  (That's on my list of less low-hanging fruit to attack if I
decide to go for another 3 ns or so for 2.6.41.)

>
>> For the vDSO, we could be nice to userspace and install the debugging
>> symbols somewhere sensible.  Currently we generate a buildid but we
>> don't install the symbols anywhere by default.
>
> Actually we include the dwarf2 tables and signal unwinding works
> (for the vDSO)

That particular comment was more about debugging than stack traces.
Things like perf annotate would like the full symbols.

>
>> Longer term, it would be nice to mark the vsyscall page NX.  That
>> involves a few things:
>
> Why NX? What would make sense is to call the VDSO from it.
> The problem is that the vDSO is randomized and there's no good memory
> location to store the pointer to it.
>
> The real reason for all this dance is to have some less non randomized
> code around. What I implemented back then was instead code to patch out
> the SYSCALL in there if not needed to lower the attack surface (not sure
> if that still works though, but that was the idea). For most cases
> (TSC/HPET read) it's not needed.

Nothing sensible calls the vsyscall page.  (I'm defining code that
statically links against current glibc as insensible.  But that could
be fixed.  I think it's mainly an artifact of some odd glibc
internals.)  So if we make the vsyscall page NX then we have no
non-randomized code at all.

How do we tell that a program really wants to call the vsyscall page
and isn't being exploited to call the vsyscall page?

There are excactly four entry-points into the vsyscall page that are
part of the ABI.  One of them is venosys (maybe it used to do
something) and the other three are archaic duplicates of vDSO
functions.  It seems less complicated to emulate them than to patch
them in.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 20:22         ` Andi Kleen
  2011-05-16 21:28           ` Andrew Lutomirski
@ 2011-05-16 21:53           ` Thomas Gleixner
  2011-05-16 22:17             ` Andrew Lutomirski
  1 sibling, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2011-05-16 21:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Lutomirski, x86, linux-kernel, Ingo Molnar,
	Linus Torvalds, David S. Miller, Eric Dumazet, Peter Zijlstra,
	Borislav Petkov

On Mon, 16 May 2011, Andi Kleen wrote:
> Andrew Lutomirski <luto@mit.edu> writes:
> > Longer term, it would be nice to mark the vsyscall page NX.  That
> > involves a few things:
> 
> Why NX? What would make sense is to call the VDSO from it.
> The problem is that the vDSO is randomized and there's no good memory
> location to store the pointer to it.
> 
> The real reason for all this dance is to have some less non randomized
> code around. What I implemented back then was instead code to patch out
> the SYSCALL in there if not needed to lower the attack surface (not sure
> if that still works though, but that was the idea). For most cases
> (TSC/HPET read) it's not needed.
> 
> Checking: someone removed the code meanwhile.

For a damned good reason.
 
> > And we won't have a
> > syscall instruction sitting at a predictable address.
> 
> The easy way to fix this is to just re-add the patching.

If you can address _all_ reasons why it was removed then we might
revisit that issue, but that's going to be an interesting patch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 21:53           ` Thomas Gleixner
@ 2011-05-16 22:17             ` Andrew Lutomirski
  2011-05-16 22:40               ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-16 22:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, x86, linux-kernel, Ingo Molnar, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Mon, May 16, 2011 at 5:53 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 16 May 2011, Andi Kleen wrote:
>> Andrew Lutomirski <luto@mit.edu> writes:
>> > Longer term, it would be nice to mark the vsyscall page NX.  That
>> > involves a few things:
>>
>> Why NX? What would make sense is to call the VDSO from it.
>> The problem is that the vDSO is randomized and there's no good memory
>> location to store the pointer to it.
>>
>> The real reason for all this dance is to have some less non randomized
>> code around. What I implemented back then was instead code to patch out
>> the SYSCALL in there if not needed to lower the attack surface (not sure
>> if that still works though, but that was the idea). For most cases
>> (TSC/HPET read) it's not needed.
>>
>> Checking: someone removed the code meanwhile.
>
> For a damned good reason.
>
>> > And we won't have a
>> > syscall instruction sitting at a predictable address.
>>
>> The easy way to fix this is to just re-add the patching.
>
> If you can address _all_ reasons why it was removed then we might
> revisit that issue, but that's going to be an interesting patch.

We're now completely offtopic for the RDTSC patches, but do you have
any moral objection to rigging the vsyscall page to generate a fault
and emulating it after a couple years of deprecation?  If not, I can
see if I can persuade Uli to take accept a glibc patch to stop calling
it in future static glibc versions.

The long-term cost to the kernel will be a special case (and, compare,
unlikely branch) in whichever fault trap gets used.  The best way
might be to use a page full of 0xCC so that an int3 fault (presumably
not a performance-critical path) is taken.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 22:17             ` Andrew Lutomirski
@ 2011-05-16 22:40               ` Thomas Gleixner
  2011-05-17  8:00                 ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2011-05-16 22:40 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, Ingo Molnar, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1829 bytes --]

On Mon, 16 May 2011, Andrew Lutomirski wrote:
> On Mon, May 16, 2011 at 5:53 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Mon, 16 May 2011, Andi Kleen wrote:
> >> Andrew Lutomirski <luto@mit.edu> writes:
> >> > Longer term, it would be nice to mark the vsyscall page NX.  That
> >> > involves a few things:
> >>
> >> Why NX? What would make sense is to call the VDSO from it.
> >> The problem is that the vDSO is randomized and there's no good memory
> >> location to store the pointer to it.
> >>
> >> The real reason for all this dance is to have some less non randomized
> >> code around. What I implemented back then was instead code to patch out
> >> the SYSCALL in there if not needed to lower the attack surface (not sure
> >> if that still works though, but that was the idea). For most cases
> >> (TSC/HPET read) it's not needed.
> >>
> >> Checking: someone removed the code meanwhile.
> >
> > For a damned good reason.
> >
> >> > And we won't have a
> >> > syscall instruction sitting at a predictable address.
> >>
> >> The easy way to fix this is to just re-add the patching.
> >
> > If you can address _all_ reasons why it was removed then we might
> > revisit that issue, but that's going to be an interesting patch.
> 
> We're now completely offtopic for the RDTSC patches, but do you have
> any moral objection to rigging the vsyscall page to generate a fault
> and emulating it after a couple years of deprecation?  If not, I can

Moral objections are pretty irrelevant, but I have no technical
objections either. :)

> see if I can persuade Uli to take accept a glibc patch to stop calling
> it in future static glibc versions.

How wide spread is this in reality on 64bit systems ?

IOW, what's the damage if we take a trap and emulate it in the most
painful way we can come up with ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 16:49     ` Andi Kleen
  2011-05-16 17:05       ` Andrew Lutomirski
@ 2011-05-17  7:56       ` Ingo Molnar
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2011-05-17  7:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Andy Lutomirski, x86, linux-kernel,
	Linus Torvalds, David S. Miller, Eric Dumazet, Peter Zijlstra,
	Borislav Petkov


* Andi Kleen <andi@firstfloor.org> wrote:

> > And unless you or someone else changes the primitive state of the
> > kernel, framepointers are going to stay simply because removing them
> > breaks profiling backtraces when the hit is inside vread().
> 
> This doesn't work anyways because the glibc stub code calling vgettimeofday 
> normally doesn't set up a frame pointer frame.

But we at least get a callchain back to that point.

Also, Thomas's point remains: you can whine about framepointers and you can try 
to slowly sabotage proper backtraces (nasty and sleazy tactic you are using 
there btw), or you can help out bring sane dwarf support to the kernel.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-16 22:40               ` Thomas Gleixner
@ 2011-05-17  8:00                 ` Ingo Molnar
  2011-05-17 11:11                   ` Andrew Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2011-05-17  8:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Lutomirski, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov


* Thomas Gleixner <tglx@linutronix.de> wrote:

> > see if I can persuade Uli to take accept a glibc patch to stop calling it 
> > in future static glibc versions.
> 
> How wide spread is this in reality on 64bit systems ?
> 
> IOW, what's the damage if we take a trap and emulate it in the most painful 
> way we can come up with ?

Well, how does that differ from having the real syscall instruction there? How 
are we going to filter real (old-)glibc calls from exploits?

If it can be filtered in a meaningful way then we should just do that and 
perhaps offer a (default enabled) .config COMPAT_VDSO_EMU=y switch to turn the 
emulation off.

That way we keep the ABI and also have a way out for users who *really* need 
this to work in a performant way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17  8:00                 ` Ingo Molnar
@ 2011-05-17 11:11                   ` Andrew Lutomirski
  2011-05-17 11:36                     ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-17 11:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Tue, May 17, 2011 at 4:00 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> > see if I can persuade Uli to take accept a glibc patch to stop calling it
>> > in future static glibc versions.
>>
>> How wide spread is this in reality on 64bit systems ?
>>
>> IOW, what's the damage if we take a trap and emulate it in the most painful
>> way we can come up with ?

I dunno.  I'll measure it.

>
> Well, how does that differ from having the real syscall instruction there? How
> are we going to filter real (old-)glibc calls from exploits?

Because there are only four vsyscalls: vgettimeofday, vtime, vgetcpu,
and venosys.  None of them have side-effects, so they only allow an
attacker to write something to user memory somewhere.  The
implementation of vgettimeofday needs a syscall instruction internally
for its fallback, which means that an attack could jump there instead
of to the start of the vsyscall implementation.

>
> If it can be filtered in a meaningful way then we should just do that and
> perhaps offer a (default enabled) .config COMPAT_VDSO_EMU=y switch to turn the
> emulation off.
>
> That way we keep the ABI and also have a way out for users who *really* need
> this to work in a performant way.

Yeah, that probably makes more sense.  It'll make for an uglier
diffstat, though -- there's a lot of ugly duplicate code around to
make vgettimeofday and vgetcpu work.

--Andy

>
> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17 11:11                   ` Andrew Lutomirski
@ 2011-05-17 11:36                     ` Ingo Molnar
  2011-05-17 18:31                       ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2011-05-17 11:36 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Thomas Gleixner, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov


* Andrew Lutomirski <luto@mit.edu> wrote:

> > Well, how does that differ from having the real syscall instruction there? 
> > How are we going to filter real (old-)glibc calls from exploits?
> 
> Because there are only four vsyscalls: vgettimeofday, vtime, vgetcpu, and 
> venosys.  None of them have side-effects, so they only allow an attacker to 
> write something to user memory somewhere.  The implementation of 
> vgettimeofday needs a syscall instruction internally for its fallback, which 
> means that an attack could jump there instead of to the start of the vsyscall 
> implementation.

So for this to work securely the emulation code would also have to filter the 
syscall numbers, to make sure that only these benign syscalls are used.

It should perhaps also warn if it notices something weird going on.

> > If it can be filtered in a meaningful way then we should just do that and 
> > perhaps offer a (default enabled) .config COMPAT_VDSO_EMU=y switch to turn 
> > the emulation off.
> >
> > That way we keep the ABI and also have a way out for users who *really* 
> > need this to work in a performant way.
> 
> Yeah, that probably makes more sense.  It'll make for an uglier diffstat, 
> though -- there's a lot of ugly duplicate code around to make vgettimeofday 
> and vgetcpu work.

Lets try the pure emulation thing first, ok? Complications is not really what 
we need in this area ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17 11:36                     ` Ingo Molnar
@ 2011-05-17 18:31                       ` Andy Lutomirski
  2011-05-17 19:27                         ` Ingo Molnar
  2011-05-17 21:31                         ` Andi Kleen
  0 siblings, 2 replies; 31+ messages in thread
From: Andy Lutomirski @ 2011-05-17 18:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On 05/17/2011 07:36 AM, Ingo Molnar wrote:
> 
> * Andrew Lutomirski<luto@mit.edu>  wrote:
> 
>>> Well, how does that differ from having the real syscall instruction there?
>>> How are we going to filter real (old-)glibc calls from exploits?
>>
>> Because there are only four vsyscalls: vgettimeofday, vtime, vgetcpu, and
>> venosys.  None of them have side-effects, so they only allow an attacker to
>> write something to user memory somewhere.  The implementation of
>> vgettimeofday needs a syscall instruction internally for its fallback, which
>> means that an attack could jump there instead of to the start of the vsyscall
>> implementation.
> 
> So for this to work securely the emulation code would also have to filter the
> syscall numbers, to make sure that only these benign syscalls are used.
> 
> It should perhaps also warn if it notices something weird going on.

It's even easier than that: there are no syscall numbers involved.  There are four separate entry points, one for each vsyscall.

(It turns out that one of them has been broken and just segfaults since 2008 (a4928cff), so we only have to emulate three of them.)

On KVM on Sandy Bridge, I can emulate a vsyscall that does nothing in 400ns or so.  I'll try to make this code emulate real vsyscalls over the weekend.  This was much easier than I expected.

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d0983d2..52b4b49 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -39,6 +39,14 @@ extern struct timezone sys_tz;
 
 extern void map_vsyscall(void);
 
+/* Emulation */
+static inline bool is_vsyscall_addr(unsigned long addr)
+{
+	return (addr & ~(3*VSYSCALL_SIZE)) == VSYSCALL_START + 4096; /* intentionally incorrect for testing */
+}
+
+void emulate_vsyscall(struct pt_regs *regs);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index dcbb28c..83590e8 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -32,6 +32,8 @@
 #include <linux/cpu.h>
 #include <linux/smp.h>
 #include <linux/notifier.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
 
 #include <asm/vsyscall.h>
 #include <asm/pgtable.h>
@@ -233,6 +235,41 @@ static long __vsyscall(3) venosys_1(void)
 	return -ENOSYS;
 }
 
+void emulate_vsyscall(struct pt_regs *regs)
+{
+	long ret = 0;
+	unsigned long called_from;
+
+	unsigned vsyscall_no = (regs->ip >> 10) & 3;
+	BUILD_BUG_ON(VSYSCALL_SIZE != (1<<10));
+
+	/* pop called_from */
+	ret = get_user(called_from, (unsigned long __user *)regs->sp);
+	if (ret)
+		goto fault;
+	regs->sp += 8;
+
+	switch(vsyscall_no) {
+	case 0:		/* vgettimeofday */
+	case 1:		/* vtime */
+	case 2:		/* vgetcpu */
+		ret = -EINVAL;
+		goto out;
+
+	case 3:		/* venosys */
+		ret = -ENOSYS;
+		goto out;
+	}
+
+out:
+	regs->ip = called_from;
+	regs->ax = ret;
+	return;
+
+fault:
+	force_sig(SIGKILL, current);  /* XXX */
+}
+
 #ifdef CONFIG_SYSCTL
 static ctl_table kernel_table2[] = {
 	{ .procname = "vsyscall64",
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..c84df6f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -16,6 +16,7 @@
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
+#include <asm/vsyscall.h>		/* vsyscall emulation		*/
 
 /*
  * Page fault error code bits:
@@ -719,6 +720,16 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 		if (is_errata100(regs, address))
 			return;
 
+		/*
+		 * Calling certain addresses has historical semantics that
+		 * we need to emulate.
+		 */
+		if (is_vsyscall_addr(regs->ip) && regs->ip == address &&
+		    (error_code & (PF_WRITE | PF_INSTR)) == PF_INSTR) {
+			emulate_vsyscall(regs);
+			return;
+		}
+
 		if (unlikely(show_unhandled_signals))
 			show_signal_msg(regs, error_code, address, tsk);
 



I don't expect to have this ready for 2.6.40.  What's the status of the RDTSC stuff -- do you want to pick it up for the 2.6.40 merge window?

--Andy

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17 18:31                       ` Andy Lutomirski
@ 2011-05-17 19:27                         ` Ingo Molnar
  2011-05-17 21:31                         ` Andi Kleen
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2011-05-17 19:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov


* Andy Lutomirski <luto@MIT.EDU> wrote:

> It's even easier than that: there are no syscall numbers involved.  There are 
> four separate entry points, one for each vsyscall.
> 
> (It turns out that one of them has been broken and just segfaults since 2008 
> (a4928cff), so we only have to emulate three of them.)
> 
> On KVM on Sandy Bridge, I can emulate a vsyscall that does nothing in 400ns 
> or so.  I'll try to make this code emulate real vsyscalls over the weekend.  

Very nice!

> This was much easier than I expected.

Hey, looking back at a working patch always makes it seem easy ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17 18:31                       ` Andy Lutomirski
  2011-05-17 19:27                         ` Ingo Molnar
@ 2011-05-17 21:31                         ` Andi Kleen
  2011-05-17 22:59                           ` Thomas Gleixner
  1 sibling, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2011-05-17 21:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

Andy Lutomirski <luto@MIT.EDU> writes:
>
> On KVM on Sandy Bridge, I can emulate a vsyscall that does nothing in 400ns or so.  I'll try to make this code emulate real vsyscalls over the weekend.  This was much easier than I expected.

How about the performance of all the statically linked programs? I guess
you just declared they don't matter? gettimeofday is quite critical
and adding a exception into it is just a performance desaster.

Also it's always a dangerous assumption to think that all
programs on Linux use glibc ("all world is a Vax")

In fact more and more of Linux users are using different libcs these
days (like Android or embedded systems or languages with special runtime
systems) Who knows if all those other libraries use vDSO?

And then there are of course the old glibcs. A lot of people
(including me) use new kernels with old userland.

For me this seems like a very risky move -- breaking performance of
previously perfectly good set ups for very little reason.

Given the old vsyscall code is somewhat ugly -- I wouldn't argue that --
but compatibility (including performance compatibility) has always
been importand in Linux and we have far uglier code around in the name of 
compatibility..

And the "security problem" you keep talking about can be fixed
much easier and more compatible as I pointed out.

As far as I'm concered the change is a bad idea.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17 21:31                         ` Andi Kleen
@ 2011-05-17 22:59                           ` Thomas Gleixner
  2011-05-18  3:18                             ` Andrew Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2011-05-17 22:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Ingo Molnar, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Tue, 17 May 2011, Andi Kleen wrote:
> Andy Lutomirski <luto@MIT.EDU> writes:
> >
> > On KVM on Sandy Bridge, I can emulate a vsyscall that does nothing in 400ns or so.  I'll try to make this code emulate real vsyscalls over the weekend.  This was much easier than I expected.
> 
> How about the performance of all the statically linked programs? I guess

_ALL_ the statically linked programs? Point out a single one which
matters and _IS_ performance critical.

> you just declared they don't matter? gettimeofday is quite critical
> and adding a exception into it is just a performance desaster.
> 
> Also it's always a dangerous assumption to think that all
> programs on Linux use glibc ("all world is a Vax")
> 
> In fact more and more of Linux users are using different libcs these
> days (like Android or embedded systems or languages with special runtime
> systems) Who knows if all those other libraries use vDSO?

Which is completely irrelevant to x86_64. Point to a single relevant
x86_64 embedded system to which one of the above handwaving applies.

> And then there are of course the old glibcs. A lot of people
> (including me) use new kernels with old userland.

And how is your use case performance critical ?

Furthermore any halfways up to date deployemnt is using VDSO for
obvious reasons and the archaic stuff which might be affected is not
using a recent kernel at all (except for akpm on his retro laptop, but
that "performance penalty" is probably the least of his worries).

> For me this seems like a very risky move -- breaking performance of
> previously perfectly good set ups for very little reason.
> 
> Given the old vsyscall code is somewhat ugly -- I wouldn't argue that --

It's not somewhat ugly. It's a design failure and more important it's
a risk - and you very well know that.

> but compatibility (including performance compatibility) has always
> been importand in Linux and we have far uglier code around in the
> name of compatibility..

We have ugly code around, but it has always been more important to fix
security risks than keeping them around for performance sake.
 
> And the "security problem" you keep talking about can be fixed much
> easier and more compatible as I pointed out.

Unless you come forth with a patch which addresses _ALL_ reasons which
caused me to remove your superior "Make the vsyscall less risky"
approach, just admit that it's fcked up by design and be done with it.

Stop your "as I pointed out" handwaving please. That code was broken
and even exploitable and I removed it for more than one very good
reason. Stop claiming that your "design" is in any way fixable. It's
not.

> As far as I'm concered the change is a bad idea.

As far as I'm concerned you seem to regain your old habits of
defending your proven to be wrong "design" decisions no matter what.

And as long as you can't come up with documented proof that a single
application is affected by such a change, STFU!

Even if you are not able to proof it, it's a total nobrainer to revert
such a change and make it CONFIG_KEEP_FCKUP=y dependent in the very
unliklely case that we get a proper and reasonable regression report.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-17 22:59                           ` Thomas Gleixner
@ 2011-05-18  3:18                             ` Andrew Lutomirski
  2011-05-18  7:30                               ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-18  3:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Ingo Molnar, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Tue, May 17, 2011 at 6:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 17 May 2011, Andi Kleen wrote:
>> Andy Lutomirski <luto@MIT.EDU> writes:
>> >
>> > On KVM on Sandy Bridge, I can emulate a vsyscall that does nothing in 400ns or so.  I'll try to make this code emulate real vsyscalls over the weekend.  This was much easier than I expected.
>>
>> How about the performance of all the statically linked programs? I guess
>
> _ALL_ the statically linked programs? Point out a single one which
> matters and _IS_ performance critical.
>
>> you just declared they don't matter? gettimeofday is quite critical
>> and adding a exception into it is just a performance desaster.
>>
>> Also it's always a dangerous assumption to think that all
>> programs on Linux use glibc ("all world is a Vax")
>>
>> In fact more and more of Linux users are using different libcs these
>> days (like Android or embedded systems or languages with special runtime
>> systems) Who knows if all those other libraries use vDSO?
>
> Which is completely irrelevant to x86_64. Point to a single relevant
> x86_64 embedded system to which one of the above handwaving applies.
>
>> And then there are of course the old glibcs. A lot of people
>> (including me) use new kernels with old userland.
>
> And how is your use case performance critical ?
>
> Furthermore any halfways up to date deployemnt is using VDSO for
> obvious reasons and the archaic stuff which might be affected is not
> using a recent kernel at all (except for akpm on his retro laptop, but
> that "performance penalty" is probably the least of his worries).

Sadly that's not quite true.  glibc git right now contains this:

ENTRY (__gettimeofday)
        /* Align stack.  */
        sub     $0x8, %rsp
        cfi_adjust_cfa_offset(8)
#ifdef SHARED
        movq    __vdso_gettimeofday(%rip), %rax
        PTR_DEMANGLE (%rax)
#else
        movq    $VSYSCALL_ADDR_vgettimeofday, %rax
#endif
        callq   *%rax

And time() and sched_getcpu() call the vsyscall page unconditionally.
We should either declare CLOCK_REALTIME_COARSE to be acceptable for
time() or add a new vDSO call.

IMO we should put a note in feature-removal-schedule.txt, add vsyscall
emulation as a config option for 2.6.41 but leave it turned off by
default, and turn it on by default (or just remove the old code) in
2.6.43 or so.  That'll give glibc a chance to stop generating *new*
static binaries that call it.

I'm not volunteering to dig around the libdl stuff to fix it myself.

klibc doesn't seem to use vsyscalls or the vDSO.  I haven't looked at
uclibc, and I don't think that Bionic has any released version on
x86_64.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-18  3:18                             ` Andrew Lutomirski
@ 2011-05-18  7:30                               ` Thomas Gleixner
  2011-05-18  8:31                                 ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2011-05-18  7:30 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, Ingo Molnar, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Tue, 17 May 2011, Andrew Lutomirski wrote:
> On Tue, May 17, 2011 at 6:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Furthermore any halfways up to date deployemnt is using VDSO for
> > obvious reasons and the archaic stuff which might be affected is not
> > using a recent kernel at all (except for akpm on his retro laptop, but
> > that "performance penalty" is probably the least of his worries).
> 
> Sadly that's not quite true.  glibc git right now contains this:
> 
> ENTRY (__gettimeofday)
>         /* Align stack.  */
>         sub     $0x8, %rsp
>         cfi_adjust_cfa_offset(8)
> #ifdef SHARED
>         movq    __vdso_gettimeofday(%rip), %rax
>         PTR_DEMANGLE (%rax)
> #else
>         movq    $VSYSCALL_ADDR_vgettimeofday, %rax
> #endif
>         callq   *%rax

I know that, but again: Are statically linked binaries a real issue ?
 
> And time() and sched_getcpu() call the vsyscall page unconditionally.

Dammit, time() is a real problem. I missed that and thought that it's
gettimeofday() alone for the static case. sched_getcpu() is nothing to
worry about.

> We should either declare CLOCK_REALTIME_COARSE to be acceptable for
> time() or add a new vDSO call.

Separate call.

> IMO we should put a note in feature-removal-schedule.txt, add vsyscall
> emulation as a config option for 2.6.41 but leave it turned off by
> default, and turn it on by default (or just remove the old code) in
> 2.6.43 or so.  That'll give glibc a chance to stop generating *new*
> static binaries that call it.
> 
> I'm not volunteering to dig around the libdl stuff to fix it myself.
> 
> klibc doesn't seem to use vsyscalls or the vDSO.  I haven't looked at
> uclibc, and I don't think that Bionic has any released version on
> x86_64.

uclibc is safe as well. The VSYSCALL usage is in the futex code, but
is conditional and not used on current kernels. That code is copied
from glibc.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-18  7:30                               ` Thomas Gleixner
@ 2011-05-18  8:31                                 ` Ingo Molnar
  2011-05-18 11:30                                   ` Andrew Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2011-05-18  8:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Lutomirski, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov


* Thomas Gleixner <tglx@linutronix.de> wrote:

> > And time() and sched_getcpu() call the vsyscall page unconditionally.
> 
> Dammit, time() is a real problem. I missed that and thought that it's 
> gettimeofday() alone for the static case. sched_getcpu() is nothing to worry 
> about.

There's a relatively simple solution for all this:

 - We can make the old vsyscall page contain an int $0x81 (it is a free vector)

 - We can use vector 0x81 as a wrapper around the int80 entry: it would check 
   the syscall nrs and return if it's outside the small number of permitted 
   syscalls

 - We can put this behind a straightforward CONFIG_COMPAT_VSYSCALL=y option, 
   enabled by default for compatibility.

 - Distros that fix glibc can turn it off

Costs:

 - the performance cost of this solution is minimal: weirdly built binaries on 
   unfixed glibc will have a handful of syscalls execute via int $0x81 not the 
   syscall instruction. The cost of that is +50 nsecs at most - not 500.

 - almost zero maintenance cost: it just wraps existing int80 logic. It does not
   even have to use any kernel stack, it only checks register arguments so the 
   code is truly small and trivial to keep secure.

Advantages:

 - we defang the constant-address syscall instruction this way - it cannot be 
   used for anything even remotely useful to an exploit.

 - it's very simple

 - there's a future path out of it and a future path to deprecate this

What do you think?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-18  8:31                                 ` Ingo Molnar
@ 2011-05-18 11:30                                   ` Andrew Lutomirski
  2011-05-18 12:10                                     ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lutomirski @ 2011-05-18 11:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov

On Wed, May 18, 2011 at 4:31 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> > And time() and sched_getcpu() call the vsyscall page unconditionally.
>>
>> Dammit, time() is a real problem. I missed that and thought that it's
>> gettimeofday() alone for the static case. sched_getcpu() is nothing to worry
>> about.

I made exactly the same assumption.

>
> There's a relatively simple solution for all this:
>
>  - We can make the old vsyscall page contain an int $0x81 (it is a free vector)
>
>  - We can use vector 0x81 as a wrapper around the int80 entry: it would check
>   the syscall nrs and return if it's outside the small number of permitted
>   syscalls
>
>  - We can put this behind a straightforward CONFIG_COMPAT_VSYSCALL=y option,
>   enabled by default for compatibility.
>
>  - Distros that fix glibc can turn it off
>
> Costs:
>
>  - the performance cost of this solution is minimal: weirdly built binaries on
>   unfixed glibc will have a handful of syscalls execute via int $0x81 not the
>   syscall instruction. The cost of that is +50 nsecs at most - not 500.
>
>  - almost zero maintenance cost: it just wraps existing int80 logic. It does not
>   even have to use any kernel stack, it only checks register arguments so the
>   code is truly small and trivial to keep secure.
>
> Advantages:
>
>  - we defang the constant-address syscall instruction this way - it cannot be
>   used for anything even remotely useful to an exploit.
>
>  - it's very simple
>
>  - there's a future path out of it and a future path to deprecate this
>
> What do you think?

I was thinking of something similar but using ud2 instead of a new
interrupt vector.  I like the interrupt approach better.  We could
fill the rest of the page with 0xCC to make certain that jumping
anywhere else in there isn't useful.

As an interim measure, we could keep vtime and check it for anything
that could be used as a syscall if called with the wrong instruction
alignment.

Should we add a vDSO vtime for 2.6.40 so that people can start using it?

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/6] Micro-optimize vclock_gettime
  2011-05-18 11:30                                   ` Andrew Lutomirski
@ 2011-05-18 12:10                                     ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2011-05-18 12:10 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Thomas Gleixner, Andi Kleen, x86, linux-kernel, Linus Torvalds,
	David S. Miller, Eric Dumazet, Peter Zijlstra, Borislav Petkov


* Andrew Lutomirski <luto@mit.edu> wrote:

> Should we add a vDSO vtime for 2.6.40 so that people can start using it?

yeah.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2011-05-18 12:11 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-16 16:00 [PATCH v4 0/6] Micro-optimize vclock_gettime Andy Lutomirski
2011-05-16 16:00 ` [PATCH v4 1/6] x86-64: Clean up vdso/kernel shared variables Andy Lutomirski
2011-05-16 17:23   ` Borislav Petkov
2011-05-16 17:34     ` Andrew Lutomirski
2011-05-16 16:00 ` [PATCH v4 2/6] x86-64: Remove unnecessary barrier in vread_tsc Andy Lutomirski
2011-05-16 16:01 ` [PATCH v4 3/6] x86-64: Don't generate cmov " Andy Lutomirski
2011-05-16 16:01 ` [PATCH v4 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
2011-05-16 16:01 ` [PATCH v4 5/6] x86-64: Move vread_tsc into a new file with sensible options Andy Lutomirski
2011-05-16 16:01 ` [PATCH v4 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
2011-05-16 16:09 ` [PATCH v4 0/6] Micro-optimize vclock_gettime Andi Kleen
2011-05-16 16:25   ` Thomas Gleixner
2011-05-16 16:49     ` Andi Kleen
2011-05-16 17:05       ` Andrew Lutomirski
2011-05-16 20:22         ` Andi Kleen
2011-05-16 21:28           ` Andrew Lutomirski
2011-05-16 21:53           ` Thomas Gleixner
2011-05-16 22:17             ` Andrew Lutomirski
2011-05-16 22:40               ` Thomas Gleixner
2011-05-17  8:00                 ` Ingo Molnar
2011-05-17 11:11                   ` Andrew Lutomirski
2011-05-17 11:36                     ` Ingo Molnar
2011-05-17 18:31                       ` Andy Lutomirski
2011-05-17 19:27                         ` Ingo Molnar
2011-05-17 21:31                         ` Andi Kleen
2011-05-17 22:59                           ` Thomas Gleixner
2011-05-18  3:18                             ` Andrew Lutomirski
2011-05-18  7:30                               ` Thomas Gleixner
2011-05-18  8:31                                 ` Ingo Molnar
2011-05-18 11:30                                   ` Andrew Lutomirski
2011-05-18 12:10                                     ` Ingo Molnar
2011-05-17  7:56       ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.