All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/10] Remove syscall instructions at fixed addresses
@ 2011-05-31 14:13 Andy Lutomirski
  2011-05-31 14:13 ` [PATCH v4 01/10] x86-64: Fix alignment of jiffies variable Andy Lutomirski
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:13 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

Patch 1 is just a bugfix from the last vdso series.
The bug should be harmless but it's pretty dumb.  This is almost
certainly 3.0 material.

Patch 2 adds documentation for entry_64.S.  A lot of the magic in there is far from obvious.

Patches 3, 4, and 5 remove a bunch of syscall instructions in kernel space
at fixed addresses that user code can execute.  Several are data that isn't marked NX.  Patch 2 makes vvars NX and
5/10 makes the HPET NX.

The time() vsyscall contains an explicit syscall fallback.  Patch 3/10
removes it.

At this point, there is only one explicit syscall left in the vsyscall
page: the fallback case for vgettimeofday.  The rest of the series is to
remove it and most of the remaining vsyscall code.

Patch 6 is pure cleanup.  venosys (one of the four vsyscalls) has been
broken for years, so patch 6 removes it.

Patch 7 pads the empty parts of the vsyscall page with 0xcc.  0xcc is an
explicit trap.

Patch 8 removes the code implementing the vsyscalls and replaces it with
magic int 0xcc incantations.  These incantations are specifically
designed so that jumping into them at funny offsets will either work
fine or generate some kind of fault.  This is a significant performance
penalty (~220ns here) for all vsyscall users, but there aren't many
left.  Because current glibc still uses the time vsyscall (although it's
fixed in glibc's git), the option CONFIG_UNSAFE_VSYSCALLS (default y)
will leave time() alone.

This patch is also nice because it removes a bunch of duplicated code
from vsyscall_64.s.

Patch 9/10 randomizes the int 0xcc incantation at bootup.  It is pretty
much worthless for security (there are only three choices for the random
number and it's easy to figure out which one is in use) but it prevents
overly clever userspace programs from thinking that the incantation is
ABI.  One instrumentation tool author offered to hard-code special
handling for int 0xcc; I want to discourage this approach.

Patch 10/10 adds CONFIG_UNSAFE_VSYSCALLS to
feature-removal-schedule.txt.  Removing it won't break anything but it
will slow some older code down.


Changes from v3:
 - Rebased onto tip/master (1a0c84d)

Changes from v2:
 - Reordered the patches.
 - Removed the option to leave gettimeofday and getcpu as native code.
 - Clean up the int 0xcc handler and registration.
 - sched_getcpu works now (thanks, Borislav, for catching my blatant arithmetic error).
 - Numerous small fixes from review comments.
 - Abandon my plan to spread turbostat to the masses.

Changes from v1:
 - Patches 6-10 are new.
 - The int 0xcc code is much prettier and has lots of bugs fixed.
 - I've decided to let everyone compile turbostat on their own :)

Andy Lutomirski (10):
  x86-64: Fix alignment of jiffies variable
  x86-64: Document some of entry_64.S
  x86-64: Give vvars their own page
  x86-64: Remove kernel.vsyscall64 sysctl
  x86-64: Map the HPET NX
  x86-64: Remove vsyscall number 3 (venosys)
  x86-64: Fill unused parts of the vsyscall page with 0xcc
  x86-64: Emulate legacy vsyscalls
  x86-64: Randomize int 0xcc magic al values at boot
  x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

 Documentation/feature-removal-schedule.txt |    9 +
 Documentation/x86/entry_64.txt             |   98 +++++++++
 arch/x86/Kconfig                           |   17 ++
 arch/x86/include/asm/fixmap.h              |    1 +
 arch/x86/include/asm/irq_vectors.h         |    6 +-
 arch/x86/include/asm/pgtable_types.h       |    6 +-
 arch/x86/include/asm/traps.h               |    4 +
 arch/x86/include/asm/vgtod.h               |    1 -
 arch/x86/include/asm/vsyscall.h            |    6 +
 arch/x86/include/asm/vvar.h                |   24 +-
 arch/x86/kernel/Makefile                   |    1 +
 arch/x86/kernel/entry_64.S                 |    4 +
 arch/x86/kernel/hpet.c                     |    2 +-
 arch/x86/kernel/traps.c                    |    4 +
 arch/x86/kernel/vmlinux.lds.S              |   47 ++--
 arch/x86/kernel/vsyscall_64.c              |  327 +++++++++++++++++-----------
 arch/x86/kernel/vsyscall_emu_64.S          |   42 ++++
 arch/x86/vdso/vclock_gettime.c             |   55 ++---
 18 files changed, 448 insertions(+), 206 deletions(-)
 create mode 100644 Documentation/x86/entry_64.txt
 create mode 100644 arch/x86/kernel/vsyscall_emu_64.S

-- 
1.7.5.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 01/10] x86-64: Fix alignment of jiffies variable
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
@ 2011-05-31 14:13 ` Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 02/10] x86-64: Document some of entry_64.S Andy Lutomirski
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:13 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

It's declared __attribute__((aligned(16)) but it's explicitly not
aligned.  This is probably harmless but it's a bit embarrassing.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/vvar.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index 341b355..a4eaca4 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -45,7 +45,7 @@
 /* DECLARE_VVAR(offset, type, name) */
 
 DECLARE_VVAR(0, volatile unsigned long, jiffies)
-DECLARE_VVAR(8, int, vgetcpu_mode)
+DECLARE_VVAR(16, int, vgetcpu_mode)
 DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
 
 #undef DECLARE_VVAR
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 02/10] x86-64: Document some of entry_64.S
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
  2011-05-31 14:13 ` [PATCH v4 01/10] x86-64: Fix alignment of jiffies variable Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 03/10] x86-64: Give vvars their own page Andy Lutomirski
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 Documentation/x86/entry_64.txt |   98 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/entry_64.S     |    2 +
 2 files changed, 100 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/x86/entry_64.txt

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
new file mode 100644
index 0000000..7869f14
--- /dev/null
+++ b/Documentation/x86/entry_64.txt
@@ -0,0 +1,98 @@
+This file documents some of the kernel entries in
+arch/x86/kernel/entry_64.S.  A lot of this explanation is adapted from
+an email from Ingo Molnar:
+
+http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
+
+The x86 architecture has quite a few different ways to jump into
+kernel code.  Most of these entry points are registered in
+arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S
+and arch/x86/ia32/ia32entry.S.
+
+The IDT vector assignments are listed in arch/x86/include/irq_vectors.h.
+
+Some of these entries are:
+
+ - system_call: syscall instruction from 64-bit code.
+
+ - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall
+   either way.
+
+ - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit
+   code
+
+ - interrupt: An array of entries.  Every IDT vector that doesn't
+   explicitly point somewhere else gets set to the corresponding
+   value in interrupts.  These point to a whole array of
+   magically-generated functions that make their way to do_IRQ with
+   the interrupt number as a parameter.
+
+ - emulate_vsyscall: int 0xcc, a special non-ABI entry used by
+   vsyscall emulation.
+
+ - APIC interrupts: Various special-purpose interrupts for things
+   like TLB shootdown.
+
+ - Architecturally-defined exceptions like divide_error.
+
+There are a few complexities here.  The different x86-64 entries
+have different calling conventions.  The syscall and sysenter
+instructions have their own peculiar calling conventions.  Some of
+the IDT entries push an error code onto the stack; others don't.
+IDT entries using the IST alternative stack mechanism need their own
+magic to get the stack frames right.  (You can find some
+documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
+Volume 3, Chapter 6.)
+
+Dealing with the swapgs instruction is especially tricky.  Swapgs
+toggles whether gs is the kernel gs or the user gs.  The swapgs
+instruction is rather fragile: it must nest perfectly and only in
+single depth, it should only be used if entering from user mode to
+kernel mode and then when returning to user-space, and precisely
+so. If we mess that up even slightly, we crash.
+
+So when we have a secondary entry, already in kernel mode, we *must
+not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
+not switched/swapped yet.
+
+Now, there's a secondary complication: there's a cheap way to test
+which mode the CPU is in and an expensive way.
+
+The cheap way is to pick this info off the entry frame on the kernel
+stack, from the CS of the ptregs area of the kernel stack:
+
+	xorl %ebx,%ebx
+	testl $3,CS+8(%rsp)
+	je error_kernelspace
+	SWAPGS
+
+The expensive (paranoid) way is to read back the MSR_GS_BASE value
+(which is what SWAPGS modifies):
+
+	movl $1,%ebx
+	movl $MSR_GS_BASE,%ecx
+	rdmsr
+	testl %edx,%edx
+	js 1f   /* negative -> in kernel */
+	SWAPGS
+	xorl %ebx,%ebx
+1:	ret
+
+and the whole paranoid non-paranoid macro complexity is about whether
+to suffer that RDMSR cost.
+
+If we are at an interrupt or user-trap/gate-alike boundary then we can
+use the faster check: the stack will be a reliable indicator of
+whether SWAPGS was already done: if we see that we are a secondary
+entry interrupting kernel mode execution, then we know that the GS
+base has already been switched. If it says that we interrupted
+user-space execution then we must do the SWAPGS.
+
+But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
+which might have triggered right after a normal entry wrote CS to the
+stack but before we executed SWAPGS, then the only safe way to check
+for GS is the slower method: the RDMSR.
+
+So we try only to mark those entry methods 'paranoid' that absolutely
+need the more expensive check for the GS base - and we generate all
+'normal' entry points with the regular (faster) entry macros.
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 8a445a0..72c4a77 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -9,6 +9,8 @@
 /*
  * entry.S contains the system-call and fault low-level handling routines.
  *
+ * Some of this is documented in Documentation/x86/entry_64.txt
+ *
  * NOTE: This code handles signal-recognition, which happens every time
  * after an interrupt and after each system call.
  *
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 03/10] x86-64: Give vvars their own page
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
  2011-05-31 14:13 ` [PATCH v4 01/10] x86-64: Fix alignment of jiffies variable Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 02/10] x86-64: Document some of entry_64.S Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 17:17   ` Louis Rilling
  2011-05-31 14:14 ` [PATCH v4 04/10] x86-64: Remove kernel.vsyscall64 sysctl Andy Lutomirski
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

Move vvars out of the vsyscall page into their own page and mark it
NX.

Without this patch, an attacker who can force a daemon to call some
fixed address could wait until the time contains, say, 0xCD80, and
then execute the current time.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/fixmap.h        |    1 +
 arch/x86/include/asm/pgtable_types.h |    2 ++
 arch/x86/include/asm/vvar.h          |   22 ++++++++++------------
 arch/x86/kernel/vmlinux.lds.S        |   27 ++++++++++++++++-----------
 arch/x86/kernel/vsyscall_64.c        |    5 +++++
 5 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 4729b2b..460c74e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -78,6 +78,7 @@ enum fixed_addresses {
 	VSYSCALL_LAST_PAGE,
 	VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE
 			    + ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
+	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
 	FIX_DBGP_BASE,
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index d56187c..6a29aed6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -108,6 +108,7 @@
 #define __PAGE_KERNEL_UC_MINUS		(__PAGE_KERNEL | _PAGE_PCD)
 #define __PAGE_KERNEL_VSYSCALL		(__PAGE_KERNEL_RX | _PAGE_USER)
 #define __PAGE_KERNEL_VSYSCALL_NOCACHE	(__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_KERNEL_VVAR		(__PAGE_KERNEL_RO | _PAGE_USER)
 #define __PAGE_KERNEL_LARGE		(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_NOCACHE	(__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC	(__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -130,6 +131,7 @@
 #define PAGE_KERNEL_LARGE_EXEC		__pgprot(__PAGE_KERNEL_LARGE_EXEC)
 #define PAGE_KERNEL_VSYSCALL		__pgprot(__PAGE_KERNEL_VSYSCALL)
 #define PAGE_KERNEL_VSYSCALL_NOCACHE	__pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE)
+#define PAGE_KERNEL_VVAR		__pgprot(__PAGE_KERNEL_VVAR)
 
 #define PAGE_KERNEL_IO			__pgprot(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE		__pgprot(__PAGE_KERNEL_IO_NOCACHE)
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index a4eaca4..de656ac 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -10,15 +10,14 @@
  * In normal kernel code, they are used like any other variable.
  * In user code, they are accessed through the VVAR macro.
  *
- * Each of these variables lives in the vsyscall page, and each
- * one needs a unique offset within the little piece of the page
- * reserved for vvars.  Specify that offset in DECLARE_VVAR.
- * (There are 896 bytes available.  If you mess up, the linker will
- * catch it.)
+ * These variables live in a page of kernel data that has an extra RO
+ * mapping for userspace.  Each variable needs a unique offset within
+ * that page; specify that offset with the DECLARE_VVAR macro.  (If
+ * you mess up, the linker will catch it.)
  */
 
-/* Offset of vars within vsyscall page */
-#define VSYSCALL_VARS_OFFSET (3072 + 128)
+/* Base address of vvars.  This is not ABI. */
+#define VVAR_ADDRESS (-10*1024*1024 - 4096)
 
 #if defined(__VVAR_KERNEL_LDS)
 
@@ -26,17 +25,17 @@
  * right place.
  */
 #define DECLARE_VVAR(offset, type, name) \
-	EMIT_VVAR(name, VSYSCALL_VARS_OFFSET + offset)
+	EMIT_VVAR(name, offset)
 
 #else
 
 #define DECLARE_VVAR(offset, type, name)				\
 	static type const * const vvaraddr_ ## name =			\
-		(void *)(VSYSCALL_START + VSYSCALL_VARS_OFFSET + (offset));
+		(void *)(VVAR_ADDRESS + (offset));
 
 #define DEFINE_VVAR(type, name)						\
-	type __vvar_ ## name						\
-	__attribute__((section(".vsyscall_var_" #name), aligned(16)))
+	type name							\
+	__attribute__((section(".vvar_" #name), aligned(16)))
 
 #define VVAR(name) (*vvaraddr_ ## name)
 
@@ -49,4 +48,3 @@ DECLARE_VVAR(16, int, vgetcpu_mode)
 DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
 
 #undef DECLARE_VVAR
-#undef VSYSCALL_VARS_OFFSET
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 89aed99..3d89a00 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -161,12 +161,6 @@ SECTIONS
 
 #define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
 #define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
-#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x	\
-	ADDR(.vsyscall_0) + offset		 	\
-	: AT(VLOAD(.vsyscall_var_ ## x)) {     		\
-		*(.vsyscall_var_ ## x)			\
-	}						\
-	x = VVIRT(.vsyscall_var_ ## x);
 
 	. = ALIGN(4096);
 	__vsyscall_0 = .;
@@ -192,17 +186,28 @@ SECTIONS
 		*(.vsyscall_3)
 	}
 
-#define __VVAR_KERNEL_LDS
-#include <asm/vvar.h>
-#undef __VVAR_KERNEL_LDS
-
-	. = __vsyscall_0 + PAGE_SIZE;
+	. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
 
 #undef VSYSCALL_ADDR
 #undef VLOAD_OFFSET
 #undef VLOAD
 #undef VVIRT_OFFSET
 #undef VVIRT
+
+	__vvar_page = .;
+
+#define EMIT_VVAR(name, offset) .vvar_ ## name		\
+	(__vvar_page + offset) :			\
+	 AT(ADDR(.vvar_ ## name) - LOAD_OFFSET) {	\
+		*(.vvar_ ## x)				\
+	} :data
+
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+
+       . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
+
 #undef EMIT_VVAR
 
 #endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 3e68218..3cf1cef 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -284,9 +284,14 @@ void __init map_vsyscall(void)
 {
 	extern char __vsyscall_0;
 	unsigned long physaddr_page0 = __pa_symbol(&__vsyscall_0);
+	extern char __vvar_page;
+	unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
 
 	/* Note that VSYSCALL_MAPPED_PAGES must agree with the code below. */
 	__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
+	__set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
+	BUILD_BUG_ON((unsigned long)__fix_to_virt(VVAR_PAGE) !=
+		     (unsigned long)VVAR_ADDRESS);
 }
 
 static int __init vsyscall_init(void)
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 04/10] x86-64: Remove kernel.vsyscall64 sysctl
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (2 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 03/10] x86-64: Give vvars their own page Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 05/10] x86-64: Map the HPET NX Andy Lutomirski
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

It's unnecessary overhead in code that's supposed to be highly
optimized.  Removing it allows us to remove one of the two syscall
instructions in the vsyscall page.

The only sensible use for it is for UML users, and it doesn't fully
address inconsistent vsyscall results on UML.  The real fix for UML
is to stop using vsyscalls entirely.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/vgtod.h   |    1 -
 arch/x86/kernel/vsyscall_64.c  |   34 +------------------------
 arch/x86/vdso/vclock_gettime.c |   55 +++++++++++++++------------------------
 3 files changed, 22 insertions(+), 68 deletions(-)

diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 646b4c1..aa5add8 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -11,7 +11,6 @@ struct vsyscall_gtod_data {
 	time_t		wall_time_sec;
 	u32		wall_time_nsec;
 
-	int		sysctl_enabled;
 	struct timezone sys_tz;
 	struct { /* extract of a clocksource struct */
 		cycle_t (*vread)(void);
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 3cf1cef..9b2f3f5 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -53,7 +53,6 @@ DEFINE_VVAR(int, vgetcpu_mode);
 DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
 {
 	.lock = __SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock),
-	.sysctl_enabled = 1,
 };
 
 void update_vsyscall_tz(void)
@@ -103,15 +102,6 @@ static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
 	return ret;
 }
 
-static __always_inline long time_syscall(long *t)
-{
-	long secs;
-	asm volatile("syscall"
-		: "=a" (secs)
-		: "0" (__NR_time),"D" (t) : __syscall_clobber);
-	return secs;
-}
-
 static __always_inline void do_vgettimeofday(struct timeval * tv)
 {
 	cycle_t now, base, mask, cycle_delta;
@@ -122,8 +112,7 @@ static __always_inline void do_vgettimeofday(struct timeval * tv)
 		seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
 
 		vread = VVAR(vsyscall_gtod_data).clock.vread;
-		if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled ||
-			     !vread)) {
+		if (unlikely(!vread)) {
 			gettimeofday(tv,NULL);
 			return;
 		}
@@ -165,8 +154,6 @@ time_t __vsyscall(1) vtime(time_t *t)
 {
 	unsigned seq;
 	time_t result;
-	if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
-		return time_syscall(t);
 
 	do {
 		seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
@@ -227,22 +214,6 @@ static long __vsyscall(3) venosys_1(void)
 	return -ENOSYS;
 }
 
-#ifdef CONFIG_SYSCTL
-static ctl_table kernel_table2[] = {
-	{ .procname = "vsyscall64",
-	  .data = &vsyscall_gtod_data.sysctl_enabled, .maxlen = sizeof(int),
-	  .mode = 0644,
-	  .proc_handler = proc_dointvec },
-	{}
-};
-
-static ctl_table kernel_root_table2[] = {
-	{ .procname = "kernel", .mode = 0555,
-	  .child = kernel_table2 },
-	{}
-};
-#endif
-
 /* Assume __initcall executes before all user space. Hopefully kmod
    doesn't violate that. We'll find out if it does. */
 static void __cpuinit vsyscall_set_cpu(int cpu)
@@ -301,9 +272,6 @@ static int __init vsyscall_init(void)
 	BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
 	BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
 	BUG_ON((unsigned long) &vgetcpu != VSYSCALL_ADDR(__NR_vgetcpu));
-#ifdef CONFIG_SYSCTL
-	register_sysctl_table(kernel_root_table2);
-#endif
 	on_each_cpu(cpu_vsyscall_init, NULL, 1);
 	/* notifier priority > KVM */
 	hotcpu_notifier(cpu_vsyscall_notifier, 30);
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index a724905..cf54813 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -116,21 +116,21 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
-	if (likely(gtod->sysctl_enabled))
-		switch (clock) {
-		case CLOCK_REALTIME:
-			if (likely(gtod->clock.vread))
-				return do_realtime(ts);
-			break;
-		case CLOCK_MONOTONIC:
-			if (likely(gtod->clock.vread))
-				return do_monotonic(ts);
-			break;
-		case CLOCK_REALTIME_COARSE:
-			return do_realtime_coarse(ts);
-		case CLOCK_MONOTONIC_COARSE:
-			return do_monotonic_coarse(ts);
-		}
+	switch (clock) {
+	case CLOCK_REALTIME:
+		if (likely(gtod->clock.vread))
+			return do_realtime(ts);
+		break;
+	case CLOCK_MONOTONIC:
+		if (likely(gtod->clock.vread))
+			return do_monotonic(ts);
+		break;
+	case CLOCK_REALTIME_COARSE:
+		return do_realtime_coarse(ts);
+	case CLOCK_MONOTONIC_COARSE:
+		return do_monotonic_coarse(ts);
+	}
+
 	return vdso_fallback_gettime(clock, ts);
 }
 int clock_gettime(clockid_t, struct timespec *)
@@ -139,7 +139,7 @@ int clock_gettime(clockid_t, struct timespec *)
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
 	long ret;
-	if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
+	if (likely(gtod->clock.vread)) {
 		if (likely(tv != NULL)) {
 			BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
 				     offsetof(struct timespec, tv_nsec) ||
@@ -161,27 +161,14 @@ notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 int gettimeofday(struct timeval *, struct timezone *)
 	__attribute__((weak, alias("__vdso_gettimeofday")));
 
-/* This will break when the xtime seconds get inaccurate, but that is
- * unlikely */
-
-static __always_inline long time_syscall(long *t)
-{
-	long secs;
-	asm volatile("syscall"
-		     : "=a" (secs)
-		     : "0" (__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
-	return secs;
-}
-
+/*
+ * This will break when the xtime seconds get inaccurate, but that is
+ * unlikely
+ */
 notrace time_t __vdso_time(time_t *t)
 {
-	time_t result;
-
-	if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
-		return time_syscall(t);
-
 	/* This is atomic on x86_64 so we don't need any locks. */
-	result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);
+	time_t result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);
 
 	if (t)
 		*t = result;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 05/10] x86-64: Map the HPET NX
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (3 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 04/10] x86-64: Remove kernel.vsyscall64 sysctl Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 06/10] x86-64: Remove vsyscall number 3 (venosys) Andy Lutomirski
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

Currently the HPET mapping is a user-accessible syscall instruction
at a fixed address some of the time.  A sufficiently determined
hacker might be able to guess when.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/pgtable_types.h |    4 ++--
 arch/x86/kernel/hpet.c               |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 6a29aed6..013286a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -107,8 +107,8 @@
 #define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_UC_MINUS		(__PAGE_KERNEL | _PAGE_PCD)
 #define __PAGE_KERNEL_VSYSCALL		(__PAGE_KERNEL_RX | _PAGE_USER)
-#define __PAGE_KERNEL_VSYSCALL_NOCACHE	(__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_VVAR		(__PAGE_KERNEL_RO | _PAGE_USER)
+#define __PAGE_KERNEL_VVAR_NOCACHE	(__PAGE_KERNEL_VVAR | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_LARGE		(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_NOCACHE	(__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC	(__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -130,8 +130,8 @@
 #define PAGE_KERNEL_LARGE_NOCACHE	__pgprot(__PAGE_KERNEL_LARGE_NOCACHE)
 #define PAGE_KERNEL_LARGE_EXEC		__pgprot(__PAGE_KERNEL_LARGE_EXEC)
 #define PAGE_KERNEL_VSYSCALL		__pgprot(__PAGE_KERNEL_VSYSCALL)
-#define PAGE_KERNEL_VSYSCALL_NOCACHE	__pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE)
 #define PAGE_KERNEL_VVAR		__pgprot(__PAGE_KERNEL_VVAR)
+#define PAGE_KERNEL_VVAR_NOCACHE	__pgprot(__PAGE_KERNEL_VVAR_NOCACHE)
 
 #define PAGE_KERNEL_IO			__pgprot(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE		__pgprot(__PAGE_KERNEL_IO_NOCACHE)
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 6781765..e9f5605 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -71,7 +71,7 @@ static inline void hpet_set_mapping(void)
 {
 	hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
 #ifdef CONFIG_X86_64
-	__set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE);
+	__set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VVAR_NOCACHE);
 #endif
 }
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 06/10] x86-64: Remove vsyscall number 3 (venosys)
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (4 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 05/10] x86-64: Map the HPET NX Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 07/10] x86-64: Fill unused parts of the vsyscall page with 0xcc Andy Lutomirski
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

It just segfaults since April 2008 (a4928cff), so I'm pretty sure
that nothing uses it.  And having an empty section makes the linker
script a bit fragile.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/vmlinux.lds.S |    4 ----
 arch/x86/kernel/vsyscall_64.c |    3 ---
 2 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3d89a00..dc8ac70 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -182,10 +182,6 @@ SECTIONS
 		*(.vsyscall_2)
 	}
 
-	.vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) {
-		*(.vsyscall_3)
-	}
-
 	. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
 
 #undef VSYSCALL_ADDR
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 9b2f3f5..c7fe325 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -209,9 +209,6 @@ vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
 	return 0;
 }
 
-static long __vsyscall(3) venosys_1(void)
-{
-	return -ENOSYS;
 }
 
 /* Assume __initcall executes before all user space. Hopefully kmod
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 07/10] x86-64: Fill unused parts of the vsyscall page with 0xcc
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (5 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 06/10] x86-64: Remove vsyscall number 3 (venosys) Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls Andy Lutomirski
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

Jumping to 0x00 might do something depending on the following bytes.
Jumping to 0xcc is a trap.  So fill the unused parts of the vsyscall
page with 0xcc to make it useless for exploits to jump there.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/vmlinux.lds.S |   16 +++++++---------
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index dc8ac70..c3b37d6 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -166,22 +166,20 @@ SECTIONS
 	__vsyscall_0 = .;
 
 	. = VSYSCALL_ADDR;
-	.vsyscall_0 : AT(VLOAD(.vsyscall_0)) {
+	.vsyscall : AT(VLOAD(.vsyscall)) {
 		*(.vsyscall_0)
-	} :user
 
-	. = ALIGN(L1_CACHE_BYTES);
-	.vsyscall_fn : AT(VLOAD(.vsyscall_fn)) {
+		. = ALIGN(L1_CACHE_BYTES);
 		*(.vsyscall_fn)
-	}
 
-	.vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) {
+		. = 1024;
 		*(.vsyscall_1)
-	}
-	.vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) {
+
+		. = 2048;
 		*(.vsyscall_2)
-	}
 
+		. = 4096;  /* Pad the whole page. */
+	} :user =0xcc
 	. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
 
 #undef VSYSCALL_ADDR
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (6 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 07/10] x86-64: Fill unused parts of the vsyscall page with 0xcc Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 15:35   ` Ingo Molnar
  2011-05-31 14:14 ` [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot Andy Lutomirski
  2011-05-31 14:14 ` [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule Andy Lutomirski
  9 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

There's a fair amount of code in the vsyscall page.  It contains a
syscall instruction (in the gettimeofday fallback) and who knows
what will happen if an exploit jumps into the middle of some other
code.

Reduce the risk by replacing the vsyscalls with short magic
incantations that cause the kernel to emulate the real vsyscalls.
These incantations are useless if entered in the middle.

This causes vsyscalls to be a little more expensive than real
syscalls.  Fortunately sensible programs don't use them.

Less fortunately, current glibc uses the vsyscall for time() even in
dynamic binaries.  So there's a CONFIG_UNSAFE_VSYSCALLS (default y)
option that leaves in the native code for time().  That should go
away in awhile when glibc gets fixed.

Some care is taken to make sure that tools like valgrind and
ThreadSpotter still work.

This patch is not perfect: the vread_tsc and vread_hpet functions
are still at a fixed address.  Fixing that might involve making
alternative patching work in the vDSO.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/Kconfig                   |   17 +++
 arch/x86/include/asm/irq_vectors.h |    6 +-
 arch/x86/include/asm/traps.h       |    4 +
 arch/x86/include/asm/vsyscall.h    |    6 +
 arch/x86/kernel/Makefile           |    1 +
 arch/x86/kernel/entry_64.S         |    2 +
 arch/x86/kernel/traps.c            |    4 +
 arch/x86/kernel/vsyscall_64.c      |  253 +++++++++++++++++++++---------------
 arch/x86/kernel/vsyscall_emu_64.S  |   42 ++++++
 9 files changed, 231 insertions(+), 104 deletions(-)
 create mode 100644 arch/x86/kernel/vsyscall_emu_64.S

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..79e5d8a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1646,6 +1646,23 @@ config COMPAT_VDSO
 
 	  If unsure, say Y.
 
+config UNSAFE_VSYSCALLS
+	def_bool y
+	prompt "Unsafe fast legacy vsyscalls"
+	depends on X86_64
+	---help---
+	  Legacy user code expects to be able to issue three syscalls
+	  by calling fixed addresses in kernel space.  If you say N,
+	  then the kernel traps and emulates these calls.  If you say
+	  Y, then there is actual executable code at a fixed address
+	  to implement time() efficiently.
+
+	  On a system with recent enough glibc (probably 2.14 or
+	  newer) and no static binaries, you can say N without a
+	  performance penalty to improve security
+
+	  If unsure, say Y.
+
 config CMDLINE_BOOL
 	bool "Built-in kernel command line"
 	---help---
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 6e976ee..a563c50 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -17,7 +17,8 @@
  *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
  *  Vectors  32 ... 127 : device interrupts
  *  Vector  128         : legacy int80 syscall interface
- *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts
+ *  Vector  204         : legacy x86_64 vsyscall emulation
+ *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 except 204 : device interrupts
  *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
  *
  * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
@@ -50,6 +51,9 @@
 #ifdef CONFIG_X86_32
 # define SYSCALL_VECTOR			0x80
 #endif
+#ifdef CONFIG_X86_64
+# define VSYSCALL_EMU_VECTOR		0xcc
+#endif
 
 /*
  * Vectors 0x30-0x3f are used for ISA interrupts.
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 0310da6..2bae0a5 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_TRAPS_H
 #define _ASM_X86_TRAPS_H
 
+#include <linux/kprobes.h>
+
 #include <asm/debugreg.h>
 #include <asm/siginfo.h>			/* TRAP_TRACE, ... */
 
@@ -38,6 +40,7 @@ asmlinkage void alignment_check(void);
 asmlinkage void machine_check(void);
 #endif /* CONFIG_X86_MCE */
 asmlinkage void simd_coprocessor_error(void);
+asmlinkage void emulate_vsyscall(void);
 
 dotraplinkage void do_divide_error(struct pt_regs *, long);
 dotraplinkage void do_debug(struct pt_regs *, long);
@@ -64,6 +67,7 @@ dotraplinkage void do_alignment_check(struct pt_regs *, long);
 dotraplinkage void do_machine_check(struct pt_regs *, long);
 #endif
 dotraplinkage void do_simd_coprocessor_error(struct pt_regs *, long);
+dotraplinkage void do_emulate_vsyscall(struct pt_regs *, long);
 #ifdef CONFIG_X86_32
 dotraplinkage void do_iret_error(struct pt_regs *, long);
 #endif
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d555973..293ae08 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -31,6 +31,12 @@ extern struct timezone sys_tz;
 
 extern void map_vsyscall(void);
 
+/* Emulation */
+static inline bool in_vsyscall_page(unsigned long addr)
+{
+	return (addr & ~(PAGE_SIZE - 1)) == VSYSCALL_START;
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 90b06d4..cc0469a 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -44,6 +44,7 @@ obj-y			+= probe_roms.o
 obj-$(CONFIG_X86_32)	+= sys_i386_32.o i386_ksyms_32.o
 obj-$(CONFIG_X86_64)	+= sys_x86_64.o x8664_ksyms_64.o
 obj-$(CONFIG_X86_64)	+= syscall_64.o vsyscall_64.o vread_tsc_64.o
+obj-$(CONFIG_X86_64)	+= vsyscall_emu_64.o
 obj-y			+= bootflag.o e820.o
 obj-y			+= pci-dma.o quirks.o topology.o kdebugfs.o
 obj-y			+= alternative.o i8253.o pci-nommu.o hw_breakpoint.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 72c4a77..e949793 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1123,6 +1123,8 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
 zeroentry coprocessor_error do_coprocessor_error
 errorentry alignment_check do_alignment_check
 zeroentry simd_coprocessor_error do_simd_coprocessor_error
+zeroentry emulate_vsyscall do_emulate_vsyscall
+
 
 	/* Reload gs selector with exception handling */
 	/* edi:  new selector */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b9b6716..72f0f6a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -872,6 +872,10 @@ void __init trap_init(void)
 	set_bit(SYSCALL_VECTOR, used_vectors);
 #endif
 
+	BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
+	set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
+	set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
+
 	/*
 	 * Should be a barrier for any external CPU state:
 	 */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index c7fe325..52ba392 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -32,6 +32,8 @@
 #include <linux/cpu.h>
 #include <linux/smp.h>
 #include <linux/notifier.h>
+#include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 
 #include <asm/vsyscall.h>
 #include <asm/pgtable.h>
@@ -44,10 +46,7 @@
 #include <asm/desc.h>
 #include <asm/topology.h>
 #include <asm/vgtod.h>
-
-#define __vsyscall(nr) \
-		__attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
-#define __syscall_clobber "r11","cx","memory"
+#include <asm/traps.h>
 
 DEFINE_VVAR(int, vgetcpu_mode);
 DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
@@ -84,73 +83,45 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
 	write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags);
 }
 
-/* RED-PEN may want to readd seq locking, but then the variable should be
- * write-once.
- */
-static __always_inline void do_get_tz(struct timezone * tz)
+static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
+			      const char *message)
 {
-	*tz = VVAR(vsyscall_gtod_data).sys_tz;
+	struct task_struct *tsk;
+	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	if (!show_unhandled_signals || !__ratelimit(&rs))
+		return;
+
+	tsk = current;
+
+	printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx",
+	       level, tsk->comm, task_pid_nr(tsk),
+	       message,
+	       regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+	if (!in_vsyscall_page(regs->ip - 2))
+		print_vma_addr(" in ", regs->ip - 2);
+	printk("\n");
 }
 
-static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
-{
-	int ret;
-	asm volatile("syscall"
-		: "=a" (ret)
-		: "0" (__NR_gettimeofday),"D" (tv),"S" (tz)
-		: __syscall_clobber );
-	return ret;
-}
+/* al values for each vsyscall; see vsyscall_emu_64.S for why. */
+static u8 vsyscall_nr_to_al[] = {0xcc, 0xce, 0xf0};
 
-static __always_inline void do_vgettimeofday(struct timeval * tv)
+static int al_to_vsyscall_nr(u8 al)
 {
-	cycle_t now, base, mask, cycle_delta;
-	unsigned seq;
-	unsigned long mult, shift, nsec;
-	cycle_t (*vread)(void);
-	do {
-		seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
-
-		vread = VVAR(vsyscall_gtod_data).clock.vread;
-		if (unlikely(!vread)) {
-			gettimeofday(tv,NULL);
-			return;
-		}
-
-		now = vread();
-		base = VVAR(vsyscall_gtod_data).clock.cycle_last;
-		mask = VVAR(vsyscall_gtod_data).clock.mask;
-		mult = VVAR(vsyscall_gtod_data).clock.mult;
-		shift = VVAR(vsyscall_gtod_data).clock.shift;
-
-		tv->tv_sec = VVAR(vsyscall_gtod_data).wall_time_sec;
-		nsec = VVAR(vsyscall_gtod_data).wall_time_nsec;
-	} while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
-
-	/* calculate interval: */
-	cycle_delta = (now - base) & mask;
-	/* convert to nsecs: */
-	nsec += (cycle_delta * mult) >> shift;
-
-	while (nsec >= NSEC_PER_SEC) {
-		tv->tv_sec += 1;
-		nsec -= NSEC_PER_SEC;
-	}
-	tv->tv_usec = nsec / NSEC_PER_USEC;
+	int i;
+	for (i = 0; i < ARRAY_SIZE(vsyscall_nr_to_al); i++)
+		if (vsyscall_nr_to_al[i] == al)
+			return i;
+	return -1;
 }
 
-int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
-{
-	if (tv)
-		do_vgettimeofday(tv);
-	if (tz)
-		do_get_tz(tz);
-	return 0;
-}
+#ifdef CONFIG_UNSAFE_VSYSCALLS
 
 /* This will break when the xtime seconds get inaccurate, but that is
  * unlikely */
-time_t __vsyscall(1) vtime(time_t *t)
+time_t __attribute__ ((unused, __section__(".vsyscall_1"))) notrace
+vtime(time_t *t)
 {
 	unsigned seq;
 	time_t result;
@@ -167,48 +138,127 @@ time_t __vsyscall(1) vtime(time_t *t)
 	return result;
 }
 
-/* Fast way to get current CPU and node.
-   This helps to do per node and per CPU caches in user space.
-   The result is not guaranteed without CPU affinity, but usually
-   works out because the scheduler tries to keep a thread on the same
-   CPU.
+#endif /* CONFIG_UNSAFE_VSYSCALLS */
+
+/* If CONFIG_UNSAFE_VSYSCALLS=y, then this is incorrect for vsyscall_nr == 1. */
+static inline unsigned long vsyscall_intcc_addr(int vsyscall_nr)
+{
+	return VSYSCALL_START + 1024*vsyscall_nr + 2;
+}
 
-   tcache must point to a two element sized long array.
-   All arguments can be NULL. */
-long __vsyscall(2)
-vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 {
-	unsigned int p;
-	unsigned long j = 0;
-
-	/* Fast cache - only recompute value once per jiffies and avoid
-	   relatively costly rdtscp/cpuid otherwise.
-	   This works because the scheduler usually keeps the process
-	   on the same CPU and this syscall doesn't guarantee its
-	   results anyways.
-	   We do this here because otherwise user space would do it on
-	   its own in a likely inferior way (no access to jiffies).
-	   If you don't like it pass NULL. */
-	if (tcache && tcache->blob[0] == (j = VVAR(jiffies))) {
-		p = tcache->blob[1];
-	} else if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
-		/* Load per CPU data from RDTSCP */
-		native_read_tscp(&p);
-	} else {
-		/* Load per CPU data from GDT */
-		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	static DEFINE_RATELIMIT_STATE(rs, 3600 * HZ, 3);
+	struct task_struct *tsk;
+	const char *vsyscall_name;
+	int vsyscall_nr;
+	long ret;
+
+	/* Kernel code must never get here. */
+	BUG_ON(!user_mode(regs));
+
+	local_irq_enable();
+
+	vsyscall_nr = al_to_vsyscall_nr(regs->ax & 0xff);
+	if (vsyscall_nr < 0) {
+		warn_bad_vsyscall(KERN_WARNING, regs, "illegal int 0xcc "
+				  "(exploit attempt?)");
+		goto sigsegv;
 	}
-	if (tcache) {
-		tcache->blob[0] = j;
-		tcache->blob[1] = p;
+
+	if (regs->ip - 2 != vsyscall_intcc_addr(vsyscall_nr)) {
+		if (in_vsyscall_page(regs->ip - 2)) {
+			/* This should not be possible. */
+			warn_bad_vsyscall(KERN_WARNING, regs,
+					  "int 0xcc bogus magic "
+					  "(exploit attempt?)");
+			goto sigsegv;
+		} else {
+			/*
+			 * We allow the call because tools like ThreadSpotter
+			 * might copy the int 0xcc instruction to user memory.
+			 * We make it annoying, though, to try to persuade
+			 * the authors to stop doing that...
+			 */
+			warn_bad_vsyscall(KERN_WARNING, regs,
+					  "int 0xcc in user code "
+					  "(exploit attempt? legacy "
+					  "instrumented code?)");
+		}
 	}
-	if (cpu)
-		*cpu = p & 0xfff;
-	if (node)
-		*node = p >> 12;
-	return 0;
-}
 
+	tsk = current;
+	if (tsk->seccomp.mode) {
+		do_exit(SIGKILL);
+		goto out;
+	}
+
+	switch (vsyscall_nr) {
+	case 0:
+		vsyscall_name = "gettimeofday";
+		ret = sys_gettimeofday(
+			(struct timeval __user *)regs->di,
+			(struct timezone __user *)regs->si);
+		break;
+
+	case 1:
+#ifdef CONFIG_UNSAFE_VSYSCALLS
+		warn_bad_vsyscall(KERN_WARNING, regs, "bogus time() vsyscall "
+				  "emulation (exploit attempt?)");
+		goto sigsegv;
+#else
+		vsyscall_name = "time";
+		ret = sys_time((time_t __user *)regs->di);
+		break;
+#endif
+
+	case 2:
+		vsyscall_name = "getcpu";
+		ret = sys_getcpu((unsigned __user *)regs->di,
+				 (unsigned __user *)regs->si,
+				 0);
+		break;
+
+	default:
+		BUG();
+	}
+
+	if (ret == -EFAULT) {
+		/*
+		 * Bad news -- userspace fed a bad pointer to a vsyscall.
+		 *
+		 * With a real vsyscall, that would have caused SIGSEGV.
+		 * To make writing reliable exploits using the emulated
+		 * vsyscalls harder, generate SIGSEGV here as well.
+		 */
+		warn_bad_vsyscall(KERN_INFO, regs,
+				  "vsyscall fault (exploit attempt?)");
+		goto sigsegv;
+	}
+
+	regs->ax = ret;
+
+	if (__ratelimit(&rs)) {
+		unsigned long caller;
+		if (get_user(caller, (unsigned long __user *)regs->sp))
+			caller = 0;  /* no need to crash on this fault. */
+		printk(KERN_INFO "%s[%d] emulated legacy vsyscall %s(); "
+		       "upgrade your code to avoid a performance hit. "
+		       "ip:%lx sp:%lx caller:%lx",
+		       tsk->comm, task_pid_nr(tsk), vsyscall_name,
+		       regs->ip - 2, regs->sp, caller);
+		if (caller)
+			print_vma_addr(" in ", caller);
+		printk("\n");
+	}
+
+out:
+	local_irq_disable();
+	return;
+
+sigsegv:
+	regs->ip -= 2;  /* The faulting instruction should be the int 0xcc. */
+	force_sig(SIGSEGV, current);
 }
 
 /* Assume __initcall executes before all user space. Hopefully kmod
@@ -264,11 +314,8 @@ void __init map_vsyscall(void)
 
 static int __init vsyscall_init(void)
 {
-	BUG_ON(((unsigned long) &vgettimeofday !=
-			VSYSCALL_ADDR(__NR_vgettimeofday)));
-	BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
-	BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
-	BUG_ON((unsigned long) &vgetcpu != VSYSCALL_ADDR(__NR_vgetcpu));
+	BUG_ON(VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE));
+
 	on_each_cpu(cpu_vsyscall_init, NULL, 1);
 	/* notifier priority > KVM */
 	hotcpu_notifier(cpu_vsyscall_notifier, 30);
diff --git a/arch/x86/kernel/vsyscall_emu_64.S b/arch/x86/kernel/vsyscall_emu_64.S
new file mode 100644
index 0000000..7ebde61
--- /dev/null
+++ b/arch/x86/kernel/vsyscall_emu_64.S
@@ -0,0 +1,42 @@
+/*
+ * vsyscall_emu_64.S: Vsyscall emulation page
+ * Copyright (c) 2011 Andy Lutomirski
+ * Subject to the GNU General Public License, version 2
+*/
+
+#include <linux/linkage.h>
+#include <asm/irq_vectors.h>
+
+/*
+ * These magic incantations are chosen so that they fault if entered anywhere
+ * other than an instruction boundary.  The movb instruction is two bytes, and
+ * the int imm8 instruction is also two bytes, so the only misaligned places
+ * to enter are the immediate values for the two instructions.  0xcc is int3
+ * (always faults), 0xce is into (faults on x64-64, and 32-bit code can't get
+ * here), and 0xf0 is lock (lock int is invalid).
+ *
+ * The unused parts of the page are filled with 0xcc by the linker script.
+ */
+
+.section .vsyscall_0, "a"
+ENTRY(vsyscall_0)
+	movb $0xcc, %al
+	int $VSYSCALL_EMU_VECTOR
+	ret
+END(vsyscall_0)
+
+#ifndef CONFIG_UNSAFE_VSYSCALLS
+.section .vsyscall_1, "a"
+ENTRY(vsyscall_1)
+	movb $0xce, %al
+	int $VSYSCALL_EMU_VECTOR
+	ret
+END(vsyscall_1)
+#endif
+
+.section .vsyscall_2, "a"
+ENTRY(vsyscall_2)
+	movb $0xf0, %al
+	int $VSYSCALL_EMU_VECTOR
+	ret
+END(vsyscall_2)
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (7 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 15:40   ` Ingo Molnar
  2011-05-31 14:14 ` [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule Andy Lutomirski
  9 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

This is not a security feature.  It's to prevent the int 0xcc
sequence from becoming ABI.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/vsyscall_64.c |   54 +++++++++++++++++++++++++++++++++++++++-
 1 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 52ba392..277c47c 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -34,6 +34,8 @@
 #include <linux/notifier.h>
 #include <linux/syscalls.h>
 #include <linux/ratelimit.h>
+#include <linux/random.h>
+#include <linux/highmem.h>
 
 #include <asm/vsyscall.h>
 #include <asm/pgtable.h>
@@ -54,6 +56,8 @@ DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
 	.lock = __SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock),
 };
 
+static u8 vsyscall_nr_offset;  /* Cyclic permutation of al. */
+
 void update_vsyscall_tz(void)
 {
 	unsigned long flags;
@@ -95,10 +99,11 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 
 	tsk = current;
 
-	printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx",
+	printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx offset:%d si:%lx di:%lx",
 	       level, tsk->comm, task_pid_nr(tsk),
 	       message,
-	       regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+	       regs->ip - 2, regs->sp, regs->ax, (int)vsyscall_nr_offset,
+	       regs->si, regs->di);
 	if (!in_vsyscall_page(regs->ip - 2))
 		print_vma_addr(" in ", regs->ip - 2);
 	printk("\n");
@@ -116,6 +121,17 @@ static int al_to_vsyscall_nr(u8 al)
 	return -1;
 }
 
+static inline u8 mangle_al(u8 al)
+{
+	int nr = al_to_vsyscall_nr(al);
+	return vsyscall_nr_to_al[(nr + vsyscall_nr_offset) % 3];
+}
+
+static inline int demangle_vsyscall_nr(int nr)
+{
+	return (nr + 3 - vsyscall_nr_offset) % 3;
+}
+
 #ifdef CONFIG_UNSAFE_VSYSCALLS
 
 /* This will break when the xtime seconds get inaccurate, but that is
@@ -165,6 +181,7 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 				  "(exploit attempt?)");
 		goto sigsegv;
 	}
+	vsyscall_nr = demangle_vsyscall_nr(vsyscall_nr);
 
 	if (regs->ip - 2 != vsyscall_intcc_addr(vsyscall_nr)) {
 		if (in_vsyscall_page(regs->ip - 2)) {
@@ -312,10 +329,43 @@ void __init map_vsyscall(void)
 		     (unsigned long)VVAR_ADDRESS);
 }
 
+static void __init mangle_vsyscall_movb(void *mapping,
+					unsigned long movb_addr, u8 initial)
+{
+	u8 *imm8;
+	BUG_ON(movb_addr >= 4095);
+
+	imm8 = (char *)(mapping) + movb_addr + 1;
+
+	BUG_ON(*imm8 != initial);
+	*imm8 = mangle_al(*imm8);
+}
+
 static int __init vsyscall_init(void)
 {
+	extern char __vsyscall_0;
+	void *mapping;
+	struct page *vsyscall_page;
+
 	BUG_ON(VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE));
 
+	/*
+	 * Randomize the magic al values for int 0xcc invocation.  This
+	 * isn't really a security feature; it's to make sure that
+	 * dynamic binary instrumentation tools don't start to think
+	 * that the int 0xcc magic incantation is ABI.
+	 */
+	vsyscall_nr_offset = get_random_int() % 3;
+	vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
+	mapping = kmap_atomic(vsyscall_page);
+	/* It's easier to hardcode the addresses -- they're ABI. */
+	mangle_vsyscall_movb(mapping, 0, 0xcc);
+#ifndef CONFIG_UNSAFE_VSYSCALLS
+	mangle_vsyscall_movb(mapping, 1024, 0xce);
+#endif
+	mangle_vsyscall_movb(mapping, 2048, 0xf0);
+	kunmap_atomic(mapping);
+
 	on_each_cpu(cpu_vsyscall_init, NULL, 1);
 	/* notifier priority > KVM */
 	hotcpu_notifier(cpu_vsyscall_notifier, 30);
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
                   ` (8 preceding siblings ...)
  2011-05-31 14:14 ` [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot Andy Lutomirski
@ 2011-05-31 14:14 ` Andy Lutomirski
  2011-05-31 18:34   ` Andi Kleen
  9 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-05-31 14:14 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen,
	Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 Documentation/feature-removal-schedule.txt |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 1a9446b..94b4470 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -600,3 +600,12 @@ Why:	Superseded by the UVCIOC_CTRL_QUERY ioctl.
 Who:	Laurent Pinchart <laurent.pinchart@ideasonboard.com>
 
 ----------------------------
+
+What:	CONFIG_UNSAFE_VSYSCALLS (x86_64)
+When:	When glibc 2.14 or newer is ubitquitous.  Perhaps mid-2012.
+Why:	Having user-executable code at a fixed address is a security problem.
+	Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
+	make the time() function slower on glibc versions 2.13 and below.
+Who:	Andy Lutomirski <luto@mit.edu>
+
+----------------------------
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls
  2011-05-31 14:14 ` [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls Andy Lutomirski
@ 2011-05-31 15:35   ` Ingo Molnar
  0 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 15:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen


* Andy Lutomirski <luto@mit.edu> wrote:

> +#ifdef CONFIG_X86_64
> +# define VSYSCALL_EMU_VECTOR		0xcc
> +#endif

> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -872,6 +872,10 @@ void __init trap_init(void)
>  	set_bit(SYSCALL_VECTOR, used_vectors);
>  #endif
>  
> +	BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
> +	set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
> +	set_bit(VSYSCALL_EMU_VECTOR, used_vectors);

That is a generic section of traps.c so it won't build (and makes no 
point) on 32-bit.

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 14:14 ` [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot Andy Lutomirski
@ 2011-05-31 15:40   ` Ingo Molnar
  2011-05-31 15:56     ` Andrew Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 15:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen


* Andy Lutomirski <luto@MIT.EDU> wrote:

> This is not a security feature.  It's to prevent the int 0xcc
> sequence from becoming ABI.

Hm, what kind of ABI reliance could be built on it?

> +static void __init mangle_vsyscall_movb(void *mapping,
> +					unsigned long movb_addr, u8 initial)
> +{
> +	u8 *imm8;
> +	BUG_ON(movb_addr >= 4095);

Please put newlines after local variable definitions.

>  static int __init vsyscall_init(void)
>  {
> +	extern char __vsyscall_0;

Please don't put extern definitions in the middle of a .c file - if 
then it should be in a .h file. (even if only a single function uses 
it)

> +	/*
> +	 * Randomize the magic al values for int 0xcc invocation.  This
> +	 * isn't really a security feature; it's to make sure that
> +	 * dynamic binary instrumentation tools don't start to think
> +	 * that the int 0xcc magic incantation is ABI.
> +	 */
> +	vsyscall_nr_offset = get_random_int() % 3;
> +	vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
> +	mapping = kmap_atomic(vsyscall_page);
> +	/* It's easier to hardcode the addresses -- they're ABI. */
> +	mangle_vsyscall_movb(mapping, 0, 0xcc);

what about filling it with zeroes?

> +#ifndef CONFIG_UNSAFE_VSYSCALLS
> +	mangle_vsyscall_movb(mapping, 1024, 0xce);
> +#endif
> +	mangle_vsyscall_movb(mapping, 2048, 0xf0);

Dunno, this all looks rather ugly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 15:40   ` Ingo Molnar
@ 2011-05-31 15:56     ` Andrew Lutomirski
  2011-05-31 16:10       ` Andrew Lutomirski
  2011-05-31 16:42       ` Ingo Molnar
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Lutomirski @ 2011-05-31 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen

On Tue, May 31, 2011 at 11:40 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andy Lutomirski <luto@MIT.EDU> wrote:
>
>> This is not a security feature.  It's to prevent the int 0xcc
>> sequence from becoming ABI.
>
> Hm, what kind of ABI reliance could be built on it?

See:

http://lkml.kernel.org/r/<BANLkTi=zxA3xzc0pR14rqwaOUXUAyF__MQ%40mail.gmail.com>

I was surprised, too.

>
>> +static void __init mangle_vsyscall_movb(void *mapping,
>> +                                     unsigned long movb_addr, u8 initial)
>> +{
>> +     u8 *imm8;
>> +     BUG_ON(movb_addr >= 4095);
>
> Please put newlines after local variable definitions.

Yep.

>
>>  static int __init vsyscall_init(void)
>>  {
>> +     extern char __vsyscall_0;
>
> Please don't put extern definitions in the middle of a .c file - if
> then it should be in a .h file. (even if only a single function uses
> it)

I thought the convention (and existing practice in vsyscall_64.c) was
that if the extern reference is to a magic linker symbol then it goes
in the function that uses it.  But I can find it a header file.

>
>> +     /*
>> +      * Randomize the magic al values for int 0xcc invocation.  This
>> +      * isn't really a security feature; it's to make sure that
>> +      * dynamic binary instrumentation tools don't start to think
>> +      * that the int 0xcc magic incantation is ABI.
>> +      */
>> +     vsyscall_nr_offset = get_random_int() % 3;
>> +     vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
>> +     mapping = kmap_atomic(vsyscall_page);
>> +     /* It's easier to hardcode the addresses -- they're ABI. */
>> +     mangle_vsyscall_movb(mapping, 0, 0xcc);
>
> what about filling it with zeroes?

Fill what with zeroes?  I'm just patching one byte here.

>
>> +#ifndef CONFIG_UNSAFE_VSYSCALLS
>> +     mangle_vsyscall_movb(mapping, 1024, 0xce);
>> +#endif
>> +     mangle_vsyscall_movb(mapping, 2048, 0xf0);
>
> Dunno, this all looks rather ugly.

Agreed.  Better ideas are welcome.

We could scrap int 0xcc entirely and emulate on page fault, but that
is slower and has other problems (like breaking anything that thinks
it can look at a call target in a binary and dereference that
address).

Here's a possibly dumb/evil idea:

Put real syscalls in the vsyscall page but mark the page NX.  Then
emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
the correct address but send SIGSEGV for the wrong address.

Down side: it's even more complexity for the same silly case.

--Andy

>
> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 15:56     ` Andrew Lutomirski
@ 2011-05-31 16:10       ` Andrew Lutomirski
  2011-05-31 16:43         ` Ingo Molnar
  2011-05-31 16:42       ` Ingo Molnar
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Lutomirski @ 2011-05-31 16:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen

On Tue, May 31, 2011 at 11:56 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> We could scrap int 0xcc entirely and emulate on page fault, but that
> is slower and has other problems (like breaking anything that thinks
> it can look at a call target in a binary and dereference that
> address).
>
> Here's a possibly dumb/evil idea:
>
> Put real syscalls in the vsyscall page but mark the page NX.  Then
> emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
> the correct address but send SIGSEGV for the wrong address.
>
> Down side: it's even more complexity for the same silly case.

Scratch that.  It's incompatible with keeping time() fast for now.

>
> --Andy
>
>>
>> Thanks,
>>
>>        Ingo
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 15:56     ` Andrew Lutomirski
  2011-05-31 16:10       ` Andrew Lutomirski
@ 2011-05-31 16:42       ` Ingo Molnar
  2011-05-31 18:08         ` Andrew Lutomirski
  1 sibling, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 16:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen


* Andrew Lutomirski <luto@mit.edu> wrote:

> >>  static int __init vsyscall_init(void)
> >>  {
> >> +     extern char __vsyscall_0;
> >
> > Please don't put extern definitions in the middle of a .c file - if
> > then it should be in a .h file. (even if only a single function uses
> > it)
> 
> I thought the convention (and existing practice in vsyscall_64.c) 
> was that if the extern reference is to a magic linker symbol then 
> it goes in the function that uses it.  But I can find it a header 
> file.

i'd suggest collecting them into a vsyscall header. The problem with 
externs in .c is that the moment two .c files start using it there's 
the danger of type divergence.

> >> +     /*
> >> +      * Randomize the magic al values for int 0xcc invocation.  This
> >> +      * isn't really a security feature; it's to make sure that
> >> +      * dynamic binary instrumentation tools don't start to think
> >> +      * that the int 0xcc magic incantation is ABI.
> >> +      */
> >> +     vsyscall_nr_offset = get_random_int() % 3;
> >> +     vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
> >> +     mapping = kmap_atomic(vsyscall_page);
> >> +     /* It's easier to hardcode the addresses -- they're ABI. */
> >> +     mangle_vsyscall_movb(mapping, 0, 0xcc);
> >
> > what about filling it with zeroes?
> 
> Fill what with zeroes?  I'm just patching one byte here.

Sigh, i suck at reading comprehension today!

> >> +#ifndef CONFIG_UNSAFE_VSYSCALLS
> >> +     mangle_vsyscall_movb(mapping, 1024, 0xce);
> >> +#endif
> >> +     mangle_vsyscall_movb(mapping, 2048, 0xf0);
> >
> > Dunno, this all looks rather ugly.
> 
> Agreed.  Better ideas are welcome.

None at the moment except "don't randomize it and see where the chips 
may fall". I'd rather live with a somewhat sticky default-off compat 
Kconfig switch than some permanently ugly randomization to make the 
transition to no-vsyscall faster.

It's not like we'll be able to remove the vsyscall altogether from 
the kernel - the best we can hope for is to be able to flip the 
default - there's binaries out there today that rely on it and 
binaries are sticky - a few months ago i saw someone test-running 
1995 binaries ;-)

Btw., we could also make the vsyscall page vanish *runtime*, via a 
sysctl. That way distros only need to update their /etc/sysctl.conf.

> We could scrap int 0xcc entirely and emulate on page fault, but 
> that is slower and has other problems (like breaking anything that 
> thinks it can look at a call target in a binary and dereference 
> that address).
> 
> Here's a possibly dumb/evil idea:
> 
> Put real syscalls in the vsyscall page but mark the page NX.  Then 
> emulate the vsyscalls on the PF_INSTR fault when userspace jumps to 
> the correct address but send SIGSEGV for the wrong address.
> 
> Down side: it's even more complexity for the same silly case.

heh, you are good at coming up with sick ideas! ;-)

I don't think we want to add another branch to #PF, but could we turn 
this into #GP or perhaps an illegal instruction fault?

Should be benchmarked:

 - The advantage of INT 0xCC is that it's completely isolated: it 
   does not slow down anything else.

 - doing this through #GP might be significantly slower cycle-wise. 
   Do we know by how much?

The advantage would be that we would not waste an extra vector, it 
would be smaller, plus it would be rather simple to make it all a 
runtime toggle via a sysctl.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 16:10       ` Andrew Lutomirski
@ 2011-05-31 16:43         ` Ingo Molnar
  0 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 16:43 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen


* Andrew Lutomirski <luto@mit.edu> wrote:

> On Tue, May 31, 2011 at 11:56 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> > We could scrap int 0xcc entirely and emulate on page fault, but that
> > is slower and has other problems (like breaking anything that thinks
> > it can look at a call target in a binary and dereference that
> > address).
> >
> > Here's a possibly dumb/evil idea:
> >
> > Put real syscalls in the vsyscall page but mark the page NX.  Then
> > emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
> > the correct address but send SIGSEGV for the wrong address.
> >
> > Down side: it's even more complexity for the same silly case.
> 
> Scratch that.  It's incompatible with keeping time() fast for now.

If we can find another fault than #PF then it will be similarly fast 
to an INT $0xCC so please at least investigate this route.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 03/10] x86-64: Give vvars their own page
  2011-05-31 14:14 ` [PATCH v4 03/10] x86-64: Give vvars their own page Andy Lutomirski
@ 2011-05-31 17:17   ` Louis Rilling
  0 siblings, 0 replies; 28+ messages in thread
From: Louis Rilling @ 2011-05-31 17:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson,
	Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 3492 bytes --]

On 31/05/11 10:14 -0400, Andy Lutomirski wrote:
> Move vvars out of the vsyscall page into their own page and mark it
> NX.
> 
> Without this patch, an attacker who can force a daemon to call some
> fixed address could wait until the time contains, say, 0xCD80, and
> then execute the current time.
> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> ---
>  arch/x86/include/asm/fixmap.h        |    1 +
>  arch/x86/include/asm/pgtable_types.h |    2 ++
>  arch/x86/include/asm/vvar.h          |   22 ++++++++++------------
>  arch/x86/kernel/vmlinux.lds.S        |   27 ++++++++++++++++-----------
>  arch/x86/kernel/vsyscall_64.c        |    5 +++++
>  5 files changed, 34 insertions(+), 23 deletions(-)
> 

[...]

> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index 89aed99..3d89a00 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -161,12 +161,6 @@ SECTIONS
>  
>  #define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
>  #define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
> -#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x	\
> -	ADDR(.vsyscall_0) + offset		 	\
> -	: AT(VLOAD(.vsyscall_var_ ## x)) {     		\
> -		*(.vsyscall_var_ ## x)			\
> -	}						\
> -	x = VVIRT(.vsyscall_var_ ## x);
>  
>  	. = ALIGN(4096);
>  	__vsyscall_0 = .;
> @@ -192,17 +186,28 @@ SECTIONS
>  		*(.vsyscall_3)
>  	}
>  
> -#define __VVAR_KERNEL_LDS
> -#include <asm/vvar.h>
> -#undef __VVAR_KERNEL_LDS
> -
> -	. = __vsyscall_0 + PAGE_SIZE;
> +	. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
>  
>  #undef VSYSCALL_ADDR
>  #undef VLOAD_OFFSET
>  #undef VLOAD
>  #undef VVIRT_OFFSET
>  #undef VVIRT
> +
> +	__vvar_page = .;
> +
> +#define EMIT_VVAR(name, offset) .vvar_ ## name		\
> +	(__vvar_page + offset) :			\
> +	 AT(ADDR(.vvar_ ## name) - LOAD_OFFSET) {	\
> +		*(.vvar_ ## x)				\
                            ^
Maybe s/x/name/ ? -----------|

Thanks,

Louis

> +	} :data
> +
> +#define __VVAR_KERNEL_LDS
> +#include <asm/vvar.h>
> +#undef __VVAR_KERNEL_LDS
> +
> +       . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
> +
>  #undef EMIT_VVAR
>  
>  #endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 3e68218..3cf1cef 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -284,9 +284,14 @@ void __init map_vsyscall(void)
>  {
>  	extern char __vsyscall_0;
>  	unsigned long physaddr_page0 = __pa_symbol(&__vsyscall_0);
> +	extern char __vvar_page;
> +	unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
>  
>  	/* Note that VSYSCALL_MAPPED_PAGES must agree with the code below. */
>  	__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
> +	__set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
> +	BUILD_BUG_ON((unsigned long)__fix_to_virt(VVAR_PAGE) !=
> +		     (unsigned long)VVAR_ADDRESS);
>  }
>  
>  static int __init vsyscall_init(void)
> -- 
> 1.7.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot
  2011-05-31 16:42       ` Ingo Molnar
@ 2011-05-31 18:08         ` Andrew Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Lutomirski @ 2011-05-31 18:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andi Kleen

On Tue, May 31, 2011 at 12:42 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Lutomirski <luto@mit.edu> wrote:
>
>> >>  static int __init vsyscall_init(void)
>> >>  {
>> >> +     extern char __vsyscall_0;
>> >
>> > Please don't put extern definitions in the middle of a .c file - if
>> > then it should be in a .h file. (even if only a single function uses
>> > it)
>>
>> I thought the convention (and existing practice in vsyscall_64.c)
>> was that if the extern reference is to a magic linker symbol then
>> it goes in the function that uses it.  But I can find it a header
>> file.
>
> i'd suggest collecting them into a vsyscall header. The problem with
> externs in .c is that the moment two .c files start using it there's
> the danger of type divergence.
>
>> >> +     /*
>> >> +      * Randomize the magic al values for int 0xcc invocation.  This
>> >> +      * isn't really a security feature; it's to make sure that
>> >> +      * dynamic binary instrumentation tools don't start to think
>> >> +      * that the int 0xcc magic incantation is ABI.
>> >> +      */
>> >> +     vsyscall_nr_offset = get_random_int() % 3;
>> >> +     vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
>> >> +     mapping = kmap_atomic(vsyscall_page);
>> >> +     /* It's easier to hardcode the addresses -- they're ABI. */
>> >> +     mangle_vsyscall_movb(mapping, 0, 0xcc);
>> >
>> > what about filling it with zeroes?
>>
>> Fill what with zeroes?  I'm just patching one byte here.
>
> Sigh, i suck at reading comprehension today!
>
>> >> +#ifndef CONFIG_UNSAFE_VSYSCALLS
>> >> +     mangle_vsyscall_movb(mapping, 1024, 0xce);
>> >> +#endif
>> >> +     mangle_vsyscall_movb(mapping, 2048, 0xf0);
>> >
>> > Dunno, this all looks rather ugly.
>>
>> Agreed.  Better ideas are welcome.
>
> None at the moment except "don't randomize it and see where the chips
> may fall". I'd rather live with a somewhat sticky default-off compat
> Kconfig switch than some permanently ugly randomization to make the
> transition to no-vsyscall faster.
>
> It's not like we'll be able to remove the vsyscall altogether from
> the kernel - the best we can hope for is to be able to flip the
> default - there's binaries out there today that rely on it and
> binaries are sticky - a few months ago i saw someone test-running
> 1995 binaries ;-)
>
> Btw., we could also make the vsyscall page vanish *runtime*, via a
> sysctl. That way distros only need to update their /etc/sysctl.conf.
>
>> We could scrap int 0xcc entirely and emulate on page fault, but
>> that is slower and has other problems (like breaking anything that
>> thinks it can look at a call target in a binary and dereference
>> that address).
>>
>> Here's a possibly dumb/evil idea:
>>
>> Put real syscalls in the vsyscall page but mark the page NX.  Then
>> emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
>> the correct address but send SIGSEGV for the wrong address.
>>
>> Down side: it's even more complexity for the same silly case.
>
> heh, you are good at coming up with sick ideas! ;-)
>
> I don't think we want to add another branch to #PF, but could we turn
> this into #GP or perhaps an illegal instruction fault?

I'm not seeing one right now.  But #PF can be done without any
fast-path impact by checking for a vsyscall after the normal
page-fault logic gives up.  That takes about 400ns on my machine, I
think.

>
> Should be benchmarked:
>
>  - The advantage of INT 0xCC is that it's completely isolated: it
>   does not slow down anything else.

220ns or so.

>
>  - doing this through #GP might be significantly slower cycle-wise.
>   Do we know by how much?

Not sure how to do it.  I imagine that #UD is almost the same as int cc, though.

>
> The advantage would be that we would not waste an extra vector, it
> would be smaller, plus it would be rather simple to make it all a
> runtime toggle via a sysctl.

I think this is all moot right now, though -- whatever we do has to
keep time() fast for awhile, and I think that means that we need a
real (even if illegal) instruction in gettimeofday.

So let's just drop the randomization patch and hope that the "int 0xcc
in user code (exploit attempt? legacy instrumented code?)" is enough
to keep people from doing dumb things.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 14:14 ` [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule Andy Lutomirski
@ 2011-05-31 18:34   ` Andi Kleen
  2011-05-31 18:57     ` Thomas Gleixner
  2011-05-31 18:59     ` Andrew Lutomirski
  0 siblings, 2 replies; 28+ messages in thread
From: Andi Kleen @ 2011-05-31 18:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson,
	Andi Kleen

> +What:	CONFIG_UNSAFE_VSYSCALLS (x86_64)
> +When:	When glibc 2.14 or newer is ubitquitous.  Perhaps mid-2012.
> +Why:	Having user-executable code at a fixed address is a security problem.
> +	Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
> +	make the time() function slower on glibc versions 2.13 and below.

I disagree with this description (and the whole idea really)

First it's time+gettimeofday+vgetcu, not just time.

A more accurate description is 

"will make all x86-64 Linux programs written to the original pre
vDSO syscall ABI significantly slower"

And the assumption that all world is using glibc is still as bad
as it was on the first po.st

And it's still a bad idea. Especially since there's a much better
alternative anyways for the "security problem" which has none of 
these drawbacks.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 18:34   ` Andi Kleen
@ 2011-05-31 18:57     ` Thomas Gleixner
  2011-05-31 18:59     ` Andrew Lutomirski
  1 sibling, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2011-05-31 18:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Ingo Molnar, x86, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson

On Tue, 31 May 2011, Andi Kleen wrote:
> > +What:	CONFIG_UNSAFE_VSYSCALLS (x86_64)
> > +When:	When glibc 2.14 or newer is ubitquitous.  Perhaps mid-2012.
> > +Why:	Having user-executable code at a fixed address is a security problem.
> > +	Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
> > +	make the time() function slower on glibc versions 2.13 and below.
> 
> I disagree with this description (and the whole idea really)
> 
> First it's time+gettimeofday+vgetcu, not just time.
> 
> A more accurate description is 
> 
> "will make all x86-64 Linux programs written to the original pre
> vDSO syscall ABI significantly slower"
> 
> And the assumption that all world is using glibc is still as bad
> as it was on the first po.st
> 
> And it's still a bad idea. Especially since there's a much better
> alternative anyways for the "security problem" which has none of 
> these drawbacks.

How about posting an alternative patch?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 18:34   ` Andi Kleen
  2011-05-31 18:57     ` Thomas Gleixner
@ 2011-05-31 18:59     ` Andrew Lutomirski
  2011-05-31 19:28       ` Ingo Molnar
  2011-06-08  8:50       ` [PATCH v4 " Ingo Molnar
  1 sibling, 2 replies; 28+ messages in thread
From: Andrew Lutomirski @ 2011-05-31 18:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson

On Tue, May 31, 2011 at 2:34 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> +What:        CONFIG_UNSAFE_VSYSCALLS (x86_64)
>> +When:        When glibc 2.14 or newer is ubitquitous.  Perhaps mid-2012.
>> +Why: Having user-executable code at a fixed address is a security problem.
>> +     Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
>> +     make the time() function slower on glibc versions 2.13 and below.
>
> I disagree with this description (and the whole idea really)
>
> First it's time+gettimeofday+vgetcu, not just time.
>
> A more accurate description is
>
> "will make all x86-64 Linux programs written to the original pre
> vDSO syscall ABI significantly slower"

Well, if this series goes in, then gettimeofday and getcpu are already
slower.  It's just time that would get even slower later on.

>
> And the assumption that all world is using glibc is still as bad
> as it was on the first po.st

As opposed to?

uclibc and klibc don't appear to use vsyscalls or the vdso.

dietlibc is hardcoded to use the vsyscall.  Are there any
performance-critical programs using dietlibc?

I don't think that Bionic runs on any released x86-64 systems.

>
> And it's still a bad idea. Especially since there's a much better
> alternative anyways for the "security problem" which has none of
> these drawbacks.

What's the alternative?

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 18:59     ` Andrew Lutomirski
@ 2011-05-31 19:28       ` Ingo Molnar
  2011-05-31 19:36         ` Ingo Molnar
  2011-06-08  8:50       ` [PATCH v4 " Ingo Molnar
  1 sibling, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 19:28 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson


* Andrew Lutomirski <luto@mit.edu> wrote:

> > And it's still a bad idea. Especially since there's a much better 
> > alternative anyways for the "security problem" which has none of 
> > these drawbacks.
> 
> What's the alternative?

Well, Andi likes to draw out such answers and likes to keep any 
answer as minimally helpful as possible (to demonstrate his 
superiority), but my guess would be that he is thinking of the 
(trivial) solution that filters the caller RIP at the generic syscall 
entry point and checks RCX against the 'expected' SYSCALL instruction 
address, which is the (per task) vdso-address + constant-offset.

That method has a big disadvantage:

 - it slows down the syscall fastpath with two or three unnecessary 
   instructions.

It has two advantages:

 - it's the simplest method of all

 - it also *only* allows the vdso address to be used for system calls,
   so if an attacker manages to find an executable SYSCALL 
   instruction somewhere in the executable space of an application,
   that entry point will not be usable.

... so this method is not completely off the table.

If everyone agrees that the generic syscall overhead is acceptable we 
could try this too.

Thoughts?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 19:28       ` Ingo Molnar
@ 2011-05-31 19:36         ` Ingo Molnar
  2011-05-31 20:05           ` Andrew Lutomirski
  2011-08-06 20:18           ` [PATCH v3 " Andrew Lutomirski
  0 siblings, 2 replies; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 19:36 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson


* Ingo Molnar <mingo@elte.hu> wrote:

> [...] solution that filters the caller RIP at the generic syscall 
> entry point and checks RCX against the 'expected' SYSCALL 
> instruction address, which is the (per task) vdso-address + 
> constant-offset.

Note that this solution would allow the vsyscall page to be 
'filtered' to the 3 allowed system calls rather efficiently, via a 
second level check.

This second check does not affect the fastpath, and it could be put 
behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does 
not put vsyscall references anywhere - but we could even keep it 
around forever, as this way it's defanged permanently.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 19:36         ` Ingo Molnar
@ 2011-05-31 20:05           ` Andrew Lutomirski
  2011-05-31 20:24             ` Ingo Molnar
  2011-08-06 20:18           ` [PATCH v3 " Andrew Lutomirski
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Lutomirski @ 2011-05-31 20:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson

[Sorry, possible resend.]

On 5/31/11, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>> [...] solution that filters the caller RIP at the generic syscall
>> entry point and checks RCX against the 'expected' SYSCALL
>> instruction address, which is the (per task) vdso-address +
>> constant-offset.
>
> Note that this solution would allow the vsyscall page to be
> 'filtered' to the 3 allowed system calls rather efficiently, via a
> second level check.
>
> This second check does not affect the fastpath, and it could be put
> behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does
> not put vsyscall references anywhere - but we could even keep it
> around forever, as this way it's defanged permanently.
>

Are you thinking about the 32-bit vDSO? I think that 64-bit code puts
syscalls instructions all over the place.

How is this better than v2 of my series, stopping after the int cc fallback?

--Andy

> Thanks,
>
> 	Ingo
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 20:05           ` Andrew Lutomirski
@ 2011-05-31 20:24             ` Ingo Molnar
  0 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2011-05-31 20:24 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, Thomas Gleixner, linux-kernel, Jesper Juhl,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Jan Beulich, richard -rw- weinberger, Mikael Pettersson


* Andrew Lutomirski <luto@mit.edu> wrote:

> [Sorry, possible resend.]
> 
> On 5/31/11, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Ingo Molnar <mingo@elte.hu> wrote:
> >
> >> [...] solution that filters the caller RIP at the generic syscall
> >> entry point and checks RCX against the 'expected' SYSCALL
> >> instruction address, which is the (per task) vdso-address +
> >> constant-offset.
> >
> > Note that this solution would allow the vsyscall page to be
> > 'filtered' to the 3 allowed system calls rather efficiently, via a
> > second level check.
> >
> > This second check does not affect the fastpath, and it could be put
> > behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does
> > not put vsyscall references anywhere - but we could even keep it
> > around forever, as this way it's defanged permanently.
> >
> 
> Are you thinking about the 32-bit vDSO? I think that 64-bit code puts
> syscalls instructions all over the place.

Yeah, it does in a few dozen places so RCX filtering would only work 
if we 'knew' about glibc's syscall range (it's available from the 
vma) and restricted syscalls to that boundary.

... which makes this solution rather fragile so i think we can 
disregard it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 18:59     ` Andrew Lutomirski
  2011-05-31 19:28       ` Ingo Molnar
@ 2011-06-08  8:50       ` Ingo Molnar
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2011-06-08  8:50 UTC (permalink / raw)
  To: Andi Kleen, Andi Kleen
  Cc: Thomas Gleixner, linux-kernel, Jesper Juhl, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson, Andrew Lutomirski,
	H. Peter Anvin


* Andrew Lutomirski <luto@mit.edu> wrote:

> On Tue, May 31, 2011 at 2:34 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> > And it's still a bad idea. Especially since there's a much better 
> > alternative anyways for the "security problem" which has none of 
> > these drawbacks.
> 
> What's the alternative?

Andi, you still have not replied to Andy's, Thomas's and my question 
regarding your claim: exactly what 'alternative' did you mean here?

Do you plan to reply to our questions or will you continue to ignore 
them forever, keeping your FUD in the air indefinitely?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
  2011-05-31 19:36         ` Ingo Molnar
  2011-05-31 20:05           ` Andrew Lutomirski
@ 2011-08-06 20:18           ` Andrew Lutomirski
  1 sibling, 0 replies; 28+ messages in thread
From: Andrew Lutomirski @ 2011-08-06 20:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Lutomirski, Andi Kleen, x86, Thomas Gleixner,
	linux-kernel, Jesper Juhl, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jan Beulich,
	richard -rw- weinberger, Mikael Pettersson

On Tuesday, May 31, 2011, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>> [...] solution that filters the caller RIP at the generic syscall
>> entry point and checks RCX against the 'expected' SYSCALL
>> instruction address, which is the (per task) vdso-address +
>> constant-offset.
>
> Note that this solution would allow the vsyscall page to be
> 'filtered' to the 3 allowed system calls rather efficiently, via a
> second level check.
>
> This second check does not affect the fastpath, and it could be put
> behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does
> not put vsyscall references anywhere - but we could even keep it
> around forever, as this way it's defanged permanently.

I'm confused.  Are you thinking about the 32-bit vDSO?  I think that
64-bit code puts syscall instructions all over the place.

--Andy

>
> Thanks,
>
>         Ingo
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-08-06 20:18 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-31 14:13 [PATCH v4 00/10] Remove syscall instructions at fixed addresses Andy Lutomirski
2011-05-31 14:13 ` [PATCH v4 01/10] x86-64: Fix alignment of jiffies variable Andy Lutomirski
2011-05-31 14:14 ` [PATCH v4 02/10] x86-64: Document some of entry_64.S Andy Lutomirski
2011-05-31 14:14 ` [PATCH v4 03/10] x86-64: Give vvars their own page Andy Lutomirski
2011-05-31 17:17   ` Louis Rilling
2011-05-31 14:14 ` [PATCH v4 04/10] x86-64: Remove kernel.vsyscall64 sysctl Andy Lutomirski
2011-05-31 14:14 ` [PATCH v4 05/10] x86-64: Map the HPET NX Andy Lutomirski
2011-05-31 14:14 ` [PATCH v4 06/10] x86-64: Remove vsyscall number 3 (venosys) Andy Lutomirski
2011-05-31 14:14 ` [PATCH v4 07/10] x86-64: Fill unused parts of the vsyscall page with 0xcc Andy Lutomirski
2011-05-31 14:14 ` [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls Andy Lutomirski
2011-05-31 15:35   ` Ingo Molnar
2011-05-31 14:14 ` [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot Andy Lutomirski
2011-05-31 15:40   ` Ingo Molnar
2011-05-31 15:56     ` Andrew Lutomirski
2011-05-31 16:10       ` Andrew Lutomirski
2011-05-31 16:43         ` Ingo Molnar
2011-05-31 16:42       ` Ingo Molnar
2011-05-31 18:08         ` Andrew Lutomirski
2011-05-31 14:14 ` [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule Andy Lutomirski
2011-05-31 18:34   ` Andi Kleen
2011-05-31 18:57     ` Thomas Gleixner
2011-05-31 18:59     ` Andrew Lutomirski
2011-05-31 19:28       ` Ingo Molnar
2011-05-31 19:36         ` Ingo Molnar
2011-05-31 20:05           ` Andrew Lutomirski
2011-05-31 20:24             ` Ingo Molnar
2011-08-06 20:18           ` [PATCH v3 " Andrew Lutomirski
2011-06-08  8:50       ` [PATCH v4 " Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.