All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] x86-64 vDSO changes for 3.1
@ 2011-07-10  3:22 Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling Andy Lutomirski
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

This series applies to the x86/vdso branch of the -tip tree.

The first patch cleans up the vsyscall emulation code.

After the vsyscall emulation patches, the only real executable code left
in the vsyscall page is vread_tsc and vread_hpet.  That code is only
called from the vDSO, so patches 2-6 move it into the vDSO.

vread_tsc() uses rdtsc_barrier(), which contains two alternatives.
Patches 2 and 3 make alternative patching work in the vDSO.  (This has a
slightly odd side effect that the vDSO image dumped from memory doesn't
quite match the debug version anymore, but it's hard to imagine that
causing problems.)

Patch 4 fixes an annoyance I found while writing this code.  If you
introduce an undefined symbol into the vDSO, you get an unhelpful error
message.  ld is smart enough to give a nice error if you ask it to.

Patch 5 cleans up the architecture-specific part of struct clocksource.
IA64 had its own ifdefed code in there, and the supposedly generic vread
pointer was only used by x86-64.  With the patch, each arch gets to set
up its own private part of struct clocksource.

Patch 6 is the meat.  It moves vread_tsc and vread_hpet into the vDSO
where they belong, and it's a net deletion of code because it removes a
bunch of magic needed to make regular functions accessible through the
vsyscall page.

With patches 1-6 applied, every single byte in the vsyscall
page is some sort of trap instruction.

Patches 7 and 8 are optional.  Patch 7 changes IA64 to use the new arch
gtod data.  It presumably should not go in through the x86 tree.  Patch
8 adds some vDSO documentation and a reference vDSO parser for user code
to use.  It's meant for projects that don't dynamically link to glibc
(e.g. Go) but still want to call the vDSO.  Someone who knows more about
ELF than I should take a look.

*** Note to IA64 people: I have not even compile-tested this on IA64. ***

Changes from v1:
 - Tidy up vDSO alternative patching (thanks, Borislav).
 - Fix really dumb bugs in the IA64 stuff.
 - Add the cleanup patch and the reference vDSO parser.
 - Split the main IA-64 patch out.

Andy Lutomirski (8):
  x86-64: Improve vsyscall emulation CS and RIP handling
  x86: Make alternative instruction pointers relative
  x86-64: Allow alternative patching in the vDSO
  x86-64: Add --no-undefined to vDSO build
  clocksource: Replace vread with generic arch data
  x86-64: Move vread_tsc and vread_hpet into the vDSO
  ia64: Replace clocksource.fsys_mmio with generic arch data
  Document the vDSO and add a reference parser

 Documentation/ABI/stable/vdso          |   27 ++++
 Documentation/vDSO/parse_vdso.c        |  256 ++++++++++++++++++++++++++++++++
 Documentation/vDSO/vdso_test.c         |  112 ++++++++++++++
 arch/ia64/include/asm/clocksource.h    |   12 ++
 arch/ia64/kernel/cyclone.c             |    2 +-
 arch/ia64/kernel/time.c                |    2 +-
 arch/ia64/sn/kernel/sn2/timer.c        |    2 +-
 arch/x86/include/asm/alternative-asm.h |    4 +-
 arch/x86/include/asm/alternative.h     |    8 +-
 arch/x86/include/asm/clocksource.h     |   20 +++
 arch/x86/include/asm/cpufeature.h      |    8 +-
 arch/x86/include/asm/tsc.h             |    4 -
 arch/x86/include/asm/vgtod.h           |    2 +-
 arch/x86/include/asm/vsyscall.h        |   16 --
 arch/x86/kernel/Makefile               |    7 +-
 arch/x86/kernel/alternative.c          |   23 ++--
 arch/x86/kernel/hpet.c                 |    9 +-
 arch/x86/kernel/tsc.c                  |    2 +-
 arch/x86/kernel/vmlinux.lds.S          |    3 -
 arch/x86/kernel/vread_tsc_64.c         |   36 -----
 arch/x86/kernel/vsyscall_64.c          |   61 +++++---
 arch/x86/lib/copy_page_64.S            |    9 +-
 arch/x86/lib/memmove_64.S              |   11 +-
 arch/x86/vdso/Makefile                 |    1 +
 arch/x86/vdso/vclock_gettime.c         |   52 ++++++-
 arch/x86/vdso/vma.c                    |   30 ++++
 drivers/char/hpet.c                    |    2 +-
 include/asm-generic/clocksource.h      |    4 +
 include/linux/clocksource.h            |   13 +-
 29 files changed, 589 insertions(+), 149 deletions(-)
 create mode 100644 Documentation/ABI/stable/vdso
 create mode 100644 Documentation/vDSO/parse_vdso.c
 create mode 100644 Documentation/vDSO/vdso_test.c
 create mode 100644 arch/ia64/include/asm/clocksource.h
 create mode 100644 arch/x86/include/asm/clocksource.h
 delete mode 100644 arch/x86/kernel/vread_tsc_64.c
 create mode 100644 include/asm-generic/clocksource.h

-- 
1.7.6


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
@ 2011-07-10  3:22 ` Andy Lutomirski
  2011-07-11 22:14   ` Borislav Petkov
  2011-07-10  3:22 ` [PATCH v2 2/8] x86: Make alternative instruction pointers relative Andy Lutomirski
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

Two fixes here:
 - Send SIGSEGV if called from compat code or with a funny CS.
 - Don't BUG on impossible addresses.

This patch also removes an unused variable.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/vsyscall.h |   12 --------
 arch/x86/kernel/vsyscall_64.c   |   59 +++++++++++++++++++++++++--------------
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index bb710cb..d555973 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -31,18 +31,6 @@ extern struct timezone sys_tz;
 
 extern void map_vsyscall(void);
 
-/* Emulation */
-
-static inline bool is_vsyscall_entry(unsigned long addr)
-{
-	return (addr & ~0xC00UL) == VSYSCALL_START;
-}
-
-static inline int vsyscall_entry_nr(unsigned long addr)
-{
-	return (addr & 0xC00UL) >> 10;
-}
-
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 10cd8ac..6d14848 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -38,6 +38,7 @@
 
 #include <asm/vsyscall.h>
 #include <asm/pgtable.h>
+#include <asm/compat.h>
 #include <asm/page.h>
 #include <asm/unistd.h>
 #include <asm/fixmap.h>
@@ -97,33 +98,60 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 
 	tsk = current;
 
-	printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+	printk("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
 	       level, tsk->comm, task_pid_nr(tsk),
-	       message, regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+	       message, regs->ip - 2, regs->cs,
+	       regs->sp, regs->ax, regs->si, regs->di);
+}
+
+static int addr_to_vsyscall_nr(unsigned *vsyscall_nr, unsigned long addr)
+{
+	if ((addr & ~0xC00UL) != VSYSCALL_START)
+		return -EINVAL;
+
+	*vsyscall_nr = (addr & 0xC00UL) >> 10;
+	if (*vsyscall_nr >= 3)
+		return -EINVAL;
+
+	return 0;
 }
 
 void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 {
-	const char *vsyscall_name;
 	struct task_struct *tsk;
 	unsigned long caller;
-	int vsyscall_nr;
+	unsigned vsyscall_nr;
 	long ret;
 
-	/* Kernel code must never get here. */
-	BUG_ON(!user_mode(regs));
-
 	local_irq_enable();
 
 	/*
+	 * Real 64-bit user mode code has cs == __USER_CS.  Anything else
+	 * is bogus.
+	 */
+	if (regs->cs != __USER_CS) {
+		/*
+		 * If we trapped from kernel mode, we might as well OOPS now
+		 * instead of returning to some random address and OOPSing
+		 * then.
+		 */
+		BUG_ON(!user_mode(regs));
+
+		/* Compat mode and non-compat 32-bit CS should both segfault. */
+		warn_bad_vsyscall(KERN_WARNING, regs,
+				  "illegal int 0xcc from 32-bit mode");
+		goto sigsegv;
+	}
+
+	/*
 	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
 	 * and int 0xcc is two bytes long.
 	 */
-	if (!is_vsyscall_entry(regs->ip - 2)) {
-		warn_bad_vsyscall(KERN_WARNING, regs, "illegal int 0xcc (exploit attempt?)");
+	if (addr_to_vsyscall_nr(&vsyscall_nr, regs->ip - 2) != 0) {
+		warn_bad_vsyscall(KERN_WARNING, regs,
+				  "illegal int 0xcc (exploit attempt?)");
 		goto sigsegv;
 	}
-	vsyscall_nr = vsyscall_entry_nr(regs->ip - 2);
 
 	if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
 		warn_bad_vsyscall(KERN_WARNING, regs, "int 0xcc with bad stack (exploit attempt?)");
@@ -136,31 +164,20 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 
 	switch (vsyscall_nr) {
 	case 0:
-		vsyscall_name = "gettimeofday";
 		ret = sys_gettimeofday(
 			(struct timeval __user *)regs->di,
 			(struct timezone __user *)regs->si);
 		break;
 
 	case 1:
-		vsyscall_name = "time";
 		ret = sys_time((time_t __user *)regs->di);
 		break;
 
 	case 2:
-		vsyscall_name = "getcpu";
 		ret = sys_getcpu((unsigned __user *)regs->di,
 				 (unsigned __user *)regs->si,
 				 0);
 		break;
-
-	default:
-		/*
-		 * If we get here, then vsyscall_nr indicates that int 0xcc
-		 * happened at an address in the vsyscall page that doesn't
-		 * contain int 0xcc.  That can't happen.
-		 */
-		BUG();
 	}
 
 	if (ret == -EFAULT) {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/8] x86: Make alternative instruction pointers relative
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling Andy Lutomirski
@ 2011-07-10  3:22 ` Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO Andy Lutomirski
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

This save a few bytes on x86-64 and means that future patches can
apply alternatives to unrelocated code.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/alternative-asm.h |    4 ++--
 arch/x86/include/asm/alternative.h     |    8 ++++----
 arch/x86/include/asm/cpufeature.h      |    8 ++++----
 arch/x86/kernel/alternative.c          |   21 +++++++++++++--------
 arch/x86/lib/copy_page_64.S            |    9 +++------
 arch/x86/lib/memmove_64.S              |   11 +++++------
 6 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/alternative-asm.h b/arch/x86/include/asm/alternative-asm.h
index 94d420b..4554cc6 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -17,8 +17,8 @@
 
 .macro altinstruction_entry orig alt feature orig_len alt_len
 	.align 8
-	.quad \orig
-	.quad \alt
+	.long \orig - .
+	.long \alt - .
 	.word \feature
 	.byte \orig_len
 	.byte \alt_len
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index bf535f9..23fb6d7 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -43,8 +43,8 @@
 #endif
 
 struct alt_instr {
-	u8 *instr;		/* original instruction */
-	u8 *replacement;
+	s32 instr_offset;	/* original instruction */
+	s32 repl_offset;	/* offset to replacement instruction */
 	u16 cpuid;		/* cpuid bit set for replacement */
 	u8  instrlen;		/* length of original instruction */
 	u8  replacementlen;	/* length of new instruction, <= instrlen */
@@ -84,8 +84,8 @@ static inline int alternatives_text_reserved(void *start, void *end)
       "661:\n\t" oldinstr "\n662:\n"					\
       ".section .altinstructions,\"a\"\n"				\
       _ASM_ALIGN "\n"							\
-      _ASM_PTR "661b\n"				/* label           */	\
-      _ASM_PTR "663f\n"				/* new instruction */	\
+      "	 .long 661b - .\n"			/* label           */	\
+      "	 .long 663f - .\n"			/* new instruction */	\
       "	 .word " __stringify(feature) "\n"	/* feature bit     */	\
       "	 .byte 662b-661b\n"			/* sourcelen       */	\
       "	 .byte 664f-663f\n"			/* replacementlen  */	\
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 71cc380..8a1920e 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -331,8 +331,8 @@ static __always_inline __pure bool __static_cpu_has(u16 bit)
 			 "2:\n"
 			 ".section .altinstructions,\"a\"\n"
 			 _ASM_ALIGN "\n"
-			 _ASM_PTR "1b\n"
-			 _ASM_PTR "0\n" 	/* no replacement */
+			 " .long 1b - .\n"
+			 " .long 0\n"	 	/* no replacement */
 			 " .word %P0\n"		/* feature bit */
 			 " .byte 2b - 1b\n"	/* source len */
 			 " .byte 0\n"		/* replacement len */
@@ -349,8 +349,8 @@ static __always_inline __pure bool __static_cpu_has(u16 bit)
 			     "2:\n"
 			     ".section .altinstructions,\"a\"\n"
 			     _ASM_ALIGN "\n"
-			     _ASM_PTR "1b\n"
-			     _ASM_PTR "3f\n"
+			     " .long 1b - .\n"
+			     " .long 3f - .\n"
 			     " .word %P1\n"		/* feature bit */
 			     " .byte 2b - 1b\n"		/* source len */
 			     " .byte 4f - 3f\n"		/* replacement len */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a81f2d5..ddb207b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -263,6 +263,7 @@ void __init_or_module apply_alternatives(struct alt_instr *start,
 					 struct alt_instr *end)
 {
 	struct alt_instr *a;
+	u8 *instr, *replacement;
 	u8 insnbuf[MAX_PATCH_LEN];
 
 	DPRINTK("%s: alt table %p -> %p\n", __func__, start, end);
@@ -276,25 +277,29 @@ void __init_or_module apply_alternatives(struct alt_instr *start,
 	 * order.
 	 */
 	for (a = start; a < end; a++) {
-		u8 *instr = a->instr;
+		instr = (u8 *)&a->instr_offset + a->instr_offset;
+		replacement = (u8 *)&a->repl_offset + a->repl_offset;
 		BUG_ON(a->replacementlen > a->instrlen);
 		BUG_ON(a->instrlen > sizeof(insnbuf));
 		BUG_ON(a->cpuid >= NCAPINTS*32);
 		if (!boot_cpu_has(a->cpuid))
 			continue;
+
+		memcpy(insnbuf, replacement, a->replacementlen);
+
+		/* 0xe8 is a relative jump; fix the offset. */
+		if (*insnbuf == 0xe8 && a->replacementlen == 5)
+		    *(s32 *)(insnbuf + 1) += replacement - instr;
+
+		add_nops(insnbuf + a->replacementlen,
+			 a->instrlen - a->replacementlen);
+
 #ifdef CONFIG_X86_64
 		/* vsyscall code is not mapped yet. resolve it manually. */
 		if (instr >= (u8 *)VSYSCALL_START && instr < (u8*)VSYSCALL_END) {
 			instr = __va(instr - (u8*)VSYSCALL_START + (u8*)__pa_symbol(&__vsyscall_0));
-			DPRINTK("%s: vsyscall fixup: %p => %p\n",
-				__func__, a->instr, instr);
 		}
 #endif
-		memcpy(insnbuf, a->replacement, a->replacementlen);
-		if (*insnbuf == 0xe8 && a->replacementlen == 5)
-		    *(s32 *)(insnbuf + 1) += a->replacement - a->instr;
-		add_nops(insnbuf + a->replacementlen,
-			 a->instrlen - a->replacementlen);
 		text_poke_early(instr, insnbuf, a->instrlen);
 	}
 }
diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 6fec2d1..01c805b 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -2,6 +2,7 @@
 
 #include <linux/linkage.h>
 #include <asm/dwarf2.h>
+#include <asm/alternative-asm.h>
 
 	ALIGN
 copy_page_c:
@@ -110,10 +111,6 @@ ENDPROC(copy_page)
 2:
 	.previous
 	.section .altinstructions,"a"
-	.align 8
-	.quad copy_page
-	.quad 1b
-	.word X86_FEATURE_REP_GOOD
-	.byte .Lcopy_page_end - copy_page
-	.byte 2b - 1b
+	altinstruction_entry copy_page, 1b, X86_FEATURE_REP_GOOD,	\
+		.Lcopy_page_end-copy_page, 2b-1b
 	.previous
diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index d0ec9c2..ee16461 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -9,6 +9,7 @@
 #include <linux/linkage.h>
 #include <asm/dwarf2.h>
 #include <asm/cpufeature.h>
+#include <asm/alternative-asm.h>
 
 #undef memmove
 
@@ -214,11 +215,9 @@ ENTRY(memmove)
 	.previous
 
 	.section .altinstructions,"a"
-	.align 8
-	.quad .Lmemmove_begin_forward
-	.quad .Lmemmove_begin_forward_efs
-	.word X86_FEATURE_ERMS
-	.byte .Lmemmove_end_forward-.Lmemmove_begin_forward
-	.byte .Lmemmove_end_forward_efs-.Lmemmove_begin_forward_efs
+	altinstruction_entry .Lmemmove_begin_forward,		\
+		.Lmemmove_begin_forward_efs,X86_FEATURE_ERMS,	\
+		.Lmemmove_end_forward-.Lmemmove_begin_forward,	\
+		.Lmemmove_end_forward_efs-.Lmemmove_begin_forward_efs
 	.previous
 ENDPROC(memmove)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 2/8] x86: Make alternative instruction pointers relative Andy Lutomirski
@ 2011-07-10  3:22 ` Andy Lutomirski
  2011-07-11 10:41   ` Rakib Mullick
  2011-07-10  3:22 ` [PATCH v2 4/8] x86-64: Add --no-undefined to vDSO build Andy Lutomirski
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

This code is short enough and different enough from the module
loader that it's not worth trying to share anything.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/vdso/vma.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 7abd2be..ba92244 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -23,11 +23,41 @@ extern unsigned short vdso_sync_cpuid;
 static struct page **vdso_pages;
 static unsigned vdso_size;
 
+static void patch_vdso(void *vdso, size_t len)
+{
+	Elf64_Ehdr *hdr = vdso;
+	Elf64_Shdr *sechdrs, *alt_sec = 0;
+	char *secstrings;
+	void *alt_data;
+	int i;
+
+	BUG_ON(len < sizeof(Elf64_Ehdr));
+	BUG_ON(memcmp(hdr->e_ident, ELFMAG, SELFMAG) != 0);
+
+	sechdrs = (void *)hdr + hdr->e_shoff;
+	secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
+
+	for (i = 1; i < hdr->e_shnum; i++) {
+		Elf64_Shdr *shdr = &sechdrs[i];
+		if (!strcmp(secstrings + shdr->sh_name, ".altinstructions")) {
+			alt_sec = shdr;
+			goto found;
+		}
+	}
+	return;  /* nothing to patch */
+
+found:
+	alt_data = (void *)hdr + alt_sec->sh_offset;
+	apply_alternatives(alt_data, alt_data + alt_sec->sh_size);
+}
+
 static int __init init_vdso_vars(void)
 {
 	int npages = (vdso_end - vdso_start + PAGE_SIZE - 1) / PAGE_SIZE;
 	int i;
 
+	patch_vdso(vdso_start, vdso_end - vdso_start);
+
 	vdso_size = npages << PAGE_SHIFT;
 	vdso_pages = kmalloc(sizeof(struct page *) * npages, GFP_KERNEL);
 	if (!vdso_pages)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/8] x86-64: Add --no-undefined to vDSO build
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
                   ` (2 preceding siblings ...)
  2011-07-10  3:22 ` [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO Andy Lutomirski
@ 2011-07-10  3:22 ` Andy Lutomirski
  2011-07-10  3:22   ` Andy Lutomirski
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

This gives much nicer diagnostics when something goes wrong.  It's
supported at least as far back as binutils 2.15.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/vdso/Makefile |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index bef0bc9..5d17950 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -26,6 +26,7 @@ targets += vdso.so vdso.so.dbg vdso.lds $(vobjs-y)
 export CPPFLAGS_vdso.lds += -P -C
 
 VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \
+			-Wl,--no-undefined \
 		      	-Wl,-z,max-page-size=4096 -Wl,-z,common-page-size=4096
 
 $(obj)/vdso.o: $(src)/vdso.S $(obj)/vdso.so
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/8] clocksource: Replace vread with generic arch data
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
@ 2011-07-10  3:22   ` Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 2/8] x86: Make alternative instruction pointers relative Andy Lutomirski
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov,
	Andy Lutomirski, Clemens Ladisch, linux-ia64, Tony Luck,
	Fenghua Yu, Thomas Gleixner

The vread field was bloating struct clocksource everywhere except
x86_64, and I want to change the way this works on x86_64, so let's
split it out into per-arch data.

Cc: x86@kernel.org
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: linux-ia64@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/clocksource.h |   16 ++++++++++++++++
 arch/x86/kernel/hpet.c             |    2 +-
 arch/x86/kernel/tsc.c              |    2 +-
 arch/x86/kernel/vsyscall_64.c      |    2 +-
 include/asm-generic/clocksource.h  |    4 ++++
 include/linux/clocksource.h        |   10 ++++++++--
 6 files changed, 31 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/clocksource.h
 create mode 100644 include/asm-generic/clocksource.h

diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
new file mode 100644
index 0000000..a5df33f
--- /dev/null
+++ b/arch/x86/include/asm/clocksource.h
@@ -0,0 +1,16 @@
+/* x86-specific clocksource additions */
+
+#ifndef _ASM_X86_CLOCKSOURCE_H
+#define _ASM_X86_CLOCKSOURCE_H
+
+#ifdef CONFIG_X86_64
+
+#define __ARCH_HAS_CLOCKSOURCE_DATA
+
+struct arch_clocksource_data {
+	cycle_t (*vread)(void);
+};
+
+#endif /* CONFIG_X86_64 */
+
+#endif /* _ASM_X86_CLOCKSOURCE_H */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index e9f5605..0e07257 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -753,7 +753,7 @@ static struct clocksource clocksource_hpet = {
 	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
 	.resume		= hpet_resume_counter,
 #ifdef CONFIG_X86_64
-	.vread		= vread_hpet,
+	.archdata	= { .vread = vread_hpet },
 #endif
 };
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6cc6922..e7a74b8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -777,7 +777,7 @@ static struct clocksource clocksource_tsc = {
 	.flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
 				  CLOCK_SOURCE_MUST_VERIFY,
 #ifdef CONFIG_X86_64
-	.vread                  = vread_tsc,
+	.archdata               = { .vread = vread_tsc },
 #endif
 };
 
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 6d14848..40a379c 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -74,7 +74,7 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
 	write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags);
 
 	/* copy vsyscall data */
-	vsyscall_gtod_data.clock.vread		= clock->vread;
+	vsyscall_gtod_data.clock.vread		= clock->archdata.vread;
 	vsyscall_gtod_data.clock.cycle_last	= clock->cycle_last;
 	vsyscall_gtod_data.clock.mask		= clock->mask;
 	vsyscall_gtod_data.clock.mult		= mult;
diff --git a/include/asm-generic/clocksource.h b/include/asm-generic/clocksource.h
new file mode 100644
index 0000000..0a462d3
--- /dev/null
+++ b/include/asm-generic/clocksource.h
@@ -0,0 +1,4 @@
+/*
+ * Architectures should override this file to add private userspace
+ * clock magic if needed.
+ */
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 18a1baf..9ab6b6a 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -22,6 +22,8 @@
 typedef u64 cycle_t;
 struct clocksource;
 
+#include <asm/clocksource.h>
+
 /**
  * struct cyclecounter - hardware abstraction for a free running counter
  *	Provides completely state-free accessors to the underlying hardware.
@@ -153,7 +155,7 @@ extern u64 timecounter_cyc2time(struct timecounter *tc,
  * @shift:		cycle to nanosecond divisor (power of two)
  * @max_idle_ns:	max idle time permitted by the clocksource (nsecs)
  * @flags:		flags describing special properties
- * @vread:		vsyscall based read
+ * @archdata:		arch-specific data
  * @suspend:		suspend function for the clocksource, if necessary
  * @resume:		resume function for the clocksource, if necessary
  */
@@ -175,10 +177,14 @@ struct clocksource {
 #else
 #define CLKSRC_FSYS_MMIO_SET(mmio, addr)      do { } while (0)
 #endif
+
+#ifdef __ARCH_HAS_CLOCKSOURCE_DATA
+	struct arch_clocksource_data archdata;
+#endif
+
 	const char *name;
 	struct list_head list;
 	int rating;
-	cycle_t (*vread)(void);
 	int (*enable)(struct clocksource *cs);
 	void (*disable)(struct clocksource *cs);
 	unsigned long flags;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/8] clocksource: Replace vread with generic arch data
@ 2011-07-10  3:22   ` Andy Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov,
	Andy Lutomirski, Clemens Ladisch, linux-ia64, Tony Luck,
	Fenghua Yu, Thomas Gleixner

The vread field was bloating struct clocksource everywhere except
x86_64, and I want to change the way this works on x86_64, so let's
split it out into per-arch data.

Cc: x86@kernel.org
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: linux-ia64@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/clocksource.h |   16 ++++++++++++++++
 arch/x86/kernel/hpet.c             |    2 +-
 arch/x86/kernel/tsc.c              |    2 +-
 arch/x86/kernel/vsyscall_64.c      |    2 +-
 include/asm-generic/clocksource.h  |    4 ++++
 include/linux/clocksource.h        |   10 ++++++++--
 6 files changed, 31 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/clocksource.h
 create mode 100644 include/asm-generic/clocksource.h

diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
new file mode 100644
index 0000000..a5df33f
--- /dev/null
+++ b/arch/x86/include/asm/clocksource.h
@@ -0,0 +1,16 @@
+/* x86-specific clocksource additions */
+
+#ifndef _ASM_X86_CLOCKSOURCE_H
+#define _ASM_X86_CLOCKSOURCE_H
+
+#ifdef CONFIG_X86_64
+
+#define __ARCH_HAS_CLOCKSOURCE_DATA
+
+struct arch_clocksource_data {
+	cycle_t (*vread)(void);
+};
+
+#endif /* CONFIG_X86_64 */
+
+#endif /* _ASM_X86_CLOCKSOURCE_H */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index e9f5605..0e07257 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -753,7 +753,7 @@ static struct clocksource clocksource_hpet = {
 	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
 	.resume		= hpet_resume_counter,
 #ifdef CONFIG_X86_64
-	.vread		= vread_hpet,
+	.archdata	= { .vread = vread_hpet },
 #endif
 };
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6cc6922..e7a74b8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -777,7 +777,7 @@ static struct clocksource clocksource_tsc = {
 	.flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
 				  CLOCK_SOURCE_MUST_VERIFY,
 #ifdef CONFIG_X86_64
-	.vread                  = vread_tsc,
+	.archdata               = { .vread = vread_tsc },
 #endif
 };
 
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 6d14848..40a379c 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -74,7 +74,7 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
 	write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags);
 
 	/* copy vsyscall data */
-	vsyscall_gtod_data.clock.vread		= clock->vread;
+	vsyscall_gtod_data.clock.vread		= clock->archdata.vread;
 	vsyscall_gtod_data.clock.cycle_last	= clock->cycle_last;
 	vsyscall_gtod_data.clock.mask		= clock->mask;
 	vsyscall_gtod_data.clock.mult		= mult;
diff --git a/include/asm-generic/clocksource.h b/include/asm-generic/clocksource.h
new file mode 100644
index 0000000..0a462d3
--- /dev/null
+++ b/include/asm-generic/clocksource.h
@@ -0,0 +1,4 @@
+/*
+ * Architectures should override this file to add private userspace
+ * clock magic if needed.
+ */
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 18a1baf..9ab6b6a 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -22,6 +22,8 @@
 typedef u64 cycle_t;
 struct clocksource;
 
+#include <asm/clocksource.h>
+
 /**
  * struct cyclecounter - hardware abstraction for a free running counter
  *	Provides completely state-free accessors to the underlying hardware.
@@ -153,7 +155,7 @@ extern u64 timecounter_cyc2time(struct timecounter *tc,
  * @shift:		cycle to nanosecond divisor (power of two)
  * @max_idle_ns:	max idle time permitted by the clocksource (nsecs)
  * @flags:		flags describing special properties
- * @vread:		vsyscall based read
+ * @archdata:		arch-specific data
  * @suspend:		suspend function for the clocksource, if necessary
  * @resume:		resume function for the clocksource, if necessary
  */
@@ -175,10 +177,14 @@ struct clocksource {
 #else
 #define CLKSRC_FSYS_MMIO_SET(mmio, addr)      do { } while (0)
 #endif
+
+#ifdef __ARCH_HAS_CLOCKSOURCE_DATA
+	struct arch_clocksource_data archdata;
+#endif
+
 	const char *name;
 	struct list_head list;
 	int rating;
-	cycle_t (*vread)(void);
 	int (*enable)(struct clocksource *cs);
 	void (*disable)(struct clocksource *cs);
 	unsigned long flags;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/8] x86-64: Move vread_tsc and vread_hpet into the vDSO
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
                   ` (4 preceding siblings ...)
  2011-07-10  3:22   ` Andy Lutomirski
@ 2011-07-10  3:22 ` Andy Lutomirski
  2011-07-10  3:22   ` Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 8/8] Document the vDSO and add a reference parser Andy Lutomirski
  7 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

The vsyscall page now consists entirely of trap instructions.

Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/clocksource.h |    6 +++-
 arch/x86/include/asm/tsc.h         |    4 ---
 arch/x86/include/asm/vgtod.h       |    2 +-
 arch/x86/include/asm/vsyscall.h    |    4 ---
 arch/x86/kernel/Makefile           |    7 +----
 arch/x86/kernel/alternative.c      |    8 -----
 arch/x86/kernel/hpet.c             |    9 +-----
 arch/x86/kernel/tsc.c              |    2 +-
 arch/x86/kernel/vmlinux.lds.S      |    3 --
 arch/x86/kernel/vread_tsc_64.c     |   36 -------------------------
 arch/x86/kernel/vsyscall_64.c      |    2 +-
 arch/x86/vdso/vclock_gettime.c     |   52 +++++++++++++++++++++++++++++++----
 12 files changed, 56 insertions(+), 79 deletions(-)
 delete mode 100644 arch/x86/kernel/vread_tsc_64.c

diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
index a5df33f..3882c65 100644
--- a/arch/x86/include/asm/clocksource.h
+++ b/arch/x86/include/asm/clocksource.h
@@ -7,8 +7,12 @@
 
 #define __ARCH_HAS_CLOCKSOURCE_DATA
 
+#define VCLOCK_NONE 0  /* No vDSO clock available.	*/
+#define VCLOCK_TSC  1  /* vDSO should use vread_tsc.	*/
+#define VCLOCK_HPET 2  /* vDSO should use vread_hpet.	*/
+
 struct arch_clocksource_data {
-	cycle_t (*vread)(void);
+	int vclock_mode;
 };
 
 #endif /* CONFIG_X86_64 */
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 9db5583..83e2efd 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -51,10 +51,6 @@ extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern unsigned long native_calibrate_tsc(void);
 
-#ifdef CONFIG_X86_64
-extern cycles_t vread_tsc(void);
-#endif
-
 /*
  * Boot-time check whether the TSCs are synchronized across
  * all CPUs/cores:
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index aa5add8..815285b 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -13,7 +13,7 @@ struct vsyscall_gtod_data {
 
 	struct timezone sys_tz;
 	struct { /* extract of a clocksource struct */
-		cycle_t (*vread)(void);
+		int vclock_mode;
 		cycle_t	cycle_last;
 		cycle_t	mask;
 		u32	mult;
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d555973..6010707 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -16,10 +16,6 @@ enum vsyscall_num {
 #ifdef __KERNEL__
 #include <linux/seqlock.h>
 
-/* Definitions for CONFIG_GENERIC_TIME definitions */
-#define __vsyscall_fn \
-	__attribute__ ((unused, __section__(".vsyscall_fn"))) notrace
-
 #define VGETCPU_RDTSCP	1
 #define VGETCPU_LSL	2
 
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cc0469a..2deef3d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,17 +24,12 @@ endif
 nostackp := $(call cc-option, -fno-stack-protector)
 CFLAGS_vsyscall_64.o	:= $(PROFILING) -g0 $(nostackp)
 CFLAGS_hpet.o		:= $(nostackp)
-CFLAGS_vread_tsc_64.o	:= $(nostackp)
 CFLAGS_paravirt.o	:= $(nostackp)
 GCOV_PROFILE_vsyscall_64.o	:= n
 GCOV_PROFILE_hpet.o		:= n
 GCOV_PROFILE_tsc.o		:= n
-GCOV_PROFILE_vread_tsc_64.o	:= n
 GCOV_PROFILE_paravirt.o		:= n
 
-# vread_tsc_64 is hot and should be fully optimized:
-CFLAGS_REMOVE_vread_tsc_64.o = -pg -fno-optimize-sibling-calls
-
 obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o ldt.o dumpstack.o
@@ -43,7 +38,7 @@ obj-$(CONFIG_IRQ_WORK)  += irq_work.o
 obj-y			+= probe_roms.o
 obj-$(CONFIG_X86_32)	+= sys_i386_32.o i386_ksyms_32.o
 obj-$(CONFIG_X86_64)	+= sys_x86_64.o x8664_ksyms_64.o
-obj-$(CONFIG_X86_64)	+= syscall_64.o vsyscall_64.o vread_tsc_64.o
+obj-$(CONFIG_X86_64)	+= syscall_64.o vsyscall_64.o
 obj-$(CONFIG_X86_64)	+= vsyscall_emu_64.o
 obj-y			+= bootflag.o e820.o
 obj-y			+= pci-dma.o quirks.o topology.o kdebugfs.o
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ddb207b..c638228 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -14,7 +14,6 @@
 #include <asm/pgtable.h>
 #include <asm/mce.h>
 #include <asm/nmi.h>
-#include <asm/vsyscall.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 #include <asm/io.h>
@@ -250,7 +249,6 @@ static void __init_or_module add_nops(void *insns, unsigned int len)
 
 extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern s32 __smp_locks[], __smp_locks_end[];
-extern char __vsyscall_0;
 void *text_poke_early(void *addr, const void *opcode, size_t len);
 
 /* Replace instructions with better alternatives for this CPU type.
@@ -294,12 +292,6 @@ void __init_or_module apply_alternatives(struct alt_instr *start,
 		add_nops(insnbuf + a->replacementlen,
 			 a->instrlen - a->replacementlen);
 
-#ifdef CONFIG_X86_64
-		/* vsyscall code is not mapped yet. resolve it manually. */
-		if (instr >= (u8 *)VSYSCALL_START && instr < (u8*)VSYSCALL_END) {
-			instr = __va(instr - (u8*)VSYSCALL_START + (u8*)__pa_symbol(&__vsyscall_0));
-		}
-#endif
 		text_poke_early(instr, insnbuf, a->instrlen);
 	}
 }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 0e07257..d10cc00 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -738,13 +738,6 @@ static cycle_t read_hpet(struct clocksource *cs)
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
 
-#ifdef CONFIG_X86_64
-static cycle_t __vsyscall_fn vread_hpet(void)
-{
-	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
-}
-#endif
-
 static struct clocksource clocksource_hpet = {
 	.name		= "hpet",
 	.rating		= 250,
@@ -753,7 +746,7 @@ static struct clocksource clocksource_hpet = {
 	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
 	.resume		= hpet_resume_counter,
 #ifdef CONFIG_X86_64
-	.archdata	= { .vread = vread_hpet },
+	.archdata	= { .vclock_mode = VCLOCK_HPET },
 #endif
 };
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index e7a74b8..56c633a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -777,7 +777,7 @@ static struct clocksource clocksource_tsc = {
 	.flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
 				  CLOCK_SOURCE_MUST_VERIFY,
 #ifdef CONFIG_X86_64
-	.archdata               = { .vread = vread_tsc },
+	.archdata               = { .vclock_mode = VCLOCK_TSC },
 #endif
 };
 
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8017471..4aa9c54 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -169,9 +169,6 @@ SECTIONS
 	.vsyscall : AT(VLOAD(.vsyscall)) {
 		*(.vsyscall_0)
 
-		. = ALIGN(L1_CACHE_BYTES);
-		*(.vsyscall_fn)
-
 		. = 1024;
 		*(.vsyscall_1)
 
diff --git a/arch/x86/kernel/vread_tsc_64.c b/arch/x86/kernel/vread_tsc_64.c
deleted file mode 100644
index a81aa9e..0000000
--- a/arch/x86/kernel/vread_tsc_64.c
+++ /dev/null
@@ -1,36 +0,0 @@
-/* This code runs in userspace. */
-
-#define DISABLE_BRANCH_PROFILING
-#include <asm/vgtod.h>
-
-notrace cycle_t __vsyscall_fn vread_tsc(void)
-{
-	cycle_t ret;
-	u64 last;
-
-	/*
-	 * Empirically, a fence (of type that depends on the CPU)
-	 * before rdtsc is enough to ensure that rdtsc is ordered
-	 * with respect to loads.  The various CPU manuals are unclear
-	 * as to whether rdtsc can be reordered with later loads,
-	 * but no one has ever seen it happen.
-	 */
-	rdtsc_barrier();
-	ret = (cycle_t)vget_cycles();
-
-	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	/*
-	 * GCC likes to generate cmov here, but this branch is extremely
-	 * predictable (it's just a funciton of time and the likely is
-	 * very likely) and there's a data dependence, so force GCC
-	 * to generate a branch instead.  I don't barrier() because
-	 * we don't actually need a barrier, and if this function
-	 * ever gets inlined it will generate worse code.
-	 */
-	asm volatile ("");
-	return last;
-}
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 40a379c..db08146 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -74,7 +74,7 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
 	write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags);
 
 	/* copy vsyscall data */
-	vsyscall_gtod_data.clock.vread		= clock->archdata.vread;
+	vsyscall_gtod_data.clock.vclock_mode	= clock->archdata.vclock_mode;
 	vsyscall_gtod_data.clock.cycle_last	= clock->cycle_last;
 	vsyscall_gtod_data.clock.mask		= clock->mask;
 	vsyscall_gtod_data.clock.mult		= mult;
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index cf54813..9869bac 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -25,6 +25,43 @@
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
+notrace static cycle_t vread_tsc(void)
+{
+	cycle_t ret;
+	u64 last;
+
+	/*
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
+	 */
+	rdtsc_barrier();
+	ret = (cycle_t)vget_cycles();
+
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
+}
+
+static notrace cycle_t vread_hpet(void)
+{
+	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
+}
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
 	long ret;
@@ -36,9 +73,12 @@ notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 notrace static inline long vgetns(void)
 {
 	long v;
-	cycles_t (*vread)(void);
-	vread = gtod->clock.vread;
-	v = (vread() - gtod->clock.cycle_last) & gtod->clock.mask;
+	cycles_t cycles;
+	if (gtod->clock.vclock_mode == VCLOCK_TSC)
+		cycles = vread_tsc();
+	else
+		cycles = vread_hpet();
+	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
 	return (v * gtod->clock.mult) >> gtod->clock.shift;
 }
 
@@ -118,11 +158,11 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
 	switch (clock) {
 	case CLOCK_REALTIME:
-		if (likely(gtod->clock.vread))
+		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE))
 			return do_realtime(ts);
 		break;
 	case CLOCK_MONOTONIC:
-		if (likely(gtod->clock.vread))
+		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE))
 			return do_monotonic(ts);
 		break;
 	case CLOCK_REALTIME_COARSE:
@@ -139,7 +179,7 @@ int clock_gettime(clockid_t, struct timespec *)
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
 	long ret;
-	if (likely(gtod->clock.vread)) {
+	if (likely(gtod->clock.vclock_mode != VCLOCK_NONE)) {
 		if (likely(tv != NULL)) {
 			BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
 				     offsetof(struct timespec, tv_nsec) ||
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/8] ia64: Replace clocksource.fsys_mmio with generic arch data
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
@ 2011-07-10  3:22   ` Andy Lutomirski
  2011-07-10  3:22 ` [PATCH v2 2/8] x86: Make alternative instruction pointers relative Andy Lutomirski
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov,
	Andy Lutomirski, Clemens Ladisch, linux-ia64, Tony Luck,
	Fenghua Yu, Thomas Gleixner

Now that clocksource.archdata is available, use it for ia64-specific
code.

Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: linux-ia64@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/ia64/include/asm/clocksource.h |   12 ++++++++++++
 arch/ia64/kernel/cyclone.c          |    2 +-
 arch/ia64/kernel/time.c             |    2 +-
 arch/ia64/sn/kernel/sn2/timer.c     |    2 +-
 drivers/char/hpet.c                 |    2 +-
 include/linux/clocksource.h         |    7 -------
 6 files changed, 16 insertions(+), 11 deletions(-)
 create mode 100644 arch/ia64/include/asm/clocksource.h

diff --git a/arch/ia64/include/asm/clocksource.h b/arch/ia64/include/asm/clocksource.h
new file mode 100644
index 0000000..00eb549
--- /dev/null
+++ b/arch/ia64/include/asm/clocksource.h
@@ -0,0 +1,12 @@
+/* IA64-specific clocksource additions */
+
+#ifndef _ASM_IA64_CLOCKSOURCE_H
+#define _ASM_IA64_CLOCKSOURCE_H
+
+#define __ARCH_HAS_CLOCKSOURCE_DATA
+
+struct arch_clocksource_data {
+	void *fsys_mmio;        /* used by fsyscall asm code */
+};
+
+#endif /* _ASM_IA64_CLOCKSOURCE_H */
diff --git a/arch/ia64/kernel/cyclone.c b/arch/ia64/kernel/cyclone.c
index f64097b..4826ff9 100644
--- a/arch/ia64/kernel/cyclone.c
+++ b/arch/ia64/kernel/cyclone.c
@@ -115,7 +115,7 @@ int __init init_cyclone_clock(void)
 	}
 	/* initialize last tick */
 	cyclone_mc = cyclone_timer;
-	clocksource_cyclone.fsys_mmio = cyclone_timer;
+	clocksource_cyclone.archdata.fsys_mmio = cyclone_timer;
 	clocksource_register_hz(&clocksource_cyclone, CYCLONE_TIMER_FREQ);
 
 	return 0;
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 85118df..43920de 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -468,7 +468,7 @@ void update_vsyscall(struct timespec *wall, struct timespec *wtm,
         fsyscall_gtod_data.clk_mask = c->mask;
         fsyscall_gtod_data.clk_mult = mult;
         fsyscall_gtod_data.clk_shift = c->shift;
-        fsyscall_gtod_data.clk_fsys_mmio = c->fsys_mmio;
+        fsyscall_gtod_data.clk_fsys_mmio = c->archdata.fsys_mmio;
         fsyscall_gtod_data.clk_cycle_last = c->cycle_last;
 
 	/* copy kernel time structures */
diff --git a/arch/ia64/sn/kernel/sn2/timer.c b/arch/ia64/sn/kernel/sn2/timer.c
index c34efda..0f8844e 100644
--- a/arch/ia64/sn/kernel/sn2/timer.c
+++ b/arch/ia64/sn/kernel/sn2/timer.c
@@ -54,7 +54,7 @@ ia64_sn_udelay (unsigned long usecs)
 
 void __init sn_timer_init(void)
 {
-	clocksource_sn2.fsys_mmio = RTC_COUNTER_ADDR;
+	clocksource_sn2.archdata.fsys_mmio = RTC_COUNTER_ADDR;
 	clocksource_register_hz(&clocksource_sn2, sn_rtc_cycles_per_second);
 
 	ia64_udelay = &ia64_sn_udelay;
diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index 34d6a1c..0833896 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -952,7 +952,7 @@ int hpet_alloc(struct hpet_data *hdp)
 #ifdef CONFIG_IA64
 	if (!hpet_clocksource) {
 		hpet_mctr = (void __iomem *)&hpetp->hp_hpet->hpet_mc;
-		CLKSRC_FSYS_MMIO_SET(clocksource_hpet.fsys_mmio, hpet_mctr);
+		clocksource_hpet.archdata.fsys_mmio = hpet_mctr;
 		clocksource_register_hz(&clocksource_hpet, hpetp->hp_tick_freq);
 		hpetp->hp_clocksource = &clocksource_hpet;
 		hpet_clocksource = &clocksource_hpet;
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 9ab6b6a..0c79005 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -171,13 +171,6 @@ struct clocksource {
 	u32 shift;
 	u64 max_idle_ns;
 
-#ifdef CONFIG_IA64
-	void *fsys_mmio;        /* used by fsyscall asm code */
-#define CLKSRC_FSYS_MMIO_SET(mmio, addr)      ((mmio) = (addr))
-#else
-#define CLKSRC_FSYS_MMIO_SET(mmio, addr)      do { } while (0)
-#endif
-
 #ifdef __ARCH_HAS_CLOCKSOURCE_DATA
 	struct arch_clocksource_data archdata;
 #endif
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/8] ia64: Replace clocksource.fsys_mmio with generic arch data
@ 2011-07-10  3:22   ` Andy Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov,
	Andy Lutomirski, Clemens Ladisch, linux-ia64, Tony Luck,
	Fenghua Yu, Thomas Gleixner

Now that clocksource.archdata is available, use it for ia64-specific
code.

Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: linux-ia64@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/ia64/include/asm/clocksource.h |   12 ++++++++++++
 arch/ia64/kernel/cyclone.c          |    2 +-
 arch/ia64/kernel/time.c             |    2 +-
 arch/ia64/sn/kernel/sn2/timer.c     |    2 +-
 drivers/char/hpet.c                 |    2 +-
 include/linux/clocksource.h         |    7 -------
 6 files changed, 16 insertions(+), 11 deletions(-)
 create mode 100644 arch/ia64/include/asm/clocksource.h

diff --git a/arch/ia64/include/asm/clocksource.h b/arch/ia64/include/asm/clocksource.h
new file mode 100644
index 0000000..00eb549
--- /dev/null
+++ b/arch/ia64/include/asm/clocksource.h
@@ -0,0 +1,12 @@
+/* IA64-specific clocksource additions */
+
+#ifndef _ASM_IA64_CLOCKSOURCE_H
+#define _ASM_IA64_CLOCKSOURCE_H
+
+#define __ARCH_HAS_CLOCKSOURCE_DATA
+
+struct arch_clocksource_data {
+	void *fsys_mmio;        /* used by fsyscall asm code */
+};
+
+#endif /* _ASM_IA64_CLOCKSOURCE_H */
diff --git a/arch/ia64/kernel/cyclone.c b/arch/ia64/kernel/cyclone.c
index f64097b..4826ff9 100644
--- a/arch/ia64/kernel/cyclone.c
+++ b/arch/ia64/kernel/cyclone.c
@@ -115,7 +115,7 @@ int __init init_cyclone_clock(void)
 	}
 	/* initialize last tick */
 	cyclone_mc = cyclone_timer;
-	clocksource_cyclone.fsys_mmio = cyclone_timer;
+	clocksource_cyclone.archdata.fsys_mmio = cyclone_timer;
 	clocksource_register_hz(&clocksource_cyclone, CYCLONE_TIMER_FREQ);
 
 	return 0;
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 85118df..43920de 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -468,7 +468,7 @@ void update_vsyscall(struct timespec *wall, struct timespec *wtm,
         fsyscall_gtod_data.clk_mask = c->mask;
         fsyscall_gtod_data.clk_mult = mult;
         fsyscall_gtod_data.clk_shift = c->shift;
-        fsyscall_gtod_data.clk_fsys_mmio = c->fsys_mmio;
+        fsyscall_gtod_data.clk_fsys_mmio = c->archdata.fsys_mmio;
         fsyscall_gtod_data.clk_cycle_last = c->cycle_last;
 
 	/* copy kernel time structures */
diff --git a/arch/ia64/sn/kernel/sn2/timer.c b/arch/ia64/sn/kernel/sn2/timer.c
index c34efda..0f8844e 100644
--- a/arch/ia64/sn/kernel/sn2/timer.c
+++ b/arch/ia64/sn/kernel/sn2/timer.c
@@ -54,7 +54,7 @@ ia64_sn_udelay (unsigned long usecs)
 
 void __init sn_timer_init(void)
 {
-	clocksource_sn2.fsys_mmio = RTC_COUNTER_ADDR;
+	clocksource_sn2.archdata.fsys_mmio = RTC_COUNTER_ADDR;
 	clocksource_register_hz(&clocksource_sn2, sn_rtc_cycles_per_second);
 
 	ia64_udelay = &ia64_sn_udelay;
diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index 34d6a1c..0833896 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -952,7 +952,7 @@ int hpet_alloc(struct hpet_data *hdp)
 #ifdef CONFIG_IA64
 	if (!hpet_clocksource) {
 		hpet_mctr = (void __iomem *)&hpetp->hp_hpet->hpet_mc;
-		CLKSRC_FSYS_MMIO_SET(clocksource_hpet.fsys_mmio, hpet_mctr);
+		clocksource_hpet.archdata.fsys_mmio = hpet_mctr;
 		clocksource_register_hz(&clocksource_hpet, hpetp->hp_tick_freq);
 		hpetp->hp_clocksource = &clocksource_hpet;
 		hpet_clocksource = &clocksource_hpet;
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 9ab6b6a..0c79005 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -171,13 +171,6 @@ struct clocksource {
 	u32 shift;
 	u64 max_idle_ns;
 
-#ifdef CONFIG_IA64
-	void *fsys_mmio;        /* used by fsyscall asm code */
-#define CLKSRC_FSYS_MMIO_SET(mmio, addr)      ((mmio) = (addr))
-#else
-#define CLKSRC_FSYS_MMIO_SET(mmio, addr)      do { } while (0)
-#endif
-
 #ifdef __ARCH_HAS_CLOCKSOURCE_DATA
 	struct arch_clocksource_data archdata;
 #endif
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 8/8] Document the vDSO and add a reference parser
  2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
                   ` (6 preceding siblings ...)
  2011-07-10  3:22   ` Andy Lutomirski
@ 2011-07-10  3:22 ` Andy Lutomirski
  7 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2011-07-10  3:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov, Andy Lutomirski

It turns out that parsing the vDSO is nontrivial if you don't already
have an ELF dynamic loader around.  So document it in Documentation/ABI
and add a reference CC0-licenced parser.

This code is dedicated to Go issue 1933:
http://code.google.com/p/go/issues/detail?id=1933

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 Documentation/ABI/stable/vdso   |   27 ++++
 Documentation/vDSO/parse_vdso.c |  256 +++++++++++++++++++++++++++++++++++++++
 Documentation/vDSO/vdso_test.c  |  112 +++++++++++++++++
 3 files changed, 395 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/stable/vdso
 create mode 100644 Documentation/vDSO/parse_vdso.c
 create mode 100644 Documentation/vDSO/vdso_test.c

diff --git a/Documentation/ABI/stable/vdso b/Documentation/ABI/stable/vdso
new file mode 100644
index 0000000..8a1cbb5
--- /dev/null
+++ b/Documentation/ABI/stable/vdso
@@ -0,0 +1,27 @@
+On some architectures, when the kernel loads any userspace program it
+maps an ELF DSO into that program's address space.  This DSO is called
+the vDSO and it often contains useful and highly-optimized alternatives
+to real syscalls.
+
+These functions are called just like ordinary C function according to
+your platform's ABI.  Call them from a sensible context.  (For example,
+if you set CS on x86 to something strange, the vDSO functions are
+within their rights to crash.)  In addition, if you pass a bad
+pointer to a vDSO function, you might get SIGSEGV instead of -EFAULT.
+
+To find the DSO, parse the auxiliary vector passed to the program's
+entry point.  The AT_SYSINFO_EHDR entry will point to the vDSO.
+
+The vDSO uses symbol versioning; whenever you request a symbol from the
+vDSO, specify the version you are expecting.
+
+Programs that dynamically link to glibc will use the vDSO automatically.
+Otherwise, you can use the reference parser in Documentation/vDSO/parse_vdso.c.
+
+Unless otherwise noted, the set of symbols with any given version and the
+ABI of those symbols is considered stable.  It may vary across architectures,
+though.
+
+(As of this writing, this ABI documentation as been confirmed for x86_64.
+ The maintainers of the other vDSO-using architectures should confirm
+ that it is correct for their architecture.)
\ No newline at end of file
diff --git a/Documentation/vDSO/parse_vdso.c b/Documentation/vDSO/parse_vdso.c
new file mode 100644
index 0000000..8587020
--- /dev/null
+++ b/Documentation/vDSO/parse_vdso.c
@@ -0,0 +1,256 @@
+/*
+ * parse_vdso.c: Linux reference vDSO parser
+ * Written by Andrew Lutomirski, 2011.
+ *
+ * This code is meant to be linked in to various programs that run on Linux.
+ * As such, it is available with as few restrictions as possible.  This file
+ * is licensed under the Creative Commons Zero License, version 1.0,
+ * available at http://creativecommons.org/publicdomain/zero/1.0/legalcode
+ *
+ * The vDSO is a regular ELF DSO that the kernel maps into user space when
+ * it starts a program.  It works equally well in statically and dynamically
+ * linked binaries.
+ *
+ * This code is tested on x86_64.  In principle it should work on any 64-bit
+ * architecture that has a vDSO.
+ */
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <string.h>
+#include <elf.h>
+
+/*
+ * To use this vDSO parser, first call one of the vdso_init_* functions.
+ * If you've already parsed auxv, then pass the value of AT_SYSINFO_EHDR
+ * to vdso_init_from_sysinfo_ehdr.  Otherwise pass auxv to vdso_init_from_auxv.
+ * Then call vdso_sym for each symbol you want.  For example, to look up
+ * gettimeofday on x86_64, use:
+ *
+ *     <some pointer> = vdso_sym("LINUX_2.6", "gettimeofday");
+ * or
+ *     <some pointer> = vdso_sym("LINUX_2.6", "__vdso_gettimeofday");
+ *
+ * vdso_sym will return 0 if the symbol doesn't exist or if the init function
+ * failed or was not called.  vdso_sym is a little slow, so its return value
+ * should be cached.
+ *
+ * vdso_sym is threadsafe; the init functions are not.
+ *
+ * These are the prototypes:
+ */
+extern void vdso_init_from_auxv(void *auxv);
+extern void vdso_init_from_sysinfo_ehdr(uintptr_t base);
+extern void *vdso_sym(const char *version, const char *name);
+
+
+/* And here's the code. */
+
+#ifndef __x86_64__
+# error Not yet ported to non-x86_64 architectures
+#endif
+
+static struct vdso_info
+{
+	bool valid;
+
+	/* Load information */
+	uintptr_t load_addr;
+	uintptr_t load_offset;  /* load_addr - recorded vaddr */
+
+	/* Symbol table */
+	Elf64_Sym *symtab;
+	const char *symstrings;
+	Elf64_Word *bucket, *chain;
+	Elf64_Word nbucket, nchain;
+
+	/* Version table */
+	Elf64_Versym *versym;
+	Elf64_Verdef *verdef;
+} vdso_info;
+
+/* Straight from the ELF specification. */
+static unsigned long elf_hash(const unsigned char *name)
+{
+	unsigned long h = 0, g;
+	while (*name)
+	{
+		h = (h << 4) + *name++;
+		if (g = h & 0xf0000000)
+			h ^= g >> 24;
+		h &= ~g;
+	}
+	return h;
+}
+
+void vdso_init_from_sysinfo_ehdr(uintptr_t base)
+{
+	size_t i;
+	bool found_vaddr = false;
+
+	vdso_info.valid = false;
+
+	vdso_info.load_addr = base;
+
+	Elf64_Ehdr *hdr = (Elf64_Ehdr*)base;
+	Elf64_Phdr *pt = (Elf64_Phdr*)(vdso_info.load_addr + hdr->e_phoff);
+	Elf64_Dyn *dyn = 0;
+
+	/*
+	 * We need two things from the segment table: the load offset
+	 * and the dynamic table.
+	 */
+	for (i = 0; i < hdr->e_phnum; i++)
+	{
+		if (pt[i].p_type == PT_LOAD && !found_vaddr) {
+			found_vaddr = true;
+			vdso_info.load_offset =	base
+				+ (uintptr_t)pt[i].p_offset
+				- (uintptr_t)pt[i].p_vaddr;
+		} else if (pt[i].p_type == PT_DYNAMIC) {
+			dyn = (Elf64_Dyn*)(base + pt[i].p_offset);
+		}
+	}
+
+	if (!found_vaddr || !dyn)
+		return;  /* Failed */
+
+	/*
+	 * Fish out the useful bits of the dynamic table.
+	 */
+	Elf64_Word *hash = 0;
+	vdso_info.symstrings = 0;
+	vdso_info.symtab = 0;
+	vdso_info.versym = 0;
+	vdso_info.verdef = 0;
+	for (i = 0; dyn[i].d_tag != DT_NULL; i++) {
+		switch (dyn[i].d_tag) {
+		case DT_STRTAB:
+			vdso_info.symstrings = (const char *)
+				((uintptr_t)dyn[i].d_un.d_ptr
+				 + vdso_info.load_offset);
+			break;
+		case DT_SYMTAB:
+			vdso_info.symtab = (Elf64_Sym *)
+				((uintptr_t)dyn[i].d_un.d_ptr
+				 + vdso_info.load_offset);
+			break;
+		case DT_HASH:
+			hash = (Elf64_Word *)
+				((uintptr_t)dyn[i].d_un.d_ptr
+				 + vdso_info.load_offset);
+			break;
+		case DT_VERSYM:
+			vdso_info.versym = (Elf64_Versym *)
+				((uintptr_t)dyn[i].d_un.d_ptr
+				 + vdso_info.load_offset);
+			break;
+		case DT_VERDEF:
+			vdso_info.verdef = (Elf64_Verdef *)
+				((uintptr_t)dyn[i].d_un.d_ptr
+				 + vdso_info.load_offset);
+			break;
+		}
+	}
+	if (!vdso_info.symstrings || !vdso_info.symtab || !hash)
+		return;  /* Failed */
+
+	if (!vdso_info.verdef)
+		vdso_info.versym = 0;
+
+	/* Parse the hash table header. */
+	vdso_info.nbucket = hash[0];
+	vdso_info.nchain = hash[1];
+	vdso_info.bucket = &hash[2];
+	vdso_info.chain = &hash[vdso_info.nbucket + 2];
+
+	/* That's all we need. */
+	vdso_info.valid = true;
+}
+
+static bool vdso_match_version(Elf64_Versym ver,
+			       const char *name, Elf64_Word hash)
+{
+	/*
+	 * This is a helper function to check if the version indexed by
+	 * ver matches name (which hashes to hash).
+	 *
+	 * The version definition table is a mess, and I don't know how
+	 * to do this in better than linear time without allocating memory
+	 * to build an index.  I also don't know why the table has
+	 * variable size entries in the first place.
+	 *
+	 * For added fun, I can't find a comprehensible specification of how
+	 * to parse all the weird flags in the table.
+	 *
+	 * So I just parse the whole table every time.
+	 */
+
+	/* First step: find the version definition */
+	ver &= 0x7fff;  /* Apparently bit 15 means "hidden" */
+	Elf64_Verdef *def = vdso_info.verdef;
+	while(true) {
+		if ((def->vd_flags & VER_FLG_BASE) == 0
+		    && (def->vd_ndx & 0x7fff) == ver)
+			break;
+
+		if (def->vd_next == 0)
+			return false;  /* No definition. */
+
+		def = (Elf64_Verdef *)((char *)def + def->vd_next);
+	}
+
+	/* Now figure out whether it matches. */
+	Elf64_Verdaux *aux = (Elf64_Verdaux*)((char *)def + def->vd_aux);
+	return def->vd_hash == hash
+		&& !strcmp(name, vdso_info.symstrings + aux->vda_name);
+}
+
+void *vdso_sym(const char *version, const char *name)
+{
+	unsigned long ver_hash;
+	if (!vdso_info.valid)
+		return 0;
+
+	ver_hash = elf_hash(version);
+	Elf64_Word chain = vdso_info.bucket[elf_hash(name) % vdso_info.nbucket];
+
+	for (; chain != STN_UNDEF; chain = vdso_info.chain[chain]) {
+		Elf64_Sym *sym = &vdso_info.symtab[chain];
+
+		/* Check for a defined global or weak function w/ right name. */
+		if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC)
+			continue;
+		if (ELF64_ST_BIND(sym->st_info) != STB_GLOBAL &&
+		    ELF64_ST_BIND(sym->st_info) != STB_WEAK)
+			continue;
+		if (sym->st_shndx == SHN_UNDEF)
+			continue;
+		if (strcmp(name, vdso_info.symstrings + sym->st_name))
+			continue;
+
+		/* Check symbol version. */
+		if (vdso_info.versym
+		    && !vdso_match_version(vdso_info.versym[chain],
+					   version, ver_hash))
+			continue;
+
+		return (void *)(vdso_info.load_offset + sym->st_value);
+	}
+
+	return 0;
+}
+
+void vdso_init_from_auxv(void *auxv)
+{
+	Elf64_auxv_t *elf_auxv = auxv;
+	for (int i = 0; elf_auxv[i].a_type != AT_NULL; i++)
+	{
+		if (elf_auxv[i].a_type == AT_SYSINFO_EHDR) {
+			vdso_init_from_sysinfo_ehdr(elf_auxv[i].a_un.a_val);
+			return;
+		}
+	}
+
+	vdso_info.valid = false;
+}
diff --git a/Documentation/vDSO/vdso_test.c b/Documentation/vDSO/vdso_test.c
new file mode 100644
index 0000000..1f3a776
--- /dev/null
+++ b/Documentation/vDSO/vdso_test.c
@@ -0,0 +1,112 @@
+/*
+ * vdso_test.c: Sample code to test parse_vdso.c on x86_64
+ * Copyright (c) 2011 Andy Lutomirski
+ * Subject to the GNU General Public License, version 2
+ *
+ * You can amuse yourself by compiling with:
+ * gcc -std=gnu99 -nostdlib
+ *     -Os -fno-asynchronous-unwind-tables -flto
+ *      vdso_test.c parse_vdso.c -o vdso_test
+ * to generate a small binary with no dependencies at all.
+ */
+
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <stdint.h>
+
+extern void *vdso_sym(const char *version, const char *name);
+extern void vdso_init_from_sysinfo_ehdr(uintptr_t base);
+extern void vdso_init_from_auxv(void *auxv);
+
+/* We need a libc functions... */
+int strcmp(const char *a, const char *b)
+{
+	/* This implementation is buggy: it never returns -1. */
+	while (*a || *b) {
+		if (*a != *b)
+			return 1;
+		if (*a == 0 || *b == 0)
+			return 1;
+		a++;
+		b++;
+	}
+
+	return 0;
+}
+
+/* ...and two syscalls.  This is x86_64-specific. */
+static inline long linux_write(int fd, const void *data, size_t len)
+{
+
+	long ret;
+	asm volatile ("syscall" : "=a" (ret) : "a" (__NR_write),
+		      "D" (fd), "S" (data), "d" (len) :
+		      "cc", "memory", "rcx",
+		      "r8", "r9", "r10", "r11" );
+	return ret;
+}
+
+static inline void linux_exit(int code)
+{
+	asm volatile ("syscall" : : "a" (__NR_exit), "D" (code));
+}
+
+void to_base10(char *lastdig, uint64_t n)
+{
+	while (n) {
+		*lastdig = (n % 10) + '0';
+		n /= 10;
+		lastdig--;
+	}
+}
+
+__attribute__((externally_visible)) void c_main(void **stack)
+{
+	/* Parse the stack */
+	long argc = (long)*stack;
+	stack += argc + 2;
+
+	/* Now we're pointing at the environment.  Skip it. */
+	while(*stack)
+		stack++;
+	stack++;
+
+	/* Now we're pointing at auxv.  Initialize the vDSO parser. */
+	vdso_init_from_auxv((void *)stack);
+
+	/* Find gettimeofday. */
+	typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz);
+	gtod_t gtod = (gtod_t)vdso_sym("LINUX_2.6", "__vdso_gettimeofday");
+
+	if (!gtod)
+		linux_exit(1);
+
+	struct timeval tv;
+	long ret = gtod(&tv, 0);
+
+	if (ret == 0) {
+		char buf[] = "The time is                     .000000\n";
+		to_base10(buf + 31, tv.tv_sec);
+		to_base10(buf + 38, tv.tv_usec);
+		linux_write(1, buf, sizeof(buf) - 1);
+	} else {
+		linux_exit(ret);
+	}
+
+	linux_exit(0);
+}
+
+/*
+ * This is the real entry point.  It passes the initial stack into
+ * the C entry point.
+ */
+asm (
+	".text\n"
+	".global _start\n"
+        ".type _start,@function\n"
+        "_start:\n\t"
+        "mov %rsp,%rdi\n\t"
+        "jmp c_main"
+	);
+
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO
  2011-07-10  3:22 ` [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO Andy Lutomirski
@ 2011-07-11 10:41   ` Rakib Mullick
  2011-07-11 14:21     ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Rakib Mullick @ 2011-07-11 10:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov

On Sun, Jul 10, 2011 at 9:22 AM, Andy Lutomirski <luto@mit.edu> wrote:
> This code is short enough and different enough from the module
> loader that it's not worth trying to share anything.
>
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> ---
>  arch/x86/vdso/vma.c |   30 ++++++++++++++++++++++++++++++
>  1 files changed, 30 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index 7abd2be..ba92244 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -23,11 +23,41 @@ extern unsigned short vdso_sync_cpuid;
>  static struct page **vdso_pages;
>  static unsigned vdso_size;
>
> +static void patch_vdso(void *vdso, size_t len)

I think, patch_vdso should mark with __init. Since it's been only
called from an __init function. We might hit a section mismatch, even
if we don't hit section mismatch, the will function get removed after
system boot - saves some .text size.

> +{
> +       Elf64_Ehdr *hdr = vdso;
> +       Elf64_Shdr *sechdrs, *alt_sec = 0;
> +       char *secstrings;
> +       void *alt_data;
> +       int i;
> +
> +       BUG_ON(len < sizeof(Elf64_Ehdr));
> +       BUG_ON(memcmp(hdr->e_ident, ELFMAG, SELFMAG) != 0);
> +
> +       sechdrs = (void *)hdr + hdr->e_shoff;
> +       secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
> +
> +       for (i = 1; i < hdr->e_shnum; i++) {
> +               Elf64_Shdr *shdr = &sechdrs[i];
> +               if (!strcmp(secstrings + shdr->sh_name, ".altinstructions")) {
> +                       alt_sec = shdr;
> +                       goto found;
> +               }
> +       }
> +       return;  /* nothing to patch */

If there's nothing to patch, I think (perhaps) - it would be nice to
let user know through printk.

Thanks,
Rakib.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO
  2011-07-11 10:41   ` Rakib Mullick
@ 2011-07-11 14:21     ` Andrew Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-11 14:21 UTC (permalink / raw)
  To: Rakib Mullick
  Cc: x86, linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov

On Mon, Jul 11, 2011 at 4:41 AM, Rakib Mullick <rakib.mullick@gmail.com> wrote:
> On Sun, Jul 10, 2011 at 9:22 AM, Andy Lutomirski <luto@mit.edu> wrote:

>> +static void patch_vdso(void *vdso, size_t len)
>
> I think, patch_vdso should mark with __init. Since it's been only
> called from an __init function. We might hit a section mismatch, even
> if we don't hit section mismatch, the will function get removed after
> system boot - saves some .text size.

Sure.  It won't make any difference, because my compiler at least is
smart enough to inline the whole function into init_vdso_vars, but
it's the right thing to do anyway.

>> +       return;  /* nothing to patch */
>
> If there's nothing to patch, I think (perhaps) - it would be nice to
> let user know through printk.

Will do.  I verified it myself by dumping the vdso image from memory
and confirming that it got patched correctly, but if the code ever
bitrots it'll be nice to have the warning.

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-10  3:22 ` [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling Andy Lutomirski
@ 2011-07-11 22:14   ` Borislav Petkov
  2011-07-11 22:20     ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2011-07-11 22:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov

On Sat, Jul 09, 2011 at 11:22:08PM -0400, Andy Lutomirski wrote:
> Two fixes here:
>  - Send SIGSEGV if called from compat code or with a funny CS.
>  - Don't BUG on impossible addresses.
> 
> This patch also removes an unused variable.
> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> ---
>  arch/x86/include/asm/vsyscall.h |   12 --------
>  arch/x86/kernel/vsyscall_64.c   |   59 +++++++++++++++++++++++++--------------
>  2 files changed, 38 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
> index bb710cb..d555973 100644
> --- a/arch/x86/include/asm/vsyscall.h
> +++ b/arch/x86/include/asm/vsyscall.h
> @@ -31,18 +31,6 @@ extern struct timezone sys_tz;
>  
>  extern void map_vsyscall(void);
>  
> -/* Emulation */
> -
> -static inline bool is_vsyscall_entry(unsigned long addr)
> -{
> -	return (addr & ~0xC00UL) == VSYSCALL_START;
> -}
> -
> -static inline int vsyscall_entry_nr(unsigned long addr)
> -{
> -	return (addr & 0xC00UL) >> 10;
> -}
> -
>  #endif /* __KERNEL__ */
>  
>  #endif /* _ASM_X86_VSYSCALL_H */
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 10cd8ac..6d14848 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -38,6 +38,7 @@
>  
>  #include <asm/vsyscall.h>
>  #include <asm/pgtable.h>
> +#include <asm/compat.h>
>  #include <asm/page.h>
>  #include <asm/unistd.h>
>  #include <asm/fixmap.h>
> @@ -97,33 +98,60 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
>  
>  	tsk = current;
>  
> -	printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
> +	printk("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
>  	       level, tsk->comm, task_pid_nr(tsk),
> -	       message, regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
> +	       message, regs->ip - 2, regs->cs,
> +	       regs->sp, regs->ax, regs->si, regs->di);
> +}
> +
> +static int addr_to_vsyscall_nr(unsigned *vsyscall_nr, unsigned long addr)
> +{
> +	if ((addr & ~0xC00UL) != VSYSCALL_START)
> +		return -EINVAL;
> +
> +	*vsyscall_nr = (addr & 0xC00UL) >> 10;
> +	if (*vsyscall_nr >= 3)
> +		return -EINVAL;
> +
> +	return 0;
>  }

I'm wondering: why don't you make this function return negative value on
error, i.e. -EINVAL and the vsyscall number on success so that you can
get rid of returning it through the arg pointer?

Then at the callsite you can do:

	vsyscall_nr = addr_to_vsyscall_nr(addr);
	if (vsyscall_nr < 0)
		warn_bad_vsyscall(...)

?



>  
>  void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
>  {
> -	const char *vsyscall_name;
>  	struct task_struct *tsk;
>  	unsigned long caller;
> -	int vsyscall_nr;
> +	unsigned vsyscall_nr;
>  	long ret;
>  
> -	/* Kernel code must never get here. */
> -	BUG_ON(!user_mode(regs));
> -
>  	local_irq_enable();
>  
>  	/*
> +	 * Real 64-bit user mode code has cs == __USER_CS.  Anything else
> +	 * is bogus.
> +	 */
> +	if (regs->cs != __USER_CS) {
> +		/*
> +		 * If we trapped from kernel mode, we might as well OOPS now
> +		 * instead of returning to some random address and OOPSing
> +		 * then.
> +		 */
> +		BUG_ON(!user_mode(regs));
> +
> +		/* Compat mode and non-compat 32-bit CS should both segfault. */
> +		warn_bad_vsyscall(KERN_WARNING, regs,
> +				  "illegal int 0xcc from 32-bit mode");
> +		goto sigsegv;
> +	}
> +
> +	/*
>  	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
>  	 * and int 0xcc is two bytes long.
>  	 */
> -	if (!is_vsyscall_entry(regs->ip - 2)) {
> -		warn_bad_vsyscall(KERN_WARNING, regs, "illegal int 0xcc (exploit attempt?)");
> +	if (addr_to_vsyscall_nr(&vsyscall_nr, regs->ip - 2) != 0) {
> +		warn_bad_vsyscall(KERN_WARNING, regs,
> +				  "illegal int 0xcc (exploit attempt?)");
>  		goto sigsegv;
>  	}
> -	vsyscall_nr = vsyscall_entry_nr(regs->ip - 2);
>  
>  	if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
>  		warn_bad_vsyscall(KERN_WARNING, regs, "int 0xcc with bad stack (exploit attempt?)");
> @@ -136,31 +164,20 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
>  
>  	switch (vsyscall_nr) {
>  	case 0:
> -		vsyscall_name = "gettimeofday";
>  		ret = sys_gettimeofday(
>  			(struct timeval __user *)regs->di,
>  			(struct timezone __user *)regs->si);
>  		break;
>  
>  	case 1:
> -		vsyscall_name = "time";
>  		ret = sys_time((time_t __user *)regs->di);
>  		break;
>  
>  	case 2:
> -		vsyscall_name = "getcpu";
>  		ret = sys_getcpu((unsigned __user *)regs->di,
>  				 (unsigned __user *)regs->si,
>  				 0);
>  		break;
> -
> -	default:
> -		/*
> -		 * If we get here, then vsyscall_nr indicates that int 0xcc
> -		 * happened at an address in the vsyscall page that doesn't
> -		 * contain int 0xcc.  That can't happen.
> -		 */
> -		BUG();
>  	}
>  
>  	if (ret == -EFAULT) {
> -- 
> 1.7.6

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-11 22:14   ` Borislav Petkov
@ 2011-07-11 22:20     ` Andrew Lutomirski
  2011-07-12  7:18       ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-11 22:20 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, x86, linux-kernel, Ingo Molnar,
	John Stultz, Borislav Petkov

On Mon, Jul 11, 2011 at 6:14 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Jul 09, 2011 at 11:22:08PM -0400, Andy Lutomirski wrote:
>> Two fixes here:
>>  - Send SIGSEGV if called from compat code or with a funny CS.
>>  - Don't BUG on impossible addresses.
>>
>> This patch also removes an unused variable.
>>
>> Signed-off-by: Andy Lutomirski <luto@mit.edu>
>> ---
>>  arch/x86/include/asm/vsyscall.h |   12 --------
>>  arch/x86/kernel/vsyscall_64.c   |   59 +++++++++++++++++++++++++--------------
>>  2 files changed, 38 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
>> index bb710cb..d555973 100644
>> --- a/arch/x86/include/asm/vsyscall.h
>> +++ b/arch/x86/include/asm/vsyscall.h
>> @@ -31,18 +31,6 @@ extern struct timezone sys_tz;
>>
>>  extern void map_vsyscall(void);
>>
>> -/* Emulation */
>> -
>> -static inline bool is_vsyscall_entry(unsigned long addr)
>> -{
>> -     return (addr & ~0xC00UL) == VSYSCALL_START;
>> -}
>> -
>> -static inline int vsyscall_entry_nr(unsigned long addr)
>> -{
>> -     return (addr & 0xC00UL) >> 10;
>> -}
>> -
>>  #endif /* __KERNEL__ */
>>
>>  #endif /* _ASM_X86_VSYSCALL_H */
>> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
>> index 10cd8ac..6d14848 100644
>> --- a/arch/x86/kernel/vsyscall_64.c
>> +++ b/arch/x86/kernel/vsyscall_64.c
>> @@ -38,6 +38,7 @@
>>
>>  #include <asm/vsyscall.h>
>>  #include <asm/pgtable.h>
>> +#include <asm/compat.h>
>>  #include <asm/page.h>
>>  #include <asm/unistd.h>
>>  #include <asm/fixmap.h>
>> @@ -97,33 +98,60 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
>>
>>       tsk = current;
>>
>> -     printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
>> +     printk("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
>>              level, tsk->comm, task_pid_nr(tsk),
>> -            message, regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
>> +            message, regs->ip - 2, regs->cs,
>> +            regs->sp, regs->ax, regs->si, regs->di);
>> +}
>> +
>> +static int addr_to_vsyscall_nr(unsigned *vsyscall_nr, unsigned long addr)
>> +{
>> +     if ((addr & ~0xC00UL) != VSYSCALL_START)
>> +             return -EINVAL;
>> +
>> +     *vsyscall_nr = (addr & 0xC00UL) >> 10;
>> +     if (*vsyscall_nr >= 3)
>> +             return -EINVAL;
>> +
>> +     return 0;
>>  }
>
> I'm wondering: why don't you make this function return negative value on
> error, i.e. -EINVAL and the vsyscall number on success so that you can
> get rid of returning it through the arg pointer?
>
> Then at the callsite you can do:
>
>        vsyscall_nr = addr_to_vsyscall_nr(addr);
>        if (vsyscall_nr < 0)
>                warn_bad_vsyscall(...)

Because I don't want a warning about ret being used without being initialized.

With the code in this patch, the compiler is smart enough to figure
out that either vsyscall_nr is 0, 1, or 2 or that the EINVAL branch is
taken.  I'll see if it works the other way.

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-11 22:20     ` Andrew Lutomirski
@ 2011-07-12  7:18       ` Borislav Petkov
  2011-07-12 12:58         ` Andrew Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2011-07-12  7:18 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: x86, linux-kernel, Ingo Molnar, John Stultz, Borislav Petkov

On Mon, Jul 11, 2011 at 06:20:50PM -0400, Andrew Lutomirski wrote:
> > I'm wondering: why don't you make this function return negative value on
> > error, i.e. -EINVAL and the vsyscall number on success so that you can
> > get rid of returning it through the arg pointer?
> >
> > Then at the callsite you can do:
> >
> >        vsyscall_nr = addr_to_vsyscall_nr(addr);
> >        if (vsyscall_nr < 0)
> >                warn_bad_vsyscall(...)
> 
> Because I don't want a warning about ret being used without being initialized.

not if you preinit it...

> With the code in this patch, the compiler is smart enough to figure
> out that either vsyscall_nr is 0, 1, or 2 or that the EINVAL branch is
> taken.  I'll see if it works the other way.

here's what i mean, I changed your patch a bit:
--

>From 2bb6c706cd6cf49ce5baec670bd01b53d2ea6f19 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto@MIT.EDU>
Date: Sat, 9 Jul 2011 23:22:08 -0400
Subject: [PATCH] x86-64: Improve vsyscall emulation CS and RIP handling

Two fixes here:
 - Send SIGSEGV if called from compat code or with a funny CS.
 - Don't BUG on impossible addresses.

This patch also removes an unused variable.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/include/asm/vsyscall.h |   12 -------
 arch/x86/kernel/vsyscall_64.c   |   62 +++++++++++++++++++++++++-------------
 2 files changed, 41 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index bb710cb..d555973 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -31,18 +31,6 @@ extern struct timezone sys_tz;
 
 extern void map_vsyscall(void);
 
-/* Emulation */
-
-static inline bool is_vsyscall_entry(unsigned long addr)
-{
-	return (addr & ~0xC00UL) == VSYSCALL_START;
-}
-
-static inline int vsyscall_entry_nr(unsigned long addr)
-{
-	return (addr & 0xC00UL) >> 10;
-}
-
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 10cd8ac..2110cf2 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -38,6 +38,7 @@
 
 #include <asm/vsyscall.h>
 #include <asm/pgtable.h>
+#include <asm/compat.h>
 #include <asm/page.h>
 #include <asm/unistd.h>
 #include <asm/fixmap.h>
@@ -97,33 +98,63 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 
 	tsk = current;
 
-	printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+	printk("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
 	       level, tsk->comm, task_pid_nr(tsk),
-	       message, regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+	       message, regs->ip - 2, regs->cs,
+	       regs->sp, regs->ax, regs->si, regs->di);
+}
+
+static int addr_to_vsyscall_nr(unsigned long addr)
+{
+	int vsyscall = -EINVAL;
+
+	if ((addr & ~0xC00UL) != VSYSCALL_START)
+		return -EINVAL;
+
+	vsyscall = (addr & 0xC00UL) >> 10;
+	if (vsyscall < 0 || vsyscall >= 3)
+		return -EINVAL;
+
+	return vsyscall;
 }
 
 void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 {
-	const char *vsyscall_name;
 	struct task_struct *tsk;
 	unsigned long caller;
 	int vsyscall_nr;
-	long ret;
-
-	/* Kernel code must never get here. */
-	BUG_ON(!user_mode(regs));
+	long ret = -EFAULT;
 
 	local_irq_enable();
 
 	/*
+	 * Real 64-bit user mode code has cs == __USER_CS.  Anything else
+	 * is bogus.
+	 */
+	if (regs->cs != __USER_CS) {
+		/*
+		 * If we trapped from kernel mode, we might as well OOPS now
+		 * instead of returning to some random address and OOPSing
+		 * then.
+		 */
+		BUG_ON(!user_mode(regs));
+
+		/* Compat mode and non-compat 32-bit CS should both segfault. */
+		warn_bad_vsyscall(KERN_WARNING, regs,
+				  "illegal int 0xcc from 32-bit mode");
+		goto sigsegv;
+	}
+
+	/*
 	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
 	 * and int 0xcc is two bytes long.
 	 */
-	if (!is_vsyscall_entry(regs->ip - 2)) {
-		warn_bad_vsyscall(KERN_WARNING, regs, "illegal int 0xcc (exploit attempt?)");
+	vsyscall_nr = addr_to_vsyscall_nr(regs->ip - 2);
+	if (vsyscall_nr < 0) {
+		warn_bad_vsyscall(KERN_WARNING, regs,
+				  "illegal int 0xcc (exploit attempt?)");
 		goto sigsegv;
 	}
-	vsyscall_nr = vsyscall_entry_nr(regs->ip - 2);
 
 	if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
 		warn_bad_vsyscall(KERN_WARNING, regs, "int 0xcc with bad stack (exploit attempt?)");
@@ -136,31 +167,20 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 
 	switch (vsyscall_nr) {
 	case 0:
-		vsyscall_name = "gettimeofday";
 		ret = sys_gettimeofday(
 			(struct timeval __user *)regs->di,
 			(struct timezone __user *)regs->si);
 		break;
 
 	case 1:
-		vsyscall_name = "time";
 		ret = sys_time((time_t __user *)regs->di);
 		break;
 
 	case 2:
-		vsyscall_name = "getcpu";
 		ret = sys_getcpu((unsigned __user *)regs->di,
 				 (unsigned __user *)regs->si,
 				 0);
 		break;
-
-	default:
-		/*
-		 * If we get here, then vsyscall_nr indicates that int 0xcc
-		 * happened at an address in the vsyscall page that doesn't
-		 * contain int 0xcc.  That can't happen.
-		 */
-		BUG();
 	}
 
 	if (ret == -EFAULT) {
-- 
1.7.5.3.401.gfb674




-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-12  7:18       ` Borislav Petkov
@ 2011-07-12 12:58         ` Andrew Lutomirski
  2011-07-12 13:24           ` Borislav Petkov
  2011-07-13  4:33           ` H. Peter Anvin
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-12 12:58 UTC (permalink / raw)
  To: Borislav Petkov, Andrew Lutomirski, x86, linux-kernel,
	Ingo Molnar, John Stultz, Borislav Petkov

On Tue, Jul 12, 2011 at 3:18 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jul 11, 2011 at 06:20:50PM -0400, Andrew Lutomirski wrote:
>> > I'm wondering: why don't you make this function return negative value on
>> > error, i.e. -EINVAL and the vsyscall number on success so that you can
>> > get rid of returning it through the arg pointer?
>> >
>> > Then at the callsite you can do:
>> >
>> >        vsyscall_nr = addr_to_vsyscall_nr(addr);
>> >        if (vsyscall_nr < 0)
>> >                warn_bad_vsyscall(...)
>>
>> Because I don't want a warning about ret being used without being initialized.
>
> not if you preinit it...

I kind of like that warning as a sanity check, and preiniting it
grates against my irrational desire to over-optimize :)

>
>> With the code in this patch, the compiler is smart enough to figure
>> out that either vsyscall_nr is 0, 1, or 2 or that the EINVAL branch is
>> taken.  I'll see if it works the other way.
>
> here's what i mean, I changed your patch a bit:

How about this:

static int addr_to_vsyscall_nr(unsigned long addr)
{
	int nr;

	if ((addr & ~0xC00UL) != VSYSCALL_START)
		return -EINVAL;

	nr = (addr & 0xC00UL) >> 10;
	if (nr >= 3)
		return -EINVAL;

	return nr;
}

...

	int vsyscall_nr;

...

	vsyscall_nr = addr_to_vsyscall_nr(regs->ip - 2);
	if (vsyscall_nr < 0) {
		warn_bad_vsyscall(KERN_WARNING, regs,
				  "illegal int 0xcc (exploit attempt?)");
		goto sigsegv;
	}

gcc 4.6 at least does not warn.


Also, IRQ disabling was still mismatched in the sigsegv path.  I'll
fix that as well.

--Andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-12 12:58         ` Andrew Lutomirski
@ 2011-07-12 13:24           ` Borislav Petkov
  2011-07-13  4:33           ` H. Peter Anvin
  1 sibling, 0 replies; 20+ messages in thread
From: Borislav Petkov @ 2011-07-12 13:24 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, x86, linux-kernel, Ingo Molnar, John Stultz,
	Borislav Petkov

On Tue, Jul 12, 2011 at 08:58:58AM -0400, Andrew Lutomirski wrote:
> On Tue, Jul 12, 2011 at 3:18 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Mon, Jul 11, 2011 at 06:20:50PM -0400, Andrew Lutomirski wrote:
> >> > I'm wondering: why don't you make this function return negative value on
> >> > error, i.e. -EINVAL and the vsyscall number on success so that you can
> >> > get rid of returning it through the arg pointer?
> >> >
> >> > Then at the callsite you can do:
> >> >
> >> >        vsyscall_nr = addr_to_vsyscall_nr(addr);
> >> >        if (vsyscall_nr < 0)
> >> >                warn_bad_vsyscall(...)
> >>
> >> Because I don't want a warning about ret being used without being initialized.
> >
> > not if you preinit it...
> 
> I kind of like that warning as a sanity check, and preiniting it
> grates against my irrational desire to over-optimize :)

:-)

> >
> >> With the code in this patch, the compiler is smart enough to figure
> >> out that either vsyscall_nr is 0, 1, or 2 or that the EINVAL branch is
> >> taken.  I'll see if it works the other way.
> >
> > here's what i mean, I changed your patch a bit:
> 
> How about this:
> 
> static int addr_to_vsyscall_nr(unsigned long addr)
> {
> 	int nr;
> 
> 	if ((addr & ~0xC00UL) != VSYSCALL_START)
> 		return -EINVAL;
> 
> 	nr = (addr & 0xC00UL) >> 10;
> 	if (nr >= 3)
> 		return -EINVAL;
> 
> 	return nr;
> }
> 
> ...
> 
> 	int vsyscall_nr;
> 
> ...
> 
> 	vsyscall_nr = addr_to_vsyscall_nr(regs->ip - 2);
> 	if (vsyscall_nr < 0) {
> 		warn_bad_vsyscall(KERN_WARNING, regs,
> 				  "illegal int 0xcc (exploit attempt?)");
> 		goto sigsegv;
> 	}
> 
> gcc 4.6 at least does not warn.

Yep, looks good.

> Also, IRQ disabling was still mismatched in the sigsegv path.  I'll
> fix that as well.

oh yeah.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-12 12:58         ` Andrew Lutomirski
  2011-07-12 13:24           ` Borislav Petkov
@ 2011-07-13  4:33           ` H. Peter Anvin
  2011-07-13 13:17             ` Andrew Lutomirski
  1 sibling, 1 reply; 20+ messages in thread
From: H. Peter Anvin @ 2011-07-13  4:33 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, x86, linux-kernel, Ingo Molnar, John Stultz,
	Borislav Petkov

On 07/12/2011 05:58 AM, Andrew Lutomirski wrote:
> 
> Also, IRQ disabling was still mismatched in the sigsegv path.  I'll
> fix that as well.
> 

Just to make sure we're clear, we're still waiting for a new version of
this patch, right?

	-hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling
  2011-07-13  4:33           ` H. Peter Anvin
@ 2011-07-13 13:17             ` Andrew Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Lutomirski @ 2011-07-13 13:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, x86, linux-kernel, Ingo Molnar, John Stultz,
	Borislav Petkov

On Wed, Jul 13, 2011 at 12:33 AM, H. Peter Anvin <hpa@kernel.org> wrote:
> On 07/12/2011 05:58 AM, Andrew Lutomirski wrote:
>>
>> Also, IRQ disabling was still mismatched in the sigsegv path.  I'll
>> fix that as well.
>>
>
> Just to make sure we're clear, we're still waiting for a new version of
> this patch, right?

Yes.  I'm testing it right now and then I'll email it out.

--Andy

>
>        -hpa
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-07-13 13:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-10  3:22 [PATCH v2 0/8] x86-64 vDSO changes for 3.1 Andy Lutomirski
2011-07-10  3:22 ` [PATCH v2 1/8] x86-64: Improve vsyscall emulation CS and RIP handling Andy Lutomirski
2011-07-11 22:14   ` Borislav Petkov
2011-07-11 22:20     ` Andrew Lutomirski
2011-07-12  7:18       ` Borislav Petkov
2011-07-12 12:58         ` Andrew Lutomirski
2011-07-12 13:24           ` Borislav Petkov
2011-07-13  4:33           ` H. Peter Anvin
2011-07-13 13:17             ` Andrew Lutomirski
2011-07-10  3:22 ` [PATCH v2 2/8] x86: Make alternative instruction pointers relative Andy Lutomirski
2011-07-10  3:22 ` [PATCH v2 3/8] x86-64: Allow alternative patching in the vDSO Andy Lutomirski
2011-07-11 10:41   ` Rakib Mullick
2011-07-11 14:21     ` Andrew Lutomirski
2011-07-10  3:22 ` [PATCH v2 4/8] x86-64: Add --no-undefined to vDSO build Andy Lutomirski
2011-07-10  3:22 ` [PATCH v2 5/8] clocksource: Replace vread with generic arch data Andy Lutomirski
2011-07-10  3:22   ` Andy Lutomirski
2011-07-10  3:22 ` [PATCH v2 6/8] x86-64: Move vread_tsc and vread_hpet into the vDSO Andy Lutomirski
2011-07-10  3:22 ` [PATCH v2 7/8] ia64: Replace clocksource.fsys_mmio with generic arch data Andy Lutomirski
2011-07-10  3:22   ` Andy Lutomirski
2011-07-10  3:22 ` [PATCH v2 8/8] Document the vDSO and add a reference parser Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.