All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] i386: espfix fixes
@ 2009-06-07 12:42 Alexander van Heukelum
  2009-06-07 12:42 ` [PATCH 1/2] i386: fix return to 16-bit stack from NMI handler Alexander van Heukelum
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Alexander van Heukelum @ 2009-06-07 12:42 UTC (permalink / raw)
  To: LKML, H. Peter Anvin
  Cc: Ingo Molnar, Thomas Gleixner, Andi Kleen, Cyrill Gorcunov,
	Alexander van Heukelum

Hi Peter,

Here are the two patches I said I would send... The first one is a fix 
to include espfix code in the NMI entry/exit code, the second one fixes 
and cleans up the espfix code itself. Tests boots were done using the
tip tree plus both patches.

The first patch applies cleanly to mainline too, and the second has
a trivial reject. I included a patch for mainline too.

Thanks for having a look at them!

Greetings,
	Alexander

 arch/x86/include/asm/desc.h  |   26 -----------------
 arch/x86/kernel/cpu/common.c |    2 +-
 arch/x86/kernel/entry_32.S   |   64 +++++++++++++++++++++++++++--------------
 arch/x86/kernel/head_32.S    |    1 -
 arch/x86/kernel/traps.c      |   21 --------------
 5 files changed, 43 insertions(+), 71 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] i386: fix return to 16-bit stack from NMI handler
  2009-06-07 12:42 [PATCH 0/2] i386: espfix fixes Alexander van Heukelum
@ 2009-06-07 12:42 ` Alexander van Heukelum
  2009-06-07 12:42 ` [PATCH 2/2, tip] i386: Fix/simplify espfix, move it into assembly Alexander van Heukelum
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Alexander van Heukelum @ 2009-06-07 12:42 UTC (permalink / raw)
  To: LKML, H. Peter Anvin
  Cc: Ingo Molnar, Thomas Gleixner, Andi Kleen, Cyrill Gorcunov,
	Alexander van Heukelum, H. Peter Anvin

Returning to a task with a 16-bit stack requires special care: the iret 
instruction does not restore the high word of esp in that case. The 
espfix code fixes this, but currently is not invoked on NMIs. This means 
that a running task gets the upper word of esp clobbered due intervening 
NMIs. To reproduce, compile and run the following program with the nmi 
watchdog enabled (nmi_watchdog=2 on the command line). Using gdb you can 
see that the high bits of esp contain garbage, while the low bits are 
still correct.

This patch puts the espfix code back into the NMI code path.

The patch is slightly complicated due to the irqtrace infrastructure not 
being NMI-safe. The NMI return path cannot call TRACE_IRQS_IRET. 
Otherwise, the tail of the normal iret-code is correct for the nmi code 
path too. To be able to share this code-path, the TRACE_IRQS_IRET was 
move up a bit. The espfix code exists after the TRACE_IRQS_IRET, but 
this code explicitly disables interrupts. This short interrupts-off 
section is now not traced anymore. The return-to-kernel path now always 
includes the preliminary test to decide if the espfix code should be 
called. This is never the case, but doing it this way keeps the patch as 
simple as possible and the few extra instructions should not affect 
timing in any significant way.

#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <asm/ldt.h>

int modify_ldt(int func, void *ptr, unsigned long bytecount)
{
        return syscall(SYS_modify_ldt, func, ptr, bytecount);
}

/* this is assumed to be usable */
#define SEGBASEADDR 0x10000
#define SEGLIMIT 0x20000

/* 16-bit segment */
struct user_desc desc = {
        .entry_number = 0,
        .base_addr = SEGBASEADDR,
        .limit = SEGLIMIT,
        .seg_32bit = 0,
        .contents = 0, /* ??? */
        .read_exec_only = 0,
        .limit_in_pages = 0,
        .seg_not_present = 0,
        .useable = 1
};

int main(void)
{
        setvbuf(stdout, NULL, _IONBF, 0);

        /* map a 64 kb segment */
        char *pointer = mmap((void *)SEGBASEADDR, SEGLIMIT+1,
                        PROT_EXEC|PROT_READ|PROT_WRITE,
                        MAP_SHARED|MAP_ANONYMOUS, -1, 0);
        if (pointer == NULL) {
                printf("could not map space\n");
                return 0;
        }

        /* write ldt, new mode */
        int err = modify_ldt(0x11, &desc, sizeof(desc));
        if (err) {
                printf("error modifying ldt: %i\n", err);
                return 0;
        }

        for (int i=0; i<1000; i++) {
        asm volatile (
                "pusha\n\t"
                "mov %ss, %eax\n\t" /* preserve ss:esp */
                "mov %esp, %ebp\n\t"
                "push $7\n\t" /* index 0, ldt, user mode */
                "push $65536-4096\n\t" /* esp */
                "lss (%esp), %esp\n\t" /* switch to new stack */
                "push %eax\n\t" /* save old ss:esp on new stack */
                "push %ebp\n\t"
		"add $17*65536, %esp\n\t" /* set high bits */
                "mov %esp, %edx\n\t"

                "mov $10000000, %ecx\n\t" /* wait... */
                "1: loop 1b\n\t" /* ... a bit */

                "cmp %esp, %edx\n\t"
                "je 1f\n\t"
                "ud2\n\t" /* esp changed inexplicably! */
                "1:\n\t"
		"sub $17*65536, %esp\n\t" /* restore high bits */
                "lss (%esp), %esp\n\t" /* restore old ss:esp */
                "popa\n\t");

                printf("\rx%ix", i);
        }

        return 0;
}

The bug was introduced with 55f327fa9e876758491a82af7491104f1cc3fc4d
("lockdep: irqtrace subsystem, i386 support").

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>

---
 arch/x86/kernel/entry_32.S |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c929add..d7d1c7d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -84,7 +84,7 @@
 #define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 #define preempt_stop(clobbers)
-#define resume_kernel		restore_nocheck
+#define resume_kernel		restore_all
 #endif
 
 .macro TRACE_IRQS_IRET
@@ -372,7 +372,7 @@ END(ret_from_exception)
 ENTRY(resume_kernel)
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	cmpl $0,TI_preempt_count(%ebp)	# non-zero preempt_count ?
-	jnz restore_nocheck
+	jnz restore_all
 need_resched:
 	movl TI_flags(%ebp), %ecx	# need_resched set ?
 	testb $_TIF_NEED_RESCHED, %cl
@@ -540,6 +540,8 @@ syscall_exit:
 	jne syscall_exit_work
 
 restore_all:
+	TRACE_IRQS_IRET
+restore_all_notrace:
 	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS, SS and CS
 	# Warning: PT_OLDSS(%esp) contains the wrong/random values if we
 	# are returning to the kernel.
@@ -551,8 +553,6 @@ restore_all:
 	CFI_REMEMBER_STATE
 	je ldt_ss			# returning to user-space with LDT SS
 restore_nocheck:
-	TRACE_IRQS_IRET
-restore_nocheck_notrace:
 	RESTORE_REGS 4			# skip orig_eax/error_code
 	CFI_ADJUST_CFA_OFFSET -4
 irq_return:
@@ -601,8 +601,10 @@ ldt_ss:
 	CFI_ADJUST_CFA_OFFSET 4
 	pushl %eax
 	CFI_ADJUST_CFA_OFFSET 4
+	/* Disable interrupts, but do not irqtrace this section: we
+	 * will soon execute iret and the tracer was already set to
+	 * the irqstate after the iret */
 	DISABLE_INTERRUPTS(CLBR_EAX)
-	TRACE_IRQS_OFF
 	lss (%esp), %esp
 	CFI_ADJUST_CFA_OFFSET -8
 	jmp restore_nocheck
@@ -1329,7 +1331,7 @@ nmi_stack_correct:
 	xorl %edx,%edx		# zero error code
 	movl %esp,%eax		# pt_regs pointer
 	call do_nmi
-	jmp restore_nocheck_notrace
+	jmp restore_all_notrace
 	CFI_ENDPROC
 
 nmi_stack_fixup:
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2, tip] i386: Fix/simplify espfix, move it into assembly
  2009-06-07 12:42 [PATCH 0/2] i386: espfix fixes Alexander van Heukelum
  2009-06-07 12:42 ` [PATCH 1/2] i386: fix return to 16-bit stack from NMI handler Alexander van Heukelum
@ 2009-06-07 12:42 ` Alexander van Heukelum
  2009-06-07 12:42 ` [PATCH 2/2, mainline] " Alexander van Heukelum
       [not found] ` <1245182594.27756.1320705931@webmail.messagingengine.com>
  3 siblings, 0 replies; 7+ messages in thread
From: Alexander van Heukelum @ 2009-06-07 12:42 UTC (permalink / raw)
  To: LKML, H. Peter Anvin
  Cc: Ingo Molnar, Thomas Gleixner, Andi Kleen, Cyrill Gorcunov,
	Alexander van Heukelum, H. Peter Anvin

The espfix code triggers if we have a protected mode userspace 
application with a 16-bit stack. On returning to userspace, the CPU 
doesn't restore the high word of the stack pointer. This is an 
"official" bug, and the work-around used in the kernel is to temporarily 
switch to a 32-bit stack segment/pointer pair where the high word of the 
pointer is equal to the high word of the userspace stackpointer, just 
before doing the iret.

The current implementation uses THREAD_SIZE to determine the cut-off, 
but there is no good reason not to use natural 1<<16... However, 
implementing this by simply substituting THREAD_SIZE with 1<<16 in 
patch_espfix_desc crashed the test application. patch_espfix_desc tries 
to do what is described above, but gets it subtly wrong if the userspace 
stack pointer is just below a multiple of THREAD_SIZE: an overflow 
occurs to bit 13... With a bit of luck, when the kernelspace 
stackpointer is just below a 64kb-boundary, the overflow then ripples 
trough to bit 16 and the userspace stack pointer will then be changed by 
65536. Changing the cut-off to bit 16 just made the problem occur
reliably.

This patch moves all espfix code into entry_32.S. Selecting a 16-bit 
cut-off simplifies the code enormously. The game with changing the limit 
dynamically is removed too. It complicates matters and I see no value in 
it. Changing only the top 16-bit word of ESP is one instruction and it 
also implies that only two bytes of the ESPFIX GDT entry need to be 
changed and this can be implemented in just a handful simple to 
understand instructions. As a side effect, the operation to compute the 
original ESP from the ESPFIX ESP and the GDT entry simplifies a bit too, 
and the remaining three instructions have been expanded inline in 
entry_32.S.

Follow-up cleanup:
 - GET_DESC_BASE macro removed
 - #include <asm/desc.h> removed from head_32.S and entry_32.S
 - arch/x86/include/asm/desc.h de-ASSEMBLY-ized
 - patch_espfix_desc() removed from arch/x86/kernel/traps.c 

impact: can now run with ESP=xxxxfffc on 16-bit stack segment

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>

---
 arch/x86/include/asm/desc.h  |   26 ---------------------
 arch/x86/kernel/cpu/common.c |    2 +-
 arch/x86/kernel/entry_32.S   |   50 ++++++++++++++++++++++++++++-------------
 arch/x86/kernel/head_32.S    |    1 -
 arch/x86/kernel/traps.c      |   21 -----------------
 5 files changed, 35 insertions(+), 65 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index c45f415..c993e9e 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -1,7 +1,6 @@
 #ifndef _ASM_X86_DESC_H
 #define _ASM_X86_DESC_H
 
-#ifndef __ASSEMBLY__
 #include <asm/desc_defs.h>
 #include <asm/ldt.h>
 #include <asm/mmu.h>
@@ -380,29 +379,4 @@ static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
 	_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
 }
 
-#else
-/*
- * GET_DESC_BASE reads the descriptor base of the specified segment.
- *
- * Args:
- *    idx - descriptor index
- *    gdt - GDT pointer
- *    base - 32bit register to which the base will be written
- *    lo_w - lo word of the "base" register
- *    lo_b - lo byte of the "base" register
- *    hi_b - hi byte of the low word of the "base" register
- *
- * Example:
- *    GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah)
- *    Will read the base address of GDT_ENTRY_ESPFIX_SS and put it into %eax.
- */
-#define GET_DESC_BASE(idx, gdt, base, lo_w, lo_b, hi_b) \
-	movb idx * 8 + 4(gdt), lo_b;			\
-	movb idx * 8 + 7(gdt), hi_b;			\
-	shll $16, base;					\
-	movw idx * 8 + 2(gdt), lo_w;
-
-
-#endif /* __ASSEMBLY__ */
-
 #endif /* _ASM_X86_DESC_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3ffdcfa..9b95c2f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -108,7 +108,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
 	/* data */
 	[GDT_ENTRY_APMBIOS_BASE+2]	= { { { 0x0000ffff, 0x00409200 } } },
 
-	[GDT_ENTRY_ESPFIX_SS]		= { { { 0x00000000, 0x00c09200 } } },
+	[GDT_ENTRY_ESPFIX_SS]		= { { { 0x0000ffff, 0x00cf9200 } } },
 	[GDT_ENTRY_PERCPU]		= { { { 0x0000ffff, 0x00cf9200 } } },
 	GDT_STACK_CANARY_INIT
 #endif
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index d7d1c7d..96f9942 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -48,7 +48,6 @@
 #include <asm/segment.h>
 #include <asm/smp.h>
 #include <asm/page_types.h>
-#include <asm/desc.h>
 #include <asm/percpu.h>
 #include <asm/dwarf2.h>
 #include <asm/processor-flags.h>
@@ -588,24 +587,34 @@ ldt_ss:
 	jne restore_nocheck
 #endif
 
-	/* If returning to userspace with 16bit stack,
-	 * try to fix the higher word of ESP, as the CPU
-	 * won't restore it.
-	 * This is an "official" bug of all the x86-compatible
-	 * CPUs, which we can try to work around to make
-	 * dosemu and wine happy. */
-	movl PT_OLDESP(%esp), %eax
-	movl %esp, %edx
-	call patch_espfix_desc
+/*
+ * Setup and switch to ESPFIX stack
+ *
+ * We're returning to userspace with a 16 bit stack. The CPU will not
+ * restore the high word of ESP for us on executing iret... This is an
+ * "official" bug of all the x86-compatible CPUs, which we can work
+ * around to make dosemu and wine happy. We do this by preloading the
+ * high word of ESP with the high word of the userspace ESP while
+ * compensating for the offset by changing to the ESPFIX segment with
+ * a base address that matches for the difference.
+ */
+	mov %esp, %edx			/* load kernel esp */
+	mov PT_OLDESP(%esp), %eax	/* load userspace esp */
+	mov %dx, %ax			/* eax: new kernel esp */
+	sub %eax, %edx			/* offset (low word is 0) */
+	PER_CPU(gdt_page, %ebx)
+	shr $16, %edx
+	mov %dl, GDT_ENTRY_ESPFIX_SS * 8 + 4(%ebx) /* bits 16..23 */
+	mov %dh, GDT_ENTRY_ESPFIX_SS * 8 + 7(%ebx) /* bits 24..31 */
 	pushl $__ESPFIX_SS
 	CFI_ADJUST_CFA_OFFSET 4
-	pushl %eax
+	push %eax			/* new kernel esp */
 	CFI_ADJUST_CFA_OFFSET 4
 	/* Disable interrupts, but do not irqtrace this section: we
 	 * will soon execute iret and the tracer was already set to
 	 * the irqstate after the iret */
 	DISABLE_INTERRUPTS(CLBR_EAX)
-	lss (%esp), %esp
+	lss (%esp), %esp		/* switch to espfix segment */
 	CFI_ADJUST_CFA_OFFSET -8
 	jmp restore_nocheck
 	CFI_ENDPROC
@@ -718,15 +727,24 @@ PTREGSCALL(vm86)
 PTREGSCALL(vm86old)
 
 .macro FIXUP_ESPFIX_STACK
-	/* since we are on a wrong stack, we cant make it a C code :( */
+/*
+ * Switch back for ESPFIX stack to the normal zerobased stack
+ *
+ * We can't call C functions using the ESPFIX stack. This code reads
+ * the high word of the segment base from the GDT and swiches to the
+ * normal stack and adjusts ESP with the matching offset.
+ */
+	/* fixup the stack */
 	PER_CPU(gdt_page, %ebx)
-	GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah)
-	addl %esp, %eax
+	mov GDT_ENTRY_ESPFIX_SS * 8 + 4(%ebx), %al /* bits 16..23 */
+	mov GDT_ENTRY_ESPFIX_SS * 8 + 7(%ebx), %ah /* bits 24..31 */
+	shl $16, %eax
+	addl %esp, %eax			/* the adjusted stack pointer */
 	pushl $__KERNEL_DS
 	CFI_ADJUST_CFA_OFFSET 4
 	pushl %eax
 	CFI_ADJUST_CFA_OFFSET 4
-	lss (%esp), %esp
+	lss (%esp), %esp		/* switch to the normal stack segment */
 	CFI_ADJUST_CFA_OFFSET -8
 .endm
 .macro UNWIND_ESPFIX_STACK
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index dc5ed4b..8663afb 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -13,7 +13,6 @@
 #include <asm/segment.h>
 #include <asm/page_types.h>
 #include <asm/pgtable_types.h>
-#include <asm/desc.h>
 #include <asm/cache.h>
 #include <asm/thread_info.h>
 #include <asm/asm-offsets.h>
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c8a7f87..4bd1f6f 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -763,27 +763,6 @@ do_spurious_interrupt_bug(struct pt_regs *regs, long error_code)
 #endif
 }
 
-#ifdef CONFIG_X86_32
-unsigned long patch_espfix_desc(unsigned long uesp, unsigned long kesp)
-{
-	struct desc_struct *gdt = get_cpu_gdt_table(smp_processor_id());
-	unsigned long base = (kesp - uesp) & -THREAD_SIZE;
-	unsigned long new_kesp = kesp - base;
-	unsigned long lim_pages = (new_kesp | (THREAD_SIZE - 1)) >> PAGE_SHIFT;
-	__u64 desc = *(__u64 *)&gdt[GDT_ENTRY_ESPFIX_SS];
-
-	/* Set up base for espfix segment */
-	desc &= 0x00f0ff0000000000ULL;
-	desc |=	((((__u64)base) << 16) & 0x000000ffffff0000ULL) |
-		((((__u64)base) << 32) & 0xff00000000000000ULL) |
-		((((__u64)lim_pages) << 32) & 0x000f000000000000ULL) |
-		(lim_pages & 0xffff);
-	*(__u64 *)&gdt[GDT_ENTRY_ESPFIX_SS] = desc;
-
-	return new_kesp;
-}
-#endif
-
 asmlinkage void __attribute__((weak)) smp_thermal_interrupt(void)
 {
 }
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2, mainline] i386: Fix/simplify espfix, move it into assembly
  2009-06-07 12:42 [PATCH 0/2] i386: espfix fixes Alexander van Heukelum
  2009-06-07 12:42 ` [PATCH 1/2] i386: fix return to 16-bit stack from NMI handler Alexander van Heukelum
  2009-06-07 12:42 ` [PATCH 2/2, tip] i386: Fix/simplify espfix, move it into assembly Alexander van Heukelum
@ 2009-06-07 12:42 ` Alexander van Heukelum
       [not found] ` <1245182594.27756.1320705931@webmail.messagingengine.com>
  3 siblings, 0 replies; 7+ messages in thread
From: Alexander van Heukelum @ 2009-06-07 12:42 UTC (permalink / raw)
  To: LKML, H. Peter Anvin
  Cc: Ingo Molnar, Thomas Gleixner, Andi Kleen, Cyrill Gorcunov,
	Alexander van Heukelum

The espfix code triggers if we have a protected mode userspace
application with a 16-bit stack. On returning to userspace, the CPU
doesn't restore the high word of the stack pointer. This is an
"official" bug, and the work-around used in the kernel is to temporarily
switch to a 32-bit stack segment/pointer pair where the high word of the
pointer is equal to the high word of the userspace stackpointer, just
before doing the iret.

The current implementation uses THREAD_SIZE to determine the cut-off,
but there is no good reason not to use natural 1<<16... However,
implementing this by simply substituting THREAD_SIZE with 1<<16 in
patch_espfix_desc crashed the test application. patch_espfix_desc tries
to do what is described above, but gets it subtly wrong if the userspace
stack pointer is just below a multiple of THREAD_SIZE: an overflow
occurs to bit 13... With a bit of luck, when the kernelspace
stackpointer is just below a 64kb-boundary, the overflow then ripples
trough to bit 16 and the userspace stack pointer will then be changed by
65536. Changing the cut-off to bit 16 just made the problem occur
reliably.

This patch moves all espfix code into entry_32.S. Selecting a 16-bit
cut-off simplifies the code enormously. The game with changing the limit
dynamically is removed too. It complicates matters and I see no value in
it. Changing only the top 16-bit word of ESP is one instruction and it
also implies that only two bytes of the ESPFIX GDT entry need to be
changed and this can be implemented in just a handful simple to
understand instructions. As a side effect, the operation to compute the
original ESP from the ESPFIX ESP and the GDT entry simplifies a bit too,
and the remaining three instructions have been expanded inline in
entry_32.S.

Follow-up cleanup:
 - GET_DESC_BASE macro removed
 - #include <asm/desc.h> removed from head_32.S and entry_32.S
 - arch/x86/include/asm/desc.h de-ASSEMBLY-ized
 - patch_espfix_desc() removed from arch/x86/kernel/traps.c

impact: can now run with ESP=xxxxfffc on 16-bit stack segment

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---
 arch/x86/include/asm/desc.h  |   26 ---------------------
 arch/x86/kernel/cpu/common.c |    2 +-
 arch/x86/kernel/entry_32.S   |   50 ++++++++++++++++++++++++++++-------------
 arch/x86/kernel/head_32.S    |    1 -
 arch/x86/kernel/traps.c      |   21 +----------------
 5 files changed, 36 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index c45f415..c993e9e 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -1,7 +1,6 @@
 #ifndef _ASM_X86_DESC_H
 #define _ASM_X86_DESC_H
 
-#ifndef __ASSEMBLY__
 #include <asm/desc_defs.h>
 #include <asm/ldt.h>
 #include <asm/mmu.h>
@@ -380,29 +379,4 @@ static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
 	_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
 }
 
-#else
-/*
- * GET_DESC_BASE reads the descriptor base of the specified segment.
- *
- * Args:
- *    idx - descriptor index
- *    gdt - GDT pointer
- *    base - 32bit register to which the base will be written
- *    lo_w - lo word of the "base" register
- *    lo_b - lo byte of the "base" register
- *    hi_b - hi byte of the low word of the "base" register
- *
- * Example:
- *    GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah)
- *    Will read the base address of GDT_ENTRY_ESPFIX_SS and put it into %eax.
- */
-#define GET_DESC_BASE(idx, gdt, base, lo_w, lo_b, hi_b) \
-	movb idx * 8 + 4(gdt), lo_b;			\
-	movb idx * 8 + 7(gdt), hi_b;			\
-	shll $16, base;					\
-	movw idx * 8 + 2(gdt), lo_w;
-
-
-#endif /* __ASSEMBLY__ */
-
 #endif /* _ASM_X86_DESC_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 77848d9..db6a863 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -107,7 +107,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
 	/* data */
 	[GDT_ENTRY_APMBIOS_BASE+2]	= { { { 0x0000ffff, 0x00409200 } } },
 
-	[GDT_ENTRY_ESPFIX_SS]		= { { { 0x00000000, 0x00c09200 } } },
+	[GDT_ENTRY_ESPFIX_SS]		= { { { 0x0000ffff, 0x00cf9200 } } },
 	[GDT_ENTRY_PERCPU]		= { { { 0x0000ffff, 0x00cf9200 } } },
 	GDT_STACK_CANARY_INIT
 #endif
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index d7d1c7d..9f8ce77 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -48,7 +48,6 @@
 #include <asm/segment.h>
 #include <asm/smp.h>
 #include <asm/page_types.h>
-#include <asm/desc.h>
 #include <asm/percpu.h>
 #include <asm/dwarf2.h>
 #include <asm/processor-flags.h>
@@ -588,24 +587,34 @@ ldt_ss:
 	jne restore_nocheck
 #endif
 
-	/* If returning to userspace with 16bit stack,
-	 * try to fix the higher word of ESP, as the CPU
-	 * won't restore it.
-	 * This is an "official" bug of all the x86-compatible
-	 * CPUs, which we can try to work around to make
-	 * dosemu and wine happy. */
-	movl PT_OLDESP(%esp), %eax
-	movl %esp, %edx
-	call patch_espfix_desc
+/*
+ * Setup and switch to ESPFIX stack
+ *
+ * We're returning to userspace with a 16 bit stack. The CPU will not
+ * restore the high word of ESP for us on executing iret... This is an
+ * "official" bug of all the x86-compatible CPUs, which we can work
+ * around to make dosemu and wine happy. We do this by preloading the
+ * high word of ESP with the high word of the userspace ESP while
+ * compensating for the offset by changing to the ESPFIX segment with
+ * a base address that matches for the difference.
+ */
+	mov %esp, %edx			/* load kernel esp */
+	mov PT_OLDESP(%esp), %eax	/* load userspace esp */
+	mov %dx, %ax			/* eax: new kernel esp */
+	sub %eax, %edx			/* offset (low word is 0) */
+	PER_CPU(gdt_page, %ebx)
+	shr $16, %edx
+	mov %dl, GDT_ENTRY_ESPFIX_SS * 8 + 4(%ebx) /* bits 16..23 */
+	mov %dh, GDT_ENTRY_ESPFIX_SS * 8 + 7(%ebx) /* bits 24..31 */
 	pushl $__ESPFIX_SS
 	CFI_ADJUST_CFA_OFFSET 4
-	pushl %eax
+	push %eax			/* new kernel esp */
 	CFI_ADJUST_CFA_OFFSET 4
 	/* Disable interrupts, but do not irqtrace this section: we
 	 * will soon execute iret and the tracer was already set to
 	 * the irqstate after the iret */
 	DISABLE_INTERRUPTS(CLBR_EAX)
-	lss (%esp), %esp
+	lss (%esp), %esp		/* switch to espfix segment */
 	CFI_ADJUST_CFA_OFFSET -8
 	jmp restore_nocheck
 	CFI_ENDPROC
@@ -718,15 +727,24 @@ PTREGSCALL(vm86)
 PTREGSCALL(vm86old)
 
 .macro FIXUP_ESPFIX_STACK
-	/* since we are on a wrong stack, we cant make it a C code :( */
+/*
+ * Switch back for ESPFIX stack to the normal zerobased stack
+ *
+ * We can't call C functions using the ESPFIX stack. This code reads
+ * the high word of the segment base from the GDT and swiches to the
+ * normal stack and adjusts ESP with the matching offset.
+ */
+	/* fixup the stack */
 	PER_CPU(gdt_page, %ebx)
-	GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah)
-	addl %esp, %eax
+	mov GDT_ENTRY_ESPFIX_SS * 8 + 4(%ebx), %al /* bits 16..23 */
+	mov GDT_ENTRY_ESPFIX_SS * 8 + 7(%ebx), %ah /* bits 24..31 */
+	shl $16, %eax
+	addl %esp, %eax			/* the adjusted stack pointer */
 	pushl $__KERNEL_DS
 	CFI_ADJUST_CFA_OFFSET 4
 	pushl %eax
 	CFI_ADJUST_CFA_OFFSET 4
-	lss (%esp), %esp
+	lss (%esp), %esp		/* switch to the normal stack segment */
 	CFI_ADJUST_CFA_OFFSET -8
 .endm
 .macro UNWIND_ESPFIX_STACK
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 3068388..ebce401 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -13,7 +13,6 @@
 #include <asm/segment.h>
 #include <asm/page_types.h>
 #include <asm/pgtable_types.h>
-#include <asm/desc.h>
 #include <asm/cache.h>
 #include <asm/thread_info.h>
 #include <asm/asm-offsets.h>
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a1d2883..e18f24a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -779,26 +779,7 @@ do_spurious_interrupt_bug(struct pt_regs *regs, long error_code)
 #endif
 }
 
-#ifdef CONFIG_X86_32
-unsigned long patch_espfix_desc(unsigned long uesp, unsigned long kesp)
-{
-	struct desc_struct *gdt = get_cpu_gdt_table(smp_processor_id());
-	unsigned long base = (kesp - uesp) & -THREAD_SIZE;
-	unsigned long new_kesp = kesp - base;
-	unsigned long lim_pages = (new_kesp | (THREAD_SIZE - 1)) >> PAGE_SHIFT;
-	__u64 desc = *(__u64 *)&gdt[GDT_ENTRY_ESPFIX_SS];
-
-	/* Set up base for espfix segment */
-	desc &= 0x00f0ff0000000000ULL;
-	desc |=	((((__u64)base) << 16) & 0x000000ffffff0000ULL) |
-		((((__u64)base) << 32) & 0xff00000000000000ULL) |
-		((((__u64)lim_pages) << 32) & 0x000f000000000000ULL) |
-		(lim_pages & 0xffff);
-	*(__u64 *)&gdt[GDT_ENTRY_ESPFIX_SS] = desc;
-
-	return new_kesp;
-}
-#else
+#ifdef CONFIG_X86_64
 asmlinkage void __attribute__((weak)) smp_thermal_interrupt(void)
 {
 }
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Ping: Re: [PATCH 0/2] i386: espfix fixes
       [not found] ` <1245182594.27756.1320705931@webmail.messagingengine.com>
@ 2009-06-16 21:19   ` Stas Sergeev
  2009-06-16 22:41     ` Alexander van Heukelum
  0 siblings, 1 reply; 7+ messages in thread
From: Stas Sergeev @ 2009-06-16 21:19 UTC (permalink / raw)
  To: Alexander van Heukelum
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Cyrill Gorcunov,
	Tejun Heo, Zachary Amsden, Chuck Ebbert, Jeremy Fitzhardinge,
	Linux kernel

Hi.

17.06.2009 00:03, Alexander van Heukelum wrote:
> Just wanted to get a little attention for the espfix fixes...
>
> http://lkml.org/lkml/2009/6/7/106 : fix return to 16-bit stack from NMI
> http://lkml.org/lkml/2009/6/7/109 : fix/simplify espfix, move it into
> assembly
>
> Any objections against those?
Well, the dynamic limits were advocated
by Zach and Chuck, so lets ask them first
(added CCs).

For the rest:
Acked-by: Stas Sergeev <stsp@aknet.ru>

In particular, I am asking now myself why
have I used THREAD_SIZE there, and can't
recall the reason...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ping: Re: [PATCH 0/2] i386: espfix fixes
  2009-06-16 21:19   ` Ping: Re: [PATCH 0/2] i386: espfix fixes Stas Sergeev
@ 2009-06-16 22:41     ` Alexander van Heukelum
  2009-06-16 23:29       ` H. Peter Anvin
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander van Heukelum @ 2009-06-16 22:41 UTC (permalink / raw)
  To: Stas Sergeev
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Cyrill Gorcunov,
	Tejun Heo, Zachary Amsden, Chuck Ebbert, Jeremy Fitzhardinge,
	Linux kernel

On Wed, 17 Jun 2009 01:19:05 +0400, "Stas Sergeev" <stsp@aknet.ru> said:
> Hi.
> 
> 17.06.2009 00:03, Alexander van Heukelum wrote:
> > Just wanted to get a little attention for the espfix fixes...
> >
> > http://lkml.org/lkml/2009/6/7/106 : fix return to 16-bit stack from NMI
> > http://lkml.org/lkml/2009/6/7/109 : fix/simplify espfix, move it into
> > assembly
> >
> > Any objections against those?
> Well, the dynamic limits were advocated
> by Zach and Chuck, so lets ask them first
> (added CCs).

Hi!

I see...

On http://lkml.indiana.edu/hypermail/linux/kernel/0608.0/0162.html:
> > > - .quad 0x0000920000000000 /* 0xd0 - ESPFIX 16-bit SS */
> > > + .quad 0x00cf92000000ffff /* 0xd0 - ESPFIX SS */
> > 
> > Seems a bit dangerous to allow access to full 4GB through this.
> > Can you tighten the limit any? I suppose not, because the high
> > bits in %esp really could be anything. But it might be nice to
> > try setting the limit to regs->esp + THREAD_SIZE. Of course, this
> > is not strictly necessary, just an extra paranoid protection
> > mechanism. 
> Since, when calculating the base, I do &-THREAD_SIZE, I guess the
> minimal safe limit is regs->esp + THREAD_SIZE*2... Well, may just
> I not do that please? :) For what, btw? There are no such a things
> for __KERNEL_DS or anything, so I just don't see the necessity.

In the normal case user-esp would be between 0 and 65535, and in
that case the memory range in the ESPFIX stack segment would be
pretty small. But userspace can set esp to just about anything if
it really wants to, and in that case the reduction of the memory
range is pretty much wothless (kernel stacks are allocated way
above the code). As you said: may we just not do that, please?

> For the rest:
> Acked-by: Stas Sergeev <stsp@aknet.ru>

Thanks, very much appreciated!

> In particular, I am asking now myself why
> have I used THREAD_SIZE there, and can't
> recall the reason...

Other parts of the complete fix were not massochistic enough?
( http://lkml.indiana.edu/hypermail/linux/kernel/0608.0/0858.html )

Greetings,
    Alexander

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ping: Re: [PATCH 0/2] i386: espfix fixes
  2009-06-16 22:41     ` Alexander van Heukelum
@ 2009-06-16 23:29       ` H. Peter Anvin
  0 siblings, 0 replies; 7+ messages in thread
From: H. Peter Anvin @ 2009-06-16 23:29 UTC (permalink / raw)
  To: Alexander van Heukelum
  Cc: Stas Sergeev, Ingo Molnar, Thomas Gleixner, Cyrill Gorcunov,
	Tejun Heo, Zachary Amsden, Chuck Ebbert, Jeremy Fitzhardinge,
	Linux kernel

Alexander van Heukelum wrote:
> 
> On http://lkml.indiana.edu/hypermail/linux/kernel/0608.0/0162.html:
>>>> - .quad 0x0000920000000000 /* 0xd0 - ESPFIX 16-bit SS */
>>>> + .quad 0x00cf92000000ffff /* 0xd0 - ESPFIX SS */
>>> Seems a bit dangerous to allow access to full 4GB through this.
>>> Can you tighten the limit any? I suppose not, because the high
>>> bits in %esp really could be anything. But it might be nice to
>>> try setting the limit to regs->esp + THREAD_SIZE. Of course, this
>>> is not strictly necessary, just an extra paranoid protection
>>> mechanism. 
>> Since, when calculating the base, I do &-THREAD_SIZE, I guess the
>> minimal safe limit is regs->esp + THREAD_SIZE*2... Well, may just
>> I not do that please? :) For what, btw? There are no such a things
>> for __KERNEL_DS or anything, so I just don't see the necessity.
> 
> In the normal case user-esp would be between 0 and 65535, and in
> that case the memory range in the ESPFIX stack segment would be
> pretty small. But userspace can set esp to just about anything if
> it really wants to, and in that case the reduction of the memory
> range is pretty much wothless (kernel stacks are allocated way
> above the code). As you said: may we just not do that, please?
> 

Who are "we", here?  First of all, this is about a 16-bit stack segment,
right?  Why are we doing a 4 GB limit at all if this is a 16-bit stack
segment (where the stack can only be touched for the first 64K even if a
stack operation happens)?

I clearly don't quite understand at a glance what is going on.
	
	-hpa


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-16 23:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-07 12:42 [PATCH 0/2] i386: espfix fixes Alexander van Heukelum
2009-06-07 12:42 ` [PATCH 1/2] i386: fix return to 16-bit stack from NMI handler Alexander van Heukelum
2009-06-07 12:42 ` [PATCH 2/2, tip] i386: Fix/simplify espfix, move it into assembly Alexander van Heukelum
2009-06-07 12:42 ` [PATCH 2/2, mainline] " Alexander van Heukelum
     [not found] ` <1245182594.27756.1320705931@webmail.messagingengine.com>
2009-06-16 21:19   ` Ping: Re: [PATCH 0/2] i386: espfix fixes Stas Sergeev
2009-06-16 22:41     ` Alexander van Heukelum
2009-06-16 23:29       ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.