[RFC 0/7] Prep code for better stack switching

* [RFC 0/7] Prep code for better stack switching
@ 2017-11-11  4:05 Andy Lutomirski
  2017-11-11  4:05 ` [RFC 1/7] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-11-11  4:05 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Andy Lutomirski

This isn't quite done (the TSS remap patch is busted on 32-bit, but
that's a straightforward fix), but it should be ready for at least a
conceptual review.

The idea here is to prepare us to have all kernel data needed for
user mode execution and early entry located in the fixmap.  To do
this, I hijack the GDT remap mechanism and make it more general.  I
add a struct cpu_entry_area.  This struct is never instantiated
directly.  Instead, it represents the layout of a per-cpu portion of
the fixmap.  That portion contains the GDT, the TSS (including IO
bitmap), and the entry stack (for now just a part of the TSS
region).  It should also end up containing the PEBS and BTS buffers.

If this works, then the idea would be to add a magic *executable* page
to cpu_entry_area.  That page would contain a stub like this:

ENTRY(entry_SYSCALL_64_trampoline)
	UNWIND_HINT_EMPTY
	movq	%rsp, 0x1000+entry_SYSCALL_64_trampoline-1f(%rip)
1:
	movq	0x1008+entry_SYSCALL_64_trampoline-1f(%rip), %rsp
1:
	pushq	%rdi
	pushq	%rsi
	movq	0x1000+entry_SYSCALL_64_trampoline-1f(%rip), %rsi
1:
	movq	$entry_SYSCALL_64, %rdi
	jmp	*%rdi
END(entry_SYSCALL_64_trampoline)

(Those offsets are made up.  In real life, they'd be computed using
asm-offsets so they refer to the top word of the entry stack and to
the word that contains the real kernel stack address, respectively.)

We'd now enter entry_SYSCALL_64 (probably renamed) on the real task
stack, with user RDI and RSI on that stack (and in need of popping)
and with user RSP in RSI.  This is weird, but it gives us some major
benefits:

 - This entire sequence works without any %gs prefixes and without
   touching the conventional percpu mappings.  This means that it
   will work without mapping any conventional percpu data.  That
   removes a considerable amount of complexity in Dave's series and
   also removes a giant kASLR hole in that Dave's series, as is,
   leaks the location of all the percpu mappings.

 - We run the SYSCALL entry code in a context in which it has
   easy access to scratch space for its CR3 shenanigans.

 - I've carefully done this without needing access to the
   cpu_entry_area from the post-trampoline entry code.  Finding
   it would require awkward calculations, a percpu load from
   an otherwise unneeded cacheline, or a potentially unfortunate
   load of the valule we just stored from a different VA alias.  I
   imagine that the last one is nasy from a microarchitectural
   perspective.

I'd really like to do this in a way that makes it optional so that,
if KAISER is disabled, we don't take the TLB miss overhead, which
probably outweighs the minor speedup that we no longer stall on
SWAPGS.  OTOH, it might end up benchmarking faster than the current
code, since, while it's harder on I$ and the TLB, it's easier on D$
(avoids two conventional percpu accesses, instead using a cacheline
that's needed anyway for the stack0.

The same exact treatment is used for SYSCALL32.

If I didn't forget some detail, this would allow KAISER to function
with only the fixmap, the entry text, and the espfix64 junk mapped.
Down the road, we could further tweak it to get rid of the entry
text too by moving all the CR3-switching code into the fixmap.

The ORC unwinder would need to learn about this special case to be
able to unwind an NMI that hits in the trampoline.  Or maybe we
don't care.  kallsyms might also want to hackery to recognize
the trampoline for perf's benefit.

Open questions:

 - Should the entry stack be anywhere near as big as I made it here?
   If I keep it very small, then inappropriate uses of it would be
   immediately detected as (properly backtraced) double faults.

 - Something should IMO complain very loudly, at least with debugging on,
   if we accidentally schedule from the entry stack.  As is, it causes
   huge corruption but doesn't immediately die.

 - This is incompatible with the PIE effort.  We'd have to use movabs
   instead of movq, but I don't know whether the tooling can handle
   the resulting relocation.

Andy Lutomirski (7):
  x86/asm/64: Allocate and enable the SYSENTER stack
  x86/gdt: Put per-cpu GDT remaps in ascending order
  x86/fixmap: Generalize the GDT fixmap mechanism
  x86/asm: Fix assumptions that the HW TSS is at the beginning of
    cpu_tss
  x86/asm: Rearrange struct cpu_tss to enlarge SYSENTER_stack and fix
    alignment
  x86/asm: Remap the TSS into the cpu entry area
  x86/unwind/64: Add support for the SYSENTER stack

 arch/x86/entry/entry_64_compat.S  |  2 +-
 arch/x86/include/asm/desc.h       | 11 ++--------
 arch/x86/include/asm/fixmap.h     | 43 +++++++++++++++++++++++++++++++++++--
 arch/x86/include/asm/processor.h  | 25 +++++++++++-----------
 arch/x86/include/asm/stacktrace.h |  1 +
 arch/x86/kernel/asm-offsets.c     |  5 +++++
 arch/x86/kernel/asm-offsets_32.c  |  5 -----
 arch/x86/kernel/cpu/common.c      | 45 +++++++++++++++++++++++++++++----------
 arch/x86/kernel/doublefault.c     | 36 +++++++++++++++----------------
 arch/x86/kernel/dumpstack_32.c    |  3 +++
 arch/x86/kernel/dumpstack_64.c    | 23 ++++++++++++++++++++
 arch/x86/kernel/process.c         |  2 --
 arch/x86/kernel/traps.c           |  3 +--
 arch/x86/power/cpu.c              | 16 ++++++++------
 arch/x86/xen/mmu_pv.c             |  2 +-
 15 files changed, 151 insertions(+), 71 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 32+ messages in thread