From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754018AbdKKEFd (ORCPT ); Fri, 10 Nov 2017 23:05:33 -0500 Received: from mail.kernel.org ([198.145.29.99]:52184 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750926AbdKKEFc (ORCPT ); Fri, 10 Nov 2017 23:05:32 -0500 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B788B218BC Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org From: Andy Lutomirski To: X86 ML Cc: Borislav Petkov , "linux-kernel@vger.kernel.org" , Brian Gerst , Dave Hansen , Linus Torvalds , Andy Lutomirski Subject: [RFC 0/7] Prep code for better stack switching Date: Fri, 10 Nov 2017 20:05:19 -0800 Message-Id: X-Mailer: git-send-email 2.13.6 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This isn't quite done (the TSS remap patch is busted on 32-bit, but that's a straightforward fix), but it should be ready for at least a conceptual review. The idea here is to prepare us to have all kernel data needed for user mode execution and early entry located in the fixmap. To do this, I hijack the GDT remap mechanism and make it more general. I add a struct cpu_entry_area. This struct is never instantiated directly. Instead, it represents the layout of a per-cpu portion of the fixmap. That portion contains the GDT, the TSS (including IO bitmap), and the entry stack (for now just a part of the TSS region). It should also end up containing the PEBS and BTS buffers. If this works, then the idea would be to add a magic *executable* page to cpu_entry_area. That page would contain a stub like this: ENTRY(entry_SYSCALL_64_trampoline) UNWIND_HINT_EMPTY movq %rsp, 0x1000+entry_SYSCALL_64_trampoline-1f(%rip) 1: movq 0x1008+entry_SYSCALL_64_trampoline-1f(%rip), %rsp 1: pushq %rdi pushq %rsi movq 0x1000+entry_SYSCALL_64_trampoline-1f(%rip), %rsi 1: movq $entry_SYSCALL_64, %rdi jmp *%rdi END(entry_SYSCALL_64_trampoline) (Those offsets are made up. In real life, they'd be computed using asm-offsets so they refer to the top word of the entry stack and to the word that contains the real kernel stack address, respectively.) We'd now enter entry_SYSCALL_64 (probably renamed) on the real task stack, with user RDI and RSI on that stack (and in need of popping) and with user RSP in RSI. This is weird, but it gives us some major benefits: - This entire sequence works without any %gs prefixes and without touching the conventional percpu mappings. This means that it will work without mapping any conventional percpu data. That removes a considerable amount of complexity in Dave's series and also removes a giant kASLR hole in that Dave's series, as is, leaks the location of all the percpu mappings. - We run the SYSCALL entry code in a context in which it has easy access to scratch space for its CR3 shenanigans. - I've carefully done this without needing access to the cpu_entry_area from the post-trampoline entry code. Finding it would require awkward calculations, a percpu load from an otherwise unneeded cacheline, or a potentially unfortunate load of the valule we just stored from a different VA alias. I imagine that the last one is nasy from a microarchitectural perspective. I'd really like to do this in a way that makes it optional so that, if KAISER is disabled, we don't take the TLB miss overhead, which probably outweighs the minor speedup that we no longer stall on SWAPGS. OTOH, it might end up benchmarking faster than the current code, since, while it's harder on I$ and the TLB, it's easier on D$ (avoids two conventional percpu accesses, instead using a cacheline that's needed anyway for the stack0. The same exact treatment is used for SYSCALL32. If I didn't forget some detail, this would allow KAISER to function with only the fixmap, the entry text, and the espfix64 junk mapped. Down the road, we could further tweak it to get rid of the entry text too by moving all the CR3-switching code into the fixmap. The ORC unwinder would need to learn about this special case to be able to unwind an NMI that hits in the trampoline. Or maybe we don't care. kallsyms might also want to hackery to recognize the trampoline for perf's benefit. Open questions: - Should the entry stack be anywhere near as big as I made it here? If I keep it very small, then inappropriate uses of it would be immediately detected as (properly backtraced) double faults. - Something should IMO complain very loudly, at least with debugging on, if we accidentally schedule from the entry stack. As is, it causes huge corruption but doesn't immediately die. - This is incompatible with the PIE effort. We'd have to use movabs instead of movq, but I don't know whether the tooling can handle the resulting relocation. Andy Lutomirski (7): x86/asm/64: Allocate and enable the SYSENTER stack x86/gdt: Put per-cpu GDT remaps in ascending order x86/fixmap: Generalize the GDT fixmap mechanism x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss x86/asm: Rearrange struct cpu_tss to enlarge SYSENTER_stack and fix alignment x86/asm: Remap the TSS into the cpu entry area x86/unwind/64: Add support for the SYSENTER stack arch/x86/entry/entry_64_compat.S | 2 +- arch/x86/include/asm/desc.h | 11 ++-------- arch/x86/include/asm/fixmap.h | 43 +++++++++++++++++++++++++++++++++++-- arch/x86/include/asm/processor.h | 25 +++++++++++----------- arch/x86/include/asm/stacktrace.h | 1 + arch/x86/kernel/asm-offsets.c | 5 +++++ arch/x86/kernel/asm-offsets_32.c | 5 ----- arch/x86/kernel/cpu/common.c | 45 +++++++++++++++++++++++++++++---------- arch/x86/kernel/doublefault.c | 36 +++++++++++++++---------------- arch/x86/kernel/dumpstack_32.c | 3 +++ arch/x86/kernel/dumpstack_64.c | 23 ++++++++++++++++++++ arch/x86/kernel/process.c | 2 -- arch/x86/kernel/traps.c | 3 +-- arch/x86/power/cpu.c | 16 ++++++++------ arch/x86/xen/mmu_pv.c | 2 +- 15 files changed, 151 insertions(+), 71 deletions(-) -- 2.13.6