From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751415AbdKYLkx (ORCPT ); Sat, 25 Nov 2017 06:40:53 -0500 Received: from mail.skyhub.de ([5.9.137.197]:38266 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751049AbdKYLkv (ORCPT ); Sat, 25 Nov 2017 06:40:51 -0500 Date: Sat, 25 Nov 2017 12:40:43 +0100 From: Borislav Petkov To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, Dave Hansen , Andy Lutomirski , Thomas Gleixner , "H . Peter Anvin" , Peter Zijlstra , Linus Torvalds Subject: Re: [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Message-ID: <20171125114043.f72547iiowbzyjvd@pd.tnic> References: <20171124172411.19476-1-mingo@kernel.org> <20171124172411.19476-16-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20171124172411.19476-16-mingo@kernel.org> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 24, 2017 at 06:23:43PM +0100, Ingo Molnar wrote: > From: Andy Lutomirski > > Handling SYSCALL is tricky: the SYSCALL handler is entered with every > single register (except FLAGS), including RSP, live. It somehow needs > to set RSP to point to a valid stack, which means it needs to save the > user RSP somewhere and find its own stack pointer. The canonical way > to do this is with SWAPGS, which lets us access percpu data using the > %gs prefix. > > With KAISER-like pagetable switching, this is problematic. Without a > scratch register, switching CR3 is impossible, so %gs-based percpu > memory would need to be mapped in the user pagetables. Doing that > without information leaks is difficult or impossible. > > Instead, use a different sneaky trick. Map a copy of the first part > of the SYSCALL asm at a different address for each CPU. Now RIP > varies depending on the CPU, so we can use RIP-relative memory access > to access percpu memory. By putting the relevant information (one > scratch slot and the stack address) at a constant offset relative to > RIP, we can make SYSCALL work without relying on %gs. > > A nice thing about this approach is that we can easily switch it on > and off if we want pagetable switching to be configurable. > > The compat variant of SYSCALL doesn't have this problem in the first > place -- there are plenty of scratch registers, since we don't care > about preserving r8-r15. This patch therefore doesn't touch SYSCALL32 > at all. > > XXX: Whenever we settle how KAISER gets turned on and off, we should do > the same to this. > > Signed-off-by: Andy Lutomirski > Signed-off-by: Thomas Gleixner > Cc: Borislav Petkov > Cc: Brian Gerst > Cc: Dave Hansen > Cc: Josh Poimboeuf > Cc: Linus Torvalds > Cc: Peter Zijlstra > Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org > Signed-off-by: Ingo Molnar > --- > arch/x86/entry/entry_64.S | 48 +++++++++++++++++++++++++++++++++++++++++++ > arch/x86/include/asm/fixmap.h | 2 ++ > arch/x86/kernel/asm-offsets.c | 1 + > arch/x86/kernel/cpu/common.c | 12 ++++++++++- > arch/x86/kernel/vmlinux.lds.S | 10 +++++++++ > 5 files changed, 72 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 426b8c669d6a..0cde243b7542 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -140,6 +140,54 @@ END(native_usergs_sysret64) > * with them due to bugs in both AMD and Intel CPUs. > */ > > + .pushsection .entry_trampoline, "ax" > + > +/* > + * The code in here gets remapped into cpu_entry_area's trampoline. This means > + * that the assembler and linker have the wrong idea as to where this code > + * lives (and, in fact, it's mapped more than once, so it's not even at a > + * fixed address). So we can't reference any symbols outside the entry > + * trampoline and expect it to work. > + * > + * Instead, we carefully abuse %rip-relative addressing. > + * .Lentry_trampoline(%rip) refers to the start of the remapped) entry _entry_trampoline(%rip) I'd guess. > + * trampoline. We can thus find cpu_entry_area with this macro: Uuh, fun. :) > + */ > + > +#define CPU_ENTRY_AREA \ > + _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip) So this generates _entry_trampoline - 16384(%rip) here. IINM, the layout looks like this [ GDT | TSS | TSS | TSS | trampoline ] where each section is a page, and we have 4 pages per CPU. Just for my own understanding... > + > +/* The top word of the SYSENTER stack is hot and is usable as scratch space. */ > +#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \ > + SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA I'm wondering if it would be easier to make SYSENTER_stack part of struct cpu_entry_area and thus simplify that wild offset math here :) Also, pls align: #define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \ SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA > + > +ENTRY(entry_SYSCALL_64_trampoline) > + UNWIND_HINT_EMPTY > + swapgs > + > + /* Stash the user RSP. */ > + movq %rsp, RSP_SCRATCH > + > + /* Load the top of the task stack into RSP */ > + movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp Yeah, let's put CPU_ENTRY_AREA first because then it reads easier: movq CPU_ENTRY_AREA + CPU_ENTRY_AREA_tss + TSS_sp1, %rsp i.e., pointer to cpu_entry_area, offset to tss within said cpu_entry_area and then inside that, sp1. Ditto for above. ... > @@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks > /* May not be marked __init: used by software suspend */ > void syscall_init(void) > { > + extern char _entry_trampoline[]; > + extern char entry_SYSCALL_64_trampoline[]; > + > int cpu = smp_processor_id(); > > wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS); > - wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64); > + wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline)); Definitely a local var. > #ifdef CONFIG_IA32_EMULATION > wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat); > diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S > index a4009fb9be87..2738cfb6c8c8 100644 > --- a/arch/x86/kernel/vmlinux.lds.S > +++ b/arch/x86/kernel/vmlinux.lds.S > @@ -107,6 +107,16 @@ SECTIONS > SOFTIRQENTRY_TEXT > *(.fixup) > *(.gnu.warning) > + > +#ifdef CONFIG_X86_64 > + /* Entry trampoline */ No need for that comment - variable and section names are enough. :) > + . = ALIGN(PAGE_SIZE); > + _entry_trampoline = .; > + *(.entry_trampoline) > + . = ALIGN(PAGE_SIZE); > + ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big"); > +#endif > + > /* End of text section */ > _etext = .; > } :text = 0x9090 > -- -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.