From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752613AbdKWAgX (ORCPT ); Wed, 22 Nov 2017 19:36:23 -0500 Received: from mga11.intel.com ([192.55.52.93]:40552 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752546AbdKWAgT (ORCPT ); Wed, 22 Nov 2017 19:36:19 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.44,438,1505804400"; d="scan'208";a="10898911" Subject: [PATCH 17/23] x86, kaiser: use PCID feature to make user and kernel switches faster To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@linux.intel.com, moritz.lipp@iaik.tugraz.at, daniel.gruss@iaik.tugraz.at, michael.schwarz@iaik.tugraz.at, richard.fellner@student.tugraz.at, luto@kernel.org, torvalds@linux-foundation.org, keescook@google.com, hughd@google.com, x86@kernel.org From: Dave Hansen Date: Wed, 22 Nov 2017 16:35:09 -0800 References: <20171123003438.48A0EEDE@viggo.jf.intel.com> In-Reply-To: <20171123003438.48A0EEDE@viggo.jf.intel.com> Message-Id: <20171123003509.EC42DD15@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen Short summary: Use x86 PCID feature to avoid flushing the TLB at all interrupts and syscalls. Speed them up. Makes context switches and TLB flushing slower. Background: KAISER keeps two copies of the page tables. Switches between the copies are performed by writing to the CR3 register. But, CR3 was really designed for context switches and writes to it also flush the entire TLB (modulo global pages). This TLB flush increases the cost of interrupts and context switches. For syscall-heavy microbenchmarks it can cut the rate of syscalls by 2/3. The kernel recently gained support for and Intel CPU feature called Process Context IDentifiers (PCID) thanks to Andy Lutomirski. This feature is intended to allow you to switch between contexts without flushing the TLB. Implementation: PCIDs can be used to avoid flushing the TLB at kernel entry/exit. This is speeds up both interrupts and syscalls. First, the kernel and userspace must be assigned different ASIDs. On entry from userspace, move over to the kernel page tables *and* ASID. On exit, restore the user page tables and ASID. Fortunately, the ASID is programmed via CR3, which is already being used to switch between the user and kernel page tables. This gives us convenient, one-stop shopping. The CR3 write which is used to switch between processes provides all the TLB flushing normally required at context switch time. But, with KAISER, that CR3 write only flushes the current (kernel) ASID. An extra TLB flush operation is now required in order to flush the user ASID. This new instruction (INVPCID) is probably ~100 cycles, but this is done with the assumption that the time lost in context switches is more than made up for by lower cost of interrupts and syscalls. Support: PCIDs are generally available on Sandybridge and newer CPUs. However, the accompanying INVPCID instruction did not become available until Haswell (the ones with "v4", or called fourth-generation Core). This instruction allows non-current-PCID TLB entries to be flushed without switching CR3 and global pages to be flushed without a double MOV-to-CR4. Without INVPCID, PCIDs are much harder to use. TLB invalidation gets much more onerous: 1. Every kernel TLB flush (even for a single page) requires an interrupts-off MOV-to-CR4 which is very expensive. This is because there is no way to flush a kernel address that might be loaded in *EVERY* PCID. Right now, there are "only" ~12 of these per-cpu, but that's too painful to use the MOV-to-CR3 to flush them. That leaves only the MOV-to-CR4. 2. Every userspace flush (even for a single page requires one of the following: a. A pair of flushing (bit 63 clear) CR3 writes: one for the kernel ASID and another for userspace. b. A pair of non-flushing CR3 writes (bit 63 set) with the flush done for each. For instance, what is currently a single instruction without KAISER: invpcid_flush_one(current_pcid, addr); becomes this with KAISER: invpcid_flush_one(current_kern_pcid, addr); invpcid_flush_one(current_user_pcid, addr); and this without INVPCID: __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_user_pcid | NOFLUSH); __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_kern_pcid | NOFLUSH); So, for now, fully disable PCIDs with KAISER when INVPCID is not available. This is fixable, but it's an optimization that can be performed later. Hugh Dickins also points out that PCIDs really have two distinct use-cases in the context of KAISER. The first way they can be used is as "TLB preservation across context-switch", which is what Andy Lutomirksi's 4.14 PCID code does. They can also be used as a "KAISER syscall/interrupt accelerator". If we just use them to speed up syscall/interrupts (and ignore the context-switch TLB preservation), then the deficiency of not having INVPCID becomes much less onerous. Signed-off-by: Dave Hansen Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: x86@kernel.org --- b/arch/x86/entry/calling.h | 25 +++- b/arch/x86/entry/entry_64.S | 1 b/arch/x86/include/asm/cpufeatures.h | 1 b/arch/x86/include/asm/pgtable_types.h | 11 ++ b/arch/x86/include/asm/tlbflush.h | 137 +++++++++++++++++++++----- b/arch/x86/include/uapi/asm/processor-flags.h | 3 b/arch/x86/kvm/x86.c | 3 b/arch/x86/mm/init.c | 75 +++++++++----- b/arch/x86/mm/tlb.c | 66 ++++++++++++ 9 files changed, 262 insertions(+), 60 deletions(-) diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h --- a/arch/x86/entry/calling.h~kaiser-pcid 2017-11-22 15:45:53.443619728 -0800 +++ b/arch/x86/entry/calling.h 2017-11-22 15:45:53.461619728 -0800 @@ -3,6 +3,7 @@ #include #include #include +#include /* @@ -192,16 +193,20 @@ For 32-bit we have the following convent #ifdef CONFIG_KAISER /* KAISER PGDs are 8k. Flip bit 12 to switch between the two halves: */ -#define KAISER_SWITCH_MASK (1< MAX_ASID_AVAILABLE); + +#ifdef CONFIG_KAISER + /* + * Make sure that the dynamic ASID space does not confict + * with the bit we are using to switch between user and + * kernel ASIDs. + */ + BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<mm == NULL then we borrow a mm + * which may change during a task switch and + * therefore we must not be preempted while we + * write CR3 back: + */ + preempt_disable(); + native_write_cr3(__native_read_cr3()); + preempt_enable(); + /* + * Does not need tlb_flush_shared_nonglobals() + * since the CR3 write without PCIDs flushes all + * non-globals. + */ + return; + } /* - * If current->mm == NULL then we borrow a mm which may change during a - * task switch and therefore we must not be preempted while we write CR3 - * back: - */ - preempt_disable(); - native_write_cr3(__native_read_cr3()); - preempt_enable(); - /* - * Does not need tlb_flush_shared_nonglobals() since the CR3 write - * without PCIDs flushes all non-globals. + * We are no longer using globals with KAISER, so a + * "nonglobals" flush would work too. But, this is more + * conservative. + * + * Note, this works with CR4.PCIDE=0 or 1. */ + invpcid_flush_all(); } static inline void __native_flush_tlb_global_irq_disabled(void) @@ -353,6 +417,8 @@ static inline void __native_flush_tlb_gl /* * Using INVPCID is considerably faster than a pair of writes * to CR4 sandwiched inside an IRQ flag save/restore. + * + * Note, this works with CR4.PCIDE=0 or 1. */ invpcid_flush_all(); return; @@ -372,7 +438,30 @@ static inline void __native_flush_tlb_gl static inline void __native_flush_tlb_single(unsigned long addr) { - asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + + /* + * Some platforms #GP if we call invpcid(type=1/2) before + * CR4.PCIDE=1. Just call invpcid in the case we are called + * early. + */ + if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + return; + } + /* Flush the address out of both PCIDs. */ + /* + * An optimization here might be to determine addresses + * that are only kernel-mapped and only flush the kernel + * ASID. But, userspace flushes are probably much more + * important performance-wise. + * + * Make sure to do only a single invpcid when KAISER is + * disabled and we have only a single ASID. + */ + if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid)) + invpcid_flush_one(user_asid(loaded_mm_asid), addr); + invpcid_flush_one(kern_asid(loaded_mm_asid), addr); } static inline void __flush_tlb_all(void) diff -puN arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid arch/x86/include/uapi/asm/processor-flags.h --- a/arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid 2017-11-22 15:45:53.452619728 -0800 +++ b/arch/x86/include/uapi/asm/processor-flags.h 2017-11-22 15:45:53.466619728 -0800 @@ -78,7 +78,8 @@ #define X86_CR3_PWT _BITUL(X86_CR3_PWT_BIT) #define X86_CR3_PCD_BIT 4 /* Page Cache Disable */ #define X86_CR3_PCD _BITUL(X86_CR3_PCD_BIT) -#define X86_CR3_PCID_MASK _AC(0x00000fff,UL) /* PCID Mask */ +#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */ +#define X86_CR3_PCID_NOFLUSH _BITULL(X86_CR3_PCID_NOFLUSH_BIT) /* * Intel CPU features in CR4 diff -puN arch/x86/kvm/x86.c~kaiser-pcid arch/x86/kvm/x86.c --- a/arch/x86/kvm/x86.c~kaiser-pcid 2017-11-22 15:45:53.454619728 -0800 +++ b/arch/x86/kvm/x86.c 2017-11-22 15:45:53.468619728 -0800 @@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u return 1; /* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */ - if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu)) + if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) || + !is_long_mode(vcpu)) return 1; } diff -puN arch/x86/mm/init.c~kaiser-pcid arch/x86/mm/init.c --- a/arch/x86/mm/init.c~kaiser-pcid 2017-11-22 15:45:53.456619728 -0800 +++ b/arch/x86/mm/init.c 2017-11-22 15:45:53.468619728 -0800 @@ -196,34 +196,59 @@ static void __init probe_page_size_mask( static void setup_pcid(void) { -#ifdef CONFIG_X86_64 - if (boot_cpu_has(X86_FEATURE_PCID)) { - if (boot_cpu_has(X86_FEATURE_PGE)) { - /* - * This can't be cr4_set_bits_and_update_boot() -- - * the trampoline code can't handle CR4.PCIDE and - * it wouldn't do any good anyway. Despite the name, - * cr4_set_bits_and_update_boot() doesn't actually - * cause the bits in question to remain set all the - * way through the secondary boot asm. - * - * Instead, we brute-force it and set CR4.PCIDE - * manually in start_secondary(). - */ - cr4_set_bits(X86_CR4_PCIDE); - } else { - /* - * flush_tlb_all(), as currently implemented, won't - * work if PCID is on but PGE is not. Since that - * combination doesn't exist on real hardware, there's - * no reason to try to fully support it, but it's - * polite to avoid corrupting data if we're on - * an improperly configured VM. - */ + if (!IS_ENABLED(CONFIG_X86_64)) + return; + + if (!boot_cpu_has(X86_FEATURE_PCID)) + return; + + if (boot_cpu_has(X86_FEATURE_PGE)) { + /* + * KAISER uses a PCID for the kernel and another + * for userspace. Both PCIDs need to be flushed + * when the TLB flush functions are called. But, + * flushing *another* PCID is insane without + * INVPCID. Just avoid using PCIDs at all if we + * have KAISER and do not have INVPCID. + */ + if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) && + !boot_cpu_has(X86_FEATURE_INVPCID)) { setup_clear_cpu_cap(X86_FEATURE_PCID); + return; } + /* + * This can't be cr4_set_bits_and_update_boot() -- + * the trampoline code can't handle CR4.PCIDE and + * it wouldn't do any good anyway. Despite the name, + * cr4_set_bits_and_update_boot() doesn't actually + * cause the bits in question to remain set all the + * way through the secondary boot asm. + * + * Instead, we brute-force it and set CR4.PCIDE + * manually in start_secondary(). + */ + cr4_set_bits(X86_CR4_PCIDE); + + /* + * INVPCID's single-context modes (2/3) only work + * if we set X86_CR4_PCIDE, *and* we INVPCID + * support. It's unusable on systems that have + * X86_CR4_PCIDE clear, or that have no INVPCID + * support at all. + */ + if (boot_cpu_has(X86_FEATURE_INVPCID)) + setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE); + } else { + /* + * flush_tlb_all(), as currently implemented, won't + * work if PCID is on but PGE is not. Since that + * combination doesn't exist on real hardware, there's + * no reason to try to fully support it, but it's + * polite to avoid corrupting data if we're on + * an improperly configured VM. + */ + setup_clear_cpu_cap(X86_FEATURE_PCID); } -#endif } #ifdef CONFIG_X86_32 diff -puN arch/x86/mm/tlb.c~kaiser-pcid arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~kaiser-pcid 2017-11-22 15:45:53.458619728 -0800 +++ b/arch/x86/mm/tlb.c 2017-11-22 15:45:53.469619728 -0800 @@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_st *need_flush = true; } +/* + * Given a kernel asid, flush the corresponding KAISER + * user ASID. + */ +static void flush_user_asid(pgd_t *pgd, u16 kern_asid) +{ + /* There is no user ASID if KAISER is off */ + if (!IS_ENABLED(CONFIG_KAISER)) + return; + /* + * We only have a single ASID if PCID is off and the CR3 + * write will have flushed it. + */ + if (!cpu_feature_enabled(X86_FEATURE_PCID)) + return; + /* + * With PCIDs enabled, write_cr3() only flushes TLB + * entries for the current (kernel) ASID. This leaves + * old TLB entries for the user ASID in place and we must + * flush that context separately. We can theoretically + * delay doing this until we actually load up the + * userspace CR3, but do it here for simplicity. + */ + if (cpu_feature_enabled(X86_FEATURE_INVPCID)) { + invpcid_flush_single_context(user_asid(kern_asid)); + } else { + /* + * On systems with PCIDs, but no INVPCID, the only + * way to flush a PCID is a CR3 write. Note that + * we use the kernel page tables with the *user* + * ASID here. + */ + unsigned long user_asid_flush_cr3; + user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid)); + write_cr3(user_asid_flush_cr3); + /* + * We do not use PCIDs with KAISER unless we also + * have INVPCID. Getting here is unexpected. + */ + WARN_ON_ONCE(1); + } +} + +static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush) +{ + unsigned long new_mm_cr3; + + if (need_flush) { + flush_user_asid(pgdir, new_asid); + new_mm_cr3 = build_cr3(pgdir, new_asid); + } else { + new_mm_cr3 = build_cr3_noflush(pgdir, new_asid); + } + + /* + * Caution: many callers of this function expect + * that load_cr3() is serializing and orders TLB + * fills with respect to the mm_cpumask writes. + */ + write_cr3(new_mm_cr3); +} + void leave_mm(int cpu) { struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); @@ -230,7 +292,7 @@ void switch_mm_irqs_off(struct mm_struct if (need_flush) { this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); - write_cr3(build_cr3(next->pgd, new_asid)); + load_new_mm_cr3(next->pgd, new_asid, true); /* * NB: This gets called via leave_mm() in the idle path @@ -243,7 +305,7 @@ void switch_mm_irqs_off(struct mm_struct trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } else { /* The new ASID is already up to date. */ - write_cr3(build_cr3_noflush(next->pgd, new_asid)); + load_new_mm_cr3(next->pgd, new_asid, false); /* See above wrt _rcuidle. */ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0); _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id B18856B028B for ; Wed, 22 Nov 2017 19:36:19 -0500 (EST) Received: by mail-pf0-f200.google.com with SMTP id r88so15748336pfi.23 for ; Wed, 22 Nov 2017 16:36:19 -0800 (PST) Received: from mga07.intel.com (mga07.intel.com. [134.134.136.100]) by mx.google.com with ESMTPS id h3si14934312plh.203.2017.11.22.16.36.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 22 Nov 2017 16:36:18 -0800 (PST) Subject: [PATCH 17/23] x86, kaiser: use PCID feature to make user and kernel switches faster From: Dave Hansen Date: Wed, 22 Nov 2017 16:35:09 -0800 References: <20171123003438.48A0EEDE@viggo.jf.intel.com> In-Reply-To: <20171123003438.48A0EEDE@viggo.jf.intel.com> Message-Id: <20171123003509.EC42DD15@viggo.jf.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@linux.intel.com, moritz.lipp@iaik.tugraz.at, daniel.gruss@iaik.tugraz.at, michael.schwarz@iaik.tugraz.at, richard.fellner@student.tugraz.at, luto@kernel.org, torvalds@linux-foundation.org, keescook@google.com, hughd@google.com, x86@kernel.org From: Dave Hansen Short summary: Use x86 PCID feature to avoid flushing the TLB at all interrupts and syscalls. Speed them up. Makes context switches and TLB flushing slower. Background: KAISER keeps two copies of the page tables. Switches between the copies are performed by writing to the CR3 register. But, CR3 was really designed for context switches and writes to it also flush the entire TLB (modulo global pages). This TLB flush increases the cost of interrupts and context switches. For syscall-heavy microbenchmarks it can cut the rate of syscalls by 2/3. The kernel recently gained support for and Intel CPU feature called Process Context IDentifiers (PCID) thanks to Andy Lutomirski. This feature is intended to allow you to switch between contexts without flushing the TLB. Implementation: PCIDs can be used to avoid flushing the TLB at kernel entry/exit. This is speeds up both interrupts and syscalls. First, the kernel and userspace must be assigned different ASIDs. On entry from userspace, move over to the kernel page tables *and* ASID. On exit, restore the user page tables and ASID. Fortunately, the ASID is programmed via CR3, which is already being used to switch between the user and kernel page tables. This gives us convenient, one-stop shopping. The CR3 write which is used to switch between processes provides all the TLB flushing normally required at context switch time. But, with KAISER, that CR3 write only flushes the current (kernel) ASID. An extra TLB flush operation is now required in order to flush the user ASID. This new instruction (INVPCID) is probably ~100 cycles, but this is done with the assumption that the time lost in context switches is more than made up for by lower cost of interrupts and syscalls. Support: PCIDs are generally available on Sandybridge and newer CPUs. However, the accompanying INVPCID instruction did not become available until Haswell (the ones with "v4", or called fourth-generation Core). This instruction allows non-current-PCID TLB entries to be flushed without switching CR3 and global pages to be flushed without a double MOV-to-CR4. Without INVPCID, PCIDs are much harder to use. TLB invalidation gets much more onerous: 1. Every kernel TLB flush (even for a single page) requires an interrupts-off MOV-to-CR4 which is very expensive. This is because there is no way to flush a kernel address that might be loaded in *EVERY* PCID. Right now, there are "only" ~12 of these per-cpu, but that's too painful to use the MOV-to-CR3 to flush them. That leaves only the MOV-to-CR4. 2. Every userspace flush (even for a single page requires one of the following: a. A pair of flushing (bit 63 clear) CR3 writes: one for the kernel ASID and another for userspace. b. A pair of non-flushing CR3 writes (bit 63 set) with the flush done for each. For instance, what is currently a single instruction without KAISER: invpcid_flush_one(current_pcid, addr); becomes this with KAISER: invpcid_flush_one(current_kern_pcid, addr); invpcid_flush_one(current_user_pcid, addr); and this without INVPCID: __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_user_pcid | NOFLUSH); __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_kern_pcid | NOFLUSH); So, for now, fully disable PCIDs with KAISER when INVPCID is not available. This is fixable, but it's an optimization that can be performed later. Hugh Dickins also points out that PCIDs really have two distinct use-cases in the context of KAISER. The first way they can be used is as "TLB preservation across context-switch", which is what Andy Lutomirksi's 4.14 PCID code does. They can also be used as a "KAISER syscall/interrupt accelerator". If we just use them to speed up syscall/interrupts (and ignore the context-switch TLB preservation), then the deficiency of not having INVPCID becomes much less onerous. Signed-off-by: Dave Hansen Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: x86@kernel.org --- b/arch/x86/entry/calling.h | 25 +++- b/arch/x86/entry/entry_64.S | 1 b/arch/x86/include/asm/cpufeatures.h | 1 b/arch/x86/include/asm/pgtable_types.h | 11 ++ b/arch/x86/include/asm/tlbflush.h | 137 +++++++++++++++++++++----- b/arch/x86/include/uapi/asm/processor-flags.h | 3 b/arch/x86/kvm/x86.c | 3 b/arch/x86/mm/init.c | 75 +++++++++----- b/arch/x86/mm/tlb.c | 66 ++++++++++++ 9 files changed, 262 insertions(+), 60 deletions(-) diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h --- a/arch/x86/entry/calling.h~kaiser-pcid 2017-11-22 15:45:53.443619728 -0800 +++ b/arch/x86/entry/calling.h 2017-11-22 15:45:53.461619728 -0800 @@ -3,6 +3,7 @@ #include #include #include +#include /* @@ -192,16 +193,20 @@ For 32-bit we have the following convent #ifdef CONFIG_KAISER /* KAISER PGDs are 8k. Flip bit 12 to switch between the two halves: */ -#define KAISER_SWITCH_MASK (1< MAX_ASID_AVAILABLE); + +#ifdef CONFIG_KAISER + /* + * Make sure that the dynamic ASID space does not confict + * with the bit we are using to switch between user and + * kernel ASIDs. + */ + BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<mm == NULL then we borrow a mm + * which may change during a task switch and + * therefore we must not be preempted while we + * write CR3 back: + */ + preempt_disable(); + native_write_cr3(__native_read_cr3()); + preempt_enable(); + /* + * Does not need tlb_flush_shared_nonglobals() + * since the CR3 write without PCIDs flushes all + * non-globals. + */ + return; + } /* - * If current->mm == NULL then we borrow a mm which may change during a - * task switch and therefore we must not be preempted while we write CR3 - * back: - */ - preempt_disable(); - native_write_cr3(__native_read_cr3()); - preempt_enable(); - /* - * Does not need tlb_flush_shared_nonglobals() since the CR3 write - * without PCIDs flushes all non-globals. + * We are no longer using globals with KAISER, so a + * "nonglobals" flush would work too. But, this is more + * conservative. + * + * Note, this works with CR4.PCIDE=0 or 1. */ + invpcid_flush_all(); } static inline void __native_flush_tlb_global_irq_disabled(void) @@ -353,6 +417,8 @@ static inline void __native_flush_tlb_gl /* * Using INVPCID is considerably faster than a pair of writes * to CR4 sandwiched inside an IRQ flag save/restore. + * + * Note, this works with CR4.PCIDE=0 or 1. */ invpcid_flush_all(); return; @@ -372,7 +438,30 @@ static inline void __native_flush_tlb_gl static inline void __native_flush_tlb_single(unsigned long addr) { - asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + + /* + * Some platforms #GP if we call invpcid(type=1/2) before + * CR4.PCIDE=1. Just call invpcid in the case we are called + * early. + */ + if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + return; + } + /* Flush the address out of both PCIDs. */ + /* + * An optimization here might be to determine addresses + * that are only kernel-mapped and only flush the kernel + * ASID. But, userspace flushes are probably much more + * important performance-wise. + * + * Make sure to do only a single invpcid when KAISER is + * disabled and we have only a single ASID. + */ + if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid)) + invpcid_flush_one(user_asid(loaded_mm_asid), addr); + invpcid_flush_one(kern_asid(loaded_mm_asid), addr); } static inline void __flush_tlb_all(void) diff -puN arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid arch/x86/include/uapi/asm/processor-flags.h --- a/arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid 2017-11-22 15:45:53.452619728 -0800 +++ b/arch/x86/include/uapi/asm/processor-flags.h 2017-11-22 15:45:53.466619728 -0800 @@ -78,7 +78,8 @@ #define X86_CR3_PWT _BITUL(X86_CR3_PWT_BIT) #define X86_CR3_PCD_BIT 4 /* Page Cache Disable */ #define X86_CR3_PCD _BITUL(X86_CR3_PCD_BIT) -#define X86_CR3_PCID_MASK _AC(0x00000fff,UL) /* PCID Mask */ +#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */ +#define X86_CR3_PCID_NOFLUSH _BITULL(X86_CR3_PCID_NOFLUSH_BIT) /* * Intel CPU features in CR4 diff -puN arch/x86/kvm/x86.c~kaiser-pcid arch/x86/kvm/x86.c --- a/arch/x86/kvm/x86.c~kaiser-pcid 2017-11-22 15:45:53.454619728 -0800 +++ b/arch/x86/kvm/x86.c 2017-11-22 15:45:53.468619728 -0800 @@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u return 1; /* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */ - if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu)) + if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) || + !is_long_mode(vcpu)) return 1; } diff -puN arch/x86/mm/init.c~kaiser-pcid arch/x86/mm/init.c --- a/arch/x86/mm/init.c~kaiser-pcid 2017-11-22 15:45:53.456619728 -0800 +++ b/arch/x86/mm/init.c 2017-11-22 15:45:53.468619728 -0800 @@ -196,34 +196,59 @@ static void __init probe_page_size_mask( static void setup_pcid(void) { -#ifdef CONFIG_X86_64 - if (boot_cpu_has(X86_FEATURE_PCID)) { - if (boot_cpu_has(X86_FEATURE_PGE)) { - /* - * This can't be cr4_set_bits_and_update_boot() -- - * the trampoline code can't handle CR4.PCIDE and - * it wouldn't do any good anyway. Despite the name, - * cr4_set_bits_and_update_boot() doesn't actually - * cause the bits in question to remain set all the - * way through the secondary boot asm. - * - * Instead, we brute-force it and set CR4.PCIDE - * manually in start_secondary(). - */ - cr4_set_bits(X86_CR4_PCIDE); - } else { - /* - * flush_tlb_all(), as currently implemented, won't - * work if PCID is on but PGE is not. Since that - * combination doesn't exist on real hardware, there's - * no reason to try to fully support it, but it's - * polite to avoid corrupting data if we're on - * an improperly configured VM. - */ + if (!IS_ENABLED(CONFIG_X86_64)) + return; + + if (!boot_cpu_has(X86_FEATURE_PCID)) + return; + + if (boot_cpu_has(X86_FEATURE_PGE)) { + /* + * KAISER uses a PCID for the kernel and another + * for userspace. Both PCIDs need to be flushed + * when the TLB flush functions are called. But, + * flushing *another* PCID is insane without + * INVPCID. Just avoid using PCIDs at all if we + * have KAISER and do not have INVPCID. + */ + if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) && + !boot_cpu_has(X86_FEATURE_INVPCID)) { setup_clear_cpu_cap(X86_FEATURE_PCID); + return; } + /* + * This can't be cr4_set_bits_and_update_boot() -- + * the trampoline code can't handle CR4.PCIDE and + * it wouldn't do any good anyway. Despite the name, + * cr4_set_bits_and_update_boot() doesn't actually + * cause the bits in question to remain set all the + * way through the secondary boot asm. + * + * Instead, we brute-force it and set CR4.PCIDE + * manually in start_secondary(). + */ + cr4_set_bits(X86_CR4_PCIDE); + + /* + * INVPCID's single-context modes (2/3) only work + * if we set X86_CR4_PCIDE, *and* we INVPCID + * support. It's unusable on systems that have + * X86_CR4_PCIDE clear, or that have no INVPCID + * support at all. + */ + if (boot_cpu_has(X86_FEATURE_INVPCID)) + setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE); + } else { + /* + * flush_tlb_all(), as currently implemented, won't + * work if PCID is on but PGE is not. Since that + * combination doesn't exist on real hardware, there's + * no reason to try to fully support it, but it's + * polite to avoid corrupting data if we're on + * an improperly configured VM. + */ + setup_clear_cpu_cap(X86_FEATURE_PCID); } -#endif } #ifdef CONFIG_X86_32 diff -puN arch/x86/mm/tlb.c~kaiser-pcid arch/x86/mm/tlb.c --- a/arch/x86/mm/tlb.c~kaiser-pcid 2017-11-22 15:45:53.458619728 -0800 +++ b/arch/x86/mm/tlb.c 2017-11-22 15:45:53.469619728 -0800 @@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_st *need_flush = true; } +/* + * Given a kernel asid, flush the corresponding KAISER + * user ASID. + */ +static void flush_user_asid(pgd_t *pgd, u16 kern_asid) +{ + /* There is no user ASID if KAISER is off */ + if (!IS_ENABLED(CONFIG_KAISER)) + return; + /* + * We only have a single ASID if PCID is off and the CR3 + * write will have flushed it. + */ + if (!cpu_feature_enabled(X86_FEATURE_PCID)) + return; + /* + * With PCIDs enabled, write_cr3() only flushes TLB + * entries for the current (kernel) ASID. This leaves + * old TLB entries for the user ASID in place and we must + * flush that context separately. We can theoretically + * delay doing this until we actually load up the + * userspace CR3, but do it here for simplicity. + */ + if (cpu_feature_enabled(X86_FEATURE_INVPCID)) { + invpcid_flush_single_context(user_asid(kern_asid)); + } else { + /* + * On systems with PCIDs, but no INVPCID, the only + * way to flush a PCID is a CR3 write. Note that + * we use the kernel page tables with the *user* + * ASID here. + */ + unsigned long user_asid_flush_cr3; + user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid)); + write_cr3(user_asid_flush_cr3); + /* + * We do not use PCIDs with KAISER unless we also + * have INVPCID. Getting here is unexpected. + */ + WARN_ON_ONCE(1); + } +} + +static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush) +{ + unsigned long new_mm_cr3; + + if (need_flush) { + flush_user_asid(pgdir, new_asid); + new_mm_cr3 = build_cr3(pgdir, new_asid); + } else { + new_mm_cr3 = build_cr3_noflush(pgdir, new_asid); + } + + /* + * Caution: many callers of this function expect + * that load_cr3() is serializing and orders TLB + * fills with respect to the mm_cpumask writes. + */ + write_cr3(new_mm_cr3); +} + void leave_mm(int cpu) { struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); @@ -230,7 +292,7 @@ void switch_mm_irqs_off(struct mm_struct if (need_flush) { this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); - write_cr3(build_cr3(next->pgd, new_asid)); + load_new_mm_cr3(next->pgd, new_asid, true); /* * NB: This gets called via leave_mm() in the idle path @@ -243,7 +305,7 @@ void switch_mm_irqs_off(struct mm_struct trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } else { /* The new ASID is already up to date. */ - write_cr3(build_cr3_noflush(next->pgd, new_asid)); + load_new_mm_cr3(next->pgd, new_asid, false); /* See above wrt _rcuidle. */ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org