linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/6] Optionally flush L1D on context switch
@ 2020-04-23 14:01 Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Provide a mechanism to flush the L1D cache on context switch.  The goal
is to allow tasks that are paranoid due to the recent snoop assisted data
sampling vulnerabilites, to flush their L1D on being switched out.
This protects their data from being snooped or leaked via side channels
after the task has context switched out.


Changelog v4:
- Refactor the L1D flushing code even further, pages are now allocated
  once and never freed. Simplify the exported functions.
- Change the name prefixs to be more consistent (l1d_flush_*)
- Refactoring of the code done in the spirit of the comments, prctl
  still requires arch bits for get/set L1D flush and ofcourse in
  the arch switch_mm bits flushing the L1D cache.
Changelog v3:
 - Refactor the return value of what flush_l1d_cache_hw() returns
 - Refactor the code, so that the generic setup bits come first
   (patch 3 from previous posting is now patches 3 and 4)
 - Move from arch_prctl() to the prctl() interface as recommend
   in the reviews.
Changelog v2:
 - Fix a miss of mutex_unlock (caught by Borislav Petkov <bp@alien8.de>)
 - Add documentation about the changes (Josh Poimboeuf
   <jpoimboe@redhat.com>)

Changelog:
 - Refactor the code and reuse cond_ibpb() - code bits provided by tglx
 - Merge mm state tracking for ibpb and l1d flush
 - Rename TIF_L1D_FLUSH to TIF_SPEC_FLUSH_L1D

Changelog RFC:
 - Reuse existing code for allocation and flush
 - Simplify the goto logic in the actual l1d_flush function
 - Optimize the code path with jump labels/static functions

The previous version of these patches are posted at:

https://lore.kernel.org/lkml/20200408090229.16467-1-sblbir@amazon.com/

Balbir Singh (6):
  arch/x86/kvm: Refactor l1d flush lifecycle management
  arch/x86/kvm: Refactor tlbflush and l1d flush
  arch/x86/mm: Refactor cond_ibpb() to support other use cases
  arch/x86/kvm: Refactor L1D flushing
  Optionally flush L1D on context switch
  Documentation: Add L1D flushing Documentation

 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 .../admin-guide/hw-vuln/l1d_flush.rst         |  40 ++++++
 arch/x86/include/asm/cacheflush.h             |   8 ++
 arch/x86/include/asm/thread_info.h            |   7 +-
 arch/x86/include/asm/tlbflush.h               |   2 +-
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/l1d_flush.c                   | 117 ++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c                        |  62 +---------
 arch/x86/mm/tlb.c                             |  83 +++++++++----
 include/uapi/linux/prctl.h                    |   4 +
 kernel/sys.c                                  |  20 +++
 11 files changed, 263 insertions(+), 82 deletions(-)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
 create mode 100644 arch/x86/kernel/l1d_flush.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
@ 2020-04-23 14:01 ` Balbir Singh
  2020-04-24 18:59   ` Tom Lendacky
  2020-04-23 14:01 ` [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush Balbir Singh
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Split out the allocation and free routines to be used in a follow
up set of patches (to reuse for L1D flushing).

Signed-off-by: Balbir Singh <sblbir@amazon.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/cacheflush.h |  3 +++
 arch/x86/kernel/Makefile          |  1 +
 arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
 4 files changed, 43 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/kernel/l1d_flush.c

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 63feaf2a5f93..bac56fcd9790 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -6,6 +6,9 @@
 #include <asm-generic/cacheflush.h>
 #include <asm/special_insns.h>
 
+#define L1D_CACHE_ORDER 4
 void clflush_cache_range(void *addr, unsigned int size);
+void *l1d_flush_alloc_pages(void);
+void l1d_flush_cleanup_pages(void *l1d_flush_pages);
 
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 92e1261ec4ec..42c11ca85f1c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -158,3 +158,4 @@ ifeq ($(CONFIG_X86_64),y)
 endif
 
 obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)	+= ima_arch.o
+obj-y						+= l1d_flush.o
diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
new file mode 100644
index 000000000000..d605878c8f28
--- /dev/null
+++ b/arch/x86/kernel/l1d_flush.c
@@ -0,0 +1,36 @@
+#include <linux/mm.h>
+#include <asm/cacheflush.h>
+
+void *l1d_flush_alloc_pages(void)
+{
+	struct page *page;
+	void *l1d_flush_pages = NULL;
+	int i;
+
+	/*
+	 * This allocation for l1d_flush_pages is not tied to a VM/task's
+	 * lifetime and so should not be charged to a memcg.
+	 */
+	page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
+	if (!page)
+		return NULL;
+	l1d_flush_pages = page_address(page);
+
+	/*
+	 * Initialize each page with a different pattern in
+	 * order to protect against KSM in the nested
+	 * virtualization case.
+	 */
+	for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
+		memset(l1d_flush_pages + i * PAGE_SIZE, i + 1,
+				PAGE_SIZE);
+	}
+	return l1d_flush_pages;
+}
+EXPORT_SYMBOL_GPL(l1d_flush_alloc_pages);
+
+void l1d_flush_cleanup_pages(void *l1d_flush_pages)
+{
+	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
+}
+EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 83050977490c..225aa8219bac 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -203,14 +203,10 @@ static const struct {
 	[VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
 };
 
-#define L1D_CACHE_ORDER 4
 static void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 {
-	struct page *page;
-	unsigned int i;
-
 	if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
 		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
 		return 0;
@@ -253,24 +249,9 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 
 	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
 	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-		/*
-		 * This allocation for vmx_l1d_flush_pages is not tied to a VM
-		 * lifetime and so should not be charged to a memcg.
-		 */
-		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
-		if (!page)
+		vmx_l1d_flush_pages = l1d_flush_alloc_pages();
+		if (!vmx_l1d_flush_pages)
 			return -ENOMEM;
-		vmx_l1d_flush_pages = page_address(page);
-
-		/*
-		 * Initialize each page with a different pattern in
-		 * order to protect against KSM in the nested
-		 * virtualization case.
-		 */
-		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
-			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
-			       PAGE_SIZE);
-		}
 	}
 
 	l1tf_vmx_mitigation = l1tf;
@@ -8026,7 +8007,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
 static void vmx_cleanup_l1d_flush(void)
 {
 	if (vmx_l1d_flush_pages) {
-		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);
+		l1d_flush_cleanup_pages(vmx_l1d_flush_pages);
 		vmx_l1d_flush_pages = NULL;
 	}
 	/* Restore state so sysfs ignores VMX */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush
  2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
@ 2020-04-23 14:01 ` Balbir Singh
  2020-04-24 19:04   ` Tom Lendacky
  2020-04-23 14:01 ` [PATCH v4 3/6] arch/x86/mm: Refactor cond_ibpb() to support other use cases Balbir Singh
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Refactor the existing assembly bits into smaller helper functions
and also abstract L1D_FLUSH into a helper function. Use these
functions in kvm for L1D flushing.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/cacheflush.h |  3 ++
 arch/x86/kernel/l1d_flush.c       | 54 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c            | 29 ++---------------
 3 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index bac56fcd9790..21cc3b28fa63 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -8,7 +8,10 @@
 
 #define L1D_CACHE_ORDER 4
 void clflush_cache_range(void *addr, unsigned int size);
+void l1d_flush_populate_tlb(void *l1d_flush_pages);
 void *l1d_flush_alloc_pages(void);
 void l1d_flush_cleanup_pages(void *l1d_flush_pages);
+void l1d_flush_sw(void *l1d_flush_pages);
+int l1d_flush_hw(void);
 
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
index d605878c8f28..5871794f890d 100644
--- a/arch/x86/kernel/l1d_flush.c
+++ b/arch/x86/kernel/l1d_flush.c
@@ -34,3 +34,57 @@ void l1d_flush_cleanup_pages(void *l1d_flush_pages)
 	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
 }
 EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
+
+/*
+ * Not all users of l1d flush would want to populate the TLB first
+ * split out the function so that callers can optionally flush the L1D
+ * cache via sw without prefetching the TLB.
+ */
+void l1d_flush_populate_tlb(void *l1d_flush_pages)
+{
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	asm volatile(
+		/* First ensure the pages are in the TLB */
+		"xorl	%%eax, %%eax\n"
+		".Lpopulate_tlb:\n\t"
+		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
+		"addl	$4096, %%eax\n\t"
+		"cmpl	%%eax, %[size]\n\t"
+		"jne	.Lpopulate_tlb\n\t"
+		"xorl	%%eax, %%eax\n\t"
+		"cpuid\n\t"
+		:: [flush_pages] "r" (l1d_flush_pages),
+		    [size] "r" (size)
+		: "eax", "ebx", "ecx", "edx");
+}
+EXPORT_SYMBOL_GPL(l1d_flush_populate_tlb);
+
+int l1d_flush_hw(void)
+{
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+		return 0;
+	}
+	return -ENOTSUPP;
+}
+EXPORT_SYMBOL_GPL(l1d_flush_hw);
+
+void l1d_flush_sw(void *l1d_flush_pages)
+{
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	asm volatile(
+			/* Fill the cache */
+			"xorl	%%eax, %%eax\n"
+			".Lfill_cache:\n"
+			"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
+			"addl	$64, %%eax\n\t"
+			"cmpl	%%eax, %[size]\n\t"
+			"jne	.Lfill_cache\n\t"
+			"lfence\n"
+			:: [flush_pages] "r" (l1d_flush_pages),
+			[size] "r" (size)
+			: "eax", "ecx");
+}
+EXPORT_SYMBOL_GPL(l1d_flush_sw);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 225aa8219bac..786d1615a09f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5983,8 +5983,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu,
  */
 static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
-	int size = PAGE_SIZE << L1D_CACHE_ORDER;
-
 	/*
 	 * This code is only executed when the the flush mode is 'cond' or
 	 * 'always'
@@ -6013,32 +6011,11 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 
 	vcpu->stat.l1d_flush++;
 
-	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+	if (!l1d_flush_hw())
 		return;
-	}
 
-	asm volatile(
-		/* First ensure the pages are in the TLB */
-		"xorl	%%eax, %%eax\n"
-		".Lpopulate_tlb:\n\t"
-		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
-		"addl	$4096, %%eax\n\t"
-		"cmpl	%%eax, %[size]\n\t"
-		"jne	.Lpopulate_tlb\n\t"
-		"xorl	%%eax, %%eax\n\t"
-		"cpuid\n\t"
-		/* Now fill the cache */
-		"xorl	%%eax, %%eax\n"
-		".Lfill_cache:\n"
-		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
-		"addl	$64, %%eax\n\t"
-		"cmpl	%%eax, %[size]\n\t"
-		"jne	.Lfill_cache\n\t"
-		"lfence\n"
-		:: [flush_pages] "r" (vmx_l1d_flush_pages),
-		    [size] "r" (size)
-		: "eax", "ebx", "ecx", "edx");
+	l1d_flush_populate_tlb(vmx_l1d_flush_pages);
+	l1d_flush_sw(vmx_l1d_flush_pages);
 }
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/6] arch/x86/mm: Refactor cond_ibpb() to support other use cases
  2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush Balbir Singh
@ 2020-04-23 14:01 ` Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 4/6] arch/x86/kvm: Refactor L1D flushing Balbir Singh
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

cond_ibpb() has the necessary bits required to track the
previous mm in switch_mm_irqs_off(). This can be reused for
other use cases like L1D flushing (on context switch out).

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/tlbflush.h |  2 +-
 arch/x86/mm/tlb.c               | 43 +++++++++++++++++----------------
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6f66d841262d..69e6ea20679c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -172,7 +172,7 @@ struct tlb_state {
 	/* Last user mm for optimizing IBPB */
 	union {
 		struct mm_struct	*last_user_mm;
-		unsigned long		last_user_mm_ibpb;
+		unsigned long		last_user_mm_spec;
 	};
 
 	u16 loaded_mm_asid;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 66f96f21a7b6..da5c94286c7d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -33,10 +33,11 @@
  */
 
 /*
- * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
- * stored in cpu_tlb_state.last_user_mm_ibpb.
+ * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_spec.
  */
 #define LAST_USER_MM_IBPB	0x1UL
+#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB)
 
 /*
  * We get here when we do something requiring a TLB invalidation
@@ -189,19 +190,24 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
 	}
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
 {
 	unsigned long next_tif = task_thread_info(next)->flags;
-	unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
 
-	return (unsigned long)next->mm | ibpb;
+	return (unsigned long)next->mm | spec_bits;
 }
 
-static void cond_ibpb(struct task_struct *next)
+static void cond_mitigation(struct task_struct *next)
 {
+	unsigned long prev_mm, next_mm;
+
 	if (!next || !next->mm)
 		return;
 
+	next_mm = mm_mangle_tif_spec_bits(next);
+	prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
+
 	/*
 	 * Both, the conditional and the always IBPB mode use the mm
 	 * pointer to avoid the IBPB when switching between tasks of the
@@ -212,8 +218,6 @@ static void cond_ibpb(struct task_struct *next)
 	 * exposed data is not really interesting.
 	 */
 	if (static_branch_likely(&switch_mm_cond_ibpb)) {
-		unsigned long prev_mm, next_mm;
-
 		/*
 		 * This is a bit more complex than the always mode because
 		 * it has to handle two cases:
@@ -243,20 +247,14 @@ static void cond_ibpb(struct task_struct *next)
 		 * Optimize this with reasonably small overhead for the
 		 * above cases. Mangle the TIF_SPEC_IB bit into the mm
 		 * pointer of the incoming task which is stored in
-		 * cpu_tlbstate.last_user_mm_ibpb for comparison.
-		 */
-		next_mm = mm_mangle_tif_spec_ib(next);
-		prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
-
-		/*
+		 * cpu_tlbstate.last_user_mm_spec for comparison.
+		 *
 		 * Issue IBPB only if the mm's are different and one or
 		 * both have the IBPB bit set.
 		 */
 		if (next_mm != prev_mm &&
 		    (next_mm | prev_mm) & LAST_USER_MM_IBPB)
 			indirect_branch_prediction_barrier();
-
-		this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
 	}
 
 	if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -265,11 +263,12 @@ static void cond_ibpb(struct task_struct *next)
 		 * different context than the user space task which ran
 		 * last on this CPU.
 		 */
-		if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+		if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
+					(unsigned long)next->mm)
 			indirect_branch_prediction_barrier();
-			this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
-		}
 	}
+
+	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
 }
 
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
@@ -374,8 +373,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * Avoid user/user BTB poisoning by flushing the branch
 		 * predictor when switching between processes. This stops
 		 * one process from doing Spectre-v2 attacks on another.
+		 * The hook can also be used for mitigations that rely
+		 * on switch_mm for hooks.
 		 */
-		cond_ibpb(tsk);
+		cond_mitigation(tsk);
 
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 			/*
@@ -501,7 +502,7 @@ void initialize_tlbstate_and_flush(void)
 	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
-	this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
+	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_IBPB);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
 	this_cpu_write(cpu_tlbstate.next_asid, 1);
 	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/6] arch/x86/kvm: Refactor L1D flushing
  2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
                   ` (2 preceding siblings ...)
  2020-04-23 14:01 ` [PATCH v4 3/6] arch/x86/mm: Refactor cond_ibpb() to support other use cases Balbir Singh
@ 2020-04-23 14:01 ` Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 5/6] Optionally flush L1D on context switch Balbir Singh
  2020-04-23 14:01 ` [PATCH v4 6/6] Documentation: Add L1D flushing Documentation Balbir Singh
  5 siblings, 0 replies; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Move out the initialization function to l1d_flush_init_once()
so that it can be reused for subsequent patches. The side-effect
of this patch is that the memory allocated for l1d flush pages
is no longer freed up and the memory allocated once is shared
amongst callers.

l1d_flush_sw/hw() are now abstracted under arch_l1d_flush().
vmx_l1d_flush_mutex however continues to exist as it also used
from other code paths.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/cacheflush.h | 12 ++++---
 arch/x86/kernel/l1d_flush.c       | 53 +++++++++++++++++++++++--------
 arch/x86/kvm/vmx/vmx.c            | 20 ++----------
 3 files changed, 50 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 21cc3b28fa63..851d8f1ab827 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -7,11 +7,13 @@
 #include <asm/special_insns.h>
 
 #define L1D_CACHE_ORDER 4
+
+enum l1d_flush_options {
+	L1D_FLUSH_POPULATE_TLB = 0x1,
+};
+
 void clflush_cache_range(void *addr, unsigned int size);
-void l1d_flush_populate_tlb(void *l1d_flush_pages);
-void *l1d_flush_alloc_pages(void);
-void l1d_flush_cleanup_pages(void *l1d_flush_pages);
-void l1d_flush_sw(void *l1d_flush_pages);
-int l1d_flush_hw(void);
+int l1d_flush_init_once(void);
+void arch_l1d_flush(enum l1d_flush_options options);
 
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
index 5871794f890d..a754b6c288a9 100644
--- a/arch/x86/kernel/l1d_flush.c
+++ b/arch/x86/kernel/l1d_flush.c
@@ -1,7 +1,7 @@
 #include <linux/mm.h>
 #include <asm/cacheflush.h>
 
-void *l1d_flush_alloc_pages(void)
+static void *l1d_flush_alloc_pages(void)
 {
 	struct page *page;
 	void *l1d_flush_pages = NULL;
@@ -27,20 +27,14 @@ void *l1d_flush_alloc_pages(void)
 	}
 	return l1d_flush_pages;
 }
-EXPORT_SYMBOL_GPL(l1d_flush_alloc_pages);
 
-void l1d_flush_cleanup_pages(void *l1d_flush_pages)
-{
-	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
-}
-EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
 
 /*
  * Not all users of l1d flush would want to populate the TLB first
  * split out the function so that callers can optionally flush the L1D
  * cache via sw without prefetching the TLB.
  */
-void l1d_flush_populate_tlb(void *l1d_flush_pages)
+static void l1d_flush_populate_tlb(void *l1d_flush_pages)
 {
 	int size = PAGE_SIZE << L1D_CACHE_ORDER;
 
@@ -58,9 +52,8 @@ void l1d_flush_populate_tlb(void *l1d_flush_pages)
 		    [size] "r" (size)
 		: "eax", "ebx", "ecx", "edx");
 }
-EXPORT_SYMBOL_GPL(l1d_flush_populate_tlb);
 
-int l1d_flush_hw(void)
+static int l1d_flush_hw(void)
 {
 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
 		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
@@ -68,9 +61,8 @@ int l1d_flush_hw(void)
 	}
 	return -ENOTSUPP;
 }
-EXPORT_SYMBOL_GPL(l1d_flush_hw);
 
-void l1d_flush_sw(void *l1d_flush_pages)
+static void l1d_flush_sw(void *l1d_flush_pages)
 {
 	int size = PAGE_SIZE << L1D_CACHE_ORDER;
 
@@ -87,4 +79,39 @@ void l1d_flush_sw(void *l1d_flush_pages)
 			[size] "r" (size)
 			: "eax", "ecx");
 }
-EXPORT_SYMBOL_GPL(l1d_flush_sw);
+
+static void *l1d_flush_pages;
+static DEFINE_MUTEX(l1d_flush_mutex);
+
+/*
+ * Initialize and setup L1D flush once, each caller will reuse the
+ * l1d_flush_pages for flushing, no per CPU allocations or NUMA aware
+ * allocations at the moment.
+ */
+int l1d_flush_init_once(void)
+{
+	int ret = 0;
+
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D) || l1d_flush_pages)
+		return ret;
+
+	mutex_lock(&l1d_flush_mutex);
+	if (!l1d_flush_pages)
+		l1d_flush_pages = l1d_flush_alloc_pages();
+	ret = l1d_flush_pages ? 0 : -ENOMEM;
+	mutex_unlock(&l1d_flush_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(l1d_flush_init_once);
+
+void arch_l1d_flush(enum l1d_flush_options options)
+{
+	if (!l1d_flush_hw())
+		return;
+
+	if (options & L1D_FLUSH_POPULATE_TLB)
+		l1d_flush_populate_tlb(l1d_flush_pages);
+
+	l1d_flush_sw(l1d_flush_pages);
+}
+EXPORT_SYMBOL_GPL(arch_l1d_flush);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 786d1615a09f..d489234c4d5a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -203,8 +203,6 @@ static const struct {
 	[VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
 };
 
-static void *vmx_l1d_flush_pages;
-
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 {
 	if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
@@ -247,12 +245,9 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 		l1tf = VMENTER_L1D_FLUSH_ALWAYS;
 	}
 
-	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
-	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-		vmx_l1d_flush_pages = l1d_flush_alloc_pages();
-		if (!vmx_l1d_flush_pages)
+	if (l1tf != VMENTER_L1D_FLUSH_NEVER)
+		if (l1d_flush_init_once())
 			return -ENOMEM;
-	}
 
 	l1tf_vmx_mitigation = l1tf;
 
@@ -6010,12 +6005,7 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 	}
 
 	vcpu->stat.l1d_flush++;
-
-	if (!l1d_flush_hw())
-		return;
-
-	l1d_flush_populate_tlb(vmx_l1d_flush_pages);
-	l1d_flush_sw(vmx_l1d_flush_pages);
+	arch_l1d_flush(L1D_FLUSH_POPULATE_TLB);
 }
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
@@ -7983,10 +7973,6 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
 
 static void vmx_cleanup_l1d_flush(void)
 {
-	if (vmx_l1d_flush_pages) {
-		l1d_flush_cleanup_pages(vmx_l1d_flush_pages);
-		vmx_l1d_flush_pages = NULL;
-	}
 	/* Restore state so sysfs ignores VMX */
 	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/6] Optionally flush L1D on context switch
  2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
                   ` (3 preceding siblings ...)
  2020-04-23 14:01 ` [PATCH v4 4/6] arch/x86/kvm: Refactor L1D flushing Balbir Singh
@ 2020-04-23 14:01 ` Balbir Singh
  2020-04-23 19:19   ` Kees Cook
  2020-04-23 14:01 ` [PATCH v4 6/6] Documentation: Add L1D flushing Documentation Balbir Singh
  5 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that are paranoid due to the recent snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out.  This protects
their data from being snooped or leaked via side channels after the task
has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.

A new thread_info flag TIF_SPEC_FLUSH_L1D is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
the TIF flags into mm state (per cpu via last_user_mm_spec) in
cond_mitigation(), which then used to do decide when to call flush_l1d().

Add prctl()'s to opt-in to the L1D cache on context switch out, the
existing mechanisms of tracking prev_mm via cpu_tlbstate is
reused to track state of the tasks and to flush the L1D cache.
The prctl interface is generic and can be ported over to other
architectures.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/thread_info.h |  7 ++++-
 arch/x86/mm/tlb.c                  | 44 ++++++++++++++++++++++++++++--
 include/uapi/linux/prctl.h         |  4 +++
 kernel/sys.c                       | 20 ++++++++++++++
 4 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..67de693d9ba1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,7 +84,7 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE	10	/* Force speculation MSR update in context switch */
+#define TIF_SPEC_FLUSH_L1D	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING	13	/* pending live patching update */
@@ -96,6 +96,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -132,6 +133,7 @@ struct thread_info {
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
 #define _TIF_X32		(1 << TIF_X32)
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
+#define _TIF_SPEC_FLUSH_L1D	(1 << TIF_SPEC_FLUSH_L1D)
 
 /* Work to do before invoking the actual syscall. */
 #define _TIF_WORK_SYSCALL_ENTRY	\
@@ -235,6 +237,9 @@ static inline int arch_within_stack_frames(const void * const stack,
 			   current_thread_info()->status & TS_COMPAT)
 #endif
 
+extern int arch_prctl_l1d_flush_set(struct task_struct *tsk, unsigned long enable);
+extern int arch_prctl_l1d_flush_get(struct task_struct *tsk);
+
 extern void arch_task_cache_init(void);
 extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
 extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index da5c94286c7d..7778560760e6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -13,6 +13,7 @@
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
+#include <asm/cacheflush.h>
 #include <asm/apic.h>
 #include <asm/uv/uv.h>
 
@@ -33,11 +34,12 @@
  */
 
 /*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
  * stored in cpu_tlb_state.last_user_mm_spec.
  */
 #define LAST_USER_MM_IBPB	0x1UL
-#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH	0x2UL
+#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)
 
 /*
  * We get here when we do something requiring a TLB invalidation
@@ -152,6 +154,35 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+static int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	int ret = l1d_flush_init_once();
+
+	if (ret < 0)
+		return ret;
+
+	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
+	return ret;
+}
+
+static int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
+	return 0;
+}
+
+int arch_prctl_l1d_flush_get(struct task_struct *tsk)
+{
+	return test_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
+}
+
+int arch_prctl_l1d_flush_set(struct task_struct *tsk, unsigned long enable)
+{
+	if (enable)
+		return enable_l1d_flush_for_task(tsk);
+	return disable_l1d_flush_for_task(tsk);
+}
+
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -268,6 +299,9 @@ static void cond_mitigation(struct task_struct *next)
 			indirect_branch_prediction_barrier();
 	}
 
+	if (prev_mm & LAST_USER_MM_L1D_FLUSH)
+		arch_l1d_flush(0); /* Just flush, don't populate the TLB */
+
 	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
 }
 
@@ -502,6 +536,12 @@ void initialize_tlbstate_and_flush(void)
 	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
+
+	/*
+	 * Leave last_user_mm_spec at LAST_USER_MM_IBPB, we don't
+	 * want to set LAST_USER_MM_L1D_FLUSH and force a flush before
+	 * we've allocated the flush pages.
+	 */
 	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_IBPB);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
 	this_cpu_write(cpu_tlbstate.next_asid, 1);
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..42cb3038c81a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,8 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Flush L1D on context switch (mm) */
+#define PR_SET_L1D_FLUSH		59
+#define PR_GET_L1D_FLUSH		60
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..578aa8b6d87e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2262,6 +2262,16 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
 	return -EINVAL;
 }
 
+int __weak arch_prctl_l1d_flush_set(struct task_struct *tsk, unsigned long enable)
+{
+	return -EINVAL;
+}
+
+int __weak arch_prctl_l1d_flush_get(struct task_struct *t)
+{
+	return -EINVAL;
+}
+
 #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
 
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
@@ -2514,6 +2524,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SET_L1D_FLUSH:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = arch_prctl_l1d_flush_set(me, arg2);
+		break;
+	case PR_GET_L1D_FLUSH:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = arch_prctl_l1d_flush_get(me);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 6/6] Documentation: Add L1D flushing Documentation
  2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
                   ` (4 preceding siblings ...)
  2020-04-23 14:01 ` [PATCH v4 5/6] Optionally flush L1D on context switch Balbir Singh
@ 2020-04-23 14:01 ` Balbir Singh
  2020-04-23 19:20   ` Kees Cook
  5 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2020-04-23 14:01 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Add documentation of l1d flushing, explain the need for the
feature and how it can be used.

Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 Documentation/admin-guide/hw-vuln/index.rst   |  1 +
 .../admin-guide/hw-vuln/l1d_flush.rst         | 40 +++++++++++++++++++
 2 files changed, 41 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 0795e3c2643f..35633b299d45 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -14,3 +14,4 @@ are configurable at compile, boot or run time.
    mds
    tsx_async_abort
    multihit.rst
+   l1d_flush
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 000000000000..7d515b8c29f1
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,40 @@
+L1D Flushing for the paranoid
+=============================
+
+With an increasing number of vulnerabilities being reported around data
+leaks from L1D, a new user space mechanism to flush the L1D cache on
+context switch is added to the kernel. This should help address
+CVE-2020-0550 and for paranoid applications, keep them safe from any
+yet to be discovered vulnerabilities, related to leaks from the L1D
+cache.
+
+Tasks can opt in to this mechanism by using a prctl (implemented only
+for x86 at the moment).
+
+Related CVES
+------------
+At the present moment, the following CVEs can be addressed by this
+mechanism
+
+    =============       ========================     ==================
+    CVE-2020-0550       Improper Data Forwarding     OS related aspects
+    =============       ========================     ==================
+
+Usage Guidelines
+----------------
+Applications can call ``prctl(2)`` with one of these two arguments
+
+1. PR_SET_L1D_FLUSH - flush the L1D cache on context switch (out)
+2. PR_GET_L1D_FLUSH - get the current state of the L1D cache flush, returns 1
+   if set and 0 if not set.
+
+**NOTE**: The feature is disabled by default, applications to need to specifically
+opt into the feature to enable it.
+
+Mitigation
+----------
+When PR_SET_L1D_FLUSH is enabled for a task, on switching tasks (when
+the address space changes), a flush of the L1D cache is performed for
+the task when it leaves the CPU. If the underlying CPU supports L1D
+flushing in hardware, the hardware mechanism is used, otherwise a software
+fallback, similar to the mechanism used by L1TF is used.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/6] Optionally flush L1D on context switch
  2020-04-23 14:01 ` [PATCH v4 5/6] Optionally flush L1D on context switch Balbir Singh
@ 2020-04-23 19:19   ` Kees Cook
  2020-04-24  9:56     ` Singh, Balbir
  0 siblings, 1 reply; 16+ messages in thread
From: Kees Cook @ 2020-04-23 19:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: tglx, linux-kernel, jpoimboe, tony.luck, benh, x86, dave.hansen

On Fri, Apr 24, 2020 at 12:01:24AM +1000, Balbir Singh wrote:
> Implement a mechanism to selectively flush the L1D cache. The goal is to
> allow tasks that are paranoid due to the recent snoop assisted data sampling
> vulnerabilites, to flush their L1D on being switched out.  This protects
> their data from being snooped or leaked via side channels after the task
> has context switched out.
> 
> There are two scenarios we might want to protect against, a task leaving
> the CPU with data still in L1D (which is the main concern of this patch),
> the second scenario is a malicious task coming in (not so well trusted)
> for which we want to clean up the cache before it starts. Only the case
> for the former is addressed.
> 
> A new thread_info flag TIF_SPEC_FLUSH_L1D is added to track tasks which
> opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
> the TIF flags into mm state (per cpu via last_user_mm_spec) in
> cond_mitigation(), which then used to do decide when to call flush_l1d().
> 
> Add prctl()'s to opt-in to the L1D cache on context switch out, the
> existing mechanisms of tracking prev_mm via cpu_tlbstate is
> reused to track state of the tasks and to flush the L1D cache.
> The prctl interface is generic and can be ported over to other
> architectures.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Balbir Singh <sblbir@amazon.com>

I'm not a huge fan of __weak (I like CONFIGs better), but that's no
enough to NAK this. ;) Thanks for the prctl() change!

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/thread_info.h |  7 ++++-
>  arch/x86/mm/tlb.c                  | 44 ++++++++++++++++++++++++++++--
>  include/uapi/linux/prctl.h         |  4 +++
>  kernel/sys.c                       | 20 ++++++++++++++
>  4 files changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 8de8ceccb8bc..67de693d9ba1 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -84,7 +84,7 @@ struct thread_info {
>  #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
>  #define TIF_SECCOMP		8	/* secure computing */
>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
> -#define TIF_SPEC_FORCE_UPDATE	10	/* Force speculation MSR update in context switch */
> +#define TIF_SPEC_FLUSH_L1D	10	/* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
>  #define TIF_UPROBE		12	/* breakpointed or singlestepping */
>  #define TIF_PATCH_PENDING	13	/* pending live patching update */
> @@ -96,6 +96,7 @@ struct thread_info {
>  #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
>  #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
>  #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
> +#define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
>  #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
>  #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
>  #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
> @@ -132,6 +133,7 @@ struct thread_info {
>  #define _TIF_ADDR32		(1 << TIF_ADDR32)
>  #define _TIF_X32		(1 << TIF_X32)
>  #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
> +#define _TIF_SPEC_FLUSH_L1D	(1 << TIF_SPEC_FLUSH_L1D)
>  
>  /* Work to do before invoking the actual syscall. */
>  #define _TIF_WORK_SYSCALL_ENTRY	\
> @@ -235,6 +237,9 @@ static inline int arch_within_stack_frames(const void * const stack,
>  			   current_thread_info()->status & TS_COMPAT)
>  #endif
>  
> +extern int arch_prctl_l1d_flush_set(struct task_struct *tsk, unsigned long enable);
> +extern int arch_prctl_l1d_flush_get(struct task_struct *tsk);
> +
>  extern void arch_task_cache_init(void);
>  extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
>  extern void arch_release_task_struct(struct task_struct *tsk);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index da5c94286c7d..7778560760e6 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -13,6 +13,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/cache.h>
> +#include <asm/cacheflush.h>
>  #include <asm/apic.h>
>  #include <asm/uv/uv.h>
>  
> @@ -33,11 +34,12 @@
>   */
>  
>  /*
> - * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
> + * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
>   * stored in cpu_tlb_state.last_user_mm_spec.
>   */
>  #define LAST_USER_MM_IBPB	0x1UL
> -#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB)
> +#define LAST_USER_MM_L1D_FLUSH	0x2UL
> +#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)
>  
>  /*
>   * We get here when we do something requiring a TLB invalidation
> @@ -152,6 +154,35 @@ void leave_mm(int cpu)
>  }
>  EXPORT_SYMBOL_GPL(leave_mm);
>  
> +static int enable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	int ret = l1d_flush_init_once();
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
> +	return ret;
> +}
> +
> +static int disable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
> +	return 0;
> +}
> +
> +int arch_prctl_l1d_flush_get(struct task_struct *tsk)
> +{
> +	return test_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
> +}
> +
> +int arch_prctl_l1d_flush_set(struct task_struct *tsk, unsigned long enable)
> +{
> +	if (enable)
> +		return enable_l1d_flush_for_task(tsk);
> +	return disable_l1d_flush_for_task(tsk);
> +}
> +
>  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	       struct task_struct *tsk)
>  {
> @@ -268,6 +299,9 @@ static void cond_mitigation(struct task_struct *next)
>  			indirect_branch_prediction_barrier();
>  	}
>  
> +	if (prev_mm & LAST_USER_MM_L1D_FLUSH)
> +		arch_l1d_flush(0); /* Just flush, don't populate the TLB */
> +
>  	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
>  }
>  
> @@ -502,6 +536,12 @@ void initialize_tlbstate_and_flush(void)
>  	write_cr3(build_cr3(mm->pgd, 0));
>  
>  	/* Reinitialize tlbstate. */
> +
> +	/*
> +	 * Leave last_user_mm_spec at LAST_USER_MM_IBPB, we don't
> +	 * want to set LAST_USER_MM_L1D_FLUSH and force a flush before
> +	 * we've allocated the flush pages.
> +	 */
>  	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_IBPB);
>  	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
>  	this_cpu_write(cpu_tlbstate.next_asid, 1);
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..42cb3038c81a 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,8 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER		57
>  #define PR_GET_IO_FLUSHER		58
>  
> +/* Flush L1D on context switch (mm) */
> +#define PR_SET_L1D_FLUSH		59
> +#define PR_GET_L1D_FLUSH		60
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index d325f3ab624a..578aa8b6d87e 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2262,6 +2262,16 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>  	return -EINVAL;
>  }
>  
> +int __weak arch_prctl_l1d_flush_set(struct task_struct *tsk, unsigned long enable)
> +{
> +	return -EINVAL;
> +}
> +
> +int __weak arch_prctl_l1d_flush_get(struct task_struct *t)
> +{
> +	return -EINVAL;
> +}
> +
>  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
>  
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> @@ -2514,6 +2524,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  
>  		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
>  		break;
> +	case PR_SET_L1D_FLUSH:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = arch_prctl_l1d_flush_set(me, arg2);
> +		break;
> +	case PR_GET_L1D_FLUSH:
> +		if (arg2 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = arch_prctl_l1d_flush_get(me);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 6/6] Documentation: Add L1D flushing Documentation
  2020-04-23 14:01 ` [PATCH v4 6/6] Documentation: Add L1D flushing Documentation Balbir Singh
@ 2020-04-23 19:20   ` Kees Cook
  0 siblings, 0 replies; 16+ messages in thread
From: Kees Cook @ 2020-04-23 19:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: tglx, linux-kernel, jpoimboe, tony.luck, benh, x86, dave.hansen

On Fri, Apr 24, 2020 at 12:01:25AM +1000, Balbir Singh wrote:
> Add documentation of l1d flushing, explain the need for the
> feature and how it can be used.
> 
> Signed-off-by: Balbir Singh <sblbir@amazon.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  Documentation/admin-guide/hw-vuln/index.rst   |  1 +
>  .../admin-guide/hw-vuln/l1d_flush.rst         | 40 +++++++++++++++++++
>  2 files changed, 41 insertions(+)
>  create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
> 
> diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
> index 0795e3c2643f..35633b299d45 100644
> --- a/Documentation/admin-guide/hw-vuln/index.rst
> +++ b/Documentation/admin-guide/hw-vuln/index.rst
> @@ -14,3 +14,4 @@ are configurable at compile, boot or run time.
>     mds
>     tsx_async_abort
>     multihit.rst
> +   l1d_flush
> diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
> new file mode 100644
> index 000000000000..7d515b8c29f1
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
> @@ -0,0 +1,40 @@
> +L1D Flushing for the paranoid
> +=============================
> +
> +With an increasing number of vulnerabilities being reported around data
> +leaks from L1D, a new user space mechanism to flush the L1D cache on
> +context switch is added to the kernel. This should help address
> +CVE-2020-0550 and for paranoid applications, keep them safe from any
> +yet to be discovered vulnerabilities, related to leaks from the L1D
> +cache.
> +
> +Tasks can opt in to this mechanism by using a prctl (implemented only
> +for x86 at the moment).
> +
> +Related CVES
> +------------
> +At the present moment, the following CVEs can be addressed by this
> +mechanism
> +
> +    =============       ========================     ==================
> +    CVE-2020-0550       Improper Data Forwarding     OS related aspects
> +    =============       ========================     ==================
> +
> +Usage Guidelines
> +----------------
> +Applications can call ``prctl(2)`` with one of these two arguments
> +
> +1. PR_SET_L1D_FLUSH - flush the L1D cache on context switch (out)
> +2. PR_GET_L1D_FLUSH - get the current state of the L1D cache flush, returns 1
> +   if set and 0 if not set.
> +
> +**NOTE**: The feature is disabled by default, applications to need to specifically
> +opt into the feature to enable it.
> +
> +Mitigation
> +----------
> +When PR_SET_L1D_FLUSH is enabled for a task, on switching tasks (when
> +the address space changes), a flush of the L1D cache is performed for
> +the task when it leaves the CPU. If the underlying CPU supports L1D
> +flushing in hardware, the hardware mechanism is used, otherwise a software
> +fallback, similar to the mechanism used by L1TF is used.
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/6] Optionally flush L1D on context switch
  2020-04-23 19:19   ` Kees Cook
@ 2020-04-24  9:56     ` Singh, Balbir
  0 siblings, 0 replies; 16+ messages in thread
From: Singh, Balbir @ 2020-04-24  9:56 UTC (permalink / raw)
  To: keescook; +Cc: tglx, linux-kernel, tony.luck, benh, jpoimboe, x86, dave.hansen

On Thu, 2020-04-23 at 12:19 -0700, Kees Cook wrote:
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> On Fri, Apr 24, 2020 at 12:01:24AM +1000, Balbir Singh wrote:
> > Implement a mechanism to selectively flush the L1D cache. The goal is to
> > allow tasks that are paranoid due to the recent snoop assisted data
> > sampling
> > vulnerabilites, to flush their L1D on being switched out.  This protects
> > their data from being snooped or leaked via side channels after the task
> > has context switched out.
> > 
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this patch),
> > the second scenario is a malicious task coming in (not so well trusted)
> > for which we want to clean up the cache before it starts. Only the case
> > for the former is addressed.
> > 
> > A new thread_info flag TIF_SPEC_FLUSH_L1D is added to track tasks which
> > opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
> > the TIF flags into mm state (per cpu via last_user_mm_spec) in
> > cond_mitigation(), which then used to do decide when to call flush_l1d().
> > 
> > Add prctl()'s to opt-in to the L1D cache on context switch out, the
> > existing mechanisms of tracking prev_mm via cpu_tlbstate is
> > reused to track state of the tasks and to flush the L1D cache.
> > The prctl interface is generic and can be ported over to other
> > architectures.
> > 
> > Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> 
> I'm not a huge fan of __weak (I like CONFIGs better), but that's no
> enough to NAK this. ;) Thanks for the prctl() change!
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> 

Thanks, the CONFIG_* seemed a bit much for two functions.

Balbir Singh.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-04-23 14:01 ` [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
@ 2020-04-24 18:59   ` Tom Lendacky
  2020-04-25  1:49     ` Singh, Balbir
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Lendacky @ 2020-04-24 18:59 UTC (permalink / raw)
  To: Balbir Singh, tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen

On 4/23/20 9:01 AM, Balbir Singh wrote:
> Split out the allocation and free routines to be used in a follow
> up set of patches (to reuse for L1D flushing).
> 
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> ---
>   arch/x86/include/asm/cacheflush.h |  3 +++
>   arch/x86/kernel/Makefile          |  1 +
>   arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
>   4 files changed, 43 insertions(+), 22 deletions(-)
>   create mode 100644 arch/x86/kernel/l1d_flush.c
> 
> diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
> index 63feaf2a5f93..bac56fcd9790 100644
> --- a/arch/x86/include/asm/cacheflush.h
> +++ b/arch/x86/include/asm/cacheflush.h
> @@ -6,6 +6,9 @@
>   #include <asm-generic/cacheflush.h>
>   #include <asm/special_insns.h>
>   
> +#define L1D_CACHE_ORDER 4

Since this is becoming a generic function now, shouldn't this value be 
based on the actual L1D cache size? Is this value based on a 32KB data 
cache and the idea is to write twice the size of the cache to be sure that 
every entry has been replaced - with the second 32KB to catch the odd line 
that might not have been pulled in?

Thanks,
Tom

>   void clflush_cache_range(void *addr, unsigned int size);
> +void *l1d_flush_alloc_pages(void);
> +void l1d_flush_cleanup_pages(void *l1d_flush_pages);
>   
>   #endif /* _ASM_X86_CACHEFLUSH_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 92e1261ec4ec..42c11ca85f1c 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -158,3 +158,4 @@ ifeq ($(CONFIG_X86_64),y)
>   endif
>   
>   obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)	+= ima_arch.o
> +obj-y						+= l1d_flush.o
> diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> new file mode 100644
> index 000000000000..d605878c8f28
> --- /dev/null
> +++ b/arch/x86/kernel/l1d_flush.c
> @@ -0,0 +1,36 @@
> +#include <linux/mm.h>
> +#include <asm/cacheflush.h>
> +
> +void *l1d_flush_alloc_pages(void)
> +{
> +	struct page *page;
> +	void *l1d_flush_pages = NULL;
> +	int i;
> +
> +	/*
> +	 * This allocation for l1d_flush_pages is not tied to a VM/task's
> +	 * lifetime and so should not be charged to a memcg.
> +	 */
> +	page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
> +	if (!page)
> +		return NULL;
> +	l1d_flush_pages = page_address(page);
> +
> +	/*
> +	 * Initialize each page with a different pattern in
> +	 * order to protect against KSM in the nested
> +	 * virtualization case.
> +	 */
> +	for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
> +		memset(l1d_flush_pages + i * PAGE_SIZE, i + 1,
> +				PAGE_SIZE);
> +	}
> +	return l1d_flush_pages;
> +}
> +EXPORT_SYMBOL_GPL(l1d_flush_alloc_pages);
> +
> +void l1d_flush_cleanup_pages(void *l1d_flush_pages)
> +{
> +	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
> +}
> +EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 83050977490c..225aa8219bac 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -203,14 +203,10 @@ static const struct {
>   	[VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
>   };
>   
> -#define L1D_CACHE_ORDER 4
>   static void *vmx_l1d_flush_pages;
>   
>   static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>   {
> -	struct page *page;
> -	unsigned int i;
> -
>   	if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
>   		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
>   		return 0;
> @@ -253,24 +249,9 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>   
>   	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
>   	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> -		/*
> -		 * This allocation for vmx_l1d_flush_pages is not tied to a VM
> -		 * lifetime and so should not be charged to a memcg.
> -		 */
> -		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
> -		if (!page)
> +		vmx_l1d_flush_pages = l1d_flush_alloc_pages();
> +		if (!vmx_l1d_flush_pages)
>   			return -ENOMEM;
> -		vmx_l1d_flush_pages = page_address(page);
> -
> -		/*
> -		 * Initialize each page with a different pattern in
> -		 * order to protect against KSM in the nested
> -		 * virtualization case.
> -		 */
> -		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
> -			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
> -			       PAGE_SIZE);
> -		}
>   	}
>   
>   	l1tf_vmx_mitigation = l1tf;
> @@ -8026,7 +8007,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
>   static void vmx_cleanup_l1d_flush(void)
>   {
>   	if (vmx_l1d_flush_pages) {
> -		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);
> +		l1d_flush_cleanup_pages(vmx_l1d_flush_pages);
>   		vmx_l1d_flush_pages = NULL;
>   	}
>   	/* Restore state so sysfs ignores VMX */
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush
  2020-04-23 14:01 ` [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush Balbir Singh
@ 2020-04-24 19:04   ` Tom Lendacky
  2020-04-25  2:00     ` Singh, Balbir
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Lendacky @ 2020-04-24 19:04 UTC (permalink / raw)
  To: Balbir Singh, tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen

On 4/23/20 9:01 AM, Balbir Singh wrote:
> Refactor the existing assembly bits into smaller helper functions
> and also abstract L1D_FLUSH into a helper function. Use these
> functions in kvm for L1D flushing.
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> ---
>   arch/x86/include/asm/cacheflush.h |  3 ++
>   arch/x86/kernel/l1d_flush.c       | 54 +++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/vmx.c            | 29 ++---------------
>   3 files changed, 60 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
> index bac56fcd9790..21cc3b28fa63 100644
> --- a/arch/x86/include/asm/cacheflush.h
> +++ b/arch/x86/include/asm/cacheflush.h
> @@ -8,7 +8,10 @@
>   
>   #define L1D_CACHE_ORDER 4
>   void clflush_cache_range(void *addr, unsigned int size);
> +void l1d_flush_populate_tlb(void *l1d_flush_pages);
>   void *l1d_flush_alloc_pages(void);
>   void l1d_flush_cleanup_pages(void *l1d_flush_pages);
> +void l1d_flush_sw(void *l1d_flush_pages);
> +int l1d_flush_hw(void);
>   
>   #endif /* _ASM_X86_CACHEFLUSH_H */
> diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> index d605878c8f28..5871794f890d 100644
> --- a/arch/x86/kernel/l1d_flush.c
> +++ b/arch/x86/kernel/l1d_flush.c
> @@ -34,3 +34,57 @@ void l1d_flush_cleanup_pages(void *l1d_flush_pages)
>   	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
>   }
>   EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
> +
> +/*
> + * Not all users of l1d flush would want to populate the TLB first
> + * split out the function so that callers can optionally flush the L1D
> + * cache via sw without prefetching the TLB.
> + */
> +void l1d_flush_populate_tlb(void *l1d_flush_pages)
> +{
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	asm volatile(
> +		/* First ensure the pages are in the TLB */
> +		"xorl	%%eax, %%eax\n"
> +		".Lpopulate_tlb:\n\t"
> +		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> +		"addl	$4096, %%eax\n\t"
> +		"cmpl	%%eax, %[size]\n\t"
> +		"jne	.Lpopulate_tlb\n\t"
> +		"xorl	%%eax, %%eax\n\t"
> +		"cpuid\n\t"
> +		:: [flush_pages] "r" (l1d_flush_pages),
> +		    [size] "r" (size)
> +		: "eax", "ebx", "ecx", "edx");
> +}
> +EXPORT_SYMBOL_GPL(l1d_flush_populate_tlb);
> +
> +int l1d_flush_hw(void)
> +{
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +		return 0;
> +	}
> +	return -ENOTSUPP;
> +}
> +EXPORT_SYMBOL_GPL(l1d_flush_hw);
> +
> +void l1d_flush_sw(void *l1d_flush_pages)
> +{
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	asm volatile(
> +			/* Fill the cache */
> +			"xorl	%%eax, %%eax\n"
> +			".Lfill_cache:\n"
> +			"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> +			"addl	$64, %%eax\n\t"
> +			"cmpl	%%eax, %[size]\n\t"
> +			"jne	.Lfill_cache\n\t"
> +			"lfence\n"
> +			:: [flush_pages] "r" (l1d_flush_pages),
> +			[size] "r" (size)
> +			: "eax", "ecx");
> +}

If the answer to my previous question in patch 1/6 about the allocation 
being twice the size is yes, then could you allocate the flush pages the 
same size as the cache and just write it twice? Wouldn't that accomplish 
the same goal and provide a performance improvement with (mostly) now 
present L1D entries of the flush pages? Also, can't you unroll this loop a 
bit and operate on multiple entries in each loop, reducing the overall 
looping compares and jumps as an optimization?

Thanks,
Tom


> +EXPORT_SYMBOL_GPL(l1d_flush_sw);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 225aa8219bac..786d1615a09f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5983,8 +5983,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu,
>    */
>   static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>   {
> -	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> -
>   	/*
>   	 * This code is only executed when the the flush mode is 'cond' or
>   	 * 'always'
> @@ -6013,32 +6011,11 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>   
>   	vcpu->stat.l1d_flush++;
>   
> -	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> -		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +	if (!l1d_flush_hw())
>   		return;
> -	}
>   
> -	asm volatile(
> -		/* First ensure the pages are in the TLB */
> -		"xorl	%%eax, %%eax\n"
> -		".Lpopulate_tlb:\n\t"
> -		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> -		"addl	$4096, %%eax\n\t"
> -		"cmpl	%%eax, %[size]\n\t"
> -		"jne	.Lpopulate_tlb\n\t"
> -		"xorl	%%eax, %%eax\n\t"
> -		"cpuid\n\t"
> -		/* Now fill the cache */
> -		"xorl	%%eax, %%eax\n"
> -		".Lfill_cache:\n"
> -		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> -		"addl	$64, %%eax\n\t"
> -		"cmpl	%%eax, %[size]\n\t"
> -		"jne	.Lfill_cache\n\t"
> -		"lfence\n"
> -		:: [flush_pages] "r" (vmx_l1d_flush_pages),
> -		    [size] "r" (size)
> -		: "eax", "ebx", "ecx", "edx");
> +	l1d_flush_populate_tlb(vmx_l1d_flush_pages);
> +	l1d_flush_sw(vmx_l1d_flush_pages);
>   }
>   
>   static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:  [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-04-24 18:59   ` Tom Lendacky
@ 2020-04-25  1:49     ` Singh, Balbir
  2020-05-01  3:48       ` Singh, Balbir
  0 siblings, 1 reply; 16+ messages in thread
From: Singh, Balbir @ 2020-04-25  1:49 UTC (permalink / raw)
  To: thomas.lendacky, tglx, linux-kernel
  Cc: keescook, tony.luck, benh, jpoimboe, x86, dave.hansen

On Fri, 2020-04-24 at 13:59 -0500, Tom Lendacky wrote:
> 
> On 4/23/20 9:01 AM, Balbir Singh wrote:
> > Split out the allocation and free routines to be used in a follow
> > up set of patches (to reuse for L1D flushing).
> > 
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > ---
> >   arch/x86/include/asm/cacheflush.h |  3 +++
> >   arch/x86/kernel/Makefile          |  1 +
> >   arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
> >   arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
> >   4 files changed, 43 insertions(+), 22 deletions(-)
> >   create mode 100644 arch/x86/kernel/l1d_flush.c
> > 
> > diff --git a/arch/x86/include/asm/cacheflush.h
> > b/arch/x86/include/asm/cacheflush.h
> > index 63feaf2a5f93..bac56fcd9790 100644
> > --- a/arch/x86/include/asm/cacheflush.h
> > +++ b/arch/x86/include/asm/cacheflush.h
> > @@ -6,6 +6,9 @@
> >   #include <asm-generic/cacheflush.h>
> >   #include <asm/special_insns.h>
> > 
> > +#define L1D_CACHE_ORDER 4
> 
> Since this is becoming a generic function now, shouldn't this value be
> based on the actual L1D cache size? Is this value based on a 32KB data
> cache and the idea is to write twice the size of the cache to be sure that
> every entry has been replaced - with the second 32KB to catch the odd line
> that might not have been pulled in?
> 

Currently the only users are VMX L1TF and optional prctl(). It should be based
on actual L1D cache size, I checked a little bit and the largest L1D cache
size across various x86 bits is 64K. so there are three options here:

1. We refactor the code, we would need to save the L1D cache size and use
cpu_dev callbacks for L1D flush
2. We can make the current code depend on L1D_FLUSH MSR and enable it only
when that feature is available. There would be no software fallback. Then
follow it up with #1
3. We keep the code as is on the assumption that all of L1D <= 64K across the
current platforms and we do #1 in a followup (since the prctl is optional and
the only other user is the VMX code).

Thanks for the review,
Balbir Singh.



> Thanks,
> Tom
> 
> >   void clflush_cache_range(void *addr, unsigned int size);
> > +void *l1d_flush_alloc_pages(void);
> > +void l1d_flush_cleanup_pages(void *l1d_flush_pages);
> > 
> >   #endif /* _ASM_X86_CACHEFLUSH_H */
> > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> > index 92e1261ec4ec..42c11ca85f1c 100644
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -158,3 +158,4 @@ ifeq ($(CONFIG_X86_64),y)
> >   endif
> > 
> >   obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)        += ima_arch.o
> > +obj-y                                                += l1d_flush.o
> > diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> > new file mode 100644
> > index 000000000000..d605878c8f28
> > --- /dev/null
> > +++ b/arch/x86/kernel/l1d_flush.c
> > @@ -0,0 +1,36 @@
> > +#include <linux/mm.h>
> > +#include <asm/cacheflush.h>
> > +
> > +void *l1d_flush_alloc_pages(void)
> > +{
> > +     struct page *page;
> > +     void *l1d_flush_pages = NULL;
> > +     int i;
> > +
> > +     /*
> > +      * This allocation for l1d_flush_pages is not tied to a VM/task's
> > +      * lifetime and so should not be charged to a memcg.
> > +      */
> > +     page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
> > +     if (!page)
> > +             return NULL;
> > +     l1d_flush_pages = page_address(page);
> > +
> > +     /*
> > +      * Initialize each page with a different pattern in
> > +      * order to protect against KSM in the nested
> > +      * virtualization case.
> > +      */
> > +     for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
> > +             memset(l1d_flush_pages + i * PAGE_SIZE, i + 1,
> > +                             PAGE_SIZE);
> > +     }
> > +     return l1d_flush_pages;
> > +}
> > +EXPORT_SYMBOL_GPL(l1d_flush_alloc_pages);
> > +
> > +void l1d_flush_cleanup_pages(void *l1d_flush_pages)
> > +{
> > +     free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
> > +}
> > +EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 83050977490c..225aa8219bac 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -203,14 +203,10 @@ static const struct {
> >       [VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
> >   };
> > 
> > -#define L1D_CACHE_ORDER 4
> >   static void *vmx_l1d_flush_pages;
> > 
> >   static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
> >   {
> > -     struct page *page;
> > -     unsigned int i;
> > -
> >       if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
> >               l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
> >               return 0;
> > @@ -253,24 +249,9 @@ static int vmx_setup_l1d_flush(enum
> > vmx_l1d_flush_state l1tf)
> > 
> >       if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
> >           !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > -             /*
> > -              * This allocation for vmx_l1d_flush_pages is not tied to a
> > VM
> > -              * lifetime and so should not be charged to a memcg.
> > -              */
> > -             page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
> > -             if (!page)
> > +             vmx_l1d_flush_pages = l1d_flush_alloc_pages();
> > +             if (!vmx_l1d_flush_pages)
> >                       return -ENOMEM;
> > -             vmx_l1d_flush_pages = page_address(page);
> > -
> > -             /*
> > -              * Initialize each page with a different pattern in
> > -              * order to protect against KSM in the nested
> > -              * virtualization case.
> > -              */
> > -             for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
> > -                     memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
> > -                            PAGE_SIZE);
> > -             }
> >       }
> > 
> >       l1tf_vmx_mitigation = l1tf;
> > @@ -8026,7 +8007,7 @@ static struct kvm_x86_init_ops vmx_init_ops
> > __initdata = {
> >   static void vmx_cleanup_l1d_flush(void)
> >   {
> >       if (vmx_l1d_flush_pages) {
> > -             free_pages((unsigned long)vmx_l1d_flush_pages,
> > L1D_CACHE_ORDER);
> > +             l1d_flush_cleanup_pages(vmx_l1d_flush_pages);
> >               vmx_l1d_flush_pages = NULL;
> >       }
> >       /* Restore state so sysfs ignores VMX */
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush
  2020-04-24 19:04   ` Tom Lendacky
@ 2020-04-25  2:00     ` Singh, Balbir
  0 siblings, 0 replies; 16+ messages in thread
From: Singh, Balbir @ 2020-04-25  2:00 UTC (permalink / raw)
  To: thomas.lendacky, tglx, linux-kernel
  Cc: keescook, tony.luck, benh, jpoimboe, x86, dave.hansen

On Fri, 2020-04-24 at 14:04 -0500, Tom Lendacky wrote:
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> On 4/23/20 9:01 AM, Balbir Singh wrote:
> > Refactor the existing assembly bits into smaller helper functions
> > and also abstract L1D_FLUSH into a helper function. Use these
> > functions in kvm for L1D flushing.
> > 
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > ---
> >   arch/x86/include/asm/cacheflush.h |  3 ++
> >   arch/x86/kernel/l1d_flush.c       | 54 +++++++++++++++++++++++++++++++
> >   arch/x86/kvm/vmx/vmx.c            | 29 ++---------------
> >   3 files changed, 60 insertions(+), 26 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/cacheflush.h
> > b/arch/x86/include/asm/cacheflush.h
> > index bac56fcd9790..21cc3b28fa63 100644
> > --- a/arch/x86/include/asm/cacheflush.h
> > +++ b/arch/x86/include/asm/cacheflush.h
> > @@ -8,7 +8,10 @@
> > 
> >   #define L1D_CACHE_ORDER 4
> >   void clflush_cache_range(void *addr, unsigned int size);
> > +void l1d_flush_populate_tlb(void *l1d_flush_pages);
> >   void *l1d_flush_alloc_pages(void);
> >   void l1d_flush_cleanup_pages(void *l1d_flush_pages);
> > +void l1d_flush_sw(void *l1d_flush_pages);
> > +int l1d_flush_hw(void);
> > 
> >   #endif /* _ASM_X86_CACHEFLUSH_H */
> > diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> > index d605878c8f28..5871794f890d 100644
> > --- a/arch/x86/kernel/l1d_flush.c
> > +++ b/arch/x86/kernel/l1d_flush.c
> > @@ -34,3 +34,57 @@ void l1d_flush_cleanup_pages(void *l1d_flush_pages)
> >       free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
> >   }
> >   EXPORT_SYMBOL_GPL(l1d_flush_cleanup_pages);
> > +
> > +/*
> > + * Not all users of l1d flush would want to populate the TLB first
> > + * split out the function so that callers can optionally flush the L1D
> > + * cache via sw without prefetching the TLB.
> > + */
> > +void l1d_flush_populate_tlb(void *l1d_flush_pages)
> > +{
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     asm volatile(
> > +             /* First ensure the pages are in the TLB */
> > +             "xorl   %%eax, %%eax\n"
> > +             ".Lpopulate_tlb:\n\t"
> > +             "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +             "addl   $4096, %%eax\n\t"
> > +             "cmpl   %%eax, %[size]\n\t"
> > +             "jne    .Lpopulate_tlb\n\t"
> > +             "xorl   %%eax, %%eax\n\t"
> > +             "cpuid\n\t"
> > +             :: [flush_pages] "r" (l1d_flush_pages),
> > +                 [size] "r" (size)
> > +             : "eax", "ebx", "ecx", "edx");
> > +}
> > +EXPORT_SYMBOL_GPL(l1d_flush_populate_tlb);
> > +
> > +int l1d_flush_hw(void)
> > +{
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +             return 0;
> > +     }
> > +     return -ENOTSUPP;
> > +}
> > +EXPORT_SYMBOL_GPL(l1d_flush_hw);
> > +
> > +void l1d_flush_sw(void *l1d_flush_pages)
> > +{
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     asm volatile(
> > +                     /* Fill the cache */
> > +                     "xorl   %%eax, %%eax\n"
> > +                     ".Lfill_cache:\n"
> > +                     "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +                     "addl   $64, %%eax\n\t"
> > +                     "cmpl   %%eax, %[size]\n\t"
> > +                     "jne    .Lfill_cache\n\t"
> > +                     "lfence\n"
> > +                     :: [flush_pages] "r" (l1d_flush_pages),
> > +                     [size] "r" (size)
> > +                     : "eax", "ecx");
> > +}
> 
> If the answer to my previous question in patch 1/6 about the allocation
> being twice the size is yes, then could you allocate the flush pages the
> same size as the cache and just write it twice? Wouldn't that accomplish
> the same goal and provide a performance improvement with (mostly) now
> present L1D entries of the flush pages? Also, can't you unroll this loop a
> bit and operate on multiple entries in each loop, reducing the overall
> looping compares and jumps as an optimization?
> 

Yes, I have some options suggested in my answer to your question in 1/6. For
now I decided to stick with the algorithm already present, but we could
definitely optimize it

Balbir

> Thanks,
> Tom
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re:  [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-04-25  1:49     ` Singh, Balbir
@ 2020-05-01  3:48       ` Singh, Balbir
  2020-05-01 14:16         ` Tom Lendacky
  0 siblings, 1 reply; 16+ messages in thread
From: Singh, Balbir @ 2020-05-01  3:48 UTC (permalink / raw)
  To: thomas.lendacky, tglx, linux-kernel
  Cc: keescook, tony.luck, benh, jpoimboe, x86, dave.hansen

On Sat, 2020-04-25 at 11:49 +1000, Balbir Singh wrote:
> On Fri, 2020-04-24 at 13:59 -0500, Tom Lendacky wrote:
> > 
> > On 4/23/20 9:01 AM, Balbir Singh wrote:
> > > Split out the allocation and free routines to be used in a follow
> > > up set of patches (to reuse for L1D flushing).
> > > 
> > > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > > Reviewed-by: Kees Cook <keescook@chromium.org>
> > > ---
> > >   arch/x86/include/asm/cacheflush.h |  3 +++
> > >   arch/x86/kernel/Makefile          |  1 +
> > >   arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
> > >   arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
> > >   4 files changed, 43 insertions(+), 22 deletions(-)
> > >   create mode 100644 arch/x86/kernel/l1d_flush.c
> > > 
> > > diff --git a/arch/x86/include/asm/cacheflush.h
> > > b/arch/x86/include/asm/cacheflush.h
> > > index 63feaf2a5f93..bac56fcd9790 100644
> > > --- a/arch/x86/include/asm/cacheflush.h
> > > +++ b/arch/x86/include/asm/cacheflush.h
> > > @@ -6,6 +6,9 @@
> > >   #include <asm-generic/cacheflush.h>
> > >   #include <asm/special_insns.h>
> > > 
> > > +#define L1D_CACHE_ORDER 4
> > 
> > Since this is becoming a generic function now, shouldn't this value be
> > based on the actual L1D cache size? Is this value based on a 32KB data
> > cache and the idea is to write twice the size of the cache to be sure that
> > every entry has been replaced - with the second 32KB to catch the odd line
> > that might not have been pulled in?
> > 
> 
> Currently the only users are VMX L1TF and optional prctl(). It should be
> based
> on actual L1D cache size, I checked a little bit and the largest L1D cache
> size across various x86 bits is 64K. so there are three options here:
> 
> 1. We refactor the code, we would need to save the L1D cache size and use
> cpu_dev callbacks for L1D flush
> 2. We can make the current code depend on L1D_FLUSH MSR and enable it only
> when that feature is available. There would be no software fallback. Then
> follow it up with #1
> 3. We keep the code as is on the assumption that all of L1D <= 64K across
> the
> current platforms and we do #1 in a followup (since the prctl is optional
> and
> the only other user is the VMX code).
> 
> Thanks for the review,
> Balbir Singh.
> 

Tom,

I have the following changes that I think will suffice for now (these are not
properly formatted, but you get the idea)

diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
index a754b6c288a9..7fec0cc8f85c 100644
--- a/arch/x86/kernel/l1d_flush.c
+++ b/arch/x86/kernel/l1d_flush.c
@@ -92,6 +92,9 @@ int l1d_flush_init_once(void)
 {
        int ret = 0;
 
+       if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+               return -ENOTSUPP;
+
        if (static_cpu_has(X86_FEATURE_FLUSH_L1D) || l1d_flush_pages)
                return ret;


Does that satisfy your comments about patch 1/6 and 2/6 in the series?

Balbir Singh.

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-05-01  3:48       ` Singh, Balbir
@ 2020-05-01 14:16         ` Tom Lendacky
  0 siblings, 0 replies; 16+ messages in thread
From: Tom Lendacky @ 2020-05-01 14:16 UTC (permalink / raw)
  To: Singh, Balbir, tglx, linux-kernel
  Cc: keescook, tony.luck, benh, jpoimboe, x86, dave.hansen

On 4/30/20 10:48 PM, Singh, Balbir wrote:
> On Sat, 2020-04-25 at 11:49 +1000, Balbir Singh wrote:
>> On Fri, 2020-04-24 at 13:59 -0500, Tom Lendacky wrote:
>>>
>>> On 4/23/20 9:01 AM, Balbir Singh wrote:
>>>> Split out the allocation and free routines to be used in a follow
>>>> up set of patches (to reuse for L1D flushing).
>>>>
>>>> Signed-off-by: Balbir Singh <sblbir@amazon.com>
>>>> Reviewed-by: Kees Cook <keescook@chromium.org>
>>>> ---
>>>>    arch/x86/include/asm/cacheflush.h |  3 +++
>>>>    arch/x86/kernel/Makefile          |  1 +
>>>>    arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
>>>>    arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
>>>>    4 files changed, 43 insertions(+), 22 deletions(-)
>>>>    create mode 100644 arch/x86/kernel/l1d_flush.c
>>>>
>>>> diff --git a/arch/x86/include/asm/cacheflush.h
>>>> b/arch/x86/include/asm/cacheflush.h
>>>> index 63feaf2a5f93..bac56fcd9790 100644
>>>> --- a/arch/x86/include/asm/cacheflush.h
>>>> +++ b/arch/x86/include/asm/cacheflush.h
>>>> @@ -6,6 +6,9 @@
>>>>    #include <asm-generic/cacheflush.h>
>>>>    #include <asm/special_insns.h>
>>>>
>>>> +#define L1D_CACHE_ORDER 4
>>>
>>> Since this is becoming a generic function now, shouldn't this value be
>>> based on the actual L1D cache size? Is this value based on a 32KB data
>>> cache and the idea is to write twice the size of the cache to be sure that
>>> every entry has been replaced - with the second 32KB to catch the odd line
>>> that might not have been pulled in?
>>>
>>
>> Currently the only users are VMX L1TF and optional prctl(). It should be
>> based
>> on actual L1D cache size, I checked a little bit and the largest L1D cache
>> size across various x86 bits is 64K. so there are three options here:
>>
>> 1. We refactor the code, we would need to save the L1D cache size and use
>> cpu_dev callbacks for L1D flush
>> 2. We can make the current code depend on L1D_FLUSH MSR and enable it only
>> when that feature is available. There would be no software fallback. Then
>> follow it up with #1
>> 3. We keep the code as is on the assumption that all of L1D <= 64K across
>> the
>> current platforms and we do #1 in a followup (since the prctl is optional
>> and
>> the only other user is the VMX code).
>>
>> Thanks for the review,
>> Balbir Singh.
>>
> 
> Tom,
> 
> I have the following changes that I think will suffice for now (these are not
> properly formatted, but you get the idea)
> 
> diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> index a754b6c288a9..7fec0cc8f85c 100644
> --- a/arch/x86/kernel/l1d_flush.c
> +++ b/arch/x86/kernel/l1d_flush.c
> @@ -92,6 +92,9 @@ int l1d_flush_init_once(void)
>   {
>          int ret = 0;
>   
> +       if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +               return -ENOTSUPP;
> +
>          if (static_cpu_has(X86_FEATURE_FLUSH_L1D) || l1d_flush_pages)
>                  return ret;
> 
> 
> Does that satisfy your comments about patch 1/6 and 2/6 in the series?

Yes, that works.

Thanks,
Tom

> 
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-05-01 14:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-23 14:01 [PATCH v4 0/6] Optionally flush L1D on context switch Balbir Singh
2020-04-23 14:01 ` [PATCH v4 1/6] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
2020-04-24 18:59   ` Tom Lendacky
2020-04-25  1:49     ` Singh, Balbir
2020-05-01  3:48       ` Singh, Balbir
2020-05-01 14:16         ` Tom Lendacky
2020-04-23 14:01 ` [PATCH v4 2/6] arch/x86/kvm: Refactor tlbflush and l1d flush Balbir Singh
2020-04-24 19:04   ` Tom Lendacky
2020-04-25  2:00     ` Singh, Balbir
2020-04-23 14:01 ` [PATCH v4 3/6] arch/x86/mm: Refactor cond_ibpb() to support other use cases Balbir Singh
2020-04-23 14:01 ` [PATCH v4 4/6] arch/x86/kvm: Refactor L1D flushing Balbir Singh
2020-04-23 14:01 ` [PATCH v4 5/6] Optionally flush L1D on context switch Balbir Singh
2020-04-23 19:19   ` Kees Cook
2020-04-24  9:56     ` Singh, Balbir
2020-04-23 14:01 ` [PATCH v4 6/6] Documentation: Add L1D flushing Documentation Balbir Singh
2020-04-23 19:20   ` Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).