[PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch
@ 2020-04-06  3:19 Balbir Singh
  2020-04-06  3:19 ` [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Balbir Singh @ 2020-04-06  3:19 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Provide a mechanisn to flush the L1D cache on context switch.  The goal
is to allow tasks that are paranoid due to the recent snoop assisted data
sampling vulnerabilites, to flush their L1D on being switched out.
This protects their data from being snooped or leaked via side channels
after the task has context switched out.

The core of the patches is patch 3, the first two refactor the code so
that common bits can be reused.

Changelog v2:
 - Fix a miss of mutex_unlock (caught by Borislav Petkov <bp@alien8.de>)
 - Add documentation about the changes (Josh Poimboeuf
   <jpoimboe@redhat.com>)

Changelog:
 - Refactor the code and reuse cond_ibpb() - code bits provided by tglx
 - Merge mm state tracking for ibpb and l1d flush
 - Rename TIF_L1D_FLUSH to TIF_SPEC_FLUSH_L1D

Changelog RFC:
 - Reuse existing code for allocation and flush
 - Simplify the goto logic in the actual l1d_flush function
 - Optimize the code path with jump labels/static functions

The previous version of this patch posted at:

https://lore.kernel.org/lkml/20200402062401.29856-1-sblbir@amazon.com/

Balbir Singh (4):
  arch/x86/kvm: Refactor l1d flush lifecycle management
  arch/x86: Refactor tlbflush and l1d flush
  arch/x86: Optionally flush L1D on context switch
  arch/x86: Add L1D flushing Documentation

 Documentation/admin-guide/hw-vuln/index.rst   |  1 +
 .../admin-guide/hw-vuln/l1d_flush.rst         | 40 ++++++++
 arch/x86/include/asm/cacheflush.h             |  6 ++
 arch/x86/include/asm/thread_info.h            |  6 +-
 arch/x86/include/asm/tlbflush.h               |  2 +-
 arch/x86/include/uapi/asm/prctl.h             |  3 +
 arch/x86/kernel/Makefile                      |  1 +
 arch/x86/kernel/l1d_flush.c                   | 85 +++++++++++++++++
 arch/x86/kernel/process_64.c                  | 10 +-
 arch/x86/kvm/vmx/vmx.c                        | 56 ++---------
 arch/x86/mm/tlb.c                             | 94 ++++++++++++++-----
 11 files changed, 232 insertions(+), 72 deletions(-)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
 create mode 100644 arch/x86/kernel/l1d_flush.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-04-06  3:19 [PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
@ 2020-04-06  3:19 ` Balbir Singh
  2020-04-07 18:21   ` Kees Cook
  2020-04-06  3:19 ` [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Balbir Singh @ 2020-04-06  3:19 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Split out the allocation and free routines to be used in a follow
up set of patches (to reuse for L1D flushing).

Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/cacheflush.h |  3 +++
 arch/x86/kernel/Makefile          |  1 +
 arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
 4 files changed, 43 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/kernel/l1d_flush.c

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 63feaf2a5f93..6419a4cef0e8 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -6,6 +6,9 @@
 #include <asm-generic/cacheflush.h>
 #include <asm/special_insns.h>
 
+#define L1D_CACHE_ORDER 4
 void clflush_cache_range(void *addr, unsigned int size);
+void *alloc_l1d_flush_pages(void);
+void cleanup_l1d_flush_pages(void *l1d_flush_pages);
 
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index d6d61c4455fa..48f443e6c2de 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -160,3 +160,4 @@ ifeq ($(CONFIG_X86_64),y)
 endif
 
 obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)	+= ima_arch.o
+obj-y						+= l1d_flush.o
diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
new file mode 100644
index 000000000000..05f375c33423
--- /dev/null
+++ b/arch/x86/kernel/l1d_flush.c
@@ -0,0 +1,36 @@
+#include <linux/mm.h>
+#include <asm/cacheflush.h>
+
+void *alloc_l1d_flush_pages(void)
+{
+	struct page *page;
+	void *l1d_flush_pages = NULL;
+	int i;
+
+	/*
+	 * This allocation for l1d_flush_pages is not tied to a VM/task's
+	 * lifetime and so should not be charged to a memcg.
+	 */
+	page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
+	if (!page)
+		return NULL;
+	l1d_flush_pages = page_address(page);
+
+	/*
+	 * Initialize each page with a different pattern in
+	 * order to protect against KSM in the nested
+	 * virtualization case.
+	 */
+	for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
+		memset(l1d_flush_pages + i * PAGE_SIZE, i + 1,
+				PAGE_SIZE);
+	}
+	return l1d_flush_pages;
+}
+EXPORT_SYMBOL_GPL(alloc_l1d_flush_pages);
+
+void cleanup_l1d_flush_pages(void *l1d_flush_pages)
+{
+	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
+}
+EXPORT_SYMBOL_GPL(cleanup_l1d_flush_pages);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9eaccf92d616..209e63798435 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -203,14 +203,10 @@ static const struct {
 	[VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
 };
 
-#define L1D_CACHE_ORDER 4
 static void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 {
-	struct page *page;
-	unsigned int i;
-
 	if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
 		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
 		return 0;
@@ -253,24 +249,9 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 
 	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
 	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-		/*
-		 * This allocation for vmx_l1d_flush_pages is not tied to a VM
-		 * lifetime and so should not be charged to a memcg.
-		 */
-		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
-		if (!page)
+		vmx_l1d_flush_pages = alloc_l1d_flush_pages();
+		if (!vmx_l1d_flush_pages)
 			return -ENOMEM;
-		vmx_l1d_flush_pages = page_address(page);
-
-		/*
-		 * Initialize each page with a different pattern in
-		 * order to protect against KSM in the nested
-		 * virtualization case.
-		 */
-		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
-			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
-			       PAGE_SIZE);
-		}
 	}
 
 	l1tf_vmx_mitigation = l1tf;
@@ -7992,7 +7973,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 static void vmx_cleanup_l1d_flush(void)
 {
 	if (vmx_l1d_flush_pages) {
-		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);
+		cleanup_l1d_flush_pages(vmx_l1d_flush_pages);
 		vmx_l1d_flush_pages = NULL;
 	}
 	/* Restore state so sysfs ignores VMX */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush
  2020-04-06  3:19 [PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
  2020-04-06  3:19 ` [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
@ 2020-04-06  3:19 ` Balbir Singh
  2020-04-07 18:25   ` Kees Cook
  2020-04-06  3:19 ` [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
  2020-04-06  3:19 ` [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation Balbir Singh
  3 siblings, 1 reply; 17+ messages in thread
From: Balbir Singh @ 2020-04-06  3:19 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Refactor the existing assembly bits into smaller helper functions
and also abstract L1D_FLUSH into a helper function. Use these
functions in kvm for L1D flushing.

Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/cacheflush.h |  3 ++
 arch/x86/kernel/l1d_flush.c       | 49 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c            | 31 ++++---------------
 3 files changed, 57 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 6419a4cef0e8..66a46db7aadd 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,5 +10,8 @@
 void clflush_cache_range(void *addr, unsigned int size);
 void *alloc_l1d_flush_pages(void);
 void cleanup_l1d_flush_pages(void *l1d_flush_pages);
+void populate_tlb_with_flush_pages(void *l1d_flush_pages);
+void flush_l1d_cache_sw(void *l1d_flush_pages);
+int flush_l1d_cache_hw(void);
 
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
index 05f375c33423..60499f773046 100644
--- a/arch/x86/kernel/l1d_flush.c
+++ b/arch/x86/kernel/l1d_flush.c
@@ -34,3 +34,52 @@ void cleanup_l1d_flush_pages(void *l1d_flush_pages)
 	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
 }
 EXPORT_SYMBOL_GPL(cleanup_l1d_flush_pages);
+
+void populate_tlb_with_flush_pages(void *l1d_flush_pages)
+{
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	asm volatile(
+		/* First ensure the pages are in the TLB */
+		"xorl	%%eax, %%eax\n"
+		".Lpopulate_tlb:\n\t"
+		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
+		"addl	$4096, %%eax\n\t"
+		"cmpl	%%eax, %[size]\n\t"
+		"jne	.Lpopulate_tlb\n\t"
+		"xorl	%%eax, %%eax\n\t"
+		"cpuid\n\t"
+		:: [flush_pages] "r" (l1d_flush_pages),
+		    [size] "r" (size)
+		: "eax", "ebx", "ecx", "edx");
+}
+EXPORT_SYMBOL_GPL(populate_tlb_with_flush_pages);
+
+int flush_l1d_cache_hw(void)
+{
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(flush_l1d_cache_hw);
+
+void flush_l1d_cache_sw(void *l1d_flush_pages)
+{
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	asm volatile(
+			/* Fill the cache */
+			"xorl	%%eax, %%eax\n"
+			".Lfill_cache:\n"
+			"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
+			"addl	$64, %%eax\n\t"
+			"cmpl	%%eax, %[size]\n\t"
+			"jne	.Lfill_cache\n\t"
+			"lfence\n"
+			:: [flush_pages] "r" (l1d_flush_pages),
+			[size] "r" (size)
+			: "eax", "ecx");
+}
+EXPORT_SYMBOL_GPL(flush_l1d_cache_sw);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 209e63798435..29dc5a5bb6ab 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5956,8 +5956,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu,
  */
 static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
-	int size = PAGE_SIZE << L1D_CACHE_ORDER;
-
 	/*
 	 * This code is only executed when the the flush mode is 'cond' or
 	 * 'always'
@@ -5986,32 +5984,13 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 
 	vcpu->stat.l1d_flush++;
 
-	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+	if (flush_l1d_cache_hw())
 		return;
-	}
 
-	asm volatile(
-		/* First ensure the pages are in the TLB */
-		"xorl	%%eax, %%eax\n"
-		".Lpopulate_tlb:\n\t"
-		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
-		"addl	$4096, %%eax\n\t"
-		"cmpl	%%eax, %[size]\n\t"
-		"jne	.Lpopulate_tlb\n\t"
-		"xorl	%%eax, %%eax\n\t"
-		"cpuid\n\t"
-		/* Now fill the cache */
-		"xorl	%%eax, %%eax\n"
-		".Lfill_cache:\n"
-		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
-		"addl	$64, %%eax\n\t"
-		"cmpl	%%eax, %[size]\n\t"
-		"jne	.Lfill_cache\n\t"
-		"lfence\n"
-		:: [flush_pages] "r" (vmx_l1d_flush_pages),
-		    [size] "r" (size)
-		: "eax", "ebx", "ecx", "edx");
+	preempt_disable();
+	populate_tlb_with_flush_pages(vmx_l1d_flush_pages);
+	flush_l1d_cache_sw(vmx_l1d_flush_pages);
+	preempt_enable();
 }
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-06  3:19 [PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
  2020-04-06  3:19 ` [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
  2020-04-06  3:19 ` [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush Balbir Singh
@ 2020-04-06  3:19 ` Balbir Singh
  2020-04-07 18:26   ` Kees Cook
  2020-04-07 23:52   ` Thomas Gleixner
  2020-04-06  3:19 ` [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation Balbir Singh
  3 siblings, 2 replies; 17+ messages in thread
From: Balbir Singh @ 2020-04-06  3:19 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that are paranoid due to the recent snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out.  This protects
their data from being snooped or leaked via side channels after the task
has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.

Add arch specific prctl()'s to opt-in to the L1D cache on context switch
out, the existing mechanisms of tracking prev_mm via cpu_tlbstate is
reused. cond_ibpb() is refactored and renamed into cond_mitigation().

A new thread_info flag TIF_SPEC_FLUSH_L1D is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_ibpb is renamed to
cpu_tlbstate.last_user_mm_spec, this is used to convert the TIF flags
into mm state (per cpu via last_user_mm_spec) in cond_mitigation(),
which then used to do decide when to call flush_l1d().

The current version benefited from discussions with Kees and Thomas.
Thomas suggested and provided the code snippet for refactoring the
existing cond_ibpb() code.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/thread_info.h |  6 +-
 arch/x86/include/asm/tlbflush.h    |  2 +-
 arch/x86/include/uapi/asm/prctl.h  |  3 +
 arch/x86/kernel/process_64.c       | 10 +++-
 arch/x86/mm/tlb.c                  | 94 +++++++++++++++++++++++-------
 5 files changed, 91 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..5cb250872643 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,7 +84,7 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE	10	/* Force speculation MSR update in context switch */
+#define TIF_SPEC_FLUSH_L1D	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING	13	/* pending live patching update */
@@ -96,6 +96,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -132,6 +133,7 @@ struct thread_info {
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
 #define _TIF_X32		(1 << TIF_X32)
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
+#define _TIF_SPEC_FLUSH_L1D	(1 << TIF_SPEC_FLUSH_L1D)
 
 /* Work to do before invoking the actual syscall. */
 #define _TIF_WORK_SYSCALL_ENTRY	\
@@ -239,6 +241,8 @@ extern void arch_task_cache_init(void);
 extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
 extern void arch_release_task_struct(struct task_struct *tsk);
 extern void arch_setup_new_exec(void);
+extern int enable_l1d_flush_for_task(struct task_struct *tsk);
+extern int disable_l1d_flush_for_task(struct task_struct *tsk);
 #define arch_setup_new_exec arch_setup_new_exec
 #endif	/* !__ASSEMBLY__ */
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6f66d841262d..69e6ea20679c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -172,7 +172,7 @@ struct tlb_state {
 	/* Last user mm for optimizing IBPB */
 	union {
 		struct mm_struct	*last_user_mm;
-		unsigned long		last_user_mm_ibpb;
+		unsigned long		last_user_mm_spec;
 	};
 
 	u16 loaded_mm_asid;
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..1361e5e25791 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,7 @@
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
 
+#define ARCH_SET_L1D_FLUSH	0x3001
+#define ARCH_GET_L1D_FLUSH	0x3002
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 5ef9d8f25b0e..ecf542f13572 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -699,7 +699,15 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_MAP_VDSO_64:
 		return prctl_map_vdso(&vdso_image_64, arg2);
 #endif
-
+	case ARCH_GET_L1D_FLUSH:
+		return test_ti_thread_flag(&task->thread_info, TIF_SPEC_FLUSH_L1D);
+	case ARCH_SET_L1D_FLUSH: {
+		if (arg2 >= 1)
+			return enable_l1d_flush_for_task(task);
+		else
+			return disable_l1d_flush_for_task(task);
+		break;
+	}
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 66f96f21a7b6..71ef9fb941b8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -13,6 +13,7 @@
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
+#include <asm/cacheflush.h>
 #include <asm/apic.h>
 #include <asm/uv/uv.h>
 
@@ -33,10 +34,12 @@
  */
 
 /*
- * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
- * stored in cpu_tlb_state.last_user_mm_ibpb.
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_spec.
  */
 #define LAST_USER_MM_IBPB	0x1UL
+#define LAST_USER_MM_FLUSH_L1D	0x2UL
+#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB | LAST_USER_MM_FLUSH_L1D)
 
 /*
  * We get here when we do something requiring a TLB invalidation
@@ -151,6 +154,52 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+static void *l1d_flush_pages;
+static DEFINE_MUTEX(l1d_flush_mutex);
+
+int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	struct page *page;
+	int ret = 0;
+
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
+		goto done;
+
+	page = READ_ONCE(l1d_flush_pages);
+	if (unlikely(!page)) {
+		mutex_lock(&l1d_flush_mutex);
+		if (!l1d_flush_pages) {
+			l1d_flush_pages = alloc_l1d_flush_pages();
+			if (!l1d_flush_pages) {
+				mutex_unlock(&l1d_flush_mutex);
+				return -ENOMEM;
+			}
+		}
+		mutex_unlock(&l1d_flush_mutex);
+	}
+	/* I don't think we need to worry about KSM */
+done:
+	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
+	return ret;
+}
+
+int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
+	return 0;
+}
+
+/*
+ * Flush the L1D cache for this CPU. We want to this at switch mm time,
+ * this is a pessimistic security measure and an opt-in for those tasks
+ * that host sensitive information.
+ */
+static void flush_l1d(void)
+{
+	if (!flush_l1d_cache_hw())
+		flush_l1d_cache_sw(l1d_flush_pages);
+}
+
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -189,19 +238,26 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
 	}
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
 {
 	unsigned long next_tif = task_thread_info(next)->flags;
-	unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
+
+	BUILD_BUG_ON(TIF_SPEC_FLUSH_L1D != TIF_SPEC_IB + 1);
 
-	return (unsigned long)next->mm | ibpb;
+	return (unsigned long)next->mm | spec_bits;
 }
 
-static void cond_ibpb(struct task_struct *next)
+static void cond_mitigation(struct task_struct *next)
 {
+	unsigned long prev_mm, next_mm;
+
 	if (!next || !next->mm)
 		return;
 
+	next_mm = mm_mangle_tif_spec_bits(next);
+	prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
+
 	/*
 	 * Both, the conditional and the always IBPB mode use the mm
 	 * pointer to avoid the IBPB when switching between tasks of the
@@ -212,8 +268,6 @@ static void cond_ibpb(struct task_struct *next)
 	 * exposed data is not really interesting.
 	 */
 	if (static_branch_likely(&switch_mm_cond_ibpb)) {
-		unsigned long prev_mm, next_mm;
-
 		/*
 		 * This is a bit more complex than the always mode because
 		 * it has to handle two cases:
@@ -243,20 +297,14 @@ static void cond_ibpb(struct task_struct *next)
 		 * Optimize this with reasonably small overhead for the
 		 * above cases. Mangle the TIF_SPEC_IB bit into the mm
 		 * pointer of the incoming task which is stored in
-		 * cpu_tlbstate.last_user_mm_ibpb for comparison.
-		 */
-		next_mm = mm_mangle_tif_spec_ib(next);
-		prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
-
-		/*
+		 * cpu_tlbstate.last_user_mm_spec for comparison.
+		 *
 		 * Issue IBPB only if the mm's are different and one or
 		 * both have the IBPB bit set.
 		 */
 		if (next_mm != prev_mm &&
 		    (next_mm | prev_mm) & LAST_USER_MM_IBPB)
 			indirect_branch_prediction_barrier();
-
-		this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
 	}
 
 	if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -265,11 +313,15 @@ static void cond_ibpb(struct task_struct *next)
 		 * different context than the user space task which ran
 		 * last on this CPU.
 		 */
-		if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+		if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
+					(unsigned long)next->mm)
 			indirect_branch_prediction_barrier();
-			this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
-		}
 	}
+
+	if (prev_mm & LAST_USER_MM_FLUSH_L1D)
+		flush_l1d();
+
+	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
 }
 
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
@@ -375,7 +427,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * predictor when switching between processes. This stops
 		 * one process from doing Spectre-v2 attacks on another.
 		 */
-		cond_ibpb(tsk);
+		cond_mitigation(tsk);
 
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 			/*
@@ -501,7 +553,7 @@ void initialize_tlbstate_and_flush(void)
 	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
-	this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
+	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_IBPB);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
 	this_cpu_write(cpu_tlbstate.next_asid, 1);
 	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation
  2020-04-06  3:19 [PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
                   ` (2 preceding siblings ...)
  2020-04-06  3:19 ` [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
@ 2020-04-06  3:19 ` Balbir Singh
  2020-05-19 15:39   ` Randy Dunlap
  3 siblings, 1 reply; 17+ messages in thread
From: Balbir Singh @ 2020-04-06  3:19 UTC (permalink / raw)
  To: tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Add documentation of l1d flushing, explain the need for the
feature and how it can be used.

Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 Documentation/admin-guide/hw-vuln/index.rst   |  1 +
 .../admin-guide/hw-vuln/l1d_flush.rst         | 40 +++++++++++++++++++
 2 files changed, 41 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 0795e3c2643f..35633b299d45 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -14,3 +14,4 @@ are configurable at compile, boot or run time.
    mds
    tsx_async_abort
    multihit.rst
+   l1d_flush
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 000000000000..73ee9e491a74
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,40 @@
+L1D Flushing for the paranoid
+=============================
+
+With an increasing number of vulnerabilities being reported around data
+leaks from L1D, a new user space mechanism to flush the L1D cache on
+context switch is added to the kernel. This should help address
+CVE-2020-0550 and for paranoid applications, keep them safe from any
+yet to be discovered vulnerabilities, related to leaks from the L1D
+cache.
+
+Tasks can opt in to this mechanism by using an architecture specific
+prctl (x86 only at the moment).
+
+Related CVES
+------------
+At the present moment, the following CVEs can be addressed by this
+mechanism
+
+    =============       ========================     ==================
+    CVE-2020-0550       Improper Data Forwarding     OS related aspects
+    =============       ========================     ==================
+
+Usage Guidelines
+----------------
+Applications can call ``arch_prctl(2)`` with one of these two arguments
+
+1. ARCH_SET_L1D_FLUSH - flush the L1D cache on context switch (out)
+2. ARCH_GET_L1D_FLUSH - get the current state of the L1D cache flush, returns 1
+   if set and 0 if not set.
+
+**NOTE**: The feature is disabled by default, applications to need to specifically
+opt into the feature to enable it.
+
+Mitigation
+----------
+When ARCH_SET_L1D_FLUSH is enabled for a task, on switching tasks (when
+the address space changes), a flush of the L1D cache is performed for
+the task when it leaves the CPU. If the underlying CPU supports L1D
+flushing in hardware, the hardware mechanism is used, otherwise a software
+fallback, similar to the mechanism used by L1TF is used.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management
  2020-04-06  3:19 ` [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
@ 2020-04-07 18:21   ` Kees Cook
  0 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2020-04-07 18:21 UTC (permalink / raw)
  To: Balbir Singh
  Cc: tglx, linux-kernel, jpoimboe, tony.luck, benh, x86, dave.hansen

On Mon, Apr 06, 2020 at 01:19:43PM +1000, Balbir Singh wrote:
> Split out the allocation and free routines to be used in a follow
> up set of patches (to reuse for L1D flushing).
> 
> Signed-off-by: Balbir Singh <sblbir@amazon.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/cacheflush.h |  3 +++
>  arch/x86/kernel/Makefile          |  1 +
>  arch/x86/kernel/l1d_flush.c       | 36 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c            | 25 +++------------------
>  4 files changed, 43 insertions(+), 22 deletions(-)
>  create mode 100644 arch/x86/kernel/l1d_flush.c
> 
> diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
> index 63feaf2a5f93..6419a4cef0e8 100644
> --- a/arch/x86/include/asm/cacheflush.h
> +++ b/arch/x86/include/asm/cacheflush.h
> @@ -6,6 +6,9 @@
>  #include <asm-generic/cacheflush.h>
>  #include <asm/special_insns.h>
>  
> +#define L1D_CACHE_ORDER 4
>  void clflush_cache_range(void *addr, unsigned int size);
> +void *alloc_l1d_flush_pages(void);
> +void cleanup_l1d_flush_pages(void *l1d_flush_pages);
>  
>  #endif /* _ASM_X86_CACHEFLUSH_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index d6d61c4455fa..48f443e6c2de 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -160,3 +160,4 @@ ifeq ($(CONFIG_X86_64),y)
>  endif
>  
>  obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)	+= ima_arch.o
> +obj-y						+= l1d_flush.o
> diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> new file mode 100644
> index 000000000000..05f375c33423
> --- /dev/null
> +++ b/arch/x86/kernel/l1d_flush.c
> @@ -0,0 +1,36 @@
> +#include <linux/mm.h>
> +#include <asm/cacheflush.h>
> +
> +void *alloc_l1d_flush_pages(void)
> +{
> +	struct page *page;
> +	void *l1d_flush_pages = NULL;
> +	int i;
> +
> +	/*
> +	 * This allocation for l1d_flush_pages is not tied to a VM/task's
> +	 * lifetime and so should not be charged to a memcg.
> +	 */
> +	page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
> +	if (!page)
> +		return NULL;
> +	l1d_flush_pages = page_address(page);
> +
> +	/*
> +	 * Initialize each page with a different pattern in
> +	 * order to protect against KSM in the nested
> +	 * virtualization case.
> +	 */
> +	for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
> +		memset(l1d_flush_pages + i * PAGE_SIZE, i + 1,
> +				PAGE_SIZE);
> +	}
> +	return l1d_flush_pages;
> +}
> +EXPORT_SYMBOL_GPL(alloc_l1d_flush_pages);
> +
> +void cleanup_l1d_flush_pages(void *l1d_flush_pages)
> +{
> +	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
> +}
> +EXPORT_SYMBOL_GPL(cleanup_l1d_flush_pages);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9eaccf92d616..209e63798435 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -203,14 +203,10 @@ static const struct {
>  	[VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
>  };
>  
> -#define L1D_CACHE_ORDER 4
>  static void *vmx_l1d_flush_pages;
>  
>  static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>  {
> -	struct page *page;
> -	unsigned int i;
> -
>  	if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
>  		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
>  		return 0;
> @@ -253,24 +249,9 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>  
>  	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
>  	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> -		/*
> -		 * This allocation for vmx_l1d_flush_pages is not tied to a VM
> -		 * lifetime and so should not be charged to a memcg.
> -		 */
> -		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
> -		if (!page)
> +		vmx_l1d_flush_pages = alloc_l1d_flush_pages();
> +		if (!vmx_l1d_flush_pages)
>  			return -ENOMEM;
> -		vmx_l1d_flush_pages = page_address(page);
> -
> -		/*
> -		 * Initialize each page with a different pattern in
> -		 * order to protect against KSM in the nested
> -		 * virtualization case.
> -		 */
> -		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
> -			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
> -			       PAGE_SIZE);
> -		}
>  	}
>  
>  	l1tf_vmx_mitigation = l1tf;
> @@ -7992,7 +7973,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  static void vmx_cleanup_l1d_flush(void)
>  {
>  	if (vmx_l1d_flush_pages) {
> -		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);
> +		cleanup_l1d_flush_pages(vmx_l1d_flush_pages);
>  		vmx_l1d_flush_pages = NULL;
>  	}
>  	/* Restore state so sysfs ignores VMX */
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush
  2020-04-06  3:19 ` [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush Balbir Singh
@ 2020-04-07 18:25   ` Kees Cook
  2020-04-08  0:22     ` Singh, Balbir
  0 siblings, 1 reply; 17+ messages in thread
From: Kees Cook @ 2020-04-07 18:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: tglx, linux-kernel, jpoimboe, tony.luck, benh, x86, dave.hansen

On Mon, Apr 06, 2020 at 01:19:44PM +1000, Balbir Singh wrote:
> Refactor the existing assembly bits into smaller helper functions
> and also abstract L1D_FLUSH into a helper function. Use these
> functions in kvm for L1D flushing.
> 
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> ---
>  arch/x86/include/asm/cacheflush.h |  3 ++
>  arch/x86/kernel/l1d_flush.c       | 49 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c            | 31 ++++---------------
>  3 files changed, 57 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
> index 6419a4cef0e8..66a46db7aadd 100644
> --- a/arch/x86/include/asm/cacheflush.h
> +++ b/arch/x86/include/asm/cacheflush.h
> @@ -10,5 +10,8 @@
>  void clflush_cache_range(void *addr, unsigned int size);
>  void *alloc_l1d_flush_pages(void);
>  void cleanup_l1d_flush_pages(void *l1d_flush_pages);
> +void populate_tlb_with_flush_pages(void *l1d_flush_pages);
> +void flush_l1d_cache_sw(void *l1d_flush_pages);
> +int flush_l1d_cache_hw(void);
>  
>  #endif /* _ASM_X86_CACHEFLUSH_H */
> diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> index 05f375c33423..60499f773046 100644
> --- a/arch/x86/kernel/l1d_flush.c
> +++ b/arch/x86/kernel/l1d_flush.c
> @@ -34,3 +34,52 @@ void cleanup_l1d_flush_pages(void *l1d_flush_pages)
>  	free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
>  }
>  EXPORT_SYMBOL_GPL(cleanup_l1d_flush_pages);
> +
> +void populate_tlb_with_flush_pages(void *l1d_flush_pages)
> +{
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	asm volatile(
> +		/* First ensure the pages are in the TLB */
> +		"xorl	%%eax, %%eax\n"
> +		".Lpopulate_tlb:\n\t"
> +		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> +		"addl	$4096, %%eax\n\t"
> +		"cmpl	%%eax, %[size]\n\t"
> +		"jne	.Lpopulate_tlb\n\t"
> +		"xorl	%%eax, %%eax\n\t"
> +		"cpuid\n\t"
> +		:: [flush_pages] "r" (l1d_flush_pages),
> +		    [size] "r" (size)
> +		: "eax", "ebx", "ecx", "edx");
> +}
> +EXPORT_SYMBOL_GPL(populate_tlb_with_flush_pages);
> +
> +int flush_l1d_cache_hw(void)
> +{
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +		return 1;
> +	}
> +	return 0;
> +}

This return value is backwards from the kernel's normal use of "int". I
would expect 0 to mean "success" and non-zero to mean "failure". How
about:

int flush_l1d_cache_hw(void)
{
     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
             return 0;
     }
     return -ENOTSUPP;
}


> +EXPORT_SYMBOL_GPL(flush_l1d_cache_hw);
> +
> +void flush_l1d_cache_sw(void *l1d_flush_pages)
> +{
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	asm volatile(
> +			/* Fill the cache */
> +			"xorl	%%eax, %%eax\n"
> +			".Lfill_cache:\n"
> +			"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> +			"addl	$64, %%eax\n\t"
> +			"cmpl	%%eax, %[size]\n\t"
> +			"jne	.Lfill_cache\n\t"
> +			"lfence\n"
> +			:: [flush_pages] "r" (l1d_flush_pages),
> +			[size] "r" (size)
> +			: "eax", "ecx");
> +}
> +EXPORT_SYMBOL_GPL(flush_l1d_cache_sw);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 209e63798435..29dc5a5bb6ab 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5956,8 +5956,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu,
>   */
>  static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>  {
> -	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> -
>  	/*
>  	 * This code is only executed when the the flush mode is 'cond' or
>  	 * 'always'
> @@ -5986,32 +5984,13 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>  
>  	vcpu->stat.l1d_flush++;
>  
> -	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> -		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +	if (flush_l1d_cache_hw())
>  		return;
> -	}

Then this becomes:

	if (flush_l1d_cache_hw() == 0)
		return;

(Or change it to a "bool" with and use true/false and leave the above
call as-is.)

Either way:

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

>  
> -	asm volatile(
> -		/* First ensure the pages are in the TLB */
> -		"xorl	%%eax, %%eax\n"
> -		".Lpopulate_tlb:\n\t"
> -		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> -		"addl	$4096, %%eax\n\t"
> -		"cmpl	%%eax, %[size]\n\t"
> -		"jne	.Lpopulate_tlb\n\t"
> -		"xorl	%%eax, %%eax\n\t"
> -		"cpuid\n\t"
> -		/* Now fill the cache */
> -		"xorl	%%eax, %%eax\n"
> -		".Lfill_cache:\n"
> -		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> -		"addl	$64, %%eax\n\t"
> -		"cmpl	%%eax, %[size]\n\t"
> -		"jne	.Lfill_cache\n\t"
> -		"lfence\n"
> -		:: [flush_pages] "r" (vmx_l1d_flush_pages),
> -		    [size] "r" (size)
> -		: "eax", "ebx", "ecx", "edx");
> +	preempt_disable();
> +	populate_tlb_with_flush_pages(vmx_l1d_flush_pages);
> +	flush_l1d_cache_sw(vmx_l1d_flush_pages);
> +	preempt_enable();
>  }
>  
>  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> -- 
> 2.17.1
> 


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-06  3:19 ` [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
@ 2020-04-07 18:26   ` Kees Cook
  2020-04-07 23:37     ` Benjamin Herrenschmidt
                       ` (2 more replies)
  2020-04-07 23:52   ` Thomas Gleixner
  1 sibling, 3 replies; 17+ messages in thread
From: Kees Cook @ 2020-04-07 18:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: tglx, linux-kernel, jpoimboe, tony.luck, benh, x86, dave.hansen

On Mon, Apr 06, 2020 at 01:19:45PM +1000, Balbir Singh wrote:
> Implement a mechanism to selectively flush the L1D cache. The goal is to
> allow tasks that are paranoid due to the recent snoop assisted data sampling
> vulnerabilites, to flush their L1D on being switched out.  This protects
> their data from being snooped or leaked via side channels after the task
> has context switched out.
> 
> There are two scenarios we might want to protect against, a task leaving
> the CPU with data still in L1D (which is the main concern of this patch),
> the second scenario is a malicious task coming in (not so well trusted)
> for which we want to clean up the cache before it starts. Only the case
> for the former is addressed.
> 
> Add arch specific prctl()'s to opt-in to the L1D cache on context switch
> out, the existing mechanisms of tracking prev_mm via cpu_tlbstate is
> reused. cond_ibpb() is refactored and renamed into cond_mitigation().

I still think this should be a generic prctl(). If there is a strong
reason not to do this, can it be described in the commit log here?

-Kees

> 
> A new thread_info flag TIF_SPEC_FLUSH_L1D is added to track tasks which
> opt-into L1D flushing. cpu_tlbstate.last_user_mm_ibpb is renamed to
> cpu_tlbstate.last_user_mm_spec, this is used to convert the TIF flags
> into mm state (per cpu via last_user_mm_spec) in cond_mitigation(),
> which then used to do decide when to call flush_l1d().
> 
> The current version benefited from discussions with Kees and Thomas.
> Thomas suggested and provided the code snippet for refactoring the
> existing cond_ibpb() code.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> ---
>  arch/x86/include/asm/thread_info.h |  6 +-
>  arch/x86/include/asm/tlbflush.h    |  2 +-
>  arch/x86/include/uapi/asm/prctl.h  |  3 +
>  arch/x86/kernel/process_64.c       | 10 +++-
>  arch/x86/mm/tlb.c                  | 94 +++++++++++++++++++++++-------
>  5 files changed, 91 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 8de8ceccb8bc..5cb250872643 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -84,7 +84,7 @@ struct thread_info {
>  #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
>  #define TIF_SECCOMP		8	/* secure computing */
>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
> -#define TIF_SPEC_FORCE_UPDATE	10	/* Force speculation MSR update in context switch */
> +#define TIF_SPEC_FLUSH_L1D	10	/* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
>  #define TIF_UPROBE		12	/* breakpointed or singlestepping */
>  #define TIF_PATCH_PENDING	13	/* pending live patching update */
> @@ -96,6 +96,7 @@ struct thread_info {
>  #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
>  #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
>  #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
> +#define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
>  #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
>  #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
>  #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
> @@ -132,6 +133,7 @@ struct thread_info {
>  #define _TIF_ADDR32		(1 << TIF_ADDR32)
>  #define _TIF_X32		(1 << TIF_X32)
>  #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
> +#define _TIF_SPEC_FLUSH_L1D	(1 << TIF_SPEC_FLUSH_L1D)
>  
>  /* Work to do before invoking the actual syscall. */
>  #define _TIF_WORK_SYSCALL_ENTRY	\
> @@ -239,6 +241,8 @@ extern void arch_task_cache_init(void);
>  extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
>  extern void arch_release_task_struct(struct task_struct *tsk);
>  extern void arch_setup_new_exec(void);
> +extern int enable_l1d_flush_for_task(struct task_struct *tsk);
> +extern int disable_l1d_flush_for_task(struct task_struct *tsk);
>  #define arch_setup_new_exec arch_setup_new_exec
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6f66d841262d..69e6ea20679c 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -172,7 +172,7 @@ struct tlb_state {
>  	/* Last user mm for optimizing IBPB */
>  	union {
>  		struct mm_struct	*last_user_mm;
> -		unsigned long		last_user_mm_ibpb;
> +		unsigned long		last_user_mm_spec;
>  	};
>  
>  	u16 loaded_mm_asid;
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9fa41f..1361e5e25791 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -14,4 +14,7 @@
>  #define ARCH_MAP_VDSO_32	0x2002
>  #define ARCH_MAP_VDSO_64	0x2003
>  
> +#define ARCH_SET_L1D_FLUSH	0x3001
> +#define ARCH_GET_L1D_FLUSH	0x3002
> +
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 5ef9d8f25b0e..ecf542f13572 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -699,7 +699,15 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
>  	case ARCH_MAP_VDSO_64:
>  		return prctl_map_vdso(&vdso_image_64, arg2);
>  #endif
> -
> +	case ARCH_GET_L1D_FLUSH:
> +		return test_ti_thread_flag(&task->thread_info, TIF_SPEC_FLUSH_L1D);
> +	case ARCH_SET_L1D_FLUSH: {
> +		if (arg2 >= 1)
> +			return enable_l1d_flush_for_task(task);
> +		else
> +			return disable_l1d_flush_for_task(task);
> +		break;
> +	}
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 66f96f21a7b6..71ef9fb941b8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -13,6 +13,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/cache.h>
> +#include <asm/cacheflush.h>
>  #include <asm/apic.h>
>  #include <asm/uv/uv.h>
>  
> @@ -33,10 +34,12 @@
>   */
>  
>  /*
> - * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
> - * stored in cpu_tlb_state.last_user_mm_ibpb.
> + * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
> + * stored in cpu_tlb_state.last_user_mm_spec.
>   */
>  #define LAST_USER_MM_IBPB	0x1UL
> +#define LAST_USER_MM_FLUSH_L1D	0x2UL
> +#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB | LAST_USER_MM_FLUSH_L1D)
>  
>  /*
>   * We get here when we do something requiring a TLB invalidation
> @@ -151,6 +154,52 @@ void leave_mm(int cpu)
>  }
>  EXPORT_SYMBOL_GPL(leave_mm);
>  
> +static void *l1d_flush_pages;
> +static DEFINE_MUTEX(l1d_flush_mutex);
> +
> +int enable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	struct page *page;
> +	int ret = 0;
> +
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> +		goto done;
> +
> +	page = READ_ONCE(l1d_flush_pages);
> +	if (unlikely(!page)) {
> +		mutex_lock(&l1d_flush_mutex);
> +		if (!l1d_flush_pages) {
> +			l1d_flush_pages = alloc_l1d_flush_pages();
> +			if (!l1d_flush_pages) {
> +				mutex_unlock(&l1d_flush_mutex);
> +				return -ENOMEM;
> +			}
> +		}
> +		mutex_unlock(&l1d_flush_mutex);
> +	}
> +	/* I don't think we need to worry about KSM */
> +done:
> +	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
> +	return ret;
> +}
> +
> +int disable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_FLUSH_L1D);
> +	return 0;
> +}
> +
> +/*
> + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> + * this is a pessimistic security measure and an opt-in for those tasks
> + * that host sensitive information.
> + */
> +static void flush_l1d(void)
> +{
> +	if (!flush_l1d_cache_hw())
> +		flush_l1d_cache_sw(l1d_flush_pages);
> +}
> +
>  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	       struct task_struct *tsk)
>  {
> @@ -189,19 +238,26 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
>  	}
>  }
>  
> -static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
> +static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
>  {
>  	unsigned long next_tif = task_thread_info(next)->flags;
> -	unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
> +	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
> +
> +	BUILD_BUG_ON(TIF_SPEC_FLUSH_L1D != TIF_SPEC_IB + 1);
>  
> -	return (unsigned long)next->mm | ibpb;
> +	return (unsigned long)next->mm | spec_bits;
>  }
>  
> -static void cond_ibpb(struct task_struct *next)
> +static void cond_mitigation(struct task_struct *next)
>  {
> +	unsigned long prev_mm, next_mm;
> +
>  	if (!next || !next->mm)
>  		return;
>  
> +	next_mm = mm_mangle_tif_spec_bits(next);
> +	prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
> +
>  	/*
>  	 * Both, the conditional and the always IBPB mode use the mm
>  	 * pointer to avoid the IBPB when switching between tasks of the
> @@ -212,8 +268,6 @@ static void cond_ibpb(struct task_struct *next)
>  	 * exposed data is not really interesting.
>  	 */
>  	if (static_branch_likely(&switch_mm_cond_ibpb)) {
> -		unsigned long prev_mm, next_mm;
> -
>  		/*
>  		 * This is a bit more complex than the always mode because
>  		 * it has to handle two cases:
> @@ -243,20 +297,14 @@ static void cond_ibpb(struct task_struct *next)
>  		 * Optimize this with reasonably small overhead for the
>  		 * above cases. Mangle the TIF_SPEC_IB bit into the mm
>  		 * pointer of the incoming task which is stored in
> -		 * cpu_tlbstate.last_user_mm_ibpb for comparison.
> -		 */
> -		next_mm = mm_mangle_tif_spec_ib(next);
> -		prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
> -
> -		/*
> +		 * cpu_tlbstate.last_user_mm_spec for comparison.
> +		 *
>  		 * Issue IBPB only if the mm's are different and one or
>  		 * both have the IBPB bit set.
>  		 */
>  		if (next_mm != prev_mm &&
>  		    (next_mm | prev_mm) & LAST_USER_MM_IBPB)
>  			indirect_branch_prediction_barrier();
> -
> -		this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
>  	}
>  
>  	if (static_branch_unlikely(&switch_mm_always_ibpb)) {
> @@ -265,11 +313,15 @@ static void cond_ibpb(struct task_struct *next)
>  		 * different context than the user space task which ran
>  		 * last on this CPU.
>  		 */
> -		if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
> +		if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
> +					(unsigned long)next->mm)
>  			indirect_branch_prediction_barrier();
> -			this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
> -		}
>  	}
> +
> +	if (prev_mm & LAST_USER_MM_FLUSH_L1D)
> +		flush_l1d();
> +
> +	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
>  }
>  
>  void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> @@ -375,7 +427,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		 * predictor when switching between processes. This stops
>  		 * one process from doing Spectre-v2 attacks on another.
>  		 */
> -		cond_ibpb(tsk);
> +		cond_mitigation(tsk);
>  
>  		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
>  			/*
> @@ -501,7 +553,7 @@ void initialize_tlbstate_and_flush(void)
>  	write_cr3(build_cr3(mm->pgd, 0));
>  
>  	/* Reinitialize tlbstate. */
> -	this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
> +	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_IBPB);
>  	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
>  	this_cpu_write(cpu_tlbstate.next_asid, 1);
>  	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-07 18:26   ` Kees Cook
@ 2020-04-07 23:37     ` Benjamin Herrenschmidt
  2020-04-07 23:39     ` Singh, Balbir
  2020-05-19 23:41     ` Singh, Balbir
  2 siblings, 0 replies; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2020-04-07 23:37 UTC (permalink / raw)
  To: Kees Cook, Balbir Singh
  Cc: tglx, linux-kernel, jpoimboe, tony.luck, x86, dave.hansen

On Tue, 2020-04-07 at 11:26 -0700, Kees Cook wrote:
> > Add arch specific prctl()'s to opt-in to the L1D cache on context
> > switch
> > out, the existing mechanisms of tracking prev_mm via cpu_tlbstate
> > is
> > reused. cond_ibpb() is refactored and renamed into
> > cond_mitigation().
> 
> I still think this should be a generic prctl(). If there is a strong
> reason not to do this, can it be described in the commit log here?

Agreed.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-07 18:26   ` Kees Cook
  2020-04-07 23:37     ` Benjamin Herrenschmidt
@ 2020-04-07 23:39     ` Singh, Balbir
  2020-04-07 23:49       ` Thomas Gleixner
  2020-05-19 23:41     ` Singh, Balbir
  2 siblings, 1 reply; 17+ messages in thread
From: Singh, Balbir @ 2020-04-07 23:39 UTC (permalink / raw)
  To: keescook; +Cc: tglx, linux-kernel, tony.luck, benh, jpoimboe, x86, dave.hansen

On Tue, 2020-04-07 at 11:26 -0700, Kees Cook wrote:
> 
> 
> On Mon, Apr 06, 2020 at 01:19:45PM +1000, Balbir Singh wrote:
> > Implement a mechanism to selectively flush the L1D cache. The goal is to
> > allow tasks that are paranoid due to the recent snoop assisted data
> > sampling
> > vulnerabilites, to flush their L1D on being switched out.  This protects
> > their data from being snooped or leaked via side channels after the task
> > has context switched out.
> > 
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this patch),
> > the second scenario is a malicious task coming in (not so well trusted)
> > for which we want to clean up the cache before it starts. Only the case
> > for the former is addressed.
> > 
> > Add arch specific prctl()'s to opt-in to the L1D cache on context switch
> > out, the existing mechanisms of tracking prev_mm via cpu_tlbstate is
> > reused. cond_ibpb() is refactored and renamed into cond_mitigation().
> 
> I still think this should be a generic prctl(). If there is a strong
> reason not to do this, can it be described in the commit log here?
> 
> -Kees
> 

I can move to prctl() if that is what you prefer, the prctl() can then do arch
specific things. I thought in my question around would other arch's like to do
this, I did not hear anything specific, but I am happy to convert the
interface over.

Balbir Singh.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-07 23:39     ` Singh, Balbir
@ 2020-04-07 23:49       ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2020-04-07 23:49 UTC (permalink / raw)
  To: Singh, Balbir, keescook
  Cc: linux-kernel, tony.luck, benh, jpoimboe, x86, dave.hansen

"Singh, Balbir" <sblbir@amazon.com> writes:
> On Tue, 2020-04-07 at 11:26 -0700, Kees Cook wrote:
>> On Mon, Apr 06, 2020 at 01:19:45PM +1000, Balbir Singh wrote:
>> > Add arch specific prctl()'s to opt-in to the L1D cache on context switch
>> > out, the existing mechanisms of tracking prev_mm via cpu_tlbstate is
>> > reused. cond_ibpb() is refactored and renamed into cond_mitigation().
>> 
>> I still think this should be a generic prctl(). If there is a strong
>> reason not to do this, can it be described in the commit log here?
>
> I can move to prctl() if that is what you prefer, the prctl() can then do arch
> specific things. I thought in my question around would other arch's like to do
> this, I did not hear anything specific, but I am happy to convert the
> interface over.

Yes, please. It's just a matter of time that other architectures find
this useful. L1D attacks are not restricted to x86 AFAICT.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-06  3:19 ` [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
  2020-04-07 18:26   ` Kees Cook
@ 2020-04-07 23:52   ` Thomas Gleixner
  2020-04-08  0:14     ` Singh, Balbir
  1 sibling, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2020-04-07 23:52 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen, Balbir Singh

Balbir,

Balbir Singh <sblbir@amazon.com> writes:
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6f66d841262d..69e6ea20679c 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -172,7 +172,7 @@ struct tlb_state {
>  	/* Last user mm for optimizing IBPB */
>  	union {
>  		struct mm_struct	*last_user_mm;
> -		unsigned long		last_user_mm_ibpb;
> +		unsigned long		last_user_mm_spec;
  
> -static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
> +static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)

> -static void cond_ibpb(struct task_struct *next)
> +static void cond_mitigation(struct task_struct *next)
>  {
> +	unsigned long prev_mm, next_mm;
> +
>  	if (!next || !next->mm)
>  		return;

can you please split out these preparatory changes into a separate
patch?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-07 23:52   ` Thomas Gleixner
@ 2020-04-08  0:14     ` Singh, Balbir
  0 siblings, 0 replies; 17+ messages in thread
From: Singh, Balbir @ 2020-04-08  0:14 UTC (permalink / raw)
  To: tglx, linux-kernel; +Cc: keescook, tony.luck, benh, jpoimboe, x86, dave.hansen

On Wed, 2020-04-08 at 01:52 +0200, Thomas Gleixner wrote:
> 
> Balbir,
> 
> Balbir Singh <sblbir@amazon.com> writes:
> > diff --git a/arch/x86/include/asm/tlbflush.h
> > b/arch/x86/include/asm/tlbflush.h
> > index 6f66d841262d..69e6ea20679c 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -172,7 +172,7 @@ struct tlb_state {
> >       /* Last user mm for optimizing IBPB */
> >       union {
> >               struct mm_struct        *last_user_mm;
> > -             unsigned long           last_user_mm_ibpb;
> > +             unsigned long           last_user_mm_spec;
> > -static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct
> > *next)
> > +static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct
> > *next)
> > -static void cond_ibpb(struct task_struct *next)
> > +static void cond_mitigation(struct task_struct *next)
> >  {
> > +     unsigned long prev_mm, next_mm;
> > +
> >       if (!next || !next->mm)
> >               return;
> 
> can you please split out these preparatory changes into a separate
> patch?
> 

Will do and repost a new iteration

Balbir Singh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush
  2020-04-07 18:25   ` Kees Cook
@ 2020-04-08  0:22     ` Singh, Balbir
  0 siblings, 0 replies; 17+ messages in thread
From: Singh, Balbir @ 2020-04-08  0:22 UTC (permalink / raw)
  To: keescook; +Cc: tglx, linux-kernel, tony.luck, benh, jpoimboe, x86, dave.hansen

On Tue, 2020-04-07 at 11:25 -0700, Kees Cook wrote:
> 
> On Mon, Apr 06, 2020 at 01:19:44PM +1000, Balbir Singh wrote:
> > Refactor the existing assembly bits into smaller helper functions
> > and also abstract L1D_FLUSH into a helper function. Use these
> > functions in kvm for L1D flushing.
> > 
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > ---
> >  arch/x86/include/asm/cacheflush.h |  3 ++
> >  arch/x86/kernel/l1d_flush.c       | 49 +++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/vmx.c            | 31 ++++---------------
> >  3 files changed, 57 insertions(+), 26 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/cacheflush.h
> > b/arch/x86/include/asm/cacheflush.h
> > index 6419a4cef0e8..66a46db7aadd 100644
> > --- a/arch/x86/include/asm/cacheflush.h
> > +++ b/arch/x86/include/asm/cacheflush.h
> > @@ -10,5 +10,8 @@
> >  void clflush_cache_range(void *addr, unsigned int size);
> >  void *alloc_l1d_flush_pages(void);
> >  void cleanup_l1d_flush_pages(void *l1d_flush_pages);
> > +void populate_tlb_with_flush_pages(void *l1d_flush_pages);
> > +void flush_l1d_cache_sw(void *l1d_flush_pages);
> > +int flush_l1d_cache_hw(void);
> > 
> >  #endif /* _ASM_X86_CACHEFLUSH_H */
> > diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
> > index 05f375c33423..60499f773046 100644
> > --- a/arch/x86/kernel/l1d_flush.c
> > +++ b/arch/x86/kernel/l1d_flush.c
> > @@ -34,3 +34,52 @@ void cleanup_l1d_flush_pages(void *l1d_flush_pages)
> >       free_pages((unsigned long)l1d_flush_pages, L1D_CACHE_ORDER);
> >  }
> >  EXPORT_SYMBOL_GPL(cleanup_l1d_flush_pages);
> > +
> > +void populate_tlb_with_flush_pages(void *l1d_flush_pages)
> > +{
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     asm volatile(
> > +             /* First ensure the pages are in the TLB */
> > +             "xorl   %%eax, %%eax\n"
> > +             ".Lpopulate_tlb:\n\t"
> > +             "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +             "addl   $4096, %%eax\n\t"
> > +             "cmpl   %%eax, %[size]\n\t"
> > +             "jne    .Lpopulate_tlb\n\t"
> > +             "xorl   %%eax, %%eax\n\t"
> > +             "cpuid\n\t"
> > +             :: [flush_pages] "r" (l1d_flush_pages),
> > +                 [size] "r" (size)
> > +             : "eax", "ebx", "ecx", "edx");
> > +}
> > +EXPORT_SYMBOL_GPL(populate_tlb_with_flush_pages);
> > +
> > +int flush_l1d_cache_hw(void)
> > +{
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +             return 1;
> > +     }
> > +     return 0;
> > +}
> 
> This return value is backwards from the kernel's normal use of "int". I
> would expect 0 to mean "success" and non-zero to mean "failure". How
> about:
> 
> int flush_l1d_cache_hw(void)
> {
>      if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
>              wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
>              return 0;
>      }
>      return -ENOTSUPP;
> }
> 

Will do

> 
> > +EXPORT_SYMBOL_GPL(flush_l1d_cache_hw);
> > +
> > +void flush_l1d_cache_sw(void *l1d_flush_pages)
> > +{
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     asm volatile(
> > +                     /* Fill the cache */
> > +                     "xorl   %%eax, %%eax\n"
> > +                     ".Lfill_cache:\n"
> > +                     "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +                     "addl   $64, %%eax\n\t"
> > +                     "cmpl   %%eax, %[size]\n\t"
> > +                     "jne    .Lfill_cache\n\t"
> > +                     "lfence\n"
> > +                     :: [flush_pages] "r" (l1d_flush_pages),
> > +                     [size] "r" (size)
> > +                     : "eax", "ecx");
> > +}
> > +EXPORT_SYMBOL_GPL(flush_l1d_cache_sw);
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 209e63798435..29dc5a5bb6ab 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5956,8 +5956,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu,
> >   */
> >  static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
> >  {
> > -     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > -
> >       /*
> >        * This code is only executed when the the flush mode is 'cond' or
> >        * 'always'
> > @@ -5986,32 +5984,13 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
> > 
> >       vcpu->stat.l1d_flush++;
> > 
> > -     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > -             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +     if (flush_l1d_cache_hw())
> >               return;
> > -     }
> 
> Then this becomes:
> 
>         if (flush_l1d_cache_hw() == 0)
>                 return;
> 
> (Or change it to a "bool" with and use true/false and leave the above
> call as-is.)
> 
> Either way:
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> 

Thanks,
Balbir Singh.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation
  2020-04-06  3:19 ` [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation Balbir Singh
@ 2020-05-19 15:39   ` Randy Dunlap
  2020-05-20  0:47     ` Singh, Balbir
  0 siblings, 1 reply; 17+ messages in thread
From: Randy Dunlap @ 2020-05-19 15:39 UTC (permalink / raw)
  To: Balbir Singh, tglx, linux-kernel
  Cc: jpoimboe, tony.luck, keescook, benh, x86, dave.hansen

Hi--

Comments below. Sorry about the delay.

On 4/5/20 8:19 PM, Balbir Singh wrote:
> Add documentation of l1d flushing, explain the need for the
> feature and how it can be used.
> 
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> ---
>  Documentation/admin-guide/hw-vuln/index.rst   |  1 +
>  .../admin-guide/hw-vuln/l1d_flush.rst         | 40 +++++++++++++++++++
>  2 files changed, 41 insertions(+)
>  create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

> diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
> new file mode 100644
> index 000000000000..73ee9e491a74
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
> @@ -0,0 +1,40 @@
> +L1D Flushing for the paranoid
> +=============================
> +
> +With an increasing number of vulnerabilities being reported around data
> +leaks from L1D, a new user space mechanism to flush the L1D cache on
> +context switch is added to the kernel. This should help address
> +CVE-2020-0550 and for paranoid applications, keep them safe from any
> +yet to be discovered vulnerabilities, related to leaks from the L1D
> +cache.
> +
> +Tasks can opt in to this mechanism by using an architecture specific
> +prctl (x86 only at the moment).
> +
> +Related CVES

           CVEs

> +------------
> +At the present moment, the following CVEs can be addressed by this
> +mechanism
> +
> +    =============       ========================     ==================
> +    CVE-2020-0550       Improper Data Forwarding     OS related aspects
> +    =============       ========================     ==================
> +
> +Usage Guidelines
> +----------------
> +Applications can call ``arch_prctl(2)`` with one of these two arguments

end above sentence with period or colon (colon might require the following
bullet items to be indented -- I'm not sure about that).

> +
> +1. ARCH_SET_L1D_FLUSH - flush the L1D cache on context switch (out)
> +2. ARCH_GET_L1D_FLUSH - get the current state of the L1D cache flush, returns 1
> +   if set and 0 if not set.
> +
> +**NOTE**: The feature is disabled by default, applications to need to specifically

                                        default; applications need to

> +opt into the feature to enable it.
> +
> +Mitigation
> +----------
> +When ARCH_SET_L1D_FLUSH is enabled for a task, on switching tasks (when
> +the address space changes), a flush of the L1D cache is performed for
> +the task when it leaves the CPU. If the underlying CPU supports L1D
> +flushing in hardware, the hardware mechanism is used, otherwise a software
> +fallback, similar to the mechanism used by L1TF is used.
> 

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re:  [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch
  2020-04-07 18:26   ` Kees Cook
  2020-04-07 23:37     ` Benjamin Herrenschmidt
  2020-04-07 23:39     ` Singh, Balbir
@ 2020-05-19 23:41     ` Singh, Balbir
  2 siblings, 0 replies; 17+ messages in thread
From: Singh, Balbir @ 2020-05-19 23:41 UTC (permalink / raw)
  To: keescook; +Cc: tglx, linux-kernel, tony.luck, benh, jpoimboe, x86, dave.hansen

On Tue, 2020-04-07 at 11:26 -0700, Kees Cook wrote:
> 
> 
> On Mon, Apr 06, 2020 at 01:19:45PM +1000, Balbir Singh wrote:
> > Implement a mechanism to selectively flush the L1D cache. The goal is to
> > allow tasks that are paranoid due to the recent snoop assisted data sampling
> > vulnerabilites, to flush their L1D on being switched out.  This protects
> > their data from being snooped or leaked via side channels after the task
> > has context switched out.
> > 
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this patch),
> > the second scenario is a malicious task coming in (not so well trusted)
> > for which we want to clean up the cache before it starts. Only the case
> > for the former is addressed.
> > 
> > Add arch specific prctl()'s to opt-in to the L1D cache on context switch
> > out, the existing mechanisms of tracking prev_mm via cpu_tlbstate is
> > reused. cond_ibpb() is refactored and renamed into cond_mitigation().
> 
> I still think this should be a generic prctl(). If there is a strong
> reason not to do this, can it be described in the commit log here?

Kees, the context in the changelog might be misleading, the prctl is generic,
the implementation is arch specific as you can see from the following patches.
I can reword the change log, sorry for the confusion.

Balbir Singh.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation
  2020-05-19 15:39   ` Randy Dunlap
@ 2020-05-20  0:47     ` Singh, Balbir
  0 siblings, 0 replies; 17+ messages in thread
From: Singh, Balbir @ 2020-05-20  0:47 UTC (permalink / raw)
  To: tglx, rdunlap, linux-kernel
  Cc: keescook, tony.luck, benh, jpoimboe, x86, dave.hansen

On Tue, 2020-05-19 at 08:39 -0700, Randy Dunlap wrote:
> 
> Hi--
> 
> Comments below. Sorry about the delay.
> 
> On 4/5/20 8:19 PM, Balbir Singh wrote:
> > Add documentation of l1d flushing, explain the need for the
> > feature and how it can be used.
> > 
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > ---
> >  Documentation/admin-guide/hw-vuln/index.rst   |  1 +
> >  .../admin-guide/hw-vuln/l1d_flush.rst         | 40 +++++++++++++++++++
> >  2 files changed, 41 insertions(+)
> >  create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
> > diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
> > new file mode 100644
> > index 000000000000..73ee9e491a74
> > --- /dev/null
> > +++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
> > @@ -0,0 +1,40 @@
> > +L1D Flushing for the paranoid
> > +=============================
> > +
> > +With an increasing number of vulnerabilities being reported around data
> > +leaks from L1D, a new user space mechanism to flush the L1D cache on
> > +context switch is added to the kernel. This should help address
> > +CVE-2020-0550 and for paranoid applications, keep them safe from any
> > +yet to be discovered vulnerabilities, related to leaks from the L1D
> > +cache.
> > +
> > +Tasks can opt in to this mechanism by using an architecture specific
> > +prctl (x86 only at the moment).
> > +
> > +Related CVES
> 
>            CVEs
> 
> > +------------
> > +At the present moment, the following CVEs can be addressed by this
> > +mechanism
> > +
> > +    =============       ========================     ==================
> > +    CVE-2020-0550       Improper Data Forwarding     OS related aspects
> > +    =============       ========================     ==================
> > +
> > +Usage Guidelines
> > +----------------
> > +Applications can call ``arch_prctl(2)`` with one of these two arguments
> 
> end above sentence with period or colon (colon might require the following
> bullet items to be indented -- I'm not sure about that).

I'll take a look

> 
> > +
> > +1. ARCH_SET_L1D_FLUSH - flush the L1D cache on context switch (out)
> > +2. ARCH_GET_L1D_FLUSH - get the current state of the L1D cache flush, returns 1
> > +   if set and 0 if not set.
> > +
> > +**NOTE**: The feature is disabled by default, applications to need to specifically
> 
>                                         default; applications need to
> 
> > +opt into the feature to enable it.
> > +
> > +Mitigation
> > +----------
> > +When ARCH_SET_L1D_FLUSH is enabled for a task, on switching tasks (when
> > +the address space changes), a flush of the L1D cache is performed for
> > +the task when it leaves the CPU. If the underlying CPU supports L1D
> > +flushing in hardware, the hardware mechanism is used, otherwise a software
> > +fallback, similar to the mechanism used by L1TF is used.
> > 
> 

I'll work on these and update based on more feedback on the rest of the series.

Balbir Singh.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-05-20  0:47 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-06  3:19 [PATCH v2 0/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
2020-04-06  3:19 ` [PATCH v2 1/4] arch/x86/kvm: Refactor l1d flush lifecycle management Balbir Singh
2020-04-07 18:21   ` Kees Cook
2020-04-06  3:19 ` [PATCH v2 2/4] arch/x86: Refactor tlbflush and l1d flush Balbir Singh
2020-04-07 18:25   ` Kees Cook
2020-04-08  0:22     ` Singh, Balbir
2020-04-06  3:19 ` [PATCH v2 3/4] arch/x86: Optionally flush L1D on context switch Balbir Singh
2020-04-07 18:26   ` Kees Cook
2020-04-07 23:37     ` Benjamin Herrenschmidt
2020-04-07 23:39     ` Singh, Balbir
2020-04-07 23:49       ` Thomas Gleixner
2020-05-19 23:41     ` Singh, Balbir
2020-04-07 23:52   ` Thomas Gleixner
2020-04-08  0:14     ` Singh, Balbir
2020-04-06  3:19 ` [PATCH v2 4/4] arch/x86: Add L1D flushing Documentation Balbir Singh
2020-05-19 15:39   ` Randy Dunlap
2020-05-20  0:47     ` Singh, Balbir

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).