linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] arch/x86: Optionally flush L1D on context switch
@ 2020-03-13 22:04 Balbir Singh
  2020-03-18 23:14 ` Kees Cook
  2020-03-19  0:38 ` Thomas Gleixner
  0 siblings, 2 replies; 15+ messages in thread
From: Balbir Singh @ 2020-03-13 22:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, Balbir Singh, keescook, benh

This patch is an RFC/PoC to start the discussion on optionally flushing
L1D cache.  The goal is to allow tasks that are paranoid due to the recent
snoop assisted data sampling vulnerabilites, to flush their L1D on being
switched out.  This protects their data from being snooped or leaked via
side channels after the task has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this
patch), the second scenario is a malicious task coming in (not so well
trusted) for which we want to clean up the cache before it starts
execution. The latter was proposed by benh and is not currently
addressed by this patch, but can be easily accommodated by the same
mechanism.

This patch adds an arch specific prctl() to flush L1D cache on context
switch out, the existing mechanisms of tracking prev_mm via cpu_tlbstate
is reused (very similar to the cond_ipbp() changes). The patch has been
lighted tested.

The points of discussion/review are:

1. Discuss the use case and the right approach to address this
2. Does an arch prctl allowing for opt-in flushing make sense, would other
   arches care about something similar?
3. There is a fallback software L1D load, similar to what L1TF does, but
   we don't prefetch the TLB, is that sufficient?
4. The atomics can be improved and we could use a static key like ibpb
   does to optimize the code path
5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
   is bit 32), we could generalize it with some effort
6. Should we consider cleaning up the L1D on arrival of tasks?

In summary, this is an early PoC to start the discussion on the need for
conditional L1D flushing based on the security posture of the
application and the sensitivity of the data it has access to or might
have access to.

Cc: keescook@chromium.org
Cc: benh@amazon.com

Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/thread_info.h |  8 +++
 arch/x86/include/asm/tlbflush.h    |  6 ++
 arch/x86/include/uapi/asm/prctl.h  |  3 +
 arch/x86/kernel/process_64.c       | 12 +++-
 arch/x86/mm/tlb.c                  | 89 ++++++++++++++++++++++++++++++
 5 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..c48ebfa17805 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -103,6 +103,9 @@ struct thread_info {
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
 #define TIF_X32			30	/* 32-bit native x86-64 binary */
 #define TIF_FSCHECK		31	/* Check FS is USER_DS on return */
+#ifdef CONFIG_64BIT
+#define TIF_L1D_FLUSH           32      /* Flush L1D on mm switches (processes) */
+#endif
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
@@ -132,6 +135,9 @@ struct thread_info {
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
 #define _TIF_X32		(1 << TIF_X32)
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
+#ifdef CONFIG_64BIT
+#define _TIF_L1D_FLUSH		(1UL << TIF_L1D_FLUSH)
+#endif
 
 /* Work to do before invoking the actual syscall. */
 #define _TIF_WORK_SYSCALL_ENTRY	\
@@ -239,6 +245,8 @@ extern void arch_task_cache_init(void);
 extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
 extern void arch_release_task_struct(struct task_struct *tsk);
 extern void arch_setup_new_exec(void);
+extern void enable_l1d_flush_for_task(struct task_struct *tsk);
+extern void disable_l1d_flush_for_task(struct task_struct *tsk);
 #define arch_setup_new_exec arch_setup_new_exec
 #endif	/* !__ASSEMBLY__ */
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6f66d841262d..1d535059b358 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -219,6 +219,12 @@ struct tlb_state {
 	 */
 	unsigned long cr4;
 
+	/*
+	 * Flush the L1D cache on switch_mm_irqs_off() for a
+	 * task getting off the CPU, if it opted in to do so
+	 */
+	bool last_user_mm_l1d_flush;
+
 	/*
 	 * This is a list of all contexts that might exist in the TLB.
 	 * There is one per ASID that we use, and the ASID (what the
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..1361e5e25791 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,7 @@
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
 
+#define ARCH_SET_L1D_FLUSH	0x3001
+#define ARCH_GET_L1D_FLUSH	0x3002
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ffd497804dbc..df9f8775ee94 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -700,7 +700,17 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_MAP_VDSO_64:
 		return prctl_map_vdso(&vdso_image_64, arg2);
 #endif
-
+#ifdef CONFIG_64BIT
+	case ARCH_GET_L1D_FLUSH:
+		return test_ti_thread_flag(&task->thread_info, TIF_L1D_FLUSH);
+	case ARCH_SET_L1D_FLUSH: {
+		if (arg2 >= 1)
+			enable_l1d_flush_for_task(task);
+		else
+			disable_l1d_flush_for_task(task);
+		break;
+	}
+#endif
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 66f96f21a7b6..35a3970df0ef 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -151,6 +151,92 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+#define L1D_CACHE_ORDER 4
+static void *l1d_flush_pages;
+static DEFINE_MUTEX(l1d_flush_mutex);
+
+void enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	struct page *page;
+
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
+		goto done;
+
+	mutex_lock(&l1d_flush_mutex);
+	if (l1d_flush_pages)
+		goto done;
+	/*
+	 * These pages are never freed, we use the same
+	 * set of pages across multiple processes/contexts
+	 */
+	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
+	if (!page)
+		return;
+
+	l1d_flush_pages = page_address(page);
+	/* I don't think we need to worry about KSM */
+done:
+	set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
+	mutex_unlock(&l1d_flush_mutex);
+}
+
+void disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
+	smp_mb__after_atomic();
+}
+
+/*
+ * Flush the L1D cache for this CPU. We want to this at switch mm time,
+ * this is a pessimistic security measure and an opt-in for those tasks
+ * that host sensitive information and there are concerns about spills
+ * from fill buffers.
+ */
+static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
+{
+	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
+		goto check_next;
+
+	if (real_prev == next)
+		return;
+
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+		goto done;
+	}
+
+	asm volatile(
+		/* Fill the cache */
+		"xorl	%%eax, %%eax\n"
+		".Lfill_cache:\n"
+		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
+		"addl	$64, %%eax\n\t"
+		"cmpl	%%eax, %[size]\n\t"
+		"jne	.Lfill_cache\n\t"
+		"lfence\n"
+		:: [flush_pages] "r" (l1d_flush_pages),
+		    [size] "r" (size)
+		: "eax", "ecx");
+
+done:
+	this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
+	/* Make sure we clear the values before we set it again */
+	barrier();
+check_next:
+	if (tsk == NULL)
+		return;
+
+	/* Match the set/clear_bit barriers */
+	smp_rmb();
+
+	/* We don't need stringent checks as we opt-in/opt-out */
+	if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
+		this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
+}
+
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
 	}
 
+	l1d_flush(next, tsk);
+
 	/* Make sure we write CR3 before loaded_mm. */
 	barrier();
 
@@ -503,6 +591,7 @@ void initialize_tlbstate_and_flush(void)
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
+	this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
 	this_cpu_write(cpu_tlbstate.next_asid, 1);
 	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
 	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-13 22:04 [RFC PATCH] arch/x86: Optionally flush L1D on context switch Balbir Singh
@ 2020-03-18 23:14 ` Kees Cook
  2020-03-20  1:35   ` Singh, Balbir
  2020-03-19  0:38 ` Thomas Gleixner
  1 sibling, 1 reply; 15+ messages in thread
From: Kees Cook @ 2020-03-18 23:14 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-kernel, x86, benh

On Sat, Mar 14, 2020 at 09:04:15AM +1100, Balbir Singh wrote:
> This patch is an RFC/PoC to start the discussion on optionally flushing
> L1D cache.  The goal is to allow tasks that are paranoid due to the recent
> snoop assisted data sampling vulnerabilites, to flush their L1D on being
> switched out.  This protects their data from being snooped or leaked via
> side channels after the task has context switched out.
> 
> There are two scenarios we might want to protect against, a task leaving
> the CPU with data still in L1D (which is the main concern of this
> patch), the second scenario is a malicious task coming in (not so well
> trusted) for which we want to clean up the cache before it starts
> execution. The latter was proposed by benh and is not currently
> addressed by this patch, but can be easily accommodated by the same
> mechanism.
> 
> This patch adds an arch specific prctl() to flush L1D cache on context
> switch out, the existing mechanisms of tracking prev_mm via cpu_tlbstate
> is reused (very similar to the cond_ipbp() changes). The patch has been
> lighted tested.
> 
> The points of discussion/review are:
> 
> 1. Discuss the use case and the right approach to address this

I think this is a nice feature to have available.

> 2. Does an arch prctl allowing for opt-in flushing make sense, would other
>    arches care about something similar?

This has been talked about before (and "make the CPU safe" prctl) and it
really came down to "things are so arch-specific that there isn't a
great way to describe the feature". That said, L1D is going to be a
common feature for all CPUs, but perhaps this _can_ be a general prctl
with arch-specific support for speed-ups.

> 3. There is a fallback software L1D load, similar to what L1TF does, but
>    we don't prefetch the TLB, is that sufficient?

Isn't there a function to force a TLB flush? I don't know that area
well.

> 4. The atomics can be improved and we could use a static key like ibpb
>    does to optimize the code path

Notes below...

> 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
>    is bit 32), we could generalize it with some effort

Or we could upgrade the TIF bits to u64?

> 6. Should we consider cleaning up the L1D on arrival of tasks?

I don't think so? Maybe I don't know what you mean, but it seems like
this only needs to pay attention when the TIF flag is set.

> In summary, this is an early PoC to start the discussion on the need for
> conditional L1D flushing based on the security posture of the
> application and the sensitivity of the data it has access to or might
> have access to.
> 
> Cc: keescook@chromium.org
> Cc: benh@amazon.com
> 
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> ---
>  arch/x86/include/asm/thread_info.h |  8 +++
>  arch/x86/include/asm/tlbflush.h    |  6 ++
>  arch/x86/include/uapi/asm/prctl.h  |  3 +
>  arch/x86/kernel/process_64.c       | 12 +++-
>  arch/x86/mm/tlb.c                  | 89 ++++++++++++++++++++++++++++++
>  5 files changed, 117 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 8de8ceccb8bc..c48ebfa17805 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -103,6 +103,9 @@ struct thread_info {
>  #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
>  #define TIF_X32			30	/* 32-bit native x86-64 binary */
>  #define TIF_FSCHECK		31	/* Check FS is USER_DS on return */
> +#ifdef CONFIG_64BIT
> +#define TIF_L1D_FLUSH           32      /* Flush L1D on mm switches (processes) */
> +#endif
>  
>  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
> @@ -132,6 +135,9 @@ struct thread_info {
>  #define _TIF_ADDR32		(1 << TIF_ADDR32)
>  #define _TIF_X32		(1 << TIF_X32)
>  #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
> +#ifdef CONFIG_64BIT
> +#define _TIF_L1D_FLUSH		(1UL << TIF_L1D_FLUSH)
> +#endif

If we're out of TIF flags we should move to u64 explicitly, maybe?

>  
>  /* Work to do before invoking the actual syscall. */
>  #define _TIF_WORK_SYSCALL_ENTRY	\
> @@ -239,6 +245,8 @@ extern void arch_task_cache_init(void);
>  extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
>  extern void arch_release_task_struct(struct task_struct *tsk);
>  extern void arch_setup_new_exec(void);
> +extern void enable_l1d_flush_for_task(struct task_struct *tsk);
> +extern void disable_l1d_flush_for_task(struct task_struct *tsk);
>  #define arch_setup_new_exec arch_setup_new_exec
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6f66d841262d..1d535059b358 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -219,6 +219,12 @@ struct tlb_state {
>  	 */
>  	unsigned long cr4;
>  
> +	/*
> +	 * Flush the L1D cache on switch_mm_irqs_off() for a
> +	 * task getting off the CPU, if it opted in to do so
> +	 */
> +	bool last_user_mm_l1d_flush;

Move up with other bools. Maybe we need to make these bools bit fields
soon? We're wasting a lot of space in struct tlb_state.

> +
>  	/*
>  	 * This is a list of all contexts that might exist in the TLB.
>  	 * There is one per ASID that we use, and the ASID (what the
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9fa41f..1361e5e25791 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -14,4 +14,7 @@
>  #define ARCH_MAP_VDSO_32	0x2002
>  #define ARCH_MAP_VDSO_64	0x2003
>  
> +#define ARCH_SET_L1D_FLUSH	0x3001
> +#define ARCH_GET_L1D_FLUSH	0x3002
> +
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index ffd497804dbc..df9f8775ee94 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -700,7 +700,17 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
>  	case ARCH_MAP_VDSO_64:
>  		return prctl_map_vdso(&vdso_image_64, arg2);
>  #endif
> -
> +#ifdef CONFIG_64BIT
> +	case ARCH_GET_L1D_FLUSH:
> +		return test_ti_thread_flag(&task->thread_info, TIF_L1D_FLUSH);
> +	case ARCH_SET_L1D_FLUSH: {
> +		if (arg2 >= 1)
> +			enable_l1d_flush_for_task(task);
> +		else
> +			disable_l1d_flush_for_task(task);
> +		break;
> +	}
> +#endif

I think this should probably be a generic prctl() with the software
fall-back run in the arch-agnostic code.

>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 66f96f21a7b6..35a3970df0ef 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -151,6 +151,92 @@ void leave_mm(int cpu)
>  }
>  EXPORT_SYMBOL_GPL(leave_mm);
>  
> +#define L1D_CACHE_ORDER 4
> +static void *l1d_flush_pages;
> +static DEFINE_MUTEX(l1d_flush_mutex);
> +
> +void enable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	struct page *page;
> +
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> +		goto done;
> +
> +	mutex_lock(&l1d_flush_mutex);
> +	if (l1d_flush_pages)
> +		goto done;
> +	/*
> +	 * These pages are never freed, we use the same
> +	 * set of pages across multiple processes/contexts
> +	 */
> +	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> +	if (!page)
> +		return;
> +
> +	l1d_flush_pages = page_address(page);
> +	/* I don't think we need to worry about KSM */
> +done:
> +	set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> +	mutex_unlock(&l1d_flush_mutex);
> +}

I have no suggestions that feel better, but this seems weird to have
this lock and dynamic allocation. We save memory if no process ever sets
the flag, but it just feels weird to have to take a global lock for a
per-process TIF flag setting.

How about READ_ONCE() on l1d_flush_pages, and use the lock for doing the
allocation if it's NULL? Something like:

if (arch_has_flush_l1d()) {
	struct page *page = READ_ONCE(l1d_flush_pages);

	if (unlikely(!page)) {
		mutex_lock(&l1d_flush_mutex);
		page = READ_ONCE(l1d_flush_pages);
		if (!page) {
			l1d_flush_pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
						      L1D_CACHE_ORDER);
		}
		page = l1d_flush_pages;
		mutex_unlock(&l1d_flush_mutex);
		if (!page)
			return -ENOMEM;
	}
}

set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
return 0;

And that gets rid of gotos.

And this needs to return errors in the ENOMEM case, yes?

> +
> +void disable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> +	smp_mb__after_atomic();
> +}
> +
> +/*
> + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> + * this is a pessimistic security measure and an opt-in for those tasks
> + * that host sensitive information and there are concerns about spills
> + * from fill buffers.
> + */
> +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> +{
> +	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> +		goto check_next;
> +
> +	if (real_prev == next)
> +		return;
> +
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +		goto done;
> +	}

This mix of gotos and returns makes this hard to read, IMO. Please just
if/else should be fine here.

> +
> +	asm volatile(
> +		/* Fill the cache */
> +		"xorl	%%eax, %%eax\n"
> +		".Lfill_cache:\n"
> +		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> +		"addl	$64, %%eax\n\t"
> +		"cmpl	%%eax, %[size]\n\t"
> +		"jne	.Lfill_cache\n\t"
> +		"lfence\n"
> +		:: [flush_pages] "r" (l1d_flush_pages),
> +		    [size] "r" (size)
> +		: "eax", "ecx");

Can the "generic" routine just be in C and in the prctl()?

> +
> +done:
> +	this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> +	/* Make sure we clear the values before we set it again */
> +	barrier();
> +check_next:
> +	if (tsk == NULL)
> +		return;
> +
> +	/* Match the set/clear_bit barriers */
> +	smp_rmb();
> +
> +	/* We don't need stringent checks as we opt-in/opt-out */
> +	if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> +		this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> +}

I'm less familiar with how tlb_state relates to the current thread, so
this last_user_mm_l1d_flush doesn't make sense to me. (Nor the real_prev
== next check.) I assume this is just details from mm switching.

> +
>  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	       struct task_struct *tsk)
>  {
> @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
>  	}
>  
> +	l1d_flush(next, tsk);
> +
>  	/* Make sure we write CR3 before loaded_mm. */
>  	barrier();
>  
> @@ -503,6 +591,7 @@ void initialize_tlbstate_and_flush(void)
>  	/* Reinitialize tlbstate. */
>  	this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
>  	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
> +	this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
>  	this_cpu_write(cpu_tlbstate.next_asid, 1);
>  	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
>  	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
> -- 
> 2.17.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-13 22:04 [RFC PATCH] arch/x86: Optionally flush L1D on context switch Balbir Singh
  2020-03-18 23:14 ` Kees Cook
@ 2020-03-19  0:38 ` Thomas Gleixner
  2020-03-20  1:37   ` Singh, Balbir
  2020-03-22  5:01   ` Herrenschmidt, Benjamin
  1 sibling, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-19  0:38 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel; +Cc: x86, Balbir Singh, keescook, benh

Balbir,

Balbir Singh <sblbir@amazon.com> writes:
> This patch is an RFC/PoC to start the discussion on optionally flushing
> L1D cache.  The goal is to allow tasks that are paranoid due to the recent
> snoop assisted data sampling vulnerabilites, to flush their L1D on being
> switched out.  This protects their data from being snooped or leaked via
> side channels after the task has context switched out.

What you mean is that L1D is flushed before another task which does not
belong to the same trust zone returns to user space or enters guest
mode.

> There are two scenarios we might want to protect against, a task leaving
> the CPU with data still in L1D (which is the main concern of this
> patch), the second scenario is a malicious task coming in (not so well
> trusted) for which we want to clean up the cache before it starts
> execution. The latter was proposed by benh and is not currently
> addressed by this patch, but can be easily accommodated by the same
> mechanism.

What's the point? The attack surface is the L1D content of the scheduled
out task. If the malicious task schedules out, then why would you care?

I might be missing something, but AFAICT this is beyond paranoia.

> The points of discussion/review are:
>
> 1. Discuss the use case and the right approach to address this

It might be the quick fix for the next not yet known variant of L1D
based horrors, so yes it's at least worth to discuss it.

> 2. Does an arch prctl allowing for opt-in flushing make sense, would other
>    arches care about something similar?

No idea, but I assume in the light of L1D based issues that might be the
case. Though that still is per architecture as the L1D flush mechanisms
are architecture specific. Aside of that I don't think that more than a
few will enable / support it.

> 3. There is a fallback software L1D load, similar to what L1TF does, but
>    we don't prefetch the TLB, is that sufficient?

If we go there, then the KVM L1D flush code wants to move into general
x86 code.

> 4. The atomics can be improved and we could use a static key like ibpb
>    does to optimize the code path

Yes to static key.

> 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
>    is bit 32), we could generalize it with some effort

Why so? There are gaps in the TIF flags (18, 23, 26). Why do you want to
push that to 32?

> 6. Should we consider cleaning up the L1D on arrival of tasks?

That's the Ben idea, right? If so see above.

> +#define L1D_CACHE_ORDER 4
> +static void *l1d_flush_pages;
> +static DEFINE_MUTEX(l1d_flush_mutex);
> +
> +void enable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	struct page *page;
> +
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> +		goto done;
> +
> +	mutex_lock(&l1d_flush_mutex);
> +	if (l1d_flush_pages)
> +		goto done;
> +	/*
> +	 * These pages are never freed, we use the same
> +	 * set of pages across multiple processes/contexts
> +	 */
> +	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> +	if (!page)
> +		return;
> +
> +	l1d_flush_pages = page_address(page);
> +	/* I don't think we need to worry about KSM */

Why not? Even if it wouldn't be necessary why would we care as this is a
once per boot operation in fully preemptible code.

Aside of that why do we need another pointlessly different copy of what
we have in the VMX code?

> +done:
> +	set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> +	mutex_unlock(&l1d_flush_mutex);
> +}
> +
> +void disable_l1d_flush_for_task(struct task_struct *tsk)
> +{
> +	clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> +	smp_mb__after_atomic();

Lacks an explanation / comment what this smp_mb is for and where the
counterpart lives.

Aside of that, this is pointless AFAICT. Disable/enable is really not a
concern of being perfect. If you want perfect protection, simply switch
off your computer.

> +/*
> + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> + * this is a pessimistic security measure and an opt-in for those tasks
> + * that host sensitive information and there are concerns about spills
> + * from fill buffers.
> + */
> +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> +{
> +	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> +		goto check_next;
> +
> +	if (real_prev == next)
> +		return;
> +
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +		goto done;
> +	}
> +
> +	asm volatile(
> +		/* Fill the cache */
> +		"xorl	%%eax, %%eax\n"
> +		".Lfill_cache:\n"
> +		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> +		"addl	$64, %%eax\n\t"
> +		"cmpl	%%eax, %[size]\n\t"
> +		"jne	.Lfill_cache\n\t"
> +		"lfence\n"
> +		:: [flush_pages] "r" (l1d_flush_pages),
> +		    [size] "r" (size)
> +		: "eax", "ecx");

Yet moar copied code slighlty different for no reason.

> +
> +done:
> +	this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> +	/* Make sure we clear the values before we set it again */
> +	barrier();
> +check_next:
> +	if (tsk == NULL)
> +		return;
> +
> +	/* Match the set/clear_bit barriers */
> +	smp_rmb();

What for again?

> +	/* We don't need stringent checks as we opt-in/opt-out */
> +	if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> +		this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> +}
> +
>  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	       struct task_struct *tsk)
>  {
> @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
>  	}
>  
> +	l1d_flush(next, tsk);

This is really the wrong place. You want to do that:

  1) Just before return to user space
  2) When entering a guest

and only when the previously running user space task was the one which
requested this massive protection.

While it's worth to discuss, I'm not yet convinced that this is worth
the trouble.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-18 23:14 ` Kees Cook
@ 2020-03-20  1:35   ` Singh, Balbir
  0 siblings, 0 replies; 15+ messages in thread
From: Singh, Balbir @ 2020-03-20  1:35 UTC (permalink / raw)
  To: keescook; +Cc: linux-kernel, Herrenschmidt, Benjamin, x86

On Wed, 2020-03-18 at 16:14 -0700, Kees Cook wrote:
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> On Sat, Mar 14, 2020 at 09:04:15AM +1100, Balbir Singh wrote:
> > This patch is an RFC/PoC to start the discussion on optionally flushing
> > L1D cache.  The goal is to allow tasks that are paranoid due to the recent
> > snoop assisted data sampling vulnerabilites, to flush their L1D on being
> > switched out.  This protects their data from being snooped or leaked via
> > side channels after the task has context switched out.
> > 
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this
> > patch), the second scenario is a malicious task coming in (not so well
> > trusted) for which we want to clean up the cache before it starts
> > execution. The latter was proposed by benh and is not currently
> > addressed by this patch, but can be easily accommodated by the same
> > mechanism.
> > 
> > This patch adds an arch specific prctl() to flush L1D cache on context
> > switch out, the existing mechanisms of tracking prev_mm via cpu_tlbstate
> > is reused (very similar to the cond_ipbp() changes). The patch has been
> > lighted tested.
> > 
> > The points of discussion/review are:
> > 
> > 1. Discuss the use case and the right approach to address this
> 
> I think this is a nice feature to have available.

Thanks!

> 
> > 2. Does an arch prctl allowing for opt-in flushing make sense, would other
> >    arches care about something similar?
> 
> This has been talked about before (and "make the CPU safe" prctl) and it
> really came down to "things are so arch-specific that there isn't a
> great way to describe the feature". That said, L1D is going to be a
> common feature for all CPUs, but perhaps this _can_ be a general prctl
> with arch-specific support for speed-ups.
> 

Sounds good

> > 3. There is a fallback software L1D load, similar to what L1TF does, but
> >    we don't prefetch the TLB, is that sufficient?
> 
> Isn't there a function to force a TLB flush? I don't know that area
> well.

Yes, there is, but I suspect it is not needed in that case.

> 
> > 4. The atomics can be improved and we could use a static key like ibpb
> >    does to optimize the code path
> 
> Notes below...
> 
> > 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
> >    is bit 32), we could generalize it with some effort
> 
> Or we could upgrade the TIF bits to u64?
> 

tglx, recommended some free bits I potentially missed

> > 6. Should we consider cleaning up the L1D on arrival of tasks?
> 
> I don't think so? Maybe I don't know what you mean, but it seems like
> this only needs to pay attention when the TIF flag is set.
> 
> > In summary, this is an early PoC to start the discussion on the need for
> > conditional L1D flushing based on the security posture of the
> > application and the sensitivity of the data it has access to or might
> > have access to.
> > 
> > Cc: keescook@chromium.org
> > Cc: benh@amazon.com
> > 
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > ---
> >  arch/x86/include/asm/thread_info.h |  8 +++
> >  arch/x86/include/asm/tlbflush.h    |  6 ++
> >  arch/x86/include/uapi/asm/prctl.h  |  3 +
> >  arch/x86/kernel/process_64.c       | 12 +++-
> >  arch/x86/mm/tlb.c                  | 89 ++++++++++++++++++++++++++++++
> >  5 files changed, 117 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/thread_info.h
> > b/arch/x86/include/asm/thread_info.h
> > index 8de8ceccb8bc..c48ebfa17805 100644
> > --- a/arch/x86/include/asm/thread_info.h
> > +++ b/arch/x86/include/asm/thread_info.h
> > @@ -103,6 +103,9 @@ struct thread_info {
> >  #define TIF_ADDR32           29      /* 32-bit address space on 64 bits
> > */
> >  #define TIF_X32                      30      /* 32-bit native x86-64
> > binary */
> >  #define TIF_FSCHECK          31      /* Check FS is USER_DS on return */
> > +#ifdef CONFIG_64BIT
> > +#define TIF_L1D_FLUSH           32      /* Flush L1D on mm switches
> > (processes) */
> > +#endif
> > 
> >  #define _TIF_SYSCALL_TRACE   (1 << TIF_SYSCALL_TRACE)
> >  #define _TIF_NOTIFY_RESUME   (1 << TIF_NOTIFY_RESUME)
> > @@ -132,6 +135,9 @@ struct thread_info {
> >  #define _TIF_ADDR32          (1 << TIF_ADDR32)
> >  #define _TIF_X32             (1 << TIF_X32)
> >  #define _TIF_FSCHECK         (1 << TIF_FSCHECK)
> > +#ifdef CONFIG_64BIT
> > +#define _TIF_L1D_FLUSH               (1UL << TIF_L1D_FLUSH)
> > +#endif
> 
> If we're out of TIF flags we should move to u64 explicitly, maybe?
> 
> > 
> >  /* Work to do before invoking the actual syscall. */
> >  #define _TIF_WORK_SYSCALL_ENTRY      \
> > @@ -239,6 +245,8 @@ extern void arch_task_cache_init(void);
> >  extern int arch_dup_task_struct(struct task_struct *dst, struct
> > task_struct *src);
> >  extern void arch_release_task_struct(struct task_struct *tsk);
> >  extern void arch_setup_new_exec(void);
> > +extern void enable_l1d_flush_for_task(struct task_struct *tsk);
> > +extern void disable_l1d_flush_for_task(struct task_struct *tsk);
> >  #define arch_setup_new_exec arch_setup_new_exec
> >  #endif       /* !__ASSEMBLY__ */
> > 
> > diff --git a/arch/x86/include/asm/tlbflush.h
> > b/arch/x86/include/asm/tlbflush.h
> > index 6f66d841262d..1d535059b358 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -219,6 +219,12 @@ struct tlb_state {
> >        */
> >       unsigned long cr4;
> > 
> > +     /*
> > +      * Flush the L1D cache on switch_mm_irqs_off() for a
> > +      * task getting off the CPU, if it opted in to do so
> > +      */
> > +     bool last_user_mm_l1d_flush;
> 
> Move up with other bools. Maybe we need to make these bools bit fields
> soon? We're wasting a lot of space in struct tlb_state.
> 
> > +
> >       /*
> >        * This is a list of all contexts that might exist in the TLB.
> >        * There is one per ASID that we use, and the ASID (what the
> > diff --git a/arch/x86/include/uapi/asm/prctl.h
> > b/arch/x86/include/uapi/asm/prctl.h
> > index 5a6aac9fa41f..1361e5e25791 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -14,4 +14,7 @@
> >  #define ARCH_MAP_VDSO_32     0x2002
> >  #define ARCH_MAP_VDSO_64     0x2003
> > 
> > +#define ARCH_SET_L1D_FLUSH   0x3001
> > +#define ARCH_GET_L1D_FLUSH   0x3002
> > +
> >  #endif /* _ASM_X86_PRCTL_H */
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index ffd497804dbc..df9f8775ee94 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -700,7 +700,17 @@ long do_arch_prctl_64(struct task_struct *task, int
> > option, unsigned long arg2)
> >       case ARCH_MAP_VDSO_64:
> >               return prctl_map_vdso(&vdso_image_64, arg2);
> >  #endif
> > -
> > +#ifdef CONFIG_64BIT
> > +     case ARCH_GET_L1D_FLUSH:
> > +             return test_ti_thread_flag(&task->thread_info,
> > TIF_L1D_FLUSH);
> > +     case ARCH_SET_L1D_FLUSH: {
> > +             if (arg2 >= 1)
> > +                     enable_l1d_flush_for_task(task);
> > +             else
> > +                     disable_l1d_flush_for_task(task);
> > +             break;
> > +     }
> > +#endif
> 
> I think this should probably be a generic prctl() with the software
> fall-back run in the arch-agnostic code.
> 

Hmm.. OK, so you see that as how other arches can opt-in? The real issue is
that software fall-back is hard. In the case of x86 for example, we know that
the cache replacement is not true LRU, so we need to write 64K for a 32K
cache. I am afraid every arch has its quirkiness, so this code needs to live
in the arches.

> >       default:
> >               ret = -EINVAL;
> >               break;
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 66f96f21a7b6..35a3970df0ef 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -151,6 +151,92 @@ void leave_mm(int cpu)
> >  }
> >  EXPORT_SYMBOL_GPL(leave_mm);
> > 
> > +#define L1D_CACHE_ORDER 4
> > +static void *l1d_flush_pages;
> > +static DEFINE_MUTEX(l1d_flush_mutex);
> > +
> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +     struct page *page;
> > +
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > +             goto done;
> > +
> > +     mutex_lock(&l1d_flush_mutex);
> > +     if (l1d_flush_pages)
> > +             goto done;
> > +     /*
> > +      * These pages are never freed, we use the same
> > +      * set of pages across multiple processes/contexts
> > +      */
> > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > +     if (!page)
> > +             return;
> > +
> > +     l1d_flush_pages = page_address(page);
> > +     /* I don't think we need to worry about KSM */
> > +done:
> > +     set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +     mutex_unlock(&l1d_flush_mutex);
> > +}
> 
> I have no suggestions that feel better, but this seems weird to have
> this lock and dynamic allocation. We save memory if no process ever sets
> the flag, but it just feels weird to have to take a global lock for a
> per-process TIF flag setting.
> 
> How about READ_ONCE() on l1d_flush_pages, and use the lock for doing the
> allocation if it's NULL? Something like:
> 
> if (arch_has_flush_l1d()) {
>         struct page *page = READ_ONCE(l1d_flush_pages);
> 
>         if (unlikely(!page)) {
>                 mutex_lock(&l1d_flush_mutex);
>                 page = READ_ONCE(l1d_flush_pages);
>                 if (!page) {
>                         l1d_flush_pages = alloc_pages(GFP_KERNEL |
> __GFP_ZERO,
>                                                       L1D_CACHE_ORDER);
>                 }
>                 page = l1d_flush_pages;
>                 mutex_unlock(&l1d_flush_mutex);
>                 if (!page)
>                         return -ENOMEM;
>         }
> }

Yep, that works too

> 
> set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> return 0;
> 
> And that gets rid of gotos.
> 
> And this needs to return errors in the ENOMEM case, yes?
> 
> > +
> > +void disable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +     clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +     smp_mb__after_atomic();
> > +}
> > +
> > +/*
> > + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> > + * this is a pessimistic security measure and an opt-in for those tasks
> > + * that host sensitive information and there are concerns about spills
> > + * from fill buffers.
> > + */
> > +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> > +{
> > +     struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> > +             goto check_next;
> > +
> > +     if (real_prev == next)
> > +             return;
> > +
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +             goto done;
> > +     }
> 
> This mix of gotos and returns makes this hard to read, IMO. Please just
> if/else should be fine here.
> 

Yes, I can refactor the code.

> > +
> > +     asm volatile(
> > +             /* Fill the cache */
> > +             "xorl   %%eax, %%eax\n"
> > +             ".Lfill_cache:\n"
> > +             "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +             "addl   $64, %%eax\n\t"
> > +             "cmpl   %%eax, %[size]\n\t"
> > +             "jne    .Lfill_cache\n\t"
> > +             "lfence\n"
> > +             :: [flush_pages] "r" (l1d_flush_pages),
> > +                 [size] "r" (size)
> > +             : "eax", "ecx");
> 
> Can the "generic" routine just be in C and in the prctl()?
> 

No.. see above.

> > +
> > +done:
> > +     this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> > +     /* Make sure we clear the values before we set it again */
> > +     barrier();
> > +check_next:
> > +     if (tsk == NULL)
> > +             return;
> > +
> > +     /* Match the set/clear_bit barriers */
> > +     smp_rmb();
> > +
> > +     /* We don't need stringent checks as we opt-in/opt-out */
> > +     if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> > +             this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> > +}
> 
> I'm less familiar with how tlb_state relates to the current thread, so
> this last_user_mm_l1d_flush doesn't make sense to me. (Nor the real_prev
> == next check.) I assume this is just details from mm switching.
> 

tlbstate stores the previous mm and the idea is to flush the cache on
eviction, since switch_mm_irqs_off() has the next and prev mm and the next
task, what we need is the previous task (to check the TIF flag) and to help
with that we can mark the tlbstate as needing an L1D flush when the current
task goes off the CPU


> > +
> >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >              struct task_struct *tsk)
> >  {
> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > mm_struct *next,
> >               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> >       }
> > 
> > +     l1d_flush(next, tsk);
> > +
> >       /* Make sure we write CR3 before loaded_mm. */
> >       barrier();
> > 
> > @@ -503,6 +591,7 @@ void initialize_tlbstate_and_flush(void)
> >       /* Reinitialize tlbstate. */
> >       this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
> >       this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
> > +     this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> >       this_cpu_write(cpu_tlbstate.next_asid, 1);
> >       this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
> >       this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
> > --
> > 2.17.1
> > 
> 


Thanks for the review

Balbir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-19  0:38 ` Thomas Gleixner
@ 2020-03-20  1:37   ` Singh, Balbir
  2020-03-20 11:49     ` Thomas Gleixner
  2020-03-22  5:01   ` Herrenschmidt, Benjamin
  1 sibling, 1 reply; 15+ messages in thread
From: Singh, Balbir @ 2020-03-20  1:37 UTC (permalink / raw)
  To: tglx, linux-kernel; +Cc: keescook, Herrenschmidt, Benjamin, x86

On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> Balbir,
> 
> Balbir Singh <sblbir@amazon.com> writes:
> > This patch is an RFC/PoC to start the discussion on optionally flushing
> > L1D cache.  The goal is to allow tasks that are paranoid due to the recent
> > snoop assisted data sampling vulnerabilites, to flush their L1D on being
> > switched out.  This protects their data from being snooped or leaked via
> > side channels after the task has context switched out.
> 
> What you mean is that L1D is flushed before another task which does not
> belong to the same trust zone returns to user space or enters guest
> mode.
> 
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this
> > patch), the second scenario is a malicious task coming in (not so well
> > trusted) for which we want to clean up the cache before it starts
> > execution. The latter was proposed by benh and is not currently
> > addressed by this patch, but can be easily accommodated by the same
> > mechanism.
> 
> What's the point? The attack surface is the L1D content of the scheduled
> out task. If the malicious task schedules out, then why would you care?
> 
> I might be missing something, but AFAICT this is beyond paranoia.
> 

I think there are two cases

1. Task with important data schedules out
2. Malicious task schedules in

These patches address 1, but call out case #2


> > The points of discussion/review are:
> > 
> > 1. Discuss the use case and the right approach to address this
> 
> It might be the quick fix for the next not yet known variant of L1D
> based horrors, so yes it's at least worth to discuss it.
> 

Yes, that is the broader thinking, there might be yet to be discovered attack
vectors and having something like this provides a good workaround

> > 2. Does an arch prctl allowing for opt-in flushing make sense, would other
> >    arches care about something similar?
> 
> No idea, but I assume in the light of L1D based issues that might be the
> case. Though that still is per architecture as the L1D flush mechanisms
> are architecture specific. Aside of that I don't think that more than a
> few will enable / support it.
> 

Fair enough

> > 3. There is a fallback software L1D load, similar to what L1TF does, but
> >    we don't prefetch the TLB, is that sufficient?
> 
> If we go there, then the KVM L1D flush code wants to move into general
> x86 code.

OK.. we can definitely consider reusing code, but I think the KVM bits require
tlb prefetching, IIUC before cache flush to negate any bad translations
associated with an L1TF fault, but the code/comments are not clear on the need
to do so.

> 
> > 4. The atomics can be improved and we could use a static key like ibpb
> >    does to optimize the code path
> 
> Yes to static key.

Sure, will do

> 
> > 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
> >    is bit 32), we could generalize it with some effort
> 
> Why so? There are gaps in the TIF flags (18, 23, 26). Why do you want to
> push that to 32?

I might have missed the holes, I can move to using one of them, thanks.

> 
> > 6. Should we consider cleaning up the L1D on arrival of tasks?
> 
> That's the Ben idea, right? If so see above.

Yes

> 
> > +#define L1D_CACHE_ORDER 4
> > +static void *l1d_flush_pages;
> > +static DEFINE_MUTEX(l1d_flush_mutex);
> > +
> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +     struct page *page;
> > +
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > +             goto done;
> > +
> > +     mutex_lock(&l1d_flush_mutex);
> > +     if (l1d_flush_pages)
> > +             goto done;
> > +     /*
> > +      * These pages are never freed, we use the same
> > +      * set of pages across multiple processes/contexts
> > +      */
> > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > +     if (!page)
> > +             return;
> > +
> > +     l1d_flush_pages = page_address(page);
> > +     /* I don't think we need to worry about KSM */
> 
> Why not? Even if it wouldn't be necessary why would we care as this is a
> once per boot operation in fully preemptible code.

Not sure I understand your question, I was stating that even if KSM was
running, it would not impact us (with dedup), as we'd still be writing out 0s
to the cache line in the fallback case.

> 
> Aside of that why do we need another pointlessly different copy of what
> we have in the VMX code?
> 
> > +done:
> > +     set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +     mutex_unlock(&l1d_flush_mutex);
> > +}
> > +
> > +void disable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +     clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +     smp_mb__after_atomic();
> 
> Lacks an explanation / comment what this smp_mb is for and where the
> counterpart lives.
> 
> Aside of that, this is pointless AFAICT. Disable/enable is really not a
> concern of being perfect. If you want perfect protection, simply switch
> off your computer.

Yes, the counter part was in the l1d_flush(), which did an smp_rmb(), but the
comment does point out that it can and will be lazy, so we can safely remove
this bit if needed.

> 
> > +/*
> > + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> > + * this is a pessimistic security measure and an opt-in for those tasks
> > + * that host sensitive information and there are concerns about spills
> > + * from fill buffers.
> > + */
> > +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> > +{
> > +     struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> > +             goto check_next;
> > +
> > +     if (real_prev == next)
> > +             return;
> > +
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +             goto done;
> > +     }
> > +
> > +     asm volatile(
> > +             /* Fill the cache */
> > +             "xorl   %%eax, %%eax\n"
> > +             ".Lfill_cache:\n"
> > +             "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +             "addl   $64, %%eax\n\t"
> > +             "cmpl   %%eax, %[size]\n\t"
> > +             "jne    .Lfill_cache\n\t"
> > +             "lfence\n"
> > +             :: [flush_pages] "r" (l1d_flush_pages),
> > +                 [size] "r" (size)
> > +             : "eax", "ecx");
> 
> Yet moar copied code slighlty different for no reason.
> 

Different because of the need to not prefetch the TLB

> > +
> > +done:
> > +     this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> > +     /* Make sure we clear the values before we set it again */
> > +     barrier();
> > +check_next:
> > +     if (tsk == NULL)
> > +             return;
> > +
> > +     /* Match the set/clear_bit barriers */
> > +     smp_rmb();
> 
> What for again?

Yes, not needed (the comment below sort of says that), I can remove this

> 
> > +     /* We don't need stringent checks as we opt-in/opt-out */
> > +     if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> > +             this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> > +}
> > +
> >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >              struct task_struct *tsk)
> >  {
> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > mm_struct *next,
> >               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> >       }
> > 
> > +     l1d_flush(next, tsk);
> 
> This is really the wrong place. You want to do that:
> 
>   1) Just before return to user space
>   2) When entering a guest
> 
> and only when the previously running user space task was the one which
> requested this massive protection.
> 

Cases 1 and 2 are handled via

1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
2. L1TF fault handling

This mechanism allows for flushing not restricted to 1 or 2, the idea is to
immediately flush L1D for paranoid processes on mm switch.

> > 

> While it's worth to discuss, I'm not yet convinced that this is worth
> the trouble.
> 

Are you suggesting the mechanism is not worth building?

Balbir Singh.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-20  1:37   ` Singh, Balbir
@ 2020-03-20 11:49     ` Thomas Gleixner
  2020-03-21  1:42       ` Singh, Balbir
  2020-03-22  5:08       ` Herrenschmidt, Benjamin
  0 siblings, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-20 11:49 UTC (permalink / raw)
  To: Singh, Balbir, linux-kernel; +Cc: keescook, Herrenschmidt, Benjamin, x86

Balbir,

"Singh, Balbir" <sblbir@amazon.com> writes:
> On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
>> What's the point? The attack surface is the L1D content of the scheduled
>> out task. If the malicious task schedules out, then why would you care?
>> 
>> I might be missing something, but AFAICT this is beyond paranoia.
>> 
>
> I think there are two cases
>
> 1. Task with important data schedules out
> 2. Malicious task schedules in
>
> These patches address 1, but call out case #2

The point is if the victim task schedules out, then there is no reason
to flush L1D immediately in context switch. If that just schedules a
kernel thread and then goes back to the task, then there is no point
unless you do not even trust the kernel thread.

>> > 3. There is a fallback software L1D load, similar to what L1TF does, but
>> >    we don't prefetch the TLB, is that sufficient?
>> 
>> If we go there, then the KVM L1D flush code wants to move into general
>> x86 code.
>
> OK.. we can definitely consider reusing code, but I think the KVM bits require
> tlb prefetching, IIUC before cache flush to negate any bad translations
> associated with an L1TF fault, but the code/comments are not clear on the need
> to do so.

I forgot the gory details by now, but having two entry points or a
conditional and share the rest (page allocation etc.) is definitely
better than two slightly different implementation which basically do the
same thing.

>> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
>> > +{
>> > +     struct page *page;
>> > +
>> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
>> > +             goto done;
>> > +
>> > +     mutex_lock(&l1d_flush_mutex);
>> > +     if (l1d_flush_pages)
>> > +             goto done;
>> > +     /*
>> > +      * These pages are never freed, we use the same
>> > +      * set of pages across multiple processes/contexts
>> > +      */
>> > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
>> > +     if (!page)
>> > +             return;
>> > +
>> > +     l1d_flush_pages = page_address(page);
>> > +     /* I don't think we need to worry about KSM */
>> 
>> Why not? Even if it wouldn't be necessary why would we care as this is a
>> once per boot operation in fully preemptible code.
>
> Not sure I understand your question, I was stating that even if KSM was
> running, it would not impact us (with dedup), as we'd still be writing out 0s
> to the cache line in the fallback case.

I probably confused myself vs. the comment in the VMX code, but that
mentions nested virt. Needs at least some consideration when we reuse
that code.

>> >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>> >              struct task_struct *tsk)
>> >  {
>> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> > mm_struct *next,
>> >               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
>> >       }
>> > 
>> > +     l1d_flush(next, tsk);
>> 
>> This is really the wrong place. You want to do that:
>> 
>>   1) Just before return to user space
>>   2) When entering a guest
>> 
>> and only when the previously running user space task was the one which
>> requested this massive protection.
>> 
>
> Cases 1 and 2 are handled via
>
> 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)

How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.

> 2. L1TF fault handling
>
> This mechanism allows for flushing not restricted to 1 or 2, the idea is to
> immediately flush L1D for paranoid processes on mm switch.

Why? To protect the victim task against the malicious kernel?

The L1D content of the victim is endangered in the following case:

    victim out -> attacker in

The attacker can either run in user space or in guest mode. So the flush
is only interesting when the attacker actually goes back to user space
or reenters the guest.

The following is completely uninteresting:

    victim out -> kernel thread in/out -> victim in

Even this is uninteresting:

    victim in -> attacker in (stays in kernel, e.g. waits for data) ->
    attacker out -> victim in

So the point where you want to flush conditionally is when the attacker
leaves kernel space either to user mode or guest mode.

So if the victim schedules out it sets a per cpu request to flush L1D
on the borders.

And then you have on return to user:

    if (this_cpu_flush_l1d())
    	flush_l1d()

and in kvm:

    if (this_cpu_flush_l1d() || L1TF_flush_L1D)
    	flush_l1d()

The request does:

    if (!this_cpu_read(l1d_flush_for_task))
        this_cpu_write(l1d_flush_for_task, current)

The check does:

    p = this_cpu_read(l1d_flush_for_task);
    if (p) {
        this_cpu_write(l1d_flush_for_task, NULL);
        return p != current;
    }
    return false;

Hmm?

Thanks,

        tglx

                

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-20 11:49     ` Thomas Gleixner
@ 2020-03-21  1:42       ` Singh, Balbir
  2020-03-21 10:05         ` Thomas Gleixner
  2020-03-22  5:08       ` Herrenschmidt, Benjamin
  1 sibling, 1 reply; 15+ messages in thread
From: Singh, Balbir @ 2020-03-21  1:42 UTC (permalink / raw)
  To: tglx, linux-kernel; +Cc: keescook, Herrenschmidt, Benjamin, x86

On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> 
> Balbir,
> 
> "Singh, Balbir" <sblbir@amazon.com> writes:
> > On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> > > What's the point? The attack surface is the L1D content of the scheduled
> > > out task. If the malicious task schedules out, then why would you care?
> > > 
> > > I might be missing something, but AFAICT this is beyond paranoia.
> > > 
> > 
> > I think there are two cases
> > 
> > 1. Task with important data schedules out
> > 2. Malicious task schedules in
> > 
> > These patches address 1, but call out case #2
> 
> The point is if the victim task schedules out, then there is no reason
> to flush L1D immediately in context switch. If that just schedules a
> kernel thread and then goes back to the task, then there is no point
> unless you do not even trust the kernel thread.

Yes, the implementation tries to do that by tracking switch_mm(). 

> 
> > > > 3. There is a fallback software L1D load, similar to what L1TF does,
> > > > but
> > > >    we don't prefetch the TLB, is that sufficient?
> > > 
> > > If we go there, then the KVM L1D flush code wants to move into general
> > > x86 code.
> > 
> > OK.. we can definitely consider reusing code, but I think the KVM bits
> > require
> > tlb prefetching, IIUC before cache flush to negate any bad translations
> > associated with an L1TF fault, but the code/comments are not clear on the
> > need
> > to do so.
> 
> I forgot the gory details by now, but having two entry points or a
> conditional and share the rest (page allocation etc.) is definitely
> better than two slightly different implementation which basically do the
> same thing.

OK, I can try and dedup them to the extent possible, but please do remember
that 

1. KVM is usually loaded as a module
2. KVM is optional

We can share code, by putting the common bits in the core kernel.

> 
> > > > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > > > +{
> > > > +     struct page *page;
> > > > +
> > > > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > > > +             goto done;
> > > > +
> > > > +     mutex_lock(&l1d_flush_mutex);
> > > > +     if (l1d_flush_pages)
> > > > +             goto done;
> > > > +     /*
> > > > +      * These pages are never freed, we use the same
> > > > +      * set of pages across multiple processes/contexts
> > > > +      */
> > > > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > > > +     if (!page)
> > > > +             return;
> > > > +
> > > > +     l1d_flush_pages = page_address(page);
> > > > +     /* I don't think we need to worry about KSM */
> > > 
> > > Why not? Even if it wouldn't be necessary why would we care as this is a
> > > once per boot operation in fully preemptible code.
> > 
> > Not sure I understand your question, I was stating that even if KSM was
> > running, it would not impact us (with dedup), as we'd still be writing out
> > 0s
> > to the cache line in the fallback case.
> 
> I probably confused myself vs. the comment in the VMX code, but that
> mentions nested virt. Needs at least some consideration when we reuse
> that code.

Yes, definitely!

> 
> > > >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > >              struct task_struct *tsk)
> > > >  {
> > > > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev,
> > > > struct
> > > > mm_struct *next,
> > > >               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > > >       }
> > > > 
> > > > +     l1d_flush(next, tsk);
> > > 
> > > This is really the wrong place. You want to do that:
> > > 
> > >   1) Just before return to user space
> > >   2) When entering a guest
> > > 
> > > and only when the previously running user space task was the one which
> > > requested this massive protection.
> > > 
> > 
> > Cases 1 and 2 are handled via
> > 
> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
> 
> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.

Sorry, my bad, I was thinking MDS_CLEAR (via verw), which does flush out
things, which I suspect should be sufficient from a return to user/signal
handling, etc perspective. Right now, reading through 
https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
, it does seem like we need this during a context switch, specifically since a
dirty cache line can cause snooped reads for the attacker to leak data. Am I
missing anything?

> 
> > 2. L1TF fault handling
> > 
> > This mechanism allows for flushing not restricted to 1 or 2, the idea is
> > to
> > immediately flush L1D for paranoid processes on mm switch.
> 
> Why? To protect the victim task against the malicious kernel?
> 
> The L1D content of the victim is endangered in the following case:
> 
>     victim out -> attacker in
> 
> The attacker can either run in user space or in guest mode. So the flush
> is only interesting when the attacker actually goes back to user space
> or reenters the guest.
> 
> The following is completely uninteresting:
> 
>     victim out -> kernel thread in/out -> victim in
> 
> Even this is uninteresting:
> 
>     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>     attacker out -> victim in
> 

Not from what I understand from the link above, the attack is a function of
what can be snooped by another core/thread and that is a function of what
modified secrets are in the cache line/store buffer.


> So the point where you want to flush conditionally is when the attacker
> leaves kernel space either to user mode or guest mode.
> 
> So if the victim schedules out it sets a per cpu request to flush L1D
> on the borders.
> 
> And then you have on return to user:
> 
>     if (this_cpu_flush_l1d())
>         flush_l1d()
> 
> and in kvm:
> 
>     if (this_cpu_flush_l1d() || L1TF_flush_L1D)
>         flush_l1d()
> 
> The request does:
> 
>     if (!this_cpu_read(l1d_flush_for_task))
>         this_cpu_write(l1d_flush_for_task, current)
> 
> The check does:
> 
>     p = this_cpu_read(l1d_flush_for_task);
>     if (p) {
>         this_cpu_write(l1d_flush_for_task, NULL);
>         return p != current;
>     }
>     return false;
> 
> Hmm?

On return to user, we already use VERW (verw), but just return to user
protection is not sufficient IMHO. Based on the link above, we need to clear
the L1D cache before it can be snooped.

Thanks for the review,
Balbir Singh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-21  1:42       ` Singh, Balbir
@ 2020-03-21 10:05         ` Thomas Gleixner
  2020-03-22  5:10           ` Herrenschmidt, Benjamin
  2020-03-23  0:37           ` Singh, Balbir
  0 siblings, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-21 10:05 UTC (permalink / raw)
  To: Singh, Balbir, linux-kernel; +Cc: keescook, Herrenschmidt, Benjamin, x86

Balbir,

"Singh, Balbir" <sblbir@amazon.com> writes:
> On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
>> I forgot the gory details by now, but having two entry points or a
>> conditional and share the rest (page allocation etc.) is definitely
>> better than two slightly different implementation which basically do the
>> same thing.
>
> OK, I can try and dedup them to the extent possible, but please do remember
> that 
>
> 1. KVM is usually loaded as a module
> 2. KVM is optional
>
> We can share code, by putting the common bits in the core kernel.

Obviously so.

>> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
>> 
>> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.
>
> Sorry, my bad, I was thinking MDS_CLEAR (via verw), which does flush out
> things, which I suspect should be sufficient from a return to user/signal
> handling, etc perspective.

MDS is affecting store buffers, fill buffers and load ports. Different story.

> Right now, reading through 
> https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
> , it does seem like we need this during a context switch, specifically since a
> dirty cache line can cause snooped reads for the attacker to leak data. Am I
> missing anything?

Yes. The way this goes is:

CPU0                   CPU1

victim1
 store secrit
                        victim2
attacker                  read secrit

Now if L1D is flushed on CPU0 before attacker reaches user space,
i.e. reaches the attack code, then there is nothing to see. From the
link:

  Similar to the L1TF VMM mitigations, snoop-assisted L1D sampling can be
  mitigated by flushing the L1D cache between when secrets are accessed
  and when possibly malicious software runs on the same core.

So the important point is to flush _before_ the attack code runs which
involves going back to user space or guest mode.

>> Even this is uninteresting:
>> 
>>     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>>     attacker out -> victim in
>> 
>
> Not from what I understand from the link above, the attack is a function of
> what can be snooped by another core/thread and that is a function of what
> modified secrets are in the cache line/store buffer.

Forget HT. That's not fixable by any flushing simply because there is no
scheduling involved.

CPU0  HT0          CPU0 HT1		CPU1

victim1            attacker
 store secrit
                        		victim2
                                          read secrit

> On return to user, we already use VERW (verw), but just return to user
> protection is not sufficient IMHO. Based on the link above, we need to clear
> the L1D cache before it can be snooped.

Again. Flush is required between store and attacker running attack
code. The attacker _cannot_ run attack code while it is in the kernel so
flushing L1D on context switch is just voodoo.

If you want to cure the HT case with core scheduling then the scenario
looks like this:

CPU0  HT0          CPU0 HT1		CPU1

victim1            IDLE
 store secrit
-> IDLE
                   attacker in 		victim2
                                          read secrit

And yes, there the context switch flush on HT0 prevents it. So this can
be part of a core scheduling based mitigation or handled via a per core
flush request.

But HT is attackable in so many ways ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-19  0:38 ` Thomas Gleixner
  2020-03-20  1:37   ` Singh, Balbir
@ 2020-03-22  5:01   ` Herrenschmidt, Benjamin
  1 sibling, 0 replies; 15+ messages in thread
From: Herrenschmidt, Benjamin @ 2020-03-22  5:01 UTC (permalink / raw)
  To: tglx, linux-kernel, Singh, Balbir; +Cc: keescook, x86

On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> Balbir,
> 
> Balbir Singh <sblbir@amazon.com> writes:
> > This patch is an RFC/PoC to start the discussion on optionally flushing
> > L1D cache.  The goal is to allow tasks that are paranoid due to the recent
> > snoop assisted data sampling vulnerabilites, to flush their L1D on being
> > switched out.  This protects their data from being snooped or leaked via
> > side channels after the task has context switched out.
> 
> What you mean is that L1D is flushed before another task which does not
> belong to the same trust zone returns to user space or enters guest
> mode.

To some extent, though we haven't really got a concept of "trust zone"
at this point, so we assume an mm will do. Typically it's going to be
tasks that manipulate sensitive data (customer data for example) that
want to have extra protection against other tasks (or containers) in
the system.

One use case is language runtimes running different customer programs
in different tasks needing to be protected from snooping at each other.

> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this
> > patch), the second scenario is a malicious task coming in (not so well
> > trusted) for which we want to clean up the cache before it starts
> > execution. The latter was proposed by benh and is not currently
> > addressed by this patch, but can be easily accommodated by the same
> > mechanism.
> 
> What's the point? The attack surface is the L1D content of the scheduled
> out task. If the malicious task schedules out, then why would you care?
> 
> I might be missing something, but AFAICT this is beyond paranoia.

So I was just suggesting the symetric variant. IE. The original issue
we try to address is a "high value" task and prevent its data being
snooped by a "low value" environment (ie. another task). Thus flush
when switching out.

I was suggesting the other way around might be useful for other
reasons: a task running untrusted code (ie, your browser javascript
interpreter) which you want to disallow any chance of snooping the L1D
of other things in your system. In this case, you want a flush on the
way into the task to protect everything else.

Both are opt-in.

> > The points of discussion/review are:
> > 
> > 1. Discuss the use case and the right approach to address this
> 
> It might be the quick fix for the next not yet known variant of L1D
> based horrors, so yes it's at least worth to discuss it.

It also address existing variants for which we have workarounds for VMs
but not necessarily for processes/containers. But yes, the idea here is
to have a somewhat "blanket" protection of data. It's never perfect
(branch caches etc...) but it's useful when we know we are running some
tasks with sensitive data and are willing to pay the price on context
switch for these.

> > 2. Does an arch prctl allowing for opt-in flushing make sense, would other
> >    arches care about something similar?
> 
> No idea, but I assume in the light of L1D based issues that might be the
> case. Though that still is per architecture as the L1D flush mechanisms
> are architecture specific. Aside of that I don't think that more than a
> few will enable / support it.

We should thrive to define the prctl itself in as much arch agnostic
way as possible. I suspect we'll want the same for ARM64 soon :-) This
is in line with what Kees suggests in an earlier response. The
implementation remains arch specific.

> > 3. There is a fallback software L1D load, similar to what L1TF does, but
> >    we don't prefetch the TLB, is that sufficient?
> 
> If we go there, then the KVM L1D flush code wants to move into general
> x86 code.
> 
> > 4. The atomics can be improved and we could use a static key like ibpb
> >    does to optimize the code path
> 
> Yes to static key.
> 
> > 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
> >    is bit 32), we could generalize it with some effort
> 
> Why so? There are gaps in the TIF flags (18, 23, 26). Why do you want to
> push that to 32?
> 
> > 6. Should we consider cleaning up the L1D on arrival of tasks?
> 
> That's the Ben idea, right? If so see above.
> 
> > +#define L1D_CACHE_ORDER 4
> > +static void *l1d_flush_pages;
> > +static DEFINE_MUTEX(l1d_flush_mutex);
> > +
> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +	struct page *page;
> > +
> > +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > +		goto done;
> > +
> > +	mutex_lock(&l1d_flush_mutex);
> > +	if (l1d_flush_pages)
> > +		goto done;
> > +	/*
> > +	 * These pages are never freed, we use the same
> > +	 * set of pages across multiple processes/contexts
> > +	 */
> > +	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > +	if (!page)
> > +		return;
> > +
> > +	l1d_flush_pages = page_address(page);
> > +	/* I don't think we need to worry about KSM */
> 
> Why not? Even if it wouldn't be necessary why would we care as this is a
> once per boot operation in fully preemptible code.
> 
> Aside of that why do we need another pointlessly different copy of what
> we have in the VMX code?
> 
> > +done:
> > +	set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +	mutex_unlock(&l1d_flush_mutex);
> > +}
> > +
> > +void disable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +	clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +	smp_mb__after_atomic();
> 
> Lacks an explanation / comment what this smp_mb is for and where the
> counterpart lives.
> 
> Aside of that, this is pointless AFAICT. Disable/enable is really not a
> concern of being perfect. If you want perfect protection, simply switch
> off your computer.
> 
> > +/*
> > + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> > + * this is a pessimistic security measure and an opt-in for those tasks
> > + * that host sensitive information and there are concerns about spills
> > + * from fill buffers.
> > + */
> > +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> > +{
> > +	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +	if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> > +		goto check_next;
> > +
> > +	if (real_prev == next)
> > +		return;
> > +
> > +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +		goto done;
> > +	}
> > +
> > +	asm volatile(
> > +		/* Fill the cache */
> > +		"xorl	%%eax, %%eax\n"
> > +		".Lfill_cache:\n"
> > +		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +		"addl	$64, %%eax\n\t"
> > +		"cmpl	%%eax, %[size]\n\t"
> > +		"jne	.Lfill_cache\n\t"
> > +		"lfence\n"
> > +		:: [flush_pages] "r" (l1d_flush_pages),
> > +		    [size] "r" (size)
> > +		: "eax", "ecx");
> 
> Yet moar copied code slighlty different for no reason.
> 
> > +
> > +done:
> > +	this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> > +	/* Make sure we clear the values before we set it again */
> > +	barrier();
> > +check_next:
> > +	if (tsk == NULL)
> > +		return;
> > +
> > +	/* Match the set/clear_bit barriers */
> > +	smp_rmb();
> 
> What for again?
> 
> > +	/* We don't need stringent checks as we opt-in/opt-out */
> > +	if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> > +		this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> > +}
> > +
> >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >  	       struct task_struct *tsk)
> >  {
> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> >  		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> >  	}
> >  
> > +	l1d_flush(next, tsk);
> 
> This is really the wrong place. You want to do that:
> 
>   1) Just before return to user space
>   2) When entering a guest
> 
> and only when the previously running user space task was the one which
> requested this massive protection.

Why ? switch_mm will not do anything when switching to/from kernel
threads right ? Or did things change since last I look at it (granted a
while ago...). It seems like a reasonable place and definitely less hot
than the return-to-userspace path.

> While it's worth to discuss, I'm not yet convinced that this is worth
> the trouble.

It's a pretty serious request for us, people really want this when
running different processes that can today demonstratively spy on each
other caches and belong to different customers. Think of containers for
example. VMs are a solution since we have such workarounds in KVM I
believe, but much heavier weight.

Cheers,
Ben.


> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-20 11:49     ` Thomas Gleixner
  2020-03-21  1:42       ` Singh, Balbir
@ 2020-03-22  5:08       ` Herrenschmidt, Benjamin
  2020-03-22 15:10         ` Andy Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Herrenschmidt, Benjamin @ 2020-03-22  5:08 UTC (permalink / raw)
  To: tglx, linux-kernel, Singh, Balbir; +Cc: keescook, x86

On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> Balbir,
> 
> "Singh, Balbir" <sblbir@amazon.com> writes:
> > On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> > > What's the point? The attack surface is the L1D content of the scheduled
> > > out task. If the malicious task schedules out, then why would you care?
> > > 
> > > I might be missing something, but AFAICT this is beyond paranoia.
> > > 
> > 
> > I think there are two cases
> > 
> > 1. Task with important data schedules out
> > 2. Malicious task schedules in
> > 
> > These patches address 1, but call out case #2
> 
> The point is if the victim task schedules out, then there is no reason
> to flush L1D immediately in context switch. If that just schedules a
> kernel thread and then goes back to the task, then there is no point
> unless you do not even trust the kernel thread.

A switch to a kernel thread will not call switch_mm, will it ? At least it used not to...

> > > > 3. There is a fallback software L1D load, similar to what L1TF does, but
> > > >     we don't prefetch the TLB, is that sufficient?
> > > 
> > > If we go there, then the KVM L1D flush code wants to move into general
> > > x86 code.
> > 
> > OK.. we can definitely consider reusing code, but I think the KVM bits require
> > tlb prefetching, IIUC before cache flush to negate any bad translations
> > associated with an L1TF fault, but the code/comments are not clear on the need
> > to do so.
> 
> I forgot the gory details by now, but having two entry points or a
> conditional and share the rest (page allocation etc.) is definitely
> better than two slightly different implementation which basically do the same thing.
> 
> > > > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > > > +{
> > > > +     struct page *page;
> > > > +
> > > > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > > > +             goto done;
> > > > +
> > > > +     mutex_lock(&l1d_flush_mutex);
> > > > +     if (l1d_flush_pages)
> > > > +             goto done;
> > > > +     /*
> > > > +      * These pages are never freed, we use the same
> > > > +      * set of pages across multiple processes/contexts
> > > > +      */
> > > > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > > > +     if (!page)
> > > > +             return;
> > > > +
> > > > +     l1d_flush_pages = page_address(page);
> > > > +     /* I don't think we need to worry about KSM */
> > > 
> > > Why not? Even if it wouldn't be necessary why would we care as this is a
> > > once per boot operation in fully preemptible code.
> > 
> > Not sure I understand your question, I was stating that even if KSM was
> > running, it would not impact us (with dedup), as we'd still be writing out 0s
> > to the cache line in the fallback case.
> 
> I probably confused myself vs. the comment in the VMX code, but that
> mentions nested virt. Needs at least some consideration when we reuse
> that code.
> 
> > > >   void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > >               struct task_struct *tsk)
> > > >   {
> > > > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > > > mm_struct *next,
> > > >                trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > > >        }
> > > > 
> > > > +     l1d_flush(next, tsk);
> > > 
> > > This is really the wrong place. You want to do that:
> > > 
> > >    1) Just before return to user space
> > >    2) When entering a guest
> > > 
> > > and only when the previously running user space task was the one which
> > > requested this massive protection.
> > > 
> > 
> > Cases 1 and 2 are handled via
> > 
> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
> 
> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.

> > 2. L1TF fault handling
> > 
> > This mechanism allows for flushing not restricted to 1 or 2, the idea is to
> > immediately flush L1D for paranoid processes on mm switch.
> 
> Why? To protect the victim task against the malicious kernel?

Mostly malicious other tasks for us. As I said, I don't think switch_mm
is called on switching to a kernel thread and is definitely a colder
path than the return to userspace, so it felt like the right place to
put this, but I don't mind if you prefer it elsewhere as long as it
does the job which is to prevent task B to snoop task A data.

> The L1D content of the victim is endangered in the following case:
> 
>     victim out -> attacker in
> 
> The attacker can either run in user space or in guest mode. So the flush
> is only interesting when the attacker actually goes back to user space
> or reenters the guest.
> 
> The following is completely uninteresting:
> 
>     victim out -> kernel thread in/out -> victim in

Sure but will that call switch_mm to be called ?

> Even this is uninteresting:
> 
>     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>     attacker out -> victim in

I don't get this ... how do you get attacker_in without victim_out
first ? In which case you have a victim_out -> attacker_in transition
which is what we are trying to protect.

I still think flushing the "high value" process L1D on switch_mm out is
the way to go here...

> So the point where you want to flush conditionally is when the attacker
> leaves kernel space either to user mode or guest mode.
> 
> So if the victim schedules out it sets a per cpu request to flush L1D
> on the borders.
> 
> And then you have on return to user:
> 
>     if (this_cpu_flush_l1d())
>         flush_l1d()
> 
> and in kvm:
> 
>     if (this_cpu_flush_l1d() || L1TF_flush_L1D)
>         flush_l1d()
> 
> The request does:
> 
>     if (!this_cpu_read(l1d_flush_for_task))
>         this_cpu_write(l1d_flush_for_task, current)
> 
> The check does:
> 
>     p = this_cpu_read(l1d_flush_for_task);
>     if (p) {
>         this_cpu_write(l1d_flush_for_task, NULL);
>         return p != current;
>     }
>     return false;
> 
> Hmm?
> 
> Thanks,
> 
>         tglx
> 
>                 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-21 10:05         ` Thomas Gleixner
@ 2020-03-22  5:10           ` Herrenschmidt, Benjamin
  2020-03-23  0:37           ` Singh, Balbir
  1 sibling, 0 replies; 15+ messages in thread
From: Herrenschmidt, Benjamin @ 2020-03-22  5:10 UTC (permalink / raw)
  To: tglx, linux-kernel, Singh, Balbir; +Cc: keescook, x86

On Sat, 2020-03-21 at 11:05 +0100, Thomas Gleixner wrote:
> victim1
>  store secrit
>                         victim2
> attacker                  read secrit
> 
> Now if L1D is flushed on CPU0 before attacker reaches user space,
> i.e. reaches the attack code, then there is nothing to see. From the
> link:
> 
>   Similar to the L1TF VMM mitigations, snoop-assisted L1D sampling can be
>   mitigated by flushing the L1D cache between when secrets are accessed
>   and when possibly malicious software runs on the same core.
> 
> So the important point is to flush _before_ the attack code runs which
> involves going back to user space or guest mode.

So you mean switching from victim to attacker in the kernel, and going
back to victim before the attacker has a chance to actually execute any
userspace code ?

I can see why this doesn't require a flush, but is it a case worth
optimizing for either ?

IE. The flush in switch_mm is rather trivial, is lower overhead than
adding code to the userspace return code, and avoids kernel threads
likely, I prefer it for its simplicity TBH...

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-22  5:08       ` Herrenschmidt, Benjamin
@ 2020-03-22 15:10         ` Andy Lutomirski
  2020-03-22 23:17           ` Herrenschmidt, Benjamin
  2020-03-23  0:12           ` Singh, Balbir
  0 siblings, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2020-03-22 15:10 UTC (permalink / raw)
  To: Herrenschmidt, Benjamin; +Cc: tglx, linux-kernel, Singh, Balbir, keescook, x86


> On Mar 21, 2020, at 10:08 PM, Herrenschmidt, Benjamin <benh@amazon.com> wrote:
> 
> On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
>> Balbir,
>> 
>> "Singh, Balbir" <sblbir@amazon.com> writes:
>>>> On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
>>>>> What's the point? The attack surface is the L1D content of the scheduled
>>>>> out task. If the malicious task schedules out, then why would you care?
>>>>> 
>>>>> I might be missing something, but AFAICT this is beyond paranoia.
>>>>> 
>>> 
>>> I think there are two cases
>>> 
>>> 1. Task with important data schedules out
>>> 2. Malicious task schedules in
>>> 
>>> These patches address 1, but call out case #2
>> 
>> The point is if the victim task schedules out, then there is no reason
>> to flush L1D immediately in context switch. If that just schedules a
>> kernel thread and then goes back to the task, then there is no point
>> unless you do not even trust the kernel thread.
> 
> A switch to a kernel thread will not call switch_mm, will it ? At least it used not to...
> 
>>>>> 3. There is a fallback software L1D load, similar to what L1TF does, but
>>>>>    we don't prefetch the TLB, is that sufficient?
>>>> 
>>>> If we go there, then the KVM L1D flush code wants to move into general
>>>> x86 code.
>>> 
>>> OK.. we can definitely consider reusing code, but I think the KVM bits require
>>> tlb prefetching, IIUC before cache flush to negate any bad translations
>>> associated with an L1TF fault, but the code/comments are not clear on the need
>>> to do so.
>> 
>> I forgot the gory details by now, but having two entry points or a
>> conditional and share the rest (page allocation etc.) is definitely
>> better than two slightly different implementation which basically do the same thing.
>> 
>>>>> +void enable_l1d_flush_for_task(struct task_struct *tsk)
>>>>> +{
>>>>> +     struct page *page;
>>>>> +
>>>>> +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
>>>>> +             goto done;
>>>>> +
>>>>> +     mutex_lock(&l1d_flush_mutex);
>>>>> +     if (l1d_flush_pages)
>>>>> +             goto done;
>>>>> +     /*
>>>>> +      * These pages are never freed, we use the same
>>>>> +      * set of pages across multiple processes/contexts
>>>>> +      */
>>>>> +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
>>>>> +     if (!page)
>>>>> +             return;
>>>>> +
>>>>> +     l1d_flush_pages = page_address(page);
>>>>> +     /* I don't think we need to worry about KSM */
>>>> 
>>>> Why not? Even if it wouldn't be necessary why would we care as this is a
>>>> once per boot operation in fully preemptible code.
>>> 
>>> Not sure I understand your question, I was stating that even if KSM was
>>> running, it would not impact us (with dedup), as we'd still be writing out 0s
>>> to the cache line in the fallback case.
>> 
>> I probably confused myself vs. the comment in the VMX code, but that
>> mentions nested virt. Needs at least some consideration when we reuse
>> that code.
>> 
>>>>>  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>>>>              struct task_struct *tsk)
>>>>>  {
>>>>> @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>>>>> mm_struct *next,
>>>>>               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
>>>>>       }
>>>>> 
>>>>> +     l1d_flush(next, tsk);
>>>> 
>>>> This is really the wrong place. You want to do that:
>>>> 
>>>>   1) Just before return to user space
>>>>   2) When entering a guest
>>>> 
>>>> and only when the previously running user space task was the one which
>>>> requested this massive protection.
>>>> 
>>> 
>>> Cases 1 and 2 are handled via
>>> 
>>> 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
>> 
>> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.
> 
>>> 2. L1TF fault handling
>>> 
>>> This mechanism allows for flushing not restricted to 1 or 2, the idea is to
>>> immediately flush L1D for paranoid processes on mm switch.
>> 
>> Why? To protect the victim task against the malicious kernel?
> 
> Mostly malicious other tasks for us. As I said, I don't think switch_mm
> is called on switching to a kernel thread and is definitely a colder
> path than the return to userspace, so it felt like the right place to
> put this, but I don't mind if you prefer it elsewhere as long as it
> does the job which is to prevent task B to snoop task A data.
> 
>> The L1D content of the victim is endangered in the following case:
>> 
>>    victim out -> attacker in
>> 
>> The attacker can either run in user space or in guest mode. So the flush
>> is only interesting when the attacker actually goes back to user space
>> or reenters the guest.
>> 
>> The following is completely uninteresting:
>> 
>>    victim out -> kernel thread in/out -> victim in
> 
> Sure but will that call switch_mm to be called ?
> 
>> Even this is uninteresting:
>> 
>>    victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>>    attacker out -> victim in
> 
> I don't get this ... how do you get attacker_in without victim_out
> first ? In which case you have a victim_out -> attacker_in transition
> which is what we are trying to protect.
> 
> I still think flushing the "high value" process L1D on switch_mm out is
> the way to go here...

Let me try to understand the issue. There is some high-value data, and that data is owned by a high-value process. At some point, the data ends up in L1D.  Later in, evil code runs and may attempt to exfiltrate  that data from L1D using a side channel. (The evil code is not necessarily in a malicious process context. It could be kernel code targeted by LVI or similar. It could be ordinary code that happens to contain a side channel gadget by accident.)

So the idea is to flush L1D after manipulating high-value data and before running evil code.

The nasty part here is that we don’t have a good handle on when L1D is filled and when the evil code runs. If the evil code is untrusted process userspace and the fill is an interrupt, then switch_mm is useless and we want to flush on kernel exit instead. If the fill and evil code are both userspace, then switch_mm is probably the right choice, but prepare_exit_from_usermode works too. If SMT is on, we lose no matter what.  If the evil code is in kernel context, then it’s not clear what to do. If the fill and the evil code are both in kernel threads (hi, io_uring), then I’m not at all sure what to do.

In summary, kernel exit seems stronger, but the right answer isn’t so clear.

We could do an optimized variant where we flush at kernel exit but we *decide* to flush in switch_mm.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-22 15:10         ` Andy Lutomirski
@ 2020-03-22 23:17           ` Herrenschmidt, Benjamin
  2020-03-23  0:12           ` Singh, Balbir
  1 sibling, 0 replies; 15+ messages in thread
From: Herrenschmidt, Benjamin @ 2020-03-22 23:17 UTC (permalink / raw)
  To: luto; +Cc: tglx, linux-kernel, keescook, Singh, Balbir, x86

On Sun, 2020-03-22 at 08:10 -0700, Andy Lutomirski wrote:
> 
> Let me try to understand the issue. There is some high-value data,
> and that data is owned by a high-value process. At some point, the
> data ends up in L1D.  Later in, evil code runs and may attempt to
> exfiltrate  that data from L1D using a side channel. (The evil code
> is not necessarily in a malicious process context. It could be kernel
> code targeted by LVI or similar. It could be ordinary code that
> happens to contain a side channel gadget by accident.)

We aren't trying to protect processes against the kernel. I think
that's beyond what can reasonably be done if the kernel is
compromised... If you are worried about that case, use VMs.

We are mostly trying to protect process vs. process. either language
runtimes potentially running different "user" code, or containers
pertaining to different "users" etc....

> So the idea is to flush L1D after manipulating high-value data and
> before running evil code.
>
> The nasty part here is that we don’t have a good handle on when L1D
> is filled and when the evil code runs. If the evil code is untrusted
> process userspace and the fill is an interrupt, then switch_mm is
> useless and we want to flush on kernel exit instead. If the fill and
> evil code are both userspace, then switch_mm is probably the right
> choice, but prepare_exit_from_usermode works too. If SMT is on, we
> lose no matter what.  If the evil code is in kernel context, then
> it’s not clear what to do. If the fill and the evil code are both in
> kernel threads (hi, io_uring), then I’m not at all sure what to do.
> 
> In summary, kernel exit seems stronger, but the right answer isn’t so
> clear.

Right. Which is why we are happy to limit the scope of this to
processes. I think if the kernel cannot be trusted in a given system,
the range of possible exploits dwarfs this one, I don't think it's what
we reasonably want to address here.

That said, I am not married to the switch_mm() solution, if there is
consensus that these things are better done in the kernel entry/exit
path, then so be it. But my gut feeling in that specific case is that
the overhead will be lower and the code potentially simpler in
switch_mm.

> We could do an optimized variant where we flush at kernel exit but we
> *decide* to flush in switch_mm.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-22 15:10         ` Andy Lutomirski
  2020-03-22 23:17           ` Herrenschmidt, Benjamin
@ 2020-03-23  0:12           ` Singh, Balbir
  1 sibling, 0 replies; 15+ messages in thread
From: Singh, Balbir @ 2020-03-23  0:12 UTC (permalink / raw)
  To: Herrenschmidt, Benjamin, luto; +Cc: tglx, linux-kernel, keescook, x86

On Sun, 2020-03-22 at 08:10 -0700, Andy Lutomirski wrote:
> 
> > I still think flushing the "high value" process L1D on switch_mm out is
> > the way to go here...
> 
> Let me try to understand the issue. There is some high-value data, and that
> data is owned by a high-value process. At some point, the data ends up in
> L1D.  Later in, evil code runs and may attempt to exfiltrate  that data from
> L1D using a side channel. (The evil code is not necessarily in a malicious
> process context. It could be kernel code targeted by LVI or similar. It
> could be ordinary code that happens to contain a side channel gadget by
> accident.)
> 
> So the idea is to flush L1D after manipulating high-value data and before
> running evil code.
> 
> The nasty part here is that we don’t have a good handle on when L1D is
> filled and when the evil code runs. If the evil code is untrusted process
> userspace and the fill is an interrupt, then switch_mm is useless and we
> want to flush on kernel exit instead. If the fill and evil code are both
> userspace, then switch_mm is probably the right choice, but
> prepare_exit_from_usermode works too. If SMT is on, we lose no matter
> what.  If the evil code is in kernel context, then it’s not clear what to
> do. If the fill and the evil code are both in kernel threads (hi, io_uring),
> then I’m not at all sure what to do.
> 
> In summary, kernel exit seems stronger, but the right answer isn’t so clear.
> 
> We could do an optimized variant where we flush at kernel exit but we
> *decide* to flush in switch_mm.

I think the key question in the LVI case would be, is it possible to do an LVI
in a kernel context? If the answer is no, switch_mm() is sufficient, but for
now these patches focus on flushing L1D on task exit, we could add the use
case for LVI (which is called out)

Balbir Singh.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
  2020-03-21 10:05         ` Thomas Gleixner
  2020-03-22  5:10           ` Herrenschmidt, Benjamin
@ 2020-03-23  0:37           ` Singh, Balbir
  1 sibling, 0 replies; 15+ messages in thread
From: Singh, Balbir @ 2020-03-23  0:37 UTC (permalink / raw)
  To: tglx, linux-kernel; +Cc: keescook, Herrenschmidt, Benjamin, x86

Hi, Thomas,

On Sat, 2020-03-21 at 11:05 +0100, Thomas Gleixner wrote:
> 
> 
> Balbir,
> 
> "Singh, Balbir" <sblbir@amazon.com> writes:
> > On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> > > I forgot the gory details by now, but having two entry points or a
> > > conditional and share the rest (page allocation etc.) is definitely
> > > better than two slightly different implementation which basically do the
> > > same thing.
> > 
> > OK, I can try and dedup them to the extent possible, but please do
> > remember
> > that
> > 
> > 1. KVM is usually loaded as a module
> > 2. KVM is optional
> > 
> > We can share code, by putting the common bits in the core kernel.
> 
> Obviously so.
> 
> > > > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
> > > 
> > > How so? SWAPGS mitigation does not flush L1D. It merily serializes
> > > SWAPGS.
> > 
> > Sorry, my bad, I was thinking MDS_CLEAR (via verw), which does flush out
> > things, which I suspect should be sufficient from a return to user/signal
> > handling, etc perspective.
> 
> MDS is affecting store buffers, fill buffers and load ports. Different
> story.
> 

Yes, what gets me is that as per (
https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling
) it says, "The VERW instruction and L1D_FLUSH command will overwrite the
store buffer value for the current logical processor on processors affected by
MSBDS". In my mind, this makes VERW the same as L1D_FLUSH and hence the
assumption, it could be that L1D_FLUSH is a superset, but it's not clear and I
can't seem to find any other form of documentation on the MSRs and microcode.

> > Right now, reading through
> > 
https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
> > , it does seem like we need this during a context switch, specifically
> > since a
> > dirty cache line can cause snooped reads for the attacker to leak data. Am
> > I
> > missing anything?
> 
> Yes. The way this goes is:
> 
> CPU0                   CPU1
> 
> victim1
>  store secrit
>                         victim2
> attacker                  read secrit
> 
> Now if L1D is flushed on CPU0 before attacker reaches user space,
> i.e. reaches the attack code, then there is nothing to see. From the
> link:
> 
>   Similar to the L1TF VMM mitigations, snoop-assisted L1D sampling can be
>   mitigated by flushing the L1D cache between when secrets are accessed
>   and when possibly malicious software runs on the same core.
> 
> So the important point is to flush _before_ the attack code runs which
> involves going back to user space or guest mode.

I think there is a more generic case with HT you've highlighted below

> 
> > > Even this is uninteresting:
> > > 
> > >     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
> > >     attacker out -> victim in
> > > 
> > 
> > Not from what I understand from the link above, the attack is a function
> > of
> > what can be snooped by another core/thread and that is a function of what
> > modified secrets are in the cache line/store buffer.
> 
> Forget HT. That's not fixable by any flushing simply because there is no
> scheduling involved.
> 
> CPU0  HT0          CPU0 HT1             CPU1
> 
> victim1            attacker
>  store secrit
>                                         victim2
>                                           read secrit
> 
> > On return to user, we already use VERW (verw), but just return to user
> > protection is not sufficient IMHO. Based on the link above, we need to
> > clear
> > the L1D cache before it can be snooped.
> 
> Again. Flush is required between store and attacker running attack
> code. The attacker _cannot_ run attack code while it is in the kernel so
> flushing L1D on context switch is just voodoo.
> 
> If you want to cure the HT case with core scheduling then the scenario
> looks like this:
> 
> CPU0  HT0          CPU0 HT1             CPU1
> 
> victim1            IDLE
>  store secrit
> -> IDLE
>                    attacker in          victim2
>                                           read secrit
> 
> And yes, there the context switch flush on HT0 prevents it. So this can
> be part of a core scheduling based mitigation or handled via a per core
> flush request.
> 
> But HT is attackable in so many ways ...

I think the reason you prefer exit to user as opposed to switch_mm (switching
task groups/threads) is that it's lower overhead, the reason I prefer switch
mm is 

1. The overhead is not for all tasks, the selection of L1D flush is optional
2. It's more generic and does not make specific assumptions


> 
> Thanks,
> 
>         tglx


Thanks for the review,
Balbir Singh.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-03-23  0:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-13 22:04 [RFC PATCH] arch/x86: Optionally flush L1D on context switch Balbir Singh
2020-03-18 23:14 ` Kees Cook
2020-03-20  1:35   ` Singh, Balbir
2020-03-19  0:38 ` Thomas Gleixner
2020-03-20  1:37   ` Singh, Balbir
2020-03-20 11:49     ` Thomas Gleixner
2020-03-21  1:42       ` Singh, Balbir
2020-03-21 10:05         ` Thomas Gleixner
2020-03-22  5:10           ` Herrenschmidt, Benjamin
2020-03-23  0:37           ` Singh, Balbir
2020-03-22  5:08       ` Herrenschmidt, Benjamin
2020-03-22 15:10         ` Andy Lutomirski
2020-03-22 23:17           ` Herrenschmidt, Benjamin
2020-03-23  0:12           ` Singh, Balbir
2020-03-22  5:01   ` Herrenschmidt, Benjamin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).