linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Next revision of the L1D flush patches
@ 2020-11-27  6:59 Balbir Singh
  2020-11-27  6:59 ` [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Balbir Singh @ 2020-11-27  6:59 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

Implement a mechanism that allows tasks to conditionally flush
their L1D cache (mitigation mechanism suggested in [2]). The previous
posts of these patches were sent for inclusion (see [3]) and were not
included due to the concern for the need for additional checks,
those checks were:

1. Implement this mechanism only for CPUs affected by the L1TF bug
2. Disable the software fallback
3. Provide an override to disable this mechanism completely
4. Be SMT aware in the implementation

The patches support a use case where the entire system is not in
non SMT mode, but rather a few CPUs can have their SMT turned off
and processes that want to opt-in are expected to run on non SMT
cores. This gives the administrator complete control over setting
up the mitigation for the issue. In addition, the administrator
has a boot time override (l1d_flush_out=off) to turn of the mechanism
completely.

To implement these efficiently, a new per cpu view of whether the core
is in SMT mode or not is implemented in patch 1. The code is refactored
in patch 2 so that the existing code can allow for other speculation
related checks when switching mm between tasks, this mechanism has not
changed since the last post. The ability to flush L1D for tasks if the
TIF_SPEC_L1D_FLUSH bit is set and the task has context switched out of a
non SMT core is provided by patch 3. Hooks for the user space API, for
this feature to be invoked via prctl are provided in patch 4, along with
the checks described above (1, 2, and 3). Documentation updates are in
patch 5, with updates on l1d_flush, the prctl changes and updates to the
kernel-parameters (l1d_flush_out).

The checks for opting into L1D flushing are:
	a. If the CPU is affected by L1TF
        b. Hardware L1D flush mechanism is available

A task running on a core with SMT enabled and opting into this feature will
receive a SIGBUS.

References
[1] https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
[2] https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
[3] https://lkml.org/lkml/2020/6/2/1150
[4] https://lore.kernel.org/lkml/20200729001103.6450-1-sblbir@amazon.com/
[5] https://lore.kernel.org/lkml/20201117234934.25985-2-sblbir@amazon.com/

Changelog v3:
- Implement the SIGBUS mechansim
- Update and fix the documentation

Balbir Singh (5):
  x86/mm: change l1d flush runtime prctl behaviour
  x86/mm: Refactor cond_ibpb() to support other use cases
  x86/mm: Optionally flush L1D on context switch
  prctl: Hook L1D flushing in via prctl
  Documentation: Add L1D flushing Documentation

 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 .../admin-guide/hw-vuln/l1d_flush.rst         |  69 ++++++++++++
 .../admin-guide/kernel-parameters.txt         |  17 +++
 Documentation/userspace-api/spec_ctrl.rst     |   8 ++
 arch/Kconfig                                  |   4 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/cacheflush.h             |   8 ++
 arch/x86/include/asm/processor.h              |   2 +
 arch/x86/include/asm/thread_info.h            |   9 +-
 arch/x86/include/asm/tlbflush.h               |   2 +-
 arch/x86/kernel/cpu/bugs.c                    |  54 +++++++++
 arch/x86/kernel/smpboot.c                     |  11 +-
 arch/x86/mm/tlb.c                             | 105 ++++++++++++++----
 include/linux/sched.h                         |  10 ++
 include/uapi/linux/prctl.h                    |   1 +
 15 files changed, 274 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour
  2020-11-27  6:59 [PATCH v3 0/5] Next revision of the L1D flush patches Balbir Singh
@ 2020-11-27  6:59 ` Balbir Singh
  2020-12-04 21:07   ` Thomas Gleixner
  2020-11-27  6:59 ` [PATCH v3 2/5] x86/mm: Refactor cond_ibpb() to support other use cases Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2020-11-27  6:59 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

Detection of task affinities at API opt-in time is not the best
approach, the approach is to kill the task if it runs on a
SMT enable core. This is better than not flushing the L1D cache
when the task switches from a non-SMT core to an SMT enabled core.

Signed-off-by: Balbir Singh <sblbir@amazon.com>
---
 arch/x86/include/asm/processor.h |  2 ++
 arch/x86/kernel/smpboot.c        | 11 ++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..60dbcdcb833f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -136,6 +136,8 @@ struct cpuinfo_x86 {
 	u16			logical_die_id;
 	/* Index into per_cpu list: */
 	u16			cpu_index;
+	/*  Is SMT active on this core? */
+	bool			smt_active;
 	u32			microcode;
 	/* Address space bits used by the cache internally */
 	u8			x86_cache_bits;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index de776b2e6046..9a94934fae5f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -635,6 +635,9 @@ void set_cpu_sibling_map(int cpu)
 	threads = cpumask_weight(topology_sibling_cpumask(cpu));
 	if (threads > __max_smt_threads)
 		__max_smt_threads = threads;
+
+	for_each_cpu(i, topology_sibling_cpumask(cpu))
+		cpu_data(i).smt_active = threads > 1;
 }
 
 /* maps the cpu to the sched domain representing multi-core */
@@ -1548,10 +1551,16 @@ static void remove_siblinginfo(int cpu)
 
 	for_each_cpu(sibling, topology_die_cpumask(cpu))
 		cpumask_clear_cpu(cpu, topology_die_cpumask(sibling));
-	for_each_cpu(sibling, topology_sibling_cpumask(cpu))
+
+	for_each_cpu(sibling, topology_sibling_cpumask(cpu)) {
 		cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+		if (cpumask_weight(topology_sibling_cpumask(sibling)) == 1)
+			cpu_data(sibling).smt_active = false;
+	}
+
 	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+
 	cpumask_clear(cpu_llc_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
 	cpumask_clear(topology_core_cpumask(cpu));
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/5] x86/mm: Refactor cond_ibpb() to support other use cases
  2020-11-27  6:59 [PATCH v3 0/5] Next revision of the L1D flush patches Balbir Singh
  2020-11-27  6:59 ` [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour Balbir Singh
@ 2020-11-27  6:59 ` Balbir Singh
  2020-11-27  6:59 ` [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Balbir Singh @ 2020-11-27  6:59 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.

[ tglx: Moved comment, added a separate define for state (re)initialization ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200510014803.12190-4-sblbir@amazon.com
Link: https://lore.kernel.org/r/20200729001103.6450-3-sblbir@amazon.com
---
 arch/x86/include/asm/tlbflush.h |  2 +-
 arch/x86/mm/tlb.c               | 53 ++++++++++++++++++---------------
 2 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a927d40664df 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,7 +83,7 @@ struct tlb_state {
 	/* Last user mm for optimizing IBPB */
 	union {
 		struct mm_struct	*last_user_mm;
-		unsigned long		last_user_mm_ibpb;
+		unsigned long		last_user_mm_spec;
 	};
 
 	u16 loaded_mm_asid;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 11666ba19b62..ca021e039451 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -42,10 +42,14 @@
  */
 
 /*
- * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
- * stored in cpu_tlb_state.last_user_mm_ibpb.
+ * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_spec.
  */
 #define LAST_USER_MM_IBPB	0x1UL
+#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB)
+
+/* Bits to set when tlbstate and flush is (re)initialized */
+#define LAST_USER_MM_INIT	LAST_USER_MM_IBPB
 
 /*
  * The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -316,20 +320,29 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	local_irq_restore(flags);
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
 {
 	unsigned long next_tif = task_thread_info(next)->flags;
-	unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
 
-	return (unsigned long)next->mm | ibpb;
+	return (unsigned long)next->mm | spec_bits;
 }
 
-static void cond_ibpb(struct task_struct *next)
+static void cond_mitigation(struct task_struct *next)
 {
+	unsigned long prev_mm, next_mm;
+
 	if (!next || !next->mm)
 		return;
 
+	next_mm = mm_mangle_tif_spec_bits(next);
+	prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
+
 	/*
+	 * Avoid user/user BTB poisoning by flushing the branch predictor
+	 * when switching between processes. This stops one process from
+	 * doing Spectre-v2 attacks on another.
+	 *
 	 * Both, the conditional and the always IBPB mode use the mm
 	 * pointer to avoid the IBPB when switching between tasks of the
 	 * same process. Using the mm pointer instead of mm->context.ctx_id
@@ -339,8 +352,6 @@ static void cond_ibpb(struct task_struct *next)
 	 * exposed data is not really interesting.
 	 */
 	if (static_branch_likely(&switch_mm_cond_ibpb)) {
-		unsigned long prev_mm, next_mm;
-
 		/*
 		 * This is a bit more complex than the always mode because
 		 * it has to handle two cases:
@@ -370,20 +381,14 @@ static void cond_ibpb(struct task_struct *next)
 		 * Optimize this with reasonably small overhead for the
 		 * above cases. Mangle the TIF_SPEC_IB bit into the mm
 		 * pointer of the incoming task which is stored in
-		 * cpu_tlbstate.last_user_mm_ibpb for comparison.
-		 */
-		next_mm = mm_mangle_tif_spec_ib(next);
-		prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
-
-		/*
+		 * cpu_tlbstate.last_user_mm_spec for comparison.
+		 *
 		 * Issue IBPB only if the mm's are different and one or
 		 * both have the IBPB bit set.
 		 */
 		if (next_mm != prev_mm &&
 		    (next_mm | prev_mm) & LAST_USER_MM_IBPB)
 			indirect_branch_prediction_barrier();
-
-		this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
 	}
 
 	if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -392,11 +397,12 @@ static void cond_ibpb(struct task_struct *next)
 		 * different context than the user space task which ran
 		 * last on this CPU.
 		 */
-		if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+		if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
+					(unsigned long)next->mm)
 			indirect_branch_prediction_barrier();
-			this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
-		}
 	}
+
+	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
 }
 
 #ifdef CONFIG_PERF_EVENTS
@@ -518,11 +524,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		need_flush = true;
 	} else {
 		/*
-		 * Avoid user/user BTB poisoning by flushing the branch
-		 * predictor when switching between processes. This stops
-		 * one process from doing Spectre-v2 attacks on another.
+		 * Apply process to process speculation vulnerability
+		 * mitigations if applicable.
 		 */
-		cond_ibpb(tsk);
+		cond_mitigation(tsk);
 
 		/*
 		 * Stop remote flushes for the previous mm.
@@ -630,7 +635,7 @@ void initialize_tlbstate_and_flush(void)
 	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
-	this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
+	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
 	this_cpu_write(cpu_tlbstate.next_asid, 1);
 	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch
  2020-11-27  6:59 [PATCH v3 0/5] Next revision of the L1D flush patches Balbir Singh
  2020-11-27  6:59 ` [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour Balbir Singh
  2020-11-27  6:59 ` [PATCH v3 2/5] x86/mm: Refactor cond_ibpb() to support other use cases Balbir Singh
@ 2020-11-27  6:59 ` Balbir Singh
  2020-12-04 21:21   ` Thomas Gleixner
  2020-11-27  6:59 ` [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl Balbir Singh
  2020-11-27  6:59 ` [PATCH v3 5/5] Documentation: Add L1D flushing Documentation Balbir Singh
  4 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2020-11-27  6:59 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that want to save sensitive information, found by the recent
snoop assisted data sampling vulnerabilites, to flush their L1D on being
switched out.  This protects their data from being snooped or leaked via
side channels after the task has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.

A new thread_info flag TIF_SPEC_L1D_FLUSH is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
the TIF flags into mm state (per cpu via last_user_mm_spec) in
cond_mitigation(), which then used to do decide when to flush the
L1D cache.

A new helper inline function l1d_flush_hw() has been introduced.
Currently it returns an error code if hardware flushing is not
supported.  The caller currently does not check the return value, in the
context of these patches, the routine is called only when HW assisted
flushing is available.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200729001103.6450-4-sblbir@amazon.com
---
 arch/x86/include/asm/cacheflush.h  |  8 ++++++++
 arch/x86/include/asm/thread_info.h |  9 +++++++--
 arch/x86/mm/tlb.c                  | 30 +++++++++++++++++++++++++++---
 3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index b192d917a6d0..554eaf697f3f 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,4 +10,12 @@
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+static inline int l1d_flush_hw(void)
+{
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+		return 0;
+	}
+	return -EOPNOTSUPP;
+}
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 44733a4bfc42..36a11cfb1061 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,7 +84,7 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE	10	/* Force speculation MSR update in context switch */
+#define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING	13	/* pending live patching update */
@@ -96,6 +96,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -113,7 +114,7 @@ struct thread_info {
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
-#define _TIF_SPEC_FORCE_UPDATE	(1 << TIF_SPEC_FORCE_UPDATE)
+#define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
@@ -124,6 +125,7 @@ struct thread_info {
 #define _TIF_SLD		(1 << TIF_SLD)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE	(1 << TIF_SPEC_FORCE_UPDATE)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
@@ -228,6 +230,9 @@ static inline int arch_within_stack_frames(const void * const stack,
 			   current_thread_info()->status & TS_COMPAT)
 #endif
 
+extern int enable_l1d_flush_for_task(struct task_struct *tsk);
+extern int disable_l1d_flush_for_task(struct task_struct *tsk);
+
 extern void arch_task_cache_init(void);
 extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
 extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ca021e039451..1531d98396a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,11 +8,13 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/sched/smt.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
+#include <asm/cacheflush.h>
 #include <asm/apic.h>
 
 #include "mm_internal.h"
@@ -42,14 +44,15 @@
  */
 
 /*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
  * stored in cpu_tlb_state.last_user_mm_spec.
  */
 #define LAST_USER_MM_IBPB	0x1UL
-#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH	0x2UL
+#define LAST_USER_MM_SPEC_MASK	(LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)
 
 /* Bits to set when tlbstate and flush is (re)initialized */
-#define LAST_USER_MM_INIT	LAST_USER_MM_IBPB
+#define LAST_USER_MM_INIT	(LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)
 
 /*
  * The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -310,6 +313,18 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+	return 0;
+}
+
+int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+	clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+	return 0;
+}
+
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -325,6 +340,7 @@ static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
 	unsigned long next_tif = task_thread_info(next)->flags;
 	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
 
+	BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
 	return (unsigned long)next->mm | spec_bits;
 }
 
@@ -402,6 +418,14 @@ static void cond_mitigation(struct task_struct *next)
 			indirect_branch_prediction_barrier();
 	}
 
+	/*
+	 * Flush only if SMT is disabled as per the contract, which is checked
+	 * when the feature is enabled.
+	 */
+	if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
+		(prev_mm & LAST_USER_MM_L1D_FLUSH))
+		l1d_flush_hw();
+
 	this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl
  2020-11-27  6:59 [PATCH v3 0/5] Next revision of the L1D flush patches Balbir Singh
                   ` (2 preceding siblings ...)
  2020-11-27  6:59 ` [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch Balbir Singh
@ 2020-11-27  6:59 ` Balbir Singh
  2020-12-04 22:19   ` Thomas Gleixner
  2020-11-27  6:59 ` [PATCH v3 5/5] Documentation: Add L1D flushing Documentation Balbir Singh
  4 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2020-11-27  6:59 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.

Enabling L1D flush does not check if the task is running on
an SMT enabled core, rather a check is done at runtime (at the
time of flush), if the task runs on a non SMT enabled core
then the task is sent a SIGBUS (this is done prior to the task
executing on the core, so no data is leaked). This is better
than the other alternatives of

a. Ensuring strict affinity of the task (hard to enforce
without further changes in the scheduler)
b. Silently skipping flush for tasks that move to SMT enabled
cores.

An arch config ARCH_HAS_PARANOID_L1D_FLUSH has been added
and struct task carries a callback_head for arch's that support
this config (currently on x86), this callback head is used
to schedule task work (SIGBUS delivery).

There is also no seccomp integration for the feature.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/Kconfig               |  4 +++
 arch/x86/Kconfig           |  1 +
 arch/x86/kernel/cpu/bugs.c | 54 ++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c          | 30 ++++++++++++++++++++-
 include/linux/sched.h      | 10 +++++++
 include/uapi/linux/prctl.h |  1 +
 6 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..d4a0501ac7fc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -311,6 +311,10 @@ config ARCH_32BIT_OFF_T
 	  still support 32-bit off_t. This option is enabled for all such
 	  architectures explicitly.
 
+config ARCH_HAS_PARANOID_L1D_FLUSH
+	bool
+	default n
+
 config HAVE_ASM_MODVERSIONS
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..4f6caa6dae16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANT_HUGE_PMD_SHARE
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select BUILDTIME_TABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 581fb7223ad0..dece79e4d1e9 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -296,6 +296,13 @@ enum taa_mitigations {
 	TAA_MITIGATION_TSX_DISABLED,
 };
 
+enum l1d_flush_out_mitigations {
+	L1D_FLUSH_OUT_OFF,
+	L1D_FLUSH_OUT_ON,
+};
+
+static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
+
 /* Default mitigation for TAA-affected CPUs */
 static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
 static bool taa_nosmt __ro_after_init;
@@ -379,6 +386,18 @@ static void __init taa_select_mitigation(void)
 	pr_info("%s\n", taa_strings[taa_mitigation]);
 }
 
+static int __init l1d_flush_out_parse_cmdline(char *str)
+{
+	if (!boot_cpu_has_bug(X86_BUG_L1TF))
+		return 0;
+
+	if (!strcmp(str, "off"))
+		l1d_flush_out_mitigation = L1D_FLUSH_OUT_OFF;
+
+	return 0;
+}
+early_param("l1d_flush_out", l1d_flush_out_parse_cmdline);
+
 static int __init tsx_async_abort_parse_cmdline(char *str)
 {
 	if (!boot_cpu_has_bug(X86_BUG_TAA))
@@ -1215,6 +1234,23 @@ static void task_update_spec_tif(struct task_struct *tsk)
 		speculation_ctrl_update_current();
 }
 
+static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
+{
+
+	if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+		return -EPERM;
+
+	switch (ctrl) {
+	case PR_SPEC_ENABLE:
+		return enable_l1d_flush_for_task(task);
+	case PR_SPEC_DISABLE:
+		return disable_l1d_flush_for_task(task);
+	default:
+		return -ERANGE;
+	}
+	return 0;
+}
+
 static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
 {
 	if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
@@ -1324,6 +1360,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
 		return ssb_prctl_set(task, ctrl);
 	case PR_SPEC_INDIRECT_BRANCH:
 		return ib_prctl_set(task, ctrl);
+	case PR_SPEC_L1D_FLUSH_OUT:
+		return l1d_flush_out_prctl_set(task, ctrl);
 	default:
 		return -ENODEV;
 	}
@@ -1340,6 +1378,20 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
 }
 #endif
 
+static int l1d_flush_out_prctl_get(struct task_struct *task)
+{
+	int ret;
+
+	if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+		return PR_SPEC_FORCE_DISABLE;
+
+	ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
+	if (ret)
+		return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
+	else
+		return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
+}
+
 static int ssb_prctl_get(struct task_struct *task)
 {
 	switch (ssb_mode) {
@@ -1390,6 +1442,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
 		return ssb_prctl_get(task);
 	case PR_SPEC_INDIRECT_BRANCH:
 		return ib_prctl_get(task);
+	case PR_SPEC_L1D_FLUSH_OUT:
+		return l1d_flush_out_prctl_get(task);
 	default:
 		return -ENODEV;
 	}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1531d98396a0..bdc399b86bc7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -315,6 +315,16 @@ EXPORT_SYMBOL_GPL(leave_mm);
 
 int enable_l1d_flush_for_task(struct task_struct *tsk)
 {
+	/*
+	 * Do not enable L1D_FLUSH_OUT if
+	 * b. The CPU is not affected by the L1TF bug
+	 * c. The CPU does not have L1D FLUSH feature support
+	 */
+
+	if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
+			!boot_cpu_has(X86_FEATURE_FLUSH_L1D))
+		return -EINVAL;
+
 	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
 	return 0;
 }
@@ -335,13 +345,31 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	local_irq_restore(flags);
 }
 
+/*
+ * Sent to a task that opts into L1D flushing via the prctl interface
+ * but ends up running on an SMT enabled core.
+ */
+static void l1d_flush_kill(struct callback_head *ch)
+{
+	force_sig(SIGBUS);
+}
+
 static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
 {
 	unsigned long next_tif = task_thread_info(next)->flags;
 	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
+	unsigned long next_mm;
 
 	BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
-	return (unsigned long)next->mm | spec_bits;
+	next_mm = (unsigned long)next->mm | spec_bits;
+
+	if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {
+		clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
+		next->l1d_flush_kill.func = l1d_flush_kill;
+		task_work_add(next, &next->l1d_flush_kill, true);
+	}
+
+	return next_mm;
 }
 
 static void cond_mitigation(struct task_struct *next)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 76cd21fa5501..f8c5b6833f14 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1348,6 +1348,16 @@ struct task_struct {
 	struct callback_head		mce_kill_me;
 #endif
 
+#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
+	/*
+	 * If L1D flush is supported on mm context switch
+	 * then we use this callback head to queue kill work
+	 * to kill tasks that are not running on SMT disabled
+	 * cores
+	 */
+	struct callback_head		l1d_flush_kill;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7f0827705c9a..c334e6a02e5f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -213,6 +213,7 @@ struct prctl_mm_map {
 /* Speculation control variants */
 # define PR_SPEC_STORE_BYPASS		0
 # define PR_SPEC_INDIRECT_BRANCH	1
+# define PR_SPEC_L1D_FLUSH_OUT		2
 /* Return and control values for PR_SET/GET_SPECULATION_CTRL */
 # define PR_SPEC_NOT_AFFECTED		0
 # define PR_SPEC_PRCTL			(1UL << 0)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 5/5] Documentation: Add L1D flushing Documentation
  2020-11-27  6:59 [PATCH v3 0/5] Next revision of the L1D flush patches Balbir Singh
                   ` (3 preceding siblings ...)
  2020-11-27  6:59 ` [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl Balbir Singh
@ 2020-11-27  6:59 ` Balbir Singh
  4 siblings, 0 replies; 12+ messages in thread
From: Balbir Singh @ 2020-11-27  6:59 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

Add documentation of l1d flushing, explain the need for the
feature and how it can be used.

Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/admin-guide/hw-vuln/index.rst   |  1 +
 .../admin-guide/hw-vuln/l1d_flush.rst         | 69 +++++++++++++++++++
 .../admin-guide/kernel-parameters.txt         | 17 +++++
 Documentation/userspace-api/spec_ctrl.rst     |  8 +++
 4 files changed, 95 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..21710f8609fe 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
    tsx_async_abort
    multihit.rst
    special-register-buffer-data-sampling.rst
+   l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 000000000000..9db0f5e568cb
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,69 @@
+L1D Flushing
+============
+
+With an increasing number of vulnerabilities being reported around data
+leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
+mechanism to flush the L1D cache on context switch.
+
+This mechanism can be used to address e.g. CVE-2020-0550. For applications
+the mechanism keeps them safe from vulnerabilities, related to leaks
+(snooping of) from the L1D cache.
+
+
+Related CVEs
+------------
+The following CVEs can be addressed by this
+mechanism
+
+    =============       ========================     ==================
+    CVE-2020-0550       Improper Data Forwarding     OS related aspects
+    =============       ========================     ==================
+
+Usage Guidelines
+----------------
+
+Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst
+<set_spec_ctrl>` for details.
+
+**NOTE**: The feature is disabled by default, applications need to
+specifically opt into the feature to enable it.
+
+Mitigation
+----------
+
+When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
+performed when the task is scheduled out and the incoming task belongs to a
+different process and therefore to a different address space.
+
+If the underlying CPU supports L1D flushing in hardware, the hardware
+mechanism is used, software fallback for the mitigation, is not supported.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1D flush mitigations at boot
+time with the option "l1d_flush_out=". The valid arguments for this option are:
+
+  ============  =============================================================
+  off		Disables the prctl interface, applications trying to use
+                the prctl() will fail with an error
+  ============  =============================================================
+
+By default the API is enabled and applications opt-in by using the prctl
+API.
+
+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+**NOTE** : The opt-in of a task for L1D flushing will work only when the
+tasks affinity is limited to cores running in non-SMT mode. Running the task
+on a CPU with SMT enabled would result in the task getting a SIGBUS when 
+t executes on the non-SMT core.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 44fde25bb221..e3ed24af91d1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2320,6 +2320,23 @@
 			feature (tagged TLBs) on capable Intel chips.
 			Default is 1 (enabled)
 
+	l1d_flush_out=	[X86,INTEL]
+			Control mitigation for L1D based snooping vulnerability.
+
+			Certain CPUs are vulnerable to an exploit against CPU
+			internal buffers which can forward information to a
+			disclosure gadget under certain conditions.
+
+			In vulnerable processors, the speculatively
+			forwarded data can be used in a cache side channel
+			attack, to access data to which the attacker does
+			not have direct access.
+
+			This parameter controls the mitigation. The
+			options are:
+
+			off        - Unconditionally disable the mitigation
+
 	l1tf=           [X86] Control mitigation of the L1TF vulnerability on
 			      affected CPUs
 
diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst
index 7ddd8f667459..f39744ef8810 100644
--- a/Documentation/userspace-api/spec_ctrl.rst
+++ b/Documentation/userspace-api/spec_ctrl.rst
@@ -106,3 +106,11 @@ Speculation misfeature controls
    * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
    * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
    * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
+
+- PR_SPEC_L1D_FLUSH_OUT: Flush L1D Cache on context switch out of the task
+                        (works only when tasks run on non SMT cores)
+
+  Invocations:
+   * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, 0, 0, 0);
+   * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE, 0, 0);
+   * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_DISABLE, 0, 0);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour
  2020-11-27  6:59 ` [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour Balbir Singh
@ 2020-12-04 21:07   ` Thomas Gleixner
  2020-12-04 22:44     ` Singh, Balbir
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-12-04 21:07 UTC (permalink / raw)
  To: Balbir Singh, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:

> Detection of task affinities at API opt-in time is not the best
> approach, the approach is to kill the task if it runs on a
> SMT enable core. This is better than not flushing the L1D cache
> when the task switches from a non-SMT core to an SMT enabled core.
>
> Signed-off-by: Balbir Singh <sblbir@amazon.com>
> ---
>  arch/x86/include/asm/processor.h |  2 ++
>  arch/x86/kernel/smpboot.c        | 11 ++++++++++-
>  2 files changed, 12 insertions(+), 1 deletion(-)

Subject, changelog match but patch content not so much :)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch
  2020-11-27  6:59 ` [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch Balbir Singh
@ 2020-12-04 21:21   ` Thomas Gleixner
  2020-12-04 22:41     ` Singh, Balbir
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-12-04 21:21 UTC (permalink / raw)
  To: Balbir Singh, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
>  
> +	/*
> +	 * Flush only if SMT is disabled as per the contract, which is checked
> +	 * when the feature is enabled.
> +	 */
> +	if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
> +		(prev_mm & LAST_USER_MM_L1D_FLUSH))
> +		l1d_flush_hw();

So if SMT is completely disabled then no flush? Shouldn't the logic be:

    if ((!sched_smt_active() || !this_cpu_read(cpu_info.smt_active) &&
         (prev_mm & LAST_USER_MM_L1D_FLUSH))

Hmm?

But that's bad, because it's lot's of conditions to evaluate for every
switch_mm where most of them are not interested in it at all.

Let me read through the rest of the pile.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl
  2020-11-27  6:59 ` [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl Balbir Singh
@ 2020-12-04 22:19   ` Thomas Gleixner
  2020-12-05  2:56     ` Balbir Singh
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-12-04 22:19 UTC (permalink / raw)
  To: Balbir Singh, mingo
  Cc: peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh, x86,
	dave.hansen, thomas.lendacky, torvalds, Balbir Singh

Balbir,

On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> +enum l1d_flush_out_mitigations {
> +	L1D_FLUSH_OUT_OFF,
> +	L1D_FLUSH_OUT_ON,
> +};
> +
> +static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;

Why default on and why stays it on when the CPU is not affected by L1TF ...

>  /* Default mitigation for TAA-affected CPUs */
>  static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
>  static bool taa_nosmt __ro_after_init;
> @@ -379,6 +386,18 @@ static void __init taa_select_mitigation(void)
>  	pr_info("%s\n", taa_strings[taa_mitigation]);
>  }
>  
> +static int __init l1d_flush_out_parse_cmdline(char *str)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_L1TF))
> +		return 0;

... while here you check for L1TF.

Also shouldn't it be default off and enabled via command line?

> +static int l1d_flush_out_prctl_get(struct task_struct *task)
> +{
> +	int ret;
> +
> +	if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> +		return PR_SPEC_FORCE_DISABLE;
> +
> +	ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);

That ret indirection is pointless. Just make it if (test_....)

> +static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
> +{
> +
> +	if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> +		return -EPERM;

So here you check for off and then...

>  int enable_l1d_flush_for_task(struct task_struct *tsk)
>  {
> +	/*
> +	 * Do not enable L1D_FLUSH_OUT if
> +	 * b. The CPU is not affected by the L1TF bug
> +	 * c. The CPU does not have L1D FLUSH feature support
> +	 */
> +
> +	if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
> +			!boot_cpu_has(X86_FEATURE_FLUSH_L1D))
> +		return -EINVAL;

... you check for the feature bits with a malformatted condition at some
other place. It's not supported when these conditions are not there. So
why having this check here?

> +
>  	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
>  	return 0;
>  }

Aside of that, why is this in tlb.c and not in bugs.c? There is nothing
tlb specific in these enable/disable functions. They just fiddle with
the TIF bit.

> +/*
> + * Sent to a task that opts into L1D flushing via the prctl interface
> + * but ends up running on an SMT enabled core.
> + */
> +static void l1d_flush_kill(struct callback_head *ch)
> +{
> +	force_sig(SIGBUS);
> +}
> +
>  static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
>  {
>  	unsigned long next_tif = task_thread_info(next)->flags;
>  	unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
> +	unsigned long next_mm;
>  
>  	BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
> -	return (unsigned long)next->mm | spec_bits;
> +	next_mm = (unsigned long)next->mm | spec_bits;
> +
> +	if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {

Wheeee. Yet more unconditional checks on every context switch.

> +		clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
> +		next->l1d_flush_kill.func = l1d_flush_kill;
> +		task_work_add(next, &next->l1d_flush_kill, true);

int task_work_add(struct task_struct *task, struct callback_head *twork,
                  enum task_work_notify_mode mode);

true is truly not a valid enum constant ....

> +	}

So you really want to have:

DEFINE_STATIC_KEY_FALSE(l1dflush_enabled);
static bool l1dflush_mitigation __init_data;

and then with the command line option you set l1dflush_mitigation and in
check_bugs() you invoke l1dflush_select_mitigation() which does:

       if (!l1dflush_mitigation || !boot_cpu_has_bug(X86_BUG_L1TF) ||
       	   !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
                return;

       static_branch_enable(&l1dflush_enabled);

and then in l1d_flush_out_prctl_set()

       if (!static_branch_unlikely(&l1dflush_enabled))
       		return -ENOTSUPP;

Then make the whole switch machinery do:

      if (static_branch_unlikely(&l1dflush_enabled)) {
            if (unlikely((next_mm | prev_mm) & LAST_USER_MM_L1D_FLUSH))
                 l1dflush_evaluate(next_mm, prev_mm);
      }
                    
and l1dflush_evaluate()

     if (prev_mm & LAST_USER_MM_L1D_FLUSH)
     	  l1d_flush();

     if ((next_mm & LAST_USER_MM_L1D_FLUSH) &&
         this_cpu_read(cpu_info.smt_active)) {

          clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
          next->l1d_flush_kill.func = l1d_flush_kill;
	  task_work_add(next, &next->l1d_flush_kill, TWA_RESUME);
     }

That way the overhead is on systems where the admin decides to enable it
and if enabled the evaluation of prev_mm and next_mm is pushed out of
line.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch
  2020-12-04 21:21   ` Thomas Gleixner
@ 2020-12-04 22:41     ` Singh, Balbir
  0 siblings, 0 replies; 12+ messages in thread
From: Singh, Balbir @ 2020-12-04 22:41 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, peterz, keescook, torvalds, jpoimboe, x86,
	tony.luck, dave.hansen, thomas.lendacky, benh

On Fri, 2020-12-04 at 22:21 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> > 
> > +     /*
> > +      * Flush only if SMT is disabled as per the contract, which is checked
> > +      * when the feature is enabled.
> > +      */
> > +     if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
> > +             (prev_mm & LAST_USER_MM_L1D_FLUSH))
> > +             l1d_flush_hw();
> 
> So if SMT is completely disabled then no flush? Shouldn't the logic be:
> 
>     if ((!sched_smt_active() || !this_cpu_read(cpu_info.smt_active) &&
>          (prev_mm & LAST_USER_MM_L1D_FLUSH))
> 
> Hmm?
> 
> But that's bad, because it's lot's of conditions to evaluate for every
> switch_mm where most of them are not interested in it at all.
> 
> Let me read through the rest of the pile.
>


We don't need this anymore with the new checks for preempting killing
of the task, so it can be removed

Balbir 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour
  2020-12-04 21:07   ` Thomas Gleixner
@ 2020-12-04 22:44     ` Singh, Balbir
  0 siblings, 0 replies; 12+ messages in thread
From: Singh, Balbir @ 2020-12-04 22:44 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, peterz, keescook, torvalds, jpoimboe, x86,
	tony.luck, dave.hansen, thomas.lendacky, benh

On Fri, 2020-12-04 at 22:07 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> 
> > Detection of task affinities at API opt-in time is not the best
> > approach, the approach is to kill the task if it runs on a
> > SMT enable core. This is better than not flushing the L1D cache
> > when the task switches from a non-SMT core to an SMT enabled core.
> > 
> > Signed-off-by: Balbir Singh <sblbir@amazon.com>
> > ---
> >  arch/x86/include/asm/processor.h |  2 ++
> >  arch/x86/kernel/smpboot.c        | 11 ++++++++++-
> >  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> Subject, changelog match but patch content not so much :)
> 

The changelog jumped between 1/3 of my fixup and 1/5 of my
new post :)

The correct changelog is below, which I shall fix

x86/smp: Add a per-cpu view of SMT state

A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not. This can be very helpful if the system has some
of its cores with threads offlined and can be used for cases where
action is taken based on the state of SMT. The follow up patches use
this feature.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200729001103.6450-2-sblbir@amazon.com
---

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl
  2020-12-04 22:19   ` Thomas Gleixner
@ 2020-12-05  2:56     ` Balbir Singh
  0 siblings, 0 replies; 12+ messages in thread
From: Balbir Singh @ 2020-12-05  2:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, peterz, linux-kernel, keescook, jpoimboe, tony.luck, benh,
	x86, dave.hansen, thomas.lendacky, torvalds

On Fri, Dec 04, 2020 at 11:19:17PM +0100, Thomas Gleixner wrote:
> 
> Balbir,
> 
> On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> > +enum l1d_flush_out_mitigations {
> > +     L1D_FLUSH_OUT_OFF,
> > +     L1D_FLUSH_OUT_ON,
> > +};
> > +
> > +static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
> 
> Why default on and why stays it on when the CPU is not affected by L1TF ...
> 

Because we don't set the PRCTL is the processor is not affected by the
bug

> >  /* Default mitigation for TAA-affected CPUs */
> >  static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
> >  static bool taa_nosmt __ro_after_init;
> > @@ -379,6 +386,18 @@ static void __init taa_select_mitigation(void)
> >       pr_info("%s\n", taa_strings[taa_mitigation]);
> >  }
> >
> > +static int __init l1d_flush_out_parse_cmdline(char *str)
> > +{
> > +     if (!boot_cpu_has_bug(X86_BUG_L1TF))
> > +             return 0;
> 
> ... while here you check for L1TF.
> 
> Also shouldn't it be default off and enabled via command line?
> 

I chose the other way because the prctl is an opt-in as well

> > +static int l1d_flush_out_prctl_get(struct task_struct *task)
> > +{
> > +     int ret;
> > +
> > +     if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> > +             return PR_SPEC_FORCE_DISABLE;
> > +
> > +     ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
> 
> That ret indirection is pointless. Just make it if (test_....)

Sure, will do

> 
> > +static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
> > +{
> > +
> > +     if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> > +             return -EPERM;
> 
> So here you check for off and then...
> 

Yes

> >  int enable_l1d_flush_for_task(struct task_struct *tsk)
> >  {
> > +     /*
> > +      * Do not enable L1D_FLUSH_OUT if
> > +      * b. The CPU is not affected by the L1TF bug
> > +      * c. The CPU does not have L1D FLUSH feature support
> > +      */
> > +
> > +     if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
> > +                     !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
> > +             return -EINVAL;
> 
> ... you check for the feature bits with a malformatted condition at some
> other place. It's not supported when these conditions are not there. So
> why having this check here?
> 
> > +
> >       set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
> >       return 0;
> >  }
> 
> Aside of that, why is this in tlb.c and not in bugs.c? There is nothing
> tlb specific in these enable/disable functions. They just fiddle with
> the TIF bit.
> 

I can move them over.

> > +/*
> > + * Sent to a task that opts into L1D flushing via the prctl interface
> > + * but ends up running on an SMT enabled core.
> > + */
> > +static void l1d_flush_kill(struct callback_head *ch)
> > +{
> > +     force_sig(SIGBUS);
> > +}
> > +
> >  static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
> >  {
> >       unsigned long next_tif = task_thread_info(next)->flags;
> >       unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
> > +     unsigned long next_mm;
> >
> >       BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
> > -     return (unsigned long)next->mm | spec_bits;
> > +     next_mm = (unsigned long)next->mm | spec_bits;
> > +
> > +     if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {
> 
> Wheeee. Yet more unconditional checks on every context switch.

A task can only get here if it is affected by the bug (processor has
L1TF and L1D_FLUSH support) and the task opted in, I think what your
suggesting is that we avoid the check for all tasks (the signgle next_mm
& LAST_USER_MM_L1D_FLUSH) check as well?

> 
> > +             clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
> > +             next->l1d_flush_kill.func = l1d_flush_kill;
> > +             task_work_add(next, &next->l1d_flush_kill, true);
> 
> int task_work_add(struct task_struct *task, struct callback_head *twork,
>                   enum task_work_notify_mode mode);
> 
> true is truly not a valid enum constant ....

:) I might really have added it when we were transitioning from true to
TWA_RESUME, I am surprised the compiler did not catch it, I'll move it
over.

> 
> > +     }
> 
> So you really want to have:
> 
> DEFINE_STATIC_KEY_FALSE(l1dflush_enabled);
> static bool l1dflush_mitigation __init_data;
> 
> and then with the command line option you set l1dflush_mitigation and in
> check_bugs() you invoke l1dflush_select_mitigation() which does:
> 
>        if (!l1dflush_mitigation || !boot_cpu_has_bug(X86_BUG_L1TF) ||
>            !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
>                 return;
> 
>        static_branch_enable(&l1dflush_enabled);
> 
> and then in l1d_flush_out_prctl_set()
> 
>        if (!static_branch_unlikely(&l1dflush_enabled))
>                 return -ENOTSUPP;
> 
> Then make the whole switch machinery do:
> 
>       if (static_branch_unlikely(&l1dflush_enabled)) {
>             if (unlikely((next_mm | prev_mm) & LAST_USER_MM_L1D_FLUSH))
>                  l1dflush_evaluate(next_mm, prev_mm);
>       }
> 
> and l1dflush_evaluate()
> 
>      if (prev_mm & LAST_USER_MM_L1D_FLUSH)
>           l1d_flush();
> 
>      if ((next_mm & LAST_USER_MM_L1D_FLUSH) &&
>          this_cpu_read(cpu_info.smt_active)) {
> 
>           clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
>           next->l1d_flush_kill.func = l1d_flush_kill;
>           task_work_add(next, &next->l1d_flush_kill, TWA_RESUME);
>      }
> 
> That way the overhead is on systems where the admin decides to enable it
> and if enabled the evaluation of prev_mm and next_mm is pushed out of
> line.
>

OK, I'll rewrite it and see how it looks

Thanks for the review,
Balbir Singh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-12-05  2:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-27  6:59 [PATCH v3 0/5] Next revision of the L1D flush patches Balbir Singh
2020-11-27  6:59 ` [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour Balbir Singh
2020-12-04 21:07   ` Thomas Gleixner
2020-12-04 22:44     ` Singh, Balbir
2020-11-27  6:59 ` [PATCH v3 2/5] x86/mm: Refactor cond_ibpb() to support other use cases Balbir Singh
2020-11-27  6:59 ` [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch Balbir Singh
2020-12-04 21:21   ` Thomas Gleixner
2020-12-04 22:41     ` Singh, Balbir
2020-11-27  6:59 ` [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl Balbir Singh
2020-12-04 22:19   ` Thomas Gleixner
2020-12-05  2:56     ` Balbir Singh
2020-11-27  6:59 ` [PATCH v3 5/5] Documentation: Add L1D flushing Documentation Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).