All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch V4 00/11] MDS basics
@ 2019-02-22 22:24 Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 01/11] x86/msr-index: Cleanup bit defines Thomas Gleixner
                   ` (14 more replies)
  0 siblings, 15 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck

Hi!

Another day, another update.

Changes since V3:

  - Add the #DF mitigation and document why I can't be bothered
    to sprinkle the buffer clear into #MC

  - Add a comment about the segment selector choice. It makes sense on it's
    own but it won't prevent anyone from thinking that we're crazy.

  - Addressed the review feedback vs. documentation

  - Resurrected the admin documentation patch, tidied it up and filled the
    gaps.

Delta patch without the admin documentation parts below.

Git tree WIP.mds branch is updated as well.

If anyone of the people new to this need access to the git repo,
please send me a public SSH key so I can add to the gitolite config.

There is one point left which I did not look into yet and I'm happy to
delegate that to the virtualization wizards:

  XEON PHI is not affected by L1TF, so it won't get the L1TF
  mitigations. But it is affected by MSBDS, so it needs separate
  mitigation, i.e. clearing CPU buffers on VMENTER.


Thanks,

	Thomas

8<-------------------

 Documentation/ABI/testing/sysfs-devices-system-cpu |    1 
 Documentation/admin-guide/hw-vuln/index.rst        |   13 +
 Documentation/admin-guide/hw-vuln/l1tf.rst         |    1 
 Documentation/admin-guide/hw-vuln/mds.rst          |  258 +++++++++++++++++++++
 Documentation/admin-guide/index.rst                |    6 
 Documentation/admin-guide/kernel-parameters.txt    |   27 ++
 Documentation/index.rst                            |    1 
 Documentation/x86/conf.py                          |   10 
 Documentation/x86/index.rst                        |    8 
 Documentation/x86/mds.rst                          |  205 ++++++++++++++++
 arch/x86/entry/common.c                            |   10 
 arch/x86/include/asm/cpufeatures.h                 |    2 
 arch/x86/include/asm/irqflags.h                    |    4 
 arch/x86/include/asm/msr-index.h                   |   39 +--
 arch/x86/include/asm/mwait.h                       |    7 
 arch/x86/include/asm/nospec-branch.h               |   39 +++
 arch/x86/include/asm/processor.h                   |    7 
 arch/x86/kernel/cpu/bugs.c                         |  105 ++++++++
 arch/x86/kernel/cpu/common.c                       |   13 +
 arch/x86/kernel/nmi.c                              |    6 
 arch/x86/kernel/traps.c                            |    9 
 arch/x86/kvm/cpuid.c                               |    3 
 drivers/base/cpu.c                                 |    8 
 include/linux/cpu.h                                |    2 
 24 files changed, 762 insertions(+), 22 deletions(-)

diff --git a/Documentation/x86/mds.rst b/Documentation/x86/mds.rst
index 0c0d802367e6..ce3dbddbd3b8 100644
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -1,7 +1,12 @@
 Microarchitecural Data Sampling (MDS) mitigation
 ================================================
 
-Microarchitectural Data Sampling (MDS) is a class of side channel attacks
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
 on internal buffers in Intel CPUs. The variants are:
 
  - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
@@ -33,6 +38,7 @@ faulting or assisting loads under certain conditions, which again can be
 exploited eventually. Load ports are shared between Hyper-Threads so cross
 thread leakage is possible.
 
+
 Exposure assumptions
 --------------------
 
@@ -48,7 +54,7 @@ needed for exploiting MDS requires:
  - to control the pointer through which the disclosure gadget exposes the
    data
 
-The existance of such a construct cannot be excluded with 100% certainty,
+The existence of such a construct cannot be excluded with 100% certainty,
 but the complexity involved makes it extremly unlikely.
 
 There is one exception, which is untrusted BPF. The functionality of
@@ -91,13 +97,37 @@ the invocation can be enforced or conditional.
 As a special quirk to address virtualization scenarios where the host has
 the microcode updated, but the hypervisor does not (yet) expose the
 MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
-hope that it might work. The state is reflected accordingly.
+hope that it might actually clear the buffers. The state is reflected
+accordingly.
 
 According to current knowledge additional mitigations inside the kernel
 itself are not required because the necessary gadgets to expose the leaked
 data cannot be controlled in a way which allows exploitation from malicious
 user space or VM guests.
 
+
+Kernel internal mitigation modes
+--------------------------------
+
+ ======= ===========================================================
+ off     Mitigation is disabled. Either the CPU is not affected or
+         mds=off is supplied on the kernel command line
+
+ full    Mitigation is eanbled. CPU is affected and MD_CLEAR is
+         advertised in CPUID.
+
+ vmwerv	 Mitigation is enabled. CPU is affected and MD_CLEAR is not
+         advertised in CPUID. That is mainly for virtualization
+	 scenarios where the host has the updated microcode but the
+	 hypervisor does not expose MD_CLEAR in CPUID. It's a best
+	 effort approach without guarantee.
+ ======= ===========================================================
+
+If the CPU is affected and mds=off is not supplied on the kernel
+command line then the kernel selects the appropriate mitigation mode
+depending on the availability of the MD_CLEAR CPUID bit.
+
+
 Mitigation points
 -----------------
 
@@ -128,8 +158,16 @@ Mitigation points
    coverage.
 
    There is one non maskable exception which returns through paranoid exit
-   and is not mitigated: #DF. If user space is able to trigger a double
-   fault the possible MDS leakage is the least problem to worry about.
+   and is to some extent controllable from user space through
+   modify_ldt(2): #DF. So mitigation is required in the double fault
+   handler as well.
+
+   Another corner case is a #MC which hits between the buffer clear and the
+   actual return to user. As this still is in kernel space it takes the
+   paranoid exit path which does not clear the CPU buffers. So the #MC
+   handler repopulates the buffers to some extent. Machine checks are not
+   reliably controllable and the window is extremly small so mitigation
+   would just tick a checkbox that this theoretical corner case is covered.
 
 
 2. C-State transition
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 8be9158d848e..3e27ccd6d5c5 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -338,6 +338,8 @@ static inline void mds_clear_cpu_buffers(void)
 	 * Has to be the memory-operand variant because only that
 	 * guarantees the CPU buffer flush functionality according to
 	 * documentation. The register-operand variant does not.
+	 * Works with any segment selector, but a valid writable
+	 * data segment is the fastest variant.
 	 *
 	 * "cc" clobber is required because VERW modifies ZF.
 	 */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0fb241a78de3..83b19bb54093 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -68,6 +68,7 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 DEFINE_STATIC_KEY_FALSE(mds_user_clear);
 /* Control MDS CPU buffer clear before idling (halt, mwait) */
 DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+EXPORT_SYMBOL_GPL(mds_idle_clear);
 
 void __init check_bugs(void)
 {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9b7c4ca8f0a7..d2779f4730f5 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -366,6 +366,15 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->ip = (unsigned long)general_protection;
 		regs->sp = (unsigned long)&gpregs->orig_ax;
 
+		/*
+		 * This situation can be triggered by userspace via
+		 * modify_ldt(2) and the return does not take the regular
+		 * user space exit, so a CPU buffer clear is required when
+		 * MDS mitigation is enabled.
+		 */
+		if (static_branch_unlikely(&mds_user_clear))
+			mds_clear_cpu_buffers();
+
 		return;
 	}
 #endif

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [patch V4 01/11] x86/msr-index: Cleanup bit defines
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS Thomas Gleixner
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Greg Kroah-Hartman, Borislav Petkov

From: Thomas Gleixner <tglx@linutronix.de>

Greg pointed out that speculation related bit defines are using (1 << N)
format instead of BIT(N). Aside of that (1 << N) is wrong as it should use
1UL at least.

Clean it up.

Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/msr-index.h |   34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_MSR_INDEX_H
 #define _ASM_X86_MSR_INDEX_H
 
+#include <linux/bits.h>
+
 /*
  * CPU model specific register (MSR) numbers.
  *
@@ -40,14 +42,14 @@
 /* Intel MSRs. Some also available on other CPUs */
 
 #define MSR_IA32_SPEC_CTRL		0x00000048 /* Speculation Control */
-#define SPEC_CTRL_IBRS			(1 << 0)   /* Indirect Branch Restricted Speculation */
+#define SPEC_CTRL_IBRS			BIT(0)	   /* Indirect Branch Restricted Speculation */
 #define SPEC_CTRL_STIBP_SHIFT		1	   /* Single Thread Indirect Branch Predictor (STIBP) bit */
-#define SPEC_CTRL_STIBP			(1 << SPEC_CTRL_STIBP_SHIFT)	/* STIBP mask */
+#define SPEC_CTRL_STIBP			BIT(SPEC_CTRL_STIBP_SHIFT)	/* STIBP mask */
 #define SPEC_CTRL_SSBD_SHIFT		2	   /* Speculative Store Bypass Disable bit */
-#define SPEC_CTRL_SSBD			(1 << SPEC_CTRL_SSBD_SHIFT)	/* Speculative Store Bypass Disable */
+#define SPEC_CTRL_SSBD			BIT(SPEC_CTRL_SSBD_SHIFT)	/* Speculative Store Bypass Disable */
 
 #define MSR_IA32_PRED_CMD		0x00000049 /* Prediction Command */
-#define PRED_CMD_IBPB			(1 << 0)   /* Indirect Branch Prediction Barrier */
+#define PRED_CMD_IBPB			BIT(0)	   /* Indirect Branch Prediction Barrier */
 
 #define MSR_PPIN_CTL			0x0000004e
 #define MSR_PPIN			0x0000004f
@@ -69,20 +71,20 @@
 #define MSR_MTRRcap			0x000000fe
 
 #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a
-#define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */
-#define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */
-#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH	(1 << 3)   /* Skip L1D flush on vmentry */
-#define ARCH_CAP_SSB_NO			(1 << 4)   /*
-						    * Not susceptible to Speculative Store Bypass
-						    * attack, so no Speculative Store Bypass
-						    * control required.
-						    */
+#define ARCH_CAP_RDCL_NO		BIT(0)	/* Not susceptible to Meltdown */
+#define ARCH_CAP_IBRS_ALL		BIT(1)	/* Enhanced IBRS support */
+#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH	BIT(3)	/* Skip L1D flush on vmentry */
+#define ARCH_CAP_SSB_NO			BIT(4)	/*
+						 * Not susceptible to Speculative Store Bypass
+						 * attack, so no Speculative Store Bypass
+						 * control required.
+						 */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
-#define L1D_FLUSH			(1 << 0)   /*
-						    * Writeback and invalidate the
-						    * L1 data cache.
-						    */
+#define L1D_FLUSH			BIT(0)	/*
+						 * Writeback and invalidate the
+						 * L1 data cache.
+						 */
 
 #define MSR_IA32_BBL_CR_CTL		0x00000119
 #define MSR_IA32_BBL_CR_CTL3		0x0000011e

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 01/11] x86/msr-index: Cleanup bit defines Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-23  1:28   ` [MODERATED] " Linus Torvalds
  2019-02-22 22:24 ` [patch V4 03/11] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests Thomas Gleixner
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen, Borislav Petkov, Greg Kroah-Hartman

From: Andi Kleen <ak@linux.intel.com>

Microarchitectural Data Sampling (MDS), is a class of side channel attacks
on internal buffers in Intel CPUs. The variants are:

 - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
 - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
 - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)

MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
dependent load (store-to-load forwarding) as an optimization. The forward
can also happen to a faulting or assisting load operation for a different
memory address, which can be exploited under certain conditions. Store
buffers are partitioned between Hyper-Threads so cross thread forwarding is
not possible. But if a thread enters or exits a sleep state the store
buffer is repartitioned which can expose data from one thread to the other.

MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
L1 miss situations and to hold data which is returned or sent in response
to a memory or I/O operation. Fill buffers can forward data to a load
operation and also write data to the cache. When the fill buffer is
deallocated it can retain the stale data of the preceding operations which
can then be forwarded to a faulting or assisting load operation, which can
be exploited under certain conditions. Fill buffers are shared between
Hyper-Threads so cross thread leakage is possible.

MLDPS leaks Load Port Data. Load ports are used to perform load operations
from memory or I/O. The received data is then forwarded to the register
file or a subsequent operation. In some implementations the Load Port can
contain stale data from a previous operation which can be forwarded to
faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.

All variants have the same mitigation for single CPU thread case (SMT off),
so the kernel can treat them as one MDS issue.

Add the basic infrastructure to detect if the current CPU is affected by
MDS.

[ tglx: Rewrote changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V3: Addressed Borislav's review comments
---
 arch/x86/include/asm/cpufeatures.h |    2 ++
 arch/x86/include/asm/msr-index.h   |    5 +++++
 arch/x86/kernel/cpu/common.c       |   13 +++++++++++++
 3 files changed, 20 insertions(+)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -344,6 +344,7 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
 #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
 #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_MD_CLEAR		(18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
@@ -381,5 +382,6 @@
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
+#define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -79,6 +79,11 @@
 						 * attack, so no Speculative Store Bypass
 						 * control required.
 						 */
+#define ARCH_CAP_MDS_NO			BIT(5)   /*
+						  * Not susceptible to
+						  * Microarchitectural Data
+						  * Sampling (MDS) vulnerabilities.
+						  */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			BIT(0)	/*
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -998,6 +998,14 @@ static const __initconst struct x86_cpu_
 	{}
 };
 
+static const __initconst struct x86_cpu_id cpu_no_mds[] = {
+	/* in addition to cpu_no_speculation */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_X	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_PLUS	},
+	{}
+};
+
 static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = 0;
@@ -1019,6 +1027,11 @@ static void __init cpu_set_bug_bits(stru
 	if (ia32_cap & ARCH_CAP_IBRS_ALL)
 		setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
 
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    !x86_match_cpu(cpu_no_mds)) &&
+	    !(ia32_cap & ARCH_CAP_MDS_NO))
+		setup_force_cpu_bug(X86_BUG_MDS);
+
 	if (x86_match_cpu(cpu_no_meltdown))
 		return;
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 03/11] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 01/11] x86/msr-index: Cleanup bit defines Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen, Borislav Petkov, Greg Kroah-Hartman

From: Andi Kleen <ak@linux.intel.com>
Subject: [patch V4 03/11] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests

X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode
provides the mechanism to invoke a flush of various exploitable CPU buffers
by invoking the VERW instruction.

Hand it through to guests so they can adjust their mitigations.

This also requires corresponding qemu changes, which are available
separately.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/kvm/cpuid.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,7 +409,8 @@ static inline int __do_cpuid_ent(struct
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
-		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP);
+		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
+		F(MD_CLEAR);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (2 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 03/11] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-25 16:06   ` [MODERATED] " Frederic Weisbecker
                     ` (2 more replies)
  2019-02-22 22:24 ` [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user Thomas Gleixner
                   ` (10 subsequent siblings)
  14 siblings, 3 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Borislav Petkov, Greg Kroah-Hartman

From: Thomas Gleixner <tglx@linutronix.de>

The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
clearing the affected CPU buffers. The mechanism for clearing the buffers
uses the unused and obsolete VERW instruction in combination with a
microcode update which triggers a CPU buffer clear when VERW is executed.

Provide a inline function with the assembly magic. The argument of the VERW
instruction must be a memory operand as documented:

  "MD_CLEAR enumerates that the memory-operand variant of VERW (for
   example, VERW m16) has been extended to also overwrite buffers affected
   by MDS. This buffer overwriting functionality is not guaranteed for the
   register operand variant of VERW."

Documentation also recommends to use a writable data segment selector:

  "The buffer overwriting occurs regardless of the result of the VERW
   permission check, as well as when the selector is null or causes a
   descriptor load segment violation. However, for lowest latency we
   recommend using a selector that indicates a valid writable data
   segment."

Add x86 specific documentation about MDS and the internal workings of the
mitigation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V3 --> V4: Document the segment selecor choice as well.

V2 --> V3: Add VERW documentation and fix typos/grammar..., dropped 'i(0)'
       	   Add more details fo the documentation file

V1 --> V2: Add "cc" clobber and documentation
---
 Documentation/index.rst              |    1 
 Documentation/x86/conf.py            |   10 +++
 Documentation/x86/index.rst          |    8 ++
 Documentation/x86/mds.rst            |  100 +++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/nospec-branch.h |   25 ++++++++
 5 files changed, 144 insertions(+)

--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -101,6 +101,7 @@ implementation.
    :maxdepth: 2
 
    sh/index
+   x86/index
 
 Filesystem Documentation
 ------------------------
--- /dev/null
+++ b/Documentation/x86/conf.py
@@ -0,0 +1,10 @@
+# -*- coding: utf-8; mode: python -*-
+
+project = "X86 architecture specific documentation"
+
+tags.add("subproject")
+
+latex_documents = [
+    ('index', 'x86.tex', project,
+     'The kernel development community', 'manual'),
+]
--- /dev/null
+++ b/Documentation/x86/index.rst
@@ -0,0 +1,8 @@
+==========================
+x86 architecture specifics
+==========================
+
+.. toctree::
+   :maxdepth: 1
+
+   mds
--- /dev/null
+++ b/Documentation/x86/mds.rst
@@ -0,0 +1,100 @@
+Microarchitecural Data Sampling (MDS) mitigation
+================================================
+
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
+on internal buffers in Intel CPUs. The variants are:
+
+ - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+ - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+ - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
+
+MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
+dependent load (store-to-load forwarding) as an optimization. The forward
+can also happen to a faulting or assisting load operation for a different
+memory address, which can be exploited under certain conditions. Store
+buffers are partitioned between Hyper-Threads so cross thread forwarding is
+not possible. But if a thread enters or exits a sleep state the store
+buffer is repartitioned which can expose data from one thread to the other.
+
+MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
+L1 miss situations and to hold data which is returned or sent in response
+to a memory or I/O operation. Fill buffers can forward data to a load
+operation and also write data to the cache. When the fill buffer is
+deallocated it can retain the stale data of the preceding operations which
+can then be forwarded to a faulting or assisting load operation, which can
+be exploited under certain conditions. Fill buffers are shared between
+Hyper-Threads so cross thread leakage is possible.
+
+MLDPS leaks Load Port Data. Load ports are used to perform load operations
+from memory or I/O. The received data is then forwarded to the register
+file or a subsequent operation. In some implementations the Load Port can
+contain stale data from a previous operation which can be forwarded to
+faulting or assisting loads under certain conditions, which again can be
+exploited eventually. Load ports are shared between Hyper-Threads so cross
+thread leakage is possible.
+
+
+Exposure assumptions
+--------------------
+
+It is assumed that attack code resides in user space or in a guest with one
+exception. The rationale behind this assumption is that the code construct
+needed for exploiting MDS requires:
+
+ - to control the load to trigger a fault or assist
+
+ - to have a disclosure gadget which exposes the speculatively accessed
+   data for consumption through a side channel.
+
+ - to control the pointer through which the disclosure gadget exposes the
+   data
+
+The existence of such a construct cannot be excluded with 100% certainty,
+but the complexity involved makes it extremly unlikely.
+
+There is one exception, which is untrusted BPF. The functionality of
+untrusted BPF is limited, but it needs to be thoroughly investigated
+whether it can be used to create such a construct.
+
+
+Mitigation strategy
+-------------------
+
+All variants have the same mitigation strategy at least for the single CPU
+thread case (SMT off): Force the CPU to clear the affected buffers.
+
+This is achieved by using the otherwise unused and obsolete VERW
+instruction in combination with a microcode update. The microcode clears
+the affected CPU buffers when the VERW instruction is executed.
+
+For virtualization there are two ways to achieve CPU buffer
+clearing. Either the modified VERW instruction or via the L1D Flush
+command. The latter is issued when L1TF mitigation is enabled so the extra
+VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
+be issued.
+
+If the VERW instruction with the supplied segment selector argument is
+executed on a CPU without the microcode update there is no side effect
+other than a small number of pointlessly wasted CPU cycles.
+
+This does not protect against cross Hyper-Thread attacks except for MSBDS
+which is only exploitable cross Hyper-thread when one of the Hyper-Threads
+enters a C-state.
+
+The kernel provides a function to invoke the buffer clearing:
+
+    mds_clear_cpu_buffers()
+
+The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
+(idle) transitions. Depending on the mitigation mode and the system state
+the invocation can be enforced or conditional.
+
+According to current knowledge additional mitigations inside the kernel
+itself are not required because the necessary gadgets to expose the leaked
+data cannot be controlled in a way which allows exploitation from malicious
+user space or VM guests.
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,31 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
 DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
 DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
+#include <asm/segment.h>
+
+/**
+ * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * This uses the otherwise unused and obsolete VERW instruction in
+ * combination with microcode which triggers a CPU buffer flush when the
+ * instruction is executed.
+ */
+static inline void mds_clear_cpu_buffers(void)
+{
+	static const u16 ds = __KERNEL_DS;
+
+	/*
+	 * Has to be the memory-operand variant because only that
+	 * guarantees the CPU buffer flush functionality according to
+	 * documentation. The register-operand variant does not.
+	 * Works with any segment selector, but a valid writable
+	 * data segment is the fastest variant.
+	 *
+	 * "cc" clobber is required because VERW modifies ZF.
+	 */
+	asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
+}
+
 #endif /* __ASSEMBLY__ */
 
 /*

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (3 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-25 21:04   ` [MODERATED] " Greg KH
  2019-02-26 15:20   ` Josh Poimboeuf
  2019-02-22 22:24 ` [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry Thomas Gleixner
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck

From: Thomas Gleixner <tglx@linutronix.de>

Add a static key which controls the invocation of the CPU buffer clear
mechanism on exit to user space and add the call into
prepare_exit_to_usermode() and do_nmi() right before actually returning.

Add documentation which kernel to user space transition this covers and
explain why some corner cases are not mitigated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3 --> V4: Add #DS mitigation and document that the #MC corner case
       	   is really not interesting.

V3: Add NMI conditional on user regs and update documentation accordingly.
    Use the static branch scheme suggested by Peter. Fix typos ...
---
 Documentation/x86/mds.rst            |   41 +++++++++++++++++++++++++++++++++++
 arch/x86/entry/common.c              |   10 ++++++++
 arch/x86/include/asm/nospec-branch.h |    2 +
 arch/x86/kernel/cpu/bugs.c           |    4 ++-
 arch/x86/kernel/nmi.c                |    6 +++++
 arch/x86/kernel/traps.c              |    9 +++++++
 6 files changed, 71 insertions(+), 1 deletion(-)

--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -94,3 +94,44 @@ According to current knowledge additiona
 itself are not required because the necessary gadgets to expose the leaked
 data cannot be controlled in a way which allows exploitation from malicious
 user space or VM guests.
+
+Mitigation points
+-----------------
+
+1. Return to user space
+^^^^^^^^^^^^^^^^^^^^^^^
+   When transitioning from kernel to user space the CPU buffers are flushed
+   on affected CPUs:
+
+   - always when the mitigation mode is full. The migitation is enabled
+     through the static key mds_user_clear.
+
+   This covers transitions from kernel to user space through a return to
+   user space from a syscall and from an interrupt or a regular exception.
+
+   There are other kernel to user space transitions which are not covered
+   by this: NMIs and all non maskable exceptions which go through the
+   paranoid exit, which means that they are not invoking the regular
+   prepare_exit_to_usermode() which handles the CPU buffer clearing.
+
+   Access to sensible data like keys, credentials in the NMI context is
+   mostly theoretical: The CPU can do prefetching or execute a
+   misspeculated code path and thereby fetching data which might end up
+   leaking through a buffer.
+
+   But for mounting other attacks the kernel stack address of the task is
+   already valuable information. So in full mitigation mode, the NMI is
+   mitigated on the return from do_nmi() to provide almost complete
+   coverage.
+
+   There is one non maskable exception which returns through paranoid exit
+   and is to some extent controllable from user space through
+   modify_ldt(2): #DF. So mitigation is required in the double fault
+   handler as well.
+
+   Another corner case is a #MC which hits between the buffer clear and the
+   actual return to user. As this still is in kernel space it takes the
+   paranoid exit path which does not clear the CPU buffers. So the #MC
+   handler repopulates the buffers to some extent. Machine checks are not
+   reliably controllable and the window is extremly small so mitigation
+   would just tick a checkbox that this theoretical corner case is covered.
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,6 +31,7 @@
 #include <asm/vdso.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -180,6 +181,13 @@ static void exit_to_usermode_loop(struct
 	}
 }
 
+static inline void mds_user_clear_cpu_buffers(void)
+{
+	if (!static_branch_likely(&mds_user_clear))
+		return;
+	mds_clear_cpu_buffers();
+}
+
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
@@ -212,6 +220,8 @@ static void exit_to_usermode_loop(struct
 #endif
 
 	user_enter_irqoff();
+
+	mds_user_clear_cpu_buffers();
 }
 
 #define SYSCALL_EXIT_WORK_FLAGS				\
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,8 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
 DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
 DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
+DECLARE_STATIC_KEY_FALSE(mds_user_clear);
+
 #include <asm/segment.h>
 
 /**
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -63,10 +63,12 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_i
 /* Control unconditional IBPB in switch_mm() */
 DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
+/* Control MDS CPU buffer clear before returning to user space */
+DEFINE_STATIC_KEY_FALSE(mds_user_clear);
+
 void __init check_bugs(void)
 {
 	identify_boot_cpu();
-
 	/*
 	 * identify_boot_cpu() initialized SMT support information, let the
 	 * core code know.
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -34,6 +34,7 @@
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
 #include <asm/cache.h>
+#include <asm/nospec-branch.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -533,6 +534,11 @@ do_nmi(struct pt_regs *regs, long error_
 		write_cr2(this_cpu_read(nmi_cr2));
 	if (this_cpu_dec_return(nmi_state))
 		goto nmi_restart;
+
+	if (!static_branch_likely(&mds_user_clear))
+		return;
+	if (user_mode(regs))
+		mds_clear_cpu_buffers();
 }
 NOKPROBE_SYMBOL(do_nmi);
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -366,6 +366,15 @@ dotraplinkage void do_double_fault(struc
 		regs->ip = (unsigned long)general_protection;
 		regs->sp = (unsigned long)&gpregs->orig_ax;
 
+		/*
+		 * This situation can be triggered by userspace via
+		 * modify_ldt(2) and the return does not take the regular
+		 * user space exit, so a CPU buffer clear is required when
+		 * MDS mitigation is enabled.
+		 */
+		if (static_branch_unlikely(&mds_user_clear))
+			mds_clear_cpu_buffers();
+
 		return;
 	}
 #endif

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (4 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-25 21:09   ` [MODERATED] " Greg KH
  2019-02-26 15:31   ` Josh Poimboeuf
  2019-02-22 22:24 ` [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS Thomas Gleixner
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Borislav Petkov

From: Thomas Gleixner <tglx@linutronix.de>

Add a static key which controls the invocation of the CPU buffer clear
mechanism on idle entry. This is independent of other MDS mitigations
because the idle entry invocation to mitigate the potential leakage due to
store buffer repartitioning is only necessary on SMT systems.

Add the actual invocations to the different halt/mwait variants which
covers all usage sites. mwaitx is not patched as it's not available on
Intel CPUs.

The buffer clear is only invoked before entering the C-State to prevent
that stale data from the idling CPU is spilled to the Hyper-Thread sibling
after the Store buffer got repartitioned and all entries are available to
the non idle sibling.

When coming out of idle the store buffer is partitioned again so each
sibling has half of it available. Now CPU which returned from idle could be
speculatively exposed to contents of the sibling, but the buffers are
flushed either on exit to user space or on VMENTER.

When later on conditional buffer clearing is implemented on top of this,
then there is no action required either because before returning to user
space the context switch will set the condition flag which causes a flush
on the return to user path.

This intentionaly does not handle the case in the acpi/processor_idle
driver which uses the legacy IO port interface for C-State transitions for
two reasons:

 - The acpi/processor_idle driver was replaced by the intel_idle driver
   almost a decade ago. Anything Nehalem upwards supports it and defaults
   to that new driver.

 - The legacy IO port interface is likely to be used on older and therefore
   unaffected CPUs or on systems which do not receive microcode updates
   anymore, so there is no point in adding that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
V4: Export mds_idle_clear
V3: Adjust document wording
---
 Documentation/x86/mds.rst            |   35 +++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/irqflags.h      |    4 ++++
 arch/x86/include/asm/mwait.h         |    7 +++++++
 arch/x86/include/asm/nospec-branch.h |   12 ++++++++++++
 arch/x86/kernel/cpu/bugs.c           |    3 +++
 5 files changed, 61 insertions(+)

--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -135,3 +135,38 @@ Mitigation points
    handler repopulates the buffers to some extent. Machine checks are not
    reliably controllable and the window is extremly small so mitigation
    would just tick a checkbox that this theoretical corner case is covered.
+
+
+2. C-State transition
+^^^^^^^^^^^^^^^^^^^^^
+
+   When a CPU goes idle and enters a C-State the CPU buffers need to be
+   cleared on affected CPUs when SMT is active. This addresses the
+   repartitioning of the store buffer when one of the Hyper-Threads enters
+   a C-State.
+
+   When SMT is inactive, i.e. either the CPU does not support it or all
+   sibling threads are offline CPU buffer clearing is not required.
+
+   The invocation is controlled by the static key mds_idle_clear which is
+   switched depending on the chosen mitigation mode and the SMT state of
+   the system.
+
+   The buffer clear is only invoked before entering the C-State to prevent
+   that stale data from the idling CPU can be spilled to the Hyper-Thread
+   sibling after the store buffer got repartitioned and all entries are
+   available to the non idle sibling.
+
+   When coming out of idle the store buffer is partitioned again so each
+   sibling has half of it available. The back from idle CPU could be then
+   speculatively exposed to contents of the sibling. The buffers are
+   flushed either on exit to user space or on VMENTER so malicious code
+   in user space or the guest cannot speculatively access them.
+
+   The mitigation is hooked into all variants of halt()/mwait(), but does
+   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
+   has been superseded by the intel_idle driver around 2010 and is
+   preferred on all affected CPUs which are expected to gain the MD_CLEAR
+   functionality in microcode. Aside of that the IO-Port mechanism is a
+   legacy interface which is only used on older systems which are either
+   not affected or do not receive microcode updates anymore.
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -6,6 +6,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm/nospec-branch.h>
+
 /* Provide __cpuidle; we can't safely include <linux/cpu.h> */
 #define __cpuidle __attribute__((__section__(".cpuidle.text")))
 
@@ -54,11 +56,13 @@ static inline void native_irq_enable(voi
 
 static inline __cpuidle void native_safe_halt(void)
 {
+	mds_idle_clear_cpu_buffers();
 	asm volatile("sti; hlt": : :"memory");
 }
 
 static inline __cpuidle void native_halt(void)
 {
+	mds_idle_clear_cpu_buffers();
 	asm volatile("hlt": : :"memory");
 }
 
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -6,6 +6,7 @@
 #include <linux/sched/idle.h>
 
 #include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
 
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
@@ -40,6 +41,8 @@ static inline void __monitorx(const void
 
 static inline void __mwait(unsigned long eax, unsigned long ecx)
 {
+	mds_idle_clear_cpu_buffers();
+
 	/* "mwait %eax, %ecx;" */
 	asm volatile(".byte 0x0f, 0x01, 0xc9;"
 		     :: "a" (eax), "c" (ecx));
@@ -74,6 +77,8 @@ static inline void __mwait(unsigned long
 static inline void __mwaitx(unsigned long eax, unsigned long ebx,
 			    unsigned long ecx)
 {
+	/* No MDS buffer clear as this is AMD/HYGON only */
+
 	/* "mwaitx %eax, %ebx, %ecx;" */
 	asm volatile(".byte 0x0f, 0x01, 0xfb;"
 		     :: "a" (eax), "b" (ebx), "c" (ecx));
@@ -81,6 +86,8 @@ static inline void __mwaitx(unsigned lon
 
 static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
 {
+	mds_idle_clear_cpu_buffers();
+
 	trace_hardirqs_on();
 	/* "mwait %eax, %ecx;" */
 	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -319,6 +319,7 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_cond_
 DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
 DECLARE_STATIC_KEY_FALSE(mds_user_clear);
+DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
 
 #include <asm/segment.h>
 
@@ -345,6 +346,17 @@ static inline void mds_clear_cpu_buffers
 	asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
 }
 
+/**
+ * mds_idle_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * Clear CPU buffers if the corresponding static key is enabled
+ */
+static inline void mds_idle_clear_cpu_buffers(void)
+{
+	if (static_branch_likely(&mds_idle_clear))
+		mds_clear_cpu_buffers();
+}
+
 #endif /* __ASSEMBLY__ */
 
 /*
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -65,6 +65,9 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always
 
 /* Control MDS CPU buffer clear before returning to user space */
 DEFINE_STATIC_KEY_FALSE(mds_user_clear);
+/* Control MDS CPU buffer clear before idling (halt, mwait) */
+DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+EXPORT_SYMBOL_GPL(mds_idle_clear);
 
 void __init check_bugs(void)
 {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (5 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-25 20:17   ` [MODERATED] " mark gross
  2019-02-26 15:50   ` Josh Poimboeuf
  2019-02-22 22:24 ` [patch V4 08/11] x86/speculation/mds: Add sysfs reporting " Thomas Gleixner
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Greg Kroah-Hartman, Borislav Petkov

From: Thomas Gleixner <tglx@linutronix.de>

Now that the mitigations are in place, add a command line parameter to
control the mitigation, a mitigation selector function and a SMT update
mechanism.

This is the minimal straight forward initial implementation which just
provides an always on/off mode. The command line parameter is:

  mds=[full|off|auto]

This is consistent with the existing mitigations for other speculative
hardware vulnerabilities.

The idle invocation is dynamically updated according to the SMT state of
the system similar to the dynamic update of the STIBP mitigation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
 Documentation/admin-guide/kernel-parameters.txt |   27 ++++++++
 arch/x86/include/asm/processor.h                |    6 +
 arch/x86/kernel/cpu/bugs.c                      |   76 ++++++++++++++++++++++++
 3 files changed, 109 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2356,6 +2356,33 @@
 			Format: <first>,<last>
 			Specifies range of consoles to be captured by the MDA.
 
+	mds=		[X86,INTEL]
+			Control mitigation for the Micro-architectural Data
+			Sampling (MDS) vulnerability.
+
+			Certain CPUs are vulnerable to an exploit against CPU
+			internal buffers which can forward information to a
+			disclosure gadget under certain conditions.
+
+			In vulnerable processors, the speculatively
+			forwarded data can be used in a cache side channel
+			attack, to access data to which the attacker does
+			not have direct access.
+
+			This parameter controls the MDS mitigation. The the
+			options are:
+
+			full    - Unconditionally enable MDS mitigation
+			off     - Unconditionally disable MDS mitigation
+			auto    - Kernel detects whether the CPU model is
+				  vulnerable to MDS and picks the most
+				  appropriate mitigation. If the CPU is not
+				  vulnerable, "off" is selected. If the CPU
+				  is vulnerable "full" is selected.
+
+			Not specifying this option is equivalent to
+			mds=auto.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -992,4 +992,10 @@ enum l1tf_mitigations {
 
 extern enum l1tf_mitigations l1tf_mitigation;
 
+enum mds_mitigations {
+	MDS_MITIGATION_OFF,
+	MDS_MITIGATION_AUTO,
+	MDS_MITIGATION_FULL,
+};
+
 #endif /* _ASM_X86_PROCESSOR_H */
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -37,6 +37,7 @@
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
 static void __init l1tf_select_mitigation(void);
+static void __init mds_select_mitigation(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -106,6 +107,8 @@ void __init check_bugs(void)
 
 	l1tf_select_mitigation();
 
+	mds_select_mitigation();
+
 #ifdef CONFIG_X86_32
 	/*
 	 * Check whether we are able to run this kernel safely on SMP.
@@ -212,6 +215,59 @@ static void x86_amd_ssb_disable(void)
 }
 
 #undef pr_fmt
+#define pr_fmt(fmt)	"MDS: " fmt
+
+/* Default mitigation for L1TF-affected CPUs */
+static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_AUTO;
+
+static const char * const mds_strings[] = {
+	[MDS_MITIGATION_OFF]	= "Vulnerable",
+	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers"
+};
+
+static void mds_select_mitigation(void)
+{
+	if (!boot_cpu_has_bug(X86_BUG_MDS)) {
+		mds_mitigation = MDS_MITIGATION_OFF;
+		return;
+	}
+
+	switch (mds_mitigation) {
+	case MDS_MITIGATION_OFF:
+		break;
+	case MDS_MITIGATION_AUTO:
+	case MDS_MITIGATION_FULL:
+		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
+			mds_mitigation = MDS_MITIGATION_FULL;
+			static_branch_enable(&mds_user_clear);
+		} else {
+			mds_mitigation = MDS_MITIGATION_OFF;
+		}
+		break;
+	}
+	pr_info("%s\n", mds_strings[mds_mitigation]);
+}
+
+static int __init mds_cmdline(char *str)
+{
+	if (!boot_cpu_has_bug(X86_BUG_MDS))
+		return 0;
+
+	if (!str)
+		return -EINVAL;
+
+	if (!strcmp(str, "off"))
+		mds_mitigation = MDS_MITIGATION_OFF;
+	else if (!strcmp(str, "auto"))
+		mds_mitigation = MDS_MITIGATION_AUTO;
+	else if (!strcmp(str, "full"))
+		mds_mitigation = MDS_MITIGATION_FULL;
+
+	return 0;
+}
+early_param("mds", mds_cmdline);
+
+#undef pr_fmt
 #define pr_fmt(fmt)     "Spectre V2 : " fmt
 
 static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
@@ -615,6 +671,15 @@ static void update_indir_branch_cond(voi
 		static_branch_disable(&switch_to_cond_stibp);
 }
 
+/* Update the static key controlling the MDS CPU buffer clear in idle */
+static void update_mds_branch_idle(void)
+{
+	if (sched_smt_active())
+		static_branch_enable(&mds_idle_clear);
+	else
+		static_branch_disable(&mds_idle_clear);
+}
+
 void arch_smt_update(void)
 {
 	/* Enhanced IBRS implies STIBP. No update required. */
@@ -636,6 +701,17 @@ void arch_smt_update(void)
 		break;
 	}
 
+	switch (mds_mitigation) {
+	case MDS_MITIGATION_OFF:
+		break;
+	case MDS_MITIGATION_FULL:
+		update_mds_branch_idle();
+		break;
+	/* Keep GCC happy */
+	case MDS_MITIGATION_AUTO:
+		break;
+	}
+
 	mutex_unlock(&spec_ctrl_mutex);
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 08/11] x86/speculation/mds: Add sysfs reporting for MDS
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (6 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-22 22:24 ` [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV Thomas Gleixner
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck; +Cc: Greg Kroah-Hartman, Borislav Petkov

From: Thomas Gleixner <tglx@linutronix.de>

Add the sysfs reporting file for MDS. It exposes the vulnerability and
mitigation state similar to the existing files for the other speculative
hardware vulnerabilities.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
V3: Copy & Paste done right :(
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |    1 +
 arch/x86/kernel/cpu/bugs.c                         |   20 ++++++++++++++++++++
 drivers/base/cpu.c                                 |    8 ++++++++
 include/linux/cpu.h                                |    2 ++
 4 files changed, 31 insertions(+)

--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -484,6 +484,7 @@ What:		/sys/devices/system/cpu/vulnerabi
 		/sys/devices/system/cpu/vulnerabilities/spectre_v2
 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
 		/sys/devices/system/cpu/vulnerabilities/l1tf
+		/sys/devices/system/cpu/vulnerabilities/mds
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1176,6 +1176,17 @@ static ssize_t l1tf_show_state(char *buf
 }
 #endif
 
+static ssize_t mds_show_state(char *buf)
+{
+	if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
+		return sprintf(buf, "%s; SMT Host state unknown\n",
+			       mds_strings[mds_mitigation]);
+	}
+
+	return sprintf(buf, "%s; SMT %s\n", mds_strings[mds_mitigation],
+		       sched_smt_active() ? "vulnerable" : "disabled");
+}
+
 static char *stibp_state(void)
 {
 	if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
@@ -1242,6 +1253,10 @@ static ssize_t cpu_show_common(struct de
 		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
 			return l1tf_show_state(buf);
 		break;
+
+	case X86_BUG_MDS:
+		return mds_show_state(buf);
+
 	default:
 		break;
 	}
@@ -1273,4 +1288,9 @@ ssize_t cpu_show_l1tf(struct device *dev
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
 }
+
+ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
+}
 #endif
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct devi
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_mds(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
 static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
 static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulner
 	&dev_attr_spectre_v2.attr,
 	&dev_attr_spec_store_bypass.attr,
 	&dev_attr_l1tf.attr,
+	&dev_attr_mds.attr,
 	NULL
 };
 
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -57,6 +57,8 @@ extern ssize_t cpu_show_spec_store_bypas
 					  struct device_attribute *attr, char *buf);
 extern ssize_t cpu_show_l1tf(struct device *dev,
 			     struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_mds(struct device *dev,
+			    struct device_attribute *attr, char *buf);
 
 extern __printf(4, 5)
 struct device *cpu_device_create(struct device *parent, void *drvdata,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (7 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 08/11] x86/speculation/mds: Add sysfs reporting " Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-23  9:52   ` [MODERATED] " Greg KH
  2019-02-25 20:31   ` mark gross
  2019-02-22 22:24 ` [patch V4 10/11] Documentation: Move L1TF to separate directory Thomas Gleixner
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck

From: Thomas Gleixner <tglx@linutronix.de>

In virtualized environments it can happen that the host has the microcode
update which utilizes the VERW instruction to clear CPU buffers, but the
hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
to guests.

Introduce an internal mitigation mode VWWERV which enables the invocation
of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
system has no updated microcode this results in a pointless execution of
the VERW instruction wasting a few CPU cycles. If the microcode is updated,
but not exposed to a guest then the CPU buffers will be cleared.

That said: Virtual Machines Will Eventually Receive Vaccine

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2 -> V3: Rename mode.
---
 Documentation/x86/mds.rst        |   29 +++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h |    1 +
 arch/x86/kernel/cpu/bugs.c       |   14 ++++++++------
 3 files changed, 38 insertions(+), 6 deletions(-)

--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -90,11 +90,40 @@ The mitigation is invoked on kernel/user
 (idle) transitions. Depending on the mitigation mode and the system state
 the invocation can be enforced or conditional.
 
+As a special quirk to address virtualization scenarios where the host has
+the microcode updated, but the hypervisor does not (yet) expose the
+MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
+hope that it might actually clear the buffers. The state is reflected
+accordingly.
+
 According to current knowledge additional mitigations inside the kernel
 itself are not required because the necessary gadgets to expose the leaked
 data cannot be controlled in a way which allows exploitation from malicious
 user space or VM guests.
 
+
+Kernel internal mitigation modes
+--------------------------------
+
+ ======= ===========================================================
+ off     Mitigation is disabled. Either the CPU is not affected or
+         mds=off is supplied on the kernel command line
+
+ full    Mitigation is eanbled. CPU is affected and MD_CLEAR is
+         advertised in CPUID.
+
+ vmwerv	 Mitigation is enabled. CPU is affected and MD_CLEAR is not
+         advertised in CPUID. That is mainly for virtualization
+	 scenarios where the host has the updated microcode but the
+	 hypervisor does not expose MD_CLEAR in CPUID. It's a best
+	 effort approach without guarantee.
+ ======= ===========================================================
+
+If the CPU is affected and mds=off is not supplied on the kernel
+command line then the kernel selects the appropriate mitigation mode
+depending on the availability of the MD_CLEAR CPUID bit.
+
+
 Mitigation points
 -----------------
 
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -996,6 +996,7 @@ enum mds_mitigations {
 	MDS_MITIGATION_OFF,
 	MDS_MITIGATION_AUTO,
 	MDS_MITIGATION_FULL,
+	MDS_MITIGATION_VMWERV,
 };
 
 #endif /* _ASM_X86_PROCESSOR_H */
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -222,7 +222,8 @@ static enum mds_mitigations mds_mitigati
 
 static const char * const mds_strings[] = {
 	[MDS_MITIGATION_OFF]	= "Vulnerable",
-	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers"
+	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers",
+	[MDS_MITIGATION_VMWERV]	= "Vulnerable: Clear CPU buffers attempted, no microcode",
 };
 
 static void mds_select_mitigation(void)
@@ -237,12 +238,12 @@ static void mds_select_mitigation(void)
 		break;
 	case MDS_MITIGATION_AUTO:
 	case MDS_MITIGATION_FULL:
-		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
+	case MDS_MITIGATION_VMWERV:
+		if (boot_cpu_has(X86_FEATURE_MD_CLEAR))
 			mds_mitigation = MDS_MITIGATION_FULL;
-			static_branch_enable(&mds_user_clear);
-		} else {
-			mds_mitigation = MDS_MITIGATION_OFF;
-		}
+		else
+			mds_mitigation = MDS_MITIGATION_VMWERV;
+		static_branch_enable(&mds_user_clear);
 		break;
 	}
 	pr_info("%s\n", mds_strings[mds_mitigation]);
@@ -705,6 +706,7 @@ void arch_smt_update(void)
 	case MDS_MITIGATION_OFF:
 		break;
 	case MDS_MITIGATION_FULL:
+	case MDS_MITIGATION_VMWERV:
 		update_mds_branch_idle();
 		break;
 	/* Keep GCC happy */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 10/11] Documentation: Move L1TF to separate directory
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (8 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-23  8:41   ` [MODERATED] " Greg KH
  2019-02-22 22:24 ` [patch V4 11/11] Documentation: Add MDS vulnerability documentation Thomas Gleixner
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck

From: Thomas Gleixner <tglx@linutronix.de>

Move L!TF to a separate directory so the MDS stuff can be added at the
side. Otherwise the all hardware vulnerabilites have their own top level
entry. Should have done that right away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/admin-guide/hw-vuln/index.rst |   12 
 Documentation/admin-guide/hw-vuln/l1tf.rst  |  614 ++++++++++++++++++++++++++++
 Documentation/admin-guide/index.rst         |    6 
 Documentation/admin-guide/l1tf.rst          |  614 ----------------------------
 4 files changed, 628 insertions(+), 618 deletions(-)

--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -0,0 +1,12 @@
+========================
+Hardware vulnerabilities
+========================
+
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+   :maxdepth: 1
+
+   l1tf
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -0,0 +1,614 @@
+L1TF - L1 Terminal Fault
+========================
+
+L1 Terminal Fault is a hardware vulnerability which allows unprivileged
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual address, which is used
+for the access, has the Present bit cleared or other reserved bits set.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Processors from AMD, Centaur and other non Intel vendors
+
+   - Older processor models, where the CPU family is < 6
+
+   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
+     Penwell, Pineview, Silvermont, Airmont, Merrifield)
+
+   - The Intel XEON PHI family
+
+   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
+     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
+     by the Meltdown vulnerability either. These CPUs should become
+     available by end of 2018.
+
+Whether a processor is affected or not can be read out from the L1TF
+vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the L1TF vulnerability:
+
+   =============  =================  ==============================
+   CVE-2018-3615  L1 Terminal Fault  SGX related aspects
+   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
+   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
+   =============  =================  ==============================
+
+Problem
+-------
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the Present bit cleared or other reserved bits set,
+then speculative execution ignores the invalid PTE and loads the referenced
+data if it is present in the Level 1 Data Cache, as if the page referenced
+by the address bits in the PTE was still present and accessible.
+
+While this is a purely speculative mechanism and the instruction will raise
+a page fault when it is retired eventually, the pure act of loading the
+data and making it available to other speculative instructions opens up the
+opportunity for side channel attacks to unprivileged malicious code,
+similar to the Meltdown attack.
+
+While Meltdown breaks the user space to kernel space protection, L1TF
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows an attack of SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+
+Attack scenarios
+----------------
+
+1. Malicious user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+   Operating Systems store arbitrary information in the address bits of a
+   PTE which is marked non present. This allows a malicious user space
+   application to attack the physical memory to which these PTEs resolve.
+   In some cases user-space can maliciously influence the information
+   encoded in the address bits of the PTE, thus making attacks more
+   deterministic and more practical.
+
+   The Linux kernel contains a mitigation for this attack vector, PTE
+   inversion, which is permanently enabled and has no performance
+   impact. The kernel ensures that the address bits of PTEs, which are not
+   marked present, never point to cacheable physical memory space.
+
+   A system with an up to date kernel is protected against attacks from
+   malicious user space applications.
+
+2. Malicious guest in a virtual machine
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The fact that L1TF breaks all domain protections allows malicious guest
+   OSes, which can control the PTEs directly, and malicious guest user
+   space applications, which run on an unprotected guest kernel lacking the
+   PTE inversion mitigation for L1TF, to attack physical host memory.
+
+   A special aspect of L1TF in the context of virtualization is symmetric
+   multi threading (SMT). The Intel implementation of SMT is called
+   HyperThreading. The fact that Hyperthreads on the affected processors
+   share the L1 Data Cache (L1D) is important for this. As the flaw allows
+   only to attack data which is present in L1D, a malicious guest running
+   on one Hyperthread can attack the data which is brought into the L1D by
+   the context which runs on the sibling Hyperthread of the same physical
+   core. This context can be host OS, host user space or a different guest.
+
+   If the processor does not support Extended Page Tables, the attack is
+   only possible, when the hypervisor does not sanitize the content of the
+   effective (shadow) page tables.
+
+   While solutions exist to mitigate these attack vectors fully, these
+   mitigations are not enabled by default in the Linux kernel because they
+   can affect performance significantly. The kernel provides several
+   mechanisms which can be utilized to address the problem depending on the
+   deployment scenario. The mitigations, their protection scope and impact
+   are described in the next sections.
+
+   The default mitigations and the rationale for choosing them are explained
+   at the end of this document. See :ref:`default_mitigations`.
+
+.. _l1tf_sys_info:
+
+L1TF system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current L1TF
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/l1tf
+
+The possible values in this file are:
+
+  ===========================   ===============================
+  'Not affected'		The processor is not vulnerable
+  'Mitigation: PTE Inversion'	The host protection is active
+  ===========================   ===============================
+
+If KVM/VMX is enabled and the processor is vulnerable then the following
+information is appended to the 'Mitigation: PTE Inversion' part:
+
+  - SMT status:
+
+    =====================  ================
+    'VMX: SMT vulnerable'  SMT is enabled
+    'VMX: SMT disabled'    SMT is disabled
+    =====================  ================
+
+  - L1D Flush mode:
+
+    ================================  ====================================
+    'L1D vulnerable'		      L1D flushing is disabled
+
+    'L1D conditional cache flushes'   L1D flush is conditionally enabled
+
+    'L1D cache flushes'		      L1D flush is unconditionally enabled
+    ================================  ====================================
+
+The resulting grade of protection is discussed in the following sections.
+
+
+Host mitigation mechanism
+-------------------------
+
+The kernel is unconditionally protected against L1TF attacks from malicious
+user space running on the host.
+
+
+Guest mitigation mechanisms
+---------------------------
+
+.. _l1d_flush:
+
+1. L1D flush on VMENTER
+^^^^^^^^^^^^^^^^^^^^^^^
+
+   To make sure that a guest cannot attack data which is present in the L1D
+   the hypervisor flushes the L1D before entering the guest.
+
+   Flushing the L1D evicts not only the data which should not be accessed
+   by a potentially malicious guest, it also flushes the guest
+   data. Flushing the L1D has a performance impact as the processor has to
+   bring the flushed guest data back into the L1D. Depending on the
+   frequency of VMEXIT/VMENTER and the type of computations in the guest
+   performance degradation in the range of 1% to 50% has been observed. For
+   scenarios where guest VMEXIT/VMENTER are rare the performance impact is
+   minimal. Virtio and mechanisms like posted interrupts are designed to
+   confine the VMEXITs to a bare minimum, but specific configurations and
+   application scenarios might still suffer from a high VMEXIT rate.
+
+   The kernel provides two L1D flush modes:
+    - conditional ('cond')
+    - unconditional ('always')
+
+   The conditional mode avoids L1D flushing after VMEXITs which execute
+   only audited code paths before the corresponding VMENTER. These code
+   paths have been verified that they cannot expose secrets or other
+   interesting data to an attacker, but they can leak information about the
+   address space layout of the hypervisor.
+
+   Unconditional mode flushes L1D on all VMENTER invocations and provides
+   maximum protection. It has a higher overhead than the conditional
+   mode. The overhead cannot be quantified correctly as it depends on the
+   workload scenario and the resulting number of VMEXITs.
+
+   The general recommendation is to enable L1D flush on VMENTER. The kernel
+   defaults to conditional mode on affected processors.
+
+   **Note**, that L1D flush does not prevent the SMT problem because the
+   sibling thread will also bring back its data into the L1D which makes it
+   attackable again.
+
+   L1D flush can be controlled by the administrator via the kernel command
+   line and sysfs control files. See :ref:`mitigation_control_command_line`
+   and :ref:`mitigation_control_kvm`.
+
+.. _guest_confinement:
+
+2. Guest VCPU confinement to dedicated physical cores
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   To address the SMT problem, it is possible to make a guest or a group of
+   guests affine to one or more physical cores. The proper mechanism for
+   that is to utilize exclusive cpusets to ensure that no other guest or
+   host tasks can run on these cores.
+
+   If only a single guest or related guests run on sibling SMT threads on
+   the same physical core then they can only attack their own memory and
+   restricted parts of the host memory.
+
+   Host memory is attackable, when one of the sibling SMT threads runs in
+   host OS (hypervisor) context and the other in guest context. The amount
+   of valuable information from the host OS context depends on the context
+   which the host OS executes, i.e. interrupts, soft interrupts and kernel
+   threads. The amount of valuable data from these contexts cannot be
+   declared as non-interesting for an attacker without deep inspection of
+   the code.
+
+   **Note**, that assigning guests to a fixed set of physical cores affects
+   the ability of the scheduler to do load balancing and might have
+   negative effects on CPU utilization depending on the hosting
+   scenario. Disabling SMT might be a viable alternative for particular
+   scenarios.
+
+   For further information about confining guests to a single or to a group
+   of cores consult the cpusets documentation:
+
+   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
+
+.. _interrupt_isolation:
+
+3. Interrupt affinity
+^^^^^^^^^^^^^^^^^^^^^
+
+   Interrupts can be made affine to logical CPUs. This is not universally
+   true because there are types of interrupts which are truly per CPU
+   interrupts, e.g. the local timer interrupt. Aside of that multi queue
+   devices affine their interrupts to single CPUs or groups of CPUs per
+   queue without allowing the administrator to control the affinities.
+
+   Moving the interrupts, which can be affinity controlled, away from CPUs
+   which run untrusted guests, reduces the attack vector space.
+
+   Whether the interrupts with are affine to CPUs, which run untrusted
+   guests, provide interesting data for an attacker depends on the system
+   configuration and the scenarios which run on the system. While for some
+   of the interrupts it can be assumed that they won't expose interesting
+   information beyond exposing hints about the host OS memory layout, there
+   is no way to make general assumptions.
+
+   Interrupt affinity can be controlled by the administrator via the
+   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
+   available at:
+
+   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
+
+.. _smt_control:
+
+4. SMT control
+^^^^^^^^^^^^^^
+
+   To prevent the SMT issues of L1TF it might be necessary to disable SMT
+   completely. Disabling SMT can have a significant performance impact, but
+   the impact depends on the hosting scenario and the type of workloads.
+   The impact of disabling SMT needs also to be weighted against the impact
+   of other mitigation solutions like confining guests to dedicated cores.
+
+   The kernel provides a sysfs interface to retrieve the status of SMT and
+   to control it. It also provides a kernel command line interface to
+   control SMT.
+
+   The kernel command line interface consists of the following options:
+
+     =========== ==========================================================
+     nosmt	 Affects the bring up of the secondary CPUs during boot. The
+		 kernel tries to bring all present CPUs online during the
+		 boot process. "nosmt" makes sure that from each physical
+		 core only one - the so called primary (hyper) thread is
+		 activated. Due to a design flaw of Intel processors related
+		 to Machine Check Exceptions the non primary siblings have
+		 to be brought up at least partially and are then shut down
+		 again.  "nosmt" can be undone via the sysfs interface.
+
+     nosmt=force Has the same effect as "nosmt" but it does not allow to
+		 undo the SMT disable via the sysfs interface.
+     =========== ==========================================================
+
+   The sysfs interface provides two files:
+
+   - /sys/devices/system/cpu/smt/control
+   - /sys/devices/system/cpu/smt/active
+
+   /sys/devices/system/cpu/smt/control:
+
+     This file allows to read out the SMT control state and provides the
+     ability to disable or (re)enable SMT. The possible states are:
+
+	==============  ===================================================
+	on		SMT is supported by the CPU and enabled. All
+			logical CPUs can be onlined and offlined without
+			restrictions.
+
+	off		SMT is supported by the CPU and disabled. Only
+			the so called primary SMT threads can be onlined
+			and offlined without restrictions. An attempt to
+			online a non-primary sibling is rejected
+
+	forceoff	Same as 'off' but the state cannot be controlled.
+			Attempts to write to the control file are rejected.
+
+	notsupported	The processor does not support SMT. It's therefore
+			not affected by the SMT implications of L1TF.
+			Attempts to write to the control file are rejected.
+	==============  ===================================================
+
+     The possible states which can be written into this file to control SMT
+     state are:
+
+     - on
+     - off
+     - forceoff
+
+   /sys/devices/system/cpu/smt/active:
+
+     This file reports whether SMT is enabled and active, i.e. if on any
+     physical core two or more sibling threads are online.
+
+   SMT control is also possible at boot time via the l1tf kernel command
+   line parameter in combination with L1D flush control. See
+   :ref:`mitigation_control_command_line`.
+
+5. Disabling EPT
+^^^^^^^^^^^^^^^^
+
+  Disabling EPT for virtual machines provides full mitigation for L1TF even
+  with SMT enabled, because the effective page tables for guests are
+  managed and sanitized by the hypervisor. Though disabling EPT has a
+  significant performance impact especially when the Meltdown mitigation
+  KPTI is enabled.
+
+  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+There is ongoing research and development for new mitigation mechanisms to
+address the performance impact of disabling SMT or EPT.
+
+.. _mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1TF mitigations at boot
+time with the option "l1tf=". The valid arguments for this option are:
+
+  ============  =============================================================
+  full		Provides all available mitigations for the L1TF
+		vulnerability. Disables SMT and enables all mitigations in
+		the hypervisors, i.e. unconditional L1D flushing
+
+		SMT control and L1D flush control via the sysfs interface
+		is still possible after boot.  Hypervisors will issue a
+		warning when the first VM is started in a potentially
+		insecure configuration, i.e. SMT enabled or L1D flush
+		disabled.
+
+  full,force	Same as 'full', but disables SMT and L1D flush runtime
+		control. Implies the 'nosmt=force' command line option.
+		(i.e. sysfs control of SMT is disabled.)
+
+  flush		Leaves SMT enabled and enables the default hypervisor
+		mitigation, i.e. conditional L1D flushing
+
+		SMT control and L1D flush control via the sysfs interface
+		is still possible after boot.  Hypervisors will issue a
+		warning when the first VM is started in a potentially
+		insecure configuration, i.e. SMT enabled or L1D flush
+		disabled.
+
+  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
+		i.e. conditional L1D flushing.
+
+		SMT control and L1D flush control via the sysfs interface
+		is still possible after boot.  Hypervisors will issue a
+		warning when the first VM is started in a potentially
+		insecure configuration, i.e. SMT enabled or L1D flush
+		disabled.
+
+  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
+		started in a potentially insecure configuration.
+
+  off		Disables hypervisor mitigations and doesn't emit any
+		warnings.
+		It also drops the swap size and available RAM limit restrictions
+		on both hypervisor and bare metal.
+
+  ============  =============================================================
+
+The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
+
+
+.. _mitigation_control_kvm:
+
+Mitigation control for KVM - module parameter
+-------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism, flushing the L1D cache when
+entering a guest, can be controlled with a module parameter.
+
+The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
+following arguments:
+
+  ============  ==============================================================
+  always	L1D cache flush on every VMENTER.
+
+  cond		Flush L1D on VMENTER only when the code between VMEXIT and
+		VMENTER can leak host memory which is considered
+		interesting for an attacker. This still can leak host memory
+		which allows e.g. to determine the hosts address space layout.
+
+  never		Disables the mitigation
+  ============  ==============================================================
+
+The parameter can be provided on the kernel command line, as a module
+parameter when loading the modules and at runtime modified via the sysfs
+file:
+
+/sys/module/kvm_intel/parameters/vmentry_l1d_flush
+
+The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
+line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
+module parameter is ignored and writes to the sysfs file are rejected.
+
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The system is protected by the kernel unconditionally and no further
+   action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   If the guest comes from a trusted source and the guest OS kernel is
+   guaranteed to have the L1TF mitigations in place the system is fully
+   protected against L1TF and no further action is required.
+
+   To avoid the overhead of the default L1D flushing on VMENTER the
+   administrator can disable the flushing via the kernel command line and
+   sysfs control files. See :ref:`mitigation_control_command_line` and
+   :ref:`mitigation_control_kvm`.
+
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+3.1. SMT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+  If SMT is not supported by the processor or disabled in the BIOS or by
+  the kernel, it's only required to enforce L1D flushing on VMENTER.
+
+  Conditional L1D flushing is the default behaviour and can be tuned. See
+  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+3.2. EPT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+  If EPT is not supported by the processor or disabled in the hypervisor,
+  the system is fully protected. SMT can stay enabled and L1D flushing on
+  VMENTER is not required.
+
+  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+3.3. SMT and EPT supported and active
+"""""""""""""""""""""""""""""""""""""
+
+  If SMT and EPT are supported and active then various degrees of
+  mitigations can be employed:
+
+  - L1D flushing on VMENTER:
+
+    L1D flushing on VMENTER is the minimal protection requirement, but it
+    is only potent in combination with other mitigation methods.
+
+    Conditional L1D flushing is the default behaviour and can be tuned. See
+    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+  - Guest confinement:
+
+    Confinement of guests to a single or a group of physical cores which
+    are not running any other processes, can reduce the attack surface
+    significantly, but interrupts, soft interrupts and kernel threads can
+    still expose valuable data to a potential attacker. See
+    :ref:`guest_confinement`.
+
+  - Interrupt isolation:
+
+    Isolating the guest CPUs from interrupts can reduce the attack surface
+    further, but still allows a malicious guest to explore a limited amount
+    of host physical memory. This can at least be used to gain knowledge
+    about the host address space layout. The interrupts which have a fixed
+    affinity to the CPUs which run the untrusted guests can depending on
+    the scenario still trigger soft interrupts and schedule kernel threads
+    which might expose valuable information. See
+    :ref:`interrupt_isolation`.
+
+The above three mitigation methods combined can provide protection to a
+certain degree, but the risk of the remaining attack surface has to be
+carefully analyzed. For full protection the following methods are
+available:
+
+  - Disabling SMT:
+
+    Disabling SMT and enforcing the L1D flushing provides the maximum
+    amount of protection. This mitigation is not depending on any of the
+    above mitigation methods.
+
+    SMT control and L1D flushing can be tuned by the command line
+    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
+    time with the matching sysfs control files. See :ref:`smt_control`,
+    :ref:`mitigation_control_command_line` and
+    :ref:`mitigation_control_kvm`.
+
+  - Disabling EPT:
+
+    Disabling EPT provides the maximum amount of protection as well. It is
+    not depending on any of the above mitigation methods. SMT can stay
+    enabled and L1D flushing is not required, but the performance impact is
+    significant.
+
+    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
+    parameter.
+
+3.4. Nested virtual machines
+""""""""""""""""""""""""""""
+
+When nested virtualization is in use, three operating systems are involved:
+the bare metal hypervisor, the nested hypervisor and the nested virtual
+machine.  VMENTER operations from the nested hypervisor into the nested
+guest will always be processed by the bare metal hypervisor. If KVM is the
+bare metal hypervisor it will:
+
+ - Flush the L1D cache on every switch from the nested hypervisor to the
+   nested virtual machine, so that the nested hypervisor's secrets are not
+   exposed to the nested virtual machine;
+
+ - Flush the L1D cache on every switch from the nested virtual machine to
+   the nested hypervisor; this is a complex operation, and flushing the L1D
+   cache avoids that the bare metal hypervisor's secrets are exposed to the
+   nested virtual machine;
+
+ - Instruct the nested hypervisor to not perform any L1D cache flush. This
+   is an optimization to avoid double L1D flushing.
+
+
+.. _default_mitigations:
+
+Default mitigations
+-------------------
+
+  The kernel default mitigations for vulnerable processors are:
+
+  - PTE inversion to protect against malicious user space. This is done
+    unconditionally and cannot be controlled. The swap storage is limited
+    to ~16TB.
+
+  - L1D conditional flushing on VMENTER when EPT is enabled for
+    a guest.
+
+  The kernel does not by default enforce the disabling of SMT, which leaves
+  SMT systems vulnerable when running untrusted guests with EPT enabled.
+
+  The rationale for this choice is:
+
+  - Force disabling SMT can break existing setups, especially with
+    unattended updates.
+
+  - If regular users run untrusted guests on their machine, then L1TF is
+    just an add on to other malware which might be embedded in an untrusted
+    guest, e.g. spam-bots or attacks on the local network.
+
+    There is no technical way to prevent a user from running untrusted code
+    on their machines blindly.
+
+  - It's technically extremely unlikely and from today's knowledge even
+    impossible that L1TF can be exploited via the most popular attack
+    mechanisms like JavaScript because these mechanisms have no way to
+    control PTEs. If this would be possible and not other mitigation would
+    be possible, then the default might be different.
+
+  - The administrators of cloud and hosting setups have to carefully
+    analyze the risk for their scenarios and make the appropriate
+    mitigation choices, which might even vary across their deployed
+    machines and also result in other changes of their overall setup.
+    There is no way for the kernel to provide a sensible default for this
+    kind of scenarios.
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -17,14 +17,12 @@ etc.
    kernel-parameters
    devices
 
-This section describes CPU vulnerabilities and provides an overview of the
-possible mitigations along with guidance for selecting mitigations if they
-are configurable at compile, boot or run time.
+This section describes CPU vulnerabilities and their mitigations.
 
 .. toctree::
    :maxdepth: 1
 
-   l1tf
+   hw-vuln/index
 
 Here is a set of documents aimed at users who are trying to track down
 problems and bugs in particular.
--- a/Documentation/admin-guide/l1tf.rst
+++ /dev/null
@@ -1,614 +0,0 @@
-L1TF - L1 Terminal Fault
-========================
-
-L1 Terminal Fault is a hardware vulnerability which allows unprivileged
-speculative access to data which is available in the Level 1 Data Cache
-when the page table entry controlling the virtual address, which is used
-for the access, has the Present bit cleared or other reserved bits set.
-
-Affected processors
--------------------
-
-This vulnerability affects a wide range of Intel processors. The
-vulnerability is not present on:
-
-   - Processors from AMD, Centaur and other non Intel vendors
-
-   - Older processor models, where the CPU family is < 6
-
-   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
-     Penwell, Pineview, Silvermont, Airmont, Merrifield)
-
-   - The Intel XEON PHI family
-
-   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
-     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
-     by the Meltdown vulnerability either. These CPUs should become
-     available by end of 2018.
-
-Whether a processor is affected or not can be read out from the L1TF
-vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
-
-Related CVEs
-------------
-
-The following CVE entries are related to the L1TF vulnerability:
-
-   =============  =================  ==============================
-   CVE-2018-3615  L1 Terminal Fault  SGX related aspects
-   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
-   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
-   =============  =================  ==============================
-
-Problem
--------
-
-If an instruction accesses a virtual address for which the relevant page
-table entry (PTE) has the Present bit cleared or other reserved bits set,
-then speculative execution ignores the invalid PTE and loads the referenced
-data if it is present in the Level 1 Data Cache, as if the page referenced
-by the address bits in the PTE was still present and accessible.
-
-While this is a purely speculative mechanism and the instruction will raise
-a page fault when it is retired eventually, the pure act of loading the
-data and making it available to other speculative instructions opens up the
-opportunity for side channel attacks to unprivileged malicious code,
-similar to the Meltdown attack.
-
-While Meltdown breaks the user space to kernel space protection, L1TF
-allows to attack any physical memory address in the system and the attack
-works across all protection domains. It allows an attack of SGX and also
-works from inside virtual machines because the speculation bypasses the
-extended page table (EPT) protection mechanism.
-
-
-Attack scenarios
-----------------
-
-1. Malicious user space
-^^^^^^^^^^^^^^^^^^^^^^^
-
-   Operating Systems store arbitrary information in the address bits of a
-   PTE which is marked non present. This allows a malicious user space
-   application to attack the physical memory to which these PTEs resolve.
-   In some cases user-space can maliciously influence the information
-   encoded in the address bits of the PTE, thus making attacks more
-   deterministic and more practical.
-
-   The Linux kernel contains a mitigation for this attack vector, PTE
-   inversion, which is permanently enabled and has no performance
-   impact. The kernel ensures that the address bits of PTEs, which are not
-   marked present, never point to cacheable physical memory space.
-
-   A system with an up to date kernel is protected against attacks from
-   malicious user space applications.
-
-2. Malicious guest in a virtual machine
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-   The fact that L1TF breaks all domain protections allows malicious guest
-   OSes, which can control the PTEs directly, and malicious guest user
-   space applications, which run on an unprotected guest kernel lacking the
-   PTE inversion mitigation for L1TF, to attack physical host memory.
-
-   A special aspect of L1TF in the context of virtualization is symmetric
-   multi threading (SMT). The Intel implementation of SMT is called
-   HyperThreading. The fact that Hyperthreads on the affected processors
-   share the L1 Data Cache (L1D) is important for this. As the flaw allows
-   only to attack data which is present in L1D, a malicious guest running
-   on one Hyperthread can attack the data which is brought into the L1D by
-   the context which runs on the sibling Hyperthread of the same physical
-   core. This context can be host OS, host user space or a different guest.
-
-   If the processor does not support Extended Page Tables, the attack is
-   only possible, when the hypervisor does not sanitize the content of the
-   effective (shadow) page tables.
-
-   While solutions exist to mitigate these attack vectors fully, these
-   mitigations are not enabled by default in the Linux kernel because they
-   can affect performance significantly. The kernel provides several
-   mechanisms which can be utilized to address the problem depending on the
-   deployment scenario. The mitigations, their protection scope and impact
-   are described in the next sections.
-
-   The default mitigations and the rationale for choosing them are explained
-   at the end of this document. See :ref:`default_mitigations`.
-
-.. _l1tf_sys_info:
-
-L1TF system information
------------------------
-
-The Linux kernel provides a sysfs interface to enumerate the current L1TF
-status of the system: whether the system is vulnerable, and which
-mitigations are active. The relevant sysfs file is:
-
-/sys/devices/system/cpu/vulnerabilities/l1tf
-
-The possible values in this file are:
-
-  ===========================   ===============================
-  'Not affected'		The processor is not vulnerable
-  'Mitigation: PTE Inversion'	The host protection is active
-  ===========================   ===============================
-
-If KVM/VMX is enabled and the processor is vulnerable then the following
-information is appended to the 'Mitigation: PTE Inversion' part:
-
-  - SMT status:
-
-    =====================  ================
-    'VMX: SMT vulnerable'  SMT is enabled
-    'VMX: SMT disabled'    SMT is disabled
-    =====================  ================
-
-  - L1D Flush mode:
-
-    ================================  ====================================
-    'L1D vulnerable'		      L1D flushing is disabled
-
-    'L1D conditional cache flushes'   L1D flush is conditionally enabled
-
-    'L1D cache flushes'		      L1D flush is unconditionally enabled
-    ================================  ====================================
-
-The resulting grade of protection is discussed in the following sections.
-
-
-Host mitigation mechanism
--------------------------
-
-The kernel is unconditionally protected against L1TF attacks from malicious
-user space running on the host.
-
-
-Guest mitigation mechanisms
----------------------------
-
-.. _l1d_flush:
-
-1. L1D flush on VMENTER
-^^^^^^^^^^^^^^^^^^^^^^^
-
-   To make sure that a guest cannot attack data which is present in the L1D
-   the hypervisor flushes the L1D before entering the guest.
-
-   Flushing the L1D evicts not only the data which should not be accessed
-   by a potentially malicious guest, it also flushes the guest
-   data. Flushing the L1D has a performance impact as the processor has to
-   bring the flushed guest data back into the L1D. Depending on the
-   frequency of VMEXIT/VMENTER and the type of computations in the guest
-   performance degradation in the range of 1% to 50% has been observed. For
-   scenarios where guest VMEXIT/VMENTER are rare the performance impact is
-   minimal. Virtio and mechanisms like posted interrupts are designed to
-   confine the VMEXITs to a bare minimum, but specific configurations and
-   application scenarios might still suffer from a high VMEXIT rate.
-
-   The kernel provides two L1D flush modes:
-    - conditional ('cond')
-    - unconditional ('always')
-
-   The conditional mode avoids L1D flushing after VMEXITs which execute
-   only audited code paths before the corresponding VMENTER. These code
-   paths have been verified that they cannot expose secrets or other
-   interesting data to an attacker, but they can leak information about the
-   address space layout of the hypervisor.
-
-   Unconditional mode flushes L1D on all VMENTER invocations and provides
-   maximum protection. It has a higher overhead than the conditional
-   mode. The overhead cannot be quantified correctly as it depends on the
-   workload scenario and the resulting number of VMEXITs.
-
-   The general recommendation is to enable L1D flush on VMENTER. The kernel
-   defaults to conditional mode on affected processors.
-
-   **Note**, that L1D flush does not prevent the SMT problem because the
-   sibling thread will also bring back its data into the L1D which makes it
-   attackable again.
-
-   L1D flush can be controlled by the administrator via the kernel command
-   line and sysfs control files. See :ref:`mitigation_control_command_line`
-   and :ref:`mitigation_control_kvm`.
-
-.. _guest_confinement:
-
-2. Guest VCPU confinement to dedicated physical cores
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-   To address the SMT problem, it is possible to make a guest or a group of
-   guests affine to one or more physical cores. The proper mechanism for
-   that is to utilize exclusive cpusets to ensure that no other guest or
-   host tasks can run on these cores.
-
-   If only a single guest or related guests run on sibling SMT threads on
-   the same physical core then they can only attack their own memory and
-   restricted parts of the host memory.
-
-   Host memory is attackable, when one of the sibling SMT threads runs in
-   host OS (hypervisor) context and the other in guest context. The amount
-   of valuable information from the host OS context depends on the context
-   which the host OS executes, i.e. interrupts, soft interrupts and kernel
-   threads. The amount of valuable data from these contexts cannot be
-   declared as non-interesting for an attacker without deep inspection of
-   the code.
-
-   **Note**, that assigning guests to a fixed set of physical cores affects
-   the ability of the scheduler to do load balancing and might have
-   negative effects on CPU utilization depending on the hosting
-   scenario. Disabling SMT might be a viable alternative for particular
-   scenarios.
-
-   For further information about confining guests to a single or to a group
-   of cores consult the cpusets documentation:
-
-   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
-
-.. _interrupt_isolation:
-
-3. Interrupt affinity
-^^^^^^^^^^^^^^^^^^^^^
-
-   Interrupts can be made affine to logical CPUs. This is not universally
-   true because there are types of interrupts which are truly per CPU
-   interrupts, e.g. the local timer interrupt. Aside of that multi queue
-   devices affine their interrupts to single CPUs or groups of CPUs per
-   queue without allowing the administrator to control the affinities.
-
-   Moving the interrupts, which can be affinity controlled, away from CPUs
-   which run untrusted guests, reduces the attack vector space.
-
-   Whether the interrupts with are affine to CPUs, which run untrusted
-   guests, provide interesting data for an attacker depends on the system
-   configuration and the scenarios which run on the system. While for some
-   of the interrupts it can be assumed that they won't expose interesting
-   information beyond exposing hints about the host OS memory layout, there
-   is no way to make general assumptions.
-
-   Interrupt affinity can be controlled by the administrator via the
-   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
-   available at:
-
-   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
-
-.. _smt_control:
-
-4. SMT control
-^^^^^^^^^^^^^^
-
-   To prevent the SMT issues of L1TF it might be necessary to disable SMT
-   completely. Disabling SMT can have a significant performance impact, but
-   the impact depends on the hosting scenario and the type of workloads.
-   The impact of disabling SMT needs also to be weighted against the impact
-   of other mitigation solutions like confining guests to dedicated cores.
-
-   The kernel provides a sysfs interface to retrieve the status of SMT and
-   to control it. It also provides a kernel command line interface to
-   control SMT.
-
-   The kernel command line interface consists of the following options:
-
-     =========== ==========================================================
-     nosmt	 Affects the bring up of the secondary CPUs during boot. The
-		 kernel tries to bring all present CPUs online during the
-		 boot process. "nosmt" makes sure that from each physical
-		 core only one - the so called primary (hyper) thread is
-		 activated. Due to a design flaw of Intel processors related
-		 to Machine Check Exceptions the non primary siblings have
-		 to be brought up at least partially and are then shut down
-		 again.  "nosmt" can be undone via the sysfs interface.
-
-     nosmt=force Has the same effect as "nosmt" but it does not allow to
-		 undo the SMT disable via the sysfs interface.
-     =========== ==========================================================
-
-   The sysfs interface provides two files:
-
-   - /sys/devices/system/cpu/smt/control
-   - /sys/devices/system/cpu/smt/active
-
-   /sys/devices/system/cpu/smt/control:
-
-     This file allows to read out the SMT control state and provides the
-     ability to disable or (re)enable SMT. The possible states are:
-
-	==============  ===================================================
-	on		SMT is supported by the CPU and enabled. All
-			logical CPUs can be onlined and offlined without
-			restrictions.
-
-	off		SMT is supported by the CPU and disabled. Only
-			the so called primary SMT threads can be onlined
-			and offlined without restrictions. An attempt to
-			online a non-primary sibling is rejected
-
-	forceoff	Same as 'off' but the state cannot be controlled.
-			Attempts to write to the control file are rejected.
-
-	notsupported	The processor does not support SMT. It's therefore
-			not affected by the SMT implications of L1TF.
-			Attempts to write to the control file are rejected.
-	==============  ===================================================
-
-     The possible states which can be written into this file to control SMT
-     state are:
-
-     - on
-     - off
-     - forceoff
-
-   /sys/devices/system/cpu/smt/active:
-
-     This file reports whether SMT is enabled and active, i.e. if on any
-     physical core two or more sibling threads are online.
-
-   SMT control is also possible at boot time via the l1tf kernel command
-   line parameter in combination with L1D flush control. See
-   :ref:`mitigation_control_command_line`.
-
-5. Disabling EPT
-^^^^^^^^^^^^^^^^
-
-  Disabling EPT for virtual machines provides full mitigation for L1TF even
-  with SMT enabled, because the effective page tables for guests are
-  managed and sanitized by the hypervisor. Though disabling EPT has a
-  significant performance impact especially when the Meltdown mitigation
-  KPTI is enabled.
-
-  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
-
-There is ongoing research and development for new mitigation mechanisms to
-address the performance impact of disabling SMT or EPT.
-
-.. _mitigation_control_command_line:
-
-Mitigation control on the kernel command line
----------------------------------------------
-
-The kernel command line allows to control the L1TF mitigations at boot
-time with the option "l1tf=". The valid arguments for this option are:
-
-  ============  =============================================================
-  full		Provides all available mitigations for the L1TF
-		vulnerability. Disables SMT and enables all mitigations in
-		the hypervisors, i.e. unconditional L1D flushing
-
-		SMT control and L1D flush control via the sysfs interface
-		is still possible after boot.  Hypervisors will issue a
-		warning when the first VM is started in a potentially
-		insecure configuration, i.e. SMT enabled or L1D flush
-		disabled.
-
-  full,force	Same as 'full', but disables SMT and L1D flush runtime
-		control. Implies the 'nosmt=force' command line option.
-		(i.e. sysfs control of SMT is disabled.)
-
-  flush		Leaves SMT enabled and enables the default hypervisor
-		mitigation, i.e. conditional L1D flushing
-
-		SMT control and L1D flush control via the sysfs interface
-		is still possible after boot.  Hypervisors will issue a
-		warning when the first VM is started in a potentially
-		insecure configuration, i.e. SMT enabled or L1D flush
-		disabled.
-
-  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
-		i.e. conditional L1D flushing.
-
-		SMT control and L1D flush control via the sysfs interface
-		is still possible after boot.  Hypervisors will issue a
-		warning when the first VM is started in a potentially
-		insecure configuration, i.e. SMT enabled or L1D flush
-		disabled.
-
-  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
-		started in a potentially insecure configuration.
-
-  off		Disables hypervisor mitigations and doesn't emit any
-		warnings.
-		It also drops the swap size and available RAM limit restrictions
-		on both hypervisor and bare metal.
-
-  ============  =============================================================
-
-The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
-
-
-.. _mitigation_control_kvm:
-
-Mitigation control for KVM - module parameter
--------------------------------------------------------------
-
-The KVM hypervisor mitigation mechanism, flushing the L1D cache when
-entering a guest, can be controlled with a module parameter.
-
-The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
-following arguments:
-
-  ============  ==============================================================
-  always	L1D cache flush on every VMENTER.
-
-  cond		Flush L1D on VMENTER only when the code between VMEXIT and
-		VMENTER can leak host memory which is considered
-		interesting for an attacker. This still can leak host memory
-		which allows e.g. to determine the hosts address space layout.
-
-  never		Disables the mitigation
-  ============  ==============================================================
-
-The parameter can be provided on the kernel command line, as a module
-parameter when loading the modules and at runtime modified via the sysfs
-file:
-
-/sys/module/kvm_intel/parameters/vmentry_l1d_flush
-
-The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
-line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
-module parameter is ignored and writes to the sysfs file are rejected.
-
-
-Mitigation selection guide
---------------------------
-
-1. No virtualization in use
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-   The system is protected by the kernel unconditionally and no further
-   action is required.
-
-2. Virtualization with trusted guests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-   If the guest comes from a trusted source and the guest OS kernel is
-   guaranteed to have the L1TF mitigations in place the system is fully
-   protected against L1TF and no further action is required.
-
-   To avoid the overhead of the default L1D flushing on VMENTER the
-   administrator can disable the flushing via the kernel command line and
-   sysfs control files. See :ref:`mitigation_control_command_line` and
-   :ref:`mitigation_control_kvm`.
-
-
-3. Virtualization with untrusted guests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-3.1. SMT not supported or disabled
-""""""""""""""""""""""""""""""""""
-
-  If SMT is not supported by the processor or disabled in the BIOS or by
-  the kernel, it's only required to enforce L1D flushing on VMENTER.
-
-  Conditional L1D flushing is the default behaviour and can be tuned. See
-  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
-
-3.2. EPT not supported or disabled
-""""""""""""""""""""""""""""""""""
-
-  If EPT is not supported by the processor or disabled in the hypervisor,
-  the system is fully protected. SMT can stay enabled and L1D flushing on
-  VMENTER is not required.
-
-  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
-
-3.3. SMT and EPT supported and active
-"""""""""""""""""""""""""""""""""""""
-
-  If SMT and EPT are supported and active then various degrees of
-  mitigations can be employed:
-
-  - L1D flushing on VMENTER:
-
-    L1D flushing on VMENTER is the minimal protection requirement, but it
-    is only potent in combination with other mitigation methods.
-
-    Conditional L1D flushing is the default behaviour and can be tuned. See
-    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
-
-  - Guest confinement:
-
-    Confinement of guests to a single or a group of physical cores which
-    are not running any other processes, can reduce the attack surface
-    significantly, but interrupts, soft interrupts and kernel threads can
-    still expose valuable data to a potential attacker. See
-    :ref:`guest_confinement`.
-
-  - Interrupt isolation:
-
-    Isolating the guest CPUs from interrupts can reduce the attack surface
-    further, but still allows a malicious guest to explore a limited amount
-    of host physical memory. This can at least be used to gain knowledge
-    about the host address space layout. The interrupts which have a fixed
-    affinity to the CPUs which run the untrusted guests can depending on
-    the scenario still trigger soft interrupts and schedule kernel threads
-    which might expose valuable information. See
-    :ref:`interrupt_isolation`.
-
-The above three mitigation methods combined can provide protection to a
-certain degree, but the risk of the remaining attack surface has to be
-carefully analyzed. For full protection the following methods are
-available:
-
-  - Disabling SMT:
-
-    Disabling SMT and enforcing the L1D flushing provides the maximum
-    amount of protection. This mitigation is not depending on any of the
-    above mitigation methods.
-
-    SMT control and L1D flushing can be tuned by the command line
-    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
-    time with the matching sysfs control files. See :ref:`smt_control`,
-    :ref:`mitigation_control_command_line` and
-    :ref:`mitigation_control_kvm`.
-
-  - Disabling EPT:
-
-    Disabling EPT provides the maximum amount of protection as well. It is
-    not depending on any of the above mitigation methods. SMT can stay
-    enabled and L1D flushing is not required, but the performance impact is
-    significant.
-
-    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
-    parameter.
-
-3.4. Nested virtual machines
-""""""""""""""""""""""""""""
-
-When nested virtualization is in use, three operating systems are involved:
-the bare metal hypervisor, the nested hypervisor and the nested virtual
-machine.  VMENTER operations from the nested hypervisor into the nested
-guest will always be processed by the bare metal hypervisor. If KVM is the
-bare metal hypervisor it will:
-
- - Flush the L1D cache on every switch from the nested hypervisor to the
-   nested virtual machine, so that the nested hypervisor's secrets are not
-   exposed to the nested virtual machine;
-
- - Flush the L1D cache on every switch from the nested virtual machine to
-   the nested hypervisor; this is a complex operation, and flushing the L1D
-   cache avoids that the bare metal hypervisor's secrets are exposed to the
-   nested virtual machine;
-
- - Instruct the nested hypervisor to not perform any L1D cache flush. This
-   is an optimization to avoid double L1D flushing.
-
-
-.. _default_mitigations:
-
-Default mitigations
--------------------
-
-  The kernel default mitigations for vulnerable processors are:
-
-  - PTE inversion to protect against malicious user space. This is done
-    unconditionally and cannot be controlled. The swap storage is limited
-    to ~16TB.
-
-  - L1D conditional flushing on VMENTER when EPT is enabled for
-    a guest.
-
-  The kernel does not by default enforce the disabling of SMT, which leaves
-  SMT systems vulnerable when running untrusted guests with EPT enabled.
-
-  The rationale for this choice is:
-
-  - Force disabling SMT can break existing setups, especially with
-    unattended updates.
-
-  - If regular users run untrusted guests on their machine, then L1TF is
-    just an add on to other malware which might be embedded in an untrusted
-    guest, e.g. spam-bots or attacks on the local network.
-
-    There is no technical way to prevent a user from running untrusted code
-    on their machines blindly.
-
-  - It's technically extremely unlikely and from today's knowledge even
-    impossible that L1TF can be exploited via the most popular attack
-    mechanisms like JavaScript because these mechanisms have no way to
-    control PTEs. If this would be possible and not other mitigation would
-    be possible, then the default might be different.
-
-  - The administrators of cloud and hosting setups have to carefully
-    analyze the risk for their scenarios and make the appropriate
-    mitigation choices, which might even vary across their deployed
-    machines and also result in other changes of their overall setup.
-    There is no way for the kernel to provide a sensible default for this
-    kind of scenarios.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch V4 11/11] Documentation: Add MDS vulnerability documentation
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (9 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 10/11] Documentation: Move L1TF to separate directory Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
  2019-02-23  9:58   ` [MODERATED] " Greg KH
  2019-02-25 18:02   ` [MODERATED] " Dave Hansen
  2019-02-23  0:53 ` [MODERATED] Re: [patch V4 00/11] MDS basics Andrew Cooper
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
  To: speck

From: Thomas Gleixner <tglx@linutronix.de>

Add the initial MDS vulnerability documentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V1 --> V4: Added the missing pieces
---
 Documentation/admin-guide/hw-vuln/index.rst |    1 
 Documentation/admin-guide/hw-vuln/l1tf.rst  |    1 
 Documentation/admin-guide/hw-vuln/mds.rst   |  258 ++++++++++++++++++++++++++++
 3 files changed, 260 insertions(+)

--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -10,3 +10,4 @@ are configurable at compile, boot or run
    :maxdepth: 1
 
    l1tf
+   mds
--- a/Documentation/admin-guide/hw-vuln/l1tf.rst
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -445,6 +445,7 @@ The default is 'cond'. If 'l1tf=full,for
 line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
 module parameter is ignored and writes to the sysfs file are rejected.
 
+.. _mitigation_selection:
 
 Mitigation selection guide
 --------------------------
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -0,0 +1,258 @@
+MDS - Microarchitectural Data Sampling
+======================================
+
+Microarchitectural Data Sampling is a hardware vulnerability which allows
+unprivileged speculative access to data which is available in various CPU
+internal buffers.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Processors from AMD, Centaur and other non Intel vendors
+
+   - Older processor models, where the CPU family is < 6
+
+   - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+   - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
+     IA32_ARCH_CAPABILITIES MSR.
+
+Whether a processor is affected or not can be read out from the MDS
+vulnerability file in sysfs. See :ref:`mds_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the MDS vulnerability:
+
+   ==============  =====  ==============================================
+   CVE-2018-12126  MSBDS  Microarchitectural Store Buffer Data Sampling
+   CVE-2018-12130  MFBDS  Microarchitectural Fill Buffer Data Sampling
+   CVE-2018-12127  MLPDS  Microarchitectural Load Port Data Sampling
+   ==============  =====  ==============================================
+
+Problem
+-------
+
+When performing store, load, L1 refill operations, processors write data
+into temporary microarchitectural structures (buffers). The data in the
+buffer can be forwarded to load operations as an optimization.
+
+Under certain conditions, usually a fault/assist caused by a load
+operation, data unrelated to the load memory address can be speculatively
+forwarded from the buffers. Because the load operation causes a fault or
+assist and its result will be discarded, the forwarded data will not cause
+incorrect program execution or state changes. But a malicious operation
+may be able to forward this speculative data to a disclosure gadget which
+allows in turn to infer the value via a cache side channel attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks may be possible.
+
+Deeper technical information is available in the MDS specific x86
+architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the MDS vulnerabilities can be mounted from malicious non
+priviledged user space applications running on hosts or guest. Malicious
+guest OSes can obviously mount attacks as well.
+
+Contrary to other speculation based vulnerabilities the MDS vulnerability
+does not allow the attacker to control the memory target address. As a
+consequence the attacks are purely sampling based, but as demonstrated with
+the TLBleed attack samples can be postprocessed successfully.
+
+Web-Browsers
+^^^^^^^^^^^^
+
+  It's unclear whether attacks through Web-Browsers are possible at
+  all. The exploitation through Java-Script is considered very unlikely,
+  but other widely used web technologies like Webassembly could possibly be
+  abused.
+
+
+.. _mds_sys_info:
+
+MDS system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current MDS
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+The possible values in this file are:
+
+  =========================================   =================================
+  'Not affected'				The processor is not vulnerable
+
+  'Vulnerable'					The processor is vulnerable,
+						but no mitigation enabled
+
+  'Vulnerable: Clear CPU buffers attempted'	The processor is vulnerable but
+						microcode is not updated.
+						The mitigation is enabled on a
+						best effort basis.
+						See :ref:`vmwerv`
+
+  'Mitigation: CPU buffer clear'		The processor is vulnerable and the
+						CPU buffer clearing mitigation is
+						enabled.
+  =========================================   =================================
+
+If the processor is vulnerable then the following information is appended
+to the above information:
+
+    ========================  ============================================
+    'SMT vulnerable'          SMT is enabled
+    'SMT disabled'            SMT is disabled
+    'SMT Host state unknown'  Kernel runs in a VM, Host SMT state unknown
+    ========================  ============================================
+
+.. _vmwerv:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  If the processor is vulnerable, but the availability of the microcode based
+  mitigation mechanism is not advertised via CPUID the kernel selects a best
+  effort mitigation mode.  This mode invokes the mitigation instructions
+  without a guarantee that they clear the CPU buffers.
+
+  This is done to address virtualization scenarios where the host has the
+  microcode update applied, but the hypervisor is not yet updated to expose
+  the CPUID to the guest. If the host has updated microcode the protection
+  takes effect otherwise a few cpu cycles are wasted pointlessly.
+
+  The state in the mds sysfs file reflects this situation accordingly.
+
+
+Mitigation mechanism
+-------------------------
+
+The kernel detects the affected CPUs and the presence of the microcode
+which is required.
+
+If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default. The mitigation can be controlled at boot
+time via a kernel command line option. See
+:ref:`mds_mitigation_control_command_line`.
+
+.. _cpu_buffer_clear:
+
+CPU buffer clearing
+^^^^^^^^^^^^^^^^^^^
+
+  The mitigation for MDS clears the affected CPU buffers on return to user
+  space and when entering a guest.
+
+  If SMT is enabled it also clears the buffers on idle entry, but that's not
+  a sufficient SMT protection for all MDS variants; it covers solely MSBDS.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  If the CPU is also affected by L1TF and the L1D flush mitigation is enabled
+  and up to date microcode is available, the L1D flush mitigation is
+  automatically protecting the guest transition. For details on L1TF and
+  virtualization see:
+  :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
+
+  If the L1D flush mitigation is disabled or the microcode is not available
+  the guest transition is unprotected.
+
+.. _xeon_phi:
+
+XEON PHI specific considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  The XEON PHI processor family is affected by MSBDS which can be exploited
+  cross Hyper-Threads when entering idle states. Some XEON PHI variants allow
+  to use MWAIT in user space (Ring 3) which opens an potential attack vector
+  for malicious user space. The exposure can be disabled on the kernel
+  command line with the 'ring3mwait=disable' command line option.
+
+.. _mds_smt_control:
+
+SMT control
+^^^^^^^^^^^
+
+  To prevent the SMT issues of MDS it might be necessary to disable SMT
+  completely. Disabling SMT can have a significant performance impact, but
+  the impact depends on the type of workloads.
+
+  See the relevant chapter in the L1TF mitigation documentation for details:
+  :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+
+.. _mds_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the MDS mitigations at boot
+time with the option "mds=". The valid arguments for this option are:
+
+  ============  =============================================================
+  auto		Kernel selects the appropriate mitigation mode when the CPU
+		is affected. Defaults to full.
+
+  full		Provides all available mitigations for the MDS vulnerability
+		vulnerability, unconditional CPU buffer clearing on exit to
+		userspace and when entering a VM. Idle transitions are
+		protect as well.
+
+		It does not automatically disable SMT.
+
+  off		Disables MDS mitigations completely.
+
+  ============  =============================================================
+
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+   If all userspace applications are from a trusted source and do not
+   execute untrusted code which is supplied externally, then the mitigation
+   can be disabled.
+
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The same considerations as above versus trusted user space apply. See
+   also: :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_selection>`.
+
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The protection depends on the state of the L1TF mitigations.
+   See :ref:`virt_mechanism`.
+
+
+.. _mds_default_mitigations:
+
+Default mitigations
+-------------------
+
+  The kernel default mitigations for vulnerable processors are:
+
+  - Enable CPU buffer clearing
+
+  The kernel does not by default enforce the disabling of SMT, which leaves
+  SMT systems vulnerable when running untrusted code. The same rationale as
+  for L1TF applies.
+  See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 00/11] MDS basics
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (10 preceding siblings ...)
  2019-02-22 22:24 ` [patch V4 11/11] Documentation: Add MDS vulnerability documentation Thomas Gleixner
@ 2019-02-23  0:53 ` Andrew Cooper
  2019-02-23 14:12   ` Peter Zijlstra
  2019-02-25 16:38 ` mark gross
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Andrew Cooper @ 2019-02-23  0:53 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2501 bytes --]

On 22/02/2019 22:24, speck for Thomas Gleixner wrote:
> Hi!
>
> Another day, another update.
>
> Changes since V3:
>
>   - Add the #DF mitigation and document why I can't be bothered
>     to sprinkle the buffer clear into #MC
>
>   - Add a comment about the segment selector choice. It makes sense on it's
>     own but it won't prevent anyone from thinking that we're crazy.
>
>   - Addressed the review feedback vs. documentation
>
>   - Resurrected the admin documentation patch, tidied it up and filled the
>     gaps.
>
> Delta patch without the admin documentation parts below.
>
> Git tree WIP.mds branch is updated as well.
>
> If anyone of the people new to this need access to the git repo,
> please send me a public SSH key so I can add to the gitolite config.
>
> There is one point left which I did not look into yet and I'm happy to
> delegate that to the virtualization wizards:
>
>   XEON PHI is not affected by L1TF, so it won't get the L1TF
>   mitigations. But it is affected by MSBDS, so it needs separate
>   mitigation, i.e. clearing CPU buffers on VMENTER.

I haven’t got to this in Xen yet, but you're right - it is a pain to
deal with.

For L1TF, the write to MSR_FLUSH_CMD has to be in an MSR load list if
you want to avoid all kinds of nasty race conditions with late-hitting
NMIs/etc in the path-to-vmentry.

For PHI, it would be ideal to use the same mechanism, but obviously we
cant.  That said - I've just asked Intel what the feasibility of getting
MSR_FLUSH_CMD[1] being VERW is.  I very much expect the answer is "we're
months too late for a question like that", but I don't lose anything by
asking.

In Xen, I've managed to get the VERW flushing down to a single
instruction living in an alternative, and this is actually quite easy to
sprinkle around the exit asm.  Also, because it is encoded with (%rsp),
it can be used after POPing all the GPRs on the exit path.

The closer it moves to the VMLAUNCH/VMRESUME instructions, the narrower
the window for race conditions (which is fairly large for L1TF as you
must interact with MSRs before POPing the GPRs).

An NMI happening on the instruction boundary between VERW and VMRESUME
probably falls into the category of sufficiently rare to be unconcerned
about[1].

~Andrew

[1] He says, fully appreciating the irony that he has spent the past 6
weeks chasing a TLB flushing bug which turned out to be an NMI hitting a
single INVPCID instruction.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS
  2019-02-22 22:24 ` [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS Thomas Gleixner
@ 2019-02-23  1:28   ` Linus Torvalds
  2019-02-23  7:42     ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2019-02-23  1:28 UTC (permalink / raw)
  To: speck

Don't take this as a NAK on this patch, I just didn't react to it on
earlier versions, and I wanted to just bring it up..

On Fri, Feb 22, 2019 at 4:04 PM speck for Thomas Gleixner
<speck@linutronix.de> wrote:
>
> +static const __initconst struct x86_cpu_id cpu_no_mds[] = {
> +       /* in addition to cpu_no_speculation */
> +       { X86_VENDOR_INTEL,     6,      INTEL_FAM6_ATOM_GOLDMONT        },
...

That comment was what made me go look: we already have *four* of these
tables in this file, and this is now the fifth.

And that may be ok. Maybe we do want separate tables for separate
quirks, even if there are patterns there.

But I at least wanted to bring it up: maybe it would be more legible
to have one table of CPU quirks, and have that table say "this CPU has
/ doesn't have this quirk".

Looking at the existing tables, there's often commonalities. And the
'struct x86_cpu_id' does have that "driver_data" field that is meant
to be able to describe particular issues, and could contain flags for
"has bug X" or "doesn't have bug Y" quirks

I dunno. I guess it depends on which way people prefer to thing about
things. Do you want to have a "I wan to see which CPU's have bug X",
or do you want to have a "I want to see what bugs CPU X has".

Right now it's been driven by "quirk X" having a list of CPU's
associated with that quirk. And maybe that's the right thing to do.

But looking at those tables, I do wonder if maybe we should have
instead a list of CPU's, and then associate the quirks with the CPU.

Anyway, that was my aside. I don't think this patch series needs to
worry about it,

               Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS
  2019-02-23  1:28   ` [MODERATED] " Linus Torvalds
@ 2019-02-23  7:42     ` Thomas Gleixner
  2019-02-27 13:04       ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-23  7:42 UTC (permalink / raw)
  To: speck

On Fri, 22 Feb 2019, speck for Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 4:04 PM speck for Thomas Gleixner
> <speck@linutronix.de> wrote:
> >
> > +static const __initconst struct x86_cpu_id cpu_no_mds[] = {
> > +       /* in addition to cpu_no_speculation */
> > +       { X86_VENDOR_INTEL,     6,      INTEL_FAM6_ATOM_GOLDMONT        },
> ...
> 
> That comment was what made me go look: we already have *four* of these
> tables in this file, and this is now the fifth.
> 
> And that may be ok. Maybe we do want separate tables for separate
> quirks, even if there are patterns there.
> 
> But I at least wanted to bring it up: maybe it would be more legible
> to have one table of CPU quirks, and have that table say "this CPU has
> / doesn't have this quirk".
> 
> Looking at the existing tables, there's often commonalities. And the
> 'struct x86_cpu_id' does have that "driver_data" field that is meant
> to be able to describe particular issues, and could contain flags for
> "has bug X" or "doesn't have bug Y" quirks
> 
> I dunno. I guess it depends on which way people prefer to thing about
> things. Do you want to have a "I wan to see which CPU's have bug X",
> or do you want to have a "I want to see what bugs CPU X has".
> 
> Right now it's been driven by "quirk X" having a list of CPU's
> associated with that quirk. And maybe that's the right thing to do.
> 
> But looking at those tables, I do wonder if maybe we should have
> instead a list of CPU's, and then associate the quirks with the CPU.

Good point. Never thought about it. Should be trivial enough to do.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 10/11] Documentation: Move L1TF to separate directory
  2019-02-22 22:24 ` [patch V4 10/11] Documentation: Move L1TF to separate directory Thomas Gleixner
@ 2019-02-23  8:41   ` Greg KH
  0 siblings, 0 replies; 47+ messages in thread
From: Greg KH @ 2019-02-23  8:41 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:28PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Move L!TF to a separate directory so the MDS stuff can be added at the
> side. Otherwise the all hardware vulnerabilites have their own top level
> entry. Should have done that right away.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/admin-guide/hw-vuln/index.rst |   12 
>  Documentation/admin-guide/hw-vuln/l1tf.rst  |  614 ++++++++++++++++++++++++++++
>  Documentation/admin-guide/index.rst         |    6 
>  Documentation/admin-guide/l1tf.rst          |  614 ----------------------------
>  4 files changed, 628 insertions(+), 618 deletions(-)

-M on git format-patch will show this just as a move, not a delete/add
diffstat, if that really matters.

Anyway, looks good to me:
	Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV
  2019-02-22 22:24 ` [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV Thomas Gleixner
@ 2019-02-23  9:52   ` Greg KH
  2019-02-25 20:31   ` mark gross
  1 sibling, 0 replies; 47+ messages in thread
From: Greg KH @ 2019-02-23  9:52 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:27PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
> 
> Introduce an internal mitigation mode VWWERV which enables the invocation
> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> system has no updated microcode this results in a pointless execution of
> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> but not exposed to a guest then the CPU buffers will be cleared.
> 
> That said: Virtual Machines Will Eventually Receive Vaccine
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Thanks for the documentation update here, looks good.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 11/11] Documentation: Add MDS vulnerability documentation
  2019-02-22 22:24 ` [patch V4 11/11] Documentation: Add MDS vulnerability documentation Thomas Gleixner
@ 2019-02-23  9:58   ` Greg KH
  2019-02-26 20:11     ` Thomas Gleixner
  2019-02-25 18:02   ` [MODERATED] " Dave Hansen
  1 sibling, 1 reply; 47+ messages in thread
From: Greg KH @ 2019-02-23  9:58 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:29PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Add the initial MDS vulnerability documentation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V1 --> V4: Added the missing pieces
> ---
>  Documentation/admin-guide/hw-vuln/index.rst |    1 
>  Documentation/admin-guide/hw-vuln/l1tf.rst  |    1 
>  Documentation/admin-guide/hw-vuln/mds.rst   |  258 ++++++++++++++++++++++++++++
>  3 files changed, 260 insertions(+)
> 
> --- a/Documentation/admin-guide/hw-vuln/index.rst
> +++ b/Documentation/admin-guide/hw-vuln/index.rst
> @@ -10,3 +10,4 @@ are configurable at compile, boot or run
>     :maxdepth: 1
>  
>     l1tf
> +   mds
> --- a/Documentation/admin-guide/hw-vuln/l1tf.rst
> +++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
> @@ -445,6 +445,7 @@ The default is 'cond'. If 'l1tf=full,for
>  line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
>  module parameter is ignored and writes to the sysfs file are rejected.
>  
> +.. _mitigation_selection:
>  
>  Mitigation selection guide
>  --------------------------
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/mds.rst
> @@ -0,0 +1,258 @@
> +MDS - Microarchitectural Data Sampling
> +======================================
> +
> +Microarchitectural Data Sampling is a hardware vulnerability which allows
> +unprivileged speculative access to data which is available in various CPU
> +internal buffers.
> +
> +Affected processors
> +-------------------
> +
> +This vulnerability affects a wide range of Intel processors. The
> +vulnerability is not present on:
> +
> +   - Processors from AMD, Centaur and other non Intel vendors
> +
> +   - Older processor models, where the CPU family is < 6
> +
> +   - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
> +
> +   - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
> +     IA32_ARCH_CAPABILITIES MSR.
> +
> +Whether a processor is affected or not can be read out from the MDS
> +vulnerability file in sysfs. See :ref:`mds_sys_info`.
> +
> +Related CVEs
> +------------
> +
> +The following CVE entries are related to the MDS vulnerability:
> +
> +   ==============  =====  ==============================================
> +   CVE-2018-12126  MSBDS  Microarchitectural Store Buffer Data Sampling
> +   CVE-2018-12130  MFBDS  Microarchitectural Fill Buffer Data Sampling
> +   CVE-2018-12127  MLPDS  Microarchitectural Load Port Data Sampling
> +   ==============  =====  ==============================================
> +
> +Problem
> +-------
> +
> +When performing store, load, L1 refill operations, processors write data
> +into temporary microarchitectural structures (buffers). The data in the
> +buffer can be forwarded to load operations as an optimization.
> +
> +Under certain conditions, usually a fault/assist caused by a load
> +operation, data unrelated to the load memory address can be speculatively
> +forwarded from the buffers. Because the load operation causes a fault or
> +assist and its result will be discarded, the forwarded data will not cause
> +incorrect program execution or state changes. But a malicious operation
> +may be able to forward this speculative data to a disclosure gadget which
> +allows in turn to infer the value via a cache side channel attack.
> +
> +Because the buffers are potentially shared between Hyper-Threads cross
> +Hyper-Thread attacks may be possible.

Shouldn't this be "are possible."?

As "proof" of this, some of the Linux distros, and a few other operating
systems, told Intel last week that they were going to be disabling
hyperthreading on their systems.  Some distros/OSs were only going to do
that on a "new install", but others can't really tell the difference
between an upgrade and new install, so were going to do it by default.

Theo was right, for all the wrong reasons :)

Anyway, good documentation, even if you don't want to change that
sentance, it looks fine to me:

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 00/11] MDS basics
  2019-02-23  0:53 ` [MODERATED] Re: [patch V4 00/11] MDS basics Andrew Cooper
@ 2019-02-23 14:12   ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2019-02-23 14:12 UTC (permalink / raw)
  To: speck

On Sat, Feb 23, 2019 at 12:53:23AM +0000, speck for Andrew Cooper wrote:
> [1] He says, fully appreciating the irony that he has spent the past 6
> weeks chasing a TLB flushing bug which turned out to be an NMI hitting a
> single INVPCID instruction.

That sounds like so much fun... :-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
  2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
@ 2019-02-25 16:06   ` Frederic Weisbecker
  2019-02-26 14:19   ` Josh Poimboeuf
  2019-02-26 15:00   ` [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() David Woodhouse
  2 siblings, 0 replies; 47+ messages in thread
From: Frederic Weisbecker @ 2019-02-25 16:06 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
> clearing the affected CPU buffers. The mechanism for clearing the buffers
> uses the unused and obsolete VERW instruction in combination with a
> microcode update which triggers a CPU buffer clear when VERW is executed.
> 
> Provide a inline function with the assembly magic. The argument of the VERW
> instruction must be a memory operand as documented:
> 
>   "MD_CLEAR enumerates that the memory-operand variant of VERW (for
>    example, VERW m16) has been extended to also overwrite buffers affected
>    by MDS. This buffer overwriting functionality is not guaranteed for the
>    register operand variant of VERW."
> 
> Documentation also recommends to use a writable data segment selector:
> 
>   "The buffer overwriting occurs regardless of the result of the VERW
>    permission check, as well as when the selector is null or causes a
>    descriptor load segment violation. However, for lowest latency we
>    recommend using a selector that indicates a valid writable data
>    segment."
> 
> Add x86 specific documentation about MDS and the internal workings of the
> mitigation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 00/11] MDS basics
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (11 preceding siblings ...)
  2019-02-23  0:53 ` [MODERATED] Re: [patch V4 00/11] MDS basics Andrew Cooper
@ 2019-02-25 16:38 ` mark gross
  2019-02-26 19:58   ` Thomas Gleixner
  2019-02-26 16:28 ` [MODERATED] " Tyler Hicks
  2019-02-26 18:58 ` [MODERATED] " Kanth Ghatraju
  14 siblings, 1 reply; 47+ messages in thread
From: mark gross @ 2019-02-25 16:38 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:18PM +0100, speck for Thomas Gleixner wrote:
> Hi!
> 
> Another day, another update.
> 
> Changes since V3:
> 
>   - Add the #DF mitigation and document why I can't be bothered
>     to sprinkle the buffer clear into #MC
> 
>   - Add a comment about the segment selector choice. It makes sense on it's
>     own but it won't prevent anyone from thinking that we're crazy.
> 
>   - Addressed the review feedback vs. documentation
> 
>   - Resurrected the admin documentation patch, tidied it up and filled the
>     gaps.
> 
> Delta patch without the admin documentation parts below.
> 
> Git tree WIP.mds branch is updated as well.
> 
> If anyone of the people new to this need access to the git repo,
> please send me a public SSH key so I can add to the gitolite config.
>

My public SSH key:

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3EHqCyZrI5iLrSt8ujk/MAz4V/W85IsYQ/n8dKSyCpQCrL4BDSArFLmT8PoDazKjKX8R2tS0IhvI2inAOq1ERXKbU9gj81x9EHekVfNl9jnmqrTHLmKZNwdHgkPxOastkPTMD71SS1ONqcN1Fm9t8XRsByd7Lsr22GznOjLgMl4lrj1OgOGbwXXkYGgsJNpye8au7iNWmHFvMAcjEsgVtrY+kKRDz5pPneI6XmktwWfudFKwiCyH7NOX/D4whkWp/tanHbFSjO1jmtB92ADYdU4mXMI7CxVSS8NH2petxH3IkaD+8H6AnuZnZ+jnbOD8YYhTrfnQNmNpbtbGBIcvXw== mark.gross@intel.com


BTW what is the git remote for this git repo?

--mark

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 11/11] Documentation: Add MDS vulnerability documentation
  2019-02-22 22:24 ` [patch V4 11/11] Documentation: Add MDS vulnerability documentation Thomas Gleixner
  2019-02-23  9:58   ` [MODERATED] " Greg KH
@ 2019-02-25 18:02   ` Dave Hansen
  2019-02-26 20:10     ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2019-02-25 18:02 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 962 bytes --]

On 2/22/19 2:24 PM, speck for Thomas Gleixner wrote:
> +Contrary to other speculation based vulnerabilities the MDS vulnerability
> +does not allow the attacker to control the memory target address. As a
> +consequence the attacks are purely sampling based, but as demonstrated with
> +the TLBleed attack samples can be postprocessed successfully.

I saw this "sampling-based" terminology in Andi's docs too.  Personally,
I find it a bit confusing.  I think it's trying to make a distinction
between attacks that pull data out of memory and attacks that pull data
out of CPU-internal state that came from somewhere else.  Maybe
something like:

	Other attacks such as Spectre and Meltdown tend to target data
	at a specific memory address.  The MDS vulnerability itself can
	not be targeted at memory and can only leak memory contents that
	have been loaded into the CPU buffers by other means.

Or, is it trying to make a *timing* argument?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS
  2019-02-22 22:24 ` [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS Thomas Gleixner
@ 2019-02-25 20:17   ` mark gross
  2019-02-26 15:50   ` Josh Poimboeuf
  1 sibling, 0 replies; 47+ messages in thread
From: mark gross @ 2019-02-25 20:17 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:25PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Now that the mitigations are in place, add a command line parameter to
> control the mitigation, a mitigation selector function and a SMT update
> mechanism.
> 
> This is the minimal straight forward initial implementation which just
> provides an always on/off mode. The command line parameter is:
> 
>   mds=[full|off|auto]
do we need full and auto?

> 
> This is consistent with the existing mitigations for other speculative
> hardware vulnerabilities.
> 
> The idle invocation is dynamically updated according to the SMT state of
> the system similar to the dynamic update of the STIBP mitigation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   27 ++++++++
>  arch/x86/include/asm/processor.h                |    6 +
>  arch/x86/kernel/cpu/bugs.c                      |   76 ++++++++++++++++++++++++
>  3 files changed, 109 insertions(+)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2356,6 +2356,33 @@
>  			Format: <first>,<last>
>  			Specifies range of consoles to be captured by the MDA.
>  
> +	mds=		[X86,INTEL]
> +			Control mitigation for the Micro-architectural Data
> +			Sampling (MDS) vulnerability.
> +
> +			Certain CPUs are vulnerable to an exploit against CPU
> +			internal buffers which can forward information to a
> +			disclosure gadget under certain conditions.
> +
> +			In vulnerable processors, the speculatively
> +			forwarded data can be used in a cache side channel
> +			attack, to access data to which the attacker does
> +			not have direct access.
> +
> +			This parameter controls the MDS mitigation. The the
> +			options are:
> +
> +			full    - Unconditionally enable MDS mitigation
> +			off     - Unconditionally disable MDS mitigation
> +			auto    - Kernel detects whether the CPU model is
> +				  vulnerable to MDS and picks the most
> +				  appropriate mitigation. If the CPU is not
> +				  vulnerable, "off" is selected. If the CPU
> +				  is vulnerable "full" is selected.
> +
> +			Not specifying this option is equivalent to
> +			mds=auto.
> +
>  	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
>  			Amount of memory to be used when the kernel is not able
>  			to see the whole system memory or for test.
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -992,4 +992,10 @@ enum l1tf_mitigations {
>  
>  extern enum l1tf_mitigations l1tf_mitigation;
>  
> +enum mds_mitigations {
> +	MDS_MITIGATION_OFF,
> +	MDS_MITIGATION_AUTO,
> +	MDS_MITIGATION_FULL,
> +};
> +
>  #endif /* _ASM_X86_PROCESSOR_H */
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -37,6 +37,7 @@
>  static void __init spectre_v2_select_mitigation(void);
>  static void __init ssb_select_mitigation(void);
>  static void __init l1tf_select_mitigation(void);
> +static void __init mds_select_mitigation(void);
>  
>  /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
>  u64 x86_spec_ctrl_base;
> @@ -106,6 +107,8 @@ void __init check_bugs(void)
>  
>  	l1tf_select_mitigation();
>  
> +	mds_select_mitigation();
> +
>  #ifdef CONFIG_X86_32
>  	/*
>  	 * Check whether we are able to run this kernel safely on SMP.
> @@ -212,6 +215,59 @@ static void x86_amd_ssb_disable(void)
>  }
>  
>  #undef pr_fmt
> +#define pr_fmt(fmt)	"MDS: " fmt
> +
> +/* Default mitigation for L1TF-affected CPUs */
> +static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_AUTO;
> +
> +static const char * const mds_strings[] = {
> +	[MDS_MITIGATION_OFF]	= "Vulnerable",
> +	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers"
> +};
> +
> +static void mds_select_mitigation(void)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_MDS)) {
> +		mds_mitigation = MDS_MITIGATION_OFF;
> +		return;
> +	}
> +
> +	switch (mds_mitigation) {
> +	case MDS_MITIGATION_OFF:
> +		break;
> +	case MDS_MITIGATION_AUTO:
> +	case MDS_MITIGATION_FULL:
here AUTO and FULL behave identically.

> +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> +			mds_mitigation = MDS_MITIGATION_FULL;
> +			static_branch_enable(&mds_user_clear);
> +		} else {
> +			mds_mitigation = MDS_MITIGATION_OFF;
> +		}
> +		break;
> +	}
> +	pr_info("%s\n", mds_strings[mds_mitigation]);
> +}
> +
> +static int __init mds_cmdline(char *str)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_MDS))
> +		return 0;
> +
> +	if (!str)
> +		return -EINVAL;
> +
> +	if (!strcmp(str, "off"))
> +		mds_mitigation = MDS_MITIGATION_OFF;
> +	else if (!strcmp(str, "auto"))
> +		mds_mitigation = MDS_MITIGATION_AUTO;
> +	else if (!strcmp(str, "full"))
> +		mds_mitigation = MDS_MITIGATION_FULL;
> +
> +	return 0;
> +}
> +early_param("mds", mds_cmdline);
> +
> +#undef pr_fmt
>  #define pr_fmt(fmt)     "Spectre V2 : " fmt
>  
>  static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
> @@ -615,6 +671,15 @@ static void update_indir_branch_cond(voi
>  		static_branch_disable(&switch_to_cond_stibp);
>  }
>  
> +/* Update the static key controlling the MDS CPU buffer clear in idle */
> +static void update_mds_branch_idle(void)
> +{
> +	if (sched_smt_active())
> +		static_branch_enable(&mds_idle_clear);
> +	else
> +		static_branch_disable(&mds_idle_clear);
> +}
> +
>  void arch_smt_update(void)
>  {
>  	/* Enhanced IBRS implies STIBP. No update required. */
> @@ -636,6 +701,17 @@ void arch_smt_update(void)
>  		break;
>  	}
>  
> +	switch (mds_mitigation) {
> +	case MDS_MITIGATION_OFF:
> +		break;
> +	case MDS_MITIGATION_FULL:
> +		update_mds_branch_idle();
> +		break;
> +	/* Keep GCC happy */
> +	case MDS_MITIGATION_AUTO:
shouldn't there be a check to see if the platform needs to set mds_idle_clear
and call update_mds_branch_idle conditionally?

I'm not sure what the value of having both auto and full is.

--mark

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV
  2019-02-22 22:24 ` [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV Thomas Gleixner
  2019-02-23  9:52   ` [MODERATED] " Greg KH
@ 2019-02-25 20:31   ` mark gross
  2019-02-26  0:34     ` Andrew Cooper
  2019-02-26 19:29     ` Thomas Gleixner
  1 sibling, 2 replies; 47+ messages in thread
From: mark gross @ 2019-02-25 20:31 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:27PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
> 
> Introduce an internal mitigation mode VWWERV which enables the invocation
> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> system has no updated microcode this results in a pointless execution of
> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> but not exposed to a guest then the CPU buffers will be cleared.
> 
> That said: Virtual Machines Will Eventually Receive Vaccine
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2 -> V3: Rename mode.
> ---
>  Documentation/x86/mds.rst        |   29 +++++++++++++++++++++++++++++
>  arch/x86/include/asm/processor.h |    1 +
>  arch/x86/kernel/cpu/bugs.c       |   14 ++++++++------
>  3 files changed, 38 insertions(+), 6 deletions(-)
> 
> --- a/Documentation/x86/mds.rst
> +++ b/Documentation/x86/mds.rst
> @@ -90,11 +90,40 @@ The mitigation is invoked on kernel/user
>  (idle) transitions. Depending on the mitigation mode and the system state
>  the invocation can be enforced or conditional.
>  
> +As a special quirk to address virtualization scenarios where the host has
> +the microcode updated, but the hypervisor does not (yet) expose the
> +MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
> +hope that it might actually clear the buffers. The state is reflected
> +accordingly.
> +
>  According to current knowledge additional mitigations inside the kernel
>  itself are not required because the necessary gadgets to expose the leaked
>  data cannot be controlled in a way which allows exploitation from malicious
>  user space or VM guests.
>  
> +
> +Kernel internal mitigation modes
> +--------------------------------
> +
> + ======= ===========================================================
> + off     Mitigation is disabled. Either the CPU is not affected or
> +         mds=off is supplied on the kernel command line
> +
> + full    Mitigation is eanbled. CPU is affected and MD_CLEAR is
> +         advertised in CPUID.
> +
> + vmwerv	 Mitigation is enabled. CPU is affected and MD_CLEAR is not
    vmverw  <-- type oh?
> +         advertised in CPUID. That is mainly for virtualization
> +	 scenarios where the host has the updated microcode but the
> +	 hypervisor does not expose MD_CLEAR in CPUID. It's a best
> +	 effort approach without guarantee.
> + ======= ===========================================================
> +
> +If the CPU is affected and mds=off is not supplied on the kernel
> +command line then the kernel selects the appropriate mitigation mode
> +depending on the availability of the MD_CLEAR CPUID bit.
> +
> +
>  Mitigation points
>  -----------------
>  
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -996,6 +996,7 @@ enum mds_mitigations {
>  	MDS_MITIGATION_OFF,
>  	MDS_MITIGATION_AUTO,
>  	MDS_MITIGATION_FULL,
> +	MDS_MITIGATION_VMWERV,
	MDS_MITIGATION_VMVERW
>  };
>  
>  #endif /* _ASM_X86_PROCESSOR_H */
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -222,7 +222,8 @@ static enum mds_mitigations mds_mitigati
>  
>  static const char * const mds_strings[] = {
>  	[MDS_MITIGATION_OFF]	= "Vulnerable",
> -	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers"
> +	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers",
> +	[MDS_MITIGATION_VMWERV]	= "Vulnerable: Clear CPU buffers attempted, no microcode",
should be [MDS_MITIGATION_VMVERW]?
>  };
>  
>  static void mds_select_mitigation(void)
> @@ -237,12 +238,12 @@ static void mds_select_mitigation(void)
>  		break;
>  	case MDS_MITIGATION_AUTO:
>  	case MDS_MITIGATION_FULL:
> -		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> +	case MDS_MITIGATION_VMWERV:
0,$s/VMWERV/VMVERW/g

--mark

> +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR))
>  			mds_mitigation = MDS_MITIGATION_FULL;
> -			static_branch_enable(&mds_user_clear);
> -		} else {
> -			mds_mitigation = MDS_MITIGATION_OFF;
> -		}
> +		else
> +			mds_mitigation = MDS_MITIGATION_VMWERV;
> +		static_branch_enable(&mds_user_clear);
>  		break;
>  	}
>  	pr_info("%s\n", mds_strings[mds_mitigation]);
> @@ -705,6 +706,7 @@ void arch_smt_update(void)
>  	case MDS_MITIGATION_OFF:
>  		break;
>  	case MDS_MITIGATION_FULL:
> +	case MDS_MITIGATION_VMWERV:
>  		update_mds_branch_idle();
>  		break;
>  	/* Keep GCC happy */
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user
  2019-02-22 22:24 ` [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user Thomas Gleixner
@ 2019-02-25 21:04   ` Greg KH
  2019-02-26 15:20   ` Josh Poimboeuf
  1 sibling, 0 replies; 47+ messages in thread
From: Greg KH @ 2019-02-25 21:04 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:23PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on exit to user space and add the call into
> prepare_exit_to_usermode() and do_nmi() right before actually returning.
> 
> Add documentation which kernel to user space transition this covers and
> explain why some corner cases are not mitigated.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry
  2019-02-22 22:24 ` [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry Thomas Gleixner
@ 2019-02-25 21:09   ` Greg KH
  2019-02-26 15:31   ` Josh Poimboeuf
  1 sibling, 0 replies; 47+ messages in thread
From: Greg KH @ 2019-02-25 21:09 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:24PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on idle entry. This is independent of other MDS mitigations
> because the idle entry invocation to mitigate the potential leakage due to
> store buffer repartitioning is only necessary on SMT systems.
> 
> Add the actual invocations to the different halt/mwait variants which
> covers all usage sites. mwaitx is not patched as it's not available on
> Intel CPUs.
> 
> The buffer clear is only invoked before entering the C-State to prevent
> that stale data from the idling CPU is spilled to the Hyper-Thread sibling
> after the Store buffer got repartitioned and all entries are available to
> the non idle sibling.
> 
> When coming out of idle the store buffer is partitioned again so each
> sibling has half of it available. Now CPU which returned from idle could be
> speculatively exposed to contents of the sibling, but the buffers are
> flushed either on exit to user space or on VMENTER.
> 
> When later on conditional buffer clearing is implemented on top of this,
> then there is no action required either because before returning to user
> space the context switch will set the condition flag which causes a flush
> on the return to user path.
> 
> This intentionaly does not handle the case in the acpi/processor_idle
> driver which uses the legacy IO port interface for C-State transitions for
> two reasons:
> 
>  - The acpi/processor_idle driver was replaced by the intel_idle driver
>    almost a decade ago. Anything Nehalem upwards supports it and defaults
>    to that new driver.
> 
>  - The legacy IO port interface is likely to be used on older and therefore
>    unaffected CPUs or on systems which do not receive microcode updates
>    anymore, so there is no point in adding that.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>

Comparing this to the Intel paper, I find this way more readable and
understandable.  Things they "hint" at are actually spelled out here,
nice work.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV
  2019-02-25 20:31   ` mark gross
@ 2019-02-26  0:34     ` Andrew Cooper
  2019-02-26 18:51       ` mark gross
  2019-02-26 19:29     ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Cooper @ 2019-02-26  0:34 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2724 bytes --]

On 25/02/2019 20:31, speck for mark gross wrote:
> On Fri, Feb 22, 2019 at 11:24:27PM +0100, speck for Thomas Gleixner wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>>
>> In virtualized environments it can happen that the host has the microcode
>> update which utilizes the VERW instruction to clear CPU buffers, but the
>> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
>> to guests.
>>
>> Introduce an internal mitigation mode VWWERV which enables the invocation
>> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
>> system has no updated microcode this results in a pointless execution of
>> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
>> but not exposed to a guest then the CPU buffers will be cleared.
>>
>> That said: Virtual Machines Will Eventually Receive Vaccine
>>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
>> V2 -> V3: Rename mode.
>> ---
>>  Documentation/x86/mds.rst        |   29 +++++++++++++++++++++++++++++
>>  arch/x86/include/asm/processor.h |    1 +
>>  arch/x86/kernel/cpu/bugs.c       |   14 ++++++++------
>>  3 files changed, 38 insertions(+), 6 deletions(-)
>>
>> --- a/Documentation/x86/mds.rst
>> +++ b/Documentation/x86/mds.rst
>> @@ -90,11 +90,40 @@ The mitigation is invoked on kernel/user
>>  (idle) transitions. Depending on the mitigation mode and the system state
>>  the invocation can be enforced or conditional.
>>  
>> +As a special quirk to address virtualization scenarios where the host has
>> +the microcode updated, but the hypervisor does not (yet) expose the
>> +MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
>> +hope that it might actually clear the buffers. The state is reflected
>> +accordingly.
>> +
>>  According to current knowledge additional mitigations inside the kernel
>>  itself are not required because the necessary gadgets to expose the leaked
>>  data cannot be controlled in a way which allows exploitation from malicious
>>  user space or VM guests.
>>  
>> +
>> +Kernel internal mitigation modes
>> +--------------------------------
>> +
>> + ======= ===========================================================
>> + off     Mitigation is disabled. Either the CPU is not affected or
>> +         mds=off is supplied on the kernel command line
>> +
>> + full    Mitigation is eanbled. CPU is affected and MD_CLEAR is
>> +         advertised in CPUID.
>> +
>> + vmwerv	 Mitigation is enabled. CPU is affected and MD_CLEAR is not
>     vmverw  <-- type oh?

I recommend re-reading the commit message :)

The position of the W isn't an accident.

~Andrew


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
  2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
  2019-02-25 16:06   ` [MODERATED] " Frederic Weisbecker
@ 2019-02-26 14:19   ` Josh Poimboeuf
  2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
  2019-02-26 15:00   ` [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() David Woodhouse
  2 siblings, 1 reply; 47+ messages in thread
From: Josh Poimboeuf @ 2019-02-26 14:19 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
> +L1 miss situations and to hold data which is returned or sent in response
> +to a memory or I/O operation. Fill buffers can forward data to a load
> +operation and also write data to the cache. When the fill buffer is
> +deallocated it can retain the stale data of the preceding operations which
> +can then be forwarded to a faulting or assisting load operation, which can
> +be exploited under certain conditions. Fill buffers are shared between
> +Hyper-Threads so cross thread leakage is possible.
> +
> +MLDPS leaks Load Port Data. Load ports are used to perform load operations

MLPDS

> +from memory or I/O. The received data is then forwarded to the register
> +file or a subsequent operation. In some implementations the Load Port can
> +contain stale data from a previous operation which can be forwarded to
> +faulting or assisting loads under certain conditions, which again can be
> +exploited eventually. Load ports are shared between Hyper-Threads so cross
> +thread leakage is possible.
> +
> +
> +Exposure assumptions
> +--------------------
> +
> +It is assumed that attack code resides in user space or in a guest with one
> +exception. The rationale behind this assumption is that the code construct
> +needed for exploiting MDS requires:
> +
> + - to control the load to trigger a fault or assist
> +
> + - to have a disclosure gadget which exposes the speculatively accessed
> +   data for consumption through a side channel.
> +
> + - to control the pointer through which the disclosure gadget exposes the
> +   data
> +
> +The existence of such a construct cannot be excluded with 100% certainty,
> +but the complexity involved makes it extremly unlikely.

The existence of such a construct *in the kernel* cannot be excluded...

> +There is one exception, which is untrusted BPF. The functionality of
> +untrusted BPF is limited, but it needs to be thoroughly investigated
> +whether it can be used to create such a construct.
> +
> +
> +Mitigation strategy
> +-------------------
> +
> +All variants have the same mitigation strategy at least for the single CPU
> +thread case (SMT off): Force the CPU to clear the affected buffers.
> +
> +This is achieved by using the otherwise unused and obsolete VERW
> +instruction in combination with a microcode update. The microcode clears
> +the affected CPU buffers when the VERW instruction is executed.
> +
> +For virtualization there are two ways to achieve CPU buffer
> +clearing. Either the modified VERW instruction or via the L1D Flush
> +command. The latter is issued when L1TF mitigation is enabled so the extra
> +VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
> +be issued.
> +
> +If the VERW instruction with the supplied segment selector argument is
> +executed on a CPU without the microcode update there is no side effect
> +other than a small number of pointlessly wasted CPU cycles.
> +
> +This does not protect against cross Hyper-Thread attacks except for MSBDS
> +which is only exploitable cross Hyper-thread when one of the Hyper-Threads
> +enters a C-state.
> +
> +The kernel provides a function to invoke the buffer clearing:
> +
> +    mds_clear_cpu_buffers()
> +
> +The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
> +(idle) transitions. Depending on the mitigation mode and the system state
> +the invocation can be enforced or conditional.

The conditional bit isn't true (yet?).

What does "enforced" mean in this context?  s/enforced/unconditional ?
Maybe the last sentence can be removed entirely.

-- 
Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
  2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
  2019-02-25 16:06   ` [MODERATED] " Frederic Weisbecker
  2019-02-26 14:19   ` Josh Poimboeuf
@ 2019-02-26 15:00   ` David Woodhouse
  2 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2019-02-26 15:00 UTC (permalink / raw)
  To: speck


Two single-letter heckles...


On Fri, 2019-02-22 at 23:24 +0100, speck for Thomas Gleixner wrote:
> Subject: patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
                                                                        ^
                                                                  bufferS()


> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
> clearing the affected CPU buffers. The mechanism for clearing the buffers
> uses the unused and obsolete VERW instruction in combination with a
> microcode update which triggers a CPU buffer clear when VERW is executed.
> 
> Provide a inline function with the assembly magic. The argument of the VERW
> instruction must be a memory operand as documented:
> 
>   "MD_CLEAR enumerates that the memory-operand variant of VERW (for
>    example, VERW m16) has been extended to also overwrite buffers affected
>    by MDS. This buffer overwriting functionality is not guaranteed for the
>    register operand variant of VERW."
> 
> Documentation also recommends to use a writable data segment selector:
> 
>   "The buffer overwriting occurs regardless of the result of the VERW
>    permission check, as well as when the selector is null or causes a
>    descriptor load segment violation. However, for lowest latency we
>    recommend using a selector that indicates a valid writable data
>    segment."
> 
> Add x86 specific documentation about MDS and the internal workings of the
> mitigation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
> V3 --> V4: Document the segment selecor choice as well.
> 
> V2 --> V3: Add VERW documentation and fix typos/grammar..., dropped 'i(0)'
>        	   Add more details fo the documentation file
> 
> V1 --> V2: Add "cc" clobber and documentation
> ---
>  Documentation/index.rst              |    1 
>  Documentation/x86/conf.py            |   10 +++
>  Documentation/x86/index.rst          |    8 ++
>  Documentation/x86/mds.rst            |  100 +++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/nospec-branch.h |   25 ++++++++
>  5 files changed, 144 insertions(+)
> 
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -101,6 +101,7 @@ implementation.
>     :maxdepth: 2
>  
>     sh/index
> +   x86/index
>  
>  Filesystem Documentation
>  ------------------------
> --- /dev/null
> +++ b/Documentation/x86/conf.py
> @@ -0,0 +1,10 @@
> +# -*- coding: utf-8; mode: python -*-
> +
> +project = "X86 architecture specific documentation"
> +
> +tags.add("subproject")
> +
> +latex_documents = [
> +    ('index', 'x86.tex', project,
> +     'The kernel development community', 'manual'),
> +]
> --- /dev/null
> +++ b/Documentation/x86/index.rst
> @@ -0,0 +1,8 @@
> +==========================
> +x86 architecture specifics
> +==========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   mds
> --- /dev/null
> +++ b/Documentation/x86/mds.rst
> @@ -0,0 +1,100 @@
> +Microarchitecural Data Sampling (MDS) mitigation
                ^
   MicroarchitecTural 

> +================================================
> +
> +.. _mds:
> +
> +Overview
> +--------
> +
> +Microarchitectural Data Sampling (MDS) is a family of side channel attacks
> +on internal buffers in Intel CPUs. The variants are:
> +
> + - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
> + - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
> + - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
> +
> +MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
> +dependent load (store-to-load forwarding) as an optimization. The forward
> +can also happen to a faulting or assisting load operation for a different
> +memory address, which can be exploited under certain conditions. Store
> +buffers are partitioned between Hyper-Threads so cross thread forwarding is
> +not possible. But if a thread enters or exits a sleep state the store
> +buffer is repartitioned which can expose data from one thread to the other.
> +
> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
> +L1 miss situations and to hold data which is returned or sent in response
> +to a memory or I/O operation. Fill buffers can forward data to a load
> +operation and also write data to the cache. When the fill buffer is
> +deallocated it can retain the stale data of the preceding operations which
> +can then be forwarded to a faulting or assisting load operation, which can
> +be exploited under certain conditions. Fill buffers are shared between
> +Hyper-Threads so cross thread leakage is possible.
> +
> +MLDPS leaks Load Port Data. Load ports are used to perform load operations
> +from memory or I/O. The received data is then forwarded to the register
> +file or a subsequent operation. In some implementations the Load Port can
> +contain stale data from a previous operation which can be forwarded to
> +faulting or assisting loads under certain conditions, which again can be
> +exploited eventually. Load ports are shared between Hyper-Threads so cross
> +thread leakage is possible.
> +
> +
> +Exposure assumptions
> +--------------------
> +
> +It is assumed that attack code resides in user space or in a guest with one
> +exception. The rationale behind this assumption is that the code construct
> +needed for exploiting MDS requires:
> +
> + - to control the load to trigger a fault or assist
> +
> + - to have a disclosure gadget which exposes the speculatively accessed
> +   data for consumption through a side channel.
> +
> + - to control the pointer through which the disclosure gadget exposes the
> +   data
> +
> +The existence of such a construct cannot be excluded with 100% certainty,
> +but the complexity involved makes it extremly unlikely.
> +
> +There is one exception, which is untrusted BPF. The functionality of
> +untrusted BPF is limited, but it needs to be thoroughly investigated
> +whether it can be used to create such a construct.
> +
> +
> +Mitigation strategy
> +-------------------
> +
> +All variants have the same mitigation strategy at least for the single CPU
> +thread case (SMT off): Force the CPU to clear the affected buffers.
> +
> +This is achieved by using the otherwise unused and obsolete VERW
> +instruction in combination with a microcode update. The microcode clears
> +the affected CPU buffers when the VERW instruction is executed.
> +
> +For virtualization there are two ways to achieve CPU buffer
> +clearing. Either the modified VERW instruction or via the L1D Flush
> +command. The latter is issued when L1TF mitigation is enabled so the extra
> +VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
> +be issued.
> +
> +If the VERW instruction with the supplied segment selector argument is
> +executed on a CPU without the microcode update there is no side effect
> +other than a small number of pointlessly wasted CPU cycles.
> +
> +This does not protect against cross Hyper-Thread attacks except for MSBDS
> +which is only exploitable cross Hyper-thread when one of the Hyper-Threads
> +enters a C-state.
> +
> +The kernel provides a function to invoke the buffer clearing:
> +
> +    mds_clear_cpu_buffers()
> +
> +The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
> +(idle) transitions. Depending on the mitigation mode and the system state
> +the invocation can be enforced or conditional.
> +
> +According to current knowledge additional mitigations inside the kernel
> +itself are not required because the necessary gadgets to expose the leaked
> +data cannot be controlled in a way which allows exploitation from malicious
> +user space or VM guests.
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -318,6 +318,31 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
>  DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
>  DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>  
> +#include <asm/segment.h>
> +
> +/**
> + * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
> + *
> + * This uses the otherwise unused and obsolete VERW instruction in
> + * combination with microcode which triggers a CPU buffer flush when the
> + * instruction is executed.
> + */
> +static inline void mds_clear_cpu_buffers(void)
> +{
> +	static const u16 ds = __KERNEL_DS;
> +
> +	/*
> +	 * Has to be the memory-operand variant because only that
> +	 * guarantees the CPU buffer flush functionality according to
> +	 * documentation. The register-operand variant does not.
> +	 * Works with any segment selector, but a valid writable
> +	 * data segment is the fastest variant.
> +	 *
> +	 * "cc" clobber is required because VERW modifies ZF.
> +	 */
> +	asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user
  2019-02-22 22:24 ` [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user Thomas Gleixner
  2019-02-25 21:04   ` [MODERATED] " Greg KH
@ 2019-02-26 15:20   ` Josh Poimboeuf
  2019-02-26 20:26     ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Josh Poimboeuf @ 2019-02-26 15:20 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:23PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on exit to user space and add the call into
> prepare_exit_to_usermode() and do_nmi() right before actually returning.
>
> Add documentation which kernel to user space transition this covers and
> explain why some corner cases are not mitigated.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V3 --> V4: Add #DS mitigation and document that the #MC corner case
>        	   is really not interesting.
> 
> V3: Add NMI conditional on user regs and update documentation accordingly.
>     Use the static branch scheme suggested by Peter. Fix typos ...
> ---
>  Documentation/x86/mds.rst            |   41 +++++++++++++++++++++++++++++++++++
>  arch/x86/entry/common.c              |   10 ++++++++
>  arch/x86/include/asm/nospec-branch.h |    2 +
>  arch/x86/kernel/cpu/bugs.c           |    4 ++-
>  arch/x86/kernel/nmi.c                |    6 +++++
>  arch/x86/kernel/traps.c              |    9 +++++++
>  6 files changed, 71 insertions(+), 1 deletion(-)
> 
> --- a/Documentation/x86/mds.rst
> +++ b/Documentation/x86/mds.rst
> @@ -94,3 +94,44 @@ According to current knowledge additiona
>  itself are not required because the necessary gadgets to expose the leaked
>  data cannot be controlled in a way which allows exploitation from malicious
>  user space or VM guests.
> +
> +Mitigation points
> +-----------------
> +
> +1. Return to user space
> +^^^^^^^^^^^^^^^^^^^^^^^
> +   When transitioning from kernel to user space the CPU buffers are flushed
> +   on affected CPUs:
> +
> +   - always when the mitigation mode is full. The migitation is enabled

Currently the mitigation is always full.

> +     through the static key mds_user_clear.
> +
> +   This covers transitions from kernel to user space through a return to
> +   user space from a syscall and from an interrupt or a regular exception.
> +
> +   There are other kernel to user space transitions which are not covered
> +   by this: NMIs and all non maskable exceptions which go through the
> +   paranoid exit, which means that they are not invoking the regular

Actually, NMI *is* mitigated.

What is a non maskable exception?

The statement about all paranoid exits being covered isn't correct,
because #DF is mitigated.

> +   prepare_exit_to_usermode() which handles the CPU buffer clearing.
> +
> +   Access to sensible data like keys, credentials in the NMI context is
> +   mostly theoretical: The CPU can do prefetching or execute a
> +   misspeculated code path and thereby fetching data which might end up
> +   leaking through a buffer.

This paragraph can be removed, since NMI is mitigated.

> +
> +   But for mounting other attacks the kernel stack address of the task is
> +   already valuable information. So in full mitigation mode, the NMI is
> +   mitigated on the return from do_nmi() to provide almost complete
> +   coverage.

This one is correct.

> +
> +   There is one non maskable exception which returns through paranoid exit

Again the phrase "non maskable exception".  Maybe I'm missing something
but I have no idea what that means.

> +   and is to some extent controllable from user space through
> +   modify_ldt(2): #DF. So mitigation is required in the double fault
> +   handler as well.
> +
> +   Another corner case is a #MC which hits between the buffer clear and the
> +   actual return to user. As this still is in kernel space it takes the
> +   paranoid exit path which does not clear the CPU buffers. So the #MC
> +   handler repopulates the buffers to some extent. Machine checks are not
> +   reliably controllable and the window is extremly small so mitigation
> +   would just tick a checkbox that this theoretical corner case is covered.

There is no mention of #DB anywhere, shouldn't it also be mitigated?

> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -31,6 +31,7 @@
>  #include <asm/vdso.h>
>  #include <linux/uaccess.h>
>  #include <asm/cpufeature.h>
> +#include <asm/nospec-branch.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/syscalls.h>
> @@ -180,6 +181,13 @@ static void exit_to_usermode_loop(struct
>  	}
>  }
>  
> +static inline void mds_user_clear_cpu_buffers(void)
> +{
> +	if (!static_branch_likely(&mds_user_clear))
> +		return;
> +	mds_clear_cpu_buffers();
> +}
> +
>  /* Called with IRQs disabled. */
>  __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
>  {
> @@ -212,6 +220,8 @@ static void exit_to_usermode_loop(struct
>  #endif
>  
>  	user_enter_irqoff();
> +
> +	mds_user_clear_cpu_buffers();
>  }
>  
>  #define SYSCALL_EXIT_WORK_FLAGS				\
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -318,6 +318,8 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
>  DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
>  DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>  
> +DECLARE_STATIC_KEY_FALSE(mds_user_clear);
> +
>  #include <asm/segment.h>
>  
>  /**
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -63,10 +63,12 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_i
>  /* Control unconditional IBPB in switch_mm() */
>  DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>  
> +/* Control MDS CPU buffer clear before returning to user space */
> +DEFINE_STATIC_KEY_FALSE(mds_user_clear);
> +
>  void __init check_bugs(void)
>  {
>  	identify_boot_cpu();
> -
>  	/*
>  	 * identify_boot_cpu() initialized SMT support information, let the
>  	 * core code know.
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -34,6 +34,7 @@
>  #include <asm/x86_init.h>
>  #include <asm/reboot.h>
>  #include <asm/cache.h>
> +#include <asm/nospec-branch.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/nmi.h>
> @@ -533,6 +534,11 @@ do_nmi(struct pt_regs *regs, long error_
>  		write_cr2(this_cpu_read(nmi_cr2));
>  	if (this_cpu_dec_return(nmi_state))
>  		goto nmi_restart;
> +
> +	if (!static_branch_likely(&mds_user_clear))
> +		return;
> +	if (user_mode(regs))
> +		mds_clear_cpu_buffers();

This could be simplied:

	if (user_mode(regs))
		mds_user_clear_cpu_buffers();

>  }
>  NOKPROBE_SYMBOL(do_nmi);
>  
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -366,6 +366,15 @@ dotraplinkage void do_double_fault(struc
>  		regs->ip = (unsigned long)general_protection;
>  		regs->sp = (unsigned long)&gpregs->orig_ax;
>  
> +		/*
> +		 * This situation can be triggered by userspace via
> +		 * modify_ldt(2) and the return does not take the regular
> +		 * user space exit, so a CPU buffer clear is required when
> +		 * MDS mitigation is enabled.
> +		 */
> +		if (static_branch_unlikely(&mds_user_clear))
> +			mds_clear_cpu_buffers();

Shouldn't it be likely?  Anyway this can just use
mds_user_clear_cpu_buffers().

-- 
Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry
  2019-02-22 22:24 ` [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry Thomas Gleixner
  2019-02-25 21:09   ` [MODERATED] " Greg KH
@ 2019-02-26 15:31   ` Josh Poimboeuf
  2019-02-26 20:20     ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Josh Poimboeuf @ 2019-02-26 15:31 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:24PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on idle entry. This is independent of other MDS mitigations
> because the idle entry invocation to mitigate the potential leakage due to
> store buffer repartitioning is only necessary on SMT systems.
> 
> Add the actual invocations to the different halt/mwait variants which
> covers all usage sites. mwaitx is not patched as it's not available on
> Intel CPUs.
> 
> The buffer clear is only invoked before entering the C-State to prevent
> that stale data from the idling CPU is spilled to the Hyper-Thread sibling
> after the Store buffer got repartitioned and all entries are available to
> the non idle sibling.
> 
> When coming out of idle the store buffer is partitioned again so each
> sibling has half of it available. Now CPU which returned from idle could be
> speculatively exposed to contents of the sibling, but the buffers are
> flushed either on exit to user space or on VMENTER.
> 
> When later on conditional buffer clearing is implemented on top of this,
> then there is no action required either because before returning to user
> space the context switch will set the condition flag which causes a flush
> on the return to user path.
> 
> This intentionaly does not handle the case in the acpi/processor_idle

intentionally

> driver which uses the legacy IO port interface for C-State transitions for
> two reasons:
> 
>  - The acpi/processor_idle driver was replaced by the intel_idle driver
>    almost a decade ago. Anything Nehalem upwards supports it and defaults
>    to that new driver.
> 
>  - The legacy IO port interface is likely to be used on older and therefore
>    unaffected CPUs or on systems which do not receive microcode updates
>    anymore, so there is no point in adding that.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> ---
> V4: Export mds_idle_clear
> V3: Adjust document wording
> ---
>  Documentation/x86/mds.rst            |   35 +++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/irqflags.h      |    4 ++++
>  arch/x86/include/asm/mwait.h         |    7 +++++++
>  arch/x86/include/asm/nospec-branch.h |   12 ++++++++++++
>  arch/x86/kernel/cpu/bugs.c           |    3 +++
>  5 files changed, 61 insertions(+)
> 
> --- a/Documentation/x86/mds.rst
> +++ b/Documentation/x86/mds.rst
> @@ -135,3 +135,38 @@ Mitigation points
>     handler repopulates the buffers to some extent. Machine checks are not
>     reliably controllable and the window is extremly small so mitigation
>     would just tick a checkbox that this theoretical corner case is covered.
> +
> +
> +2. C-State transition
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +   When a CPU goes idle and enters a C-State the CPU buffers need to be
> +   cleared on affected CPUs when SMT is active. This addresses the
> +   repartitioning of the store buffer when one of the Hyper-Threads enters
> +   a C-State.
> +
> +   When SMT is inactive, i.e. either the CPU does not support it or all
> +   sibling threads are offline CPU buffer clearing is not required.
> +
> +   The invocation is controlled by the static key mds_idle_clear which is
> +   switched depending on the chosen mitigation mode and the SMT state of
> +   the system.
> +
> +   The buffer clear is only invoked before entering the C-State to prevent
> +   that stale data from the idling CPU can be spilled to the Hyper-Thread

s/can be spilled/from spilling/

> +   sibling after the store buffer got repartitioned and all entries are
> +   available to the non idle sibling.
> +
> +   When coming out of idle the store buffer is partitioned again so each
> +   sibling has half of it available. The back from idle CPU could be then
> +   speculatively exposed to contents of the sibling. The buffers are
> +   flushed either on exit to user space or on VMENTER so malicious code
> +   in user space or the guest cannot speculatively access them.
> +
> +   The mitigation is hooked into all variants of halt()/mwait(), but does
> +   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
> +   has been superseded by the intel_idle driver around 2010 and is
> +   preferred on all affected CPUs which are expected to gain the MD_CLEAR
> +   functionality in microcode. Aside of that the IO-Port mechanism is a
> +   legacy interface which is only used on older systems which are either
> +   not affected or do not receive microcode updates anymore.
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -6,6 +6,8 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +#include <asm/nospec-branch.h>
> +
>  /* Provide __cpuidle; we can't safely include <linux/cpu.h> */
>  #define __cpuidle __attribute__((__section__(".cpuidle.text")))
>  
> @@ -54,11 +56,13 @@ static inline void native_irq_enable(voi
>  
>  static inline __cpuidle void native_safe_halt(void)
>  {
> +	mds_idle_clear_cpu_buffers();
>  	asm volatile("sti; hlt": : :"memory");
>  }
>  
>  static inline __cpuidle void native_halt(void)
>  {
> +	mds_idle_clear_cpu_buffers();
>  	asm volatile("hlt": : :"memory");
>  }
>  
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -6,6 +6,7 @@
>  #include <linux/sched/idle.h>
>  
>  #include <asm/cpufeature.h>
> +#include <asm/nospec-branch.h>
>  
>  #define MWAIT_SUBSTATE_MASK		0xf
>  #define MWAIT_CSTATE_MASK		0xf
> @@ -40,6 +41,8 @@ static inline void __monitorx(const void
>  
>  static inline void __mwait(unsigned long eax, unsigned long ecx)
>  {
> +	mds_idle_clear_cpu_buffers();
> +
>  	/* "mwait %eax, %ecx;" */
>  	asm volatile(".byte 0x0f, 0x01, 0xc9;"
>  		     :: "a" (eax), "c" (ecx));
> @@ -74,6 +77,8 @@ static inline void __mwait(unsigned long
>  static inline void __mwaitx(unsigned long eax, unsigned long ebx,
>  			    unsigned long ecx)
>  {
> +	/* No MDS buffer clear as this is AMD/HYGON only */
> +
>  	/* "mwaitx %eax, %ebx, %ecx;" */
>  	asm volatile(".byte 0x0f, 0x01, 0xfb;"
>  		     :: "a" (eax), "b" (ebx), "c" (ecx));
> @@ -81,6 +86,8 @@ static inline void __mwaitx(unsigned lon
>  
>  static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
>  {
> +	mds_idle_clear_cpu_buffers();
> +
>  	trace_hardirqs_on();
>  	/* "mwait %eax, %ecx;" */
>  	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -319,6 +319,7 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_cond_
>  DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>  
>  DECLARE_STATIC_KEY_FALSE(mds_user_clear);
> +DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
>  
>  #include <asm/segment.h>
>  
> @@ -345,6 +346,17 @@ static inline void mds_clear_cpu_buffers
>  	asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
>  }
>  
> +/**
> + * mds_idle_clear_cpu_buffers - Mitigation for MDS vulnerability
> + *
> + * Clear CPU buffers if the corresponding static key is enabled
> + */
> +static inline void mds_idle_clear_cpu_buffers(void)
> +{
> +	if (static_branch_likely(&mds_idle_clear))
> +		mds_clear_cpu_buffers();
> +}

This two-line construct is more readable than the
mds_user_clear_cpu_buffers() three-line version from the previous patch,
I'd suggest doing the same thing there.

-- 
Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS
  2019-02-22 22:24 ` [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS Thomas Gleixner
  2019-02-25 20:17   ` [MODERATED] " mark gross
@ 2019-02-26 15:50   ` Josh Poimboeuf
  2019-02-26 20:16     ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Josh Poimboeuf @ 2019-02-26 15:50 UTC (permalink / raw)
  To: speck

On Fri, Feb 22, 2019 at 11:24:25PM +0100, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Now that the mitigations are in place, add a command line parameter to
> control the mitigation, a mitigation selector function and a SMT update
> mechanism.
> 
> This is the minimal straight forward initial implementation which just
> provides an always on/off mode. The command line parameter is:
> 
>   mds=[full|off|auto]
> 
> This is consistent with the existing mitigations for other speculative
> hardware vulnerabilities.
> 
> The idle invocation is dynamically updated according to the SMT state of
> the system similar to the dynamic update of the STIBP mitigation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   27 ++++++++
>  arch/x86/include/asm/processor.h                |    6 +
>  arch/x86/kernel/cpu/bugs.c                      |   76 ++++++++++++++++++++++++
>  3 files changed, 109 insertions(+)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2356,6 +2356,33 @@
>  			Format: <first>,<last>
>  			Specifies range of consoles to be captured by the MDA.
>  
> +	mds=		[X86,INTEL]
> +			Control mitigation for the Micro-architectural Data
> +			Sampling (MDS) vulnerability.
> +
> +			Certain CPUs are vulnerable to an exploit against CPU
> +			internal buffers which can forward information to a
> +			disclosure gadget under certain conditions.
> +
> +			In vulnerable processors, the speculatively
> +			forwarded data can be used in a cache side channel
> +			attack, to access data to which the attacker does
> +			not have direct access.
> +
> +			This parameter controls the MDS mitigation. The the

https://www.youtube.com/watch?v=X43ZyUGOPyw

> +			options are:
> +
> +			full    - Unconditionally enable MDS mitigation
> +			off     - Unconditionally disable MDS mitigation
> +			auto    - Kernel detects whether the CPU model is
> +				  vulnerable to MDS and picks the most
> +				  appropriate mitigation. If the CPU is not
> +				  vulnerable, "off" is selected. If the CPU
> +				  is vulnerable "full" is selected.
> +
> +			Not specifying this option is equivalent to
> +			mds=auto.
> +
>  	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
>  			Amount of memory to be used when the kernel is not able
>  			to see the whole system memory or for test.
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -992,4 +992,10 @@ enum l1tf_mitigations {
>  
>  extern enum l1tf_mitigations l1tf_mitigation;
>  
> +enum mds_mitigations {
> +	MDS_MITIGATION_OFF,
> +	MDS_MITIGATION_AUTO,
> +	MDS_MITIGATION_FULL,
> +};
> +
>  #endif /* _ASM_X86_PROCESSOR_H */
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -37,6 +37,7 @@
>  static void __init spectre_v2_select_mitigation(void);
>  static void __init ssb_select_mitigation(void);
>  static void __init l1tf_select_mitigation(void);
> +static void __init mds_select_mitigation(void);
>  
>  /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
>  u64 x86_spec_ctrl_base;
> @@ -106,6 +107,8 @@ void __init check_bugs(void)
>  
>  	l1tf_select_mitigation();
>  
> +	mds_select_mitigation();
> +
>  #ifdef CONFIG_X86_32
>  	/*
>  	 * Check whether we are able to run this kernel safely on SMP.
> @@ -212,6 +215,59 @@ static void x86_amd_ssb_disable(void)
>  }
>  
>  #undef pr_fmt
> +#define pr_fmt(fmt)	"MDS: " fmt
> +
> +/* Default mitigation for L1TF-affected CPUs */
> +static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_AUTO;
> +
> +static const char * const mds_strings[] = {
> +	[MDS_MITIGATION_OFF]	= "Vulnerable",
> +	[MDS_MITIGATION_FULL]	= "Mitigation: Clear CPU buffers"
> +};
> +
> +static void mds_select_mitigation(void)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_MDS)) {
> +		mds_mitigation = MDS_MITIGATION_OFF;
> +		return;
> +	}
> +
> +	switch (mds_mitigation) {
> +	case MDS_MITIGATION_OFF:
> +		break;
> +	case MDS_MITIGATION_AUTO:
> +	case MDS_MITIGATION_FULL:
> +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> +			mds_mitigation = MDS_MITIGATION_FULL;
> +			static_branch_enable(&mds_user_clear);
> +		} else {
> +			mds_mitigation = MDS_MITIGATION_OFF;
> +		}
> +		break;
> +	}
> +	pr_info("%s\n", mds_strings[mds_mitigation]);
> +}
> +
> +static int __init mds_cmdline(char *str)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_MDS))
> +		return 0;
> +
> +	if (!str)
> +		return -EINVAL;
> +
> +	if (!strcmp(str, "off"))
> +		mds_mitigation = MDS_MITIGATION_OFF;
> +	else if (!strcmp(str, "auto"))
> +		mds_mitigation = MDS_MITIGATION_AUTO;
> +	else if (!strcmp(str, "full"))
> +		mds_mitigation = MDS_MITIGATION_FULL;
> +
> +	return 0;
> +}
> +early_param("mds", mds_cmdline);

I agree with Mark that mds=auto isn't needed.

Shall we also have a mds=full,nosmt?

> +
> +#undef pr_fmt
>  #define pr_fmt(fmt)     "Spectre V2 : " fmt
>  
>  static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
> @@ -615,6 +671,15 @@ static void update_indir_branch_cond(voi
>  		static_branch_disable(&switch_to_cond_stibp);
>  }
>  
> +/* Update the static key controlling the MDS CPU buffer clear in idle */
> +static void update_mds_branch_idle(void)
> +{
> +	if (sched_smt_active())
> +		static_branch_enable(&mds_idle_clear);
> +	else
> +		static_branch_disable(&mds_idle_clear);
> +}
> +
>  void arch_smt_update(void)
>  {
>  	/* Enhanced IBRS implies STIBP. No update required. */
> @@ -636,6 +701,17 @@ void arch_smt_update(void)
>  		break;
>  	}
>  
> +	switch (mds_mitigation) {
> +	case MDS_MITIGATION_OFF:
> +		break;
> +	case MDS_MITIGATION_FULL:
> +		update_mds_branch_idle();
> +		break;
> +	/* Keep GCC happy */
> +	case MDS_MITIGATION_AUTO:
> +		break;
> +	}
> +

Per the docs, this is a bug because full and auto should be identical.

-- 
Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 00/11] MDS basics
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (12 preceding siblings ...)
  2019-02-25 16:38 ` mark gross
@ 2019-02-26 16:28 ` Tyler Hicks
  2019-02-26 19:58   ` Thomas Gleixner
  2019-02-26 18:58 ` [MODERATED] " Kanth Ghatraju
  14 siblings, 1 reply; 47+ messages in thread
From: Tyler Hicks @ 2019-02-26 16:28 UTC (permalink / raw)
  To: speck

On 2019-02-22 23:24:18, speck for Thomas Gleixner wrote:
> Git tree WIP.mds branch is updated as well.
> 
> If anyone of the people new to this need access to the git repo,
> please send me a public SSH key so I can add to the gitolite config.

I don't think that I have access to the git repo.

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDhlN1WX8KQuBV1gEka85iElgeIRnBEQjJoTghHuUrOusBpG0f3nf468fWm2gXKgItRiZp9LI7y9tkbaUV3wSlxjt7NTzixr22eYJKLiZCWQYAZZVlSSBzIreiV+nRgjkK8rojJKnjPcxMNg1JBgjSyY0R7AoPkIU9oChLVTDj6nun3yqT/ZdMJiPCtB9mxJP7krGlBWag5bQV7Cus5nJtcXqc9rGVfJ07Ur5z6ymb0DLnphRnjM8AOYjyDRdMgXo6RTN9e0VAgLPTgXBc0ejxINF6E41sUc30TMuiQ10wZbnjFzFws/PSerTEbheqMUB/tF/LFgx1J4cGbGFn86H4hp0Wn+FvmTd4jSOabmDxbBVpjtoYlzdkblsJGph9z091qY0PUj41Va3hyYfb8SbrShpf6JE9l+l5m3nXz4Dts93qEfdWo7moJLUQZ8aAL9pANspwfH7GZzFoy7h0iXtuW1DWDOluGLDbDvLtH6Ns2AK+GEgkE9DBB7pny2wOZlV1q5xSmJml+EESK8SJSjncPmkroKbhGW4G3BwVktpCfzA3nn7H75J5RLXNDulwXJWaaQhmh4jVGNI8fL/mnQFZwd9KjhcvKubDVLKCGY4rh2efFloNBZA9k1rzRDZoFYtmsB3Gni/rJ3Ctc9krcbg4n1Q2EPW/d6Ar7qEX2bASgkw== tyhicks@sec - Canonical

Thanks!

Tyler

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV
  2019-02-26  0:34     ` Andrew Cooper
@ 2019-02-26 18:51       ` mark gross
  0 siblings, 0 replies; 47+ messages in thread
From: mark gross @ 2019-02-26 18:51 UTC (permalink / raw)
  To: speck

On Tue, Feb 26, 2019 at 12:34:55AM +0000, speck for Andrew Cooper wrote:
> On 25/02/2019 20:31, speck for mark gross wrote:
> > On Fri, Feb 22, 2019 at 11:24:27PM +0100, speck for Thomas Gleixner wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >>
> >> In virtualized environments it can happen that the host has the microcode
> >> update which utilizes the VERW instruction to clear CPU buffers, but the
> >> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> >> to guests.
> >>
> >> Introduce an internal mitigation mode VWWERV which enables the invocation
> >> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> >> system has no updated microcode this results in a pointless execution of
> >> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> >> but not exposed to a guest then the CPU buffers will be cleared.
> >>
> >> That said: Virtual Machines Will Eventually Receive Vaccine
> >>
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >> ---
> >> V2 -> V3: Rename mode.
> >> ---
> >>  Documentation/x86/mds.rst        |   29 +++++++++++++++++++++++++++++
> >>  arch/x86/include/asm/processor.h |    1 +
> >>  arch/x86/kernel/cpu/bugs.c       |   14 ++++++++------
> >>  3 files changed, 38 insertions(+), 6 deletions(-)
> >>
> >> --- a/Documentation/x86/mds.rst
> >> +++ b/Documentation/x86/mds.rst
> >> @@ -90,11 +90,40 @@ The mitigation is invoked on kernel/user
> >>  (idle) transitions. Depending on the mitigation mode and the system state
> >>  the invocation can be enforced or conditional.
> >>  
> >> +As a special quirk to address virtualization scenarios where the host has
> >> +the microcode updated, but the hypervisor does not (yet) expose the
> >> +MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
> >> +hope that it might actually clear the buffers. The state is reflected
> >> +accordingly.
> >> +
> >>  According to current knowledge additional mitigations inside the kernel
> >>  itself are not required because the necessary gadgets to expose the leaked
> >>  data cannot be controlled in a way which allows exploitation from malicious
> >>  user space or VM guests.
> >>  
> >> +
> >> +Kernel internal mitigation modes
> >> +--------------------------------
> >> +
> >> + ======= ===========================================================
> >> + off     Mitigation is disabled. Either the CPU is not affected or
> >> +         mds=off is supplied on the kernel command line
> >> +
> >> + full    Mitigation is eanbled. CPU is affected and MD_CLEAR is
> >> +         advertised in CPUID.
> >> +
> >> + vmwerv	 Mitigation is enabled. CPU is affected and MD_CLEAR is not
> >     vmverw  <-- type oh?
> 
> I recommend re-reading the commit message :)
> 
> The position of the W isn't an accident.
Virtual Machines Will Eventually Receive Vaccine  (VMWERV)

I get it now.

meh,
--mark

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Re: [patch V4 00/11] MDS basics
  2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
                   ` (13 preceding siblings ...)
  2019-02-26 16:28 ` [MODERATED] " Tyler Hicks
@ 2019-02-26 18:58 ` Kanth Ghatraju
  2019-02-26 19:59   ` Thomas Gleixner
  14 siblings, 1 reply; 47+ messages in thread
From: Kanth Ghatraju @ 2019-02-26 18:58 UTC (permalink / raw)
  To: speck


[-- Attachment #1.1: Type: text/plain, Size: 321 bytes --]



> On Feb 22, 2019, at 5:24 PM, speck for Thomas Gleixner <speck@linutronix.de> wrote:
> 
> 
> 
> If anyone of the people new to this need access to the git repo,
> please send me a public SSH key so I can add to the gitolite config.
> 

Hello Thomas,

Attached is my public access key. Thanks.

-kanth


[-- Attachment #1.2: id_rsa.pub --]
[-- Type: application/x-mspublisher, Size: 408 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV
  2019-02-25 20:31   ` mark gross
  2019-02-26  0:34     ` Andrew Cooper
@ 2019-02-26 19:29     ` Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 19:29 UTC (permalink / raw)
  To: speck

On Mon, 25 Feb 2019, speck for mark gross wrote:
> On Fri, Feb 22, 2019 at 11:24:27PM +0100, speck for Thomas Gleixner wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > In virtualized environments it can happen that the host has the microcode
> > update which utilizes the VERW instruction to clear CPU buffers, but the
> > hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> > to guests.
> > 
> > Introduce an internal mitigation mode VWWERV which enables the invocation
> > of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> > system has no updated microcode this results in a pointless execution of
> > the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> > but not exposed to a guest then the CPU buffers will be cleared.
> > 
> > That said: Virtual Machines Will Eventually Receive Vaccine

> > + vmwerv	 Mitigation is enabled. CPU is affected and MD_CLEAR is not
>     vmverw  <-- type oh?

Actually it's intentional. I was looking for something which is a subtle
hint for why this thing exists in the first place and is a proper
acronym. See above.

I probably could come up with something for what vmverw states, but the
subtle hint is then even more subtle. Not that I care much.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 00/11] MDS basics
  2019-02-25 16:38 ` mark gross
@ 2019-02-26 19:58   ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 19:58 UTC (permalink / raw)
  To: speck

On Mon, 25 Feb 2019, speck for mark gross wrote:
> On Fri, Feb 22, 2019 at 11:24:18PM +0100, speck for Thomas Gleixner wrote:

Added your key.

> BTW what is the git remote for this git repo?

  cvs.ou.linutronix.de:linux/speck/linux

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 00/11] MDS basics
  2019-02-26 16:28 ` [MODERATED] " Tyler Hicks
@ 2019-02-26 19:58   ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 19:58 UTC (permalink / raw)
  To: speck

On Tue, 26 Feb 2019, speck for Tyler Hicks wrote:
> On 2019-02-22 23:24:18, speck for Thomas Gleixner wrote:
> > Git tree WIP.mds branch is updated as well.
>
> I don't think that I have access to the git repo.

Added your key.

  cvs.ou.linutronix.de:linux/speck/linux

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 00/11] MDS basics
  2019-02-26 18:58 ` [MODERATED] " Kanth Ghatraju
@ 2019-02-26 19:59   ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 19:59 UTC (permalink / raw)
  To: speck

On Tue, 26 Feb 2019, speck for Kanth Ghatraju wrote:
> > On Feb 22, 2019, at 5:24 PM, speck for Thomas Gleixner <speck@linutronix.de> wrote:
> > If anyone of the people new to this need access to the git repo,
> > please send me a public SSH key so I can add to the gitolite config.
> > 
> Attached is my public access key. Thanks.

Added your key.

  cvs.ou.linutronix.de:linux/speck/linux

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 11/11] Documentation: Add MDS vulnerability documentation
  2019-02-25 18:02   ` [MODERATED] " Dave Hansen
@ 2019-02-26 20:10     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 20:10 UTC (permalink / raw)
  To: speck

On Mon, 25 Feb 2019, speck for Dave Hansen wrote:

> On 2/22/19 2:24 PM, speck for Thomas Gleixner wrote:
> > +Contrary to other speculation based vulnerabilities the MDS vulnerability
> > +does not allow the attacker to control the memory target address. As a
> > +consequence the attacks are purely sampling based, but as demonstrated with
> > +the TLBleed attack samples can be postprocessed successfully.
> 
> I saw this "sampling-based" terminology in Andi's docs too.  Personally,
> I find it a bit confusing.  I think it's trying to make a distinction
> between attacks that pull data out of memory and attacks that pull data
> out of CPU-internal state that came from somewhere else.  Maybe
> something like:
> 
> 	Other attacks such as Spectre and Meltdown tend to target data
> 	at a specific memory address.  The MDS vulnerability itself can
> 	not be targeted at memory and can only leak memory contents that
> 	have been loaded into the CPU buffers by other means.
> 
> Or, is it trying to make a *timing* argument?

No, the point is that the other attacks target data at a memory address so
it's more targetted in some ways, at least once the attack found something
which looks interesting it can be targetted pretty good.

The MDS attacks just collect the buffer leakage and then try to make sense
out of the leaked data they retrieved. I think sampling describes that
pretty good. Let me think about it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 11/11] Documentation: Add MDS vulnerability documentation
  2019-02-23  9:58   ` [MODERATED] " Greg KH
@ 2019-02-26 20:11     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 20:11 UTC (permalink / raw)
  To: speck

On Sat, 23 Feb 2019, speck for Greg KH wrote:
> On Fri, Feb 22, 2019 at 11:24:29PM +0100, speck for Thomas Gleixner wrote:
> > +Because the buffers are potentially shared between Hyper-Threads cross
> > +Hyper-Thread attacks may be possible.
> 
> Shouldn't this be "are possible."?

Yes, of course.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS
  2019-02-26 15:50   ` Josh Poimboeuf
@ 2019-02-26 20:16     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 20:16 UTC (permalink / raw)
  To: speck

On Tue, 26 Feb 2019, speck for Josh Poimboeuf wrote:
> On Fri, Feb 22, 2019 at 11:24:25PM +0100, speck for Thomas Gleixner wrote:
> > +
> > +			This parameter controls the MDS mitigation. The the
> 
> https://www.youtube.com/watch?v=X43ZyUGOPyw

Hehe.

> > +	if (!strcmp(str, "off"))
> > +		mds_mitigation = MDS_MITIGATION_OFF;
> > +	else if (!strcmp(str, "auto"))
> > +		mds_mitigation = MDS_MITIGATION_AUTO;
> > +	else if (!strcmp(str, "full"))
> > +		mds_mitigation = MDS_MITIGATION_FULL;
> > +
> > +	return 0;
> > +}
> > +early_param("mds", mds_cmdline);
> 
> I agree with Mark that mds=auto isn't needed.

Yes, if we just have full/off auto is pointless. I'll drop it.

> Shall we also have a mds=full,nosmt?

Good question.

> > +	switch (mds_mitigation) {
> > +	case MDS_MITIGATION_OFF:
> > +		break;
> > +	case MDS_MITIGATION_FULL:
> > +		update_mds_branch_idle();
> > +		break;
> > +	/* Keep GCC happy */
> > +	case MDS_MITIGATION_AUTO:
> > +		break;
> > +	}
> > +
> 
> Per the docs, this is a bug because full and auto should be identical.

Per docs, yes, but not per code because auto is replaced and that case is
just there so GCC does not yell about the missed enum in the switch case. I
prefer that over default, because when extending the enum, gcc will yell
and you won't forget.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry
  2019-02-26 15:31   ` Josh Poimboeuf
@ 2019-02-26 20:20     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 20:20 UTC (permalink / raw)
  To: speck

On Tue, 26 Feb 2019, speck for Josh Poimboeuf wrote:
> On Fri, Feb 22, 2019 at 11:24:24PM +0100, speck for Thomas Gleixner wrote:
> > +/**
> > + * mds_idle_clear_cpu_buffers - Mitigation for MDS vulnerability
> > + *
> > + * Clear CPU buffers if the corresponding static key is enabled
> > + */
> > +static inline void mds_idle_clear_cpu_buffers(void)
> > +{
> > +	if (static_branch_likely(&mds_idle_clear))
> > +		mds_clear_cpu_buffers();
> > +}
> 
> This two-line construct is more readable than the
> mds_user_clear_cpu_buffers() three-line version from the previous patch,
> I'd suggest doing the same thing there.

Will do.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user
  2019-02-26 15:20   ` Josh Poimboeuf
@ 2019-02-26 20:26     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-26 20:26 UTC (permalink / raw)
  To: speck

On Tue, 26 Feb 2019, speck for Josh Poimboeuf wrote:
> On Fri, Feb 22, 2019 at 11:24:23PM +0100, speck for Thomas Gleixner wrote:
> > +1. Return to user space
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +   When transitioning from kernel to user space the CPU buffers are flushed
> > +   on affected CPUs:
> > +
> > +   - always when the mitigation mode is full. The migitation is enabled
> 
> Currently the mitigation is always full.
> 
> > +     through the static key mds_user_clear.
> > +
> > +   This covers transitions from kernel to user space through a return to
> > +   user space from a syscall and from an interrupt or a regular exception.
> > +
> > +   There are other kernel to user space transitions which are not covered
> > +   by this: NMIs and all non maskable exceptions which go through the
> > +   paranoid exit, which means that they are not invoking the regular
> 
> Actually, NMI *is* mitigated.

But not by the above. That's a separate mitigation point due to the mess
which the x86 exception handling is.

> What is a non maskable exception?

All exceptions which are delivered despite interrupts being disabled, NMI,
MCE, DF, ....

> The statement about all paranoid exits being covered isn't correct,
> because #DF is mitigated.
> 
> > +   prepare_exit_to_usermode() which handles the CPU buffer clearing.
> > +
> > +   Access to sensible data like keys, credentials in the NMI context is
> > +   mostly theoretical: The CPU can do prefetching or execute a
> > +   misspeculated code path and thereby fetching data which might end up
> > +   leaking through a buffer.
> 
> This paragraph can be removed, since NMI is mitigated.
> 
> > +
> > +   But for mounting other attacks the kernel stack address of the task is
> > +   already valuable information. So in full mitigation mode, the NMI is
> > +   mitigated on the return from do_nmi() to provide almost complete
> > +   coverage.
> 
> This one is correct.
> 
> > +
> > +   There is one non maskable exception which returns through paranoid exit
> 
> Again the phrase "non maskable exception".  Maybe I'm missing something
> but I have no idea what that means.
>
> > +   and is to some extent controllable from user space through
> > +   modify_ldt(2): #DF. So mitigation is required in the double fault
> > +   handler as well.
> > +
> > +   Another corner case is a #MC which hits between the buffer clear and the
> > +   actual return to user. As this still is in kernel space it takes the
> > +   paranoid exit path which does not clear the CPU buffers. So the #MC
> > +   handler repopulates the buffers to some extent. Machine checks are not
> > +   reliably controllable and the window is extremly small so mitigation
> > +   would just tick a checkbox that this theoretical corner case is covered.
> 
> There is no mention of #DB anywhere, shouldn't it also be mitigated?

If #DB comes from a user space int1 then it will go through the regular
return to user path which is mitigated. If it happens in the kernel, it's
not relevant.

The thing about NMI and the #DF special case is that even if they come from
user space they are not returning through the regular path and therefore
need explicit mitigation.

I'll reword the whole thing so it's less confusing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS
  2019-02-23  7:42     ` Thomas Gleixner
@ 2019-02-27 13:04       ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-27 13:04 UTC (permalink / raw)
  To: speck

On Sat, 23 Feb 2019, speck for Thomas Gleixner wrote:
> On Fri, 22 Feb 2019, speck for Linus Torvalds wrote:
> > But looking at those tables, I do wonder if maybe we should have
> > instead a list of CPU's, and then associate the quirks with the CPU.
> 
> Good point. Never thought about it. Should be trivial enough to do.

And doing so immediately shows that the current tables are
inconsistent. AIRMONT is not affected by SSB, but AIRMONT_MID is according
to the cpu_no_spec_store_bypass table. I noticed that when consolidating
all the bits into a single table....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-26 14:19   ` Josh Poimboeuf
@ 2019-03-01 20:58     ` Jon Masters
  2019-03-01 22:14       ` Jon Masters
  0 siblings, 1 reply; 47+ messages in thread
From: Jon Masters @ 2019-03-01 20:58 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 164 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()

[-- Attachment #2: Type: text/plain, Size: 2764 bytes --]

On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:

> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>> +L1 miss situations and to hold data which is returned or sent in response
>> +to a memory or I/O operation. Fill buffers can forward data to a load
>> +operation and also write data to the cache. When the fill buffer is
>> +deallocated it can retain the stale data of the preceding operations which
>> +can then be forwarded to a faulting or assisting load operation, which can
>> +be exploited under certain conditions. Fill buffers are shared between
>> +Hyper-Threads so cross thread leakage is possible.

The fill buffers sit opposite the L1D$ and participate in coherency
directly. They supply data directly to the load store units. Here's the
internal summary I wrote (feel free to use any of it that is useful):

"Intel processors utilize fill buffers to perform loads of data when a
miss occurs in the Level 1 data cache. The fill buffer allows the
processor to implement a non-blocking cache, continuing with other
operations while the necessary cache data “line” is loaded from a higher
level cache or from memory. It also allows the result of the fill to be
forwarded directly to the EU (Execution Unit) requiring the load,
without waiting for it to be written into the L1 Data Cache.

A load operation is not decoupled in the same way that a store is, but
it does involve an AGU (Address Generation Unit) operation. If the AGU
generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
Intel design would block the load and later reissue it. In contemporary
designs, it instead allows subsequent speculation operations to
temporarily see a forwarded data value from the fill buffer slot prior
to the load actually taking place. Thus it is possible to read data that
was recently accessed by another thread, if the fill buffer entry is not
reused.

It is this attack that allows cross-thread SMT leakage and breaks HT
without recourse other than to disable it or to implement core
scheduling in the Linux kernel.

Variants of this include loads that cross cache or page boundaries due
to further optimizations in Intel’s implementation. For example, Intel
incorporate logic to guess at address generation prior to determining
whether it crosses such a boundary (covered in US5335333A) and will
forward this to the TLB/load logic prior to resolving the full address.
They will retry the load by re-issuing uops in the case of a cross
cacheline/page boundary but in that case will leak state as well."

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-01 22:14       ` Jon Masters
  0 siblings, 0 replies; 47+ messages in thread
From: Jon Masters @ 2019-03-01 22:14 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 161 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()

[-- Attachment #2: Type: text/plain, Size: 3426 bytes --]

On 3/1/19 3:58 PM, speck for Jon Masters wrote:
> On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:
> 
>> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>>> +L1 miss situations and to hold data which is returned or sent in response
>>> +to a memory or I/O operation. Fill buffers can forward data to a load
>>> +operation and also write data to the cache. When the fill buffer is
>>> +deallocated it can retain the stale data of the preceding operations which
>>> +can then be forwarded to a faulting or assisting load operation, which can
>>> +be exploited under certain conditions. Fill buffers are shared between
>>> +Hyper-Threads so cross thread leakage is possible.
> 
> The fill buffers sit opposite the L1D$ and participate in coherency
> directly. They supply data directly to the load store units. Here's the
> internal summary I wrote (feel free to use any of it that is useful):
> 
> "Intel processors utilize fill buffers to perform loads of data when a
> miss occurs in the Level 1 data cache. The fill buffer allows the
> processor to implement a non-blocking cache, continuing with other
> operations while the necessary cache data “line” is loaded from a higher
> level cache or from memory. It also allows the result of the fill to be
> forwarded directly to the EU (Execution Unit) requiring the load,
> without waiting for it to be written into the L1 Data Cache.
> 
> A load operation is not decoupled in the same way that a store is, but
> it does involve an AGU (Address Generation Unit) operation. If the AGU
> generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
> Intel design would block the load and later reissue it. In contemporary
> designs, it instead allows subsequent speculation operations to
> temporarily see a forwarded data value from the fill buffer slot prior
> to the load actually taking place. Thus it is possible to read data that
> was recently accessed by another thread, if the fill buffer entry is not
> reused.
> 
> It is this attack that allows cross-thread SMT leakage and breaks HT
> without recourse other than to disable it or to implement core
> scheduling in the Linux kernel.
> 
> Variants of this include loads that cross cache or page boundaries due
> to further optimizations in Intel’s implementation. For example, Intel
> incorporate logic to guess at address generation prior to determining
> whether it crosses such a boundary (covered in US5335333A) and will
> forward this to the TLB/load logic prior to resolving the full address.
> They will retry the load by re-issuing uops in the case of a cross
> cacheline/page boundary but in that case will leak state as well."

Btw, I've various reproducers here that I'm happy to share if useful
with the right folks. Thomas and Linus should already have my IFU one
for later testing of that, I've also e.g. an FBBF. Currently it just
spews whatever it sees from the other threads, but in the next few days
I'll have it cleaned up to send/receive specific messages - then can
just wrap it with a bow so it can print yes/no vulnerable.

Ping if you have a need for a repro (keybase/email) and I'll go through
our process for sharing as appropriate.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-03-01 22:14 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
2019-02-22 22:24 ` [patch V4 01/11] x86/msr-index: Cleanup bit defines Thomas Gleixner
2019-02-22 22:24 ` [patch V4 02/11] x86/speculation/mds: Add basic bug infrastructure for MDS Thomas Gleixner
2019-02-23  1:28   ` [MODERATED] " Linus Torvalds
2019-02-23  7:42     ` Thomas Gleixner
2019-02-27 13:04       ` Thomas Gleixner
2019-02-22 22:24 ` [patch V4 03/11] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests Thomas Gleixner
2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
2019-02-25 16:06   ` [MODERATED] " Frederic Weisbecker
2019-02-26 14:19   ` Josh Poimboeuf
2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 22:14       ` Jon Masters
2019-02-26 15:00   ` [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() David Woodhouse
2019-02-22 22:24 ` [patch V4 05/11] x86/speculation/mds: Clear CPU buffers on exit to user Thomas Gleixner
2019-02-25 21:04   ` [MODERATED] " Greg KH
2019-02-26 15:20   ` Josh Poimboeuf
2019-02-26 20:26     ` Thomas Gleixner
2019-02-22 22:24 ` [patch V4 06/11] x86/speculation/mds: Conditionally clear CPU buffers on idle entry Thomas Gleixner
2019-02-25 21:09   ` [MODERATED] " Greg KH
2019-02-26 15:31   ` Josh Poimboeuf
2019-02-26 20:20     ` Thomas Gleixner
2019-02-22 22:24 ` [patch V4 07/11] x86/speculation/mds: Add mitigation control for MDS Thomas Gleixner
2019-02-25 20:17   ` [MODERATED] " mark gross
2019-02-26 15:50   ` Josh Poimboeuf
2019-02-26 20:16     ` Thomas Gleixner
2019-02-22 22:24 ` [patch V4 08/11] x86/speculation/mds: Add sysfs reporting " Thomas Gleixner
2019-02-22 22:24 ` [patch V4 09/11] x86/speculation/mds: Add mitigation mode VMWERV Thomas Gleixner
2019-02-23  9:52   ` [MODERATED] " Greg KH
2019-02-25 20:31   ` mark gross
2019-02-26  0:34     ` Andrew Cooper
2019-02-26 18:51       ` mark gross
2019-02-26 19:29     ` Thomas Gleixner
2019-02-22 22:24 ` [patch V4 10/11] Documentation: Move L1TF to separate directory Thomas Gleixner
2019-02-23  8:41   ` [MODERATED] " Greg KH
2019-02-22 22:24 ` [patch V4 11/11] Documentation: Add MDS vulnerability documentation Thomas Gleixner
2019-02-23  9:58   ` [MODERATED] " Greg KH
2019-02-26 20:11     ` Thomas Gleixner
2019-02-25 18:02   ` [MODERATED] " Dave Hansen
2019-02-26 20:10     ` Thomas Gleixner
2019-02-23  0:53 ` [MODERATED] Re: [patch V4 00/11] MDS basics Andrew Cooper
2019-02-23 14:12   ` Peter Zijlstra
2019-02-25 16:38 ` mark gross
2019-02-26 19:58   ` Thomas Gleixner
2019-02-26 16:28 ` [MODERATED] " Tyler Hicks
2019-02-26 19:58   ` Thomas Gleixner
2019-02-26 18:58 ` [MODERATED] " Kanth Ghatraju
2019-02-26 19:59   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.