* [patch V6 00/14] MDS basics 0
@ 2019-03-01 21:47 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 01/14] MDS basics 1 Thomas Gleixner
` (15 more replies)
0 siblings, 16 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Changes vs. V5:
- Fix tools/ build (Josh)
- Dropped the AIRMONT_MID change as it needs confirmation from Intel
- Made the consolidated whitelist more readable and correct
- Added the MSBDS only quirk for XEON PHI, made the idle flush
depend on it and updated the sysfs output accordingly.
- Fixed the protection matrix in the admin documentation and clarified
the SMT situation vs. MSBDS only.
- Updated the KVM/VMX changelog.
Delta patch against V5 below.
Available from git:
cvs.ou.linutronix.de:linux/speck/linux WIP.mds
The linux-4.20.y, linux-4.19.y and linux-4.14.y branches are updated as
well and contain the untested backports of the pile for reference.
I'll send git bundles of the pile as well.
Thanks,
tglx
8<---------------------------
diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
index 73cdc390aece..1de29d28903d 100644
--- a/Documentation/admin-guide/hw-vuln/mds.rst
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -23,6 +23,10 @@ vulnerability is not present on:
Whether a processor is affected or not can be read out from the MDS
vulnerability file in sysfs. See :ref:`mds_sys_info`.
+Not all processors are affected by all variants of MDS, but the mitigation
+is identical for all of them so the kernel treats them as a single
+vulnerability.
+
Related CVEs
------------
@@ -112,6 +116,7 @@ to the above information:
======================== ============================================
'SMT vulnerable' SMT is enabled
+ 'SMT mitigated' SMT is enabled and mitigated
'SMT disabled' SMT is disabled
'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
======================== ============================================
@@ -153,8 +158,12 @@ CPU buffer clearing
The mitigation for MDS clears the affected CPU buffers on return to user
space and when entering a guest.
- If SMT is enabled it also clears the buffers on idle entry, but that's not
- a sufficient SMT protection for all MDS variants; it covers solely MSBDS.
+ If SMT is enabled it also clears the buffers on idle entry when the CPU
+ is only affected by MSBDS and not any other MDS variant, because the
+ other variants cannot be protected against cross Hyper-Thread attacks.
+
+ For CPUs which are only affected by MSBDS the user space, guest and idle
+ transition mitigations are sufficient and SMT is not affected.
.. _virt_mechanism:
@@ -168,8 +177,10 @@ Virtualization mitigation
If the L1D flush mitigation is enabled and up to date microcode is
available, the L1D flush mitigation is automatically protecting the
- guest transition. If the L1D flush mitigation is disabled the MDS
- mitigation is disabled as well.
+ guest transition.
+
+ If the L1D flush mitigation is disabled then the MDS mitigation is
+ invoked explicit when the host MDS mitigation is enabled.
For details on L1TF and virtualization see:
:ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
@@ -177,16 +188,18 @@ Virtualization mitigation
- CPU is not affected by L1TF:
CPU buffers are flushed before entering the guest when the host MDS
- protection is enabled.
+ mitigation is enabled.
The resulting MDS protection matrix for the host to guest transition:
============ ===== ============= ============ =================
- L1TF MDS VMX-L1FLUSH Host MDS State
+ L1TF MDS VMX-L1FLUSH Host MDS MDS-State
Don't care No Don't care N/A Not affected
- Yes Yes Disabled Don't care Vulnerable
+ Yes Yes Disabled Off Vulnerable
+
+ Yes Yes Disabled Full Mitigated
Yes Yes Enabled Don't care Mitigated
@@ -196,7 +209,7 @@ Virtualization mitigation
============ ===== ============= ============ =================
This only covers the host to guest transition, i.e. prevents leakage from
- host to guest, but does not protect the guest internally. Guest need to
+ host to guest, but does not protect the guest internally. Guests need to
have their own protections.
.. _xeon_phi:
@@ -210,14 +223,22 @@ XEON PHI specific considerations
for malicious user space. The exposure can be disabled on the kernel
command line with the 'ring3mwait=disable' command line option.
+ XEON PHI is not affected by the other MDS variants and MSBDS is mitigated
+ before the CPU enters a idle state. As XEON PHI is not affected by L1TF
+ either disabling SMT is not required for full protection.
+
.. _mds_smt_control:
SMT control
^^^^^^^^^^^
- To prevent the SMT issues of MDS it might be necessary to disable SMT
- completely. Disabling SMT can have a significant performance impact, but
- the impact depends on the type of workloads.
+ All MDS variants except MSBDS can be attacked cross Hyper-Threads. That
+ means on CPUs which are affected by MFBDS or MLPDS it is necessary to
+ disable SMT for full protection. These are most of the affected CPUs; the
+ exception is XEON PHI, see :ref:`xeon_phi`.
+
+ Disabling SMT can have a significant performance impact, but the impact
+ depends on the type of workloads.
See the relevant chapter in the L1TF mitigation documentation for details:
:ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
@@ -260,9 +281,7 @@ Mitigation selection guide
2. Virtualization with trusted guests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- The same considerations as above versus trusted user space apply. See
- also: :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_selection>`.
-
+ The same considerations as above versus trusted user space apply.
3. Virtualization with untrusted guests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -270,6 +289,8 @@ Mitigation selection guide
The protection depends on the state of the L1TF mitigations.
See :ref:`virt_mechanism`.
+ If the MDS mitigation is enabled and SMT is disabled, guest to host and
+ guest to guest attacks are prevented.
.. _mds_default_mitigations:
diff --git a/Documentation/x86/mds.rst b/Documentation/x86/mds.rst
index b050623c869c..3d6f943f1afb 100644
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -107,19 +107,19 @@ user space or VM guests.
Kernel internal mitigation modes
--------------------------------
- ======= ===========================================================
- off Mitigation is disabled. Either the CPU is not affected or
- mds=off is supplied on the kernel command line
+ ======= ============================================================
+ off Mitigation is disabled. Either the CPU is not affected or
+ mds=off is supplied on the kernel command line
- full Mitigation is eanbled. CPU is affected and MD_CLEAR is
- advertised in CPUID.
+ full Mitigation is eanbled. CPU is affected and MD_CLEAR is
+ advertised in CPUID.
- vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not
- advertised in CPUID. That is mainly for virtualization
- scenarios where the host has the updated microcode but the
- hypervisor does not expose MD_CLEAR in CPUID. It's a best
- effort approach without guarantee.
- ======= ===========================================================
+ vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not
+ advertised in CPUID. That is mainly for virtualization
+ scenarios where the host has the updated microcode but the
+ hypervisor does not expose MD_CLEAR in CPUID. It's a best
+ effort approach without guarantee.
+ ======= ============================================================
If the CPU is affected and mds=off is not supplied on the kernel command
line then the kernel selects the appropriate mitigation mode depending on
@@ -189,6 +189,13 @@ Mitigation points
When SMT is inactive, i.e. either the CPU does not support it or all
sibling threads are offline CPU buffer clearing is not required.
+ The idle clearing is enabled on CPUs which are only affected by MSBDS
+ and not by any other MDS variant. The other MDS variants cannot be
+ protected against cross Hyper-Thread attacks because the Fill Buffer and
+ the Load Ports are shared. So on CPUs affected by other variants, the
+ idle clearing would be a window dressing exercise and is therefore not
+ activated.
+
The invocation is controlled by the static key mds_idle_clear which is
switched depending on the chosen mitigation mode and the SMT state of
the system.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ae3f987b24f1..bdcea163850a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -383,5 +383,6 @@
#define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
#define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
#define X86_BUG_MDS X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
+#define X86_BUG_MSBDS_ONLY X86_BUG(20) /* CPU is only affected by the MSDBS variant of BUG_MDS */
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index aea871e69d64..e11654f93e71 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -667,6 +667,15 @@ static void update_indir_branch_cond(void)
/* Update the static key controlling the MDS CPU buffer clear in idle */
static void update_mds_branch_idle(void)
{
+ /*
+ * Enable the idle clearing on CPUs which are affected only by
+ * MDBDS and not any other MDS variant. The other variants cannot
+ * be mitigated when SMT is enabled, so clearing the buffers on
+ * idle would be a window dressing exercise.
+ */
+ if (!boot_cpu_has(X86_BUG_MSBDS_ONLY))
+ return;
+
if (sched_smt_active())
static_branch_enable(&mds_idle_clear);
else
@@ -1174,6 +1183,11 @@ static ssize_t mds_show_state(char *buf)
mds_strings[mds_mitigation]);
}
+ if (boot_cpu_has(X86_BUG_MSBDS_ONLY)) {
+ return sprintf(buf, "%s; SMT %s\n", mds_strings[mds_mitigation],
+ sched_smt_active() ? "mitigated" : "disabled");
+ }
+
return sprintf(buf, "%s; SMT %s\n", mds_strings[mds_mitigation],
sched_smt_active() ? "vulnerable" : "disabled");
}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 389853338c2f..71d953a2c4db 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -953,38 +953,57 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
#define NO_SSB BIT(2)
#define NO_L1TF BIT(3)
#define NO_MDS BIT(4)
+#define MSBDS_ONLY BIT(5)
+
+#define VULNWL(_vendor, _family, _model, _whitelist) \
+ { X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
+
+#define VULNWL_INTEL(model, whitelist) \
+ VULNWL(INTEL, 6, INTEL_FAM6_##model, whitelist)
+
+#define VULNWL_AMD(family, whitelist) \
+ VULNWL(AMD, family, X86_MODEL_ANY, whitelist)
+
+#define VULNWL_HYGON(family, whitelist) \
+ VULNWL(HYGON, family, X86_MODEL_ANY, whitelist)
static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = {
- { X86_VENDOR_ANY, 4, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_CENTAUR, 5, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_INTEL, 5, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_NSC, 5, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SALTWELL, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SALTWELL_TABLET, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_BONNELL_MID, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SALTWELL_MID, X86_FEATURE_ANY, NO_SPECULATION },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_BONNELL, X86_FEATURE_ANY, NO_SPECULATION },
-
- { X86_VENDOR_AMD, X86_FAMILY_ANY, X86_MODEL_ANY, X86_FEATURE_ANY, NO_MELTDOWN | NO_L1TF },
- { X86_VENDOR_HYGON, X86_FAMILY_ANY, X86_MODEL_ANY, X86_FEATURE_ANY, NO_MELTDOWN | NO_L1TF },
-
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT_X, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT_MID, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT_MID, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_CORE_YONAH, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM, X86_FEATURE_ANY, NO_SSB | NO_L1TF },
-
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT, X86_FEATURE_ANY, NO_L1TF | NO_MDS },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT_X, X86_FEATURE_ANY, NO_L1TF | NO_MDS },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT_PLUS, X86_FEATURE_ANY, NO_L1TF | NO_MDS },
-
- { X86_VENDOR_AMD, 0x0f, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SSB },
- { X86_VENDOR_AMD, 0x10, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SSB },
- { X86_VENDOR_AMD, 0x11, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SSB },
- { X86_VENDOR_AMD, 0x12, X86_MODEL_ANY, X86_FEATURE_ANY, NO_SSB },
+ VULNWL(ANY, 4, X86_MODEL_ANY, NO_SPECULATION),
+ VULNWL(CENTAUR, 5, X86_MODEL_ANY, NO_SPECULATION),
+ VULNWL(INTEL, 5, X86_MODEL_ANY, NO_SPECULATION),
+ VULNWL(NSC, 5, X86_MODEL_ANY, NO_SPECULATION),
+
+ /* Intel Family 6 */
+ VULNWL_INTEL(ATOM_SALTWELL, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_SALTWELL_TABLET, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_SALTWELL_MID, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_BONNELL, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_BONNELL_MID, NO_SPECULATION),
+
+ VULNWL_INTEL(ATOM_SILVERMONT, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(ATOM_SILVERMONT_X, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY),
+ VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY),
+
+ VULNWL_INTEL(CORE_YONAH, NO_SSB),
+
+ VULNWL_INTEL(ATOM_AIRMONT_MID, NO_L1TF),
+
+ VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT_X, NO_MDS | NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF),
+
+ /* AMD Family 0xf - 0x12 */
+ VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+ VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+ VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+ VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+
+ /* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */
+ VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS),
+ VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS),
{}
};
@@ -1015,8 +1034,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
if (ia32_cap & ARCH_CAP_IBRS_ALL)
setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
- if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO))
+ if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO)) {
setup_force_cpu_bug(X86_BUG_MDS);
+ if (cpu_matches(MSBDS_ONLY))
+ setup_force_cpu_bug(X86_BUG_MSBDS_ONLY);
+ }
if (cpu_matches(NO_MELTDOWN))
return;
diff --git a/tools/power/x86/turbostat/Makefile b/tools/power/x86/turbostat/Makefile
index 1598b4fa0b11..045f5f7d68ab 100644
--- a/tools/power/x86/turbostat/Makefile
+++ b/tools/power/x86/turbostat/Makefile
@@ -9,7 +9,7 @@ ifeq ("$(origin O)", "command line")
endif
turbostat : turbostat.c
-override CFLAGS += -Wall
+override CFLAGS += -Wall -I../../../include
override CFLAGS += -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
override CFLAGS += -DINTEL_FAMILY_HEADER='"../../../../arch/x86/include/asm/intel-family.h"'
diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile
index ae7a0e09b722..1fdeef864e7c 100644
--- a/tools/power/x86/x86_energy_perf_policy/Makefile
+++ b/tools/power/x86/x86_energy_perf_policy/Makefile
@@ -9,7 +9,7 @@ ifeq ("$(origin O)", "command line")
endif
x86_energy_perf_policy : x86_energy_perf_policy.c
-override CFLAGS += -Wall
+override CFLAGS += -Wall -I../../../include
override CFLAGS += -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
%: %.c
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [patch V6 01/14] MDS basics 1
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-02 0:06 ` [MODERATED] " Frederic Weisbecker
2019-03-01 21:47 ` [patch V6 02/14] MDS basics 2 Thomas Gleixner
` (14 subsequent siblings)
15 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 01/14] x86/msr-index: Cleanup bit defines
From: Thomas Gleixner <tglx@linutronix.de>
Greg pointed out that speculation related bit defines are using (1 << N)
format instead of BIT(N). Aside of that (1 << N) is wrong as it should use
1UL at least.
Clean it up.
[ Josh Poimboeuf: Fix tools build ]
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
V5 -> V6: Fix tools build (Josh)
---
arch/x86/include/asm/msr-index.h | 34 ++++++++++++------------
tools/power/x86/turbostat/Makefile | 2 -
tools/power/x86/x86_energy_perf_policy/Makefile | 2 -
3 files changed, 20 insertions(+), 18 deletions(-)
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_MSR_INDEX_H
#define _ASM_X86_MSR_INDEX_H
+#include <linux/bits.h>
+
/*
* CPU model specific register (MSR) numbers.
*
@@ -40,14 +42,14 @@
/* Intel MSRs. Some also available on other CPUs */
#define MSR_IA32_SPEC_CTRL 0x00000048 /* Speculation Control */
-#define SPEC_CTRL_IBRS (1 << 0) /* Indirect Branch Restricted Speculation */
+#define SPEC_CTRL_IBRS BIT(0) /* Indirect Branch Restricted Speculation */
#define SPEC_CTRL_STIBP_SHIFT 1 /* Single Thread Indirect Branch Predictor (STIBP) bit */
-#define SPEC_CTRL_STIBP (1 << SPEC_CTRL_STIBP_SHIFT) /* STIBP mask */
+#define SPEC_CTRL_STIBP BIT(SPEC_CTRL_STIBP_SHIFT) /* STIBP mask */
#define SPEC_CTRL_SSBD_SHIFT 2 /* Speculative Store Bypass Disable bit */
-#define SPEC_CTRL_SSBD (1 << SPEC_CTRL_SSBD_SHIFT) /* Speculative Store Bypass Disable */
+#define SPEC_CTRL_SSBD BIT(SPEC_CTRL_SSBD_SHIFT) /* Speculative Store Bypass Disable */
#define MSR_IA32_PRED_CMD 0x00000049 /* Prediction Command */
-#define PRED_CMD_IBPB (1 << 0) /* Indirect Branch Prediction Barrier */
+#define PRED_CMD_IBPB BIT(0) /* Indirect Branch Prediction Barrier */
#define MSR_PPIN_CTL 0x0000004e
#define MSR_PPIN 0x0000004f
@@ -69,20 +71,20 @@
#define MSR_MTRRcap 0x000000fe
#define MSR_IA32_ARCH_CAPABILITIES 0x0000010a
-#define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */
-#define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */
-#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH (1 << 3) /* Skip L1D flush on vmentry */
-#define ARCH_CAP_SSB_NO (1 << 4) /*
- * Not susceptible to Speculative Store Bypass
- * attack, so no Speculative Store Bypass
- * control required.
- */
+#define ARCH_CAP_RDCL_NO BIT(0) /* Not susceptible to Meltdown */
+#define ARCH_CAP_IBRS_ALL BIT(1) /* Enhanced IBRS support */
+#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH BIT(3) /* Skip L1D flush on vmentry */
+#define ARCH_CAP_SSB_NO BIT(4) /*
+ * Not susceptible to Speculative Store Bypass
+ * attack, so no Speculative Store Bypass
+ * control required.
+ */
#define MSR_IA32_FLUSH_CMD 0x0000010b
-#define L1D_FLUSH (1 << 0) /*
- * Writeback and invalidate the
- * L1 data cache.
- */
+#define L1D_FLUSH BIT(0) /*
+ * Writeback and invalidate the
+ * L1 data cache.
+ */
#define MSR_IA32_BBL_CR_CTL 0x00000119
#define MSR_IA32_BBL_CR_CTL3 0x0000011e
--- a/tools/power/x86/turbostat/Makefile
+++ b/tools/power/x86/turbostat/Makefile
@@ -9,7 +9,7 @@ ifeq ("$(origin O)", "command line")
endif
turbostat : turbostat.c
-override CFLAGS += -Wall
+override CFLAGS += -Wall -I../../../include
override CFLAGS += -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
override CFLAGS += -DINTEL_FAMILY_HEADER='"../../../../arch/x86/include/asm/intel-family.h"'
--- a/tools/power/x86/x86_energy_perf_policy/Makefile
+++ b/tools/power/x86/x86_energy_perf_policy/Makefile
@@ -9,7 +9,7 @@ ifeq ("$(origin O)", "command line")
endif
x86_energy_perf_policy : x86_energy_perf_policy.c
-override CFLAGS += -Wall
+override CFLAGS += -Wall -I../../../include
override CFLAGS += -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
%: %.c
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 01/14] MDS basics 1
2019-03-01 21:47 ` [patch V6 01/14] MDS basics 1 Thomas Gleixner
@ 2019-03-02 0:06 ` Frederic Weisbecker
0 siblings, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-02 0:06 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:39PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 01/14] x86/msr-index: Cleanup bit defines
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Greg pointed out that speculation related bit defines are using (1 << N)
> format instead of BIT(N). Aside of that (1 << N) is wrong as it should use
> 1UL at least.
>
> Clean it up.
>
> [ Josh Poimboeuf: Fix tools build ]
>
> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> ---
> V5 -> V6: Fix tools build (Josh)
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 02/14] MDS basics 2
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 01/14] MDS basics 1 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-02 0:34 ` [MODERATED] " Frederic Weisbecker
` (2 more replies)
2019-03-01 21:47 ` [patch V6 03/14] MDS basics 3 Thomas Gleixner
` (13 subsequent siblings)
15 siblings, 3 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 02/14] x86/speculation: Consolidate CPU whitelists
From: Thomas Gleixner <tglx@linutronix.de>
The CPU vulnerability whitelists have some overlap and there are more
whitelists coming along.
Use the driver_data field in the x86_cpu_id struct to denote the
whitelisted vulnerabilities and combine all whitelists into one.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5 --> V6: Use a helper macro to make it more readable
Fix the AMD family 0xf-0x12 vs. ANY ordering
---
arch/x86/kernel/cpu/common.c | 110 +++++++++++++++++++++++--------------------
1 file changed, 60 insertions(+), 50 deletions(-)
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -948,61 +948,72 @@ static void identify_cpu_without_cpuid(s
#endif
}
-static const __initconst struct x86_cpu_id cpu_no_speculation[] = {
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SALTWELL, X86_FEATURE_ANY },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SALTWELL_TABLET, X86_FEATURE_ANY },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_BONNELL_MID, X86_FEATURE_ANY },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SALTWELL_MID, X86_FEATURE_ANY },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_BONNELL, X86_FEATURE_ANY },
- { X86_VENDOR_CENTAUR, 5 },
- { X86_VENDOR_INTEL, 5 },
- { X86_VENDOR_NSC, 5 },
- { X86_VENDOR_ANY, 4 },
+#define NO_SPECULATION BIT(0)
+#define NO_MELTDOWN BIT(1)
+#define NO_SSB BIT(2)
+#define NO_L1TF BIT(3)
+
+#define VULNWL(_vendor, _family, _model, _whitelist) \
+ { X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
+
+#define VULNWL_INTEL(model, whitelist) \
+ VULNWL(INTEL, 6, INTEL_FAM6_##model, whitelist)
+
+#define VULNWL_AMD(family, whitelist) \
+ VULNWL(AMD, family, X86_MODEL_ANY, whitelist)
+
+#define VULNWL_HYGON(family, whitelist) \
+ VULNWL(HYGON, family, X86_MODEL_ANY, whitelist)
+
+static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = {
+ VULNWL(ANY, 4, X86_MODEL_ANY, NO_SPECULATION),
+ VULNWL(CENTAUR, 5, X86_MODEL_ANY, NO_SPECULATION),
+ VULNWL(INTEL, 5, X86_MODEL_ANY, NO_SPECULATION),
+ VULNWL(NSC, 5, X86_MODEL_ANY, NO_SPECULATION),
+
+ VULNWL_INTEL(ATOM_SALTWELL, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_SALTWELL_TABLET, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_SALTWELL_MID, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_BONNELL, NO_SPECULATION),
+ VULNWL_INTEL(ATOM_BONNELL_MID, NO_SPECULATION),
+
+ VULNWL_INTEL(ATOM_SILVERMONT, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(ATOM_SILVERMONT_X, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF),
+
+ VULNWL_INTEL(CORE_YONAH, NO_SSB),
+
+ VULNWL_INTEL(ATOM_AIRMONT_MID, NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT, NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT_X, NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_L1TF),
+
+ VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF),
+ VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF),
+ VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF),
+ VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF),
+
+ /* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */
+ VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF),
+ VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF),
{}
};
-static const __initconst struct x86_cpu_id cpu_no_meltdown[] = {
- { X86_VENDOR_AMD },
- { X86_VENDOR_HYGON },
- {}
-};
-
-/* Only list CPUs which speculate but are non susceptible to SSB */
-static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT_X },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT_MID },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_CORE_YONAH },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM },
- { X86_VENDOR_AMD, 0x12, },
- { X86_VENDOR_AMD, 0x11, },
- { X86_VENDOR_AMD, 0x10, },
- { X86_VENDOR_AMD, 0xf, },
- {}
-};
+static bool __init cpu_matches(unsigned long which)
+{
+ const struct x86_cpu_id *m = x86_match_cpu(cpu_vuln_whitelist);
-static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {
- /* in addition to cpu_no_speculation */
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT_X },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT_MID },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT_MID },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT_X },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT_PLUS },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM },
- {}
-};
+ return m && !!(m->driver_data & which);
+}
static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{
u64 ia32_cap = 0;
- if (x86_match_cpu(cpu_no_speculation))
+ if (cpu_matches(NO_SPECULATION))
return;
setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
@@ -1011,15 +1022,14 @@ static void __init cpu_set_bug_bits(stru
if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
- if (!x86_match_cpu(cpu_no_spec_store_bypass) &&
- !(ia32_cap & ARCH_CAP_SSB_NO) &&
+ if (!cpu_matches(NO_SSB) && !(ia32_cap & ARCH_CAP_SSB_NO) &&
!cpu_has(c, X86_FEATURE_AMD_SSB_NO))
setup_force_cpu_bug(X86_BUG_SPEC_STORE_BYPASS);
if (ia32_cap & ARCH_CAP_IBRS_ALL)
setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
- if (x86_match_cpu(cpu_no_meltdown))
+ if (cpu_matches(NO_MELTDOWN))
return;
/* Rogue Data Cache Load? No! */
@@ -1028,7 +1038,7 @@ static void __init cpu_set_bug_bits(stru
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
- if (x86_match_cpu(cpu_no_l1tf))
+ if (cpu_matches(NO_L1TF))
return;
setup_force_cpu_bug(X86_BUG_L1TF);
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 02/14] MDS basics 2
2019-03-01 21:47 ` [patch V6 02/14] MDS basics 2 Thomas Gleixner
@ 2019-03-02 0:34 ` Frederic Weisbecker
2019-03-02 8:34 ` Greg KH
2019-03-05 17:54 ` Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-02 0:34 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:40PM +0100, speck for Thomas Gleixner wrote:
> static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
> {
> u64 ia32_cap = 0;
>
> - if (x86_match_cpu(cpu_no_speculation))
> + if (cpu_matches(NO_SPECULATION))
> return;
>
> setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
> @@ -1011,15 +1022,14 @@ static void __init cpu_set_bug_bits(stru
> if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
> rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
>
> - if (!x86_match_cpu(cpu_no_spec_store_bypass) &&
> - !(ia32_cap & ARCH_CAP_SSB_NO) &&
> + if (!cpu_matches(NO_SSB) && !(ia32_cap & ARCH_CAP_SSB_NO) &&
Much clearer and well unified.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 02/14] MDS basics 2
2019-03-01 21:47 ` [patch V6 02/14] MDS basics 2 Thomas Gleixner
2019-03-02 0:34 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-02 8:34 ` Greg KH
2019-03-05 17:54 ` Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Greg KH @ 2019-03-02 8:34 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:40PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 02/14] x86/speculation: Consolidate CPU whitelists
> From: Thomas Gleixner <tglx@linutronix.de>
>
> The CPU vulnerability whitelists have some overlap and there are more
> whitelists coming along.
>
> Use the driver_data field in the x86_cpu_id struct to denote the
> whitelisted vulnerabilities and combine all whitelists into one.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 02/14] MDS basics 2
2019-03-01 21:47 ` [patch V6 02/14] MDS basics 2 Thomas Gleixner
2019-03-02 0:34 ` [MODERATED] " Frederic Weisbecker
2019-03-02 8:34 ` Greg KH
@ 2019-03-05 17:54 ` Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2019-03-05 17:54 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:40PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 02/14] x86/speculation: Consolidate CPU whitelists
> From: Thomas Gleixner <tglx@linutronix.de>
>
> The CPU vulnerability whitelists have some overlap and there are more
> whitelists coming along.
>
> Use the driver_data field in the x86_cpu_id struct to denote the
> whitelisted vulnerabilities and combine all whitelists into one.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>
> V5 --> V6: Use a helper macro to make it more readable
> Fix the AMD family 0xf-0x12 vs. ANY ordering
> ---
> arch/x86/kernel/cpu/common.c | 110 +++++++++++++++++++++++--------------------
> 1 file changed, 60 insertions(+), 50 deletions(-)
Yap, nice and clean.
Reviewed-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 03/14] MDS basics 3
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 01/14] MDS basics 1 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 02/14] MDS basics 2 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-02 1:12 ` [MODERATED] " Frederic Weisbecker
2019-03-01 21:47 ` [patch V6 04/14] MDS basics 4 Thomas Gleixner
` (12 subsequent siblings)
15 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 03/14] x86/speculation/mds: Add basic bug infrastructure for MDS
From: Andi Kleen <ak@linux.intel.com>
Microarchitectural Data Sampling (MDS), is a class of side channel attacks
on internal buffers in Intel CPUs. The variants are:
- Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
- Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
- Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
dependent load (store-to-load forwarding) as an optimization. The forward
can also happen to a faulting or assisting load operation for a different
memory address, which can be exploited under certain conditions. Store
buffers are partitioned between Hyper-Threads so cross thread forwarding is
not possible. But if a thread enters or exits a sleep state the store
buffer is repartitioned which can expose data from one thread to the other.
MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
L1 miss situations and to hold data which is returned or sent in response
to a memory or I/O operation. Fill buffers can forward data to a load
operation and also write data to the cache. When the fill buffer is
deallocated it can retain the stale data of the preceding operations which
can then be forwarded to a faulting or assisting load operation, which can
be exploited under certain conditions. Fill buffers are shared between
Hyper-Threads so cross thread leakage is possible.
MLDPS leaks Load Port Data. Load ports are used to perform load operations
from memory or I/O. The received data is then forwarded to the register
file or a subsequent operation. In some implementations the Load Port can
contain stale data from a previous operation which can be forwarded to
faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.
All variants have the same mitigation for single CPU thread case (SMT off),
so the kernel can treat them as one MDS issue.
Add the basic infrastructure to detect if the current CPU is affected by
MDS.
[ tglx: Rewrote changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V5: Adopt to the consolidated quirk table
V3: Addressed Borislav's review comments
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/include/asm/msr-index.h | 5 +++++
arch/x86/kernel/cpu/common.c | 27 +++++++++++++++++----------
3 files changed, 24 insertions(+), 10 deletions(-)
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -344,6 +344,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
#define X86_FEATURE_AVX512_4VNNIW (18*32+ 2) /* AVX-512 Neural Network Instructions */
#define X86_FEATURE_AVX512_4FMAPS (18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU buffers */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
@@ -381,5 +382,6 @@
#define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
#define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
#define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
+#define X86_BUG_MDS X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
#endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -79,6 +79,11 @@
* attack, so no Speculative Store Bypass
* control required.
*/
+#define ARCH_CAP_MDS_NO BIT(5) /*
+ * Not susceptible to
+ * Microarchitectural Data
+ * Sampling (MDS) vulnerabilities.
+ */
#define MSR_IA32_FLUSH_CMD 0x0000010b
#define L1D_FLUSH BIT(0) /*
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -952,6 +952,7 @@ static void identify_cpu_without_cpuid(s
#define NO_MELTDOWN BIT(1)
#define NO_SSB BIT(2)
#define NO_L1TF BIT(3)
+#define NO_MDS BIT(4)
#define VULNWL(_vendor, _family, _model, _whitelist) \
{ X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
@@ -971,6 +972,7 @@ static const __initconst struct x86_cpu_
VULNWL(INTEL, 5, X86_MODEL_ANY, NO_SPECULATION),
VULNWL(NSC, 5, X86_MODEL_ANY, NO_SPECULATION),
+ /* Intel Family 6 */
VULNWL_INTEL(ATOM_SALTWELL, NO_SPECULATION),
VULNWL_INTEL(ATOM_SALTWELL_TABLET, NO_SPECULATION),
VULNWL_INTEL(ATOM_SALTWELL_MID, NO_SPECULATION),
@@ -987,18 +989,20 @@ static const __initconst struct x86_cpu_
VULNWL_INTEL(CORE_YONAH, NO_SSB),
VULNWL_INTEL(ATOM_AIRMONT_MID, NO_L1TF),
- VULNWL_INTEL(ATOM_GOLDMONT, NO_L1TF),
- VULNWL_INTEL(ATOM_GOLDMONT_X, NO_L1TF),
- VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_L1TF),
-
- VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF),
- VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF),
- VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF),
- VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF),
+
+ VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT_X, NO_MDS | NO_L1TF),
+ VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF),
+
+ /* AMD Family 0xf - 0x12 */
+ VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+ VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+ VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
+ VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
/* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */
- VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF),
- VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF),
+ VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS),
+ VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS),
{}
};
@@ -1029,6 +1033,9 @@ static void __init cpu_set_bug_bits(stru
if (ia32_cap & ARCH_CAP_IBRS_ALL)
setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
+ if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO))
+ setup_force_cpu_bug(X86_BUG_MDS);
+
if (cpu_matches(NO_MELTDOWN))
return;
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 03/14] MDS basics 3
2019-03-01 21:47 ` [patch V6 03/14] MDS basics 3 Thomas Gleixner
@ 2019-03-02 1:12 ` Frederic Weisbecker
0 siblings, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-02 1:12 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:41PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 03/14] x86/speculation/mds: Add basic bug infrastructure for MDS
> + /* AMD Family 0xf - 0x12 */
> + VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
> + VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
> + VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
> + VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS),
Lucky guys :-)
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 04/14] MDS basics 4
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (2 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 03/14] MDS basics 3 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-02 1:28 ` [MODERATED] " Frederic Weisbecker
` (2 more replies)
2019-03-01 21:47 ` [patch V6 05/14] MDS basics 5 Thomas Gleixner
` (11 subsequent siblings)
15 siblings, 3 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 04/14] x86/speculation/mds: Add BUG_MSBDS_ONLY
From: Thomas Gleixner <tglx@linutronix.de>
This bug bit is set on CPUs which are only affected by Microarchitectural
Store Buffer Data Sampling (MSBDS) and not by any other MDS variant.
This is important because the Store Buffers are partitioned between
Hyper-Threads so cross thread forwarding is not possible. But if a thread
enters or exits a sleep state the store buffer is repartitioned which can
expose data from one thread to the other. This transition can be mitigated.
That means that for CPUs which are only affected by MSBDS SMT can be
enabled, if the CPU is not affected by other SMT sensitive vulnerabilities,
e.g. L1TF. The XEON PHI variants fall into that category.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/common.c | 10 +++++++---
2 files changed, 8 insertions(+), 3 deletions(-)
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -383,5 +383,6 @@
#define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
#define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
#define X86_BUG_MDS X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
+#define X86_BUG_MSBDS_ONLY X86_BUG(20) /* CPU is only affected by the MSDBS variant of BUG_MDS */
#endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -953,6 +953,7 @@ static void identify_cpu_without_cpuid(s
#define NO_SSB BIT(2)
#define NO_L1TF BIT(3)
#define NO_MDS BIT(4)
+#define MSBDS_ONLY BIT(5)
#define VULNWL(_vendor, _family, _model, _whitelist) \
{ X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
@@ -983,8 +984,8 @@ static const __initconst struct x86_cpu_
VULNWL_INTEL(ATOM_SILVERMONT_X, NO_SSB | NO_L1TF),
VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF),
VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF),
- VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF),
- VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF),
+ VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY),
+ VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY),
VULNWL_INTEL(CORE_YONAH, NO_SSB),
@@ -1033,8 +1034,11 @@ static void __init cpu_set_bug_bits(stru
if (ia32_cap & ARCH_CAP_IBRS_ALL)
setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
- if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO))
+ if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO)) {
setup_force_cpu_bug(X86_BUG_MDS);
+ if (cpu_matches(MSBDS_ONLY))
+ setup_force_cpu_bug(X86_BUG_MSBDS_ONLY);
+ }
if (cpu_matches(NO_MELTDOWN))
return;
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 04/14] MDS basics 4
2019-03-01 21:47 ` [patch V6 04/14] MDS basics 4 Thomas Gleixner
@ 2019-03-02 1:28 ` Frederic Weisbecker
2019-03-05 14:52 ` Thomas Gleixner
2019-03-06 20:00 ` [MODERATED] " Andrew Cooper
2019-03-07 23:56 ` [MODERATED] " Andi Kleen
2 siblings, 1 reply; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-02 1:28 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:42PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 04/14] x86/speculation/mds: Add BUG_MSBDS_ONLY
> From: Thomas Gleixner <tglx@linutronix.de>
>
> This bug bit is set on CPUs which are only affected by Microarchitectural
> Store Buffer Data Sampling (MSBDS) and not by any other MDS variant.
>
> This is important because the Store Buffers are partitioned between
> Hyper-Threads so cross thread forwarding is not possible. But if a thread
> enters or exits a sleep state the store buffer is repartitioned which can
> expose data from one thread to the other. This transition can be mitigated.
>
> That means that for CPUs which are only affected by MSBDS SMT can be
> enabled, if the CPU is not affected by other SMT sensitive vulnerabilities,
> e.g. L1TF. The XEON PHI variants fall into that category.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/common.c | 10 +++++++---
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -383,5 +383,6 @@
> #define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
> #define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
> #define X86_BUG_MDS X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
> +#define X86_BUG_MSBDS_ONLY X86_BUG(20) /* CPU is only affected by the MSDBS variant of BUG_MDS */
>
> #endif /* _ASM_X86_CPUFEATURES_H */
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -953,6 +953,7 @@ static void identify_cpu_without_cpuid(s
> #define NO_SSB BIT(2)
> #define NO_L1TF BIT(3)
> #define NO_MDS BIT(4)
> +#define MSBDS_ONLY BIT(5)
>
> #define VULNWL(_vendor, _family, _model, _whitelist) \
> { X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
> @@ -983,8 +984,8 @@ static const __initconst struct x86_cpu_
> VULNWL_INTEL(ATOM_SILVERMONT_X, NO_SSB | NO_L1TF),
> VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF),
> VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF),
> - VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF),
> - VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF),
> + VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY),
> + VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY),
>
> VULNWL_INTEL(CORE_YONAH, NO_SSB),
>
> @@ -1033,8 +1034,11 @@ static void __init cpu_set_bug_bits(stru
> if (ia32_cap & ARCH_CAP_IBRS_ALL)
> setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
>
> - if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO))
> + if (!cpu_matches(NO_MDS) && !(ia32_cap & ARCH_CAP_MDS_NO)) {
> setup_force_cpu_bug(X86_BUG_MDS);
> + if (cpu_matches(MSBDS_ONLY))
> + setup_force_cpu_bug(X86_BUG_MSBDS_ONLY);
> + }
>
> if (cpu_matches(NO_MELTDOWN))
> return;
>
It looks weird to have it as a separate bug flag and not as a subset of full
MDS such as:
#define NO_IDLE_SHARED_MDS BIT(4)
#define NO_SHARED_MDS BIT(5)
#define NO_MDS (NO_IDLE_SHARED_MDS | NO_SHARED_MDS)
Now that would probably make sense only if the mitigation of full MDS required
to also imply a VERW before entering idle (that's the mitigation of MSBDS_ONLY, right?).
Turning off SMT removes the need to do that so the layout seem to make sense as is.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 04/14] MDS basics 4
2019-03-02 1:28 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-05 14:52 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-05 14:52 UTC (permalink / raw)
To: speck
On Sat, 2 Mar 2019, speck for Frederic Weisbecker wrote:
> On Fri, Mar 01, 2019 at 10:47:42PM +0100, speck for Thomas Gleixner wrote:
> > if (cpu_matches(NO_MELTDOWN))
> > return;
> >
>
> It looks weird to have it as a separate bug flag and not as a subset of full
> MDS such as:
>
> #define NO_IDLE_SHARED_MDS BIT(4)
> #define NO_SHARED_MDS BIT(5)
> #define NO_MDS (NO_IDLE_SHARED_MDS | NO_SHARED_MDS)
>
> Now that would probably make sense only if the mitigation of full MDS required
> to also imply a VERW before entering idle (that's the mitigation of MSBDS_ONLY, right?).
> Turning off SMT removes the need to do that so the layout seem to make sense as is.
Yeah, I had several variants of the theme, but all of them sucked in one
way or the other.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 04/14] MDS basics 4
2019-03-01 21:47 ` [patch V6 04/14] MDS basics 4 Thomas Gleixner
2019-03-02 1:28 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-06 20:00 ` Andrew Cooper
2019-03-06 20:32 ` Thomas Gleixner
2019-03-07 23:56 ` [MODERATED] " Andi Kleen
2 siblings, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2019-03-06 20:00 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 2851 bytes --]
On 01/03/2019 21:47, speck for Thomas Gleixner wrote:
> Subject: [patch V6 04/14] x86/speculation/mds: Add BUG_MSBDS_ONLY
> From: Thomas Gleixner <tglx@linutronix.de>
>
> This bug bit is set on CPUs which are only affected by Microarchitectural
> Store Buffer Data Sampling (MSBDS) and not by any other MDS variant.
>
> This is important because the Store Buffers are partitioned between
> Hyper-Threads so cross thread forwarding is not possible. But if a thread
> enters or exits a sleep state the store buffer is repartitioned which can
> expose data from one thread to the other. This transition can be mitigated.
>
> That means that for CPUs which are only affected by MSBDS SMT can be
> enabled, if the CPU is not affected by other SMT sensitive vulnerabilities,
> e.g. L1TF. The XEON PHI variants fall into that category.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/common.c | 10 +++++++---
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -383,5 +383,6 @@
> #define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
> #define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
> #define X86_BUG_MDS X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
> +#define X86_BUG_MSBDS_ONLY X86_BUG(20) /* CPU is only affected by the MSDBS variant of BUG_MDS */
>
> #endif /* _ASM_X86_CPUFEATURES_H */
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -953,6 +953,7 @@ static void identify_cpu_without_cpuid(s
> #define NO_SSB BIT(2)
> #define NO_L1TF BIT(3)
> #define NO_MDS BIT(4)
> +#define MSBDS_ONLY BIT(5)
>
> #define VULNWL(_vendor, _family, _model, _whitelist) \
> { X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
> @@ -983,8 +984,8 @@ static const __initconst struct x86_cpu_
> VULNWL_INTEL(ATOM_SILVERMONT_X, NO_SSB | NO_L1TF),
> VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF),
> VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF),
> - VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF),
> - VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF),
> + VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY),
> + VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY),
Looking at the table in the magic PDF, Silvermont/Airmont are MDBDS_ONLY
as well.
The model numbers listed in the Silvermont/Airmont category are 37, 4a,
4c, 4d, 5a, 5d, 6e, 65, 75.
The first 5 of those models match up with Linux's Silvermont/Airmont
names, while the last 4 are unknown. I can't locate them anywhere and
have requested clarification.
~Andrew
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 04/14] MDS basics 4
2019-03-06 20:00 ` [MODERATED] " Andrew Cooper
@ 2019-03-06 20:32 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-06 20:32 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 1256 bytes --]
On Wed, 6 Mar 2019, speck for Andrew Cooper wrote:
> On 01/03/2019 21:47, speck for Thomas Gleixner wrote:
> > #define VULNWL(_vendor, _family, _model, _whitelist) \
> > { X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
> > @@ -983,8 +984,8 @@ static const __initconst struct x86_cpu_
> > VULNWL_INTEL(ATOM_SILVERMONT_X, NO_SSB | NO_L1TF),
> > VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF),
> > VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF),
> > - VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF),
> > - VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF),
> > + VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY),
> > + VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY),
>
> Looking at the table in the magic PDF, Silvermont/Airmont are MDBDS_ONLY
> as well.
>
> The model numbers listed in the Silvermont/Airmont category are 37, 4a,
> 4c, 4d, 5a, 5d, 6e, 65, 75.
>
> The first 5 of those models match up with Linux's Silvermont/Airmont
> names, while the last 4 are unknown. I can't locate them anywhere and
> have requested clarification.
Yeah, forgot about the Silvermonts. Though the SMT problem does not exist
there as these beasts do not have HT AFAICT.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 04/14] MDS basics 4
2019-03-01 21:47 ` [patch V6 04/14] MDS basics 4 Thomas Gleixner
2019-03-02 1:28 ` [MODERATED] " Frederic Weisbecker
2019-03-06 20:00 ` [MODERATED] " Andrew Cooper
@ 2019-03-07 23:56 ` Andi Kleen
2019-03-08 0:36 ` Linus Torvalds
2 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-03-07 23:56 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:42PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 04/14] x86/speculation/mds: Add BUG_MSBDS_ONLY
> From: Thomas Gleixner <tglx@linutronix.de>
>
> This bug bit is set on CPUs which are only affected by Microarchitectural
> Store Buffer Data Sampling (MSBDS) and not by any other MDS variant.
This patch is pointless. It won't have VERW support and we don't have mitigation
for Xeon Phi because Linus rejected software sequences.
Xeon Phi will simply not be mitigated. However Xeon PHIs are not widely
used, and those that are deployed can be handled in different ways.
-Andi
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 04/14] MDS basics 4
2019-03-07 23:56 ` [MODERATED] " Andi Kleen
@ 2019-03-08 0:36 ` Linus Torvalds
0 siblings, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2019-03-08 0:36 UTC (permalink / raw)
To: speck
On Thu, Mar 7, 2019 at 3:56 PM speck for Andi Kleen <speck@linutronix.de> wrote:
>
> Xeon Phi will simply not be mitigated. However Xeon PHIs are not widely
> used,
Heh. Understatement of the year.
> and those that are deployed can be handled in different ways.
I don't think anybody uses them in situations that would care.
The main target was HPC, I think.
So I think the "handled in different ways" ends up being "ignored", I suspect.
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 05/14] MDS basics 5
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (3 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 04/14] MDS basics 4 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-02 1:37 ` [MODERATED] " Frederic Weisbecker
2019-03-07 23:59 ` Andi Kleen
2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
` (10 subsequent siblings)
15 siblings, 2 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
From: Andi Kleen <ak@linux.intel.com>
Subject: [patch V6 05/14] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode
provides the mechanism to invoke a flush of various exploitable CPU buffers
by invoking the VERW instruction.
Hand it through to guests so they can adjust their mitigations.
This also requires corresponding qemu changes, which are available
separately.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
arch/x86/kvm/cpuid.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -410,7 +410,8 @@ static inline int __do_cpuid_ent(struct
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
- F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP);
+ F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
+ F(MD_CLEAR);
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 05/14] MDS basics 5
2019-03-01 21:47 ` [patch V6 05/14] MDS basics 5 Thomas Gleixner
@ 2019-03-02 1:37 ` Frederic Weisbecker
2019-03-07 23:59 ` Andi Kleen
1 sibling, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-02 1:37 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:43PM +0100, speck for Thomas Gleixner wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject: [patch V6 05/14] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
>
> X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode
> provides the mechanism to invoke a flush of various exploitable CPU buffers
> by invoking the VERW instruction.
>
> Hand it through to guests so they can adjust their mitigations.
>
> This also requires corresponding qemu changes, which are available
> separately.
>
> [ tglx: Massaged changelog ]
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 05/14] MDS basics 5
2019-03-01 21:47 ` [patch V6 05/14] MDS basics 5 Thomas Gleixner
2019-03-02 1:37 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-07 23:59 ` Andi Kleen
2019-03-08 6:37 ` Thomas Gleixner
1 sibling, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-03-07 23:59 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:43PM +0100, speck for Thomas Gleixner wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject: [patch V6 05/14] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
>
> X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode
> provides the mechanism to invoke a flush of various exploitable CPU buffers
> by invoking the VERW instruction.
>
> Hand it through to guests so they can adjust their mitigations.
>
> This also requires corresponding qemu changes, which are available
> separately.
This patch is not complete. You also need some variant of
x86/speculation/mds: Handle VMENTRY clear for CPUs without l1tf
in my patch kit.
-Andi
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 05/14] MDS basics 5
2019-03-07 23:59 ` Andi Kleen
@ 2019-03-08 6:37 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-08 6:37 UTC (permalink / raw)
To: speck
On Thu, 7 Mar 2019, speck for Andi Kleen wrote:
> On Fri, Mar 01, 2019 at 10:47:43PM +0100, speck for Thomas Gleixner wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > Subject: [patch V6 05/14] x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
> >
> > X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode
> > provides the mechanism to invoke a flush of various exploitable CPU buffers
> > by invoking the VERW instruction.
> >
> > Hand it through to guests so they can adjust their mitigations.
> >
> > This also requires corresponding qemu changes, which are available
> > separately.
>
> This patch is not complete. You also need some variant of
>
> x86/speculation/mds: Handle VMENTRY clear for CPUs without l1tf
650b68a0622f ("x86/kvm/vmx: Add MDS protection when L1D Flush is not active")
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 06/14] MDS basics 6
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (4 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 05/14] MDS basics 5 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-04 6:28 ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 21:47 ` [patch V6 07/14] MDS basics 7 Thomas Gleixner
` (9 subsequent siblings)
15 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 06/14] x86/speculation/mds: Add mds_clear_cpu_buffers()
From: Thomas Gleixner <tglx@linutronix.de>
The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
clearing the affected CPU buffers. The mechanism for clearing the buffers
uses the unused and obsolete VERW instruction in combination with a
microcode update which triggers a CPU buffer clear when VERW is executed.
Provide a inline function with the assembly magic. The argument of the VERW
instruction must be a memory operand as documented:
"MD_CLEAR enumerates that the memory-operand variant of VERW (for
example, VERW m16) has been extended to also overwrite buffers affected
by MDS. This buffer overwriting functionality is not guaranteed for the
register operand variant of VERW."
Documentation also recommends to use a writable data segment selector:
"The buffer overwriting occurs regardless of the result of the VERW
permission check, as well as when the selector is null or causes a
descriptor load segment violation. However, for lowest latency we
recommend using a selector that indicates a valid writable data
segment."
Add x86 specific documentation about MDS and the internal workings of the
mitigation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
v4 --> V5: Fix typos and remove the conditional mode reference.
V3 --> V4: Document the segment selecor choice as well.
V2 --> V3: Add VERW documentation and fix typos/grammar..., dropped 'i(0)'
Add more details fo the documentation file
V1 --> V2: Add "cc" clobber and documentation
---
Documentation/index.rst | 1
Documentation/x86/conf.py | 10 +++
Documentation/x86/index.rst | 8 ++
Documentation/x86/mds.rst | 99 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/nospec-branch.h | 25 ++++++++
5 files changed, 143 insertions(+)
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -101,6 +101,7 @@ implementation.
:maxdepth: 2
sh/index
+ x86/index
Filesystem Documentation
------------------------
--- /dev/null
+++ b/Documentation/x86/conf.py
@@ -0,0 +1,10 @@
+# -*- coding: utf-8; mode: python -*-
+
+project = "X86 architecture specific documentation"
+
+tags.add("subproject")
+
+latex_documents = [
+ ('index', 'x86.tex', project,
+ 'The kernel development community', 'manual'),
+]
--- /dev/null
+++ b/Documentation/x86/index.rst
@@ -0,0 +1,8 @@
+==========================
+x86 architecture specifics
+==========================
+
+.. toctree::
+ :maxdepth: 1
+
+ mds
--- /dev/null
+++ b/Documentation/x86/mds.rst
@@ -0,0 +1,99 @@
+Microarchitectural Data Sampling (MDS) mitigation
+=================================================
+
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
+on internal buffers in Intel CPUs. The variants are:
+
+ - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+ - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+ - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
+
+MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
+dependent load (store-to-load forwarding) as an optimization. The forward
+can also happen to a faulting or assisting load operation for a different
+memory address, which can be exploited under certain conditions. Store
+buffers are partitioned between Hyper-Threads so cross thread forwarding is
+not possible. But if a thread enters or exits a sleep state the store
+buffer is repartitioned which can expose data from one thread to the other.
+
+MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
+L1 miss situations and to hold data which is returned or sent in response
+to a memory or I/O operation. Fill buffers can forward data to a load
+operation and also write data to the cache. When the fill buffer is
+deallocated it can retain the stale data of the preceding operations which
+can then be forwarded to a faulting or assisting load operation, which can
+be exploited under certain conditions. Fill buffers are shared between
+Hyper-Threads so cross thread leakage is possible.
+
+MLPDS leaks Load Port Data. Load ports are used to perform load operations
+from memory or I/O. The received data is then forwarded to the register
+file or a subsequent operation. In some implementations the Load Port can
+contain stale data from a previous operation which can be forwarded to
+faulting or assisting loads under certain conditions, which again can be
+exploited eventually. Load ports are shared between Hyper-Threads so cross
+thread leakage is possible.
+
+
+Exposure assumptions
+--------------------
+
+It is assumed that attack code resides in user space or in a guest with one
+exception. The rationale behind this assumption is that the code construct
+needed for exploiting MDS requires:
+
+ - to control the load to trigger a fault or assist
+
+ - to have a disclosure gadget which exposes the speculatively accessed
+ data for consumption through a side channel.
+
+ - to control the pointer through which the disclosure gadget exposes the
+ data
+
+The existence of such a construct in the kernel cannot be excluded with
+100% certainty, but the complexity involved makes it extremly unlikely.
+
+There is one exception, which is untrusted BPF. The functionality of
+untrusted BPF is limited, but it needs to be thoroughly investigated
+whether it can be used to create such a construct.
+
+
+Mitigation strategy
+-------------------
+
+All variants have the same mitigation strategy at least for the single CPU
+thread case (SMT off): Force the CPU to clear the affected buffers.
+
+This is achieved by using the otherwise unused and obsolete VERW
+instruction in combination with a microcode update. The microcode clears
+the affected CPU buffers when the VERW instruction is executed.
+
+For virtualization there are two ways to achieve CPU buffer
+clearing. Either the modified VERW instruction or via the L1D Flush
+command. The latter is issued when L1TF mitigation is enabled so the extra
+VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
+be issued.
+
+If the VERW instruction with the supplied segment selector argument is
+executed on a CPU without the microcode update there is no side effect
+other than a small number of pointlessly wasted CPU cycles.
+
+This does not protect against cross Hyper-Thread attacks except for MSBDS
+which is only exploitable cross Hyper-thread when one of the Hyper-Threads
+enters a C-state.
+
+The kernel provides a function to invoke the buffer clearing:
+
+ mds_clear_cpu_buffers()
+
+The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
+(idle) transitions.
+
+According to current knowledge additional mitigations inside the kernel
+itself are not required because the necessary gadgets to expose the leaked
+data cannot be controlled in a way which allows exploitation from malicious
+user space or VM guests.
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,31 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+#include <asm/segment.h>
+
+/**
+ * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * This uses the otherwise unused and obsolete VERW instruction in
+ * combination with microcode which triggers a CPU buffer flush when the
+ * instruction is executed.
+ */
+static inline void mds_clear_cpu_buffers(void)
+{
+ static const u16 ds = __KERNEL_DS;
+
+ /*
+ * Has to be the memory-operand variant because only that
+ * guarantees the CPU buffer flush functionality according to
+ * documentation. The register-operand variant does not.
+ * Works with any segment selector, but a valid writable
+ * data segment is the fastest variant.
+ *
+ * "cc" clobber is required because VERW modifies ZF.
+ */
+ asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
+}
+
#endif /* __ASSEMBLY__ */
/*
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
@ 2019-03-04 6:28 ` Jon Masters
2019-03-05 14:55 ` Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Jon Masters @ 2019-03-04 6:28 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 06/14] MDS basics 6
[-- Attachment #2: Type: text/plain, Size: 1195 bytes --]
On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Provide a inline function with the assembly magic. The argument of the VERW
> instruction must be a memory operand as documented:
>
> "MD_CLEAR enumerates that the memory-operand variant of VERW (for
> example, VERW m16) has been extended to also overwrite buffers affected
> by MDS. This buffer overwriting functionality is not guaranteed for the
> register operand variant of VERW."
>
> Documentation also recommends to use a writable data segment selector:
>
> "The buffer overwriting occurs regardless of the result of the VERW
> permission check, as well as when the selector is null or causes a
> descriptor load segment violation. However, for lowest latency we
> recommend using a selector that indicates a valid writable data
> segment."
Note that we raised this again with Intel last week amid Andrew's
results and they are going to get back to us if this guidance changes as
a result of further measurements on their end. It's a few cycles
difference in the Coffeelake case, but it could always be higher.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Encrypted Message
2019-03-04 6:28 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-05 14:55 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-05 14:55 UTC (permalink / raw)
To: speck
On Mon, 4 Mar 2019, speck for Jon Masters wrote:
> > Documentation also recommends to use a writable data segment selector:
> >
> > "The buffer overwriting occurs regardless of the result of the VERW
> > permission check, as well as when the selector is null or causes a
> > descriptor load segment violation. However, for lowest latency we
> > recommend using a selector that indicates a valid writable data
> > segment."
>
> Note that we raised this again with Intel last week amid Andrew's
> results and they are going to get back to us if this guidance changes as
> a result of further measurements on their end. It's a few cycles
> difference in the Coffeelake case, but it could always be higher.
Ok. We can fix that up on top once we have final answers.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 07/14] MDS basics 7
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (5 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-02 2:22 ` [MODERATED] " Frederic Weisbecker
2019-03-06 5:21 ` Borislav Petkov
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
` (8 subsequent siblings)
15 siblings, 2 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 07/14] x86/speculation/mds: Clear CPU buffers on exit to user
From: Thomas Gleixner <tglx@linutronix.de>
Add a static key which controls the invocation of the CPU buffer clear
mechanism on exit to user space and add the call into
prepare_exit_to_usermode() and do_nmi() right before actually returning.
Add documentation which kernel to user space transition this covers and
explain why some corner cases are not mitigated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V4 --> v5: Use an inline helper instead of open coding it.
Rework the documentation paragraph about exceptions.
V3 --> V4: Add #DS mitigation and document that the #MC corner case
is really not interesting.
V3: Add NMI conditional on user regs and update documentation accordingly.
Use the static branch scheme suggested by Peter. Fix typos ...
---
Documentation/x86/mds.rst | 52 +++++++++++++++++++++++++++++++++++
arch/x86/entry/common.c | 3 ++
arch/x86/include/asm/nospec-branch.h | 13 ++++++++
arch/x86/kernel/cpu/bugs.c | 3 ++
arch/x86/kernel/nmi.c | 4 ++
arch/x86/kernel/traps.c | 7 ++++
6 files changed, 82 insertions(+)
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -97,3 +97,55 @@ According to current knowledge additiona
itself are not required because the necessary gadgets to expose the leaked
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.
+
+Mitigation points
+-----------------
+
+1. Return to user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+ When transitioning from kernel to user space the CPU buffers are flushed
+ on affected CPUs when the mitigation is not disabled on the kernel
+ command line. The migitation is enabled through the static key
+ mds_user_clear.
+
+ The mitigation is invoked in prepare_exit_to_usermode() which covers
+ most of the kernel to user space transitions. There are a few exceptions
+ which are not invoking prepare_exit_to_usermode() on return to user
+ space. These exceptions use the paranoid exit code.
+
+ - Non Maskable Interrupt (NMI):
+
+ Access to sensible data like keys, credentials in the NMI context is
+ mostly theoretical: The CPU can do prefetching or execute a
+ misspeculated code path and thereby fetching data which might end up
+ leaking through a buffer.
+
+ But for mounting other attacks the kernel stack address of the task is
+ already valuable information. So in full mitigation mode, the NMI is
+ mitigated on the return from do_nmi() to provide almost complete
+ coverage.
+
+ - Double fault (#DF):
+
+ A double fault is usually fatal, but the ESPFIX workaround, which can
+ be triggered from user space through modify_ldt(2) is a recoverable
+ double fault. #DF uses the paranoid exit path, so explicit mitigation
+ in the double fault handler is required.
+
+ - Machine Check Exception (#MC):
+
+ Another corner case is a #MC which hits between the CPU buffer clear
+ invocation and the actual return to user. As this still is in kernel
+ space it takes the paranoid exit path which does not clear the CPU
+ buffers. So the #MC handler repopulates the buffers to some
+ extent. Machine checks are not reliably controllable and the window is
+ extremly small so mitigation would just tick a checkbox that this
+ theoretical corner case is covered. To keep the amount of special
+ cases small, ignore #MC.
+
+ - Debug Exception (#DB):
+
+ This takes the paranoid exit path only when the INT1 breakpoint is in
+ kernel space. #DB on a user space address takes the regular exit path,
+ so no extra mitigation required.
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,6 +31,7 @@
#include <asm/vdso.h>
#include <linux/uaccess.h>
#include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -212,6 +213,8 @@ static void exit_to_usermode_loop(struct
#endif
user_enter_irqoff();
+
+ mds_user_clear_cpu_buffers();
}
#define SYSCALL_EXIT_WORK_FLAGS \
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,8 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+DECLARE_STATIC_KEY_FALSE(mds_user_clear);
+
#include <asm/segment.h>
/**
@@ -343,6 +345,17 @@ static inline void mds_clear_cpu_buffers
asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
}
+/**
+ * mds_user_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * Clear CPU buffers if the corresponding static key is enabled
+ */
+static inline void mds_user_clear_cpu_buffers(void)
+{
+ if (static_branch_likely(&mds_user_clear))
+ mds_clear_cpu_buffers();
+}
+
#endif /* __ASSEMBLY__ */
/*
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -63,6 +63,9 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_i
/* Control unconditional IBPB in switch_mm() */
DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+/* Control MDS CPU buffer clear before returning to user space */
+DEFINE_STATIC_KEY_FALSE(mds_user_clear);
+
void __init check_bugs(void)
{
identify_boot_cpu();
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -34,6 +34,7 @@
#include <asm/x86_init.h>
#include <asm/reboot.h>
#include <asm/cache.h>
+#include <asm/nospec-branch.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -533,6 +534,9 @@ do_nmi(struct pt_regs *regs, long error_
write_cr2(this_cpu_read(nmi_cr2));
if (this_cpu_dec_return(nmi_state))
goto nmi_restart;
+
+ if (user_mode(regs))
+ mds_user_clear_cpu_buffers();
}
NOKPROBE_SYMBOL(do_nmi);
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -366,6 +366,13 @@ dotraplinkage void do_double_fault(struc
regs->ip = (unsigned long)general_protection;
regs->sp = (unsigned long)&gpregs->orig_ax;
+ /*
+ * This situation can be triggered by userspace via
+ * modify_ldt(2) and the return does not take the regular
+ * user space exit, so a CPU buffer clear is required when
+ * MDS mitigation is enabled.
+ */
+ mds_user_clear_cpu_buffers();
return;
}
#endif
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 07/14] MDS basics 7
2019-03-01 21:47 ` [patch V6 07/14] MDS basics 7 Thomas Gleixner
@ 2019-03-02 2:22 ` Frederic Weisbecker
2019-03-05 15:30 ` Thomas Gleixner
2019-03-06 5:21 ` Borislav Petkov
1 sibling, 1 reply; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-02 2:22 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:45PM +0100, speck for Thomas Gleixner wrote:
> +
> + - Debug Exception (#DB):
> +
> + This takes the paranoid exit path only when the INT1 breakpoint is in
> + kernel space. #DB on a user space address takes the regular exit path,
> + so no extra mitigation required.
I can't find that part in this patch, maybe it's further in the series?
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -34,6 +34,7 @@
> #include <asm/x86_init.h>
> #include <asm/reboot.h>
> #include <asm/cache.h>
> +#include <asm/nospec-branch.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/nmi.h>
> @@ -533,6 +534,9 @@ do_nmi(struct pt_regs *regs, long error_
> write_cr2(this_cpu_read(nmi_cr2));
> if (this_cpu_dec_return(nmi_state))
> goto nmi_restart;
> +
> + if (user_mode(regs))
> + mds_user_clear_cpu_buffers();
What if the NMI fires after a call to prepare_exit_to_usermode()
but before the actual return to usermode, would that be a problem?
Thanks.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 07/14] MDS basics 7
2019-03-02 2:22 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-05 15:30 ` Thomas Gleixner
2019-03-06 15:49 ` [MODERATED] " Frederic Weisbecker
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-05 15:30 UTC (permalink / raw)
To: speck
On Sat, 2 Mar 2019, speck for Frederic Weisbecker wrote:
> On Fri, Mar 01, 2019 at 10:47:45PM +0100, speck for Thomas Gleixner wrote:
> > +
> > + - Debug Exception (#DB):
> > +
> > + This takes the paranoid exit path only when the INT1 breakpoint is in
> > + kernel space. #DB on a user space address takes the regular exit path,
> > + so no extra mitigation required.
>
> I can't find that part in this patch, maybe it's further in the series?
There is no patch. #DB is not interesting as explained above.
> > --- a/arch/x86/kernel/nmi.c
> > +++ b/arch/x86/kernel/nmi.c
> > @@ -34,6 +34,7 @@
> > #include <asm/x86_init.h>
> > #include <asm/reboot.h>
> > #include <asm/cache.h>
> > +#include <asm/nospec-branch.h>
> >
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/nmi.h>
> > @@ -533,6 +534,9 @@ do_nmi(struct pt_regs *regs, long error_
> > write_cr2(this_cpu_read(nmi_cr2));
> > if (this_cpu_dec_return(nmi_state))
> > goto nmi_restart;
> > +
> > + if (user_mode(regs))
> > + mds_user_clear_cpu_buffers();
>
> What if the NMI fires after a call to prepare_exit_to_usermode()
> but before the actual return to usermode, would that be a problem?
Yes, it's a hole in the protection, but you would need to be able to
orchestrate that as user which I doubt you can. So the thought was that we
rather avoid the penalty for perf when it hits kernel space, which requires
root ....
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 07/14] MDS basics 7
2019-03-05 15:30 ` Thomas Gleixner
@ 2019-03-06 15:49 ` Frederic Weisbecker
0 siblings, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-06 15:49 UTC (permalink / raw)
To: speck
On Tue, Mar 05, 2019 at 04:30:38PM +0100, speck for Thomas Gleixner wrote:
> On Sat, 2 Mar 2019, speck for Frederic Weisbecker wrote:
>
> > On Fri, Mar 01, 2019 at 10:47:45PM +0100, speck for Thomas Gleixner wrote:
> > > +
> > > + - Debug Exception (#DB):
> > > +
> > > + This takes the paranoid exit path only when the INT1 breakpoint is in
> > > + kernel space. #DB on a user space address takes the regular exit path,
> > > + so no extra mitigation required.
> >
> > I can't find that part in this patch, maybe it's further in the series?
>
> There is no patch. #DB is not interesting as explained above.
Oh right, my brainfart...
>
> > > --- a/arch/x86/kernel/nmi.c
> > > +++ b/arch/x86/kernel/nmi.c
> > > @@ -34,6 +34,7 @@
> > > #include <asm/x86_init.h>
> > > #include <asm/reboot.h>
> > > #include <asm/cache.h>
> > > +#include <asm/nospec-branch.h>
> > >
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/nmi.h>
> > > @@ -533,6 +534,9 @@ do_nmi(struct pt_regs *regs, long error_
> > > write_cr2(this_cpu_read(nmi_cr2));
> > > if (this_cpu_dec_return(nmi_state))
> > > goto nmi_restart;
> > > +
> > > + if (user_mode(regs))
> > > + mds_user_clear_cpu_buffers();
> >
> > What if the NMI fires after a call to prepare_exit_to_usermode()
> > but before the actual return to usermode, would that be a problem?
>
> Yes, it's a hole in the protection, but you would need to be able to
> orchestrate that as user which I doubt you can. So the thought was that we
> rather avoid the penalty for perf when it hits kernel space, which requires
> root ....
Fair enough.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 07/14] MDS basics 7
2019-03-01 21:47 ` [patch V6 07/14] MDS basics 7 Thomas Gleixner
2019-03-02 2:22 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-06 5:21 ` Borislav Petkov
1 sibling, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2019-03-06 5:21 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:45PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 07/14] x86/speculation/mds: Clear CPU buffers on exit to user
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on exit to user space and add the call into
> prepare_exit_to_usermode() and do_nmi() right before actually returning.
>
> Add documentation which kernel to user space transition this covers and
> explain why some corner cases are not mitigated.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> ---
> V4 --> v5: Use an inline helper instead of open coding it.
> Rework the documentation paragraph about exceptions.
>
> V3 --> V4: Add #DS mitigation and document that the #MC corner case
> is really not interesting.
>
> V3: Add NMI conditional on user regs and update documentation accordingly.
> Use the static branch scheme suggested by Peter. Fix typos ...
> ---
> Documentation/x86/mds.rst | 52 +++++++++++++++++++++++++++++++++++
> arch/x86/entry/common.c | 3 ++
> arch/x86/include/asm/nospec-branch.h | 13 ++++++++
> arch/x86/kernel/cpu/bugs.c | 3 ++
> arch/x86/kernel/nmi.c | 4 ++
> arch/x86/kernel/traps.c | 7 ++++
> 6 files changed, 82 insertions(+)
...
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -366,6 +366,13 @@ dotraplinkage void do_double_fault(struc
> regs->ip = (unsigned long)general_protection;
> regs->sp = (unsigned long)&gpregs->orig_ax;
>
> + /*
> + * This situation can be triggered by userspace via
> + * modify_ldt(2) and the return does not take the regular
> + * user space exit, so a CPU buffer clear is required when
> + * MDS mitigation is enabled.
> + */
> + mds_user_clear_cpu_buffers();
> return;
> }
> #endif
Looks like the traps.c change is missing a hunk, see below. Otherwise:
arch/x86/kernel/traps.c: In function ‘do_double_fault’:
arch/x86/kernel/traps.c:375:3: error: implicit declaration of function ‘mds_user_clear_cpu_buffers’ [-Werror=implicit-function-declaration]
mds_user_clear_cpu_buffers();
^~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:276: arch/x86/kernel/traps.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [scripts/Makefile.build:492: arch/x86/kernel] Error 2
make: *** [Makefile:1043: arch/x86] Error 2
make: *** Waiting for unfinished jobs....
---
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5942060dba9a..ce33f7f672d6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/mpx.h>
#include <asm/vm86.h>
#include <asm/umip.h>
+#include <asm/nospec-branch.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
---
with that
Reviewed-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [patch V6 08/14] MDS basics 8
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (6 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 07/14] MDS basics 7 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-03 2:54 ` [MODERATED] " Frederic Weisbecker
` (2 more replies)
2019-03-01 21:47 ` [patch V6 09/14] MDS basics 9 Thomas Gleixner
` (7 subsequent siblings)
15 siblings, 3 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
CPUs which are affected by L1TF and MDS mitigate MDS with the L1D Flush on
VMENTER when updated microcode is installed.
If a CPU is not affected by L1TF or if the L1D Flush is not in use, then
MDS mitigation needs to be invoked explicit.
For these cases, follow the host mitigation state and invoke the MDS
mitigation before VMENTER.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V4 --> V5: Fix changelog
---
arch/x86/kernel/cpu/bugs.c | 1 +
arch/x86/kvm/vmx/vmx.c | 2 ++
2 files changed, 3 insertions(+)
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -65,6 +65,7 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always
/* Control MDS CPU buffer clear before returning to user space */
DEFINE_STATIC_KEY_FALSE(mds_user_clear);
+EXPORT_SYMBOL_GPL(mds_user_clear);
void __init check_bugs(void)
{
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6371,6 +6371,8 @@ static void __vmx_vcpu_run(struct kvm_vc
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
+ else if (static_branch_unlikely(&mds_user_clear))
+ mds_clear_cpu_buffers();
asm(
/* Store host registers */
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 08/14] MDS basics 8
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
@ 2019-03-03 2:54 ` Frederic Weisbecker
2019-03-04 6:57 ` [MODERATED] Encrypted Message Jon Masters
2019-03-06 14:11 ` [MODERATED] Re: [patch V6 08/14] MDS basics 8 Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-03 2:54 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:46PM +0100, speck for Thomas Gleixner wrote:
> CPUs which are affected by L1TF and MDS mitigate MDS with the L1D Flush on
> VMENTER when updated microcode is installed.
>
> If a CPU is not affected by L1TF or if the L1D Flush is not in use, then
> MDS mitigation needs to be invoked explicit.
>
> For these cases, follow the host mitigation state and invoke the MDS
> mitigation before VMENTER.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
> V4 --> V5: Fix changelog
> ---
> arch/x86/kernel/cpu/bugs.c | 1 +
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 2 files changed, 3 insertions(+)
>
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -65,6 +65,7 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always
>
> /* Control MDS CPU buffer clear before returning to user space */
> DEFINE_STATIC_KEY_FALSE(mds_user_clear);
> +EXPORT_SYMBOL_GPL(mds_user_clear);
>
> void __init check_bugs(void)
> {
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6371,6 +6371,8 @@ static void __vmx_vcpu_run(struct kvm_vc
We may want to add a comment below to summarize what's explained
in the changelog. git blame tends to lose prime history after any
future most unsignificant variable rename. Something like:
+ /* l1tf mitigation, if present, spares us mds mitigation */
> if (static_branch_unlikely(&vmx_l1d_should_flush))
> vmx_l1d_flush(vcpu);
> + else if (static_branch_unlikely(&mds_user_clear))
> + mds_clear_cpu_buffers();
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Thanks.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
2019-03-03 2:54 ` [MODERATED] " Frederic Weisbecker
@ 2019-03-04 6:57 ` Jon Masters
2019-03-04 7:06 ` Jon Masters
2019-03-05 15:34 ` Thomas Gleixner
2019-03-06 14:11 ` [MODERATED] Re: [patch V6 08/14] MDS basics 8 Borislav Petkov
2 siblings, 2 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-04 6:57 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8
[-- Attachment #2: Type: text/plain, Size: 491 bytes --]
On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> if (static_branch_unlikely(&vmx_l1d_should_flush))
> vmx_l1d_flush(vcpu);
> + else if (static_branch_unlikely(&mds_user_clear))
> + mds_clear_cpu_buffers();
Does this cover the case where we have older ucode installed that does
L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
logic handling this but wanted to call it out.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-04 6:57 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-04 7:06 ` Jon Masters
2019-03-04 8:12 ` Jon Masters
2019-03-05 15:34 ` Thomas Gleixner
1 sibling, 1 reply; 89+ messages in thread
From: Jon Masters @ 2019-03-04 7:06 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 126 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8
[-- Attachment #2: Type: text/plain, Size: 877 bytes --]
On 3/4/19 1:57 AM, speck for Jon Masters wrote:
> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>> if (static_branch_unlikely(&vmx_l1d_should_flush))
>> vmx_l1d_flush(vcpu);
>> + else if (static_branch_unlikely(&mds_user_clear))
>> + mds_clear_cpu_buffers();
>
> Does this cover the case where we have older ucode installed that does
> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
> logic handling this but wanted to call it out.
Aside from the above question, I've reviewed all of the patches
extensively at this point. Feel free to add a Reviewed-by or Tested-by
according to your preference. I've a bunch of further tests running,
including on AMD platforms just so to check nothing broke with those
platforms that are not susceptible to MDS.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Encrypted Message
2019-03-04 6:57 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 7:06 ` Jon Masters
@ 2019-03-05 15:34 ` Thomas Gleixner
2019-03-06 16:21 ` [MODERATED] " Jon Masters
1 sibling, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-05 15:34 UTC (permalink / raw)
To: speck
On Mon, 4 Mar 2019, speck for Jon Masters wrote:
> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> > if (static_branch_unlikely(&vmx_l1d_should_flush))
> > vmx_l1d_flush(vcpu);
> > + else if (static_branch_unlikely(&mds_user_clear))
> > + mds_clear_cpu_buffers();
>
> Does this cover the case where we have older ucode installed that does
> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
> logic handling this but wanted to call it out.
If no updated microcode is available then it's pretty irrelevant which code
path you take. None of them will mitigate MDS.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 08/14] MDS basics 8
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
2019-03-03 2:54 ` [MODERATED] " Frederic Weisbecker
2019-03-04 6:57 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-06 14:11 ` Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2019-03-06 14:11 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:46PM +0100, speck for Thomas Gleixner wrote:
> CPUs which are affected by L1TF and MDS mitigate MDS with the L1D Flush on
> VMENTER when updated microcode is installed.
>
> If a CPU is not affected by L1TF or if the L1D Flush is not in use, then
> MDS mitigation needs to be invoked explicit.
explicitly.
>
> For these cases, follow the host mitigation state and invoke the MDS
> mitigation before VMENTER.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
> V4 --> V5: Fix changelog
> ---
> arch/x86/kernel/cpu/bugs.c | 1 +
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 2 files changed, 3 insertions(+)
With that:
Reviewed-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 09/14] MDS basics 9
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (7 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-06 16:14 ` [MODERATED] " Frederic Weisbecker
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
` (6 subsequent siblings)
15 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 09/14] x86/speculation/mds: Conditionally clear CPU buffers on idle entry
From: Thomas Gleixner <tglx@linutronix.de>
Add a static key which controls the invocation of the CPU buffer clear
mechanism on idle entry. This is independent of other MDS mitigations
because the idle entry invocation to mitigate the potential leakage due to
store buffer repartitioning is only necessary on SMT systems.
Add the actual invocations to the different halt/mwait variants which
covers all usage sites. mwaitx is not patched as it's not available on
Intel CPUs.
The buffer clear is only invoked before entering the C-State to prevent
that stale data from the idling CPU is spilled to the Hyper-Thread sibling
after the Store buffer got repartitioned and all entries are available to
the non idle sibling.
When coming out of idle the store buffer is partitioned again so each
sibling has half of it available. Now CPU which returned from idle could be
speculatively exposed to contents of the sibling, but the buffers are
flushed either on exit to user space or on VMENTER.
When later on conditional buffer clearing is implemented on top of this,
then there is no action required either because before returning to user
space the context switch will set the condition flag which causes a flush
on the return to user path.
Note, that the buffer clearing on idle is only sensible on CPUs which are
solely affected by MSBDS and not any other variant of MDS because the other
MDS variants cannot be mitigated when SMT is enabled, so the buffer
clearing on idle would be a window dressing exercise.
This intentionally does not handle the case in the acpi/processor_idle
driver which uses the legacy IO port interface for C-State transitions for
two reasons:
- The acpi/processor_idle driver was replaced by the intel_idle driver
almost a decade ago. Anything Nehalem upwards supports it and defaults
to that new driver.
- The legacy IO port interface is likely to be used on older and therefore
unaffected CPUs or on systems which do not receive microcode updates
anymore, so there is no point in adding that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V4: Export mds_idle_clear
V3: Adjust document wording
---
Documentation/x86/mds.rst | 42 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/irqflags.h | 4 +++
arch/x86/include/asm/mwait.h | 7 +++++
arch/x86/include/asm/nospec-branch.h | 12 ++++++++++
arch/x86/kernel/cpu/bugs.c | 3 ++
5 files changed, 68 insertions(+)
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -149,3 +149,45 @@ Mitigation points
This takes the paranoid exit path only when the INT1 breakpoint is in
kernel space. #DB on a user space address takes the regular exit path,
so no extra mitigation required.
+
+
+2. C-State transition
+^^^^^^^^^^^^^^^^^^^^^
+
+ When a CPU goes idle and enters a C-State the CPU buffers need to be
+ cleared on affected CPUs when SMT is active. This addresses the
+ repartitioning of the store buffer when one of the Hyper-Threads enters
+ a C-State.
+
+ When SMT is inactive, i.e. either the CPU does not support it or all
+ sibling threads are offline CPU buffer clearing is not required.
+
+ The idle clearing is enabled on CPUs which are only affected by MSBDS
+ and not by any other MDS variant. The other MDS variants cannot be
+ protected against cross Hyper-Thread attacks because the Fill Buffer and
+ the Load Ports are shared. So on CPUs affected by other variants, the
+ idle clearing would be a window dressing exercise and is therefore not
+ activated.
+
+ The invocation is controlled by the static key mds_idle_clear which is
+ switched depending on the chosen mitigation mode and the SMT state of
+ the system.
+
+ The buffer clear is only invoked before entering the C-State to prevent
+ that stale data from the idling CPU from spilling to the Hyper-Thread
+ sibling after the store buffer got repartitioned and all entries are
+ available to the non idle sibling.
+
+ When coming out of idle the store buffer is partitioned again so each
+ sibling has half of it available. The back from idle CPU could be then
+ speculatively exposed to contents of the sibling. The buffers are
+ flushed either on exit to user space or on VMENTER so malicious code
+ in user space or the guest cannot speculatively access them.
+
+ The mitigation is hooked into all variants of halt()/mwait(), but does
+ not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
+ has been superseded by the intel_idle driver around 2010 and is
+ preferred on all affected CPUs which are expected to gain the MD_CLEAR
+ functionality in microcode. Aside of that the IO-Port mechanism is a
+ legacy interface which is only used on older systems which are either
+ not affected or do not receive microcode updates anymore.
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -6,6 +6,8 @@
#ifndef __ASSEMBLY__
+#include <asm/nospec-branch.h>
+
/* Provide __cpuidle; we can't safely include <linux/cpu.h> */
#define __cpuidle __attribute__((__section__(".cpuidle.text")))
@@ -54,11 +56,13 @@ static inline void native_irq_enable(voi
static inline __cpuidle void native_safe_halt(void)
{
+ mds_idle_clear_cpu_buffers();
asm volatile("sti; hlt": : :"memory");
}
static inline __cpuidle void native_halt(void)
{
+ mds_idle_clear_cpu_buffers();
asm volatile("hlt": : :"memory");
}
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -6,6 +6,7 @@
#include <linux/sched/idle.h>
#include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
#define MWAIT_SUBSTATE_MASK 0xf
#define MWAIT_CSTATE_MASK 0xf
@@ -40,6 +41,8 @@ static inline void __monitorx(const void
static inline void __mwait(unsigned long eax, unsigned long ecx)
{
+ mds_idle_clear_cpu_buffers();
+
/* "mwait %eax, %ecx;" */
asm volatile(".byte 0x0f, 0x01, 0xc9;"
:: "a" (eax), "c" (ecx));
@@ -74,6 +77,8 @@ static inline void __mwait(unsigned long
static inline void __mwaitx(unsigned long eax, unsigned long ebx,
unsigned long ecx)
{
+ /* No MDS buffer clear as this is AMD/HYGON only */
+
/* "mwaitx %eax, %ebx, %ecx;" */
asm volatile(".byte 0x0f, 0x01, 0xfb;"
:: "a" (eax), "b" (ebx), "c" (ecx));
@@ -81,6 +86,8 @@ static inline void __mwaitx(unsigned lon
static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
{
+ mds_idle_clear_cpu_buffers();
+
trace_hardirqs_on();
/* "mwait %eax, %ecx;" */
asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -319,6 +319,7 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_cond_
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
DECLARE_STATIC_KEY_FALSE(mds_user_clear);
+DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
#include <asm/segment.h>
@@ -356,6 +357,17 @@ static inline void mds_user_clear_cpu_bu
mds_clear_cpu_buffers();
}
+/**
+ * mds_idle_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * Clear CPU buffers if the corresponding static key is enabled
+ */
+static inline void mds_idle_clear_cpu_buffers(void)
+{
+ if (static_branch_likely(&mds_idle_clear))
+ mds_clear_cpu_buffers();
+}
+
#endif /* __ASSEMBLY__ */
/*
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -66,6 +66,9 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always
/* Control MDS CPU buffer clear before returning to user space */
DEFINE_STATIC_KEY_FALSE(mds_user_clear);
EXPORT_SYMBOL_GPL(mds_user_clear);
+/* Control MDS CPU buffer clear before idling (halt, mwait) */
+DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+EXPORT_SYMBOL_GPL(mds_idle_clear);
void __init check_bugs(void)
{
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 09/14] MDS basics 9
2019-03-01 21:47 ` [patch V6 09/14] MDS basics 9 Thomas Gleixner
@ 2019-03-06 16:14 ` Frederic Weisbecker
0 siblings, 0 replies; 89+ messages in thread
From: Frederic Weisbecker @ 2019-03-06 16:14 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:47PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 09/14] x86/speculation/mds: Conditionally clear CPU buffers on idle entry
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on idle entry. This is independent of other MDS mitigations
> because the idle entry invocation to mitigate the potential leakage due to
> store buffer repartitioning is only necessary on SMT systems.
>
> Add the actual invocations to the different halt/mwait variants which
> covers all usage sites. mwaitx is not patched as it's not available on
> Intel CPUs.
>
> The buffer clear is only invoked before entering the C-State to prevent
> that stale data from the idling CPU is spilled to the Hyper-Thread sibling
> after the Store buffer got repartitioned and all entries are available to
> the non idle sibling.
>
> When coming out of idle the store buffer is partitioned again so each
> sibling has half of it available. Now CPU which returned from idle could be
> speculatively exposed to contents of the sibling, but the buffers are
> flushed either on exit to user space or on VMENTER.
>
> When later on conditional buffer clearing is implemented on top of this,
> then there is no action required either because before returning to user
> space the context switch will set the condition flag which causes a flush
> on the return to user path.
>
> Note, that the buffer clearing on idle is only sensible on CPUs which are
> solely affected by MSBDS and not any other variant of MDS because the other
> MDS variants cannot be mitigated when SMT is enabled, so the buffer
> clearing on idle would be a window dressing exercise.
>
> This intentionally does not handle the case in the acpi/processor_idle
> driver which uses the legacy IO port interface for C-State transitions for
> two reasons:
>
> - The acpi/processor_idle driver was replaced by the intel_idle driver
> almost a decade ago. Anything Nehalem upwards supports it and defaults
> to that new driver.
>
> - The legacy IO port interface is likely to be used on older and therefore
> unaffected CPUs or on systems which do not receive microcode updates
> anymore, so there is no point in adding that.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 10/14] MDS basics 10
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (8 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 09/14] MDS basics 9 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-04 6:45 ` [MODERATED] Encrypted Message Jon Masters
` (2 more replies)
2019-03-01 21:47 ` [patch V6 11/14] MDS basics 11 Thomas Gleixner
` (5 subsequent siblings)
15 siblings, 3 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 10/14] x86/speculation/mds: Add mitigation control for MDS
From: Thomas Gleixner <tglx@linutronix.de>
Now that the mitigations are in place, add a command line parameter to
control the mitigation, a mitigation selector function and a SMT update
mechanism.
This is the minimal straight forward initial implementation which just
provides an always on/off mode. The command line parameter is:
mds=[full|off]
This is consistent with the existing mitigations for other speculative
hardware vulnerabilities.
The idle invocation is dynamically updated according to the SMT state of
the system similar to the dynamic update of the STIBP mitigation. The idle
mitigation is limited to CPUs which are only affected by MSBDS and not any
other variant, because the other variants cannot be mitigated on SMT
enabled systems.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5 --> V6: Make idle clearing depend on BUG_MSBDS_ONLY
V4 --> V5: Remove 'auto'
---
Documentation/admin-guide/kernel-parameters.txt | 22 +++++++
arch/x86/include/asm/processor.h | 5 +
arch/x86/kernel/cpu/bugs.c | 68 ++++++++++++++++++++++++
3 files changed, 95 insertions(+)
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2356,6 +2356,28 @@
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
+ mds= [X86,INTEL]
+ Control mitigation for the Micro-architectural Data
+ Sampling (MDS) vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the MDS mitigation. The
+ options are:
+
+ full - Enable MDS mitigation on vulnerable CPUs
+ off - Unconditionally disable MDS mitigation
+
+ Not specifying this option is equivalent to
+ mds=full.
+
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used when the kernel is not able
to see the whole system memory or for test.
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -992,4 +992,9 @@ enum l1tf_mitigations {
extern enum l1tf_mitigations l1tf_mitigation;
+enum mds_mitigations {
+ MDS_MITIGATION_OFF,
+ MDS_MITIGATION_FULL,
+};
+
#endif /* _ASM_X86_PROCESSOR_H */
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -37,6 +37,7 @@
static void __init spectre_v2_select_mitigation(void);
static void __init ssb_select_mitigation(void);
static void __init l1tf_select_mitigation(void);
+static void __init mds_select_mitigation(void);
/* The base value of the SPEC_CTRL MSR that always has to be preserved. */
u64 x86_spec_ctrl_base;
@@ -108,6 +109,8 @@ void __init check_bugs(void)
l1tf_select_mitigation();
+ mds_select_mitigation();
+
#ifdef CONFIG_X86_32
/*
* Check whether we are able to run this kernel safely on SMP.
@@ -214,6 +217,50 @@ static void x86_amd_ssb_disable(void)
}
#undef pr_fmt
+#define pr_fmt(fmt) "MDS: " fmt
+
+/* Default mitigation for L1TF-affected CPUs */
+static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_FULL;
+
+static const char * const mds_strings[] = {
+ [MDS_MITIGATION_OFF] = "Vulnerable",
+ [MDS_MITIGATION_FULL] = "Mitigation: Clear CPU buffers"
+};
+
+static void mds_select_mitigation(void)
+{
+ if (!boot_cpu_has_bug(X86_BUG_MDS)) {
+ mds_mitigation = MDS_MITIGATION_OFF;
+ return;
+ }
+
+ if (mds_mitigation == MDS_MITIGATION_FULL) {
+ if (boot_cpu_has(X86_FEATURE_MD_CLEAR))
+ static_branch_enable(&mds_user_clear);
+ else
+ mds_mitigation = MDS_MITIGATION_OFF;
+ }
+ pr_info("%s\n", mds_strings[mds_mitigation]);
+}
+
+static int __init mds_cmdline(char *str)
+{
+ if (!boot_cpu_has_bug(X86_BUG_MDS))
+ return 0;
+
+ if (!str)
+ return -EINVAL;
+
+ if (!strcmp(str, "off"))
+ mds_mitigation = MDS_MITIGATION_OFF;
+ else if (!strcmp(str, "full"))
+ mds_mitigation = MDS_MITIGATION_FULL;
+
+ return 0;
+}
+early_param("mds", mds_cmdline);
+
+#undef pr_fmt
#define pr_fmt(fmt) "Spectre V2 : " fmt
static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init =
@@ -617,6 +664,24 @@ static void update_indir_branch_cond(voi
static_branch_disable(&switch_to_cond_stibp);
}
+/* Update the static key controlling the MDS CPU buffer clear in idle */
+static void update_mds_branch_idle(void)
+{
+ /*
+ * Enable the idle clearing on CPUs which are affected only by
+ * MDBDS and not any other MDS variant. The other variants cannot
+ * be mitigated when SMT is enabled, so clearing the buffers on
+ * idle would be a window dressing exercise.
+ */
+ if (!boot_cpu_has(X86_BUG_MSBDS_ONLY))
+ return;
+
+ if (sched_smt_active())
+ static_branch_enable(&mds_idle_clear);
+ else
+ static_branch_disable(&mds_idle_clear);
+}
+
void arch_smt_update(void)
{
/* Enhanced IBRS implies STIBP. No update required. */
@@ -638,6 +703,9 @@ void arch_smt_update(void)
break;
}
+ if (mds_mitigation == MDS_MITIGATION_FULL)
+ update_mds_branch_idle();
+
mutex_unlock(&spec_ctrl_mutex);
}
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
@ 2019-03-04 6:45 ` Jon Masters
2019-03-05 18:42 ` [MODERATED] Re: [patch V6 10/14] MDS basics 10 Andrea Arcangeli
2019-03-06 14:31 ` [MODERATED] " Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-04 6:45 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 131 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 10/14] MDS basics 10
[-- Attachment #2: Type: text/plain, Size: 306 bytes --]
On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> + /*
> + * Enable the idle clearing on CPUs which are affected only by
> + * MDBDS and not any other MDS variant. The other variants cannot
^^^^^
MSBDS
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 10/14] MDS basics 10
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
2019-03-04 6:45 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-05 18:42 ` Andrea Arcangeli
2019-03-06 19:15 ` Thomas Gleixner
2019-03-06 14:31 ` [MODERATED] " Borislav Petkov
2 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2019-03-05 18:42 UTC (permalink / raw)
To: speck
Hi Thomas,
On Fri, Mar 01, 2019 at 10:47:48PM +0100, speck for Thomas Gleixner wrote:
> +/* Update the static key controlling the MDS CPU buffer clear in idle */
> +static void update_mds_branch_idle(void)
> +{
> + /*
> + * Enable the idle clearing on CPUs which are affected only by
> + * MDBDS and not any other MDS variant. The other variants cannot
> + * be mitigated when SMT is enabled, so clearing the buffers on
> + * idle would be a window dressing exercise.
> + */
> + if (!boot_cpu_has(X86_BUG_MSBDS_ONLY))
> + return;
> +
> + if (sched_smt_active())
> + static_branch_enable(&mds_idle_clear);
Do you think it's worth also clearing
MSR_MISC_FEATURES_ENABLES_RING3MWAIT_BIT by setting
ring3mwait_disabled when sched_smt_active() is true above?
I don't expect anybody will pass manually ring3mwait=disable to the
kernel on XEON_PHI_KNL/XEON_PHI_KNM. I'm not aware of any app using
the user mwait, which also makes this not a big deal.. but it goes
both ways, it's also not a big deal for userland to turn it off when
we report SMT is enabled and safe in sysfs.
Thanks,
Andrea
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 10/14] MDS basics 10
2019-03-05 18:42 ` [MODERATED] Re: [patch V6 10/14] MDS basics 10 Andrea Arcangeli
@ 2019-03-06 19:15 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-06 19:15 UTC (permalink / raw)
To: speck
Andrea,
On Tue, 5 Mar 2019, speck for Andrea Arcangeli wrote:
> Hi Thomas,
>
> On Fri, Mar 01, 2019 at 10:47:48PM +0100, speck for Thomas Gleixner wrote:
> > +/* Update the static key controlling the MDS CPU buffer clear in idle */
> > +static void update_mds_branch_idle(void)
> > +{
> > + /*
> > + * Enable the idle clearing on CPUs which are affected only by
> > + * MDBDS and not any other MDS variant. The other variants cannot
> > + * be mitigated when SMT is enabled, so clearing the buffers on
> > + * idle would be a window dressing exercise.
> > + */
> > + if (!boot_cpu_has(X86_BUG_MSBDS_ONLY))
> > + return;
> > +
> > + if (sched_smt_active())
> > + static_branch_enable(&mds_idle_clear);
>
> Do you think it's worth also clearing
> MSR_MISC_FEATURES_ENABLES_RING3MWAIT_BIT by setting
> ring3mwait_disabled when sched_smt_active() is true above?
Not sure.
> I don't expect anybody will pass manually ring3mwait=disable to the
> kernel on XEON_PHI_KNL/XEON_PHI_KNM. I'm not aware of any app using
> the user mwait, which also makes this not a big deal.. but it goes
> both ways, it's also not a big deal for userland to turn it off when
> we report SMT is enabled and safe in sysfs.
True and as usual we don't really know what people are doing and wreckaging
existing applications which rely on that would be not nice.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 10/14] MDS basics 10
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
2019-03-04 6:45 ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 18:42 ` [MODERATED] Re: [patch V6 10/14] MDS basics 10 Andrea Arcangeli
@ 2019-03-06 14:31 ` Borislav Petkov
2019-03-06 15:30 ` Thomas Gleixner
2 siblings, 1 reply; 89+ messages in thread
From: Borislav Petkov @ 2019-03-06 14:31 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:48PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 10/14] x86/speculation/mds: Add mitigation control for MDS
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Now that the mitigations are in place, add a command line parameter to
> control the mitigation, a mitigation selector function and a SMT update
> mechanism.
>
> This is the minimal straight forward initial implementation which just
> provides an always on/off mode. The command line parameter is:
>
> mds=[full|off]
>
> This is consistent with the existing mitigations for other speculative
> hardware vulnerabilities.
>
> The idle invocation is dynamically updated according to the SMT state of
> the system similar to the dynamic update of the STIBP mitigation. The idle
> mitigation is limited to CPUs which are only affected by MSBDS and not any
> other variant, because the other variants cannot be mitigated on SMT
> enabled systems.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V5 --> V6: Make idle clearing depend on BUG_MSBDS_ONLY
> V4 --> V5: Remove 'auto'
> ---
...
> @@ -617,6 +664,24 @@ static void update_indir_branch_cond(voi
> static_branch_disable(&switch_to_cond_stibp);
> }
>
> +/* Update the static key controlling the MDS CPU buffer clear in idle */
> +static void update_mds_branch_idle(void)
> +{
> + /*
> + * Enable the idle clearing on CPUs which are affected only by
> + * MDBDS and not any other MDS variant. The other variants cannot
> + * be mitigated when SMT is enabled,
... but we're not enabling the key when SMT on those is disabled,
AFAICT. Or is that coming later?
> so clearing the buffers on
> + * idle would be a window dressing exercise.
> + */
> + if (!boot_cpu_has(X86_BUG_MSBDS_ONLY))
if (!boot_cpu_has_bug
> + return;
> +
> + if (sched_smt_active())
> + static_branch_enable(&mds_idle_clear);
> + else
> + static_branch_disable(&mds_idle_clear);
> +}
> +
> void arch_smt_update(void)
> {
> /* Enhanced IBRS implies STIBP. No update required. */
> @@ -638,6 +703,9 @@ void arch_smt_update(void)
> break;
> }
>
> + if (mds_mitigation == MDS_MITIGATION_FULL)
> + update_mds_branch_idle();
> +
> mutex_unlock(&spec_ctrl_mutex);
> }
>
>
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 10/14] MDS basics 10
2019-03-06 14:31 ` [MODERATED] " Borislav Petkov
@ 2019-03-06 15:30 ` Thomas Gleixner
2019-03-06 18:35 ` Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-06 15:30 UTC (permalink / raw)
To: speck
On Wed, 6 Mar 2019, speck for Borislav Petkov wrote:
> On Fri, Mar 01, 2019 at 10:47:48PM +0100, speck for Thomas Gleixner wrote:
> > +/* Update the static key controlling the MDS CPU buffer clear in idle */
> > +static void update_mds_branch_idle(void)
> > +{
> > + /*
> > + * Enable the idle clearing on CPUs which are affected only by
> > + * MDBDS and not any other MDS variant. The other variants cannot
> > + * be mitigated when SMT is enabled,
>
> ... but we're not enabling the key when SMT on those is disabled,
> AFAICT. Or is that coming later?
Five lines down ....
> > so clearing the buffers on
> > + * idle would be a window dressing exercise.
> > + */
> > + if (!boot_cpu_has(X86_BUG_MSBDS_ONLY))
>
> if (!boot_cpu_has_bug
Fixed.
> > + return;
> > +
> > + if (sched_smt_active())
... here is the decision whether to enable or disable.
> > + static_branch_enable(&mds_idle_clear);
> > + else
> > + static_branch_disable(&mds_idle_clear);
> > +}
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 10/14] MDS basics 10
2019-03-06 15:30 ` Thomas Gleixner
@ 2019-03-06 18:35 ` Thomas Gleixner
2019-03-06 19:34 ` [MODERATED] Re: " Borislav Petkov
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-06 18:35 UTC (permalink / raw)
To: speck
On Wed, 6 Mar 2019, speck for Thomas Gleixner wrote:
> On Wed, 6 Mar 2019, speck for Borislav Petkov wrote:
> > On Fri, Mar 01, 2019 at 10:47:48PM +0100, speck for Thomas Gleixner wrote:
> > > +/* Update the static key controlling the MDS CPU buffer clear in idle */
> > > +static void update_mds_branch_idle(void)
> > > +{
> > > + /*
> > > + * Enable the idle clearing on CPUs which are affected only by
> > > + * MDBDS and not any other MDS variant. The other variants cannot
> > > + * be mitigated when SMT is enabled,
> >
> > ... but we're not enabling the key when SMT on those is disabled,
> > AFAICT. Or is that coming later?
>
> Five lines down ....
Following up on our conversation on IRC, I've reworded the comment:
/*
* Enable the idle clearing if SMT is active on CPUs which are
* affected only by MSBDS and not any other MDS variant.
*
* The other variants cannot be mitigated when SMT is enabled, so
* clearing the buffers on idle just to prevent the Store Buffer
* repartitioning leak would be a window dressing exercise.
*/
if (!boot_cpu_has_bug(X86_BUG_MSBDS_ONLY))
return;
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: Re: [patch V6 10/14] MDS basics 10
2019-03-06 18:35 ` Thomas Gleixner
@ 2019-03-06 19:34 ` Borislav Petkov
0 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2019-03-06 19:34 UTC (permalink / raw)
To: speck
On Wed, Mar 06, 2019 at 07:35:26PM +0100, speck for Thomas Gleixner wrote:
> Following up on our conversation on IRC, I've reworded the comment:
>
> /*
> * Enable the idle clearing if SMT is active on CPUs which are
> * affected only by MSBDS and not any other MDS variant.
> *
> * The other variants cannot be mitigated when SMT is enabled, so
> * clearing the buffers on idle just to prevent the Store Buffer
> * repartitioning leak would be a window dressing exercise.
> */
> if (!boot_cpu_has_bug(X86_BUG_MSBDS_ONLY))
> return;
Yap, looks good.
With that addressed:
Reviewed-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 11/14] MDS basics 11
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (9 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
` (4 subsequent siblings)
15 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 11/14] x86/speculation/mds: Add sysfs reporting for MDS
From: Thomas Gleixner <tglx@linutronix.de>
Add the sysfs reporting file for MDS. It exposes the vulnerability and
mitigation state similar to the existing files for the other speculative
hardware vulnerabilities.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
V3: Copy & Paste done right :(
---
Documentation/ABI/testing/sysfs-devices-system-cpu | 1
arch/x86/kernel/cpu/bugs.c | 25 +++++++++++++++++++++
drivers/base/cpu.c | 8 ++++++
include/linux/cpu.h | 2 +
4 files changed, 36 insertions(+)
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -484,6 +484,7 @@ What: /sys/devices/system/cpu/vulnerabi
/sys/devices/system/cpu/vulnerabilities/spectre_v2
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
/sys/devices/system/cpu/vulnerabilities/l1tf
+ /sys/devices/system/cpu/vulnerabilities/mds
Date: January 2018
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: Information about CPU vulnerabilities
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1170,6 +1170,22 @@ static ssize_t l1tf_show_state(char *buf
}
#endif
+static ssize_t mds_show_state(char *buf)
+{
+ if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
+ return sprintf(buf, "%s; SMT Host state unknown\n",
+ mds_strings[mds_mitigation]);
+ }
+
+ if (boot_cpu_has(X86_BUG_MSBDS_ONLY)) {
+ return sprintf(buf, "%s; SMT %s\n", mds_strings[mds_mitigation],
+ sched_smt_active() ? "mitigated" : "disabled");
+ }
+
+ return sprintf(buf, "%s; SMT %s\n", mds_strings[mds_mitigation],
+ sched_smt_active() ? "vulnerable" : "disabled");
+}
+
static char *stibp_state(void)
{
if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED)
@@ -1236,6 +1252,10 @@ static ssize_t cpu_show_common(struct de
if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
return l1tf_show_state(buf);
break;
+
+ case X86_BUG_MDS:
+ return mds_show_state(buf);
+
default:
break;
}
@@ -1267,4 +1287,9 @@ ssize_t cpu_show_l1tf(struct device *dev
{
return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
}
+
+ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
+}
#endif
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct devi
return sprintf(buf, "Not affected\n");
}
+ssize_t __weak cpu_show_mds(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "Not affected\n");
+}
+
static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
static struct attribute *cpu_root_vulnerabilities_attrs[] = {
&dev_attr_meltdown.attr,
@@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulner
&dev_attr_spectre_v2.attr,
&dev_attr_spec_store_bypass.attr,
&dev_attr_l1tf.attr,
+ &dev_attr_mds.attr,
NULL
};
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -57,6 +57,8 @@ extern ssize_t cpu_show_spec_store_bypas
struct device_attribute *attr, char *buf);
extern ssize_t cpu_show_l1tf(struct device *dev,
struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_mds(struct device *dev,
+ struct device_attribute *attr, char *buf);
extern __printf(4, 5)
struct device *cpu_device_create(struct device *parent, void *drvdata,
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 12/14] MDS basics 12
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (10 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 11/14] MDS basics 11 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-04 5:47 ` [MODERATED] Encrypted Message Jon Masters
` (2 more replies)
2019-03-01 21:47 ` [patch V6 13/14] MDS basics 13 Thomas Gleixner
` (3 subsequent siblings)
15 siblings, 3 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 12/14] x86/speculation/mds: Add mitigation mode VMWERV
From: Thomas Gleixner <tglx@linutronix.de>
In virtualized environments it can happen that the host has the microcode
update which utilizes the VERW instruction to clear CPU buffers, but the
hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
to guests.
Introduce an internal mitigation mode VWWERV which enables the invocation
of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
system has no updated microcode this results in a pointless execution of
the VERW instruction wasting a few CPU cycles. If the microcode is updated,
but not exposed to a guest then the CPU buffers will be cleared.
That said: Virtual Machines Will Eventually Receive Vaccine
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2 -> V3: Rename mode.
---
Documentation/x86/mds.rst | 27 +++++++++++++++++++++++++++
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/bugs.c | 18 ++++++++++++------
3 files changed, 40 insertions(+), 6 deletions(-)
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -93,11 +93,38 @@ enters a C-state.
The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
(idle) transitions.
+As a special quirk to address virtualization scenarios where the host has
+the microcode updated, but the hypervisor does not (yet) expose the
+MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
+hope that it might actually clear the buffers. The state is reflected
+accordingly.
+
According to current knowledge additional mitigations inside the kernel
itself are not required because the necessary gadgets to expose the leaked
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.
+Kernel internal mitigation modes
+--------------------------------
+
+ ======= ============================================================
+ off Mitigation is disabled. Either the CPU is not affected or
+ mds=off is supplied on the kernel command line
+
+ full Mitigation is eanbled. CPU is affected and MD_CLEAR is
+ advertised in CPUID.
+
+ vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not
+ advertised in CPUID. That is mainly for virtualization
+ scenarios where the host has the updated microcode but the
+ hypervisor does not expose MD_CLEAR in CPUID. It's a best
+ effort approach without guarantee.
+ ======= ============================================================
+
+If the CPU is affected and mds=off is not supplied on the kernel command
+line then the kernel selects the appropriate mitigation mode depending on
+the availability of the MD_CLEAR CPUID bit.
+
Mitigation points
-----------------
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -995,6 +995,7 @@ extern enum l1tf_mitigations l1tf_mitiga
enum mds_mitigations {
MDS_MITIGATION_OFF,
MDS_MITIGATION_FULL,
+ MDS_MITIGATION_VMWERV,
};
#endif /* _ASM_X86_PROCESSOR_H */
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -224,7 +224,8 @@ static enum mds_mitigations mds_mitigati
static const char * const mds_strings[] = {
[MDS_MITIGATION_OFF] = "Vulnerable",
- [MDS_MITIGATION_FULL] = "Mitigation: Clear CPU buffers"
+ [MDS_MITIGATION_FULL] = "Mitigation: Clear CPU buffers",
+ [MDS_MITIGATION_VMWERV] = "Vulnerable: Clear CPU buffers attempted, no microcode",
};
static void mds_select_mitigation(void)
@@ -235,10 +236,9 @@ static void mds_select_mitigation(void)
}
if (mds_mitigation == MDS_MITIGATION_FULL) {
- if (boot_cpu_has(X86_FEATURE_MD_CLEAR))
- static_branch_enable(&mds_user_clear);
- else
- mds_mitigation = MDS_MITIGATION_OFF;
+ if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
+ mds_mitigation = MDS_MITIGATION_VMWERV;
+ static_branch_enable(&mds_user_clear);
}
pr_info("%s\n", mds_strings[mds_mitigation]);
}
@@ -703,8 +703,14 @@ void arch_smt_update(void)
break;
}
- if (mds_mitigation == MDS_MITIGATION_FULL)
+ switch(mds_mitigation) {
+ case MDS_MITIGATION_FULL:
+ case MDS_MITIGATION_VMWERV:
update_mds_branch_idle();
+ break;
+ case MDS_MITIGATION_OFF:
+ break;
+ }
mutex_unlock(&spec_ctrl_mutex);
}
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
@ 2019-03-04 5:47 ` Jon Masters
2019-03-05 16:04 ` Thomas Gleixner
2019-03-05 16:40 ` [MODERATED] Re: [patch V6 12/14] MDS basics 12 mark gross
2019-03-06 14:42 ` Borislav Petkov
2 siblings, 1 reply; 89+ messages in thread
From: Jon Masters @ 2019-03-04 5:47 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 131 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 12/14] MDS basics 12
[-- Attachment #2: Type: text/plain, Size: 1553 bytes --]
On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Subject: [patch V6 12/14] x86/speculation/mds: Add mitigation mode VMWERV
> From: Thomas Gleixner <tglx@linutronix.de>
>
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
>
> Introduce an internal mitigation mode VWWERV which enables the invocation
> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> system has no updated microcode this results in a pointless execution of
> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> but not exposed to a guest then the CPU buffers will be cleared.
>
> That said: Virtual Machines Will Eventually Receive Vaccine
The effect of this patch, currently, is that a (bare metal) machine
without updated ucode will print the following:
[ 1.576602] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
The intention of the patch is to say "hey, you might be on a VM, so
we'll try anyway in case we didn't get told you had MD_CLEAR". But the
effect on bare metal might be ambiguous. It's reasonable (for someone
else) to assume we might be using a software sequence to try flushing.
Perhaps the wording should convey something like:
"MDS: Vulnerable: Clear CPU buffers may not work, no microcode"
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Encrypted Message
2019-03-04 5:47 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-05 16:04 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-05 16:04 UTC (permalink / raw)
To: speck
On Mon, 4 Mar 2019, speck for Jon Masters wrote:
> > That said: Virtual Machines Will Eventually Receive Vaccine
>
> The effect of this patch, currently, is that a (bare metal) machine
> without updated ucode will print the following:
>
> [ 1.576602] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
>
> The intention of the patch is to say "hey, you might be on a VM, so
> we'll try anyway in case we didn't get told you had MD_CLEAR". But the
> effect on bare metal might be ambiguous. It's reasonable (for someone
> else) to assume we might be using a software sequence to try flushing.
>
> Perhaps the wording should convey something like:
>
> "MDS: Vulnerable: Clear CPU buffers may not work, no microcode"
Yeah, we also could do something like the delta patch below:
Thanks,
tglx
8<------------------
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -228,18 +228,28 @@ static const char * const mds_strings[]
[MDS_MITIGATION_VMWERV] = "Vulnerable: Clear CPU buffers attempted, no microcode",
};
-static void mds_select_mitigation(void)
+static void __init mds_check_md_clear(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
+ if (hypervisor_is_type(X86_HYPER_NATIVE)) {
+ mds_mitigation = MDS_MITIGATION_OFF;
+ return;
+ }
+ mds_mitigation = MDS_MITIGATION_VMWERV;
+ }
+ static_branch_enable(&mds_user_clear);
+}
+
+static void __init mds_select_mitigation(void)
{
if (!boot_cpu_has_bug(X86_BUG_MDS)) {
mds_mitigation = MDS_MITIGATION_OFF;
return;
}
- if (mds_mitigation == MDS_MITIGATION_FULL) {
- if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
- mds_mitigation = MDS_MITIGATION_VMWERV;
- static_branch_enable(&mds_user_clear);
- }
+ if (mds_mitigation == MDS_MITIGATION_FULL)
+ mds_check_md_clear();
+
pr_info("%s\n", mds_strings[mds_mitigation]);
}
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 12/14] MDS basics 12
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
2019-03-04 5:47 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-05 16:40 ` mark gross
2019-03-06 14:42 ` Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: mark gross @ 2019-03-05 16:40 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:50PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 12/14] x86/speculation/mds: Add mitigation mode VMWERV
> From: Thomas Gleixner <tglx@linutronix.de>
>
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
>
> Introduce an internal mitigation mode VWWERV which enables the invocation
minor type-oh. s/VWWERV/VMWERV/
--mark
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 12/14] MDS basics 12
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
2019-03-04 5:47 ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 16:40 ` [MODERATED] Re: [patch V6 12/14] MDS basics 12 mark gross
@ 2019-03-06 14:42 ` Borislav Petkov
2 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2019-03-06 14:42 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:50PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 12/14] x86/speculation/mds: Add mitigation mode VMWERV
> From: Thomas Gleixner <tglx@linutronix.de>
>
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
>
> Introduce an internal mitigation mode VWWERV which enables the invocation
> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> system has no updated microcode this results in a pointless execution of
> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> but not exposed to a guest then the CPU buffers will be cleared.
>
> That said: Virtual Machines Will Eventually Receive Vaccine
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2 -> V3: Rename mode.
> ---
> Documentation/x86/mds.rst | 27 +++++++++++++++++++++++++++
> arch/x86/include/asm/processor.h | 1 +
> arch/x86/kernel/cpu/bugs.c | 18 ++++++++++++------
> 3 files changed, 40 insertions(+), 6 deletions(-)
...
> @@ -235,10 +236,9 @@ static void mds_select_mitigation(void)
> }
>
> if (mds_mitigation == MDS_MITIGATION_FULL) {
> - if (boot_cpu_has(X86_FEATURE_MD_CLEAR))
> - static_branch_enable(&mds_user_clear);
> - else
> - mds_mitigation = MDS_MITIGATION_OFF;
> + if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
> + mds_mitigation = MDS_MITIGATION_VMWERV;
> + static_branch_enable(&mds_user_clear);
> }
> pr_info("%s\n", mds_strings[mds_mitigation]);
> }
> @@ -703,8 +703,14 @@ void arch_smt_update(void)
> break;
> }
>
> - if (mds_mitigation == MDS_MITIGATION_FULL)
> + switch(mds_mitigation) {
ERROR: space required before the open parenthesis '('
#119: FILE: arch/x86/kernel/cpu/bugs.c:706:
+ switch(mds_mitigation) {
with that addressed:
Reviewed-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 13/14] MDS basics 13
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (11 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-03 4:01 ` [MODERATED] " Josh Poimboeuf
2019-03-05 16:43 ` [MODERATED] " mark gross
2019-03-01 21:47 ` [patch V6 14/14] MDS basics 14 Thomas Gleixner
` (2 subsequent siblings)
15 siblings, 2 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 13/14] Documentation: Move L1TF to separate directory
From: Thomas Gleixner <tglx@linutronix.de>
Move L!TF to a separate directory so the MDS stuff can be added at the
side. Otherwise the all hardware vulnerabilites have their own top level
entry. Should have done that right away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
Documentation/admin-guide/hw-vuln/index.rst | 12
Documentation/admin-guide/hw-vuln/l1tf.rst | 614 ++++++++++++++++++++++++++++
Documentation/admin-guide/index.rst | 6
Documentation/admin-guide/l1tf.rst | 614 ----------------------------
4 files changed, 628 insertions(+), 618 deletions(-)
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -0,0 +1,12 @@
+========================
+Hardware vulnerabilities
+========================
+
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+ :maxdepth: 1
+
+ l1tf
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -0,0 +1,614 @@
+L1TF - L1 Terminal Fault
+========================
+
+L1 Terminal Fault is a hardware vulnerability which allows unprivileged
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual address, which is used
+for the access, has the Present bit cleared or other reserved bits set.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+ - Processors from AMD, Centaur and other non Intel vendors
+
+ - Older processor models, where the CPU family is < 6
+
+ - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
+ Penwell, Pineview, Silvermont, Airmont, Merrifield)
+
+ - The Intel XEON PHI family
+
+ - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
+ by the Meltdown vulnerability either. These CPUs should become
+ available by end of 2018.
+
+Whether a processor is affected or not can be read out from the L1TF
+vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the L1TF vulnerability:
+
+ ============= ================= ==============================
+ CVE-2018-3615 L1 Terminal Fault SGX related aspects
+ CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
+ CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
+ ============= ================= ==============================
+
+Problem
+-------
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the Present bit cleared or other reserved bits set,
+then speculative execution ignores the invalid PTE and loads the referenced
+data if it is present in the Level 1 Data Cache, as if the page referenced
+by the address bits in the PTE was still present and accessible.
+
+While this is a purely speculative mechanism and the instruction will raise
+a page fault when it is retired eventually, the pure act of loading the
+data and making it available to other speculative instructions opens up the
+opportunity for side channel attacks to unprivileged malicious code,
+similar to the Meltdown attack.
+
+While Meltdown breaks the user space to kernel space protection, L1TF
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows an attack of SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+
+Attack scenarios
+----------------
+
+1. Malicious user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+ Operating Systems store arbitrary information in the address bits of a
+ PTE which is marked non present. This allows a malicious user space
+ application to attack the physical memory to which these PTEs resolve.
+ In some cases user-space can maliciously influence the information
+ encoded in the address bits of the PTE, thus making attacks more
+ deterministic and more practical.
+
+ The Linux kernel contains a mitigation for this attack vector, PTE
+ inversion, which is permanently enabled and has no performance
+ impact. The kernel ensures that the address bits of PTEs, which are not
+ marked present, never point to cacheable physical memory space.
+
+ A system with an up to date kernel is protected against attacks from
+ malicious user space applications.
+
+2. Malicious guest in a virtual machine
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The fact that L1TF breaks all domain protections allows malicious guest
+ OSes, which can control the PTEs directly, and malicious guest user
+ space applications, which run on an unprotected guest kernel lacking the
+ PTE inversion mitigation for L1TF, to attack physical host memory.
+
+ A special aspect of L1TF in the context of virtualization is symmetric
+ multi threading (SMT). The Intel implementation of SMT is called
+ HyperThreading. The fact that Hyperthreads on the affected processors
+ share the L1 Data Cache (L1D) is important for this. As the flaw allows
+ only to attack data which is present in L1D, a malicious guest running
+ on one Hyperthread can attack the data which is brought into the L1D by
+ the context which runs on the sibling Hyperthread of the same physical
+ core. This context can be host OS, host user space or a different guest.
+
+ If the processor does not support Extended Page Tables, the attack is
+ only possible, when the hypervisor does not sanitize the content of the
+ effective (shadow) page tables.
+
+ While solutions exist to mitigate these attack vectors fully, these
+ mitigations are not enabled by default in the Linux kernel because they
+ can affect performance significantly. The kernel provides several
+ mechanisms which can be utilized to address the problem depending on the
+ deployment scenario. The mitigations, their protection scope and impact
+ are described in the next sections.
+
+ The default mitigations and the rationale for choosing them are explained
+ at the end of this document. See :ref:`default_mitigations`.
+
+.. _l1tf_sys_info:
+
+L1TF system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current L1TF
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/l1tf
+
+The possible values in this file are:
+
+ =========================== ===============================
+ 'Not affected' The processor is not vulnerable
+ 'Mitigation: PTE Inversion' The host protection is active
+ =========================== ===============================
+
+If KVM/VMX is enabled and the processor is vulnerable then the following
+information is appended to the 'Mitigation: PTE Inversion' part:
+
+ - SMT status:
+
+ ===================== ================
+ 'VMX: SMT vulnerable' SMT is enabled
+ 'VMX: SMT disabled' SMT is disabled
+ ===================== ================
+
+ - L1D Flush mode:
+
+ ================================ ====================================
+ 'L1D vulnerable' L1D flushing is disabled
+
+ 'L1D conditional cache flushes' L1D flush is conditionally enabled
+
+ 'L1D cache flushes' L1D flush is unconditionally enabled
+ ================================ ====================================
+
+The resulting grade of protection is discussed in the following sections.
+
+
+Host mitigation mechanism
+-------------------------
+
+The kernel is unconditionally protected against L1TF attacks from malicious
+user space running on the host.
+
+
+Guest mitigation mechanisms
+---------------------------
+
+.. _l1d_flush:
+
+1. L1D flush on VMENTER
+^^^^^^^^^^^^^^^^^^^^^^^
+
+ To make sure that a guest cannot attack data which is present in the L1D
+ the hypervisor flushes the L1D before entering the guest.
+
+ Flushing the L1D evicts not only the data which should not be accessed
+ by a potentially malicious guest, it also flushes the guest
+ data. Flushing the L1D has a performance impact as the processor has to
+ bring the flushed guest data back into the L1D. Depending on the
+ frequency of VMEXIT/VMENTER and the type of computations in the guest
+ performance degradation in the range of 1% to 50% has been observed. For
+ scenarios where guest VMEXIT/VMENTER are rare the performance impact is
+ minimal. Virtio and mechanisms like posted interrupts are designed to
+ confine the VMEXITs to a bare minimum, but specific configurations and
+ application scenarios might still suffer from a high VMEXIT rate.
+
+ The kernel provides two L1D flush modes:
+ - conditional ('cond')
+ - unconditional ('always')
+
+ The conditional mode avoids L1D flushing after VMEXITs which execute
+ only audited code paths before the corresponding VMENTER. These code
+ paths have been verified that they cannot expose secrets or other
+ interesting data to an attacker, but they can leak information about the
+ address space layout of the hypervisor.
+
+ Unconditional mode flushes L1D on all VMENTER invocations and provides
+ maximum protection. It has a higher overhead than the conditional
+ mode. The overhead cannot be quantified correctly as it depends on the
+ workload scenario and the resulting number of VMEXITs.
+
+ The general recommendation is to enable L1D flush on VMENTER. The kernel
+ defaults to conditional mode on affected processors.
+
+ **Note**, that L1D flush does not prevent the SMT problem because the
+ sibling thread will also bring back its data into the L1D which makes it
+ attackable again.
+
+ L1D flush can be controlled by the administrator via the kernel command
+ line and sysfs control files. See :ref:`mitigation_control_command_line`
+ and :ref:`mitigation_control_kvm`.
+
+.. _guest_confinement:
+
+2. Guest VCPU confinement to dedicated physical cores
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ To address the SMT problem, it is possible to make a guest or a group of
+ guests affine to one or more physical cores. The proper mechanism for
+ that is to utilize exclusive cpusets to ensure that no other guest or
+ host tasks can run on these cores.
+
+ If only a single guest or related guests run on sibling SMT threads on
+ the same physical core then they can only attack their own memory and
+ restricted parts of the host memory.
+
+ Host memory is attackable, when one of the sibling SMT threads runs in
+ host OS (hypervisor) context and the other in guest context. The amount
+ of valuable information from the host OS context depends on the context
+ which the host OS executes, i.e. interrupts, soft interrupts and kernel
+ threads. The amount of valuable data from these contexts cannot be
+ declared as non-interesting for an attacker without deep inspection of
+ the code.
+
+ **Note**, that assigning guests to a fixed set of physical cores affects
+ the ability of the scheduler to do load balancing and might have
+ negative effects on CPU utilization depending on the hosting
+ scenario. Disabling SMT might be a viable alternative for particular
+ scenarios.
+
+ For further information about confining guests to a single or to a group
+ of cores consult the cpusets documentation:
+
+ https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
+
+.. _interrupt_isolation:
+
+3. Interrupt affinity
+^^^^^^^^^^^^^^^^^^^^^
+
+ Interrupts can be made affine to logical CPUs. This is not universally
+ true because there are types of interrupts which are truly per CPU
+ interrupts, e.g. the local timer interrupt. Aside of that multi queue
+ devices affine their interrupts to single CPUs or groups of CPUs per
+ queue without allowing the administrator to control the affinities.
+
+ Moving the interrupts, which can be affinity controlled, away from CPUs
+ which run untrusted guests, reduces the attack vector space.
+
+ Whether the interrupts with are affine to CPUs, which run untrusted
+ guests, provide interesting data for an attacker depends on the system
+ configuration and the scenarios which run on the system. While for some
+ of the interrupts it can be assumed that they won't expose interesting
+ information beyond exposing hints about the host OS memory layout, there
+ is no way to make general assumptions.
+
+ Interrupt affinity can be controlled by the administrator via the
+ /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
+ available at:
+
+ https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
+
+.. _smt_control:
+
+4. SMT control
+^^^^^^^^^^^^^^
+
+ To prevent the SMT issues of L1TF it might be necessary to disable SMT
+ completely. Disabling SMT can have a significant performance impact, but
+ the impact depends on the hosting scenario and the type of workloads.
+ The impact of disabling SMT needs also to be weighted against the impact
+ of other mitigation solutions like confining guests to dedicated cores.
+
+ The kernel provides a sysfs interface to retrieve the status of SMT and
+ to control it. It also provides a kernel command line interface to
+ control SMT.
+
+ The kernel command line interface consists of the following options:
+
+ =========== ==========================================================
+ nosmt Affects the bring up of the secondary CPUs during boot. The
+ kernel tries to bring all present CPUs online during the
+ boot process. "nosmt" makes sure that from each physical
+ core only one - the so called primary (hyper) thread is
+ activated. Due to a design flaw of Intel processors related
+ to Machine Check Exceptions the non primary siblings have
+ to be brought up at least partially and are then shut down
+ again. "nosmt" can be undone via the sysfs interface.
+
+ nosmt=force Has the same effect as "nosmt" but it does not allow to
+ undo the SMT disable via the sysfs interface.
+ =========== ==========================================================
+
+ The sysfs interface provides two files:
+
+ - /sys/devices/system/cpu/smt/control
+ - /sys/devices/system/cpu/smt/active
+
+ /sys/devices/system/cpu/smt/control:
+
+ This file allows to read out the SMT control state and provides the
+ ability to disable or (re)enable SMT. The possible states are:
+
+ ============== ===================================================
+ on SMT is supported by the CPU and enabled. All
+ logical CPUs can be onlined and offlined without
+ restrictions.
+
+ off SMT is supported by the CPU and disabled. Only
+ the so called primary SMT threads can be onlined
+ and offlined without restrictions. An attempt to
+ online a non-primary sibling is rejected
+
+ forceoff Same as 'off' but the state cannot be controlled.
+ Attempts to write to the control file are rejected.
+
+ notsupported The processor does not support SMT. It's therefore
+ not affected by the SMT implications of L1TF.
+ Attempts to write to the control file are rejected.
+ ============== ===================================================
+
+ The possible states which can be written into this file to control SMT
+ state are:
+
+ - on
+ - off
+ - forceoff
+
+ /sys/devices/system/cpu/smt/active:
+
+ This file reports whether SMT is enabled and active, i.e. if on any
+ physical core two or more sibling threads are online.
+
+ SMT control is also possible at boot time via the l1tf kernel command
+ line parameter in combination with L1D flush control. See
+ :ref:`mitigation_control_command_line`.
+
+5. Disabling EPT
+^^^^^^^^^^^^^^^^
+
+ Disabling EPT for virtual machines provides full mitigation for L1TF even
+ with SMT enabled, because the effective page tables for guests are
+ managed and sanitized by the hypervisor. Though disabling EPT has a
+ significant performance impact especially when the Meltdown mitigation
+ KPTI is enabled.
+
+ EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+There is ongoing research and development for new mitigation mechanisms to
+address the performance impact of disabling SMT or EPT.
+
+.. _mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1TF mitigations at boot
+time with the option "l1tf=". The valid arguments for this option are:
+
+ ============ =============================================================
+ full Provides all available mitigations for the L1TF
+ vulnerability. Disables SMT and enables all mitigations in
+ the hypervisors, i.e. unconditional L1D flushing
+
+ SMT control and L1D flush control via the sysfs interface
+ is still possible after boot. Hypervisors will issue a
+ warning when the first VM is started in a potentially
+ insecure configuration, i.e. SMT enabled or L1D flush
+ disabled.
+
+ full,force Same as 'full', but disables SMT and L1D flush runtime
+ control. Implies the 'nosmt=force' command line option.
+ (i.e. sysfs control of SMT is disabled.)
+
+ flush Leaves SMT enabled and enables the default hypervisor
+ mitigation, i.e. conditional L1D flushing
+
+ SMT control and L1D flush control via the sysfs interface
+ is still possible after boot. Hypervisors will issue a
+ warning when the first VM is started in a potentially
+ insecure configuration, i.e. SMT enabled or L1D flush
+ disabled.
+
+ flush,nosmt Disables SMT and enables the default hypervisor mitigation,
+ i.e. conditional L1D flushing.
+
+ SMT control and L1D flush control via the sysfs interface
+ is still possible after boot. Hypervisors will issue a
+ warning when the first VM is started in a potentially
+ insecure configuration, i.e. SMT enabled or L1D flush
+ disabled.
+
+ flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
+ started in a potentially insecure configuration.
+
+ off Disables hypervisor mitigations and doesn't emit any
+ warnings.
+ It also drops the swap size and available RAM limit restrictions
+ on both hypervisor and bare metal.
+
+ ============ =============================================================
+
+The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
+
+
+.. _mitigation_control_kvm:
+
+Mitigation control for KVM - module parameter
+-------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism, flushing the L1D cache when
+entering a guest, can be controlled with a module parameter.
+
+The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
+following arguments:
+
+ ============ ==============================================================
+ always L1D cache flush on every VMENTER.
+
+ cond Flush L1D on VMENTER only when the code between VMEXIT and
+ VMENTER can leak host memory which is considered
+ interesting for an attacker. This still can leak host memory
+ which allows e.g. to determine the hosts address space layout.
+
+ never Disables the mitigation
+ ============ ==============================================================
+
+The parameter can be provided on the kernel command line, as a module
+parameter when loading the modules and at runtime modified via the sysfs
+file:
+
+/sys/module/kvm_intel/parameters/vmentry_l1d_flush
+
+The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
+line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
+module parameter is ignored and writes to the sysfs file are rejected.
+
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The system is protected by the kernel unconditionally and no further
+ action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the guest comes from a trusted source and the guest OS kernel is
+ guaranteed to have the L1TF mitigations in place the system is fully
+ protected against L1TF and no further action is required.
+
+ To avoid the overhead of the default L1D flushing on VMENTER the
+ administrator can disable the flushing via the kernel command line and
+ sysfs control files. See :ref:`mitigation_control_command_line` and
+ :ref:`mitigation_control_kvm`.
+
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+3.1. SMT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+ If SMT is not supported by the processor or disabled in the BIOS or by
+ the kernel, it's only required to enforce L1D flushing on VMENTER.
+
+ Conditional L1D flushing is the default behaviour and can be tuned. See
+ :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+3.2. EPT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+ If EPT is not supported by the processor or disabled in the hypervisor,
+ the system is fully protected. SMT can stay enabled and L1D flushing on
+ VMENTER is not required.
+
+ EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+3.3. SMT and EPT supported and active
+"""""""""""""""""""""""""""""""""""""
+
+ If SMT and EPT are supported and active then various degrees of
+ mitigations can be employed:
+
+ - L1D flushing on VMENTER:
+
+ L1D flushing on VMENTER is the minimal protection requirement, but it
+ is only potent in combination with other mitigation methods.
+
+ Conditional L1D flushing is the default behaviour and can be tuned. See
+ :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+ - Guest confinement:
+
+ Confinement of guests to a single or a group of physical cores which
+ are not running any other processes, can reduce the attack surface
+ significantly, but interrupts, soft interrupts and kernel threads can
+ still expose valuable data to a potential attacker. See
+ :ref:`guest_confinement`.
+
+ - Interrupt isolation:
+
+ Isolating the guest CPUs from interrupts can reduce the attack surface
+ further, but still allows a malicious guest to explore a limited amount
+ of host physical memory. This can at least be used to gain knowledge
+ about the host address space layout. The interrupts which have a fixed
+ affinity to the CPUs which run the untrusted guests can depending on
+ the scenario still trigger soft interrupts and schedule kernel threads
+ which might expose valuable information. See
+ :ref:`interrupt_isolation`.
+
+The above three mitigation methods combined can provide protection to a
+certain degree, but the risk of the remaining attack surface has to be
+carefully analyzed. For full protection the following methods are
+available:
+
+ - Disabling SMT:
+
+ Disabling SMT and enforcing the L1D flushing provides the maximum
+ amount of protection. This mitigation is not depending on any of the
+ above mitigation methods.
+
+ SMT control and L1D flushing can be tuned by the command line
+ parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
+ time with the matching sysfs control files. See :ref:`smt_control`,
+ :ref:`mitigation_control_command_line` and
+ :ref:`mitigation_control_kvm`.
+
+ - Disabling EPT:
+
+ Disabling EPT provides the maximum amount of protection as well. It is
+ not depending on any of the above mitigation methods. SMT can stay
+ enabled and L1D flushing is not required, but the performance impact is
+ significant.
+
+ EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
+ parameter.
+
+3.4. Nested virtual machines
+""""""""""""""""""""""""""""
+
+When nested virtualization is in use, three operating systems are involved:
+the bare metal hypervisor, the nested hypervisor and the nested virtual
+machine. VMENTER operations from the nested hypervisor into the nested
+guest will always be processed by the bare metal hypervisor. If KVM is the
+bare metal hypervisor it will:
+
+ - Flush the L1D cache on every switch from the nested hypervisor to the
+ nested virtual machine, so that the nested hypervisor's secrets are not
+ exposed to the nested virtual machine;
+
+ - Flush the L1D cache on every switch from the nested virtual machine to
+ the nested hypervisor; this is a complex operation, and flushing the L1D
+ cache avoids that the bare metal hypervisor's secrets are exposed to the
+ nested virtual machine;
+
+ - Instruct the nested hypervisor to not perform any L1D cache flush. This
+ is an optimization to avoid double L1D flushing.
+
+
+.. _default_mitigations:
+
+Default mitigations
+-------------------
+
+ The kernel default mitigations for vulnerable processors are:
+
+ - PTE inversion to protect against malicious user space. This is done
+ unconditionally and cannot be controlled. The swap storage is limited
+ to ~16TB.
+
+ - L1D conditional flushing on VMENTER when EPT is enabled for
+ a guest.
+
+ The kernel does not by default enforce the disabling of SMT, which leaves
+ SMT systems vulnerable when running untrusted guests with EPT enabled.
+
+ The rationale for this choice is:
+
+ - Force disabling SMT can break existing setups, especially with
+ unattended updates.
+
+ - If regular users run untrusted guests on their machine, then L1TF is
+ just an add on to other malware which might be embedded in an untrusted
+ guest, e.g. spam-bots or attacks on the local network.
+
+ There is no technical way to prevent a user from running untrusted code
+ on their machines blindly.
+
+ - It's technically extremely unlikely and from today's knowledge even
+ impossible that L1TF can be exploited via the most popular attack
+ mechanisms like JavaScript because these mechanisms have no way to
+ control PTEs. If this would be possible and not other mitigation would
+ be possible, then the default might be different.
+
+ - The administrators of cloud and hosting setups have to carefully
+ analyze the risk for their scenarios and make the appropriate
+ mitigation choices, which might even vary across their deployed
+ machines and also result in other changes of their overall setup.
+ There is no way for the kernel to provide a sensible default for this
+ kind of scenarios.
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -17,14 +17,12 @@ etc.
kernel-parameters
devices
-This section describes CPU vulnerabilities and provides an overview of the
-possible mitigations along with guidance for selecting mitigations if they
-are configurable at compile, boot or run time.
+This section describes CPU vulnerabilities and their mitigations.
.. toctree::
:maxdepth: 1
- l1tf
+ hw-vuln/index
Here is a set of documents aimed at users who are trying to track down
problems and bugs in particular.
--- a/Documentation/admin-guide/l1tf.rst
+++ /dev/null
@@ -1,614 +0,0 @@
-L1TF - L1 Terminal Fault
-========================
-
-L1 Terminal Fault is a hardware vulnerability which allows unprivileged
-speculative access to data which is available in the Level 1 Data Cache
-when the page table entry controlling the virtual address, which is used
-for the access, has the Present bit cleared or other reserved bits set.
-
-Affected processors
--------------------
-
-This vulnerability affects a wide range of Intel processors. The
-vulnerability is not present on:
-
- - Processors from AMD, Centaur and other non Intel vendors
-
- - Older processor models, where the CPU family is < 6
-
- - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
- Penwell, Pineview, Silvermont, Airmont, Merrifield)
-
- - The Intel XEON PHI family
-
- - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
- IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
- by the Meltdown vulnerability either. These CPUs should become
- available by end of 2018.
-
-Whether a processor is affected or not can be read out from the L1TF
-vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
-
-Related CVEs
-------------
-
-The following CVE entries are related to the L1TF vulnerability:
-
- ============= ================= ==============================
- CVE-2018-3615 L1 Terminal Fault SGX related aspects
- CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
- CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
- ============= ================= ==============================
-
-Problem
--------
-
-If an instruction accesses a virtual address for which the relevant page
-table entry (PTE) has the Present bit cleared or other reserved bits set,
-then speculative execution ignores the invalid PTE and loads the referenced
-data if it is present in the Level 1 Data Cache, as if the page referenced
-by the address bits in the PTE was still present and accessible.
-
-While this is a purely speculative mechanism and the instruction will raise
-a page fault when it is retired eventually, the pure act of loading the
-data and making it available to other speculative instructions opens up the
-opportunity for side channel attacks to unprivileged malicious code,
-similar to the Meltdown attack.
-
-While Meltdown breaks the user space to kernel space protection, L1TF
-allows to attack any physical memory address in the system and the attack
-works across all protection domains. It allows an attack of SGX and also
-works from inside virtual machines because the speculation bypasses the
-extended page table (EPT) protection mechanism.
-
-
-Attack scenarios
-----------------
-
-1. Malicious user space
-^^^^^^^^^^^^^^^^^^^^^^^
-
- Operating Systems store arbitrary information in the address bits of a
- PTE which is marked non present. This allows a malicious user space
- application to attack the physical memory to which these PTEs resolve.
- In some cases user-space can maliciously influence the information
- encoded in the address bits of the PTE, thus making attacks more
- deterministic and more practical.
-
- The Linux kernel contains a mitigation for this attack vector, PTE
- inversion, which is permanently enabled and has no performance
- impact. The kernel ensures that the address bits of PTEs, which are not
- marked present, never point to cacheable physical memory space.
-
- A system with an up to date kernel is protected against attacks from
- malicious user space applications.
-
-2. Malicious guest in a virtual machine
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The fact that L1TF breaks all domain protections allows malicious guest
- OSes, which can control the PTEs directly, and malicious guest user
- space applications, which run on an unprotected guest kernel lacking the
- PTE inversion mitigation for L1TF, to attack physical host memory.
-
- A special aspect of L1TF in the context of virtualization is symmetric
- multi threading (SMT). The Intel implementation of SMT is called
- HyperThreading. The fact that Hyperthreads on the affected processors
- share the L1 Data Cache (L1D) is important for this. As the flaw allows
- only to attack data which is present in L1D, a malicious guest running
- on one Hyperthread can attack the data which is brought into the L1D by
- the context which runs on the sibling Hyperthread of the same physical
- core. This context can be host OS, host user space or a different guest.
-
- If the processor does not support Extended Page Tables, the attack is
- only possible, when the hypervisor does not sanitize the content of the
- effective (shadow) page tables.
-
- While solutions exist to mitigate these attack vectors fully, these
- mitigations are not enabled by default in the Linux kernel because they
- can affect performance significantly. The kernel provides several
- mechanisms which can be utilized to address the problem depending on the
- deployment scenario. The mitigations, their protection scope and impact
- are described in the next sections.
-
- The default mitigations and the rationale for choosing them are explained
- at the end of this document. See :ref:`default_mitigations`.
-
-.. _l1tf_sys_info:
-
-L1TF system information
------------------------
-
-The Linux kernel provides a sysfs interface to enumerate the current L1TF
-status of the system: whether the system is vulnerable, and which
-mitigations are active. The relevant sysfs file is:
-
-/sys/devices/system/cpu/vulnerabilities/l1tf
-
-The possible values in this file are:
-
- =========================== ===============================
- 'Not affected' The processor is not vulnerable
- 'Mitigation: PTE Inversion' The host protection is active
- =========================== ===============================
-
-If KVM/VMX is enabled and the processor is vulnerable then the following
-information is appended to the 'Mitigation: PTE Inversion' part:
-
- - SMT status:
-
- ===================== ================
- 'VMX: SMT vulnerable' SMT is enabled
- 'VMX: SMT disabled' SMT is disabled
- ===================== ================
-
- - L1D Flush mode:
-
- ================================ ====================================
- 'L1D vulnerable' L1D flushing is disabled
-
- 'L1D conditional cache flushes' L1D flush is conditionally enabled
-
- 'L1D cache flushes' L1D flush is unconditionally enabled
- ================================ ====================================
-
-The resulting grade of protection is discussed in the following sections.
-
-
-Host mitigation mechanism
--------------------------
-
-The kernel is unconditionally protected against L1TF attacks from malicious
-user space running on the host.
-
-
-Guest mitigation mechanisms
----------------------------
-
-.. _l1d_flush:
-
-1. L1D flush on VMENTER
-^^^^^^^^^^^^^^^^^^^^^^^
-
- To make sure that a guest cannot attack data which is present in the L1D
- the hypervisor flushes the L1D before entering the guest.
-
- Flushing the L1D evicts not only the data which should not be accessed
- by a potentially malicious guest, it also flushes the guest
- data. Flushing the L1D has a performance impact as the processor has to
- bring the flushed guest data back into the L1D. Depending on the
- frequency of VMEXIT/VMENTER and the type of computations in the guest
- performance degradation in the range of 1% to 50% has been observed. For
- scenarios where guest VMEXIT/VMENTER are rare the performance impact is
- minimal. Virtio and mechanisms like posted interrupts are designed to
- confine the VMEXITs to a bare minimum, but specific configurations and
- application scenarios might still suffer from a high VMEXIT rate.
-
- The kernel provides two L1D flush modes:
- - conditional ('cond')
- - unconditional ('always')
-
- The conditional mode avoids L1D flushing after VMEXITs which execute
- only audited code paths before the corresponding VMENTER. These code
- paths have been verified that they cannot expose secrets or other
- interesting data to an attacker, but they can leak information about the
- address space layout of the hypervisor.
-
- Unconditional mode flushes L1D on all VMENTER invocations and provides
- maximum protection. It has a higher overhead than the conditional
- mode. The overhead cannot be quantified correctly as it depends on the
- workload scenario and the resulting number of VMEXITs.
-
- The general recommendation is to enable L1D flush on VMENTER. The kernel
- defaults to conditional mode on affected processors.
-
- **Note**, that L1D flush does not prevent the SMT problem because the
- sibling thread will also bring back its data into the L1D which makes it
- attackable again.
-
- L1D flush can be controlled by the administrator via the kernel command
- line and sysfs control files. See :ref:`mitigation_control_command_line`
- and :ref:`mitigation_control_kvm`.
-
-.. _guest_confinement:
-
-2. Guest VCPU confinement to dedicated physical cores
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- To address the SMT problem, it is possible to make a guest or a group of
- guests affine to one or more physical cores. The proper mechanism for
- that is to utilize exclusive cpusets to ensure that no other guest or
- host tasks can run on these cores.
-
- If only a single guest or related guests run on sibling SMT threads on
- the same physical core then they can only attack their own memory and
- restricted parts of the host memory.
-
- Host memory is attackable, when one of the sibling SMT threads runs in
- host OS (hypervisor) context and the other in guest context. The amount
- of valuable information from the host OS context depends on the context
- which the host OS executes, i.e. interrupts, soft interrupts and kernel
- threads. The amount of valuable data from these contexts cannot be
- declared as non-interesting for an attacker without deep inspection of
- the code.
-
- **Note**, that assigning guests to a fixed set of physical cores affects
- the ability of the scheduler to do load balancing and might have
- negative effects on CPU utilization depending on the hosting
- scenario. Disabling SMT might be a viable alternative for particular
- scenarios.
-
- For further information about confining guests to a single or to a group
- of cores consult the cpusets documentation:
-
- https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
-
-.. _interrupt_isolation:
-
-3. Interrupt affinity
-^^^^^^^^^^^^^^^^^^^^^
-
- Interrupts can be made affine to logical CPUs. This is not universally
- true because there are types of interrupts which are truly per CPU
- interrupts, e.g. the local timer interrupt. Aside of that multi queue
- devices affine their interrupts to single CPUs or groups of CPUs per
- queue without allowing the administrator to control the affinities.
-
- Moving the interrupts, which can be affinity controlled, away from CPUs
- which run untrusted guests, reduces the attack vector space.
-
- Whether the interrupts with are affine to CPUs, which run untrusted
- guests, provide interesting data for an attacker depends on the system
- configuration and the scenarios which run on the system. While for some
- of the interrupts it can be assumed that they won't expose interesting
- information beyond exposing hints about the host OS memory layout, there
- is no way to make general assumptions.
-
- Interrupt affinity can be controlled by the administrator via the
- /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
- available at:
-
- https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
-
-.. _smt_control:
-
-4. SMT control
-^^^^^^^^^^^^^^
-
- To prevent the SMT issues of L1TF it might be necessary to disable SMT
- completely. Disabling SMT can have a significant performance impact, but
- the impact depends on the hosting scenario and the type of workloads.
- The impact of disabling SMT needs also to be weighted against the impact
- of other mitigation solutions like confining guests to dedicated cores.
-
- The kernel provides a sysfs interface to retrieve the status of SMT and
- to control it. It also provides a kernel command line interface to
- control SMT.
-
- The kernel command line interface consists of the following options:
-
- =========== ==========================================================
- nosmt Affects the bring up of the secondary CPUs during boot. The
- kernel tries to bring all present CPUs online during the
- boot process. "nosmt" makes sure that from each physical
- core only one - the so called primary (hyper) thread is
- activated. Due to a design flaw of Intel processors related
- to Machine Check Exceptions the non primary siblings have
- to be brought up at least partially and are then shut down
- again. "nosmt" can be undone via the sysfs interface.
-
- nosmt=force Has the same effect as "nosmt" but it does not allow to
- undo the SMT disable via the sysfs interface.
- =========== ==========================================================
-
- The sysfs interface provides two files:
-
- - /sys/devices/system/cpu/smt/control
- - /sys/devices/system/cpu/smt/active
-
- /sys/devices/system/cpu/smt/control:
-
- This file allows to read out the SMT control state and provides the
- ability to disable or (re)enable SMT. The possible states are:
-
- ============== ===================================================
- on SMT is supported by the CPU and enabled. All
- logical CPUs can be onlined and offlined without
- restrictions.
-
- off SMT is supported by the CPU and disabled. Only
- the so called primary SMT threads can be onlined
- and offlined without restrictions. An attempt to
- online a non-primary sibling is rejected
-
- forceoff Same as 'off' but the state cannot be controlled.
- Attempts to write to the control file are rejected.
-
- notsupported The processor does not support SMT. It's therefore
- not affected by the SMT implications of L1TF.
- Attempts to write to the control file are rejected.
- ============== ===================================================
-
- The possible states which can be written into this file to control SMT
- state are:
-
- - on
- - off
- - forceoff
-
- /sys/devices/system/cpu/smt/active:
-
- This file reports whether SMT is enabled and active, i.e. if on any
- physical core two or more sibling threads are online.
-
- SMT control is also possible at boot time via the l1tf kernel command
- line parameter in combination with L1D flush control. See
- :ref:`mitigation_control_command_line`.
-
-5. Disabling EPT
-^^^^^^^^^^^^^^^^
-
- Disabling EPT for virtual machines provides full mitigation for L1TF even
- with SMT enabled, because the effective page tables for guests are
- managed and sanitized by the hypervisor. Though disabling EPT has a
- significant performance impact especially when the Meltdown mitigation
- KPTI is enabled.
-
- EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
-
-There is ongoing research and development for new mitigation mechanisms to
-address the performance impact of disabling SMT or EPT.
-
-.. _mitigation_control_command_line:
-
-Mitigation control on the kernel command line
----------------------------------------------
-
-The kernel command line allows to control the L1TF mitigations at boot
-time with the option "l1tf=". The valid arguments for this option are:
-
- ============ =============================================================
- full Provides all available mitigations for the L1TF
- vulnerability. Disables SMT and enables all mitigations in
- the hypervisors, i.e. unconditional L1D flushing
-
- SMT control and L1D flush control via the sysfs interface
- is still possible after boot. Hypervisors will issue a
- warning when the first VM is started in a potentially
- insecure configuration, i.e. SMT enabled or L1D flush
- disabled.
-
- full,force Same as 'full', but disables SMT and L1D flush runtime
- control. Implies the 'nosmt=force' command line option.
- (i.e. sysfs control of SMT is disabled.)
-
- flush Leaves SMT enabled and enables the default hypervisor
- mitigation, i.e. conditional L1D flushing
-
- SMT control and L1D flush control via the sysfs interface
- is still possible after boot. Hypervisors will issue a
- warning when the first VM is started in a potentially
- insecure configuration, i.e. SMT enabled or L1D flush
- disabled.
-
- flush,nosmt Disables SMT and enables the default hypervisor mitigation,
- i.e. conditional L1D flushing.
-
- SMT control and L1D flush control via the sysfs interface
- is still possible after boot. Hypervisors will issue a
- warning when the first VM is started in a potentially
- insecure configuration, i.e. SMT enabled or L1D flush
- disabled.
-
- flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
- started in a potentially insecure configuration.
-
- off Disables hypervisor mitigations and doesn't emit any
- warnings.
- It also drops the swap size and available RAM limit restrictions
- on both hypervisor and bare metal.
-
- ============ =============================================================
-
-The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
-
-
-.. _mitigation_control_kvm:
-
-Mitigation control for KVM - module parameter
--------------------------------------------------------------
-
-The KVM hypervisor mitigation mechanism, flushing the L1D cache when
-entering a guest, can be controlled with a module parameter.
-
-The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
-following arguments:
-
- ============ ==============================================================
- always L1D cache flush on every VMENTER.
-
- cond Flush L1D on VMENTER only when the code between VMEXIT and
- VMENTER can leak host memory which is considered
- interesting for an attacker. This still can leak host memory
- which allows e.g. to determine the hosts address space layout.
-
- never Disables the mitigation
- ============ ==============================================================
-
-The parameter can be provided on the kernel command line, as a module
-parameter when loading the modules and at runtime modified via the sysfs
-file:
-
-/sys/module/kvm_intel/parameters/vmentry_l1d_flush
-
-The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
-line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
-module parameter is ignored and writes to the sysfs file are rejected.
-
-
-Mitigation selection guide
---------------------------
-
-1. No virtualization in use
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The system is protected by the kernel unconditionally and no further
- action is required.
-
-2. Virtualization with trusted guests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- If the guest comes from a trusted source and the guest OS kernel is
- guaranteed to have the L1TF mitigations in place the system is fully
- protected against L1TF and no further action is required.
-
- To avoid the overhead of the default L1D flushing on VMENTER the
- administrator can disable the flushing via the kernel command line and
- sysfs control files. See :ref:`mitigation_control_command_line` and
- :ref:`mitigation_control_kvm`.
-
-
-3. Virtualization with untrusted guests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-3.1. SMT not supported or disabled
-""""""""""""""""""""""""""""""""""
-
- If SMT is not supported by the processor or disabled in the BIOS or by
- the kernel, it's only required to enforce L1D flushing on VMENTER.
-
- Conditional L1D flushing is the default behaviour and can be tuned. See
- :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
-
-3.2. EPT not supported or disabled
-""""""""""""""""""""""""""""""""""
-
- If EPT is not supported by the processor or disabled in the hypervisor,
- the system is fully protected. SMT can stay enabled and L1D flushing on
- VMENTER is not required.
-
- EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
-
-3.3. SMT and EPT supported and active
-"""""""""""""""""""""""""""""""""""""
-
- If SMT and EPT are supported and active then various degrees of
- mitigations can be employed:
-
- - L1D flushing on VMENTER:
-
- L1D flushing on VMENTER is the minimal protection requirement, but it
- is only potent in combination with other mitigation methods.
-
- Conditional L1D flushing is the default behaviour and can be tuned. See
- :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
-
- - Guest confinement:
-
- Confinement of guests to a single or a group of physical cores which
- are not running any other processes, can reduce the attack surface
- significantly, but interrupts, soft interrupts and kernel threads can
- still expose valuable data to a potential attacker. See
- :ref:`guest_confinement`.
-
- - Interrupt isolation:
-
- Isolating the guest CPUs from interrupts can reduce the attack surface
- further, but still allows a malicious guest to explore a limited amount
- of host physical memory. This can at least be used to gain knowledge
- about the host address space layout. The interrupts which have a fixed
- affinity to the CPUs which run the untrusted guests can depending on
- the scenario still trigger soft interrupts and schedule kernel threads
- which might expose valuable information. See
- :ref:`interrupt_isolation`.
-
-The above three mitigation methods combined can provide protection to a
-certain degree, but the risk of the remaining attack surface has to be
-carefully analyzed. For full protection the following methods are
-available:
-
- - Disabling SMT:
-
- Disabling SMT and enforcing the L1D flushing provides the maximum
- amount of protection. This mitigation is not depending on any of the
- above mitigation methods.
-
- SMT control and L1D flushing can be tuned by the command line
- parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
- time with the matching sysfs control files. See :ref:`smt_control`,
- :ref:`mitigation_control_command_line` and
- :ref:`mitigation_control_kvm`.
-
- - Disabling EPT:
-
- Disabling EPT provides the maximum amount of protection as well. It is
- not depending on any of the above mitigation methods. SMT can stay
- enabled and L1D flushing is not required, but the performance impact is
- significant.
-
- EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
- parameter.
-
-3.4. Nested virtual machines
-""""""""""""""""""""""""""""
-
-When nested virtualization is in use, three operating systems are involved:
-the bare metal hypervisor, the nested hypervisor and the nested virtual
-machine. VMENTER operations from the nested hypervisor into the nested
-guest will always be processed by the bare metal hypervisor. If KVM is the
-bare metal hypervisor it will:
-
- - Flush the L1D cache on every switch from the nested hypervisor to the
- nested virtual machine, so that the nested hypervisor's secrets are not
- exposed to the nested virtual machine;
-
- - Flush the L1D cache on every switch from the nested virtual machine to
- the nested hypervisor; this is a complex operation, and flushing the L1D
- cache avoids that the bare metal hypervisor's secrets are exposed to the
- nested virtual machine;
-
- - Instruct the nested hypervisor to not perform any L1D cache flush. This
- is an optimization to avoid double L1D flushing.
-
-
-.. _default_mitigations:
-
-Default mitigations
--------------------
-
- The kernel default mitigations for vulnerable processors are:
-
- - PTE inversion to protect against malicious user space. This is done
- unconditionally and cannot be controlled. The swap storage is limited
- to ~16TB.
-
- - L1D conditional flushing on VMENTER when EPT is enabled for
- a guest.
-
- The kernel does not by default enforce the disabling of SMT, which leaves
- SMT systems vulnerable when running untrusted guests with EPT enabled.
-
- The rationale for this choice is:
-
- - Force disabling SMT can break existing setups, especially with
- unattended updates.
-
- - If regular users run untrusted guests on their machine, then L1TF is
- just an add on to other malware which might be embedded in an untrusted
- guest, e.g. spam-bots or attacks on the local network.
-
- There is no technical way to prevent a user from running untrusted code
- on their machines blindly.
-
- - It's technically extremely unlikely and from today's knowledge even
- impossible that L1TF can be exploited via the most popular attack
- mechanisms like JavaScript because these mechanisms have no way to
- control PTEs. If this would be possible and not other mitigation would
- be possible, then the default might be different.
-
- - The administrators of cloud and hosting setups have to carefully
- analyze the risk for their scenarios and make the appropriate
- mitigation choices, which might even vary across their deployed
- machines and also result in other changes of their overall setup.
- There is no way for the kernel to provide a sensible default for this
- kind of scenarios.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 13/14] MDS basics 13
2019-03-01 21:47 ` [patch V6 13/14] MDS basics 13 Thomas Gleixner
@ 2019-03-03 4:01 ` Josh Poimboeuf
2019-03-05 16:04 ` Thomas Gleixner
2019-03-05 16:43 ` [MODERATED] " mark gross
1 sibling, 1 reply; 89+ messages in thread
From: Josh Poimboeuf @ 2019-03-03 4:01 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:51PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 13/14] Documentation: Move L1TF to separate directory
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Move L!TF to a separate directory so the MDS stuff can be added at the
> side. Otherwise the all hardware vulnerabilites have their own top level
> entry. Should have done that right away.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
$ git grep admin-guide |grep l1tf |grep -v hw-vuln
Documentation/ABI/testing/sysfs-devices-system-cpu: Documentation/admin-guide/l1tf.rst
Documentation/admin-guide/kernel-parameters.txt: For details see: Documentation/admin-guide/l1tf.rst
arch/x86/kernel/cpu/bugs.c: pr_info("Reading https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html might help you decide.\n");
arch/x86/kvm/vmx/vmx.c:#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"
arch/x86/kvm/vmx/vmx.c:#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"
--
Josh
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 13/14] MDS basics 13
2019-03-03 4:01 ` [MODERATED] " Josh Poimboeuf
@ 2019-03-05 16:04 ` Thomas Gleixner
0 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-05 16:04 UTC (permalink / raw)
To: speck
On Sat, 2 Mar 2019, speck for Josh Poimboeuf wrote:
> On Fri, Mar 01, 2019 at 10:47:51PM +0100, speck for Thomas Gleixner wrote:
> > Subject: [patch V6 13/14] Documentation: Move L1TF to separate directory
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > Move L!TF to a separate directory so the MDS stuff can be added at the
> > side. Otherwise the all hardware vulnerabilites have their own top level
> > entry. Should have done that right away.
> >
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> $ git grep admin-guide |grep l1tf |grep -v hw-vuln
> Documentation/ABI/testing/sysfs-devices-system-cpu: Documentation/admin-guide/l1tf.rst
> Documentation/admin-guide/kernel-parameters.txt: For details see: Documentation/admin-guide/l1tf.rst
> arch/x86/kernel/cpu/bugs.c: pr_info("Reading https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html might help you decide.\n");
> arch/x86/kvm/vmx/vmx.c:#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"
> arch/x86/kvm/vmx/vmx.c:#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"
>
Ah. Indeed....
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V6 13/14] MDS basics 13
2019-03-01 21:47 ` [patch V6 13/14] MDS basics 13 Thomas Gleixner
2019-03-03 4:01 ` [MODERATED] " Josh Poimboeuf
@ 2019-03-05 16:43 ` mark gross
1 sibling, 0 replies; 89+ messages in thread
From: mark gross @ 2019-03-05 16:43 UTC (permalink / raw)
To: speck
On Fri, Mar 01, 2019 at 10:47:51PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V6 13/14] Documentation: Move L1TF to separate directory
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Move L!TF to a separate directory so the MDS stuff can be added at the
s/L!TF/L1TF
--mark
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V6 14/14] MDS basics 14
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (12 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 13/14] MDS basics 13 Thomas Gleixner
@ 2019-03-01 21:47 ` Thomas Gleixner
2019-03-01 23:48 ` [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-04 5:30 ` [MODERATED] Encrypted Message Jon Masters
15 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 21:47 UTC (permalink / raw)
To: speck
Subject: [patch V6 14/14] Documentation: Add MDS vulnerability documentation
From: Thomas Gleixner <tglx@linutronix.de>
Add the initial MDS vulnerability documentation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5 --> V6: Fix the protection matrix, minor tweaks vs. idle mitigation
and MSBDS only systems.
V4 --> V5: Remove 'auto' option. Adjust virt mitigation info.
V1 --> V4: Added the missing pieces
---
Documentation/admin-guide/hw-vuln/index.rst | 1
Documentation/admin-guide/hw-vuln/l1tf.rst | 1
Documentation/admin-guide/hw-vuln/mds.rst | 307 ++++++++++++++++++++++++++++
3 files changed, 309 insertions(+)
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -10,3 +10,4 @@ are configurable at compile, boot or run
:maxdepth: 1
l1tf
+ mds
--- a/Documentation/admin-guide/hw-vuln/l1tf.rst
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -445,6 +445,7 @@ The default is 'cond'. If 'l1tf=full,for
line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
module parameter is ignored and writes to the sysfs file are rejected.
+.. _mitigation_selection:
Mitigation selection guide
--------------------------
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -0,0 +1,307 @@
+MDS - Microarchitectural Data Sampling
+======================================
+
+Microarchitectural Data Sampling is a hardware vulnerability which allows
+unprivileged speculative access to data which is available in various CPU
+internal buffers.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+ - Processors from AMD, Centaur and other non Intel vendors
+
+ - Older processor models, where the CPU family is < 6
+
+ - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+ - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR.
+
+Whether a processor is affected or not can be read out from the MDS
+vulnerability file in sysfs. See :ref:`mds_sys_info`.
+
+Not all processors are affected by all variants of MDS, but the mitigation
+is identical for all of them so the kernel treats them as a single
+vulnerability.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the MDS vulnerability:
+
+ ============== ===== ==============================================
+ CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling
+ CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling
+ CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling
+ ============== ===== ==============================================
+
+Problem
+-------
+
+When performing store, load, L1 refill operations, processors write data
+into temporary microarchitectural structures (buffers). The data in the
+buffer can be forwarded to load operations as an optimization.
+
+Under certain conditions, usually a fault/assist caused by a load
+operation, data unrelated to the load memory address can be speculatively
+forwarded from the buffers. Because the load operation causes a fault or
+assist and its result will be discarded, the forwarded data will not cause
+incorrect program execution or state changes. But a malicious operation
+may be able to forward this speculative data to a disclosure gadget which
+allows in turn to infer the value via a cache side channel attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+Deeper technical information is available in the MDS specific x86
+architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the MDS vulnerabilities can be mounted from malicious non
+priviledged user space applications running on hosts or guest. Malicious
+guest OSes can obviously mount attacks as well.
+
+Contrary to other speculation based vulnerabilities the MDS vulnerability
+does not allow the attacker to control the memory target address. As a
+consequence the attacks are purely sampling based, but as demonstrated with
+the TLBleed attack samples can be postprocessed successfully.
+
+Web-Browsers
+^^^^^^^^^^^^
+
+ It's unclear whether attacks through Web-Browsers are possible at
+ all. The exploitation through Java-Script is considered very unlikely,
+ but other widely used web technologies like Webassembly could possibly be
+ abused.
+
+
+.. _mds_sys_info:
+
+MDS system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current MDS
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+The possible values in this file are:
+
+ ========================================= =================================
+ 'Not affected' The processor is not vulnerable
+
+ 'Vulnerable' The processor is vulnerable,
+ but no mitigation enabled
+
+ 'Vulnerable: Clear CPU buffers attempted' The processor is vulnerable but
+ microcode is not updated.
+ The mitigation is enabled on a
+ best effort basis.
+ See :ref:`vmwerv`
+
+ 'Mitigation: CPU buffer clear' The processor is vulnerable and the
+ CPU buffer clearing mitigation is
+ enabled.
+ ========================================= =================================
+
+If the processor is vulnerable then the following information is appended
+to the above information:
+
+ ======================== ============================================
+ 'SMT vulnerable' SMT is enabled
+ 'SMT mitigated' SMT is enabled and mitigated
+ 'SMT disabled' SMT is disabled
+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
+ ======================== ============================================
+
+.. _vmwerv:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the processor is vulnerable, but the availability of the microcode based
+ mitigation mechanism is not advertised via CPUID the kernel selects a best
+ effort mitigation mode. This mode invokes the mitigation instructions
+ without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to expose
+ the CPUID to the guest. If the host has updated microcode the protection
+ takes effect otherwise a few cpu cycles are wasted pointlessly.
+
+ The state in the mds sysfs file reflects this situation accordingly.
+
+
+Mitigation mechanism
+-------------------------
+
+The kernel detects the affected CPUs and the presence of the microcode
+which is required.
+
+If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default. The mitigation can be controlled at boot
+time via a kernel command line option. See
+:ref:`mds_mitigation_control_command_line`.
+
+.. _cpu_buffer_clear:
+
+CPU buffer clearing
+^^^^^^^^^^^^^^^^^^^
+
+ The mitigation for MDS clears the affected CPU buffers on return to user
+ space and when entering a guest.
+
+ If SMT is enabled it also clears the buffers on idle entry when the CPU
+ is only affected by MSBDS and not any other MDS variant, because the
+ other variants cannot be protected against cross Hyper-Thread attacks.
+
+ For CPUs which are only affected by MSBDS the user space, guest and idle
+ transition mitigations are sufficient and SMT is not affected.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection for host to guest transition depends on the L1TF
+ vulnerability of the CPU:
+
+ - CPU is affected by L1TF:
+
+ If the L1D flush mitigation is enabled and up to date microcode is
+ available, the L1D flush mitigation is automatically protecting the
+ guest transition.
+
+ If the L1D flush mitigation is disabled then the MDS mitigation is
+ invoked explicit when the host MDS mitigation is enabled.
+
+ For details on L1TF and virtualization see:
+ :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
+
+ - CPU is not affected by L1TF:
+
+ CPU buffers are flushed before entering the guest when the host MDS
+ mitigation is enabled.
+
+ The resulting MDS protection matrix for the host to guest transition:
+
+ ============ ===== ============= ============ =================
+ L1TF MDS VMX-L1FLUSH Host MDS MDS-State
+
+ Don't care No Don't care N/A Not affected
+
+ Yes Yes Disabled Off Vulnerable
+
+ Yes Yes Disabled Full Mitigated
+
+ Yes Yes Enabled Don't care Mitigated
+
+ No Yes N/A Off Vulnerable
+
+ No Yes N/A Full Mitigated
+ ============ ===== ============= ============ =================
+
+ This only covers the host to guest transition, i.e. prevents leakage from
+ host to guest, but does not protect the guest internally. Guests need to
+ have their own protections.
+
+.. _xeon_phi:
+
+XEON PHI specific considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The XEON PHI processor family is affected by MSBDS which can be exploited
+ cross Hyper-Threads when entering idle states. Some XEON PHI variants allow
+ to use MWAIT in user space (Ring 3) which opens an potential attack vector
+ for malicious user space. The exposure can be disabled on the kernel
+ command line with the 'ring3mwait=disable' command line option.
+
+ XEON PHI is not affected by the other MDS variants and MSBDS is mitigated
+ before the CPU enters a idle state. As XEON PHI is not affected by L1TF
+ either disabling SMT is not required for full protection.
+
+.. _mds_smt_control:
+
+SMT control
+^^^^^^^^^^^
+
+ All MDS variants except MSBDS can be attacked cross Hyper-Threads. That
+ means on CPUs which are affected by MFBDS or MLPDS it is necessary to
+ disable SMT for full protection. These are most of the affected CPUs; the
+ exception is XEON PHI, see :ref:`xeon_phi`.
+
+ Disabling SMT can have a significant performance impact, but the impact
+ depends on the type of workloads.
+
+ See the relevant chapter in the L1TF mitigation documentation for details:
+ :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+
+.. _mds_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the MDS mitigations at boot
+time with the option "mds=". The valid arguments for this option are:
+
+ ============ =============================================================
+ full If the CPU is vulnerable, enable all available mitigations
+ for the MDS vulnerability, CPU buffer clearing on exit to
+ userspace and when entering a VM. Idle transitions are
+ protected as well if SMT is enabled.
+
+ It does not automatically disable SMT.
+
+ off Disables MDS mitigations completely.
+
+ ============ =============================================================
+
+Not specifying this option is equivalent to "mds=full".
+
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+ If all userspace applications are from a trusted source and do not
+ execute untrusted code which is supplied externally, then the mitigation
+ can be disabled.
+
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The same considerations as above versus trusted user space apply.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection depends on the state of the L1TF mitigations.
+ See :ref:`virt_mechanism`.
+
+ If the MDS mitigation is enabled and SMT is disabled, guest to host and
+ guest to guest attacks are prevented.
+
+.. _mds_default_mitigations:
+
+Default mitigations
+-------------------
+
+ The kernel default mitigations for vulnerable processors are:
+
+ - Enable CPU buffer clearing
+
+ The kernel does not by default enforce the disabling of SMT, which leaves
+ SMT systems vulnerable when running untrusted code. The same rationale as
+ for L1TF applies.
+ See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [patch V6 00/14] MDS basics 0
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (13 preceding siblings ...)
2019-03-01 21:47 ` [patch V6 14/14] MDS basics 14 Thomas Gleixner
@ 2019-03-01 23:48 ` Thomas Gleixner
2019-03-04 5:30 ` [MODERATED] Encrypted Message Jon Masters
15 siblings, 0 replies; 89+ messages in thread
From: Thomas Gleixner @ 2019-03-01 23:48 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 127 bytes --]
On Fri, 1 Mar 2019, speck for Thomas Gleixner wrote:
>
> I'll send git bundles of the pile as well.
Attached.
Thanks,
tglx
[-- Attachment #2: Type: application/octet-stream, Size: 36555 bytes --]
[-- Attachment #3: Type: application/octet-stream, Size: 37958 bytes --]
[-- Attachment #4: Type: application/octet-stream, Size: 44988 bytes --]
[-- Attachment #5: Type: application/octet-stream, Size: 47246 bytes --]
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
` (14 preceding siblings ...)
2019-03-01 23:48 ` [patch V6 00/14] MDS basics 0 Thomas Gleixner
@ 2019-03-04 5:30 ` Jon Masters
15 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-04 5:30 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 00/14] MDS basics 0
[-- Attachment #2: Type: text/plain, Size: 1408 bytes --]
On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Changes vs. V5:
>
> - Fix tools/ build (Josh)
>
> - Dropped the AIRMONT_MID change as it needs confirmation from Intel
>
> - Made the consolidated whitelist more readable and correct
>
> - Added the MSBDS only quirk for XEON PHI, made the idle flush
> depend on it and updated the sysfs output accordingly.
>
> - Fixed the protection matrix in the admin documentation and clarified
> the SMT situation vs. MSBDS only.
>
> - Updated the KVM/VMX changelog.
>
> Delta patch against V5 below.
>
> Available from git:
>
> cvs.ou.linutronix.de:linux/speck/linux WIP.mds
>
> The linux-4.20.y, linux-4.19.y and linux-4.14.y branches are updated as
> well and contain the untested backports of the pile for reference.
>
> I'll send git bundles of the pile as well.
Tested on Coffeelake with updated ucode successfully:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
stepping : 10
microcode : 0xae
[jcm@stephen ~]$ dmesg|grep MDS
[ 1.633165] MDS: Mitigation: Clear CPU buffers
[jcm@stephen ~]$ cat /sys/devices/system/cpu/vulnerabilities/mds
Mitigation: Clear CPU buffers; SMT vulnerable
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Starting to go public?
@ 2019-03-05 16:43 Linus Torvalds
2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
2019-03-05 17:10 ` Jon Masters
0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2019-03-05 16:43 UTC (permalink / raw)
To: speck
Looks like the papers are starting to leak:
https://arxiv.org/pdf/1903.00446.pdf
yes, yes, a lot of the attack seems to be about rowhammer, but the
"spolier" part looks like MDS.
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: Starting to go public?
2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
@ 2019-03-05 17:02 ` Andrew Cooper
2019-03-05 20:36 ` Jiri Kosina
2019-03-05 17:10 ` Jon Masters
1 sibling, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2019-03-05 17:02 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 598 bytes --]
On 05/03/2019 16:43, speck for Linus Torvalds wrote:
> Looks like the papers are starting to leak:
>
> https://arxiv.org/pdf/1903.00446.pdf
>
> yes, yes, a lot of the attack seems to be about rowhammer, but the
> "spolier" part looks like MDS.
So Intel was aware of that paper, but wasn't expecting it to go public
today.
From their point of view, it is a traditional timing sidechannel on a
piece of the pipeline (which happens to be component which exists for
speculative memory disambiguation).
There are no proposed changes to the MDS timeline at this point.
~Andrew
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: Starting to go public?
2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
@ 2019-03-05 20:36 ` Jiri Kosina
2019-03-05 22:31 ` Andrew Cooper
0 siblings, 1 reply; 89+ messages in thread
From: Jiri Kosina @ 2019-03-05 20:36 UTC (permalink / raw)
To: speck
On Tue, 5 Mar 2019, speck for Andrew Cooper wrote:
> > Looks like the papers are starting to leak:
> >
> > https://arxiv.org/pdf/1903.00446.pdf
> >
> > yes, yes, a lot of the attack seems to be about rowhammer, but the
> > "spolier" part looks like MDS.
>
> So Intel was aware of that paper, but wasn't expecting it to go public
> today.
>
> =46rom their point of view, it is a traditional timing sidechannel on a
> piece of the pipeline (which happens to be component which exists for
> speculative memory disambiguation).
>
> There are no proposed changes to the MDS timeline at this point.
So this is not the paper that caused the panic fearing that PSF might leak
earlier than the rest of the issues in mid-february (which few days later
Intel claimed to have succesfully negotiated with the researches not to
publish before the CRD)?
Thanks,
--
Jiri Kosina
SUSE Labs
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: Starting to go public?
2019-03-05 20:36 ` Jiri Kosina
@ 2019-03-05 22:31 ` Andrew Cooper
2019-03-06 16:18 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2019-03-05 22:31 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]
On 05/03/2019 20:36, speck for Jiri Kosina wrote:
> On Tue, 5 Mar 2019, speck for Andrew Cooper wrote:
>
>>> Looks like the papers are starting to leak:
>>>
>>> https://arxiv.org/pdf/1903.00446.pdf
>>>
>>> yes, yes, a lot of the attack seems to be about rowhammer, but the
>>> "spolier" part looks like MDS.
>> So Intel was aware of that paper, but wasn't expecting it to go public
>> today.
>>
>> =46rom their point of view, it is a traditional timing sidechannel on a
>> piece of the pipeline (which happens to be component which exists for
>> speculative memory disambiguation).
>>
>> There are no proposed changes to the MDS timeline at this point.
> So this is not the paper that caused the panic fearing that PSF might leak
> earlier than the rest of the issues in mid-february (which few days later
> Intel claimed to have succesfully negotiated with the researches not to
> publish before the CRD)?
Correct.
The incident you are referring to is a researcher who definitely found
PSF, contacted Intel and was initially displeased at the proposed embargo.
~Andrew
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-05 22:31 ` Andrew Cooper
@ 2019-03-06 16:18 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-06 16:18 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 121 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Andrew Cooper <speck@linutronix.de>
Subject: Re: Starting to go public?
[-- Attachment #2: Type: text/plain, Size: 1380 bytes --]
On 3/5/19 5:31 PM, speck for Andrew Cooper wrote:
> On 05/03/2019 20:36, speck for Jiri Kosina wrote:
>> On Tue, 5 Mar 2019, speck for Andrew Cooper wrote:
>>
>>>> Looks like the papers are starting to leak:
>>>>
>>>> https://arxiv.org/pdf/1903.00446.pdf
>>>>
>>>> yes, yes, a lot of the attack seems to be about rowhammer, but the
>>>> "spolier" part looks like MDS.
>>> So Intel was aware of that paper, but wasn't expecting it to go public
>>> today.
>>>
>>> =46rom their point of view, it is a traditional timing sidechannel on a
>>> piece of the pipeline (which happens to be component which exists for
>>> speculative memory disambiguation).
>>>
>>> There are no proposed changes to the MDS timeline at this point.
>> So this is not the paper that caused the panic fearing that PSF might leak
>> earlier than the rest of the issues in mid-february (which few days later
>> Intel claimed to have succesfully negotiated with the researches not to
>> publish before the CRD)?
>
> Correct.
>
> The incident you are referring to is a researcher who definitely found
> PSF, contacted Intel and was initially displeased at the proposed embargo.
Indeed. There are at least three different teams with papers that read
on MDS, and all of them are holding to the embargo.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
@ 2019-03-05 17:10 ` Jon Masters
1 sibling, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-05 17:10 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Linus Torvalds <speck@linutronix.de>
Subject: NOT PUBLIC - Re: Starting to go public?
[-- Attachment #2: Type: text/plain, Size: 796 bytes --]
On 3/5/19 11:43 AM, speck for Linus Torvalds wrote:
> Looks like the papers are starting to leak:
>
> https://arxiv.org/pdf/1903.00446.pdf
>
> yes, yes, a lot of the attack seems to be about rowhammer, but the
> "spolier" part looks like MDS.
It's not but it is close to finding PSF behavior. The thing they found
is described separately in one of the original Intel store patent. So we
are at risk but should not panic.
I've spoken with several researchers sitting on MDS papers and confirmed
that they are NOT concerned at this stage. Of course everyone is
carefully watching and that's why we need to have contingency. People
will start looking in this area (I know of three teams doing so) now.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements
@ 2019-03-04 1:21 Josh Poimboeuf
2019-03-04 1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
` (2 more replies)
0 siblings, 3 replies; 89+ messages in thread
From: Josh Poimboeuf @ 2019-03-04 1:21 UTC (permalink / raw)
To: speck
For MDS and SMT, I'd propose that we do something similar to what we did
for L1TF: a) add an mds=full,nosmt option; and b) add a printk warning
if SMT is enabled. That's the first three patches.
The last patch proposes a meta-option which is intended to make it
easier for users to choose sane mitigation defaults for all the
speculative vulnerabilities at once.
Josh Poimboeuf (4):
x86/speculation/mds: Add mds=full,nosmt cmdline option
x86/speculation: Move arch_smt_update() call to after mitigation
decisions
x86/speculation/mds: Add SMT warning message
x86/speculation: Add 'cpu_spec_mitigations=' cmdline options
Documentation/admin-guide/hw-vuln/mds.rst | 3 +
.../admin-guide/kernel-parameters.txt | 49 ++++++++++++-
arch/powerpc/kernel/security.c | 6 +-
arch/powerpc/kernel/setup_64.c | 2 +-
arch/s390/kernel/nospec-branch.c | 4 +-
arch/x86/include/asm/processor.h | 2 +
arch/x86/kernel/cpu/bugs.c | 68 ++++++++++++++++---
arch/x86/mm/pti.c | 3 +-
include/linux/cpu.h | 8 +++
kernel/cpu.c | 15 ++++
10 files changed, 144 insertions(+), 16 deletions(-)
--
2.17.2
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH RFC 1/4] 1
2019-03-04 1:21 [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements Josh Poimboeuf
@ 2019-03-04 1:23 ` Josh Poimboeuf
2019-03-04 3:55 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 7:30 ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
2019-03-04 1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
2019-03-04 1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
2 siblings, 2 replies; 89+ messages in thread
From: Josh Poimboeuf @ 2019-03-04 1:23 UTC (permalink / raw)
To: speck
From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH RFC 1/4] x86/speculation/mds: Add mds=full,nosmt cmdline
option
Add the mds=full,nosmt cmdline option. This is like mds=full, but with
SMT disabled if the CPU is vulnerable.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
Documentation/admin-guide/hw-vuln/mds.rst | 3 +++
Documentation/admin-guide/kernel-parameters.txt | 6 ++++--
arch/x86/kernel/cpu/bugs.c | 10 ++++++++++
3 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
index 1de29d28903d..244ab47d1fb3 100644
--- a/Documentation/admin-guide/hw-vuln/mds.rst
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -260,6 +260,9 @@ time with the option "mds=". The valid arguments for this option are:
It does not automatically disable SMT.
+ full,nosmt The same as mds=full, with SMT disabled on vulnerable
+ CPUs. This is the complete mitigation.
+
off Disables MDS mitigations completely.
============ =============================================================
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dddb024eb523..55969f240f2e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2372,8 +2372,10 @@
This parameter controls the MDS mitigation. The
options are:
- full - Enable MDS mitigation on vulnerable CPUs
- off - Unconditionally disable MDS mitigation
+ full - Enable MDS mitigation on vulnerable CPUs
+ full,nosmt - Enable MDS mitigation and disable
+ SMT on vulnerable CPUs
+ off - Unconditionally disable MDS mitigation
Not specifying this option is equivalent to
mds=full.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index e11654f93e71..0c71ab0d57e3 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -221,6 +221,7 @@ static void x86_amd_ssb_disable(void)
/* Default mitigation for L1TF-affected CPUs */
static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_FULL;
+static bool mds_nosmt __ro_after_init = false;
static const char * const mds_strings[] = {
[MDS_MITIGATION_OFF] = "Vulnerable",
@@ -238,8 +239,13 @@ static void mds_select_mitigation(void)
if (mds_mitigation == MDS_MITIGATION_FULL) {
if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
mds_mitigation = MDS_MITIGATION_VMWERV;
+
static_branch_enable(&mds_user_clear);
+
+ if (mds_nosmt && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
+ cpu_smt_disable(false);
}
+
pr_info("%s\n", mds_strings[mds_mitigation]);
}
@@ -255,6 +261,10 @@ static int __init mds_cmdline(char *str)
mds_mitigation = MDS_MITIGATION_OFF;
else if (!strcmp(str, "full"))
mds_mitigation = MDS_MITIGATION_FULL;
+ else if (!strcmp(str, "full,nosmt")) {
+ mds_mitigation = MDS_MITIGATION_FULL;
+ mds_nosmt = true;
+ }
return 0;
}
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-04 1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
@ 2019-03-04 3:55 ` Jon Masters
2019-03-04 7:30 ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
1 sibling, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-04 3:55 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 1/4] 1
[-- Attachment #2: Type: text/plain, Size: 1069 bytes --]
On 3/3/19 8:23 PM, speck for Josh Poimboeuf wrote:
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index e11654f93e71..0c71ab0d57e3 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -221,6 +221,7 @@ static void x86_amd_ssb_disable(void)
>
> /* Default mitigation for L1TF-affected CPUs */
> static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_FULL;
> +static bool mds_nosmt __ro_after_init = false;
>
> static const char * const mds_strings[] = {
> [MDS_MITIGATION_OFF] = "Vulnerable",
> @@ -238,8 +239,13 @@ static void mds_select_mitigation(void)
> if (mds_mitigation == MDS_MITIGATION_FULL) {
> if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
> mds_mitigation = MDS_MITIGATION_VMWERV;
> +
> static_branch_enable(&mds_user_clear);
> +
> + if (mds_nosmt && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> + cpu_smt_disable(false);
Is there some logic missing here to disable SMT?
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [PATCH RFC 1/4] 1
2019-03-04 1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
2019-03-04 3:55 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-04 7:30 ` Greg KH
2019-03-04 7:45 ` [MODERATED] Encrypted Message Jon Masters
1 sibling, 1 reply; 89+ messages in thread
From: Greg KH @ 2019-03-04 7:30 UTC (permalink / raw)
To: speck
On Sun, Mar 03, 2019 at 07:23:22PM -0600, speck for Josh Poimboeuf wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> Subject: [PATCH RFC 1/4] x86/speculation/mds: Add mds=full,nosmt cmdline
> option
>
> Add the mds=full,nosmt cmdline option. This is like mds=full, but with
> SMT disabled if the CPU is vulnerable.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
> Documentation/admin-guide/hw-vuln/mds.rst | 3 +++
> Documentation/admin-guide/kernel-parameters.txt | 6 ++++--
> arch/x86/kernel/cpu/bugs.c | 10 ++++++++++
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
> index 1de29d28903d..244ab47d1fb3 100644
> --- a/Documentation/admin-guide/hw-vuln/mds.rst
> +++ b/Documentation/admin-guide/hw-vuln/mds.rst
> @@ -260,6 +260,9 @@ time with the option "mds=". The valid arguments for this option are:
>
> It does not automatically disable SMT.
>
> + full,nosmt The same as mds=full, with SMT disabled on vulnerable
> + CPUs. This is the complete mitigation.
While I understand the intention, the number of different combinations
we are "offering" to userspace here is huge, and everyone is going to be
confused as to what to do. If we really think/say that SMT is a major
issue for this, why don't we just have "full" disable SMT?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-04 7:30 ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
@ 2019-03-04 7:45 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-04 7:45 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 110 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH RFC 1/4] 1
[-- Attachment #2: Type: text/plain, Size: 1867 bytes --]
On 3/4/19 2:30 AM, speck for Greg KH wrote:
> On Sun, Mar 03, 2019 at 07:23:22PM -0600, speck for Josh Poimboeuf wrote:
>> From: Josh Poimboeuf <jpoimboe@redhat.com>
>> Subject: [PATCH RFC 1/4] x86/speculation/mds: Add mds=full,nosmt cmdline
>> option
>>
>> Add the mds=full,nosmt cmdline option. This is like mds=full, but with
>> SMT disabled if the CPU is vulnerable.
>>
>> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
>> ---
>> Documentation/admin-guide/hw-vuln/mds.rst | 3 +++
>> Documentation/admin-guide/kernel-parameters.txt | 6 ++++--
>> arch/x86/kernel/cpu/bugs.c | 10 ++++++++++
>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
>> index 1de29d28903d..244ab47d1fb3 100644
>> --- a/Documentation/admin-guide/hw-vuln/mds.rst
>> +++ b/Documentation/admin-guide/hw-vuln/mds.rst
>> @@ -260,6 +260,9 @@ time with the option "mds=". The valid arguments for this option are:
>>
>> It does not automatically disable SMT.
>>
>> + full,nosmt The same as mds=full, with SMT disabled on vulnerable
>> + CPUs. This is the complete mitigation.
>
> While I understand the intention, the number of different combinations
> we are "offering" to userspace here is huge, and everyone is going to be
> confused as to what to do. If we really think/say that SMT is a major
> issue for this, why don't we just have "full" disable SMT?
Frankly, it ought to for safety (can't be made safe). The reason cited
for not doing so (Thomas and Linus can speak up on this part) was
upgrades vs new installs. The concern was not to break existing folks by
losing half their logical CPU count when upgrading a kernel.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH RFC 3/4] 3
2019-03-04 1:21 [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements Josh Poimboeuf
2019-03-04 1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
@ 2019-03-04 1:24 ` Josh Poimboeuf
2019-03-04 3:58 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
2 siblings, 1 reply; 89+ messages in thread
From: Josh Poimboeuf @ 2019-03-04 1:24 UTC (permalink / raw)
To: speck
From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH RFC 3/4] x86/speculation/mds: Add SMT warning message
MDS is vulnerable with SMT. Make that clear with a one-time printk
whenever SMT first gets enabled.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
arch/x86/kernel/cpu/bugs.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 9e20aef01d38..346f0f05879d 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -691,6 +691,8 @@ static void update_mds_branch_idle(void)
static_branch_disable(&mds_idle_clear);
}
+#define MDS_MSG_SMT "MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.\n"
+
void arch_smt_update(void)
{
/* Enhanced IBRS implies STIBP. No update required. */
@@ -715,6 +717,8 @@ void arch_smt_update(void)
switch(mds_mitigation) {
case MDS_MITIGATION_FULL:
case MDS_MITIGATION_VMWERV:
+ if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
+ pr_warn_once(MDS_MSG_SMT);
update_mds_branch_idle();
break;
case MDS_MITIGATION_OFF:
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-04 1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
@ 2019-03-04 3:58 ` Jon Masters
2019-03-04 17:17 ` [MODERATED] " Josh Poimboeuf
0 siblings, 1 reply; 89+ messages in thread
From: Jon Masters @ 2019-03-04 3:58 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 3/4] 3
[-- Attachment #2: Type: text/plain, Size: 445 bytes --]
On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:
> + if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> + pr_warn_once(MDS_MSG_SMT);
It's never fully safe to use SMT. I get that if we only had MSBDS then
it's unlikely we'll hit the e.g. power state change cases needed to
exploit it but I think it would be prudent to display something anyway?
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: Encrypted Message
2019-03-04 3:58 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-04 17:17 ` Josh Poimboeuf
2019-03-06 16:22 ` [MODERATED] " Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Josh Poimboeuf @ 2019-03-04 17:17 UTC (permalink / raw)
To: speck
On Sun, Mar 03, 2019 at 10:58:01PM -0500, speck for Jon Masters wrote:
> On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:
>
> > + if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> > + pr_warn_once(MDS_MSG_SMT);
>
> It's never fully safe to use SMT. I get that if we only had MSBDS then
> it's unlikely we'll hit the e.g. power state change cases needed to
> exploit it but I think it would be prudent to display something anyway?
My understanding is that the idle state changes are mitigated elsewhere
in the MDS patches, so it should be safe in theory.
--
Josh
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-04 17:17 ` [MODERATED] " Josh Poimboeuf
@ 2019-03-06 16:22 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-06 16:22 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: Encrypted Message
[-- Attachment #2: Type: text/plain, Size: 778 bytes --]
On 3/4/19 12:17 PM, speck for Josh Poimboeuf wrote:
> On Sun, Mar 03, 2019 at 10:58:01PM -0500, speck for Jon Masters wrote:
>
>> On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:
>>
>>> + if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
>>> + pr_warn_once(MDS_MSG_SMT);
>>
>> It's never fully safe to use SMT. I get that if we only had MSBDS then
>> it's unlikely we'll hit the e.g. power state change cases needed to
>> exploit it but I think it would be prudent to display something anyway?
>
> My understanding is that the idle state changes are mitigated elsewhere
> in the MDS patches, so it should be safe in theory.
Looked at it again. Agree. Sorry about that.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH RFC 4/4] 4
2019-03-04 1:21 [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements Josh Poimboeuf
2019-03-04 1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
2019-03-04 1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
@ 2019-03-04 1:25 ` Josh Poimboeuf
2019-03-04 4:07 ` [MODERATED] Encrypted Message Jon Masters
2 siblings, 1 reply; 89+ messages in thread
From: Josh Poimboeuf @ 2019-03-04 1:25 UTC (permalink / raw)
To: speck
From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH RFC 4/4] x86/speculation: Add 'cpu_spec_mitigations=' cmdline
options
Keeping track of the number of mitigations for all the CPU speculation
bugs has become overwhelming for many users. It's getting more and more
complicated to decide what mitigations are needed for a given
architecture.
Most users fall into a few basic categories:
- want all mitigations off;
- want all reasonable mitigations on, with SMT enabled even if it's
vulnerable; or
- want all reasonable mitigations on, with SMT disabled if vulnerable.
Define a set of curated, arch-independent options, each of which is an
aggregation of existing options:
- cpu_spec_mitigations=off: Disable all mitigations.
- cpu_spec_mitigations=auto: [default] Enable all the default mitigations,
but leave SMT enabled, even if it's vulnerable.
- cpu_spec_mitigations=auto,nosmt: Enable all the default mitigations,
disabling SMT if needed by a mitigation.
See the documentation for more details.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
.../admin-guide/kernel-parameters.txt | 43 ++++++++++++++++
arch/powerpc/kernel/security.c | 6 +--
arch/powerpc/kernel/setup_64.c | 2 +-
arch/s390/kernel/nospec-branch.c | 4 +-
arch/x86/include/asm/processor.h | 2 +
arch/x86/kernel/cpu/bugs.c | 51 ++++++++++++++++---
arch/x86/mm/pti.c | 3 +-
include/linux/cpu.h | 8 +++
kernel/cpu.c | 15 ++++++
9 files changed, 122 insertions(+), 12 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 55969f240f2e..c2dba60630e4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2537,6 +2537,49 @@
in the "bleeding edge" mini2440 support kernel at
http://repo.or.cz/w/linux-2.6/mini2440.git
+ cpu_spec_mitigations=
+ [KNL] Control mitigations for CPU speculation
+ vulnerabilities on affected CPUs. This is a set of
+ curated, arch-independent options, each of which is an
+ aggregation of existing options.
+
+ off
+ Disable all speculative CPU mitigations.
+ Equivalent to: nopti
+ nospectre_v1
+ nospectre_v2
+ spectre_v2_user=off
+ nobp=0
+ spec_store_bypass_disable=off
+ l1tf=off
+ mds=off
+
+ auto (default)
+ Mitigate all speculative CPU vulnerabilities,
+ but leave SMT enabled, even if it's vulnerable.
+ This is useful for users who don't want to be
+ surprised by SMT getting disabled across kernel
+ upgrades, or who have other ways of avoiding
+ SMT-based attacks.
+ Equivalent to: pti=auto
+ spectre_v2=auto
+ spectre_v2_user=auto
+ spec_store_bypass_disable=auto
+ l1tf=flush
+ mds=full
+
+ auto,nosmt
+ Mitigate all speculative CPU vulnerabilities,
+ disabling SMT if needed. This is for users who
+ always want to be fully mitigated, even if it
+ means losing SMT.
+ Equivalent to: pti=auto
+ spectre_v2=auto
+ spectre_v2_user=auto
+ spec_store_bypass_disable=auto
+ l1tf=flush,nosmt
+ mds=full,nosmt
+
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 9b8631533e02..be4266a57e54 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -57,7 +57,7 @@ void setup_barrier_nospec(void)
enable = security_ftr_enabled(SEC_FTR_FAVOUR_SECURITY) &&
security_ftr_enabled(SEC_FTR_BNDS_CHK_SPEC_BAR);
- if (!no_nospec)
+ if (!no_nospec && cpu_spec_mitigations != CPU_SPEC_MITIGATIONS_OFF)
enable_barrier_nospec(enable);
}
@@ -116,7 +116,7 @@ static int __init handle_nospectre_v2(char *p)
early_param("nospectre_v2", handle_nospectre_v2);
void setup_spectre_v2(void)
{
- if (no_spectrev2)
+ if (no_spectrev2 || cpu_spec_mitigations == CPU_SPEC_MITIGATIONS_OFF)
do_btb_flush_fixups();
else
btb_flush_enabled = true;
@@ -307,7 +307,7 @@ void setup_stf_barrier(void)
stf_enabled_flush_types = type;
- if (!no_stf_barrier)
+ if (!no_stf_barrier && cpu_spec_mitigations != CPU_SPEC_MITIGATIONS_OFF)
stf_barrier_enable(enable);
}
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 236c1151a3a7..5fe43bcde325 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -958,7 +958,7 @@ void setup_rfi_flush(enum l1d_flush_type types, bool enable)
enabled_flush_types = types;
- if (!no_rfi_flush)
+ if (!no_rfi_flush || cpu_spec_mitigations != CPU_SPEC_MITIGATIONS_OFF)
rfi_flush_enable(enable);
}
diff --git a/arch/s390/kernel/nospec-branch.c b/arch/s390/kernel/nospec-branch.c
index bdddaae96559..c40eb672b43a 100644
--- a/arch/s390/kernel/nospec-branch.c
+++ b/arch/s390/kernel/nospec-branch.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/module.h>
#include <linux/device.h>
+#include <linux/cpu.h>
#include <asm/nospec-branch.h>
static int __init nobp_setup_early(char *str)
@@ -58,7 +59,8 @@ early_param("nospectre_v2", nospectre_v2_setup_early);
void __init nospec_auto_detect(void)
{
- if (test_facility(156)) {
+ if (test_facility(156) ||
+ cpu_spec_mitigations == CPU_SPEC_MITIGATIONS_OFF) {
/*
* The machine supports etokens.
* Disable expolines and disable nobp.
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index aca1ef8cc79f..bb2ced3a491e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -983,6 +983,7 @@ void microcode_check(void);
enum l1tf_mitigations {
L1TF_MITIGATION_OFF,
+ L1TF_MITIGATION_DEFAULT,
L1TF_MITIGATION_FLUSH_NOWARN,
L1TF_MITIGATION_FLUSH,
L1TF_MITIGATION_FLUSH_NOSMT,
@@ -994,6 +995,7 @@ extern enum l1tf_mitigations l1tf_mitigation;
enum mds_mitigations {
MDS_MITIGATION_OFF,
+ MDS_MITIGATION_DEFAULT,
MDS_MITIGATION_FULL,
MDS_MITIGATION_VMWERV,
};
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 346f0f05879d..7354daf3555f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -222,7 +222,7 @@ static void x86_amd_ssb_disable(void)
#define pr_fmt(fmt) "MDS: " fmt
/* Default mitigation for L1TF-affected CPUs */
-static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_FULL;
+static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_DEFAULT;
static bool mds_nosmt __ro_after_init = false;
static const char * const mds_strings[] = {
@@ -238,6 +238,20 @@ static void mds_select_mitigation(void)
return;
}
+ if (mds_mitigation == MDS_MITIGATION_DEFAULT) {
+ switch (cpu_spec_mitigations) {
+ case CPU_SPEC_MITIGATIONS_OFF:
+ mds_mitigation = MDS_MITIGATION_OFF;
+ break;
+ case CPU_SPEC_MITIGATIONS_AUTO_NOSMT:
+ mds_nosmt = true;
+ /* fallthrough */
+ case CPU_SPEC_MITIGATIONS_AUTO:
+ mds_mitigation = MDS_MITIGATION_FULL;
+ break;
+ }
+ }
+
if (mds_mitigation == MDS_MITIGATION_FULL) {
if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
mds_mitigation = MDS_MITIGATION_VMWERV;
@@ -374,8 +388,11 @@ spectre_v2_parse_user_cmdline(enum spectre_v2_mitigation_cmd v2_cmd)
ret = cmdline_find_option(boot_command_line, "spectre_v2_user",
arg, sizeof(arg));
- if (ret < 0)
+ if (ret < 0) {
+ if (cpu_spec_mitigations == CPU_SPEC_MITIGATIONS_OFF)
+ return SPECTRE_V2_USER_CMD_NONE;
return SPECTRE_V2_USER_CMD_AUTO;
+ }
for (i = 0; i < ARRAY_SIZE(v2_user_options); i++) {
if (match_option(arg, ret, v2_user_options[i].option)) {
@@ -510,8 +527,11 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
return SPECTRE_V2_CMD_NONE;
ret = cmdline_find_option(boot_command_line, "spectre_v2", arg, sizeof(arg));
- if (ret < 0)
+ if (ret < 0) {
+ if (cpu_spec_mitigations == CPU_SPEC_MITIGATIONS_OFF)
+ return SPECTRE_V2_CMD_NONE;
return SPECTRE_V2_CMD_AUTO;
+ }
for (i = 0; i < ARRAY_SIZE(mitigation_options); i++) {
if (!match_option(arg, ret, mitigation_options[i].option))
@@ -716,9 +736,10 @@ void arch_smt_update(void)
switch(mds_mitigation) {
case MDS_MITIGATION_FULL:
+ case MDS_MITIGATION_DEFAULT:
case MDS_MITIGATION_VMWERV:
if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
- pr_warn_once(MDS_MSG_SMT);
+ printk_once(KERN_WARNING MDS_MSG_SMT);
update_mds_branch_idle();
break;
case MDS_MITIGATION_OFF:
@@ -771,8 +792,11 @@ static enum ssb_mitigation_cmd __init ssb_parse_cmdline(void)
} else {
ret = cmdline_find_option(boot_command_line, "spec_store_bypass_disable",
arg, sizeof(arg));
- if (ret < 0)
+ if (ret < 0) {
+ if (cpu_spec_mitigations == CPU_SPEC_MITIGATIONS_OFF)
+ return SPEC_STORE_BYPASS_CMD_NONE;
return SPEC_STORE_BYPASS_CMD_AUTO;
+ }
for (i = 0; i < ARRAY_SIZE(ssb_mitigation_options); i++) {
if (!match_option(arg, ret, ssb_mitigation_options[i].option))
@@ -1037,7 +1061,7 @@ void x86_spec_ctrl_setup_ap(void)
#define pr_fmt(fmt) "L1TF: " fmt
/* Default mitigation for L1TF-affected CPUs */
-enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;
+enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_DEFAULT;
#if IS_ENABLED(CONFIG_KVM_INTEL)
EXPORT_SYMBOL_GPL(l1tf_mitigation);
#endif
@@ -1092,8 +1116,23 @@ static void __init l1tf_select_mitigation(void)
override_cache_bits(&boot_cpu_data);
+ if (l1tf_mitigation == L1TF_MITIGATION_DEFAULT) {
+ switch (cpu_spec_mitigations) {
+ case CPU_SPEC_MITIGATIONS_OFF:
+ l1tf_mitigation = L1TF_MITIGATION_OFF;
+ break;
+ case CPU_SPEC_MITIGATIONS_AUTO:
+ l1tf_mitigation = L1TF_MITIGATION_FLUSH;
+ break;
+ case CPU_SPEC_MITIGATIONS_AUTO_NOSMT:
+ l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;
+ break;
+ }
+ }
+
switch (l1tf_mitigation) {
case L1TF_MITIGATION_OFF:
+ case L1TF_MITIGATION_DEFAULT:
case L1TF_MITIGATION_FLUSH_NOWARN:
case L1TF_MITIGATION_FLUSH:
break;
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 4fee5c3003ed..943b641bc003 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -115,7 +115,8 @@ void __init pti_check_boottime_disable(void)
}
}
- if (cmdline_find_option_bool(boot_command_line, "nopti")) {
+ if (cmdline_find_option_bool(boot_command_line, "nopti") ||
+ cpu_spec_mitigations == CPU_SPEC_MITIGATIONS_OFF) {
pti_mode = PTI_FORCE_OFF;
pti_print_if_insecure("disabled on command line.");
return;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3c87ad888ed3..6cdd3d5228d3 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -189,4 +189,12 @@ static inline void cpu_smt_disable(bool force) { }
static inline void cpu_smt_check_topology(void) { }
#endif
+enum cpu_spec_mitigations {
+ CPU_SPEC_MITIGATIONS_OFF,
+ CPU_SPEC_MITIGATIONS_AUTO,
+ CPU_SPEC_MITIGATIONS_AUTO_NOSMT,
+};
+
+extern enum cpu_spec_mitigations cpu_spec_mitigations;
+
#endif /* _LINUX_CPU_H_ */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d1c6d152da89..136d33fb90e5 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2279,3 +2279,18 @@ void __init boot_cpu_hotplug_init(void)
#endif
this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
}
+
+enum cpu_spec_mitigations cpu_spec_mitigations __ro_after_init = CPU_SPEC_MITIGATIONS_AUTO;
+
+static int __init cpu_spec_mitigations_setup(char *arg)
+{
+ if (!strcmp(arg, "off"))
+ cpu_spec_mitigations = CPU_SPEC_MITIGATIONS_OFF;
+ else if (!strcmp(arg, "auto"))
+ cpu_spec_mitigations = CPU_SPEC_MITIGATIONS_AUTO;
+ else if (!strcmp(arg, "auto,nosmt"))
+ cpu_spec_mitigations = CPU_SPEC_MITIGATIONS_AUTO_NOSMT;
+
+ return 0;
+}
+early_param("cpu_spec_mitigations", cpu_spec_mitigations_setup);
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-04 1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
@ 2019-03-04 4:07 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-04 4:07 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 4/4] 4
[-- Attachment #2: Type: text/plain, Size: 1461 bytes --]
On 3/3/19 8:25 PM, speck for Josh Poimboeuf wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> Subject: [PATCH RFC 4/4] x86/speculation: Add 'cpu_spec_mitigations=' cmdline
> options
>
> Keeping track of the number of mitigations for all the CPU speculation
> bugs has become overwhelming for many users. It's getting more and more
> complicated to decide what mitigations are needed for a given
> architecture.
>
> Most users fall into a few basic categories:
>
> - want all mitigations off;
>
> - want all reasonable mitigations on, with SMT enabled even if it's
> vulnerable; or
>
> - want all reasonable mitigations on, with SMT disabled if vulnerable.
>
> Define a set of curated, arch-independent options, each of which is an
> aggregation of existing options:
>
> - cpu_spec_mitigations=off: Disable all mitigations.
>
> - cpu_spec_mitigations=auto: [default] Enable all the default mitigations,
> but leave SMT enabled, even if it's vulnerable.
>
> - cpu_spec_mitigations=auto,nosmt: Enable all the default mitigations,
> disabling SMT if needed by a mitigation.
>
> See the documentation for more details.
Looks good. There's an effort to upstream mitigation controls for the
arm64 but that's not in place yet. They'll want to wire that up later. I
actually had missed the s390x etokens work so that was fun to see here.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v6 00/43] MDSv6
@ 2019-02-24 15:07 Andi Kleen
2019-02-24 15:07 ` [MODERATED] [PATCH v6 10/43] MDSv6 Andi Kleen
2019-02-24 15:07 ` [MODERATED] [PATCH v6 31/43] MDSv6 Andi Kleen
0 siblings, 2 replies; 89+ messages in thread
From: Andi Kleen @ 2019-02-24 15:07 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
Here's a new version of flushing CPU buffers for group 4
for single thread.
I would be mainly interested in feedback on the lazy approach,
so please focus on the later patches.
There didn't seem to be much interest in it, so I wonder
if it still makes sense to continue with it? Or could
we just stay with the full approach?
The lazy approach is faster, but not by that much
and may not be worth the short and long term impact
all over the tree.
This version is based on my earlier base patches, with the
mds=full implementation being at the beginning, and a lazy
implementation building on top of it. The patch can
be rebased on the rewrite once it matures.
Even the base has some features not in Thomas' version which would
need to be ported (e.g. more complete virtualization support
and EBPF mitigation)
This patch implements the "full tree audit" approach that
was suggested by several reviewers. We (me and Mark Gross)
went through most asynchronous code in the kernel and marked the
functions that touch user or IO data. This leaves most asynchronous
interrupts etc. to not schedule a clear. However this would
need to be continuously enforced for new code too.
It also implements various other review suggestions
and improvements. clearcpu.txt is now clarified in many ways.
Before reviewing please read Documentation/clearcpu.txt
Some performance data for lazy:
Kernel build: ~+1% (slightly faster, but that's within noise)
loopback apache -1% (within noise)
ebizzy -0.3% (within noise)
aim7 -5.0%
netperf rr -0.7%
netperf stream 0.0%
In comparison an older version of mds=full showed:
kernel build -2.4%
ebizzy -3.3%
apache loopback -10.0%
For networking workloads there is practically no regression now.
AIM7 is showing some regression. I assume this is due to the context
switch overhead.
mds=full is a bit slower, but not that much. The only real outlier is
apache loopback, which is probably not too realistic a workload
because it mainly does tight loops over some syscalls.
No changelog against previous versions, too many changes.
Andi Kleen (42):
x86/speculation/mds: Add basic bug infrastructure for MDS
x86/speculation/mds: Clear CPU on every kernel exit
x86/speculation/mds: Clear CPU buffers on entering idle
x86/speculation/mds: Add command line options to control mds
x86/speculation/mds: Add sysfs reporting
mds: Add some administrator documentation
x86/speculation/mds: Export MD_CLEAR CPUID to KVM guests.
x86/cpufeatures: Add word 20 for additional features
x86/speculation/mds: Handle VMENTRY clear for CPUs without l1tf
mds: Add documentation for clear cpu usage
x86/speculation/mds: Introduce lazy_clear_cpu
x86/speculation/mds: Add basic implementation of mds=full
x86/speculation/mds: Check lazy clear in kernel exit
x86/speculation/mds: Add tracing for clear_cpu
x86/speculation/mds: Schedule cpu clear on context switch
mds: Force clear cpu on kernel preemption
mds: Clear cpu in memzero_explicit and kzfree
mds: Support cpu clear in interrupts
mds: Support cpu clear after tasklets
mds: Support cpu clearing in timers
mds: Clear cpu for string io/memcpy_*io in interrupts
mds: Schedule clear cpu in swiotlb
mds: Instrument skb functions to clear cpu automatically
mds: Clear cpu for kmap_atomic in interrupts
mds: Support cpu clearing for BPF
mds sweep: Schedule clear cpus in sound core
mds sweep: Make MPU401 interrupts clear cpu
mds sweep: Clear cpu on processing input layer data
mds sweep: Clear cpu for tty input
mds sweep: Clear cpu for usbmon intercepts
mds sweep: Clear cpu in some Xen drivers
mds sweep: Clear cpu in DVB software filters
mds sweep: Mark all DRM interrupts to clear cpu
mds sweep: Make all old style IDE driver interrupts clear cpu
mds sweep: Make Amazon ena driver management interrupt clear cpu
mds sweep: Make all PCMCIA interrupts clear cpu
mds sweep: Mark common functions in comedi as clear cpu
mds sweep: Make usb hcd poll clear cpu
x86/speculation/mds: Switch mds=auto to lazy
mds sweep: Mark interrupts that touch user data
mds sweep: Mark timer handlers that touch user data
mds sweep: Mark tasklets that touch user data
Mark Gross (1):
mds sweep: Clear cpu in sg_copy_from_buffer for SCSI
.../ABI/testing/sysfs-devices-system-cpu | 1 +
.../admin-guide/kernel-parameters.txt | 11 +
Documentation/admin-guide/mds.rst | 95 +++++++
Documentation/clearcpu.txt | 261 ++++++++++++++++++
arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 8 +
arch/x86/events/intel/uncore.c | 3 +-
arch/x86/include/asm/clearbpf.h | 29 ++
arch/x86/include/asm/clearcpu.h | 83 ++++++
arch/x86/include/asm/cpufeature.h | 6 +-
arch/x86/include/asm/cpufeatures.h | 9 +-
arch/x86/include/asm/disabled-features.h | 3 +-
arch/x86/include/asm/floppy.h | 6 +-
arch/x86/include/asm/io.h | 3 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/required-features.h | 3 +-
arch/x86/include/asm/trace/clearcpu.h | 27 ++
arch/x86/kernel/acpi/cstate.c | 2 +
arch/x86/kernel/cpu/bugs.c | 82 ++++++
arch/x86/kernel/cpu/common.c | 25 ++
arch/x86/kernel/kvm.c | 3 +
arch/x86/kernel/nmi.c | 6 +-
arch/x86/kernel/process.c | 5 +
arch/x86/kernel/process.h | 1 +
arch/x86/kernel/smpboot.c | 3 +
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/vmx/vmx.c | 20 +-
arch/x86/mm/highmem_32.c | 3 +
arch/x86/mm/tlb.c | 14 +
drivers/acpi/acpi_pad.c | 2 +
drivers/acpi/processor_idle.c | 3 +
drivers/atm/eni.c | 3 +-
drivers/atm/he.c | 3 +-
drivers/atm/lanai.c | 4 +-
drivers/atm/nicstar.c | 4 +-
drivers/auxdisplay/img-ascii-lcd.c | 2 +-
drivers/base/cpu.c | 8 +
drivers/block/xsysace.c | 5 +-
drivers/char/ipmi/ipmi_si_intf.c | 6 +-
drivers/char/sonypi.c | 3 +-
drivers/crypto/ixp4xx_crypto.c | 3 +-
drivers/crypto/qat/qat_common/adf_isr.c | 7 +-
drivers/crypto/qat/qat_common/adf_sriov.c | 6 +-
drivers/crypto/qat/qat_common/adf_vf_isr.c | 10 +-
drivers/dma/dw/core.c | 3 +-
drivers/dma/ioat/init.c | 3 +-
drivers/dma/virt-dma.c | 3 +-
drivers/firewire/core-transaction.c | 5 +-
drivers/firewire/nosy.c | 3 +-
drivers/gpu/drm/drm_irq.c | 3 +-
drivers/gpu/drm/gma500/oaktrail_hdmi_i2c.c | 3 +-
drivers/gpu/drm/i915/i915_pmu.c | 3 +-
drivers/gpu/drm/i915/intel_lrc.c | 5 +-
.../gpu/drm/nouveau/nvkm/subdev/pci/base.c | 3 +-
drivers/hv/channel_mgmt.c | 4 +-
drivers/hv/hv.c | 4 +-
drivers/i2c/busses/i2c-emev2.c | 5 +-
drivers/i2c/busses/i2c-i801.c | 2 +-
drivers/i2c/busses/i2c-pxa.c | 4 +-
drivers/i2c/busses/i2c-rk3x.c | 3 +-
drivers/ide/ide-probe.c | 5 +-
drivers/idle/intel_idle.c | 5 +
drivers/iio/trigger/iio-trig-hrtimer.c | 3 +-
drivers/infiniband/hw/bnxt_re/qplib_fp.c | 4 +-
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 4 +-
drivers/infiniband/hw/i40iw/i40iw_main.c | 9 +-
drivers/infiniband/hw/mthca/mthca_eq.c | 14 +-
drivers/infiniband/hw/qib/qib_sdma.c | 4 +-
drivers/infiniband/sw/rxe/rxe_cq.c | 3 +-
drivers/input/ff-memless.c | 2 +-
drivers/input/input.c | 5 +-
drivers/input/misc/xen-kbdfront.c | 3 +-
drivers/input/serio/hil_mlc.c | 2 +-
drivers/input/serio/i8042.c | 13 +-
drivers/input/serio/serio.c | 3 +
drivers/ipack/carriers/tpci200.c | 6 +-
drivers/isdn/capi/capidrv.c | 2 +-
drivers/isdn/gigaset/bas-gigaset.c | 8 +-
drivers/isdn/gigaset/common.c | 4 +-
drivers/isdn/gigaset/ser-gigaset.c | 4 +-
drivers/isdn/gigaset/usb-gigaset.c | 4 +-
drivers/isdn/hardware/avm/b1isa.c | 4 +-
drivers/isdn/hardware/avm/b1pci.c | 6 +-
drivers/isdn/hardware/avm/b1pcmcia.c | 3 +-
drivers/isdn/hardware/avm/c4.c | 3 +-
drivers/isdn/hardware/avm/t1isa.c | 4 +-
drivers/isdn/hardware/avm/t1pci.c | 3 +-
drivers/isdn/hardware/mISDN/avmfritz.c | 4 +-
drivers/isdn/hardware/mISDN/hfcmulti.c | 3 +-
drivers/isdn/hardware/mISDN/hfcpci.c | 2 +-
drivers/isdn/hardware/mISDN/mISDNinfineon.c | 3 +-
drivers/isdn/hardware/mISDN/netjet.c | 2 +-
drivers/isdn/hardware/mISDN/speedfax.c | 3 +-
drivers/isdn/hardware/mISDN/w6692.c | 2 +-
drivers/isdn/hisax/config.c | 2 +-
drivers/isdn/hisax/hfc4s8s_l1.c | 3 +-
drivers/isdn/hisax/hisax_fcpcipnp.c | 13 +-
drivers/isdn/i4l/isdn_common.c | 2 +-
drivers/media/cec/cec-pin.c | 3 +-
drivers/media/common/saa7146/saa7146_core.c | 4 +-
drivers/media/dvb-core/dvb_demux.c | 3 +
drivers/media/pci/b2c2/flexcop-pci.c | 3 +-
drivers/media/pci/bt8xx/bttv-driver.c | 2 +-
drivers/media/pci/bt8xx/bttv-input.c | 5 +-
drivers/media/pci/bt8xx/dvb-bt8xx.c | 3 +-
drivers/media/pci/cobalt/cobalt-driver.c | 3 +-
drivers/media/pci/cx18/cx18-driver.c | 3 +-
drivers/media/pci/cx25821/cx25821-core.c | 2 +-
drivers/media/pci/cx88/cx88-alsa.c | 3 +-
drivers/media/pci/cx88/cx88-mpeg.c | 2 +-
drivers/media/pci/cx88/cx88-video.c | 2 +-
drivers/media/pci/dt3155/dt3155.c | 2 +-
drivers/media/pci/intel/ipu3/ipu3-cio2.c | 2 +-
drivers/media/pci/ivtv/ivtv-driver.c | 3 +-
drivers/media/pci/mantis/mantis_dvb.c | 3 +-
drivers/media/pci/meye/meye.c | 3 +-
.../pci/netup_unidvb/netup_unidvb_core.c | 3 +-
drivers/media/pci/ngene/ngene-core.c | 5 +-
drivers/media/pci/pluto2/pluto2.c | 3 +-
drivers/media/pci/saa7134/saa7134-alsa.c | 4 +-
drivers/media/pci/saa7134/saa7134-core.c | 2 +-
drivers/media/pci/saa7134/saa7134-input.c | 3 +-
drivers/media/pci/saa7134/saa7134-ts.c | 3 +-
drivers/media/pci/saa7134/saa7134-vbi.c | 3 +-
drivers/media/pci/saa7134/saa7134-video.c | 3 +-
drivers/media/pci/saa7164/saa7164-core.c | 7 +-
drivers/media/pci/smipcie/smipcie-main.c | 3 +-
drivers/media/pci/solo6x10/solo6x10-core.c | 4 +-
drivers/media/pci/sta2x11/sta2x11_vip.c | 5 +-
drivers/media/pci/ttpci/av7110.c | 12 +-
drivers/media/pci/ttpci/av7110_ir.c | 3 +-
drivers/media/pci/ttpci/budget-ci.c | 9 +-
drivers/media/pci/ttpci/budget-core.c | 3 +-
drivers/media/pci/tw5864/tw5864-video.c | 4 +-
drivers/media/pci/tw68/tw68-core.c | 2 +-
drivers/media/pci/tw686x/tw686x-core.c | 4 +-
drivers/media/platform/aspeed-video.c | 5 +-
.../media/platform/marvell-ccic/cafe-driver.c | 3 +-
.../media/platform/marvell-ccic/mcam-core.c | 4 +-
drivers/media/radio/radio-cadet.c | 2 +-
drivers/media/radio/wl128x/fmdrv_common.c | 9 +-
drivers/media/rc/fintek-cir.c | 3 +-
drivers/media/rc/gpio-ir-recv.c | 2 +-
drivers/media/rc/img-ir/img-ir-raw.c | 2 +-
drivers/media/rc/ir-hix5hd2.c | 3 +-
drivers/media/rc/ir-rx51.c | 3 +-
drivers/media/rc/ite-cir.c | 3 +-
drivers/media/rc/nuvoton-cir.c | 3 +-
drivers/media/rc/serial_ir.c | 2 +-
drivers/media/rc/sir_ir.c | 3 +-
drivers/media/rc/winbond-cir.c | 2 +-
drivers/media/usb/au0828/au0828-video.c | 6 +-
drivers/media/usb/ttusb-dec/ttusb_dec.c | 5 +-
drivers/memstick/host/jmb38x_ms.c | 6 +-
drivers/message/fusion/mptbase.c | 3 +-
drivers/mfd/ezx-pcap.c | 4 +-
drivers/misc/ibmasm/module.c | 4 +-
drivers/misc/sgi-gru/grufile.c | 7 +-
drivers/misc/sgi-xp/xpc_uv.c | 3 +-
drivers/misc/vmw_vmci/vmci_guest.c | 8 +-
drivers/mmc/host/mtk-sd.c | 3 +-
drivers/mmc/host/wbsd.c | 20 +-
drivers/net/appletalk/cops.c | 2 +-
drivers/net/arcnet/arc-rimi.c | 2 +-
drivers/net/arcnet/com20020.c | 3 +-
drivers/net/arcnet/com90io.c | 3 +-
drivers/net/arcnet/com90xx.c | 2 +-
drivers/net/caif/caif_hsi.c | 9 +-
drivers/net/can/cc770/cc770.c | 4 +-
drivers/net/can/peak_canfd/peak_pciefd_main.c | 8 +-
drivers/net/can/sja1000/ems_pcmcia.c | 4 +-
drivers/net/can/sja1000/peak_pcmcia.c | 3 +-
drivers/net/can/sja1000/sja1000.c | 5 +-
drivers/net/ethernet/3com/3c509.c | 3 +-
drivers/net/ethernet/3com/3c515.c | 6 +-
drivers/net/ethernet/8390/axnet_cs.c | 3 +-
drivers/net/ethernet/8390/ne.c | 3 +-
drivers/net/ethernet/8390/ne2k-pci.c | 3 +-
drivers/net/ethernet/8390/pcnet_cs.c | 3 +-
drivers/net/ethernet/8390/smc-ultra.c | 3 +-
drivers/net/ethernet/8390/wd.c | 3 +-
drivers/net/ethernet/agere/et131x.c | 4 +-
drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 +-
drivers/net/ethernet/amazon/ena/ena_netdev.h | 1 +
drivers/net/ethernet/amd/lance.c | 2 +-
drivers/net/ethernet/amd/ni65.c | 5 +-
drivers/net/ethernet/atheros/atlx/atl1.c | 4 +-
drivers/net/ethernet/atheros/atlx/atl2.c | 4 +-
drivers/net/ethernet/broadcom/cnic.c | 8 +-
drivers/net/ethernet/cadence/macb_main.c | 4 +-
.../net/ethernet/chelsio/cxgb3/cxgb3_main.c | 16 +-
drivers/net/ethernet/micrel/ks8842.c | 7 +-
drivers/net/ethernet/micrel/ks8851_mll.c | 3 +-
drivers/net/ethernet/microchip/lan743x_main.c | 7 +-
drivers/net/ethernet/realtek/atp.c | 3 +-
drivers/net/fddi/skfp/skfddi.c | 4 +-
drivers/net/hamradio/6pack.c | 4 +-
drivers/net/hamradio/baycom_ser_fdx.c | 3 +-
drivers/net/hamradio/baycom_ser_hdx.c | 3 +-
drivers/net/hamradio/scc.c | 8 +-
drivers/net/hamradio/yam.c | 6 +-
drivers/net/hippi/rrunner.c | 2 +-
drivers/net/ieee802154/at86rf230.c | 6 +-
drivers/net/ieee802154/ca8210.c | 10 +-
drivers/net/ieee802154/mcr20a.c | 3 +-
drivers/net/ieee802154/mrf24j40.c | 3 +-
drivers/net/ppp/ppp_async.c | 3 +-
drivers/net/ppp/ppp_synctty.c | 3 +-
drivers/net/slip/slip.c | 4 +-
drivers/net/usb/cdc_ncm.c | 3 +-
drivers/net/usb/hso.c | 6 +-
drivers/net/wan/cosa.c | 2 +-
drivers/net/wan/farsync.c | 6 +-
drivers/net/wan/hostess_sv11.c | 3 +-
drivers/net/wan/sbni.c | 2 +-
drivers/net/wan/sdla.c | 4 +-
drivers/net/wan/sealevel.c | 3 +-
drivers/net/wireless/ath/ath9k/init.c | 5 +-
drivers/net/wireless/ath/carl9170/usb.c | 4 +-
.../net/wireless/broadcom/b43legacy/main.c | 8 +-
drivers/net/wireless/broadcom/b43legacy/pio.c | 4 +-
.../broadcom/brcm80211/brcmfmac/bcmsdh.c | 7 +-
.../broadcom/brcm80211/brcmsmac/mac80211_if.c | 3 +-
drivers/net/wireless/cisco/airo.c | 4 +-
drivers/net/wireless/intel/ipw2x00/ipw2100.c | 9 +-
drivers/net/wireless/intel/ipw2x00/ipw2200.c | 8 +-
.../net/wireless/intel/iwlegacy/3945-mac.c | 9 +-
.../net/wireless/intel/iwlegacy/4965-mac.c | 9 +-
.../net/wireless/intersil/hostap/hostap_ap.c | 2 +-
.../net/wireless/intersil/hostap/hostap_pci.c | 3 +-
.../net/wireless/intersil/hostap/hostap_plx.c | 3 +-
drivers/net/wireless/intersil/orinoco/main.c | 4 +-
.../intersil/orinoco/orinoco_nortel.c | 4 +-
.../wireless/intersil/orinoco/orinoco_pci.c | 4 +-
.../wireless/intersil/orinoco/orinoco_plx.c | 4 +-
.../wireless/intersil/orinoco/orinoco_tmd.c | 4 +-
drivers/net/wireless/intersil/p54/p54pci.c | 2 +-
drivers/net/wireless/intersil/p54/p54spi.c | 4 +-
.../intersil/prism54/islpci_hotplug.c | 2 +-
drivers/net/wireless/mac80211_hwsim.c | 6 +-
drivers/net/wireless/marvell/libertas/if_cs.c | 2 +-
.../net/wireless/marvell/libertas/if_spi.c | 3 +-
.../wireless/marvell/mwifiex/11n_rxreorder.c | 3 +-
drivers/net/wireless/marvell/mwifiex/main.c | 5 +-
drivers/net/wireless/marvell/mwifiex/pcie.c | 8 +-
drivers/net/wireless/marvell/mwifiex/usb.c | 2 +-
drivers/net/wireless/marvell/mwl8k.c | 10 +-
.../net/wireless/mediatek/mt76/mt76x0/pci.c | 3 +-
.../net/wireless/mediatek/mt76/mt76x2/pci.c | 3 +-
.../quantenna/qtnfmac/pcie/pearl_pcie.c | 3 +-
.../quantenna/qtnfmac/pcie/topaz_pcie.c | 3 +-
.../net/wireless/ralink/rt2x00/rt2x00mmio.c | 6 +-
.../wireless/realtek/rtl818x/rtl8180/dev.c | 6 +-
drivers/net/wireless/realtek/rtlwifi/pci.c | 16 +-
drivers/net/wireless/ti/wl1251/sdio.c | 4 +-
drivers/net/wireless/ti/wl1251/spi.c | 5 +-
drivers/ntb/hw/amd/ntb_hw_amd.c | 12 +-
drivers/ntb/hw/intel/ntb_hw_gen1.c | 12 +-
drivers/ntb/hw/mscc/ntb_hw_switchtec.c | 8 +-
drivers/parport/parport_ax88796.c | 3 +-
drivers/parport/parport_pc.c | 3 +-
drivers/pci/controller/pcie-xilinx.c | 2 +-
drivers/pci/controller/vmd.c | 11 +-
drivers/pci/hotplug/cpci_hotplug_core.c | 6 +-
drivers/pci/hotplug/cpqphp_core.c | 3 +-
drivers/pci/hotplug/shpchp_hpc.c | 5 +-
drivers/pci/pcie/pme.c | 3 +-
drivers/pci/switch/switchtec.c | 10 +-
drivers/pcmcia/i82092.c | 2 +-
drivers/pcmcia/i82365.c | 9 +-
drivers/pcmcia/pcmcia_resource.c | 9 +-
drivers/pcmcia/pd6729.c | 8 +-
drivers/pcmcia/tcic.c | 7 +-
drivers/pcmcia/yenta_socket.c | 7 +-
drivers/pinctrl/intel/pinctrl-intel.c | 2 +-
drivers/pinctrl/pinctrl-amd.c | 3 +-
drivers/pinctrl/pinctrl-single.c | 3 +-
drivers/platform/goldfish/goldfish_pipe.c | 6 +-
drivers/platform/mellanox/mlxreg-hotplug.c | 5 +-
drivers/platform/x86/fujitsu-tablet.c | 3 +-
drivers/platform/x86/intel_int0002_vgpio.c | 2 +-
drivers/platform/x86/intel_ips.c | 3 +-
drivers/platform/x86/intel_pmc_ipc.c | 8 +-
drivers/platform/x86/intel_punit_ipc.c | 4 +-
drivers/platform/x86/intel_scu_ipc.c | 5 +-
drivers/platform/x86/sony-laptop.c | 5 +-
drivers/pnp/resource.c | 3 +-
drivers/power/reset/ltc2952-poweroff.c | 6 +-
drivers/power/supply/act8945a_charger.c | 6 +-
drivers/power/supply/goldfish_battery.c | 5 +-
.../power/supply/max14656_charger_detector.c | 2 +-
drivers/power/supply/pda_power.c | 6 +-
drivers/power/supply/wm97xx_battery.c | 5 +-
drivers/pps/clients/pps-gpio.c | 3 +-
drivers/ptp/ptp_pch.c | 3 +-
drivers/rapidio/devices/tsi721.c | 25 +-
drivers/rapidio/devices/tsi721_dma.c | 8 +-
drivers/regulator/qcom_spmi-regulator.c | 5 +-
drivers/rpmsg/qcom_glink_native.c | 5 +-
drivers/rtc/rtc-cmos.c | 5 +-
drivers/rtc/rtc-ds1305.c | 3 +-
drivers/rtc/rtc-ds1374.c | 5 +-
drivers/rtc/rtc-ds1511.c | 3 +-
drivers/rtc/rtc-ds1553.c | 4 +-
drivers/rtc/rtc-ds1685.c | 3 +-
drivers/rtc/rtc-ftrtc010.c | 2 +-
drivers/rtc/rtc-m48t59.c | 5 +-
drivers/rtc/rtc-mrst.c | 3 +-
drivers/rtc/rtc-pcap.c | 10 +-
drivers/rtc/rtc-r7301.c | 3 +-
drivers/rtc/rtc-snvs.c | 3 +-
drivers/rtc/rtc-stk17ta8.c | 5 +-
drivers/rtc/rtc-zynqmp.c | 6 +-
drivers/scsi/3w-9xxx.c | 6 +-
drivers/scsi/3w-sas.c | 6 +-
drivers/scsi/3w-xxxx.c | 3 +-
drivers/scsi/BusLogic.c | 3 +-
drivers/scsi/a100u2w.c | 4 +-
drivers/scsi/aacraid/commsup.c | 8 +-
drivers/scsi/aacraid/rx.c | 3 +-
drivers/scsi/aacraid/sa.c | 3 +-
drivers/scsi/aacraid/src.c | 3 +-
drivers/scsi/advansys.c | 4 +-
drivers/scsi/aha152x.c | 4 +-
drivers/scsi/aha1542.c | 2 +-
drivers/scsi/aha1740.c | 3 +-
drivers/scsi/aic7xxx/aic7770_osm.c | 3 +-
drivers/scsi/aic7xxx/aic79xx_osm_pci.c | 2 +-
drivers/scsi/aic7xxx/aic7xxx_osm_pci.c | 2 +-
drivers/scsi/aic94xx/aic94xx_hwi.c | 4 +-
drivers/scsi/aic94xx/aic94xx_init.c | 5 +-
drivers/scsi/am53c974.c | 4 +-
drivers/scsi/arcmsr/arcmsr_hba.c | 3 +-
drivers/scsi/atp870u.c | 3 +-
drivers/scsi/be2iscsi/be_main.c | 12 +-
drivers/scsi/bfa/bfad.c | 9 +-
drivers/scsi/csiostor/csio_isr.c | 16 +-
drivers/scsi/dpt_i2o.c | 2 +-
drivers/scsi/esas2r/esas2r_init.c | 8 +-
drivers/scsi/fnic/fnic_isr.c | 13 +-
drivers/scsi/g_NCR5380.c | 7 +-
drivers/scsi/gdth.c | 8 +-
drivers/scsi/hpsa.c | 19 +-
drivers/scsi/hptiop.c | 3 +-
drivers/scsi/initio.c | 3 +-
drivers/scsi/ipr.c | 23 +-
drivers/scsi/ips.c | 4 +-
drivers/scsi/isci/init.c | 8 +-
drivers/scsi/lpfc/lpfc_init.c | 35 ++-
drivers/scsi/megaraid.c | 4 +-
drivers/scsi/megaraid/megaraid_mbox.c | 3 +-
drivers/scsi/megaraid/megaraid_sas_base.c | 8 +-
drivers/scsi/mpt3sas/mpt3sas_base.c | 2 +-
drivers/scsi/mvsas/mv_init.c | 4 +-
drivers/scsi/mvumi.c | 8 +-
drivers/scsi/myrb.c | 2 +-
drivers/scsi/myrs.c | 2 +-
drivers/scsi/nsp32.c | 3 +-
drivers/scsi/pcmcia/qlogic_stub.c | 2 +-
drivers/scsi/pcmcia/sym53c500_cs.c | 2 +-
drivers/scsi/pm8001/pm8001_init.c | 10 +-
drivers/scsi/pmcraid.c | 5 +-
drivers/scsi/qedf/qedf_main.c | 4 +-
drivers/scsi/qedi/qedi_main.c | 3 +-
drivers/scsi/qla1280.c | 3 +-
drivers/scsi/qla2xxx/qla_isr.c | 22 +-
drivers/scsi/qla4xxx/ql4_isr.c | 5 +-
drivers/scsi/qla4xxx/ql4_nx.c | 10 +-
drivers/scsi/qlogicfas.c | 2 +-
drivers/scsi/sim710.c | 2 +-
drivers/scsi/smartpqi/smartpqi_init.c | 11 +-
drivers/scsi/snic/snic_isr.c | 5 +-
drivers/scsi/stex.c | 4 +-
drivers/scsi/sym53c8xx_2/sym_glue.c | 3 +-
drivers/scsi/ufs/ufshcd.c | 7 +-
drivers/scsi/vmw_pvscsi.c | 9 +-
drivers/scsi/wd719x.c | 4 +-
drivers/slimbus/qcom-ctrl.c | 3 +-
drivers/spi/spi-altera.c | 3 +-
drivers/spi/spi-axi-spi-engine.c | 3 +-
drivers/spi/spi-cadence.c | 3 +-
drivers/spi/spi-dw.c | 4 +-
drivers/spi/spi-fsl-spi.c | 3 +-
drivers/spi/spi-oc-tiny.c | 3 +-
drivers/spi/spi-pxa2xx.c | 4 +-
drivers/spi/spi-topcliff-pch.c | 5 +-
drivers/spi/spi-xilinx.c | 5 +-
drivers/spi/spi-zynqmp-gqspi.c | 3 +-
drivers/staging/android/vsoc.c | 3 +-
drivers/staging/axis-fifo/axis-fifo.c | 3 +-
drivers/staging/comedi/comedi_buf.c | 5 +
.../staging/comedi/drivers/addi_apci_1032.c | 3 +-
.../staging/comedi/drivers/addi_apci_1500.c | 3 +-
.../staging/comedi/drivers/addi_apci_1564.c | 3 +-
.../staging/comedi/drivers/addi_apci_2032.c | 3 +-
.../staging/comedi/drivers/addi_apci_3120.c | 3 +-
.../staging/comedi/drivers/addi_apci_3xxx.c | 3 +-
drivers/staging/comedi/drivers/adl_pci9111.c | 3 +-
drivers/staging/comedi/drivers/adl_pci9118.c | 3 +-
drivers/staging/comedi/drivers/adv_pci1710.c | 3 +-
drivers/staging/comedi/drivers/aio_iiro_16.c | 3 +-
.../comedi/drivers/amplc_dio200_common.c | 3 +-
.../comedi/drivers/amplc_pc236_common.c | 3 +-
drivers/staging/comedi/drivers/amplc_pci224.c | 3 +-
drivers/staging/comedi/drivers/amplc_pci230.c | 3 +-
drivers/staging/comedi/drivers/cb_pcidas.c | 4 +-
drivers/staging/comedi/drivers/cb_pcidas64.c | 5 +-
.../staging/comedi/drivers/comedi_parport.c | 3 +-
drivers/staging/comedi/drivers/comedi_test.c | 6 +-
drivers/staging/comedi/drivers/das16m1.c | 3 +-
drivers/staging/comedi/drivers/das1800.c | 3 +-
drivers/staging/comedi/drivers/das6402.c | 3 +-
drivers/staging/comedi/drivers/das800.c | 5 +-
drivers/staging/comedi/drivers/dmm32at.c | 3 +-
drivers/staging/comedi/drivers/dt2811.c | 3 +-
drivers/staging/comedi/drivers/dt2814.c | 3 +-
drivers/staging/comedi/drivers/dt282x.c | 2 +-
drivers/staging/comedi/drivers/dt3000.c | 3 +-
drivers/staging/comedi/drivers/gsc_hpdi.c | 3 +-
drivers/staging/comedi/drivers/jr3_pci.c | 2 +-
drivers/staging/comedi/drivers/me4000.c | 3 +-
drivers/staging/comedi/drivers/ni_6527.c | 4 +-
drivers/staging/comedi/drivers/ni_65xx.c | 3 +-
drivers/staging/comedi/drivers/ni_660x.c | 4 +-
drivers/staging/comedi/drivers/ni_at_a2150.c | 2 +-
drivers/staging/comedi/drivers/ni_atmio.c | 3 +-
drivers/staging/comedi/drivers/ni_atmio16d.c | 3 +-
.../staging/comedi/drivers/ni_labpc_common.c | 5 +-
drivers/staging/comedi/drivers/ni_pcidio.c | 3 +-
drivers/staging/comedi/drivers/ni_pcimio.c | 3 +-
drivers/staging/comedi/drivers/pcl711.c | 3 +-
drivers/staging/comedi/drivers/pcl726.c | 3 +-
drivers/staging/comedi/drivers/pcl812.c | 3 +-
drivers/staging/comedi/drivers/pcl816.c | 2 +-
drivers/staging/comedi/drivers/pcl818.c | 3 +-
drivers/staging/comedi/drivers/pcmmio.c | 3 +-
drivers/staging/comedi/drivers/pcmuio.c | 6 +-
drivers/staging/comedi/drivers/rtd520.c | 3 +-
drivers/staging/comedi/drivers/s626.c | 3 +-
drivers/staging/gasket/gasket_interrupt.c | 3 +-
drivers/staging/goldfish/goldfish_audio.c | 5 +-
drivers/staging/iio/adc/ad7606.c | 3 +-
drivers/staging/ks7010/ks_hostif.c | 3 +-
drivers/staging/media/bcm2048/radio-bcm2048.c | 4 +-
drivers/staging/media/zoran/zoran_card.c | 2 +-
drivers/staging/most/dim2/dim2.c | 6 +-
drivers/staging/most/i2c/i2c.c | 3 +-
drivers/staging/olpc_dcon/olpc_dcon_xo_1.c | 2 +-
drivers/staging/olpc_dcon/olpc_dcon_xo_1_5.c | 2 +-
drivers/staging/pi433/pi433_if.c | 6 +-
drivers/staging/rtl8188eu/core/rtw_recv.c | 2 +-
.../staging/rtl8188eu/hal/rtl8188eu_recv.c | 6 +-
.../staging/rtl8188eu/hal/rtl8188eu_xmit.c | 6 +-
drivers/staging/rtl8188eu/os_dep/mlme_linux.c | 17 +-
drivers/staging/rtl8188eu/os_dep/recv_linux.c | 2 +-
drivers/staging/rtl8192e/rtl8192e/rtl_core.c | 20 +-
drivers/staging/rtl8192e/rtllib_softmac.c | 6 +-
.../rtl8192u/ieee80211/ieee80211_module.c | 2 +-
.../rtl8192u/ieee80211/ieee80211_softmac.c | 6 +-
drivers/staging/rtl8712/rtl8712_recv.c | 6 +-
drivers/staging/rtl8712/rtl871x_xmit.c | 6 +-
.../staging/rtl8723bs/hal/rtl8723bs_recv.c | 8 +-
drivers/staging/rtl8723bs/os_dep/mlme_linux.c | 22 +-
drivers/staging/rtl8723bs/os_dep/recv_linux.c | 2 +-
drivers/staging/rtlwifi/pci.c | 4 +-
drivers/staging/rts5208/rtsx.c | 4 +-
drivers/staging/speakup/main.c | 2 +-
drivers/staging/speakup/serialio.c | 5 +-
drivers/staging/vt6655/device_main.c | 2 +-
drivers/thermal/ti-soc-thermal/ti-bandgap.c | 6 +-
drivers/thunderbolt/nhi.c | 6 +-
drivers/tty/cyclades.c | 9 +-
drivers/tty/goldfish.c | 4 +-
drivers/tty/hvc/hvc_irq.c | 4 +-
drivers/tty/ipwireless/hardware.c | 3 +-
drivers/tty/isicom.c | 4 +-
drivers/tty/moxa.c | 2 +-
drivers/tty/mxser.c | 4 +-
drivers/tty/n_gsm.c | 5 +-
drivers/tty/n_r3964.c | 2 +-
drivers/tty/rocket.c | 2 +-
drivers/tty/serial/8250/8250_core.c | 6 +-
drivers/tty/serial/8250/8250_exar.c | 2 +-
drivers/tty/serial/8250/8250_port.c | 4 +-
drivers/tty/serial/altera_jtaguart.c | 4 +-
drivers/tty/serial/altera_uart.c | 5 +-
drivers/tty/serial/arc_uart.c | 2 +-
drivers/tty/serial/digicolor-usart.c | 3 +-
drivers/tty/serial/fsl_lpuart.c | 12 +-
drivers/tty/serial/ifx6x60.c | 11 +-
drivers/tty/serial/jsm/jsm_driver.c | 3 +-
drivers/tty/serial/max3100.c | 5 +-
drivers/tty/serial/men_z135_uart.c | 4 +-
drivers/tty/serial/mux.c | 2 +-
drivers/tty/serial/pch_uart.c | 4 +-
drivers/tty/serial/pnx8xxx_uart.c | 3 +-
drivers/tty/serial/rp2.c | 2 +-
drivers/tty/serial/sc16is7xx.c | 2 +-
drivers/tty/serial/sccnxp.c | 2 +-
drivers/tty/serial/sh-sci.c | 9 +-
drivers/tty/serial/timbuart.c | 4 +-
drivers/tty/serial/uartlite.c | 3 +-
drivers/tty/serial/xilinx_uartps.c | 4 +-
drivers/tty/synclink.c | 3 +-
drivers/tty/synclink_gt.c | 6 +-
drivers/tty/synclinkmp.c | 7 +-
drivers/tty/tty_buffer.c | 5 +-
drivers/tty/vcc.c | 4 +-
drivers/tty/vt/keyboard.c | 2 +-
drivers/uio/uio.c | 3 +-
drivers/usb/atm/usbatm.c | 6 +-
drivers/usb/c67x00/c67x00-drv.c | 3 +-
drivers/usb/chipidea/core.c | 5 +-
drivers/usb/chipidea/otg_fsm.c | 3 +-
drivers/usb/core/hcd.c | 8 +-
drivers/usb/dwc2/gadget.c | 3 +-
drivers/usb/dwc2/hcd_queue.c | 6 +-
drivers/usb/dwc2/platform.c | 3 +-
drivers/usb/gadget/function/f_midi.c | 3 +-
drivers/usb/gadget/function/f_ncm.c | 3 +-
drivers/usb/gadget/udc/amd5536udc_pci.c | 2 +-
drivers/usb/gadget/udc/bdc/bdc_udc.c | 3 +-
drivers/usb/gadget/udc/dummy_hcd.c | 4 +-
drivers/usb/gadget/udc/fotg210-udc.c | 4 +-
drivers/usb/gadget/udc/fusb300_udc.c | 6 +-
drivers/usb/gadget/udc/goku_udc.c | 3 +-
drivers/usb/gadget/udc/m66592-udc.c | 6 +-
drivers/usb/gadget/udc/mv_u3d_core.c | 3 +-
drivers/usb/gadget/udc/mv_udc_core.c | 3 +-
drivers/usb/gadget/udc/net2272.c | 3 +-
drivers/usb/gadget/udc/net2280.c | 3 +-
drivers/usb/gadget/udc/pch_udc.c | 8 +-
drivers/usb/gadget/udc/pxa27x_udc.c | 3 +-
drivers/usb/gadget/udc/r8a66597-udc.c | 7 +-
drivers/usb/gadget/udc/snps_udc_plat.c | 4 +-
drivers/usb/gadget/udc/udc-xilinx.c | 3 +-
drivers/usb/host/max3421-hcd.c | 3 +-
drivers/usb/host/xhci.c | 12 +-
drivers/usb/isp1760/isp1760-udc.c | 3 +-
drivers/usb/musb/musb_core.c | 2 +-
drivers/usb/phy/phy-gpio-vbus-usb.c | 4 +-
drivers/usb/serial/mos7720.c | 4 +-
drivers/usb/usbip/vudc_transfer.c | 2 +-
drivers/uwb/neh.c | 2 +-
drivers/uwb/rsv.c | 5 +-
drivers/uwb/whc-rc.c | 5 +-
drivers/vfio/pci/vfio_pci_intrs.c | 5 +-
drivers/video/fbdev/arcfb.c | 3 +-
drivers/video/fbdev/aty/atyfb_base.c | 2 +-
drivers/video/fbdev/goldfishfb.c | 4 +-
drivers/video/fbdev/matrox/matroxfb_base.c | 3 +-
drivers/video/fbdev/mb862xx/mb862xxfbdrv.c | 7 +-
drivers/video/fbdev/via/via-core.c | 3 +-
drivers/video/fbdev/xen-fbfront.c | 2 +
drivers/virt/vboxguest/vboxguest_linux.c | 5 +-
drivers/virtio/virtio_mmio.c | 4 +-
drivers/virtio/virtio_pci_common.c | 20 +-
drivers/visorbus/visorbus_main.c | 2 +-
drivers/vme/bridges/vme_ca91cx42.c | 5 +-
drivers/vme/bridges/vme_tsi148.c | 7 +-
drivers/w1/masters/ds1wm.c | 3 +-
drivers/xen/events/events_base.c | 12 +-
drivers/xen/platform-pci.c | 4 +-
drivers/xen/pvcalls-front.c | 2 +
drivers/xen/xen-pciback/pciback_ops.c | 6 +-
include/asm-generic/io.h | 3 +
include/linux/clearcpu.h | 36 +++
include/linux/filter.h | 21 +-
include/linux/highmem.h | 2 +
include/linux/hrtimer.h | 4 +
include/linux/interrupt.h | 18 +-
include/linux/skbuff.h | 2 +
include/linux/timer.h | 14 +-
include/linux/tty_flip.h | 4 +
include/linux/usb/hcd.h | 5 +-
kernel/bpf/core.c | 2 +
kernel/bpf/cpumap.c | 3 +
kernel/dma/swiotlb.c | 2 +
kernel/irq/handle.c | 4 +
kernel/irq/manage.c | 1 +
kernel/sched/core.c | 9 +
kernel/softirq.c | 25 +-
kernel/time/alarmtimer.c | 2 +-
kernel/time/hrtimer.c | 5 +
kernel/time/timer.c | 8 +
lib/random32.c | 2 +-
lib/scatterlist.c | 2 +
lib/string.c | 6 +
mm/slab_common.c | 5 +-
net/atm/pppoatm.c | 2 +-
net/core/skbuff.c | 32 +++
net/mac80211/main.c | 14 +-
net/rds/ib_cm.c | 8 +-
net/wireless/lib80211.c | 2 +-
net/xfrm/xfrm_state.c | 3 +-
samples/v4l/v4l2-pci-skeleton.c | 5 +-
security/keys/gc.c | 2 +-
sound/core/hrtimer.c | 3 +-
sound/core/pcm_lib.c | 3 +
sound/core/rawmidi.c | 3 +
sound/core/timer.c | 7 +-
sound/drivers/mpu401/mpu401_uart.c | 8 +-
sound/drivers/mtpav.c | 5 +-
sound/drivers/pcsp/pcsp.c | 3 +-
sound/drivers/serial-u16550.c | 6 +-
sound/isa/ad1816a/ad1816a_lib.c | 2 +-
sound/isa/es1688/es1688_lib.c | 4 +-
sound/isa/es18xx.c | 3 +-
sound/isa/gus/gus_main.c | 2 +-
sound/isa/gus/gusmax.c | 2 +-
sound/isa/gus/interwave.c | 3 +-
sound/isa/msnd/msnd_pinnacle.c | 3 +-
sound/isa/opl3sa2.c | 4 +-
sound/isa/opti9xx/opti92x-ad1848.c | 3 +-
sound/isa/sb/emu8000_pcm.c | 2 +-
sound/isa/sb/sb8_midi.c | 3 +-
sound/isa/sb/sb_common.c | 6 +-
sound/isa/wavefront/wavefront.c | 3 +-
sound/isa/wavefront/wavefront_midi.c | 4 +-
sound/isa/wss/wss_lib.c | 3 +-
sound/pci/ad1889.c | 3 +-
sound/pci/ali5451/ali5451.c | 3 +-
sound/pci/als300.c | 3 +-
sound/pci/asihpi/asihpi.c | 4 +-
sound/pci/asihpi/hpioctl.c | 3 +-
sound/pci/atiixp.c | 3 +-
sound/pci/atiixp_modem.c | 3 +-
sound/pci/aw2/aw2-alsa.c | 3 +-
sound/pci/azt3328.c | 3 +-
sound/pci/bt87x.c | 4 +-
sound/pci/ca0106/ca0106_main.c | 3 +-
sound/pci/cmipci.c | 3 +-
sound/pci/cs4281.c | 3 +-
sound/pci/cs46xx/cs46xx_lib.c | 3 +-
sound/pci/cs5535audio/cs5535audio.c | 3 +-
sound/pci/ctxfi/cthw20k1.c | 3 +-
sound/pci/ctxfi/cthw20k2.c | 3 +-
sound/pci/echoaudio/midi.c | 2 +-
sound/pci/emu10k1/emu10k1_main.c | 3 +-
sound/pci/emu10k1/emu10k1x.c | 3 +-
sound/pci/ens1370.c | 3 +-
sound/pci/es1938.c | 6 +-
sound/pci/es1968.c | 3 +-
sound/pci/fm801.c | 3 +-
sound/pci/hda/hda_intel.c | 4 +-
sound/pci/ice1712/ice1712.c | 3 +-
sound/pci/ice1712/ice1724.c | 3 +-
sound/pci/intel8x0.c | 6 +-
sound/pci/intel8x0m.c | 6 +-
sound/pci/korg1212/korg1212.c | 4 +-
sound/pci/lola/lola.c | 3 +-
sound/pci/maestro3.c | 3 +-
sound/pci/nm256/nm256.c | 3 +-
sound/pci/oxygen/oxygen_lib.c | 4 +-
sound/pci/riptide/riptide.c | 3 +-
sound/pci/rme32.c | 3 +-
sound/pci/rme96.c | 3 +-
sound/pci/rme9652/hdsp.c | 8 +-
sound/pci/rme9652/hdspm.c | 10 +-
sound/pci/rme9652/rme9652.c | 3 +-
sound/pci/sis7019.c | 7 +-
sound/pci/sonicvibes.c | 3 +-
sound/pci/trident/trident_main.c | 3 +-
sound/pci/via82xx.c | 6 +-
sound/pci/via82xx_modem.c | 3 +-
sound/pci/ymfpci/ymfpci_main.c | 3 +-
sound/soc/amd/acp-pcm-dma.c | 3 +-
sound/soc/amd/raven/acp3x-pcm-dma.c | 3 +-
sound/soc/codecs/rt5640.c | 4 +-
sound/soc/codecs/rt5651.c | 4 +-
sound/soc/codecs/rt5663.c | 4 +-
sound/soc/dwc/dwc-i2s.c | 5 +-
sound/soc/fsl/fsl_asrc.c | 3 +-
sound/soc/fsl/fsl_esai.c | 3 +-
sound/soc/fsl/fsl_sai.c | 3 +-
sound/soc/fsl/fsl_spdif.c | 3 +-
sound/soc/fsl/fsl_ssi.c | 3 +-
sound/usb/midi.c | 6 +-
sound/usb/misc/ua101.c | 4 +-
sound/x86/intel_hdmi_audio.c | 5 +-
681 files changed, 2457 insertions(+), 1308 deletions(-)
create mode 100644 Documentation/admin-guide/mds.rst
create mode 100644 Documentation/clearcpu.txt
create mode 100644 arch/x86/include/asm/clearbpf.h
create mode 100644 arch/x86/include/asm/clearcpu.h
create mode 100644 arch/x86/include/asm/trace/clearcpu.h
create mode 100644 include/linux/clearcpu.h
--
2.17.2
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v6 10/43] MDSv6
2019-02-24 15:07 [MODERATED] [PATCH v6 00/43] MDSv6 Andi Kleen
@ 2019-02-24 15:07 ` Andi Kleen
2019-02-25 16:30 ` [MODERATED] " Greg KH
2019-02-24 15:07 ` [MODERATED] [PATCH v6 31/43] MDSv6 Andi Kleen
1 sibling, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-02-24 15:07 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
From: Andi Kleen <ak@linux.intel.com>
Subject: mds: Add documentation for clear cpu usage
Including the theory, and some guide lines for subsystem/driver
maintainers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
Documentation/clearcpu.txt | 261 +++++++++++++++++++++++++++++++++++++
1 file changed, 261 insertions(+)
create mode 100644 Documentation/clearcpu.txt
diff --git a/Documentation/clearcpu.txt b/Documentation/clearcpu.txt
new file mode 100644
index 000000000000..a45e5d82868a
--- /dev/null
+++ b/Documentation/clearcpu.txt
@@ -0,0 +1,261 @@
+
+Security model for Microarchitectural Data Sampling
+===================================================
+
+Some CPUs can leave read or written data in internal buffers,
+which then later might be sampled through side effects.
+For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
+
+This can be avoided by explicitly clearing the CPU state.
+
+We attempt to avoid leaking data between different processes,
+and also some sensitive data, like cryptographic data, to
+user space.
+
+We support three modes:
+
+(1) mitigation off (mds=off)
+(2) clear only when needed (default)
+(3) clear on every kernel exit, or guest entry (mds=full)
+
+(1) and (3) are trivial, the rest of the document discusses (2)
+
+In general option (3) is the most conservative choice. It does
+not make ST assumptions about leaking data.
+
+Basic requirements and assumptions
+----------------------------------
+
+Kernel addresses and kernel temporary data are not sensitive.
+
+User data is sensitive, but only for other processes.
+
+User data is anything in the user address space, or data buffers
+directly copied from/to the user (e.g. read/write). It does not
+include metadata, or flag settings. For example packet headers
+or file names are not sensitive in this model.
+
+Block IO data (but not meta data) is sensitive.
+
+Most data structures in the kernel are not sensitive.
+
+Kernel data is sensitive when it involves cryptographic keys.
+
+We consider data from input devices (such as key presses)
+sensitive. We also consider sound data or terminal
+data sensitive.
+
+We assume that only data actually accessed by the kernel by explicit
+instructions can be leaked. Note that this may not be always
+true, in theory prefetching or speculation may touch more. The assumption
+is that if any such happens it will be very low bandwidth and hard
+to control due to the existing Spectre and other mitigations,
+such as memory randomization. If a user is concerned about this they
+need to use mds=full.
+
+Guidance for driver/subsystem developers
+----------------------------------------
+
+[These generally need to be enforced in code review for new code now]
+
+When you touch user supplied data of *other* processes in system call
+context add lazy_clear_cpu().
+
+For the cases below we care only about data from other processes.
+Touching non cryptographic data from the current process is always allowed.
+
+Touching only pointers to user data is always allowed.
+
+When your interrupt does touch user data directly mark it with IRQF_USER_DATA.
+
+When your tasklet does touch user data directly, mark it TASKLET_USER_DATA
+using tasklet_init_flags/or DECLARE_TASKLET_USERDATA*.
+
+When your timer does touch user data mark it with TIMER_USER_DATA
+If it is a hrtimer and touches user data, mark it with HRTIMER_MODE_USER_DATA.
+
+When your irq poll handler does touch user data, mark it lazy_clear_cpu().
+
+For networking code, make sure to only touch user data through
+skb_push/put/copy [add more], unless it is data from the current
+process. If that is not ensured add lazy_clear_cpu or
+lazy_clear_cpu_interrupt.
+
+Any cryptographic code touching key data should use memzero_explicit
+or kzfree to free the data.
+
+If your RCU callback touches user data add lazy_clear_cpu().
+
+These steps are currently only needed for code that runs on MDS affected
+CPUs, which is currently only x86. But might be worth being prepared
+if other architectures become affected too.
+
+Implementation details/assumptions
+----------------------------------
+
+Any buffer clearing is done lazily on next kernel exit. lazy_clear*
+is only a few fast instructions with no cache misses setting
+a flag and can be used frequently even in fast paths.
+
+Protecting process data
+-----------------------
+
+If a system call touches data of its own process, CPU state does not
+need to be cleared, because it has already access to it.
+
+On context switching we clear data, unless the context switch is
+inside a process. We also clear after any context switches from kernel
+threads.
+
+Cryptographic keys inside the kernel should be protected.
+We assume they use kzfree() or memzero_explicit() to clear
+state, so these functions trigger a cpu clear.
+
+Hard Interrupts and tasklets
+----------------------------
+
+Most interrupt handlers for modern devices do not touch
+user data, because they rely on DMA and only manipulate
+pointers. They have been audited.
+
+Some handlers copy data, but often use strategic
+functions which can be marked with a lazy clear.
+For example memcpy_from/to_io, swiotlb (see below
+for a full list)
+
+Some handlers touch user data without using these strategic
+functions, these have to be marked with IRQF_USER_DATA.
+All in tree handlers have been audited.
+
+Softirqs
+--------
+
+Softirqs are handled case by case:
+
+ TIMER: see timers below.
+ NET_*: see networking below.
+ BLOCK: do not touch user data, except
+ for a few using kmap_atomic. We have a lazy_clear_cpu_interrupt()
+ in kmap_atomic for this case.
+
+ IRQ_POLL: generally do not touch user data
+ TASKLET: see tasklets below
+ SCHED: only touches scheduler metadata
+ RCU: RCU handlers generally only free.
+
+Networking
+----------
+
+This is only about network code running in hard interrupt
+or softirq or timer context. Per process network code
+generally only touches data for the current process,
+so does not need any changes.
+
+In principle packet data should be encrypted anyways for the wire,
+but we try to avoid leaking it anyways.
+
+For networking code, any skb functions that are likely
+touching non header packet data schedule a clear cpu at next
+kernel exit. This includes skb_copy and related, skb_put/push,
+checksum functions. We assume that any networking code touching
+packet data uses these functions.
+
+NMIs / machine checks
+---------------------
+
+Assume they don't touch other processes user data. Most NMI
+handlers are fairly simple and trivial and only concerned with
+some non user hardware state. The machine check handlers and perf PMI
+handlers are complicated (e.g. perf can touch user stack), but they
+never touch any data not of the current process.
+
+Other interrupts
+----------------
+
+SMP function interrupt call backs have been audited and don't touch
+any user data.
+
+Clear points
+------------
+
+We schedule clears in some centralized functions to minimize impact
+on the overall code.
+
+Always clear:
+
+kernel preemption undefined state, need to always clear
+context switch protect user / kernel thread data
+VM entry protect host against guest
+
+Always schedule clear for next kernel exit:
+
+kzfree / memzero_explicit keys and crypto data
+
+Only schedule clear for next exit when called in interrupts:
+
+kmap_atomic block drivers touching user process data
+memcpy_from/to_io drivers copying IO data
+insw*, outs*
+input_event input drivers touching user IO data
+serio_interrupt
+tty_insert_* tty drivers touching user input IO data
+swiotlb bounce buffers touching IO data
+sg_copy_* scsi drivers touching IO data in interrupts
+skb_put, skb_copy_* networking code touching IO data
+skb_*csum*
+snd_pcm_period_elapsed, sound drivers touching IO data
+snd_rawmidi_transmit/receive
+snd_timer_interrupt
+
+Sandboxes
+---------
+
+We don't do anything special for seccomp processes
+
+If there is a sandbox inside the process the process should take care
+itself of clearing its own sensitive data before running sandbox
+code. This would include data touched by system calls.
+
+BPF
+---
+
+Assume BPF execution does not touch other user's data, so does
+not need to schedule a clear for itself.
+
+BPF could attack the rest of the kernel if it can successfully
+measure side channel side effects.
+
+When the BPF program was loaded unprivileged, always clear the CPU
+to prevent any exploits written in BPF using side channels to read
+data leaked from other kernel code
+
+We only do this when running in an interrupt, or if an clear cpu is
+already scheduled (which means for example there was a context
+switch, or crypto operation before)
+
+In process context we assume the code only accesses data of the
+current user and check that the BPF running was loaded by the
+same user so even if data leaked it would not cross privilege
+boundaries.
+
+Technically we would only need to do this if the BPF program
+contains conditional branches and loads dominated by them, but
+let's assume that nearly all do.
+
+This could be further optimized by batching clears for
+many similar EBPF executions in a row (e.g. for packet
+processing). This would need ensuring that no sensitive
+data is touched inbetween the EBPF executions, and also
+that all EBPF scripts are set up by the same uid.
+We could add such optimizations later based on
+profile data.
+
+Virtualization
+--------------
+
+When entering a guest in KVM we clear to avoid any leakage to a guest.
+Normally this is done implicitly as part of the L1TF mitigation,
+except on a few CPUs that are not vulnerable to L1TF and need
+explicit clear. It relies on L1TF being enabled. It also uses the
+"fast exit" optimization that only clears if an interrupt or context switch
+happened during a VMexit, unless mds=full is used.
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Re: [PATCH v6 10/43] MDSv6
2019-02-24 15:07 ` [MODERATED] [PATCH v6 10/43] MDSv6 Andi Kleen
@ 2019-02-25 16:30 ` Greg KH
2019-02-25 16:41 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Greg KH @ 2019-02-25 16:30 UTC (permalink / raw)
To: speck
I'm sorry, I just can't stop...
On Sun, Feb 24, 2019 at 07:07:16AM -0800, speck for Andi Kleen wrote:
> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +[These generally need to be enforced in code review for new code now]
> +
> +When you touch user supplied data of *other* processes in system call
> +context add lazy_clear_cpu().
What do you mean by "user supplied data of *other* processes"? I think
I understand the kernel, and I have no idea what this means.
> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.
Define "non cryptographic data". Is data coming across a serial port
crypto data? From a camera?
> +
> +Touching only pointers to user data is always allowed.
But not touching the data the pointer points to?
> +
> +When your interrupt does touch user data directly mark it with IRQF_USER_DATA.
I still don't know what "user data" means. Is a serial stream coming
from a bluetooth device "user data"? Is a program that talks directly
to a USB data without a special kernel driver reading "user data"? Is a
serial port data stream "user data"?
> +
> +When your tasklet does touch user data directly, mark it TASKLET_USER_DATA
> +using tasklet_init_flags/or DECLARE_TASKLET_USERDATA*.
Same as above.
> +
> +When your timer does touch user data mark it with TIMER_USER_DATA
> +If it is a hrtimer and touches user data, mark it with HRTIMER_MODE_USER_DATA.
Mark what, the timer function?
> +When your irq poll handler does touch user data, mark it lazy_clear_cpu().
Mark what? That's a function call?
> +For networking code, make sure to only touch user data through
> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or
> +lazy_clear_cpu_interrupt.
How do you know if data coming across the network is for the "current"
process? What does "current process" even mean here?
How about UIO drivers, how does their data get classified? Lots of
networking stacks use UIO now... virtual io channels?
> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree to free the data.
We do that today, right? If not, that needs to be done regardless.
And what about password data? I _think_ we got most of that now out of
the tty layer, but we could be wrong :)
> +
> +If your RCU callback touches user data add lazy_clear_cpu().
Ugh, really?
> +
> +These steps are currently only needed for code that runs on MDS affected
> +CPUs, which is currently only x86. But might be worth being prepared
> +if other architectures become affected too.
As someone who probably reviews more new drivers than anyone else right
now, my first recommendation is going to be, "Buy a non-Intel
processor, we have no idea what they expect from driver writers now,
just give up trying to appease them." Because if I, the person that is
responsible for reviewing those drivers has no idea what to do here, how
can some random driver author be expected to know?
Seriously, this is crazy.
And if you all are going to expect me to start auditing all new drivers
based on these new rules, I need a _big_ raise. I somehow doubt you are
asking all other OS vendors to do all of this crud.
> +Implementation details/assumptions
> +----------------------------------
> +
> +Any buffer clearing is done lazily on next kernel exit. lazy_clear*
> +is only a few fast instructions with no cache misses setting
> +a flag and can be used frequently even in fast paths.
I can not parse this paragraph at all, what are you trying to say?
> +Protecting process data
> +-----------------------
> +
> +If a system call touches data of its own process, CPU state does not
> +need to be cleared, because it has already access to it.
How do you know if it is it's own process or not?
What about something like IPC data?
> +On context switching we clear data, unless the context switch is
> +inside a process. We also clear after any context switches from kernel
> +threads.
> +
> +Cryptographic keys inside the kernel should be protected.
"Protected"? How? Huh? ugh.
<big snip as I got tired>
> +Sandboxes
> +---------
> +
> +We don't do anything special for seccomp processes
> +
> +If there is a sandbox inside the process the process should take care
> +itself of clearing its own sensitive data before running sandbox
> +code. This would include data touched by system calls.
i.e. "Userspace code is hosed, sorry."?
> +BPF
> +---
> +
> +Assume BPF execution does not touch other user's data, so does
> +not need to schedule a clear for itself.
Can you assume that?
> +BPF could attack the rest of the kernel if it can successfully
> +measure side channel side effects.
Can it do such a measurement?
> +When the BPF program was loaded unprivileged, always clear the CPU
> +to prevent any exploits written in BPF using side channels to read
> +data leaked from other kernel code
> +
> +We only do this when running in an interrupt, or if an clear cpu is
> +already scheduled (which means for example there was a context
> +switch, or crypto operation before)
> +
> +In process context we assume the code only accesses data of the
> +current user and check that the BPF running was loaded by the
> +same user so even if data leaked it would not cross privilege
> +boundaries.
> +
> +Technically we would only need to do this if the BPF program
> +contains conditional branches and loads dominated by them, but
> +let's assume that nearly all do.
> +
> +This could be further optimized by batching clears for
> +many similar EBPF executions in a row (e.g. for packet
> +processing). This would need ensuring that no sensitive
> +data is touched inbetween the EBPF executions, and also
> +that all EBPF scripts are set up by the same uid.
> +We could add such optimizations later based on
> +profile data.
Please contact the BPF people before writing any of the above. The fact
that you all have not done so is scandalous.
greg k-h
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v6 31/43] MDSv6
2019-02-24 15:07 [MODERATED] [PATCH v6 00/43] MDSv6 Andi Kleen
2019-02-24 15:07 ` [MODERATED] [PATCH v6 10/43] MDSv6 Andi Kleen
@ 2019-02-24 15:07 ` Andi Kleen
2019-02-25 15:19 ` [MODERATED] " Greg KH
1 sibling, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-02-24 15:07 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
From: Andi Kleen <ak@linux.intel.com>
Subject: mds sweep: Clear cpu for usbmon intercepts
usbmon touches user data in interrupts that otherwise don't
touch user data. Automatically schedule a clear cpu if
usbmon is called from an interrupt.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/usb/hcd.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
index 7dc3a411bece..7f37056fe973 100644
--- a/include/linux/usb/hcd.h
+++ b/include/linux/usb/hcd.h
@@ -25,6 +25,7 @@
#include <linux/rwsem.h>
#include <linux/interrupt.h>
#include <linux/idr.h>
+#include <linux/clearcpu.h>
#define MAX_TOPO_LEVEL 6
@@ -688,8 +689,10 @@ static inline void usbmon_urb_submit_error(struct usb_bus *bus, struct urb *urb,
static inline void usbmon_urb_complete(struct usb_bus *bus, struct urb *urb,
int status)
{
- if (bus->monitored)
+ if (bus->monitored) {
(*mon_ops->urb_complete)(bus, urb, status);
+ lazy_clear_cpu_interrupt();
+ }
}
int usb_mon_register(const struct usb_mon_operations *ops);
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Re: [PATCH v6 31/43] MDSv6
2019-02-24 15:07 ` [MODERATED] [PATCH v6 31/43] MDSv6 Andi Kleen
@ 2019-02-25 15:19 ` Greg KH
2019-02-25 15:34 ` Andi Kleen
0 siblings, 1 reply; 89+ messages in thread
From: Greg KH @ 2019-02-25 15:19 UTC (permalink / raw)
To: speck
On Sun, Feb 24, 2019 at 07:07:37AM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject: mds sweep: Clear cpu for usbmon intercepts
>
> usbmon touches user data in interrupts that otherwise don't
> touch user data. Automatically schedule a clear cpu if
> usbmon is called from an interrupt.
I have written a long and very satisfying rant about this patch, that I
than deleted, as it made me feel very better, but probably would not
have helped anyone else out.
In turn, I need you to properly justify this patch as these two tiny
sentences, and this small patch make no sense to me at all. Please
explain _WHY_ this is needed in this specific location. Before
responding, I would strongly recommend reading up on exactly what usbmon
is and who is allowed to use it. If after doing that, you still feel
this patch is needed (and it might be, I still can not tell for sure),
please reply with enough detail that anyone who does not know what
usbmon is, or what mds really is, can understand why this patch is
needed.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [PATCH v6 31/43] MDSv6
2019-02-25 15:19 ` [MODERATED] " Greg KH
@ 2019-02-25 15:34 ` Andi Kleen
2019-02-25 15:49 ` Greg KH
0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-02-25 15:34 UTC (permalink / raw)
To: speck
On Mon, Feb 25, 2019 at 04:19:35PM +0100, speck for Greg KH wrote:
> On Sun, Feb 24, 2019 at 07:07:37AM -0800, speck for Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > Subject: mds sweep: Clear cpu for usbmon intercepts
> >
> > usbmon touches user data in interrupts that otherwise don't
> > touch user data. Automatically schedule a clear cpu if
> > usbmon is called from an interrupt.
>
> I have written a long and very satisfying rant about this patch, that I
> than deleted, as it made me feel very better, but probably would not
> have helped anyone else out.
>
> In turn, I need you to properly justify this patch as these two tiny
> sentences, and this small patch make no sense to me at all. Please
> explain _WHY_ this is needed in this specific location. Before
> responding, I would strongly recommend reading up on exactly what usbmon
> is and who is allowed to use it. If after doing that, you still feel
Right it's root only.
But this is not about leaking data to the root monitoring user
(who can see the data anyways), but to unrelated processes
which are not root, but happen to be interrupted by the USB
interrupt.
> this patch is needed (and it might be, I still can not tell for sure),
Anything that touches user data in an interrupt needs to be marked
with the lazy approach.
I can write more on this instance.
However I will probably not be able to write a detailed
description for each of the interrupt handlers changed because
there are just too many.
-Andi
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [PATCH v6 31/43] MDSv6
2019-02-25 15:34 ` Andi Kleen
@ 2019-02-25 15:49 ` Greg KH
2019-02-25 15:52 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Greg KH @ 2019-02-25 15:49 UTC (permalink / raw)
To: speck
On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:
> On Mon, Feb 25, 2019 at 04:19:35PM +0100, speck for Greg KH wrote:
> > On Sun, Feb 24, 2019 at 07:07:37AM -0800, speck for Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > Subject: mds sweep: Clear cpu for usbmon intercepts
> > >
> > > usbmon touches user data in interrupts that otherwise don't
> > > touch user data. Automatically schedule a clear cpu if
> > > usbmon is called from an interrupt.
> >
> > I have written a long and very satisfying rant about this patch, that I
> > than deleted, as it made me feel very better, but probably would not
> > have helped anyone else out.
> >
> > In turn, I need you to properly justify this patch as these two tiny
> > sentences, and this small patch make no sense to me at all. Please
> > explain _WHY_ this is needed in this specific location. Before
> > responding, I would strongly recommend reading up on exactly what usbmon
> > is and who is allowed to use it. If after doing that, you still feel
>
> Right it's root only.
>
> But this is not about leaking data to the root monitoring user
> (who can see the data anyways), but to unrelated processes
> which are not root, but happen to be interrupted by the USB
> interrupt.
Then why are you messing around with the usbmon callback? It has
nothing to do with anything here. By hooking it here, you now have 2
calls to this function on the USB urb callback path.
The fact that a root process happens to be watching the USB data flowing
through the system, or not, should have no affect on anything here, as
the data flow is still the same (with the exception an extra copy in the
irq could happen). Does multiple copys matter or not? I can't find
anything in the documentation we have about this, am I missing it?
> > this patch is needed (and it might be, I still can not tell for sure),
>
> Anything that touches user data in an interrupt needs to be marked
> with the lazy approach.
As I asked with the hcd change, what is "user data"?
> I can write more on this instance.
I nicely asked for that in the past but was ignored twice. Do I need to
ask for it again in a non-nice manner?
Without that information, this patchset is pretty impossible to review.
> However I will probably not be able to write a detailed
> description for each of the interrupt handlers changed because
> there are just too many.
Then how do you expect each subsystem / driver author to know if this is
an acceptable change or not? How do you expect to educate driver
authors to have them determine if they need to do this on their new
drivers or not? Are you going to hand-audit each new driver that gets
added to the kernel for forever?
Without this type of information, this seems like a futile exercise.
greg k-h
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-02-25 15:49 ` Greg KH
@ 2019-02-25 15:52 ` Jon Masters
2019-02-25 16:00 ` [MODERATED] " Greg KH
0 siblings, 1 reply; 89+ messages in thread
From: Jon Masters @ 2019-02-25 15:52 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 115 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH v6 31/43] MDSv6
[-- Attachment #2: Type: text/plain, Size: 1032 bytes --]
On 2/25/19 10:49 AM, speck for Greg KH wrote:
> On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:
>> However I will probably not be able to write a detailed
>> description for each of the interrupt handlers changed because
>> there are just too many.
>
> Then how do you expect each subsystem / driver author to know if this is
> an acceptable change or not? How do you expect to educate driver
> authors to have them determine if they need to do this on their new
> drivers or not? Are you going to hand-audit each new driver that gets
> added to the kernel for forever?
>
> Without this type of information, this seems like a futile exercise.
Forgive me if I'm being too cautious here, but it seems to make most
sense to have the basic MDS infrastructure in place at unembargo. Unless
it's very clear how the auto stuff can be safe, and the audit
comprehensive, I wonder if that shouldn't just be done after.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: Encrypted Message
2019-02-25 15:52 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-02-25 16:00 ` Greg KH
2019-02-25 16:19 ` [MODERATED] " Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Greg KH @ 2019-02-25 16:00 UTC (permalink / raw)
To: speck
On Mon, Feb 25, 2019 at 10:52:30AM -0500, speck for Jon Masters wrote:
> From: Jon Masters <jcm@redhat.com>
> To: speck for Greg KH <speck@linutronix.de>
> Subject: Re: [PATCH v6 31/43] MDSv6
> On 2/25/19 10:49 AM, speck for Greg KH wrote:
> > On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:
>
>
> >> However I will probably not be able to write a detailed
> >> description for each of the interrupt handlers changed because
> >> there are just too many.
> >
> > Then how do you expect each subsystem / driver author to know if this is
> > an acceptable change or not? How do you expect to educate driver
> > authors to have them determine if they need to do this on their new
> > drivers or not? Are you going to hand-audit each new driver that gets
> > added to the kernel for forever?
> >
> > Without this type of information, this seems like a futile exercise.
>
> Forgive me if I'm being too cautious here, but it seems to make most
> sense to have the basic MDS infrastructure in place at unembargo. Unless
> it's very clear how the auto stuff can be safe, and the audit
> comprehensive, I wonder if that shouldn't just be done after.
I thought that was what Thomas's patchset provided and is what was
alluded to in patch 00/43 of this series.
greg k-h
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-02-25 16:00 ` [MODERATED] " Greg KH
@ 2019-02-25 16:19 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-02-25 16:19 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 110 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: Encrypted Message
[-- Attachment #2: Type: text/plain, Size: 1592 bytes --]
On 2/25/19 11:00 AM, speck for Greg KH wrote:
> On Mon, Feb 25, 2019 at 10:52:30AM -0500, speck for Jon Masters wrote:
>> From: Jon Masters <jcm@redhat.com>
>> To: speck for Greg KH <speck@linutronix.de>
>> Subject: Re: [PATCH v6 31/43] MDSv6
>
>> On 2/25/19 10:49 AM, speck for Greg KH wrote:
>>> On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:
>>
>>
>>>> However I will probably not be able to write a detailed
>>>> description for each of the interrupt handlers changed because
>>>> there are just too many.
>>>
>>> Then how do you expect each subsystem / driver author to know if this is
>>> an acceptable change or not? How do you expect to educate driver
>>> authors to have them determine if they need to do this on their new
>>> drivers or not? Are you going to hand-audit each new driver that gets
>>> added to the kernel for forever?
>>>
>>> Without this type of information, this seems like a futile exercise.
>>
>> Forgive me if I'm being too cautious here, but it seems to make most
>> sense to have the basic MDS infrastructure in place at unembargo. Unless
>> it's very clear how the auto stuff can be safe, and the audit
>> comprehensive, I wonder if that shouldn't just be done after.
>
> I thought that was what Thomas's patchset provided and is what was
> alluded to in patch 00/43 of this series.
Indeed. I'm asking whether we're trying to figure out the "auto" stuff
as well before unembargo or is the other discussion just for planning?
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V4 00/11] MDS basics
@ 2019-02-22 22:24 Thomas Gleixner
2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
To: speck
Hi!
Another day, another update.
Changes since V3:
- Add the #DF mitigation and document why I can't be bothered
to sprinkle the buffer clear into #MC
- Add a comment about the segment selector choice. It makes sense on it's
own but it won't prevent anyone from thinking that we're crazy.
- Addressed the review feedback vs. documentation
- Resurrected the admin documentation patch, tidied it up and filled the
gaps.
Delta patch without the admin documentation parts below.
Git tree WIP.mds branch is updated as well.
If anyone of the people new to this need access to the git repo,
please send me a public SSH key so I can add to the gitolite config.
There is one point left which I did not look into yet and I'm happy to
delegate that to the virtualization wizards:
XEON PHI is not affected by L1TF, so it won't get the L1TF
mitigations. But it is affected by MSBDS, so it needs separate
mitigation, i.e. clearing CPU buffers on VMENTER.
Thanks,
Thomas
8<-------------------
Documentation/ABI/testing/sysfs-devices-system-cpu | 1
Documentation/admin-guide/hw-vuln/index.rst | 13 +
Documentation/admin-guide/hw-vuln/l1tf.rst | 1
Documentation/admin-guide/hw-vuln/mds.rst | 258 +++++++++++++++++++++
Documentation/admin-guide/index.rst | 6
Documentation/admin-guide/kernel-parameters.txt | 27 ++
Documentation/index.rst | 1
Documentation/x86/conf.py | 10
Documentation/x86/index.rst | 8
Documentation/x86/mds.rst | 205 ++++++++++++++++
arch/x86/entry/common.c | 10
arch/x86/include/asm/cpufeatures.h | 2
arch/x86/include/asm/irqflags.h | 4
arch/x86/include/asm/msr-index.h | 39 +--
arch/x86/include/asm/mwait.h | 7
arch/x86/include/asm/nospec-branch.h | 39 +++
arch/x86/include/asm/processor.h | 7
arch/x86/kernel/cpu/bugs.c | 105 ++++++++
arch/x86/kernel/cpu/common.c | 13 +
arch/x86/kernel/nmi.c | 6
arch/x86/kernel/traps.c | 9
arch/x86/kvm/cpuid.c | 3
drivers/base/cpu.c | 8
include/linux/cpu.h | 2
24 files changed, 762 insertions(+), 22 deletions(-)
diff --git a/Documentation/x86/mds.rst b/Documentation/x86/mds.rst
index 0c0d802367e6..ce3dbddbd3b8 100644
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -1,7 +1,12 @@
Microarchitecural Data Sampling (MDS) mitigation
================================================
-Microarchitectural Data Sampling (MDS) is a class of side channel attacks
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
on internal buffers in Intel CPUs. The variants are:
- Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
@@ -33,6 +38,7 @@ faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.
+
Exposure assumptions
--------------------
@@ -48,7 +54,7 @@ needed for exploiting MDS requires:
- to control the pointer through which the disclosure gadget exposes the
data
-The existance of such a construct cannot be excluded with 100% certainty,
+The existence of such a construct cannot be excluded with 100% certainty,
but the complexity involved makes it extremly unlikely.
There is one exception, which is untrusted BPF. The functionality of
@@ -91,13 +97,37 @@ the invocation can be enforced or conditional.
As a special quirk to address virtualization scenarios where the host has
the microcode updated, but the hypervisor does not (yet) expose the
MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
-hope that it might work. The state is reflected accordingly.
+hope that it might actually clear the buffers. The state is reflected
+accordingly.
According to current knowledge additional mitigations inside the kernel
itself are not required because the necessary gadgets to expose the leaked
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.
+
+Kernel internal mitigation modes
+--------------------------------
+
+ ======= ===========================================================
+ off Mitigation is disabled. Either the CPU is not affected or
+ mds=off is supplied on the kernel command line
+
+ full Mitigation is eanbled. CPU is affected and MD_CLEAR is
+ advertised in CPUID.
+
+ vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not
+ advertised in CPUID. That is mainly for virtualization
+ scenarios where the host has the updated microcode but the
+ hypervisor does not expose MD_CLEAR in CPUID. It's a best
+ effort approach without guarantee.
+ ======= ===========================================================
+
+If the CPU is affected and mds=off is not supplied on the kernel
+command line then the kernel selects the appropriate mitigation mode
+depending on the availability of the MD_CLEAR CPUID bit.
+
+
Mitigation points
-----------------
@@ -128,8 +158,16 @@ Mitigation points
coverage.
There is one non maskable exception which returns through paranoid exit
- and is not mitigated: #DF. If user space is able to trigger a double
- fault the possible MDS leakage is the least problem to worry about.
+ and is to some extent controllable from user space through
+ modify_ldt(2): #DF. So mitigation is required in the double fault
+ handler as well.
+
+ Another corner case is a #MC which hits between the buffer clear and the
+ actual return to user. As this still is in kernel space it takes the
+ paranoid exit path which does not clear the CPU buffers. So the #MC
+ handler repopulates the buffers to some extent. Machine checks are not
+ reliably controllable and the window is extremly small so mitigation
+ would just tick a checkbox that this theoretical corner case is covered.
2. C-State transition
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 8be9158d848e..3e27ccd6d5c5 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -338,6 +338,8 @@ static inline void mds_clear_cpu_buffers(void)
* Has to be the memory-operand variant because only that
* guarantees the CPU buffer flush functionality according to
* documentation. The register-operand variant does not.
+ * Works with any segment selector, but a valid writable
+ * data segment is the fastest variant.
*
* "cc" clobber is required because VERW modifies ZF.
*/
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0fb241a78de3..83b19bb54093 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -68,6 +68,7 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
DEFINE_STATIC_KEY_FALSE(mds_user_clear);
/* Control MDS CPU buffer clear before idling (halt, mwait) */
DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
+EXPORT_SYMBOL_GPL(mds_idle_clear);
void __init check_bugs(void)
{
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9b7c4ca8f0a7..d2779f4730f5 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -366,6 +366,15 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
regs->ip = (unsigned long)general_protection;
regs->sp = (unsigned long)&gpregs->orig_ax;
+ /*
+ * This situation can be triggered by userspace via
+ * modify_ldt(2) and the return does not take the regular
+ * user space exit, so a CPU buffer clear is required when
+ * MDS mitigation is enabled.
+ */
+ if (static_branch_unlikely(&mds_user_clear))
+ mds_clear_cpu_buffers();
+
return;
}
#endif
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
@ 2019-02-22 22:24 ` Thomas Gleixner
2019-02-26 14:19 ` [MODERATED] " Josh Poimboeuf
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-22 22:24 UTC (permalink / raw)
To: speck; +Cc: Borislav Petkov, Greg Kroah-Hartman
From: Thomas Gleixner <tglx@linutronix.de>
The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
clearing the affected CPU buffers. The mechanism for clearing the buffers
uses the unused and obsolete VERW instruction in combination with a
microcode update which triggers a CPU buffer clear when VERW is executed.
Provide a inline function with the assembly magic. The argument of the VERW
instruction must be a memory operand as documented:
"MD_CLEAR enumerates that the memory-operand variant of VERW (for
example, VERW m16) has been extended to also overwrite buffers affected
by MDS. This buffer overwriting functionality is not guaranteed for the
register operand variant of VERW."
Documentation also recommends to use a writable data segment selector:
"The buffer overwriting occurs regardless of the result of the VERW
permission check, as well as when the selector is null or causes a
descriptor load segment violation. However, for lowest latency we
recommend using a selector that indicates a valid writable data
segment."
Add x86 specific documentation about MDS and the internal workings of the
mitigation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
V3 --> V4: Document the segment selecor choice as well.
V2 --> V3: Add VERW documentation and fix typos/grammar..., dropped 'i(0)'
Add more details fo the documentation file
V1 --> V2: Add "cc" clobber and documentation
---
Documentation/index.rst | 1
Documentation/x86/conf.py | 10 +++
Documentation/x86/index.rst | 8 ++
Documentation/x86/mds.rst | 100 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/nospec-branch.h | 25 ++++++++
5 files changed, 144 insertions(+)
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -101,6 +101,7 @@ implementation.
:maxdepth: 2
sh/index
+ x86/index
Filesystem Documentation
------------------------
--- /dev/null
+++ b/Documentation/x86/conf.py
@@ -0,0 +1,10 @@
+# -*- coding: utf-8; mode: python -*-
+
+project = "X86 architecture specific documentation"
+
+tags.add("subproject")
+
+latex_documents = [
+ ('index', 'x86.tex', project,
+ 'The kernel development community', 'manual'),
+]
--- /dev/null
+++ b/Documentation/x86/index.rst
@@ -0,0 +1,8 @@
+==========================
+x86 architecture specifics
+==========================
+
+.. toctree::
+ :maxdepth: 1
+
+ mds
--- /dev/null
+++ b/Documentation/x86/mds.rst
@@ -0,0 +1,100 @@
+Microarchitecural Data Sampling (MDS) mitigation
+================================================
+
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
+on internal buffers in Intel CPUs. The variants are:
+
+ - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+ - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+ - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
+
+MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
+dependent load (store-to-load forwarding) as an optimization. The forward
+can also happen to a faulting or assisting load operation for a different
+memory address, which can be exploited under certain conditions. Store
+buffers are partitioned between Hyper-Threads so cross thread forwarding is
+not possible. But if a thread enters or exits a sleep state the store
+buffer is repartitioned which can expose data from one thread to the other.
+
+MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
+L1 miss situations and to hold data which is returned or sent in response
+to a memory or I/O operation. Fill buffers can forward data to a load
+operation and also write data to the cache. When the fill buffer is
+deallocated it can retain the stale data of the preceding operations which
+can then be forwarded to a faulting or assisting load operation, which can
+be exploited under certain conditions. Fill buffers are shared between
+Hyper-Threads so cross thread leakage is possible.
+
+MLDPS leaks Load Port Data. Load ports are used to perform load operations
+from memory or I/O. The received data is then forwarded to the register
+file or a subsequent operation. In some implementations the Load Port can
+contain stale data from a previous operation which can be forwarded to
+faulting or assisting loads under certain conditions, which again can be
+exploited eventually. Load ports are shared between Hyper-Threads so cross
+thread leakage is possible.
+
+
+Exposure assumptions
+--------------------
+
+It is assumed that attack code resides in user space or in a guest with one
+exception. The rationale behind this assumption is that the code construct
+needed for exploiting MDS requires:
+
+ - to control the load to trigger a fault or assist
+
+ - to have a disclosure gadget which exposes the speculatively accessed
+ data for consumption through a side channel.
+
+ - to control the pointer through which the disclosure gadget exposes the
+ data
+
+The existence of such a construct cannot be excluded with 100% certainty,
+but the complexity involved makes it extremly unlikely.
+
+There is one exception, which is untrusted BPF. The functionality of
+untrusted BPF is limited, but it needs to be thoroughly investigated
+whether it can be used to create such a construct.
+
+
+Mitigation strategy
+-------------------
+
+All variants have the same mitigation strategy at least for the single CPU
+thread case (SMT off): Force the CPU to clear the affected buffers.
+
+This is achieved by using the otherwise unused and obsolete VERW
+instruction in combination with a microcode update. The microcode clears
+the affected CPU buffers when the VERW instruction is executed.
+
+For virtualization there are two ways to achieve CPU buffer
+clearing. Either the modified VERW instruction or via the L1D Flush
+command. The latter is issued when L1TF mitigation is enabled so the extra
+VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
+be issued.
+
+If the VERW instruction with the supplied segment selector argument is
+executed on a CPU without the microcode update there is no side effect
+other than a small number of pointlessly wasted CPU cycles.
+
+This does not protect against cross Hyper-Thread attacks except for MSBDS
+which is only exploitable cross Hyper-thread when one of the Hyper-Threads
+enters a C-state.
+
+The kernel provides a function to invoke the buffer clearing:
+
+ mds_clear_cpu_buffers()
+
+The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
+(idle) transitions. Depending on the mitigation mode and the system state
+the invocation can be enforced or conditional.
+
+According to current knowledge additional mitigations inside the kernel
+itself are not required because the necessary gadgets to expose the leaked
+data cannot be controlled in a way which allows exploitation from malicious
+user space or VM guests.
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,31 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+#include <asm/segment.h>
+
+/**
+ * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * This uses the otherwise unused and obsolete VERW instruction in
+ * combination with microcode which triggers a CPU buffer flush when the
+ * instruction is executed.
+ */
+static inline void mds_clear_cpu_buffers(void)
+{
+ static const u16 ds = __KERNEL_DS;
+
+ /*
+ * Has to be the memory-operand variant because only that
+ * guarantees the CPU buffer flush functionality according to
+ * documentation. The register-operand variant does not.
+ * Works with any segment selector, but a valid writable
+ * data segment is the fastest variant.
+ *
+ * "cc" clobber is required because VERW modifies ZF.
+ */
+ asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
+}
+
#endif /* __ASSEMBLY__ */
/*
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
@ 2019-02-26 14:19 ` Josh Poimboeuf
2019-03-01 20:58 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Josh Poimboeuf @ 2019-02-26 14:19 UTC (permalink / raw)
To: speck
On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
> +L1 miss situations and to hold data which is returned or sent in response
> +to a memory or I/O operation. Fill buffers can forward data to a load
> +operation and also write data to the cache. When the fill buffer is
> +deallocated it can retain the stale data of the preceding operations which
> +can then be forwarded to a faulting or assisting load operation, which can
> +be exploited under certain conditions. Fill buffers are shared between
> +Hyper-Threads so cross thread leakage is possible.
> +
> +MLDPS leaks Load Port Data. Load ports are used to perform load operations
MLPDS
> +from memory or I/O. The received data is then forwarded to the register
> +file or a subsequent operation. In some implementations the Load Port can
> +contain stale data from a previous operation which can be forwarded to
> +faulting or assisting loads under certain conditions, which again can be
> +exploited eventually. Load ports are shared between Hyper-Threads so cross
> +thread leakage is possible.
> +
> +
> +Exposure assumptions
> +--------------------
> +
> +It is assumed that attack code resides in user space or in a guest with one
> +exception. The rationale behind this assumption is that the code construct
> +needed for exploiting MDS requires:
> +
> + - to control the load to trigger a fault or assist
> +
> + - to have a disclosure gadget which exposes the speculatively accessed
> + data for consumption through a side channel.
> +
> + - to control the pointer through which the disclosure gadget exposes the
> + data
> +
> +The existence of such a construct cannot be excluded with 100% certainty,
> +but the complexity involved makes it extremly unlikely.
The existence of such a construct *in the kernel* cannot be excluded...
> +There is one exception, which is untrusted BPF. The functionality of
> +untrusted BPF is limited, but it needs to be thoroughly investigated
> +whether it can be used to create such a construct.
> +
> +
> +Mitigation strategy
> +-------------------
> +
> +All variants have the same mitigation strategy at least for the single CPU
> +thread case (SMT off): Force the CPU to clear the affected buffers.
> +
> +This is achieved by using the otherwise unused and obsolete VERW
> +instruction in combination with a microcode update. The microcode clears
> +the affected CPU buffers when the VERW instruction is executed.
> +
> +For virtualization there are two ways to achieve CPU buffer
> +clearing. Either the modified VERW instruction or via the L1D Flush
> +command. The latter is issued when L1TF mitigation is enabled so the extra
> +VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
> +be issued.
> +
> +If the VERW instruction with the supplied segment selector argument is
> +executed on a CPU without the microcode update there is no side effect
> +other than a small number of pointlessly wasted CPU cycles.
> +
> +This does not protect against cross Hyper-Thread attacks except for MSBDS
> +which is only exploitable cross Hyper-thread when one of the Hyper-Threads
> +enters a C-state.
> +
> +The kernel provides a function to invoke the buffer clearing:
> +
> + mds_clear_cpu_buffers()
> +
> +The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
> +(idle) transitions. Depending on the mitigation mode and the system state
> +the invocation can be enforced or conditional.
The conditional bit isn't true (yet?).
What does "enforced" mean in this context? s/enforced/unconditional ?
Maybe the last sentence can be removed entirely.
--
Josh
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-02-26 14:19 ` [MODERATED] " Josh Poimboeuf
@ 2019-03-01 20:58 ` Jon Masters
2019-03-01 22:14 ` Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Jon Masters @ 2019-03-01 20:58 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 164 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
[-- Attachment #2: Type: text/plain, Size: 2764 bytes --]
On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:
> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>> +L1 miss situations and to hold data which is returned or sent in response
>> +to a memory or I/O operation. Fill buffers can forward data to a load
>> +operation and also write data to the cache. When the fill buffer is
>> +deallocated it can retain the stale data of the preceding operations which
>> +can then be forwarded to a faulting or assisting load operation, which can
>> +be exploited under certain conditions. Fill buffers are shared between
>> +Hyper-Threads so cross thread leakage is possible.
The fill buffers sit opposite the L1D$ and participate in coherency
directly. They supply data directly to the load store units. Here's the
internal summary I wrote (feel free to use any of it that is useful):
"Intel processors utilize fill buffers to perform loads of data when a
miss occurs in the Level 1 data cache. The fill buffer allows the
processor to implement a non-blocking cache, continuing with other
operations while the necessary cache data “line” is loaded from a higher
level cache or from memory. It also allows the result of the fill to be
forwarded directly to the EU (Execution Unit) requiring the load,
without waiting for it to be written into the L1 Data Cache.
A load operation is not decoupled in the same way that a store is, but
it does involve an AGU (Address Generation Unit) operation. If the AGU
generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
Intel design would block the load and later reissue it. In contemporary
designs, it instead allows subsequent speculation operations to
temporarily see a forwarded data value from the fill buffer slot prior
to the load actually taking place. Thus it is possible to read data that
was recently accessed by another thread, if the fill buffer entry is not
reused.
It is this attack that allows cross-thread SMT leakage and breaks HT
without recourse other than to disable it or to implement core
scheduling in the Linux kernel.
Variants of this include loads that cross cache or page boundaries due
to further optimizations in Intel’s implementation. For example, Intel
incorporate logic to guess at address generation prior to determining
whether it crosses such a boundary (covered in US5335333A) and will
forward this to the TLB/load logic prior to resolving the full address.
They will retry the load by re-issuing uops in the case of a cross
cacheline/page boundary but in that case will leak state as well."
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-03-01 20:58 ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-01 22:14 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-03-01 22:14 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 161 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()
[-- Attachment #2: Type: text/plain, Size: 3426 bytes --]
On 3/1/19 3:58 PM, speck for Jon Masters wrote:
> On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:
>
>> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>>> +L1 miss situations and to hold data which is returned or sent in response
>>> +to a memory or I/O operation. Fill buffers can forward data to a load
>>> +operation and also write data to the cache. When the fill buffer is
>>> +deallocated it can retain the stale data of the preceding operations which
>>> +can then be forwarded to a faulting or assisting load operation, which can
>>> +be exploited under certain conditions. Fill buffers are shared between
>>> +Hyper-Threads so cross thread leakage is possible.
>
> The fill buffers sit opposite the L1D$ and participate in coherency
> directly. They supply data directly to the load store units. Here's the
> internal summary I wrote (feel free to use any of it that is useful):
>
> "Intel processors utilize fill buffers to perform loads of data when a
> miss occurs in the Level 1 data cache. The fill buffer allows the
> processor to implement a non-blocking cache, continuing with other
> operations while the necessary cache data “line” is loaded from a higher
> level cache or from memory. It also allows the result of the fill to be
> forwarded directly to the EU (Execution Unit) requiring the load,
> without waiting for it to be written into the L1 Data Cache.
>
> A load operation is not decoupled in the same way that a store is, but
> it does involve an AGU (Address Generation Unit) operation. If the AGU
> generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
> Intel design would block the load and later reissue it. In contemporary
> designs, it instead allows subsequent speculation operations to
> temporarily see a forwarded data value from the fill buffer slot prior
> to the load actually taking place. Thus it is possible to read data that
> was recently accessed by another thread, if the fill buffer entry is not
> reused.
>
> It is this attack that allows cross-thread SMT leakage and breaks HT
> without recourse other than to disable it or to implement core
> scheduling in the Linux kernel.
>
> Variants of this include loads that cross cache or page boundaries due
> to further optimizations in Intel’s implementation. For example, Intel
> incorporate logic to guess at address generation prior to determining
> whether it crosses such a boundary (covered in US5335333A) and will
> forward this to the TLB/load logic prior to resolving the full address.
> They will retry the load by re-issuing uops in the case of a cross
> cacheline/page boundary but in that case will leak state as well."
Btw, I've various reproducers here that I'm happy to share if useful
with the right folks. Thomas and Linus should already have my IFU one
for later testing of that, I've also e.g. an FBBF. Currently it just
spews whatever it sees from the other threads, but in the next few days
I'll have it cleaned up to send/receive specific messages - then can
just wrap it with a bow so it can print yes/no vulnerable.
Ping if you have a need for a repro (keybase/email) and I'll go through
our process for sharing as appropriate.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V3 0/9] MDS basics 0
@ 2019-02-21 23:44 Thomas Gleixner
2019-02-21 23:44 ` [patch V3 4/9] MDS basics 4 Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-21 23:44 UTC (permalink / raw)
To: speck
Hi!
Thanks for the valuable feedback to everyone!
Changes since V2:
- Added the NMI mitigation and added an explanation. Thanks Andi and
Kees.
- Fixed the VERW asm magic as pointed out by Andrew and added
more explanation as requested by Borislav and Andrew.
- Adopted Peter's static branch suggestions
- Renamed the _HOPE mode to _VMWERV along with an explanation of the
acronym in the changelog. Thanks Mark for the inspiration.
- Updated documentation. The return to user section has changed a
lot. Added some explanation about assumptions and hopefully fixed all
issues mentioned by Borislav, Andrew, Greg....
- Cleaned up the bitmask issues in the speculation MSR defines as
pointed out by Greg.
- Got the Copy & Paste in the sysfs code right this time.
- Dropped the conditional mode stuff for now. Needs more thought on
all ends and I wish we just don't need it at all :)
- Collected a few Reviewed-by tags, but not for the patches which
have significant changes.
The admin documentation is still WIP, so not included.
It's also available through the git repository in the force updated
branch: WIP.mds
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V3 4/9] MDS basics 4
2019-02-21 23:44 [patch V3 0/9] MDS basics 0 Thomas Gleixner
@ 2019-02-21 23:44 ` Thomas Gleixner
2019-02-22 7:45 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-21 23:44 UTC (permalink / raw)
To: speck
Subject: [patch V3 4/9] x86/speculation/mds: Add mds_clear_cpu_buffer()
From: Thomas Gleixner <tglx@linutronix.de>
The Microarchitectural Data Sampling (MDS) vulernabilities are mitigated by
clearing the affected CPU buffers. The mechanism for clearing the buffers
uses the unused and obsolete VERW instruction in combination with a
microcode update which triggers a CPU buffer clear when VERW is executed.
Provide a inline function with the assembly magic. The argument of the VERW
instruction must be a memory operand as documented:
"MD_CLEAR enumerates that the memory-operand variant of VERW (for
example, VERW m16) has been extended to also overwrite buffers affected
by MDS. This buffer overwriting functionality is not guaranteed for the register
operand variant of VERW."
Add x86 specific documentation about MDS and the internal workings of the
mitigation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2 --> V3: Add VERW documentation and fix typos/grammar..., dropped 'i(0)'
Add more details fo the documentation file
V1 --> V2: Add "cc" clobber and documentation
---
Documentation/index.rst | 1
Documentation/x86/conf.py | 10 +++
Documentation/x86/index.rst | 8 ++
Documentation/x86/mds.rst | 94 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/nospec-branch.h | 23 ++++++++
5 files changed, 136 insertions(+)
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -101,6 +101,7 @@ implementation.
:maxdepth: 2
sh/index
+ x86/index
Filesystem Documentation
------------------------
--- /dev/null
+++ b/Documentation/x86/conf.py
@@ -0,0 +1,10 @@
+# -*- coding: utf-8; mode: python -*-
+
+project = "X86 architecture specific documentation"
+
+tags.add("subproject")
+
+latex_documents = [
+ ('index', 'x86.tex', project,
+ 'The kernel development community', 'manual'),
+]
--- /dev/null
+++ b/Documentation/x86/index.rst
@@ -0,0 +1,8 @@
+==========================
+x86 architecture specifics
+==========================
+
+.. toctree::
+ :maxdepth: 1
+
+ mds
--- /dev/null
+++ b/Documentation/x86/mds.rst
@@ -0,0 +1,94 @@
+Microarchitecural Data Sampling (MDS) mitigation
+================================================
+
+Microarchitectural Data Sampling (MDS) is a class of side channel attacks
+on internal buffers in Intel CPUs. The variants are:
+
+ - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+ - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+ - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
+
+MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
+dependent load (store-to-load forwarding) as an optimization. The forward
+can also happen to a faulting or assisting load operation for a different
+memory address, which can be exploited under certain conditions. Store
+buffers are partitioned between Hyper-Threads so cross thread forwarding is
+not possible. But if a thread enters or exits a sleep state the store
+buffer is repartitioned which can expose data from one thread to the other.
+
+MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
+L1 miss situations and to hold data which is returned or sent in response
+to a memory or I/O operation. Fill buffers can forward data to a load
+operation and also write data to the cache. When the fill buffer is
+deallocated it can retain the stale data of the preceding operations which
+can then be forwarded to a faulting or assisting load operation, which can
+be exploited under certain conditions. Fill buffers are shared between
+Hyper-Threads so cross thread leakage is possible.
+
+MLDPS leaks Load Port Data. Load ports are used to perform load operations
+from memory or I/O. The received data is then forwarded to the register
+file or a subsequent operation. In some implementations the Load Port can
+contain stale data from a previous operation which can be forwarded to
+faulting or assisting loads under certain conditions, which again can be
+exploited eventually. Load ports are shared between Hyper-Threads so cross
+thread leakage is possible.
+
+Exposure assumptions
+--------------------
+
+It is assumed that attack code resides in user space or in a guest with one
+exception. The rationale behind this assumption is that the code construct
+needed for exploiting MDS requires:
+
+ - to control the load to trigger a fault or assist
+
+ - to have a disclosure gadget which exposes the speculatively accessed
+ data for consumption through a side channel.
+
+ - to control the pointer through which the disclosure gadget exposes the
+ data
+
+The existance of such a construct cannot be excluded with 100% certainty,
+but the complexity involved makes it extremly unlikely.
+
+There is one exception, which is untrusted BPF. The functionality of
+untrusted BPF is limited, but it needs to be thoroughly investigated
+whether it can be used to create such a construct.
+
+
+Mitigation strategy
+-------------------
+
+All variants have the same mitigation strategy at least for the single CPU
+thread case (SMT off): Force the CPU to clear the affected buffers.
+
+This is achieved by using the otherwise unused and obsolete VERW
+instruction in combination with a microcode update. The microcode clears
+the affected CPU buffers when the VERW instruction is executed.
+
+For virtualization there are two ways to achieve CPU buffer
+clearing. Either the modified VERW instruction or via the L1D Flush
+command. The latter is issued when L1TF mitigation is enabled so the extra
+VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
+be issued.
+
+If the VERW instruction with the supplied segment selector argument is
+executed on a CPU without the microcode update there is no side effect
+other than a small number of pointlessly wasted CPU cycles.
+
+This does not protect against cross Hyper-Thread attacks except for MSBDS
+which is only exploitable cross Hyper-thread when one of the Hyper-Threads
+enters a C-state.
+
+The kernel provides a function to invoke the buffer clearing:
+
+ mds_clear_cpu_buffers()
+
+The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
+(idle) transitions. Depending on the mitigation mode and the system state
+the invocation can be enforced or conditional.
+
+According to current knowledge additional mitigations inside the kernel
+itself are not required because the necessary gadgets to expose the leaked
+data cannot be controlled in a way which allows exploitation from malicious
+user space or VM guests.
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,29 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+#include <asm/segment.h>
+
+/**
+ * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
+ *
+ * This uses the otherwise unused and obsolete VERW instruction in
+ * combination with microcode which triggers a CPU buffer flush when the
+ * instruction is executed.
+ */
+static inline void mds_clear_cpu_buffers(void)
+{
+ static const u16 ds = __KERNEL_DS;
+
+ /*
+ * Has to be the memory-operand variant because only that
+ * guarantees the CPU buffer flush functionality according to
+ * documentation. The register-operand variant does not.
+ *
+ * "cc" clobber is required because VERW modifies ZF.
+ */
+ asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
+}
+
#endif /* __ASSEMBLY__ */
/*
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V2 00/10] MDS basics+ 0
@ 2019-02-20 15:07 Thomas Gleixner
2019-02-20 15:07 ` [patch V2 04/10] MDS basics+ 4 Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-20 15:07 UTC (permalink / raw)
To: speck
Hi!
This is an update to yesterdays series with the following changes:
- Addressed review comments (on/off list)
- Changed the approach with static keys slightly
- Added "cc" clobber to the VERW asm magic (spotted by Peterz)
- Added x86 specific documentation which explains the mitigation methods
and details on why particular code pathes are excluded.
- Added an internal 'HOPE' mitigation mode to address the VMWare wish.
- Added the basic infrastructure for conditional mode
Dropped the documentation patch for now as I'm not done with updating it
and I have to run now and attend my grandson's birthday party.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V2 04/10] MDS basics+ 4
2019-02-20 15:07 [patch V2 00/10] MDS basics+ 0 Thomas Gleixner
@ 2019-02-20 15:07 ` Thomas Gleixner
2019-02-20 17:10 ` [MODERATED] " mark gross
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-20 15:07 UTC (permalink / raw)
To: speck
Subject: [patch V2 04/10] x86/speculation/mds: Clear CPU buffers on exit to user
From: Thomas Gleixner <tglx@linutronix.de>
Add a static key which controls the invocation of the CPU buffer clear
mechanism on exit to user space and add the call into
prepare_exit_to_usermode() right before actually returning.
Add documentation which kernel to user space transition this covers and
explain in detail why those which are not mitigated do not need it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
Documentation/x86/mds.rst | 79 +++++++++++++++++++++++++++++++++++
arch/x86/entry/common.c | 9 +++
arch/x86/include/asm/nospec-branch.h | 2
arch/x86/kernel/cpu/bugs.c | 4 +
4 files changed, 93 insertions(+), 1 deletion(-)
--- a/Documentation/x86/mds.rst
+++ b/Documentation/x86/mds.rst
@@ -64,3 +64,82 @@ itself are not required because the nece
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.
+Mitigation points
+-----------------
+
+1. Return to user space
+^^^^^^^^^^^^^^^^^^^^^^^
+ When transition from kernel to user space the CPU buffers are flushed
+ on affected CPUs:
+
+ - always when the mitigation mode is full. In this case the invocation
+ depends on the static key mds_user_clear_always.
+
+ - depending on executed functions between entering kernel space and
+ returning to user space. This is not yet implemented.
+
+ This covers transitions from kernel to user space through a return to
+ user space from a syscall and from an interrupt or a regular exception.
+
+ There are other kernel to user space transitions which are not covered
+ by this: NMIs and all non maskable exceptions which go through the
+ paranoid exit, which means that they are not going to the regular
+ prepare_exit_to_usermode() exit path which handles the CPU buffer
+ clearing.
+
+ The occasional non maskable exceptions which go through paranoid exit
+ are not controllable by user space in any way and most of these
+ exceptions cannot expose any valuable information either.
+
+ Neither can NMIs be reliably controlled by a non priviledged attacker
+ and their exposure to sensitive data is very limited. NMIs originate
+ from:
+
+ - Performance monitoring.
+
+ Performance monitoring is restricted by various mechanisms, i.e. a
+ regular user on a properly secured system can- if at all - only
+ monitor it's own user space processes. The performance monitoring
+ NMI surely executes priviledged kernel code and accesses kernel
+ internal data structures, which might be exploitable to break the
+ kernel's address space layout randomization, which is a non-issue
+ on affected CPUs as there are simpler ways to achieve that.
+
+ - Watchdog
+
+ The kernel uses - if enabled - a performance monitoring event to
+ trigger NMIs periodically which allow detection of hard lockups in
+ kernel space due to deadlocks or other issues.
+
+ The watchdog period is a multiple of seconds and the code path
+ executed cannot expose any secret information other than kernel
+ address space layout. Due to the low frequency and a limited
+ control of a potential attacker to align on the watchdog period the
+ attack surface is close to zero.
+
+ - Legacy oprofile NMI handler
+
+ Similar to performance monitoring, albeit potentially less
+ restricted, but has been widely replaced by the performance
+ monitoring interface perf. State of the art systems will not expose
+ the oprofile interface and even if exposed the potentially
+ exploitable information is accessible by other and simpler means.
+
+ - KGBD
+
+ If the kernel debugger is accessible by an unpriviledged attacker,
+ then the NMI handler is the least of the problems.
+
+ - ACPI/GHES
+
+ A firmware based error reporting mechanism which uses NMIs for
+ notification. Similar to Machine Check Exceptions there is no known
+ way for an attacker to reliably control and trigger errors which
+ would cause NMIs. Even if that would be the case the potentially
+ exploitable data, e.g. kernel address space layout, would be
+ accessible by simpler means.
+
+ - IPMI, vendor specific NMIs, forced shutdown NMI
+
+ None of those are controllable by unpriviledged attackers to form a
+ reliable exploit surface.
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,6 +31,7 @@
#include <asm/vdso.h>
#include <linux/uaccess.h>
#include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -180,6 +181,12 @@ static void exit_to_usermode_loop(struct
}
}
+static inline void mds_user_clear_cpu_buffers(void)
+{
+ if (static_branch_likely(&mds_user_clear_always))
+ mds_clear_cpu_buffers();
+}
+
/* Called with IRQs disabled. */
__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
{
@@ -212,6 +219,8 @@ static void exit_to_usermode_loop(struct
#endif
user_enter_irqoff();
+
+ mds_user_clear_cpu_buffers();
}
#define SYSCALL_EXIT_WORK_FLAGS \
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -318,6 +318,8 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+DECLARE_STATIC_KEY_FALSE(mds_user_clear_always);
+
#include <asm/segment.h>
/**
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -63,10 +63,12 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_i
/* Control unconditional IBPB in switch_mm() */
DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+/* Control MDS CPU buffer clear before returning to user space */
+DEFINE_STATIC_KEY_FALSE(mds_user_clear_always);
+
void __init check_bugs(void)
{
identify_boot_cpu();
-
/*
* identify_boot_cpu() initialized SMT support information, let the
* core code know.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [patch V2 04/10] MDS basics+ 4
2019-02-20 15:07 ` [patch V2 04/10] MDS basics+ 4 Thomas Gleixner
@ 2019-02-20 17:10 ` mark gross
2019-02-21 19:26 ` [MODERATED] Encrypted Message Tim Chen
0 siblings, 1 reply; 89+ messages in thread
From: mark gross @ 2019-02-20 17:10 UTC (permalink / raw)
To: speck
On Wed, Feb 20, 2019 at 04:07:57PM +0100, speck for Thomas Gleixner wrote:
> Subject: [patch V2 04/10] x86/speculation/mds: Clear CPU buffers on exit to user
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Add a static key which controls the invocation of the CPU buffer clear
> mechanism on exit to user space and add the call into
> prepare_exit_to_usermode() right before actually returning.
>
> Add documentation which kernel to user space transition this covers and
> explain in detail why those which are not mitigated do not need it.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> Documentation/x86/mds.rst | 79 +++++++++++++++++++++++++++++++++++
> arch/x86/entry/common.c | 9 +++
> arch/x86/include/asm/nospec-branch.h | 2
> arch/x86/kernel/cpu/bugs.c | 4 +
> 4 files changed, 93 insertions(+), 1 deletion(-)
>
> --- a/Documentation/x86/mds.rst
> +++ b/Documentation/x86/mds.rst
> @@ -64,3 +64,82 @@ itself are not required because the nece
> data cannot be controlled in a way which allows exploitation from malicious
> user space or VM guests.
>
> +Mitigation points
> +-----------------
> +
> +1. Return to user space
> +^^^^^^^^^^^^^^^^^^^^^^^
> + When transition from kernel to user space the CPU buffers are flushed
> + on affected CPUs:
> +
> + - always when the mitigation mode is full. In this case the invocation
> + depends on the static key mds_user_clear_always.
> +
> + - depending on executed functions between entering kernel space and
> + returning to user space. This is not yet implemented.
> +
> + This covers transitions from kernel to user space through a return to
> + user space from a syscall and from an interrupt or a regular exception.
> +
> + There are other kernel to user space transitions which are not covered
> + by this: NMIs and all non maskable exceptions which go through the
> + paranoid exit, which means that they are not going to the regular
> + prepare_exit_to_usermode() exit path which handles the CPU buffer
> + clearing.
> +
> + The occasional non maskable exceptions which go through paranoid exit
> + are not controllable by user space in any way and most of these
> + exceptions cannot expose any valuable information either.
> +
> + Neither can NMIs be reliably controlled by a non priviledged attacker
> + and their exposure to sensitive data is very limited. NMIs originate
> + from:
> +
> + - Performance monitoring.
> +
> + Performance monitoring is restricted by various mechanisms, i.e. a
> + regular user on a properly secured system can- if at all - only
> + monitor it's own user space processes. The performance monitoring
> + NMI surely executes priviledged kernel code and accesses kernel
> + internal data structures, which might be exploitable to break the
> + kernel's address space layout randomization, which is a non-issue
> + on affected CPUs as there are simpler ways to achieve that.
> +
> + - Watchdog
> +
> + The kernel uses - if enabled - a performance monitoring event to
> + trigger NMIs periodically which allow detection of hard lockups in
> + kernel space due to deadlocks or other issues.
> +
> + The watchdog period is a multiple of seconds and the code path
> + executed cannot expose any secret information other than kernel
> + address space layout. Due to the low frequency and a limited
> + control of a potential attacker to align on the watchdog period the
> + attack surface is close to zero.
> +
> + - Legacy oprofile NMI handler
> +
> + Similar to performance monitoring, albeit potentially less
> + restricted, but has been widely replaced by the performance
> + monitoring interface perf. State of the art systems will not expose
> + the oprofile interface and even if exposed the potentially
> + exploitable information is accessible by other and simpler means.
> +
> + - KGBD
> +
> + If the kernel debugger is accessible by an unpriviledged attacker,
> + then the NMI handler is the least of the problems.
> +
> + - ACPI/GHES
> +
> + A firmware based error reporting mechanism which uses NMIs for
> + notification. Similar to Machine Check Exceptions there is no known
> + way for an attacker to reliably control and trigger errors which
> + would cause NMIs. Even if that would be the case the potentially
> + exploitable data, e.g. kernel address space layout, would be
> + accessible by simpler means.
> +
> + - IPMI, vendor specific NMIs, forced shutdown NMI
> +
> + None of those are controllable by unpriviledged attackers to form a
> + reliable exploit surface.
I agree we need some balance between paranoia and reality.
However; if I'm being pedantic, the attacker not having controlability aspect
of your argument can apply to most aspects of the MDS vulnerability. I think
that's why its name uses "data sampling". Also, I need to ask the chip heads
about if this list of NMI's is complete and can be expected to stay that way
across processor and platfrom generations.
--mark
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -31,6 +31,7 @@
> #include <asm/vdso.h>
> #include <linux/uaccess.h>
> #include <asm/cpufeature.h>
> +#include <asm/nospec-branch.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/syscalls.h>
> @@ -180,6 +181,12 @@ static void exit_to_usermode_loop(struct
> }
> }
>
> +static inline void mds_user_clear_cpu_buffers(void)
> +{
> + if (static_branch_likely(&mds_user_clear_always))
> + mds_clear_cpu_buffers();
> +}
> +
> /* Called with IRQs disabled. */
> __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
> {
> @@ -212,6 +219,8 @@ static void exit_to_usermode_loop(struct
> #endif
>
> user_enter_irqoff();
> +
> + mds_user_clear_cpu_buffers();
> }
>
> #define SYSCALL_EXIT_WORK_FLAGS \
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -318,6 +318,8 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_
> DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
> DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>
> +DECLARE_STATIC_KEY_FALSE(mds_user_clear_always);
> +
> #include <asm/segment.h>
>
> /**
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -63,10 +63,12 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_i
> /* Control unconditional IBPB in switch_mm() */
> DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>
> +/* Control MDS CPU buffer clear before returning to user space */
> +DEFINE_STATIC_KEY_FALSE(mds_user_clear_always);
> +
> void __init check_bugs(void)
> {
> identify_boot_cpu();
> -
> /*
> * identify_boot_cpu() initialized SMT support information, let the
> * core code know.
>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch 0/8] MDS basics 0
@ 2019-02-19 12:44 Thomas Gleixner
2019-02-21 16:14 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2019-02-19 12:44 UTC (permalink / raw)
To: speck
Subject: [patch 0/8] MDS basics
From: Thomas Gleixner <tglx@linutronix.de>
Hi!
I got the following information yesterday night:
"All - FYI. There has been some chatter/ discussion on the subject.
Hopefully this note will help clarify. We received a report from a
researcher who independently identified what we formerly referred to as
PSF (aka Microarchitectural Store Buffer Data Sampling). There were
some initial indications (this week) this researcher would elect to
release a paper publicly PRIOR to the May 14 embargo was lifted.
We have been working closely with them, and it appears for now that will
NOT be the case. Were that to happen however, we DID begin prepping
materials to disclose PSF ONLY. I.e. we would disclose only that
particular issue after having consulted with this team. This includes a
modified/ reduced section of the existing whitepaper, press statement
and standard security advisory language. We are finalizing this
material and will then hold it in reserve.
As we have done in the past, we would convene a meeting of reps from
this group before activating those assets. I will keep you apprised of
any change in the situation, and can provide those assets for your use/
adaptation once finalized."
This was posted on that keybase io chat on friday night and of course not
made available to those who are not part of that. Even people who are
subscribed there missed the message because it scrolled away due to
other chit/chat.
Now we maybe got lucky this time, but I wouldn't hold my breath as the
propability that other people will figure that out as well is surely way
larger than 0.
If that happens, then it makes exactly ZERO sense to expose only the
MSBDS part as everything else is lumped together with this. But why am
I still trying to make sense of all this?
So while being grumpy about this communication fail, I'm even more
grumpy about the fact, that we don't have even the minimal full/off
mitigation in place in a workable form. I asked specifically for this
weeks ago just for the case that the embargo breaks early so we don't
stand there with pants down.
So being grumpy as hell made me sit down and write the basic
mitigation implementation myself (again).
It reuses a single patch from that Intel pile which is defining the
bug and MSR bits. Guess what, it took me less than 4 hours to do so
and another 2 hours in the morning to write at least the basic admin
documentation. The latter surely needs some work still, but I wanted
to get the patches out. There is also another TODO mentioned further
down.
The series comes with:
- A consistent command line interface
- A consistent sysfs interface
- Static key based control for the exit to user and idle invocations
- Dynamic update of the idle invocation key according to the actual SMT
state similar to the STIPB update.
- Idle invocations are inside the halt/mwait inlines and not randomly
sprinkled all over the kernel tree.
It builds and boots and while I was able to emit the VERW instruction by
hacking the mitigation selection to omit the MD_CLEAR supported check, I
have no access to real hardware with updated micro code.
This is how it should have looked from the very beginning and the extra
bits and pieces (cond mode) can be built on top of it. Please review and
give it a testride when you have a machine with updated microcode
available.
The lot is also available from the speck git tree in the WIP.mds
branch.
Note, that I moved the L!TF document to a separate folder so the hw
vulnerabilities are not showing up at the top level index of the admin
guide as separate items. Should have thought about that back then
already...
TODO:
For CPUs which are not affected by L1TF but are affected by MDS there
needs to be CPU buffer clear mitigation at VMENTER. That applies at
least to XEON PHI, SILVERMONT and AIRMONT and probably to some of the
newer models which have RDCL_NO set.
Thanks,
tglx
8<-----------------------
Documentation/ABI/testing/sysfs-devices-system-cpu | 1
Documentation/admin-guide/hw-vuln/index.rst | 13 +
Documentation/admin-guide/hw-vuln/l1tf.rst | 1
Documentation/admin-guide/hw-vuln/mds.rst | 230 +++++++++++++++++++++
Documentation/admin-guide/index.rst | 6
Documentation/admin-guide/kernel-parameters.txt | 27 ++
arch/x86/entry/common.c | 3
arch/x86/include/asm/cpufeatures.h | 2
arch/x86/include/asm/irqflags.h | 4
arch/x86/include/asm/msr-index.h | 5
arch/x86/include/asm/mwait.h | 7
arch/x86/include/asm/nospec-branch.h | 22 ++
arch/x86/include/asm/processor.h | 6
arch/x86/kernel/cpu/bugs.c | 102 +++++++++
arch/x86/kernel/cpu/common.c | 13 +
drivers/base/cpu.c | 6
include/linux/cpu.h | 2
17 files changed, 443 insertions(+), 7 deletions(-)
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v3 0/6] PERFv3
@ 2019-02-07 23:41 Andi Kleen
2019-02-07 23:41 ` [MODERATED] [PATCH v3 2/6] PERFv3 Andi Kleen
0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-02-07 23:41 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
Walnut is an functional (not security) issue with TSX. The upcoming
microcode updates on Skylake may corrupt perfmon counter 3
when RTM transactions are used.
There is a new MSR that allows to force abort RTM, and free
counter 3.
The following patchkit adds the support to perf to avoid
using counter 3, or disabling TSX when counter 3 is needed
for perf.
There are per perf event and global options to set the
default.
This patch sets the default to TSX enabled, but
that could be easily changed.
We can have a discussion on the trade offs of the default
setting. I suspect it's a decision that should be made by Linus,
as it may impact user programs either way.
The trade offs for setting the option default are:
Using 4 (or 8 with HT off) events in perf versus
allowing RTM usage while perf is active.
- Existing programs that use perf groups with 4 counters
may not retrieve perfmon data anymore. Perf usages
that use less than four (or 7 with HT off) counters
are not impacted. Perf usages that don't use group
will still work, but will see increase multiplexing.
- TSX programs should not functionally break from
forcing RTM to abort because they always need a valid
fall back path. However they will see significantly
lower performance if they rely on TSX for performance
(all RTM transactions will run and only abort at the end),
potentially slowing them down so much that it is
equivalent to functional breakage.
Patches are against tip/perf/core as of
commit ca3bb3d027f69ac3ab1dafb32bde2f5a3a44439c (tip/perf/core)
Author: Elena Reshetova <elena.reshetova@intel.com>
-Andi
v1: Initial post
v2: Minor updates in code (see individual patches)
Removed optimization to not change MSR for update. This caused missing
MSR updates in some cases.
Redid KVM code to always intercept MSR and pass correct flag
to host perf.
v3: Use Peter's scheduling patch, with some changes and cleanups.
Dropped some obsolete patches.
KVM now always forces the guest state and doesn't rely on the host state.
Andi Kleen (6):
x86/pmu/intel: Export number of counters in caps
x86/pmu/intel: Handle TSX with counter 3 on Skylake
x86/pmu/intel: Add perf event attribute to control RTM
perf stat: Make all existing groups weak
perf stat: Don't count EL for --transaction with three counters
kvm: vmx: Support TSX_FORCE_ABORT in KVM guests
arch/x86/events/core.c | 24 ++++++++
arch/x86/events/intel/core.c | 94 +++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 13 ++++-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/msr-index.h | 5 ++
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/pmu.c | 19 +++---
arch/x86/kvm/pmu.h | 6 +-
arch/x86/kvm/pmu_amd.c | 2 +-
arch/x86/kvm/vmx/pmu_intel.c | 20 ++++++-
tools/perf/builtin-stat.c | 38 ++++++++----
tools/perf/util/pmu.c | 10 ++++
tools/perf/util/pmu.h | 1 +
14 files changed, 211 insertions(+), 26 deletions(-)
--
2.17.2
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v3 2/6] PERFv3
2019-02-07 23:41 [MODERATED] [PATCH v3 0/6] PERFv3 Andi Kleen
@ 2019-02-07 23:41 ` Andi Kleen
2019-02-08 0:51 ` [MODERATED] Re: [SUSPECTED SPAM][PATCH " Andrew Cooper
0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-02-07 23:41 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
From: Andi Kleen <ak@linux.intel.com>
Subject: x86/pmu/intel: Handle TSX with counter 3 on Skylake
Most of the code is from Peter Ziljstra at this point,
based on earlier code from AK.
On Skylake with recent microcode updates due to errata XXX
perfmon general purpose counter 3 can be corrupted when RTM transactions
are executed.
The microcode provides a new MSR to force disable RTM
(make all RTM transactions abort).
This patch adds the low level code to manage this MSR.
Depending on a global flag (/sys/devices/cpu/enable_all_counters)
schedule or not schedule events on generic counter 3.
When the flag is set, and an event uses counter 3, disable TSX
while the event is active.
This patch assumes that the kernel is using
RETPOLINE (or IBRS), otherwise speculative execution could
still corrupt counter 3 in very unlikely cases.
The enable_all_counters flag default is set to zero in this
patch. This default could be changed.
The trade offs for setting the option default are:
Using 4 (or 8 with HT off) events in perf versus
allowing RTM usage while perf is active.
- Existing programs that use perf groups with 4 counters
may not retrieve perfmon data anymore. Perf usages
that use less than four (or 7 with HT off) counters
are not impacted. Perf usages that don't use group
will still work, but will see increase multiplexing.
- TSX programs should not functionally break from
forcing RTM to abort because they always need a valid
fall back path. However they will see significantly
lower performance if they rely on TSX for performance
(all RTM transactions will run and only abort at the end),
potentially slowing them down so much that it is
equivalent to functional breakage.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
v2:
Use u8 instead of bool
Rename force_rtm_abort_active.
v3:
Use correct patch version that actually compiles.
v4:
Switch to Peter's implementation with some updates by AK.
Now the TFA state is checked for in enable_all,
and the extra mask is handled by get_constraint
Use a temporary constraint instead of modifying the globals.
---
arch/x86/events/core.c | 6 ++-
arch/x86/events/intel/core.c | 64 +++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 10 ++++-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 5 +++
5 files changed, 83 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 58e659bfc2d9..f5d1435c6071 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2252,7 +2252,11 @@ static ssize_t num_counter_show(struct device *cdev,
struct device_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu.num_counters);
+ int num = x86_pmu.num_counters;
+ if (boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT) &&
+ perf_enable_all_counters && num > 0)
+ num--;
+ return snprintf(buf, PAGE_SIZE, "%d\n", num);
}
static DEVICE_ATTR_RO(num_counter);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index daafb893449b..b4162b4b0899 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1999,6 +1999,30 @@ static void intel_pmu_nhm_enable_all(int added)
intel_pmu_enable_all(added);
}
+static void intel_skl_pmu_enable_all(int added)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 val;
+
+ /*
+ * The perf code is not expected to execute RTM instructions
+ * (and also cannot misspeculate into them due to RETPOLINE
+ * use), so PMC3 should be 'stable'; IOW the values we
+ * just potentially programmed into it, should still be there.
+ *
+ * If we programmed PMC3, make sure to set TFA before we make
+ * things go and possibly encounter RTM instructions.
+ * Similarly, if PMC3 got unused, make sure to clear TFA.
+ */
+ val = MSR_TFA_RTM_FORCE_ABORT * test_bit(3, cpuc->active_mask);
+ if (cpuc->tfa_shadow != val) {
+ cpuc->tfa_shadow = val;
+ wrmsrl(MSR_TSX_FORCE_ABORT, val);
+ }
+
+ intel_pmu_enable_all(added);
+}
+
static void enable_counter_freeze(void)
{
update_debugctlmsr(get_debugctlmsr() |
@@ -3345,6 +3369,34 @@ glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
return c;
}
+bool perf_enable_all_counters __read_mostly;
+
+/*
+ * On Skylake counter 3 may get corrupted when RTM is used.
+ * Either avoid counter 3, or disable RTM when counter 3 used.
+ */
+
+static struct event_constraint *
+skl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct event_constraint *c;
+
+ c = hsw_get_event_constraints(cpuc, idx, event);
+
+ if (!perf_enable_all_counters) {
+ cpuc->counter3_constraint = *c;
+ c = &cpuc->counter3_constraint;
+
+ /*
+ * Without TFA we must not use PMC3.
+ */
+ __clear_bit(3, c->idxmsk);
+ }
+
+ return c;
+}
+
/*
* Broadwell:
*
@@ -4061,8 +4113,11 @@ static struct attribute *intel_pmu_caps_attrs[] = {
NULL
};
+DEVICE_BOOL_ATTR(enable_all_counters, 0644, perf_enable_all_counters);
+
static struct attribute *intel_pmu_attrs[] = {
&dev_attr_freeze_on_smi.attr,
+ NULL, /* May be overriden with enable_all_counters */
NULL,
};
@@ -4543,9 +4598,16 @@ __init int intel_pmu_init(void)
/* all extra regs are per-cpu when HT is on */
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+ if (boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) {
+ x86_pmu.enable_all = intel_skl_pmu_enable_all;
+ intel_pmu_attrs[1] = &dev_attr_enable_all_counters.attr.attr;
+ x86_pmu.get_event_constraints = skl_get_event_constraints;
+ /* Could add checking&warning for !RETPOLINE here */
+ } else {
+ x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ }
x86_pmu.hw_config = hsw_hw_config;
- x86_pmu.get_event_constraints = hsw_get_event_constraints;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
hsw_format_attr : nhm_format_attr;
extra_attr = merge_attr(extra_attr, skl_format_attr);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 78d7b7031bfc..2474ebfad961 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -70,7 +70,7 @@ struct event_constraint {
#define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */
-
+#define PERF_X86_EVENT_ABORT_TSX 0x1000 /* force abort TSX */
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -242,6 +242,12 @@ struct cpu_hw_events {
struct intel_excl_cntrs *excl_cntrs;
int excl_thread_id; /* 0 or 1 */
+ /*
+ * Manage using counter 3 on Skylake with TSX.
+ */
+ int tfa_shadow;
+ struct event_constraint counter3_constraint;
+
/*
* AMD specific bits
*/
@@ -998,6 +1004,8 @@ static inline int is_ht_workaround_enabled(void)
return !!(x86_pmu.flags & PMU_FL_EXCL_ENABLED);
}
+extern bool perf_enable_all_counters;
+
#else /* CONFIG_CPU_SUP_INTEL */
static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..981ff9479648 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -344,6 +344,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
#define X86_FEATURE_AVX512_4VNNIW (18*32+ 2) /* AVX-512 Neural Network Instructions */
#define X86_FEATURE_AVX512_4FMAPS (18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_TSX_FORCE_ABORT (18*32+13) /* "" TSX_FORCE_ABORT */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..492b18720dba 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -666,6 +666,11 @@
#define MSR_IA32_TSC_DEADLINE 0x000006E0
+#define MSR_TSX_FORCE_ABORT 0x0000010F
+
+#define MSR_TFA_RTM_FORCE_ABORT_BIT 0
+#define MSR_TFA_RTM_FORCE_ABORT BIT_ULL(MSR_TFA_RTM_FORCE_ABORT_BIT)
+
/* P4/Xeon+ specific */
#define MSR_IA32_MCG_EAX 0x00000180
#define MSR_IA32_MCG_EBX 0x00000181
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Re: [SUSPECTED SPAM][PATCH v3 2/6] PERFv3
2019-02-07 23:41 ` [MODERATED] [PATCH v3 2/6] PERFv3 Andi Kleen
@ 2019-02-08 0:51 ` Andrew Cooper
2019-02-08 9:01 ` Peter Zijlstra
0 siblings, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2019-02-08 0:51 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 522 bytes --]
On 07/02/2019 23:41, speck for Andi Kleen wrote:
> This patch assumes that the kernel is using
> RETPOLINE (or IBRS), otherwise speculative execution could
> still corrupt counter 3 in very unlikely cases.
What has the kernel configuration got to do with it?
It is my understanding that any execution of an XBEGIN instruction, even
speculatively, even in userspace will result in PCR3 getting modified.
A CPU either has force abort mode active, or PCR3 can be changed behind
the kernel's back.
~Andrew
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [SUSPECTED SPAM][PATCH v3 2/6] PERFv3
2019-02-08 0:51 ` [MODERATED] Re: [SUSPECTED SPAM][PATCH " Andrew Cooper
@ 2019-02-08 9:01 ` Peter Zijlstra
2019-02-08 9:39 ` Peter Zijlstra
0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2019-02-08 9:01 UTC (permalink / raw)
To: speck
On Fri, Feb 08, 2019 at 12:51:01AM +0000, speck for Andrew Cooper wrote:
> On 07/02/2019 23:41, speck for Andi Kleen wrote:
> > This patch assumes that the kernel is using
> > RETPOLINE (or IBRS), otherwise speculative execution could
> > still corrupt counter 3 in very unlikely cases.
>
> What has the kernel configuration got to do with it?
>
> It is my understanding that any execution of an XBEGIN instruction, even
> speculatively, even in userspace will result in PCR3 getting modified.
>
> A CPU either has force abort mode active, or PCR3 can be changed behind
> the kernel's back.
We are executing kernel code; therefore any user RTM will have aborted
and is irrelevant.
So what the kernel does is:
/*
* And as noted; userspace transactions will be aborted by
* having entered the kernel. The kernel does not use RTM
* itself.
*/
/*
* stops all counters; irrespective of ucode using PMC3 or not
*/
GLOBAL_CTRL = 0;
/*
* program PMC3
*/
CTRVAL3 = x;
EVTSEL3 = y;
/*
* Set the TFA bit to make ucode not touch PMC3; since there has
* not been an RTM instruction between GLOBAL_CTRL=0 and here,
* PMC3 will still be {x,y} as we just wrote.
*
* This is what requires RETPOLINE/IBRS; because otherwise
* speculation could see a partial kernel instruction that looks
* like RTM, which would mess things up.
*/
WRMSR(MSR_TFA, 1);
/*
* Let 'er rip.
*/
GLOBAL_CTRL = ~0ULL;
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: [SUSPECTED SPAM][PATCH v3 2/6] PERFv3
2019-02-08 9:01 ` Peter Zijlstra
@ 2019-02-08 9:39 ` Peter Zijlstra
2019-02-08 10:53 ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2019-02-08 9:39 UTC (permalink / raw)
To: speck
On Fri, Feb 08, 2019 at 10:01:47AM +0100, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 12:51:01AM +0000, speck for Andrew Cooper wrote:
> > On 07/02/2019 23:41, speck for Andi Kleen wrote:
> > > This patch assumes that the kernel is using
> > > RETPOLINE (or IBRS), otherwise speculative execution could
> > > still corrupt counter 3 in very unlikely cases.
> >
> > What has the kernel configuration got to do with it?
> >
> > It is my understanding that any execution of an XBEGIN instruction, even
> > speculatively, even in userspace will result in PCR3 getting modified.
> >
> > A CPU either has force abort mode active, or PCR3 can be changed behind
> > the kernel's back.
>
> We are executing kernel code; therefore any user RTM will have aborted
> and is irrelevant.
>
> So what the kernel does is:
>
> /*
> * And as noted; userspace transactions will be aborted by
> * having entered the kernel. The kernel does not use RTM
> * itself.
> */
>
>
> /*
> * stops all counters; irrespective of ucode using PMC3 or not
> */
> GLOBAL_CTRL = 0;
>
> /*
> * program PMC3
> */
> CTRVAL3 = x;
> EVTSEL3 = y;
>
> /*
> * Set the TFA bit to make ucode not touch PMC3; since there has
> * not been an RTM instruction between GLOBAL_CTRL=0 and here,
> * PMC3 will still be {x,y} as we just wrote.
> *
> * This is what requires RETPOLINE/IBRS; because otherwise
> * speculation could see a partial kernel instruction that looks
> * like RTM, which would mess things up.
> */
> WRMSR(MSR_TFA, 1);
>
> /*
> * Let 'er rip.
> */
> GLOBAL_CTRL = ~0ULL;
Ah, I think I found a way to avoid having to rely on this. Let me try.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [RFC][PATCH] performance walnuts
2019-02-08 9:39 ` Peter Zijlstra
@ 2019-02-08 10:53 ` Peter Zijlstra
2019-02-15 23:45 ` [MODERATED] Encrypted Message Jon Masters
0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2019-02-08 10:53 UTC (permalink / raw)
To: speck
On Fri, Feb 08, 2019 at 10:39:50AM +0100, Peter Zijlstra wrote:
> Ah, I think I found a way to avoid having to rely on this. Let me try.
Something like so. Can someone with access to a relevant machine test
this?
If it works, I'll write a Changelog and this'll be it.
---
arch/x86/events/intel/core.c | 127 ++++++++++++++++++++++++++++++-------
arch/x86/events/perf_event.h | 6 ++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 6 ++
4 files changed, 116 insertions(+), 24 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e0232bdb7aff..8352c2647a1b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1999,6 +1999,39 @@ static void intel_pmu_nhm_enable_all(int added)
intel_pmu_enable_all(added);
}
+static void intel_set_tfa(struct cpu_hw_events *cpuc, bool on)
+{
+ u64 val = MSR_TFA_RTM_FORCE_ABORT * on;
+
+ if (cpuc->tfa_shadow != val) {
+ cpuc->tfa_shadow = val;
+ wrmsrl(MSR_TSX_FORCE_ABORT, val);
+ }
+}
+
+static void intel_skl_commit_scheduling(struct cpu_hw_events *cpuc, int idx, int cntr)
+{
+ /*
+ * We're going to use PMC3, make sure TFA is set before we touch it.
+ */
+ if (cntr == 3)
+ intel_set_tfa(cpuc, true);
+}
+
+static void intel_skl_pmu_enable_all(int added)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /*
+ * If we find PMC3 is no longer used when we enable the PMU, we can
+ * clear TFA.
+ */
+ if (!test_bit(3, cpuc->active_mask))
+ intel_set_tfa(cpuc, false);
+
+ intel_pmu_enable_all(added);
+}
+
static void enable_counter_freeze(void)
{
update_debugctlmsr(get_debugctlmsr() |
@@ -2768,6 +2801,35 @@ intel_stop_scheduling(struct cpu_hw_events *cpuc)
raw_spin_unlock(&excl_cntrs->lock);
}
+static struct event_constraint *
+dyn_constraint(struct cpu_hw_events *cpuc, struct event_constraint *c, int idx)
+{
+ WARN_ON_ONCE(!cpuc->constraint_list);
+
+ if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
+ struct event_constraint *cx;
+
+ /*
+ * grab pre-allocated constraint entry
+ */
+ cx = &cpuc->constraint_list[idx];
+
+ /*
+ * initialize dynamic constraint
+ * with static constraint
+ */
+ *cx = *c;
+
+ /*
+ * mark constraint as dynamic
+ */
+ cx->flags |= PERF_X86_EVENT_DYNAMIC;
+ c = cx;
+ }
+
+ return c;
+}
+
static struct event_constraint *
intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
int idx, struct event_constraint *c)
@@ -2798,27 +2860,7 @@ intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
* only needed when constraint has not yet
* been cloned (marked dynamic)
*/
- if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
- struct event_constraint *cx;
-
- /*
- * grab pre-allocated constraint entry
- */
- cx = &cpuc->constraint_list[idx];
-
- /*
- * initialize dynamic constraint
- * with static constraint
- */
- *cx = *c;
-
- /*
- * mark constraint as dynamic, so we
- * can free it later on
- */
- cx->flags |= PERF_X86_EVENT_DYNAMIC;
- c = cx;
- }
+ c = dyn_constraint(cpuc, c, idx);
/*
* From here on, the constraint is dynamic.
@@ -3345,6 +3387,26 @@ glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
return c;
}
+static bool allow_tsx_force_abort = true;
+
+static struct event_constraint *
+skl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct event_constraint *c = hsw_get_event_constraints(cpuc, idx, event);
+
+ /*
+ * Without TFA we must not use PMC3.
+ */
+ if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
+ c = dyn_constraint(cpuc, c, idx);
+ c->idxmsk64 &= ~(1ULL << 3);
+ c->weight = hweight64(c->idxmsk64);
+ }
+
+ return c;
+}
+
/*
* Broadwell:
*
@@ -3440,13 +3502,15 @@ static int intel_pmu_cpu_prepare(int cpu)
goto err;
}
- if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
+ if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_WALNUT)) {
size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
cpuc->constraint_list = kzalloc(sz, GFP_KERNEL);
if (!cpuc->constraint_list)
goto err_shared_regs;
+ }
+ if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
cpuc->excl_cntrs = allocate_excl_cntrs(cpu);
if (!cpuc->excl_cntrs)
goto err_constraint_list;
@@ -3552,9 +3616,10 @@ static void free_excl_cntrs(int cpu)
if (c->core_id == -1 || --c->refcnt == 0)
kfree(c);
cpuc->excl_cntrs = NULL;
- kfree(cpuc->constraint_list);
- cpuc->constraint_list = NULL;
}
+
+ kfree(cpuc->constraint_list);
+ cpuc->constraint_list = NULL;
}
static void intel_pmu_cpu_dying(int cpu)
@@ -4061,9 +4126,12 @@ static struct attribute *intel_pmu_caps_attrs[] = {
NULL
};
+DEVICE_BOOL_ATTR(allow_tsx_force_abort, 0644, allow_tsx_force_abort);
+
static struct attribute *intel_pmu_attrs[] = {
&dev_attr_freeze_on_smi.attr,
NULL,
+ NULL,
};
static __init struct attribute **
@@ -4546,6 +4614,7 @@ __init int intel_pmu_init(void)
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
@@ -4557,6 +4626,16 @@ __init int intel_pmu_init(void)
tsx_attr = hsw_tsx_events_attrs;
intel_pmu_pebs_data_source_skl(
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X);
+
+ /* If our CPU haz a walnut */
+ if (boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) {
+ x86_pmu.flags |= PMU_FL_WALNUT;
+ x86_pmu.get_event_constraints = skl_get_event_constraints;
+ x86_pmu.enable_all = intel_skl_pmu_enable_all;
+ x86_pmu.commit_scheduling = intel_skl_commit_scheduling;
+ intel_pmu_attrs[1] = &dev_attr_allow_tsx_force_abort.attr.attr;
+ }
+
pr_cont("Skylake events, ");
name = "skylake";
break;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 78d7b7031bfc..44b3426c618e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -242,6 +242,11 @@ struct cpu_hw_events {
struct intel_excl_cntrs *excl_cntrs;
int excl_thread_id; /* 0 or 1 */
+ /*
+ * SKL TSX_FORCE_ABORT shadow
+ */
+ int tfa_shadow;
+
/*
* AMD specific bits
*/
@@ -676,6 +681,7 @@ do { \
#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */
#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */
#define PMU_FL_PEBS_ALL 0x10 /* all events are valid PEBS events */
+#define PMU_FL_WALNUT 0x20 /* deal with the walnut errata */
#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..981ff9479648 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -344,6 +344,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
#define X86_FEATURE_AVX512_4VNNIW (18*32+ 2) /* AVX-512 Neural Network Instructions */
#define X86_FEATURE_AVX512_4FMAPS (18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_TSX_FORCE_ABORT (18*32+13) /* "" TSX_FORCE_ABORT */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..ca5bc0eacb95 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -666,6 +666,12 @@
#define MSR_IA32_TSC_DEADLINE 0x000006E0
+
+#define MSR_TSX_FORCE_ABORT 0x0000010F
+
+#define MSR_TFA_RTM_FORCE_ABORT_BIT 0
+#define MSR_TFA_RTM_FORCE_ABORT BIT_ULL(MSR_TFA_RTM_FORCE_ABORT_BIT)
+
/* P4/Xeon+ specific */
#define MSR_IA32_MCG_EAX 0x00000180
#define MSR_IA32_MCG_EBX 0x00000181
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-02-08 10:53 ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
@ 2019-02-15 23:45 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-02-15 23:45 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 132 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Peter Zijlstra <speck@linutronix.de>
Subject: Re: [RFC][PATCH] performance walnuts
[-- Attachment #2: Type: text/plain, Size: 944 bytes --]
On 2/8/19 5:53 AM, speck for Peter Zijlstra wrote:
> +static void intel_set_tfa(struct cpu_hw_events *cpuc, bool on)
> +{
> + u64 val = MSR_TFA_RTM_FORCE_ABORT * on;
> +
> + if (cpuc->tfa_shadow != val) {
> + cpuc->tfa_shadow = val;
> + wrmsrl(MSR_TSX_FORCE_ABORT, val);
> + }
> +}
Ok let me ask a stupid question.
This MSR is exposed on a given core. What's the impact (if any) on
*other* cores that might be using TSX? For example, suppose I'm running
an application using RTM on one core while another application on
another core begins profiling. What impact does the impact of this MSR
write have on other cores? (Architecturally).
I'm assuming the implementation of HLE relies on whatever you're doing
fitting into the local core's cache and you just abort on any snoop,
etc. so it ought to be fairly self contained, but I want to know.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v4 00/28] MDSv4 2
@ 2019-01-12 1:29 Andi Kleen
2019-01-12 1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
2019-01-12 1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
0 siblings, 2 replies; 89+ messages in thread
From: Andi Kleen @ 2019-01-12 1:29 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
Here's a new version of flushing CPU buffers for group 4.
This mainly covers single thread, not SMT (except for the idle case).
I lumped all the issues together under the Microarchitectural Data
Sampling (MDS) name because they need the same mitigations,a
and it doesn't seem worth duplicating the sysfs files and bug entries.
This version drops support for software sequences, and also
does VERW unconditionally unless disabled.
This version implements Linus' suggestion to only clear the CPU
buffer when needed. The patch kit is now a lot more complicated:
different subsystems determine if they might touch other user's
or sensitive data and schedule a cpu clear on next kernel exit.
Generally process context doesn't clear (unless it is cryptographic
or does context switches), and interrupt context schedules a clear.
There are some exceptions to these rules.
For details on the security model see the Documentation/clearcpu.txt
file. In my tests the number of clears is much lower now.
For most benchmarks we tried the difference is in the noise
level now. ebizzy and loopback apache both show about 1.7%
degradation.
It makes various assumptions on how kernel code behaves.
I did some auditing, but wasn't able to do it for everything.
Please double check the assumptions laid out in the document.
Likely a lot more interrupt and timer handlers (and tasklets
and irq poll handlers) could be white listed to not need clear, but I only
did a fairly minimum set for now that I could test.
For some of the white listed code, especially the networking and
block softirqs, as well as the EBPF mitigation, some additional auditing that
no rules are violated would be useful.
Some notes:
- Against 5.0-rc1
Changes against previous versions:
- Remove software sequences
- Make VERW unconditional
- Improved documentation
- Some other minor changes
Changes against previous versions:
- By default now flushes only when needed
- Define security model
- New administrator document
- Added mds=verw and mds=full
- Renamed mds_disable to mds=off
- KVM virtualization much improved
- Too many others to list. Most things different now.
Andi Kleen (28):
x86/speculation/mds: Add basic bug infrastructure for MDS
x86/speculation/mds: Add mds=off
x86/speculation/mds: Support clearing CPU data on kernel exit
x86/speculation/mds: Support mds=full
x86/speculation/mds: Clear CPU buffers on entering idle
x86/speculation/mds: Add sysfs reporting
x86/speculation/mds: Support mds=full for NMIs
x86/speculation/mds: Support mds=full for 32bit NMI
x86/speculation/mds: Export MD_CLEAR CPUID to KVM guests.
mds: Add documentation for clear cpu usage
mds: Add preliminary administrator documentation
x86/speculation/mds: Introduce lazy_clear_cpu
x86/speculation/mds: Schedule cpu clear on context switch
x86/speculation/mds: Add tracing for clear_cpu
mds: Force clear cpu on kernel preemption
mds: Schedule cpu clear for memzero_explicit and kzfree
mds: Mark interrupts clear cpu, unless opted-out
mds: Clear cpu on all timers, unless the timer opts-out
mds: Clear CPU on tasklets, unless opted-out
mds: Clear CPU on irq poll, unless opted-out
mds: Clear cpu for string io/memcpy_*io in interrupts
mds: Schedule clear cpu in swiotlb
mds: Instrument skb functions to clear cpu automatically
mds: Opt out tcp tasklet to not touch user data
mds: mark kernel/* timers safe as not touching user data
mds: Mark AHCI interrupt as not needing cpu clear
mds: Mark ACPI interrupt as not needing cpu clear
mds: Mitigate BPF
.../ABI/testing/sysfs-devices-system-cpu | 1 +
.../admin-guide/kernel-parameters.txt | 8 +
Documentation/admin-guide/mds.rst | 108 +++++++++++
Documentation/clearcpu.txt | 173 ++++++++++++++++++
arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 13 +-
arch/x86/entry/entry_32.S | 6 +
arch/x86/entry/entry_64.S | 12 ++
arch/x86/include/asm/clearbpf.h | 29 +++
arch/x86/include/asm/clearcpu.h | 92 ++++++++++
arch/x86/include/asm/cpufeatures.h | 3 +
arch/x86/include/asm/io.h | 3 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/include/asm/trace/clearcpu.h | 27 +++
arch/x86/kernel/acpi/cstate.c | 2 +
arch/x86/kernel/cpu/bugs.c | 46 +++++
arch/x86/kernel/cpu/common.c | 14 ++
arch/x86/kernel/kvm.c | 3 +
arch/x86/kernel/process.c | 5 +
arch/x86/kernel/process.h | 27 +++
arch/x86/kernel/smpboot.c | 3 +
arch/x86/kvm/cpuid.c | 3 +-
drivers/acpi/acpi_pad.c | 2 +
drivers/acpi/osl.c | 3 +-
drivers/acpi/processor_idle.c | 3 +
drivers/ata/ahci.c | 2 +-
drivers/ata/ahci.h | 2 +
drivers/ata/libahci.c | 40 ++--
drivers/base/cpu.c | 8 +
drivers/idle/intel_idle.c | 5 +
include/asm-generic/io.h | 3 +
include/linux/clearcpu.h | 36 ++++
include/linux/filter.h | 21 ++-
include/linux/hrtimer.h | 4 +
include/linux/interrupt.h | 18 +-
include/linux/irq_poll.h | 2 +
include/linux/skbuff.h | 2 +
include/linux/timer.h | 9 +-
kernel/bpf/core.c | 2 +
kernel/dma/swiotlb.c | 2 +
kernel/events/core.c | 6 +-
kernel/fork.c | 3 +-
kernel/futex.c | 6 +-
kernel/irq/handle.c | 8 +
kernel/irq/manage.c | 1 +
kernel/sched/core.c | 14 +-
kernel/sched/deadline.c | 6 +-
kernel/sched/fair.c | 7 +-
kernel/sched/idle.c | 3 +-
kernel/sched/rt.c | 3 +-
kernel/softirq.c | 25 ++-
kernel/time/alarmtimer.c | 2 +-
kernel/time/hrtimer.c | 11 +-
kernel/time/posix-timers.c | 6 +-
kernel/time/sched_clock.c | 3 +-
kernel/time/tick-sched.c | 6 +-
kernel/time/timer.c | 8 +
kernel/watchdog.c | 3 +-
lib/irq_poll.c | 18 +-
lib/string.c | 6 +
mm/slab_common.c | 5 +-
net/core/skbuff.c | 26 +++
net/ipv4/tcp_output.c | 5 +-
65 files changed, 869 insertions(+), 61 deletions(-)
create mode 100644 Documentation/admin-guide/mds.rst
create mode 100644 Documentation/clearcpu.txt
create mode 100644 arch/x86/include/asm/clearbpf.h
create mode 100644 arch/x86/include/asm/clearcpu.h
create mode 100644 arch/x86/include/asm/trace/clearcpu.h
create mode 100644 include/linux/clearcpu.h
--
2.17.2
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v4 05/28] MDSv4 10
2019-01-12 1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
@ 2019-01-12 1:29 ` Andi Kleen
2019-01-14 19:20 ` [MODERATED] " Dave Hansen
2019-01-14 23:39 ` Tim Chen
2019-01-12 1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
1 sibling, 2 replies; 89+ messages in thread
From: Andi Kleen @ 2019-01-12 1:29 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
When entering idle the internal state of the current CPU might
become visible to the thread sibling because the CPU "frees" some
internal resources.
To ensure there is no MDS leakage always clear the CPU state
before doing any idling. We only do this if SMT is enabled,
as otherwise there is no leakage possible.
Not needed for idle poll because it does not share resources.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
arch/x86/include/asm/clearcpu.h | 19 +++++++++++++++++++
arch/x86/kernel/acpi/cstate.c | 2 ++
arch/x86/kernel/kvm.c | 3 +++
arch/x86/kernel/process.c | 5 +++++
arch/x86/kernel/smpboot.c | 3 +++
drivers/acpi/acpi_pad.c | 2 ++
drivers/acpi/processor_idle.c | 3 +++
drivers/idle/intel_idle.c | 5 +++++
kernel/sched/fair.c | 1 +
9 files changed, 43 insertions(+)
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 3b8ee76b9c07..b83ef1a5268f 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -20,6 +20,25 @@ static inline void clear_cpu(void)
[kernelds] "m" (kernel_ds));
}
+/*
+ * Clear CPU buffers before going idle, so that no state is leaked to SMT
+ * siblings taking over thread resources.
+ * Out of line to avoid include hell.
+ *
+ * Assumes that interrupts are disabled and only get reenabled
+ * before idle, otherwise the data from a racing interrupt might not
+ * get cleared. There are some callers who violate this,
+ * but they are only used in unattackable cases.
+ */
+
+static inline void clear_cpu_idle(void)
+{
+ if (sched_smt_active()) {
+ clear_thread_flag(TIF_CLEAR_CPU);
+ clear_cpu();
+ }
+}
+
DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
#endif
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index 158ad1483c43..48adea5afacf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
#include <acpi/processor.h>
#include <asm/mwait.h>
#include <asm/special_insns.h>
+#include <asm/clearcpu.h>
/*
* Initialize bm_flags based on the CPU cache properties
@@ -157,6 +158,7 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
unsigned int cpu = smp_processor_id();
struct cstate_entry *percpu_entry;
+ clear_cpu_idle();
percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
percpu_entry->states[cx->index].ecx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ba4bfb7f6a36..c9206ad40a5b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -159,6 +159,7 @@ void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
/*
* We cannot reschedule. So halt.
*/
+ clear_cpu_idle();
native_safe_halt();
local_irq_disable();
}
@@ -785,6 +786,8 @@ static void kvm_wait(u8 *ptr, u8 val)
if (READ_ONCE(*ptr) != val)
goto out;
+ clear_cpu_idle();
+
/*
* halt until it's our turn and kicked. Note that we do safe halt
* for irq enabled case to avoid hang when lock info is overwritten
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 90ae0ca51083..9d9f2d2b209d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -42,6 +42,7 @@
#include <asm/prctl.h>
#include <asm/spec-ctrl.h>
#include <asm/proto.h>
+#include <asm/clearcpu.h>
#include "process.h"
@@ -589,6 +590,8 @@ void stop_this_cpu(void *dummy)
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+ clear_cpu_idle();
+
/*
* Use wbinvd on processors that support SME. This provides support
* for performing a successful kexec when going from SME inactive
@@ -675,6 +678,8 @@ static __cpuidle void mwait_idle(void)
mb(); /* quirk */
}
+ clear_cpu_idle();
+
__monitor((void *)¤t_thread_info()->flags, 0, 0);
if (!need_resched())
__sti_mwait(0, 0);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ccd1f2a8e557..c7fff6b09253 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -81,6 +81,7 @@
#include <asm/cpu_device_id.h>
#include <asm/spec-ctrl.h>
#include <asm/hw_irq.h>
+#include <asm/clearcpu.h>
/* representing HT siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
@@ -1635,6 +1636,7 @@ static inline void mwait_play_dead(void)
wbinvd();
while (1) {
+ clear_cpu_idle();
/*
* The CLFLUSH is a workaround for erratum AAI65 for
* the Xeon 7400 series. It's not clear it is actually
@@ -1662,6 +1664,7 @@ void hlt_play_dead(void)
wbinvd();
while (1) {
+ clear_cpu_idle();
native_halt();
/*
* If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index a47676a55b84..2dcbc38d0880 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -27,6 +27,7 @@
#include <linux/slab.h>
#include <linux/acpi.h>
#include <asm/mwait.h>
+#include <asm/clearcpu.h>
#include <xen/xen.h>
#define ACPI_PROCESSOR_AGGREGATOR_CLASS "acpi_pad"
@@ -175,6 +176,7 @@ static int power_saving_thread(void *data)
tick_broadcast_enable();
tick_broadcast_enter();
stop_critical_timings();
+ clear_cpu_idle();
mwait_idle_with_hints(power_saving_mwait_eax, 1);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b2131c4ea124..0342daa122fe 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -33,6 +33,7 @@
#include <linux/cpuidle.h>
#include <linux/cpu.h>
#include <acpi/processor.h>
+#include <asm/clearcpu.h>
/*
* Include the apic definitions for x86 to have the APIC timer related defines
@@ -120,6 +121,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
*/
static void __cpuidle acpi_safe_halt(void)
{
+ clear_cpu_idle();
if (!tif_need_resched()) {
safe_halt();
local_irq_disable();
@@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
ACPI_FLUSH_CPU_CACHE();
+ clear_cpu_idle();
while (1) {
if (cx->entry_method == ACPI_CSTATE_HALT)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 8b5d85c91e9d..ddaa7603d53a 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,6 +65,7 @@
#include <asm/intel-family.h>
#include <asm/mwait.h>
#include <asm/msr.h>
+#include <asm/clearcpu.h>
#define INTEL_IDLE_VERSION "0.4.1"
@@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
}
}
+ clear_cpu_idle();
+
mwait_idle_with_hints(eax, ecx);
if (!static_cpu_has(X86_FEATURE_ARAT) && tick)
@@ -953,6 +956,8 @@ static void intel_idle_s2idle(struct cpuidle_device *dev,
unsigned long ecx = 1; /* break on interrupt flag */
unsigned long eax = flg2MWAIT(drv->states[index].flags);
+ clear_cpu_idle();
+
mwait_idle_with_hints(eax, ecx);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50aa2aba69bd..b5a1bd4a1a46 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
#ifdef CONFIG_SCHED_SMT
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL(sched_smt_present);
static inline void set_idle_cores(int cpu, int val)
{
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Re: [PATCH v4 05/28] MDSv4 10
2019-01-12 1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
@ 2019-01-14 19:20 ` Dave Hansen
2019-01-18 7:33 ` [MODERATED] Encrypted Message Jon Masters
2019-01-14 23:39 ` Tim Chen
1 sibling, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2019-01-14 19:20 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]
On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
> When entering idle the internal state of the current CPU might
> become visible to the thread sibling because the CPU "frees" some
> internal resources.
Is there some documentation somewhere about what "idle" means here? It
looks like MWAIT and HLT certainly count, but is there anything else?
I'm just trying to figure out how we make sure we catch all of the
call-sites for these. This sprinkles quite a few of them around, and
I'm wondering how you found these, how we know if we missed any, and how
we keep folks from reintroducing new call-sites that would make us
vulnerable again.
I did a quick "objdump | grep mwait" and this patch appears to catch all
the functions that I encountered.
> +/*
> + * Clear CPU buffers before going idle, so that no state is leaked to SMT
> + * siblings taking over thread resources.
> + * Out of line to avoid include hell.
> + *
> + * Assumes that interrupts are disabled and only get reenabled
> + * before idle, otherwise the data from a racing interrupt might not
> + * get cleared. There are some callers who violate this,
> + * but they are only used in unattackable cases.> + */
Can we please document the unattackable cases, along with the reasons
they are unattackable? This property also keeps us from being able to
annotate this site with lockdep checks for interrupts being off, which
is a bit unfortunate.
> +static inline void clear_cpu_idle(void)
> +{
> + if (sched_smt_active()) {
> + clear_thread_flag(TIF_CLEAR_CPU);
> + clear_cpu();
> + }
> +}
...
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index b2131c4ea124..0342daa122fe 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -33,6 +33,7 @@
> #include <linux/cpuidle.h>
> #include <linux/cpu.h>
> #include <acpi/processor.h>
> +#include <asm/clearcpu.h>
>
> /*
> * Include the apic definitions for x86 to have the APIC timer related defines
> @@ -120,6 +121,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
> */
> static void __cpuidle acpi_safe_halt(void)
> {
> + clear_cpu_idle();
> if (!tif_need_resched()) {
> safe_halt();
> local_irq_disable();
Why is this one outside the if()? Seems like it could be safely inside
next to safe_halt().
> @@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
>
> ACPI_FLUSH_CPU_CACHE();
>
> + clear_cpu_idle();
> while (1) {
>
> if (cx->entry_method == ACPI_CSTATE_HALT)
At the risk of bike-shedding... Why don't we just catch all these
*play_dead() sites inside play_dead() itself, or at arch_cpu_idle_dead()?
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 8b5d85c91e9d..ddaa7603d53a 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -65,6 +65,7 @@
> #include <asm/intel-family.h>
> #include <asm/mwait.h>
> #include <asm/msr.h>
> +#include <asm/clearcpu.h>
>
> #define INTEL_IDLE_VERSION "0.4.1"
>
> @@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
> }
> }
>
> + clear_cpu_idle();
> +
> mwait_idle_with_hints(eax, ecx);
And my like bikeshed: It seems like this would be a much smaller patch,
and be less likely to have future code add vulnerabilities if we just
patched mwait_idle_with_hints().
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-01-14 19:20 ` [MODERATED] " Dave Hansen
@ 2019-01-18 7:33 ` Jon Masters
0 siblings, 0 replies; 89+ messages in thread
From: Jon Masters @ 2019-01-18 7:33 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 122 bytes --]
From: Jon Masters <jcm@redhat.com>
To: speck for Dave Hansen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10
[-- Attachment #2: Type: text/plain, Size: 1328 bytes --]
On 1/14/19 2:20 PM, speck for Dave Hansen wrote:
> On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
>> When entering idle the internal state of the current CPU might
>> become visible to the thread sibling because the CPU "frees" some
>> internal resources.
>
> Is there some documentation somewhere about what "idle" means here? It
> looks like MWAIT and HLT certainly count, but is there anything else?
We know power state transitions in addition can cause the peer to
dynamically sleep or wake up. MWAIT was the main example I got out of
Intel for how you'd explicitly cause a thread to be deallocated.
When Andi is talking about "frees" above he means (for example) the
dynamic allocation/deallocation of store buffer entries as threads come
and go - e.g. in Skylake there are 56 entries in a distributed store
buffer that splits into 2x28. I am not aware of fill buffer behavior
changing as threads come and go, and this isn't documented AFAICS.
I've been wondering whether we want a bit more detail in the docs. I
spent a /lot/ of time last week going through all of Intel's patents in
this area, which really help understand it. If folks feel we could do
with a bit more meaty summary, I can try to suggest something.
Jon.
--
Computer Architect | Sent with my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-01-12 1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
2019-01-14 19:20 ` [MODERATED] " Dave Hansen
@ 2019-01-14 23:39 ` Tim Chen
1 sibling, 0 replies; 89+ messages in thread
From: Tim Chen @ 2019-01-14 23:39 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10
[-- Attachment #2: Type: text/plain, Size: 526 bytes --]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50aa2aba69bd..b5a1bd4a1a46 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>
> #ifdef CONFIG_SCHED_SMT
> DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> +EXPORT_SYMBOL(sched_smt_present);
This export is not needed since sched_smt_present is not used in the patch series.
Only sched_smt_active() is used.
Thanks.
Tim
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH v4 10/28] MDSv4 24
2019-01-12 1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
2019-01-12 1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
@ 2019-01-12 1:29 ` Andi Kleen
2019-01-15 1:05 ` [MODERATED] Encrypted Message Tim Chen
1 sibling, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2019-01-12 1:29 UTC (permalink / raw)
To: speck; +Cc: Andi Kleen
Including the theory, and some guide lines for subsystem/driver
maintainers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
Documentation/clearcpu.txt | 173 +++++++++++++++++++++++++++++++++++++
1 file changed, 173 insertions(+)
create mode 100644 Documentation/clearcpu.txt
diff --git a/Documentation/clearcpu.txt b/Documentation/clearcpu.txt
new file mode 100644
index 000000000000..b204b1e7051c
--- /dev/null
+++ b/Documentation/clearcpu.txt
@@ -0,0 +1,173 @@
+
+Security model for Microarchitectural Data Sampling
+===================================================
+
+Some CPUs can leave read or written data in internal buffers,
+which then later might be sampled through side effects.
+For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
+
+This can be avoided by explicitely clearing the CPU state.
+
+We trying to avoid leaking data between different processes,
+and also some sensitive data, like cryptographic data,
+or user data from other processes.
+
+We support three modes:
+
+(1) mitigation off (mds=off)
+(2) clear only when needed (default)
+(3) clear on every kernel exit, or guest entry (mds=full)
+
+(1) and (3) are trivial, the rest of the document discusses (2)
+
+Basic requirements and assumptions
+----------------------------------
+
+Kernel addresses and kernel temporary data are not sensitive.
+
+User data is sensitive, but only for other processes.
+
+Kernel data is sensitive when it is cryptographic keys.
+
+Guidance for driver/subsystem developers
+----------------------------------------
+
+When you touch user supplied data of *other* processes in system call
+context add lazy_clear_cpu().
+
+For the cases below we care only about data from other processes.
+Touching non cryptographic data from the current process is always allowed.
+
+Touching only pointers to user data is always allowed.
+
+When your interrupt does not touch user data directly consider marking
+it with IRQF_NO_USER.
+
+When your tasklet does not touch user data directly consider marking
+it with TASKLET_NO_USER using tasklet_init_flags/or
+DECLARE_TASKLET*_NOUSER.
+
+When your timer does not touch user data mark it with TIMER_NO_USER.
+If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.
+
+When your irq poll handler does not touch user data, mark it
+with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
+
+For networking code make sure to only touch user data through
+skb_push/put/copy [add more], unless it is data from the current
+process. If that is not ensured add lazy_clear_cpu or
+lazy_clear_cpu_interrupt. When the non skb data access is only in a
+hardware interrupt controlled by the driver, it can rely on not
+setting IRQF_NO_USER for that interrupt.
+
+Any cryptographic code touching key data should use memzero_explicit
+or kzfree.
+
+If your RCU callback touches user data add lazy_clear_cpu().
+
+These steps are currently only needed for code that runs on MDS affected
+CPUs, which is currently only x86. But might be worth being prepared
+if other architectures become affected too.
+
+Implementation details/assumptions
+----------------------------------
+
+If a system call touches data it is for its own process, so does not
+need to be cleared, because it has already access to it.
+
+When context switching we clear data, unless the context switch
+is inside a process, or from/to idle. We also clear after any
+context switches from kernel threads.
+
+Idle does not have sensitive data, except for in interrupts, which
+are handled separately.
+
+Cryptographic keys inside the kernel should be protected.
+We assume they use kzfree() or memzero_explicit() to clear
+state, so these functions trigger a cpu clear.
+
+Hard interrupts, tasklets, timers which can run asynchronous are
+assumed to touch random user data, unless they have been audited, and
+marked with NO_USER flags.
+
+Most interrupt handlers for modern devices should not touch
+user data because they rely on DMA and only manipulate
+pointers. This needs auditing to confirm though.
+
+For softirqs we assume that if they touch user data they use
+lazy_clear_cpu()/lazy_clear_interrupt() as needed.
+Networking is handled through skb_* below.
+Timer and Tasklets and IRQ poll are handled through opt-in.
+
+Scheduler softirq is assumed to not touch user data.
+
+Block softirq done callbacks are assumed to not touch user data.
+
+For networking code, any skb functions that are likely
+touching non header packet data schedule a clear cpu at next
+kernel exit. This includes skb_copy and related, skb_put/push,
+checksum functions. We assume that any networking code touching
+packet data uses these functions.
+
+[In principle packet data should be encrypted anyways for the wire,
+but we try to avoid leaking it anyways]
+
+Some IO related functions like string PIO and memcpy_from/to_io, or
+the software pci dma bounce function, which touch data, schedule a
+buffer clear.
+
+We assume NMI/machine check code does not touch other
+processes' data.
+
+Any buffer clearing is done lazily on next kernel exit, so can be
+triggered in fast paths.
+
+Sandboxes
+---------
+
+We don't do anything special for seccomp processes
+
+If there is a sandbox inside the process the process should take care
+itself of clearing its own sensitive data before running sandbox
+code. This would include data touched by system calls.
+
+BPF
+---
+
+Assume BPF execution does not touch other user's data, so does
+not need to schedule a clear for itself.
+
+BPF could attack the rest of the kernel if it can successfully
+measure side channel side effects.
+
+When the BPF program was loaded unprivileged, always clear the CPU
+to prevent any exploits written in BPF using side channels to read
+data leaked from other kernel code
+
+We only do this when running in an interrupt, or if an clear cpu is
+already scheduled (which means for example there was a context
+switch, or crypto operation before)
+
+In process context we assume the code only accesses data of the
+current user and check that the BPF running was loaded by the
+same user so even if data leaked it would not cross privilege
+boundaries.
+
+Technically we would only need to do this if the BPF program
+contains conditional branches and loads dominated by them, but
+let's assume that near all do.
+
+This could be further optimized by allowing callers that do
+a lot of individual BPF runs and are sure they don't touch
+other user's data inbetween to do the clear only once
+at the beginning. We can add such optimizations later based on
+profile data.
+
+Virtualization
+--------------
+
+When entering a guest in KVM we clear to avoid any leakage to a guest.
+Normally this is done implicitely as part of the L1TF mitigation.
+It relies on this being enabled. It also uses the "fast exit"
+optimization that only clears if an interrupt or context switch
+happened.
--
2.17.2
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2019-01-12 1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
@ 2019-01-15 1:05 ` Tim Chen
0 siblings, 0 replies; 89+ messages in thread
From: Tim Chen @ 2019-01-15 1:05 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 10/28] MDSv4 24
[-- Attachment #2: Type: text/plain, Size: 5059 bytes --]
On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
> +Some CPUs can leave read or written data in internal buffers,
> +which then later might be sampled through side effects.
> +For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
> +
> +This can be avoided by explicitely clearing the CPU state.
s/explicitely/explicitly
> +
> +We trying to avoid leaking data between different processes,
Suggest changing the above phrase to the below:
CPU state clearing prevents leaking data between different processes,
...
> +Basic requirements and assumptions
> +----------------------------------
> +
> +Kernel addresses and kernel temporary data are not sensitive.
> +
> +User data is sensitive, but only for other processes.
> +
> +Kernel data is sensitive when it is cryptographic keys.
s/when it is/when it involves/
> +
> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +When you touch user supplied data of *other* processes in system call
> +context add lazy_clear_cpu().
> +
> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.
> +
> +Touching only pointers to user data is always allowed.
> +
> +When your interrupt does not touch user data directly consider marking
Add a "," between "directly" and "consider"
> +it with IRQF_NO_USER.
> +
> +When your tasklet does not touch user data directly consider marking
Add a "," between "directly" and "consider"
> +it with TASKLET_NO_USER using tasklet_init_flags/or
> +DECLARE_TASKLET*_NOUSER.
> +
> +When your timer does not touch user data mark it with TIMER_NO_USER.
Add a "," between "data" and "mark"
> +If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.
Add a "," between "hrtimer" and "mark"
> +
> +When your irq poll handler does not touch user data, mark it
> +with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
> +
> +For networking code make sure to only touch user data through
Add a "," between "code" and "make"
> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or
Add a "," between "ensured" and "add"
> +lazy_clear_cpu_interrupt. When the non skb data access is only in a
> +hardware interrupt controlled by the driver, it can rely on not
> +setting IRQF_NO_USER for that interrupt.
> +
> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree.
> +
> +If your RCU callback touches user data add lazy_clear_cpu().
> +
> +These steps are currently only needed for code that runs on MDS affected
> +CPUs, which is currently only x86. But might be worth being prepared
> +if other architectures become affected too.
> +
> +Implementation details/assumptions
> +----------------------------------
> +
> +If a system call touches data it is for its own process, so does not
suggest rephrasing to
If a system call touches data of its own process, cpu state does not
> +need to be cleared, because it has already access to it.
> +
> +When context switching we clear data, unless the context switch
> +is inside a process, or from/to idle. We also clear after any
> +context switches from kernel threads.
> +
> +Idle does not have sensitive data, except for in interrupts, which
> +are handled separately.
> +
> +Cryptographic keys inside the kernel should be protected.
> +We assume they use kzfree() or memzero_explicit() to clear
> +state, so these functions trigger a cpu clear.
> +
> +Hard interrupts, tasklets, timers which can run asynchronous are
> +assumed to touch random user data, unless they have been audited, and
> +marked with NO_USER flags.
> +
> +Most interrupt handlers for modern devices should not touch
> +user data because they rely on DMA and only manipulate
> +pointers. This needs auditing to confirm though.
> +
> +For softirqs we assume that if they touch user data they use
Add "," between "data" and "they"
...
> +Technically we would only need to do this if the BPF program
> +contains conditional branches and loads dominated by them, but
> +let's assume that near all do.
s/near/nealy/
> +
> +This could be further optimized by allowing callers that do
> +a lot of individual BPF runs and are sure they don't touch
> +other user's data inbetween to do the clear only once
> +at the beginning.
Suggest breaking the above sentence. It is quite difficult to read.
> We can add such optimizations later based on
> +profile data.
> +
> +Virtualization
> +--------------
> +
> +When entering a guest in KVM we clear to avoid any leakage to a guest.
... we clear CPU state to avoid ....
> +Normally this is done implicitely as part of the L1TF mitigation.
s/implicitely/implicitly/
> +It relies on this being enabled. It also uses the "fast exit"
> +optimization that only clears if an interrupt or context switch
> +happened.
>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] FYI - Reading uncached memory
@ 2018-06-12 17:29 Jon Masters
2018-06-14 16:59 ` [MODERATED] Encrypted Message Tim Chen
0 siblings, 1 reply; 89+ messages in thread
From: Jon Masters @ 2018-06-12 17:29 UTC (permalink / raw)
To: speck
FYI Graz have been able to prove the Intel processors will allow
speculative reads of /explicitly/ UC memory (e.g. marked in MTRR). I
believe they actually use the QPI SAD table to determine what memory is
speculation safe and what memory has side effects (i.e. if it's HA'able
memory then it's deemed ok to rampantly speculate from it).
Just in case anyone thought UC was safe against attacks.
Jon.
--
Computer Architect | Sent from my Fedora powered laptop
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] [PATCH 0/2] L1TF KVM 0
@ 2018-05-29 19:42 Paolo Bonzini
[not found] ` <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>
0 siblings, 1 reply; 89+ messages in thread
From: Paolo Bonzini @ 2018-05-29 19:42 UTC (permalink / raw)
To: speck
Here is the first version of the L1 terminal fault KVM mitigation patches,
adding a TLB flush on vmentry.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 89+ messages in thread
* SSB status - V18 pushed out
@ 2018-05-17 20:53 Thomas Gleixner
2018-05-18 13:54 ` [MODERATED] Is: Sleep states ?Was:Re: " Konrad Rzeszutek Wilk
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-17 20:53 UTC (permalink / raw)
To: speck
[-- Attachment #1: Type: text/plain, Size: 473 bytes --]
Folks,
we finally reached a stable state with the SSB patches. I've updated all 3
branches master/linux-4.16.y/linux-4.14.y in the repo and attached the
resulting git bundles. They merge cleanly on top of the current HEADs of
the relevant trees.
The lot survived light testing on my side and it would be great if everyone
involved could expose it to their test scenarios.
Thanks to everyone who participated in that effort (patches, review,
testing ...)!
Thanks,
tglx
[-- Attachment #2: Type: application/octet-stream, Size: 79102 bytes --]
[-- Attachment #3: Type: application/octet-stream, Size: 75724 bytes --]
[-- Attachment #4: Type: application/octet-stream, Size: 75835 bytes --]
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Is: Sleep states ?Was:Re: SSB status - V18 pushed out
2018-05-17 20:53 SSB status - V18 pushed out Thomas Gleixner
@ 2018-05-18 13:54 ` Konrad Rzeszutek Wilk
2018-05-18 14:29 ` Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-05-18 13:54 UTC (permalink / raw)
To: speck
On Thu, May 17, 2018 at 10:53:28PM +0200, speck for Thomas Gleixner wrote:
> Folks,
>
> we finally reached a stable state with the SSB patches. I've updated all 3
> branches master/linux-4.16.y/linux-4.14.y in the repo and attached the
> resulting git bundles. They merge cleanly on top of the current HEADs of
> the relevant trees.
>
> The lot survived light testing on my side and it would be great if everyone
> involved could expose it to their test scenarios.
>
> Thanks to everyone who participated in that effort (patches, review,
> testing ...)!
Yeey! Thank you.
I was reading the updated Intel doc today (instead of skim reading it) and it mentioned:
"Intel recommends that the SSBD MSR bit be cleared when in a sleep state on such processors."
We don't seem to be doing that?
To do that we would need to:
1) Revert 4b59bdb56945 x86/bugs: Remove x86_spec_ctrl_set()
2) Peppering
if (static_cpu_has(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE))
x86_spec_ctrl_set(~SPEC_CTRL_SSBD);
[when enterring sleep state]
and:
if (static_cpu_has(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE))
x86_spec_ctrl_set(SPEC_CTRL_SSBD);
[when coming out]
in mwait_idle_with_hints, mwait_idle, and native_play_dead
Or alternatively fiddle with the MSR directly.
3) And then uhuh, I am not sure how you would want to deal when the applications
are running. That is when the !static_cpu_has(X86_FEATURE_SPEC_STORE_BYPASS_DISABLE)
but still want the MSR toggled.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Is: Sleep states ?Was:Re: SSB status - V18 pushed out
2018-05-18 13:54 ` [MODERATED] Is: Sleep states ?Was:Re: " Konrad Rzeszutek Wilk
@ 2018-05-18 14:29 ` Thomas Gleixner
2018-05-18 19:50 ` [MODERATED] Encrypted Message Tim Chen
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-18 14:29 UTC (permalink / raw)
To: speck
On Fri, 18 May 2018, speck for Konrad Rzeszutek Wilk wrote:
> On Thu, May 17, 2018 at 10:53:28PM +0200, speck for Thomas Gleixner wrote:
> > Folks,
> >
> > we finally reached a stable state with the SSB patches. I've updated all 3
> > branches master/linux-4.16.y/linux-4.14.y in the repo and attached the
> > resulting git bundles. They merge cleanly on top of the current HEADs of
> > the relevant trees.
> >
> > The lot survived light testing on my side and it would be great if everyone
> > involved could expose it to their test scenarios.
> >
> > Thanks to everyone who participated in that effort (patches, review,
> > testing ...)!
>
> Yeey! Thank you.
>
> I was reading the updated Intel doc today (instead of skim reading it) and it mentioned:
>
> "Intel recommends that the SSBD MSR bit be cleared when in a sleep state on such processors."
Well, the same recommendation was for IBRS and the reason is that with HT
enabled the other hyperthread will not be able to go full speed because the
sleeping one vanished with IBRS set. SSBD works the same way.
" SW should clear [SSBD] when enter sleep state, just as is suggested for
IBRS and STIBP on existing implementations"
and that document says:
"Enabling IBRS on one logical processor of a core with Intel
Hyper-Threading Technology may affect branch prediction on other logical
processors of the same core. For this reason, software should disable IBRS
(by clearing IA32_SPEC_CTRL.IBRS) prior to entering a sleep state (e.g.,
by executing HLT or MWAIT) and re-enable IBRS upon wakeup and prior to
executing any indirect branch."
So it's only a performance issue and not a fundamental problem to have it
on when executing HLT/MWAIT
So we have two situations here:
1) ssbd = on, i.e X86_FEATURE_SPEC_STORE_BYPASS_DISABLE
There it is irrelevant because both threads have SSBD set permanentely,
so unsetting it on HLT/MWAIT is not going to lift the restriction for
the running sibling thread. And HLT/MWAIT is not going to be faster by
unsetting it and then setting it on wakeup again....
2) SSBD via prctl/seccomp
Nothing to do there, because idle task does not have TIF_SSBD set so it
never goes with SSBD set into HLT/MWAIT.
So I think we're good, but it would be nice if Intel folks would confirm
that.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2018-05-18 14:29 ` Thomas Gleixner
@ 2018-05-18 19:50 ` Tim Chen
0 siblings, 0 replies; 89+ messages in thread
From: Tim Chen @ 2018-05-18 19:50 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 163 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: Is: Sleep states ?Was:Re: SSB status - V18 pushed out
[-- Attachment #2: Type: text/plain, Size: 2667 bytes --]
On 05/18/2018 07:29 AM, speck for Thomas Gleixner wrote:
> On Fri, 18 May 2018, speck for Konrad Rzeszutek Wilk wrote:
>> On Thu, May 17, 2018 at 10:53:28PM +0200, speck for Thomas Gleixner wrote:
>>> Folks,
>>>
>>> we finally reached a stable state with the SSB patches. I've updated all 3
>>> branches master/linux-4.16.y/linux-4.14.y in the repo and attached the
>>> resulting git bundles. They merge cleanly on top of the current HEADs of
>>> the relevant trees.
>>>
>>> The lot survived light testing on my side and it would be great if everyone
>>> involved could expose it to their test scenarios.
>>>
>>> Thanks to everyone who participated in that effort (patches, review,
>>> testing ...)!
>>
>> Yeey! Thank you.
>>
>> I was reading the updated Intel doc today (instead of skim reading it) and it mentioned:
>>
>> "Intel recommends that the SSBD MSR bit be cleared when in a sleep state on such processors."
>
> Well, the same recommendation was for IBRS and the reason is that with HT
> enabled the other hyperthread will not be able to go full speed because the
> sleeping one vanished with IBRS set. SSBD works the same way.
>
> " SW should clear [SSBD] when enter sleep state, just as is suggested for
> IBRS and STIBP on existing implementations"
>
> and that document says:
>
> "Enabling IBRS on one logical processor of a core with Intel
> Hyper-Threading Technology may affect branch prediction on other logical
> processors of the same core. For this reason, software should disable IBRS
> (by clearing IA32_SPEC_CTRL.IBRS) prior to entering a sleep state (e.g.,
> by executing HLT or MWAIT) and re-enable IBRS upon wakeup and prior to
> executing any indirect branch."
>
> So it's only a performance issue and not a fundamental problem to have it
> on when executing HLT/MWAIT
>
> So we have two situations here:
>
> 1) ssbd = on, i.e X86_FEATURE_SPEC_STORE_BYPASS_DISABLE
>
> There it is irrelevant because both threads have SSBD set permanentely,
> so unsetting it on HLT/MWAIT is not going to lift the restriction for
> the running sibling thread. And HLT/MWAIT is not going to be faster by
> unsetting it and then setting it on wakeup again....
>
> 2) SSBD via prctl/seccomp
>
> Nothing to do there, because idle task does not have TIF_SSBD set so it
> never goes with SSBD set into HLT/MWAIT.
>
> So I think we're good, but it would be nice if Intel folks would confirm
> that.
Yes, we have thought about turning off SSBD in the mwait path earlier. But
decided that it was unnecessary for the exact reasons Thomas mentioned.
Thanks.
Tim
^ permalink raw reply [flat|nested] 89+ messages in thread
* [patch V11 00/16] SSB 0
@ 2018-05-02 21:51 Thomas Gleixner
2018-05-03 4:27 ` [MODERATED] Encrypted Message Tim Chen
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-02 21:51 UTC (permalink / raw)
To: speck
Changes since V10:
- Addressed Ingos review feedback
- Picked up Reviewed-bys
Delta patch below. Bundle is coming in separate mail. Git repo branches are
updated as well. The master branch contains also the fix for the lost IBRS
issue Tim was seeing.
If there are no further issues and nitpicks, I'm going to make the
changes immutable and changes need to go incremental on top.
Thanks,
tglx
8<--------------------
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 29984fd3dd18..a8d2ae1e335b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4051,11 +4051,12 @@
on - Unconditionally disable Speculative Store Bypass
off - Unconditionally enable Speculative Store Bypass
- auto - Kernel detects whether the CPU model contains a
+ auto - Kernel detects whether the CPU model contains an
implementation of Speculative Store Bypass and
- picks the most appropriate mitigation
- prctl - Control Speculative Store Bypass for a thread
- via prctl. By default it is enabled. The state
+ picks the most appropriate mitigation.
+ prctl - Control Speculative Store Bypass per thread
+ via prctl. Speculative Store Bypass is enabled
+ for a process by default. The state of the control
is inherited on fork.
Not specifying this option is equivalent to
diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst
index 8ff39a26a992..ddbebcd01208 100644
--- a/Documentation/userspace-api/spec_ctrl.rst
+++ b/Documentation/userspace-api/spec_ctrl.rst
@@ -10,7 +10,7 @@ The kernel provides mitigation for such vulnerabilities in various
forms. Some of these mitigations are compile time configurable and some on
the kernel command line.
-There is also a class of mitigations which is very expensive, but they can
+There is also a class of mitigations which are very expensive, but they can
be restricted to a certain set of processes or tasks in controlled
environments. The mechanism to control these mitigations is via
:manpage:`prctl(2)`.
@@ -25,7 +25,7 @@ PR_GET_SPECULATION_CTRL
-----------------------
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
-which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
+which is selected with arg2 of prctl(2). The return value uses bits 0-2 with
the following meaning:
==== ================ ===================================================
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 5bee7a2ca4ff..810f50bb338d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -70,7 +70,11 @@
#define MSR_IA32_ARCH_CAPABILITIES 0x0000010a
#define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */
#define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */
-#define ARCH_CAP_RDS_NO (1 << 4) /* Not susceptible to speculative store bypass */
+#define ARCH_CAP_RDS_NO (1 << 4) /*
+ * Not susceptible to Speculative Store Bypass
+ * attack, so no Reduced Data Speculation control
+ * required.
+ */
#define MSR_IA32_BBL_CR_CTL 0x00000119
#define MSR_IA32_BBL_CR_CTL3 0x0000011e
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 023e2edc0f3c..71ad01422655 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -225,8 +225,8 @@ enum spectre_v2_mitigation {
* ourselves and always use this as the base for SPEC_CTRL.
* We also use this when handling guest entry/exit as below.
*/
-extern void x86_set_spec_ctrl(u64);
-extern u64 x86_get_default_spec_ctrl(void);
+extern void x86_spec_ctrl_set(u64);
+extern u64 x86_spec_ctrl_get_default(void);
/* The Speculative Store Bypass disable variants */
enum ssb_mitigation {
@@ -285,7 +285,7 @@ static inline void indirect_branch_prediction_barrier(void)
*/
#define firmware_restrict_branch_speculation_start() \
do { \
- u64 val = x86_get_default_spec_ctrl() | SPEC_CTRL_IBRS; \
+ u64 val = x86_spec_ctrl_get_default() | SPEC_CTRL_IBRS; \
\
preempt_disable(); \
alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \
@@ -294,7 +294,7 @@ do { \
#define firmware_restrict_branch_speculation_end() \
do { \
- u64 val = x86_get_default_spec_ctrl(); \
+ u64 val = x86_spec_ctrl_get_default(); \
\
alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \
X86_FEATURE_USE_IBRS_FW); \
diff --git a/arch/x86/include/asm/spec-ctrl.h b/arch/x86/include/asm/spec-ctrl.h
index 607236af4008..45ef00ad5105 100644
--- a/arch/x86/include/asm/spec-ctrl.h
+++ b/arch/x86/include/asm/spec-ctrl.h
@@ -12,8 +12,8 @@
* shadowable for guests but this is not (currently) the case.
* Takes the guest view of SPEC_CTRL MSR as a parameter.
*/
-extern void x86_set_guest_spec_ctrl(u64);
-extern void x86_restore_host_spec_ctrl(u64);
+extern void x86_spec_ctrl_set_guest(u64);
+extern void x86_spec_ctrl_restore_host(u64);
/* AMD specific Speculative Store Bypass MSR data */
extern u64 x86_amd_ls_cfg_base;
@@ -30,7 +30,7 @@ static inline u64 rds_tif_to_spec_ctrl(u64 tifn)
static inline u64 rds_tif_to_amd_ls_cfg(u64 tifn)
{
- return tifn & _TIF_RDS ? x86_amd_ls_cfg_rds_mask : 0ULL;
+ return (tifn & _TIF_RDS) ? x86_amd_ls_cfg_rds_mask : 0ULL;
}
extern void speculative_store_bypass_update(void);
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 50c6ba6d031b..18efc33a8d2e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -572,7 +572,7 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)
if (!rdmsrl_safe(MSR_AMD64_LS_CFG, &x86_amd_ls_cfg_base)) {
setup_force_cpu_cap(X86_FEATURE_RDS);
setup_force_cpu_cap(X86_FEATURE_AMD_RDS);
- x86_amd_ls_cfg_rds_mask = (1ULL << bit);
+ x86_amd_ls_cfg_rds_mask = 1ULL << bit;
}
}
}
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index c28856e475c8..15f77d4518c7 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -32,7 +32,7 @@ static void __init spectre_v2_select_mitigation(void);
static void __init ssb_select_mitigation(void);
/*
- * Our boot-time value of SPEC_CTRL MSR. We read it once so that any
+ * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any
* writes to SPEC_CTRL contain whatever reserved bits have been set.
*/
u64 __ro_after_init x86_spec_ctrl_base;
@@ -41,11 +41,11 @@ u64 __ro_after_init x86_spec_ctrl_base;
* The vendor and possibly platform specific bits which can be modified in
* x86_spec_ctrl_base.
*/
-static u64 __ro_after_init x86_spec_ctrl_mask = ~(SPEC_CTRL_IBRS);
+static u64 __ro_after_init x86_spec_ctrl_mask = ~SPEC_CTRL_IBRS;
/*
- * AMD specific MSR info for Store Bypass control. x86_amd_ls_cfg_rds_mask
- * is initialized in identify_boot_cpu().
+ * AMD specific MSR info for Speculative Store Bypass control.
+ * x86_amd_ls_cfg_rds_mask is initialized in identify_boot_cpu().
*/
u64 __ro_after_init x86_amd_ls_cfg_base;
u64 __ro_after_init x86_amd_ls_cfg_rds_mask;
@@ -61,7 +61,7 @@ void __init check_bugs(void)
/*
* Read the SPEC_CTRL MSR to account for reserved bits which may
- * have unknown values. AMD64_LS_CFG msr is cached in the early AMD
+ * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD
* init code as it is not enumerated and depends on the family.
*/
if (boot_cpu_has(X86_FEATURE_IBRS))
@@ -131,22 +131,22 @@ static const char *spectre_v2_strings[] = {
static enum spectre_v2_mitigation spectre_v2_enabled = SPECTRE_V2_NONE;
-void x86_set_spec_ctrl(u64 val)
+void x86_spec_ctrl_set(u64 val)
{
if (val & x86_spec_ctrl_mask)
WARN_ONCE(1, "SPEC_CTRL MSR value 0x%16llx is unknown.\n", val);
else
wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base | val);
}
-EXPORT_SYMBOL_GPL(x86_set_spec_ctrl);
+EXPORT_SYMBOL_GPL(x86_spec_ctrl_set);
-u64 x86_get_default_spec_ctrl(void)
+u64 x86_spec_ctrl_get_default(void)
{
return x86_spec_ctrl_base;
}
-EXPORT_SYMBOL_GPL(x86_get_default_spec_ctrl);
+EXPORT_SYMBOL_GPL(x86_spec_ctrl_get_default);
-void x86_set_guest_spec_ctrl(u64 guest_spec_ctrl)
+void x86_spec_ctrl_set_guest(u64 guest_spec_ctrl)
{
u64 host = x86_spec_ctrl_base;
@@ -159,9 +159,9 @@ void x86_set_guest_spec_ctrl(u64 guest_spec_ctrl)
if (host != guest_spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, guest_spec_ctrl);
}
-EXPORT_SYMBOL_GPL(x86_set_guest_spec_ctrl);
+EXPORT_SYMBOL_GPL(x86_spec_ctrl_set_guest);
-void x86_restore_host_spec_ctrl(u64 guest_spec_ctrl)
+void x86_spec_ctrl_restore_host(u64 guest_spec_ctrl)
{
u64 host = x86_spec_ctrl_base;
@@ -174,7 +174,7 @@ void x86_restore_host_spec_ctrl(u64 guest_spec_ctrl)
if (host != guest_spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, host);
}
-EXPORT_SYMBOL_GPL(x86_restore_host_spec_ctrl);
+EXPORT_SYMBOL_GPL(x86_spec_ctrl_restore_host);
static void x86_amd_rds_enable(void)
{
@@ -504,8 +504,8 @@ static enum ssb_mitigation_cmd __init __ssb_select_mitigation(void)
switch (boot_cpu_data.x86_vendor) {
case X86_VENDOR_INTEL:
x86_spec_ctrl_base |= SPEC_CTRL_RDS;
- x86_spec_ctrl_mask &= ~(SPEC_CTRL_RDS);
- x86_set_spec_ctrl(SPEC_CTRL_RDS);
+ x86_spec_ctrl_mask &= ~SPEC_CTRL_RDS;
+ x86_spec_ctrl_set(SPEC_CTRL_RDS);
break;
case X86_VENDOR_AMD:
x86_amd_rds_enable();
@@ -560,7 +560,7 @@ static int ssb_prctl_get(void)
}
}
-int arch_prctl_set_spec_ctrl(unsigned long which, unsigned long ctrl)
+int arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl)
{
if (ctrl != PR_SPEC_ENABLE && ctrl != PR_SPEC_DISABLE)
return -ERANGE;
@@ -573,7 +573,7 @@ int arch_prctl_set_spec_ctrl(unsigned long which, unsigned long ctrl)
}
}
-int arch_prctl_get_spec_ctrl(unsigned long which)
+int arch_prctl_spec_ctrl_get(unsigned long which)
{
switch (which) {
case PR_SPEC_STORE_BYPASS:
@@ -583,10 +583,10 @@ int arch_prctl_get_spec_ctrl(unsigned long which)
}
}
-void x86_setup_ap_spec_ctrl(void)
+void x86_spec_ctrl_setup_ap(void)
{
if (boot_cpu_has(X86_FEATURE_IBRS))
- x86_set_spec_ctrl(x86_spec_ctrl_base & ~x86_spec_ctrl_mask);
+ x86_spec_ctrl_set(x86_spec_ctrl_base & ~x86_spec_ctrl_mask);
if (ssb_mode == SPEC_STORE_BYPASS_DISABLE)
x86_amd_rds_enable();
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f3dbdde978a4..e0517bcee446 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -848,6 +848,11 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_power = edx;
}
+ if (c->extended_cpuid_level >= 0x80000008) {
+ cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[CPUID_8000_0008_EBX] = ebx;
+ }
+
if (c->extended_cpuid_level >= 0x8000000a)
c->x86_capability[CPUID_8000_000A_EDX] = cpuid_edx(0x8000000a);
@@ -871,7 +876,6 @@ static void get_cpu_address_sizes(struct cpuinfo_x86 *c)
c->x86_virt_bits = (eax >> 8) & 0xff;
c->x86_phys_bits = eax & 0xff;
- c->x86_capability[CPUID_8000_0008_EBX] = ebx;
}
#ifdef CONFIG_X86_32
else if (cpu_has(c, X86_FEATURE_PAE) || cpu_has(c, X86_FEATURE_PSE36))
@@ -924,26 +928,26 @@ static const __initconst struct x86_cpu_id cpu_no_meltdown[] = {
};
static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PINEVIEW },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_LINCROFT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PENWELL },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CLOVERVIEW },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CEDARVIEW },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_CORE_YONAH },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL },
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM },
- { X86_VENDOR_CENTAUR, 5 },
- { X86_VENDOR_INTEL, 5 },
- { X86_VENDOR_NSC, 5 },
- { X86_VENDOR_AMD, 0xf },
- { X86_VENDOR_AMD, 0x10 },
- { X86_VENDOR_AMD, 0x11 },
- { X86_VENDOR_AMD, 0x12 },
- { X86_VENDOR_ANY, 4 },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PINEVIEW },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_LINCROFT },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PENWELL },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CLOVERVIEW },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CEDARVIEW },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_CORE_YONAH },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM },
+ { X86_VENDOR_CENTAUR, 5, },
+ { X86_VENDOR_INTEL, 5, },
+ { X86_VENDOR_NSC, 5, },
+ { X86_VENDOR_AMD, 0x12, },
+ { X86_VENDOR_AMD, 0x11, },
+ { X86_VENDOR_AMD, 0x10, },
+ { X86_VENDOR_AMD, 0xf, },
+ { X86_VENDOR_ANY, 4, },
{}
};
@@ -1384,7 +1388,7 @@ void identify_secondary_cpu(struct cpuinfo_x86 *c)
#endif
mtrr_ap_init();
validate_apic_and_package_id(c);
- x86_setup_ap_spec_ctrl();
+ x86_spec_ctrl_setup_ap();
}
static __init int setup_noclflush(char *arg)
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index faaabc160293..37672d299e35 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -50,6 +50,6 @@ extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);
unsigned int aperfmperf_get_khz(int cpu);
-extern void x86_setup_ap_spec_ctrl(void);
+extern void x86_spec_ctrl_setup_ap(void);
#endif /* ARCH_X86_CPU_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ba4763e9a285..437c1b371129 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5557,7 +5557,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
* is no need to worry about the conditional branch over the wrmsr
* being speculatively taken.
*/
- x86_set_guest_spec_ctrl(svm->spec_ctrl);
+ x86_spec_ctrl_set_guest(svm->spec_ctrl);
asm volatile (
"push %%" _ASM_BP "; \n\t"
@@ -5669,7 +5669,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))
svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- x86_restore_host_spec_ctrl(svm->spec_ctrl);
+ x86_spec_ctrl_restore_host(svm->spec_ctrl);
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9744e48457d6..16a111e44691 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9722,7 +9722,7 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
* is no need to worry about the conditional branch over the wrmsr
* being speculatively taken.
*/
- x86_set_guest_spec_ctrl(vmx->spec_ctrl);
+ x86_spec_ctrl_set_guest(vmx->spec_ctrl);
vmx->__launched = vmx->loaded_vmcs->launched;
@@ -9870,7 +9870,7 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))
vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
- x86_restore_host_spec_ctrl(vmx->spec_ctrl);
+ x86_spec_ctrl_restore_host(vmx->spec_ctrl);
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
diff --git a/include/linux/nospec.h b/include/linux/nospec.h
index 1e63a0a90e96..700bb8a4e4ea 100644
--- a/include/linux/nospec.h
+++ b/include/linux/nospec.h
@@ -57,7 +57,7 @@ static inline unsigned long array_index_mask_nospec(unsigned long index,
})
/* Speculation control prctl */
-int arch_prctl_set_spec_ctrl(unsigned long which, unsigned long ctrl);
-int arch_prctl_get_spec_ctrl(unsigned long which);
+int arch_prctl_spec_ctrl_get(unsigned long which);
+int arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl);
#endif /* _LINUX_NOSPEC_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 4e7a160d3b28..ebf057ac1346 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -208,8 +208,8 @@ struct prctl_mm_map {
# define PR_SVE_VL_INHERIT (1 << 17) /* inherit across exec */
/* Per task speculation control */
-#define PR_SET_SPECULATION_CTRL 52
-#define PR_GET_SPECULATION_CTRL 53
+#define PR_GET_SPECULATION_CTRL 52
+#define PR_SET_SPECULATION_CTRL 53
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
diff --git a/kernel/sys.c b/kernel/sys.c
index d7afe29319f1..b76dee23bdc9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2244,12 +2244,12 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data)
return 1;
}
-int __weak arch_prctl_set_spec_ctrl(unsigned long which, unsigned long ctrl)
+int __weak arch_prctl_spec_ctrl_get(unsigned long which)
{
return -EINVAL;
}
-int __weak arch_prctl_get_spec_ctrl(unsigned long which)
+int __weak arch_prctl_spec_ctrl_set(unsigned long which, unsigned long ctrl)
{
return -EINVAL;
}
@@ -2462,15 +2462,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SVE_GET_VL:
error = SVE_GET_VL();
break;
- case PR_SET_SPECULATION_CTRL:
- if (arg4 || arg5)
- return -EINVAL;
- error = arch_prctl_set_spec_ctrl(arg2, arg3);
- break;
case PR_GET_SPECULATION_CTRL:
if (arg3 || arg4 || arg5)
return -EINVAL;
- error = arch_prctl_get_spec_ctrl(arg2);
+ error = arch_prctl_spec_ctrl_get(arg2);
+ break;
+ case PR_SET_SPECULATION_CTRL:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = arch_prctl_spec_ctrl_set(arg2, arg3);
break;
default:
error = -EINVAL;
^ permalink raw reply related [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2018-05-02 21:51 [patch V11 00/16] SSB 0 Thomas Gleixner
@ 2018-05-03 4:27 ` Tim Chen
0 siblings, 0 replies; 89+ messages in thread
From: Tim Chen @ 2018-05-03 4:27 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 133 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V11 00/16] SSB 0
[-- Attachment #2: Type: text/plain, Size: 1580 bytes --]
On 05/02/2018 02:51 PM, speck for Thomas Gleixner wrote:
> Changes since V10:
>
> - Addressed Ingos review feedback
>
> - Picked up Reviewed-bys
>
> Delta patch below. Bundle is coming in separate mail. Git repo branches are
> updated as well. The master branch contains also the fix for the lost IBRS
> issue Tim was seeing.
>
> If there are no further issues and nitpicks, I'm going to make the
> changes immutable and changes need to go incremental on top.
>
> Thanks,
>
> tglx
>
>
I notice that this code ignores the current process's TIF_RDS setting
in the prctl case:
#define firmware_restrict_branch_speculation_end() \
do { \
u64 val = x86_get_default_spec_ctrl(); \
\
alternative_msr_write(MSR_IA32_SPEC_CTRL, val, \
X86_FEATURE_USE_IBRS_FW); \
preempt_enable(); \
} while (0)
x86_get_default_spec_ctrl will return x86_spec_ctrl_base, which
will result in x86_spec_ctrl_base written to the MSR
in the prctl case for Intel CPU. That incorrectly ignores current
process's TIF_RDS setting and the RDS bit will not be set.
Instead, the following value should have been written to the MSR
for Intel CPU:
x86_spec_ctrl_base | rds_tif_to_spec_ctrl(current_thread_info()->flags)
Thanks.
Tim
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] L1D-Fault KVM mitigation
@ 2018-04-24 9:06 Joerg Roedel
2018-04-24 9:35 ` [MODERATED] " Peter Zijlstra
0 siblings, 1 reply; 89+ messages in thread
From: Joerg Roedel @ 2018-04-24 9:06 UTC (permalink / raw)
To: speck
Hey,
I've been looking into the mitigation for the L1D fault issue in KVM,
and since the hardware seems to speculate with the GPA as an HPA, it
seems we have to disable SMT to be fully secure here because otherwise
two different guests running on HT siblings could spy on each other.
I'd like to discuss how we mitigate this, the big hammer would be not
initializing the HT siblings at boot on affected machines, but that is
probably a bit too eager as it also penalizes people not using KVM.
Another option is to just print a fat warning and/or refuse to load the
KVM modules on affected machines when HT is enabled.
So what are the opinions on how we should best mitigate this issue?
Regards,
Joerg
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: L1D-Fault KVM mitigation
2018-04-24 9:06 [MODERATED] L1D-Fault KVM mitigation Joerg Roedel
@ 2018-04-24 9:35 ` Peter Zijlstra
2018-04-24 9:48 ` David Woodhouse
0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2018-04-24 9:35 UTC (permalink / raw)
To: speck
On Tue, Apr 24, 2018 at 11:06:30AM +0200, speck for Joerg Roedel wrote:
> Hey,
>
> I've been looking into the mitigation for the L1D fault issue in KVM,
> and since the hardware seems to speculate with the GPA as an HPA, it
> seems we have to disable SMT to be fully secure here because otherwise
> two different guests running on HT siblings could spy on each other.
>
> I'd like to discuss how we mitigate this, the big hammer would be not
> initializing the HT siblings at boot on affected machines, but that is
> probably a bit too eager as it also penalizes people not using KVM.
>
> Another option is to just print a fat warning and/or refuse to load the
> KVM modules on affected machines when HT is enabled.
>
> So what are the opinions on how we should best mitigate this issue?
Another option, that is being explored, is to co-schedule siblings.
So ensure all siblings either run vcpus of the _same_ VM or idle.
Of course, this is all rather intrusive and ugly and brings with it
setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
and interrupts (on the idle CPUs).
Another complication is that on overcommitted systems the regular load
balancer will happily migrate vcpu tasks around. So it is fairly tricky
to ensure runnable vcpu threads of the same VM are in fact around to be
ran on a core.
Not to mention that Linus has basically said: "No way, Jose".
I know that I worked a little with Tim on this, and I know Google did
their own thing (but have not seen patches from them -- is pjt on this
list?). I've also heard Amazon was also working on things (are they
here?). And I think RHT was also looking into something (mingo, bonzini
-- are you guys reading?)
In any case, if any of that is to go fly we need very solid numbers to
convince Linus to reconsider.
Another idea that I had was to only allow trusted guest kernels, as in
trusted computing, key verified images etc.. Of course, they too can be
compromised, but hopefully it avoids the most egregious hostile guest
scenarios.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: L1D-Fault KVM mitigation
2018-04-24 9:35 ` [MODERATED] " Peter Zijlstra
@ 2018-04-24 9:48 ` David Woodhouse
2018-04-24 11:04 ` Peter Zijlstra
0 siblings, 1 reply; 89+ messages in thread
From: David Woodhouse @ 2018-04-24 9:48 UTC (permalink / raw)
To: speck
On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
>
> Another option, that is being explored, is to co-schedule siblings.
> So ensure all siblings either run vcpus of the _same_ VM or idle.
>
> Of course, this is all rather intrusive and ugly and brings with it
> setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> and interrupts (on the idle CPUs).
I hate to suggest more microcode hacks but... if there was an MSR bit
which, when set, would pause any HT sibling that was currently in VMX
non-root mode, then we could set that up to be automatically set on
vmexit and it would automatically pause the problematic siblings.
Meaning that co-ordinating vmexits with them might actually be
feasible?
The precise definition of 'pause' in the above could survive some
bikeshedding, but basically it shouldn't run any more guest
instructions, but it *should* be allowed to vmexit on interrupts, etc.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: L1D-Fault KVM mitigation
2018-04-24 9:48 ` David Woodhouse
@ 2018-04-24 11:04 ` Peter Zijlstra
2018-05-23 9:45 ` David Woodhouse
0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2018-04-24 11:04 UTC (permalink / raw)
To: speck
On Tue, Apr 24, 2018 at 10:48:12AM +0100, speck for David Woodhouse wrote:
> On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
> >
> > Another option, that is being explored, is to co-schedule siblings.
> > So ensure all siblings either run vcpus of the _same_ VM or idle.
> >
> > Of course, this is all rather intrusive and ugly and brings with it
> > setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> > and interrupts (on the idle CPUs).
>
> I hate to suggest more microcode hacks but... if there was an MSR bit
> which, when set, would pause any HT sibling that was currently in VMX
> non-root mode, then we could set that up to be automatically set on
> vmexit and it would automatically pause the problematic siblings.
> Meaning that co-ordinating vmexits with them might actually be
> feasible?
Not sure I'm following. The above assumes a sibling is running a VCPU of
another VM, right? But it could equally well run any regular old task
(including idle).
So only pausing siblings in VMX mode wouldn't help anything. The !VMX
tasks could still be loading stuff into L1.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: L1D-Fault KVM mitigation
2018-04-24 11:04 ` Peter Zijlstra
@ 2018-05-23 9:45 ` David Woodhouse
2018-05-24 9:45 ` Peter Zijlstra
0 siblings, 1 reply; 89+ messages in thread
From: David Woodhouse @ 2018-05-23 9:45 UTC (permalink / raw)
To: speck
On Tue, 2018-04-24 at 13:04 +0200, speck for Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 10:48:12AM +0100, speck for David Woodhouse wrote:
> >
> > On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
> > >
> > >
> > > Another option, that is being explored, is to co-schedule siblings.
> > > So ensure all siblings either run vcpus of the _same_ VM or idle.
> > >
> > > Of course, this is all rather intrusive and ugly and brings with it
> > > setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> > > and interrupts (on the idle CPUs).
>
> > I hate to suggest more microcode hacks but... if there was an MSR bit
> > which, when set, would pause any HT sibling that was currently in VMX
> > non-root mode, then we could set that up to be automatically set on
> > vmexit and it would automatically pause the problematic siblings.
> > Meaning that co-ordinating vmexits with them might actually be
> > feasible?
> Not sure I'm following. The above assumes a sibling is running a VCPU of
> another VM, right? But it could equally well run any regular old task
> (including idle).
>
> So only pausing siblings in VMX mode wouldn't help anything. The !VMX
> tasks could still be loading stuff into L1.
That's OK because it's only the VMX tasks which can abuse it, isn't it?
Let's assume we've fixed the problem for normal tasks, by flipping the
top bit in absent PTEs that actually contain swap pointers, etc.
The only thing we have left is VM guests. The microcode bit would say
that *if* a CPU thread is in non-root mode then *it* gets paused unless
its sibling is also in non-root mode for the same VMID.
So when both siblings are actually in the VM, they get to run. If one
sibling comes *out* of the VM to the host kernel or to run (host)
userspace, then the other one doesn't execute any guest instructions.
It can take exceptions which cause a vmexit though.
We'd also want a vCPU to be able to run if its sibling is actually in
the host but *idle* (and has flushed the L1. Perhaps we actually
automatically flush the L1 when resuming a sibling that got paused).
It does still depend on gang scheduling (or at least forced sibling
idle which is a subset of that), or a singleton vCPU might *never* get
run. But we were going to have to do something along those lines
anyway. The microcode trick just makes it a lot easier because we don't
have to *explicitly* pause the sibling vCPUs and manage their state on
every vmexit/entry. And avoids potential race conditions with managing
that in software.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Re: L1D-Fault KVM mitigation
2018-05-23 9:45 ` David Woodhouse
@ 2018-05-24 9:45 ` Peter Zijlstra
2018-05-24 15:04 ` Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2018-05-24 9:45 UTC (permalink / raw)
To: speck
On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
> That's OK because it's only the VMX tasks which can abuse it, isn't it?
If, like you outline below, this is an (optional) ucode assist to
co-scheduling matching VCPU threads, then yes.
> Let's assume we've fixed the problem for normal tasks, by flipping the
> top bit in absent PTEs that actually contain swap pointers, etc.
>
> The only thing we have left is VM guests. The microcode bit would say
> that *if* a CPU thread is in non-root mode then *it* gets paused unless
> its sibling is also in non-root mode for the same VMID.
>
> So when both siblings are actually in the VM, they get to run. If one
> sibling comes *out* of the VM to the host kernel or to run (host)
> userspace, then the other one doesn't execute any guest instructions.
> It can take exceptions which cause a vmexit though.
Would it make sense to time limit the being 'stuck', much like PLE ?
> We'd also want a vCPU to be able to run if its sibling is actually in
> the host but *idle* (and has flushed the L1. Perhaps we actually
> automatically flush the L1 when resuming a sibling that got paused).
Right, idle is a wildcard which matches with any VCPU. We don't care
about the cache state of the sibling though. L1 is shared and since
VMENTER must flush L1, that is sufficient.
> It does still depend on gang scheduling (or at least forced sibling
> idle which is a subset of that), or a singleton vCPU might *never* get
> run. But we were going to have to do something along those lines
> anyway.
Linus has opinions on that.. but yes, without that all that remains is
disabling HT afaict.
> The microcode trick just makes it a lot easier because we don't
> have to *explicitly* pause the sibling vCPUs and manage their state on
> every vmexit/entry. And avoids potential race conditions with managing
> that in software.
Yes, it would certainly help and avoid a fair bit of ugly. It would, for
instance, avoid having to modify irq_enter() / irq_exit(), which would
otherwise be required (and possibly leak all data touched up until that
point is reached).
But even with all that, adding L1-flush to every VMENTER will hurt lots.
Consider for example the PIO emulation used when booting a guest from a
disk image. That causes VMEXIT/VMENTER at stupendous rates.
Also, none of this readily addresses the problem of load-balancing
shredding the VCPU localities required for this.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: L1D-Fault KVM mitigation
2018-05-24 9:45 ` Peter Zijlstra
@ 2018-05-24 15:04 ` Thomas Gleixner
2018-05-24 15:33 ` Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-24 15:04 UTC (permalink / raw)
To: speck
On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
> > The microcode trick just makes it a lot easier because we don't
> > have to *explicitly* pause the sibling vCPUs and manage their state on
> > every vmexit/entry. And avoids potential race conditions with managing
> > that in software.
>
> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
> instance, avoid having to modify irq_enter() / irq_exit(), which would
> otherwise be required (and possibly leak all data touched up until that
> point is reached).
>
> But even with all that, adding L1-flush to every VMENTER will hurt lots.
> Consider for example the PIO emulation used when booting a guest from a
> disk image. That causes VMEXIT/VMENTER at stupendous rates.
Just did a test on SKL Client where I have ucode. It does not have HT so
its not suffering from any HT side effects when L1D is flushed.
Boot time from a disk image is ~1s measured from the first vcpu enter.
With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
lots of PIO operations in the early boot.
For a kernel build the L1D Flush has an overhead of < 1%.
Netperf guest to host has a slight drop of the throughput in the 2%
range. Host to guest surprisingly goes up by ~3%. Fun stuff!
Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
measure the overhead. Running cyclictest with a period of 25us in the guest
on a isolated guest CPU and monitoring the behaviour with perf on the host
for the corresponding host CPU gives
No Flush Flush
1.31 insn per cycle 1.14 insn per cycle
2e6 L1-dcache-load-misses/sec 26e6 L1-dcache-load-misses/sec
In that simple test the L1D misses go up by a factor of 13.
Now with the whole gang scheduling the numbers I heard through the
grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
disk image. 13 minutes instead of 6 seconds...
That's not surprising at all, though the magnitude is way higher than I
expected. I don't see a realistic chance for vmexit heavy workloads to work
with that synchronization thing at all, whether it's ucode assisted or not.
The only workload types which will ever benefit from that co-scheduling
stuff are CPU bound workloads which more or less never vmexit. But are
those workloads really workloads which benefit from HT? Compute workloads
tend to use floating point or vector instructions which are not really HT
friendly.
Can the virt folks who know what runs on their clowdy offerings please shed
some light on this? Has anyone made a proper analysis of clowd workloads
and their behaviour on HT and their vmexit rates?
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: L1D-Fault KVM mitigation
2018-05-24 15:04 ` Thomas Gleixner
@ 2018-05-24 15:33 ` Thomas Gleixner
2018-05-24 23:18 ` [MODERATED] Encrypted Message Tim Chen
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-24 15:33 UTC (permalink / raw)
To: speck
On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> > On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
> > > The microcode trick just makes it a lot easier because we don't
> > > have to *explicitly* pause the sibling vCPUs and manage their state on
> > > every vmexit/entry. And avoids potential race conditions with managing
> > > that in software.
> >
> > Yes, it would certainly help and avoid a fair bit of ugly. It would, for
> > instance, avoid having to modify irq_enter() / irq_exit(), which would
> > otherwise be required (and possibly leak all data touched up until that
> > point is reached).
> >
> > But even with all that, adding L1-flush to every VMENTER will hurt lots.
> > Consider for example the PIO emulation used when booting a guest from a
> > disk image. That causes VMEXIT/VMENTER at stupendous rates.
>
> Just did a test on SKL Client where I have ucode. It does not have HT so
> its not suffering from any HT side effects when L1D is flushed.
>
> Boot time from a disk image is ~1s measured from the first vcpu enter.
>
> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
> lots of PIO operations in the early boot.
>
> For a kernel build the L1D Flush has an overhead of < 1%.
>
> Netperf guest to host has a slight drop of the throughput in the 2%
> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>
> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
> measure the overhead. Running cyclictest with a period of 25us in the guest
> on a isolated guest CPU and monitoring the behaviour with perf on the host
> for the corresponding host CPU gives
>
> No Flush Flush
>
> 1.31 insn per cycle 1.14 insn per cycle
>
> 2e6 L1-dcache-load-misses/sec 26e6 L1-dcache-load-misses/sec
>
> In that simple test the L1D misses go up by a factor of 13.
>
> Now with the whole gang scheduling the numbers I heard through the
> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
> disk image. 13 minutes instead of 6 seconds...
>
> That's not surprising at all, though the magnitude is way higher than I
> expected. I don't see a realistic chance for vmexit heavy workloads to work
> with that synchronization thing at all, whether it's ucode assisted or not.
That said, I think we should stage the host side mitigations plus the L1
flush on vmenter ASAP so we are not standing there with our pants down when
the cat comes out of the bag early. That means HT off, but it's still
better than having absolutely nothing.
The gang scheduling nonsense can be added on top when it should
surprisingly turn out to be usable at all.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2018-05-24 15:33 ` Thomas Gleixner
@ 2018-05-24 23:18 ` Tim Chen
2018-05-25 18:22 ` Tim Chen
2018-05-26 19:14 ` L1D-Fault KVM mitigation Thomas Gleixner
0 siblings, 2 replies; 89+ messages in thread
From: Tim Chen @ 2018-05-24 23:18 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation
[-- Attachment #2: Type: text/plain, Size: 3619 bytes --]
On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>> The microcode trick just makes it a lot easier because we don't
>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>> that in software.
>>>
>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>> otherwise be required (and possibly leak all data touched up until that
>>> point is reached).
>>>
>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>> Consider for example the PIO emulation used when booting a guest from a
>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>
>> Just did a test on SKL Client where I have ucode. It does not have HT so
>> its not suffering from any HT side effects when L1D is flushed.
>>
>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>
>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>> lots of PIO operations in the early boot.
>>
>> For a kernel build the L1D Flush has an overhead of < 1%.
>>
>> Netperf guest to host has a slight drop of the throughput in the 2%
>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>
>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>> measure the overhead. Running cyclictest with a period of 25us in the guest
>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>> for the corresponding host CPU gives
>>
>> No Flush Flush
>>
>> 1.31 insn per cycle 1.14 insn per cycle
>>
>> 2e6 L1-dcache-load-misses/sec 26e6 L1-dcache-load-misses/sec
>>
>> In that simple test the L1D misses go up by a factor of 13.
>>
>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>> disk image. 13 minutes instead of 6 seconds...
The performance is highly dependent on how often we VM exit.
Working with Peter Z on his prototype, the performance ranges from
no regression for a network loop back, ~20% regression for kernel compile
to ~100% regression on File IO. PIO brings out the worse aspect
of the synchronization overhead as we VM exit on every dword PIO read in, and the
kernel and initrd image was about 50 MB for the experiment, and led to
13 min of load time.
We may need to do the co-scheduling only when VM exit rate is low, and
turn off the SMT when VM exit rate becomes too high.
(Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
the todo).
Tim
>>
>> That's not surprising at all, though the magnitude is way higher than I
>> expected. I don't see a realistic chance for vmexit heavy workloads to work
>> with that synchronization thing at all, whether it's ucode assisted or not.
>
> That said, I think we should stage the host side mitigations plus the L1
> flush on vmenter ASAP so we are not standing there with our pants down when
> the cat comes out of the bag early. That means HT off, but it's still
> better than having absolutely nothing.
>
> The gang scheduling nonsense can be added on top when it should
> surprisingly turn out to be usable at all.
>
> Thanks,
>
> tglx
>
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2018-05-24 23:18 ` [MODERATED] Encrypted Message Tim Chen
@ 2018-05-25 18:22 ` Tim Chen
2018-05-26 19:14 ` L1D-Fault KVM mitigation Thomas Gleixner
1 sibling, 0 replies; 89+ messages in thread
From: Tim Chen @ 2018-05-25 18:22 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 127 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Tim Chen <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation
[-- Attachment #2: Type: text/plain, Size: 3260 bytes --]
On 05/24/2018 04:18 PM, speck for Tim Chen wrote:
> On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>>> The microcode trick just makes it a lot easier because we don't
>>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>>> that in software.
>>>>
>>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>>> otherwise be required (and possibly leak all data touched up until that
>>>> point is reached).
>>>>
>>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>>> Consider for example the PIO emulation used when booting a guest from a
>>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>>
>>> Just did a test on SKL Client where I have ucode. It does not have HT so
>>> its not suffering from any HT side effects when L1D is flushed.
>>>
>>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>>
>>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>>> lots of PIO operations in the early boot.
>>>
>>> For a kernel build the L1D Flush has an overhead of < 1%.
>>>
>>> Netperf guest to host has a slight drop of the throughput in the 2%
>>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>>
>>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>>> measure the overhead. Running cyclictest with a period of 25us in the guest
>>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>>> for the corresponding host CPU gives
>>>
>>> No Flush Flush
>>>
>>> 1.31 insn per cycle 1.14 insn per cycle
>>>
>>> 2e6 L1-dcache-load-misses/sec 26e6 L1-dcache-load-misses/sec
>>>
>>> In that simple test the L1D misses go up by a factor of 13.
>>>
>>> Now with the whole gang scheduling the numbers I heard through the
>>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>>> disk image. 13 minutes instead of 6 seconds...
>
> The performance is highly dependent on how often we VM exit.
> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO. PIO brings out the worse aspect
> of the synchronization overhead as we VM exit on every dword PIO read in, and the
> kernel and initrd image was about 50 MB for the experiment, and led to
> 13 min of load time.
>
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.
>
> (Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
> the todo).
As a post note, I added in the L1 flush and the performance numbers
pretty much stay the same. So the synchronization overhead is
dominant and L1 flush overhead is secondary.
Tim
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: L1D-Fault KVM mitigation
2018-05-24 23:18 ` [MODERATED] Encrypted Message Tim Chen
2018-05-25 18:22 ` Tim Chen
@ 2018-05-26 19:14 ` Thomas Gleixner
2018-05-29 19:29 ` [MODERATED] Encrypted Message Tim Chen
1 sibling, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-26 19:14 UTC (permalink / raw)
To: speck
On Thu, 24 May 2018, speck for Tim Chen wrote:
>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>> disk image. 13 minutes instead of 6 seconds...
> The performance is highly dependent on how often we VM exit.
That's pretty obvious.
> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO.
These numbers are not that interesting when you do not provide comparisons
vs. single threaded. See below.
> PIO brings out the worse aspect of the synchronization overhead as we VM
> exit on every dword PIO read in, and the kernel and initrd image was
> about 50 MB for the experiment, and led to 13 min of load time.
>
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.
You cannot do that during runtime. That will destroy placement schemes and
whatever. The SMT off decision needs to be done at a quiescent moment,
i.e. before starting VMs.
The PIO case _IS_ interesting because it highlights the problem with the
synchronization overhead. And it does not matter at all whether you VMEXIT
because of a PIO access or due to any other reason. So even if you optimize
it then you still have a gazillion of vm_exits on boot. The simple boot
tests I did have ~250k vm_exits in 5 seconds and only half of them are PIO.
Removing the PIO access makes the boot faster because you avoid 50% of the
vmexits, but the rest of the vmexits will still get a massive overhead,
unless you have a scenario where two vCPUs of a guest are runnable and
ready to enter at the same time and vmexit at the same time. Any other
scenario will lose due to the busy waiting synchronization overhead. Just
look at traces and do the math.
I did the following test:
- Two CPUs (siblings) on the host (HSW-EX) fully isolated
- One guest with two vCPUs affine to the isolated host CPUs. idle=poll on
the guest command line to avoid the single vCPU case.
- No L1 Flush
- Running a kernel compile on the guest in the regular virtio disk backed
filesystem. Modified the build skript to stop before the final linkage
because that is single threaded.
Time: 88 seconds
vmexits: vCPU0 86.218
vCPU1 85.703
total 171.921
That's about 2 vmexits per ms.
Running the same compile single threaded (offlining vCPU1 in the guest)
increases the time to 107 seconds.
107 / 88 = 1.22
I.e. it's 20% slower than the one using two threads. That means that it is
the same slowdown as having two threads synchronized (your number).
So if I take the above example and assume that the overhead of
synchronization is ~20% then the average vmenter/vmexit time is close to
50us.
Next I did an experiment with synchronizing the vmenter/vmexit. It's
probably more stupid than what you have as the overhead I observe is way
higher, but then I don't know how and what you tested exactly, so it's hard
to compare.
Nevertheless it gave me very interesting insights via tracing the
synchronization mechanics. The interesting thing is that halfways
synchronous vmexits on both vCPUs are rather cheap. The slightly async ones
make the big difference and at some points in the trace the stuff starts to
ping pong in and out of guest mode without really making progress for a
while. So there is not only the overhead itself, it's timing dependend
overhead which can accumulate rather fast. And there is absolutely nothing
you can do about that.
So I can see the usefulness for scenarious which David Woodhouse described
where vCPU and host CPU have a fixed relationship and the guests exit once
in a while. But that should really be done with ucode assisantance which
avoids all the nasty synchronization hackery more or less completely.
But if anyone believes that the gang scheduling scheme with full software
synchronization can be applied to random usecases, then he's probably
working for the marketing department and authoring the L1 terminal fuckup
press release and whitepaper.
I'm surely open for a suprising clever trick which makes this all work, but
I certainly won't hold by breath.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2018-05-26 19:14 ` L1D-Fault KVM mitigation Thomas Gleixner
@ 2018-05-29 19:29 ` Tim Chen
2018-05-29 21:14 ` L1D-Fault KVM mitigation Thomas Gleixner
0 siblings, 1 reply; 89+ messages in thread
From: Tim Chen @ 2018-05-29 19:29 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation
[-- Attachment #2: Type: text/plain, Size: 1799 bytes --]
On 05/26/2018 12:14 PM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Tim Chen wrote:
>
of load time.
>>
>> We may need to do the co-scheduling only when VM exit rate is low, and
>> turn off the SMT when VM exit rate becomes too high.
>
> You cannot do that during runtime. That will destroy placement schemes and
> whatever. The SMT off decision needs to be done at a quiescent moment,
> i.e. before starting VMs.
Taking the SMT offline is a bit much and too big a hammer. Andi and I thought about
having the scheduler forcing the other thread in idle instead for high
VM exit rate scenario. We don't have
to bother about doing sync with the other idle thread.
But we have issues about fairness, as we will be starving the
other run queue.
>
> Running the same compile single threaded (offlining vCPU1 in the guest)
> increases the time to 107 seconds.
>
> 107 / 88 = 1.22
>
> I.e. it's 20% slower than the one using two threads. That means that it is
> the same slowdown as having two threads synchronized (your number).
yes, with compile workload, the HT speedup was mostly eaten up by
overhead.
>
> So if I take the above example and assume that the overhead of
> synchronization is ~20% then the average vmenter/vmexit time is close to
> 50us.
>
>
> So I can see the usefulness for scenarious which David Woodhouse described
> where vCPU and host CPU have a fixed relationship and the guests exit once
> in a while. But that should really be done with ucode assisantance which
> avoids all the nasty synchronization hackery more or less completely.
The ucode guys are looking into such possibilities. It is tough as they
have to work within the constraint of limited ucode headroom.
Thanks.
Tim
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: L1D-Fault KVM mitigation
2018-05-29 19:29 ` [MODERATED] Encrypted Message Tim Chen
@ 2018-05-29 21:14 ` Thomas Gleixner
2018-05-30 16:38 ` [MODERATED] Encrypted Message Tim Chen
0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2018-05-29 21:14 UTC (permalink / raw)
To: speck
On Tue, 29 May 2018, speck for Tim Chen wrote:
On 05/26/2018 12:14 PM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Tim Chen wrote:
>
> > > We may need to do the co-scheduling only when VM exit rate is low, and
> > > turn off the SMT when VM exit rate becomes too high.
> >
> > You cannot do that during runtime. That will destroy placement schemes and
> > whatever. The SMT off decision needs to be done at a quiescent moment,
> > i.e. before starting VMs.
> Taking the SMT offline is a bit much and too big a hammer.
Sorry, that's bullshit. It massively depends on the workload and the
scenario. I've explained it a gazillion times by now that there are enough
workloads which will massively lose with SMT on and the extra overhead. Its
trivial enough to figure that out without implementing all bells and
whistels.
> Andi and I thought about having the scheduler forcing the other thread in
> idle instead for high VM exit rate scenario. We don't have to bother
> about doing sync with the other idle thread.
You still have to make sure that the other idle thread _IS_ idle. It's not
the full synchronizaiton scheme, but it's extra work in a hotpath when the
guest is exit heavy. And you still have the problem of interrupt and
softirqs being served on the 'idle' sibling. It's not that simple.
> But we have issues about fairness, as we will be starving the
> other run queue.
That's more than obvious. And you will create even worse issues because
workloads which have a placement scheme, i.e. vCPU affinities will have no
chance to migrate to another CPU. Not to talk about wreckaging the load
balancer completely.
> > I.e. it's 20% slower than the one using two threads. That means that it is
> > the same slowdown as having two threads synchronized (your number).
> yes, with compile workload, the HT speedup was mostly eaten up by
> overhead.
So where is the point of the exercise?
You will not find a generic solution for this problem ever simply because
the workloads and guest scenarios are too different. There are clearly
scenarios which can benefit, but at the same time there are scenarios which
will be way worse off than with SMT disabled.
I completely understand that Intel wants to avoid the 'disable SMT'
solution by all means, but this cannot be done with something which is
obvioulsy creating more problems than it solves in the first place.
At some point reality has to kick in and you have to admit that there is no
generic solution and the only solution for a lot of use cases will be to
disable SMT. The solution for special workloads like the fully partitioned
ones David mentioned do not need the extra mess all over the place
especially not when there is ucode assist at least to the point which fits
into the patch space and some of it really should not take a huge amount of
effort, like the forced sibling vmexit to avoid the whole IPI machinery.
Thanks,
tglx
^ permalink raw reply [flat|nested] 89+ messages in thread
* [MODERATED] Encrypted Message
2018-05-29 21:14 ` L1D-Fault KVM mitigation Thomas Gleixner
@ 2018-05-30 16:38 ` Tim Chen
0 siblings, 0 replies; 89+ messages in thread
From: Tim Chen @ 2018-05-30 16:38 UTC (permalink / raw)
To: speck
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]
From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation
[-- Attachment #2: Type: text/plain, Size: 1626 bytes --]
On 05/29/2018 02:14 PM, speck for Thomas Gleixner wrote:
>
>> yes, with compile workload, the HT speedup was mostly eaten up by
>> overhead.
>
> So where is the point of the exercise?
>
> You will not find a generic solution for this problem ever simply because
> the workloads and guest scenarios are too different. There are clearly
> scenarios which can benefit, but at the same time there are scenarios which
> will be way worse off than with SMT disabled.
>
> I completely understand that Intel wants to avoid the 'disable SMT'
> solution by all means, but this cannot be done with something which is
> obvioulsy creating more problems than it solves in the first place.
>
> At some point reality has to kick in and you have to admit that there is no
> generic solution and the only solution for a lot of use cases will be to
> disable SMT. The solution for special workloads like the fully partitioned
> ones David mentioned do not need the extra mess all over the place
> especially not when there is ucode assist at least to the point which fits
> into the patch space and some of it really should not take a huge amount of
> effort, like the forced sibling vmexit to avoid the whole IPI machinery.
>
Having to sync on VM entry and on VM exit and on interrupt to idle sibling
sucks. Hopefully the ucode guys can come up with something
to provide an option that forces the sibling to vmexit on vmexit,
and on interrupt to idle sibling. This should cut the sync overhead in half.
Then only VM entry needs to be synced should we still want to
do co-scheduling.
Thanks.
Tim
^ permalink raw reply [flat|nested] 89+ messages in thread
end of thread, other threads:[~2019-03-08 6:37 UTC | newest]
Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 01/14] MDS basics 1 Thomas Gleixner
2019-03-02 0:06 ` [MODERATED] " Frederic Weisbecker
2019-03-01 21:47 ` [patch V6 02/14] MDS basics 2 Thomas Gleixner
2019-03-02 0:34 ` [MODERATED] " Frederic Weisbecker
2019-03-02 8:34 ` Greg KH
2019-03-05 17:54 ` Borislav Petkov
2019-03-01 21:47 ` [patch V6 03/14] MDS basics 3 Thomas Gleixner
2019-03-02 1:12 ` [MODERATED] " Frederic Weisbecker
2019-03-01 21:47 ` [patch V6 04/14] MDS basics 4 Thomas Gleixner
2019-03-02 1:28 ` [MODERATED] " Frederic Weisbecker
2019-03-05 14:52 ` Thomas Gleixner
2019-03-06 20:00 ` [MODERATED] " Andrew Cooper
2019-03-06 20:32 ` Thomas Gleixner
2019-03-07 23:56 ` [MODERATED] " Andi Kleen
2019-03-08 0:36 ` Linus Torvalds
2019-03-01 21:47 ` [patch V6 05/14] MDS basics 5 Thomas Gleixner
2019-03-02 1:37 ` [MODERATED] " Frederic Weisbecker
2019-03-07 23:59 ` Andi Kleen
2019-03-08 6:37 ` Thomas Gleixner
2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
2019-03-04 6:28 ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 14:55 ` Thomas Gleixner
2019-03-01 21:47 ` [patch V6 07/14] MDS basics 7 Thomas Gleixner
2019-03-02 2:22 ` [MODERATED] " Frederic Weisbecker
2019-03-05 15:30 ` Thomas Gleixner
2019-03-06 15:49 ` [MODERATED] " Frederic Weisbecker
2019-03-06 5:21 ` Borislav Petkov
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
2019-03-03 2:54 ` [MODERATED] " Frederic Weisbecker
2019-03-04 6:57 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 7:06 ` Jon Masters
2019-03-04 8:12 ` Jon Masters
2019-03-05 15:34 ` Thomas Gleixner
2019-03-06 16:21 ` [MODERATED] " Jon Masters
2019-03-06 14:11 ` [MODERATED] Re: [patch V6 08/14] MDS basics 8 Borislav Petkov
2019-03-01 21:47 ` [patch V6 09/14] MDS basics 9 Thomas Gleixner
2019-03-06 16:14 ` [MODERATED] " Frederic Weisbecker
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
2019-03-04 6:45 ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 18:42 ` [MODERATED] Re: [patch V6 10/14] MDS basics 10 Andrea Arcangeli
2019-03-06 19:15 ` Thomas Gleixner
2019-03-06 14:31 ` [MODERATED] " Borislav Petkov
2019-03-06 15:30 ` Thomas Gleixner
2019-03-06 18:35 ` Thomas Gleixner
2019-03-06 19:34 ` [MODERATED] Re: " Borislav Petkov
2019-03-01 21:47 ` [patch V6 11/14] MDS basics 11 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
2019-03-04 5:47 ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 16:04 ` Thomas Gleixner
2019-03-05 16:40 ` [MODERATED] Re: [patch V6 12/14] MDS basics 12 mark gross
2019-03-06 14:42 ` Borislav Petkov
2019-03-01 21:47 ` [patch V6 13/14] MDS basics 13 Thomas Gleixner
2019-03-03 4:01 ` [MODERATED] " Josh Poimboeuf
2019-03-05 16:04 ` Thomas Gleixner
2019-03-05 16:43 ` [MODERATED] " mark gross
2019-03-01 21:47 ` [patch V6 14/14] MDS basics 14 Thomas Gleixner
2019-03-01 23:48 ` [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-04 5:30 ` [MODERATED] Encrypted Message Jon Masters
-- strict thread matches above, loose matches on Subject: below --
2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
2019-03-05 20:36 ` Jiri Kosina
2019-03-05 22:31 ` Andrew Cooper
2019-03-06 16:18 ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 17:10 ` Jon Masters
2019-03-04 1:21 [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements Josh Poimboeuf
2019-03-04 1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
2019-03-04 3:55 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 7:30 ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
2019-03-04 7:45 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
2019-03-04 3:58 ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 17:17 ` [MODERATED] " Josh Poimboeuf
2019-03-06 16:22 ` [MODERATED] " Jon Masters
2019-03-04 1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
2019-03-04 4:07 ` [MODERATED] Encrypted Message Jon Masters
2019-02-24 15:07 [MODERATED] [PATCH v6 00/43] MDSv6 Andi Kleen
2019-02-24 15:07 ` [MODERATED] [PATCH v6 10/43] MDSv6 Andi Kleen
2019-02-25 16:30 ` [MODERATED] " Greg KH
2019-02-25 16:41 ` [MODERATED] Encrypted Message Jon Masters
2019-02-24 15:07 ` [MODERATED] [PATCH v6 31/43] MDSv6 Andi Kleen
2019-02-25 15:19 ` [MODERATED] " Greg KH
2019-02-25 15:34 ` Andi Kleen
2019-02-25 15:49 ` Greg KH
2019-02-25 15:52 ` [MODERATED] Encrypted Message Jon Masters
2019-02-25 16:00 ` [MODERATED] " Greg KH
2019-02-25 16:19 ` [MODERATED] " Jon Masters
2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
2019-02-26 14:19 ` [MODERATED] " Josh Poimboeuf
2019-03-01 20:58 ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 22:14 ` Jon Masters
2019-02-21 23:44 [patch V3 0/9] MDS basics 0 Thomas Gleixner
2019-02-21 23:44 ` [patch V3 4/9] MDS basics 4 Thomas Gleixner
2019-02-22 7:45 ` [MODERATED] Encrypted Message Jon Masters
2019-02-20 15:07 [patch V2 00/10] MDS basics+ 0 Thomas Gleixner
2019-02-20 15:07 ` [patch V2 04/10] MDS basics+ 4 Thomas Gleixner
2019-02-20 17:10 ` [MODERATED] " mark gross
2019-02-21 19:26 ` [MODERATED] Encrypted Message Tim Chen
2019-02-19 12:44 [patch 0/8] MDS basics 0 Thomas Gleixner
2019-02-21 16:14 ` [MODERATED] Encrypted Message Jon Masters
2019-02-07 23:41 [MODERATED] [PATCH v3 0/6] PERFv3 Andi Kleen
2019-02-07 23:41 ` [MODERATED] [PATCH v3 2/6] PERFv3 Andi Kleen
2019-02-08 0:51 ` [MODERATED] Re: [SUSPECTED SPAM][PATCH " Andrew Cooper
2019-02-08 9:01 ` Peter Zijlstra
2019-02-08 9:39 ` Peter Zijlstra
2019-02-08 10:53 ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
2019-02-15 23:45 ` [MODERATED] Encrypted Message Jon Masters
2019-01-12 1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
2019-01-12 1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
2019-01-14 19:20 ` [MODERATED] " Dave Hansen
2019-01-18 7:33 ` [MODERATED] Encrypted Message Jon Masters
2019-01-14 23:39 ` Tim Chen
2019-01-12 1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
2019-01-15 1:05 ` [MODERATED] Encrypted Message Tim Chen
2018-06-12 17:29 [MODERATED] FYI - Reading uncached memory Jon Masters
2018-06-14 16:59 ` [MODERATED] Encrypted Message Tim Chen
2018-05-29 19:42 [MODERATED] [PATCH 0/2] L1TF KVM 0 Paolo Bonzini
[not found] ` <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>
2018-05-29 22:49 ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
2018-05-29 23:54 ` [MODERATED] " Andrew Cooper
2018-05-30 9:01 ` Paolo Bonzini
2018-06-04 8:24 ` [MODERATED] " Martin Pohlack
2018-06-04 13:11 ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
2018-06-04 17:59 ` [MODERATED] Encrypted Message Tim Chen
2018-06-05 23:34 ` Tim Chen
2018-06-05 23:37 ` Tim Chen
2018-06-07 19:11 ` Tim Chen
2018-05-17 20:53 SSB status - V18 pushed out Thomas Gleixner
2018-05-18 13:54 ` [MODERATED] Is: Sleep states ?Was:Re: " Konrad Rzeszutek Wilk
2018-05-18 14:29 ` Thomas Gleixner
2018-05-18 19:50 ` [MODERATED] Encrypted Message Tim Chen
2018-05-02 21:51 [patch V11 00/16] SSB 0 Thomas Gleixner
2018-05-03 4:27 ` [MODERATED] Encrypted Message Tim Chen
2018-04-24 9:06 [MODERATED] L1D-Fault KVM mitigation Joerg Roedel
2018-04-24 9:35 ` [MODERATED] " Peter Zijlstra
2018-04-24 9:48 ` David Woodhouse
2018-04-24 11:04 ` Peter Zijlstra
2018-05-23 9:45 ` David Woodhouse
2018-05-24 9:45 ` Peter Zijlstra
2018-05-24 15:04 ` Thomas Gleixner
2018-05-24 15:33 ` Thomas Gleixner
2018-05-24 23:18 ` [MODERATED] Encrypted Message Tim Chen
2018-05-25 18:22 ` Tim Chen
2018-05-26 19:14 ` L1D-Fault KVM mitigation Thomas Gleixner
2018-05-29 19:29 ` [MODERATED] Encrypted Message Tim Chen
2018-05-29 21:14 ` L1D-Fault KVM mitigation Thomas Gleixner
2018-05-30 16:38 ` [MODERATED] Encrypted Message Tim Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).