All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] Command line and documentation 0
@ 2018-07-08 12:52 Thomas Gleixner
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
                   ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-08 12:52 UTC (permalink / raw)
  To: speck

From: Thomas Gleixner <tglx@linutronix.de>
Subject: [patch 0/2] L!TF: Command line param update plus documentation

I've updated Jiri's patch to address the issues I noticed during review and
created a RST based document which tries to give a consice picture of the
problem.

Note, that the Documentation does not build on -rc1, so you need to merge
the pile into rc2 at least. I'll send out a git bundle with these changes
as a reply.

Thanks.

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 1/2] Command line and documentation 1
  2018-07-08 12:52 [patch 0/2] Command line and documentation 0 Thomas Gleixner
@ 2018-07-08 12:52 ` Thomas Gleixner
  2018-07-08 14:00   ` [MODERATED] " Josh Poimboeuf
                     ` (3 more replies)
  2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
  2018-07-08 13:11 ` [patch 0/2] Command line and documentation 0 Thomas Gleixner
  2 siblings, 4 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-08 12:52 UTC (permalink / raw)
  To: speck

From: Jiri Kosina <jkosina@suse.cz>
Subject: [patch 1/2] x86/bugs, kvm: introduce boot-time control of L1TF mitigations

Introduce 'l1tf=' kernel commad line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
	Provide all available mitigations for the L1TF
	vulnerability. Disable SMT and enable all mitigations
	in the hypervisors. SMT control is still possible
	after boot.

  full,force
	Same as 'full', but disables SMT control. Implies the
	'nosmt=force' command line option.

  novirt
	Leaves SMT enabled and does not enable the hypervisor
	mitigation. Hypervisors will issue a warning when the first VM is
	being started in a potentially insecure configuration, i.e. SMT
	enabled or L1D flush disabled.

 novirt,nowarn
	Same as 'novirt', but hypervisors will not warn when
	a VM is started in a potentially insecure configuration.

Default is 'novirt'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'	: start performing L1D flushes
  - 'l1tf=full'		: start performing L1D flushes,
			  warn on VM start if SMT has
			  been runtime enabled
  - 'l1tf=novirt'	: warn on first VM start
  - 'l1tf=novirt,nowarn': no extra action is taken

This makes the KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level). Remove that option.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---

v6->v7:
	- Fixed the CVE number
	- Slightly reworded the parameter description
	- Take the l1tf command line parameter into account when
	  initializing the VMX L1TF mitigation and expose the
	  vmx mitigation state to the core.
	- Make the sysfs l1tf file show the VMX mitigation state
	  in detail.

v5->v6:
	- 'full' implies 'nosmt', 'full,force' implies nosmt=force;
	  print KVM warnings accordingly (one state more, and having
	  bitflags would be needed for clarity)
	- now that we have full and full,force, drop KVM's private
	  nosmt option
	- drop compile-time option to chose the default default :)
	- typo/grammar fixes

v4->v5:
	- rebase on top of KVM bundle

v3->v4:
        - unconfuse the meaning of 'off', both in the documentation and in 
          the code (spotted by Josh)

v2->v3:
        - provide l1tf=[full,novirt,off]
        - provide config option to chose the default
        - let KVM warn in novirt case

v1->v2: add forgotten dependency on X86_BUG_L1TF


 Documentation/admin-guide/kernel-parameters.txt |   38 +++++++++-
 arch/x86/include/asm/processor.h                |   10 ++
 arch/x86/include/asm/vmx.h                      |    9 ++
 arch/x86/kernel/cpu/bugs.c                      |   84 +++++++++++++++++++++++-
 arch/x86/kvm/vmx.c                              |   58 ++++++++++++----
 include/linux/cpu.h                             |    2 
 kernel/cpu.c                                    |    9 ++
 7 files changed, 183 insertions(+), 27 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1946,12 +1946,6 @@
 			[KVM,ARM] Allow use of GICv4 for direct injection of
 			LPIs.
 
-	kvm-intel.nosmt=[KVM,Intel] If the L1TF CPU bug is present (CVE-2018-3620)
-			and the system has SMT (aka Hyper-Threading) enabled then
-			don't allow guests to be created.
-
-			Default is 0 (allow guests to be created).
-
 	kvm-intel.ept=	[KVM,Intel] Disable extended page tables
 			(virtualized MMU) support on capable Intel chips.
 			Default is 1 (enabled)
@@ -1989,6 +1983,38 @@
 			feature (tagged TLBs) on capable Intel chips.
 			Default is 1 (enabled)
 
+	l1tf=           [X86] Control mitigation of the L1TF vulnerability on
+			      affected CPUs
+
+			The kernel PTE inversion protection is unconditionally
+			enabled and cannot be controlled.
+
+			full
+				Provide all available mitigations for the
+				L1TF vulnerability. Disable SMT and enable
+				all mitigations in the hypervisors. SMT
+				control is still possible after boot.
+
+			full,force
+				Same as 'full', but disables SMT
+				control. Implies the 'nosmt=force' command
+				line option.
+
+			novirt
+				Leaves SMT enabled and does not enable the
+				hypervisor mitigation. Hypervisors will
+				issue a warning when the first VM is being
+				started in a potentially insecure
+				configuration, i.e. SMT enabled or L1D
+				flush disabled.
+
+			novirt,nowarn
+				Same as 'novirt', but hypervisors will not
+				warn when a VM is started in a potentially
+				insecure configuration.
+
+			Default is 'novirt'.
+
 	l2cr=		[PPC]
 
 	l3cr=		[PPC]
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -982,4 +982,14 @@ bool xen_set_default_idle(void);
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
 void microcode_check(void);
+
+enum l1tf_mitigations {
+	L1TF_MITIGATION_NOVIRT_NOWARN,
+	L1TF_MITIGATION_NOVIRT,
+	L1TF_MITIGATION_FULL,
+	L1TF_MITIGATION_FULL_FORCE
+};
+
+extern enum l1tf_mitigations l1tf_mitigation;
+
 #endif /* _ASM_X86_PROCESSOR_H */
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -573,4 +573,13 @@ enum vm_instruction_error_number {
 	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
 };
 
+/* These MUST be in sync with vmentry_l1d_param order. */
+enum vmx_l1d_flush_state {
+	VMENTER_L1D_FLUSH_NEVER,
+	VMENTER_L1D_FLUSH_COND,
+	VMENTER_L1D_FLUSH_ALWAYS,
+};
+
+extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;
+
 #endif
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -23,6 +23,7 @@
 #include <asm/fpu/internal.h>
 #include <asm/msr.h>
 #include <asm/paravirt.h>
+#include <asm/vmx.h>
 #include <asm/alternative.h>
 #include <asm/pgtable.h>
 #include <asm/set_memory.h>
@@ -657,6 +658,20 @@ void x86_spec_ctrl_setup_ap(void)
 
 #undef pr_fmt
 #define pr_fmt(fmt)	"L1TF: " fmt
+
+
+/* Default mitigation for L1TF-affected CPUs */
+enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+EXPORT_SYMBOL_GPL(l1tf_mitigation);
+
+/* Initialize it to unknown which indicates that VMX has not been initialized */
+#define L1TF_VMX_UNKNOWN	INT_MAX
+
+enum vmx_l1d_flush_state l1tf_vmx_mitigation = L1TF_VMX_UNKNOWN;
+EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);
+#endif
+
 static void __init l1tf_select_mitigation(void)
 {
 	u64 half_pa;
@@ -664,6 +679,18 @@ static void __init l1tf_select_mitigatio
 	if (!boot_cpu_has_bug(X86_BUG_L1TF))
 		return;
 
+	switch (l1tf_mitigation) {
+	case L1TF_MITIGATION_FULL_FORCE:
+		cpu_smt_disable(true);
+		break;
+	case L1TF_MITIGATION_FULL:
+		cpu_smt_disable(false);
+		break;
+	case L1TF_MITIGATION_NOVIRT:
+	case L1TF_MITIGATION_NOVIRT_NOWARN:
+		break;
+	}
+
 #if CONFIG_PGTABLE_LEVELS == 2
 	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
 	return;
@@ -682,10 +709,63 @@ static void __init l1tf_select_mitigatio
 
 	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
 }
+
+static int __init l1tf_cmdline(char *str)
+{
+	if (!boot_cpu_has_bug(X86_BUG_L1TF))
+		return 0;
+
+	if (!str)
+		return 0;
+
+	if (!strcmp(str, "full"))
+		l1tf_mitigation = L1TF_MITIGATION_FULL;
+	else if (!strcmp(str, "full,force"))
+		l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;
+	else if (!strcmp(str, "novirt"))
+		l1tf_mitigation = L1TF_MITIGATION_NOVIRT;
+	else if (!strcmp(str, "novirt,nowarn"))
+		l1tf_mitigation = L1TF_MITIGATION_NOVIRT_NOWARN;
+
+	return 0;
+}
+early_param("l1tf", l1tf_cmdline);
+
 #undef pr_fmt
 
 #ifdef CONFIG_SYSFS
 
+static const char *l1tf_states[] = {
+	[L1TF_MITIGATION_FULL]		= "Mitigation: Full",
+	[L1TF_MITIGATION_FULL_FORCE]	= "Mitigation: Full [force]",
+	[L1TF_MITIGATION_NOVIRT]	= "Mitigation: Page Table Inversion",
+	[L1TF_MITIGATION_NOVIRT_NOWARN]	= "Mitigation: Page Table Inversion"
+};
+
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+static const char *l1tf_vmx_states[] = {
+	[VMENTER_L1D_FLUSH_NEVER]	= "vulnerable",
+	[VMENTER_L1D_FLUSH_COND]	= "mostly protected",
+	[VMENTER_L1D_FLUSH_ALWAYS]	= "fully protected"
+};
+
+static ssize_t l1tf_show_state(char *buf)
+{
+	if (l1tf_vmx_mitigation == L1TF_VMX_UNKNOWN)
+		return sprintf(buf, "%s\n", l1tf_states[l1tf_mitigation]);
+
+	return sprintf(buf, "%s, VMX: SMT %s L1D %s\n",
+		       l1tf_states[l1tf_mitigation],
+		       cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled",
+		       l1tf_vmx_states[l1tf_vmx_mitigation]);
+}
+#else
+static ssize_t l1tf_show_state(char *buf)
+{
+	return sprintf(buf, "%s\n", l1tf_states[l1tf_mitigation]);
+}
+#endif
+
 static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,
 			       char *buf, unsigned int bug)
 {
@@ -712,9 +792,7 @@ static ssize_t cpu_show_common(struct de
 		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
 
 	case X86_BUG_L1TF:
-		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
-			return sprintf(buf, "Mitigation: Page Table Inversion\n");
-		break;
+		return l1tf_show_state(buf);
 
 	default:
 		break;
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,9 +71,6 @@ static const struct x86_cpu_id vmx_cpu_i
 };
 MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
 
-static bool __read_mostly nosmt;
-module_param(nosmt, bool, S_IRUGO);
-
 static bool __read_mostly enable_vpid = 1;
 module_param_named(vpid, enable_vpid, bool, 0444);
 
@@ -193,13 +190,6 @@ extern const ulong vmx_return;
 
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
 
-/* These MUST be in sync with vmentry_l1d_param order. */
-enum vmx_l1d_flush_state {
-	VMENTER_L1D_FLUSH_NEVER,
-	VMENTER_L1D_FLUSH_COND,
-	VMENTER_L1D_FLUSH_ALWAYS,
-};
-
 static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush = VMENTER_L1D_FLUSH_COND;
 
 static const struct {
@@ -10539,19 +10529,41 @@ static struct kvm_vcpu *vmx_create_vcpu(
 	return ERR_PTR(err);
 }
 
-#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to CVE-2018-3620 for details.\n"
+#define L1TF_MSG_NOVIRT "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. Refer to CVE-2018-3646 for details.\n"
+#define L1TF_MSG_SMT "L1TF CPU bug present and SMT enabled, data leak possible. Refer to CVE-2018-3646 for details.\n"
 
 static int vmx_vm_init(struct kvm *kvm)
 {
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
 
-	if (boot_cpu_has(X86_BUG_L1TF) && cpu_smt_control == CPU_SMT_ENABLED) {
-		if (nosmt) {
-			pr_err(L1TF_MSG);
-			return -EOPNOTSUPP;
+	if (boot_cpu_has(X86_BUG_L1TF)) {
+		switch (l1tf_mitigation) {
+			case L1TF_MITIGATION_NOVIRT_NOWARN:
+				/* The 'I explicitly don't care' flag is set */
+				break;
+			case L1TF_MITIGATION_NOVIRT:
+				/*
+				 * Warn upon starting the first VM in a
+				 * potentially insecure environment.
+				 */
+				if (cpu_smt_control == CPU_SMT_ENABLED ||
+				    vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
+					pr_warn_once(L1TF_MSG_NOVIRT);
+				break;
+			case L1TF_MITIGATION_FULL:
+				/*
+				 * Warn if SMT has been runtime-enabled. We can't
+				 * really warn only once here, as SMT state is
+				 * not constant.
+				 */
+				if (cpu_smt_control == CPU_SMT_ENABLED)
+					pr_warn(L1TF_MSG_SMT);
+				break;
+			case L1TF_MITIGATION_FULL_FORCE:
+				/* All is good, proceed without hiccup */
+				break;
 		}
-		pr_warn(L1TF_MSG);
 	}
 	return 0;
 }
@@ -13235,6 +13247,20 @@ static int __init vmx_setup_l1d_flush(vo
 {
 	struct page *page;
 
+	/* Take the l1tf command line parameter into account */
+	switch (l1tf_mitigation) {
+	case L1TF_MITIGATION_FULL:
+	case L1TF_MITIGATION_FULL_FORCE:
+		/* Respect the FLUSH ALWAYS module param */
+		if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
+			vmentry_l1d_flush = VMENTER_L1D_FLUSH_COND;
+		break;
+	default:
+		break;
+	}
+
+	l1tf_vmx_mitigation = vmentry_l1d_flush;
+
 	if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
 	    !boot_cpu_has_bug(X86_BUG_L1TF) ||
 	    vmx_l1d_use_msr_save_list())
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -177,8 +177,10 @@ enum cpuhp_smt_control {
 
 #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
 extern enum cpuhp_smt_control cpu_smt_control;
+void cpu_smt_disable(bool force);
 #else
 # define cpu_smt_control		(CPU_SMT_ENABLED)
+static inline void cpu_smt_disable(bool force) { }
 #endif
 
 #endif /* _LINUX_CPU_H_ */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -347,13 +347,18 @@ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
 EXPORT_SYMBOL_GPL(cpu_smt_control);
 
-static int __init smt_cmdline_disable(char *str)
+void __init cpu_smt_disable(bool force)
 {
 	cpu_smt_control = CPU_SMT_DISABLED;
-	if (str && !strcmp(str, "force")) {
+	if (force) {
 		pr_info("SMT: Force disabled\n");
 		cpu_smt_control = CPU_SMT_FORCE_DISABLED;
 	}
+}
+
+static int __init smt_cmdline_disable(char *str)
+{
+	cpu_smt_disable(str && !strcmp(str, "force"));
 	return 0;
 }
 early_param("nosmt", smt_cmdline_disable);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 2/2] Command line and documentation 2
  2018-07-08 12:52 [patch 0/2] Command line and documentation 0 Thomas Gleixner
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
@ 2018-07-08 12:52 ` Thomas Gleixner
  2018-07-08 14:40   ` [MODERATED] " Andrew Cooper
                     ` (3 more replies)
  2018-07-08 13:11 ` [patch 0/2] Command line and documentation 0 Thomas Gleixner
  2 siblings, 4 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-08 12:52 UTC (permalink / raw)
  To: speck

From: Thomas Gleixner <tglx@linutronix.de>
Subject: [patch 2/2] Documentation: Add section about CPU vulnerabilities

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/admin-guide/index.rst |    9 
 Documentation/admin-guide/l1tf.rst  |  356 ++++++++++++++++++++++++++++++++++++
 2 files changed, 365 insertions(+)

--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -17,6 +17,15 @@ etc.
    kernel-parameters
    devices
 
+This section describes CPU vulnerabilities and provides an overview over
+the possible mitigations along with guidance for selecting mitigations if
+they are configurable at compile, boot or run time.
+
+.. toctree::
+   :maxdepth: 1
+
+   l1tf
+
 Here is a set of documents aimed at users who are trying to track down
 problems and bugs in particular.
 
--- /dev/null
+++ b/Documentation/admin-guide/l1tf.rst
@@ -0,0 +1,356 @@
+L1TF - L1 Terminal Fault
+========================
+
+L1 Terminal Fault is a hardware vulnerability which allows unconstrained
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual access, which is used for
+the access, has the present bit cleared.
+
+Affected CPUs
+-------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Older models, where the CPU family is < 6
+
+   - A range of ATOM processors (Cedarview, Cloverview, Lincroft, Penwell,
+     Pineview, Slivermont, Airmont, Merrifield)
+
+   - The Core Duo Yonah variants (2006 - 2008)
+
+   - The XEON PHI family
+
+   - Processors which have the ARCH_CAP_RDCL_NO bit set in the
+     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is also not
+     affected by the Meltdown vulnerabitly. These CPUs should become
+     available end of 2018.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the L1TF vulnerability::
+
+   CVE-2018-3615:  L1 Terminal Fault - SGX related aspects
+   CVE-2018-3620:  L1 Terminal Fault - OS, SMM related aspects
+   CVE-2018-3646:  L1 Terminal Fault - Virtualization related aspects
+
+Problem
+-------
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the present bit cleared, then the speculative
+execution can load the data into the speculation flow when the data from
+the physical address which is referenced in the PTE address bits is
+available in the Level 1 Data Cache. This is a purely speculative
+mechanism and the instruction will raise a page fault when it is retired.
+
+This creates a window between the load and retirement where speculative
+execution can operate on the data and malicious code can use this to create
+a side channel to leak the speculated data which was accessed without
+permission.
+
+This flaw is very similar to the Meltdown vulnerability, which speculates
+on data which should be not accessible from user space because the
+speculation ignores the permission bits. Contrary to Meltdown L1TF can not
+be exploited without actually generating page faults. While Meltdown breaks
+the user space to kernel space protection, L1TF has a broader scope. It
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows to attack SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+Attack scenarios
+----------------
+
+1. Malicious user space:
+
+   Operating Systems store arbitrary information in the address bits of a
+   PTE which is marked non present. This allows a malicious user space
+   application to attack the physical memory to which these PTEs resolve.
+
+   The Linux kernel contains a mitigation for this attack vector which is
+   permanently enabled and has no performance impact. A system with an up
+   to date kernel is not vulnerable as there is no way for a malicious
+   application to control the content of PTEs which are not marked present.
+
+   |
+
+2. Malicious guest in a virtual machine
+
+   The fact that L1TF breaks all domain protections allows malicious guest
+   OSes, which can control the PTEs directly, and malicious userspace,
+   which runs on an unprotected guest kernel, to attack physical host
+   memory.
+
+   A special aspect of L1TF in the context of virtualization is symmetric
+   multi threading (SMT). The Intel implementation of SMT is called
+   HyperThreading. The fact that Hyperthreads on the affected processors
+   share the L1 Data Cache (L1D) is important for this. As the flaw allows
+   only to attack data which is present in L1D, a malicious guest running
+   on one thread can attack the data which is brought into L1D by the
+   context which runs on the sibling thread of the same physical core. This
+   context can be host OS, host user space or a different guest.
+
+   While solutions exist to mitigate these attack vectors fully, these
+   mitigations are not enabled by default in the Linux kernel because they
+   can affect performance significantly. The kernel provides several
+   mechanisms which can be utilized to address the problem depending on the
+   deployment scenario.
+
+L1TF system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to read out the information
+about L1TF. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/l1tf
+
+It provides information whether the system is affected by L1TF or not and
+in case it is affected it provides information about the active mitigation
+mechanisms.
+
+Host mitigation mechanism
+-------------------------
+
+The kernel is unconditionally protected against L1TF attacks from malicious
+user space running on the host.
+
+Guest mitigation mechanisms
+---------------------------
+
+1. L1D flush on VMENTER
+
+   To make sure that a guest cannot attack data which is present in L1D the
+   hypervisor flushes L1D before entering the guest.
+
+   Flushing L1D evicts not only the data which should not be accessed by a
+   potentially malicious guest, it also flushes the guest data. Flushing
+   L1D has a performance impact as the processor has to bring the flushed
+   guest data back into L1D. Depending on the frequency of VMEXIT/VMENTER
+   and the type of computations in the guest performance degradation in the
+   range of 1% to 50% has been observed. For scenarios where guest
+   VMEXIT/VMENTER are rare the performance impact is minimal. Virtio and
+   mechanisms like posted interrupts are designed to confine the VMEXITs to
+   a bare minimum, but specific configurations and application scenarios
+   might still suffer from a high VMEXIT rate.
+
+   The general recommendation is to enable L1D flush on VMENTER.
+
+   Note, that L1D flush does not prevent the SMT problem because the
+   sibling thread will also bring back its data into L1D which makes it
+   attackable again.
+
+   L1D flush can be controlled by the administrator.
+
+   |
+
+2. Guest VCPU confinement to dedicated physical cores
+
+   To address the SMT problem, it is possible to make a guest or a group of
+   guests affine to one or more physical cores. The proper mechanism for
+   that is to utilize cpusets and to make sure that no other guest or host
+   tasks can run on these cores.
+
+   If only a single guest or related guests run on sibling SMT threads on
+   the same physical core then they can only attack their own memory and
+   restricted parts of the host memory.
+
+   The host memory is attackable, when one of the sibling threads runs in
+   host OS (hypervisor) context and the other in guest context. The amount
+   of valuable information from the host OS context depends on the context
+   which the host OS executes, i.e. interrupts, soft interrupts and kernel
+   threads. The amount of valuable data from these contexts cannot be
+   declared as non interesting for an attacker without deep inspection of
+   the code.
+
+   Note, that assigning guests to a fixed set of physical cores affects the
+   ability of the scheduler to do load balancing and might have negative
+   effects on CPU utilization depending on the hosting scenario. Disabling
+   SMT might be a viable alternative for particular scenarios.
+
+   For further information about confining guests to a single or to a group
+   of cores consult the cpusets documentation.
+
+   |
+
+3. Interrupt affinity
+
+   Interrupts can be made affine to logical CPUs. This is not universally
+   true because there are types of interrupts which are truly per CPU
+   interrupts, e.g. the local timer interrupt. Aside of that multi queue
+   devices affine their interrupts to single CPUs or groups of CPUs per
+   queue without allowing the administrator to control the affinities.
+
+   Moving the interrupts, which can be affinity controlled, away from CPUs
+   which run untrusted guests, reduces the attack vector space.
+
+   Whether the interrupts with are affine to CPUs, which run untrusted
+   guests, provide interesting data for an attacker depends on the system
+   configuration and the scenarios which run on the system. While for some
+   of the interrupts it can be assumed that they wont expose interesting
+   information beyond exposing hints about the host OS memory layout, there
+   is no way to make general assumptions.
+
+   Interrupt affinity can be controlled by the administrator via the
+   /proc/irq/$NR/smp_affinity[_list] files.
+
+   |
+
+4. SMT control
+
+   To prevent the SMT issues of L1TF it might be necessary to disable SMT
+   completely. Disabling SMT can have a significant performance impact, but
+   the impact depends on the hosting scenario and the type of workloads.
+   The impact of disabling SMT needs also to be weighted against the impact
+   of other mitigation solutions like confining guests to dedicated cores.
+
+   The kernel provides a sysfs interface to retrieve the status of SMT and
+   to control it. It also provides a kernel command line interface to
+   control SMT.
+
+   The kernel command line interface consists of the following options::
+
+     nosmt:	Affects the bring up of the secondary CPUs during boot. The
+		kernel tries to bring all present CPUs online during the
+		boot process. "nosmt" makes sure that from each physical
+		core only one - the so called primary (hyper) thread is
+		activated. Due to a design flaw of Intel processors related
+		to Machine Check Exceptions the non primary siblings have
+		to be brought up at least partially and are then shut down
+		again.  "nosmt" can be undone via the sysfs interface.
+
+     nosmt=force: Has the same effect as "nosmt' but it does not allow to
+		  undo the SMT disable via the sysfs interface.
+
+   The sysfs interface provides two files:
+
+   - /sys/devices/system/cpu/smt/control
+   - /sys/devices/system/cpu/smt/active
+
+   /sys/devices/system/cpu/smt/control:
+
+     This file allows to read out the SMT control state and provides the
+     ability to disable or (re)enable SMT. The possible states are::
+
+	on:		SMT is supported by the CPU and enabled. All
+			logical CPUs can be onlined and offlined without
+			restrictions
+
+	off:		SMT is supported by the CPU and disabled. Only
+			the so called primary SMT threads can be onlined
+			and offlined without restrictions. An attempt to
+			online a non-primary sibling is rejected
+
+	forceoff:	Same as 'off' but the state cannot be controlled.
+			Attempts to write to the control file are rejected.
+
+	notsupported:	The processor does not support SMT. It's therefore
+			not affected by the SMT implications of L1TF.
+			Attempts to write to the control file are rejected.
+
+    The possible states which can be written into this file to control SMT
+    state are:
+
+    - on
+    - off
+    - forceoff
+
+   /sys/devices/system/cpu/smt/active:
+
+     This file reports whether SMT is enabled and active, i.e. if on any
+     physical core two or more sibling threads are online.
+
+There is ongoing research and development for other mitigation mechanisms
+to address the performance impact of disabling SMT.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1TF mitigations at boot
+time with the option "l1tf=". The valid arguments for this option are::
+
+  full:		Provide all available mitigations for the L1TF
+		vulnerability. Disable SMT and enable all mitigations
+		in the hypervisors. SMT control is still possible
+		after boot.
+
+  full,force:	Same as 'full', but disables SMT control. Implies the
+		'nosmt=force' command line option.
+
+  novirt:	Leaves SMT enabled and does not enable the hypervisor
+		mitigation. Hypervisors will issue a warning when the first
+		VM is being started in a potentially insecure
+		configuration, i.e. SMT enabled or L1D flush disabled.
+
+  novirt,nowarn: Same as 'novirt', but hypervisors will not warn when
+		 a VM is started in a potentially insecure configuration.
+
+The default is 'novirt'.
+
+Mitigation control for KVM - command line or module parameter
+-------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism, flushing the L1D cache when
+entering a guest, can be controlled from the kernel command line or when
+the KVM-Intel hypervisor is built as a module also with a module parameter.
+
+The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
+following arguments::
+
+  always:	L1D cache flush on every VMENTER.
+
+  cond:		Flush L1D on VMENTER only when the code between VMEXIT and
+		VMENTER can leak host memory which is considered
+		interesting for an attacker. This still can leak host data
+		which allows e.g. to determine the hosts address space layout.
+
+  never:	Disables the mitigation
+
+The default is 'cond'. If 'l1tf=full' or 'l1tf=full,force' are given on the
+kernel command line, these take precedence.
+
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+
+   The system is protected by the kernel unconditionally and no further
+   action is required.
+
+   |
+
+2. Virtualization with trusted guests
+
+   If the guest comes from a trusted source and the guest OS kernel is
+   guaranteed to have the L1TF mitigations in place the system is fully
+   protected against L1TF and no further action is required.
+
+   |
+
+3. Virtualization with untrusted guests
+
+   If SMT is not supported by the processor or disabled in the BIOS or by
+   the kernel, it's only required to enforce L1D flushing on VMENTER. This
+   can be achieved with the l1tf or the kvm-intel command line or module
+   parameters.
+
+   If SMT is supported and active then various degrees of mitigations can
+   be employed.
+
+   - Confinement of guests to a single or a group of physical cores which
+     are not running any other processes, can reduce the attack surface
+     significantly, but interrupts, soft interrupts and kernel threads can
+     still expose valuable data to a potential attacker
+
+   - Isolating the guest CPUs from interrupts can reduce the attack surface
+     further, but still allows a malicious guest to explore a limited
+     amount of host physical memory. This can at least be used to gain
+     knowledge about the host address space layout. The interrupts which
+     have a fixed affinity to the CPUs which run the untrusted guests can
+     depending on the scenario still trigger soft interrupts and schedule
+     kernel threads which might expose valuable information.
+
+   - Disabling SMT and enforcing the L1D flushing provides the maximum
+     amount of protection.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/2] Command line and documentation 0
  2018-07-08 12:52 [patch 0/2] Command line and documentation 0 Thomas Gleixner
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
  2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
@ 2018-07-08 13:11 ` Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-08 13:11 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 267 bytes --]

On Sun, 8 Jul 2018, speck for Thomas Gleixner wrote:
> Note, that the Documentation does not build on -rc1, so you need to merge
> the pile into rc2 at least. I'll send out a git bundle with these changes
> as a reply.

Bundle on top of -rc1 attached.

Thanks,

	tglx

[-- Attachment #2: Type: application/octet-stream, Size: 66254 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
@ 2018-07-08 14:00   ` Josh Poimboeuf
  2018-07-08 14:13     ` Thomas Gleixner
  2018-07-08 20:32   ` Jiri Kosina
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-08 14:00 UTC (permalink / raw)
  To: speck

On Sun, Jul 08, 2018 at 02:52:17PM +0200, speck for Thomas Gleixner wrote:
> From: Jiri Kosina <jkosina@suse.cz>
> Subject: [patch 1/2] x86/bugs, kvm: introduce boot-time control of L1TF mitigations
> 
> Introduce 'l1tf=' kernel commad line option to allow for boot-time

"command"

> @@ -1989,6 +1983,38 @@
>  			feature (tagged TLBs) on capable Intel chips.
>  			Default is 1 (enabled)
>  
> +	l1tf=           [X86] Control mitigation of the L1TF vulnerability on
> +			      affected CPUs
> +
> +			The kernel PTE inversion protection is unconditionally
> +			enabled and cannot be controlled.
> +
> +			full
> +				Provide all available mitigations for the
> +				L1TF vulnerability. Disable SMT and enable
> +				all mitigations in the hypervisors. SMT
> +				control is still possible after boot.
> +
> +			full,force
> +				Same as 'full', but disables SMT
> +				control. Implies the 'nosmt=force' command
> +				line option.

The line wrapping looks a little off here.

> @@ -10539,19 +10529,41 @@ static struct kvm_vcpu *vmx_create_vcpu(
>  	return ERR_PTR(err);
>  }
>  
> -#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to CVE-2018-3620 for details.\n"
> +#define L1TF_MSG_NOVIRT "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. Refer to CVE-2018-3646 for details.\n"
> +#define L1TF_MSG_SMT "L1TF CPU bug present and SMT enabled, data leak possible. Refer to CVE-2018-3646 for details.\n"
>  
>  static int vmx_vm_init(struct kvm *kvm)
>  {
>  	if (!ple_gap)
>  		kvm->arch.pause_in_guest = true;
>  
> -	if (boot_cpu_has(X86_BUG_L1TF) && cpu_smt_control == CPU_SMT_ENABLED) {
> -		if (nosmt) {
> -			pr_err(L1TF_MSG);
> -			return -EOPNOTSUPP;
> +	if (boot_cpu_has(X86_BUG_L1TF)) {
> +		switch (l1tf_mitigation) {
> +			case L1TF_MITIGATION_NOVIRT_NOWARN:
> +				/* The 'I explicitly don't care' flag is set */
> +				break;
> +			case L1TF_MITIGATION_NOVIRT:
> +				/*
> +				 * Warn upon starting the first VM in a
> +				 * potentially insecure environment.
> +				 */
> +				if (cpu_smt_control == CPU_SMT_ENABLED ||
> +				    vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
> +					pr_warn_once(L1TF_MSG_NOVIRT);
> +				break;
> +			case L1TF_MITIGATION_FULL:
> +				/*
> +				 * Warn if SMT has been runtime-enabled. We can't
> +				 * really warn only once here, as SMT state is
> +				 * not constant.
> +				 */
> +				if (cpu_smt_control == CPU_SMT_ENABLED)
> +					pr_warn(L1TF_MSG_SMT);

As I suggested to Jiri, this should be a one-time warning, otherwise
it's going to be an annoyance for people using cpusets.

> +	/* Take the l1tf command line parameter into account */
> +	switch (l1tf_mitigation) {
> +	case L1TF_MITIGATION_FULL:
> +	case L1TF_MITIGATION_FULL_FORCE:
> +		/* Respect the FLUSH ALWAYS module param */
> +		if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
> +			vmentry_l1d_flush = VMENTER_L1D_FLUSH_COND;
> +		break;

I'm not sure about this for the non-force case.  If somebody goes to the
trouble of setting the vmx option, it seems like we should honor that.

However, I think it *does* make sense for the force case.  And I think
it should be similarly enforced in vmentry_l1d_flush_set().

In fact, this logic should probably all be moved to
vmentry_l1d_flush_set() anyway.  I'm pretty sure that code gets called
before this during module load.

> +	default:
> +		break;
> +	}
> +
> +	l1tf_vmx_mitigation = vmentry_l1d_flush;

This needs to be done in vmentry_l1d_flush_set().

Overall this looks like a big improvement, nice work.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/2] Command line and documentation 1
  2018-07-08 14:00   ` [MODERATED] " Josh Poimboeuf
@ 2018-07-08 14:13     ` Thomas Gleixner
  2018-07-08 15:21       ` [MODERATED] " Josh Poimboeuf
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-08 14:13 UTC (permalink / raw)
  To: speck

On Sun, 8 Jul 2018, speck for Josh Poimboeuf wrote:
> On Sun, Jul 08, 2018 at 02:52:17PM +0200, speck for Thomas Gleixner wrote:
> > From: Jiri Kosina <jkosina@suse.cz>
> > Subject: [patch 1/2] x86/bugs, kvm: introduce boot-time control of L1TF mitigations
> > 
> > Introduce 'l1tf=' kernel commad line option to allow for boot-time
> 
> "command"

Darn missed that,

> > +	if (boot_cpu_has(X86_BUG_L1TF)) {
> > +		switch (l1tf_mitigation) {
> > +			case L1TF_MITIGATION_NOVIRT_NOWARN:
> > +				/* The 'I explicitly don't care' flag is set */
> > +				break;
> > +			case L1TF_MITIGATION_NOVIRT:
> > +				/*
> > +				 * Warn upon starting the first VM in a
> > +				 * potentially insecure environment.
> > +				 */
> > +				if (cpu_smt_control == CPU_SMT_ENABLED ||
> > +				    vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
> > +					pr_warn_once(L1TF_MSG_NOVIRT);
> > +				break;
> > +			case L1TF_MITIGATION_FULL:
> > +				/*
> > +				 * Warn if SMT has been runtime-enabled. We can't
> > +				 * really warn only once here, as SMT state is
> > +				 * not constant.
> > +				 */
> > +				if (cpu_smt_control == CPU_SMT_ENABLED)
> > +					pr_warn(L1TF_MSG_SMT);
> 
> As I suggested to Jiri, this should be a one-time warning, otherwise
> it's going to be an annoyance for people using cpusets.

Ok.

> > +	/* Take the l1tf command line parameter into account */
> > +	switch (l1tf_mitigation) {
> > +	case L1TF_MITIGATION_FULL:
> > +	case L1TF_MITIGATION_FULL_FORCE:
> > +		/* Respect the FLUSH ALWAYS module param */
> > +		if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
> > +			vmentry_l1d_flush = VMENTER_L1D_FLUSH_COND;
> > +		break;
> 
> I'm not sure about this for the non-force case.  If somebody goes to the
> trouble of setting the vmx option, it seems like we should honor that.

No strong opinion.

> However, I think it *does* make sense for the force case.  And I think
> it should be similarly enforced in vmentry_l1d_flush_set().
>
> In fact, this logic should probably all be moved to
> vmentry_l1d_flush_set() anyway.  I'm pretty sure that code gets called
> before this during module load.

Why? If vmentry_l1d_flush is set correctly during init then
vmentry_l1d_flush_set() will do the right thing. No need to make the same
decision over and over. The result won't change because neither
vmentry_l1d_fliush nor l1tf_mitigation can change. Aside of that we really
need to do this in the init function because that also controls the static
key.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
@ 2018-07-08 14:40   ` Andrew Cooper
  2018-07-09  7:05     ` Thomas Gleixner
  2018-07-08 15:40   ` [MODERATED] " Josh Poimboeuf
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2018-07-08 14:40 UTC (permalink / raw)
  To: speck

On 08/07/18 13:52, speck for Thomas Gleixner wrote:
> --- /dev/null
> +++ b/Documentation/admin-guide/l1tf.rst
> @@ -0,0 +1,356 @@
> +L1TF - L1 Terminal Fault
> +========================
> +
> +L1 Terminal Fault is a hardware vulnerability which allows unconstrained
> +speculative access to data which is available in the Level 1 Data Cache
> +when the page table entry controlling the virtual access, which is used for
> +the access, has the present bit cleared.

"or other reserved bits set" ?

An otherwise correct PTE with bit 51 set will also be vulnerable to L1TF.

> +Problem
> +-------
>
> <snip>
>
> +This flaw is very similar to the Meltdown vulnerability, which speculates
> +on data which should be not accessible from user space because the
> +speculation ignores the permission bits. Contrary to Meltdown L1TF can not
> +be exploited without actually generating page faults.

I don't understand what you're trying to say here.  Both Meltdown and
L1TF can be pulled off in speculative windows which will get erased when
the retire buffer catches up with reality.

Furthermore, both can be pulled off in TSX transaction where the
pipeline will squash the real #PF for you.

>  While Meltdown breaks
> +the user space to kernel space protection, L1TF has a broader scope. It
> +allows to attack any physical memory address in the system and the attack
> +works across all protection domains. It allows to attack SGX and also
> +works from inside virtual machines because the speculation bypasses the
> +extended page table (EPT) protection mechanism.
> +
> +Attack scenarios
> +----------------
> +
> +1. Malicious user space:
> +
> +   Operating Systems store arbitrary information in the address bits of a
> +   PTE which is marked non present. This allows a malicious user space
> +   application to attack the physical memory to which these PTEs resolve.
> +
> +   The Linux kernel contains a mitigation for this attack vector which is
> +   permanently enabled and has no performance impact. A system with an up
> +   to date kernel is not vulnerable as there is no way for a malicious
> +   application to control the content of PTEs which are not marked present.
> +
> +   |

Stray pipe?  (or perhaps RST syntax I'm not aware of?)

Otherwise, LGTM.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-08 14:13     ` Thomas Gleixner
@ 2018-07-08 15:21       ` Josh Poimboeuf
  2018-07-09  7:07         ` Thomas Gleixner
  0 siblings, 1 reply; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-08 15:21 UTC (permalink / raw)
  To: speck

On Sun, Jul 08, 2018 at 04:13:02PM +0200, speck for Thomas Gleixner wrote:
> > However, I think it *does* make sense for the force case.  And I think
> > it should be similarly enforced in vmentry_l1d_flush_set().
> >
> > In fact, this logic should probably all be moved to
> > vmentry_l1d_flush_set() anyway.  I'm pretty sure that code gets called
> > before this during module load.
> 
> Why? If vmentry_l1d_flush is set correctly during init then
> vmentry_l1d_flush_set() will do the right thing. No need to make the same
> decision over and over. The result won't change because neither
> vmentry_l1d_fliush nor l1tf_mitigation can change. Aside of that we really
> need to do this in the init function because that also controls the static
> key.

Yeah, I guess I had been assuming that the sysfs entry for
'vmentry_l1d_flush' was read/write.

Jiri was talking about making the 'l1tf' option runtime changeable,
which I think is important.  In that case it might make sense to do the
same with 'vmentry_l1d_flush' by making the sysfs read/write.  Then
these checks (and the static key change) would belong in
vmentry_l1d_flush_set().

But this looks good for now.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
  2018-07-08 14:40   ` [MODERATED] " Andrew Cooper
@ 2018-07-08 15:40   ` Josh Poimboeuf
  2018-07-09 11:04   ` Ingo Molnar
  2018-07-09 22:07   ` [MODERATED] " Andi Kleen
  3 siblings, 0 replies; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-08 15:40 UTC (permalink / raw)
  To: speck

On Sun, Jul 08, 2018 at 02:52:18PM +0200, speck for Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> Subject: [patch 2/2] Documentation: Add section about CPU vulnerabilities
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/admin-guide/index.rst |    9 
>  Documentation/admin-guide/l1tf.rst  |  356 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 365 insertions(+)
> 
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -17,6 +17,15 @@ etc.
>     kernel-parameters
>     devices
>  
> +This section describes CPU vulnerabilities and provides an overview over

s/over/of/

> +the possible mitigations along with guidance for selecting mitigations if
> +they are configurable at compile, boot or run time.
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   l1tf
> +
>  Here is a set of documents aimed at users who are trying to track down
>  problems and bugs in particular.
>  
> --- /dev/null
> +++ b/Documentation/admin-guide/l1tf.rst
> @@ -0,0 +1,356 @@
> +L1TF - L1 Terminal Fault
> +========================
> +
> +L1 Terminal Fault is a hardware vulnerability which allows unconstrained
> +speculative access to data which is available in the Level 1 Data Cache
> +when the page table entry controlling the virtual access, which is used for
> +the access, has the present bit cleared.

s/present/Present/

> +
> +Affected CPUs
> +-------------
> +
> +This vulnerability affects a wide range of Intel processors. The
> +vulnerability is not present on:
> +
> +   - Older models, where the CPU family is < 6
> +
> +   - A range of ATOM processors (Cedarview, Cloverview, Lincroft, Penwell,
> +     Pineview, Slivermont, Airmont, Merrifield)
> +
> +   - The Core Duo Yonah variants (2006 - 2008)
> +
> +   - The XEON PHI family
> +
> +   - Processors which have the ARCH_CAP_RDCL_NO bit set in the
> +     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is also not
> +     affected by the Meltdown vulnerabitly. These CPUs should become
> +     available end of 2018.
> +
> +Related CVEs
> +------------
> +
> +The following CVE entries are related to the L1TF vulnerability::
> +
> +   CVE-2018-3615:  L1 Terminal Fault - SGX related aspects
> +   CVE-2018-3620:  L1 Terminal Fault - OS, SMM related aspects
> +   CVE-2018-3646:  L1 Terminal Fault - Virtualization related aspects
> +
> +Problem
> +-------
> +
> +If an instruction accesses a virtual address for which the relevant page
> +table entry (PTE) has the present bit cleared, then the speculative

s/present/Present/

> +3. Virtualization with untrusted guests
> +
> +   If SMT is not supported by the processor or disabled in the BIOS or by
> +   the kernel, it's only required to enforce L1D flushing on VMENTER. This
> +   can be achieved with the l1tf or the kvm-intel command line or module
> +   parameters.
> +
> +   If SMT is supported and active then various degrees of mitigations can
> +   be employed.
> +
> +   - Confinement of guests to a single or a group of physical cores which
> +     are not running any other processes, can reduce the attack surface
> +     significantly, but interrupts, soft interrupts and kernel threads can
> +     still expose valuable data to a potential attacker

period

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
  2018-07-08 14:00   ` [MODERATED] " Josh Poimboeuf
@ 2018-07-08 20:32   ` Jiri Kosina
  2018-07-09  0:33     ` Jon Masters
  2018-07-09 10:26   ` Ingo Molnar
  2018-07-09 21:45   ` Andi Kleen
  3 siblings, 1 reply; 53+ messages in thread
From: Jiri Kosina @ 2018-07-08 20:32 UTC (permalink / raw)
  To: speck

On Sun, 8 Jul 2018, speck for Thomas Gleixner wrote:

> From: Jiri Kosina <jkosina@suse.cz>
> Subject: [patch 1/2] x86/bugs, kvm: introduce boot-time control of L1TF mitigations

Thanks a lot for polishing it for me; looks pretty good now. Two small 
nits (apart from the pr_warn -> pr_warn_once transition and the typos Josh 
pointed out)

[ ... snip ... ]
> +static const char *l1tf_vmx_states[] = {
> +	[VMENTER_L1D_FLUSH_NEVER]	= "vulnerable",
> +	[VMENTER_L1D_FLUSH_COND]	= "mostly protected",
> +	[VMENTER_L1D_FLUSH_ALWAYS]	= "fully protected"

I somehow don't like the "mostly protected" string :) We believe that the 
list of confined VM boundaries is correctly chosen, so we should say so 
(also imagine the confused admin who specifies 'l1tf=full' and then starts 
wondering why he's only "mostly protected").

How about

	[VMENTER_L1D_FLUSH_NEVER]       = "vulnerable",
	[VMENTER_L1D_FLUSH_COND]        = "conditional flushes",
	[VMENTER_L1D_FLUSH_ALWAYS]      = "flushes"

> +static ssize_t l1tf_show_state(char *buf)
> +{
> +	if (l1tf_vmx_mitigation == L1TF_VMX_UNKNOWN)
> +		return sprintf(buf, "%s\n", l1tf_states[l1tf_mitigation]);
> +
> +	return sprintf(buf, "%s, VMX: SMT %s L1D %s\n",
> +		       l1tf_states[l1tf_mitigation],
> +		       cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled",
> +		       l1tf_vmx_states[l1tf_vmx_mitigation]);

The formatted string looks a bit ugly

	Mitigation: Full, VMX: SMT disabled L1D mostly protected

I think doing some more separators would make it a bit nicer. How about

	"%s; VMX: SMT %s, L1D %s\n"

Otherwise, I tested it and haven't spotted any issues.

Thanks!

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-08 20:32   ` Jiri Kosina
@ 2018-07-09  0:33     ` Jon Masters
  0 siblings, 0 replies; 53+ messages in thread
From: Jon Masters @ 2018-07-09  0:33 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1157 bytes --]

On 07/08/2018 04:32 PM, speck for Jiri Kosina wrote:
> On Sun, 8 Jul 2018, speck for Thomas Gleixner wrote:
> 
>> From: Jiri Kosina <jkosina@suse.cz>
>> Subject: [patch 1/2] x86/bugs, kvm: introduce boot-time control of L1TF mitigations
> 
> Thanks a lot for polishing it for me; looks pretty good now. Two small 
> nits (apart from the pr_warn -> pr_warn_once transition and the typos Josh 
> pointed out)
> 
> [ ... snip ... ]
>> +static const char *l1tf_vmx_states[] = {
>> +	[VMENTER_L1D_FLUSH_NEVER]	= "vulnerable",
>> +	[VMENTER_L1D_FLUSH_COND]	= "mostly protected",
>> +	[VMENTER_L1D_FLUSH_ALWAYS]	= "fully protected"
> 
> I somehow don't like the "mostly protected" string :)

I thought the same thing. It's going to result in lots of bad advice in
random forums and people being confused or whatnot.

> How about
> 
> 	[VMENTER_L1D_FLUSH_NEVER]       = "vulnerable",
> 	[VMENTER_L1D_FLUSH_COND]        = "conditional flushes",
> 	[VMENTER_L1D_FLUSH_ALWAYS]      = "flushes"

Why not just say "conditional L1D cache flush" or "L1D cache flushes".

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-08 14:40   ` [MODERATED] " Andrew Cooper
@ 2018-07-09  7:05     ` Thomas Gleixner
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-09  7:05 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

On Sun, 8 Jul 2018, speck for Andrew Cooper wrote:
> On 08/07/18 13:52, speck for Thomas Gleixner wrote:
> > +This flaw is very similar to the Meltdown vulnerability, which speculates
> > +on data which should be not accessible from user space because the
> > +speculation ignores the permission bits. Contrary to Meltdown L1TF can not
> > +be exploited without actually generating page faults.
> 
> I don't understand what you're trying to say here.  Both Meltdown and
> L1TF can be pulled off in speculative windows which will get erased when
> the retire buffer catches up with reality.
> 
> Furthermore, both can be pulled off in TSX transaction where the
> pipeline will squash the real #PF for you.

Yes, but you can hide the Meltdown attack completely in a non taken, but
speculated branch, so it will never trigger PF, abort TSX ... That trick
does not work with L1TF. That said, it's not that interesting for this
particular documentation. Will remove it.

> > +
> > +   |
> 
> Stray pipe?  (or perhaps RST syntax I'm not aware of?)

Empty new line. Workaround for a rendering issue, which I haven't had time
to analyze yet.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/2] Command line and documentation 1
  2018-07-08 15:21       ` [MODERATED] " Josh Poimboeuf
@ 2018-07-09  7:07         ` Thomas Gleixner
  2018-07-09 13:14           ` Thomas Gleixner
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-09  7:07 UTC (permalink / raw)
  To: speck

On Sun, 8 Jul 2018, speck for Josh Poimboeuf wrote:
> On Sun, Jul 08, 2018 at 04:13:02PM +0200, speck for Thomas Gleixner wrote:
> > > However, I think it *does* make sense for the force case.  And I think
> > > it should be similarly enforced in vmentry_l1d_flush_set().
> > >
> > > In fact, this logic should probably all be moved to
> > > vmentry_l1d_flush_set() anyway.  I'm pretty sure that code gets called
> > > before this during module load.
> > 
> > Why? If vmentry_l1d_flush is set correctly during init then
> > vmentry_l1d_flush_set() will do the right thing. No need to make the same
> > decision over and over. The result won't change because neither
> > vmentry_l1d_fliush nor l1tf_mitigation can change. Aside of that we really
> > need to do this in the init function because that also controls the static
> > key.
> 
> Yeah, I guess I had been assuming that the sysfs entry for
> 'vmentry_l1d_flush' was read/write.
> 
> Jiri was talking about making the 'l1tf' option runtime changeable,
> which I think is important.  In that case it might make sense to do the
> same with 'vmentry_l1d_flush' by making the sysfs read/write.  Then
> these checks (and the static key change) would belong in
> vmentry_l1d_flush_set().

I see what you mean. Yes, that makes sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
  2018-07-08 14:00   ` [MODERATED] " Josh Poimboeuf
  2018-07-08 20:32   ` Jiri Kosina
@ 2018-07-09 10:26   ` Ingo Molnar
  2018-07-09 21:45   ` Andi Kleen
  3 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-07-09 10:26 UTC (permalink / raw)
  To: speck



* speck for Thomas Gleixner <speck@linutronix.de> wrote:

> From: Jiri Kosina <jkosina@suse.cz>
> Subject: [patch 1/2] x86/bugs, kvm: introduce boot-time control of L1TF mitigations

Looks good to me, I only have minor nits:

> Introduce 'l1tf=' kernel commad line option to allow for boot-time
> switching of mitigation that is used on processors affected by L1TF.


Beyond the typo there's also a missing article:

 s/Introduce 'l1tf=' kernel command line option
  /Introduce the 'l1tf=' kernel command line option


> 
> The possible values are:
> 
>   full
> 	Provide all available mitigations for the L1TF
> 	vulnerability. Disable SMT and enable all mitigations
> 	in the hypervisors. SMT control is still possible
> 	after boot.

Maybe:

  SMT control via /sys/devices/system/cpu/smt/control is still
  possible after boot.

to make the text more helpful overall?

>   full,force
> 	Same as 'full', but disables SMT control. Implies the
> 	'nosmt=force' command line option.

Maybe add:

   full,force
 	Same as 'full', but disables SMT control. Implies the
 	'nosmt=force' command line option. (I.e. sysfs control
	of SMT is disabled.)

>   novirt
> 	Leaves SMT enabled and does not enable the hypervisor
> 	mitigation. Hypervisors will issue a warning when the first VM is
> 	being started in a potentially insecure configuration, i.e. SMT
> 	enabled or L1D flush disabled.

s/being started
 /started

>  novirt,nowarn
> 	Same as 'novirt', but hypervisors will not warn when
> 	a VM is started in a potentially insecure configuration.
> 
> Default is 'novirt'.
> 
> Let KVM adhere to these semantics, which means:
> 
>   - 'lt1f=full,force'	: start performing L1D flushes
>   - 'l1tf=full'		: start performing L1D flushes,
> 			  warn on VM start if SMT has
> 			  been runtime enabled
>   - 'l1tf=novirt'	: warn on first VM start
>   - 'l1tf=novirt,nowarn': no extra action is taken
> 
> This makes the KVM's private 'nosmt' option redundant, and as it is a bit
> non-systematic anyway (this is something to control globally, not on
> hypervisor level). Remove that option.

s/the KVM's
 /KVM's

> +	l1tf=           [X86] Control mitigation of the L1TF vulnerability on
> +			      affected CPUs
> +
> +			The kernel PTE inversion protection is unconditionally
> +			enabled and cannot be controlled.

 s/controlled
  /disabled

?

> +
> +			full
> +				Provide all available mitigations for the
> +				L1TF vulnerability. Disable SMT and enable
> +				all mitigations in the hypervisors. SMT
> +				control is still possible after boot.
> +
> +			full,force
> +				Same as 'full', but disables SMT
> +				control. Implies the 'nosmt=force' command
> +				line option.
> +
> +			novirt
> +				Leaves SMT enabled and does not enable the
> +				hypervisor mitigation. Hypervisors will
> +				issue a warning when the first VM is being
> +				started in a potentially insecure
> +				configuration, i.e. SMT enabled or L1D
> +				flush disabled.
> +
> +			novirt,nowarn
> +				Same as 'novirt', but hypervisors will not
> +				warn when a VM is started in a potentially
> +				insecure configuration.
> +
> +			Default is 'novirt'.

Also, to extent you agree to the tweaks of the text I gave above, they should 
propagate into the documentation here as well.

> +/* Default mitigation for L1TF-affected CPUs */
> +enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;
> +#if IS_ENABLED(CONFIG_KVM_INTEL)

Is there any difference to #ifdef CONFIG_KVM_INTEL?

> +	switch (l1tf_mitigation) {
> +	case L1TF_MITIGATION_FULL_FORCE:
> +		cpu_smt_disable(true);
> +		break;
> +	case L1TF_MITIGATION_FULL:
> +		cpu_smt_disable(false);
> +		break;
> +	case L1TF_MITIGATION_NOVIRT:
> +	case L1TF_MITIGATION_NOVIRT_NOWARN:
> +		break;
> +	}

Nit: maybe put this into the same order as the definition order, i.e. 
weaker-to-stronger options?


> +static int __init l1tf_cmdline(char *str)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_L1TF))
> +		return 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!strcmp(str, "full"))
> +		l1tf_mitigation = L1TF_MITIGATION_FULL;
> +	else if (!strcmp(str, "full,force"))
> +		l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;
> +	else if (!strcmp(str, "novirt"))
> +		l1tf_mitigation = L1TF_MITIGATION_NOVIRT;
> +	else if (!strcmp(str, "novirt,nowarn"))
> +		l1tf_mitigation = L1TF_MITIGATION_NOVIRT_NOWARN;

Ditto.

> +static const char *l1tf_states[] = {
> +	[L1TF_MITIGATION_FULL]		= "Mitigation: Full",
> +	[L1TF_MITIGATION_FULL_FORCE]	= "Mitigation: Full [force]",
> +	[L1TF_MITIGATION_NOVIRT]	= "Mitigation: Page Table Inversion",
> +	[L1TF_MITIGATION_NOVIRT_NOWARN]	= "Mitigation: Page Table Inversion"
> +};

Ditto.

> +
> +#if IS_ENABLED(CONFIG_KVM_INTEL)
> +static const char *l1tf_vmx_states[] = {
> +	[VMENTER_L1D_FLUSH_NEVER]	= "vulnerable",
> +	[VMENTER_L1D_FLUSH_COND]	= "mostly protected",
> +	[VMENTER_L1D_FLUSH_ALWAYS]	= "fully protected"
> +};

(This has the right order.)

> +	if (boot_cpu_has(X86_BUG_L1TF)) {
> +		switch (l1tf_mitigation) {
> +			case L1TF_MITIGATION_NOVIRT_NOWARN:
> +				/* The 'I explicitly don't care' flag is set */
> +				break;
> +			case L1TF_MITIGATION_NOVIRT:
> +				/*
> +				 * Warn upon starting the first VM in a
> +				 * potentially insecure environment.
> +				 */
> +				if (cpu_smt_control == CPU_SMT_ENABLED ||
> +				    vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER)
> +					pr_warn_once(L1TF_MSG_NOVIRT);
> +				break;
> +			case L1TF_MITIGATION_FULL:
> +				/*
> +				 * Warn if SMT has been runtime-enabled. We can't
> +				 * really warn only once here, as SMT state is
> +				 * not constant.
> +				 */
> +				if (cpu_smt_control == CPU_SMT_ENABLED)
> +					pr_warn(L1TF_MSG_SMT);
> +				break;
> +			case L1TF_MITIGATION_FULL_FORCE:
> +				/* All is good, proceed without hiccup */
> +				break;

This has the definition-order too. So I'd suggest harmonizing the order.

> +	/* Take the l1tf command line parameter into account */

Nit:

 s/the l1tf command line parameter
  /the l1tf= command line parameter

Just so that people know which exact parameter this is.

>  extern enum cpuhp_smt_control cpu_smt_control;
> +void cpu_smt_disable(bool force);

Should perhaps be 'extern' too, like the other prototypes?

LGTM otherwise:

Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
  2018-07-08 14:40   ` [MODERATED] " Andrew Cooper
  2018-07-08 15:40   ` [MODERATED] " Josh Poimboeuf
@ 2018-07-09 11:04   ` Ingo Molnar
  2018-07-09 11:08     ` Jiri Kosina
  2018-07-09 15:18     ` Thomas Gleixner
  2018-07-09 22:07   ` [MODERATED] " Andi Kleen
  3 siblings, 2 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-07-09 11:04 UTC (permalink / raw)
  To: speck


* speck for Thomas Gleixner <speck@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> Subject: [patch 2/2] Documentation: Add section about CPU vulnerabilities
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/admin-guide/index.rst |    9 
>  Documentation/admin-guide/l1tf.rst  |  356 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 365 insertions(+)
> 
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -17,6 +17,15 @@ etc.
>     kernel-parameters
>     devices
>  
> +This section describes CPU vulnerabilities and provides an overview over
> +the possible mitigations along with guidance for selecting mitigations if
> +they are configurable at compile, boot or run time.
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   l1tf
> +
>  Here is a set of documents aimed at users who are trying to track down
>  problems and bugs in particular.
>  
> --- /dev/null
> +++ b/Documentation/admin-guide/l1tf.rst
> @@ -0,0 +1,356 @@
> +L1TF - L1 Terminal Fault
> +========================
> +
> +L1 Terminal Fault is a hardware vulnerability which allows unconstrained
> +speculative access to data which is available in the Level 1 Data Cache
> +when the page table entry controlling the virtual access, which is used for
> +the access, has the present bit cleared.

Would it be clearer to say "unprivileged" instead of "unconstrained"? 
"Unconstrained" could mean a number of other things. (At least to me.)

> +Affected CPUs
> +-------------
> +
> +This vulnerability affects a wide range of Intel processors. The
> +vulnerability is not present on:
> +
> +   - Older models, where the CPU family is < 6
> +
> +   - A range of ATOM processors (Cedarview, Cloverview, Lincroft, Penwell,
> +     Pineview, Slivermont, Airmont, Merrifield)
> +
> +   - The Core Duo Yonah variants (2006 - 2008)
> +
> +   - The XEON PHI family
> +
> +   - Processors which have the ARCH_CAP_RDCL_NO bit set in the
> +     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is also not
> +     affected by the Meltdown vulnerabitly. These CPUs should become
> +     available end of 2018.

s/vulnerabitly
 /vulnerability

Also, maybe this is a bit clearer:

        If the bit is set then the CPU is not affected by the Meltdown 
        vulnerability either.

> +If an instruction accesses a virtual address for which the relevant page
> +table entry (PTE) has the present bit cleared, then the speculative
> +execution can load the data into the speculation flow when the data from
> +the physical address which is referenced in the PTE address bits is
> +available in the Level 1 Data Cache. This is a purely speculative
> +mechanism and the instruction will raise a page fault when it is retired.

s/then the speculative execution
 /then speculative execution

Also, maybe clarify it the following way:

> +If an instruction accesses a virtual address for which the relevant page
> +table entry (PTE) has the present bit cleared, then speculative
> +execution might ignore the cleared present bit and might load the
> +referenced data from the Level 1 Data Cache (if cached), as if the page was 
> +still present and accessible.
>
> +While this is a purely speculative mechanism and the instruction will raise a 
> +page fault when it is retired eventually - but the pure act of loading the
> +data and making it available to other speculative instructions opens up
> +the timing based side channel attacks to unprivileged malicious code, similar 
> +to the Meltdown attack.

?

> +This flaw is very similar to the Meltdown vulnerability, which speculates
> +on data which should be not accessible from user space because the
> +speculation ignores the permission bits. Contrary to Meltdown L1TF can not
> +be exploited without actually generating page faults. While Meltdown breaks
> +the user space to kernel space protection, L1TF has a broader scope. It
> +allows to attack any physical memory address in the system and the attack
> +works across all protection domains. It allows to attack SGX and also
> +works from inside virtual machines because the speculation bypasses the
> +extended page table (EPT) protection mechanism.

To the extent this paragraph survives into the final version:

 s/It allows to attack SGX
  /It allows an attack of SGX

?

> +
> +Attack scenarios
> +----------------
> +
> +1. Malicious user space:
> +
> +   Operating Systems store arbitrary information in the address bits of a
> +   PTE which is marked non present. This allows a malicious user space
> +   application to attack the physical memory to which these PTEs resolve.

Maybe also add this:

     "In some cases user-space can maliciously influence (i.e. set to a broad 
      range of arbitrary values) the information encoded in the address bits of 
      the PTE - making attacks more deterministic and more practical."

?

> +
> +   The Linux kernel contains a mitigation for this attack vector which is
> +   permanently enabled and has no performance impact. A system with an up
> +   to date kernel is not vulnerable as there is no way for a malicious
> +   application to control the content of PTEs which are not marked present.
> +
> +   |

It might make sense to mention the acronym of the mitigation here?

> +2. Malicious guest in a virtual machine
> +
> +   The fact that L1TF breaks all domain protections allows malicious guest
> +   OSes, which can control the PTEs directly, and malicious userspace,
> +   which runs on an unprotected guest kernel, to attack physical host
> +   memory.

s/malicious userspace
 /malicious guest user-space applications

?

Also, maybe mention here that 'unprotected guest kernel' here refers to the 
non-present PTE encoding protection technique, because it's a bit vague here I 
think.

> +   A special aspect of L1TF in the context of virtualization is symmetric
> +   multi threading (SMT). The Intel implementation of SMT is called
> +   HyperThreading. The fact that Hyperthreads on the affected processors
> +   share the L1 Data Cache (L1D) is important for this. As the flaw allows
> +   only to attack data which is present in L1D, a malicious guest running
> +   on one thread can attack the data which is brought into L1D by the
> +   context which runs on the sibling thread of the same physical core. This
> +   context can be host OS, host user space or a different guest.

For more clarify:

s/on one thread
  on one CPU thread

s/brought into L1D
 /brought into the L1D

s/sibling thread
 /sibling CPU thread

> +   While solutions exist to mitigate these attack vectors fully, these
> +   mitigations are not enabled by default in the Linux kernel because they
> +   can affect performance significantly. The kernel provides several
> +   mechanisms which can be utilized to address the problem depending on the
> +   deployment scenario.

Maybe provide a list of mitigations here, or a reference, to allow the reader to 
look further?


> +
> +L1TF system information
> +-----------------------
> +
> +The Linux kernel provides a sysfs interface to read out the information
> +about L1TF. The relevant sysfs file is:
> +
> +/sys/devices/system/cpu/vulnerabilities/l1tf
> +
> +It provides information whether the system is affected by L1TF or not and
> +in case it is affected it provides information about the active mitigation
> +mechanisms.

Maybe, for more clarity:

  The Linux kernel provides a sysfs interface to enumerate the current L1TF status 
  of the system: whether the system is vulnerable, and which mitigations are 
  available and active. The relevant sysfs file is:

   /sys/devices/system/cpu/vulnerabilities/l1tf

?

> +Guest mitigation mechanisms
> +---------------------------
> +
> +1. L1D flush on VMENTER
> +
> +   To make sure that a guest cannot attack data which is present in L1D the
> +   hypervisor flushes L1D before entering the guest.
>
> +   Flushing L1D evicts not only the data which should not be accessed by a
> +   potentially malicious guest, it also flushes the guest data. Flushing
> +   L1D has a performance impact as the processor has to bring the flushed
> +   guest data back into L1D. Depending on the frequency of VMEXIT/VMENTER
> +   and the type of computations in the guest performance degradation in the
> +   range of 1% to 50% has been observed. For scenarios where guest
> +   VMEXIT/VMENTER are rare the performance impact is minimal. Virtio and
> +   mechanisms like posted interrupts are designed to confine the VMEXITs to
> +   a bare minimum, but specific configurations and application scenarios
> +   might still suffer from a high VMEXIT rate.

A few articles are missing I think, here's the proposed fix:

  1. L1D flush on VMENTER

   To make sure that a guest cannot attack data which is present in the L1D the
   hypervisor flushes the L1D before entering the guest.

   Flushing the L1D evicts not only the data which should not be accessed by a
   potentially malicious guest, it also flushes the guest data. Flushing
   the L1D has a performance impact, as the processor has to bring the flushed
   guest data back into the L1D. Depending on the frequency of VMEXIT/VMENTER
   and the type of computations in the guest, performance degradation in the
   range of 1% to 50% has been observed. For scenarios where guest
   VMEXIT/VMENTER are rare the performance impact is minimal. Virtio and other
   mechanisms like posted interrupts are designed to confine the VMEXITs to
   a bare minimum, but specific configurations and application scenarios
   might still suffer from a high VMEXIT rate.


> +   The general recommendation is to enable L1D flush on VMENTER.
> +
> +   Note, that L1D flush does not prevent the SMT problem because the
> +   sibling thread will also bring back its data into L1D which makes it
> +   attackable again.

s/into L1D
 /into the L1D


> +2. Guest VCPU confinement to dedicated physical cores
> +
> +   To address the SMT problem, it is possible to make a guest or a group of
> +   guests affine to one or more physical cores. The proper mechanism for
> +   that is to utilize cpusets and to make sure that no other guest or host
> +   tasks can run on these cores.
> +
> +   If only a single guest or related guests run on sibling SMT threads on
> +   the same physical core then they can only attack their own memory and
> +   restricted parts of host memory.
> +
> +   Host memory is attackable when one of the sibling threads runs in
> +   host OS (hypervisor) context and the other in guest context. The amount
> +   of valuable information from the host OS context depends on the context
> +   which the host OS executes, i.e. interrupts, soft interrupts and kernel
> +   threads. The amount of valuable data from these contexts cannot be
> +   declared as non-interesting for an attacker without deep inspection of
> +   the code.
> +
> +   Note that assigning guests to a fixed set of physical cores affects the
> +   ability of the scheduler to perform load balancing and might have negative
> +   effects on CPU utilization depending on the hosting scenario. Disabling
> +   SMT might be a viable alternative for particular scenarios.
> +
> +   For further information about confining guests to a single or to a group
> +   of cores consult the cpusets documentation.

I've edited this one in place, just a few article problems.


> +  novirt,nowarn: Same as 'novirt', but hypervisors will not warn when
> +		 a VM is started in a potentially insecure configuration.
> +
> +The default is 'novirt'.

Isn't the default 'novirt,nowarn'?

> +  cond:		Flush L1D on VMENTER only when the code between VMEXIT and
> +		VMENTER can leak host memory which is considered
> +		interesting for an attacker. This still can leak host data
> +		which allows e.g. to determine the hosts address space layout.

s/hosts
 /host's

?

Other than these, LGTM:

Reviewed-by: Ingo Molnar <mingo@kernel.org>


Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 11:04   ` Ingo Molnar
@ 2018-07-09 11:08     ` Jiri Kosina
  2018-07-09 11:47       ` Ingo Molnar
  2018-07-09 15:18     ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Jiri Kosina @ 2018-07-09 11:08 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Ingo Molnar wrote:

> > +  novirt,nowarn: Same as 'novirt', but hypervisors will not warn when
> > +		 a VM is started in a potentially insecure configuration.
> > +
> > +The default is 'novirt'.
> 
> Isn't the default 'novirt,nowarn'?

No, the default absolutely is 'novirt'

	/* Default mitigation for L1TF-affected CPUs */
	enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;

I don't think making default 'novirt,nowarn' would make any sense really. 
It's uncomfortable enough that the kernel is by default not turning the 
protection on.

If it wouldn't be even issuing a warning, that'd be rather bad.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 11:08     ` Jiri Kosina
@ 2018-07-09 11:47       ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-07-09 11:47 UTC (permalink / raw)
  To: speck


* speck for Jiri Kosina <speck@linutronix.de> wrote:

> On Mon, 9 Jul 2018, speck for Ingo Molnar wrote:
> 
> > > +  novirt,nowarn: Same as 'novirt', but hypervisors will not warn when
> > > +		 a VM is started in a potentially insecure configuration.
> > > +
> > > +The default is 'novirt'.
> > 
> > Isn't the default 'novirt,nowarn'?
> 
> No, the default absolutely is 'novirt'

Indeed - I just mis-remembered it: "novirt,nowarn" is the last entry in the 
values, not the default.

> 
> 	/* Default mitigation for L1TF-affected CPUs */
> 	enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;
> 
> I don't think making default 'novirt,nowarn' would make any sense really. 
> It's uncomfortable enough that the kernel is by default not turning the 
> protection on.
> 
> If it wouldn't be even issuing a warning, that'd be rather bad.

Yeah, agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/2] Command line and documentation 1
  2018-07-09  7:07         ` Thomas Gleixner
@ 2018-07-09 13:14           ` Thomas Gleixner
  2018-07-09 13:21             ` [MODERATED] " Jiri Kosina
  2018-07-09 15:32             ` Josh Poimboeuf
  0 siblings, 2 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-09 13:14 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Thomas Gleixner wrote:
> On Sun, 8 Jul 2018, speck for Josh Poimboeuf wrote:
> > On Sun, Jul 08, 2018 at 04:13:02PM +0200, speck for Thomas Gleixner wrote:
> > > > However, I think it *does* make sense for the force case.  And I think
> > > > it should be similarly enforced in vmentry_l1d_flush_set().
> > > >
> > > > In fact, this logic should probably all be moved to
> > > > vmentry_l1d_flush_set() anyway.  I'm pretty sure that code gets called
> > > > before this during module load.
> > > 
> > > Why? If vmentry_l1d_flush is set correctly during init then
> > > vmentry_l1d_flush_set() will do the right thing. No need to make the same
> > > decision over and over. The result won't change because neither
> > > vmentry_l1d_fliush nor l1tf_mitigation can change. Aside of that we really
> > > need to do this in the init function because that also controls the static
> > > key.
> > 
> > Yeah, I guess I had been assuming that the sysfs entry for
> > 'vmentry_l1d_flush' was read/write.
> > 
> > Jiri was talking about making the 'l1tf' option runtime changeable,
> > which I think is important.  In that case it might make sense to do the
> > same with 'vmentry_l1d_flush' by making the sysfs read/write.  Then
> > these checks (and the static key change) would belong in
> > vmentry_l1d_flush_set().
> 
> I see what you mean. Yes, that makes sense.

But, making l1tf writeable does not make much sense to me and would require
to have notifiers if vmx is a module, eew! Let's not go there please.

So I'd rather make l1tf just a command line option and let VMX module
parameters control the mess. That means we'd have the following l1tf=
options:

   full:      	   SMT off, KVM flush cond
   
   full,force:	   Same as above, but SMT is forced off and KVM flush
   		   can only be switched between cond and always, but not
		   to never

   auto:	   Let the kernel decide. KVM will warn and have the l1d
   		   default

   off:		   set KVM flush to never and do not warn

Now the kvm module should have a kvm_intel.l1tf module parameter as well
with the following possible options:

   l1d-always
   l1d-cond
   l1d-never
   nowarn
   l1d-always,nowarn
   l1d-cond,nowarn
   l1d-never,nowarn

Default: l1d-cond or whatever we decide

Some of the 'nowarn' combos make probably no sense, but I can't make my
mind up which. 

The module parameter sysfs file can be writeable and allows run time
control. Transitioning from and to l1d-always might be tricky, because
l1d-always uses the MSR list and that is done at guest setup time, not at
runtime.

Hmm?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-09 13:14           ` Thomas Gleixner
@ 2018-07-09 13:21             ` Jiri Kosina
  2018-07-09 13:25               ` Jiri Kosina
  2018-07-09 15:32             ` Josh Poimboeuf
  1 sibling, 1 reply; 53+ messages in thread
From: Jiri Kosina @ 2018-07-09 13:21 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Thomas Gleixner wrote:

> But, making l1tf writeable does not make much sense to me and would require
> to have notifiers if vmx is a module, eew! Let's not go there please.
> 
> So I'd rather make l1tf just a command line option and let VMX module
> parameters control the mess. That means we'd have the following l1tf=
> options:
> 
>    full:      	   SMT off, KVM flush cond
>    
>    full,force:	   Same as above, but SMT is forced off and KVM flush
>    		   can only be switched between cond and always, but not
> 		   to never
> 
>    auto:	   Let the kernel decide. KVM will warn and have the l1d
>    		   default
> 
>    off:		   set KVM flush to never and do not warn
> 
> Now the kvm module should have a kvm_intel.l1tf module parameter as well
> with the following possible options:
> 
>    l1d-always
>    l1d-cond
>    l1d-never
>    nowarn
>    l1d-always,nowarn
>    l1d-cond,nowarn
>    l1d-never,nowarn
> 
> Default: l1d-cond or whatever we decide
> 
> Some of the 'nowarn' combos make probably no sense, but I can't make my
> mind up which. 
> 
> The module parameter sysfs file can be writeable and allows run time
> control. Transitioning from and to l1d-always might be tricky, because
> l1d-always uses the MSR list and that is done at guest setup time, not at
> runtime.
> 
> Hmm?

Hmm, isn't this getting just too complicated? In order to maintain the 
sanity of users, I'd really prefer to make it as simple as possible.

If the tradeoff is being "clearly understandable" vs. "runtime 
controllable", I'd go with the latter (which means your latest adaptation 
my patch plus the few modifications on top).

We don't do runtime control of other CPU vulnerability mitigations either.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-09 13:21             ` [MODERATED] " Jiri Kosina
@ 2018-07-09 13:25               ` Jiri Kosina
  0 siblings, 0 replies; 53+ messages in thread
From: Jiri Kosina @ 2018-07-09 13:25 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Jiri Kosina wrote:

> If the tradeoff is being "clearly understandable" vs. "runtime 
> controllable", I'd go with the latter (which means your latest adaptation 

Bah, "the former" here of course, sorry.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-09 11:04   ` Ingo Molnar
  2018-07-09 11:08     ` Jiri Kosina
@ 2018-07-09 15:18     ` Thomas Gleixner
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-09 15:18 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Ingo Molnar wrote:
> * speck for Thomas Gleixner <speck@linutronix.de> wrote:
> > +L1 Terminal Fault is a hardware vulnerability which allows unconstrained
> > +speculative access to data which is available in the Level 1 Data Cache
> > +when the page table entry controlling the virtual access, which is used for
> > +the access, has the present bit cleared.
> 
> Would it be clearer to say "unprivileged" instead of "unconstrained"? 
> "Unconstrained" could mean a number of other things. (At least to me.)

yes.

> Also, maybe this is a bit clearer:
> 
>         If the bit is set then the CPU is not affected by the Meltdown 
>         vulnerability either.

Ok.

> Also, maybe clarify it the following way:
> 
> > +If an instruction accesses a virtual address for which the relevant page
> > +table entry (PTE) has the present bit cleared, then speculative
> > +execution might ignore the cleared present bit and might load the
> > +referenced data from the Level 1 Data Cache (if cached), as if the page was 
> > +still present and accessible.
> >
> > +While this is a purely speculative mechanism and the instruction will raise a 
> > +page fault when it is retired eventually - but the pure act of loading the
> > +data and making it available to other speculative instructions opens up
> > +the timing based side channel attacks to unprivileged malicious code, similar 
> > +to the Meltdown attack.

I took parts of it an refined it.

> > +   Operating Systems store arbitrary information in the address bits of a
> > +   PTE which is marked non present. This allows a malicious user space
> > +   application to attack the physical memory to which these PTEs resolve.
> 
> Maybe also add this:
> 
>      "In some cases user-space can maliciously influence (i.e. set to a broad 
>       range of arbitrary values) the information encoded in the address bits of 
>       the PTE - making attacks more deterministic and more practical."

Good point.

> > +   The Linux kernel contains a mitigation for this attack vector which is
> > +   permanently enabled and has no performance impact. A system with an up
> > +   to date kernel is not vulnerable as there is no way for a malicious
> > +   application to control the content of PTEs which are not marked present.
> > +
> > +   |
> 
> It might make sense to mention the acronym of the mitigation here?

Done.

> > +2. Malicious guest in a virtual machine
> > +
> > +   The fact that L1TF breaks all domain protections allows malicious guest
> > +   OSes, which can control the PTEs directly, and malicious userspace,
> > +   which runs on an unprotected guest kernel, to attack physical host
> > +   memory.
> 
> s/malicious userspace
>  /malicious guest user-space applications
> 
> ?
> 
> Also, maybe mention here that 'unprotected guest kernel' here refers to the 
> non-present PTE encoding protection technique, because it's a bit vague here I 
> think.

Ok.

> Maybe, for more clarity:
> 
>   The Linux kernel provides a sysfs interface to enumerate the current L1TF status 
>   of the system: whether the system is vulnerable, and which mitigations are 
>   available and active. The relevant sysfs file is:
> 
>    /sys/devices/system/cpu/vulnerabilities/l1tf

Ok.

> > +2. Guest VCPU confinement to dedicated physical cores
> > +
> > +   SMT might be a viable alternative for particular scenarios.
> > +
> > +   For further information about confining guests to a single or to a group
> > +   of cores consult the cpusets documentation.
> 
> I've edited this one in place, just a few article problems.

Bah, the changes were hard to spot....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-09 13:14           ` Thomas Gleixner
  2018-07-09 13:21             ` [MODERATED] " Jiri Kosina
@ 2018-07-09 15:32             ` Josh Poimboeuf
  2018-07-09 15:40               ` Thomas Gleixner
  2018-07-09 15:44               ` [MODERATED] " Jiri Kosina
  1 sibling, 2 replies; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-09 15:32 UTC (permalink / raw)
  To: speck

On Mon, Jul 09, 2018 at 03:14:50PM +0200, speck for Thomas Gleixner wrote:
> On Mon, 9 Jul 2018, speck for Thomas Gleixner wrote:
> > On Sun, 8 Jul 2018, speck for Josh Poimboeuf wrote:
> > > On Sun, Jul 08, 2018 at 04:13:02PM +0200, speck for Thomas Gleixner wrote:
> > > > > However, I think it *does* make sense for the force case.  And I think
> > > > > it should be similarly enforced in vmentry_l1d_flush_set().
> > > > >
> > > > > In fact, this logic should probably all be moved to
> > > > > vmentry_l1d_flush_set() anyway.  I'm pretty sure that code gets called
> > > > > before this during module load.
> > > > 
> > > > Why? If vmentry_l1d_flush is set correctly during init then
> > > > vmentry_l1d_flush_set() will do the right thing. No need to make the same
> > > > decision over and over. The result won't change because neither
> > > > vmentry_l1d_fliush nor l1tf_mitigation can change. Aside of that we really
> > > > need to do this in the init function because that also controls the static
> > > > key.
> > > 
> > > Yeah, I guess I had been assuming that the sysfs entry for
> > > 'vmentry_l1d_flush' was read/write.
> > > 
> > > Jiri was talking about making the 'l1tf' option runtime changeable,
> > > which I think is important.  In that case it might make sense to do the
> > > same with 'vmentry_l1d_flush' by making the sysfs read/write.  Then
> > > these checks (and the static key change) would belong in
> > > vmentry_l1d_flush_set().
> > 
> > I see what you mean. Yes, that makes sense.
> 
> But, making l1tf writeable does not make much sense to me and would require
> to have notifiers if vmx is a module, eew! Let's not go there please.

Why would notifiers be needed?  Is there a reason kvm can't just check a
kernel global variable (or call a helper) at vmentry?

> So I'd rather make l1tf just a command line option and let VMX module
> parameters control the mess. That means we'd have the following l1tf=
> options:
> 
>    full:      	   SMT off, KVM flush cond
>    
>    full,force:	   Same as above, but SMT is forced off and KVM flush
>    		   can only be switched between cond and always, but not
> 		   to never
> 
>    auto:	   Let the kernel decide. KVM will warn and have the l1d
>    		   default
> 
>    off:		   set KVM flush to never and do not warn
> 
> Now the kvm module should have a kvm_intel.l1tf module parameter as well
> with the following possible options:
> 
>    l1d-always
>    l1d-cond
>    l1d-never
>    nowarn
>    l1d-always,nowarn
>    l1d-cond,nowarn
>    l1d-never,nowarn
> 
> Default: l1d-cond or whatever we decide
> 
> Some of the 'nowarn' combos make probably no sense, but I can't make my
> mind up which. 
> 
> The module parameter sysfs file can be writeable and allows run time
> control. Transitioning from and to l1d-always might be tricky, because
> l1d-always uses the MSR list and that is done at guest setup time, not at
> runtime.
> 
> Hmm?

I think runtime control is important -- much moreso with this bug than
with previous bugs -- because the mitigation is so disruptive, and is
usage-dependent.

However I think the above would be pretty confusing.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/2] Command line and documentation 1
  2018-07-09 15:32             ` Josh Poimboeuf
@ 2018-07-09 15:40               ` Thomas Gleixner
  2018-07-09 15:44               ` [MODERATED] " Jiri Kosina
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-09 15:40 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Josh Poimboeuf wrote:
> On Mon, Jul 09, 2018 at 03:14:50PM +0200, speck for Thomas Gleixner wrote:
> > > > Jiri was talking about making the 'l1tf' option runtime changeable,
> > > > which I think is important.  In that case it might make sense to do the
> > > > same with 'vmentry_l1d_flush' by making the sysfs read/write.  Then
> > > > these checks (and the static key change) would belong in
> > > > vmentry_l1d_flush_set().
> > > 
> > > I see what you mean. Yes, that makes sense.
> > 
> > But, making l1tf writeable does not make much sense to me and would require
> > to have notifiers if vmx is a module, eew! Let's not go there please.
> 
> Why would notifiers be needed?  Is there a reason kvm can't just check a
> kernel global variable (or call a helper) at vmentry?

You really want another conditional in that code path. Shrug.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-09 15:32             ` Josh Poimboeuf
  2018-07-09 15:40               ` Thomas Gleixner
@ 2018-07-09 15:44               ` Jiri Kosina
  1 sibling, 0 replies; 53+ messages in thread
From: Jiri Kosina @ 2018-07-09 15:44 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Josh Poimboeuf wrote:

> > But, making l1tf writeable does not make much sense to me and would require
> > to have notifiers if vmx is a module, eew! Let's not go there please.
> 
> Why would notifiers be needed?  Is there a reason kvm can't just check a
> kernel global variable (or call a helper) at vmentry?

My original plan was to introduce l1tf control file for runtime toggling, 
that'd in its _store() callback perform any action necessary for given 
target state (offline SMT siblings and update the global l1tf_mitigation 
variable, etc.).

If there are concerns about added overhead to KVM (as it'd have to be 
checking l1tf_mitigation on every vmenter), I guess we could easily turn 
that into a static key as part of introducing that 'runtime toggling' 
patch.

So I'd still suggest to start sorting this out only after the boot-time 
format / semantics have been sorted out (which I believe has happened by 
now).

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
                     ` (2 preceding siblings ...)
  2018-07-09 10:26   ` Ingo Molnar
@ 2018-07-09 21:45   ` Andi Kleen
  2018-07-09 22:08     ` Andi Kleen
                       ` (2 more replies)
  3 siblings, 3 replies; 53+ messages in thread
From: Andi Kleen @ 2018-07-09 21:45 UTC (permalink / raw)
  To: speck

> +			full,force
> +				Same as 'full', but disables SMT
> +				control. Implies the 'nosmt=force' command
> +				line option.

I'm still unclear what the use case for this force thing is. Can you please
clarify? If someone controls the host kernel they can do all kinds of insecure 
things, so why go to all this effort just to prevent them from 
opening a side channel.  It seems like unnecessary complexity.
I would remove all the force options.

> +
> +			novirt
> +				Leaves SMT enabled and does not enable the
> +				hypervisor mitigation. Hypervisors will
> +				issue a warning when the first VM is being
> +				started in a potentially insecure
> +				configuration, i.e. SMT enabled or L1D
> +				flush disabled.
> +
> +			novirt,nowarn
> +				Same as 'novirt', but hypervisors will not
> +				warn when a VM is started in a potentially
> +				insecure configuration.

So why is there no option to only enable the L1D flush? 

It would seem to me that would be better default than no mitigation at all.

>  
> +enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;

Can you remind me why there is no sysfs interface for this?

Jiri originally wanted boot options because apparently
they have an no ability to install a new boot up script in an update. 
So the need for an boot up option is more an hack to work
around this deficiency.

But I suspect long term for most users it's far
better to handle this dynamically at run time, and not
do unnecessary reboots, and I can't see any reason 
why that wouldn't be possible.

So should have a sysfs interface.

> +	[VMENTER_L1D_FLUSH_ALWAYS]	= "fully protected"
> +};
> +
> +static ssize_t l1tf_show_state(char *buf)
> +{
> +	if (l1tf_vmx_mitigation == L1TF_VMX_UNKNOWN)
> +		return sprintf(buf, "%s\n", l1tf_states[l1tf_mitigation]);
> +
> +	return sprintf(buf, "%s, VMX: SMT %s L1D %s\n",

I would call it Single-Thread instead of L1D

That's much more intuitive for someone who is not deep into
micro architecture.

>  
> -#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to CVE-2018-3620 for details.\n"
> +#define L1TF_MSG_NOVIRT "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. Refer to CVE-2018-3646 for details.\n"
> +#define L1TF_MSG_SMT "L1TF CPU bug present and SMT enabled, data leak possible. Refer to CVE-2018-3646 for details.\n"

This should refer to the Documentation file, not the CVE.
CVEs are not really useful to describe mitigations.


-Andi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
                     ` (2 preceding siblings ...)
  2018-07-09 11:04   ` Ingo Molnar
@ 2018-07-09 22:07   ` Andi Kleen
  2018-07-09 23:00     ` Josh Poimboeuf
                       ` (2 more replies)
  3 siblings, 3 replies; 53+ messages in thread
From: Andi Kleen @ 2018-07-09 22:07 UTC (permalink / raw)
  To: speck

> +   - Processors which have the ARCH_CAP_RDCL_NO bit set in the
> +     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is also not
> +     affected by the Meltdown vulnerabitly. These CPUs should become
> +     available end of 2018.

Would be better to specify the sysfs file output here.


> +Problem
> +-------
> +
> +If an instruction accesses a virtual address for which the relevant page
> +table entry (PTE) has the present bit cleared, then the speculative
> +execution can load the data into the speculation flow when the data from


"load the into the speculation flow" ?

No idea what that means. I would just drop it.

> +the physical address which is referenced in the PTE address bits is
> +available in the Level 1 Data Cache. This is a purely speculative
> +mechanism and the instruction will raise a page fault when it is retired.
> +
> +This creates a window between the load and retirement where speculative
> +execution can operate on the data and malicious code can use this to create
> +a side channel to leak the speculated data which was accessed without
> +permission.
> +
> +This flaw is very similar to the Meltdown vulnerability, which speculates
> +on data which should be not accessible from user space because the
> +speculation ignores the permission bits. Contrary to Meltdown L1TF can not
> +be exploited without actually generating page faults. While Meltdown breaks

Not correct due to TSX.

> +the user space to kernel space protection, L1TF has a broader scope. It

It's actually not broader for user space, but much more narrow.

> +allows to attack any physical memory address in the system and the attack

.. but only when the page table entry is controllable. 

> +works across all protection domains. It allows to attack SGX and also
> +works from inside virtual machines because the speculation bypasses the

which is only the case for virtual machines.

> +extended page table (EPT) protection mechanism.


... and virtual machines can control page table entries.

> +2. Malicious guest in a virtual machine
> +
> +   The fact that L1TF breaks all domain protections allows malicious guest
> +   OSes, which can control the PTEs directly, and malicious userspace,
> +   which runs on an unprotected guest kernel, to attack physical host
> +   memory.

... to attack other guests or the host

> +
> +1. L1D flush on VMENTER
> +
> +   To make sure that a guest cannot attack data which is present in L1D the
> +   hypervisor flushes L1D before entering the guest.
> +
> +   Flushing L1D evicts not only the data which should not be accessed by a
> +   potentially malicious guest, it also flushes the guest data. Flushing

I suspect this is misleading because most non trivial exits already effectively
clear the L1.

> +   L1D has a performance impact as the processor has to bring the flushed
> +   guest data back into L1D. Depending on the frequency of VMEXIT/VMENTER
> +   and the type of computations in the guest performance degradation in the
> +   range of 1% to 50% has been observed. For scenarios where guest
> +   VMEXIT/VMENTER are rare the performance impact is minimal. Virtio and
> +   mechanisms like posted interrupts are designed to confine the VMEXITs to
> +   a bare minimum, but specific configurations and application scenarios
> +   might still suffer from a high VMEXIT rate.
> +
> +   The general recommendation is to enable L1D flush on VMENTER.

... right so we should make it default.

> +2. Guest VCPU confinement to dedicated physical cores
> +
> +   To address the SMT problem, it is possible to make a guest or a group of
> +   guests affine to one or more physical cores. The proper mechanism for
> +   that is to utilize cpusets and to make sure that no other guest or host
> +   tasks can run on these cores.

Need to refer to exclusive cpusets here.  If the cpuset is not exclusive
it doesn't help.

> +   The host memory is attackable, when one of the sibling threads runs in
> +   host OS (hypervisor) context and the other in guest context. The amount
> +   of valuable information from the host OS context depends on the context
> +   which the host OS executes, i.e. interrupts, soft interrupts and kernel
> +   threads. The amount of valuable data from these contexts cannot be
> +   declared as non interesting for an attacker without deep inspection of
> +   the code.

The interrupt part seems misplaced in this section, should be in the next.

> +
> +   Note, that assigning guests to a fixed set of physical cores affects the
> +   ability of the scheduler to do load balancing and might have negative
> +   effects on CPU utilization depending on the hosting scenario. Disabling
> +   SMT might be a viable alternative for particular scenarios.
> +
> +   For further information about confining guests to a single or to a group
> +   of cores consult the cpusets documentation.

.... add a pointer to the document here ... 

> +   true because there are types of interrupts which are truly per CPU
> +   interrupts, e.g. the local timer interrupt. Aside of that multi queue
> +   devices affine their interrupts to single CPUs or groups of CPUs per
> +   queue without allowing the administrator to control the affinities.

Mention that timers run on the CPUs they were triggered on, so they
are effectively tied to the processes?

> +   /sys/devices/system/cpu/smt/active:
> +
> +     This file reports whether SMT is enabled and active, i.e. if on any
> +     physical core two or more sibling threads are online.
> +
> +There is ongoing research and development for other mitigation mechanisms
> +to address the performance impact of disabling SMT.

Really need to refer to any sysfs for L1D here.

> +		configuration, i.e. SMT enabled or L1D flush disabled.
> +
> +  novirt,nowarn: Same as 'novirt', but hypervisors will not warn when
> +		 a VM is started in a potentially insecure configuration.
> +
> +The default is 'novirt'.
> +
> +Mitigation control for KVM - command line or module parameter
> +-------------------------------------------------------------
> +
> +The KVM hypervisor mitigation mechanism, flushing the L1D cache when
> +entering a guest, can be controlled from the kernel command line or when
> +the KVM-Intel hypervisor is built as a module also with a module parameter.

That can be also changed dynamically through /sys/module/kvm/parameters/* right? 

That would be useful to document.


> +
> +The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
> +following arguments::
> +
> +  always:	L1D cache flush on every VMENTER.
> +
> +  cond:		Flush L1D on VMENTER only when the code between VMEXIT and
> +		VMENTER can leak host memory which is considered
> +		interesting for an attacker. This still can leak host data
> +		which allows e.g. to determine the hosts address space layout.
> +
> +  never:	Disables the mitigation
> +
> +The default is 'cond'. If 'l1tf=full' or 'l1tf=full,force' are given on the
> +kernel command line, these take precedence.

Ah, so the command line description was actually wrong. The default
is not novirt, but cond. That's good. But really need to fix that description
in the other patch ...

> +3. Virtualization with untrusted guests
> +
> +   If SMT is not supported by the processor or disabled in the BIOS or by
> +   the kernel, it's only required to enforce L1D flushing on VMENTER. This
> +   can be achieved with the l1tf or the kvm-intel command line or module
> +   parameters.

With cond in most cases they should be already ok?

Just need a trade off here between cond and always.

something like

With the default a guest can determine the address space layout
of the hypervisor, but no user data.

> +   - Isolating the guest CPUs from interrupts can reduce the attack surface
> +     further, but still allows a malicious guest to explore a limited
> +     amount of host physical memory. This can at least be used to gain
> +     knowledge about the host address space layout. The interrupts which
> +     have a fixed affinity to the CPUs which run the untrusted guests can
> +     depending on the scenario still trigger soft interrupts and schedule
> +     kernel threads which might expose valuable information.
> +
> +   - Disabling SMT and enforcing the L1D flushing provides the maximum

enforcing it with always

-Andi

> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-09 21:45   ` Andi Kleen
@ 2018-07-09 22:08     ` Andi Kleen
  2018-07-09 22:40     ` Jiri Kosina
  2018-07-10 11:53     ` Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: Andi Kleen @ 2018-07-09 22:08 UTC (permalink / raw)
  To: speck

On Mon, Jul 09, 2018 at 02:45:57PM -0700, speck for Andi Kleen wrote:
> > +				Leaves SMT enabled and does not enable the
> > +				hypervisor mitigation. Hypervisors will
> > +				issue a warning when the first VM is being
> > +				started in a potentially insecure
> > +				configuration, i.e. SMT enabled or L1D
> > +				flush disabled.
> > +
> > +			novirt,nowarn
> > +				Same as 'novirt', but hypervisors will not
> > +				warn when a VM is started in a potentially
> > +				insecure configuration.
> 
> So why is there no option to only enable the L1D flush? 
> 
> It would seem to me that would be better default than no mitigation at all.
> 
> >  
> > +enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;
> 
> Can you remind me why there is no sysfs interface for this?

Ok it looks like with the KVM parameter it's already there.
Just need to refer to it from here.

"These options can be also switched at run time with
/sys/module/kvm/parameters/... " 

> 
> Jiri originally wanted boot options because apparently
> they have an no ability to install a new boot up script in an update. 
> So the need for an boot up option is more an hack to work
> around this deficiency.
> 
> But I suspect long term for most users it's far
> better to handle this dynamically at run time, and not
> do unnecessary reboots, and I can't see any reason 
> why that wouldn't be possible.
> 
> So should have a sysfs interface.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 1/2] Command line and documentation 1
  2018-07-09 21:45   ` Andi Kleen
  2018-07-09 22:08     ` Andi Kleen
@ 2018-07-09 22:40     ` Jiri Kosina
  2018-07-10 11:53     ` Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: Jiri Kosina @ 2018-07-09 22:40 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Andi Kleen wrote:

> > +enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;
> 
> Can you remind me why there is no sysfs interface for this?

Well, noone implemented it yet. Once the cmdline option is in, sysfs knob 
can be added on top as soon as the code is written by someone (we'd have 
to turn that l1tf_mitigation into a static key in some smart way so that 
KVM's performance doesn't suffer by having to check it all the time, 
etc.).

We control mitigations for all the previous vulnerabilities from this 
"class" via boot-option only, so it looks like a reasonable first thing to 
have for l1tf as well.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 22:07   ` [MODERATED] " Andi Kleen
@ 2018-07-09 23:00     ` Josh Poimboeuf
  2018-07-09 23:11       ` Andi Kleen
  2018-07-10 19:36     ` Thomas Gleixner
  2018-07-11 14:03     ` [MODERATED] " Jon Masters
  2 siblings, 1 reply; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-09 23:00 UTC (permalink / raw)
  To: speck

On Mon, Jul 09, 2018 at 03:07:01PM -0700, speck for Andi Kleen wrote:
> > +Problem
> > +-------
> > +
> > +If an instruction accesses a virtual address for which the relevant page
> > +table entry (PTE) has the present bit cleared, then the speculative
> > +execution can load the data into the speculation flow when the data from
> 
> 
> "load the into the speculation flow" ?
> 
> No idea what that means. I would just drop it.

It might not be precisely worded, but FWIW I thought it was helpful.

> > +Mitigation control for KVM - command line or module parameter
> > +-------------------------------------------------------------
> > +
> > +The KVM hypervisor mitigation mechanism, flushing the L1D cache when
> > +entering a guest, can be controlled from the kernel command line or when
> > +the KVM-Intel hypervisor is built as a module also with a module parameter.
> 
> That can be also changed dynamically through /sys/module/kvm/parameters/* right? 
> 
> That would be useful to document.

Not yet, it's read-only for now.

> > +
> > +The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
> > +following arguments::
> > +
> > +  always:	L1D cache flush on every VMENTER.
> > +
> > +  cond:		Flush L1D on VMENTER only when the code between VMEXIT and
> > +		VMENTER can leak host memory which is considered
> > +		interesting for an attacker. This still can leak host data
> > +		which allows e.g. to determine the hosts address space layout.
> > +
> > +  never:	Disables the mitigation
> > +
> > +The default is 'cond'. If 'l1tf=full' or 'l1tf=full,force' are given on the
> > +kernel command line, these take precedence.
> 
> Ah, so the command line description was actually wrong. The default
> is not novirt, but cond. That's good. But really need to fix that description
> in the other patch ...

This description isn't correct.  The default is 'never', unless
'l1tf=full' or 'l1tf=full,force' is used, in which case the default is
'cond'.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 23:00     ` Josh Poimboeuf
@ 2018-07-09 23:11       ` Andi Kleen
  2018-07-09 23:45         ` Linus Torvalds
  2018-07-10  7:41         ` Thomas Gleixner
  0 siblings, 2 replies; 53+ messages in thread
From: Andi Kleen @ 2018-07-09 23:11 UTC (permalink / raw)
  To: speck

> > Ah, so the command line description was actually wrong. The default
> > is not novirt, but cond. That's good. But really need to fix that description
> > in the other patch ...
> 
> This description isn't correct.  The default is 'never', unless
> 'l1tf=full' or 'l1tf=full,force' is used, in which case the default is
> 'cond'.

IMHO that's wrong. We should default to cond at least.

That would give reasonable security by default for most people.

And hopefully the cond optimizations avoids some of the worst case performance
impacts for short exits, but that still remains to be seen.

-Andi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 23:11       ` Andi Kleen
@ 2018-07-09 23:45         ` Linus Torvalds
  2018-07-10  2:44           ` Josh Poimboeuf
  2018-07-10  7:41         ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2018-07-09 23:45 UTC (permalink / raw)
  To: speck



On Mon, 9 Jul 2018, speck for Andi Kleen wrote:
> 
> IMHO that's wrong. We should default to cond at least.

Right.

It may not be the sensible option for somebody who actually has HT, but 
people should remember that a lot of CPU's simply don't have HT at all 
(even from Intel, because it's a "value add" feature, and marketing people 
play with it).

So we don't want to disable HT by default, but that doesn't mean that we 
shouldn't bother with the L1 flush. Maybe people disabled HT in the BIOS, 
which is *better* than disabling it with a kernel command line?

            Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 23:45         ` Linus Torvalds
@ 2018-07-10  2:44           ` Josh Poimboeuf
  2018-07-10  5:57             ` Jiri Kosina
  2018-07-10 17:46             ` Linus Torvalds
  0 siblings, 2 replies; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-10  2:44 UTC (permalink / raw)
  To: speck

On Mon, Jul 09, 2018 at 04:45:22PM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Mon, 9 Jul 2018, speck for Andi Kleen wrote:
> > 
> > IMHO that's wrong. We should default to cond at least.
> 
> Right.
> 
> It may not be the sensible option for somebody who actually has HT, but 
> people should remember that a lot of CPU's simply don't have HT at all 
> (even from Intel, because it's a "value add" feature, and marketing people 
> play with it).
> 
> So we don't want to disable HT by default, but that doesn't mean that we 
> shouldn't bother with the L1 flush. Maybe people disabled HT in the BIOS, 
> which is *better* than disabling it with a kernel command line?

Ha, and I thought you had said the opposite before.

I don't have a strong opinion either way.  If we go this way then Jiri's
patch will need a little more rework.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10  2:44           ` Josh Poimboeuf
@ 2018-07-10  5:57             ` Jiri Kosina
  2018-07-10  6:22               ` Jiri Kosina
  2018-07-10 17:46             ` Linus Torvalds
  1 sibling, 1 reply; 53+ messages in thread
From: Jiri Kosina @ 2018-07-10  5:57 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Josh Poimboeuf wrote:

> > > IMHO that's wrong. We should default to cond at least.
> > 
> > Right.
> > 
> > It may not be the sensible option for somebody who actually has HT, but 
> > people should remember that a lot of CPU's simply don't have HT at all 
> > (even from Intel, because it's a "value add" feature, and marketing people 
> > play with it).
> > 
> > So we don't want to disable HT by default, but that doesn't mean that we 
> > shouldn't bother with the L1 flush. Maybe people disabled HT in the BIOS, 
> > which is *better* than disabling it with a kernel command line?
> 
> Ha, and I thought you had said the opposite before.

Yeah; I've re-read the "[PATCH 1/1] SMTDOC 0" thread, and I have to say I 
am still confused :)

So to make sure that we don't do another 10 respins of the cmdline patch; 
which of the below do you propose should be the kernel's default stance:

- cloud people are special, the default view of the world is that either
  there are no VMs at all and if there are, they don't care about 
  isolation. That would be current 'novirt' as the default (no flushes, HT 
  stays on)

- kernel has to provide VM isolation by default. That would be current 
  'full' (conditional flushes, HT off)

- something in between. How exactly do you envision that to work? 

  (1) Do the conditional flushes in case kernel detects that there is no 
      HT during boot? That seems odd, as HT systems would not get the
      L1D flush performance penalization, while non-HT systems would; I am 
      sure there are going to be a lot of questions once people measure 
      such an odd discrepancy in their performance profiles.

  (2) Do the conditional flushes always, no matter the HT state. What 
      I really don't like about that option is that (a) brings all the 
      non-negligible performance degradation coming from the L1D flushes, 
      while at the same time (b) it doesn't provide the isolation (on HT).

  (3) Something else?

> I don't have a strong opinion either way.  If we go this way then Jiri's 
> patch will need a little more rework.

I am not sending v9 before we absolutely clarify this thing. It's getting 
a bit tiring :)

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10  5:57             ` Jiri Kosina
@ 2018-07-10  6:22               ` Jiri Kosina
  0 siblings, 0 replies; 53+ messages in thread
From: Jiri Kosina @ 2018-07-10  6:22 UTC (permalink / raw)
  To: speck

On Tue, 10 Jul 2018, speck for Jiri Kosina wrote:

> - something in between. How exactly do you envision that to work? 
> 
>   (1) Do the conditional flushes in case kernel detects that there is no 
>       HT during boot? That seems odd, as HT systems would not get the
>       L1D flush performance penalization, while non-HT systems would; I am 
>       sure there are going to be a lot of questions once people measure 
>       such an odd discrepancy in their performance profiles.

Thinking about this a little bit more, I really don't like this option. In 
addition to the argument above, it's also pretty confusing for users that 
the level of the mitigation/protection kernel provides by default is 
basically not well defined, as it's "well, it depends".

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-09 23:11       ` Andi Kleen
  2018-07-09 23:45         ` Linus Torvalds
@ 2018-07-10  7:41         ` Thomas Gleixner
  2018-07-10  8:44           ` [MODERATED] " Jiri Kosina
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-10  7:41 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Andi Kleen wrote:

> > > Ah, so the command line description was actually wrong. The default
> > > is not novirt, but cond. That's good. But really need to fix that description
> > > in the other patch ...
> > 
> > This description isn't correct.  The default is 'never', unless
> > 'l1tf=full' or 'l1tf=full,force' is used, in which case the default is
> > 'cond'.
> 
> IMHO that's wrong. We should default to cond at least.
> 
> That would give reasonable security by default for most people.

I agree for the SMT off case and it's the right thing to do.

But for SMT on, calling it reasonable is entirely unreasonable, because
it's just fooling people into a cozy feeling of being secure.

Granted, it slows down the attack a bit, but that does not qualify as
reasonable protection at all.

Attacking a busy SSH connection from a hacked up guest still works nicely
and efficiently. The slow down with L1D flush is just in the range of
10%. Why?  because the malicious guest rarely takes a VMEXIT.

So claiming that L1D flush is providing reasonable security is at least
actively misleading. 

Aside of that the pr_warn_once() telling that SMT is insecure is not
necessarily visible for users under all circumstances. It's just another
part of pretending to care about security honestly.

IMNSHO, the right thing to do is to:

  1) Make L1D conditional flush the default

  2) Refuse to start a guest when SMT is enabled unless the user/admin
     explicitely disabled that protection.

Everything else is unreasonable and in my book unresponsible.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10  7:41         ` Thomas Gleixner
@ 2018-07-10  8:44           ` Jiri Kosina
  2018-07-10 10:32             ` Jiri Kosina
  2018-07-10 22:57             ` Josh Poimboeuf
  0 siblings, 2 replies; 53+ messages in thread
From: Jiri Kosina @ 2018-07-10  8:44 UTC (permalink / raw)
  To: speck

On Tue, 10 Jul 2018, speck for Thomas Gleixner wrote:

> IMNSHO, the right thing to do is to:
> 
>   1) Make L1D conditional flush the default
> 
>   2) Refuse to start a guest when SMT is enabled unless the user/admin
>      explicitely disabled that protection.
> 
> Everything else is unreasonable and in my book unresponsible.

I personally like it; it doesn't match Linus' original "in general case 
people don't care, as it's only a special group of people who really 
require proper VM isolation" though, so I'd wait a bit before that's 
clarified to avoid more ping-pong.

But generally, if we take that aproach, and in the name of keeping it as 
simple as possible, this would make sense I think:

	l1tf=full,force
		- L1D_FLUSH_ALWAYS
		- nosmt=force
	l1tf=full
		- L1D_FLUSH_COND
		- nosmt
	l1tf=conservative
		- L1D_FLUSH_COND
		- refuse to start VM with SMT on
			(perhaps with kvm module option that allows 
			 to override?)
	l1tf=conservative,warn
		- L1D_FLUSH_COND
		- warn when starting VM with SMT on
	l1tf=off
		- L1D_FLUSH_NEVER
		- allow starting VMs without any restrictions/warnings

My understanding is that Andi would like to see 'conservative,warn' as a 
default. I personally also see that unresponsible and confusing; it is too 
easy for users to fall into the trap of thinking they're secure while 
they're not, plus it provides degraded performance without the actual 
isolation (on SMT).

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10  8:44           ` [MODERATED] " Jiri Kosina
@ 2018-07-10 10:32             ` Jiri Kosina
  2018-07-10 22:57             ` Josh Poimboeuf
  1 sibling, 0 replies; 53+ messages in thread
From: Jiri Kosina @ 2018-07-10 10:32 UTC (permalink / raw)
  To: speck

On Tue, 10 Jul 2018, speck for Jiri Kosina wrote:

> But generally, if we take that aproach, and in the name of keeping it as 
> simple as possible, this would make sense I think:
> 
> 	l1tf=full,force
> 		- L1D_FLUSH_ALWAYS
> 		- nosmt=force
> 	l1tf=full
> 		- L1D_FLUSH_COND
> 		- nosmt
> 	l1tf=conservative
> 		- L1D_FLUSH_COND
> 		- refuse to start VM with SMT on
> 			(perhaps with kvm module option that allows 
> 			 to override?)
> 	l1tf=conservative,warn
> 		- L1D_FLUSH_COND
> 		- warn when starting VM with SMT on
> 	l1tf=off
> 		- L1D_FLUSH_NEVER
> 		- allow starting VMs without any restrictions/warnings
> 
> My understanding is that Andi would like to see 'conservative,warn' as a 
> default. I personally also see that unresponsible and confusing; it is too 
> easy for users to fall into the trap of thinking they're secure while 
> they're not, plus it provides degraded performance without the actual 
> isolation (on SMT).

OTOH making l1tf=conservative the default is also horrible from user 
experience POV.

Given all that, I still slightly prefer my original "all or nothing for 
VMs" aproach.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/2] Command line and documentation 1
  2018-07-09 21:45   ` Andi Kleen
  2018-07-09 22:08     ` Andi Kleen
  2018-07-09 22:40     ` Jiri Kosina
@ 2018-07-10 11:53     ` Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-10 11:53 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Andi Kleen wrote:
> > +			full,force
> > +				Same as 'full', but disables SMT
> > +				control. Implies the 'nosmt=force' command
> > +				line option.
> 
> I'm still unclear what the use case for this force thing is. Can you please
> clarify? If someone controls the host kernel they can do all kinds of insecure 
> things, so why go to all this effort just to prevent them from 
> opening a side channel.  It seems like unnecessary complexity.
> I would remove all the force options.

That's neither lots of effort nor a huge complexity. And this has been
discussed to death already with the nosmt option.

> > +enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_NOVIRT;
> 
> Can you remind me why there is no sysfs interface for this?

because we need to sort out the options first and then adding a sysfs entry
is just a patch on top.

> > +	[VMENTER_L1D_FLUSH_ALWAYS]	= "fully protected"
> > +};
> > +
> > +static ssize_t l1tf_show_state(char *buf)
> > +{
> > +	if (l1tf_vmx_mitigation == L1TF_VMX_UNKNOWN)
> > +		return sprintf(buf, "%s\n", l1tf_states[l1tf_mitigation]);
> > +
> > +	return sprintf(buf, "%s, VMX: SMT %s L1D %s\n",
> 
> I would call it Single-Thread instead of L1D
> 
> That's much more intuitive for someone who is not deep into
> micro architecture.

I can see how

'L1TF: Mitigation: PTE inversion, VMX: SMT disabled, Single-Thread conditional flush'

is more intuitive.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10  2:44           ` Josh Poimboeuf
  2018-07-10  5:57             ` Jiri Kosina
@ 2018-07-10 17:46             ` Linus Torvalds
  2018-07-10 21:22               ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2018-07-10 17:46 UTC (permalink / raw)
  To: speck



On Mon, 9 Jul 2018, speck for Josh Poimboeuf wrote:
> 
> Ha, and I thought you had said the opposite before.

Again, let me explain the *fundamental* issue, not any particular detail 
in the defaults.

The fundamental issue is that 99% of all regular people who have 
virtualization support simply never even use it. It's there. Who cares? 
Not most people.

And if they do use it, it's very much incidental. It's running some legacy 
windows binary, it's doing some Linux kernel bootup test, or whatever. 
It's simply not a big deal.

Put another way: 99% of all regular people do not provide virtualized 
cloud services.

What does that mean?

It means that we absolutely do not turn of HT by default. Not because we 
don't care about L1TF, but because we don't care about virtualization AT 
ALL.

Comprende?

But then, in the rare case that such a person uses kvm, at that point 
there's absolutely no reason to not flush the L1 cache. Again, we don't 
actually *care* about virtualization to a big degree. It's simply not a 
primary thing. We warn if HT is enabled, but that's more as a 
notification: "if you do care, you should know you're doing something 
wrong".

But we *should* flush the L1 cache by default, because once you use kvm, 
you care at least enough to use it. So do something best-effort at that 
point. Just not enough to screw EVERYTHING ELSE up and remove all those HT 
threads!

See the argument? It really boils down to:

 (a) Disabling HT screws _others_. We don't do that unless virtualization 
     is actually a big deal.

 (b) Doing the L1 flush on entry only screws virtualization itself, not 
     everything else. Just do it. It might help, and might make it harder 
     to reliably trigger something.

Finally, the argument is that people may not have HT or may have disabled 
it in the BIOS, means that obviously if that happens, you don't warn, but 
you (again) flush the L1 before using it by default. At that point you're 
actually safe, but there's no way for the kernel to know whether you're 
safe on purpose, or just by mistake. 

                Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-09 22:07   ` [MODERATED] " Andi Kleen
  2018-07-09 23:00     ` Josh Poimboeuf
@ 2018-07-10 19:36     ` Thomas Gleixner
  2018-07-11 14:03     ` [MODERATED] " Jon Masters
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-10 19:36 UTC (permalink / raw)
  To: speck

On Mon, 9 Jul 2018, speck for Andi Kleen wrote:
> > +   - Processors which have the ARCH_CAP_RDCL_NO bit set in the
> > +     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is also not
> > +     affected by the Meltdown vulnerabitly. These CPUs should become
> > +     available end of 2018.
> 
> Would be better to specify the sysfs file output here.

Sure.
 
> > +This flaw is very similar to the Meltdown vulnerability, which speculates
> > +on data which should be not accessible from user space because the
> > +speculation ignores the permission bits. Contrary to Meltdown L1TF can not
> > +be exploited without actually generating page faults. While Meltdown breaks
> 
> Not correct due to TSX.

What I wanted to explain, is that it can't be hidden behind a non taken
branch which works very well with Meltdown, but that's not relevant in this
context, so I dropped that already. See V2

> > +the user space to kernel space protection, L1TF has a broader scope. It
> 
> It's actually not broader for user space, but much more narrow.

The scope is still broader because it allows to break through more
protection domains than Meltdown.

> > +allows to attack any physical memory address in the system and the attack
> 
> .. but only when the page table entry is controllable. 
> 
> > +works across all protection domains. It allows to attack SGX and also
> > +works from inside virtual machines because the speculation bypasses the
> 
> which is only the case for virtual machines.

And because it's only possible in a VM we have all the PTE inversion magic
on the host?

> > +1. L1D flush on VMENTER
> > +
> > +   To make sure that a guest cannot attack data which is present in L1D the
> > +   hypervisor flushes L1D before entering the guest.
> > +
> > +   Flushing L1D evicts not only the data which should not be accessed by a
> > +   potentially malicious guest, it also flushes the guest data. Flushing
> 
> I suspect this is misleading because most non trivial exits already effectively
> clear the L1.

And why is that misleading? L1D flush does exactly that whether it's done
by vmexits or by software initiated flush on vmenter.

Care to explain which vmexits flush L1D today? The SDM doesn't tell or at
least I couldn't find it.

> > +   The host memory is attackable, when one of the sibling threads runs in
> > +   host OS (hypervisor) context and the other in guest context. The amount
> > +   of valuable information from the host OS context depends on the context
> > +   which the host OS executes, i.e. interrupts, soft interrupts and kernel
> > +   threads. The amount of valuable data from these contexts cannot be
> > +   declared as non interesting for an attacker without deep inspection of
> > +   the code.
> 
> The interrupt part seems misplaced in this section, should be in the next.

Errm?

> > +   true because there are types of interrupts which are truly per CPU
> > +   interrupts, e.g. the local timer interrupt. Aside of that multi queue
> > +   devices affine their interrupts to single CPUs or groups of CPUs per
> > +   queue without allowing the administrator to control the affinities.
> 
> Mention that timers run on the CPUs they were triggered on, so they
> are effectively tied to the processes?

I really don't understand what you try to say here.

> > +   /sys/devices/system/cpu/smt/active:
> > +
> > +     This file reports whether SMT is enabled and active, i.e. if on any
> > +     physical core two or more sibling threads are online.
> > +
> > +There is ongoing research and development for other mitigation mechanisms
> > +to address the performance impact of disabling SMT.
> 
> Really need to refer to any sysfs for L1D here.

/sys/devices/system/cpu/vulnerabilities/ongoing-l1tf-research

or what?

> > +Mitigation control for KVM - command line or module parameter
> > +-------------------------------------------------------------
> > +
> > +The KVM hypervisor mitigation mechanism, flushing the L1D cache when
> > +entering a guest, can be controlled from the kernel command line or when
> > +the KVM-Intel hypervisor is built as a module also with a module parameter.
> 
> That can be also changed dynamically through /sys/module/kvm/parameters/* right? 
> 
> That would be useful to document.

It can't at the moment and that's still being discussed which control thing
is going to be writeable.

> > +3. Virtualization with untrusted guests
> > +
> > +   If SMT is not supported by the processor or disabled in the BIOS or by
> > +   the kernel, it's only required to enforce L1D flushing on VMENTER. This
> > +   can be achieved with the l1tf or the kvm-intel command line or module
> > +   parameters.
> 
> With cond in most cases they should be already ok?
> 
> Just need a trade off here between cond and always.
> 
> something like
> 
> With the default a guest can determine the address space layout
> of the hypervisor, but no user data.

Good point.

> > +   - Isolating the guest CPUs from interrupts can reduce the attack surface
> > +     further, but still allows a malicious guest to explore a limited
> > +     amount of host physical memory. This can at least be used to gain
> > +     knowledge about the host address space layout. The interrupts which
> > +     have a fixed affinity to the CPUs which run the untrusted guests can
> > +     depending on the scenario still trigger soft interrupts and schedule
> > +     kernel threads which might expose valuable information.
> > +
> > +   - Disabling SMT and enforcing the L1D flushing provides the maximum
> 
> enforcing it with always

Indeed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-10 17:46             ` Linus Torvalds
@ 2018-07-10 21:22               ` Thomas Gleixner
  2018-07-10 21:30                 ` [MODERATED] " Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-10 21:22 UTC (permalink / raw)
  To: speck

On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
> See the argument? It really boils down to:
> 
>  (a) Disabling HT screws _others_. We don't do that unless virtualization 
>      is actually a big deal.
> 
>  (b) Doing the L1 flush on entry only screws virtualization itself, not 
>      everything else. Just do it. It might help, and might make it harder 
>      to reliably trigger something.

Not if SMT is enabled, really.

> Finally, the argument is that people may not have HT or may have disabled 
> it in the BIOS, means that obviously if that happens, you don't warn, but 
> you (again) flush the L1 before using it by default. At that point you're 
> actually safe, but there's no way for the kernel to know whether you're 
> safe on purpose, or just by mistake. 

That's inconsistent and confusing. "You're accidentaly protected or maybe
not", is really not something I want to write into a documentation. Neither
I want to explain that to people who then report that the sysfs file says
vulnerable on one machine and not on the other.

So my approach would be to keep SMT enabled by default and have L1D flush
on by default, but refuse to start a guest when SMT is enabled by default.

Its easy enough to document what to do in that case for Joe User (either
disable SMT or tell KVM to shut up), but it would be a consistent and
responsible default.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 21:22               ` Thomas Gleixner
@ 2018-07-10 21:30                 ` Linus Torvalds
  2018-07-10 21:53                   ` Linus Torvalds
  2018-07-10 22:20                   ` Thomas Gleixner
  0 siblings, 2 replies; 53+ messages in thread
From: Linus Torvalds @ 2018-07-10 21:30 UTC (permalink / raw)
  To: speck



On Tue, 10 Jul 2018, speck for Thomas Gleixner wrote:

> On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
> > See the argument? It really boils down to:
> > 
> >  (b) Doing the L1 flush on entry only screws virtualization itself, not 
> >      everything else. Just do it. It might help, and might make it harder 
> >      to reliably trigger something.
> 
> Not if SMT is enabled, really.

Quite honestly, possibly even with SMT enabled. It at least means that the 
code and data that is *reliably* available (ie everything that leads up to 
the actual VMX entry itself) no longer is available.

So then you have to get another thread to run something you care about to 
get information out. That's already quite a big step.

> So my approach would be to keep SMT enabled by default and have L1D flush
> on by default, but refuse to start a guest when SMT is enabled by default.

What? So now you break existing workflows? No.

Seriously. What part of "people just want to run their windows binary or 
test their kernel with lkvm" is so hard to understand?

WE DO NOT BREAK EXISTING USERS. Not for a security issue that simply 
doesn't matter to them. 

I realize that most on this list are on this list *BECAUSE* they care 
about this issue. But this whole L1TF with VMX is going to be a total 
non-issue for most non-cloud users. Really.

We don't screw up normal people just because the Amazons and the Googles 
of the world care.

For example, there is absolutely _zero_ reason for me to disable SMT just 
because I want to do a "lkvm run" on a kernel to test that it boots or run 
some testsuite in addition to the usual build tests.

SO NO. WE DO NOT BREAK THAT JUST BECAUSE SOME *OTHER* USE MAY CARE.

Really. 

            Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 21:30                 ` [MODERATED] " Linus Torvalds
@ 2018-07-10 21:53                   ` Linus Torvalds
  2018-07-10 22:27                     ` Thomas Gleixner
  2018-07-10 22:20                   ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2018-07-10 21:53 UTC (permalink / raw)
  To: speck



On Tue, 10 Jul 2018, Linus Torvalds wrote:
> 
> For example, there is absolutely _zero_ reason for me to disable SMT just 
> because I want to do a "lkvm run" on a kernel to test that it boots or run 
> some testsuite in addition to the usual build tests.

There's another issue: capabilities.

A normal user can run kvm. But a normal user can *not* hot-unplug CPU's. 

So this is a distro issue too. You can't just disable kvm if SMT is on by 
default. It would break anybody who updates a kernel and has had their 
system set up for them. All those poeple who have set up computers for 
their spouses, children or parents to auto-update and *not* have the user 
have to know, and now their system stops working.

Users come first. Always. And the _clueless_ user is actually way more 
important than the Amazon or Google one - inside Amazon you'll find people 
whose only job is to do MIS and know about these kinds of things.

Btw, this is why I suggested that "just turn off SMT automagically" model. 
It's the only model that is acceptable by default mode and that is 
actually secure. Exactly because it takes care of SMT automatically, so 
that you don't have capability issues, and so that you don't have the 
"clueless user" issue.

But you're the one who hated that.

               Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-10 21:30                 ` [MODERATED] " Linus Torvalds
  2018-07-10 21:53                   ` Linus Torvalds
@ 2018-07-10 22:20                   ` Thomas Gleixner
  2018-07-10 22:35                     ` [MODERATED] " Linus Torvalds
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-10 22:20 UTC (permalink / raw)
  To: speck

On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
> On Tue, 10 Jul 2018, speck for Thomas Gleixner wrote:
> 
> > On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
> > > See the argument? It really boils down to:
> > > 
> > >  (b) Doing the L1 flush on entry only screws virtualization itself, not 
> > >      everything else. Just do it. It might help, and might make it harder 
> > >      to reliably trigger something.
> > 
> > Not if SMT is enabled, really.
> 
> Quite honestly, possibly even with SMT enabled. It at least means that the 
> code and data that is *reliably* available (ie everything that leads up to 
> the actual VMX entry itself) no longer is available.
> 
> So then you have to get another thread to run something you care about to 
> get information out. That's already quite a big step.

The big step is to brute force scan the physical address space until you
find something interesting and the few flushes for every timer tick are
just adding a bit of noise. The scheduler helps quite a bit to keep busy
tasks on the same CPUs for a while, so it's not unlikely to find something
which is worthwhile given the efficency of that shit.

> > So my approach would be to keep SMT enabled by default and have L1D flush
> > on by default, but refuse to start a guest when SMT is enabled by default.
> 
> What? So now you break existing workflows? No.
> 
> Seriously. What part of "people just want to run their windows binary or 
> test their kernel with lkvm" is so hard to understand?
> 
> WE DO NOT BREAK EXISTING USERS. Not for a security issue that simply 
> doesn't matter to them. 
> 
> I realize that most on this list are on this list *BECAUSE* they care 
> about this issue. But this whole L1TF with VMX is going to be a total 
> non-issue for most non-cloud users. Really.
> 
> We don't screw up normal people just because the Amazons and the Googles 
> of the world care.
> 
> For example, there is absolutely _zero_ reason for me to disable SMT just 
> because I want to do a "lkvm run" on a kernel to test that it boots or run 
> some testsuite in addition to the usual build tests.

No need to disable SMT. You just could tell KVM to shut the f*ck up because
you know what you are doing. Yeah, I know you want it to just keep
working. Ask AMD to send you one of the nice Zen boxen :)

That said, I got the message and I kinda agree, but I still hate the
inconsistency. I'll add a rationale for the default including the resulting
inconsitency to the documentation with clear words about the remaining
risks.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 2/2] Command line and documentation 2
  2018-07-10 21:53                   ` Linus Torvalds
@ 2018-07-10 22:27                     ` Thomas Gleixner
  2018-07-10 22:37                       ` [MODERATED] " Linus Torvalds
  2018-07-10 22:50                       ` Josh Poimboeuf
  0 siblings, 2 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-07-10 22:27 UTC (permalink / raw)
  To: speck

On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
> On Tue, 10 Jul 2018, Linus Torvalds wrote:
> > 
> > For example, there is absolutely _zero_ reason for me to disable SMT just 
> > because I want to do a "lkvm run" on a kernel to test that it boots or run 
> > some testsuite in addition to the usual build tests.
> 
> There's another issue: capabilities.
> 
> A normal user can run kvm. But a normal user can *not* hot-unplug CPU's. 
> 
> So this is a distro issue too. You can't just disable kvm if SMT is on by 
> default. It would break anybody who updates a kernel and has had their 
> system set up for them. All those poeple who have set up computers for 
> their spouses, children or parents to auto-update and *not* have the user 
> have to know, and now their system stops working.

Point taken.

> Users come first. Always. And the _clueless_ user is actually way more 
> important than the Amazon or Google one - inside Amazon you'll find people 
> whose only job is to do MIS and know about these kinds of things.
> 
> Btw, this is why I suggested that "just turn off SMT automagically" model. 
> It's the only model that is acceptable by default mode and that is 
> actually secure. Exactly because it takes care of SMT automatically, so 
> that you don't have capability issues, and so that you don't have the 
> "clueless user" issue.
> 
> But you're the one who hated that.

I'd be fine with that and I really can't remember that I hated turning SMT
off automagically when a guest is started.

I'm too tired to search for that discussion, but IIRC there was some
argument from distro/Intel? people that it breaks exisitng partitioning
setups etc... Though these setups won't be Joe Users setups...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 22:20                   ` Thomas Gleixner
@ 2018-07-10 22:35                     ` Linus Torvalds
  0 siblings, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2018-07-10 22:35 UTC (permalink / raw)
  To: speck



On Wed, 11 Jul 2018, speck for Thomas Gleixner wrote:
> 
> The big step is to brute force scan the physical address space until you
> find something interesting and the few flushes for every timer tick are
> just adding a bit of noise.


.. and all of that assumes a wildly untrusted guest.

Why? People don't run those things. Amazon does, because you just pay them 
and they run anything. But if you're not a cloud provider THIS IS NOT AN 
ISSUE.

Seriously.

Why the f*ck would anybody run some image that does this?

Give me one even remotely likely scenario for running that as a normal 
user in a VM. I don't think you can.

Somebody who does pen-testing or verifies viruses, or whatever? Sure. 
Normal user? No.

This is not like the original spectre thing. The only real attack vector 
for normal users were web browsers running scripts, maybe some bad 
appstore case.

For this one? Even that doesn't exist. The native case is taken care of, 
and the virtual case requires you to run an actively untrusted guest.

> No need to disable SMT. You just could tell KVM to shut the f*ck up because
> you know what you are doing. Yeah, I know you want it to just keep
> working. 

What *I* can do is completely irrelevant.

The first rule of kernel development is "users come first".

That means that the code has to work for people who are *not* me. People 
who had the machine set up for them, potentially because they have an IQ 
of a woodchuck that was dropped on its head a few too many times. Maybe 
they have "emotional IQ" or whatever, but they can't tell a computer from 
a blender without having a few fingers nicked.

Those are the people we should care about. It's why Linux is actually 
used, and why nobody bothers with BSD any more (unless it's the Apple 
version, which cared deeply about those "emotional IQ" people).

Now, if this was some major security hole that was actually _relevant_ to 
the users, I'd have to say that "fuck it, we _have_ to fix this". But I 
seriously cannot see a single way this is relevant to a random normal user 
who uses VMX to play some random games or run some Windows-only app.

                 Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 22:27                     ` Thomas Gleixner
@ 2018-07-10 22:37                       ` Linus Torvalds
  2018-07-10 22:42                         ` Linus Torvalds
  2018-07-10 22:50                       ` Josh Poimboeuf
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2018-07-10 22:37 UTC (permalink / raw)
  To: speck



On Wed, 11 Jul 2018, speck for Thomas Gleixner wrote:
> 
> I'm too tired to search for that discussion, but IIRC there was some 
> argument from distro/Intel? people that it breaks exisitng partitioning 
> setups etc... Though these setups won't be Joe Users setups...

Yes, that's indeed my argument. Yes, it breaks partitioning setups. But 
those wouldn't be using the automatic mode anyway, they'd be using the 
"smt=off" mode.

             Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 22:37                       ` [MODERATED] " Linus Torvalds
@ 2018-07-10 22:42                         ` Linus Torvalds
  0 siblings, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2018-07-10 22:42 UTC (permalink / raw)
  To: speck



On Tue, 10 Jul 2018, Linus Torvalds wrote:
> 
> Yes, that's indeed my argument. Yes, it breaks partitioning setups. But 
> those wouldn't be using the automatic mode anyway, they'd be using the 
> "smt=off" mode.

Side note: doing the auto-off is actually hard in user space. I was kind 
of thinking that maybe there would be some simple way to just teach 
qemu-kvm to do it, but there's not just the capabilty issue, there's 
simply a ref-counting issue. Right now kvm instances don't have to know 
about each other, and they'll happily just co-exist independently. 

But that also makes it really quite hard to do "turn off smt automatically 
when in kvm", because the "turn it back on" is very much not obvious.

So even if we had a solution for the capability issue, we'd still be 
screwed in user space. Somebody could incorrectly just turn SMT back on 
while VMX is active somewhere else.

(And I don't think we want to solve the capability issue anyway: we simply 
don't want normal users taking CPU's down and bringing them up randomly. 
Doing it automatically with some timeout (so that the frequency of the 
up/down events would at least be limited) is one thing, but exposing "turn 
smt on/off" to any random user sounds like a really bad idea).

                 Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 22:27                     ` Thomas Gleixner
  2018-07-10 22:37                       ` [MODERATED] " Linus Torvalds
@ 2018-07-10 22:50                       ` Josh Poimboeuf
  2018-07-11 13:56                         ` Jon Masters
  1 sibling, 1 reply; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-10 22:50 UTC (permalink / raw)
  To: speck

On Wed, Jul 11, 2018 at 12:27:15AM +0200, speck for Thomas Gleixner wrote:
> On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
> > On Tue, 10 Jul 2018, Linus Torvalds wrote:
> > > 
> > > For example, there is absolutely _zero_ reason for me to disable SMT just 
> > > because I want to do a "lkvm run" on a kernel to test that it boots or run 
> > > some testsuite in addition to the usual build tests.
> > 
> > There's another issue: capabilities.
> > 
> > A normal user can run kvm. But a normal user can *not* hot-unplug CPU's. 
> > 
> > So this is a distro issue too. You can't just disable kvm if SMT is on by 
> > default. It would break anybody who updates a kernel and has had their 
> > system set up for them. All those poeple who have set up computers for 
> > their spouses, children or parents to auto-update and *not* have the user 
> > have to know, and now their system stops working.
> 
> Point taken.
> 
> > Users come first. Always. And the _clueless_ user is actually way more 
> > important than the Amazon or Google one - inside Amazon you'll find people 
> > whose only job is to do MIS and know about these kinds of things.
> > 
> > Btw, this is why I suggested that "just turn off SMT automagically" model. 
> > It's the only model that is acceptable by default mode and that is 
> > actually secure. Exactly because it takes care of SMT automatically, so 
> > that you don't have capability issues, and so that you don't have the 
> > "clueless user" issue.

I don't remember anybody objecting to this idea yet... but if it's being
considered, I object.

Most clueless people (myself included) who run VMs are just running
*normal* VMs with no crazy sh*t in them.  Oftentimes I run *normal* VMs
and then leave them idle for long periods of time.  I suspect this is a
common case.  Personally I would be pissed off if my laptop decided to
automagically disable SMT during those periods.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10  8:44           ` [MODERATED] " Jiri Kosina
  2018-07-10 10:32             ` Jiri Kosina
@ 2018-07-10 22:57             ` Josh Poimboeuf
  1 sibling, 0 replies; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-10 22:57 UTC (permalink / raw)
  To: speck

Given the recent (entertaining!) discussions on the list, below is a
proposed modification to Jiri's most recent proposal:

> 	l1tf=full,force
> 		- L1D_FLUSH_ALWAYS
> 		- nosmt=force

It's a bit confusing that this option does 'always' instead of 'cond'.
In fact I'd propose that we drop this option altogether, in the name of
simplicity.  I don't think anybody needs it anyway.

> 	l1tf=full
> 		- L1D_FLUSH_COND
> 		- nosmt

Agreed.

> 	l1tf=conservative
> 		- L1D_FLUSH_COND
> 		- refuse to start VM with SMT on
> 			(perhaps with kvm module option that allows 
> 			 to override?)
> 	l1tf=conservative,warn
> 		- L1D_FLUSH_COND
> 		- warn when starting VM with SMT on

Let's replace the above two with:

	l1tf=flush [default]
		- L1D_FLUSH_COND
		- warn when starting VM with SMT on

	l1tf=flush,nowarn 
		- L1D_FLUSH_COND
		- no warning
		- (useful for the cpuset power user crowd)

> 	l1tf=off
> 		- L1D_FLUSH_NEVER
> 		- allow starting VMs without any restrictions/warnings

Agreed.

As far as the implementation goes, the code might become simpler if
'l1tf_mitigation' is converted to a pseudo-variable which is really
dependent on other real variables: nosmt, vmentry_l1d_flush,
warn_on_unsafe_vm.

If I have time, I may work up a patch tonight -- though I might not have
time.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-10 22:50                       ` Josh Poimboeuf
@ 2018-07-11 13:56                         ` Jon Masters
  2018-07-11 14:48                           ` Josh Poimboeuf
  0 siblings, 1 reply; 53+ messages in thread
From: Jon Masters @ 2018-07-11 13:56 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2434 bytes --]

On 07/10/2018 06:50 PM, speck for Josh Poimboeuf wrote:
> On Wed, Jul 11, 2018 at 12:27:15AM +0200, speck for Thomas Gleixner wrote:
>> On Tue, 10 Jul 2018, speck for Linus Torvalds wrote:
>>> On Tue, 10 Jul 2018, Linus Torvalds wrote:
>>>>
>>>> For example, there is absolutely _zero_ reason for me to disable SMT just 
>>>> because I want to do a "lkvm run" on a kernel to test that it boots or run 
>>>> some testsuite in addition to the usual build tests.
>>>
>>> There's another issue: capabilities.
>>>
>>> A normal user can run kvm. But a normal user can *not* hot-unplug CPU's. 
>>>
>>> So this is a distro issue too. You can't just disable kvm if SMT is on by 
>>> default. It would break anybody who updates a kernel and has had their 
>>> system set up for them. All those poeple who have set up computers for 
>>> their spouses, children or parents to auto-update and *not* have the user 
>>> have to know, and now their system stops working.
>>
>> Point taken.
>>
>>> Users come first. Always. And the _clueless_ user is actually way more 
>>> important than the Amazon or Google one - inside Amazon you'll find people 
>>> whose only job is to do MIS and know about these kinds of things.
>>>
>>> Btw, this is why I suggested that "just turn off SMT automagically" model. 
>>> It's the only model that is acceptable by default mode and that is 
>>> actually secure. Exactly because it takes care of SMT automatically, so 
>>> that you don't have capability issues, and so that you don't have the 
>>> "clueless user" issue.
> 
> I don't remember anybody objecting to this idea yet... but if it's being
> considered, I object.
> 
> Most clueless people (myself included) who run VMs are just running
> *normal* VMs with no crazy sh*t in them.  Oftentimes I run *normal* VMs
> and then leave them idle for long periods of time.  I suspect this is a
> common case.  Personally I would be pissed off if my laptop decided to
> automagically disable SMT during those periods.

More to the point, that will break even more users when those threads
disappear. Sadly, I think the only reasonable avenue is to leave SMT on,
and to simply warn if folks spin up a VM that might be unsafe. Doing the
flush always in VMX by default seems like a sane thing too. I thought
this was all where we had ended up already?

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-09 22:07   ` [MODERATED] " Andi Kleen
  2018-07-09 23:00     ` Josh Poimboeuf
  2018-07-10 19:36     ` Thomas Gleixner
@ 2018-07-11 14:03     ` Jon Masters
  2 siblings, 0 replies; 53+ messages in thread
From: Jon Masters @ 2018-07-11 14:03 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1523 bytes --]

On 07/09/2018 06:07 PM, speck for Andi Kleen wrote:

>> +Problem
>> +-------
>> +
>> +If an instruction accesses a virtual address for which the relevant page
>> +table entry (PTE) has the present bit cleared, then the speculative
>> +execution can load the data into the speculation flow when the data from

If an instruction accesses a virtual address (performs a load) for which
the corresponding physical address from the relevant Page Table Entry
(PTE) is allocated in the Level 1 Data Cache, then the data in the L1
Data Cache from that physical address may be forwarded to the load
instruction (and its dependents) during speculation, regardless of the
value of the "present" bit in the page table entry.

You can leave most of the rest. The EPT bit is simply a design bug that
layers on the above in how Intel bolted EPT onto the existing walks.

> 
> "load the into the speculation flow" ?
> 
> No idea what that means. I would just drop it.
> 
>> +the physical address which is referenced in the PTE address bits is
>> +available in the Level 1 Data Cache. This is a purely speculative
>> +mechanism and the instruction will raise a page fault when it is retired.
>> +
>> +This creates a window between the load and retirement where speculative
>> +execution can operate on the data and malicious code can use this to create
>> +a side channel to leak the speculated data which was accessed without
>> +permission.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [MODERATED] Re: [patch 2/2] Command line and documentation 2
  2018-07-11 13:56                         ` Jon Masters
@ 2018-07-11 14:48                           ` Josh Poimboeuf
  0 siblings, 0 replies; 53+ messages in thread
From: Josh Poimboeuf @ 2018-07-11 14:48 UTC (permalink / raw)
  To: speck

On Wed, Jul 11, 2018 at 09:56:03AM -0400, speck for Jon Masters wrote:
> Doing the flush always in VMX by default seems like a sane thing too.
> I thought this was all where we had ended up already?

There was some misunderstanding and disagreement about that, because
flushing without disabling SMT is a very incomplete mitigation, but yes,
that now appears to be where we have ended up.

-- 
Josh

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2018-07-11 14:48 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-08 12:52 [patch 0/2] Command line and documentation 0 Thomas Gleixner
2018-07-08 12:52 ` [patch 1/2] Command line and documentation 1 Thomas Gleixner
2018-07-08 14:00   ` [MODERATED] " Josh Poimboeuf
2018-07-08 14:13     ` Thomas Gleixner
2018-07-08 15:21       ` [MODERATED] " Josh Poimboeuf
2018-07-09  7:07         ` Thomas Gleixner
2018-07-09 13:14           ` Thomas Gleixner
2018-07-09 13:21             ` [MODERATED] " Jiri Kosina
2018-07-09 13:25               ` Jiri Kosina
2018-07-09 15:32             ` Josh Poimboeuf
2018-07-09 15:40               ` Thomas Gleixner
2018-07-09 15:44               ` [MODERATED] " Jiri Kosina
2018-07-08 20:32   ` Jiri Kosina
2018-07-09  0:33     ` Jon Masters
2018-07-09 10:26   ` Ingo Molnar
2018-07-09 21:45   ` Andi Kleen
2018-07-09 22:08     ` Andi Kleen
2018-07-09 22:40     ` Jiri Kosina
2018-07-10 11:53     ` Thomas Gleixner
2018-07-08 12:52 ` [patch 2/2] Command line and documentation 2 Thomas Gleixner
2018-07-08 14:40   ` [MODERATED] " Andrew Cooper
2018-07-09  7:05     ` Thomas Gleixner
2018-07-08 15:40   ` [MODERATED] " Josh Poimboeuf
2018-07-09 11:04   ` Ingo Molnar
2018-07-09 11:08     ` Jiri Kosina
2018-07-09 11:47       ` Ingo Molnar
2018-07-09 15:18     ` Thomas Gleixner
2018-07-09 22:07   ` [MODERATED] " Andi Kleen
2018-07-09 23:00     ` Josh Poimboeuf
2018-07-09 23:11       ` Andi Kleen
2018-07-09 23:45         ` Linus Torvalds
2018-07-10  2:44           ` Josh Poimboeuf
2018-07-10  5:57             ` Jiri Kosina
2018-07-10  6:22               ` Jiri Kosina
2018-07-10 17:46             ` Linus Torvalds
2018-07-10 21:22               ` Thomas Gleixner
2018-07-10 21:30                 ` [MODERATED] " Linus Torvalds
2018-07-10 21:53                   ` Linus Torvalds
2018-07-10 22:27                     ` Thomas Gleixner
2018-07-10 22:37                       ` [MODERATED] " Linus Torvalds
2018-07-10 22:42                         ` Linus Torvalds
2018-07-10 22:50                       ` Josh Poimboeuf
2018-07-11 13:56                         ` Jon Masters
2018-07-11 14:48                           ` Josh Poimboeuf
2018-07-10 22:20                   ` Thomas Gleixner
2018-07-10 22:35                     ` [MODERATED] " Linus Torvalds
2018-07-10  7:41         ` Thomas Gleixner
2018-07-10  8:44           ` [MODERATED] " Jiri Kosina
2018-07-10 10:32             ` Jiri Kosina
2018-07-10 22:57             ` Josh Poimboeuf
2018-07-10 19:36     ` Thomas Gleixner
2018-07-11 14:03     ` [MODERATED] " Jon Masters
2018-07-08 13:11 ` [patch 0/2] Command line and documentation 0 Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.