All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch V5 00/10] KVM magic # 0
@ 2018-07-02 15:44 Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 01/10] KVM magic # 1 Thomas Gleixner
                   ` (11 more replies)
  0 siblings, 12 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

Konrad ran short of cycles when addressing the review comments, so I took
over from him and finished it up.

This has the following changes sinve V4:

   - Integrate the static key logic (Konrad)

   - Split the big combo patch adding the flush logic apart into:

     - Add the L1D software flush algorithm. ASM code polishing
       from Borislav
       
     - Add the flush logic itself

     The flush logic has been reduced to a single flag. The flag is not
     longer fiddled with in prepare_guest_switch. It's set from the various
     places as with the v4 set and in the flush function itself it's
     cleared or in case of 'flush always' kept set.

   - Moved most of the extra magic to vmx.c

   - Polished comments and changelogs where necessary

I'm running short of cycles as well, so the whole lot is not much
tested. May I ask politely for help with that and review of
course. Especially the new flush logic with the single flag needs some
sharp eyes.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 01/10] KVM magic # 1
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 16:22   ` [MODERATED] " Konrad Rzeszutek Wilk
  2018-07-02 16:25   ` [MODERATED] " Borislav Petkov
  2018-07-02 15:44 ` [patch V5 02/10] KVM magic # 2 Thomas Gleixner
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com

commit 75e236a35266617b8c80ec3f5f98f6c81e326c5b upstream

If the L1TF CPU bug is present we allow the KVM module to be loaded
as the major of users that use Linux and KVM have trusted guests
and do not want a broken setup.

Cloud vendors are the ones that are uncomfortable with CVE 2018-3615
and as such they are the ones that should set disallow_smt to one.

Setting disallow_smt to means that the system administrator also needs
to disable SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command
line parameter, or via the /sys/devices/system/cpu/smt/control
(see commit 05736e4ac13c cpu/hotplug: Provide knobs to control SMT).

Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v3:-Move it to vmx_vcpu_init (could also do it in kvm_vcpu_init but it seemed
    more prudent to do it in VMX handler.
   -Made it WARN if disallow_smt=0
   -Made it ERR if disallow_smt=1
   -Fixed the CVE number
v4: s/disallow/nosmt/

---
 Documentation/admin-guide/kernel-parameters.txt |    6 ++++++
 arch/x86/kvm/vmx.c                              |   13 +++++++++++++
 kernel/cpu.c                                    |    1 +
 3 files changed, 20 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1946,6 +1946,12 @@
 			[KVM,ARM] Allow use of GICv4 for direct injection of
 			LPIs.
 
+	kvm-intel.nosmt=[KVM,Intel] If the L1TF CPU bug is present (CVE-2018-3620)
+			and the system has SMT (aka Hyper-Threading) enabled then
+			don't allow guests to be created.
+
+			Default is 0 (allow guests to be created).
+
 	kvm-intel.ept=	[KVM,Intel] Disable extended page tables
 			(virtualized MMU) support on capable Intel chips.
 			Default is 1 (enabled)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,6 +71,9 @@ static const struct x86_cpu_id vmx_cpu_i
 };
 MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
 
+static bool __read_mostly nosmt = false;
+module_param(nosmt, bool, S_IRUGO);
+
 static bool __read_mostly enable_vpid = 1;
 module_param_named(vpid, enable_vpid, bool, 0444);
 
@@ -10370,10 +10373,20 @@ static struct kvm_vcpu *vmx_create_vcpu(
 	return ERR_PTR(err);
 }
 
+#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to CVE-2018-3620 for details.\n"
+
 static int vmx_vm_init(struct kvm *kvm)
 {
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
+
+	if (boot_cpu_has(X86_BUG_L1TF) && (cpu_smt_control == CPU_SMT_ENABLED)) {
+		if (nosmt) {
+			pr_err(L1TF_MSG);
+			return -EOPNOTSUPP;
+		}
+		pr_warn(L1TF_MSG);
+	}
 	return 0;
 }
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -345,6 +345,7 @@ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 
 #ifdef CONFIG_HOTPLUG_SMT
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
+EXPORT_SYMBOL_GPL(cpu_smt_control);
 
 static int __init smt_cmdline_disable(char *str)
 {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 02/10] KVM magic # 2
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 01/10] KVM magic # 1 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 03/10] KVM magic # 3 Thomas Gleixner
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [patch V5 02/10] x86/KVM/VMX: Add module argument for L1TF mitigation

Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
L1 terminal fault. The valid arguments are:

 - "always" 	L1D cache flush on every VMENTER.
 - "cond"	Conditional L1D cache flush, explained below
 - "never"	Disable the L1D cache flush mitigation

"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.

[ tglx: Split out from a larger patch ]

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/admin-guide/kernel-parameters.txt |   12 ++++
 arch/x86/kvm/vmx.c                              |   59 ++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1973,6 +1973,18 @@
 			(virtualized real and unpaged mode) on capable
 			Intel chips. Default is 1 (enabled)
 
+	kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
+			CVE-2018-3620.
+
+			Valid arguments: never, cond, always
+
+			always: L1D cache flush on every VMENTER.
+			cond:	Flush L1D on VMENTER only when the code between
+				VMEXIT and VMENTER can leak host memory.
+			never:	Disables the mitigation
+
+			Default is cond (do L1 cache flush in specific instances)
+
 	kvm-intel.vpid=	[KVM,Intel] Disable Virtual Processor Identification
 			feature (tagged TLBs) on capable Intel chips.
 			Default is 1 (enabled)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -191,6 +191,54 @@ module_param(ple_window_max, uint, 0444)
 
 extern const ulong vmx_return;
 
+static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
+
+/* These MUST be in sync with vmentry_l1d_param order. */
+enum vmx_l1d_flush_state {
+	VMENTER_L1D_FLUSH_NEVER,
+	VMENTER_L1D_FLUSH_COND,
+	VMENTER_L1D_FLUSH_ALWAYS,
+};
+
+static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush = VMENTER_L1D_FLUSH_COND;
+
+static const struct {
+	const char *option;
+	enum vmx_l1d_flush_state cmd;
+} vmentry_l1d_param[] = {
+	{"never",	VMENTER_L1D_FLUSH_NEVER},
+	{"cond",	VMENTER_L1D_FLUSH_COND},
+	{"always",	VMENTER_L1D_FLUSH_ALWAYS},
+};
+
+static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)
+{
+	unsigned int i;
+
+	if (!s)
+		return -EINVAL;
+
+	for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {
+		if (!strcmp(s, vmentry_l1d_param[i].option)) {
+			vmentry_l1d_flush = vmentry_l1d_param[i].cmd;
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)
+{
+	return sprintf(s, "%s\n", vmentry_l1d_param[vmentry_l1d_flush].option);
+}
+
+static const struct kernel_param_ops vmentry_l1d_flush_ops = {
+	.set = vmentry_l1d_flush_set,
+	.get = vmentry_l1d_flush_get,
+};
+module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, &vmentry_l1d_flush, S_IRUGO);
+
 struct kvm_vmx {
 	struct kvm kvm;
 
@@ -13062,6 +13110,15 @@ static struct kvm_x86_ops vmx_x86_ops __
 	.enable_smi_window = enable_smi_window,
 };
 
+static void __init vmx_setup_l1d_flush(void)
+{
+	if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
+	    !boot_cpu_has_bug(X86_BUG_L1TF))
+		return;
+
+	static_branch_enable(&vmx_l1d_should_flush);
+}
+
 static int __init vmx_init(void)
 {
 	int r;
@@ -13095,6 +13152,8 @@ static int __init vmx_init(void)
 	}
 #endif
 
+	vmx_setup_l1d_flush();
+
 	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
                      __alignof__(struct vcpu_vmx), THIS_MODULE);
 	if (r)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 03/10] KVM magic # 3
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 01/10] KVM magic # 1 Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 02/10] KVM magic # 2 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 23:42   ` [MODERATED] " Jon Masters
  2018-07-02 15:44 ` [patch V5 04/10] KVM magic # 4 Thomas Gleixner
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

Subject: [patch V5 03/10] x86/KVM/VMX: Add L1D flush algorithm
From: Paolo Bonzini pbonzini@redhat.com

To mitigate the L1 Terminal Fault vulnerability it's required to flush L1D
on VMENTER to prevent rogue guests from snooping host memory.

CPUs will have a new control MSR via a microcode update to flush L1D with a
single MSR write, but in the absence of microcode a fallback to a software
based flush algorithm is required.

Add a software flush loop which is based on code from Intel.

[ tglx: Split out from combo patch ]
[ bpetkov: Polish the asm code ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kvm/vmx.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 5 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9563,6 +9563,46 @@ static int vmx_handle_exit(struct kvm_vc
 	}
 }
 
+/*
+ * Software based L1D cache flush which is used when microcode providing
+ * the cache control MSR is not loaded.
+ *
+ * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to
+ * flush it is required to read in 64 KiB because the replacement algorithm
+ * is not exactly LRU. This could be sized at runtime via topology
+ * information but as all relevant affected CPUs have 32KiB L1D cache size
+ * there is no point in doing so.
+ */
+#define L1D_CACHE_ORDER 4
+static void *vmx_l1d_flush_pages;
+
+static void __maybe_unused vmx_l1d_flush(void)
+{
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	asm volatile(
+		/* First ensure the pages are in the TLB */
+		"xorl	%%eax, %%eax\n"
+		".Lpopulate_tlb:\n\t"
+		"movzbl	(%[empty_zp], %%" _ASM_AX "), %%ecx\n\t"
+		"addl	$4096, %%eax\n\t"
+		"cmpl	%%eax, %[size]\n\t"
+		"jne	.Lpopulate_tlb\n\t"
+		"xorl	%%eax, %%eax\n\t"
+		"cpuid\n\t"
+		/* Now fill the cache */
+		"xorl	%%eax, %%eax\n"
+		".Lfill_cache:\n"
+		"movzbl	(%[empty_zp], %%" _ASM_AX "), %%ecx\n\t"
+		"addl	$64, %%eax\n\t"
+		"cmpl	%%eax, %[size]\n\t"
+		"jne	.Lfill_cache\n\t"
+		"lfence\n"
+		:: [empty_zp] "r" (vmx_l1d_flush_pages),
+		    [size] "r" (size)
+		: "eax", "ebx", "ecx", "edx");
+}
+
 static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
@@ -13110,13 +13150,29 @@ static struct kvm_x86_ops vmx_x86_ops __
 	.enable_smi_window = enable_smi_window,
 };
 
-static void __init vmx_setup_l1d_flush(void)
+static int __init vmx_setup_l1d_flush(void)
 {
+	struct page *page;
+
 	if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
 	    !boot_cpu_has_bug(X86_BUG_L1TF))
-		return;
+		return 0;
+
+	page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
+	if (!page)
+		return -ENOMEM;
 
+	vmx_l1d_flush_pages = page_address(page);
 	static_branch_enable(&vmx_l1d_should_flush);
+	return 0;
+}
+
+static void vmx_free_l1d_flush_pages(void)
+{
+	if (vmx_l1d_flush_pages) {
+		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);
+		vmx_l1d_flush_pages = NULL;
+	}
 }
 
 static int __init vmx_init(void)
@@ -13152,12 +13208,16 @@ static int __init vmx_init(void)
 	}
 #endif
 
-	vmx_setup_l1d_flush();
+	r = vmx_setup_l1d_flush();
+	if (r)
+		return r;
 
 	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
-                     __alignof__(struct vcpu_vmx), THIS_MODULE);
-	if (r)
+		     __alignof__(struct vcpu_vmx), THIS_MODULE);
+	if (r) {
+		vmx_free_l1d_flush_pages();
 		return r;
+	}
 
 #ifdef CONFIG_KEXEC_CORE
 	rcu_assign_pointer(crash_vmclear_loaded_vmcss,
@@ -13199,6 +13259,7 @@ static void __exit vmx_exit(void)
 		static_branch_disable(&enable_evmcs);
 	}
 #endif
+	vmx_free_l1d_flush_pages();
 }
 
 module_init(vmx_init)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 04/10] KVM magic # 4
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (2 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 03/10] KVM magic # 3 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 05/10] KVM magic # 5 Thomas Gleixner
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

Subject: [patch V5 04/10] x86/KVM/VMX: Add L1D MSR based flush
From: Paolo Bonzini pbonzini@redhat.com

336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other
MSRs defined in the document.

The semantics of this MSR is to allow "finer granularity invalidation of
caching structures than existing mechanisms like WBINVD. It will writeback
and invalidate the L1 data cache, including all cachelines brought in by
preceding instructions, without invalidating all caches (eg. L2 or
LLC). Some processors may also invalidate the first level level instruction
cache on a L1D_FLUSH command. The L1 data and instruction caches may be
shared across the logical processors of a core."

Use it instead of the loop based L1 flush algorithm.

A copy of this document is available at
   https://bugzilla.kernel.org/show_bug.cgi?id=199511

[ tglx: Avoid allocating pages when the MSR is available ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---

 arch/x86/include/asm/msr-index.h |    6 ++++++
 arch/x86/kvm/vmx.c               |   15 +++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -76,6 +76,12 @@
 						    * control required.
 						    */
 
+#define MSR_IA32_FLUSH_CMD		0x0000010b
+#define L1D_FLUSH			(1 << 0)   /*
+						    * Writeback and invalidate the
+						    * L1 data cache.
+						    */
+
 #define MSR_IA32_BBL_CR_CTL		0x00000119
 #define MSR_IA32_BBL_CR_CTL3		0x0000011e
 
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9580,6 +9580,11 @@ static void __maybe_unused vmx_l1d_flush
 {
 	int size = PAGE_SIZE << L1D_CACHE_ORDER;
 
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+		return;
+	}
+
 	asm volatile(
 		/* First ensure the pages are in the TLB */
 		"xorl	%%eax, %%eax\n"
@@ -13158,11 +13163,13 @@ static int __init vmx_setup_l1d_flush(vo
 	    !boot_cpu_has_bug(X86_BUG_L1TF))
 		return 0;
 
-	page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
-	if (!page)
-		return -ENOMEM;
+	if (!boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
+		if (!page)
+			return -ENOMEM;
+		vmx_l1d_flush_pages = page_address(page);
+	}
 
-	vmx_l1d_flush_pages = page_address(page);
 	static_branch_enable(&vmx_l1d_should_flush);
 	return 0;
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 05/10] KVM magic # 5
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 04/10] KVM magic # 4 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 16:35   ` [MODERATED] " Konrad Rzeszutek Wilk
                     ` (2 more replies)
  2018-07-02 15:44 ` [patch V5 06/10] KVM magic # 6 Thomas Gleixner
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

Subject: [patch V5 05/10] x86/KVM/VMX: Add L1D flush logic
From: Paolo Bonzini pbonzini@redhat.com

Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.

The flags is set:
 - Always, if the flush module parameter is 'always'

 - Conditionally at:
   - Entry to vcpu_run(), i.e. after executing user space

   - From the sched_in notifier, i.e. when switching to a vCPU thread.

   - From vmexit handlers which are considered unsafe, i.e. where
     sensitive data can be brought into L1D:

     - The emulator, which could be a good target for other speculative
       execution-based threats,

     - The MMU, which can bring host page tables in the L1 cache.
     
     - External interrupts

     - Nested operations that require the MMU (see above). That is
       vmptrld, vmptrst, vmclear,vmwrite,vmread.

     - When handling invept,invvpid

[ tglx: Split out from combo patch and reduced to a single flag ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
v4: Fix from Thomas using static key
v5: Split it out from the big combo patch
    Simplify the flags logic

 arch/x86/include/asm/kvm_host.h |    4 ++++
 arch/x86/kvm/mmu.c              |    1 +
 arch/x86/kvm/vmx.c              |   22 +++++++++++++++++++++-
 arch/x86/kvm/x86.c              |    7 +++++++
 4 files changed, 33 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -713,6 +713,9 @@ struct kvm_vcpu_arch {
 
 	/* be preempted when it's in kernel-mode(cpl=0) */
 	bool preempted_in_kernel;
+
+	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */
+	bool l1tf_flush_l1d;
 };
 
 struct kvm_lpage_info {
@@ -881,6 +884,7 @@ struct kvm_vcpu_stat {
 	u64 signal_exits;
 	u64 irq_window_exits;
 	u64 nmi_window_exits;
+	u64 l1d_flush;
 	u64 halt_exits;
 	u64 halt_successful_poll;
 	u64 halt_attempted_poll;
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcp
 {
 	int r = 1;
 
+	vcpu->arch.l1tf_flush_l1d = true;
 	switch (vcpu->arch.apf.host_apf_reason) {
 	default:
 		trace_kvm_page_fault(fault_address, error_code);
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9576,9 +9576,20 @@ static int vmx_handle_exit(struct kvm_vc
 #define L1D_CACHE_ORDER 4
 static void *vmx_l1d_flush_pages;
 
-static void __maybe_unused vmx_l1d_flush(void)
+static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
 	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+	bool always;
+
+	/*
+	 * If the mitigation mode is 'flush always', keep the flush bit
+	 * set, otherwise clear it. It gets set again either from
+	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
+	 */
+	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
+	vcpu->arch.l1tf_flush_l1d = always;
+
+	vcpu->stat.l1d_flush++;
 
 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
 		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
@@ -9847,6 +9858,7 @@ static void vmx_handle_external_intr(str
 			[ss]"i"(__KERNEL_DS),
 			[cs]"i"(__KERNEL_CS)
 			);
+		vcpu->arch.l1tf_flush_l1d = true;
 	}
 }
 STACK_FRAME_NON_STANDARD(vmx_handle_external_intr);
@@ -10104,6 +10116,11 @@ static void __noclone vmx_vcpu_run(struc
 	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
 		(unsigned long)&current_evmcs->host_rsp : 0;
 
+	if (static_branch_unlikely(&vmx_l1d_should_flush)) {
+		if (vcpu->arch.l1tf_flush_l1d)
+			vmx_l1d_flush(vcpu);
+	}
+
 	asm(
 		/* Store host registers */
 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"
@@ -11917,6 +11934,9 @@ static int nested_vmx_run(struct kvm_vcp
 		return ret;
 	}
 
+	/* Hide L1D cache contents from the nested guest.  */
+	vmx->vcpu.arch.l1tf_flush_l1d = true;
+
 	/*
 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
 	 * by event injection, halt vcpu.
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -195,6 +195,7 @@ struct kvm_stats_debugfs_item debugfs_en
 	{ "irq_injections", VCPU_STAT(irq_injections) },
 	{ "nmi_injections", VCPU_STAT(nmi_injections) },
 	{ "req_event", VCPU_STAT(req_event) },
+	{ "l1d_flush", VCPU_STAT(l1d_flush) },
 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -4799,6 +4800,8 @@ int kvm_read_guest_virt(struct kvm_vcpu
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 
+	/* The gva_to_pa walker can pull in tons of pages. */
+	vcpu->arch.l1tf_flush_l1d = true;
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
 }
@@ -6050,6 +6053,8 @@ int x86_emulate_instruction(struct kvm_v
 	bool writeback = true;
 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
 
+	vcpu->arch.l1tf_flush_l1d = true;
+
 	/*
 	 * Clear write_fault_to_shadow_pgtable here to ensure it is
 	 * never reused.
@@ -7579,6 +7584,7 @@ static int vcpu_run(struct kvm_vcpu *vcp
 	struct kvm *kvm = vcpu->kvm;
 
 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
+	vcpu->arch.l1tf_flush_l1d = true;
 
 	for (;;) {
 		if (kvm_vcpu_running(vcpu)) {
@@ -8698,6 +8704,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcp
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	vcpu->arch.l1tf_flush_l1d = true;
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 06/10] KVM magic # 6
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (4 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 05/10] KVM magic # 5 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 07/10] KVM magic # 7 Thomas Gleixner
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [patch V5 06/10] x86/KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers.

There is no semantic change but this change allows an unbalanced amount of
MSRs to be loaded on VMEXIT and VMENTER, i.e. the number of MSRs to save or
restore on VMEXIT or VMENTER may be different.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kvm/vmx.c |   65 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -808,6 +808,11 @@ static inline int pi_test_sn(struct pi_d
 			(unsigned long *)&pi_desc->control);
 }
 
+struct vmx_msrs {
+	unsigned nr;
+	struct vmx_msr_entry val[NR_AUTOLOAD_MSRS];
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	unsigned long         host_rsp;
@@ -841,9 +846,8 @@ struct vcpu_vmx {
 	struct loaded_vmcs   *loaded_vmcs;
 	bool                  __launched; /* temporary, used in vmx_vcpu_run */
 	struct msr_autoload {
-		unsigned nr;
-		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
-		struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];
+		struct vmx_msrs guest;
+		struct vmx_msrs host;
 	} msr_autoload;
 	struct {
 		int           loaded;
@@ -2440,18 +2444,18 @@ static void clear_atomic_switch_msr(stru
 		}
 		break;
 	}
-
-	for (i = 0; i < m->nr; ++i)
-		if (m->guest[i].index == msr)
+	for (i = 0; i < m->guest.nr; ++i)
+		if (m->guest.val[i].index == msr)
 			break;
 
-	if (i == m->nr)
+	if (i == m->guest.nr)
 		return;
-	--m->nr;
-	m->guest[i] = m->guest[m->nr];
-	m->host[i] = m->host[m->nr];
-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+	--m->guest.nr;
+	--m->host.nr;
+	m->guest.val[i] = m->guest.val[m->guest.nr];
+	m->host.val[i] = m->host.val[m->host.nr];
+	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
+	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
 }
 
 static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
@@ -2503,24 +2507,25 @@ static void add_atomic_switch_msr(struct
 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
 	}
 
-	for (i = 0; i < m->nr; ++i)
-		if (m->guest[i].index == msr)
+	for (i = 0; i < m->guest.nr; ++i)
+		if (m->guest.val[i].index == msr)
 			break;
 
 	if (i == NR_AUTOLOAD_MSRS) {
 		printk_once(KERN_WARNING "Not enough msr switch entries. "
 				"Can't add msr %x\n", msr);
 		return;
-	} else if (i == m->nr) {
-		++m->nr;
-		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
-		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+	} else if (i == m->guest.nr) {
+		++m->guest.nr;
+		++m->host.nr;
+		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
+		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
 	}
 
-	m->guest[i].index = msr;
-	m->guest[i].value = guest_val;
-	m->host[i].index = msr;
-	m->host[i].value = host_val;
+	m->guest.val[i].index = msr;
+	m->guest.val[i].value = guest_val;
+	m->host.val[i].index = msr;
+	m->host.val[i].value = host_val;
 }
 
 static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
@@ -6290,9 +6295,9 @@ static void vmx_vcpu_setup(struct vcpu_v
 
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
+	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
+	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));
 
 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
@@ -11350,10 +11355,10 @@ static void prepare_vmcs02_full(struct k
 	 * Set the MSR load/store lists to match L0's settings.
 	 */
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
+	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
+	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
+	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
+	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));
 
 	set_cr4_guest_host_mask(vmx);
 
@@ -12457,8 +12462,8 @@ static void nested_vmx_vmexit(struct kvm
 	vmx_segment_cache_clear(vmx);
 
 	/* Update any VMCS fields that might have changed while L2 ran */
-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
+	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
+	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
 	if (vmx->hv_deadline_tsc == -1)
 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 07/10] KVM magic # 7
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (5 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 06/10] KVM magic # 6 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 08/10] KVM magic # 8 Thomas Gleixner
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [patch V5 07/10] x86/KVM/VMX: Add find_msr() helper function

.. to help find the MSR on either the guest or host MSR list.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kvm/vmx.c |   31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2421,9 +2421,20 @@ static void clear_atomic_switch_msr_spec
 	vm_exit_controls_clearbit(vmx, exit);
 }
 
+static int find_msr(struct vmx_msrs *m, unsigned msr)
+{
+	unsigned int i;
+
+	for (i = 0; i < m->nr; ++i) {
+		if (m->val[i].index == msr)
+			return i;
+	}
+	return -ENOENT;
+}
+
 static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)
 {
-	unsigned i;
+	int i;
 	struct msr_autoload *m = &vmx->msr_autoload;
 
 	switch (msr) {
@@ -2444,11 +2455,8 @@ static void clear_atomic_switch_msr(stru
 		}
 		break;
 	}
-	for (i = 0; i < m->guest.nr; ++i)
-		if (m->guest.val[i].index == msr)
-			break;
-
-	if (i == m->guest.nr)
+	i = find_msr(&m->guest, msr);
+	if (i < 0)
 		return;
 	--m->guest.nr;
 	--m->host.nr;
@@ -2472,7 +2480,7 @@ static void add_atomic_switch_msr_specia
 static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
 				  u64 guest_val, u64 host_val)
 {
-	unsigned i;
+	int i;
 	struct msr_autoload *m = &vmx->msr_autoload;
 
 	switch (msr) {
@@ -2507,16 +2515,13 @@ static void add_atomic_switch_msr(struct
 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
 	}
 
-	for (i = 0; i < m->guest.nr; ++i)
-		if (m->guest.val[i].index == msr)
-			break;
-
+	i = find_msr(&m->guest, msr);
 	if (i == NR_AUTOLOAD_MSRS) {
 		printk_once(KERN_WARNING "Not enough msr switch entries. "
 				"Can't add msr %x\n", msr);
 		return;
-	} else if (i == m->guest.nr) {
-		++m->guest.nr;
+	} else if (i < 0) {
+		i = m->guest.nr++;
 		++m->host.nr;
 		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
 		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 08/10] KVM magic # 8
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (6 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 07/10] KVM magic # 7 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 09/10] KVM magic # 9 Thomas Gleixner
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [patch V5 08/10] x86/KVM/VMX: Seperate the VMX AUTOLOAD guest/host number accounting

This allows to load a different number of MSRs depending on the context:
VMEXIT or VMENTER.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kvm/vmx.c |   29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2457,12 +2457,18 @@ static void clear_atomic_switch_msr(stru
 	}
 	i = find_msr(&m->guest, msr);
 	if (i < 0)
-		return;
+		goto skip_guest;
 	--m->guest.nr;
-	--m->host.nr;
 	m->guest.val[i] = m->guest.val[m->guest.nr];
-	m->host.val[i] = m->host.val[m->host.nr];
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
+
+skip_guest:
+	i = find_msr(&m->host, msr);
+	if (i < 0)
+		return;
+
+	--m->host.nr;
+	m->host.val[i] = m->host.val[m->host.nr];
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
 }
 
@@ -2480,7 +2486,7 @@ static void add_atomic_switch_msr_specia
 static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
 				  u64 guest_val, u64 host_val)
 {
-	int i;
+	int i, j;
 	struct msr_autoload *m = &vmx->msr_autoload;
 
 	switch (msr) {
@@ -2516,21 +2522,24 @@ static void add_atomic_switch_msr(struct
 	}
 
 	i = find_msr(&m->guest, msr);
-	if (i == NR_AUTOLOAD_MSRS) {
+	j = find_msr(&m->host, msr);
+	if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {
 		printk_once(KERN_WARNING "Not enough msr switch entries. "
 				"Can't add msr %x\n", msr);
 		return;
-	} else if (i < 0) {
+	}
+	if (i < 0) {
 		i = m->guest.nr++;
-		++m->host.nr;
 		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
+	}
+	if (j < 0) {
+		j = m->host.nr++;
 		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
 	}
-
 	m->guest.val[i].index = msr;
 	m->guest.val[i].value = guest_val;
-	m->host.val[i].index = msr;
-	m->host.val[i].value = host_val;
+	m->host.val[j].index = msr;
+	m->host.val[j].value = host_val;
 }
 
 static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 09/10] KVM magic # 9
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (7 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 08/10] KVM magic # 8 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 15:44 ` [patch V5 10/10] KVM magic # 10 Thomas Gleixner
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [patch V5 09/10] x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs

The IA32_FLUSH_CMD MSR needs only to be written on VMENTER. Extend
add_atomic_switch_msr() with an entry_only parameter to allow storing the
MSR only in the guest (ENTRY) MSR array.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kvm/vmx.c |   22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2484,9 +2484,9 @@ static void add_atomic_switch_msr_specia
 }
 
 static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
-				  u64 guest_val, u64 host_val)
+				  u64 guest_val, u64 host_val, bool entry_only)
 {
-	int i, j;
+	int i, j = 0;
 	struct msr_autoload *m = &vmx->msr_autoload;
 
 	switch (msr) {
@@ -2522,7 +2522,9 @@ static void add_atomic_switch_msr(struct
 	}
 
 	i = find_msr(&m->guest, msr);
-	j = find_msr(&m->host, msr);
+	if (!entry_only)
+		j = find_msr(&m->host, msr);
+
 	if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {
 		printk_once(KERN_WARNING "Not enough msr switch entries. "
 				"Can't add msr %x\n", msr);
@@ -2532,12 +2534,16 @@ static void add_atomic_switch_msr(struct
 		i = m->guest.nr++;
 		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
 	}
+	m->guest.val[i].index = msr;
+	m->guest.val[i].value = guest_val;
+
+	if (entry_only)
+		return;
+
 	if (j < 0) {
 		j = m->host.nr++;
 		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
 	}
-	m->guest.val[i].index = msr;
-	m->guest.val[i].value = guest_val;
 	m->host.val[j].index = msr;
 	m->host.val[j].value = host_val;
 }
@@ -2583,7 +2589,7 @@ static bool update_transition_efer(struc
 			guest_efer &= ~EFER_LME;
 		if (guest_efer != host_efer)
 			add_atomic_switch_msr(vmx, MSR_EFER,
-					      guest_efer, host_efer);
+					      guest_efer, host_efer, false);
 		return false;
 	} else {
 		guest_efer &= ~ignore_bits;
@@ -4037,7 +4043,7 @@ static int vmx_set_msr(struct kvm_vcpu *
 		vcpu->arch.ia32_xss = data;
 		if (vcpu->arch.ia32_xss != host_xss)
 			add_atomic_switch_msr(vmx, MSR_IA32_XSS,
-				vcpu->arch.ia32_xss, host_xss);
+				vcpu->arch.ia32_xss, host_xss, false);
 		else
 			clear_atomic_switch_msr(vmx, MSR_IA32_XSS);
 		break;
@@ -10040,7 +10046,7 @@ static void atomic_switch_perf_msrs(stru
 			clear_atomic_switch_msr(vmx, msrs[i].msr);
 		else
 			add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,
-					msrs[i].host);
+					msrs[i].host, false);
 }
 
 static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch V5 10/10] KVM magic # 10
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (8 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 09/10] KVM magic # 9 Thomas Gleixner
@ 2018-07-02 15:44 ` Thomas Gleixner
  2018-07-02 17:29   ` [MODERATED] " Paolo Bonzini
  2018-07-02 16:25 ` [patch V5 00/10] KVM magic # 0 Thomas Gleixner
  2018-07-02 22:14 ` [MODERATED] [patch v5 11/10] Linux+KVM magic # 1 Jiri Kosina
  11 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 15:44 UTC (permalink / raw)
  To: speck

From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
Subject: [patch V5 10/10] x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required

If the L1D flush module parameter is set to 'always' and the IA32_FLUSH_CMD
MSR is available, optimize the VMENTER code with the MSR save list.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
v3: Actually engage the MSR save list
    Move it to function that frobs VMCS
v4: Rebase on top of Thomas's patches.
v5: Adopt it to the cleanup of the flush logic

---
 arch/x86/kvm/vmx.c |   33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6237,6 +6237,16 @@ static void ept_set_mmio_spte_mask(void)
 				   VMX_EPT_MISCONFIG_WX_VALUE);
 }
 
+static bool vmx_l1d_use_msr_save_list(void)
+{
+	if (!enable_ept || !boot_cpu_has_bug(X86_BUG_L1TF) ||
+	    static_cpu_has(X86_FEATURE_HYPERVISOR) ||
+	    !static_cpu_has(X86_FEATURE_FLUSH_L1D))
+		return false;
+
+	return vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 /*
  * Sets up the vmcs for emulated real mode.
@@ -6358,6 +6368,12 @@ static void vmx_vcpu_setup(struct vcpu_v
 		vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
 		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
 	}
+	/*
+	 * If flushing the L1D cache on every VMENTER is enforced and the
+	 * MSR is available, use the MSR save list.
+	 */
+	if (vmx_l1d_use_msr_save_list())
+		add_atomic_switch_msr(vmx, MSR_IA32_FLUSH_CMD, L1D_FLUSH, 0, true);
 }
 
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -9604,14 +9620,18 @@ static void *vmx_l1d_flush_pages;
 static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
 	int size = PAGE_SIZE << L1D_CACHE_ORDER;
-	bool always;
+	bool always = false;
 
 	/*
-	 * If the mitigation mode is 'flush always', keep the flush bit
-	 * set, otherwise clear it. It gets set again either from
-	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
+	 * If the CPU has the flush MSR then clear the flush bit. If not
+	 * then act depending on the mitigation mode: If 'flush always',
+	 * keep the flush bit set, otherwise clear it.
+	 *
+	 * The flush bit gets set again either from vcpu_run() or from one
+	 * of the unsafe VMEXIT handlers.
 	 */
-	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
+	if (!static_cpu_has(X86_FEATURE_FLUSH_L1D))
+		always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
 	vcpu->arch.l1tf_flush_l1d = always;
 
 	vcpu->stat.l1d_flush++;
@@ -13205,7 +13225,8 @@ static int __init vmx_setup_l1d_flush(vo
 	struct page *page;
 
 	if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
-	    !boot_cpu_has_bug(X86_BUG_L1TF))
+	    !boot_cpu_has_bug(X86_BUG_L1TF) ||
+	    vmx_l1d_use_msr_save_list())
 		return 0;
 
 	if (!boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 01/10] KVM magic # 1
  2018-07-02 15:44 ` [patch V5 01/10] KVM magic # 1 Thomas Gleixner
@ 2018-07-02 16:22   ` Konrad Rzeszutek Wilk
  2018-07-02 17:10     ` Thomas Gleixner
  2018-07-02 16:25   ` [MODERATED] " Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-07-02 16:22 UTC (permalink / raw)
  To: speck

On Mon, Jul 02, 2018 at 05:44:27PM +0200, speck for Thomas Gleixner wrote:
> From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
> 
> commit 75e236a35266617b8c80ec3f5f98f6c81e326c5b upstream
> 
> If the L1TF CPU bug is present we allow the KVM module to be loaded
> as the major of users that use Linux and KVM have trusted guests
> and do not want a broken setup.
> 
> Cloud vendors are the ones that are uncomfortable with CVE 2018-3615

s/3615/3620/

.. while the patch has the right number.

(3615 is for the SGX code change).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch V5 00/10] KVM magic # 0
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (9 preceding siblings ...)
  2018-07-02 15:44 ` [patch V5 10/10] KVM magic # 10 Thomas Gleixner
@ 2018-07-02 16:25 ` Thomas Gleixner
  2018-07-02 22:14 ` [MODERATED] [patch v5 11/10] Linux+KVM magic # 1 Jiri Kosina
  11 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 16:25 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 334 bytes --]

On Mon, 2 Jul 2018, speck for Thomas Gleixner wrote:

> Konrad ran short of cycles when addressing the review comments, so I took
> over from him and finished it up.

I failed to fix up some of the checkpatch.pl stuff, which Boris mentioned,
but have it fixed locally here now.

Git bundle on top of 4.18-rc1 attached.

Thanks,

	tglx

[-- Attachment #2: Type: application/octet-stream, Size: 55706 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 01/10] KVM magic # 1
  2018-07-02 15:44 ` [patch V5 01/10] KVM magic # 1 Thomas Gleixner
  2018-07-02 16:22   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2018-07-02 16:25   ` Borislav Petkov
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2018-07-02 16:25 UTC (permalink / raw)
  To: speck

On Mon, Jul 02, 2018 at 05:44:27PM +0200, speck for Thomas Gleixner wrote:
> From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
> 
> commit 75e236a35266617b8c80ec3f5f98f6c81e326c5b upstream

You probably don't want that here.

> If the L1TF CPU bug is present we allow the KVM module to be loaded
> as the major of users that use Linux and KVM have trusted guests
> and do not want a broken setup.
> 
> Cloud vendors are the ones that are uncomfortable with CVE 2018-3615
> and as such they are the ones that should set disallow_smt to one.
> 
> Setting disallow_smt to means that the system administrator also needs

In both occurrences above:

s/disallow_smt/nosmt/

too.

> to disable SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command
> line parameter, or via the /sys/devices/system/cpu/smt/control
> (see commit 05736e4ac13c cpu/hotplug: Provide knobs to control SMT).
> 
> Other mitigations are to use task affinity, cpu sets, interrupt binding,
> etc - anything to make sure that _only_ the same guests vCPUs are running
> on sibling threads.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
> v3:-Move it to vmx_vcpu_init (could also do it in kvm_vcpu_init but it seemed
>     more prudent to do it in VMX handler.
>    -Made it WARN if disallow_smt=0
>    -Made it ERR if disallow_smt=1
>    -Fixed the CVE number
> v4: s/disallow/nosmt/
> 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |    6 ++++++
>  arch/x86/kvm/vmx.c                              |   13 +++++++++++++
>  kernel/cpu.c                                    |    1 +
>  3 files changed, 20 insertions(+)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1946,6 +1946,12 @@
>  			[KVM,ARM] Allow use of GICv4 for direct injection of
>  			LPIs.
>  
> +	kvm-intel.nosmt=[KVM,Intel] If the L1TF CPU bug is present (CVE-2018-3620)
> +			and the system has SMT (aka Hyper-Threading) enabled then
> +			don't allow guests to be created.
> +
> +			Default is 0 (allow guests to be created).
> +
>  	kvm-intel.ept=	[KVM,Intel] Disable extended page tables
>  			(virtualized MMU) support on capable Intel chips.
>  			Default is 1 (enabled)
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -71,6 +71,9 @@ static const struct x86_cpu_id vmx_cpu_i
>  };
>  MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
>  
> +static bool __read_mostly nosmt = false;

As already mentioned on IRC: quilt refresh :)

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 05/10] KVM magic # 5
  2018-07-02 15:44 ` [patch V5 05/10] KVM magic # 5 Thomas Gleixner
@ 2018-07-02 16:35   ` Konrad Rzeszutek Wilk
  2018-07-02 17:01     ` Thomas Gleixner
  2018-07-02 17:24   ` [MODERATED] " Paolo Bonzini
  2018-07-02 21:19   ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
  2 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-07-02 16:35 UTC (permalink / raw)
  To: speck

..snip..
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -9576,9 +9576,20 @@ static int vmx_handle_exit(struct kvm_vc
>  #define L1D_CACHE_ORDER 4
>  static void *vmx_l1d_flush_pages;
>  
> -static void __maybe_unused vmx_l1d_flush(void)
> +static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>  {
>  	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +	bool always;
> +
> +	/*
> +	 * If the mitigation mode is 'flush always', keep the flush bit
> +	 * set, otherwise clear it. It gets set again either from
> +	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
> +	 */
> +	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
> +	vcpu->arch.l1tf_flush_l1d = always;

<blinks> You did the reset of arch.l1tf_flush_l1d _after_ we have done
this vmx_l1d_flush call, nice!! So obvious in retrospect.

See below one tiny comment.
..snip..
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -195,6 +195,7 @@ struct kvm_stats_debugfs_item debugfs_en
>  	{ "irq_injections", VCPU_STAT(irq_injections) },
>  	{ "nmi_injections", VCPU_STAT(nmi_injections) },
>  	{ "req_event", VCPU_STAT(req_event) },
> +	{ "l1d_flush", VCPU_STAT(l1d_flush) },
>  	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
>  	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
>  	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
> @@ -4799,6 +4800,8 @@ int kvm_read_guest_virt(struct kvm_vcpu
>  {
>  	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
>  
> +	/* The gva_to_pa walker can pull in tons of pages. */
> +	vcpu->arch.l1tf_flush_l1d = true;


I think also kvm_write_guest_virt_system ? That covers vmptrs and vmread.


>  	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
>  					  exception);
>  }
> @@ -6050,6 +6053,8 @@ int x86_emulate_instruction(struct kvm_v
>  	bool writeback = true;
>  	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
>  
> +	vcpu->arch.l1tf_flush_l1d = true;
> +
>  	/*
>  	 * Clear write_fault_to_shadow_pgtable here to ensure it is
>  	 * never reused.
> @@ -7579,6 +7584,7 @@ static int vcpu_run(struct kvm_vcpu *vcp
>  	struct kvm *kvm = vcpu->kvm;
>  
>  	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> +	vcpu->arch.l1tf_flush_l1d = true;
>  
>  	for (;;) {
>  		if (kvm_vcpu_running(vcpu)) {
> @@ -8698,6 +8704,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcp
>  
>  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
>  {
> +	vcpu->arch.l1tf_flush_l1d = true;
>  	kvm_x86_ops->sched_in(vcpu, cpu);
>  }
>  
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch V5 05/10] KVM magic # 5
  2018-07-02 16:35   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2018-07-02 17:01     ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 17:01 UTC (permalink / raw)
  To: speck

On Mon, 2 Jul 2018, speck for Konrad Rzeszutek Wilk wrote:
> > -static void __maybe_unused vmx_l1d_flush(void)
> > +static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
> >  {
> >  	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +	bool always;
> > +
> > +	/*
> > +	 * If the mitigation mode is 'flush always', keep the flush bit
> > +	 * set, otherwise clear it. It gets set again either from
> > +	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
> > +	 */
> > +	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
> > +	vcpu->arch.l1tf_flush_l1d = always;
> 
> <blinks> You did the reset of arch.l1tf_flush_l1d _after_ we have done
> this vmx_l1d_flush call, nice!! So obvious in retrospect.

:)

> > @@ -4799,6 +4800,8 @@ int kvm_read_guest_virt(struct kvm_vcpu
> >  {
> >  	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> >  
> > +	/* The gva_to_pa walker can pull in tons of pages. */
> > +	vcpu->arch.l1tf_flush_l1d = true;
> 
> 
> I think also kvm_write_guest_virt_system ? That covers vmptrs and vmread.

That fell through the cracks when picking the bits from the original
patch. Fixed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch V5 01/10] KVM magic # 1
  2018-07-02 16:22   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2018-07-02 17:10     ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 17:10 UTC (permalink / raw)
  To: speck

On Mon, 2 Jul 2018, speck for Konrad Rzeszutek Wilk wrote:

> On Mon, Jul 02, 2018 at 05:44:27PM +0200, speck for Thomas Gleixner wrote:
> > From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
> > 
> > commit 75e236a35266617b8c80ec3f5f98f6c81e326c5b upstream
> > 
> > If the L1TF CPU bug is present we allow the KVM module to be loaded
> > as the major of users that use Linux and KVM have trusted guests
> > and do not want a broken setup.
> > 
> > Cloud vendors are the ones that are uncomfortable with CVE 2018-3615
> 
> s/3615/3620/
> 
> .. while the patch has the right number.
> 
> (3615 is for the SGX code change).

Fixed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 05/10] KVM magic # 5
  2018-07-02 15:44 ` [patch V5 05/10] KVM magic # 5 Thomas Gleixner
  2018-07-02 16:35   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2018-07-02 17:24   ` Paolo Bonzini
  2018-07-03 15:21     ` Thomas Gleixner
  2018-07-02 21:19   ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
  2 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2018-07-02 17:24 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

On 02/07/2018 17:44, speck for Thomas Gleixner wrote:
>  	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
> @@ -4799,6 +4800,8 @@ int kvm_read_guest_virt(struct kvm_vcpu
>  {
>  	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
>  
> +	/* The gva_to_pa walker can pull in tons of pages. */

It pulls in guest pages though, so does it really matter?  Also, why
read and not write too?  (Fetch is only used by the emulator which sets
the flag differently).

Paolo

> +	vcpu->arch.l1tf_flush_l1d = true;
>  	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
>  					  exception);
>  }



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 10/10] KVM magic # 10
  2018-07-02 15:44 ` [patch V5 10/10] KVM magic # 10 Thomas Gleixner
@ 2018-07-02 17:29   ` Paolo Bonzini
  2018-07-02 17:41     ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2018-07-02 17:29 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On 02/07/2018 17:44, speck for Thomas Gleixner wrote:
> +	 * If the CPU has the flush MSR then clear the flush bit. If not
> +	 * then act depending on the mitigation mode: If 'flush always',
> +	 * keep the flush bit set, otherwise clear it.
> +	 *
> +	 * The flush bit gets set again either from vcpu_run() or from one
> +	 * of the unsafe VMEXIT handlers.
>  	 */

Perhaps clearer:

	/*
	 * If the CPU has the flush MSR, then "flush always" mode
	 * is handled via MSR autoload; no need to do anything here.
	 */
	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
	if (static_cpu_has(X86_FEATURE_FLUSH_L1D) && always) {
		vcpu->arch.l1tf_flush_l1d = false;
		return;
	}

	/*
	 * If not in "flush always" mode, clear the bit; vcpu_run()
	 * or one of the unsafe vmexit handlers will set it back.
	 */
	vcpu->arch.l1tf_flush_l1d = always;
 	vcpu->stat.l1d_flush++;

etc.

Paolo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch V5 10/10] KVM magic # 10
  2018-07-02 17:29   ` [MODERATED] " Paolo Bonzini
@ 2018-07-02 17:41     ` Thomas Gleixner
  2018-07-02 22:11       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 17:41 UTC (permalink / raw)
  To: speck

On Mon, 2 Jul 2018, speck for Paolo Bonzini wrote:

> On 02/07/2018 17:44, speck for Thomas Gleixner wrote:
> > +	 * If the CPU has the flush MSR then clear the flush bit. If not
> > +	 * then act depending on the mitigation mode: If 'flush always',
> > +	 * keep the flush bit set, otherwise clear it.
> > +	 *
> > +	 * The flush bit gets set again either from vcpu_run() or from one
> > +	 * of the unsafe VMEXIT handlers.
> >  	 */
> 
> Perhaps clearer:
> 
> 	/*
> 	 * If the CPU has the flush MSR, then "flush always" mode
> 	 * is handled via MSR autoload; no need to do anything here.
> 	 */

That code is not reached in the automsr mode.

> 	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
> 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D) && always) {
> 		vcpu->arch.l1tf_flush_l1d = false;
> 		return;
> 	}
> 
> 	/*
> 	 * If not in "flush always" mode, clear the bit; vcpu_run()
> 	 * or one of the unsafe vmexit handlers will set it back.
> 	 */
> 	vcpu->arch.l1tf_flush_l1d = always;
>  	vcpu->stat.l1d_flush++;
> 
> etc.
> 
> Paolo
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** [patch V5 05/10] KVM magic # 5
  2018-07-02 15:44 ` [patch V5 05/10] KVM magic # 5 Thomas Gleixner
  2018-07-02 16:35   ` [MODERATED] " Konrad Rzeszutek Wilk
  2018-07-02 17:24   ` [MODERATED] " Paolo Bonzini
@ 2018-07-02 21:19   ` Alexander Graf
  2018-07-02 23:45     ` Andrew Cooper
  2018-07-03  8:23     ` Paolo Bonzini
  2 siblings, 2 replies; 37+ messages in thread
From: Alexander Graf @ 2018-07-02 21:19 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 4000 bytes --]



On 02.07.18 17:44, speck for Thomas Gleixner wrote:
> Subject: [patch V5 05/10] x86/KVM/VMX: Add L1D flush logic
> From: Paolo Bonzini pbonzini@redhat.com
> 
> Add the logic for flushing L1D on VMENTER. The flush depends on the static
> key being enabled and the new l1tf_flush_l1d flag being set.
> 
> The flags is set:
>  - Always, if the flush module parameter is 'always'
> 
>  - Conditionally at:
>    - Entry to vcpu_run(), i.e. after executing user space
> 
>    - From the sched_in notifier, i.e. when switching to a vCPU thread.
> 
>    - From vmexit handlers which are considered unsafe, i.e. where
>      sensitive data can be brought into L1D:
> 
>      - The emulator, which could be a good target for other speculative
>        execution-based threats,
> 
>      - The MMU, which can bring host page tables in the L1 cache.
>      
>      - External interrupts
> 
>      - Nested operations that require the MMU (see above). That is
>        vmptrld, vmptrst, vmclear,vmwrite,vmread.
> 
>      - When handling invept,invvpid
> 
> [ tglx: Split out from combo patch and reduced to a single flag ]
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> v4: Fix from Thomas using static key
> v5: Split it out from the big combo patch
>     Simplify the flags logic
> 
>  arch/x86/include/asm/kvm_host.h |    4 ++++
>  arch/x86/kvm/mmu.c              |    1 +
>  arch/x86/kvm/vmx.c              |   22 +++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |    7 +++++++
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -713,6 +713,9 @@ struct kvm_vcpu_arch {
>  
>  	/* be preempted when it's in kernel-mode(cpl=0) */
>  	bool preempted_in_kernel;
> +
> +	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */
> +	bool l1tf_flush_l1d;
>  };
>  
>  struct kvm_lpage_info {
> @@ -881,6 +884,7 @@ struct kvm_vcpu_stat {
>  	u64 signal_exits;
>  	u64 irq_window_exits;
>  	u64 nmi_window_exits;
> +	u64 l1d_flush;
>  	u64 halt_exits;
>  	u64 halt_successful_poll;
>  	u64 halt_attempted_poll;
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcp
>  {
>  	int r = 1;
>  
> +	vcpu->arch.l1tf_flush_l1d = true;
>  	switch (vcpu->arch.apf.host_apf_reason) {
>  	default:
>  		trace_kvm_page_fault(fault_address, error_code);
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -9576,9 +9576,20 @@ static int vmx_handle_exit(struct kvm_vc
>  #define L1D_CACHE_ORDER 4
>  static void *vmx_l1d_flush_pages;
>  
> -static void __maybe_unused vmx_l1d_flush(void)
> +static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>  {
>  	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +	bool always;
> +
> +	/*
> +	 * If the mitigation mode is 'flush always', keep the flush bit
> +	 * set, otherwise clear it. It gets set again either from
> +	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
> +	 */
> +	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
> +	vcpu->arch.l1tf_flush_l1d = always;
> +
> +	vcpu->stat.l1d_flush++;
>  
>  	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
>  		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> @@ -9847,6 +9858,7 @@ static void vmx_handle_external_intr(str
>  			[ss]"i"(__KERNEL_DS),
>  			[cs]"i"(__KERNEL_CS)
>  			);
> +		vcpu->arch.l1tf_flush_l1d = true;

This means we flush when an external interrupt is the trap reason, but
we may as well get the interrupt only after interrupts are enabled again
from vcpu_enter_guest(), no? How is an interrupt executing "normally"
any different from an interrupt that causes a #vmexit?

Or in other words: Is it worth it to have anything but always/never as
an option? Has anyone done benchmarks?


Alex


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch V5 10/10] KVM magic # 10
  2018-07-02 17:41     ` Thomas Gleixner
@ 2018-07-02 22:11       ` Thomas Gleixner
  2018-07-03  9:05         ` [MODERATED] " Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-02 22:11 UTC (permalink / raw)
  To: speck

On Mon, 2 Jul 2018, speck for Thomas Gleixner wrote:
> On Mon, 2 Jul 2018, speck for Paolo Bonzini wrote:
> 
> > On 02/07/2018 17:44, speck for Thomas Gleixner wrote:
> > > +	 * If the CPU has the flush MSR then clear the flush bit. If not
> > > +	 * then act depending on the mitigation mode: If 'flush always',
> > > +	 * keep the flush bit set, otherwise clear it.
> > > +	 *
> > > +	 * The flush bit gets set again either from vcpu_run() or from one
> > > +	 * of the unsafe VMEXIT handlers.
> > >  	 */
> > 
> > Perhaps clearer:
> > 
> > 	/*
> > 	 * If the CPU has the flush MSR, then "flush always" mode
> > 	 * is handled via MSR autoload; no need to do anything here.
> > 	 */
> 
> That code is not reached in the automsr mode.

Updated version below.

Thanks,

	tglx
8<--------------------
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6237,6 +6237,16 @@ static void ept_set_mmio_spte_mask(void)
 				   VMX_EPT_MISCONFIG_WX_VALUE);
 }
 
+static bool vmx_l1d_use_msr_save_list(void)
+{
+	if (!enable_ept || !boot_cpu_has_bug(X86_BUG_L1TF) ||
+	    static_cpu_has(X86_FEATURE_HYPERVISOR) ||
+	    !static_cpu_has(X86_FEATURE_FLUSH_L1D))
+		return false;
+
+	return vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 /*
  * Sets up the vmcs for emulated real mode.
@@ -6358,6 +6368,12 @@ static void vmx_vcpu_setup(struct vcpu_v
 		vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
 		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
 	}
+	/*
+	 * If flushing the L1D cache on every VMENTER is enforced and the
+	 * MSR is available, use the MSR save list.
+	 */
+	if (vmx_l1d_use_msr_save_list())
+		add_atomic_switch_msr(vmx, MSR_IA32_FLUSH_CMD, L1D_FLUSH, 0, true);
 }
 
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -9607,11 +9623,26 @@ static void vmx_l1d_flush(struct kvm_vcp
 	bool always;
 
 	/*
-	 * If the mitigation mode is 'flush always', keep the flush bit
-	 * set, otherwise clear it. It gets set again either from
-	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
+	 * This code is only executed when:
+	 * - the flush mode is 'cond'
+	 * - the flush mode is 'always' and the flush MSR is not
+	 *   available
+	 *
+	 * If the CPU has the flush MSR then clear the flush bit because
+	 * 'always' mode is handled via the MSR save list.
+	 *
+	 * If the MSR is not avaibable then act depending on the mitigation
+	 * mode: If 'flush always', keep the flush bit set, otherwise clear
+	 * it.
+	 *
+	 * The flush bit gets set again either from vcpu_run() or from one
+	 * of the unsafe VMEXIT handlers.
 	 */
-	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
+		always = false;
+	else
+		always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
+
 	vcpu->arch.l1tf_flush_l1d = always;
 
 	vcpu->stat.l1d_flush++;
@@ -13205,7 +13236,8 @@ static int __init vmx_setup_l1d_flush(vo
 	struct page *page;
 
 	if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
-	    !boot_cpu_has_bug(X86_BUG_L1TF))
+	    !boot_cpu_has_bug(X86_BUG_L1TF) ||
+	    vmx_l1d_use_msr_save_list())
 		return 0;
 
 	if (!boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] [patch v5 11/10] Linux+KVM magic # 1
  2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
                   ` (10 preceding siblings ...)
  2018-07-02 16:25 ` [patch V5 00/10] KVM magic # 0 Thomas Gleixner
@ 2018-07-02 22:14 ` Jiri Kosina
  2018-07-05 23:56   ` [MODERATED] " Josh Poimboeuf
  11 siblings, 1 reply; 37+ messages in thread
From: Jiri Kosina @ 2018-07-02 22:14 UTC (permalink / raw)
  To: speck

Introduce 'l1tf=' boot option to allow for boot-time switching of
mitigation that is used on CPUs affected by L1TF.

The possible values are

        full    Provide all available mitigations for L1TF
                vulnerability (disable HT, perform PTE bit
                inversion, allow hypervisors to know that
                they should provide all mitigations)
        novirt  Provide all available mitigations needed
                for running on bare metal (PTE bit inversion),
                while not applying mitigations needed for
                VM isolation. Hypervisors will be issuing
                warning when first VM is being started in
                pontentially insecure configuraion
        off     Claim "I don't care at all about this issue".
                The PTE bit inversion (bare metal mitigation) will
                still be performed, but hypervisors will not be
                issuing warning when VM is being started in
                potentially insecure configuration

Let KVM adhere to this semantics, which means:

	- 'l1tf=full': start performing L1D flushes
	- 'l1tf=novirt': warn (or, depending on 'kvm.nosmt', refuse completely)
	  on VM start
	- 'l1tf=off': no extra action is taken

Also, introduce CONFIG_ option to allow vendors to chose the default
mitigation to be used on affected CPUs in case no l1tf= cmdline option is
present.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---

As discussed previously, this goes on top of Konard's v5 KVM patchset, and 
implements the l1tf= bootparam handling as described in the changelog, 
including the KVM changes.

 Documentation/admin-guide/kernel-parameters.txt | 18 ++++++++
 arch/x86/Kconfig                                | 18 ++++++++
 arch/x86/include/asm/processor.h                |  7 ++++
 arch/x86/kernel/cpu/bugs.c                      | 56 +++++++++++++++++++++++--
 arch/x86/kvm/vmx.c                              | 46 ++++++++++++++++----
 include/linux/cpu.h                             |  2 +
 kernel/cpu.c                                    |  9 +++-
 7 files changed, 144 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4f790566ad91..2f1de18f4ac0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1989,6 +1989,24 @@
 			feature (tagged TLBs) on capable Intel chips.
 			Default is 1 (enabled)
 
+	l1tf=           [X86] Control mitigation of L1TF vulnerability on the
+			      affected CPUs
+			full	Provide all available mitigations for L1TF
+				vulnerability (disable HT, perform PTE bit
+				inversion, allow hypervisors to know that
+				they should provide all mitigations)
+			novirt	Provide all available mitigations needed
+				for running on bare metal (PTE bit inversion),
+				while not applying mitigations needed for
+				VM isolation. Hypervisors will be issuing
+				warning when first VM is being started in
+				pontentially insecure configuraion
+			off	Claim "I don't care at all about this issue".
+				The PTE bit inversion (bare metal mitigation) will
+				still be performed, but hypervisors will not be
+				issuing warning when VM is being started in
+				potentially insecure configuration
+
 	l2cr=		[PPC]
 
 	l3cr=		[PPC]
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7a34fdf8daf0..7aa47721c664 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2390,6 +2390,24 @@ config MODIFY_LDT_SYSCALL
 	  surface.  Disabling it removes the modify_ldt(2) system call.
 
 	  Saying 'N' here may make sense for embedded or server kernels.
+choice
+	prompt "Default L1TF mitigation"
+	default L1TF_MITIGATION_NOVIRT
+	help
+		Define what the default behavior for selecting mitigation on
+		CPUs affected by L1TF should be. The default can be overrided
+		on the kernel command-line. Refer to
+		<file:Documentation/admin-guide/kernel-parameters.txt>
+
+config L1TF_MITIGATION_FULL
+	bool "Full available L1TF mitigation"
+config L1TF_MITIGATION_NOVIRT
+	bool "Use L1TF bare metal mitigations only"
+config L1TF_MITIGATION_OFF
+	bool "Ignore L1TF issue"
+
+endchoice
+
 
 source "kernel/livepatch/Kconfig"
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7e3ac5eedcd6..05471c590964 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -982,4 +982,11 @@ bool xen_set_default_idle(void);
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
 void microcode_check(void);
+
+enum l1tf_mitigations {
+	L1TF_MITIGATION_OFF,
+	L1TF_MITIGATION_NOVIRT,
+	L1TF_MITIGATION_FULL
+};
+enum l1tf_mitigations get_l1tf_mitigation(void);
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 50500cea6eba..9aa8b94334d5 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -657,6 +657,23 @@ void x86_spec_ctrl_setup_ap(void)
 
 #undef pr_fmt
 #define pr_fmt(fmt)	"L1TF: " fmt
+/* Default mitigation for L1TF-affected CPUs */
+static int l1tf_mitigation =
+#ifdef CONFIG_L1TF_MITIGATION_FULL
+	L1TF_MITIGATION_NOVIRT;
+#endif
+#ifdef CONFIG_L1TF_MITIGATION_NOVIRT
+	L1TF_MITIGATION_NOVIRT;
+#endif
+#ifdef CONFIG_L1TF_MITIGATION_OFF
+	L1TF_MITIGATION_OFF;
+#endif
+enum l1tf_mitigations get_l1tf_mitigation(void)
+{
+	return l1tf_mitigation;
+}
+EXPORT_SYMBOL(get_l1tf_mitigation);
+
 static void __init l1tf_select_mitigation(void)
 {
 	u64 half_pa;
@@ -664,6 +681,15 @@ static void __init l1tf_select_mitigation(void)
 	if (!boot_cpu_has_bug(X86_BUG_L1TF))
 		return;
 
+	switch (get_l1tf_mitigation()) {
+	case L1TF_MITIGATION_FULL:
+		cpu_smt_disable(true);
+		break;
+	case L1TF_MITIGATION_OFF:
+	case L1TF_MITIGATION_NOVIRT:
+		break;
+	}
+
 #if CONFIG_PGTABLE_LEVELS == 2
 	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
 	return;
@@ -682,10 +708,36 @@ static void __init l1tf_select_mitigation(void)
 
 	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
 }
+
+static int __init l1tf_cmdline(char *str)
+{
+	if (!boot_cpu_has_bug(X86_BUG_L1TF))
+		return 0;
+
+	if (!str)
+		return 0;
+
+	if (!strcmp(str, "full"))
+		l1tf_mitigation = L1TF_MITIGATION_FULL;
+	else if (!strcmp(str, "novirt"))
+		l1tf_mitigation = L1TF_MITIGATION_NOVIRT;
+	else if (!strcmp(str, "off"))
+		l1tf_mitigation = L1TF_MITIGATION_OFF;
+
+	return 0;
+}
+early_param("l1tf", l1tf_cmdline);
+
 #undef pr_fmt
 
 #ifdef CONFIG_SYSFS
 
+static const char *l1tf_states[] = {
+	[L1TF_MITIGATION_FULL]		= "Mitigation: Full",
+	[L1TF_MITIGATION_NOVIRT]	= "Mitigation: Page Table Inversion",
+	[L1TF_MITIGATION_OFF]		= "Mitigation: Page Table Inversion"
+};
+
 static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,
 			       char *buf, unsigned int bug)
 {
@@ -712,9 +764,7 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
 
 	case X86_BUG_L1TF:
-		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
-			return sprintf(buf, "Mitigation: Page Table Inversion\n");
-		break;
+		return sprintf(buf, "%s\n", l1tf_states[get_l1tf_mitigation()]);
 
 	default:
 		break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0e53034061c1..a568efccd56f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10528,20 +10528,51 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	return ERR_PTR(err);
 }
 
-#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to CVE-2018-3620 for details.\n"
+#define L1TF_MSG_NOVIRT "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. Refer to CVE-2018-3620 for details.\n"
+#define L1TF_MSG_NOSMT "L1TF CPU bug present and SMT enabled, refusing to start VM. Refer to CVE-2018-3620 for details.\n"
+
+/*
+ * Check whether we should be really strict about denying to start VM
+ * with respect to current SMT status
+ */
+static bool strict_nosmt(void)
+{
+	return (cpu_smt_control == CPU_SMT_ENABLED && nosmt);
+}
 
 static int vmx_vm_init(struct kvm *kvm)
 {
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
 
-	if (boot_cpu_has(X86_BUG_L1TF) && cpu_smt_control == CPU_SMT_ENABLED) {
-		if (nosmt) {
-			pr_err(L1TF_MSG);
-			return -EOPNOTSUPP;
+	if (boot_cpu_has(X86_BUG_L1TF)) {
+		switch (get_l1tf_mitigation()) {
+			case L1TF_MITIGATION_OFF:
+				/* The 'I explicitly don't care' flag is set */
+				break;
+			case L1TF_MITIGATION_NOVIRT:
+				/*
+				 * Warn if potentially insecure, deny to load
+				 * if insecure and 'nosmt'
+				 */
+				if (strict_nosmt()) {
+					pr_err(L1TF_MSG_NOSMT);
+					return -EOPNOTSUPP;
+				} else {
+					pr_warn(L1TF_MSG_NOVIRT);
+				}
+				break;
+			case L1TF_MITIGATION_FULL:
+				/* Deny to load if insecure */
+				if (strict_nosmt()) {
+					pr_err(L1TF_MSG_NOSMT);
+					return -EOPNOTSUPP;
+				}
+				/* All is good, proceed without hiccup */
+				break;
 		}
-		pr_warn(L1TF_MSG);
 	}
+
 	return 0;
 }
 
@@ -13226,7 +13257,8 @@ static int __init vmx_setup_l1d_flush(void)
 
 	if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
 	    !boot_cpu_has_bug(X86_BUG_L1TF) ||
-	    vmx_l1d_use_msr_save_list())
+	    vmx_l1d_use_msr_save_list() ||
+	    get_l1tf_mitigation() != L1TF_MITIGATION_FULL)
 		return 0;
 
 	if (!boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 7532cbf27b1d..3a3b5c4b1d4a 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -177,8 +177,10 @@ enum cpuhp_smt_control {
 
 #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
 extern enum cpuhp_smt_control cpu_smt_control;
+void cpu_smt_disable(bool force);
 #else
 # define cpu_smt_control		(CPU_SMT_ENABLED)
+static inline void cpu_smt_disable(bool force) { }
 #endif
 
 #endif /* _LINUX_CPU_H_ */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 5a00ebdf98c6..5c9ff956e8eb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -347,13 +347,18 @@ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
 EXPORT_SYMBOL_GPL(cpu_smt_control);
 
-static int __init smt_cmdline_disable(char *str)
+void __init cpu_smt_disable(bool force)
 {
 	cpu_smt_control = CPU_SMT_DISABLED;
-	if (str && !strcmp(str, "force")) {
+	if (force) {
 		pr_info("SMT: Force disabled\n");
 		cpu_smt_control = CPU_SMT_FORCE_DISABLED;
 	}
+}
+
+static int __init smt_cmdline_disable(char *str)
+{
+	cpu_smt_disable(str && !strcmp(str, "force"));
 	return 0;
 }
 early_param("nosmt", smt_cmdline_disable);

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 03/10] KVM magic # 3
  2018-07-02 15:44 ` [patch V5 03/10] KVM magic # 3 Thomas Gleixner
@ 2018-07-02 23:42   ` Jon Masters
  0 siblings, 0 replies; 37+ messages in thread
From: Jon Masters @ 2018-07-02 23:42 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On 07/02/2018 11:44 AM, speck for Thomas Gleixner wrote:
> +/*
> + * Software based L1D cache flush which is used when microcode providing
> + * the cache control MSR is not loaded.
> + *
> + * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to
> + * flush it is required to read in 64 KiB because the replacement algorithm
> + * is not exactly LRU. This could be sized at runtime via topology
> + * information but as all relevant affected CPUs have 32KiB L1D cache size
> + * there is no point in doing so.
> + */
> +#define L1D_CACHE_ORDER 4
> +static void *vmx_l1d_flush_pages;
> +
> +static void __maybe_unused vmx_l1d_flush(void)
> +{
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	asm volatile(
> +		/* First ensure the pages are in the TLB */
> +		"xorl	%%eax, %%eax\n"
> +		".Lpopulate_tlb:\n\t"
> +		"movzbl	(%[empty_zp], %%" _ASM_AX "), %%ecx\n\t"
> +		"addl	$4096, %%eax\n\t"
> +		"cmpl	%%eax, %[size]\n\t"
> +		"jne	.Lpopulate_tlb\n\t"
> +		"xorl	%%eax, %%eax\n\t"
> +		"cpuid\n\t"
> +		/* Now fill the cache */
> +		"xorl	%%eax, %%eax\n"
> +		".Lfill_cache:\n"
> +		"movzbl	(%[empty_zp], %%" _ASM_AX "), %%ecx\n\t"
> +		"addl	$64, %%eax\n\t"
> +		"cmpl	%%eax, %[size]\n\t"
> +		"jne	.Lfill_cache\n\t"
> +		"lfence\n"
> +		:: [empty_zp] "r" (vmx_l1d_flush_pages),
> +		    [size] "r" (size)
> +		: "eax", "ebx", "ecx", "edx");
> +}

Per your request explicitly reviewed the above.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** [patch V5 05/10] KVM magic # 5
  2018-07-02 21:19   ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
@ 2018-07-02 23:45     ` Andrew Cooper
  2018-07-03  1:33       ` Linus Torvalds
  2018-07-03  8:23     ` Paolo Bonzini
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Cooper @ 2018-07-02 23:45 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 4743 bytes --]

On 02/07/2018 22:19, speck for Alexander Graf wrote:
>
> On 02.07.18 17:44, speck for Thomas Gleixner wrote:
>> Subject: [patch V5 05/10] x86/KVM/VMX: Add L1D flush logic
>> From: Paolo Bonzini pbonzini@redhat.com
>>
>> Add the logic for flushing L1D on VMENTER. The flush depends on the static
>> key being enabled and the new l1tf_flush_l1d flag being set.
>>
>> The flags is set:
>>  - Always, if the flush module parameter is 'always'
>>
>>  - Conditionally at:
>>    - Entry to vcpu_run(), i.e. after executing user space
>>
>>    - From the sched_in notifier, i.e. when switching to a vCPU thread.
>>
>>    - From vmexit handlers which are considered unsafe, i.e. where
>>      sensitive data can be brought into L1D:
>>
>>      - The emulator, which could be a good target for other speculative
>>        execution-based threats,
>>
>>      - The MMU, which can bring host page tables in the L1 cache.
>>      
>>      - External interrupts
>>
>>      - Nested operations that require the MMU (see above). That is
>>        vmptrld, vmptrst, vmclear,vmwrite,vmread.
>>
>>      - When handling invept,invvpid
>>
>> [ tglx: Split out from combo patch and reduced to a single flag ]
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
>> v4: Fix from Thomas using static key
>> v5: Split it out from the big combo patch
>>     Simplify the flags logic
>>
>>  arch/x86/include/asm/kvm_host.h |    4 ++++
>>  arch/x86/kvm/mmu.c              |    1 +
>>  arch/x86/kvm/vmx.c              |   22 +++++++++++++++++++++-
>>  arch/x86/kvm/x86.c              |    7 +++++++
>>  4 files changed, 33 insertions(+), 1 deletion(-)
>>
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -713,6 +713,9 @@ struct kvm_vcpu_arch {
>>  
>>  	/* be preempted when it's in kernel-mode(cpl=0) */
>>  	bool preempted_in_kernel;
>> +
>> +	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */
>> +	bool l1tf_flush_l1d;
>>  };
>>  
>>  struct kvm_lpage_info {
>> @@ -881,6 +884,7 @@ struct kvm_vcpu_stat {
>>  	u64 signal_exits;
>>  	u64 irq_window_exits;
>>  	u64 nmi_window_exits;
>> +	u64 l1d_flush;
>>  	u64 halt_exits;
>>  	u64 halt_successful_poll;
>>  	u64 halt_attempted_poll;
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcp
>>  {
>>  	int r = 1;
>>  
>> +	vcpu->arch.l1tf_flush_l1d = true;
>>  	switch (vcpu->arch.apf.host_apf_reason) {
>>  	default:
>>  		trace_kvm_page_fault(fault_address, error_code);
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -9576,9 +9576,20 @@ static int vmx_handle_exit(struct kvm_vc
>>  #define L1D_CACHE_ORDER 4
>>  static void *vmx_l1d_flush_pages;
>>  
>> -static void __maybe_unused vmx_l1d_flush(void)
>> +static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>>  {
>>  	int size = PAGE_SIZE << L1D_CACHE_ORDER;
>> +	bool always;
>> +
>> +	/*
>> +	 * If the mitigation mode is 'flush always', keep the flush bit
>> +	 * set, otherwise clear it. It gets set again either from
>> +	 * vcpu_run() or from one of the unsafe VMEXIT handlers.
>> +	 */
>> +	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
>> +	vcpu->arch.l1tf_flush_l1d = always;
>> +
>> +	vcpu->stat.l1d_flush++;
>>  
>>  	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
>>  		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
>> @@ -9847,6 +9858,7 @@ static void vmx_handle_external_intr(str
>>  			[ss]"i"(__KERNEL_DS),
>>  			[cs]"i"(__KERNEL_CS)
>>  			);
>> +		vcpu->arch.l1tf_flush_l1d = true;
> This means we flush when an external interrupt is the trap reason, but
> we may as well get the interrupt only after interrupts are enabled again
> from vcpu_enter_guest(), no? How is an interrupt executing "normally"
> any different from an interrupt that causes a #vmexit?
>
> Or in other words: Is it worth it to have anything but always/never as
> an option? Has anyone done benchmarks?

First pass Xen benchmarks on no-HT and L1D_FLUSH on every vmenter hits
up to 50% depending on workload.  Specifically aggregate small block
random access got hit the most, whereas aggregate small packet bandwidth
only took a 20% hit IIRC.

We are still investigating, but the block side of things is a very
unattractive.  OTOH, I remain to be convinced by the "we're only taking
interrupts and the prefetch isn't pulling in anything interesting"
argument, because its a bit too woolly for my liking, especially in the
presence of next-page-prefetch on Haswell and later.

~Andrew


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** [patch V5 05/10] KVM magic # 5
  2018-07-02 23:45     ` Andrew Cooper
@ 2018-07-03  1:33       ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2018-07-03  1:33 UTC (permalink / raw)
  To: speck



On Tue, 3 Jul 2018, speck for Andrew Cooper wrote:
> 
> We are still investigating, but the block side of things is a very
> unattractive.  OTOH, I remain to be convinced by the "we're only taking
> interrupts and the prefetch isn't pulling in anything interesting"
> argument, because its a bit too woolly for my liking, especially in the
> presence of next-page-prefetch on Haswell and later.

I think the whole argument that "we're only taking an interrupt" is pure 
BS.

The interrupt code fundamentally brings in a lot of the really core kernel 
data structures into the L1 cache. Do a wakeup (and what interrupt 
doesn't?) and you bring in core task structure data and scheduler stuff 
etc. 

Maybe you can say that there is no PIO, and that a disk interrupt won't 
have any of the actual disk data in the caches. Maybe that's true, and 
maybe it isn't. 

Sure, an interrupt may not touch keys and other super-sekrit stuff 
(although is it clear that a network interrupt won't?), but what is the 
argument that this all is protecting against? Maybe somebody like Amazon 
cloud really considers the host almost entirely uninteresting, and the 
only thing that matters is whether it ran another VM?

Regardless, I think the "only interrupts" is a pure garbage argument that 
makes no sense at all.  I think it's purely wishful thinking on the parts 
of people who want lower overhead, but don't actually have a real reason 
for why the L1 doesn't need flushing.

A *real* reason would be "we don't care about random host kernel internal 
data leaking, so we only want to flush if we've scheduled to something 
else". That is still a somewhat suspect argument, but it's a hell of a lot 
more believable than some "interrupts-shminterrupts" handwaving.

                   Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** [patch V5 05/10] KVM magic # 5
  2018-07-02 21:19   ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
  2018-07-02 23:45     ` Andrew Cooper
@ 2018-07-03  8:23     ` Paolo Bonzini
  2018-07-03 14:29       ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2018-07-03  8:23 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 823 bytes --]

On 02/07/2018 23:19, speck for Alexander Graf wrote:
>> +		vcpu->arch.l1tf_flush_l1d = true;
> This means we flush when an external interrupt is the trap reason, but
> we may as well get the interrupt only after interrupts are enabled again
> from vcpu_enter_guest(), no? How is an interrupt executing "normally"
> any different from an interrupt that causes a #vmexit?
> 
> Or in other words: Is it worth it to have anything but always/never as
> an option? Has anyone done benchmarks?

Yes, and "always" is terrible. :((  See my reply to Boris a couple days
ago.  A good long term plan is to switch from blacklist to whitelist,
and run the whitelisted vmexits with interrupts disabled; another
possibility is to set a flag whenever softirqs run, and I'll send an RFC
on top of the basic pieces.

Paolo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch V5 10/10] KVM magic # 10
  2018-07-02 22:11       ` Thomas Gleixner
@ 2018-07-03  9:05         ` Paolo Bonzini
  0 siblings, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2018-07-03  9:05 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1146 bytes --]

On 03/07/2018 00:11, speck for Thomas Gleixner wrote:
> +	 * This code is only executed when:
> +	 * - the flush mode is 'cond'
> +	 * - the flush mode is 'always' and the flush MSR is not
> +	 *   available
> +	 *
> +	 * If the CPU has the flush MSR then clear the flush bit because
> +	 * 'always' mode is handled via the MSR save list.
> +	 *
> +	 * If the MSR is not avaibable then act depending on the mitigation
> +	 * mode: If 'flush always', keep the flush bit set, otherwise clear
> +	 * it.
> +	 *
> +	 * The flush bit gets set again either from vcpu_run() or from one
> +	 * of the unsafe VMEXIT handlers.
>  	 */
> -	always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> +		always = false;
> +	else
> +		always = vmentry_l1d_flush == VMENTER_L1D_FLUSH_ALWAYS;
> +
>  	vcpu->arch.l1tf_flush_l1d = always;

Awesome, I was going to reply with a similar wording.  It wasn't
immediately clear that the static_cpu_has is an optimization (though
it's obvious once you note that the static key is never enabled for
"flush always with MSR").

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** Re: [patch V5 05/10] KVM magic # 5
  2018-07-03  8:23     ` Paolo Bonzini
@ 2018-07-03 14:29       ` Alexander Graf
  2018-07-05 19:08         ` Jon Masters
  2018-07-05 21:43         ` Paolo Bonzini
  0 siblings, 2 replies; 37+ messages in thread
From: Alexander Graf @ 2018-07-03 14:29 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]



On 03.07.18 10:23, speck for Paolo Bonzini wrote:
> On 02/07/2018 23:19, speck for Alexander Graf wrote:
>>> +		vcpu->arch.l1tf_flush_l1d = true;
>> This means we flush when an external interrupt is the trap reason, but
>> we may as well get the interrupt only after interrupts are enabled again
>> from vcpu_enter_guest(), no? How is an interrupt executing "normally"
>> any different from an interrupt that causes a #vmexit?
>>
>> Or in other words: Is it worth it to have anything but always/never as
>> an option? Has anyone done benchmarks?
> 
> Yes, and "always" is terrible. :((  See my reply to Boris a couple days
> ago.  A good long term plan is to switch from blacklist to whitelist,
> and run the whitelisted vmexits with interrupts disabled; another

Doesn't that basically kill real time latency because you're now doing a
full VMEXIT/VMENTER/VMEXIT dance just to get an interrupt? But then
again I guess flushing your L1 doesn't help to reduce jitter either ...

> possibility is to set a flag whenever softirqs run, and I'll send an RFC
> on top of the basic pieces.

What about non-softirqs?


Alex


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch V5 05/10] KVM magic # 5
  2018-07-02 17:24   ` [MODERATED] " Paolo Bonzini
@ 2018-07-03 15:21     ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2018-07-03 15:21 UTC (permalink / raw)
  To: speck

On Mon, 2 Jul 2018, speck for Paolo Bonzini wrote:
> On 02/07/2018 17:44, speck for Thomas Gleixner wrote:
> >  	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
> > @@ -4799,6 +4800,8 @@ int kvm_read_guest_virt(struct kvm_vcpu
> >  {
> >  	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> >  
> > +	/* The gva_to_pa walker can pull in tons of pages. */
> 
> It pulls in guest pages though, so does it really matter?  Also, why
> read and not write too?  (Fetch is only used by the emulator which sets
> the flag differently).

Removed it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** Re: [patch V5 05/10] KVM magic # 5
  2018-07-03 14:29       ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
@ 2018-07-05 19:08         ` Jon Masters
  2018-07-05 21:43         ` Paolo Bonzini
  1 sibling, 0 replies; 37+ messages in thread
From: Jon Masters @ 2018-07-05 19:08 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2257 bytes --]

On 07/03/2018 10:29 AM, speck for Alexander Graf wrote:
> 
> 
> On 03.07.18 10:23, speck for Paolo Bonzini wrote:
>> On 02/07/2018 23:19, speck for Alexander Graf wrote:
>>>> +		vcpu->arch.l1tf_flush_l1d = true;
>>> This means we flush when an external interrupt is the trap reason, but
>>> we may as well get the interrupt only after interrupts are enabled again
>>> from vcpu_enter_guest(), no? How is an interrupt executing "normally"
>>> any different from an interrupt that causes a #vmexit?
>>>
>>> Or in other words: Is it worth it to have anything but always/never as
>>> an option? Has anyone done benchmarks?
>>
>> Yes, and "always" is terrible. :((  See my reply to Boris a couple days
>> ago.  A good long term plan is to switch from blacklist to whitelist,
>> and run the whitelisted vmexits with interrupts disabled; another
> 
> Doesn't that basically kill real time latency because you're now doing a
> full VMEXIT/VMENTER/VMEXIT dance just to get an interrupt? But then
> again I guess flushing your L1 doesn't help to reduce jitter either ...

Yes. But that's one reason we're not specifically turning this
mitigation on globally for everyone (e.g. Telco OpenStack). They'll need
to opt-in. For Real-Time kernels, we're killing HT anyway but of course
that doesn't avoid the flush. If you don't have VT-d, haven't pinned to
cores, and you take an interrupt to that core then you're kinda SOL.

Paolo pointed out before that the performance hit of just blanket
flushing the L1D is really bad. I am inclined to agree with Linus that
it is kinda an all or nothing thing, but am swayed for the moment by the
argument that maybe we can constrain this a bit. What does need to
happen concurrent with a switch to whitelisting is an analysis of what
is being pulled into the L1D with the whitelist approach. I suspect
someone like Graz can help once this is public and they know - it'll be
an interesting research project to try to prove what's safe.

Meanwhile, we can only give guidance. I would like the vendors to more
or less agree on the always vs subset of flushes so that we don't give
conflicting messages to our customers.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** Re: [patch V5 05/10] KVM magic # 5
  2018-07-03 14:29       ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
  2018-07-05 19:08         ` Jon Masters
@ 2018-07-05 21:43         ` Paolo Bonzini
  2018-07-05 21:50           ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2018-07-05 21:43 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

On 03/07/2018 16:29, speck for Alexander Graf wrote:
> 
> 
> On 03.07.18 10:23, speck for Paolo Bonzini wrote:
>> On 02/07/2018 23:19, speck for Alexander Graf wrote:
>>>> +		vcpu->arch.l1tf_flush_l1d = true;
>>> This means we flush when an external interrupt is the trap reason, but
>>> we may as well get the interrupt only after interrupts are enabled again
>>> from vcpu_enter_guest(), no? How is an interrupt executing "normally"
>>> any different from an interrupt that causes a #vmexit?
>>>
>>> Or in other words: Is it worth it to have anything but always/never as
>>> an option? Has anyone done benchmarks?
>>
>> Yes, and "always" is terrible. :((  See my reply to Boris a couple days
>> ago.  A good long term plan is to switch from blacklist to whitelist,
>> and run the whitelisted vmexits with interrupts disabled; another
> 
> Doesn't that basically kill real time latency because you're now doing a
> full VMEXIT/VMENTER/VMEXIT dance just to get an interrupt? But then
> again I guess flushing your L1 doesn't help to reduce jitter either ...

vmexits are rare, and most of them take 1000-2000 clock cycles, i.e.
less than a microseconds, so the likelihood of getting an interrupt
during the vmexit is relatively small.

These short vmexits (mostly things like CPUID, very simple MSRs
including LAPIC timer/tsc_deadline, probably ioeventfd too) would also
be the ones that are whitelisted, so now you're just likely to have a
vmexit/vmenter/vmexit to serve interrupts, after it'd be always like that.

Paolo

>> possibility is to set a flag whenever softirqs run, and I'll send an RFC
>> on top of the basic pieces.
> 
> What about non-softirqs?
> 
> 
> Alex
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** Re: [patch V5 05/10] KVM magic # 5
  2018-07-05 21:43         ` Paolo Bonzini
@ 2018-07-05 21:50           ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2018-07-05 21:50 UTC (permalink / raw)
  To: speck



On Thu, 5 Jul 2018, speck for Paolo Bonzini wrote:
> 
> vmexits are rare

If that were true, people wouldn't even be discussing this. The cache 
flush simply wouldn't matter.

But vmexits are quite common for IO loads. I think the worst numbers were 
for random block IO and some ping-pong networking loads.

And ping-pong networking is pretty much the usual case, even if some 
benchmarks are all about throughput and you can mitigate interrupts (and 
vmexits) for them.

             Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch v5 11/10] Linux+KVM magic # 1
  2018-07-02 22:14 ` [MODERATED] [patch v5 11/10] Linux+KVM magic # 1 Jiri Kosina
@ 2018-07-05 23:56   ` Josh Poimboeuf
  2018-07-06 11:54     ` Jiri Kosina
  0 siblings, 1 reply; 37+ messages in thread
From: Josh Poimboeuf @ 2018-07-05 23:56 UTC (permalink / raw)
  To: speck

On Tue, Jul 03, 2018 at 12:14:30AM +0200, speck for Jiri Kosina wrote:
> From: Jiri Kosina <jkosina@suse.cz>
> Subject: [PATCH v5] x86/bugs: introduce boot-time control of L1TF mitigations
> 
> Introduce 'l1tf=' boot option to allow for boot-time switching of
> mitigation that is used on CPUs affected by L1TF.

AFAICT, this patch breaks the use case Andi was talking about, where
some users will want to do the L1 flushing with SMT enabled, and instead
play cpuset tricks.

I think that could be solved by just passing false to cpu_smt_disable()
instead of true.  That way the user can re-enable SMT if they need to.

Then the vulnerabilities file output should probably display something
different if SMT is currently enabled.

Also I think "l1tf=off" is misleading, since it's basically the same as
l1tf=novirt, except with no warnings.  Maybe it should be
"l1tf=novirt,nowarn"

[ Well, there *is* one other difference between "novirt" and "off":
"off" overrides 'kvm-intel.nosmt'.  But a) I disagree with that
approach; I think the kvm option should rule; and b) I think
'kvm-intel.nosmt' should just be removed anyway.  There's nothing
kvm-specific about it.  If it's useful, let's integrate it into this
interface instead. ]

So what do you think about the following options:

  l1tf=on		PTEINV + flushing + nosmt
  l1tf=novirt		PTEINV + VM warning [default]
  l1tf=novirt,nowarn	PTEINV

And if somebody needed the 'force' option, we could also have:

  l1tf=force		PTEINV + flushing + nosmt=force

> The possible values are
> 
>         full    Provide all available mitigations for L1TF
>                 vulnerability (disable HT, perform PTE bit
>                 inversion, allow hypervisors to know that
>                 they should provide all mitigations)

A period would be good.

>         novirt  Provide all available mitigations needed
>                 for running on bare metal (PTE bit inversion),
>                 while not applying mitigations needed for
>                 VM isolation. Hypervisors will be issuing
>                 warning when first VM is being started in
>                 pontentially insecure configuraion

"Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration."

>         off     Claim "I don't care at all about this issue".
>                 The PTE bit inversion (bare metal mitigation) will
>                 still be performed, but hypervisors will not be
>                 issuing warning when VM is being started in
>                 potentially insecure configuration

"...but hypervisors will not issue a warning when a VM is started in a
potentially insecure configuration."

> 
> Let KVM adhere to this semantics, which means:

s/this/these/

> 
> 	- 'l1tf=full': start performing L1D flushes
> 	- 'l1tf=novirt': warn (or, depending on 'kvm.nosmt', refuse completely)
> 	  on VM start
> 	- 'l1tf=off': no extra action is taken
> 
> Also, introduce CONFIG_ option to allow vendors to chose the default

"introduce *a* CONFIG_ option" .... "choose"

And as we discussed before on IRC, adding a CONFIG option is actively
harmful if you don't have a concrete plan to use it.  Let's please wait
until there's an actual user before adding it.

> mitigation to be used on affected CPUs in case no l1tf= cmdline option is
> present.
> 
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> ---
> 
> As discussed previously, this goes on top of Konard's v5 KVM patchset, and 
> implements the l1tf= bootparam handling as described in the changelog, 
> including the KVM changes.
> 
>  Documentation/admin-guide/kernel-parameters.txt | 18 ++++++++
>  arch/x86/Kconfig                                | 18 ++++++++
>  arch/x86/include/asm/processor.h                |  7 ++++
>  arch/x86/kernel/cpu/bugs.c                      | 56 +++++++++++++++++++++++--
>  arch/x86/kvm/vmx.c                              | 46 ++++++++++++++++----
>  include/linux/cpu.h                             |  2 +
>  kernel/cpu.c                                    |  9 +++-
>  7 files changed, 144 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 4f790566ad91..2f1de18f4ac0 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1989,6 +1989,24 @@
>  			feature (tagged TLBs) on capable Intel chips.
>  			Default is 1 (enabled)
>  
> +	l1tf=           [X86] Control mitigation of L1TF vulnerability on the
> +			      affected CPUs

"Control mitigation of the L1TF vulnerability on affected CPUS."

> +			full	Provide all available mitigations for L1TF
> +				vulnerability (disable HT, perform PTE bit
> +				inversion, allow hypervisors to know that
> +				they should provide all mitigations)
> +			novirt	Provide all available mitigations needed
> +				for running on bare metal (PTE bit inversion),
> +				while not applying mitigations needed for
> +				VM isolation. Hypervisors will be issuing
> +				warning when first VM is being started in
> +				pontentially insecure configuraion
> +			off	Claim "I don't care at all about this issue".
> +				The PTE bit inversion (bare metal mitigation) will
> +				still be performed, but hypervisors will not be
> +				issuing warning when VM is being started in
> +				potentially insecure configuration
> +

Same grammatical comments as above.

>  	l2cr=		[PPC]
>  
>  	l3cr=		[PPC]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7a34fdf8daf0..7aa47721c664 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2390,6 +2390,24 @@ config MODIFY_LDT_SYSCALL
>  	  surface.  Disabling it removes the modify_ldt(2) system call.
>  
>  	  Saying 'N' here may make sense for embedded or server kernels.
> +choice
> +	prompt "Default L1TF mitigation"
> +	default L1TF_MITIGATION_NOVIRT
> +	help
> +		Define what the default behavior for selecting mitigation on

"mitigations"

> +		CPUs affected by L1TF should be. The default can be overrided

"overridden"

> +		on the kernel command-line. Refer to
> +		<file:Documentation/admin-guide/kernel-parameters.txt>
> +
> +config L1TF_MITIGATION_FULL
> +	bool "Full available L1TF mitigation"
> +config L1TF_MITIGATION_NOVIRT
> +	bool "Use L1TF bare metal mitigations only"
> +config L1TF_MITIGATION_OFF
> +	bool "Ignore L1TF issue"
> +
> +endchoice
> +
>  
>  source "kernel/livepatch/Kconfig"
>  
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7e3ac5eedcd6..05471c590964 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -982,4 +982,11 @@ bool xen_set_default_idle(void);
>  void stop_this_cpu(void *dummy);
>  void df_debug(struct pt_regs *regs, long error_code);
>  void microcode_check(void);
> +
> +enum l1tf_mitigations {
> +	L1TF_MITIGATION_OFF,
> +	L1TF_MITIGATION_NOVIRT,
> +	L1TF_MITIGATION_FULL
> +};
> +enum l1tf_mitigations get_l1tf_mitigation(void);
>  #endif /* _ASM_X86_PROCESSOR_H */
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 50500cea6eba..9aa8b94334d5 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -657,6 +657,23 @@ void x86_spec_ctrl_setup_ap(void)
>  
>  #undef pr_fmt
>  #define pr_fmt(fmt)	"L1TF: " fmt
> +/* Default mitigation for L1TF-affected CPUs */
> +static int l1tf_mitigation =
> +#ifdef CONFIG_L1TF_MITIGATION_FULL
> +	L1TF_MITIGATION_NOVIRT;
> +#endif
> +#ifdef CONFIG_L1TF_MITIGATION_NOVIRT
> +	L1TF_MITIGATION_NOVIRT;
> +#endif
> +#ifdef CONFIG_L1TF_MITIGATION_OFF
> +	L1TF_MITIGATION_OFF;
> +#endif
> +enum l1tf_mitigations get_l1tf_mitigation(void)
> +{
> +	return l1tf_mitigation;
> +}
> +EXPORT_SYMBOL(get_l1tf_mitigation);
> +
>  static void __init l1tf_select_mitigation(void)
>  {
>  	u64 half_pa;
> @@ -664,6 +681,15 @@ static void __init l1tf_select_mitigation(void)
>  	if (!boot_cpu_has_bug(X86_BUG_L1TF))
>  		return;
>  
> +	switch (get_l1tf_mitigation()) {
> +	case L1TF_MITIGATION_FULL:
> +		cpu_smt_disable(true);

As described above, I think this should be changed to false.

> +		break;
> +	case L1TF_MITIGATION_OFF:
> +	case L1TF_MITIGATION_NOVIRT:
> +		break;
> +	}
> +
>  #if CONFIG_PGTABLE_LEVELS == 2
>  	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
>  	return;
> @@ -682,10 +708,36 @@ static void __init l1tf_select_mitigation(void)
>  
>  	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
>  }
> +
> +static int __init l1tf_cmdline(char *str)
> +{
> +	if (!boot_cpu_has_bug(X86_BUG_L1TF))
> +		return 0;
> +
> +	if (!str)
> +		return 0;
> +
> +	if (!strcmp(str, "full"))
> +		l1tf_mitigation = L1TF_MITIGATION_FULL;
> +	else if (!strcmp(str, "novirt"))
> +		l1tf_mitigation = L1TF_MITIGATION_NOVIRT;
> +	else if (!strcmp(str, "off"))
> +		l1tf_mitigation = L1TF_MITIGATION_OFF;
> +
> +	return 0;
> +}
> +early_param("l1tf", l1tf_cmdline);
> +
>  #undef pr_fmt
>  
>  #ifdef CONFIG_SYSFS
>  
> +static const char *l1tf_states[] = {
> +	[L1TF_MITIGATION_FULL]		= "Mitigation: Full",
> +	[L1TF_MITIGATION_NOVIRT]	= "Mitigation: Page Table Inversion",
> +	[L1TF_MITIGATION_OFF]		= "Mitigation: Page Table Inversion"
> +};
> +
>  static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,
>  			       char *buf, unsigned int bug)
>  {
> @@ -712,9 +764,7 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
>  		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
>  
>  	case X86_BUG_L1TF:
> -		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
> -			return sprintf(buf, "Mitigation: Page Table Inversion\n");
> -		break;
> +		return sprintf(buf, "%s\n", l1tf_states[get_l1tf_mitigation()]);
>  
>  	default:
>  		break;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 0e53034061c1..a568efccd56f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10528,20 +10528,51 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
>  	return ERR_PTR(err);
>  }
>  
> -#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to CVE-2018-3620 for details.\n"
> +#define L1TF_MSG_NOVIRT "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. Refer to CVE-2018-3620 for details.\n"
> +#define L1TF_MSG_NOSMT "L1TF CPU bug present and SMT enabled, refusing to start VM. Refer to CVE-2018-3620 for details.\n"
> +
> +/*
> + * Check whether we should be really strict about denying to start VM
> + * with respect to current SMT status
> + */
> +static bool strict_nosmt(void)
> +{
> +	return (cpu_smt_control == CPU_SMT_ENABLED && nosmt);
> +}
>  
>  static int vmx_vm_init(struct kvm *kvm)
>  {
>  	if (!ple_gap)
>  		kvm->arch.pause_in_guest = true;
>  
> -	if (boot_cpu_has(X86_BUG_L1TF) && cpu_smt_control == CPU_SMT_ENABLED) {
> -		if (nosmt) {
> -			pr_err(L1TF_MSG);
> -			return -EOPNOTSUPP;
> +	if (boot_cpu_has(X86_BUG_L1TF)) {
> +		switch (get_l1tf_mitigation()) {
> +			case L1TF_MITIGATION_OFF:
> +				/* The 'I explicitly don't care' flag is set */
> +				break;

It seems odd to me that this overrides 'kvm-intel.nosmt'.  I would
expect the kvm option to win.  But, as I said above, I'd propose we just
get rid of the kvm option.

> +			case L1TF_MITIGATION_NOVIRT:
> +				/*
> +				 * Warn if potentially insecure, deny to load
> +				 * if insecure and 'nosmt'

"refuse to load"

> +				 */
> +				if (strict_nosmt()) {
> +					pr_err(L1TF_MSG_NOSMT);
> +					return -EOPNOTSUPP;
> +				} else {
> +					pr_warn(L1TF_MSG_NOVIRT);
> +				}
> +				break;
> +			case L1TF_MITIGATION_FULL:
> +				/* Deny to load if insecure */

"refuse to load"

> +				if (strict_nosmt()) {
> +					pr_err(L1TF_MSG_NOSMT);
> +					return -EOPNOTSUPP;
> +				}

I don't think this can ever happen.  L1TF_MITIGATION_FULL means that SMT
is permanently disabled: see l1tf_select_mitigation() where it calls
cpu_smt_disable(true).  (But I did propose to change that to false...
then this code would make sense).

-- 
Josh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch v5 11/10] Linux+KVM magic # 1
  2018-07-05 23:56   ` [MODERATED] " Josh Poimboeuf
@ 2018-07-06 11:54     ` Jiri Kosina
  2018-07-06 13:05       ` Konrad Rzeszutek Wilk
  2018-07-06 15:11       ` Jon Masters
  0 siblings, 2 replies; 37+ messages in thread
From: Jiri Kosina @ 2018-07-06 11:54 UTC (permalink / raw)
  To: speck

On Thu, 5 Jul 2018, speck for Josh Poimboeuf wrote:

> AFAICT, this patch breaks the use case Andi was talking about, where 
> some users will want to do the L1 flushing with SMT enabled, and instead 
> play cpuset tricks.
> 
> I think that could be solved by just passing false to cpu_smt_disable()
> instead of true.  That way the user can re-enable SMT if they need to.

Good point, I'll change that.

> Also I think "l1tf=off" is misleading, since it's basically the same as
> l1tf=novirt, except with no warnings.  Maybe it should be
> "l1tf=novirt,nowarn"
> 
> [ Well, there *is* one other difference between "novirt" and "off":
> "off" overrides 'kvm-intel.nosmt'.  But a) I disagree with that
> approach; I think the kvm option should rule; and b) I think
> 'kvm-intel.nosmt' should just be removed anyway.  There's nothing
> kvm-specific about it.  If it's useful, let's integrate it into this
> interface instead. ]

Agreed, having this semantically controlled at two places independently is 
just confusing. So I'll just include removal of kvm-intel.nosmt in the 
next respin of the patch (Konrad, if you feel strongly about this, please 
speak up).

The only thing where we disagree is the config option, but I don't have 
mental energy to give this high enough priority and keep arguing over it 
:), so I'll just remove it in the next version (which I expect to be 
sending our later tonight).

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch v5 11/10] Linux+KVM magic # 1
  2018-07-06 11:54     ` Jiri Kosina
@ 2018-07-06 13:05       ` Konrad Rzeszutek Wilk
  2018-07-06 15:11       ` Jon Masters
  1 sibling, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-07-06 13:05 UTC (permalink / raw)
  To: speck

> Agreed, having this semantically controlled at two places independently is 
> just confusing. So I'll just include removal of kvm-intel.nosmt in the 
> next respin of the patch (Konrad, if you feel strongly about this, please 
> speak up).

No strong feelings, please go ahead!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [MODERATED] Re: [patch v5 11/10] Linux+KVM magic # 1
  2018-07-06 11:54     ` Jiri Kosina
  2018-07-06 13:05       ` Konrad Rzeszutek Wilk
@ 2018-07-06 15:11       ` Jon Masters
  1 sibling, 0 replies; 37+ messages in thread
From: Jon Masters @ 2018-07-06 15:11 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1916 bytes --]

On 07/06/2018 07:54 AM, speck for Jiri Kosina wrote:
> On Thu, 5 Jul 2018, speck for Josh Poimboeuf wrote:
> 
>> AFAICT, this patch breaks the use case Andi was talking about, where 
>> some users will want to do the L1 flushing with SMT enabled, and instead 
>> play cpuset tricks.
>>
>> I think that could be solved by just passing false to cpu_smt_disable()
>> instead of true.  That way the user can re-enable SMT if they need to.
> 
> Good point, I'll change that.
> 
>> Also I think "l1tf=off" is misleading, since it's basically the same as
>> l1tf=novirt, except with no warnings.  Maybe it should be
>> "l1tf=novirt,nowarn"
>>
>> [ Well, there *is* one other difference between "novirt" and "off":
>> "off" overrides 'kvm-intel.nosmt'.  But a) I disagree with that
>> approach; I think the kvm option should rule; and b) I think
>> 'kvm-intel.nosmt' should just be removed anyway.  There's nothing
>> kvm-specific about it.  If it's useful, let's integrate it into this
>> interface instead. ]
> 
> Agreed, having this semantically controlled at two places independently is 
> just confusing. So I'll just include removal of kvm-intel.nosmt in the 
> next respin of the patch (Konrad, if you feel strongly about this, please 
> speak up).

All sounds good :)

> The only thing where we disagree is the config option, but I don't have 
> mental energy to give this high enough priority and keep arguing over it 
> :), so I'll just remove it in the next version (which I expect to be 
> sending our later tonight).

Thanks. It makes sense to keep discussing, but I would also like to
advocate for having a common default that we can all get behind. It will
help with the hysteria and panic if everyone has a common approach out
of the gate (and this can always be changed later to add a config).

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2018-07-06 15:11 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-02 15:44 [patch V5 00/10] KVM magic # 0 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 01/10] KVM magic # 1 Thomas Gleixner
2018-07-02 16:22   ` [MODERATED] " Konrad Rzeszutek Wilk
2018-07-02 17:10     ` Thomas Gleixner
2018-07-02 16:25   ` [MODERATED] " Borislav Petkov
2018-07-02 15:44 ` [patch V5 02/10] KVM magic # 2 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 03/10] KVM magic # 3 Thomas Gleixner
2018-07-02 23:42   ` [MODERATED] " Jon Masters
2018-07-02 15:44 ` [patch V5 04/10] KVM magic # 4 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 05/10] KVM magic # 5 Thomas Gleixner
2018-07-02 16:35   ` [MODERATED] " Konrad Rzeszutek Wilk
2018-07-02 17:01     ` Thomas Gleixner
2018-07-02 17:24   ` [MODERATED] " Paolo Bonzini
2018-07-03 15:21     ` Thomas Gleixner
2018-07-02 21:19   ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
2018-07-02 23:45     ` Andrew Cooper
2018-07-03  1:33       ` Linus Torvalds
2018-07-03  8:23     ` Paolo Bonzini
2018-07-03 14:29       ` [MODERATED] Re: ***UNCHECKED*** " Alexander Graf
2018-07-05 19:08         ` Jon Masters
2018-07-05 21:43         ` Paolo Bonzini
2018-07-05 21:50           ` Linus Torvalds
2018-07-02 15:44 ` [patch V5 06/10] KVM magic # 6 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 07/10] KVM magic # 7 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 08/10] KVM magic # 8 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 09/10] KVM magic # 9 Thomas Gleixner
2018-07-02 15:44 ` [patch V5 10/10] KVM magic # 10 Thomas Gleixner
2018-07-02 17:29   ` [MODERATED] " Paolo Bonzini
2018-07-02 17:41     ` Thomas Gleixner
2018-07-02 22:11       ` Thomas Gleixner
2018-07-03  9:05         ` [MODERATED] " Paolo Bonzini
2018-07-02 16:25 ` [patch V5 00/10] KVM magic # 0 Thomas Gleixner
2018-07-02 22:14 ` [MODERATED] [patch v5 11/10] Linux+KVM magic # 1 Jiri Kosina
2018-07-05 23:56   ` [MODERATED] " Josh Poimboeuf
2018-07-06 11:54     ` Jiri Kosina
2018-07-06 13:05       ` Konrad Rzeszutek Wilk
2018-07-06 15:11       ` Jon Masters

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.