All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
@ 2018-06-23 13:54 konrad.wilk
  2018-06-27 10:20 ` Thomas Gleixner
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: konrad.wilk @ 2018-06-23 13:54 UTC (permalink / raw)
  To: speck

We add two mitigation modes for CVE-2018-3620, aka L1 terminal
fault.  The two modes are "vmentry_l1d_flush=1" and "vmentry_l1d_flush=2".

"vmentry_l1d_flush=2" is simply doing an L1 cache flush on every VMENTER.
"vmentry_l1d_flush=1" is trying to avoid so many L1 cache flueshes on
VMENTER and instead only does if the reason for entering the hypervisor
is based on the type of code that is executed. The idea is based on Intel's
patches, but we treat all vmexits as safe unless they execute
specific code that is considered unsafe. There is no hardcoded list of
"safe" exit reasons; but vmexits are considered safe unless:
 - They trigger the emulator, which could be a good target for
   other speculative execution-based threats,
 - or the MMU, which can bring host page tables in the L1 cache.
 - In addition, executing userspace or another process will trigger a flush.
 - external interrupts
 - nested operations that require the MMU (see above). That is
   vmptrld, vmptrst, vmclear,vmwrite,vmread.
 - Also when handling invept,invvpid

The default is "vmentry_l1d_flush=1".  The cost of "vmentry_l1d_flush=2"
is up to 2.5x more expensive vmexits on Haswell processors, and 30% on
Coffee Lake (for the latter, this is independent of whether microcode
or the generic flush code are used).

The mitigation does not in any way try to do anything about hyperthreading;
it is possible for a sibling thread to read data from the cache during a
vmexit, before the host completes the flush, or to read data from the cache
while a sibling runs.  The suggestion there is to disable hyperthreading
unless you've configured your system to dedicate each core to a specific
guest.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Add checks for X86_BUG_L1TF
   - Rework the commit description.
v3: change module parameter to S_IRUGO
    move the kvm_l1d_flush closer to VMENTER
    move the module parameter so it is in alphabetical order-ish.
    add two extra places that are used by handle_[vmptrld, vmptrst,mclear,vmwrite,vmread,invept,invvpid].
    Can be changed to be only specific handlers.
---
 Documentation/admin-guide/kernel-parameters.txt | 11 +++++
 arch/x86/include/asm/kvm_host.h                 |  8 ++++
 arch/x86/kvm/mmu.c                              |  1 +
 arch/x86/kvm/svm.c                              |  1 +
 arch/x86/kvm/vmx.c                              | 51 +++++++++++++++++++-
 arch/x86/kvm/x86.c                              | 63 ++++++++++++++++++++++++-
 6 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d59b34d4e62a..b8f7a4ab693a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1919,6 +1919,17 @@
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
 				   Default is false (don't support).
 
+	kvm.vmentry_l1d_flush=[KVM] Mitigation for L1 Terminal Fault CVE.
+			Valid arguments: 0, 1, 2
+
+			2 does an L1 cache flush on every VMENTER.
+			1 tries to avoid so many L1 cache flush on VMENTERs and instead
+			do it only if the kind of code that is executed would lead to
+			leaking host memory.
+			0 disables the mitigation
+
+			Default is 1 (do L1 cache flush in specific instances)
+
 	kvm.mmu_audit=	[KVM] This is a R/W parameter which allows audit
 			KVM MMU at runtime.
 			Default is 0 (off)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c13cd28d9d1b..78748925a370 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -713,6 +713,12 @@ struct kvm_vcpu_arch {
 
 	/* be preempted when it's in kernel-mode(cpl=0) */
 	bool preempted_in_kernel;
+
+	/* for L1 terminal fault vulnerability */
+	bool vcpu_unconfined;
+
+	/* must flush the L1 Data cache */
+	bool flush_cache_req;
 };
 
 struct kvm_lpage_info {
@@ -881,6 +887,7 @@ struct kvm_vcpu_stat {
 	u64 signal_exits;
 	u64 irq_window_exits;
 	u64 nmi_window_exits;
+	u64 l1d_flush;
 	u64 halt_exits;
 	u64 halt_successful_poll;
 	u64 halt_attempted_poll;
@@ -1449,6 +1456,7 @@ bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 
 void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 		     struct kvm_lapic_irq *irq);
+void kvm_l1d_flush(void);
 
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d594690d8b95..4d4e3dc2494e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 {
 	int r = 1;
 
+	vcpu->arch.vcpu_unconfined = true;
 	switch (vcpu->arch.apf.host_apf_reason) {
 	default:
 		trace_kvm_page_fault(fault_address, error_code);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f059a73f0fd0..fffc447f5410 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5437,6 +5437,7 @@ static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
 
 static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.flush_cache_req = false;
 }
 
 static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f08e33fc28ac..a51418429165 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -117,6 +117,9 @@ static u64 __read_mostly host_xss;
 static bool __read_mostly enable_pml = 1;
 module_param_named(pml, enable_pml, bool, S_IRUGO);
 
+static int __read_mostly vmentry_l1d_flush = 1;
+module_param(vmentry_l1d_flush, int, S_IRUGO);
+
 #define MSR_TYPE_R	1
 #define MSR_TYPE_W	2
 #define MSR_TYPE_RW	3
@@ -2621,6 +2624,45 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 				   vmx->guest_msrs[i].mask);
 }
 
+static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu)
+{
+	vmx_save_host_state(vcpu);
+
+	if (!enable_ept || static_cpu_has(X86_FEATURE_HYPERVISOR) ||
+	    !static_cpu_has(X86_BUG_L1TF)) {
+		vcpu->arch.flush_cache_req = false;
+		return;
+	}
+
+	switch (vmentry_l1d_flush) {
+	case 0:
+		vcpu->arch.flush_cache_req = false;
+		break;
+	case 1:
+		/*
+		 * If vmentry_l1d_flush is 1, each vmexit handler is responsible for
+		 * setting vcpu->arch.vcpu_unconfined.  Currently this happens in the
+		 * following cases:
+		 * - vmlaunch/vmresume: we do not want the cache to be cleared by a
+		 *   nested hypervisor *and* by KVM on bare metal, so we just do it
+		 *   on every nested entry.  Nested hypervisors do not bother clearing
+		 *   the cache.
+		 * - anything that runs the emulator (the slow paths for EPT misconfig
+		 *   or I/O instruction)
+		 * - anything that can cause get_user_pages (EPT violation, and again
+		 *   the slow paths for EPT misconfig or I/O instruction)
+		 * - anything that can run code outside KVM (external interrupt,
+		 *   which can run interrupt handlers or irqs; or the sched_in
+		 *   preempt notifier)
+		 */
+		break;
+	case 2:
+	default:
+		vcpu->arch.flush_cache_req = true;
+		break;
+	}
+}
+
 static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 {
 	if (!vmx->host_state.loaded)
@@ -9754,6 +9796,7 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
 			[ss]"i"(__KERNEL_DS),
 			[cs]"i"(__KERNEL_CS)
 			);
+		vcpu->arch.vcpu_unconfined = true;
 	}
 }
 STACK_FRAME_NON_STANDARD(vmx_handle_external_intr);
@@ -10011,6 +10054,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
 		(unsigned long)&current_evmcs->host_rsp : 0;
 
+	if (vcpu->arch.flush_cache_req)
+		kvm_l1d_flush();
+
 	asm(
 		/* Store host registers */
 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"
@@ -11824,6 +11870,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 		return ret;
 	}
 
+	/* Hide L1D cache contents from the nested guest.  */
+	vmx->vcpu.arch.vcpu_unconfined = true;
+
 	/*
 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
 	 * by event injection, halt vcpu.
@@ -12941,7 +12990,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.vcpu_free = vmx_free_vcpu,
 	.vcpu_reset = vmx_vcpu_reset,
 
-	.prepare_guest_switch = vmx_save_host_state,
+	.prepare_guest_switch = vmx_prepare_guest_switch,
 	.vcpu_load = vmx_vcpu_load,
 	.vcpu_put = vmx_vcpu_put,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0046aa70205a..4d2e4975f91d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -195,6 +195,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "irq_injections", VCPU_STAT(irq_injections) },
 	{ "nmi_injections", VCPU_STAT(nmi_injections) },
 	{ "req_event", VCPU_STAT(req_event) },
+	{ "l1d_flush", VCPU_STAT(l1d_flush) },
 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -4799,6 +4800,8 @@ int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 
+	/* The gva_to_pa walker can pull in tons of pages. */
+	vcpu->arch.vcpu_unconfined = true;
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
 }
@@ -4874,6 +4877,9 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v
 int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,
 				unsigned int bytes, struct x86_exception *exception)
 {
+	/* kvm_write_guest_virt_system can pull in tons of pages. */
+	vcpu->arch.vcpu_unconfined = true;
+
 	return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,
 					   PFERR_WRITE_MASK, exception);
 }
@@ -6050,6 +6056,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 	bool writeback = true;
 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
 
+	vcpu->arch.vcpu_unconfined = true;
+
 	/*
 	 * Clear write_fault_to_shadow_pgtable here to ensure it is
 	 * never reused.
@@ -6533,10 +6541,49 @@ static struct notifier_block pvclock_gtod_notifier = {
 };
 #endif
 
+
+/*
+ * The L1D cache is 32 KiB on Skylake, but to flush it we have to read in
+ * 64 KiB because the replacement algorithm is not exactly LRU.
+ */
+#define L1D_CACHE_ORDER 4
+static void *__read_mostly empty_zero_pages;
+
+void kvm_l1d_flush(void)
+{
+	/* FIXME: could this be boot_cpu_data.x86_cache_size * 2?  */
+	int size = PAGE_SIZE << L1D_CACHE_ORDER;
+
+	ASSERT(boot_cpu_has(X86_BUG_L1TF));
+
+	asm volatile(
+		/* First ensure the pages are in the TLB */
+		"xorl %%eax, %%eax\n\t"
+		"11: \n\t"
+		"movzbl (%0, %%" _ASM_AX "), %%ecx\n\t"
+		"addl $4096, %%eax\n\t"
+		"cmpl %%eax, %1\n\t"
+		"jne 11b\n\t"
+		"xorl %%eax, %%eax\n\t"
+		"cpuid\n\t"
+		/* Now fill the cache */
+		"xorl %%eax, %%eax\n\t"
+		"12:\n\t"
+		"movzbl (%0, %%" _ASM_AX "), %%ecx\n\t"
+		"addl $64, %%eax\n\t"
+		"cmpl %%eax, %1\n\t"
+		"jne 12b\n\t"
+		"lfence\n\t"
+		: : "r" (empty_zero_pages), "r" (size)
+		: "eax", "ebx", "ecx", "edx");
+}
+EXPORT_SYMBOL_GPL(kvm_l1d_flush);
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
 	struct kvm_x86_ops *ops = opaque;
+	struct page *page;
 
 	if (kvm_x86_ops) {
 		printk(KERN_ERR "kvm: already loaded the other module\n");
@@ -6556,10 +6603,15 @@ int kvm_arch_init(void *opaque)
 	}
 
 	r = -ENOMEM;
+	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
+	if (!page)
+		goto out;
+	empty_zero_pages = page_address(page);
+
 	shared_msrs = alloc_percpu(struct kvm_shared_msrs);
 	if (!shared_msrs) {
 		printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
-		goto out;
+		goto out_free_zero_pages;
 	}
 
 	r = kvm_mmu_module_init();
@@ -6590,6 +6642,8 @@ int kvm_arch_init(void *opaque)
 
 	return 0;
 
+out_free_zero_pages:
+	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
 out_free_percpu:
 	free_percpu(shared_msrs);
 out:
@@ -6614,6 +6668,7 @@ void kvm_arch_exit(void)
 #endif
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
+	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
 	free_percpu(shared_msrs);
 }
 
@@ -7392,7 +7447,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	preempt_disable();
 
+	vcpu->arch.flush_cache_req = vcpu->arch.vcpu_unconfined;
 	kvm_x86_ops->prepare_guest_switch(vcpu);
+	vcpu->arch.vcpu_unconfined = false;
+	if (vcpu->arch.flush_cache_req)
+		vcpu->stat.l1d_flush++;
 
 	/*
 	 * Disable IRQs before setting IN_GUEST_MODE.  Posted interrupt
@@ -7579,6 +7638,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 
 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
+	vcpu->arch.vcpu_unconfined = true;
 
 	for (;;) {
 		if (kvm_vcpu_running(vcpu)) {
@@ -8698,6 +8758,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	vcpu->arch.vcpu_unconfined = true;
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-23 13:54 [MODERATED] [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2 konrad.wilk
@ 2018-06-27 10:20 ` Thomas Gleixner
  2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
  2018-07-01 16:14 ` Josh Poimboeuf
  2 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2018-06-27 10:20 UTC (permalink / raw)
  To: speck

On Sat, 23 Jun 2018, speck for konrad.wilk_at_oracle.com wrote:
>  
>  	r = -ENOMEM;
> +	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
> +	if (!page)
> +		goto out;
> +	empty_zero_pages = page_address(page);
> +
>  	shared_msrs = alloc_percpu(struct kvm_shared_msrs);
>  	if (!shared_msrs) {
>  		printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
> -		goto out;
> +		goto out_free_zero_pages;
>  	}
>  
>  	r = kvm_mmu_module_init();
> @@ -6590,6 +6642,8 @@ int kvm_arch_init(void *opaque)
>  
>  	return 0;
>  
> +out_free_zero_pages:
> +	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
>  out_free_percpu:
>  	free_percpu(shared_msrs);

That's the wrong ordering. out_free_zero_pages: goes here because otherwise
an error exit after kvm_mmu_module_init() will leak the pages. I'll fix it
up.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [MODERATED] Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-23 13:54 [MODERATED] [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2 konrad.wilk
  2018-06-27 10:20 ` Thomas Gleixner
@ 2018-06-27 15:54 ` Borislav Petkov
  2018-06-27 16:21   ` Josh Poimboeuf
                     ` (3 more replies)
  2018-07-01 16:14 ` Josh Poimboeuf
  2 siblings, 4 replies; 9+ messages in thread
From: Borislav Petkov @ 2018-06-27 15:54 UTC (permalink / raw)
  To: speck

On Sat, Jun 23, 2018 at 09:54:16AM -0400, speck for konrad.wilk_at_oracle.com wrote:
> kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> 
> We add two mitigation modes for CVE-2018-3620, aka L1 terminal
> fault.  The two modes are "vmentry_l1d_flush=1" and "vmentry_l1d_flush=2".
> 
> "vmentry_l1d_flush=2" is simply doing an L1 cache flush on every VMENTER.
> "vmentry_l1d_flush=1" is trying to avoid so many L1 cache flueshes on

It goes without saying that "1" and "2" are not as good a choice as
words with meaning.

> VMENTER and instead only does if the reason for entering the hypervisor
> is based on the type of code that is executed. The idea is based on Intel's
> patches, but we treat all vmexits as safe unless they execute
> specific code that is considered unsafe. There is no hardcoded list of
> "safe" exit reasons; but vmexits are considered safe unless:
>  - They trigger the emulator, which could be a good target for
>    other speculative execution-based threats,
>  - or the MMU, which can bring host page tables in the L1 cache.
>  - In addition, executing userspace or another process will trigger a flush.
>  - external interrupts
>  - nested operations that require the MMU (see above). That is
>    vmptrld, vmptrst, vmclear,vmwrite,vmread.
>  - Also when handling invept,invvpid
> 
> The default is "vmentry_l1d_flush=1".  The cost of "vmentry_l1d_flush=2"
> is up to 2.5x more expensive vmexits on Haswell processors, and 30% on
> Coffee Lake (for the latter, this is independent of whether microcode
> or the generic flush code are used).
> 
> The mitigation does not in any way try to do anything about hyperthreading;
> it is possible for a sibling thread to read data from the cache during a
> vmexit, before the host completes the flush, or to read data from the cache
> while a sibling runs.  The suggestion there is to disable hyperthreading
> unless you've configured your system to dedicate each core to a specific
> guest.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Looks like Paolo is the author here - this patch needs a proper From: or
authorship fixed when applying.

> ---
> v2: Add checks for X86_BUG_L1TF
>    - Rework the commit description.
> v3: change module parameter to S_IRUGO
>     move the kvm_l1d_flush closer to VMENTER
>     move the module parameter so it is in alphabetical order-ish.
>     add two extra places that are used by handle_[vmptrld, vmptrst,mclear,vmwrite,vmread,invept,invvpid].
>     Can be changed to be only specific handlers.
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 11 +++++
>  arch/x86/include/asm/kvm_host.h                 |  8 ++++
>  arch/x86/kvm/mmu.c                              |  1 +
>  arch/x86/kvm/svm.c                              |  1 +
>  arch/x86/kvm/vmx.c                              | 51 +++++++++++++++++++-
>  arch/x86/kvm/x86.c                              | 63 ++++++++++++++++++++++++-
>  6 files changed, 133 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index d59b34d4e62a..b8f7a4ab693a 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1919,6 +1919,17 @@
>  	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
>  				   Default is false (don't support).
>  
> +	kvm.vmentry_l1d_flush=[KVM] Mitigation for L1 Terminal Fault CVE.

Let's put the CVE number here - stuff will leave our L1s with time and
having a number there will be helpful to remember in the future.

> +			Valid arguments: 0, 1, 2
> +
> +			2 does an L1 cache flush on every VMENTER.
> +			1 tries to avoid so many L1 cache flush on VMENTERs and instead
> +			do it only if the kind of code that is executed would lead to
> +			leaking host memory.
> +			0 disables the mitigation
> +
> +			Default is 1 (do L1 cache flush in specific instances)
> +
>  	kvm.mmu_audit=	[KVM] This is a R/W parameter which allows audit
>  			KVM MMU at runtime.
>  			Default is 0 (off)

...

> +static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu)
> +{
> +	vmx_save_host_state(vcpu);
> +
> +	if (!enable_ept || static_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    !static_cpu_has(X86_BUG_L1TF)) {
> +		vcpu->arch.flush_cache_req = false;
> +		return;
> +	}
> +
> +	switch (vmentry_l1d_flush) {
> +	case 0:
> +		vcpu->arch.flush_cache_req = false;
> +		break;
> +	case 1:
> +		/*
> +		 * If vmentry_l1d_flush is 1, each vmexit handler is responsible for
> +		 * setting vcpu->arch.vcpu_unconfined.  Currently this happens in the
> +		 * following cases:
> +		 * - vmlaunch/vmresume: we do not want the cache to be cleared by a
> +		 *   nested hypervisor *and* by KVM on bare metal, so we just do it
> +		 *   on every nested entry.  Nested hypervisors do not bother clearing
> +		 *   the cache.
> +		 * - anything that runs the emulator (the slow paths for EPT misconfig
> +		 *   or I/O instruction)
> +		 * - anything that can cause get_user_pages (EPT violation, and again
> +		 *   the slow paths for EPT misconfig or I/O instruction)
> +		 * - anything that can run code outside KVM (external interrupt,
> +		 *   which can run interrupt handlers or irqs; or the sched_in
> +		 *   preempt notifier)
> +		 */

I sure hope we won't miss a case in the future... :-\

> +		break;
> +	case 2:
> +	default:
> +		vcpu->arch.flush_cache_req = true;
> +		break;
> +	}
> +}

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0046aa70205a..4d2e4975f91d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c

...

> @@ -6533,10 +6541,49 @@ static struct notifier_block pvclock_gtod_notifier = {
>  };
>  #endif
>  
> +
> +/*
> + * The L1D cache is 32 KiB on Skylake, but to flush it we have to read in

Skylake? I think most if not all have 32K I$...

> + * 64 KiB because the replacement algorithm is not exactly LRU.
> + */
> +#define L1D_CACHE_ORDER 4
> +static void *__read_mostly empty_zero_pages;
> +
> +void kvm_l1d_flush(void)

Shouldn't this function be in vmx.c ?

> +{
> +	/* FIXME: could this be boot_cpu_data.x86_cache_size * 2?  */

Doesn't look like it, for example here I have:

model name      : Intel(R) Xeon(R) CPU E7-8891 v4 @ 2.80GHz
...
cache size      : 61440 KB

From looking at the end of init_intel_cacheinfo(), we put there the LLC
size ... or so...

> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	ASSERT(boot_cpu_has(X86_BUG_L1TF));

This needs to be boot_cpu_has_bug().

And then what's wrong with:

	if (!boot_cpu_has_bug(X86_BUG_L1TF))
		return;

> +
> +	asm volatile(
> +		/* First ensure the pages are in the TLB */
> +		"xorl %%eax, %%eax\n\t"
> +		"11: \n\t"
> +		"movzbl (%0, %%" _ASM_AX "), %%ecx\n\t"

Named asm params would make it more readable:

		"movzbl (%[empty_zp], %% ..."

> +		"addl $4096, %%eax\n\t"
> +		"cmpl %%eax, %1\n\t"
> +		"jne 11b\n\t"
> +		"xorl %%eax, %%eax\n\t"
> +		"cpuid\n\t"
> +		/* Now fill the cache */
> +		"xorl %%eax, %%eax\n\t"
> +		"12:\n\t"
> +		"movzbl (%0, %%" _ASM_AX "), %%ecx\n\t"
> +		"addl $64, %%eax\n\t"
> +		"cmpl %%eax, %1\n\t"
> +		"jne 12b\n\t"
> +		"lfence\n\t"
> +		: : "r" (empty_zero_pages), "r" (size)
> +		: "eax", "ebx", "ecx", "edx");
> +}
> +EXPORT_SYMBOL_GPL(kvm_l1d_flush);

and then you won't need the symbol export either.

>  int kvm_arch_init(void *opaque)
>  {
>  	int r;
>  	struct kvm_x86_ops *ops = opaque;
> +	struct page *page;
>  
>  	if (kvm_x86_ops) {
>  		printk(KERN_ERR "kvm: already loaded the other module\n");
> @@ -6556,10 +6603,15 @@ int kvm_arch_init(void *opaque)
>  	}
>  
>  	r = -ENOMEM;
> +	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);

GFP_ATOMIC huh? That important.

And why isn't this allocation happening in vmx.c?

> +	if (!page)
> +		goto out;
> +	empty_zero_pages = page_address(page);
> +
>  	shared_msrs = alloc_percpu(struct kvm_shared_msrs);
>  	if (!shared_msrs) {
>  		printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
> -		goto out;
> +		goto out_free_zero_pages;
>  	}
>  
>  	r = kvm_mmu_module_init();
> @@ -6590,6 +6642,8 @@ int kvm_arch_init(void *opaque)
>  
>  	return 0;
>  
> +out_free_zero_pages:
> +	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
>  out_free_percpu:
>  	free_percpu(shared_msrs);
>  out:
> @@ -6614,6 +6668,7 @@ void kvm_arch_exit(void)
>  #endif
>  	kvm_x86_ops = NULL;
>  	kvm_mmu_module_exit();
> +	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
>  	free_percpu(shared_msrs);
>  }
>  
> @@ -7392,7 +7447,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	preempt_disable();
>  
> +	vcpu->arch.flush_cache_req = vcpu->arch.vcpu_unconfined;

Huh, why do we need them both then? ->flush_cache_req should be enough.

>  	kvm_x86_ops->prepare_guest_switch(vcpu);
> +	vcpu->arch.vcpu_unconfined = false;
> +	if (vcpu->arch.flush_cache_req)
> +		vcpu->stat.l1d_flush++;
>  
>  	/*
>  	 * Disable IRQs before setting IN_GUEST_MODE.  Posted interrupt

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [MODERATED] Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
@ 2018-06-27 16:21   ` Josh Poimboeuf
  2018-06-28  7:25   ` Thomas Gleixner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Josh Poimboeuf @ 2018-06-27 16:21 UTC (permalink / raw)
  To: speck

On Wed, Jun 27, 2018 at 05:54:48PM +0200, speck for Borislav Petkov wrote:
> On Sat, Jun 23, 2018 at 09:54:16AM -0400, speck for konrad.wilk_at_oracle.com wrote:
> > kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> > 
> > We add two mitigation modes for CVE-2018-3620, aka L1 terminal
> > fault.  The two modes are "vmentry_l1d_flush=1" and "vmentry_l1d_flush=2".
> > 
> > "vmentry_l1d_flush=2" is simply doing an L1 cache flush on every VMENTER.
> > "vmentry_l1d_flush=1" is trying to avoid so many L1 cache flueshes on
> 
> It goes without saying that "1" and "2" are not as good a choice as
> words with meaning.

+1, I also suggested this before.

-- 
Josh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
  2018-06-27 16:21   ` Josh Poimboeuf
@ 2018-06-28  7:25   ` Thomas Gleixner
  2018-06-28 10:44     ` Thomas Gleixner
  2018-06-28 15:13   ` Thomas Gleixner
  2018-06-29 12:31   ` [MODERATED] " Paolo Bonzini
  3 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2018-06-28  7:25 UTC (permalink / raw)
  To: speck

On Wed, 27 Jun 2018, speck for Borislav Petkov wrote:
> On Sat, Jun 23, 2018 at 09:54:16AM -0400, speck for konrad.wilk_at_oracle.com wrote:
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> Looks like Paolo is the author here - this patch needs a proper From: or
> authorship fixed when applying.

The ideal solution is to have:

From: Joe Hacker <joe@hacker.com>
Subject: foo: Do frotz

in the changelog itself. That works with git and my scripts which grab the
patches from the list fixup the author and subject from that as well.

Having a

References: message-id-of-cover-letter@joehackers-box

makes the series threaded.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-28  7:25   ` Thomas Gleixner
@ 2018-06-28 10:44     ` Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2018-06-28 10:44 UTC (permalink / raw)
  To: speck

On Thu, 28 Jun 2018, speck for Thomas Gleixner wrote:
> On Wed, 27 Jun 2018, speck for Borislav Petkov wrote:
> > On Sat, Jun 23, 2018 at 09:54:16AM -0400, speck for konrad.wilk_at_oracle.com wrote:
> > > 
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > 
> > Looks like Paolo is the author here - this patch needs a proper From: or
> > authorship fixed when applying.
> 
> The ideal solution is to have:
> 
> From: Joe Hacker <joe@hacker.com>
> Subject: foo: Do frotz
> 
> in the changelog itself. That works with git and my scripts which grab the
> patches from the list fixup the author and subject from that as well.
> 
> Having a
> 
> References: message-id-of-cover-letter@joehackers-box
> 
> makes the series threaded.

That's not the changelog of course. That's a mail header.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
  2018-06-27 16:21   ` Josh Poimboeuf
  2018-06-28  7:25   ` Thomas Gleixner
@ 2018-06-28 15:13   ` Thomas Gleixner
  2018-06-29 12:31   ` [MODERATED] " Paolo Bonzini
  3 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2018-06-28 15:13 UTC (permalink / raw)
  To: speck

On Wed, 27 Jun 2018, speck for Borislav Petkov wrote:
> On Sat, Jun 23, 2018 at 09:54:16AM -0400, speck for konrad.wilk_at_oracle.com wrote:
> > +static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu)
> > +{
> > +	vmx_save_host_state(vcpu);
> > +
> > +	if (!enable_ept || static_cpu_has(X86_FEATURE_HYPERVISOR) ||
> > +	    !static_cpu_has(X86_BUG_L1TF)) {
> > +		vcpu->arch.flush_cache_req = false;
> > +		return;
> > +	}
> > +
> > +	switch (vmentry_l1d_flush) {
> > +	case 0:
> > +		vcpu->arch.flush_cache_req = false;
> > +		break;
> > +	case 1:
> > +		/*
> > +		 * If vmentry_l1d_flush is 1, each vmexit handler is responsible for
> > +		 * setting vcpu->arch.vcpu_unconfined.  Currently this happens in the
> > +		 * following cases:
> > +		 * - vmlaunch/vmresume: we do not want the cache to be cleared by a
> > +		 *   nested hypervisor *and* by KVM on bare metal, so we just do it
> > +		 *   on every nested entry.  Nested hypervisors do not bother clearing
> > +		 *   the cache.
> > +		 * - anything that runs the emulator (the slow paths for EPT misconfig
> > +		 *   or I/O instruction)
> > +		 * - anything that can cause get_user_pages (EPT violation, and again
> > +		 *   the slow paths for EPT misconfig or I/O instruction)
> > +		 * - anything that can run code outside KVM (external interrupt,
> > +		 *   which can run interrupt handlers or irqs; or the sched_in
> > +		 *   preempt notifier)
> > +		 */
> 
> I sure hope we won't miss a case in the future... :-\

The switch case can go away when using a static key.

> > +/*
> > + * The L1D cache is 32 KiB on Skylake, but to flush it we have to read in
> 
> Skylake? I think most if not all have 32K I$...
> 
> > + * 64 KiB because the replacement algorithm is not exactly LRU.
> > + */

I'd suggest the folliwng wording:

/*
 * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to
 * flush it it's required to read in 64 KiB because the replacement
 * algorithm is not exactly LRU. This could be sized at runtime via
 * cache topology information, but as all relevant affected CPUs have
 * 32KiB L1D cache size there is no point in doing so.
 */

> > +#define L1D_CACHE_ORDER 4
> > +static void *__read_mostly empty_zero_pages;
> > +
> > +void kvm_l1d_flush(void)
> 
> Shouldn't this function be in vmx.c ?

Indedd. Missed that.

> > +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +	ASSERT(boot_cpu_has(X86_BUG_L1TF));
> 
> This needs to be boot_cpu_has_bug().
> 
> And then what's wrong with:
> 
> 	if (!boot_cpu_has_bug(X86_BUG_L1TF))
> 		return;

You should not get there then.

> >  
> >  	r = -ENOMEM;
> > +	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
> 
> GFP_ATOMIC huh? That important.

Oops. missed that one.

Plus the allocation should only be done when the magic MSR is not available.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [MODERATED] Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
                     ` (2 preceding siblings ...)
  2018-06-28 15:13   ` Thomas Gleixner
@ 2018-06-29 12:31   ` Paolo Bonzini
  3 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2018-06-29 12:31 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

On 27/06/2018 17:54, speck for Borislav Petkov wrote:
>> +		 * If vmentry_l1d_flush is 1, each vmexit handler is responsible for
>> +		 * setting vcpu->arch.vcpu_unconfined.  Currently this happens in the
>> +		 * following cases:
>> +		 * - vmlaunch/vmresume: we do not want the cache to be cleared by a
>> +		 *   nested hypervisor *and* by KVM on bare metal, so we just do it
>> +		 *   on every nested entry.  Nested hypervisors do not bother clearing
>> +		 *   the cache.
>> +		 * - anything that runs the emulator (the slow paths for EPT misconfig
>> +		 *   or I/O instruction)
>> +		 * - anything that can cause get_user_pages (EPT violation, and again
>> +		 *   the slow paths for EPT misconfig or I/O instruction)
>> +		 * - anything that can run code outside KVM (external interrupt,
>> +		 *   which can run interrupt handlers or irqs; or the sched_in
>> +		 *   preempt notifier)
>> +		 */
> 
> I sure hope we won't miss a case in the future... :-\

We're already missing one; we cannot detect external interrupts that
happen between the vmexit and vmentry.  We can do two things about it:

1) run most vmexit handlers (the list would be basically the same as
above) with interrupts disabled, and change vmentry_l1d_flush=1 to
whitelist those vmexit handlers.  The interrupts would be acknowledged
after vmentry, adding about 2000 clock cycles latency to interrupts
delivered during vmexit handling.

2) handle at least the common case of softirqs by adding some kind of "#
of times softirqs were run" percpu variable.

Or (I should probably feel dirty for suggesting it) use rdtsc to find
vmexits that take suspiciously long (>2500 clock cycles?) and flush the
cache unconditionally.

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [MODERATED] Re: [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2
  2018-06-23 13:54 [MODERATED] [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2 konrad.wilk
  2018-06-27 10:20 ` Thomas Gleixner
  2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
@ 2018-07-01 16:14 ` Josh Poimboeuf
  2 siblings, 0 replies; 9+ messages in thread
From: Josh Poimboeuf @ 2018-07-01 16:14 UTC (permalink / raw)
  To: speck

On Sat, Jun 23, 2018 at 09:54:16AM -0400, speck for konrad.wilk_at_oracle.com wrote:
> @@ -7392,7 +7447,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	preempt_disable();
>  
> +	vcpu->arch.flush_cache_req = vcpu->arch.vcpu_unconfined;
>  	kvm_x86_ops->prepare_guest_switch(vcpu);
> +	vcpu->arch.vcpu_unconfined = false;
> +	if (vcpu->arch.flush_cache_req)
> +		vcpu->stat.l1d_flush++;

As Boris said, I don't see the need for the two different variables
which basically track the same thing.  And why is the stat updated in a
separate location from the actual flush?

-- 
Josh

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-07-01 16:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-23 13:54 [MODERATED] [PATCH v4 2/8] [PATCH v4 2/8] Linux Patch #2 konrad.wilk
2018-06-27 10:20 ` Thomas Gleixner
2018-06-27 15:54 ` [MODERATED] " Borislav Petkov
2018-06-27 16:21   ` Josh Poimboeuf
2018-06-28  7:25   ` Thomas Gleixner
2018-06-28 10:44     ` Thomas Gleixner
2018-06-28 15:13   ` Thomas Gleixner
2018-06-29 12:31   ` [MODERATED] " Paolo Bonzini
2018-07-01 16:14 ` Josh Poimboeuf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.