Re: [PATCH 1/2] L1TF KVM 1

From: Thomas Gleixner <tglx@linutronix.de>
To: speck@linutronix.de
Subject: Re: [PATCH 1/2] L1TF KVM 1
Date: Wed, 30 May 2018 00:49:55 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1805292350200.1597@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>

On Tue, 29 May 2018, speck for Paolo Bonzini wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> 
> This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
> fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".

This is confusing at best. Why is this vmexit_l1d_flush? You flush on
VMENTER not on VMEXIT.

What you are doing is to decide whether the last exit reason requires a
flush or not. But that decision happens on VMENTER.

> Notably, L1 cache flushes are performed on EPT violations (which are
> basically KVM-level page faults), vmexits that involve the emulator,
> and on every KVM_RUN invocation (so each userspace exit).  However,
> most vmexits are considered safe.  I singled out the emulator because
> it may be a good target for other speculative execution-based threats,
> and the MMU because it can bring host page tables in the L1 cache.

What about interrupts?

> @@ -2423,6 +2428,59 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
>  				   vmx->guest_msrs[i].mask);
>  }
>  
> +static inline bool vmx_handling_confined(int reason)
> +{
> +	switch (reason) {
> +	case EXIT_REASON_EXCEPTION_NMI:
> +	case EXIT_REASON_HLT:
> +	case EXIT_REASON_PAUSE_INSTRUCTION:
> +	case EXIT_REASON_APIC_WRITE:
> +	case EXIT_REASON_MSR_WRITE:
> +	case EXIT_REASON_VMCALL:
> +	case EXIT_REASON_CR_ACCESS:
> +	case EXIT_REASON_DR_ACCESS:
> +	case EXIT_REASON_CPUID:
> +	case EXIT_REASON_PREEMPTION_TIMER:
> +	case EXIT_REASON_MSR_READ:
> +	case EXIT_REASON_EOI_INDUCED:
> +	case EXIT_REASON_WBINVD:
> +	case EXIT_REASON_XSETBV:
> +		/*
> +		 * The next three set vcpu->arch.vcpu_unconfined themselves, so
> +		 * we consider them confined here.

What's the logic behind that?

> +		 */
> +	case EXIT_REASON_EPT_VIOLATION:
> +	case EXIT_REASON_EPT_MISCONFIG:
> +	case EXIT_REASON_IO_INSTRUCTION:
> +		return true;
> +	case EXIT_REASON_EXTERNAL_INTERRUPT: {
> +		int cpu = raw_smp_processor_id();
> +		int vector = per_cpu(last_vector, cpu);
> +		return vector == LOCAL_TIMER_VECTOR || vector == RESCHEDULE_VECTOR;

That wants a comment why these two are considered safe.

The timer vector is a doubtful one. It does not necessarily cause a
reschedule and it can run arbitrary softirq code on interrupt return. I
wouldn't be that sure that it's safe.

> +	}
> +	default:
> +		return false;
> +	}
> +}
> +
> +static bool vmx_core_confined(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	return vmx_handling_confined(vmx->exit_reason);
> +}
> +
> +static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu, bool *need_l1d_flush)
> +{
> +	vmx_save_host_state(vcpu);
> +	if (vmexit_l1d_flush == 0 || !enable_ept)
> +		*need_l1d_flush = false;
> +	else if (vmexit_l1d_flush == 1)
> +		*need_l1d_flush |= !vmx_core_confined(vcpu);

This inverted logic does not make the code more readable. It's obfuscation
for no value.

> +	else
> +		*need_l1d_flush = true;
> +}

> @@ -9457,11 +9515,13 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
> 		unsigned long entry;
> 		gate_desc *desc;
>		struct vcpu_vmx *vmx = to_vmx(vcpu);
> +		int cpu = raw_smp_processor_id();
>  #ifdef CONFIG_X86_64
>  		unsigned long tmp;
>  #endif
>  
> 		vector =  exit_intr_info & INTR_INFO_VECTOR_MASK;
> +		per_cpu(last_vector, cpu) = vector;

Why aren't you doing the evaluation of the vector right here and set the
unconfined bit instead of having yet another indirection and probably
another cache line for that per_cpu() storage? That does not make any
sense at all.

>		desc = (gate_desc *)vmx->host_idt_base + vector;
>		entry = gate_offset(desc);

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6509,10 +6512,40 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>  };
>  #endif
>  
> +
> +#define L1D_CACHE_ORDER 3

This should use the cache size information and not a hard coded value I think. 

> +static void *__read_mostly empty_zero_pages;
> +
> +void kvm_l1d_flush(void)
> +{
> +	asm volatile(
> +		"movq %0, %%rax\n\t"
> +		"leaq 65536(%0), %%rdx\n\t"

Why 64K?

> +		"11: \n\t"
> +		"movzbl (%%rax), %%ecx\n\t"
> +		"addq $4096, %%rax\n\t"
> +		"cmpq %%rax, %%rdx\n\t"
> +		"jne 11b\n\t"
> +		"xorl %%eax, %%eax\n\t"
> +		"cpuid\n\t"

What's the cpuid invocation for?

> +		"xorl %%eax, %%eax\n\t"
> +		"12:\n\t"
> +		"movzwl %%ax, %%edx\n\t"
> +		"addl $64, %%eax\n\t"
> +		"movzbl (%%rdx, %0), %%ecx\n\t"
> +		"cmpl $65536, %%eax\n\t"

And this whole magic should be documented.

> +		"jne 12b\n\t"
> +		"lfence\n\t"
> +		:
> +		: "r" (empty_zero_pages)
> +		: "rax", "rbx", "rcx", "rdx");

How is that supposed to compile on 32bit?

> +}

Aside of that do we really need that manual flush thingy? Is that ucode
update going to take forever?

Thanks,

	tglx