kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] KVM: x86: Fixes for SEV-ES state tracking
@ 2021-05-07 16:59 Sean Christopherson
  2021-05-07 16:59 ` [PATCH 1/2] KVM: SVM: Update EFER software model on CR0 trap for SEV-ES Sean Christopherson
  2021-05-07 16:59 ` [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests Sean Christopherson
  0 siblings, 2 replies; 11+ messages in thread
From: Sean Christopherson @ 2021-05-07 16:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

For SEV-ES guests, Ensure KVM's model of EFER is up-to-date after a CR0
load, and let userspace set tracked state via KVM_SET_SREGS, relying on
the vendor code to not propagate the changes to hardware (VMCB in this
case).

Peter, patch 02 is different than what I sent to you off list.  I'm still
100% convinced that the load_pdptrs() call is flawed, but Maxim has an
in-progress series that's tackling the PDPTR save/restore mess.

Sean Christopherson (2):
  KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
  KVM: x86: Allow userspace to update tracked sregs for protected guests

 arch/x86/kvm/svm/svm.c |  8 +++--
 arch/x86/kvm/x86.c     | 73 ++++++++++++++++++++++++------------------
 2 files changed, 47 insertions(+), 34 deletions(-)

-- 
2.31.1.607.g51e8a6a459-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
  2021-05-07 16:59 [PATCH 0/2] KVM: x86: Fixes for SEV-ES state tracking Sean Christopherson
@ 2021-05-07 16:59 ` Sean Christopherson
  2021-05-07 23:15   ` Tom Lendacky
  2021-05-07 16:59 ` [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests Sean Christopherson
  1 sibling, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2021-05-07 16:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

For protected guests, a.k.a. SEV-ES guests, update KVM's model of EFER
when processing the side effect of the CPU entering long mode when paging
is enabled.  The whole point of intercepting CR0/CR4/EFER is to keep
KVM's software model up-to-date.

Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Peter Gonda <pgonda@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a7271f31df47..d271fe8e58de 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1696,15 +1696,17 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	u64 hcr0 = cr0;
 
 #ifdef CONFIG_X86_64
-	if (vcpu->arch.efer & EFER_LME && !vcpu->arch.guest_state_protected) {
+	if (vcpu->arch.efer & EFER_LME) {
 		if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
 			vcpu->arch.efer |= EFER_LMA;
-			svm->vmcb->save.efer |= EFER_LMA | EFER_LME;
+			if (!vcpu->arch.guest_state_protected)
+				svm->vmcb->save.efer |= EFER_LMA | EFER_LME;
 		}
 
 		if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) {
 			vcpu->arch.efer &= ~EFER_LMA;
-			svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME);
+			if (!vcpu->arch.guest_state_protected)
+				svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME);
 		}
 	}
 #endif
-- 
2.31.1.607.g51e8a6a459-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-07 16:59 [PATCH 0/2] KVM: x86: Fixes for SEV-ES state tracking Sean Christopherson
  2021-05-07 16:59 ` [PATCH 1/2] KVM: SVM: Update EFER software model on CR0 trap for SEV-ES Sean Christopherson
@ 2021-05-07 16:59 ` Sean Christopherson
  2021-05-07 23:21   ` Tom Lendacky
  1 sibling, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2021-05-07 16:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
tracks the aforementioned registers by trapping guest writes, and also
exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
match the known hardware state.

Fixes: 5265713a0737 ("KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES")
Reported-by: Peter Gonda <pgonda@google.com>
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3bf52ba5f2bb..1b7d0e97c82b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9963,21 +9963,25 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	if (kvm_set_apic_base(vcpu, &apic_base_msr))
 		goto out;
 
-	if (vcpu->arch.guest_state_protected)
-		goto skip_protected_regs;
+	if (!vcpu->arch.guest_state_protected) {
+		dt.size = sregs->idt.limit;
+		dt.address = sregs->idt.base;
+		static_call(kvm_x86_set_idt)(vcpu, &dt);
+		dt.size = sregs->gdt.limit;
+		dt.address = sregs->gdt.base;
+		static_call(kvm_x86_set_gdt)(vcpu, &dt);
 
-	dt.size = sregs->idt.limit;
-	dt.address = sregs->idt.base;
-	static_call(kvm_x86_set_idt)(vcpu, &dt);
-	dt.size = sregs->gdt.limit;
-	dt.address = sregs->gdt.base;
-	static_call(kvm_x86_set_gdt)(vcpu, &dt);
-
-	vcpu->arch.cr2 = sregs->cr2;
-	mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
-	vcpu->arch.cr3 = sregs->cr3;
-	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
+		vcpu->arch.cr2 = sregs->cr2;
+		mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
+		vcpu->arch.cr3 = sregs->cr3;
+		kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
+	}
 
+	/*
+	 * Writes to CR0, CR4, CR8, and EFER are trapped (after the instruction
+	 * completes) for SEV-EV guests, thus userspace is allowed to set them
+	 * so that KVM's model can be updated to mirror hardware state.
+	 */
 	kvm_set_cr8(vcpu, sregs->cr8);
 
 	mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
@@ -9990,35 +9994,42 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
 	static_call(kvm_x86_set_cr4)(vcpu, sregs->cr4);
 
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
-	if (is_pae_paging(vcpu)) {
+	/*
+	 * PDPTEs, like regular PTEs, are always encrypted, thus reading them
+	 * will return garbage.  Shadow paging, including nested NPT, isn't
+	 * compatible with protected guests, so ignoring the PDPTEs is a-ok.
+	 */
+	if (!vcpu->arch.guest_state_protected && is_pae_paging(vcpu)) {
+		idx = srcu_read_lock(&vcpu->kvm->srcu);
 		load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
+		srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
 		mmu_reset_needed = 1;
 	}
-	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 
 	if (mmu_reset_needed)
 		kvm_mmu_reset_context(vcpu);
 
-	kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
-	kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
-	kvm_set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
-	kvm_set_segment(vcpu, &sregs->fs, VCPU_SREG_FS);
-	kvm_set_segment(vcpu, &sregs->gs, VCPU_SREG_GS);
-	kvm_set_segment(vcpu, &sregs->ss, VCPU_SREG_SS);
+	if (!vcpu->arch.guest_state_protected) {
+		kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
+		kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
+		kvm_set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
+		kvm_set_segment(vcpu, &sregs->fs, VCPU_SREG_FS);
+		kvm_set_segment(vcpu, &sregs->gs, VCPU_SREG_GS);
+		kvm_set_segment(vcpu, &sregs->ss, VCPU_SREG_SS);
 
-	kvm_set_segment(vcpu, &sregs->tr, VCPU_SREG_TR);
-	kvm_set_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR);
+		kvm_set_segment(vcpu, &sregs->tr, VCPU_SREG_TR);
+		kvm_set_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR);
 
-	update_cr8_intercept(vcpu);
+		update_cr8_intercept(vcpu);
 
-	/* Older userspace won't unhalt the vcpu on reset. */
-	if (kvm_vcpu_is_bsp(vcpu) && kvm_rip_read(vcpu) == 0xfff0 &&
-	    sregs->cs.selector == 0xf000 && sregs->cs.base == 0xffff0000 &&
-	    !is_protmode(vcpu))
-		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+		/* Older userspace won't unhalt the vcpu on reset. */
+		if (kvm_vcpu_is_bsp(vcpu) && kvm_rip_read(vcpu) == 0xfff0 &&
+		    sregs->cs.selector == 0xf000 &&
+		    sregs->cs.base == 0xffff0000 && !is_protmode(vcpu))
+			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	}
 
-skip_protected_regs:
 	max_bits = KVM_NR_INTERRUPTS;
 	pending_vec = find_first_bit(
 		(const unsigned long *)sregs->interrupt_bitmap, max_bits);
-- 
2.31.1.607.g51e8a6a459-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
  2021-05-07 16:59 ` [PATCH 1/2] KVM: SVM: Update EFER software model on CR0 trap for SEV-ES Sean Christopherson
@ 2021-05-07 23:15   ` Tom Lendacky
  0 siblings, 0 replies; 11+ messages in thread
From: Tom Lendacky @ 2021-05-07 23:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Peter Gonda, Maxim Levitsky

On 5/7/21 11:59 AM, Sean Christopherson wrote:
> For protected guests, a.k.a. SEV-ES guests, update KVM's model of EFER
> when processing the side effect of the CPU entering long mode when paging
> is enabled.  The whole point of intercepting CR0/CR4/EFER is to keep
> KVM's software model up-to-date.
> 
> Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
> Reported-by: Peter Gonda <pgonda@google.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Acked-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  arch/x86/kvm/svm/svm.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index a7271f31df47..d271fe8e58de 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1696,15 +1696,17 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>  	u64 hcr0 = cr0;
>  
>  #ifdef CONFIG_X86_64
> -	if (vcpu->arch.efer & EFER_LME && !vcpu->arch.guest_state_protected) {
> +	if (vcpu->arch.efer & EFER_LME) {
>  		if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
>  			vcpu->arch.efer |= EFER_LMA;
> -			svm->vmcb->save.efer |= EFER_LMA | EFER_LME;
> +			if (!vcpu->arch.guest_state_protected)
> +				svm->vmcb->save.efer |= EFER_LMA | EFER_LME;
>  		}
>  
>  		if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) {
>  			vcpu->arch.efer &= ~EFER_LMA;
> -			svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME);
> +			if (!vcpu->arch.guest_state_protected)
> +				svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME);
>  		}
>  	}
>  #endif
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-07 16:59 ` [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests Sean Christopherson
@ 2021-05-07 23:21   ` Tom Lendacky
  2021-05-10 16:10     ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Tom Lendacky @ 2021-05-07 23:21 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Peter Gonda, Maxim Levitsky

On 5/7/21 11:59 AM, Sean Christopherson wrote:
> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
> protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
> tracks the aforementioned registers by trapping guest writes, and also
> exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
> match the known hardware state.

This is very similar to the original patch I had proposed that you were
against :)

I'm assuming it's meant to make live migration a bit easier?

> 
> Fixes: 5265713a0737 ("KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES")
> Reported-by: Peter Gonda <pgonda@google.com>
> Cc: stable@vger.kernel.org
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Acked-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  arch/x86/kvm/x86.c | 73 ++++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3bf52ba5f2bb..1b7d0e97c82b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9963,21 +9963,25 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  	if (kvm_set_apic_base(vcpu, &apic_base_msr))
>  		goto out;
>  
> -	if (vcpu->arch.guest_state_protected)
> -		goto skip_protected_regs;
> +	if (!vcpu->arch.guest_state_protected) {
> +		dt.size = sregs->idt.limit;
> +		dt.address = sregs->idt.base;
> +		static_call(kvm_x86_set_idt)(vcpu, &dt);
> +		dt.size = sregs->gdt.limit;
> +		dt.address = sregs->gdt.base;
> +		static_call(kvm_x86_set_gdt)(vcpu, &dt);
>  
> -	dt.size = sregs->idt.limit;
> -	dt.address = sregs->idt.base;
> -	static_call(kvm_x86_set_idt)(vcpu, &dt);
> -	dt.size = sregs->gdt.limit;
> -	dt.address = sregs->gdt.base;
> -	static_call(kvm_x86_set_gdt)(vcpu, &dt);
> -
> -	vcpu->arch.cr2 = sregs->cr2;
> -	mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
> -	vcpu->arch.cr3 = sregs->cr3;
> -	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
> +		vcpu->arch.cr2 = sregs->cr2;
> +		mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
> +		vcpu->arch.cr3 = sregs->cr3;
> +		kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
> +	}
>  
> +	/*
> +	 * Writes to CR0, CR4, CR8, and EFER are trapped (after the instruction
> +	 * completes) for SEV-EV guests, thus userspace is allowed to set them
> +	 * so that KVM's model can be updated to mirror hardware state.
> +	 */
>  	kvm_set_cr8(vcpu, sregs->cr8);
>  
>  	mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
> @@ -9990,35 +9994,42 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  	mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
>  	static_call(kvm_x86_set_cr4)(vcpu, sregs->cr4);
>  
> -	idx = srcu_read_lock(&vcpu->kvm->srcu);
> -	if (is_pae_paging(vcpu)) {
> +	/*
> +	 * PDPTEs, like regular PTEs, are always encrypted, thus reading them
> +	 * will return garbage.  Shadow paging, including nested NPT, isn't
> +	 * compatible with protected guests, so ignoring the PDPTEs is a-ok.
> +	 */
> +	if (!vcpu->arch.guest_state_protected && is_pae_paging(vcpu)) {
> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>  		load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
>  		mmu_reset_needed = 1;
>  	}
> -	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>  
>  	if (mmu_reset_needed)
>  		kvm_mmu_reset_context(vcpu);
>  
> -	kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
> -	kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
> -	kvm_set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
> -	kvm_set_segment(vcpu, &sregs->fs, VCPU_SREG_FS);
> -	kvm_set_segment(vcpu, &sregs->gs, VCPU_SREG_GS);
> -	kvm_set_segment(vcpu, &sregs->ss, VCPU_SREG_SS);
> +	if (!vcpu->arch.guest_state_protected) {
> +		kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
> +		kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
> +		kvm_set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
> +		kvm_set_segment(vcpu, &sregs->fs, VCPU_SREG_FS);
> +		kvm_set_segment(vcpu, &sregs->gs, VCPU_SREG_GS);
> +		kvm_set_segment(vcpu, &sregs->ss, VCPU_SREG_SS);
>  
> -	kvm_set_segment(vcpu, &sregs->tr, VCPU_SREG_TR);
> -	kvm_set_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR);
> +		kvm_set_segment(vcpu, &sregs->tr, VCPU_SREG_TR);
> +		kvm_set_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR);
>  
> -	update_cr8_intercept(vcpu);
> +		update_cr8_intercept(vcpu);
>  
> -	/* Older userspace won't unhalt the vcpu on reset. */
> -	if (kvm_vcpu_is_bsp(vcpu) && kvm_rip_read(vcpu) == 0xfff0 &&
> -	    sregs->cs.selector == 0xf000 && sregs->cs.base == 0xffff0000 &&
> -	    !is_protmode(vcpu))
> -		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +		/* Older userspace won't unhalt the vcpu on reset. */
> +		if (kvm_vcpu_is_bsp(vcpu) && kvm_rip_read(vcpu) == 0xfff0 &&
> +		    sregs->cs.selector == 0xf000 &&
> +		    sregs->cs.base == 0xffff0000 && !is_protmode(vcpu))
> +			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +	}
>  
> -skip_protected_regs:
>  	max_bits = KVM_NR_INTERRUPTS;
>  	pending_vec = find_first_bit(
>  		(const unsigned long *)sregs->interrupt_bitmap, max_bits);
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-07 23:21   ` Tom Lendacky
@ 2021-05-10 16:10     ` Sean Christopherson
  2021-05-10 18:07       ` Tom Lendacky
  2021-05-14 14:19       ` Peter Gonda
  0 siblings, 2 replies; 11+ messages in thread
From: Sean Christopherson @ 2021-05-10 16:10 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

On Fri, May 07, 2021, Tom Lendacky wrote:
> On 5/7/21 11:59 AM, Sean Christopherson wrote:
> > Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
> > protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
> > tracks the aforementioned registers by trapping guest writes, and also
> > exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
> > in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
> > match the known hardware state.
> 
> This is very similar to the original patch I had proposed that you were
> against :)

I hope/think my position was that it should be unnecessary for KVM to need to
know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
I was going to say I had a change of heart, as EFER.LMA in particular could
still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
provided by #VMGEXIT or trapping.

Unless I'm missing something, that means that VMGEXIT(VMMCALL) is broken since
KVM will incorrectly crush (or preserve) bits 63:32 of GPRs.  I'm guessing no
one has reported a bug because either (a) no one has tested a hypercall that
requires bits 63:32 in a GPR or (b) the guest just happens to be in 64-bit mode
when KVM_SEV_LAUNCH_UPDATE_VMSA is invoked and so the segment registers are
frozen to make it appear as if the guest is perpetually in 64-bit mode.

I see that sev_es_validate_vmgexit() checks ghcb_cpl_is_valid(), but isn't that
either pointless or indicative of a much, much bigger problem?  If VMGEXIT is
restricted to CPL0, then the check is pointless.  If VMGEXIT isn't restricted to
CPL0, then KVM has a big gaping hole that allows a malicious/broken guest
userspace to crash the VM simply by executing VMGEXIT.  Since valid_bitmap is
cleared during VMGEXIT handling, I don't think guest userspace can attack/corrupt
the guest kernel by doing a replay attack, but it does all but guarantee a
VMGEXIT at CPL>0 will be fatal since the required valid bits won't be set.

Sadly, the APM doesn't describe the VMGEXIT behavior, nor does any of the SEV-ES
documentation I have.  I assume VMGEXIT is recognized at CPL>0 since it morphs
to VMMCALL when SEV-ES isn't active.

I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
should do something like this (and then the guest needs to be updated to set the
CPL on every VMGEXIT):

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a9d8d6aafdb8..bb7251e4a3e2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2058,7 +2058,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
        vcpu->arch.regs[VCPU_REGS_RDX] = ghcb_get_rdx_if_valid(ghcb);
        vcpu->arch.regs[VCPU_REGS_RSI] = ghcb_get_rsi_if_valid(ghcb);

-       svm->vmcb->save.cpl = ghcb_get_cpl_if_valid(ghcb);
+       svm->vmcb->save.cpl = 0;

        if (ghcb_xcr0_is_valid(ghcb)) {
                vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
@@ -2088,6 +2088,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
        if (ghcb->ghcb_usage)
                goto vmgexit_err;

+       /* Ignore VMGEXIT at CPL>0 */
+       if (!ghcb_cpl_is_valid(ghcb) || ghcb_get_cpl_if_valid(ghcb))
+               return 1;
+
        /*
         * Retrieve the exit code now even though is may not be marked valid
         * as it could help with debugging.
@@ -2142,8 +2146,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
                }
                break;
        case SVM_EXIT_VMMCALL:
-               if (!ghcb_rax_is_valid(ghcb) ||
-                   !ghcb_cpl_is_valid(ghcb))
+               if (!ghcb_rax_is_valid(ghcb))
                        goto vmgexit_err;
                break;
        case SVM_EXIT_RDTSCP:

> I'm assuming it's meant to make live migration a bit easier?

Peter, I forget, were these changes necessary for your work, or was the sole root
cause the emulated MMIO bug in our backport?

If KVM chugs along happily without these patches, I'd love to pivot and yank out
all of the CR0/4/8 and EFER trapping/tracking, and then make KVM_GET_SREGS a nop
as well.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-10 16:10     ` Sean Christopherson
@ 2021-05-10 18:07       ` Tom Lendacky
  2021-05-10 21:02         ` Sean Christopherson
  2021-05-14 14:19       ` Peter Gonda
  1 sibling, 1 reply; 11+ messages in thread
From: Tom Lendacky @ 2021-05-10 18:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

On 5/10/21 11:10 AM, Sean Christopherson wrote:
> On Fri, May 07, 2021, Tom Lendacky wrote:
>> On 5/7/21 11:59 AM, Sean Christopherson wrote:
>>> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
>>> protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
>>> tracks the aforementioned registers by trapping guest writes, and also
>>> exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
>>> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
>>> match the known hardware state.
>>
>> This is very similar to the original patch I had proposed that you were
>> against :)
> 
> I hope/think my position was that it should be unnecessary for KVM to need to
> know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
> I was going to say I had a change of heart, as EFER.LMA in particular could
> still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
> us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
> provided by #VMGEXIT or trapping.

Right, that one is missing. If you take a VMGEXIT that uses the GHCB, then
I think you can assume we're in 64-bit mode.

> 
> Unless I'm missing something, that means that VMGEXIT(VMMCALL) is broken since
> KVM will incorrectly crush (or preserve) bits 63:32 of GPRs.  I'm guessing no
> one has reported a bug because either (a) no one has tested a hypercall that
> requires bits 63:32 in a GPR or (b) the guest just happens to be in 64-bit mode
> when KVM_SEV_LAUNCH_UPDATE_VMSA is invoked and so the segment registers are
> frozen to make it appear as if the guest is perpetually in 64-bit mode.

I don't think it's (b) since the LAUNCH_UPDATE_VMSA is done against reset-
state vCPUs.

> 
> I see that sev_es_validate_vmgexit() checks ghcb_cpl_is_valid(), but isn't that
> either pointless or indicative of a much, much bigger problem?  If VMGEXIT is

It is needed for the VMMCALL exit.

> restricted to CPL0, then the check is pointless.  If VMGEXIT isn't restricted to
> CPL0, then KVM has a big gaping hole that allows a malicious/broken guest
> userspace to crash the VM simply by executing VMGEXIT.  Since valid_bitmap is
> cleared during VMGEXIT handling, I don't think guest userspace can attack/corrupt
> the guest kernel by doing a replay attack, but it does all but guarantee a
> VMGEXIT at CPL>0 will be fatal since the required valid bits won't be set.

Right, so I think some cleanup is needed there, both for the guest and the
hypervisor:

- For the guest, we could just clear the valid bitmask before leaving the
  #VC handler/releasing the GHCB. Userspace can't update the GHCB, so any
  VMGEXIT from userspace would just look like a no-op with the below
  change to KVM.

- For KVM, instead of returning -EINVAL from sev_es_validate_vmgexit(), we
  return the #GP action through the GHCB and continue running the guest.

> 
> Sadly, the APM doesn't describe the VMGEXIT behavior, nor does any of the SEV-ES
> documentation I have.  I assume VMGEXIT is recognized at CPL>0 since it morphs
> to VMMCALL when SEV-ES isn't active.

Correct.

> 
> I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM

The ghcb_cpl_is_valid() is still needed to see whether the VMMCALL was
from userspace or not (a VMMCALL will generate a #VC). So maybe something
like this instead (this is against the sev-es.c to sev.c rename):

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 432d937f8f1e..bf821a4eacf9 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -270,6 +270,7 @@ static __always_inline void sev_es_put_ghcb(struct ghcb_state *state)
 		data->backup_ghcb_active = false;
 		state->ghcb = NULL;
 	} else {
+		vc_ghcb_invalidate(ghcb);
 		data->ghcb_active = false;
 	}
 }
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 17adc1e79136..3b40fd9dc895 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2564,7 +2564,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
 	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
 }
 
-static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
+static bool sev_es_validate_vmgexit(struct vcpu_svm *svm)
 {
 	struct kvm_vcpu *vcpu;
 	struct ghcb *ghcb;
@@ -2670,7 +2670,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 		goto vmgexit_err;
 	}
 
-	return 0;
+	return true;
 
 vmgexit_err:
 	vcpu = &svm->vcpu;
@@ -2684,13 +2684,16 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 		dump_ghcb(svm);
 	}
 
-	vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-	vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
-	vcpu->run->internal.ndata = 2;
-	vcpu->run->internal.data[0] = exit_code;
-	vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
+	/* Clear the valid entries fields */
+	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
 
-	return -EINVAL;
+	ghcb_set_sw_exit_info_1(ghcb, 1);
+	ghcb_set_sw_exit_info_2(ghcb,
+				X86_TRAP_GP |
+				SVM_EVTINJ_TYPE_EXEPT |
+				SVM_EVTINJ_VALID);
+
+	return false;
 }
 
 static void pre_sev_es_run(struct vcpu_svm *svm)
@@ -3360,9 +3363,8 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 
 	exit_code = ghcb_get_sw_exit_code(ghcb);
 
-	ret = sev_es_validate_vmgexit(svm);
-	if (ret)
-		return ret;
+	if (!sev_es_validate_vmgexit(svm))
+		return 1;
 
 	sev_es_sync_from_ghcb(svm);
 	ghcb_set_sw_exit_info_1(ghcb, 0);

Thoughts?

Thanks,
Tom

> should do something like this (and then the guest needs to be updated to set the
> CPL on every VMGEXIT):
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a9d8d6aafdb8..bb7251e4a3e2 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2058,7 +2058,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
>         vcpu->arch.regs[VCPU_REGS_RDX] = ghcb_get_rdx_if_valid(ghcb);
>         vcpu->arch.regs[VCPU_REGS_RSI] = ghcb_get_rsi_if_valid(ghcb);
> 
> -       svm->vmcb->save.cpl = ghcb_get_cpl_if_valid(ghcb);
> +       svm->vmcb->save.cpl = 0;
> 
>         if (ghcb_xcr0_is_valid(ghcb)) {
>                 vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
> @@ -2088,6 +2088,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>         if (ghcb->ghcb_usage)
>                 goto vmgexit_err;
> 
> +       /* Ignore VMGEXIT at CPL>0 */
> +       if (!ghcb_cpl_is_valid(ghcb) || ghcb_get_cpl_if_valid(ghcb))
> +               return 1;
> +
>         /*
>          * Retrieve the exit code now even though is may not be marked valid
>          * as it could help with debugging.
> @@ -2142,8 +2146,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>                 }
>                 break;
>         case SVM_EXIT_VMMCALL:
> -               if (!ghcb_rax_is_valid(ghcb) ||
> -                   !ghcb_cpl_is_valid(ghcb))
> +               if (!ghcb_rax_is_valid(ghcb))
>                         goto vmgexit_err;
>                 break;
>         case SVM_EXIT_RDTSCP:
> 
>> I'm assuming it's meant to make live migration a bit easier?
> 
> Peter, I forget, were these changes necessary for your work, or was the sole root
> cause the emulated MMIO bug in our backport?
> 
> If KVM chugs along happily without these patches, I'd love to pivot and yank out
> all of the CR0/4/8 and EFER trapping/tracking, and then make KVM_GET_SREGS a nop
> as well.
> 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-10 18:07       ` Tom Lendacky
@ 2021-05-10 21:02         ` Sean Christopherson
  2021-05-10 21:23           ` Tom Lendacky
  0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2021-05-10 21:02 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

On Mon, May 10, 2021, Tom Lendacky wrote:
> On 5/10/21 11:10 AM, Sean Christopherson wrote:
> > On Fri, May 07, 2021, Tom Lendacky wrote:
> >> On 5/7/21 11:59 AM, Sean Christopherson wrote:
> >>> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
> >>> protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
> >>> tracks the aforementioned registers by trapping guest writes, and also
> >>> exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
> >>> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
> >>> match the known hardware state.
> >>
> >> This is very similar to the original patch I had proposed that you were
> >> against :)
> > 
> > I hope/think my position was that it should be unnecessary for KVM to need to
> > know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
> > I was going to say I had a change of heart, as EFER.LMA in particular could
> > still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
> > us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
> > provided by #VMGEXIT or trapping.
> 
> Right, that one is missing. If you take a VMGEXIT that uses the GHCB, then
> I think you can assume we're in 64-bit mode.

But that's not technically guaranteed.  The GHCB even seems to imply that there
are scenarios where it's legal/expected to do VMGEXIT with a valid GHCB outside
of 64-bit mode:

  However, instead of issuing a HLT instruction, the AP will issue a VMGEXIT
  with SW_EXITCODE of 0x8000_0004 ((this implies that the GHCB was updated prior
  to leaving 64-bit long mode).

In practice, assuming the guest is in 64-bit mode will likely work, especially
since the MSR-based protocol is extremely limited, but ideally there should be
stronger language in the GHCB to define the exact VMM assumptions/behaviors.

On the flip side, that assumption and the limited exposure through the MSR
protocol means trapping CR0, CR4, and EFER is pointless.  I don't see how KVM
can do anything useful with that information outside of VMGEXITs.  Page tables
are encrypted and GPRs are stale; what else could KVM possibly do with
identifying protected mode, paging, and/or 64-bit?

> > Unless I'm missing something, that means that VMGEXIT(VMMCALL) is broken since
> > KVM will incorrectly crush (or preserve) bits 63:32 of GPRs.  I'm guessing no
> > one has reported a bug because either (a) no one has tested a hypercall that
> > requires bits 63:32 in a GPR or (b) the guest just happens to be in 64-bit mode
> > when KVM_SEV_LAUNCH_UPDATE_VMSA is invoked and so the segment registers are
> > frozen to make it appear as if the guest is perpetually in 64-bit mode.
> 
> I don't think it's (b) since the LAUNCH_UPDATE_VMSA is done against reset-
> state vCPUs.
> 
> > 
> > I see that sev_es_validate_vmgexit() checks ghcb_cpl_is_valid(), but isn't that
> > either pointless or indicative of a much, much bigger problem?  If VMGEXIT is
> 
> It is needed for the VMMCALL exit.
> 
> > restricted to CPL0, then the check is pointless.  If VMGEXIT isn't restricted to
> > CPL0, then KVM has a big gaping hole that allows a malicious/broken guest
> > userspace to crash the VM simply by executing VMGEXIT.  Since valid_bitmap is
> > cleared during VMGEXIT handling, I don't think guest userspace can attack/corrupt
> > the guest kernel by doing a replay attack, but it does all but guarantee a
> > VMGEXIT at CPL>0 will be fatal since the required valid bits won't be set.
> 
> Right, so I think some cleanup is needed there, both for the guest and the
> hypervisor:
> 
> - For the guest, we could just clear the valid bitmask before leaving the
>   #VC handler/releasing the GHCB. Userspace can't update the GHCB, so any
>   VMGEXIT from userspace would just look like a no-op with the below
>   change to KVM.

Ah, right, the exit_code and exit infos need to be valid.

> - For KVM, instead of returning -EINVAL from sev_es_validate_vmgexit(), we
>   return the #GP action through the GHCB and continue running the guest.

Agreed, KVM should never kill the guest in response to a bad VMGEXIT.  That
should always be a guest decision.

> > Sadly, the APM doesn't describe the VMGEXIT behavior, nor does any of the SEV-ES
> > documentation I have.  I assume VMGEXIT is recognized at CPL>0 since it morphs
> > to VMMCALL when SEV-ES isn't active.
> 
> Correct.
> 
> > 
> > I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
> 
> The ghcb_cpl_is_valid() is still needed to see whether the VMMCALL was
> from userspace or not (a VMMCALL will generate a #VC).

Blech.  I get that the GHCB spec says CPL must be provided/checked for VMMCALL,
but IMO that makes no sense whatsover.

If the guest restricts the GHCB to CPL0, then the CPL field is pointless because
the VMGEXIT will only ever come from CPL0.  Yes, technically the guest kernel
can proxy a VMMCALL from userspace to the host, but the guest kernel _must_ be
the one to enforce any desired CPL checks because the VMM is untrusted, at least
once you get to SNP.

If the guest exposes the GHCB to any CPL, then the CPL check is worthless because
guest userspace can simply lie about the CPL.  And exposing the GCHB to userspace
completely undermines guest privilege separation since hardware doesn't provide
the real CPL, i.e. the VMM, even it were trusted, can't determine the origin of
the VMGEXIT.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-10 21:02         ` Sean Christopherson
@ 2021-05-10 21:23           ` Tom Lendacky
  2021-05-10 22:40             ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Tom Lendacky @ 2021-05-10 21:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

On 5/10/21 4:02 PM, Sean Christopherson wrote:
> On Mon, May 10, 2021, Tom Lendacky wrote:
>> On 5/10/21 11:10 AM, Sean Christopherson wrote:
>>> On Fri, May 07, 2021, Tom Lendacky wrote:
>>>> On 5/7/21 11:59 AM, Sean Christopherson wrote:
>>>>> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
>>>>> protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
>>>>> tracks the aforementioned registers by trapping guest writes, and also
>>>>> exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
>>>>> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
>>>>> match the known hardware state.
>>>>
>>>> This is very similar to the original patch I had proposed that you were
>>>> against :)
>>>
>>> I hope/think my position was that it should be unnecessary for KVM to need to
>>> know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
>>> I was going to say I had a change of heart, as EFER.LMA in particular could
>>> still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
>>> us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
>>> provided by #VMGEXIT or trapping.
>>
>> Right, that one is missing. If you take a VMGEXIT that uses the GHCB, then
>> I think you can assume we're in 64-bit mode.
> 
> But that's not technically guaranteed.  The GHCB even seems to imply that there
> are scenarios where it's legal/expected to do VMGEXIT with a valid GHCB outside
> of 64-bit mode:
> 
>   However, instead of issuing a HLT instruction, the AP will issue a VMGEXIT
>   with SW_EXITCODE of 0x8000_0004 ((this implies that the GHCB was updated prior
>   to leaving 64-bit long mode).

Right, but in order to fill in the GHCB so that the hypervisor can read
it, the guest had to have been in 64-bit mode. Otherwise, whatever the
guest wrote will be seen as encrypted data and make no sense to the
hypervisor anyway.

> 
> In practice, assuming the guest is in 64-bit mode will likely work, especially
> since the MSR-based protocol is extremely limited, but ideally there should be
> stronger language in the GHCB to define the exact VMM assumptions/behaviors.
> 
> On the flip side, that assumption and the limited exposure through the MSR
> protocol means trapping CR0, CR4, and EFER is pointless.  I don't see how KVM
> can do anything useful with that information outside of VMGEXITs.  Page tables
> are encrypted and GPRs are stale; what else could KVM possibly do with
> identifying protected mode, paging, and/or 64-bit?
> 
>>> Unless I'm missing something, that means that VMGEXIT(VMMCALL) is broken since
>>> KVM will incorrectly crush (or preserve) bits 63:32 of GPRs.  I'm guessing no
>>> one has reported a bug because either (a) no one has tested a hypercall that
>>> requires bits 63:32 in a GPR or (b) the guest just happens to be in 64-bit mode
>>> when KVM_SEV_LAUNCH_UPDATE_VMSA is invoked and so the segment registers are
>>> frozen to make it appear as if the guest is perpetually in 64-bit mode.
>>
>> I don't think it's (b) since the LAUNCH_UPDATE_VMSA is done against reset-
>> state vCPUs.
>>
>>>
>>> I see that sev_es_validate_vmgexit() checks ghcb_cpl_is_valid(), but isn't that
>>> either pointless or indicative of a much, much bigger problem?  If VMGEXIT is
>>
>> It is needed for the VMMCALL exit.
>>
>>> restricted to CPL0, then the check is pointless.  If VMGEXIT isn't restricted to
>>> CPL0, then KVM has a big gaping hole that allows a malicious/broken guest
>>> userspace to crash the VM simply by executing VMGEXIT.  Since valid_bitmap is
>>> cleared during VMGEXIT handling, I don't think guest userspace can attack/corrupt
>>> the guest kernel by doing a replay attack, but it does all but guarantee a
>>> VMGEXIT at CPL>0 will be fatal since the required valid bits won't be set.
>>
>> Right, so I think some cleanup is needed there, both for the guest and the
>> hypervisor:
>>
>> - For the guest, we could just clear the valid bitmask before leaving the
>>   #VC handler/releasing the GHCB. Userspace can't update the GHCB, so any
>>   VMGEXIT from userspace would just look like a no-op with the below
>>   change to KVM.
> 
> Ah, right, the exit_code and exit infos need to be valid.
> 
>> - For KVM, instead of returning -EINVAL from sev_es_validate_vmgexit(), we
>>   return the #GP action through the GHCB and continue running the guest.
> 
> Agreed, KVM should never kill the guest in response to a bad VMGEXIT.  That
> should always be a guest decision.
> 
>>> Sadly, the APM doesn't describe the VMGEXIT behavior, nor does any of the SEV-ES
>>> documentation I have.  I assume VMGEXIT is recognized at CPL>0 since it morphs
>>> to VMMCALL when SEV-ES isn't active.
>>
>> Correct.
>>
>>>
>>> I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
>>
>> The ghcb_cpl_is_valid() is still needed to see whether the VMMCALL was
>> from userspace or not (a VMMCALL will generate a #VC).
> 
> Blech.  I get that the GHCB spec says CPL must be provided/checked for VMMCALL,
> but IMO that makes no sense whatsover.
> 
> If the guest restricts the GHCB to CPL0, then the CPL field is pointless because
> the VMGEXIT will only ever come from CPL0.  Yes, technically the guest kernel
> can proxy a VMMCALL from userspace to the host, but the guest kernel _must_ be
> the one to enforce any desired CPL checks because the VMM is untrusted, at least
> once you get to SNP.
> 
> If the guest exposes the GHCB to any CPL, then the CPL check is worthless because

The GHCB itself is not exposed to any CPL. A VMMCALL will generate a #VC.
The guest #VC handler will extract the CPL level from the context that
generated the #VC (see vc_handle_vmmcall() in arch/x86/kernel/sev-es.c),
so that a VMMCALL from userspace will have the proper CPL value in the
GHCB when the #VC handler issues the VMGEXIT instruction.

Thanks,
Tom

> guest userspace can simply lie about the CPL.  And exposing the GCHB to userspace
> completely undermines guest privilege separation since hardware doesn't provide
> the real CPL, i.e. the VMM, even it were trusted, can't determine the origin of
> the VMGEXIT.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-10 21:23           ` Tom Lendacky
@ 2021-05-10 22:40             ` Sean Christopherson
  0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2021-05-10 22:40 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Gonda, Maxim Levitsky

On Mon, May 10, 2021, Tom Lendacky wrote:
> On 5/10/21 4:02 PM, Sean Christopherson wrote:
> > On Mon, May 10, 2021, Tom Lendacky wrote:
> >> On 5/10/21 11:10 AM, Sean Christopherson wrote:
> >>> On Fri, May 07, 2021, Tom Lendacky wrote:
> >>>> On 5/7/21 11:59 AM, Sean Christopherson wrote:
> >>>>> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
> >>>>> protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
> >>>>> tracks the aforementioned registers by trapping guest writes, and also
> >>>>> exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
> >>>>> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
> >>>>> match the known hardware state.
> >>>>
> >>>> This is very similar to the original patch I had proposed that you were
> >>>> against :)
> >>>
> >>> I hope/think my position was that it should be unnecessary for KVM to need to
> >>> know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
> >>> I was going to say I had a change of heart, as EFER.LMA in particular could
> >>> still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
> >>> us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
> >>> provided by #VMGEXIT or trapping.
> >>
> >> Right, that one is missing. If you take a VMGEXIT that uses the GHCB, then
> >> I think you can assume we're in 64-bit mode.
> > 
> > But that's not technically guaranteed.  The GHCB even seems to imply that there
> > are scenarios where it's legal/expected to do VMGEXIT with a valid GHCB outside
> > of 64-bit mode:
> > 
> >   However, instead of issuing a HLT instruction, the AP will issue a VMGEXIT
> >   with SW_EXITCODE of 0x8000_0004 ((this implies that the GHCB was updated prior
> >   to leaving 64-bit long mode).
> 
> Right, but in order to fill in the GHCB so that the hypervisor can read
> it, the guest had to have been in 64-bit mode. Otherwise, whatever the
> guest wrote will be seen as encrypted data and make no sense to the
> hypervisor anyway.

Having once been in 64-bit is not the same thing as currently being in 64-bit.
E.g. if the VMM _knows_ that the vCPU is not in 64-bit, e.g. because it traps
writes to CR0, then ignoring bits 63:32 of GPRs would be entirely reasonable.

In practice it probably won't matter since the guest will have to go out of its
way to do VMGEXIT outside of 64-bit mode with valid data in the GHCB, but ideally
KVM will conform to a spec, not implement a best guess as to what will/won't work
for the guest.

> >>> I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
> >>
> >> The ghcb_cpl_is_valid() is still needed to see whether the VMMCALL was
> >> from userspace or not (a VMMCALL will generate a #VC).
> > 
> > Blech.  I get that the GHCB spec says CPL must be provided/checked for VMMCALL,
> > but IMO that makes no sense whatsover.
> > 
> > If the guest restricts the GHCB to CPL0, then the CPL field is pointless because
> > the VMGEXIT will only ever come from CPL0.  Yes, technically the guest kernel
> > can proxy a VMMCALL from userspace to the host, but the guest kernel _must_ be
> > the one to enforce any desired CPL checks because the VMM is untrusted, at least
> > once you get to SNP.
> > 
> > If the guest exposes the GHCB to any CPL, then the CPL check is worthless because
> 
> The GHCB itself is not exposed to any CPL.

That is a guest decision.  Nothing in the GHCB spec requires the GHCB to be
accessible only in CPL0.  I'm not arguing that any sane guest will actually map
the GHCB into userspace, but from an architectural perspective it's not a given.

> A VMMCALL will generate a #VC.

But that's irrevelant from an architectural perspective.  It is 100% legal to
do VMGEXIT(VMMCALL) without first doing VMMCALL and taking a #VC.

> The guest #VC handler will extract the CPL level from the context that
> generated the #VC (see vc_handle_vmmcall() in arch/x86/kernel/sev-es.c),
> so that a VMMCALL from userspace will have the proper CPL value in the
> GHCB when the #VC handler issues the VMGEXIT instruction.

I get how the CPL ended up as a requirement for VMMCALL in the GHCB, I'm arguing
that it's useless information.  From the guest's perspective, it must handle
privilege checks for those checks to have any meaning with respect to security.
From the VMM's perspective, the checks have zero meaning whatsoever because the
CPL field is software controlled.  The fact that the VMM got a VMGEXIT with
valid data in the GHCB is proof enough that the guest blessed the VMGEXIT.

It's kind of a moot point because the extra CPL doesn't break anything, but it
gives a false sense of security since it implies the guest can/should rely on
the VMM to do privilege enforcement.

Actually, after looking at the Linux guest code, the entire VMCALL->#VC->VMGEXIT
concept is flawed.  E.g. kvm_sev_es_hcall_prepare() copies over all _possible_
GPRs.  That means users of kvm_hypercall{1,2,3}() are leaking random register
state to the VMM.  That can be "fixed" by zeroing unused registers or reflecting
on regs->ax, but at that point it's far easier and more performant to simply do
VMGEXIT(VMCALL) directly and skip the #VC.

As for VMMCALL from userspace, IMO the kernel's default should be to reject them
unconditionally.  KVM and Hyper-V APIs do not allow hypercalls from CPL>0.  At a
glance, neither does Xen (Xen skips the CPL check, but AFAICT only when the vCPU
is in real mode and thus the CPL check is meaningless).  No idea about VMware.
If there is a use case for allowing VMMCALL from userspace, then it absolutely
needs to be handled in a x86_hyper_runtime hook, because otherwise the kernel
has zero clue whether or not the hypercall should be allowed, and if it's
allowed, precisely which registers to expose.  I.e. the hook would have to
expose only the registers needed for that exact hypercall, otherwise the kernel
would unintentionally leak userspace register state.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
  2021-05-10 16:10     ` Sean Christopherson
  2021-05-10 18:07       ` Tom Lendacky
@ 2021-05-14 14:19       ` Peter Gonda
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Gonda @ 2021-05-14 14:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tom Lendacky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, linux-kernel,
	Maxim Levitsky

On Mon, May 10, 2021 at 10:10 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, May 07, 2021, Tom Lendacky wrote:
> > On 5/7/21 11:59 AM, Sean Christopherson wrote:
> > > Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
> > > protected guests, e.g. for SEV-ES guests with an encrypted VMSA.  KVM
> > > tracks the aforementioned registers by trapping guest writes, and also
> > > exposes the values to userspace via KVM_GET_SREGS.  Skipping the regs
> > > in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
> > > match the known hardware state.
> >
> > This is very similar to the original patch I had proposed that you were
> > against :)
>
> I hope/think my position was that it should be unnecessary for KVM to need to
> know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
> I was going to say I had a change of heart, as EFER.LMA in particular could
> still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
> us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
> provided by #VMGEXIT or trapping.
>
> Unless I'm missing something, that means that VMGEXIT(VMMCALL) is broken since
> KVM will incorrectly crush (or preserve) bits 63:32 of GPRs.  I'm guessing no
> one has reported a bug because either (a) no one has tested a hypercall that
> requires bits 63:32 in a GPR or (b) the guest just happens to be in 64-bit mode
> when KVM_SEV_LAUNCH_UPDATE_VMSA is invoked and so the segment registers are
> frozen to make it appear as if the guest is perpetually in 64-bit mode.
>
> I see that sev_es_validate_vmgexit() checks ghcb_cpl_is_valid(), but isn't that
> either pointless or indicative of a much, much bigger problem?  If VMGEXIT is
> restricted to CPL0, then the check is pointless.  If VMGEXIT isn't restricted to
> CPL0, then KVM has a big gaping hole that allows a malicious/broken guest
> userspace to crash the VM simply by executing VMGEXIT.  Since valid_bitmap is
> cleared during VMGEXIT handling, I don't think guest userspace can attack/corrupt
> the guest kernel by doing a replay attack, but it does all but guarantee a
> VMGEXIT at CPL>0 will be fatal since the required valid bits won't be set.
>
> Sadly, the APM doesn't describe the VMGEXIT behavior, nor does any of the SEV-ES
> documentation I have.  I assume VMGEXIT is recognized at CPL>0 since it morphs
> to VMMCALL when SEV-ES isn't active.
>
> I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
> should do something like this (and then the guest needs to be updated to set the
> CPL on every VMGEXIT):
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a9d8d6aafdb8..bb7251e4a3e2 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2058,7 +2058,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
>         vcpu->arch.regs[VCPU_REGS_RDX] = ghcb_get_rdx_if_valid(ghcb);
>         vcpu->arch.regs[VCPU_REGS_RSI] = ghcb_get_rsi_if_valid(ghcb);
>
> -       svm->vmcb->save.cpl = ghcb_get_cpl_if_valid(ghcb);
> +       svm->vmcb->save.cpl = 0;
>
>         if (ghcb_xcr0_is_valid(ghcb)) {
>                 vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
> @@ -2088,6 +2088,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>         if (ghcb->ghcb_usage)
>                 goto vmgexit_err;
>
> +       /* Ignore VMGEXIT at CPL>0 */
> +       if (!ghcb_cpl_is_valid(ghcb) || ghcb_get_cpl_if_valid(ghcb))
> +               return 1;
> +
>         /*
>          * Retrieve the exit code now even though is may not be marked valid
>          * as it could help with debugging.
> @@ -2142,8 +2146,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>                 }
>                 break;
>         case SVM_EXIT_VMMCALL:
> -               if (!ghcb_rax_is_valid(ghcb) ||
> -                   !ghcb_cpl_is_valid(ghcb))
> +               if (!ghcb_rax_is_valid(ghcb))
>                         goto vmgexit_err;
>                 break;
>         case SVM_EXIT_RDTSCP:
>
> > I'm assuming it's meant to make live migration a bit easier?
>
> Peter, I forget, were these changes necessary for your work, or was the sole root
> cause the emulated MMIO bug in our backport?
>
> If KVM chugs along happily without these patches, I'd love to pivot and yank out
> all of the CR0/4/8 and EFER trapping/tracking, and then make KVM_GET_SREGS a nop
> as well.

Let me look at if these changes are necessary for our SEV-ES copyless
migration. My initial thoughts are that we still need CR8 trapping and
setting/getting since its not stored in the VMSA. But I don't think
we'll need the others.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-05-14 14:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07 16:59 [PATCH 0/2] KVM: x86: Fixes for SEV-ES state tracking Sean Christopherson
2021-05-07 16:59 ` [PATCH 1/2] KVM: SVM: Update EFER software model on CR0 trap for SEV-ES Sean Christopherson
2021-05-07 23:15   ` Tom Lendacky
2021-05-07 16:59 ` [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests Sean Christopherson
2021-05-07 23:21   ` Tom Lendacky
2021-05-10 16:10     ` Sean Christopherson
2021-05-10 18:07       ` Tom Lendacky
2021-05-10 21:02         ` Sean Christopherson
2021-05-10 21:23           ` Tom Lendacky
2021-05-10 22:40             ` Sean Christopherson
2021-05-14 14:19       ` Peter Gonda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).