linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 1/3] KVM: X86: Fix operand/address-size during instruction decoding
@ 2017-11-03  0:50 Wanpeng Li
  2017-11-03  0:50 ` [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry Wanpeng Li
  2017-11-03  0:50 ` [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure Wanpeng Li
  0 siblings, 2 replies; 17+ messages in thread
From: Wanpeng Li @ 2017-11-03  0:50 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář,
	Wanpeng Li, Nadav Amit, Pedro Fonseca

From: Wanpeng Li <wanpeng.li@hotmail.com>

Pedro reported:
  During tests that we conducted on KVM, we noticed that executing a "PUSH %ES"
  instruction under KVM produces different results on both memory and the SP
  register depending on whether EPT support is enabled. With EPT the SP is
  reduced by 4 bytes (and the written value is 0-padded) but without EPT support
  it is only reduced by 2 bytes. The difference can be observed when the CS.DB
  field is 1 (32-bit) but not when it's 0 (16-bit).

The internal segment descriptor cache exist even in real/vm8096 mode. The CS.D 
also should be respected instead of just default operand/address-size/66H 
prefix/67H prefix during instruction decoding. This patch fixes it by also 
adjusting operand/address-size according to CS.D.

Reported-by: Pedro Fonseca <pfonseca@cs.washington.edu>
Tested-by: Pedro Fonseca <pfonseca@cs.washington.edu>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Pedro Fonseca <pfonseca@cs.washington.edu>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v4 -> v5:
 * cleanup patch subject/description
v3 -> v4:
 * def_ad_bytes must be changed to 4
 * separate X86EMUL_MODE_PROT16 altogether from the others
v2 -> v3:
 * cleanup the codes 
v1 -> v2:
 * respect cs.d for real/vm8096, other modes have already 
   been considered in init_emulate_ctxt().

 arch/x86/kvm/emulate.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8079d14..b4a87de 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5000,6 +5000,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 	bool op_prefix = false;
 	bool has_seg_override = false;
 	struct opcode opcode;
+	u16 dummy;
+	struct desc_struct desc;
 
 	ctxt->memop.type = OP_NONE;
 	ctxt->memopp = NULL;
@@ -5018,6 +5020,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 	switch (mode) {
 	case X86EMUL_MODE_REAL:
 	case X86EMUL_MODE_VM86:
+		def_op_bytes = def_ad_bytes = 2;
+		ctxt->ops->get_segment(ctxt, &dummy, &desc, NULL, VCPU_SREG_CS);
+		if (desc.d)
+			def_op_bytes = def_ad_bytes = 4;
+		break;
 	case X86EMUL_MODE_PROT16:
 		def_op_bytes = def_ad_bytes = 2;
 		break;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
  2017-11-03  0:50 [PATCH v5 1/3] KVM: X86: Fix operand/address-size during instruction decoding Wanpeng Li
@ 2017-11-03  0:50 ` Wanpeng Li
       [not found]   ` <0b1d82f7-2fc6-9fc0-15a4-3500413814bd@oracle.com>
  2017-11-03  0:50 ` [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure Wanpeng Li
  1 sibling, 1 reply; 17+ messages in thread
From: Wanpeng Li @ 2017-11-03  0:50 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Wanpeng Li, Jim Mattson

From: Wanpeng Li <wanpeng.li@hotmail.com>

According to the SDM, if the "load IA32_BNDCFGS" VM-entry controls is 1, the
following checks are performed on the field for the IA32_BNDCFGS MSR:
 - Bits reserved in the IA32_BNDCFGS MSR must be 0.
 - The linear address in bits 63:12 must be canonical.

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v3 -> v4:
 * simply condition
 * use && instead of nested "if"s

 arch/x86/kvm/vmx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e6c8ffa..6cf3972 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10805,6 +10805,11 @@ static int check_vmentry_postreqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			return 1;
 	}
 
+	if (kvm_mpx_supported() &&
+		(is_noncanonical_address(vmcs12->guest_bndcfgs & PAGE_MASK, vcpu) ||
+		(vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD)))
+			return 1;
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-03  0:50 [PATCH v5 1/3] KVM: X86: Fix operand/address-size during instruction decoding Wanpeng Li
  2017-11-03  0:50 ` [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry Wanpeng Li
@ 2017-11-03  0:50 ` Wanpeng Li
  2017-11-03 16:01   ` Jim Mattson
  2017-11-04  0:07   ` Krish Sadhukhan
  1 sibling, 2 replies; 17+ messages in thread
From: Wanpeng Li @ 2017-11-03  0:50 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Wanpeng Li, Jim Mattson

From: Wanpeng Li <wanpeng.li@hotmail.com>

Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure 
properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in L1) 
null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1.

In L1:

BUG: unable to handle kernel paging request at ffffffffc015bf8f
 IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
 PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
 Oops: 0003 [#1] PREEMPT SMP
 CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
 RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
 Call Trace:
 WARNING: kernel stack frame pointer at ffffb86f4988bc18 in qemu-system-x86:1798 has bad value 0000000000000002

In L0:

-----------[ cut here ]------------
 WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845 vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
 CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE   4.14.0-rc7+ #25
 RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
 Call Trace:
  paging64_page_fault+0x500/0xde0 [kvm]
  ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
  ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
  ? __asan_storeN+0x12/0x20
  ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
  ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
  ? lock_acquire+0x2c0/0x2c0
  ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
  ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
  ? sched_clock+0x1f/0x30
  ? check_chain_key+0x137/0x1e0
  ? __lock_acquire+0x83c/0x2420
  ? kvm_multiple_exception+0xf2/0x220 [kvm]
  ? debug_check_no_locks_freed+0x240/0x240
  ? debug_smp_processor_id+0x17/0x20
  ? __lock_is_held+0x9e/0x100
  kvm_mmu_page_fault+0x90/0x180 [kvm]
  kvm_handle_page_fault+0x15c/0x310 [kvm]
  ? __lock_is_held+0x9e/0x100
  handle_exception+0x3c7/0x4d0 [kvm_intel]
  vmx_handle_exit+0x103/0x1010 [kvm_intel]
  ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]

The commit avoids to load host state of vmcs12 as vmcs01's guest state 
since vmcs12 is not modified (except for the VM-instruction error field)
if the checking of vmcs control area fails. However, the mmu context is 
switched to nested mmu in prepare_vmcs02() and it will not be reloaded 
since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME 
fails. This patch fixes it by reloading mmu context when nested 
VMLAUNCH/VMRESUME fails.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v3 -> v4:
 * move it to a new function load_vmcs12_mmu_host_state

 arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6cf3972..8aefb91 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	kvm_clear_interrupt_queue(vcpu);
 }
 
+static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
+			struct vmcs12 *vmcs12)
+{
+	u32 entry_failure_code;
+
+	nested_ept_uninit_mmu_context(vcpu);
+
+	/*
+	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
+	 * couldn't have changed.
+	 */
+	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
+		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
+
+	if (!enable_ept)
+		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
+}
+
 /*
  * A part of what we need to when the nested L2 guest exits and we want to
  * run its L1 parent, is to reset L1's guest state to the host state specified
@@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 				   struct vmcs12 *vmcs12)
 {
 	struct kvm_segment seg;
-	u32 entry_failure_code;
 
 	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->host_ia32_efer;
@@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	vmx_set_cr4(vcpu, vmcs12->host_cr4);
 
-	nested_ept_uninit_mmu_context(vcpu);
-
-	/*
-	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
-	 * couldn't have changed.
-	 */
-	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
-		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
-
-	if (!enable_ept)
-		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
+	load_vmcs12_mmu_host_state(vcpu, vmcs12);
 
 	if (enable_vpid) {
 		/*
@@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
 	 * accordingly.
 	 */
 	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+
+	load_vmcs12_mmu_host_state(vcpu, vmcs12);
+
 	/*
 	 * The emulated instruction was already skipped in
 	 * nested_vmx_run, but the updated RIP was never
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
       [not found]   ` <0b1d82f7-2fc6-9fc0-15a4-3500413814bd@oracle.com>
@ 2017-11-03  6:40     ` Wanpeng Li
  2017-11-03 17:13       ` Krish Sadhukhan
  0 siblings, 1 reply; 17+ messages in thread
From: Wanpeng Li @ 2017-11-03  6:40 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: Paolo Bonzini, Radim Krcmar, kvm, linux-kernel, Jim Mattson

2017-11-03 14:31 GMT+08:00 Krish Sadhukhan <krish.sadhukhan@oracle.com>:
>
>
> On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>>
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> According to the SDM, if the "load IA32_BNDCFGS" VM-entry controls is 1,
>> the
>> following checks are performed on the field for the IA32_BNDCFGS MSR:
>>   - Bits reserved in the IA32_BNDCFGS MSR must be 0.
>>   - The linear address in bits 63:12 must be canonical.
>>
>> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Cc: Jim Mattson <jmattson@google.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>> ---
>> v3 -> v4:
>>   * simply condition
>>   * use && instead of nested "if"s
>>
>>   arch/x86/kvm/vmx.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index e6c8ffa..6cf3972 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -10805,6 +10805,11 @@ static int check_vmentry_postreqs(struct kvm_vcpu
>> *vcpu, struct vmcs12 *vmcs12,
>>                         return 1;
>>         }
>>   +     if (kvm_mpx_supported() &&
>> +               (is_noncanonical_address(vmcs12->guest_bndcfgs &
>> PAGE_MASK, vcpu) ||
>> +               (vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD)))
>> +                       return 1;
>> +
>>         return 0;
>>   }
>>
>
> Hi Wanpeng,
>   The SDM check is performed only when "load IA32_BNDCFGS" VM-entry control
> is 1. But vmx_mpx_supported() returns true when both "load IA32_BNDCFGS" and
> "store IA32_BNDCFGS" VM-entry controls are 1. Therefore your check is
> performed when both controls are 1. Did I miss something here ?

https://lkml.org/lkml/2017/11/2/748 Paolo hopes the simplification.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-03  0:50 ` [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure Wanpeng Li
@ 2017-11-03 16:01   ` Jim Mattson
  2017-11-04  0:07   ` Krish Sadhukhan
  1 sibling, 0 replies; 17+ messages in thread
From: Jim Mattson @ 2017-11-03 16:01 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: LKML, kvm list, Paolo Bonzini, Radim Krčmář, Wanpeng Li

That seems reasonable to me. Thanks for the fix.

Reviewed-by: Jim Mattson <jmattson@google.com>

On Thu, Nov 2, 2017 at 5:50 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
>
> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure
> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in L1)
> null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1.
>
> In L1:
>
> BUG: unable to handle kernel paging request at ffffffffc015bf8f
>  IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
>  PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
>  Oops: 0003 [#1] PREEMPT SMP
>  CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
>  RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
>  Call Trace:
>  WARNING: kernel stack frame pointer at ffffb86f4988bc18 in qemu-system-x86:1798 has bad value 0000000000000002
>
> In L0:
>
> -----------[ cut here ]------------
>  WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845 vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>  CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE   4.14.0-rc7+ #25
>  RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>  Call Trace:
>   paging64_page_fault+0x500/0xde0 [kvm]
>   ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
>   ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
>   ? __asan_storeN+0x12/0x20
>   ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
>   ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
>   ? lock_acquire+0x2c0/0x2c0
>   ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
>   ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
>   ? sched_clock+0x1f/0x30
>   ? check_chain_key+0x137/0x1e0
>   ? __lock_acquire+0x83c/0x2420
>   ? kvm_multiple_exception+0xf2/0x220 [kvm]
>   ? debug_check_no_locks_freed+0x240/0x240
>   ? debug_smp_processor_id+0x17/0x20
>   ? __lock_is_held+0x9e/0x100
>   kvm_mmu_page_fault+0x90/0x180 [kvm]
>   kvm_handle_page_fault+0x15c/0x310 [kvm]
>   ? __lock_is_held+0x9e/0x100
>   handle_exception+0x3c7/0x4d0 [kvm_intel]
>   vmx_handle_exit+0x103/0x1010 [kvm_intel]
>   ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
>
> The commit avoids to load host state of vmcs12 as vmcs01's guest state
> since vmcs12 is not modified (except for the VM-instruction error field)
> if the checking of vmcs control area fails. However, the mmu context is
> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
> fails. This patch fixes it by reloading mmu context when nested
> VMLAUNCH/VMRESUME fails.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Jim Mattson <jmattson@google.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> ---
> v3 -> v4:
>  * move it to a new function load_vmcs12_mmu_host_state
>
>  arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6cf3972..8aefb91 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>         kvm_clear_interrupt_queue(vcpu);
>  }
>
> +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> +                       struct vmcs12 *vmcs12)
> +{
> +       u32 entry_failure_code;
> +
> +       nested_ept_uninit_mmu_context(vcpu);
> +
> +       /*
> +        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> +        * couldn't have changed.
> +        */
> +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> +
> +       if (!enable_ept)
> +               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> +}
> +
>  /*
>   * A part of what we need to when the nested L2 guest exits and we want to
>   * run its L1 parent, is to reset L1's guest state to the host state specified
> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>                                    struct vmcs12 *vmcs12)
>  {
>         struct kvm_segment seg;
> -       u32 entry_failure_code;
>
>         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>                 vcpu->arch.efer = vmcs12->host_ia32_efer;
> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>         vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>         vmx_set_cr4(vcpu, vmcs12->host_cr4);
>
> -       nested_ept_uninit_mmu_context(vcpu);
> -
> -       /*
> -        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> -        * couldn't have changed.
> -        */
> -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> -
> -       if (!enable_ept)
> -               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>
>         if (enable_vpid) {
>                 /*
> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
>          * accordingly.
>          */
>         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> +
> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +
>         /*
>          * The emulated instruction was already skipped in
>          * nested_vmx_run, but the updated RIP was never
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
  2017-11-03  6:40     ` Wanpeng Li
@ 2017-11-03 17:13       ` Krish Sadhukhan
  2017-11-03 17:54         ` Jim Mattson
  0 siblings, 1 reply; 17+ messages in thread
From: Krish Sadhukhan @ 2017-11-03 17:13 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: Paolo Bonzini, Radim Krcmar, kvm, linux-kernel, Jim Mattson



On 11/02/2017 11:40 PM, Wanpeng Li wrote:
> 2017-11-03 14:31 GMT+08:00 Krish Sadhukhan <krish.sadhukhan@oracle.com>:
>>
>> On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>
>>> According to the SDM, if the "load IA32_BNDCFGS" VM-entry controls is 1,
>>> the
>>> following checks are performed on the field for the IA32_BNDCFGS MSR:
>>>    - Bits reserved in the IA32_BNDCFGS MSR must be 0.
>>>    - The linear address in bits 63:12 must be canonical.
>>>
>>> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: Jim Mattson <jmattson@google.com>
>>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>> ---
>>> v3 -> v4:
>>>    * simply condition
>>>    * use && instead of nested "if"s
>>>
>>>    arch/x86/kvm/vmx.c | 5 +++++
>>>    1 file changed, 5 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index e6c8ffa..6cf3972 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -10805,6 +10805,11 @@ static int check_vmentry_postreqs(struct kvm_vcpu
>>> *vcpu, struct vmcs12 *vmcs12,
>>>                          return 1;
>>>          }
>>>    +     if (kvm_mpx_supported() &&
>>> +               (is_noncanonical_address(vmcs12->guest_bndcfgs &
>>> PAGE_MASK, vcpu) ||
>>> +               (vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD)))
>>> +                       return 1;
>>> +
>>>          return 0;
>>>    }
>>>
>> Hi Wanpeng,
>>    The SDM check is performed only when "load IA32_BNDCFGS" VM-entry control
>> is 1. But vmx_mpx_supported() returns true when both "load IA32_BNDCFGS" and
>> "store IA32_BNDCFGS" VM-entry controls are 1. Therefore your check is
>> performed when both controls are 1. Did I miss something here ?
> https://lkml.org/lkml/2017/11/2/748 Paolo hopes the simplification.
>
> Regards,
> Wanpeng Li
I got that simplification and your changes look good to me.


However, I am still curious know the reason why vmx_mpx_supported() 
returns true only when both controls are true whereas the SDM states the 
following:

    "IA32_BNDCFGS (64 bits). This field is supported only on processors 
that support either the 1-setting of the  “load IA32_BNDCFGS” VM-entry 
control or that of the “clear IA32_BNDCFGS” VM-exit control."

Thanks,
Krish

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
  2017-11-03 17:13       ` Krish Sadhukhan
@ 2017-11-03 17:54         ` Jim Mattson
  0 siblings, 0 replies; 17+ messages in thread
From: Jim Mattson @ 2017-11-03 17:54 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: Wanpeng Li, Paolo Bonzini, Radim Krcmar, kvm, linux-kernel

KVM chooses not to support MPX in the guest unless both of these
control bits are supported by the platform.

On Fri, Nov 3, 2017 at 10:13 AM, Krish Sadhukhan
<krish.sadhukhan@oracle.com> wrote:
>
>
> On 11/02/2017 11:40 PM, Wanpeng Li wrote:
>>
>> 2017-11-03 14:31 GMT+08:00 Krish Sadhukhan <krish.sadhukhan@oracle.com>:
>>>
>>>
>>> On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>>>>
>>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>>
>>>> According to the SDM, if the "load IA32_BNDCFGS" VM-entry controls is 1,
>>>> the
>>>> following checks are performed on the field for the IA32_BNDCFGS MSR:
>>>>    - Bits reserved in the IA32_BNDCFGS MSR must be 0.
>>>>    - The linear address in bits 63:12 must be canonical.
>>>>
>>>> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>> Cc: Jim Mattson <jmattson@google.com>
>>>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>>> ---
>>>> v3 -> v4:
>>>>    * simply condition
>>>>    * use && instead of nested "if"s
>>>>
>>>>    arch/x86/kvm/vmx.c | 5 +++++
>>>>    1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index e6c8ffa..6cf3972 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -10805,6 +10805,11 @@ static int check_vmentry_postreqs(struct
>>>> kvm_vcpu
>>>> *vcpu, struct vmcs12 *vmcs12,
>>>>                          return 1;
>>>>          }
>>>>    +     if (kvm_mpx_supported() &&
>>>> +               (is_noncanonical_address(vmcs12->guest_bndcfgs &
>>>> PAGE_MASK, vcpu) ||
>>>> +               (vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD)))
>>>> +                       return 1;
>>>> +
>>>>          return 0;
>>>>    }
>>>>
>>> Hi Wanpeng,
>>>    The SDM check is performed only when "load IA32_BNDCFGS" VM-entry
>>> control
>>> is 1. But vmx_mpx_supported() returns true when both "load IA32_BNDCFGS"
>>> and
>>> "store IA32_BNDCFGS" VM-entry controls are 1. Therefore your check is
>>> performed when both controls are 1. Did I miss something here ?
>>
>> https://lkml.org/lkml/2017/11/2/748 Paolo hopes the simplification.
>>
>> Regards,
>> Wanpeng Li
>
> I got that simplification and your changes look good to me.
>
>
> However, I am still curious know the reason why vmx_mpx_supported() returns
> true only when both controls are true whereas the SDM states the following:
>
>    "IA32_BNDCFGS (64 bits). This field is supported only on processors that
> support either the 1-setting of the  “load IA32_BNDCFGS” VM-entry control or
> that of the “clear IA32_BNDCFGS” VM-exit control."
>
> Thanks,
> Krish

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-03  0:50 ` [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure Wanpeng Li
  2017-11-03 16:01   ` Jim Mattson
@ 2017-11-04  0:07   ` Krish Sadhukhan
  2017-11-08 21:47     ` Jim Mattson
  1 sibling, 1 reply; 17+ messages in thread
From: Krish Sadhukhan @ 2017-11-04  0:07 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Wanpeng Li, Jim Mattson

On 11/02/2017 05:50 PM, Wanpeng Li wrote:

> From: Wanpeng Li <wanpeng.li@hotmail.com>
>
> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure
> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in L1)
> null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1.
>
> In L1:
>
> BUG: unable to handle kernel paging request at ffffffffc015bf8f
>   IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
>   PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
>   Oops: 0003 [#1] PREEMPT SMP
>   CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
>   RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
>   Call Trace:
>   WARNING: kernel stack frame pointer at ffffb86f4988bc18 in qemu-system-x86:1798 has bad value 0000000000000002
>
> In L0:
>
> -----------[ cut here ]------------
>   WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845 vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>   CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE   4.14.0-rc7+ #25
>   RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>   Call Trace:
>    paging64_page_fault+0x500/0xde0 [kvm]
>    ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
>    ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
>    ? __asan_storeN+0x12/0x20
>    ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
>    ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
>    ? lock_acquire+0x2c0/0x2c0
>    ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
>    ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
>    ? sched_clock+0x1f/0x30
>    ? check_chain_key+0x137/0x1e0
>    ? __lock_acquire+0x83c/0x2420
>    ? kvm_multiple_exception+0xf2/0x220 [kvm]
>    ? debug_check_no_locks_freed+0x240/0x240
>    ? debug_smp_processor_id+0x17/0x20
>    ? __lock_is_held+0x9e/0x100
>    kvm_mmu_page_fault+0x90/0x180 [kvm]
>    kvm_handle_page_fault+0x15c/0x310 [kvm]
>    ? __lock_is_held+0x9e/0x100
>    handle_exception+0x3c7/0x4d0 [kvm_intel]
>    vmx_handle_exit+0x103/0x1010 [kvm_intel]
>    ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
>
> The commit avoids to load host state of vmcs12 as vmcs01's guest state
> since vmcs12 is not modified (except for the VM-instruction error field)
> if the checking of vmcs control area fails. However, the mmu context is
> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
> fails. This patch fixes it by reloading mmu context when nested
> VMLAUNCH/VMRESUME fails.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Jim Mattson <jmattson@google.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> ---
> v3 -> v4:
>   * move it to a new function load_vmcs12_mmu_host_state
>
>   arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
>   1 file changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6cf3972..8aefb91 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>   	kvm_clear_interrupt_queue(vcpu);
>   }
>   
> +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> +			struct vmcs12 *vmcs12)
> +{
> +	u32 entry_failure_code;
> +
> +	nested_ept_uninit_mmu_context(vcpu);
> +
> +	/*
> +	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> +	 * couldn't have changed.
> +	 */
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> +
> +	if (!enable_ept)
> +		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> +}
> +
>   /*
>    * A part of what we need to when the nested L2 guest exits and we want to
>    * run its L1 parent, is to reset L1's guest state to the host state specified
> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>   				   struct vmcs12 *vmcs12)
>   {
>   	struct kvm_segment seg;
> -	u32 entry_failure_code;
>   
>   	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>   		vcpu->arch.efer = vmcs12->host_ia32_efer;
> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>   	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>   	vmx_set_cr4(vcpu, vmcs12->host_cr4);
>   
> -	nested_ept_uninit_mmu_context(vcpu);
> -
> -	/*
> -	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> -	 * couldn't have changed.
> -	 */
> -	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> -		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> -
> -	if (!enable_ept)
> -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> +	load_vmcs12_mmu_host_state(vcpu, vmcs12);
>   
>   	if (enable_vpid) {
>   		/*
> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
>   	 * accordingly.
>   	 */
>   	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> +
> +	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +
>   	/*
>   	 * The emulated instruction was already skipped in
>   	 * nested_vmx_run, but the updated RIP was never
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-04  0:07   ` Krish Sadhukhan
@ 2017-11-08 21:47     ` Jim Mattson
  2017-11-09  0:37       ` Wanpeng Li
  2018-02-05 18:44       ` Jim Mattson
  0 siblings, 2 replies; 17+ messages in thread
From: Jim Mattson @ 2017-11-08 21:47 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: Wanpeng Li, LKML, kvm list, Paolo Bonzini,
	Radim Krčmář,
	Wanpeng Li

I realize now that there are actually many other problems with
deferring some control field checks to the hardware VM-entry of
vmcs02. When there is an invalid control field, the vCPU should just
fall through to the next instruction, without any state modifiation
other than the ALU flags and the VM-instruction error field of the
current VMCS. However, in preparation for the hardware VM-entry of
vmcs02, we have already changed quite a bit of the vCPU state: the
MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
FLAGS register, etc. All of these changes should be undone, and we're
not prepared to do that. (For instance, what was the old DR7 value
that needs to be restored?)

On Fri, Nov 3, 2017 at 5:07 PM, Krish Sadhukhan
<krish.sadhukhan@oracle.com> wrote:
> On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME
>> failure
>> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in
>> L1)
>> null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1.
>>
>> In L1:
>>
>> BUG: unable to handle kernel paging request at ffffffffc015bf8f
>>   IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
>>   PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
>>   Oops: 0003 [#1] PREEMPT SMP
>>   CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
>>   RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
>>   Call Trace:
>>   WARNING: kernel stack frame pointer at ffffb86f4988bc18 in
>> qemu-system-x86:1798 has bad value 0000000000000002
>>
>> In L0:
>>
>> -----------[ cut here ]------------
>>   WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845
>> vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>>   CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE
>> 4.14.0-rc7+ #25
>>   RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>>   Call Trace:
>>    paging64_page_fault+0x500/0xde0 [kvm]
>>    ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
>>    ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
>>    ? __asan_storeN+0x12/0x20
>>    ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
>>    ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
>>    ? lock_acquire+0x2c0/0x2c0
>>    ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
>>    ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
>>    ? sched_clock+0x1f/0x30
>>    ? check_chain_key+0x137/0x1e0
>>    ? __lock_acquire+0x83c/0x2420
>>    ? kvm_multiple_exception+0xf2/0x220 [kvm]
>>    ? debug_check_no_locks_freed+0x240/0x240
>>    ? debug_smp_processor_id+0x17/0x20
>>    ? __lock_is_held+0x9e/0x100
>>    kvm_mmu_page_fault+0x90/0x180 [kvm]
>>    kvm_handle_page_fault+0x15c/0x310 [kvm]
>>    ? __lock_is_held+0x9e/0x100
>>    handle_exception+0x3c7/0x4d0 [kvm_intel]
>>    vmx_handle_exit+0x103/0x1010 [kvm_intel]
>>    ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
>>
>> The commit avoids to load host state of vmcs12 as vmcs01's guest state
>> since vmcs12 is not modified (except for the VM-instruction error field)
>> if the checking of vmcs control area fails. However, the mmu context is
>> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
>> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
>> fails. This patch fixes it by reloading mmu context when nested
>> VMLAUNCH/VMRESUME fails.
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Cc: Jim Mattson <jmattson@google.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>> ---
>> v3 -> v4:
>>   * move it to a new function load_vmcs12_mmu_host_state
>>
>>   arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
>>   1 file changed, 22 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 6cf3972..8aefb91 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu,
>> struct vmcs12 *vmcs12,
>>         kvm_clear_interrupt_queue(vcpu);
>>   }
>>   +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
>> +                       struct vmcs12 *vmcs12)
>> +{
>> +       u32 entry_failure_code;
>> +
>> +       nested_ept_uninit_mmu_context(vcpu);
>> +
>> +       /*
>> +        * Only PDPTE load can fail as the value of cr3 was checked on
>> entry and
>> +        * couldn't have changed.
>> +        */
>> +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>> &entry_failure_code))
>> +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>> +
>> +       if (!enable_ept)
>> +               vcpu->arch.walk_mmu->inject_page_fault =
>> kvm_inject_page_fault;
>> +}
>> +
>>   /*
>>    * A part of what we need to when the nested L2 guest exits and we want
>> to
>>    * run its L1 parent, is to reset L1's guest state to the host state
>> specified
>> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu
>> *vcpu,
>>                                    struct vmcs12 *vmcs12)
>>   {
>>         struct kvm_segment seg;
>> -       u32 entry_failure_code;
>>         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>>                 vcpu->arch.efer = vmcs12->host_ia32_efer;
>> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct
>> kvm_vcpu *vcpu,
>>         vcpu->arch.cr4_guest_owned_bits =
>> ~vmcs_readl(CR4_GUEST_HOST_MASK);
>>         vmx_set_cr4(vcpu, vmcs12->host_cr4);
>>   -     nested_ept_uninit_mmu_context(vcpu);
>> -
>> -       /*
>> -        * Only PDPTE load can fail as the value of cr3 was checked on
>> entry and
>> -        * couldn't have changed.
>> -        */
>> -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>> &entry_failure_code))
>> -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>> -
>> -       if (!enable_ept)
>> -               vcpu->arch.walk_mmu->inject_page_fault =
>> kvm_inject_page_fault;
>> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>         if (enable_vpid) {
>>                 /*
>> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu
>> *vcpu, u32 exit_reason,
>>          * accordingly.
>>          */
>>         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>> +
>> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>> +
>>         /*
>>          * The emulated instruction was already skipped in
>>          * nested_vmx_run, but the updated RIP was never
>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-08 21:47     ` Jim Mattson
@ 2017-11-09  0:37       ` Wanpeng Li
  2017-11-09 10:40         ` Paolo Bonzini
  2018-02-05 18:44       ` Jim Mattson
  1 sibling, 1 reply; 17+ messages in thread
From: Wanpeng Li @ 2017-11-09  0:37 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Krish Sadhukhan, LKML, kvm list, Paolo Bonzini,
	Radim Krčmář,
	Wanpeng Li

2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@google.com>:
> I realize now that there are actually many other problems with
> deferring some control field checks to the hardware VM-entry of
> vmcs02. When there is an invalid control field, the vCPU should just
> fall through to the next instruction, without any state modifiation
> other than the ALU flags and the VM-instruction error field of the
> current VMCS. However, in preparation for the hardware VM-entry of
> vmcs02, we have already changed quite a bit of the vCPU state: the
> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
> FLAGS register, etc. All of these changes should be undone, and we're
> not prepared to do that. (For instance, what was the old DR7 value
> that needs to be restored?)

I didn't observe real issue currently, and I hope this patchset can
catch the upcoming merge window. Then we can dig more into your
concern.

Regards,
Wanpeng Li

>
> On Fri, Nov 3, 2017 at 5:07 PM, Krish Sadhukhan
> <krish.sadhukhan@oracle.com> wrote:
>> On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>>
>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>
>>> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME
>>> failure
>>> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in
>>> L1)
>>> null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1.
>>>
>>> In L1:
>>>
>>> BUG: unable to handle kernel paging request at ffffffffc015bf8f
>>>   IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
>>>   PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
>>>   Oops: 0003 [#1] PREEMPT SMP
>>>   CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
>>>   RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
>>>   Call Trace:
>>>   WARNING: kernel stack frame pointer at ffffb86f4988bc18 in
>>> qemu-system-x86:1798 has bad value 0000000000000002
>>>
>>> In L0:
>>>
>>> -----------[ cut here ]------------
>>>   WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845
>>> vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>>>   CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE
>>> 4.14.0-rc7+ #25
>>>   RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>>>   Call Trace:
>>>    paging64_page_fault+0x500/0xde0 [kvm]
>>>    ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
>>>    ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
>>>    ? __asan_storeN+0x12/0x20
>>>    ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
>>>    ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
>>>    ? lock_acquire+0x2c0/0x2c0
>>>    ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
>>>    ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
>>>    ? sched_clock+0x1f/0x30
>>>    ? check_chain_key+0x137/0x1e0
>>>    ? __lock_acquire+0x83c/0x2420
>>>    ? kvm_multiple_exception+0xf2/0x220 [kvm]
>>>    ? debug_check_no_locks_freed+0x240/0x240
>>>    ? debug_smp_processor_id+0x17/0x20
>>>    ? __lock_is_held+0x9e/0x100
>>>    kvm_mmu_page_fault+0x90/0x180 [kvm]
>>>    kvm_handle_page_fault+0x15c/0x310 [kvm]
>>>    ? __lock_is_held+0x9e/0x100
>>>    handle_exception+0x3c7/0x4d0 [kvm_intel]
>>>    vmx_handle_exit+0x103/0x1010 [kvm_intel]
>>>    ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
>>>
>>> The commit avoids to load host state of vmcs12 as vmcs01's guest state
>>> since vmcs12 is not modified (except for the VM-instruction error field)
>>> if the checking of vmcs control area fails. However, the mmu context is
>>> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
>>> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
>>> fails. This patch fixes it by reloading mmu context when nested
>>> VMLAUNCH/VMRESUME fails.
>>>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: Jim Mattson <jmattson@google.com>
>>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>> ---
>>> v3 -> v4:
>>>   * move it to a new function load_vmcs12_mmu_host_state
>>>
>>>   arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
>>>   1 file changed, 22 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 6cf3972..8aefb91 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu,
>>> struct vmcs12 *vmcs12,
>>>         kvm_clear_interrupt_queue(vcpu);
>>>   }
>>>   +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
>>> +                       struct vmcs12 *vmcs12)
>>> +{
>>> +       u32 entry_failure_code;
>>> +
>>> +       nested_ept_uninit_mmu_context(vcpu);
>>> +
>>> +       /*
>>> +        * Only PDPTE load can fail as the value of cr3 was checked on
>>> entry and
>>> +        * couldn't have changed.
>>> +        */
>>> +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>>> &entry_failure_code))
>>> +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>>> +
>>> +       if (!enable_ept)
>>> +               vcpu->arch.walk_mmu->inject_page_fault =
>>> kvm_inject_page_fault;
>>> +}
>>> +
>>>   /*
>>>    * A part of what we need to when the nested L2 guest exits and we want
>>> to
>>>    * run its L1 parent, is to reset L1's guest state to the host state
>>> specified
>>> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu
>>> *vcpu,
>>>                                    struct vmcs12 *vmcs12)
>>>   {
>>>         struct kvm_segment seg;
>>> -       u32 entry_failure_code;
>>>         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>>>                 vcpu->arch.efer = vmcs12->host_ia32_efer;
>>> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct
>>> kvm_vcpu *vcpu,
>>>         vcpu->arch.cr4_guest_owned_bits =
>>> ~vmcs_readl(CR4_GUEST_HOST_MASK);
>>>         vmx_set_cr4(vcpu, vmcs12->host_cr4);
>>>   -     nested_ept_uninit_mmu_context(vcpu);
>>> -
>>> -       /*
>>> -        * Only PDPTE load can fail as the value of cr3 was checked on
>>> entry and
>>> -        * couldn't have changed.
>>> -        */
>>> -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>>> &entry_failure_code))
>>> -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>>> -
>>> -       if (!enable_ept)
>>> -               vcpu->arch.walk_mmu->inject_page_fault =
>>> kvm_inject_page_fault;
>>> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>>         if (enable_vpid) {
>>>                 /*
>>> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu
>>> *vcpu, u32 exit_reason,
>>>          * accordingly.
>>>          */
>>>         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>>> +
>>> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>> +
>>>         /*
>>>          * The emulated instruction was already skipped in
>>>          * nested_vmx_run, but the updated RIP was never
>>
>> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-09  0:37       ` Wanpeng Li
@ 2017-11-09 10:40         ` Paolo Bonzini
  2017-11-09 10:47           ` Wanpeng Li
  2017-11-09 16:19           ` Jim Mattson
  0 siblings, 2 replies; 17+ messages in thread
From: Paolo Bonzini @ 2017-11-09 10:40 UTC (permalink / raw)
  To: Wanpeng Li, Jim Mattson
  Cc: Krish Sadhukhan, LKML, kvm list, Radim Krčmář, Wanpeng Li

On 09/11/2017 01:37, Wanpeng Li wrote:
> 2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@google.com>:
>> I realize now that there are actually many other problems with
>> deferring some control field checks to the hardware VM-entry of
>> vmcs02. When there is an invalid control field, the vCPU should just
>> fall through to the next instruction, without any state modifiation
>> other than the ALU flags and the VM-instruction error field of the
>> current VMCS. However, in preparation for the hardware VM-entry of
>> vmcs02, we have already changed quite a bit of the vCPU state: the
>> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
>> FLAGS register, etc. All of these changes should be undone, and we're
>> not prepared to do that. (For instance, what was the old DR7 value
>> that needs to be restored?)
> I didn't observe real issue currently, and I hope this patchset can
> catch the upcoming merge window. Then we can dig more into your
> concern.

Can any of you write a simple testcase for just one bug (e.g. DR7)?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-09 10:40         ` Paolo Bonzini
@ 2017-11-09 10:47           ` Wanpeng Li
  2017-11-09 11:05             ` Paolo Bonzini
  2017-11-09 16:19           ` Jim Mattson
  1 sibling, 1 reply; 17+ messages in thread
From: Wanpeng Li @ 2017-11-09 10:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jim Mattson, Krish Sadhukhan, LKML, kvm list,
	Radim Krčmář,
	Wanpeng Li

2017-11-09 18:40 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> On 09/11/2017 01:37, Wanpeng Li wrote:
>> 2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@google.com>:
>>> I realize now that there are actually many other problems with
>>> deferring some control field checks to the hardware VM-entry of
>>> vmcs02. When there is an invalid control field, the vCPU should just
>>> fall through to the next instruction, without any state modifiation
>>> other than the ALU flags and the VM-instruction error field of the
>>> current VMCS. However, in preparation for the hardware VM-entry of
>>> vmcs02, we have already changed quite a bit of the vCPU state: the
>>> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
>>> FLAGS register, etc. All of these changes should be undone, and we're
>>> not prepared to do that. (For instance, what was the old DR7 value
>>> that needs to be restored?)
>> I didn't observe real issue currently, and I hope this patchset can
>> catch the upcoming merge window. Then we can dig more into your
>> concern.
>
> Can any of you write a simple testcase for just one bug (e.g. DR7)?

Jim you can have a try for your concern, I have already tried tons of
stress testing and didn't observe any issue.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-09 10:47           ` Wanpeng Li
@ 2017-11-09 11:05             ` Paolo Bonzini
  2017-11-09 11:10               ` Wanpeng Li
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2017-11-09 11:05 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Jim Mattson, Krish Sadhukhan, LKML, kvm list,
	Radim Krčmář,
	Wanpeng Li

On 09/11/2017 11:47, Wanpeng Li wrote:
> 2017-11-09 18:40 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> On 09/11/2017 01:37, Wanpeng Li wrote:
>>> 2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@google.com>:
>>>> I realize now that there are actually many other problems with
>>>> deferring some control field checks to the hardware VM-entry of
>>>> vmcs02. When there is an invalid control field, the vCPU should just
>>>> fall through to the next instruction, without any state modifiation
>>>> other than the ALU flags and the VM-instruction error field of the
>>>> current VMCS. However, in preparation for the hardware VM-entry of
>>>> vmcs02, we have already changed quite a bit of the vCPU state: the
>>>> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
>>>> FLAGS register, etc. All of these changes should be undone, and we're
>>>> not prepared to do that. (For instance, what was the old DR7 value
>>>> that needs to be restored?)
>>> I didn't observe real issue currently, and I hope this patchset can
>>> catch the upcoming merge window. Then we can dig more into your
>>> concern.
>>
>> Can any of you write a simple testcase for just one bug (e.g. DR7)?
> 
> Jim you can have a try for your concern, I have already tried tons of
> stress testing and didn't observe any issue.

You need to craft a testcase for kvm-unit-tests.  No stress testing will
find an issue.

Your patch is fine, but Jim is saying that we cannot really skip the
check for invalid control fields.  It's a more general issue that can be
fixed by adding explicit checks in KVM.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-09 11:05             ` Paolo Bonzini
@ 2017-11-09 11:10               ` Wanpeng Li
  0 siblings, 0 replies; 17+ messages in thread
From: Wanpeng Li @ 2017-11-09 11:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jim Mattson, Krish Sadhukhan, LKML, kvm list,
	Radim Krčmář,
	Wanpeng Li

2017-11-09 19:05 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> On 09/11/2017 11:47, Wanpeng Li wrote:
>> 2017-11-09 18:40 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>> On 09/11/2017 01:37, Wanpeng Li wrote:
>>>> 2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@google.com>:
>>>>> I realize now that there are actually many other problems with
>>>>> deferring some control field checks to the hardware VM-entry of
>>>>> vmcs02. When there is an invalid control field, the vCPU should just
>>>>> fall through to the next instruction, without any state modifiation
>>>>> other than the ALU flags and the VM-instruction error field of the
>>>>> current VMCS. However, in preparation for the hardware VM-entry of
>>>>> vmcs02, we have already changed quite a bit of the vCPU state: the
>>>>> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
>>>>> FLAGS register, etc. All of these changes should be undone, and we're
>>>>> not prepared to do that. (For instance, what was the old DR7 value
>>>>> that needs to be restored?)
>>>> I didn't observe real issue currently, and I hope this patchset can
>>>> catch the upcoming merge window. Then we can dig more into your
>>>> concern.
>>>
>>> Can any of you write a simple testcase for just one bug (e.g. DR7)?
>>
>> Jim you can have a try for your concern, I have already tried tons of
>> stress testing and didn't observe any issue.
>
> You need to craft a testcase for kvm-unit-tests.  No stress testing will
> find an issue.
>
> Your patch is fine, but Jim is saying that we cannot really skip the
> check for invalid control fields.  It's a more general issue that can be
> fixed by adding explicit checks in KVM.

Fair enough. I will find time to do this recently. I guess Radim can
apply the whole patchset today. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-09 10:40         ` Paolo Bonzini
  2017-11-09 10:47           ` Wanpeng Li
@ 2017-11-09 16:19           ` Jim Mattson
  1 sibling, 0 replies; 17+ messages in thread
From: Jim Mattson @ 2017-11-09 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Wanpeng Li, Krish Sadhukhan, LKML, kvm list,
	Radim Krčmář,
	Wanpeng Li

Will do.

On Thu, Nov 9, 2017 at 2:40 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 09/11/2017 01:37, Wanpeng Li wrote:
>> 2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@google.com>:
>>> I realize now that there are actually many other problems with
>>> deferring some control field checks to the hardware VM-entry of
>>> vmcs02. When there is an invalid control field, the vCPU should just
>>> fall through to the next instruction, without any state modifiation
>>> other than the ALU flags and the VM-instruction error field of the
>>> current VMCS. However, in preparation for the hardware VM-entry of
>>> vmcs02, we have already changed quite a bit of the vCPU state: the
>>> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
>>> FLAGS register, etc. All of these changes should be undone, and we're
>>> not prepared to do that. (For instance, what was the old DR7 value
>>> that needs to be restored?)
>> I didn't observe real issue currently, and I hope this patchset can
>> catch the upcoming merge window. Then we can dig more into your
>> concern.
>
> Can any of you write a simple testcase for just one bug (e.g. DR7)?
>
> Thanks,
>
> Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2017-11-08 21:47     ` Jim Mattson
  2017-11-09  0:37       ` Wanpeng Li
@ 2018-02-05 18:44       ` Jim Mattson
  2018-05-07 22:56         ` Jim Mattson
  1 sibling, 1 reply; 17+ messages in thread
From: Jim Mattson @ 2018-02-05 18:44 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: Wanpeng Li, LKML, kvm list, Paolo Bonzini,
	Radim Krčmář,
	Wanpeng Li

I realize now that this fix isn't quite right, since it loads
vmcs12->host_cr3 rather than reverting to the CR3 that was loaded at the
time of VMLAUNCH/VMRESUME. In the case of VMfailValid(VM entry with invalid
VMX-control field(s)), none of the VMCS12 host state fields should be
loaded. See the pseudocode for VMLAUNCH/VMRESUME in volume 3 of the SDM.

On Wed, Nov 8, 2017 at 1:47 PM Jim Mattson <jmattson@google.com> wrote:

> I realize now that there are actually many other problems with
> deferring some control field checks to the hardware VM-entry of
> vmcs02. When there is an invalid control field, the vCPU should just
> fall through to the next instruction, without any state modifiation
> other than the ALU flags and the VM-instruction error field of the
> current VMCS. However, in preparation for the hardware VM-entry of
> vmcs02, we have already changed quite a bit of the vCPU state: the
> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
> FLAGS register, etc. All of these changes should be undone, and we're
> not prepared to do that. (For instance, what was the old DR7 value
> that needs to be restored?)

> On Fri, Nov 3, 2017 at 5:07 PM, Krish Sadhukhan
> <krish.sadhukhan@oracle.com> wrote:
> > On 11/02/2017 05:50 PM, Wanpeng Li wrote:
> >
> >> From: Wanpeng Li <wanpeng.li@hotmail.com>
> >>
> >> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME
> >> failure
> >> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls
> in
> >> L1)
> >> null pointer deference and also L0 calltrace when EPT=0 on both L0 and
> L1.
> >>
> >> In L1:
> >>
> >> BUG: unable to handle kernel paging request at ffffffffc015bf8f
> >>   IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
> >>   PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
> >>   Oops: 0003 [#1] PREEMPT SMP
> >>   CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
> >>   RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
> >>   Call Trace:
> >>   WARNING: kernel stack frame pointer at ffffb86f4988bc18 in
> >> qemu-system-x86:1798 has bad value 0000000000000002
> >>
> >> In L0:
> >>
> >> -----------[ cut here ]------------
> >>   WARNING: CPU: 6 PID: 4460 at
> /home/kernel/linux/arch/x86/kvm//vmx.c:9845
> >> vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
> >>   CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE
> >> 4.14.0-rc7+ #25
> >>   RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
> >>   Call Trace:
> >>    paging64_page_fault+0x500/0xde0 [kvm]
> >>    ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
> >>    ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
> >>    ? __asan_storeN+0x12/0x20
> >>    ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
> >>    ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
> >>    ? lock_acquire+0x2c0/0x2c0
> >>    ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
> >>    ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
> >>    ? sched_clock+0x1f/0x30
> >>    ? check_chain_key+0x137/0x1e0
> >>    ? __lock_acquire+0x83c/0x2420
> >>    ? kvm_multiple_exception+0xf2/0x220 [kvm]
> >>    ? debug_check_no_locks_freed+0x240/0x240
> >>    ? debug_smp_processor_id+0x17/0x20
> >>    ? __lock_is_held+0x9e/0x100
> >>    kvm_mmu_page_fault+0x90/0x180 [kvm]
> >>    kvm_handle_page_fault+0x15c/0x310 [kvm]
> >>    ? __lock_is_held+0x9e/0x100
> >>    handle_exception+0x3c7/0x4d0 [kvm_intel]
> >>    vmx_handle_exit+0x103/0x1010 [kvm_intel]
> >>    ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
> >>
> >> The commit avoids to load host state of vmcs12 as vmcs01's guest state
> >> since vmcs12 is not modified (except for the VM-instruction error
> field)
> >> if the checking of vmcs control area fails. However, the mmu context is
> >> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
> >> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
> >> fails. This patch fixes it by reloading mmu context when nested
> >> VMLAUNCH/VMRESUME fails.
> >>
> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> Cc: Radim Krčmář <rkrcmar@redhat.com>
> >> Cc: Jim Mattson <jmattson@google.com>
> >> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> >> ---
> >> v3 -> v4:
> >>   * move it to a new function load_vmcs12_mmu_host_state
> >>
> >>   arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
> >>   1 file changed, 22 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 6cf3972..8aefb91 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu
> *vcpu,
> >> struct vmcs12 *vmcs12,
> >>         kvm_clear_interrupt_queue(vcpu);
> >>   }
> >>   +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> >> +                       struct vmcs12 *vmcs12)
> >> +{
> >> +       u32 entry_failure_code;
> >> +
> >> +       nested_ept_uninit_mmu_context(vcpu);
> >> +
> >> +       /*
> >> +        * Only PDPTE load can fail as the value of cr3 was checked on
> >> entry and
> >> +        * couldn't have changed.
> >> +        */
> >> +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
> >> &entry_failure_code))
> >> +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> >> +
> >> +       if (!enable_ept)
> >> +               vcpu->arch.walk_mmu->inject_page_fault =
> >> kvm_inject_page_fault;
> >> +}
> >> +
> >>   /*
> >>    * A part of what we need to when the nested L2 guest exits and we
> want
> >> to
> >>    * run its L1 parent, is to reset L1's guest state to the host state
> >> specified
> >> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct
> kvm_vcpu
> >> *vcpu,
> >>                                    struct vmcs12 *vmcs12)
> >>   {
> >>         struct kvm_segment seg;
> >> -       u32 entry_failure_code;
> >>         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> >>                 vcpu->arch.efer = vmcs12->host_ia32_efer;
> >> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct
> >> kvm_vcpu *vcpu,
> >>         vcpu->arch.cr4_guest_owned_bits =
> >> ~vmcs_readl(CR4_GUEST_HOST_MASK);
> >>         vmx_set_cr4(vcpu, vmcs12->host_cr4);
> >>   -     nested_ept_uninit_mmu_context(vcpu);
> >> -
> >> -       /*
> >> -        * Only PDPTE load can fail as the value of cr3 was checked on
> >> entry and
> >> -        * couldn't have changed.
> >> -        */
> >> -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
> >> &entry_failure_code))
> >> -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> >> -
> >> -       if (!enable_ept)
> >> -               vcpu->arch.walk_mmu->inject_page_fault =
> >> kvm_inject_page_fault;
> >> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> >>         if (enable_vpid) {
> >>                 /*
> >> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu
> >> *vcpu, u32 exit_reason,
> >>          * accordingly.
> >>          */
> >>         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> >> +
> >> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> >> +
> >>         /*
> >>          * The emulated instruction was already skipped in
> >>          * nested_vmx_run, but the updated RIP was never
> >
> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  2018-02-05 18:44       ` Jim Mattson
@ 2018-05-07 22:56         ` Jim Mattson
  0 siblings, 0 replies; 17+ messages in thread
From: Jim Mattson @ 2018-05-07 22:56 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: Wanpeng Li, LKML, kvm list, Paolo Bonzini,
	Radim Krčmář,
	Wanpeng Li

Moreover, if the VMLAUNCH/VMRESUME is in 32-bit PAE mode, then the
PDPTRs should not be reloaded.

On Mon, Feb 5, 2018 at 10:44 AM, Jim Mattson <jmattson@google.com> wrote:
> I realize now that this fix isn't quite right, since it loads
> vmcs12->host_cr3 rather than reverting to the CR3 that was loaded at the
> time of VMLAUNCH/VMRESUME. In the case of VMfailValid(VM entry with invalid
> VMX-control field(s)), none of the VMCS12 host state fields should be
> loaded. See the pseudocode for VMLAUNCH/VMRESUME in volume 3 of the SDM.
>
>
> On Wed, Nov 8, 2017 at 1:47 PM Jim Mattson <jmattson@google.com> wrote:
>
>> I realize now that there are actually many other problems with
>> deferring some control field checks to the hardware VM-entry of
>> vmcs02. When there is an invalid control field, the vCPU should just
>> fall through to the next instruction, without any state modifiation
>> other than the ALU flags and the VM-instruction error field of the
>> current VMCS. However, in preparation for the hardware VM-entry of
>> vmcs02, we have already changed quite a bit of the vCPU state: the
>> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
>> FLAGS register, etc. All of these changes should be undone, and we're
>> not prepared to do that. (For instance, what was the old DR7 value
>> that needs to be restored?)
>
>
>> On Fri, Nov 3, 2017 at 5:07 PM, Krish Sadhukhan
>> <krish.sadhukhan@oracle.com> wrote:
>> > On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>> >
>> >> From: Wanpeng Li <wanpeng.li@hotmail.com>
>> >>
>> >> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME
>> >> failure
>> >> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls
>> in
>> >> L1)
>> >> null pointer deference and also L0 calltrace when EPT=0 on both L0 and
>> L1.
>> >>
>> >> In L1:
>> >>
>> >> BUG: unable to handle kernel paging request at ffffffffc015bf8f
>> >>   IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
>> >>   PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
>> >>   Oops: 0003 [#1] PREEMPT SMP
>> >>   CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
>> >>   RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
>> >>   Call Trace:
>> >>   WARNING: kernel stack frame pointer at ffffb86f4988bc18 in
>> >> qemu-system-x86:1798 has bad value 0000000000000002
>> >>
>> >> In L0:
>> >>
>> >> -----------[ cut here ]------------
>> >>   WARNING: CPU: 6 PID: 4460 at
>> /home/kernel/linux/arch/x86/kvm//vmx.c:9845
>> >> vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>> >>   CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G           OE
>> >> 4.14.0-rc7+ #25
>> >>   RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>> >>   Call Trace:
>> >>    paging64_page_fault+0x500/0xde0 [kvm]
>> >>    ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
>> >>    ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
>> >>    ? __asan_storeN+0x12/0x20
>> >>    ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
>> >>    ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
>> >>    ? lock_acquire+0x2c0/0x2c0
>> >>    ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
>> >>    ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
>> >>    ? sched_clock+0x1f/0x30
>> >>    ? check_chain_key+0x137/0x1e0
>> >>    ? __lock_acquire+0x83c/0x2420
>> >>    ? kvm_multiple_exception+0xf2/0x220 [kvm]
>> >>    ? debug_check_no_locks_freed+0x240/0x240
>> >>    ? debug_smp_processor_id+0x17/0x20
>> >>    ? __lock_is_held+0x9e/0x100
>> >>    kvm_mmu_page_fault+0x90/0x180 [kvm]
>> >>    kvm_handle_page_fault+0x15c/0x310 [kvm]
>> >>    ? __lock_is_held+0x9e/0x100
>> >>    handle_exception+0x3c7/0x4d0 [kvm_intel]
>> >>    vmx_handle_exit+0x103/0x1010 [kvm_intel]
>> >>    ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
>> >>
>> >> The commit avoids to load host state of vmcs12 as vmcs01's guest state
>> >> since vmcs12 is not modified (except for the VM-instruction error
>> >> field)
>> >> if the checking of vmcs control area fails. However, the mmu context is
>> >> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
>> >> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
>> >> fails. This patch fixes it by reloading mmu context when nested
>> >> VMLAUNCH/VMRESUME fails.
>> >>
>> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> >> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> >> Cc: Jim Mattson <jmattson@google.com>
>> >> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>> >> ---
>> >> v3 -> v4:
>> >>   * move it to a new function load_vmcs12_mmu_host_state
>> >>
>> >>   arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
>> >>   1 file changed, 22 insertions(+), 12 deletions(-)
>> >>
>> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> index 6cf3972..8aefb91 100644
>> >> --- a/arch/x86/kvm/vmx.c
>> >> +++ b/arch/x86/kvm/vmx.c
>> >> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu
>> *vcpu,
>> >> struct vmcs12 *vmcs12,
>> >>         kvm_clear_interrupt_queue(vcpu);
>> >>   }
>> >>   +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
>> >> +                       struct vmcs12 *vmcs12)
>> >> +{
>> >> +       u32 entry_failure_code;
>> >> +
>> >> +       nested_ept_uninit_mmu_context(vcpu);
>> >> +
>> >> +       /*
>> >> +        * Only PDPTE load can fail as the value of cr3 was checked on
>> >> entry and
>> >> +        * couldn't have changed.
>> >> +        */
>> >> +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>> >> &entry_failure_code))
>> >> +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>> >> +
>> >> +       if (!enable_ept)
>> >> +               vcpu->arch.walk_mmu->inject_page_fault =
>> >> kvm_inject_page_fault;
>> >> +}
>> >> +
>> >>   /*
>> >>    * A part of what we need to when the nested L2 guest exits and we
>> want
>> >> to
>> >>    * run its L1 parent, is to reset L1's guest state to the host state
>> >> specified
>> >> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct
>> kvm_vcpu
>> >> *vcpu,
>> >>                                    struct vmcs12 *vmcs12)
>> >>   {
>> >>         struct kvm_segment seg;
>> >> -       u32 entry_failure_code;
>> >>         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>> >>                 vcpu->arch.efer = vmcs12->host_ia32_efer;
>> >> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct
>> >> kvm_vcpu *vcpu,
>> >>         vcpu->arch.cr4_guest_owned_bits =
>> >> ~vmcs_readl(CR4_GUEST_HOST_MASK);
>> >>         vmx_set_cr4(vcpu, vmcs12->host_cr4);
>> >>   -     nested_ept_uninit_mmu_context(vcpu);
>> >> -
>> >> -       /*
>> >> -        * Only PDPTE load can fail as the value of cr3 was checked on
>> >> entry and
>> >> -        * couldn't have changed.
>> >> -        */
>> >> -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>> >> &entry_failure_code))
>> >> -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>> >> -
>> >> -       if (!enable_ept)
>> >> -               vcpu->arch.walk_mmu->inject_page_fault =
>> >> kvm_inject_page_fault;
>> >> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>> >>         if (enable_vpid) {
>> >>                 /*
>> >> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu
>> >> *vcpu, u32 exit_reason,
>> >>          * accordingly.
>> >>          */
>> >>         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>> >> +
>> >> +       load_vmcs12_mmu_host_state(vcpu, vmcs12);
>> >> +
>> >>         /*
>> >>          * The emulated instruction was already skipped in
>> >>          * nested_vmx_run, but the updated RIP was never
>> >
>> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-05-07 22:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-03  0:50 [PATCH v5 1/3] KVM: X86: Fix operand/address-size during instruction decoding Wanpeng Li
2017-11-03  0:50 ` [PATCH v5 2/3] KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry Wanpeng Li
     [not found]   ` <0b1d82f7-2fc6-9fc0-15a4-3500413814bd@oracle.com>
2017-11-03  6:40     ` Wanpeng Li
2017-11-03 17:13       ` Krish Sadhukhan
2017-11-03 17:54         ` Jim Mattson
2017-11-03  0:50 ` [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure Wanpeng Li
2017-11-03 16:01   ` Jim Mattson
2017-11-04  0:07   ` Krish Sadhukhan
2017-11-08 21:47     ` Jim Mattson
2017-11-09  0:37       ` Wanpeng Li
2017-11-09 10:40         ` Paolo Bonzini
2017-11-09 10:47           ` Wanpeng Li
2017-11-09 11:05             ` Paolo Bonzini
2017-11-09 11:10               ` Wanpeng Li
2017-11-09 16:19           ` Jim Mattson
2018-02-05 18:44       ` Jim Mattson
2018-05-07 22:56         ` Jim Mattson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).