All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: "Sean Christopherson" <sean.j.christopherson@intel.com>,
	"Jim Mattson" <jmattson@google.com>,
	"Krish Sadhukhan" <krish.sadhukhan@oracle.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Radim KrÄmář" <rkrcmar@redhat.com>,
	"Wanpeng Li" <wanpeng.li@hotmail.com>,
	"Sasha Levin" <sashal@kernel.org>,
	kvm@vger.kernel.org
Subject: [PATCH AUTOSEL 4.14 14/37] KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
Date: Fri, 29 Mar 2019 21:29:57 -0400	[thread overview]
Message-ID: <20190330013020.379-14-sashal@kernel.org> (raw)
In-Reply-To: <20190330013020.379-1-sashal@kernel.org>

From: Sean Christopherson <sean.j.christopherson@intel.com>

[ Upstream commit bd18bffca35397214ae68d85cf7203aca25c3c1d ]

A VMEnter that VMFails (as opposed to VMExits) does not touch host
state beyond registers that are explicitly noted in the VMFail path,
e.g. EFLAGS.  Host state does not need to be loaded because VMFail
is only signaled for consistency checks that occur before the CPU
starts to load guest state, i.e. there is no need to restore any
state as nothing has been modified.  But in the case where a VMFail
is detected by hardware and not by KVM (due to deferring consistency
checks to hardware), KVM has already loaded some amount of guest
state.  Luckily, "loaded" only means loaded to KVM's software model,
i.e. vmcs01 has not been modified.  So, unwind our software model to
the pre-VMEntry host state.

Not restoring host state in this VMFail path leads to a variety of
failures because we end up with stale data in vcpu->arch, e.g. CR0,
CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
significant delta in the stale data is all but guaranteed to crash
L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.

An alternative to this "soft" reload would be to load host state from
vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
wildly inconsistent with respect to the VMX architecture, e.g. an L1
VMM with separate VMExit and VMFail paths would explode.

Note that this approach does not mean KVM is 100% accurate with
respect to VMX hardware behavior, even at an architectural level
(the exact order of consistency checks is microarchitecture specific).
But 100% emulation accuracy isn't the goal (with this patch), rather
the goal is to be consistent in the information delivered to L1, e.g.
a VMExit should not fall-through VMENTER, and a VMFail should not jump
to HOST_RIP.

This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
context after VMLAUNCH/VMRESUME failure)", but retains the core
aspects of that patch, just in an open coded form due to the need to
pull state from vmcs01 instead of vmcs12.  Restoring host state
resolves a variety of issues introduced by commit "4f350c6dbcb9
(kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
which remedied the incorrect behavior of treating VMFail like VMExit
but in doing so neglected to restore arch state that had been modified
prior to attempting nested VMEnter.

A sample failure that occurs due to stale vcpu.arch state is a fault
of some form while emulating an LGDT (due to emulated UMIP) from L1
after a failed VMEntry to L3, in this case when running the KVM unit
test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
case due to a stale arch.cr4.UMIP.

L1:
  BUG: unable to handle kernel paging request at ffffc90000663b9e
  PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
  Oops: 0009 [#1] SMP
  CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:native_load_gdt+0x0/0x10

  ...

  Call Trace:
   load_fixmap_gdt+0x22/0x30
   __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
   vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
   nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
   vmx_handle_exit+0x246/0x15a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x600
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4f/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

L0:
  WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
  ...
  CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
  Hardware name: Intel Corporation Kabylake Client platform/KBL S
  RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]

  ...

  Call Trace:
   kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x5e0
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x49/0xf0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 153 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 229d5e39f5c0..28faf781e6fd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11860,24 +11860,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	kvm_clear_interrupt_queue(vcpu);
 }
 
-static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
-			struct vmcs12 *vmcs12)
-{
-	u32 entry_failure_code;
-
-	nested_ept_uninit_mmu_context(vcpu);
-
-	/*
-	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
-	 * couldn't have changed.
-	 */
-	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
-		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
-
-	if (!enable_ept)
-		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
-}
-
 /*
  * A part of what we need to when the nested L2 guest exits and we want to
  * run its L1 parent, is to reset L1's guest state to the host state specified
@@ -11891,6 +11873,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 				   struct vmcs12 *vmcs12)
 {
 	struct kvm_segment seg;
+	u32 entry_failure_code;
 
 	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->host_ia32_efer;
@@ -11917,7 +11900,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	vmx_set_cr4(vcpu, vmcs12->host_cr4);
 
-	load_vmcs12_mmu_host_state(vcpu, vmcs12);
+	nested_ept_uninit_mmu_context(vcpu);
+
+	/*
+	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
+	 * couldn't have changed.
+	 */
+	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
+		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
+
+	if (!enable_ept)
+		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
 
 	if (enable_vpid) {
 		/*
@@ -12008,6 +12001,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
 }
 
+static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
+{
+	struct shared_msr_entry *efer_msr;
+	unsigned int i;
+
+	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
+		return vmcs_read64(GUEST_IA32_EFER);
+
+	if (cpu_has_load_ia32_efer)
+		return host_efer;
+
+	for (i = 0; i < vmx->msr_autoload.guest.nr; ++i) {
+		if (vmx->msr_autoload.guest.val[i].index == MSR_EFER)
+			return vmx->msr_autoload.guest.val[i].value;
+	}
+
+	efer_msr = find_msr_entry(vmx, MSR_EFER);
+	if (efer_msr)
+		return efer_msr->data;
+
+	return host_efer;
+}
+
+static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
+{
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmx_msr_entry g, h;
+	struct msr_data msr;
+	gpa_t gpa;
+	u32 i, j;
+
+	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
+
+	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
+		/*
+		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
+		 * as vmcs01.GUEST_DR7 contains a userspace defined value
+		 * and vcpu->arch.dr7 is not squirreled away before the
+		 * nested VMENTER (not worth adding a variable in nested_vmx).
+		 */
+		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
+			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
+		else
+			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
+	}
+
+	/*
+	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
+	 * handle a variety of side effects to KVM's software model.
+	 */
+	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
+
+	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
+
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
+
+	nested_ept_uninit_mmu_context(vcpu);
+	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
+
+	/*
+	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
+	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
+	 * VMFail, like everything else we just need to ensure our
+	 * software model is up-to-date.
+	 */
+	ept_save_pdptrs(vcpu);
+
+	kvm_mmu_reset_context(vcpu);
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx_update_msr_bitmap(vcpu);
+
+	/*
+	 * This nasty bit of open coding is a compromise between blindly
+	 * loading L1's MSRs using the exit load lists (incorrect emulation
+	 * of VMFail), leaving the nested VM's MSRs in the software model
+	 * (incorrect behavior) and snapshotting the modified MSRs (too
+	 * expensive since the lists are unbound by hardware).  For each
+	 * MSR that was (prematurely) loaded from the nested VMEntry load
+	 * list, reload it from the exit load list if it exists and differs
+	 * from the guest value.  The intent is to stuff host state as
+	 * silently as possible, not to fully process the exit load list.
+	 */
+	msr.host_initiated = false;
+	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
+		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
+		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
+			pr_debug_ratelimited(
+				"%s read MSR index failed (%u, 0x%08llx)\n",
+				__func__, i, gpa);
+			goto vmabort;
+		}
+
+		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
+			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
+			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
+				pr_debug_ratelimited(
+					"%s read MSR failed (%u, 0x%08llx)\n",
+					__func__, j, gpa);
+				goto vmabort;
+			}
+			if (h.index != g.index)
+				continue;
+			if (h.value == g.value)
+				break;
+
+			if (nested_vmx_load_msr_check(vcpu, &h)) {
+				pr_debug_ratelimited(
+					"%s check failed (%u, 0x%x, 0x%x)\n",
+					__func__, j, h.index, h.reserved);
+				goto vmabort;
+			}
+
+			msr.index = h.index;
+			msr.data = h.value;
+			if (kvm_set_msr(vcpu, &msr)) {
+				pr_debug_ratelimited(
+					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
+					__func__, j, h.index, h.value);
+				goto vmabort;
+			}
+		}
+	}
+
+	return;
+
+vmabort:
+	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
+}
+
 /*
  * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
  * and modify vmcs12 to make it see what it would expect to see there if
@@ -12140,7 +12267,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
 	 */
 	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 
-	load_vmcs12_mmu_host_state(vcpu, vmcs12);
+	/*
+	 * Restore L1's host state to KVM's software model.  We're here
+	 * because a consistency check was caught by hardware, which
+	 * means some amount of guest state has been propagated to KVM's
+	 * model and needs to be unwound to the host's state.
+	 */
+	nested_vmx_restore_host_state(vcpu);
 
 	/*
 	 * The emulated instruction was already skipped in
-- 
2.19.1


  parent reply	other threads:[~2019-03-30  1:37 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-30  1:29 [PATCH AUTOSEL 4.14 01/37] gpio: pxa: handle corner case of unprobed device Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 02/37] rsi: improve kernel thread handling to fix kernel panic Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 03/37] 9p: do not trust pdu content for stat item size Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 04/37] 9p locks: add mount option for lock retry interval Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 05/37] f2fs: fix to do sanity check with current segment number Sasha Levin
2019-03-30  1:29   ` Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 06/37] netfilter: xt_cgroup: shrink size of v2 path Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 07/37] serial: uartps: console_setup() can't be placed to init section Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 08/37] powerpc/pseries: Remove prrn_work workqueue Sasha Levin
2019-03-30  1:29   ` Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 09/37] media: au0828: cannot kfree dev before usb disconnect Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 10/37] HID: i2c-hid: override HID descriptors for certain devices Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 11/37] ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 12/37] usbip: fix vhci_hcd controller counting Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 13/37] ACPI / SBS: Fix GPE storm on recent MacBookPro's Sasha Levin
2019-03-30  1:29 ` Sasha Levin [this message]
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 15/37] compiler.h: update definition of unreachable() Sasha Levin
2019-03-30  1:29 ` [PATCH AUTOSEL 4.14 16/37] cifs: fallback to older infolevels on findfirst queryinfo retry Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 17/37] kernel: hung_task.c: disable on suspend Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 18/37] nvme-pci: fix conflicting p2p resource adds Sasha Levin
2019-03-30  1:30   ` Sasha Levin
2019-04-01 17:37   ` Heitke, Kenneth
2019-04-01 17:37     ` Heitke, Kenneth
2019-04-03 12:57     ` Sasha Levin
2019-04-03 12:57       ` Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 19/37] blk-mq: protect debugfs_create_files() from failures Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 20/37] crypto: sha256/arm - fix crash bug in Thumb2 build Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 21/37] crypto: sha512/arm " Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 22/37] iommu/dmar: Fix buffer overflow during PCI bus notification Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 23/37] kvm: properly check debugfs dentry before using it Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 24/37] soc/tegra: pmc: Drop locking from tegra_powergate_is_powered() Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 25/37] ext4: prohibit fstrim in norecovery mode Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 26/37] lkdtm: Print real addresses Sasha Levin
2019-03-30  1:30 ` [PATCH AUTOSEL 4.14 27/37] lkdtm: Add tests for NULL pointer dereference Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190330013020.379-14-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=jmattson@google.com \
    --cc=krish.sadhukhan@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=wanpeng.li@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.