KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: Sean Christopherson <sean.j.christopherson@intel.com>,
	vkuznets@redhat.com
Subject: [PATCH 42/43] KVM: VMX: Leave preemption timer running when it's disabled
Date: Thu, 13 Jun 2019 19:03:28 +0200
Message-ID: <1560445409-17363-43-git-send-email-pbonzini@redhat.com> (raw)
In-Reply-To: <1560445409-17363-1-git-send-email-pbonzini@redhat.com>

From: Sean Christopherson <sean.j.christopherson@intel.com>

VMWRITEs to the major VMCS controls, pin controls included, are
deceptively expensive.  CPUs with VMCS caching (Westmere and later) also
optimize away consistency checks on VM-Entry, i.e. skip consistency
checks if the relevant fields have not changed since the last successful
VM-Entry (of the cached VMCS).  Because uops are a precious commodity,
uCode's dirty VMCS field tracking isn't as precise as software would
prefer.  Notably, writing any of the major VMCS fields effectively marks
the entire VMCS dirty, i.e. causes the next VM-Entry to perform all
consistency checks, which consumes several hundred cycles.

As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than
doubles the latency of the next VM-Entry (and again when/if the flag is
toggled back).  In a non-nested scenario, running a "standard" guest
with the preemption timer enabled, toggling the timer flag is uncommon
but not rare, e.g. roughly 1 in 10 entries.  Disabling the preemption
timer can change these numbers due to its use for "immediate exits",
even when explicitly disabled by userspace.

Nested virtualization in particular is painful, as the timer flag is set
for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's
pin controls to *clear* the flag since its the timer's final state isn't
known until vmx_vcpu_run().  I.e. the majority of nested VM-Enters end
up unnecessarily writing pin controls *twice*.

Rather than toggle the timer flag in pin controls, set the timer value
itself to the largest allowed value to put it into a "soft disabled"
state, and ignore any spurious preemption timer exits.

Sadly, the timer is a 32-bit value and so theoretically it can fire
before the head death of the universe, i.e. spurious exits are possible.
But because KVM does *not* save the timer value on VM-Exit and because
the timer runs at a slower rate than the TSC, the maximuma timer value
is still sufficiently large for KVM's purposes.  E.g. on a modern CPU
with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate
TSC, the timer will fire after ~55 seconds of *uninterrupted* guest
execution.  In other words, spurious VM-Exits are effectively only
possible if the *host* is tickless on the logical CPU, the guest is
not using the preemption timer, and the guest is not generating VM-Exits
for *any* other reason.

To be safe from bad/weird hardware, disable the preemption timer if its
maximum delay is less than ten seconds.  Ten seconds is mostly arbitrary
and was selected in no small part because it's a nice round number.
For simplicity and paranoia, fall back to __kvm_request_immediate_exit()
if the preemption timer is disabled by KVM or userspace.  Previously
KVM continued to use the preemption timer to force immediate exits even
when the timer was disabled by userspace.  Now that KVM leaves the timer
running instead of truly disabling it, allow userspace to kill it
entirely in the unlikely event the timer (or KVM) malfunctions.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/nested.c |  5 ++--
 arch/x86/kvm/vmx/vmcs.h   |  1 +
 arch/x86/kvm/vmx/vmx.c    | 60 ++++++++++++++++++++++++++++++-----------------
 3 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 33d6598709cb..cdb74106b2b5 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2012,9 +2012,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 	 * PIN CONTROLS
 	 */
 	exec_control = vmx_pin_based_exec_ctrl(vmx);
-	exec_control |= vmcs12->pin_based_vm_exec_control;
-	/* Preemption timer setting is computed directly in vmx_vcpu_run.  */
-	exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
+	exec_control |= (vmcs12->pin_based_vm_exec_control &
+			 ~PIN_BASED_VMX_PREEMPTION_TIMER);
 
 	/* Posted interrupts setting is only taken from vmcs12.  */
 	if (nested_cpu_has_posted_intr(vmcs12)) {
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 9a87a2482e3e..481ad879197b 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -61,6 +61,7 @@ struct loaded_vmcs {
 	int cpu;
 	bool launched;
 	bool nmi_known_unmasked;
+	bool hv_timer_soft_disabled;
 	/* Support for vnmi-less CPUs */
 	int soft_vnmi_blocked;
 	ktime_t entry_time;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 86f29ab22dec..aa08a16b301d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2465,6 +2465,7 @@ int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
 		return -ENOMEM;
 
 	loaded_vmcs->shadow_vmcs = NULL;
+	loaded_vmcs->hv_timer_soft_disabled = false;
 	loaded_vmcs_init(loaded_vmcs);
 
 	if (cpu_has_vmx_msr_bitmap()) {
@@ -3835,8 +3836,9 @@ u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 	if (!enable_vnmi)
 		pin_based_exec_ctrl &= ~PIN_BASED_VIRTUAL_NMIS;
 
-	/* Enable the preemption timer dynamically */
-	pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
+	if (!enable_preemption_timer)
+		pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
+
 	return pin_based_exec_ctrl;
 }
 
@@ -5455,8 +5457,12 @@ static int handle_pml_full(struct kvm_vcpu *vcpu)
 
 static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 {
-	if (!to_vmx(vcpu)->req_immediate_exit)
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->req_immediate_exit &&
+	    !unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled))
 		kvm_lapic_expired_hv_timer(vcpu);
+
 	return 1;
 }
 
@@ -6356,12 +6362,6 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 					msrs[i].host, false);
 }
 
-static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
-{
-	vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val);
-	pin_controls_setbit(vmx, PIN_BASED_VMX_PREEMPTION_TIMER);
-}
-
 static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6369,11 +6369,9 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
 	u32 delta_tsc;
 
 	if (vmx->req_immediate_exit) {
-		vmx_arm_hv_timer(vmx, 0);
-		return;
-	}
-
-	if (vmx->hv_deadline_tsc != -1) {
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
+	} else if (vmx->hv_deadline_tsc != -1) {
 		tscl = rdtsc();
 		if (vmx->hv_deadline_tsc > tscl)
 			/* set_hv_timer ensures the delta fits in 32-bits */
@@ -6382,11 +6380,12 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
 		else
 			delta_tsc = 0;
 
-		vmx_arm_hv_timer(vmx, delta_tsc);
-		return;
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
+	} else if (!vmx->loaded_vmcs->hv_timer_soft_disabled) {
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, -1);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = true;
 	}
-
-	pin_controls_clearbit(vmx, PIN_BASED_VMX_PREEMPTION_TIMER);
 }
 
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
@@ -6458,7 +6457,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	atomic_switch_perf_msrs(vmx);
 
-	vmx_update_hv_timer(vcpu);
+	if (enable_preemption_timer)
+		vmx_update_hv_timer(vcpu);
 
 	if (lapic_in_kernel(vcpu) &&
 		vcpu->arch.apic->lapic_timer.timer_advance_ns)
@@ -7565,17 +7565,33 @@ static __init int hardware_setup(void)
 	}
 
 	if (!cpu_has_vmx_preemption_timer())
-		kvm_x86_ops->request_immediate_exit = __kvm_request_immediate_exit;
+		enable_preemption_timer = false;
 
-	if (cpu_has_vmx_preemption_timer() && enable_preemption_timer) {
+	if (enable_preemption_timer) {
+		u64 use_timer_freq = 5000ULL * 1000 * 1000;
 		u64 vmx_msr;
 
 		rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
 		cpu_preemption_timer_multi =
 			vmx_msr & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
-	} else {
+
+		if (tsc_khz)
+			use_timer_freq = (u64)tsc_khz * 1000;
+		use_timer_freq >>= cpu_preemption_timer_multi;
+
+		/*
+		 * KVM "disables" the preemption timer by setting it to its max
+		 * value.  Don't use the timer if it might cause spurious exits
+		 * at a rate faster than 0.1 Hz (of uninterrupted guest time).
+		 */
+		if ((0xffffffffu / use_timer_freq) < 10)
+			enable_preemption_timer = false;
+	}
+
+	if (!enable_preemption_timer) {
 		kvm_x86_ops->set_hv_timer = NULL;
 		kvm_x86_ops->cancel_hv_timer = NULL;
+		kvm_x86_ops->request_immediate_exit = __kvm_request_immediate_exit;
 	}
 
 	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
-- 
1.8.3.1



  parent reply index

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-13 17:02 [PATCH 00/43] VMX optimizations Paolo Bonzini
2019-06-13 17:02 ` [PATCH 01/43] KVM: VMX: Fix handling of #MC that occurs during VM-Entry Paolo Bonzini
2019-06-13 17:24   ` Jim Mattson
2019-06-13 17:02 ` [PATCH 02/43] kvm: nVMX: small cleanup in handle_exception Paolo Bonzini
2019-06-13 17:02 ` [PATCH 03/43] KVM: VMX: Read cached VM-Exit reason to detect external interrupt Paolo Bonzini
2019-06-13 17:02 ` [PATCH 04/43] KVM: VMX: Store the host kernel's IDT base in a global variable Paolo Bonzini
2019-06-13 17:02 ` [PATCH 05/43] KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code Paolo Bonzini
2019-06-13 17:02 ` [PATCH 06/43] KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn Paolo Bonzini
2019-06-13 17:02 ` [PATCH 07/43] KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields Paolo Bonzini
2019-06-13 17:02 ` [PATCH 08/43] KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES Paolo Bonzini
2019-06-13 17:02 ` [PATCH 09/43] KVM: nVMX: Track vmcs12 offsets for shadowed VMCS fields Paolo Bonzini
2019-06-13 17:02 ` [PATCH 10/43] KVM: nVMX: Lift sync_vmcs12() out of prepare_vmcs12() Paolo Bonzini
2019-06-13 17:02 ` [PATCH 11/43] KVM: nVMX: Use descriptive names for VMCS sync functions and flags Paolo Bonzini
2019-06-13 17:02 ` [PATCH 12/43] KVM: nVMX: Add helpers to identify shadowed VMCS fields Paolo Bonzini
2019-06-14 16:10   ` Sean Christopherson
2019-06-13 17:02 ` [PATCH 13/43] KVM: nVMX: Sync rarely accessed guest fields only when needed Paolo Bonzini
2019-06-13 17:03 ` [PATCH 14/43] KVM: nVMX: Rename prepare_vmcs02_*_full to prepare_vmcs02_*_rare Paolo Bonzini
2019-06-13 17:03 ` [PATCH 15/43] KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value Paolo Bonzini
2019-06-13 17:03 ` [PATCH 16/43] KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01 Paolo Bonzini
     [not found]   ` <20190615221602.93C5721851@mail.kernel.org>
2019-06-15 22:40     ` Liran Alon
2019-06-13 17:03 ` [PATCH 17/43] KVM: nVMX: Write ENCLS-exiting bitmap once per vmcs02 Paolo Bonzini
2019-06-13 17:03 ` [PATCH 18/43] KVM: nVMX: Don't rewrite GUEST_PML_INDEX during nested VM-Entry Paolo Bonzini
2019-06-13 17:03 ` [PATCH 19/43] KVM: VMX: simplify vmx_prepare_switch_to_{guest,host} Paolo Bonzini
2019-06-13 17:03 ` [PATCH 20/43] KVM: nVMX: Don't "put" vCPU or host state when switching VMCS Paolo Bonzini
2019-06-13 17:03 ` [PATCH 21/43] KVM: nVMX: Don't reread VMCS-agnostic " Paolo Bonzini
2019-06-14 16:25   ` Sean Christopherson
2019-06-13 17:03 ` [PATCH 22/43] KVM: nVMX: Don't dump VMCS if virtual APIC page can't be mapped Paolo Bonzini
2019-06-17 19:17   ` Radim Krčmář
2019-06-17 20:07     ` Sean Christopherson
2019-06-18  9:43       ` Paolo Bonzini
2019-06-13 17:03 ` [PATCH 23/43] KVM: nVMX: Don't speculatively write virtual-APIC page address Paolo Bonzini
2019-06-13 17:03 ` [PATCH 24/43] KVM: nVMX: Don't speculatively write APIC-access " Paolo Bonzini
2019-06-13 17:03 ` [PATCH 25/43] KVM: nVMX: Update vmcs12 for MSR_IA32_CR_PAT when it's written Paolo Bonzini
2019-06-13 17:03 ` [PATCH 26/43] KVM: nVMX: Update vmcs12 for SYSENTER MSRs when they're written Paolo Bonzini
2019-06-13 17:03 ` [PATCH 27/43] KVM: nVMX: Update vmcs12 for MSR_IA32_DEBUGCTLMSR when it's written Paolo Bonzini
2019-06-13 17:03 ` [PATCH 28/43] KVM: nVMX: Don't update GUEST_BNDCFGS if it's clean in HV eVMCS Paolo Bonzini
2019-06-13 17:03 ` [PATCH 29/43] KVM: x86: introduce is_pae_paging Paolo Bonzini
2019-06-13 17:03 ` [PATCH 30/43] KVM: nVMX: Copy PDPTRs to/from vmcs12 only when necessary Paolo Bonzini
2019-06-13 17:03 ` [PATCH 31/43] KVM: nVMX: Use adjusted pin controls for vmcs02 Paolo Bonzini
2019-06-13 17:03 ` [PATCH 32/43] KVM: VMX: Add builder macros for shadowing controls Paolo Bonzini
2019-06-13 17:03 ` [PATCH 33/43] KVM: VMX: Shadow VMCS pin controls Paolo Bonzini
2019-06-13 17:03 ` [PATCH 34/43] KVM: VMX: Shadow VMCS primary execution controls Paolo Bonzini
2019-06-13 17:03 ` [PATCH 35/43] KVM: VMX: Shadow VMCS secondary " Paolo Bonzini
2019-06-13 17:03 ` [PATCH 36/43] KVM: nVMX: Shadow VMCS controls on a per-VMCS basis Paolo Bonzini
2019-06-13 17:03 ` [PATCH 37/43] KVM: nVMX: Don't reset VMCS controls shadow on VMCS switch Paolo Bonzini
2019-06-13 17:03 ` [PATCH 38/43] KVM: VMX: Explicitly initialize controls shadow at VMCS allocation Paolo Bonzini
2019-06-13 17:03 ` [PATCH 39/43] KVM: nVMX: Preserve last USE_MSR_BITMAPS when preparing vmcs02 Paolo Bonzini
2019-06-13 17:03 ` [PATCH 40/43] KVM: nVMX: Preset *DT exiting in vmcs02 when emulating UMIP Paolo Bonzini
2019-06-13 17:03 ` [PATCH 41/43] KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs' Paolo Bonzini
2019-06-13 17:03 ` Paolo Bonzini [this message]
2019-06-14 16:34   ` [PATCH 42/43] KVM: VMX: Leave preemption timer running when it's disabled Sean Christopherson
2019-06-13 17:03 ` [PATCH 43/43] KVM: nVMX: shadow pin based execution controls Paolo Bonzini
2019-06-14 16:34   ` Sean Christopherson

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1560445409-17363-43-git-send-email-pbonzini@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sean.j.christopherson@intel.com \
    --cc=vkuznets@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org kvm@archiver.kernel.org
	public-inbox-index kvm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox