[PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues
@ 2019-04-16 17:48 Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

The recent change to automatically calculate lapic_timer_advance_ns
introduced a handful of gnarly bugs, and also exposed a latent bug by
virtue of the advancement logic being enabled by default.  Inspection
and debug revealed several other opportunities for minor improvements.

The primary issue is that the auto-tuning of lapic_timer_advance_ns is
completely unbounded, e.g. there's nothing in the logic that prevents the
advancement from creeping up to hundreds of milliseconds.  Adjusting the
timers by large amounts creates major discrepancies in the guest, e.g. a
timer that was configured to fire after multiple milliseconds may arrive
before the guest executes a single instruction.  While technically correct
from a time perspective, it breaks a reasonable assumption from the guest
that it can execute some number of instructions between timer events.

The other major issue is that the advancement is global, while TSC scaling
is done on a per-vCPU basic.  Adjusting the advancement at runtime
exacerbates this as there is no protection against multiple vCPUs and/or
multiple VMs concurrently modifying the advancement value, e.g. it can
effectively become corrupted or never stabilize due to getting bounced all
over tarnation.

As for the latent bug, when timer advancement was applied to the hv_timer,
i.e. the VMX preemption timer, the logic to trigger wait_for_lapic_timer()
was not updated.  As a result, a timer interrupt emulated via the hv_timer
can easily arrive too early from a *time* perspective, as opposed to
simply arriving early from a "number of instructions executed" perspective.

v2:
 - Add explicit param to control automatic tuning. [Liran]
 - Document the effect of per-vCPU tracking on the module params.
 - Use fancy math to convert guest clockcycles to host nanoseconds
   instead of brute forcing the delay with a for loop. [Liran]
 - Refactor start_hv_timer()'s return semantics to move the "expired
   timer" handling up a level. [Liran and Paolo]
 - Add Liran's Reviewed-by tags.

v1: https://patchwork.kernel.org/cover/10899101/

Sean Christopherson (7):
  KVM: lapic: Hard cap the auto-calculated timer advancement
  KVM: lapic: Convert guest TSC to host time domain when delaying
  KVM: lapic: Track lapic timer advance per vCPU
  KVM: lapic: Allow user to disable auto-tuning of timer advancement
  KVM: lapic: Busy wait for timer to expire when using hv_timer
  KVM: lapic: Refactor start_hv_timer()'s semantics and usage
  KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero

 arch/x86/kvm/lapic.c   | 104 +++++++++++++++++++++++++++--------------
 arch/x86/kvm/lapic.h   |   5 +-
 arch/x86/kvm/vmx/vmx.c |   9 ++--
 arch/x86/kvm/x86.c     |  11 +++--
 arch/x86/kvm/x86.h     |   2 -
 5 files changed, 84 insertions(+), 47 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 2/7] KVM: lapic: Convert guest TSC to host time domain when delaying Sean Christopherson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  Now
that the timer advancement is automatically tuned during runtime, it's
effectively unbounded by default, e.g. if KVM is running as L1 the
advancement can measure in hundreds of milliseconds.

Place a somewhat arbitrary hard cap of 5000ns on the auto-calculated
advancement, as large advancements can break reasonable assumptions of
the guest, e.g. that a timer configured to fire after 1ms won't arrive
on the next instruction.  Although KVM busy waits to mitigate the timer
event arriving too early, complications can arise when shifting the
interrupt too far, e.g. vmx.flat/interrupt in kvm-unit-tests will fail
when its "host" exits on interrupts (because the INTR is injected before
the gets executes STI+HLT).  Arguably the unit test is "broken" in the
sense that delaying the timer interrupt by 1ms doesn't technically
guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
assumption that KVM should support.

Furthermore, an unbounded advancement also effectively unbounds the time
spent busy waiting, e.g. if the guest programs a timer with a very large
delay.

Arguably the advancement logic could simply be disabled when running as
L1, but KVM needs to bound the advancement time regardless, e.g. if the
TSC is unstable and the calculations get wildly out of whack.  And
allowing the advancement when running as L1 is a good stress test of
sorts for the logic.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9bf70cf84564..92446cba9b24 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -74,6 +74,7 @@ static bool lapic_timer_advance_adjust_done = false;
 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
+#define LAPIC_TIMER_ADVANCE_MAX_NS	5000

 static inline int apic_test_vector(int vec, void *bitmap)
 {
@@ -1522,6 +1523,10 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		}
 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
 			lapic_timer_advance_adjust_done = true;
+		if (unlikely(lapic_timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
+			lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
+			lapic_timer_advance_adjust_done = true;
+		}
 	}
 }

-- 
2.21.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/7] KVM: lapic: Convert guest TSC to host time domain when delaying
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  2019-04-16 18:35   ` Liran Alon
  2019-04-16 17:48 ` [PATCH v2 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  In
the event that the adjustments are too aggressive, i.e. the timer fires
earlier than the guest expects, KVM busy waits immediately prior to
entering the guest.

Currently, KVM manually converts the delay from nanoseconds to clock
cycles.  But, the conversion is done in the guest's time domain, while
the delay occurs in the host's time domain, i.e. the delay may not be
accurate and could wait too little or too long.

Convert the delay from guest clock cycles to host nanoseconds and use
ndelay() instead of __delay() to provide more accurate timing.  This
also avoids the need to convert lapic_timer_advance_ns, which is used
to cap the delay, to guest clock cycles.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 92446cba9b24..5891c0badfa6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1502,10 +1502,12 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);

-	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
-	if (guest_tsc < tsc_deadline)
-		__delay(min(tsc_deadline - guest_tsc,
-			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
+	/* ndelay uses delay_tsc whenever the hardware has TSC, thus always. */
+	if (guest_tsc < tsc_deadline) {
+		ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+		do_div(ns, vcpu->arch.virtual_tsc_khz);
+		ndelay(min_t(u32, ns, lapic_timer_advance_ns));
+	}

 	if (!lapic_timer_advance_adjust_done) {
 		/* too early */
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/7] KVM: lapic: Track lapic timer advance per vCPU
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 2/7] KVM: lapic: Convert guest TSC to host time domain when delaying Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 4/7] KVM: lapic: Allow user to disable auto-tuning of timer advancement Sean Christopherson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value.  That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.

And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.

Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.

Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer.  Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.

Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c   | 34 ++++++++++++++++++----------------
 arch/x86/kvm/lapic.h   |  4 +++-
 arch/x86/kvm/vmx/vmx.c |  4 +++-
 arch/x86/kvm/x86.c     |  7 +++----
 arch/x86/kvm/x86.h     |  2 --
 5 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5891c0badfa6..3cd62edf8084 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -70,7 +70,6 @@
 #define APIC_BROADCAST			0xFF
 #define X2APIC_BROADCAST		0xFFFFFFFFul
 
-static bool lapic_timer_advance_adjust_done = false;
 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
@@ -1486,6 +1485,7 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 void wait_lapic_expire(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
+	u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
 	u64 guest_tsc, tsc_deadline, ns;
 
 	if (!lapic_in_kernel(vcpu))
@@ -1506,35 +1506,37 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 	if (guest_tsc < tsc_deadline) {
 		ns = (tsc_deadline - guest_tsc) * 1000000ULL;
 		do_div(ns, vcpu->arch.virtual_tsc_khz);
-		ndelay(min_t(u32, ns, lapic_timer_advance_ns));
+		ndelay(min_t(u32, ns, timer_advance_ns));
 	}
 
-	if (!lapic_timer_advance_adjust_done) {
+	if (!apic->lapic_timer.timer_advance_adjust_done) {
 		/* too early */
 		if (guest_tsc < tsc_deadline) {
 			ns = (tsc_deadline - guest_tsc) * 1000000ULL;
 			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			lapic_timer_advance_ns -= min((unsigned int)ns,
-				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			timer_advance_ns -= min((u32)ns,
+				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
 		} else {
 		/* too late */
 			ns = (guest_tsc - tsc_deadline) * 1000000ULL;
 			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			lapic_timer_advance_ns += min((unsigned int)ns,
-				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			timer_advance_ns += min((u32)ns,
+				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
 		}
 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
-			lapic_timer_advance_adjust_done = true;
-		if (unlikely(lapic_timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
-			lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
-			lapic_timer_advance_adjust_done = true;
+			apic->lapic_timer.timer_advance_adjust_done = true;
+		if (unlikely(timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
+			timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
+			apic->lapic_timer.timer_advance_adjust_done = true;
 		}
+		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 	}
 }
 
 static void start_sw_tscdeadline(struct kvm_lapic *apic)
 {
-	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
+	struct kvm_timer *ktimer = &apic->lapic_timer;
+	u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
 	u64 ns = 0;
 	ktime_t expire;
 	struct kvm_vcpu *vcpu = apic->vcpu;
@@ -1553,9 +1555,8 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
 		ns = (tscdeadline - guest_tsc) * 1000000ULL;
 		do_div(ns, this_tsc_khz);
 		expire = ktime_add_ns(now, ns);
-		expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
-		hrtimer_start(&apic->lapic_timer.timer,
-				expire, HRTIMER_MODE_ABS_PINNED);
+		expire = ktime_sub_ns(expire, ktimer->timer_advance_ns);
+		hrtimer_start(&ktimer->timer, expire, HRTIMER_MODE_ABS_PINNED);
 	} else
 		apic_timer_expired(apic);
 
@@ -2262,7 +2263,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 		return HRTIMER_NORESTART;
 }
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu)
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
 {
 	struct kvm_lapic *apic;
 
@@ -2286,6 +2287,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
 	hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
+	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 
 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index ff6ef9c3d760..3e97f8a68967 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -31,8 +31,10 @@ struct kvm_timer {
 	u32 timer_mode_mask;
 	u64 tscdeadline;
 	u64 expired_tscdeadline;
+	u32 timer_advance_ns;
 	atomic_t pending;			/* accumulated triggered timers */
 	bool hv_timer_in_use;
+	bool timer_advance_adjust_done;
 };
 
 struct kvm_lapic {
@@ -62,7 +64,7 @@ struct kvm_lapic {
 
 struct dest_map;
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu);
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
 void kvm_free_lapic(struct kvm_vcpu *vcpu);
 
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5f90616da275..a3211e3ee679 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7031,6 +7031,7 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
 {
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
+	struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
 
 	if (kvm_mwait_in_guest(vcpu->kvm))
 		return -EOPNOTSUPP;
@@ -7039,7 +7040,8 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
 	tscl = rdtsc();
 	guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
 	delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
-	lapic_timer_advance_cycles = nsec_to_cycles(vcpu, lapic_timer_advance_ns);
+	lapic_timer_advance_cycles = nsec_to_cycles(vcpu,
+						    ktimer->timer_advance_ns);
 
 	if (delta_tsc > lapic_timer_advance_cycles)
 		delta_tsc -= lapic_timer_advance_cycles;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 38440316a806..b303a21a2bc2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -137,9 +137,8 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 
 /* lapic timer advance (tscdeadline mode only) in nanoseconds */
-unsigned int __read_mostly lapic_timer_advance_ns = 1000;
+static u32 __read_mostly lapic_timer_advance_ns = 1000;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
-EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
 
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
@@ -7891,7 +7890,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_entry(vcpu->vcpu_id);
-	if (lapic_timer_advance_ns)
+	if (vcpu->arch.apic->lapic_timer.timer_advance_ns)
 		wait_lapic_expire(vcpu);
 	guest_enter_irqoff();
 
@@ -9079,7 +9078,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 	if (irqchip_in_kernel(vcpu->kvm)) {
 		vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
-		r = kvm_create_lapic(vcpu);
+		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
 		if (r < 0)
 			goto fail_mmu_destroy;
 	} else
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index eda954441213..a470ff0868c5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -294,8 +294,6 @@ extern u64 kvm_supported_xcr0(void);
 
 extern unsigned int min_timer_period_us;
 
-extern unsigned int lapic_timer_advance_ns;
-
 extern bool enable_vmware_backdoor;
 
 extern struct static_key kvm_no_apic_vcpu;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/7] KVM: lapic: Allow user to disable auto-tuning of timer advancement
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (2 preceding siblings ...)
  2019-04-16 17:48 ` [PATCH v2 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  2019-04-16 18:37   ` Liran Alon
  2019-04-16 17:48 ` [PATCH v2 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

Add a new KVM param, lapic_timer_advance_autotune, to allow userspace
to disable auto-tuning the timer advancement, e.g. to manually tune the
delay.

Note, this does not completely restore the original behavior of
lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
which is necessary to support autotuning, userspace could adjust
lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
module params are snapshotted at vCPU creation, i.e. applying a new
advancement effectively requires restarting a VM.

Dynamically updating a running vCPU is possible, e.g. a helper could be
added to retrieve the desired delay, choosing between the global module
param and the per-VCPU value depending on whether or not auto-tuning is
(globally) enabled, but introduces a great deal of complexity.  The
wrapper itself is not complex, but understanding and documenting the
effects of dynamically toggling auto-tuning and/or adjusting the timer
advancement is nigh impossible since the behavior would be dependent on
KVM's implementation as well as compiler optimizations.  In other words,
providing stable behavior would require extremely careful consideration
now and in the future.

Given that the expected use of a manually-tuned timer advancement is to
"tune once, run many", use the vastly simpler approach of recognizing
changes to the module params only when creating a new vCPU.

Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 4 +++-
 arch/x86/kvm/lapic.h | 3 ++-
 arch/x86/kvm/x86.c   | 6 +++++-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3cd62edf8084..37d3489f68c9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2263,7 +2263,8 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 		return HRTIMER_NORESTART;
 }
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns,
+		     bool timer_advance_autotune)
 {
 	struct kvm_lapic *apic;
 
@@ -2288,6 +2289,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
 	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+	apic->lapic_timer.timer_advance_adjust_done = !timer_advance_autotune;
 
 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 3e97f8a68967..7fa1aed02c14 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,7 +64,8 @@ struct kvm_lapic {
 
 struct dest_map;
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns,
+		     bool timer_advance_autotune);
 void kvm_free_lapic(struct kvm_vcpu *vcpu);
 
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b303a21a2bc2..60d80e354f62 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -140,6 +140,9 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 static u32 __read_mostly lapic_timer_advance_ns = 1000;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 
+static bool __read_mostly lapic_timer_advance_autotune = true;
+module_param(lapic_timer_advance_autotune, bool, S_IRUGO | S_IWUSR);
+
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
 
@@ -9078,7 +9081,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 	if (irqchip_in_kernel(vcpu->kvm)) {
 		vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
-		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
+		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns,
+				     lapic_timer_advance_autotune);
 		if (r < 0)
 			goto fail_mmu_destroy;
 	} else
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (3 preceding siblings ...)
  2019-04-16 17:48 ` [PATCH v2 4/7] KVM: lapic: Allow user to disable auto-tuning of timer advancement Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage Sean Christopherson
  2019-04-16 17:48 ` [PATCH v2 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson
  6 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

...now that VMX's preemption timer, i.e. the hv_timer, also adjusts its
programmed time based on lapic_timer_advance_ns.  Without the delay, a
guest can see a timer interrupt arrive before the requested time when
KVM is using the hv_timer to emulate the guest's interrupt.

Fixes: c5ce8235cffa0 ("KVM: VMX: Optimize tscdeadline timer latency")
Cc: <stable@vger.kernel.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 37d3489f68c9..fa2dbca5ad6a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1455,7 +1455,7 @@ static void apic_timer_expired(struct kvm_lapic *apic)
 	if (swait_active(q))
 		swake_up_one(q);
 
-	if (apic_lvtt_tscdeadline(apic))
+	if (apic_lvtt_tscdeadline(apic) || ktimer->hv_timer_in_use)
 		ktimer->expired_tscdeadline = ktimer->tscdeadline;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (4 preceding siblings ...)
  2019-04-16 17:48 ` [PATCH v2 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  2019-04-16 18:44   ` Liran Alon
  2019-04-16 17:48 ` [PATCH v2 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson
  6 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

start_hv_timer() currently returns a bool, with %false interpreted as
"failure" and %true as "success".  But the reality is far more complex,
as the %false is also returned when programming was successful, but
must be canceled because there was already a pending timer interrupt.
It can also return %false when programming was successful but the timer
was marked expired because the deadline already passed, in which case
the hv timer must also be canceled to avoid double injection.

Functionally, the existing code is correct in the sense that it doesn't
doing anything visibily wrong, e.g. generate spurious interrupts or miss
an interrupt.  But it's extremely confusing and inefficient, e.g. there
are multiple extraneous calls to apic_timer_expired() that effectively
get dropped due to @timer_pending being %true.

Refactor start_hv_timer() to return an 'int' and move the handling of
the various combinations of success vs. expiration vs. periodic up a
level to restart_apic_timer(), which eliminates the needs to overload
the meaning of %false.

Cc: Wanpeng Li <wanpengli@tencent.com>
Suggested-by: Liran Alon <liran.alon@oracle.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++++++++++--------------
 1 file changed, 40 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index fa2dbca5ad6a..b867569107d5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1678,41 +1678,35 @@ static void cancel_hv_timer(struct kvm_lapic *apic)
 	apic->lapic_timer.hv_timer_in_use = false;
 }
 
-static bool start_hv_timer(struct kvm_lapic *apic)
+/*
+ * Return:
+ *  Negative errno on failure,
+ *  '1' on success and the timer's deadline has passed,
+ *  '0' on success and the timer's deadline is some time in the future.
+ */
+static int start_hv_timer(struct kvm_lapic *apic)
 {
 	struct kvm_timer *ktimer = &apic->lapic_timer;
 	int r;
 
 	WARN_ON(preemptible());
 	if (!kvm_x86_ops->set_hv_timer)
-		return false;
+		return -ENOENT;
 
 	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
-		return false;
+		return -EBUSY;
 
 	if (!ktimer->tscdeadline)
-		return false;
+		return -EINVAL;
 
 	r = kvm_x86_ops->set_hv_timer(apic->vcpu, ktimer->tscdeadline);
 	if (r < 0)
-		return false;
+		return r;
 
 	ktimer->hv_timer_in_use = true;
 	hrtimer_cancel(&ktimer->timer);
 
-	/*
-	 * Also recheck ktimer->pending, in case the sw timer triggered in
-	 * the window.  For periodic timer, leave the hv timer running for
-	 * simplicity, and the deadline will be recomputed on the next vmexit.
-	 */
-	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
-		if (r)
-			apic_timer_expired(apic);
-		return false;
-	}
-
-	trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, true);
-	return true;
+	return r;
 }
 
 static void start_sw_timer(struct kvm_lapic *apic)
@@ -1734,9 +1728,36 @@ static void start_sw_timer(struct kvm_lapic *apic)
 
 static void restart_apic_timer(struct kvm_lapic *apic)
 {
+	int r;
+
 	preempt_disable();
-	if (!start_hv_timer(apic))
+	r = start_hv_timer(apic);
+	if (r < 0) {
 		start_sw_timer(apic);
+	} else if (!apic_lvtt_period(apic)) {
+		/*
+		 * For a periodic timer, leave the timer running for simplicity
+		 * so that the deadline will be recomputed on the next VM-Exit.
+		 */
+		if (atomic_read(&apic->lapic_timer.pending)) {
+			/*
+			 * Cancel the hv timer if the sw timer fired while the
+			 * hv timer was being programmed.
+			 */
+			cancel_hv_timer(apic);
+		} else if (r) {
+			/*
+			 * start_hv_timer() returns '1' if the timer's deadline
+			 * has already passed.  Again, go through with the full
+			 * programming in the periodic case as the VM-Exit is
+			 * needed to recompute the periodic timer deadline.
+			 */
+			apic_timer_expired(apic);
+			cancel_hv_timer(apic);
+		}
+	}
+	trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
+				 apic->lapic_timer.hv_timer_in_use);
 	preempt_enable();
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero
  2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (5 preceding siblings ...)
  2019-04-16 17:48 ` [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage Sean Christopherson
@ 2019-04-16 17:48 ` Sean Christopherson
  6 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

Ten percent of nothin' is... let me do the math here.  Nothin' into
nothin', carry the nothin'...

Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a3211e3ee679..2fec11bc0ac9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7050,10 +7050,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
 
 	/* Convert to host delta tsc if tsc scaling is enabled */
 	if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
-			u64_shl_div_u64(delta_tsc,
+	    delta_tsc && u64_shl_div_u64(delta_tsc,
 				kvm_tsc_scaling_ratio_frac_bits,
-				vcpu->arch.tsc_scaling_ratio,
-				&delta_tsc))
+				vcpu->arch.tsc_scaling_ratio, &delta_tsc))
 		return -ERANGE;
 
 	/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/7] KVM: lapic: Convert guest TSC to host time domain when delaying
  2019-04-16 17:48 ` [PATCH v2 2/7] KVM: lapic: Convert guest TSC to host time domain when delaying Sean Christopherson
@ 2019-04-16 18:35   ` Liran Alon
  0 siblings, 0 replies; 12+ messages in thread
From: Liran Alon @ 2019-04-16 18:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 20:48, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  In
> the event that the adjustments are too aggressive, i.e. the timer fires
> earlier than the guest expects, KVM busy waits immediately prior to
> entering the guest.
> 
> Currently, KVM manually converts the delay from nanoseconds to clock
> cycles.  But, the conversion is done in the guest's time domain, while
> the delay occurs in the host's time domain, i.e. the delay may not be
> accurate and could wait too little or too long.
> 
> Convert the delay from guest clock cycles to host nanoseconds and use
> ndelay() instead of __delay() to provide more accurate timing.  This
> also avoids the need to convert lapic_timer_advance_ns, which is used
> to cap the delay, to guest clock cycles.
> 
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Reviewed-by: Liran Alon <liran.alon@oracle.com>

> ---
> arch/x86/kvm/lapic.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 92446cba9b24..5891c0badfa6 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1502,10 +1502,12 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> 
> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> -	if (guest_tsc < tsc_deadline)
> -		__delay(min(tsc_deadline - guest_tsc,
> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +	/* ndelay uses delay_tsc whenever the hardware has TSC, thus always. */
> +	if (guest_tsc < tsc_deadline) {
> +		ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> +		do_div(ns, vcpu->arch.virtual_tsc_khz);
> +		ndelay(min_t(u32, ns, lapic_timer_advance_ns));
> +	}
> 
> 	if (!lapic_timer_advance_adjust_done) {
> 		/* too early */
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/7] KVM: lapic: Allow user to disable auto-tuning of timer advancement
  2019-04-16 17:48 ` [PATCH v2 4/7] KVM: lapic: Allow user to disable auto-tuning of timer advancement Sean Christopherson
@ 2019-04-16 18:37   ` Liran Alon
  0 siblings, 0 replies; 12+ messages in thread
From: Liran Alon @ 2019-04-16 18:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 20:48, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> Add a new KVM param, lapic_timer_advance_autotune, to allow userspace
> to disable auto-tuning the timer advancement, e.g. to manually tune the
> delay.
> 
> Note, this does not completely restore the original behavior of
> lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
> which is necessary to support autotuning, userspace could adjust
> lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
> module params are snapshotted at vCPU creation, i.e. applying a new
> advancement effectively requires restarting a VM.
> 
> Dynamically updating a running vCPU is possible, e.g. a helper could be
> added to retrieve the desired delay, choosing between the global module
> param and the per-VCPU value depending on whether or not auto-tuning is
> (globally) enabled, but introduces a great deal of complexity.  The
> wrapper itself is not complex, but understanding and documenting the
> effects of dynamically toggling auto-tuning and/or adjusting the timer
> advancement is nigh impossible since the behavior would be dependent on
> KVM's implementation as well as compiler optimizations.  In other words,
> providing stable behavior would require extremely careful consideration
> now and in the future.
> 
> Given that the expected use of a manually-tuned timer advancement is to
> "tune once, run many", use the vastly simpler approach of recognizing
> changes to the module params only when creating a new vCPU.
> 
> Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Reviewed-by: Liran Alon <liran.alon@oracle.com>

> ---
> arch/x86/kvm/lapic.c | 4 +++-
> arch/x86/kvm/lapic.h | 3 ++-
> arch/x86/kvm/x86.c   | 6 +++++-
> 3 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 3cd62edf8084..37d3489f68c9 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2263,7 +2263,8 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
> 		return HRTIMER_NORESTART;
> }
> 
> -int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
> +int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns,
> +		     bool timer_advance_autotune)
> {
> 	struct kvm_lapic *apic;
> 
> @@ -2288,6 +2289,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
> 		     HRTIMER_MODE_ABS_PINNED);
> 	apic->lapic_timer.timer.function = apic_timer_fn;
> 	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> +	apic->lapic_timer.timer_advance_adjust_done = !timer_advance_autotune;
> 
> 	/*
> 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 3e97f8a68967..7fa1aed02c14 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -64,7 +64,8 @@ struct kvm_lapic {
> 
> struct dest_map;
> 
> -int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
> +int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns,
> +		     bool timer_advance_autotune);
> void kvm_free_lapic(struct kvm_vcpu *vcpu);
> 
> int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b303a21a2bc2..60d80e354f62 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -140,6 +140,9 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> static u32 __read_mostly lapic_timer_advance_ns = 1000;
> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> 
> +static bool __read_mostly lapic_timer_advance_autotune = true;
> +module_param(lapic_timer_advance_autotune, bool, S_IRUGO | S_IWUSR);
> +
> static bool __read_mostly vector_hashing = true;
> module_param(vector_hashing, bool, S_IRUGO);
> 
> @@ -9078,7 +9081,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
> 
> 	if (irqchip_in_kernel(vcpu->kvm)) {
> 		vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
> -		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
> +		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns,
> +				     lapic_timer_advance_autotune);
> 		if (r < 0)
> 			goto fail_mmu_destroy;
> 	} else
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage
  2019-04-16 17:48 ` [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage Sean Christopherson
@ 2019-04-16 18:44   ` Liran Alon
  2019-04-16 18:47     ` Liran Alon
  0 siblings, 1 reply; 12+ messages in thread
From: Liran Alon @ 2019-04-16 18:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 20:48, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> start_hv_timer() currently returns a bool, with %false interpreted as
> "failure" and %true as "success".  But the reality is far more complex,
> as the %false is also returned when programming was successful, but
> must be canceled because there was already a pending timer interrupt.
> It can also return %false when programming was successful but the timer
> was marked expired because the deadline already passed, in which case
> the hv timer must also be canceled to avoid double injection.
> 
> Functionally, the existing code is correct in the sense that it doesn't
> doing anything visibily wrong, e.g. generate spurious interrupts or miss
> an interrupt.  But it's extremely confusing and inefficient, e.g. there
> are multiple extraneous calls to apic_timer_expired() that effectively
> get dropped due to @timer_pending being %true.
> 
> Refactor start_hv_timer() to return an 'int' and move the handling of
> the various combinations of success vs. expiration vs. periodic up a
> level to restart_apic_timer(), which eliminates the needs to overload
> the meaning of %false.
> 
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Suggested-by: Liran Alon <liran.alon@oracle.com>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++++++++++--------------
> 1 file changed, 40 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fa2dbca5ad6a..b867569107d5 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1678,41 +1678,35 @@ static void cancel_hv_timer(struct kvm_lapic *apic)
> 	apic->lapic_timer.hv_timer_in_use = false;
> }
> 
> -static bool start_hv_timer(struct kvm_lapic *apic)
> +/*
> + * Return:
> + *  Negative errno on failure,
> + *  '1' on success and the timer's deadline has passed,
> + *  '0' on success and the timer's deadline is some time in the future.
> + */
> +static int start_hv_timer(struct kvm_lapic *apic)
> {
> 	struct kvm_timer *ktimer = &apic->lapic_timer;
> 	int r;
> 
> 	WARN_ON(preemptible());
> 	if (!kvm_x86_ops->set_hv_timer)
> -		return false;
> +		return -ENOENT;
> 
> 	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
> -		return false;
> +		return -EBUSY;
> 
> 	if (!ktimer->tscdeadline)
> -		return false;
> +		return -EINVAL;

These 2 upper “if” blocks can just be moved to start of restart_apic_timer() and early-return
in case any of them is true. 

In your code, start_hv_timer() will return r<0 which will cause restart_apic_timer() to call start_sw_timer()
which will then see one of these conditions and do nothing.
This is what I tried to warn about in the v1 email thread.

> 
> 	r = kvm_x86_ops->set_hv_timer(apic->vcpu, ktimer->tscdeadline);
> 	if (r < 0)
> -		return false;
> +		return r;
> 
> 	ktimer->hv_timer_in_use = true;
> 	hrtimer_cancel(&ktimer->timer);
> 
> -	/*
> -	 * Also recheck ktimer->pending, in case the sw timer triggered in
> -	 * the window.  For periodic timer, leave the hv timer running for
> -	 * simplicity, and the deadline will be recomputed on the next vmexit.
> -	 */
> -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
> -		if (r)
> -			apic_timer_expired(apic);
> -		return false;
> -	}
> -
> -	trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, true);
> -	return true;
> +	return r;
> }
> 
> static void start_sw_timer(struct kvm_lapic *apic)
> @@ -1734,9 +1728,36 @@ static void start_sw_timer(struct kvm_lapic *apic)
> 
> static void restart_apic_timer(struct kvm_lapic *apic)
> {
> +	int r;
> +
> 	preempt_disable();
> -	if (!start_hv_timer(apic))
> +	r = start_hv_timer(apic);
> +	if (r < 0) {
> 		start_sw_timer(apic);
> +	} else if (!apic_lvtt_period(apic)) {
> +		/*
> +		 * For a periodic timer, leave the timer running for simplicity
> +		 * so that the deadline will be recomputed on the next VM-Exit.
> +		 */
> +		if (atomic_read(&apic->lapic_timer.pending)) {
> +			/*
> +			 * Cancel the hv timer if the sw timer fired while the
> +			 * hv timer was being programmed.
> +			 */
> +			cancel_hv_timer(apic);
> +		} else if (r) {
> +			/*
> +			 * start_hv_timer() returns '1' if the timer's deadline
> +			 * has already passed.  Again, go through with the full
> +			 * programming in the periodic case as the VM-Exit is
> +			 * needed to recompute the periodic timer deadline.
> +			 */
> +			apic_timer_expired(apic);
> +			cancel_hv_timer(apic);
> +		}
> +	}
> +	trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> +				 apic->lapic_timer.hv_timer_in_use);
> 	preempt_enable();
> }
> 
> -- 
> 2.21.0
> 

I think I had a simpler patch.
I will try to send it and see what you think.

-Liran





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage
  2019-04-16 18:44   ` Liran Alon
@ 2019-04-16 18:47     ` Liran Alon
  0 siblings, 0 replies; 12+ messages in thread
From: Liran Alon @ 2019-04-16 18:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 21:44, Liran Alon <liran.alon@oracle.com> wrote:
> 
> 
> 
>> On 16 Apr 2019, at 20:48, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>> 
>> start_hv_timer() currently returns a bool, with %false interpreted as
>> "failure" and %true as "success".  But the reality is far more complex,
>> as the %false is also returned when programming was successful, but
>> must be canceled because there was already a pending timer interrupt.
>> It can also return %false when programming was successful but the timer
>> was marked expired because the deadline already passed, in which case
>> the hv timer must also be canceled to avoid double injection.
>> 
>> Functionally, the existing code is correct in the sense that it doesn't
>> doing anything visibily wrong, e.g. generate spurious interrupts or miss
>> an interrupt.  But it's extremely confusing and inefficient, e.g. there
>> are multiple extraneous calls to apic_timer_expired() that effectively
>> get dropped due to @timer_pending being %true.
>> 
>> Refactor start_hv_timer() to return an 'int' and move the handling of
>> the various combinations of success vs. expiration vs. periodic up a
>> level to restart_apic_timer(), which eliminates the needs to overload
>> the meaning of %false.
>> 
>> Cc: Wanpeng Li <wanpengli@tencent.com>
>> Suggested-by: Liran Alon <liran.alon@oracle.com>
>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> ---
>> arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++++++++++--------------
>> 1 file changed, 40 insertions(+), 19 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index fa2dbca5ad6a..b867569107d5 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1678,41 +1678,35 @@ static void cancel_hv_timer(struct kvm_lapic *apic)
>> 	apic->lapic_timer.hv_timer_in_use = false;
>> }
>> 
>> -static bool start_hv_timer(struct kvm_lapic *apic)
>> +/*
>> + * Return:
>> + *  Negative errno on failure,
>> + *  '1' on success and the timer's deadline has passed,
>> + *  '0' on success and the timer's deadline is some time in the future.
>> + */
>> +static int start_hv_timer(struct kvm_lapic *apic)
>> {
>> 	struct kvm_timer *ktimer = &apic->lapic_timer;
>> 	int r;
>> 
>> 	WARN_ON(preemptible());
>> 	if (!kvm_x86_ops->set_hv_timer)
>> -		return false;
>> +		return -ENOENT;
>> 
>> 	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
>> -		return false;
>> +		return -EBUSY;
>> 
>> 	if (!ktimer->tscdeadline)
>> -		return false;
>> +		return -EINVAL;
> 
> These 2 upper “if” blocks can just be moved to start of restart_apic_timer() and early-return
> in case any of them is true. 
> 
> In your code, start_hv_timer() will return r<0 which will cause restart_apic_timer() to call start_sw_timer()
> which will then see one of these conditions and do nothing.
> This is what I tried to warn about in the v1 email thread.
> 
>> 
>> 	r = kvm_x86_ops->set_hv_timer(apic->vcpu, ktimer->tscdeadline);
>> 	if (r < 0)
>> -		return false;
>> +		return r;
>> 
>> 	ktimer->hv_timer_in_use = true;
>> 	hrtimer_cancel(&ktimer->timer);
>> 
>> -	/*
>> -	 * Also recheck ktimer->pending, in case the sw timer triggered in
>> -	 * the window.  For periodic timer, leave the hv timer running for
>> -	 * simplicity, and the deadline will be recomputed on the next vmexit.
>> -	 */
>> -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
>> -		if (r)
>> -			apic_timer_expired(apic);
>> -		return false;
>> -	}
>> -
>> -	trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, true);
>> -	return true;
>> +	return r;
>> }
>> 
>> static void start_sw_timer(struct kvm_lapic *apic)
>> @@ -1734,9 +1728,36 @@ static void start_sw_timer(struct kvm_lapic *apic)
>> 
>> static void restart_apic_timer(struct kvm_lapic *apic)
>> {
>> +	int r;
>> +
>> 	preempt_disable();
>> -	if (!start_hv_timer(apic))
>> +	r = start_hv_timer(apic);
>> +	if (r < 0) {
>> 		start_sw_timer(apic);
>> +	} else if (!apic_lvtt_period(apic)) {
>> +		/*
>> +		 * For a periodic timer, leave the timer running for simplicity
>> +		 * so that the deadline will be recomputed on the next VM-Exit.
>> +		 */
>> +		if (atomic_read(&apic->lapic_timer.pending)) {
>> +			/*
>> +			 * Cancel the hv timer if the sw timer fired while the
>> +			 * hv timer was being programmed.
>> +			 */
>> +			cancel_hv_timer(apic);
>> +		} else if (r) {
>> +			/*
>> +			 * start_hv_timer() returns '1' if the timer's deadline
>> +			 * has already passed.  Again, go through with the full
>> +			 * programming in the periodic case as the VM-Exit is
>> +			 * needed to recompute the periodic timer deadline.
>> +			 */
>> +			apic_timer_expired(apic);
>> +			cancel_hv_timer(apic);
>> +		}
>> +	}
>> +	trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
>> +				 apic->lapic_timer.hv_timer_in_use);
>> 	preempt_enable();
>> }
>> 
>> -- 
>> 2.21.0
>> 
> 
> I think I had a simpler patch.
> I will try to send it and see what you think.
> 
> -Liran

Not tested, but something like:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index fbd37e07e90d..151d88833cfe 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1531,7 +1531,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
        unsigned long flags;
        ktime_t now;

-       if (unlikely(!tscdeadline || !this_tsc_khz))
+       if (unlikely(!this_tsc_khz))
                return;

        local_irq_save(flags);
@@ -1675,13 +1675,8 @@ static bool start_hv_timer(struct kvm_lapic *apic)
        int r;

        WARN_ON(preemptible());
-       if (!kvm_x86_ops->set_hv_timer)
-               return false;

-       if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
-               return false;
-
-       if (!ktimer->tscdeadline)
+       if (!kvm_x86_ops->set_hv_timer)
                return false;

        r = kvm_x86_ops->set_hv_timer(apic->vcpu, ktimer->tscdeadline);
@@ -1697,9 +1692,11 @@ static bool start_hv_timer(struct kvm_lapic *apic)
         * simplicity, and the deadline will be recomputed on the next vmexit.
         */
        if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
-               if (r)
+               if (r) {
                        apic_timer_expired(apic);
-               return false;
+                       ktimer->hv_timer_in_use = false;
+               }
+               return true;
        }

        trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, true);
@@ -1708,13 +1705,9 @@ static bool start_hv_timer(struct kvm_lapic *apic)

 static void start_sw_timer(struct kvm_lapic *apic)
 {
-       struct kvm_timer *ktimer = &apic->lapic_timer;
-
        WARN_ON(preemptible());
        if (apic->lapic_timer.hv_timer_in_use)
                cancel_hv_timer(apic);
-       if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
-               return;

        if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic))
                start_sw_period(apic);
@@ -1725,9 +1718,20 @@ static void start_sw_timer(struct kvm_lapic *apic)

 static void restart_apic_timer(struct kvm_lapic *apic)
 {
+       struct kvm_timer *ktimer = &apic->lapic_timer;
+
        preempt_disable();
+
+       if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
+               goto out;
+
+       if (!ktimer->tscdeadline)
+               goto out;
+
        if (!start_hv_timer(apic))
                start_sw_timer(apic);
+
+out:
        preempt_enable();
 }

What do you think?

-Liran


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-04-16 18:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-16 17:48 [PATCH v2 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
2019-04-16 17:48 ` [PATCH v2 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
2019-04-16 17:48 ` [PATCH v2 2/7] KVM: lapic: Convert guest TSC to host time domain when delaying Sean Christopherson
2019-04-16 18:35   ` Liran Alon
2019-04-16 17:48 ` [PATCH v2 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
2019-04-16 17:48 ` [PATCH v2 4/7] KVM: lapic: Allow user to disable auto-tuning of timer advancement Sean Christopherson
2019-04-16 18:37   ` Liran Alon
2019-04-16 17:48 ` [PATCH v2 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
2019-04-16 17:48 ` [PATCH v2 6/7] KVM: lapic: Refactor start_hv_timer()'s semantics and usage Sean Christopherson
2019-04-16 18:44   ` Liran Alon
2019-04-16 18:47     ` Liran Alon
2019-04-16 17:48 ` [PATCH v2 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.