All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues
@ 2019-04-12 20:18 Sean Christopherson
  2019-04-12 20:18 ` [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

The recent change to automatically calculate lapic_timer_advance_ns
introduced a handful of gnarly bugs, and also exposed a latent bug by
virtue of the advancement logic being enabled by default.  Inspection
and debug revealed several other opportunities for minor improvements.

The primary issue is that the auto-tuning of lapic_timer_advance_ns is
completely unbounded, e.g. there's nothing in the logic that prevents the
advancement from creeping up to hundreds of milliseconds.  Adjusting the
timers by large amounts creates major discrepancies in the guest, e.g. a
timer that was configured to fire after multiple milliseconds may arrive
before the guest executes a single instruction.  While technically correct
from a time perspective, it breaks a reasonable assumption from the guest
that it can execute some number of instructions between timer events.

The other major issue is that the advancement is global, while TSC scaling
is done on a per-vCPU basic.  Adjusting the advancement at runtime
exacerbates this as there is no protection against multiple vCPUs and/or
multiple VMs concurrently modifying the advancement value, e.g. it can
effectively become corrupted or never stabilize due to getting bounced all
over tarnation.

As for the latent bug, when timer advancement was applied to the hv_timer,
i.e. the VMX preemption timer, the logic to trigger wait_for_lapic_timer()
was not updated.  As a result, a timer interrupt emulated via the hv_timer
can easily arrive too early from a *time* perspective, as opposed to
simply arriving early from a "number of instructions executed" perspective.

As an aside, the test stage counting in kvm-unit-tests' vmx.flat/interrupt
is off by one, which causes it to misreport what it's testing, e.g. says
it's testing "direct interrupt" when the unit test's host is intercepting
interrupts and vice versa.  That was a fun few hours of debug.

Sean Christopherson (7):
  KVM: lapic: Hard cap the auto-calculated timer advancement
  KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  KVM: lapic: Track lapic timer advance per vCPU
  KVM: lapic: Allow user to override auto-tuning of timer advancement
  KVM: lapic: Busy wait for timer to expire when using hv_timer
  KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero

 arch/x86/kvm/lapic.c   | 56 +++++++++++++++++++++++++-----------------
 arch/x86/kvm/lapic.h   |  6 ++++-
 arch/x86/kvm/vmx/vmx.c |  9 ++++---
 arch/x86/kvm/x86.c     |  7 +++---
 arch/x86/kvm/x86.h     |  2 --
 5 files changed, 47 insertions(+), 33 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 10:22   ` Liran Alon
  2019-04-12 20:18 ` [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire" Sean Christopherson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  Now
that the timer advancement is automatically tuned during runtime, it's
effectively unbounded by default, e.g. if KVM is running as L1 the
advancement can measure in hundreds of milliseconds.

Place a somewhat arbitrary hard cap of 5000ns on the auto-calculated
advancement, as large advancements can break reasonable assumptions of
the guest, e.g. that a timer configured to fire after 1ms won't arrive
on the next instruction.  Although KVM busy waits to mitigate the timer
event arriving too early, complications can arise when shifting the
interrupt too far, e.g. vmx.flat/interrupt in kvm-unit-tests will fail
when its "host" exits on interrupts (because the INTR is injected before
the gets executes STI+HLT).  Arguably the unit test is "broken" in the
sense that delaying the timer interrupt by 1ms doesn't technically
guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
assumption that KVM should support.

Furthermore, an unbounded advancement also effectively unbounds the time
spent busy waiting, e.g. if the guest programs a timer with a very large
delay.

Arguably the advancement logic could simply be disabled when running as
L1, but KVM needs to bound the advancement time regardless, e.g. if the
TSC is unstable and the calculations get wildly out of whack.  And
allowing the advancement when running as L1 is a good stress test of
sorts for the logic.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9bf70cf84564..92446cba9b24 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -74,6 +74,7 @@ static bool lapic_timer_advance_adjust_done = false;
 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
+#define LAPIC_TIMER_ADVANCE_MAX_NS	5000
 
 static inline int apic_test_vector(int vec, void *bitmap)
 {
@@ -1522,6 +1523,10 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		}
 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
 			lapic_timer_advance_adjust_done = true;
+		if (unlikely(lapic_timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
+			lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
+			lapic_timer_advance_adjust_done = true;
+		}
 	}
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
  2019-04-12 20:18 ` [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 11:25   ` Liran Alon
  2019-04-16 11:02   ` Paolo Bonzini
  2019-04-12 20:18 ` [PATCH 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  In
the event that the adjustments are too aggressive, i.e. the timer fires
earlier than the guest expects, KVM busy waits immediately prior to
entering the guest.

Currently, KVM manually converts the delay from nanoseconds to clock
cycles.  But, the conversion is done in the guest's time domain, while
the delay occurs in the host's time domain, i.e. the delay may not be
accurate and could wait too little or too long.  Sidestep the headache
of shifting time domains by delaying 1ns at a time and breaking the loop
when the guest's desired timer delay has been satisfied.  Because the
advancement, which caps the delay to avoid unbounded busy waiting, is
stored in nanoseconds, the current advancement time can simply be used
as a loop terminator since we're delaying 1ns at a time (plus the few
cycles of overhead for running the code).

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 92446cba9b24..e797e3145a8b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 void wait_lapic_expire(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	u64 guest_tsc, tsc_deadline, ns;
+	u32 timer_advance_ns = lapic_timer_advance_ns;
+	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
 
 	if (!lapic_in_kernel(vcpu))
 		return;
@@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 
 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
 	apic->lapic_timer.expired_tscdeadline = 0;
-	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
 
-	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
-	if (guest_tsc < tsc_deadline)
-		__delay(min(tsc_deadline - guest_tsc,
-			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
+	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
+		ndelay(1);
+		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	}
 
 	if (!lapic_timer_advance_adjust_done) {
 		/* too early */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/7] KVM: lapic: Track lapic timer advance per vCPU
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
  2019-04-12 20:18 ` [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
  2019-04-12 20:18 ` [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire" Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 11:29   ` Liran Alon
  2019-04-12 20:18 ` [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement Sean Christopherson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value.  That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.

And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.

Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.

Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer.  Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c   | 33 +++++++++++++++++----------------
 arch/x86/kvm/lapic.h   |  4 +++-
 arch/x86/kvm/vmx/vmx.c |  4 +++-
 arch/x86/kvm/x86.c     |  7 +++----
 arch/x86/kvm/x86.h     |  2 --
 5 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e797e3145a8b..70a0acd98e9e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -70,7 +70,6 @@
 #define APIC_BROADCAST			0xFF
 #define X2APIC_BROADCAST		0xFFFFFFFFul
 
-static bool lapic_timer_advance_adjust_done = false;
 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
@@ -1486,7 +1485,7 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 void wait_lapic_expire(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	u32 timer_advance_ns = lapic_timer_advance_ns;
+	u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
 	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
 
 	if (!lapic_in_kernel(vcpu))
@@ -1508,32 +1507,34 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 	}
 
-	if (!lapic_timer_advance_adjust_done) {
+	if (!apic->lapic_timer.timer_advance_adjust_done) {
 		/* too early */
 		if (guest_tsc < tsc_deadline) {
 			ns = (tsc_deadline - guest_tsc) * 1000000ULL;
 			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			lapic_timer_advance_ns -= min((unsigned int)ns,
-				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			timer_advance_ns -= min((u32)ns,
+				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
 		} else {
 		/* too late */
 			ns = (guest_tsc - tsc_deadline) * 1000000ULL;
 			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			lapic_timer_advance_ns += min((unsigned int)ns,
-				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			timer_advance_ns += min((u32)ns,
+				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
 		}
 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
-			lapic_timer_advance_adjust_done = true;
-		if (unlikely(lapic_timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
-			lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
-			lapic_timer_advance_adjust_done = true;
+			apic->lapic_timer.timer_advance_adjust_done = true;
+		if (unlikely(timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
+			timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
+			apic->lapic_timer.timer_advance_adjust_done = true;
 		}
+		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 	}
 }
 
 static void start_sw_tscdeadline(struct kvm_lapic *apic)
 {
-	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
+	struct kvm_timer *ktimer = &apic->lapic_timer;
+	u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
 	u64 ns = 0;
 	ktime_t expire;
 	struct kvm_vcpu *vcpu = apic->vcpu;
@@ -1552,9 +1553,8 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
 		ns = (tscdeadline - guest_tsc) * 1000000ULL;
 		do_div(ns, this_tsc_khz);
 		expire = ktime_add_ns(now, ns);
-		expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
-		hrtimer_start(&apic->lapic_timer.timer,
-				expire, HRTIMER_MODE_ABS_PINNED);
+		expire = ktime_sub_ns(expire, ktimer->timer_advance_ns);
+		hrtimer_start(&ktimer->timer, expire, HRTIMER_MODE_ABS_PINNED);
 	} else
 		apic_timer_expired(apic);
 
@@ -2261,7 +2261,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 		return HRTIMER_NORESTART;
 }
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu)
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
 {
 	struct kvm_lapic *apic;
 
@@ -2285,6 +2285,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
 	hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
+	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 
 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index ff6ef9c3d760..3e97f8a68967 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -31,8 +31,10 @@ struct kvm_timer {
 	u32 timer_mode_mask;
 	u64 tscdeadline;
 	u64 expired_tscdeadline;
+	u32 timer_advance_ns;
 	atomic_t pending;			/* accumulated triggered timers */
 	bool hv_timer_in_use;
+	bool timer_advance_adjust_done;
 };
 
 struct kvm_lapic {
@@ -62,7 +64,7 @@ struct kvm_lapic {
 
 struct dest_map;
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu);
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
 void kvm_free_lapic(struct kvm_vcpu *vcpu);
 
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5f90616da275..a3211e3ee679 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7031,6 +7031,7 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
 {
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
+	struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
 
 	if (kvm_mwait_in_guest(vcpu->kvm))
 		return -EOPNOTSUPP;
@@ -7039,7 +7040,8 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
 	tscl = rdtsc();
 	guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
 	delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
-	lapic_timer_advance_cycles = nsec_to_cycles(vcpu, lapic_timer_advance_ns);
+	lapic_timer_advance_cycles = nsec_to_cycles(vcpu,
+						    ktimer->timer_advance_ns);
 
 	if (delta_tsc > lapic_timer_advance_cycles)
 		delta_tsc -= lapic_timer_advance_cycles;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 38440316a806..b303a21a2bc2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -137,9 +137,8 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 
 /* lapic timer advance (tscdeadline mode only) in nanoseconds */
-unsigned int __read_mostly lapic_timer_advance_ns = 1000;
+static u32 __read_mostly lapic_timer_advance_ns = 1000;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
-EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
 
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
@@ -7891,7 +7890,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_entry(vcpu->vcpu_id);
-	if (lapic_timer_advance_ns)
+	if (vcpu->arch.apic->lapic_timer.timer_advance_ns)
 		wait_lapic_expire(vcpu);
 	guest_enter_irqoff();
 
@@ -9079,7 +9078,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 	if (irqchip_in_kernel(vcpu->kvm)) {
 		vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
-		r = kvm_create_lapic(vcpu);
+		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
 		if (r < 0)
 			goto fail_mmu_destroy;
 	} else
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index eda954441213..a470ff0868c5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -294,8 +294,6 @@ extern u64 kvm_supported_xcr0(void);
 
 extern unsigned int min_timer_period_us;
 
-extern unsigned int lapic_timer_advance_ns;
-
 extern bool enable_vmware_backdoor;
 
 extern struct static_key kvm_no_apic_vcpu;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (2 preceding siblings ...)
  2019-04-12 20:18 ` [PATCH 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 11:35   ` Liran Alon
  2019-04-12 20:18 ` [PATCH 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

Disable auto-tuning the timer advancement if the user specifies an
explicit value via the module param.  Aside from the obvious override
capability, this also allows the KVM admin to set the advancement
beyond the internally-capped max of 5000ns.

Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 2 ++
 arch/x86/kvm/lapic.h | 2 ++
 arch/x86/kvm/x86.c   | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 70a0acd98e9e..c43cd26f040b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2286,6 +2286,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
 	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+	if (timer_advance_ns != LAPIC_TIMER_ADVANCE_DEFAULT_NS)
+		apic->lapic_timer.timer_advance_adjust_done = true;
 
 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 3e97f8a68967..c7233629c05b 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -16,6 +16,8 @@
 #define APIC_BUS_CYCLE_NS       1
 #define APIC_BUS_FREQUENCY      (1000000000ULL / APIC_BUS_CYCLE_NS)
 
+#define LAPIC_TIMER_ADVANCE_DEFAULT_NS	1000
+
 enum lapic_mode {
 	LAPIC_MODE_DISABLED = 0,
 	LAPIC_MODE_INVALID = X2APIC_ENABLE,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b303a21a2bc2..709a8bf5ae0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -137,7 +137,7 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 
 /* lapic timer advance (tscdeadline mode only) in nanoseconds */
-static u32 __read_mostly lapic_timer_advance_ns = 1000;
+static u32 __read_mostly lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_DEFAULT_NS;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 
 static bool __read_mostly vector_hashing = true;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (3 preceding siblings ...)
  2019-04-12 20:18 ` [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 11:47   ` Liran Alon
  2019-04-12 20:18 ` [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer Sean Christopherson
  2019-04-12 20:18 ` [PATCH 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

...now that VMX's preemption timer, i.e. the hv_timer, also adjusts its
programmed time based on lapic_timer_advance_ns.  Without the delay, a
guest can see a timer interrupt arrive before the requested time when
KVM is using the hv_timer to emulate the guest's interrupt.

Fixes: c5ce8235cffa0 ("KVM: VMX: Optimize tscdeadline timer latency")
Cc: <stable@vger.kernel.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c43cd26f040b..1d649a2af04c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1455,7 +1455,7 @@ static void apic_timer_expired(struct kvm_lapic *apic)
 	if (swait_active(q))
 		swake_up_one(q);
 
-	if (apic_lvtt_tscdeadline(apic))
+	if (apic_lvtt_tscdeadline(apic) || ktimer->hv_timer_in_use)
 		ktimer->expired_tscdeadline = ktimer->tscdeadline;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (4 preceding siblings ...)
  2019-04-12 20:18 ` [PATCH 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 12:15   ` Liran Alon
  2019-04-12 20:18 ` [PATCH 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

Calling apic_timer_expired() is a nop when a timer interrupt is already
pending, i.e. there's no need to call apic_timer_expired() when there's
a pending interrupt and the hv_timer wants to pend its own interrupt.
Separate the two flows to make the code more readable and to avoid an
unnecessary function call and read to ktimer->pending.

Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1d649a2af04c..f0be6f148a47 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1703,9 +1703,12 @@ static bool start_hv_timer(struct kvm_lapic *apic)
 	 * the window.  For periodic timer, leave the hv timer running for
 	 * simplicity, and the deadline will be recomputed on the next vmexit.
 	 */
-	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
-		if (r)
-			apic_timer_expired(apic);
+	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
+		return false;
+
+	/* set_hv_timer() returns '1' when the timer has already expired. */
+	if (r) {
+		apic_timer_expired(apic);
 		return false;
 	}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero
  2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (5 preceding siblings ...)
  2019-04-12 20:18 ` [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer Sean Christopherson
@ 2019-04-12 20:18 ` Sean Christopherson
  2019-04-14 12:21   ` Liran Alon
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-12 20:18 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Sean Christopherson, Liran Alon, Wanpeng Li

Ten percent of nothin' is... let me do the math here.  Nothin' into
nothin', carry the nothin'...

Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a3211e3ee679..2fec11bc0ac9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7050,10 +7050,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
 
 	/* Convert to host delta tsc if tsc scaling is enabled */
 	if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
-			u64_shl_div_u64(delta_tsc,
+	    delta_tsc && u64_shl_div_u64(delta_tsc,
 				kvm_tsc_scaling_ratio_frac_bits,
-				vcpu->arch.tsc_scaling_ratio,
-				&delta_tsc))
+				vcpu->arch.tsc_scaling_ratio, &delta_tsc))
 		return -ERANGE;
 
 	/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement
  2019-04-12 20:18 ` [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
@ 2019-04-14 10:22   ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-14 10:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  Now
> that the timer advancement is automatically tuned during runtime, it's
> effectively unbounded by default, e.g. if KVM is running as L1 the
> advancement can measure in hundreds of milliseconds.
> 
> Place a somewhat arbitrary hard cap of 5000ns on the auto-calculated
> advancement, as large advancements can break reasonable assumptions of
> the guest, e.g. that a timer configured to fire after 1ms won't arrive
> on the next instruction.  Although KVM busy waits to mitigate the timer
> event arriving too early, complications can arise when shifting the
> interrupt too far, e.g. vmx.flat/interrupt in kvm-unit-tests will fail
> when its "host" exits on interrupts (because the INTR is injected before
> the gets executes STI+HLT).  Arguably the unit test is "broken" in the
> sense that delaying the timer interrupt by 1ms doesn't technically
> guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
> assumption that KVM should support.
> 
> Furthermore, an unbounded advancement also effectively unbounds the time
> spent busy waiting, e.g. if the guest programs a timer with a very large
> delay.
> 
> Arguably the advancement logic could simply be disabled when running as
> L1, but KVM needs to bound the advancement time regardless, e.g. if the
> TSC is unstable and the calculations get wildly out of whack.  And
> allowing the advancement when running as L1 is a good stress test of
> sorts for the logic.
> 
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

I completely agree with the above. The below change also seems correct to me.
Reviewed-by: Liran Alon <liran.alon@oracle.com>

Having said that, I think we are starting to have too many hard-coded parameters for the algorithm adjusting lapic_timer_advance_ns.
(LAPIC_TIMER_ADVANCE_ADJUST_DONE, LAPIC_TIMER_ADVANCE_ADJUST_STEP and now LAPIC_TIMER_ADVANCE_MAX_NS).
We should consider changing them to KVM module parameters.

In addition, this patch led me wondering what will happen when start_sw_tscdeadline() will sub lapic_timer_advance_ns from “expire” such that (expire < now).
I think that because apic->lapic_timer.timer is init with HRTIMER_MODE_ABS_PINNED, it is not allowed to run in softirq and therefore will never expire.
Thus, I think that start_sw_tscdeadline() should check if (expire < now) after sub of lapic_timer_advance_ns and if so call apic_timer_expired() instead of hrtimer_start().
I can submit such a patch if you agree.

-Liran

> ---
> arch/x86/kvm/lapic.c | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9bf70cf84564..92446cba9b24 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -74,6 +74,7 @@ static bool lapic_timer_advance_adjust_done = false;
> #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
> /* step-by-step approximation to mitigate fluctuation */
> #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
> +#define LAPIC_TIMER_ADVANCE_MAX_NS	5000
> 
> static inline int apic_test_vector(int vec, void *bitmap)
> {
> @@ -1522,6 +1523,10 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> 		}
> 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> 			lapic_timer_advance_adjust_done = true;
> +		if (unlikely(lapic_timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
> +			lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
> +			lapic_timer_advance_adjust_done = true;
> +		}
> 	}
> }
> 
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-12 20:18 ` [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire" Sean Christopherson
@ 2019-04-14 11:25   ` Liran Alon
  2019-04-15 16:11     ` Sean Christopherson
  2019-04-16 11:02   ` Paolo Bonzini
  1 sibling, 1 reply; 29+ messages in thread
From: Liran Alon @ 2019-04-14 11:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  In
> the event that the adjustments are too aggressive, i.e. the timer fires
> earlier than the guest expects, KVM busy waits immediately prior to
> entering the guest.
> 
> Currently, KVM manually converts the delay from nanoseconds to clock
> cycles.  But, the conversion is done in the guest's time domain, while
> the delay occurs in the host's time domain, i.e. the delay may not be
> accurate and could wait too little or too long.  

Just to make sure I understand the issue you see here:
The __delay() is called with a parameter which the number of guest cycles required
to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
But in case guest vCPU frequency is different than physical CPU frequency,
this parameter should further be converted from guest cycles to host cycles.
Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
As this is quite confusing. :)

> Sidestep the headache
> of shifting time domains by delaying 1ns at a time and breaking the loop
> when the guest's desired timer delay has been satisfied.  Because the
> advancement, which caps the delay to avoid unbounded busy waiting, is
> stored in nanoseconds, the current advancement time can simply be used
> as a loop terminator since we're delaying 1ns at a time (plus the few
> cycles of overhead for running the code).

Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
This is simply done by:
host_ns = guest_cycles * 1000000ULL;
do_div(host_ns, vcpu->arch.virtual_tsc_khz);

Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
i.e. something such as:
delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
delay_host_ns = delay_guest_cycles * 1000000ULL;
do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
__delay(delay_host_ns);

-Liran

> 
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> arch/x86/kvm/lapic.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 92446cba9b24..e797e3145a8b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> void wait_lapic_expire(struct kvm_vcpu *vcpu)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u64 guest_tsc, tsc_deadline, ns;
> +	u32 timer_advance_ns = lapic_timer_advance_ns;
> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;

Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.

> 
> 	if (!lapic_in_kernel(vcpu))
> 		return;
> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> 
> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> 	apic->lapic_timer.expired_tscdeadline = 0;
> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> 
> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> -	if (guest_tsc < tsc_deadline)
> -		__delay(min(tsc_deadline - guest_tsc,
> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> +		ndelay(1);
> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	}
> 
> 	if (!lapic_timer_advance_adjust_done) {
> 		/* too early */
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] KVM: lapic: Track lapic timer advance per vCPU
  2019-04-12 20:18 ` [PATCH 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
@ 2019-04-14 11:29   ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-14 11:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> Automatically adjusting the globally-shared timer advancement could
> corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
> the advancement value.  That could be partially fixed by using a local
> variable for the arithmetic, but it would still be susceptible to a
> race when setting timer_advance_adjust_done.
> 
> And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
> correct calibration for a given vCPU may not apply to all vCPUs.
> 
> Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
> effectively violated when finding a stable advancement takes an extended
> amount of timer.
> 
> Opportunistically change the definition of lapic_timer_advance_ns to
> a u32 so that it matches the style of struct kvm_timer.  Explicitly
> pass the param to kvm_create_lapic() so that it doesn't have to be
> exposed to lapic.c, thus reducing the probability of unintentionally
> using the global value instead of the per-vCPU value.
> 
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Makes total sense to me and patch looks good.
Reviewed-by: Liran Alon <liran.alon@oracle.com>

-Liran

> ---
> arch/x86/kvm/lapic.c   | 33 +++++++++++++++++----------------
> arch/x86/kvm/lapic.h   |  4 +++-
> arch/x86/kvm/vmx/vmx.c |  4 +++-
> arch/x86/kvm/x86.c     |  7 +++----
> arch/x86/kvm/x86.h     |  2 --
> 5 files changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e797e3145a8b..70a0acd98e9e 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -70,7 +70,6 @@
> #define APIC_BROADCAST			0xFF
> #define X2APIC_BROADCAST		0xFFFFFFFFul
> 
> -static bool lapic_timer_advance_adjust_done = false;
> #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
> /* step-by-step approximation to mitigate fluctuation */
> #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
> @@ -1486,7 +1485,7 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> void wait_lapic_expire(struct kvm_vcpu *vcpu)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u32 timer_advance_ns = lapic_timer_advance_ns;
> +	u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
> 	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
> 
> 	if (!lapic_in_kernel(vcpu))
> @@ -1508,32 +1507,34 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> 		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 	}
> 
> -	if (!lapic_timer_advance_adjust_done) {
> +	if (!apic->lapic_timer.timer_advance_adjust_done) {
> 		/* too early */
> 		if (guest_tsc < tsc_deadline) {
> 			ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> 			do_div(ns, vcpu->arch.virtual_tsc_khz);
> -			lapic_timer_advance_ns -= min((unsigned int)ns,
> -				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> +			timer_advance_ns -= min((u32)ns,
> +				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> 		} else {
> 		/* too late */
> 			ns = (guest_tsc - tsc_deadline) * 1000000ULL;
> 			do_div(ns, vcpu->arch.virtual_tsc_khz);
> -			lapic_timer_advance_ns += min((unsigned int)ns,
> -				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> +			timer_advance_ns += min((u32)ns,
> +				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> 		}
> 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> -			lapic_timer_advance_adjust_done = true;
> -		if (unlikely(lapic_timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
> -			lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
> -			lapic_timer_advance_adjust_done = true;
> +			apic->lapic_timer.timer_advance_adjust_done = true;
> +		if (unlikely(timer_advance_ns > LAPIC_TIMER_ADVANCE_MAX_NS)) {
> +			timer_advance_ns = LAPIC_TIMER_ADVANCE_MAX_NS;
> +			apic->lapic_timer.timer_advance_adjust_done = true;
> 		}
> +		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> 	}
> }
> 
> static void start_sw_tscdeadline(struct kvm_lapic *apic)
> {
> -	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
> +	struct kvm_timer *ktimer = &apic->lapic_timer;
> +	u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
> 	u64 ns = 0;
> 	ktime_t expire;
> 	struct kvm_vcpu *vcpu = apic->vcpu;
> @@ -1552,9 +1553,8 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> 		ns = (tscdeadline - guest_tsc) * 1000000ULL;
> 		do_div(ns, this_tsc_khz);
> 		expire = ktime_add_ns(now, ns);
> -		expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
> -		hrtimer_start(&apic->lapic_timer.timer,
> -				expire, HRTIMER_MODE_ABS_PINNED);
> +		expire = ktime_sub_ns(expire, ktimer->timer_advance_ns);
> +		hrtimer_start(&ktimer->timer, expire, HRTIMER_MODE_ABS_PINNED);
> 	} else
> 		apic_timer_expired(apic);
> 
> @@ -2261,7 +2261,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
> 		return HRTIMER_NORESTART;
> }
> 
> -int kvm_create_lapic(struct kvm_vcpu *vcpu)
> +int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
> {
> 	struct kvm_lapic *apic;
> 
> @@ -2285,6 +2285,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
> 	hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
> 		     HRTIMER_MODE_ABS_PINNED);
> 	apic->lapic_timer.timer.function = apic_timer_fn;
> +	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> 
> 	/*
> 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index ff6ef9c3d760..3e97f8a68967 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -31,8 +31,10 @@ struct kvm_timer {
> 	u32 timer_mode_mask;
> 	u64 tscdeadline;
> 	u64 expired_tscdeadline;
> +	u32 timer_advance_ns;
> 	atomic_t pending;			/* accumulated triggered timers */
> 	bool hv_timer_in_use;
> +	bool timer_advance_adjust_done;
> };
> 
> struct kvm_lapic {
> @@ -62,7 +64,7 @@ struct kvm_lapic {
> 
> struct dest_map;
> 
> -int kvm_create_lapic(struct kvm_vcpu *vcpu);
> +int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
> void kvm_free_lapic(struct kvm_vcpu *vcpu);
> 
> int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5f90616da275..a3211e3ee679 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7031,6 +7031,7 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
> {
> 	struct vcpu_vmx *vmx;
> 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
> +	struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
> 
> 	if (kvm_mwait_in_guest(vcpu->kvm))
> 		return -EOPNOTSUPP;
> @@ -7039,7 +7040,8 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
> 	tscl = rdtsc();
> 	guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> 	delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> -	lapic_timer_advance_cycles = nsec_to_cycles(vcpu, lapic_timer_advance_ns);
> +	lapic_timer_advance_cycles = nsec_to_cycles(vcpu,
> +						    ktimer->timer_advance_ns);
> 
> 	if (delta_tsc > lapic_timer_advance_cycles)
> 		delta_tsc -= lapic_timer_advance_cycles;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 38440316a806..b303a21a2bc2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -137,9 +137,8 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> 
> /* lapic timer advance (tscdeadline mode only) in nanoseconds */
> -unsigned int __read_mostly lapic_timer_advance_ns = 1000;
> +static u32 __read_mostly lapic_timer_advance_ns = 1000;
> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> -EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
> 
> static bool __read_mostly vector_hashing = true;
> module_param(vector_hashing, bool, S_IRUGO);
> @@ -7891,7 +7890,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> 	}
> 
> 	trace_kvm_entry(vcpu->vcpu_id);
> -	if (lapic_timer_advance_ns)
> +	if (vcpu->arch.apic->lapic_timer.timer_advance_ns)
> 		wait_lapic_expire(vcpu);
> 	guest_enter_irqoff();
> 
> @@ -9079,7 +9078,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
> 
> 	if (irqchip_in_kernel(vcpu->kvm)) {
> 		vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
> -		r = kvm_create_lapic(vcpu);
> +		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
> 		if (r < 0)
> 			goto fail_mmu_destroy;
> 	} else
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index eda954441213..a470ff0868c5 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -294,8 +294,6 @@ extern u64 kvm_supported_xcr0(void);
> 
> extern unsigned int min_timer_period_us;
> 
> -extern unsigned int lapic_timer_advance_ns;
> -
> extern bool enable_vmware_backdoor;
> 
> extern struct static_key kvm_no_apic_vcpu;
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement
  2019-04-12 20:18 ` [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement Sean Christopherson
@ 2019-04-14 11:35   ` Liran Alon
  2019-04-15 16:23     ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Liran Alon @ 2019-04-14 11:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> Disable auto-tuning the timer advancement if the user specifies an
> explicit value via the module param.  Aside from the obvious override
> capability, this also allows the KVM admin to set the advancement
> beyond the internally-capped max of 5000ns.
> 
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

I agree we should allow admin the ability to disable the auto-tuning.
However, I think we should keep the semantic of lapic_timer_advance_ns value.
Whether this value is used as initial-value for auto-tuning or is auto-tuning disabled is a different knob for admin in my opinion.
Therefore, I think we should just add another module parameter which basically set the initial value for apic->lapic_timer.timer_advance_adjust_done.

One could also wonder if it makes sense that whether auto-tuning will be enabled or not varies between VMs and should not be a global variable of KVM module?

-Liran

> ---
> arch/x86/kvm/lapic.c | 2 ++
> arch/x86/kvm/lapic.h | 2 ++
> arch/x86/kvm/x86.c   | 2 +-
> 3 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 70a0acd98e9e..c43cd26f040b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2286,6 +2286,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
> 		     HRTIMER_MODE_ABS_PINNED);
> 	apic->lapic_timer.timer.function = apic_timer_fn;
> 	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> +	if (timer_advance_ns != LAPIC_TIMER_ADVANCE_DEFAULT_NS)
> +		apic->lapic_timer.timer_advance_adjust_done = true;
> 
> 	/*
> 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 3e97f8a68967..c7233629c05b 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -16,6 +16,8 @@
> #define APIC_BUS_CYCLE_NS       1
> #define APIC_BUS_FREQUENCY      (1000000000ULL / APIC_BUS_CYCLE_NS)
> 
> +#define LAPIC_TIMER_ADVANCE_DEFAULT_NS	1000
> +
> enum lapic_mode {
> 	LAPIC_MODE_DISABLED = 0,
> 	LAPIC_MODE_INVALID = X2APIC_ENABLE,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b303a21a2bc2..709a8bf5ae0e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -137,7 +137,7 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> 
> /* lapic timer advance (tscdeadline mode only) in nanoseconds */
> -static u32 __read_mostly lapic_timer_advance_ns = 1000;
> +static u32 __read_mostly lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_DEFAULT_NS;
> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> 
> static bool __read_mostly vector_hashing = true;
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer
  2019-04-12 20:18 ` [PATCH 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
@ 2019-04-14 11:47   ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-14 11:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> ...now that VMX's preemption timer, i.e. the hv_timer, also adjusts its
> programmed time based on lapic_timer_advance_ns.  Without the delay, a
> guest can see a timer interrupt arrive before the requested time when
> KVM is using the hv_timer to emulate the guest's interrupt.
> 
> Fixes: c5ce8235cffa0 ("KVM: VMX: Optimize tscdeadline timer latency")
> Cc: <stable@vger.kernel.org>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Nice catch.
Reviewed-by: Liran Alon <liran.alon@oracle.com>

> ---
> arch/x86/kvm/lapic.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c43cd26f040b..1d649a2af04c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1455,7 +1455,7 @@ static void apic_timer_expired(struct kvm_lapic *apic)
> 	if (swait_active(q))
> 		swake_up_one(q);
> 
> -	if (apic_lvtt_tscdeadline(apic))
> +	if (apic_lvtt_tscdeadline(apic) || ktimer->hv_timer_in_use)
> 		ktimer->expired_tscdeadline = ktimer->tscdeadline;
> }
> 
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-12 20:18 ` [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer Sean Christopherson
@ 2019-04-14 12:15   ` Liran Alon
  2019-04-15 16:32     ` Sean Christopherson
  2019-04-16 11:14     ` Paolo Bonzini
  0 siblings, 2 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-14 12:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> Calling apic_timer_expired() is a nop when a timer interrupt is already
> pending, i.e. there's no need to call apic_timer_expired() when there's
> a pending interrupt and the hv_timer wants to pend its own interrupt.
> Separate the two flows to make the code more readable and to avoid an
> unnecessary function call and read to ktimer->pending.

In case timer is not periodic and r==1, atomic_read(&ktimer->pending) is not executed.

> 
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> arch/x86/kvm/lapic.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 1d649a2af04c..f0be6f148a47 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1703,9 +1703,12 @@ static bool start_hv_timer(struct kvm_lapic *apic)
> 	 * the window.  For periodic timer, leave the hv timer running for
> 	 * simplicity, and the deadline will be recomputed on the next vmexit.
> 	 */
> -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
> -		if (r)
> -			apic_timer_expired(apic);
> +	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
> +		return false;
> +
> +	/* set_hv_timer() returns '1' when the timer has already expired. */
> +	if (r) {
> +		apic_timer_expired(apic);
> 		return false;
> 	}
> 
> -- 
> 2.21.0
> 

First, I think you should emphasise in commit message that you have actually fixed a rare bug here.
In case timer is periodic but given ktimer->tscdeadline has already expired on host, we should call apic_timer_expired().

In addition, when start_hv_timer() returns false, restart_apic_timer() just calls start_sw_timer() which use hrtimer instead of VMX preemption timer.
Therefore, it seems a bit ineffective to me for start_hv_timer() to return false in case ktimer->pending or when ktimer->tscdeadline already expired.
Shouldn’t we return true in these cases?

-Liran





















^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero
  2019-04-12 20:18 ` [PATCH 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson
@ 2019-04-14 12:21   ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-14 12:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> Ten percent of nothin' is... let me do the math here.  Nothin' into
> nothin', carry the nothin'...
> 
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

I think you should explicitly state in commit message title that this is an optimisation. A rather small one I must say :)
Having said that:
Reviewed-by: Liran Alon <liran.alon@oracle.com>

-Liran

> ---
> arch/x86/kvm/vmx/vmx.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a3211e3ee679..2fec11bc0ac9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7050,10 +7050,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
> 
> 	/* Convert to host delta tsc if tsc scaling is enabled */
> 	if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
> -			u64_shl_div_u64(delta_tsc,
> +	    delta_tsc && u64_shl_div_u64(delta_tsc,
> 				kvm_tsc_scaling_ratio_frac_bits,
> -				vcpu->arch.tsc_scaling_ratio,
> -				&delta_tsc))
> +				vcpu->arch.tsc_scaling_ratio, &delta_tsc))
> 		return -ERANGE;
> 
> 	/*
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-14 11:25   ` Liran Alon
@ 2019-04-15 16:11     ` Sean Christopherson
  2019-04-15 17:06       ` Liran Alon
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-15 16:11 UTC (permalink / raw)
  To: Liran Alon; +Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li

On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote:
> 
> 
> > On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > To minimize the latency of timer interrupts as observed by the guest,
> > KVM adjusts the values it programs into the host timers to account for
> > the host's overhead of programming and handling the timer event.  In
> > the event that the adjustments are too aggressive, i.e. the timer fires
> > earlier than the guest expects, KVM busy waits immediately prior to
> > entering the guest.
> > 
> > Currently, KVM manually converts the delay from nanoseconds to clock
> > cycles.  But, the conversion is done in the guest's time domain, while
> > the delay occurs in the host's time domain, i.e. the delay may not be
> > accurate and could wait too little or too long.  
> 
> Just to make sure I understand the issue you see here:
> The __delay() is called with a parameter which the number of guest cycles required
> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
> But in case guest vCPU frequency is different than physical CPU frequency,
> this parameter should further be converted from guest cycles to host cycles.
> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
> As this is quite confusing. :)

Heh, I didn't elaborate precisely because it was so darn confusing.  I
didn't want to say something completely incorrect so I kept it high level.

> 
> > Sidestep the headache
> > of shifting time domains by delaying 1ns at a time and breaking the loop
> > when the guest's desired timer delay has been satisfied.  Because the
> > advancement, which caps the delay to avoid unbounded busy waiting, is
> > stored in nanoseconds, the current advancement time can simply be used
> > as a loop terminator since we're delaying 1ns at a time (plus the few
> > cycles of overhead for running the code).
> 
> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
> This is simply done by:
> host_ns = guest_cycles * 1000000ULL;
> do_div(host_ns, vcpu->arch.virtual_tsc_khz);
> 
> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
> i.e. something such as:
> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
> delay_host_ns = delay_guest_cycles * 1000000ULL;
> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
> __delay(delay_host_ns);

That's also an option.  I opted for the loop as it felt simpler and more
obvious.  And explicitly checking rechecking kvm_read_l1_tsc() guarantees
the guest won't see an "early" TSC unless the cap kicks in.  That being
said, I don't have a strong opinion either way.

As for the code, we'd probably want to do the conversion first and cap
second so as to avoid converting lapic_timer_advance_ns to the guest's
TSC and then back to ns.  It should use ndelay(), and maybe just call it
"delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a
host variable, e.g.:

  delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
  do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
  ndelay(min(delay_ns, lapic_timer_advance_ns));

That's actually nowhere near as as bad as I thought it would be now that
it's typed out...


> 
> -Liran
> 
> > 
> > Cc: Liran Alon <liran.alon@oracle.com>
> > Cc: Wanpeng Li <wanpengli@tencent.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> > arch/x86/kvm/lapic.c | 13 +++++++------
> > 1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 92446cba9b24..e797e3145a8b 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> > void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > {
> > 	struct kvm_lapic *apic = vcpu->arch.apic;
> > -	u64 guest_tsc, tsc_deadline, ns;
> > +	u32 timer_advance_ns = lapic_timer_advance_ns;
> > +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
> 
> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.

guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the
beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more
appropriate.  My main motivation for not using verbose names is keep lines
short, i.e. avoid wrapping at 80 chars.  For me, wrapping a bunch of lines
makes the code more difficult to read than less verbose names.

> 
> > 
> > 	if (!lapic_in_kernel(vcpu))
> > 		return;
> > @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > 
> > 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> > 	apic->lapic_timer.expired_tscdeadline = 0;
> > -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> > 
> > -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> > -	if (guest_tsc < tsc_deadline)
> > -		__delay(min(tsc_deadline - guest_tsc,
> > -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> > +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> > +		ndelay(1);
> > +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +	}
> > 
> > 	if (!lapic_timer_advance_adjust_done) {
> > 		/* too early */
> > -- 
> > 2.21.0
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement
  2019-04-14 11:35   ` Liran Alon
@ 2019-04-15 16:23     ` Sean Christopherson
  2019-04-15 17:10       ` Liran Alon
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-15 16:23 UTC (permalink / raw)
  To: Liran Alon; +Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li

On Sun, Apr 14, 2019 at 02:35:39PM +0300, Liran Alon wrote:
> 
> 
> > On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > Disable auto-tuning the timer advancement if the user specifies an
> > explicit value via the module param.  Aside from the obvious override
> > capability, this also allows the KVM admin to set the advancement
> > beyond the internally-capped max of 5000ns.
> > 
> > Cc: Wanpeng Li <wanpengli@tencent.com>
> > Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> I agree we should allow admin the ability to disable the auto-tuning.
> However, I think we should keep the semantic of lapic_timer_advance_ns value.
> Whether this value is used as initial-value for auto-tuning or is auto-tuning
> disabled is a different knob for admin in my opinion.  Therefore, I think we
> should just add another module parameter which basically set the initial
> value for apic->lapic_timer.timer_advance_adjust_done.

I waffled on whether to add a param or do the implicit override.  I opted
for the implicit behavior primarily because I think most people would
expect that setting lapic_timer_advance_ns to some specific value would
fix the advancement at said value, as opposed to acting as a hint to the
autotuning logic.  In other words, I again don't have a strong opinion :-)

> One could also wonder if it makes sense that whether auto-tuning will be
> enabled or not varies between VMs and should not be a global variable of KVM
> module?

I have no opinion on this one as I don't have any direct visibility into
how the auto-tuning will be used in real world deployments.

> 
> -Liran
> 
> > ---
> > arch/x86/kvm/lapic.c | 2 ++
> > arch/x86/kvm/lapic.h | 2 ++
> > arch/x86/kvm/x86.c   | 2 +-
> > 3 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 70a0acd98e9e..c43cd26f040b 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2286,6 +2286,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
> > 		     HRTIMER_MODE_ABS_PINNED);
> > 	apic->lapic_timer.timer.function = apic_timer_fn;
> > 	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> > +	if (timer_advance_ns != LAPIC_TIMER_ADVANCE_DEFAULT_NS)
> > +		apic->lapic_timer.timer_advance_adjust_done = true;
> > 
> > 	/*
> > 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 3e97f8a68967..c7233629c05b 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -16,6 +16,8 @@
> > #define APIC_BUS_CYCLE_NS       1
> > #define APIC_BUS_FREQUENCY      (1000000000ULL / APIC_BUS_CYCLE_NS)
> > 
> > +#define LAPIC_TIMER_ADVANCE_DEFAULT_NS	1000
> > +
> > enum lapic_mode {
> > 	LAPIC_MODE_DISABLED = 0,
> > 	LAPIC_MODE_INVALID = X2APIC_ENABLE,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b303a21a2bc2..709a8bf5ae0e 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -137,7 +137,7 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
> > module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> > 
> > /* lapic timer advance (tscdeadline mode only) in nanoseconds */
> > -static u32 __read_mostly lapic_timer_advance_ns = 1000;
> > +static u32 __read_mostly lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_DEFAULT_NS;
> > module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> > 
> > static bool __read_mostly vector_hashing = true;
> > -- 
> > 2.21.0
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-14 12:15   ` Liran Alon
@ 2019-04-15 16:32     ` Sean Christopherson
  2019-04-15 17:25       ` Liran Alon
  2019-04-16 11:14     ` Paolo Bonzini
  1 sibling, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-15 16:32 UTC (permalink / raw)
  To: Liran Alon; +Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li

On Sun, Apr 14, 2019 at 03:15:41PM +0300, Liran Alon wrote:
> 
> 
> > On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > Calling apic_timer_expired() is a nop when a timer interrupt is already
> > pending, i.e. there's no need to call apic_timer_expired() when there's
> > a pending interrupt and the hv_timer wants to pend its own interrupt.
> > Separate the two flows to make the code more readable and to avoid an
> > unnecessary function call and read to ktimer->pending.
> 
> In case timer is not periodic and r==1, atomic_read(&ktimer->pending) is not executed.
> 
> > 
> > Cc: Wanpeng Li <wanpengli@tencent.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> > arch/x86/kvm/lapic.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 1d649a2af04c..f0be6f148a47 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1703,9 +1703,12 @@ static bool start_hv_timer(struct kvm_lapic *apic)
> > 	 * the window.  For periodic timer, leave the hv timer running for
> > 	 * simplicity, and the deadline will be recomputed on the next vmexit.
> > 	 */
> > -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
> > -		if (r)
> > -			apic_timer_expired(apic);
> > +	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
> > +		return false;
> > +
> > +	/* set_hv_timer() returns '1' when the timer has already expired. */
> > +	if (r) {
> > +		apic_timer_expired(apic);
> > 		return false;
> > 	}
> > 
> > -- 
> > 2.21.0
> > 
> 
> First, I think you should emphasise in commit message that you have actually
> fixed a rare bug here.  In case timer is periodic but given
> ktimer->tscdeadline has already expired on host, we should call
> apic_timer_expired().

Heh, I actually didn't even catch that bug, I was simply cleaning up the
code because I had a hard time following the logic.

> In addition, when start_hv_timer() returns false, restart_apic_timer() just
> calls start_sw_timer() which use hrtimer instead of VMX preemption timer.
> Therefore, it seems a bit ineffective to me for start_hv_timer() to return
> false in case ktimer->pending or when ktimer->tscdeadline already expired.
> Shouldn’t we return true in these cases?

That also seemed weird to me.  Again, I had a hell of a time following the
intended logic and didn't want to break anything.  AFAICT, the motivation
for calling start_sw_timer() is to cancel the HV timer, and possibly to
ensure start_sw_period() is called when necessary.  But the latter will be
handled by virtue of checking "r" after apic_lvtt_period(), so this?

	if (r) {
		apic_timer_expired(apic);
		ktimer->hv_timer_in_use = false;
		return true;
	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-15 16:11     ` Sean Christopherson
@ 2019-04-15 17:06       ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-15 17:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 15 Apr 2019, at 19:11, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> To minimize the latency of timer interrupts as observed by the guest,
>>> KVM adjusts the values it programs into the host timers to account for
>>> the host's overhead of programming and handling the timer event.  In
>>> the event that the adjustments are too aggressive, i.e. the timer fires
>>> earlier than the guest expects, KVM busy waits immediately prior to
>>> entering the guest.
>>> 
>>> Currently, KVM manually converts the delay from nanoseconds to clock
>>> cycles.  But, the conversion is done in the guest's time domain, while
>>> the delay occurs in the host's time domain, i.e. the delay may not be
>>> accurate and could wait too little or too long.  
>> 
>> Just to make sure I understand the issue you see here:
>> The __delay() is called with a parameter which the number of guest cycles required
>> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
>> But in case guest vCPU frequency is different than physical CPU frequency,
>> this parameter should further be converted from guest cycles to host cycles.
>> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
>> As this is quite confusing. :)
> 
> Heh, I didn't elaborate precisely because it was so darn confusing.  I
> didn't want to say something completely incorrect so I kept it high level.

Heh, I can agree with that :)
But we should make sure others who read commit message won’t get confuse.

> 
>> 
>>> Sidestep the headache
>>> of shifting time domains by delaying 1ns at a time and breaking the loop
>>> when the guest's desired timer delay has been satisfied.  Because the
>>> advancement, which caps the delay to avoid unbounded busy waiting, is
>>> stored in nanoseconds, the current advancement time can simply be used
>>> as a loop terminator since we're delaying 1ns at a time (plus the few
>>> cycles of overhead for running the code).
>> 
>> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
>> This is simply done by:
>> host_ns = guest_cycles * 1000000ULL;
>> do_div(host_ns, vcpu->arch.virtual_tsc_khz);
>> 
>> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
>> i.e. something such as:
>> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
>> delay_host_ns = delay_guest_cycles * 1000000ULL;
>> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
>> __delay(delay_host_ns);
> 
> That's also an option.  I opted for the loop as it felt simpler and more
> obvious.  And explicitly checking rechecking kvm_read_l1_tsc() guarantees
> the guest won't see an "early" TSC unless the cap kicks in.  That being
> said, I don't have a strong opinion either way.
> 
> As for the code, we'd probably want to do the conversion first and cap
> second so as to avoid converting lapic_timer_advance_ns to the guest's
> TSC and then back to ns.  It should use ndelay(), and maybe just call it
> "delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a
> host variable, e.g.:
> 
>  delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
>  do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
>  ndelay(min(delay_ns, lapic_timer_advance_ns));
> 
> That's actually nowhere near as as bad as I thought it would be now that
> it's typed out…

Yes I agree this code looks nice. I think it’s better. I recommend sending a v2 with this version.

> 
> 
>> 
>> -Liran
>> 
>>> 
>>> Cc: Liran Alon <liran.alon@oracle.com>
>>> Cc: Wanpeng Li <wanpengli@tencent.com>
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>> ---
>>> arch/x86/kvm/lapic.c | 13 +++++++------
>>> 1 file changed, 7 insertions(+), 6 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index 92446cba9b24..e797e3145a8b 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>>> void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>> {
>>> 	struct kvm_lapic *apic = vcpu->arch.apic;
>>> -	u64 guest_tsc, tsc_deadline, ns;
>>> +	u32 timer_advance_ns = lapic_timer_advance_ns;
>>> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>> 
>> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.
> 
> guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the
> beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more
> appropriate.  My main motivation for not using verbose names is keep lines
> short, i.e. avoid wrapping at 80 chars.  For me, wrapping a bunch of lines
> makes the code more difficult to read than less verbose names.

I think it doesn’t matter now if we agree with the above mentioned suggestion of avoiding this loop. :)

-Liran

> 
>> 
>>> 
>>> 	if (!lapic_in_kernel(vcpu))
>>> 		return;
>>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>> 
>>> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>>> 	apic->lapic_timer.expired_tscdeadline = 0;
>>> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>>> 
>>> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
>>> -	if (guest_tsc < tsc_deadline)
>>> -		__delay(min(tsc_deadline - guest_tsc,
>>> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>>> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
>>> +		ndelay(1);
>>> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> +	}
>>> 
>>> 	if (!lapic_timer_advance_adjust_done) {
>>> 		/* too early */
>>> -- 
>>> 2.21.0
>>> 
>> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement
  2019-04-15 16:23     ` Sean Christopherson
@ 2019-04-15 17:10       ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-15 17:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 15 Apr 2019, at 19:23, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Sun, Apr 14, 2019 at 02:35:39PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> Disable auto-tuning the timer advancement if the user specifies an
>>> explicit value via the module param.  Aside from the obvious override
>>> capability, this also allows the KVM admin to set the advancement
>>> beyond the internally-capped max of 5000ns.
>>> 
>>> Cc: Wanpeng Li <wanpengli@tencent.com>
>>> Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> 
>> I agree we should allow admin the ability to disable the auto-tuning.
>> However, I think we should keep the semantic of lapic_timer_advance_ns value.
>> Whether this value is used as initial-value for auto-tuning or is auto-tuning
>> disabled is a different knob for admin in my opinion.  Therefore, I think we
>> should just add another module parameter which basically set the initial
>> value for apic->lapic_timer.timer_advance_adjust_done.
> 
> I waffled on whether to add a param or do the implicit override.  I opted
> for the implicit behavior primarily because I think most people would
> expect that setting lapic_timer_advance_ns to some specific value would
> fix the advancement at said value, as opposed to acting as a hint to the
> autotuning logic.  In other words, I again don't have a strong opinion :-)

Let’s wait for Paolo’s opinion on this then :)

BTW, a nice side-effect of my proposal is that also the current value used by KVM module at runtime (after auto-tuning)
can be viewed by printing current value of lapic_timer_advance_ns module_param.

> 
>> One could also wonder if it makes sense that whether auto-tuning will be
>> enabled or not varies between VMs and should not be a global variable of KVM
>> module?
> 
> I have no opinion on this one as I don't have any direct visibility into
> how the auto-tuning will be used in real world deployments.

I will wait for Paolo’s & Wanpeng’s opinion on this as-well.

-Liran

> 
>> 
>> -Liran
>> 
>>> ---
>>> arch/x86/kvm/lapic.c | 2 ++
>>> arch/x86/kvm/lapic.h | 2 ++
>>> arch/x86/kvm/x86.c   | 2 +-
>>> 3 files changed, 5 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index 70a0acd98e9e..c43cd26f040b 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -2286,6 +2286,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
>>> 		     HRTIMER_MODE_ABS_PINNED);
>>> 	apic->lapic_timer.timer.function = apic_timer_fn;
>>> 	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
>>> +	if (timer_advance_ns != LAPIC_TIMER_ADVANCE_DEFAULT_NS)
>>> +		apic->lapic_timer.timer_advance_adjust_done = true;
>>> 
>>> 	/*
>>> 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
>>> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>>> index 3e97f8a68967..c7233629c05b 100644
>>> --- a/arch/x86/kvm/lapic.h
>>> +++ b/arch/x86/kvm/lapic.h
>>> @@ -16,6 +16,8 @@
>>> #define APIC_BUS_CYCLE_NS       1
>>> #define APIC_BUS_FREQUENCY      (1000000000ULL / APIC_BUS_CYCLE_NS)
>>> 
>>> +#define LAPIC_TIMER_ADVANCE_DEFAULT_NS	1000
>>> +
>>> enum lapic_mode {
>>> 	LAPIC_MODE_DISABLED = 0,
>>> 	LAPIC_MODE_INVALID = X2APIC_ENABLE,
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index b303a21a2bc2..709a8bf5ae0e 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -137,7 +137,7 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
>>> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
>>> 
>>> /* lapic timer advance (tscdeadline mode only) in nanoseconds */
>>> -static u32 __read_mostly lapic_timer_advance_ns = 1000;
>>> +static u32 __read_mostly lapic_timer_advance_ns = LAPIC_TIMER_ADVANCE_DEFAULT_NS;
>>> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
>>> 
>>> static bool __read_mostly vector_hashing = true;
>>> -- 
>>> 2.21.0
>>> 
>> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-15 16:32     ` Sean Christopherson
@ 2019-04-15 17:25       ` Liran Alon
  2019-04-16 16:39         ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Liran Alon @ 2019-04-15 17:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 15 Apr 2019, at 19:32, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Sun, Apr 14, 2019 at 03:15:41PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> Calling apic_timer_expired() is a nop when a timer interrupt is already
>>> pending, i.e. there's no need to call apic_timer_expired() when there's
>>> a pending interrupt and the hv_timer wants to pend its own interrupt.
>>> Separate the two flows to make the code more readable and to avoid an
>>> unnecessary function call and read to ktimer->pending.
>> 
>> In case timer is not periodic and r==1, atomic_read(&ktimer->pending) is not executed.
>> 
>>> 
>>> Cc: Wanpeng Li <wanpengli@tencent.com>
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>> ---
>>> arch/x86/kvm/lapic.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index 1d649a2af04c..f0be6f148a47 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1703,9 +1703,12 @@ static bool start_hv_timer(struct kvm_lapic *apic)
>>> 	 * the window.  For periodic timer, leave the hv timer running for
>>> 	 * simplicity, and the deadline will be recomputed on the next vmexit.
>>> 	 */
>>> -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
>>> -		if (r)
>>> -			apic_timer_expired(apic);
>>> +	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
>>> +		return false;
>>> +
>>> +	/* set_hv_timer() returns '1' when the timer has already expired. */
>>> +	if (r) {
>>> +		apic_timer_expired(apic);
>>> 		return false;
>>> 	}
>>> 
>>> -- 
>>> 2.21.0
>>> 
>> 
>> First, I think you should emphasise in commit message that you have actually
>> fixed a rare bug here.  In case timer is periodic but given
>> ktimer->tscdeadline has already expired on host, we should call
>> apic_timer_expired().
> 
> Heh, I actually didn't even catch that bug, I was simply cleaning up the
> code because I had a hard time following the logic.

LOL. So you can put me in the Reported-by tag :P

> 
>> In addition, when start_hv_timer() returns false, restart_apic_timer() just
>> calls start_sw_timer() which use hrtimer instead of VMX preemption timer.
>> Therefore, it seems a bit ineffective to me for start_hv_timer() to return
>> false in case ktimer->pending or when ktimer->tscdeadline already expired.
>> Shouldn’t we return true in these cases?
> 
> That also seemed weird to me.  Again, I had a hell of a time following the
> intended logic and didn't want to break anything.  AFAICT, the motivation
> for calling start_sw_timer() is to cancel the HV timer, and possibly to
> ensure start_sw_period() is called when necessary.

I think the motivation is that if there is any reason why hardware accelerated timer (i.e. VMX preemption timer),
can't be used to emulate the LAPIC timer, then utilise a software hrtimer based implementation instead.

This does align with why we return false when (!kvm_x86_ops->set_hv_timer) or (kvm_x86_ops->set_hv_timer() < 0).
However, this doesn’t align in case we have a (non-periodic timer and ktimer->pending) OR ktimer->tscdeadline already expired OR (!ktimer->tscdeadline).

In fact, note that start_sw_timer() early-exit when non-periodic timer and ktimer->pending…
Same is also true for start_sw_tscdeadline() early-exit when (!ktimer->tscdeadline).

> But the latter will be
> handled by virtue of checking "r" after apic_lvtt_period(), so this?
> 
> 	if (r) {
> 		apic_timer_expired(apic);
> 		ktimer->hv_timer_in_use = false;
> 		return true;
> 	}

I think I will just submit a patch to fix all the above examples I made as this just seems wrong to me.
Unless you find something I have missed. :P

-Liran



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-12 20:18 ` [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire" Sean Christopherson
  2019-04-14 11:25   ` Liran Alon
@ 2019-04-16 11:02   ` Paolo Bonzini
  2019-04-16 11:04     ` Liran Alon
  1 sibling, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2019-04-16 11:02 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář
  Cc: kvm, Liran Alon, Wanpeng Li

On 12/04/19 22:18, Sean Christopherson wrote:
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  In
> the event that the adjustments are too aggressive, i.e. the timer fires
> earlier than the guest expects, KVM busy waits immediately prior to
> entering the guest.
> 
> Currently, KVM manually converts the delay from nanoseconds to clock
> cycles.  But, the conversion is done in the guest's time domain, while
> the delay occurs in the host's time domain, i.e. the delay may not be
> accurate and could wait too little or too long.  Sidestep the headache
> of shifting time domains by delaying 1ns at a time and breaking the loop
> when the guest's desired timer delay has been satisfied.  Because the
> advancement, which caps the delay to avoid unbounded busy waiting, is
> stored in nanoseconds, the current advancement time can simply be used
> as a loop terminator since we're delaying 1ns at a time (plus the few
> cycles of overhead for running the code).

A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time
taken to execute kvm_read_l1_tsc.  I would just replace it with
cpu_relax(); I can do the change when applying.

Paolo

> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 92446cba9b24..e797e3145a8b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>  void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u64 guest_tsc, tsc_deadline, ns;
> +	u32 timer_advance_ns = lapic_timer_advance_ns;
> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>  
>  	if (!lapic_in_kernel(vcpu))
>  		return;
> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  
>  	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>  	apic->lapic_timer.expired_tscdeadline = 0;
> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>  	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>  
> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> -	if (guest_tsc < tsc_deadline)
> -		__delay(min(tsc_deadline - guest_tsc,
> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> +		ndelay(1);
> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	}
>  
>  	if (!lapic_timer_advance_adjust_done) {
>  		/* too early */
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-16 11:02   ` Paolo Bonzini
@ 2019-04-16 11:04     ` Liran Alon
  2019-04-16 11:09       ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Liran Alon @ 2019-04-16 11:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 14:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 12/04/19 22:18, Sean Christopherson wrote:
>> To minimize the latency of timer interrupts as observed by the guest,
>> KVM adjusts the values it programs into the host timers to account for
>> the host's overhead of programming and handling the timer event.  In
>> the event that the adjustments are too aggressive, i.e. the timer fires
>> earlier than the guest expects, KVM busy waits immediately prior to
>> entering the guest.
>> 
>> Currently, KVM manually converts the delay from nanoseconds to clock
>> cycles.  But, the conversion is done in the guest's time domain, while
>> the delay occurs in the host's time domain, i.e. the delay may not be
>> accurate and could wait too little or too long.  Sidestep the headache
>> of shifting time domains by delaying 1ns at a time and breaking the loop
>> when the guest's desired timer delay has been satisfied.  Because the
>> advancement, which caps the delay to avoid unbounded busy waiting, is
>> stored in nanoseconds, the current advancement time can simply be used
>> as a loop terminator since we're delaying 1ns at a time (plus the few
>> cycles of overhead for running the code).
> 
> A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time
> taken to execute kvm_read_l1_tsc.  I would just replace it with
> cpu_relax(); I can do the change when applying.
> 
> Paolo

Paolo, there is also another approach Sean and I were discussing.
I think it is more elegant.
See previous comments in this thread.

-Liran

> 
>> Cc: Liran Alon <liran.alon@oracle.com>
>> Cc: Wanpeng Li <wanpengli@tencent.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> ---
>> arch/x86/kvm/lapic.c | 13 +++++++------
>> 1 file changed, 7 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 92446cba9b24..e797e3145a8b 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>> void wait_lapic_expire(struct kvm_vcpu *vcpu)
>> {
>> 	struct kvm_lapic *apic = vcpu->arch.apic;
>> -	u64 guest_tsc, tsc_deadline, ns;
>> +	u32 timer_advance_ns = lapic_timer_advance_ns;
>> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>> 
>> 	if (!lapic_in_kernel(vcpu))
>> 		return;
>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>> 
>> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>> 	apic->lapic_timer.expired_tscdeadline = 0;
>> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>> 
>> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
>> -	if (guest_tsc < tsc_deadline)
>> -		__delay(min(tsc_deadline - guest_tsc,
>> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
>> +		ndelay(1);
>> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> +	}
>> 
>> 	if (!lapic_timer_advance_adjust_done) {
>> 		/* too early */
>> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"
  2019-04-16 11:04     ` Liran Alon
@ 2019-04-16 11:09       ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2019-04-16 11:09 UTC (permalink / raw)
  To: Liran Alon
  Cc: Sean Christopherson, Radim Krčmář, kvm, Wanpeng Li

On 16/04/19 13:04, Liran Alon wrote:
>> A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time
>> taken to execute kvm_read_l1_tsc.  I would just replace it with
>> cpu_relax(); I can do the change when applying.
>
> Paolo, there is also another approach Sean and I were discussing.
> I think it is more elegant.
> See previous comments in this thread.

Yep, I've seen it now.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-14 12:15   ` Liran Alon
  2019-04-15 16:32     ` Sean Christopherson
@ 2019-04-16 11:14     ` Paolo Bonzini
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2019-04-16 11:14 UTC (permalink / raw)
  To: Liran Alon, Sean Christopherson
  Cc: Radim Krčmář, kvm, Wanpeng Li

On 14/04/19 14:15, Liran Alon wrote:
> In addition, when start_hv_timer() returns false,
> restart_apic_timer() just calls start_sw_timer() which use hrtimer
> instead of VMX preemption timer. Therefore, it seems a bit
> ineffective to me for start_hv_timer() to return false in case
> ktimer->pending or when ktimer->tscdeadline already expired. 
> Shouldn’t we return true in these cases?

Since start_hv_timer is only called from restart_apic_timer, I suggest
doing that check in restart_apic_timer itself.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-15 17:25       ` Liran Alon
@ 2019-04-16 16:39         ` Sean Christopherson
  2019-04-16 16:48           ` Liran Alon
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-16 16:39 UTC (permalink / raw)
  To: Liran Alon; +Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li

On Mon, Apr 15, 2019 at 08:25:48PM +0300, Liran Alon wrote:
> 
> 
> > On 15 Apr 2019, at 19:32, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > On Sun, Apr 14, 2019 at 03:15:41PM +0300, Liran Alon wrote:
> >> 
> >> 
> >>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >>> 
> >>> Calling apic_timer_expired() is a nop when a timer interrupt is already
> >>> pending, i.e. there's no need to call apic_timer_expired() when there's
> >>> a pending interrupt and the hv_timer wants to pend its own interrupt.
> >>> Separate the two flows to make the code more readable and to avoid an
> >>> unnecessary function call and read to ktimer->pending.
> >> 
> >> In case timer is not periodic and r==1, atomic_read(&ktimer->pending) is not executed.
> >> 
> >>> 
> >>> Cc: Wanpeng Li <wanpengli@tencent.com>
> >>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> >>> ---
> >>> arch/x86/kvm/lapic.c | 9 ++++++---
> >>> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>> 
> >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >>> index 1d649a2af04c..f0be6f148a47 100644
> >>> --- a/arch/x86/kvm/lapic.c
> >>> +++ b/arch/x86/kvm/lapic.c
> >>> @@ -1703,9 +1703,12 @@ static bool start_hv_timer(struct kvm_lapic *apic)
> >>> 	 * the window.  For periodic timer, leave the hv timer running for
> >>> 	 * simplicity, and the deadline will be recomputed on the next vmexit.
> >>> 	 */
> >>> -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
> >>> -		if (r)
> >>> -			apic_timer_expired(apic);
> >>> +	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
> >>> +		return false;
> >>> +
> >>> +	/* set_hv_timer() returns '1' when the timer has already expired. */
> >>> +	if (r) {
> >>> +		apic_timer_expired(apic);
> >>> 		return false;
> >>> 	}
> >>> 
> >>> -- 
> >>> 2.21.0
> >>> 
> >> 
> >> First, I think you should emphasise in commit message that you have actually
> >> fixed a rare bug here.  In case timer is periodic but given
> >> ktimer->tscdeadline has already expired on host, we should call
> >> apic_timer_expired().
> > 
> > Heh, I actually didn't even catch that bug, I was simply cleaning up the
> > code because I had a hard time following the logic.
> 
> LOL. So you can put me in the Reported-by tag :P

Actually, thinking about this more, I believe the original behavior was
correct, if poorly documented.  More info below.

> >> In addition, when start_hv_timer() returns false, restart_apic_timer() just
> >> calls start_sw_timer() which use hrtimer instead of VMX preemption timer.
> >> Therefore, it seems a bit ineffective to me for start_hv_timer() to return
> >> false in case ktimer->pending or when ktimer->tscdeadline already expired.
> >> Shouldn’t we return true in these cases?
> > 
> > That also seemed weird to me.  Again, I had a hell of a time following the
> > intended logic and didn't want to break anything.  AFAICT, the motivation
> > for calling start_sw_timer() is to cancel the HV timer, and possibly to
> > ensure start_sw_period() is called when necessary.
> 
> I think the motivation is that if there is any reason why hardware
> accelerated timer (i.e. VMX preemption timer), can't be used to emulate the
> LAPIC timer, then utilise a software hrtimer based implementation instead.

My comment was regarding why start_hv_timer() returns was when the hv_timer
as already expired.

> This does align with why we return false when (!kvm_x86_ops->set_hv_timer) or
> (kvm_x86_ops->set_hv_timer() < 0).  However, this doesn’t align in case we
> have a (non-periodic timer and ktimer->pending) OR ktimer->tscdeadline
> already expired OR (!ktimer->tscdeadline).
> 
> In fact, note that start_sw_timer() early-exit when non-periodic timer and
> ktimer->pending… Same is also true for start_sw_tscdeadline() early-exit when
> (!ktimer->tscdeadline).
> 
> > But the latter will be
> > handled by virtue of checking "r" after apic_lvtt_period(), so this?
> > 
> > 	if (r) {
> > 		apic_timer_expired(apic);
> > 		ktimer->hv_timer_in_use = false;
> > 		return true;
> > 	}
> 
> I think I will just submit a patch to fix all the above examples I made as
> this just seems wrong to me.  Unless you find something I have missed. :P

When the timer is periodic, we're relying on the timer handler to invoke
advance_periodic_target_expiration() by way of kvm_lapic_expired_hv_timer().
That's why the original code only checks @r if apic_lvtt_period()==false,
i.e. to actually trigger a VMX preemption timer VM-Exit.  Note that the
return from set_hv_timer() is essentially a hint, e.g. VMX is perfectly
fine programming a preemption timer with a value of zero.

I think Paolo's suggestion of moving the logic up into restart_apic_timer()
is the way to go as it reduces the multiplexing down on start_hv_timer()'s
return value.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-16 16:39         ` Sean Christopherson
@ 2019-04-16 16:48           ` Liran Alon
  2019-04-16 17:27             ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Liran Alon @ 2019-04-16 16:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 19:39, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Mon, Apr 15, 2019 at 08:25:48PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 15 Apr 2019, at 19:32, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> On Sun, Apr 14, 2019 at 03:15:41PM +0300, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>>>> 
>>>>> Calling apic_timer_expired() is a nop when a timer interrupt is already
>>>>> pending, i.e. there's no need to call apic_timer_expired() when there's
>>>>> a pending interrupt and the hv_timer wants to pend its own interrupt.
>>>>> Separate the two flows to make the code more readable and to avoid an
>>>>> unnecessary function call and read to ktimer->pending.
>>>> 
>>>> In case timer is not periodic and r==1, atomic_read(&ktimer->pending) is not executed.
>>>> 
>>>>> 
>>>>> Cc: Wanpeng Li <wanpengli@tencent.com>
>>>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>>>> ---
>>>>> arch/x86/kvm/lapic.c | 9 ++++++---
>>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>> 
>>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>>> index 1d649a2af04c..f0be6f148a47 100644
>>>>> --- a/arch/x86/kvm/lapic.c
>>>>> +++ b/arch/x86/kvm/lapic.c
>>>>> @@ -1703,9 +1703,12 @@ static bool start_hv_timer(struct kvm_lapic *apic)
>>>>> 	 * the window.  For periodic timer, leave the hv timer running for
>>>>> 	 * simplicity, and the deadline will be recomputed on the next vmexit.
>>>>> 	 */
>>>>> -	if (!apic_lvtt_period(apic) && (r || atomic_read(&ktimer->pending))) {
>>>>> -		if (r)
>>>>> -			apic_timer_expired(apic);
>>>>> +	if (!apic_lvtt_period(apic) && atomic_read(&ktimer->pending))
>>>>> +		return false;
>>>>> +
>>>>> +	/* set_hv_timer() returns '1' when the timer has already expired. */
>>>>> +	if (r) {
>>>>> +		apic_timer_expired(apic);
>>>>> 		return false;
>>>>> 	}
>>>>> 
>>>>> -- 
>>>>> 2.21.0
>>>>> 
>>>> 
>>>> First, I think you should emphasise in commit message that you have actually
>>>> fixed a rare bug here.  In case timer is periodic but given
>>>> ktimer->tscdeadline has already expired on host, we should call
>>>> apic_timer_expired().
>>> 
>>> Heh, I actually didn't even catch that bug, I was simply cleaning up the
>>> code because I had a hard time following the logic.
>> 
>> LOL. So you can put me in the Reported-by tag :P
> 
> Actually, thinking about this more, I believe the original behavior was
> correct, if poorly documented.  More info below.
> 
>>>> In addition, when start_hv_timer() returns false, restart_apic_timer() just
>>>> calls start_sw_timer() which use hrtimer instead of VMX preemption timer.
>>>> Therefore, it seems a bit ineffective to me for start_hv_timer() to return
>>>> false in case ktimer->pending or when ktimer->tscdeadline already expired.
>>>> Shouldn’t we return true in these cases?
>>> 
>>> That also seemed weird to me.  Again, I had a hell of a time following the
>>> intended logic and didn't want to break anything.  AFAICT, the motivation
>>> for calling start_sw_timer() is to cancel the HV timer, and possibly to
>>> ensure start_sw_period() is called when necessary.
>> 
>> I think the motivation is that if there is any reason why hardware
>> accelerated timer (i.e. VMX preemption timer), can't be used to emulate the
>> LAPIC timer, then utilise a software hrtimer based implementation instead.
> 
> My comment was regarding why start_hv_timer() returns was when the hv_timer
> as already expired.
> 
>> This does align with why we return false when (!kvm_x86_ops->set_hv_timer) or
>> (kvm_x86_ops->set_hv_timer() < 0).  However, this doesn’t align in case we
>> have a (non-periodic timer and ktimer->pending) OR ktimer->tscdeadline
>> already expired OR (!ktimer->tscdeadline).
>> 
>> In fact, note that start_sw_timer() early-exit when non-periodic timer and
>> ktimer->pending… Same is also true for start_sw_tscdeadline() early-exit when
>> (!ktimer->tscdeadline).
>> 
>>> But the latter will be
>>> handled by virtue of checking "r" after apic_lvtt_period(), so this?
>>> 
>>> 	if (r) {
>>> 		apic_timer_expired(apic);
>>> 		ktimer->hv_timer_in_use = false;
>>> 		return true;
>>> 	}
>> 
>> I think I will just submit a patch to fix all the above examples I made as
>> this just seems wrong to me.  Unless you find something I have missed. :P
> 
> When the timer is periodic, we're relying on the timer handler to invoke
> advance_periodic_target_expiration() by way of kvm_lapic_expired_hv_timer().
> That's why the original code only checks @r if apic_lvtt_period()==false,
> i.e. to actually trigger a VMX preemption timer VM-Exit.  Note that the
> return from set_hv_timer() is essentially a hint, e.g. VMX is perfectly
> fine programming a preemption timer with a value of zero.

Yes I understood that already.
I don’t think it contradicts the fact that the checks I mentioned above should be moved out of start_hv_timer().

> 
> I think Paolo's suggestion of moving the logic up into restart_apic_timer()
> is the way to go as it reduces the multiplexing down on start_hv_timer()'s
> return value.
> 

Yes I agree. I plan to do so.

-Liran




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-16 16:48           ` Liran Alon
@ 2019-04-16 17:27             ` Sean Christopherson
  2019-04-16 17:27               ` Liran Alon
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2019-04-16 17:27 UTC (permalink / raw)
  To: Liran Alon; +Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li

On Tue, Apr 16, 2019 at 07:48:42PM +0300, Liran Alon wrote:
 Yes I understood that already.
> I don’t think it contradicts the fact that the checks I mentioned above
> should be moved out of start_hv_timer().

Ah, I misread your comment.  I think.  My head is spinning trying to
work through this code.

> > I think Paolo's suggestion of moving the logic up into restart_apic_timer()
> > is the way to go as it reduces the multiplexing down on start_hv_timer()'s
> > return value.
> > 
> 
> Yes I agree. I plan to do so.

And because I misread your comment, I already created this patch.  I'll
send it out shortly.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer
  2019-04-16 17:27             ` Sean Christopherson
@ 2019-04-16 17:27               ` Liran Alon
  0 siblings, 0 replies; 29+ messages in thread
From: Liran Alon @ 2019-04-16 17:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Wanpeng Li



> On 16 Apr 2019, at 20:27, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Tue, Apr 16, 2019 at 07:48:42PM +0300, Liran Alon wrote:
> Yes I understood that already.
>> I don’t think it contradicts the fact that the checks I mentioned above
>> should be moved out of start_hv_timer().
> 
> Ah, I misread your comment.  I think.  My head is spinning trying to
> work through this code.
> 
>>> I think Paolo's suggestion of moving the logic up into restart_apic_timer()
>>> is the way to go as it reduces the multiplexing down on start_hv_timer()'s
>>> return value.
>>> 
>> 
>> Yes I agree. I plan to do so.
> 
> And because I misread your comment, I already created this patch.  I'll
> send it out shortly.

LOL OK. :)
Be my guest.

-Liran

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-04-16 17:31 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12 20:18 [PATCH 0/7] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
2019-04-12 20:18 ` [PATCH 1/7] KVM: lapic: Hard cap the auto-calculated timer advancement Sean Christopherson
2019-04-14 10:22   ` Liran Alon
2019-04-12 20:18 ` [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire" Sean Christopherson
2019-04-14 11:25   ` Liran Alon
2019-04-15 16:11     ` Sean Christopherson
2019-04-15 17:06       ` Liran Alon
2019-04-16 11:02   ` Paolo Bonzini
2019-04-16 11:04     ` Liran Alon
2019-04-16 11:09       ` Paolo Bonzini
2019-04-12 20:18 ` [PATCH 3/7] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
2019-04-14 11:29   ` Liran Alon
2019-04-12 20:18 ` [PATCH 4/7] KVM: lapic: Allow user to override auto-tuning of timer advancement Sean Christopherson
2019-04-14 11:35   ` Liran Alon
2019-04-15 16:23     ` Sean Christopherson
2019-04-15 17:10       ` Liran Alon
2019-04-12 20:18 ` [PATCH 5/7] KVM: lapic: Busy wait for timer to expire when using hv_timer Sean Christopherson
2019-04-14 11:47   ` Liran Alon
2019-04-12 20:18 ` [PATCH 6/7] KVM: lapic: Clean up the code for handling of a pre-expired hv_timer Sean Christopherson
2019-04-14 12:15   ` Liran Alon
2019-04-15 16:32     ` Sean Christopherson
2019-04-15 17:25       ` Liran Alon
2019-04-16 16:39         ` Sean Christopherson
2019-04-16 16:48           ` Liran Alon
2019-04-16 17:27             ` Sean Christopherson
2019-04-16 17:27               ` Liran Alon
2019-04-16 11:14     ` Paolo Bonzini
2019-04-12 20:18 ` [PATCH 7/7] KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero Sean Christopherson
2019-04-14 12:21   ` Liran Alon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.