[PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues
@ 2019-04-17 17:15 Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 1/4] KVM: lapic: Disable timer advancement if adaptive tuning goes haywire Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Sean Christopherson @ 2019-04-17 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

KVM's recently introduced adaptive tuning of lapic_timer_advance_ns has
several critical flaws:

  - The advancement is completely unbounded, e.g. there's nothing in the
    logic that prevents the advancement from creeping up to hundreds of
    milliseconds in case something goes awry with the guest's TSC.

  - TSC scaling is done on a per-vCPU basis, while the advancement value
    is global.  This issue is also present without adaptive tuning, but
    is now more pronounced.

  - Tuning the value concurrently on multiple CPUs can corrupt the
    advancement variable.

  - Userspace can't disable adaptive tuning.

Fix the above issues along with a theoretical bug found by inspenction,
where the wait_lapic_timer() delay could be inaccurate when the guest
TSC is running at a different ratio than the host's TSC.

v4:
 - Rebase to latest kvm/queue.
 - Kill the timer advancement if adaptive tuning goes above the
   abrbitrary threshold of 5000ns. [Paolo].
 - Revert back to disabling adaptive tuning via the existing param,
   except use '-1' as the "use adaptive tuning" indicator. [Paolo].
 - Do the conversion to nanoseconds when delaying iff the guest TSC
   is running at a different ratio, and move the patch to the end
   so that the other patches are not dependent on it since it's not
   100% clear (to me) that the conversion is correct/necessary.

v3:
 - https://patchwork.kernel.org/cover/10904163/
 - Split the refactoring of start_hv_timer() and ->set_hv_timer
   into three separate patches instead of attempting to do a big
   refactor in a single patch to fix three separate issues.
    - Explicitly cancel the hv timer to avoid
    - Use a param for "expired" instead of overloading the return
      value of ->set_hv_timer().
    - Check for a pending non-periodic in restart_apic_timer(). [Liran]
 - Add more Reviewed-by tags.

v2:
 - https://patchwork.kernel.org/cover/10903613/
 - Add explicit param to control automatic tuning. [Liran]
 - Document the effect of per-vCPU tracking on the module params.
 - Use fancy math to convert guest clockcycles to host nanoseconds
   instead of brute forcing the delay with a for loop. [Liran]
 - Refactor start_hv_timer()'s return semantics to move the "expired
   timer" handling up a level. [Liran and Paolo]
 - Add Liran's Reviewed-by tags.

v1: https://patchwork.kernel.org/cover/10899101/

Sean Christopherson (4):
  KVM: lapic: Disable timer advancement if adaptive tuning goes haywire
  KVM: lapic: Track lapic timer advance per vCPU
  KVM: lapic: Allow user to disable adaptive tuning of timer advancement
  KVM: lapic: Convert guest TSC to host time domain if necessary

 arch/x86/kvm/lapic.c   | 61 +++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/lapic.h   |  4 ++-
 arch/x86/kvm/vmx/vmx.c |  4 ++-
 arch/x86/kvm/x86.c     | 14 ++++++----
 arch/x86/kvm/x86.h     |  2 --
 5 files changed, 61 insertions(+), 24 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/4] KVM: lapic: Disable timer advancement if adaptive tuning goes haywire
  2019-04-17 17:15 [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
@ 2019-04-17 17:15 ` Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 2/4] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2019-04-17 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  Now
that the timer advancement is automatically tuned during runtime, it's
effectively unbounded by default, e.g. if KVM is running as L1 the
advancement can measure in hundreds of milliseconds.

Disable timer advancement if adaptive tuning yields an advancement of
more than 5000ns, as large advancements can break reasonable assumptions
of the guest, e.g. that a timer configured to fire after 1ms won't
arrive on the next instruction.  Although KVM busy waits to mitigate the
case of a timer event arriving too early, complications can arise when
shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test
will fail when its "host" exits on interrupts as KVM may inject the INTR
before the guest executes STI+HLT.   Arguably the unit test is "broken"
in the sense that delaying a timer interrupt by 1ms doesn't technically
guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
assumption that KVM should support.

Furthermore, an unbounded advancement also effectively unbounds the time
spent busy waiting, e.g. if the guest programs a timer with a very large
delay.

5000ns is a somewhat arbitrary threshold.  When running on bare metal,
which is the intended use case, timer advancement is expected to be in
the general vicinity of 1000ns.  5000ns is high enough that false
positives are unlikely, while not being so high as to negatively affect
the host's performance/stability.

Note, a future patch will enable userspace to disable KVM's adaptive
tuning, which will allow priveleged userspace will to specifying an
advancement value in excess of this arbitrary threshold in order to
satisfy an abnormal use case.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 4e000712cb82..d19c4b955e93 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1522,6 +1522,10 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		}
 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
 			lapic_timer_advance_adjust_done = true;
+		if (unlikely(lapic_timer_advance_ns > 5000)) {
+			lapic_timer_advance_ns = 0;
+			lapic_timer_advance_adjust_done = true;
+		}
 	}
 }

-- 
2.21.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 2/4] KVM: lapic: Track lapic timer advance per vCPU
  2019-04-17 17:15 [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 1/4] KVM: lapic: Disable timer advancement if adaptive tuning goes haywire Sean Christopherson
@ 2019-04-17 17:15 ` Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 3/4] KVM: lapic: Allow user to disable adaptive tuning of timer advancement Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2019-04-17 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value.  That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.

And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.

Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.

Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer.  Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.

Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c   | 34 ++++++++++++++++++----------------
 arch/x86/kvm/lapic.h   |  4 +++-
 arch/x86/kvm/vmx/vmx.c |  4 +++-
 arch/x86/kvm/x86.c     |  7 +++----
 arch/x86/kvm/x86.h     |  2 --
 5 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d19c4b955e93..860a4045eb3c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -70,7 +70,6 @@
 #define APIC_BROADCAST			0xFF
 #define X2APIC_BROADCAST		0xFFFFFFFFul
 
-static bool lapic_timer_advance_adjust_done = false;
 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
@@ -1485,6 +1484,7 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 void wait_lapic_expire(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
+	u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
 	u64 guest_tsc, tsc_deadline, ns;
 
 	if (!lapic_in_kernel(vcpu))
@@ -1504,34 +1504,36 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
 	if (guest_tsc < tsc_deadline)
 		__delay(min(tsc_deadline - guest_tsc,
-			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
+			nsec_to_cycles(vcpu, timer_advance_ns)));
 
-	if (!lapic_timer_advance_adjust_done) {
+	if (!apic->lapic_timer.timer_advance_adjust_done) {
 		/* too early */
 		if (guest_tsc < tsc_deadline) {
 			ns = (tsc_deadline - guest_tsc) * 1000000ULL;
 			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			lapic_timer_advance_ns -= min((unsigned int)ns,
-				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			timer_advance_ns -= min((u32)ns,
+				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
 		} else {
 		/* too late */
 			ns = (guest_tsc - tsc_deadline) * 1000000ULL;
 			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			lapic_timer_advance_ns += min((unsigned int)ns,
-				lapic_timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			timer_advance_ns += min((u32)ns,
+				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
 		}
 		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
-			lapic_timer_advance_adjust_done = true;
-		if (unlikely(lapic_timer_advance_ns > 5000)) {
-			lapic_timer_advance_ns = 0;
-			lapic_timer_advance_adjust_done = true;
+			apic->lapic_timer.timer_advance_adjust_done = true;
+		if (unlikely(timer_advance_ns > 5000)) {
+			timer_advance_ns = 0;
+			apic->lapic_timer.timer_advance_adjust_done = true;
 		}
+		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 	}
 }
 
 static void start_sw_tscdeadline(struct kvm_lapic *apic)
 {
-	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
+	struct kvm_timer *ktimer = &apic->lapic_timer;
+	u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
 	u64 ns = 0;
 	ktime_t expire;
 	struct kvm_vcpu *vcpu = apic->vcpu;
@@ -1550,9 +1552,8 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
 		ns = (tscdeadline - guest_tsc) * 1000000ULL;
 		do_div(ns, this_tsc_khz);
 		expire = ktime_add_ns(now, ns);
-		expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
-		hrtimer_start(&apic->lapic_timer.timer,
-				expire, HRTIMER_MODE_ABS_PINNED);
+		expire = ktime_sub_ns(expire, ktimer->timer_advance_ns);
+		hrtimer_start(&ktimer->timer, expire, HRTIMER_MODE_ABS_PINNED);
 	} else
 		apic_timer_expired(apic);
 
@@ -2269,7 +2270,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 		return HRTIMER_NORESTART;
 }
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu)
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
 {
 	struct kvm_lapic *apic;
 
@@ -2293,6 +2294,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
 	hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
+	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 
 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index ff6ef9c3d760..3e97f8a68967 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -31,8 +31,10 @@ struct kvm_timer {
 	u32 timer_mode_mask;
 	u64 tscdeadline;
 	u64 expired_tscdeadline;
+	u32 timer_advance_ns;
 	atomic_t pending;			/* accumulated triggered timers */
 	bool hv_timer_in_use;
+	bool timer_advance_adjust_done;
 };
 
 struct kvm_lapic {
@@ -62,7 +64,7 @@ struct kvm_lapic {
 
 struct dest_map;
 
-int kvm_create_lapic(struct kvm_vcpu *vcpu);
+int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
 void kvm_free_lapic(struct kvm_vcpu *vcpu);
 
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d8f101b58ab8..4ac1f7429986 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7046,6 +7046,7 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 {
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
+	struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
 
 	if (kvm_mwait_in_guest(vcpu->kvm))
 		return -EOPNOTSUPP;
@@ -7054,7 +7055,8 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	tscl = rdtsc();
 	guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
 	delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
-	lapic_timer_advance_cycles = nsec_to_cycles(vcpu, lapic_timer_advance_ns);
+	lapic_timer_advance_cycles = nsec_to_cycles(vcpu,
+						    ktimer->timer_advance_ns);
 
 	if (delta_tsc > lapic_timer_advance_cycles)
 		delta_tsc -= lapic_timer_advance_cycles;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c09507057743..40ca27870df2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -137,9 +137,8 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 
 /* lapic timer advance (tscdeadline mode only) in nanoseconds */
-unsigned int __read_mostly lapic_timer_advance_ns = 1000;
+static u32 __read_mostly lapic_timer_advance_ns = 1000;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
-EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
 
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
@@ -7899,7 +7898,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_entry(vcpu->vcpu_id);
-	if (lapic_timer_advance_ns)
+	if (vcpu->arch.apic->lapic_timer.timer_advance_ns)
 		wait_lapic_expire(vcpu);
 	guest_enter_irqoff();
 
@@ -9087,7 +9086,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 	if (irqchip_in_kernel(vcpu->kvm)) {
 		vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
-		r = kvm_create_lapic(vcpu);
+		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
 		if (r < 0)
 			goto fail_mmu_destroy;
 	} else
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index eda954441213..a470ff0868c5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -294,8 +294,6 @@ extern u64 kvm_supported_xcr0(void);
 
 extern unsigned int min_timer_period_us;
 
-extern unsigned int lapic_timer_advance_ns;
-
 extern bool enable_vmware_backdoor;
 
 extern struct static_key kvm_no_apic_vcpu;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 3/4] KVM: lapic: Allow user to disable adaptive tuning of timer advancement
  2019-04-17 17:15 [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 1/4] KVM: lapic: Disable timer advancement if adaptive tuning goes haywire Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 2/4] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
@ 2019-04-17 17:15 ` Sean Christopherson
  2019-04-17 17:15 ` [PATCH v4 4/4] KVM: lapic: Convert guest TSC to host time domain if necessary Sean Christopherson
  2019-04-28  0:54 ` [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Wanpeng Li
  4 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2019-04-17 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

The introduction of adaptive tuning of lapic timer advancement did not
allow for the scenario where userspace would want to disable adaptive
tuning but still employ timer advancement, e.g. for testing purposes or
to handle a use case where adaptive tuning is unable to settle on a
suitable time.  This is epecially pertinent now that KVM places a hard
threshold on the maximum advancment time.

Rework the timer semantics to accept signed values, with a value of '-1'
being interpreted as "use adaptive tuning with KVM's internal default",
and any other value being used as an explicit advancement time, e.g. a
time of '0' effectively disables advancement.

Note, this does not completely restore the original behavior of
lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
which is necessary to support autotuning, userspace could adjust
lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
module params are snapshotted at vCPU creation, i.e. applying a new
advancement effectively requires restarting a VM.

Dynamically updating a running vCPU is possible, e.g. a helper could be
added to retrieve the desired delay, choosing between the global module
param and the per-VCPU value depending on whether or not auto-tuning is
(globally) enabled, but introduces a great deal of complexity.  The
wrapper itself is not complex, but understanding and documenting the
effects of dynamically toggling auto-tuning and/or adjusting the timer
advancement is nigh impossible since the behavior would be dependent on
KVM's implementation as well as compiler optimizations.  In other words,
providing stable behavior would require extremely careful consideration
now and in the future.

Given that the expected use of a manually-tuned timer advancement is to
"tune once, run many", use the vastly simpler approach of recognizing
changes to the module params only when creating a new vCPU.

Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 11 +++++++++--
 arch/x86/kvm/lapic.h |  2 +-
 arch/x86/kvm/x86.c   |  9 +++++++--
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 860a4045eb3c..2aec04d07617 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2270,7 +2270,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 		return HRTIMER_NORESTART;
 }

-int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
+int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
 {
 	struct kvm_lapic *apic;

@@ -2294,7 +2294,14 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns)
 	hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
-	apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+	if (timer_advance_ns == -1) {
+		apic->lapic_timer.timer_advance_ns = 1000;
+		apic->lapic_timer.timer_advance_adjust_done = false;
+	} else {
+		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+		apic->lapic_timer.timer_advance_adjust_done = true;
+	}
+

 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 3e97f8a68967..d6d049ba3045 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,7 +64,7 @@ struct kvm_lapic {

 struct dest_map;

-int kvm_create_lapic(struct kvm_vcpu *vcpu, u32 timer_advance_ns);
+int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns);
 void kvm_free_lapic(struct kvm_vcpu *vcpu);

 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 40ca27870df2..e2876c1f5a20 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -136,8 +136,13 @@ EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);

-/* lapic timer advance (tscdeadline mode only) in nanoseconds */
-static u32 __read_mostly lapic_timer_advance_ns = 1000;
+/*
+ * lapic timer advance (tscdeadline mode only) in nanoseconds.  '-1' enables
+ * adaptive tuning starting from default advancment of 1000ns.  '0' disables
+ * advancement entirely.  Any other value is used as-is and disables adaptive
+ * tuning, i.e. allows priveleged userspace to set an exact advancement time.
+ */
+static int __read_mostly lapic_timer_advance_ns = -1;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);

 static bool __read_mostly vector_hashing = true;
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 4/4] KVM: lapic: Convert guest TSC to host time domain if necessary
  2019-04-17 17:15 [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (2 preceding siblings ...)
  2019-04-17 17:15 ` [PATCH v4 3/4] KVM: lapic: Allow user to disable adaptive tuning of timer advancement Sean Christopherson
@ 2019-04-17 17:15 ` Sean Christopherson
  2019-04-28  0:54 ` [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Wanpeng Li
  4 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2019-04-17 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář; +Cc: kvm, Liran Alon, Wanpeng Li

To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  In
the event that the adjustments are too aggressive, i.e. the timer fires
earlier than the guest expects, KVM busy waits immediately prior to
entering the guest.

Currently, KVM manually converts the delay from nanoseconds to clock
cycles.  But, the conversion is done in the guest's time domain, while
the delay occurs in the host's time domain.  This is perfectly ok when
the guest and host are using the same TSC ratio, but if the guest is
using a different ratio then the delay may not be accurate and could
wait too little or too long.

When the guest is not using the host's ratio, convert the delay from
guest clock cycles to host nanoseconds and use ndelay() instead of
__delay() to provide more accurate timing.  Because converting to
nanoseconds is relatively expensive, e.g. requires division and more
multiplication ops, continue using __delay() directly when guest and
host TSCs are running at the same ratio.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 2aec04d07617..270bd3cda2c1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1481,6 +1481,26 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 	return false;
 }

+static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
+{
+	u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns;
+
+	/*
+	 * If the guest TSC is running at a different ratio than the host, then
+	 * convert the delay to nanoseconds to achieve an accurate delay.  Note
+	 * that __delay() uses delay_tsc whenever the hardware has TSC, thus
+	 * always for VMX enabled hardware.
+	 */
+	if (vcpu->arch.tsc_scaling_ratio == kvm_default_tsc_scaling_ratio) {
+		__delay(min(guest_cycles,
+			nsec_to_cycles(vcpu, timer_advance_ns)));
+	} else {
+		u64 delay_ns = guest_cycles * 1000000ULL;
+		do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
+		ndelay(min_t(u32, delay_ns, timer_advance_ns));
+	}
+}
+
 void wait_lapic_expire(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
@@ -1501,10 +1521,8 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);

-	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
 	if (guest_tsc < tsc_deadline)
-		__delay(min(tsc_deadline - guest_tsc,
-			nsec_to_cycles(vcpu, timer_advance_ns)));
+		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);

 	if (!apic->lapic_timer.timer_advance_adjust_done) {
 		/* too early */
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues
  2019-04-17 17:15 [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
                   ` (3 preceding siblings ...)
  2019-04-17 17:15 ` [PATCH v4 4/4] KVM: lapic: Convert guest TSC to host time domain if necessary Sean Christopherson
@ 2019-04-28  0:54 ` Wanpeng Li
  2019-04-30 19:31   ` Sean Christopherson
  4 siblings, 1 reply; 9+ messages in thread
From: Wanpeng Li @ 2019-04-28  0:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Liran Alon, Wanpeng Li

Hi Sean,
On Thu, 18 Apr 2019 at 01:18, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> KVM's recently introduced adaptive tuning of lapic_timer_advance_ns has
> several critical flaws:
[.../...]
>
>   - TSC scaling is done on a per-vCPU basis, while the advancement value
>     is global.  This issue is also present without adaptive tuning, but
>     is now more pronounced.

Did you test this against overcommit scenario? Your per-vCPU variable
can be a large number(yeah, below your 5000ns) when neighbour VMs on
the same host consume cpu heavily, however, kvm will wast a lot of
time to wait when the neighbour VMs are idle. My original patch
evaluate the conservative hypervisor overhead when the first VM is
deployed on the host. It doesn't matter whether or not the VMs on this
host alter their workload behaviors later. Unless you tune the
per-vCPU variable always, however, I think it will introduce more
overhead. So Liran's patch "Consider LAPIC TSC-Deadline Timer expired
if deadline too short" also can't depend on this.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues
  2019-04-28  0:54 ` [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Wanpeng Li
@ 2019-04-30 19:31   ` Sean Christopherson
  2019-05-05  0:43     ` Wanpeng Li
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2019-04-30 19:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Liran Alon, Wanpeng Li

On Sun, Apr 28, 2019 at 08:54:30AM +0800, Wanpeng Li wrote:
> Hi Sean,
> On Thu, 18 Apr 2019 at 01:18, Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > KVM's recently introduced adaptive tuning of lapic_timer_advance_ns has
> > several critical flaws:
> [.../...]
> >
> >   - TSC scaling is done on a per-vCPU basis, while the advancement value
> >     is global.  This issue is also present without adaptive tuning, but
> >     is now more pronounced.
> 
> Did you test this against overcommit scenario? Your per-vCPU variable
> can be a large number(yeah, below your 5000ns) when neighbour VMs on
> the same host consume cpu heavily, however, kvm will wast a lot of
> time to wait when the neighbour VMs are idle. My original patch
> evaluate the conservative hypervisor overhead when the first VM is
> deployed on the host. It doesn't matter whether or not the VMs on this
> host alter their workload behaviors later. Unless you tune the
> per-vCPU variable always, however, I think it will introduce more
> overhead. So Liran's patch "Consider LAPIC TSC-Deadline Timer expired
> if deadline too short" also can't depend on this.

I didn't test it in overcommit scenarios.  I wasn't aware of how the
automatic adjustments were being used in real deployments.

The best option I can think of is to expose a vCPU's advance time to
userspace (not sure what mechanism would be best).  This would allow
userspace to run a single vCPU VM with auto-tuning enabled, snapshot
the final adjusted advancment, and then update KVM's parameter to set
an explicit advancement and effectively disable auto-tuning.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues
  2019-04-30 19:31   ` Sean Christopherson
@ 2019-05-05  0:43     ` Wanpeng Li
  2019-05-08 21:43       ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Wanpeng Li @ 2019-05-05  0:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Liran Alon, Wanpeng Li

On Wed, 1 May 2019 at 03:31, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Sun, Apr 28, 2019 at 08:54:30AM +0800, Wanpeng Li wrote:
> > Hi Sean,
> > On Thu, 18 Apr 2019 at 01:18, Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > KVM's recently introduced adaptive tuning of lapic_timer_advance_ns has
> > > several critical flaws:
> > [.../...]
> > >
> > >   - TSC scaling is done on a per-vCPU basis, while the advancement value
> > >     is global.  This issue is also present without adaptive tuning, but
> > >     is now more pronounced.
> >
> > Did you test this against overcommit scenario? Your per-vCPU variable
> > can be a large number(yeah, below your 5000ns) when neighbour VMs on
> > the same host consume cpu heavily, however, kvm will wast a lot of
> > time to wait when the neighbour VMs are idle. My original patch
> > evaluate the conservative hypervisor overhead when the first VM is
> > deployed on the host. It doesn't matter whether or not the VMs on this
> > host alter their workload behaviors later. Unless you tune the
> > per-vCPU variable always, however, I think it will introduce more
> > overhead. So Liran's patch "Consider LAPIC TSC-Deadline Timer expired
> > if deadline too short" also can't depend on this.
>
> I didn't test it in overcommit scenarios.  I wasn't aware of how the

I think it should be considered.

> automatic adjustments were being used in real deployments.
>
> The best option I can think of is to expose a vCPU's advance time to
> userspace (not sure what mechanism would be best).  This would allow
> userspace to run a single vCPU VM with auto-tuning enabled, snapshot
> the final adjusted advancment, and then update KVM's parameter to set
> an explicit advancement and effectively disable auto-tuning.

This step is too complex to deploy in real environment, the same as
w/o auto-tuning. My auto-tuning patch evaluates the conservative
hypervisor overhead when the first VM is deployed on the host, and
auto-tuning it only once for the whole machine.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues
  2019-05-05  0:43     ` Wanpeng Li
@ 2019-05-08 21:43       ` Sean Christopherson
  0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2019-05-08 21:43 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Liran Alon, Wanpeng Li

On Sun, May 05, 2019 at 08:43:24AM +0800, Wanpeng Li wrote:
> On Wed, 1 May 2019 at 03:31, Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Sun, Apr 28, 2019 at 08:54:30AM +0800, Wanpeng Li wrote:
> > > Hi Sean,
> > > On Thu, 18 Apr 2019 at 01:18, Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > >
> > > > KVM's recently introduced adaptive tuning of lapic_timer_advance_ns has
> > > > several critical flaws:
> > > [.../...]
> > > >
> > > >   - TSC scaling is done on a per-vCPU basis, while the advancement value
> > > >     is global.  This issue is also present without adaptive tuning, but
> > > >     is now more pronounced.
> > >
> > > Did you test this against overcommit scenario? Your per-vCPU variable
> > > can be a large number(yeah, below your 5000ns) when neighbour VMs on
> > > the same host consume cpu heavily, however, kvm will wast a lot of
> > > time to wait when the neighbour VMs are idle. My original patch
> > > evaluate the conservative hypervisor overhead when the first VM is
> > > deployed on the host. It doesn't matter whether or not the VMs on this
> > > host alter their workload behaviors later. Unless you tune the
> > > per-vCPU variable always, however, I think it will introduce more
> > > overhead. So Liran's patch "Consider LAPIC TSC-Deadline Timer expired
> > > if deadline too short" also can't depend on this.
> >
> > I didn't test it in overcommit scenarios.  I wasn't aware of how the
> 
> I think it should be considered.
> 
> > automatic adjustments were being used in real deployments.
> >
> > The best option I can think of is to expose a vCPU's advance time to
> > userspace (not sure what mechanism would be best).  This would allow
> > userspace to run a single vCPU VM with auto-tuning enabled, snapshot
> > the final adjusted advancment, and then update KVM's parameter to set
> > an explicit advancement and effectively disable auto-tuning.
> 
> This step is too complex to deploy in real environment, the same as
> w/o auto-tuning. My auto-tuning patch evaluates the conservative
> hypervisor overhead when the first VM is deployed on the host, and
> auto-tuning it only once for the whole machine.

But even then the advancement could be corrupted or wildly inaccurate
unless that first VM has a single vCPU.

I thought of an idea that will hopefully fix the overcommit scenario and
in general reduce the time spent auto-adjusting.  Patch incoming...

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-05-08 21:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-17 17:15 [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Sean Christopherson
2019-04-17 17:15 ` [PATCH v4 1/4] KVM: lapic: Disable timer advancement if adaptive tuning goes haywire Sean Christopherson
2019-04-17 17:15 ` [PATCH v4 2/4] KVM: lapic: Track lapic timer advance per vCPU Sean Christopherson
2019-04-17 17:15 ` [PATCH v4 3/4] KVM: lapic: Allow user to disable adaptive tuning of timer advancement Sean Christopherson
2019-04-17 17:15 ` [PATCH v4 4/4] KVM: lapic: Convert guest TSC to host time domain if necessary Sean Christopherson
2019-04-28  0:54 ` [PATCH v4 0/4] KVM: lapic: Fix a variety of timer adv issues Wanpeng Li
2019-04-30 19:31   ` Sean Christopherson
2019-05-05  0:43     ` Wanpeng Li
2019-05-08 21:43       ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.