kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN
@ 2021-03-03  2:09 lihaiwei.kernel
  2021-03-03  2:39 ` Haiwei Li
  0 siblings, 1 reply; 6+ messages in thread
From: lihaiwei.kernel @ 2021-03-03  2:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro, Haiwei Li

From: Haiwei Li <lihaiwei@tencent.com>

In my test environment, advance_expire_delta is frequently greater than
the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the
adjustment.

Adding module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN, so they
can be dynamically adjusted.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
---
 arch/x86/kvm/lapic.c | 6 ++----
 arch/x86/kvm/x86.c   | 8 ++++++++
 arch/x86/kvm/x86.h   | 3 +++
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 45d40bf..730c657 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -61,8 +61,6 @@
 #define APIC_VECTORS_PER_REG		32
 
 static bool lapic_timer_advance_dynamic __read_mostly;
-#define LAPIC_TIMER_ADVANCE_ADJUST_MIN	100	/* clock cycles */
-#define LAPIC_TIMER_ADVANCE_ADJUST_MAX	10000	/* clock cycles */
 #define LAPIC_TIMER_ADVANCE_NS_INIT	1000
 #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
 /* step-by-step approximation to mitigate fluctuation */
@@ -1563,8 +1561,8 @@ static inline void adjust_lapic_timer_advance(struct kvm_vcpu *vcpu,
 	u64 ns;
 
 	/* Do not adjust for tiny fluctuations or large random spikes. */
-	if (abs(advance_expire_delta) > LAPIC_TIMER_ADVANCE_ADJUST_MAX ||
-	    abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_MIN)
+	if (abs(advance_expire_delta) > lapic_timer_advance_adjust_cycles_max ||
+	    abs(advance_expire_delta) < lapic_timer_advance_adjust_cycles_min)
 		return;
 
 	/* too early */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 828de7d..3bd8d19 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -176,6 +176,14 @@
 int __read_mostly pi_inject_timer = -1;
 module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
 
+u64 __read_mostly lapic_timer_advance_adjust_cycles_max = 10000;
+module_param(lapic_timer_advance_adjust_cycles_max, ullong, S_IRUGO | S_IWUSR);
+EXPORT_SYMBOL_GPL(lapic_timer_advance_adjust_cycles_max);
+
+u64 __read_mostly lapic_timer_advance_adjust_cycles_min = 100;
+module_param(lapic_timer_advance_adjust_cycles_min, ullong, S_IRUGO | S_IWUSR);
+EXPORT_SYMBOL_GPL(lapic_timer_advance_adjust_cycles_min);
+
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index ee6e010..3f7eca0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -305,6 +305,9 @@ static inline bool kvm_mpx_supported(void)
 
 extern int pi_inject_timer;
 
+extern u64 lapic_timer_advance_adjust_cycles_max;
+extern u64 lapic_timer_advance_adjust_cycles_min;
+
 extern struct static_key kvm_no_apic_vcpu;
 
 extern bool report_ignored_msrs;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN
  2021-03-03  2:09 [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN lihaiwei.kernel
@ 2021-03-03  2:39 ` Haiwei Li
  2021-03-09 23:42   ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Haiwei Li @ 2021-03-03  2:39 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro, Haiwei Li

On 21/3/3 10:09, lihaiwei.kernel@gmail.com wrote:
> From: Haiwei Li <lihaiwei@tencent.com>
> 
> In my test environment, advance_expire_delta is frequently greater than
> the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the
> adjustment.

Supplementary details:

I have tried to backport timer related features to our production
kernel.

After completed, i found that advance_expire_delta is frequently greater
than the fixed value. It's necessary to trun the fixed to dynamically
values.

-- 
Haiwei Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN
  2021-03-03  2:39 ` Haiwei Li
@ 2021-03-09 23:42   ` Sean Christopherson
  2021-03-10  9:15     ` Haiwei Li
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2021-03-09 23:42 UTC (permalink / raw)
  To: Haiwei Li
  Cc: linux-kernel, kvm, pbonzini, vkuznets, wanpengli, jmattson, joro,
	Haiwei Li

On Wed, Mar 03, 2021, Haiwei Li wrote:
> On 21/3/3 10:09, lihaiwei.kernel@gmail.com wrote:
> > From: Haiwei Li <lihaiwei@tencent.com>
> > 
> > In my test environment, advance_expire_delta is frequently greater than
> > the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the
> > adjustment.
> 
> Supplementary details:
> 
> I have tried to backport timer related features to our production
> kernel.
> 
> After completed, i found that advance_expire_delta is frequently greater
> than the fixed value. It's necessary to trun the fixed to dynamically
> values.

Does this reproduce on an upstream kernel?  If so...

  1. How much over the 10k cycle limit is the delta?
  2. Any idea what causes the large delta?  E.g. is there something that can
     and/or should be fixed elsewhere?
  3. Is it platform/CPU specific?

Ideally, KVM would play nice with "all" environments by default without forcing
the admin to hand-tune things.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN
  2021-03-09 23:42   ` Sean Christopherson
@ 2021-03-10  9:15     ` Haiwei Li
  2021-03-13  0:58       ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Haiwei Li @ 2021-03-10  9:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Haiwei Li

On Wed, Mar 10, 2021 at 7:42 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 03, 2021, Haiwei Li wrote:
> > On 21/3/3 10:09, lihaiwei.kernel@gmail.com wrote:
> > > From: Haiwei Li <lihaiwei@tencent.com>
> > >
> > > In my test environment, advance_expire_delta is frequently greater than
> > > the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the
> > > adjustment.
> >
> > Supplementary details:
> >
> > I have tried to backport timer related features to our production
> > kernel.
> >
> > After completed, i found that advance_expire_delta is frequently greater
> > than the fixed value. It's necessary to trun the fixed to dynamically
> > values.
>
> Does this reproduce on an upstream kernel?  If so...
>
>   1. How much over the 10k cycle limit is the delta?
>   2. Any idea what causes the large delta?  E.g. is there something that can
>      and/or should be fixed elsewhere?
>   3. Is it platform/CPU specific?

Hi, Sean

I have traced the flow on our production kernel and it frequently consumes more
than 10K cycles from sched_out to sched_in.
So two scenarios tested on Cascade lake Server(96 pcpu), v5.11 kernel.

1. only cyclictest in guest(88 vcpu and bound with isolated pcpus, w/o mwait
exposed, adaptive advance lapic timer is default -1). The ratio of occurrences:

greater_than_10k/total: 29/2060, 1.41%

2. cyclictest in guest(88 vcpu and not bound, w/o mwait exposed, adaptive
advance lapic timer is default -1) and stress in host(no isolate). The ratio of
occurrences:

greater_than_10k/total: 122381/1017363, 12.03%

-- 
Haiwei Li

>
> Ideally, KVM would play nice with "all" environments by default without forcing
> the admin to hand-tune things.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN
  2021-03-10  9:15     ` Haiwei Li
@ 2021-03-13  0:58       ` Sean Christopherson
  2021-03-13  1:31         ` Haiwei Li
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2021-03-13  0:58 UTC (permalink / raw)
  To: Haiwei Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Haiwei Li

On Wed, Mar 10, 2021, Haiwei Li wrote:
> On Wed, Mar 10, 2021 at 7:42 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Mar 03, 2021, Haiwei Li wrote:
> > > On 21/3/3 10:09, lihaiwei.kernel@gmail.com wrote:
> > > > From: Haiwei Li <lihaiwei@tencent.com>
> > > >
> > > > In my test environment, advance_expire_delta is frequently greater than
> > > > the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the
> > > > adjustment.
> > >
> > > Supplementary details:
> > >
> > > I have tried to backport timer related features to our production
> > > kernel.
> > >
> > > After completed, i found that advance_expire_delta is frequently greater
> > > than the fixed value. It's necessary to trun the fixed to dynamically
> > > values.
> >
> > Does this reproduce on an upstream kernel?  If so...
> >
> >   1. How much over the 10k cycle limit is the delta?
> >   2. Any idea what causes the large delta?  E.g. is there something that can
> >      and/or should be fixed elsewhere?
> >   3. Is it platform/CPU specific?
> 
> Hi, Sean
> 
> I have traced the flow on our production kernel and it frequently consumes more
> than 10K cycles from sched_out to sched_in.
> So two scenarios tested on Cascade lake Server(96 pcpu), v5.11 kernel.
> 
> 1. only cyclictest in guest(88 vcpu and bound with isolated pcpus, w/o mwait
> exposed, adaptive advance lapic timer is default -1). The ratio of occurrences:
> 
> greater_than_10k/total: 29/2060, 1.41%
> 
> 2. cyclictest in guest(88 vcpu and not bound, w/o mwait exposed, adaptive
> advance lapic timer is default -1) and stress in host(no isolate). The ratio of
> occurrences:
> 
> greater_than_10k/total: 122381/1017363, 12.03%

Hmm, I'm inclined to say this is working as intended.  If the vCPU isn't affined
and/or it's getting preempted, then large spikes are expected, and not adjusting
in reaction to those spikes is desirable.  E.g. adjusting by 20k cycles because
the timer happened to expire while a vCPU was preempted will cause KVM to busy
wait for quite a long time if the next timer runs without interference, and then
KVM will thrash the advancement.

And I don't really see the point in pushing the max adjustment beyond 10k.  The
max _advancement_ is 5000ns, which means that even with a blazing fast 5.0ghz
system, a max adjustment of 1250 (10k/ 8, the step divisor) should get KVM to
the 25000 cycle advancement limit relatively quickly.  Since KVM resets to the
initial 1000ns advancement when it would exceed the 5000ns max, I suspect that
raising the max adjustment much beyond 10k cycles would quickly push a vCPU to
the max, cause it to reset, and rinse and repeat.

Note, we definitely don't want to raise the 5000ns max, as waiting with IRQs
disabled for any longer than that will likely cause system instability.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN
  2021-03-13  0:58       ` Sean Christopherson
@ 2021-03-13  1:31         ` Haiwei Li
  0 siblings, 0 replies; 6+ messages in thread
From: Haiwei Li @ 2021-03-13  1:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Haiwei Li

On Sat, Mar 13, 2021 at 8:58 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 10, 2021, Haiwei Li wrote:
> > On Wed, Mar 10, 2021 at 7:42 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Wed, Mar 03, 2021, Haiwei Li wrote:
> > > > On 21/3/3 10:09, lihaiwei.kernel@gmail.com wrote:
> > > > > From: Haiwei Li <lihaiwei@tencent.com>
> > > > >
> > > > > In my test environment, advance_expire_delta is frequently greater than
> > > > > the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the
> > > > > adjustment.
> > > >
> > > > Supplementary details:
> > > >
> > > > I have tried to backport timer related features to our production
> > > > kernel.
> > > >
> > > > After completed, i found that advance_expire_delta is frequently greater
> > > > than the fixed value. It's necessary to trun the fixed to dynamically
> > > > values.
> > >
> > > Does this reproduce on an upstream kernel?  If so...
> > >
> > >   1. How much over the 10k cycle limit is the delta?
> > >   2. Any idea what causes the large delta?  E.g. is there something that can
> > >      and/or should be fixed elsewhere?
> > >   3. Is it platform/CPU specific?
> >
> > Hi, Sean
> >
> > I have traced the flow on our production kernel and it frequently consumes more
> > than 10K cycles from sched_out to sched_in.
> > So two scenarios tested on Cascade lake Server(96 pcpu), v5.11 kernel.
> >
> > 1. only cyclictest in guest(88 vcpu and bound with isolated pcpus, w/o mwait
> > exposed, adaptive advance lapic timer is default -1). The ratio of occurrences:
> >
> > greater_than_10k/total: 29/2060, 1.41%
> >
> > 2. cyclictest in guest(88 vcpu and not bound, w/o mwait exposed, adaptive
> > advance lapic timer is default -1) and stress in host(no isolate). The ratio of
> > occurrences:
> >
> > greater_than_10k/total: 122381/1017363, 12.03%
>
> Hmm, I'm inclined to say this is working as intended.  If the vCPU isn't affined
> and/or it's getting preempted, then large spikes are expected, and not adjusting
> in reaction to those spikes is desirable.  E.g. adjusting by 20k cycles because
> the timer happened to expire while a vCPU was preempted will cause KVM to busy
> wait for quite a long time if the next timer runs without interference, and then
> KVM will thrash the advancement.
>
> And I don't really see the point in pushing the max adjustment beyond 10k.  The
> max _advancement_ is 5000ns, which means that even with a blazing fast 5.0ghz
> system, a max adjustment of 1250 (10k/ 8, the step divisor) should get KVM to
> the 25000 cycle advancement limit relatively quickly.  Since KVM resets to the
> initial 1000ns advancement when it would exceed the 5000ns max, I suspect that
> raising the max adjustment much beyond 10k cycles would quickly push a vCPU to
> the max, cause it to reset, and rinse and repeat.
>
> Note, we definitely don't want to raise the 5000ns max, as waiting with IRQs
> disabled for any longer than that will likely cause system instability.

I see. Thanks for your explanation.

--
Haiwei Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-03-13  1:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03  2:09 [PATCH] kvm: lapic: add module parameters for LAPIC_TIMER_ADVANCE_ADJUST_MAX/MIN lihaiwei.kernel
2021-03-03  2:39 ` Haiwei Li
2021-03-09 23:42   ` Sean Christopherson
2021-03-10  9:15     ` Haiwei Li
2021-03-13  0:58       ` Sean Christopherson
2021-03-13  1:31         ` Haiwei Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).