* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
@ 2016-05-24 23:11 ` David Matlack
2016-05-24 23:35 ` yunhong jiang
2016-05-25 10:40 ` Paolo Bonzini
2016-05-25 11:52 ` Paolo Bonzini
` (2 subsequent siblings)
3 siblings, 2 replies; 32+ messages in thread
From: David Matlack @ 2016-05-24 23:11 UTC (permalink / raw)
To: Yunhong Jiang
Cc: kvm list, Marcelo Tosatti, Radim Krčmář,
Paolo Bonzini, Wanpeng Li
On Tue, May 24, 2016 at 3:27 PM, Yunhong Jiang
<yunhong.jiang@linux.intel.com> wrote:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
>
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
>
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
>
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
>
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
What is the cost of the extra sched-in/out hooks?
>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
> arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/lapic.h | 11 ++++++
> arch/x86/kvm/trace.h | 22 ++++++++++++
> arch/x86/kvm/vmx.c | 8 +++++
> arch/x86/kvm/x86.c | 4 +++
> 5 files changed, 139 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f1cf8a5ede11..93679db7ce0f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> __delay(tsc_deadline - guest_tsc);
> }
>
> -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> +static void start_sw_tscdeadline(struct kvm_lapic *apic, int no_expire)
> {
> u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
> u64 ns = 0;
> @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>
> now = apic->lapic_timer.timer.base->get_time();
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> - if (likely(tscdeadline > guest_tsc)) {
> + if (no_expire || likely(tscdeadline > guest_tsc)) {
> ns = (tscdeadline - guest_tsc) * 1000000ULL;
> do_div(ns, this_tsc_khz);
> expire = ktime_add_ns(now, ns);
> @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> local_irq_restore(flags);
> }
>
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> + u64 tscdeadline, guest_tsc;
> +
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> + return;
> +
> + tscdeadline = apic->lapic_timer.tscdeadline;
> + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> + if (tscdeadline >= guest_tsc)
> + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
Does this interact correctly with TSC scaling? IIUC this programs the
VMX preemption timer with a delay in guest cycles, rather than host
cycles.
> + else
> + kvm_x86_ops->set_hv_timer(vcpu, 0);
> +
> + apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> + trace_kvm_hv_timer_state(vcpu->vcpu_id,
> + apic->lapic_timer.hv_timer_state);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> +
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_ARMED);
> + WARN_ON(swait_active(&vcpu->wq));
> + kvm_x86_ops->cancel_hv_timer(vcpu);
> + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> + apic_timer_expired(apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> +
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_NOT_USED);
> +
> + if (apic_lvtt_tscdeadline(apic) &&
> + !atomic_read(&apic->lapic_timer.pending)) {
> + hrtimer_cancel(&apic->lapic_timer.timer);
> + /* In case the timer triggered in above small window */
> + if (!atomic_read(&apic->lapic_timer.pending)) {
> + apic->lapic_timer.hv_timer_state =
> + HV_TIMER_NEEDS_ARMING;
> + trace_kvm_hv_timer_state(vcpu->vcpu_id,
> + apic->lapic_timer.hv_timer_state);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> +
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + /* Possibly the TSC deadline timer is not enabled yet */
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> + return;
> +
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> + kvm_x86_ops->cancel_hv_timer(vcpu);
> + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> +
> + if (atomic_read(&apic->lapic_timer.pending))
> + return;
> +
> + /*
> + * Don't trigger the apic_timer_expired() for deadlock,
> + * because the swake_up() from apic_timer_expired() will
> + * try to get the run queue lock, which has been held here
> + * since we are in context switch procedure already.
> + */
> + start_sw_tscdeadline(apic, 1);
> +}
> +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> +
> static void start_apic_timer(struct kvm_lapic *apic)
> {
> ktime_t now;
> @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct kvm_lapic *apic)
> ktime_to_ns(ktime_add_ns(now,
> apic->lapic_timer.period)));
> } else if (apic_lvtt_tscdeadline(apic)) {
> - start_sw_tscdeadline(apic);
> + /* lapic timer in tsc deadline mode */
> + if (kvm_x86_ops->set_hv_timer) {
> + if (unlikely(!apic->lapic_timer.tscdeadline ||
> + !apic->vcpu->arch.virtual_tsc_khz))
> + return;
> +
> + /* Expired timer will be checked on vcpu_run() */
> + apic->lapic_timer.hv_timer_state =
> + HV_TIMER_NEEDS_ARMING;
> + trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> + apic->lapic_timer.hv_timer_state);
> + } else
> + start_sw_tscdeadline(apic, 0);
> }
> }
>
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 891c6da7d4aa..dc4fd8eea04d 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -12,6 +12,12 @@
> #define KVM_APIC_SHORT_MASK 0xc0000
> #define KVM_APIC_DEST_MASK 0x800
>
> +enum {
> + HV_TIMER_NOT_USED,
> + HV_TIMER_NEEDS_ARMING,
> + HV_TIMER_ARMED,
> +};
> +
> struct kvm_timer {
> struct hrtimer timer;
> s64 period; /* unit: ns */
> @@ -20,6 +26,7 @@ struct kvm_timer {
> u64 tscdeadline;
> u64 expired_tscdeadline;
> atomic_t pending; /* accumulated triggered timers */
> + int hv_timer_state;
> };
>
> struct kvm_lapic {
> @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
> struct kvm_vcpu **dest_vcpu);
> int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
> const unsigned long *bitmap, u32 bitmap_size);
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
> #endif
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 8de925031b5c..8e1b5e9c2e78 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -6,6 +6,7 @@
> #include <asm/svm.h>
> #include <asm/clocksource.h>
> #include <asm/pvclock-abi.h>
> +#include <lapic.h>
>
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM kvm
> @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
> __entry->vec)
> );
>
> +#define kvm_trace_symbol_hv_timer \
> + {HV_TIMER_NOT_USED, "no"}, \
> + {HV_TIMER_NEEDS_ARMING, "need_arming"}, \
> + {HV_TIMER_ARMED, "armed"}
> +
> +TRACE_EVENT(kvm_hv_timer_state,
> + TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_state),
> + TP_ARGS(vcpu_id, hv_timer_state),
> + TP_STRUCT__entry(
> + __field(unsigned int, vcpu_id)
> + __field(unsigned int, hv_timer_state)
> + ),
> + TP_fast_assign(
> + __entry->vcpu_id = vcpu_id;
> + __entry->hv_timer_state = hv_timer_state;
> + ),
> + TP_printk("vcpu_id %x hv_timer %s\n",
> + __entry->vcpu_id,
> + __print_symbolic(__entry->hv_timer_state,
> + kvm_trace_symbol_hv_timer))
> + );
> #endif /* _TRACE_KVM_H */
>
> #undef TRACE_INCLUDE_PATH
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2b29afa61715..dc0b8cae02b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> +{
> + kvm_lapic_expired_hv_timer(vcpu);
> + return 1;
> +}
> /*
> * The exit handlers return 1 if the exit was handled fully and guest execution
> * may resume. Otherwise they set the kvm_run parameter to indicate what needs
> @@ -7627,6 +7632,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> [EXIT_REASON_XRSTORS] = handle_xrstors,
> [EXIT_REASON_PML_FULL] = handle_pml_full,
> [EXIT_REASON_PCOMMIT] = handle_pcommit,
> + [EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
> };
>
> static const int kvm_vmx_max_exit_handlers =
> @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> vmx_set_interrupt_shadow(vcpu, 0);
>
> + kvm_lapic_arm_hv_timer(vcpu);
> +
> if (vmx->guest_pkru_valid)
> __write_pkru(vmx->guest_pkru);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 269d576520ba..f4b50608568a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>
> void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> {
> + if (kvm_x86_ops->set_hv_timer)
> + switch_to_hv_lapic_timer(vcpu);
> kvm_x86_ops->sched_in(vcpu, cpu);
> }
>
> void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
> {
> + if (kvm_x86_ops->set_hv_timer)
> + switch_to_sw_lapic_timer(vcpu);
> }
>
> int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 23:11 ` David Matlack
@ 2016-05-24 23:35 ` yunhong jiang
2016-05-25 11:58 ` Paolo Bonzini
2016-05-25 10:40 ` Paolo Bonzini
1 sibling, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-05-24 23:35 UTC (permalink / raw)
To: David Matlack
Cc: kvm list, Marcelo Tosatti, Radim Krčmář,
Paolo Bonzini, Wanpeng Li
On Tue, 24 May 2016 16:11:29 -0700
David Matlack <dmatlack@google.com> wrote:
> On Tue, May 24, 2016 at 3:27 PM, Yunhong Jiang
> <yunhong.jiang@linux.intel.com> wrote:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> >
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> >
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> >
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> >
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
>
> What is the cost of the extra sched-in/out hooks?
The cost is because on each sched-in/out, the kvm_sched_in/out need switch
between the sw/hv timer, including call hrtimer_start()/hrtimer_cancel() to
reprogram the hrtimer, update vmcs to arm/unarm the vmx preemption timer. I
think the hrtimer_start/cancel is costly although I have no data on hand.
Or you mean I should provide the data for it? Do you have any suggestion on
workload for compare the difference? I think the schedule hook will not impact
the guest itself, instead, it makes the scheduler procedure longer thus impact
host system performance.
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_lapic *apic = vcpu->arch.apic;
> > + u64 tscdeadline, guest_tsc;
> > +
> > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > + return;
> > +
> > + tscdeadline = apic->lapic_timer.tscdeadline;
> > + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > + if (tscdeadline >= guest_tsc)
> > + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> > guest_tsc);
>
> Does this interact correctly with TSC scaling? IIUC this programs the
> VMX preemption timer with a delay in guest cycles, rather than host
> cycles.
Aha, good point, thanks!
But I re-checked the lapic.c and seems this issue has never been considered?
Check http://lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335 and
lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335 for example. Also I
tried to find a function to translate guest tsc to host tsc and didn't find
such function at all. Did I miss anything?
Thanks
--jyh
> > --
> > 1.8.3.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 23:35 ` yunhong jiang
@ 2016-05-25 11:58 ` Paolo Bonzini
2016-05-25 22:53 ` yunhong jiang
0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 11:58 UTC (permalink / raw)
To: yunhong jiang, David Matlack
Cc: kvm list, Marcelo Tosatti, Radim Krčmář, Wanpeng Li
On 25/05/2016 01:35, yunhong jiang wrote:
>> > Does this interact correctly with TSC scaling? IIUC this programs the
>> > VMX preemption timer with a delay in guest cycles, rather than host
>> > cycles.
> Aha, good point, thanks!
>
> But I re-checked the lapic.c and seems this issue has never been considered?
> Check http://lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335
This is a bug indeed. The simplest fix would be to ignore
lapic_timer_advance_ns when TSC scaling is in use.
> and lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335
You copied the wrong URL, but at least lines 1383-1410 seem okay to me.
> for example. Also I
> tried to find a function to translate guest tsc to host tsc and didn't find
> such function at all.
See my other reply.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 11:58 ` Paolo Bonzini
@ 2016-05-25 22:53 ` yunhong jiang
2016-05-26 7:20 ` Paolo Bonzini
0 siblings, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 22:53 UTC (permalink / raw)
To: Paolo Bonzini
Cc: David Matlack, kvm list, Marcelo Tosatti,
Radim Krčmář,
Wanpeng Li
On Wed, 25 May 2016 13:58:17 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 25/05/2016 01:35, yunhong jiang wrote:
> >> > Does this interact correctly with TSC scaling? IIUC this
> >> > programs the VMX preemption timer with a delay in guest cycles,
> >> > rather than host cycles.
> > Aha, good point, thanks!
> >
> > But I re-checked the lapic.c and seems this issue has never been
> > considered? Check
> > http://lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335
>
> This is a bug indeed. The simplest fix would be to ignore
> lapic_timer_advance_ns when TSC scaling is in use.
Hmm, I can fix this on my next submission. Or someone can send out a separate
patch for it.
>
> > and lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335
>
> You copied the wrong URL, but at least lines 1383-1410 seem okay to
> me.
Yes, I mean 1383~1410. On L1402, "expire = ktime_add_ns(now, ns)", I think the
"now" is host tsc while "ns" is guest delta tsc, which seems a bug to me. Or I
miss-understand anything?
Thanks
--jyh
>
> > for example. Also I
> > tried to find a function to translate guest tsc to host tsc and
> > didn't find such function at all.
>
> See my other reply.
>
> Thanks,
>
> Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 22:53 ` yunhong jiang
@ 2016-05-26 7:20 ` Paolo Bonzini
0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-26 7:20 UTC (permalink / raw)
To: yunhong jiang
Cc: David Matlack, kvm list, Marcelo Tosatti,
Radim Krčmář,
Wanpeng Li
On 26/05/2016 00:53, yunhong jiang wrote:
> Hmm, I can fix this on my next submission. Or someone can send out a separate
> patch for it.
Alan Jenkins is looking at it in another thread.
>> > You copied the wrong URL, but at least lines 1383-1410 seem okay to
>> > me.
> Yes, I mean 1383~1410. On L1402, "expire = ktime_add_ns(now, ns)", I think the
> "now" is host tsc while "ns" is guest delta tsc, which seems a bug to me. Or I
> miss-understand anything?
now is nanoseconds, coming from struct hrtimer_clock_base.
ns is nanoseconds, computed from (tscdeadline - guest_tsc) * 1000000 /
vcpu->arch.virtual_tsc_khz.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 23:11 ` David Matlack
2016-05-24 23:35 ` yunhong jiang
@ 2016-05-25 10:40 ` Paolo Bonzini
2016-05-25 13:38 ` Radim Krčmář
1 sibling, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 10:40 UTC (permalink / raw)
To: David Matlack, Yunhong Jiang
Cc: kvm list, Marcelo Tosatti, Radim Krčmář, Wanpeng Li
On 25/05/2016 01:11, David Matlack wrote:
> > However, it possibly has impact if the vCPU thread is scheduled in/out
> > very frequently, because it switches from/to the hrtimer emulation a lot.
>
> What is the cost of the extra sched-in/out hooks?
I still have to review this version of the patches, but I think they can
be removed and moved to blocking/unblocking instead.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 10:40 ` Paolo Bonzini
@ 2016-05-25 13:38 ` Radim Krčmář
0 siblings, 0 replies; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 13:38 UTC (permalink / raw)
To: Paolo Bonzini
Cc: David Matlack, Yunhong Jiang, kvm list, Marcelo Tosatti, Wanpeng Li
2016-05-25 12:40+0200, Paolo Bonzini:
> On 25/05/2016 01:11, David Matlack wrote:
>> > However, it possibly has impact if the vCPU thread is scheduled in/out
>> > very frequently, because it switches from/to the hrtimer emulation a lot.
>>
>> What is the cost of the extra sched-in/out hooks?
>
> I still have to review this version of the patches, but I think they can
> be removed and moved to blocking/unblocking instead.
I this so as well. hrtimer does not make a difference to a preempted
VCPU or VCPU in userspace, only the blocking sleep needs to be
interrupted.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
2016-05-24 23:11 ` David Matlack
@ 2016-05-25 11:52 ` Paolo Bonzini
2016-05-25 22:44 ` yunhong jiang
2016-06-04 0:24 ` yunhong jiang
2016-05-25 13:27 ` Radim Krčmář
2016-05-25 13:45 ` Radim Krčmář
3 siblings, 2 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 11:52 UTC (permalink / raw)
To: Yunhong Jiang, kvm; +Cc: mtosatti, rkrcmar, kernellwp
On 25/05/2016 00:27, Yunhong Jiang wrote:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
>
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
>
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
>
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
>
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
> arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/lapic.h | 11 ++++++
> arch/x86/kvm/trace.h | 22 ++++++++++++
> arch/x86/kvm/vmx.c | 8 +++++
> arch/x86/kvm/x86.c | 4 +++
> 5 files changed, 139 insertions(+), 3 deletions(-)
Thanks, this looks very nice.
As I mentioned in my reply to Marcelo, I have some more changes in mind
to reduce the overhead. This way we can enable it unconditionally (on
processors that do not have the erratum). In particular:
1) While a virtual machine is scheduled out due to contention, you don't
care if the timer fires. You can fire the timer immediately after the
next vmentry. You only need to set a timer during HLT.
So, rather than sched_out/sched_in, you can start and stop the hrtimer
in vmx_pre_block and vmx_post_block. You probably should rename what is
now vmx_pre_block and vmx_post_block to something like pi_pre_block and
pi_post_block, so that vmx_pre_block and vmx_post_block are like
if (pi_pre_block(vcpu))
return 1;
if (have preemption timer) {
kvm_lapic_switch_to_sw_timer(vcpu);
and
if (have preemption timer)
kvm_lapic_switch_to_hv_timer(vcpu);
pi_post_block(vcpu);
respectively.
You'll also need to check whether the deadlock in
switch_to_sw_lapic_timer is still possible. Hopefully not---if so,
simpler code! :)
2) if possible, I would like to remove kvm_lapic_arm_hv_timer. Instead,
make switch_to_hv_lapic_timer and start_apic_timer can call
kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
set_hv_timer can convert the guest TSC to a host TSC and save the
desired host TSC in struct vmx_vcpu. vcpu_run checks if there is a TSC
deadline and, if so, writes the delta into VMX_PREEMPTION_TIMER_VALUE.
set_hv_timer/cancel_hv_timer only write to PIN_BASED_VM_EXEC_CONTROL.
Besides making vcpu_run faster, I think that with this change the
distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED disappears,
and struct kvm_lapic can simply have a bool hv_timer_in_use. Again, it
might actually end up making the code simpler, especially the interface
between lapic.c and vmx.c.
The hardest part is converting guest_tsc to host_tsc. From
guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
you have
host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
<< kvm_tsc_scaling_ratio_frac_bits)
/ vcpu->arch.tsc_scaling_ratio;
... except that you need a division with a 128-bit dividend and a 64-bit
divisor here, and there is no such thing in Linux. It's okay if you
restrict this to 64-bit hosts and use the divq assembly instruction
directly. (You also need to trap the divide overflow exception; if
there is an overflow just disable the preemption timer). On 32-bit
systems, it's okay to force-disable set_hv_timer/cancel_hv_timer, i.e.
set them to NULL.
And if you do this, pre_block and post_block become:
if (pi_pre_block(vcpu))
return 1;
if (preemption timer bit set in VMCS)
kvm_lapic_start_sw_timer(vcpu);
and
if (preemption timer bit set in VMCS)
kvm_lapic_cancel_sw_timer(vcpu);
pi_post_block(vcpu);
There is no need to redo the preemption timer setup in post_block.
3) regarding the erratum, you can use the code from
https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp
(function hmR0InitIntelIsSubjectToVmxPreemptionTimerErratum... please
rename it to something else! :)).
Thanks,
Paolo
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f1cf8a5ede11..93679db7ce0f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> __delay(tsc_deadline - guest_tsc);
> }
>
> -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> +static void start_sw_tscdeadline(struct kvm_lapic *apic, int no_expire)
> {
> u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
> u64 ns = 0;
> @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>
> now = apic->lapic_timer.timer.base->get_time();
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> - if (likely(tscdeadline > guest_tsc)) {
> + if (no_expire || likely(tscdeadline > guest_tsc)) {
> ns = (tscdeadline - guest_tsc) * 1000000ULL;
> do_div(ns, this_tsc_khz);
> expire = ktime_add_ns(now, ns);
> @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> local_irq_restore(flags);
> }
>
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> + u64 tscdeadline, guest_tsc;
> +
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> + return;
> +
> + tscdeadline = apic->lapic_timer.tscdeadline;
> + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> + if (tscdeadline >= guest_tsc)
> + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
> + else
> + kvm_x86_ops->set_hv_timer(vcpu, 0);
> +
> + apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> + trace_kvm_hv_timer_state(vcpu->vcpu_id,
> + apic->lapic_timer.hv_timer_state);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> +
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_ARMED);
> + WARN_ON(swait_active(&vcpu->wq));
> + kvm_x86_ops->cancel_hv_timer(vcpu);
> + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> + apic_timer_expired(apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> +
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_NOT_USED);
> +
> + if (apic_lvtt_tscdeadline(apic) &&
> + !atomic_read(&apic->lapic_timer.pending)) {
> + hrtimer_cancel(&apic->lapic_timer.timer);
> + /* In case the timer triggered in above small window */
> + if (!atomic_read(&apic->lapic_timer.pending)) {
> + apic->lapic_timer.hv_timer_state =
> + HV_TIMER_NEEDS_ARMING;
> + trace_kvm_hv_timer_state(vcpu->vcpu_id,
> + apic->lapic_timer.hv_timer_state);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> +
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + /* Possibly the TSC deadline timer is not enabled yet */
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> + return;
> +
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> + kvm_x86_ops->cancel_hv_timer(vcpu);
> + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> +
> + if (atomic_read(&apic->lapic_timer.pending))
> + return;
> +
> + /*
> + * Don't trigger the apic_timer_expired() for deadlock,
> + * because the swake_up() from apic_timer_expired() will
> + * try to get the run queue lock, which has been held here
> + * since we are in context switch procedure already.
> + */
> + start_sw_tscdeadline(apic, 1);
> +}
> +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> +
> static void start_apic_timer(struct kvm_lapic *apic)
> {
> ktime_t now;
> @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct kvm_lapic *apic)
> ktime_to_ns(ktime_add_ns(now,
> apic->lapic_timer.period)));
> } else if (apic_lvtt_tscdeadline(apic)) {
> - start_sw_tscdeadline(apic);
> + /* lapic timer in tsc deadline mode */
> + if (kvm_x86_ops->set_hv_timer) {
> + if (unlikely(!apic->lapic_timer.tscdeadline ||
> + !apic->vcpu->arch.virtual_tsc_khz))
> + return;
> +
> + /* Expired timer will be checked on vcpu_run() */
> + apic->lapic_timer.hv_timer_state =
> + HV_TIMER_NEEDS_ARMING;
> + trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> + apic->lapic_timer.hv_timer_state);
> + } else
> + start_sw_tscdeadline(apic, 0);
> }
> }
>
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 891c6da7d4aa..dc4fd8eea04d 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -12,6 +12,12 @@
> #define KVM_APIC_SHORT_MASK 0xc0000
> #define KVM_APIC_DEST_MASK 0x800
>
> +enum {
> + HV_TIMER_NOT_USED,
> + HV_TIMER_NEEDS_ARMING,
> + HV_TIMER_ARMED,
> +};
> +
> struct kvm_timer {
> struct hrtimer timer;
> s64 period; /* unit: ns */
> @@ -20,6 +26,7 @@ struct kvm_timer {
> u64 tscdeadline;
> u64 expired_tscdeadline;
> atomic_t pending; /* accumulated triggered timers */
> + int hv_timer_state;
> };
>
> struct kvm_lapic {
> @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
> struct kvm_vcpu **dest_vcpu);
> int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
> const unsigned long *bitmap, u32 bitmap_size);
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
> #endif
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 8de925031b5c..8e1b5e9c2e78 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -6,6 +6,7 @@
> #include <asm/svm.h>
> #include <asm/clocksource.h>
> #include <asm/pvclock-abi.h>
> +#include <lapic.h>
>
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM kvm
> @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
> __entry->vec)
> );
>
> +#define kvm_trace_symbol_hv_timer \
> + {HV_TIMER_NOT_USED, "no"}, \
> + {HV_TIMER_NEEDS_ARMING, "need_arming"}, \
> + {HV_TIMER_ARMED, "armed"}
> +
> +TRACE_EVENT(kvm_hv_timer_state,
> + TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_state),
> + TP_ARGS(vcpu_id, hv_timer_state),
> + TP_STRUCT__entry(
> + __field(unsigned int, vcpu_id)
> + __field(unsigned int, hv_timer_state)
> + ),
> + TP_fast_assign(
> + __entry->vcpu_id = vcpu_id;
> + __entry->hv_timer_state = hv_timer_state;
> + ),
> + TP_printk("vcpu_id %x hv_timer %s\n",
> + __entry->vcpu_id,
> + __print_symbolic(__entry->hv_timer_state,
> + kvm_trace_symbol_hv_timer))
> + );
> #endif /* _TRACE_KVM_H */
>
> #undef TRACE_INCLUDE_PATH
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2b29afa61715..dc0b8cae02b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> +{
> + kvm_lapic_expired_hv_timer(vcpu);
> + return 1;
> +}
> /*
> * The exit handlers return 1 if the exit was handled fully and guest execution
> * may resume. Otherwise they set the kvm_run parameter to indicate what needs
> @@ -7627,6 +7632,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> [EXIT_REASON_XRSTORS] = handle_xrstors,
> [EXIT_REASON_PML_FULL] = handle_pml_full,
> [EXIT_REASON_PCOMMIT] = handle_pcommit,
> + [EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
> };
>
> static const int kvm_vmx_max_exit_handlers =
> @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> vmx_set_interrupt_shadow(vcpu, 0);
>
> + kvm_lapic_arm_hv_timer(vcpu);
> +
> if (vmx->guest_pkru_valid)
> __write_pkru(vmx->guest_pkru);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 269d576520ba..f4b50608568a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>
> void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> {
> + if (kvm_x86_ops->set_hv_timer)
> + switch_to_hv_lapic_timer(vcpu);
> kvm_x86_ops->sched_in(vcpu, cpu);
> }
>
> void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
> {
> + if (kvm_x86_ops->set_hv_timer)
> + switch_to_sw_lapic_timer(vcpu);
> }
>
> int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 11:52 ` Paolo Bonzini
@ 2016-05-25 22:44 ` yunhong jiang
2016-05-26 14:05 ` Alan Jenkins
2016-06-04 0:24 ` yunhong jiang
1 sibling, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 22:44 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp
On Wed, 25 May 2016 13:52:56 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 25/05/2016 00:27, Yunhong Jiang wrote:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> >
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> >
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> >
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> >
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> >
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> > ---
> > arch/x86/kvm/lapic.c | 97
> > ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/lapic.h | 11 ++++++ arch/x86/kvm/trace.h | 22
> > ++++++++++++ arch/x86/kvm/vmx.c | 8 +++++
> > arch/x86/kvm/x86.c | 4 +++
> > 5 files changed, 139 insertions(+), 3 deletions(-)
>
> Thanks, this looks very nice.
>
> As I mentioned in my reply to Marcelo, I have some more changes in
> mind to reduce the overhead. This way we can enable it
> unconditionally (on processors that do not have the erratum). In
> particular:
>
> 1) While a virtual machine is scheduled out due to contention, you
> don't care if the timer fires. You can fire the timer immediately
I'm not very sure if this statement applies to all situation, but I can't think
out any exception. So sure, I will work this way.
> after the next vmentry. You only need to set a timer during HLT.
> So, rather than sched_out/sched_in, you can start and stop the hrtimer
> in vmx_pre_block and vmx_post_block. You probably should rename what
> is now vmx_pre_block and vmx_post_block to something like
> pi_pre_block and pi_post_block, so that vmx_pre_block and
> vmx_post_block are like
>
> if (pi_pre_block(vcpu))
> return 1;
> if (have preemption timer) {
> kvm_lapic_switch_to_sw_timer(vcpu);
>
> and
>
> if (have preemption timer)
> kvm_lapic_switch_to_hv_timer(vcpu);
> pi_post_block(vcpu);
>
> respectively.
>
> You'll also need to check whether the deadlock in
> switch_to_sw_lapic_timer is still possible. Hopefully not---if so,
> simpler code! :)
Hmm, the deadlock should not happen here, but I will try and see.
>
> 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
> kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> set_hv_timer can convert the guest TSC to a host TSC and save the
> desired host TSC in struct vmx_vcpu. vcpu_run checks if there is a
> TSC deadline and, if so, writes the delta into
> VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write
> to PIN_BASED_VM_EXEC_CONTROL.
>
> Besides making vcpu_run faster, I think that with this change the
> distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED
> disappears, and struct kvm_lapic can simply have a bool
> hv_timer_in_use. Again, it might actually end up making the code
> simpler, especially the interface between lapic.c and vmx.c.
Yes, this should save two VMCS access on the vcpu_run(). Will do this way.
>
> The hardest part is converting guest_tsc to host_tsc. From
>
> guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
> kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
>
> you have
>
> host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
> << kvm_tsc_scaling_ratio_frac_bits)
> / vcpu->arch.tsc_scaling_ratio;
>
> ... except that you need a division with a 128-bit dividend and a
> 64-bit divisor here, and there is no such thing in Linux. It's okay
> if you restrict this to 64-bit hosts and use the divq assembly
> instruction directly. (You also need to trap the divide overflow
> exception; if there is an overflow just disable the preemption
> timer). On 32-bit systems, it's okay to force-disable
> set_hv_timer/cancel_hv_timer, i.e. set them to NULL.
I'm scared by this conversion :) Sure will hav a try. Need firstly check
how the scale_tsc works.
>
> And if you do this, pre_block and post_block become:
>
> if (pi_pre_block(vcpu))
> return 1;
> if (preemption timer bit set in VMCS)
> kvm_lapic_start_sw_timer(vcpu);
>
> and
>
> if (preemption timer bit set in VMCS)
> kvm_lapic_cancel_sw_timer(vcpu);
> pi_post_block(vcpu);
>
> There is no need to redo the preemption timer setup in post_block.
>
> 3) regarding the erratum, you can use the code from
> https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp
> (function hmR0InitIntelIsSubjectToVmxPreemptionTimerErratum... please
> rename it to something else! :)).
Thanks for the information.
--jyh
>
> Thanks,
>
> Paolo
>
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index f1cf8a5ede11..93679db7ce0f 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > __delay(tsc_deadline - guest_tsc);
> > }
> >
> > -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > +static void start_sw_tscdeadline(struct kvm_lapic *apic, int
> > no_expire) {
> > u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
> > u64 ns = 0;
> > @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct
> > kvm_lapic *apic)
> > now = apic->lapic_timer.timer.base->get_time();
> > guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > - if (likely(tscdeadline > guest_tsc)) {
> > + if (no_expire || likely(tscdeadline > guest_tsc)) {
> > ns = (tscdeadline - guest_tsc) * 1000000ULL;
> > do_div(ns, this_tsc_khz);
> > expire = ktime_add_ns(now, ns);
> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct
> > kvm_lapic *apic) local_irq_restore(flags);
> > }
> >
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_lapic *apic = vcpu->arch.apic;
> > + u64 tscdeadline, guest_tsc;
> > +
> > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > + return;
> > +
> > + tscdeadline = apic->lapic_timer.tscdeadline;
> > + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > + if (tscdeadline >= guest_tsc)
> > + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> > guest_tsc);
> > + else
> > + kvm_x86_ops->set_hv_timer(vcpu, 0);
> > +
> > + apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> > + trace_kvm_hv_timer_state(vcpu->vcpu_id,
> > + apic->lapic_timer.hv_timer_state);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> > +
> > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > + WARN_ON(apic->lapic_timer.hv_timer_state !=
> > HV_TIMER_ARMED);
> > + WARN_ON(swait_active(&vcpu->wq));
> > + kvm_x86_ops->cancel_hv_timer(vcpu);
> > + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> > + apic_timer_expired(apic);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> > +
> > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > + WARN_ON(apic->lapic_timer.hv_timer_state !=
> > HV_TIMER_NOT_USED); +
> > + if (apic_lvtt_tscdeadline(apic) &&
> > + !atomic_read(&apic->lapic_timer.pending)) {
> > + hrtimer_cancel(&apic->lapic_timer.timer);
> > + /* In case the timer triggered in above small
> > window */
> > + if (!atomic_read(&apic->lapic_timer.pending)) {
> > + apic->lapic_timer.hv_timer_state =
> > + HV_TIMER_NEEDS_ARMING;
> > + trace_kvm_hv_timer_state(vcpu->vcpu_id,
> > +
> > apic->lapic_timer.hv_timer_state);
> > + }
> > + }
> > +}
> > +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> > +
> > +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > + /* Possibly the TSC deadline timer is not enabled yet */
> > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > + return;
> > +
> > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> > + kvm_x86_ops->cancel_hv_timer(vcpu);
> > + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> > +
> > + if (atomic_read(&apic->lapic_timer.pending))
> > + return;
> > +
> > + /*
> > + * Don't trigger the apic_timer_expired() for deadlock,
> > + * because the swake_up() from apic_timer_expired() will
> > + * try to get the run queue lock, which has been held here
> > + * since we are in context switch procedure already.
> > + */
> > + start_sw_tscdeadline(apic, 1);
> > +}
> > +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> > +
> > static void start_apic_timer(struct kvm_lapic *apic)
> > {
> > ktime_t now;
> > @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct
> > kvm_lapic *apic) ktime_to_ns(ktime_add_ns(now,
> > apic->lapic_timer.period)));
> > } else if (apic_lvtt_tscdeadline(apic)) {
> > - start_sw_tscdeadline(apic);
> > + /* lapic timer in tsc deadline mode */
> > + if (kvm_x86_ops->set_hv_timer) {
> > + if
> > (unlikely(!apic->lapic_timer.tscdeadline ||
> > + !apic->vcpu->arch.virtual_tsc_khz))
> > + return;
> > +
> > + /* Expired timer will be checked on
> > vcpu_run() */
> > + apic->lapic_timer.hv_timer_state =
> > + HV_TIMER_NEEDS_ARMING;
> > +
> > trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> > +
> > apic->lapic_timer.hv_timer_state);
> > + } else
> > + start_sw_tscdeadline(apic, 0);
> > }
> > }
> >
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 891c6da7d4aa..dc4fd8eea04d 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -12,6 +12,12 @@
> > #define KVM_APIC_SHORT_MASK 0xc0000
> > #define KVM_APIC_DEST_MASK 0x800
> >
> > +enum {
> > + HV_TIMER_NOT_USED,
> > + HV_TIMER_NEEDS_ARMING,
> > + HV_TIMER_ARMED,
> > +};
> > +
> > struct kvm_timer {
> > struct hrtimer timer;
> > s64 period; /* unit: ns */
> > @@ -20,6 +26,7 @@ struct kvm_timer {
> > u64 tscdeadline;
> > u64 expired_tscdeadline;
> > atomic_t pending; /* accumulated
> > triggered timers */
> > + int hv_timer_state;
> > };
> >
> > struct kvm_lapic {
> > @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm
> > *kvm, struct kvm_lapic_irq *irq, struct kvm_vcpu **dest_vcpu);
> > int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
> > const unsigned long *bitmap, u32
> > bitmap_size); +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
> > #endif
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 8de925031b5c..8e1b5e9c2e78 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -6,6 +6,7 @@
> > #include <asm/svm.h>
> > #include <asm/clocksource.h>
> > #include <asm/pvclock-abi.h>
> > +#include <lapic.h>
> >
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM kvm
> > @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
> > __entry->vec)
> > );
> >
> > +#define kvm_trace_symbol_hv_timer \
> > + {HV_TIMER_NOT_USED, "no"}, \
> > + {HV_TIMER_NEEDS_ARMING, "need_arming"}, \
> > + {HV_TIMER_ARMED, "armed"}
> > +
> > +TRACE_EVENT(kvm_hv_timer_state,
> > + TP_PROTO(unsigned int vcpu_id, unsigned int
> > hv_timer_state),
> > + TP_ARGS(vcpu_id, hv_timer_state),
> > + TP_STRUCT__entry(
> > + __field(unsigned int, vcpu_id)
> > + __field(unsigned int, hv_timer_state)
> > + ),
> > + TP_fast_assign(
> > + __entry->vcpu_id = vcpu_id;
> > + __entry->hv_timer_state = hv_timer_state;
> > + ),
> > + TP_printk("vcpu_id %x hv_timer %s\n",
> > + __entry->vcpu_id,
> > + __print_symbolic(__entry->hv_timer_state,
> > + kvm_trace_symbol_hv_timer))
> > + );
> > #endif /* _TRACE_KVM_H */
> >
> > #undef TRACE_INCLUDE_PATH
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 2b29afa61715..dc0b8cae02b8 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu
> > *vcpu) return 1;
> > }
> >
> > +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> > +{
> > + kvm_lapic_expired_hv_timer(vcpu);
> > + return 1;
> > +}
> > /*
> > * The exit handlers return 1 if the exit was handled fully and
> > guest execution
> > * may resume. Otherwise they set the kvm_run parameter to
> > indicate what needs @@ -7627,6 +7632,7 @@ static int (*const
> > kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) =
> > { [EXIT_REASON_XRSTORS] = handle_xrstors,
> > [EXIT_REASON_PML_FULL] = handle_pml_full,
> > [EXIT_REASON_PCOMMIT] = handle_pcommit,
> > + [EXIT_REASON_PREEMPTION_TIMER] =
> > handle_preemption_timer, };
> >
> > static const int kvm_vmx_max_exit_handlers =
> > @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct
> > kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > vmx_set_interrupt_shadow(vcpu, 0);
> >
> > + kvm_lapic_arm_hv_timer(vcpu);
> > +
> > if (vmx->guest_pkru_valid)
> > __write_pkru(vmx->guest_pkru);
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 269d576520ba..f4b50608568a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu
> > *vcpu)
> > void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> > {
> > + if (kvm_x86_ops->set_hv_timer)
> > + switch_to_hv_lapic_timer(vcpu);
> > kvm_x86_ops->sched_in(vcpu, cpu);
> > }
> >
> > void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
> > {
> > + if (kvm_x86_ops->set_hv_timer)
> > + switch_to_sw_lapic_timer(vcpu);
> > }
> >
> > int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 22:44 ` yunhong jiang
@ 2016-05-26 14:05 ` Alan Jenkins
2016-05-26 15:32 ` Paolo Bonzini
0 siblings, 1 reply; 32+ messages in thread
From: Alan Jenkins @ 2016-05-26 14:05 UTC (permalink / raw)
To: yunhong jiang, Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp
On 25/05/16 23:44, yunhong jiang wrote:
> On Wed, 25 May 2016 13:52:56 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Yes, this should save two VMCS access on the vcpu_run(). Will do this way.
>
>>
>> The hardest part is converting guest_tsc to host_tsc. From
>>
>> guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
>> kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
>>
>> you have
>>
>> host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
>> << kvm_tsc_scaling_ratio_frac_bits)
>> / vcpu->arch.tsc_scaling_ratio;
>>
>> ... except that you need a division with a 128-bit dividend and a
>> 64-bit divisor here, and there is no such thing in Linux. It's okay
>> if you restrict this to 64-bit hosts and use the divq assembly
>> instruction directly. (You also need to trap the divide overflow
>> exception; if there is an overflow just disable the preemption
>> timer). On 32-bit systems, it's okay to force-disable
>> set_hv_timer/cancel_hv_timer, i.e. set them to NULL.
>
> I'm scared by this conversion :) Sure will hav a try. Need firstly check
> how the scale_tsc works.
aaaaaaaaaaaa
1. Against my interests: have you actually confirmed the VMX preemption
timer is affected by guest TSC scaling?
It's not explicit in the SDM. The way it's described for allocating
timeslices to the guest, IMO it makes more sense if it is not scaled.
2. Paolo, any chance I could also get away with requiring 64-bit? I.e.
#ifdef CONFIG_X86_64
#define HAVE_LAPIC_TIMER_ADVANCE 1
#endif
Regards
Alan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-26 14:05 ` Alan Jenkins
@ 2016-05-26 15:32 ` Paolo Bonzini
0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-26 15:32 UTC (permalink / raw)
To: Alan Jenkins, yunhong jiang; +Cc: kvm, mtosatti, rkrcmar, kernellwp
On 26/05/2016 16:05, Alan Jenkins wrote:
> 1. Against my interests: have you actually confirmed the VMX preemption
> timer is affected by guest TSC scaling?
No, it's not, hence my formula:
host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
<< kvm_tsc_scaling_ratio_frac_bits)
/ vcpu->arch.tsc_scaling_ratio;
The idea is that once you have a host TSC deadline, you can use it to
set the preemption timer on vmentry.
> It's not explicit in the SDM. The way it's described for allocating
> timeslices to the guest, IMO it makes more sense if it is not scaled.
>
> 2. Paolo, any chance I could also get away with requiring 64-bit? I.e.
>
> #ifdef CONFIG_X86_64
> #define HAVE_LAPIC_TIMER_ADVANCE 1
> #endif
I wouldn't object.
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 11:52 ` Paolo Bonzini
2016-05-25 22:44 ` yunhong jiang
@ 2016-06-04 0:24 ` yunhong jiang
2016-06-06 13:49 ` Paolo Bonzini
1 sibling, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-06-04 0:24 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp
On Wed, 25 May 2016 13:52:56 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 25/05/2016 00:27, Yunhong Jiang wrote:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> >
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> >
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> >
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> >
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> >
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> > ---
> > arch/x86/kvm/lapic.c | 97
> > ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/lapic.h | 11 ++++++ arch/x86/kvm/trace.h | 22
> > ++++++++++++ arch/x86/kvm/vmx.c | 8 +++++
> > arch/x86/kvm/x86.c | 4 +++
> > 5 files changed, 139 insertions(+), 3 deletions(-)
>
> Thanks, this looks very nice.
>
> As I mentioned in my reply to Marcelo, I have some more changes in
> mind to reduce the overhead. This way we can enable it
> unconditionally (on processors that do not have the erratum). In
> particular:
>
> 1) While a virtual machine is scheduled out due to contention, you
> don't care if the timer fires. You can fire the timer immediately
> after the next vmentry. You only need to set a timer during HLT.
> So, rather than sched_out/sched_in, you can start and stop the hrtimer
> in vmx_pre_block and vmx_post_block. You probably should rename what
> is now vmx_pre_block and vmx_post_block to something like
> pi_pre_block and pi_post_block, so that vmx_pre_block and
> vmx_post_block are like
>
> if (pi_pre_block(vcpu))
> return 1;
> if (have preemption timer) {
> kvm_lapic_switch_to_sw_timer(vcpu);
>
> and
>
> if (have preemption timer)
> kvm_lapic_switch_to_hv_timer(vcpu);
> pi_post_block(vcpu);
>
> respectively.
>
> You'll also need to check whether the deadlock in
> switch_to_sw_lapic_timer is still possible. Hopefully not---if so,
> simpler code! :)
>
> 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
> kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> set_hv_timer can convert the guest TSC to a host TSC and save the
> desired host TSC in struct vmx_vcpu. vcpu_run checks if there is a
Hi, Paolo, I mostly finished my patch, but one thing come to my mind in
the last minutes and hope to get your input.
One issue to save the host TSC on the vmx_vcpu is, if the vCPU thread
is migrated to another pCPU, and the TSC are not synchronized between
these two pCPUs, then the result is not accurate. In previous patchset,
it's ok because we do the calculation on the vcpu_run() thus the
migration does not matter.
This also bring a tricky situation considering the VMX preepmtion timer
can only hold 32bit value. In a rare situation, the host delta TSC is
less than 32 bit, however, if there is a host tsc backwards after the
migration, then the delta TSC on the new CPU may be larger than 32 bit,
but we can't switch back to sw timer anymore because it's too late. I
add some hacky code on the kvm_arch_vcpu_load(), not sure if it's ok.
I will send a RFC patch. I'm still looking for a
platform with TSC scaling support, thus not test TSC scaling
yet. Others has been tested.
> TSC deadline and, if so, writes the delta into
> VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write
> to PIN_BASED_VM_EXEC_CONTROL.
>
> Besides making vcpu_run faster, I think that with this change the
> distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED
> disappears, and struct kvm_lapic can simply have a bool
> hv_timer_in_use. Again, it might actually end up making the code
> simpler, especially the interface between lapic.c and vmx.c.
>
> The hardest part is converting guest_tsc to host_tsc. From
>
> guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
> kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
>
> you have
>
> host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
> << kvm_tsc_scaling_ratio_frac_bits)
> / vcpu->arch.tsc_scaling_ratio;
>
> ... except that you need a division with a 128-bit dividend and a
> 64-bit divisor here, and there is no such thing in Linux. It's okay
> if you restrict this to 64-bit hosts and use the divq assembly
> instruction directly. (You also need to trap the divide overflow
> exception; if there is an overflow just disable the preemption
I simply check if the calculation will cause overflow. A divide error
exception is too much for the target usage scenario.
Thanks
--jyh
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-06-04 0:24 ` yunhong jiang
@ 2016-06-06 13:49 ` Paolo Bonzini
2016-06-06 18:21 ` yunhong jiang
0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-06-06 13:49 UTC (permalink / raw)
To: yunhong jiang; +Cc: kvm, mtosatti, rkrcmar, kernellwp
On 04/06/2016 02:24, yunhong jiang wrote:
>> > 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
>> > Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
>> > kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
>> > set_hv_timer can convert the guest TSC to a host TSC and save the
>> > desired host TSC in struct vmx_vcpu. vcpu_run checks if there is a
> Hi, Paolo, I mostly finished my patch, but one thing come to my mind in
> the last minutes and hope to get your input.
> One issue to save the host TSC on the vmx_vcpu is, if the vCPU thread
> is migrated to another pCPU, and the TSC are not synchronized between
> these two pCPUs, then the result is not accurate. In previous patchset,
> it's ok because we do the calculation on the vcpu_run() thus the
> migration does not matter.
My suggestion here is to redo the vmx_set_hv_timer in vmx_vcpu_load.
It should fix the 32-bit overflow below, too.
Paolo
> This also bring a tricky situation considering the VMX preepmtion timer
> can only hold 32bit value. In a rare situation, the host delta TSC is
> less than 32 bit, however, if there is a host tsc backwards after the
> migration, then the delta TSC on the new CPU may be larger than 32 bit,
> but we can't switch back to sw timer anymore because it's too late. I
> add some hacky code on the kvm_arch_vcpu_load(), not sure if it's ok.
>
> I will send a RFC patch. I'm still looking for a
> platform with TSC scaling support, thus not test TSC scaling
> yet. Others has been tested.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-06-06 13:49 ` Paolo Bonzini
@ 2016-06-06 18:21 ` yunhong jiang
0 siblings, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-06-06 18:21 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp
On Mon, 6 Jun 2016 15:49:12 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 04/06/2016 02:24, yunhong jiang wrote:
> >> > 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> >> > Instead, make switch_to_hv_lapic_timer and start_apic_timer can
> >> > call kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> >> > set_hv_timer can convert the guest TSC to a host TSC and save the
> >> > desired host TSC in struct vmx_vcpu. vcpu_run checks if there
> >> > is a
> > Hi, Paolo, I mostly finished my patch, but one thing come to my
> > mind in the last minutes and hope to get your input.
> > One issue to save the host TSC on the vmx_vcpu is, if the vCPU
> > thread is migrated to another pCPU, and the TSC are not
> > synchronized between these two pCPUs, then the result is not
> > accurate. In previous patchset, it's ok because we do the
> > calculation on the vcpu_run() thus the migration does not matter.
>
> My suggestion here is to redo the vmx_set_hv_timer in vmx_vcpu_load.
>
> It should fix the 32-bit overflow below, too.
>
> Paolo
Aha, good point. Will do this way. And will remove the "RFC" on next patchset.
Thanks
--jyh
>
> > This also bring a tricky situation considering the VMX preepmtion
> > timer can only hold 32bit value. In a rare situation, the host
> > delta TSC is less than 32 bit, however, if there is a host tsc
> > backwards after the migration, then the delta TSC on the new CPU
> > may be larger than 32 bit, but we can't switch back to sw timer
> > anymore because it's too late. I add some hacky code on the
> > kvm_arch_vcpu_load(), not sure if it's ok.
> >
> > I will send a RFC patch. I'm still looking for a
> > platform with TSC scaling support, thus not test TSC scaling
> > yet. Others has been tested.
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
2016-05-24 23:11 ` David Matlack
2016-05-25 11:52 ` Paolo Bonzini
@ 2016-05-25 13:27 ` Radim Krčmář
2016-05-25 13:51 ` Paolo Bonzini
2016-05-25 13:45 ` Radim Krčmář
3 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 13:27 UTC (permalink / raw)
To: Yunhong Jiang; +Cc: kvm, mtosatti, pbonzini, kernellwp
2016-05-24 15:27-0700, Yunhong Jiang:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
>
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
>
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
>
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
>
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
> arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/lapic.h | 11 ++++++
> arch/x86/kvm/trace.h | 22 ++++++++++++
> arch/x86/kvm/vmx.c | 8 +++++
> arch/x86/kvm/x86.c | 4 +++
> 5 files changed, 139 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{
Preemption timer needs to be adjusted on every vmentry while hrtimer is
set only once. After how many vmentries does preemption timer surpass
the overhead of hrtimer?
> + struct kvm_lapic *apic = vcpu->arch.apic;
> + u64 tscdeadline, guest_tsc;
> +
> + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> + return;
> +
> + tscdeadline = apic->lapic_timer.tscdeadline;
> + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> + if (tscdeadline >= guest_tsc)
> + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
> + else
> + kvm_x86_ops->set_hv_timer(vcpu, 0);
Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
vmexit after a vmentry.
Thanks.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 13:27 ` Radim Krčmář
@ 2016-05-25 13:51 ` Paolo Bonzini
2016-05-25 14:31 ` Radim Krčmář
0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 13:51 UTC (permalink / raw)
To: Radim Krčmář, Yunhong Jiang; +Cc: kvm, mtosatti, kernellwp
On 25/05/2016 15:27, Radim Krčmář wrote:
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
>
> Preemption timer needs to be adjusted on every vmentry while hrtimer is
> set only once. After how many vmentries does preemption timer surpass
> the overhead of hrtimer?
See my review. :) I hope to get the per-vmentry cost of the preemption
timer down to a dozen instructions.
> > + struct kvm_lapic *apic = vcpu->arch.apic;
> > + u64 tscdeadline, guest_tsc;
> > +
> > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > + return;
> > +
> > + tscdeadline = apic->lapic_timer.tscdeadline;
> > + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > + if (tscdeadline >= guest_tsc)
> > + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
> > + else
> > + kvm_x86_ops->set_hv_timer(vcpu, 0);
>
> Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
> vmexit after a vmentry.
Even if this is common enough to warrant optimizing for it (which I'm
not sure about, and surely depends on the workload), I think it should
be done later. You'd get the same "useless vmentry" even with
hrtimer-based LAPIC timer.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 13:51 ` Paolo Bonzini
@ 2016-05-25 14:31 ` Radim Krčmář
2016-05-25 23:13 ` yunhong jiang
2016-06-14 11:34 ` Paolo Bonzini
0 siblings, 2 replies; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 14:31 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Yunhong Jiang, kvm, mtosatti, kernellwp
2016-05-25 15:51+0200, Paolo Bonzini:
> On 25/05/2016 15:27, Radim Krčmář wrote:
>> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
>> > +{
>>
>> Preemption timer needs to be adjusted on every vmentry while hrtimer is
>> set only once. After how many vmentries does preemption timer surpass
>> the overhead of hrtimer?
>
> See my review. :) I hope to get the per-vmentry cost of the preemption
> timer down to a dozen instructions.
I secretly wished to learn what the overhead of hrtimer is. :)
Seems like we need a vmexit kvm-unit-test that enables tsc deadline
timer while benchmarking.
>> > + if (tscdeadline >= guest_tsc)
>> > + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
>> > + else
>> > + kvm_x86_ops->set_hv_timer(vcpu, 0);
>>
>> Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
>> vmexit after a vmentry.
>
> Even if this is common enough to warrant optimizing for it (which I'm
> not sure about, and surely depends on the workload), I think it should
> be done later. You'd get the same "useless vmentry" even with
> hrtimer-based LAPIC timer.
Makes sense. And early return gets harder to do if we move the setup
closer to vmlaunch.
> You'd get the same "useless vmentry" even with
> hrtimer-based LAPIC timer.
(hrtimers won't do an entry/exit cycle if they are injected sufficiently
early. Maybe preemption or userspace exits would make the difference
noticeable, but it should be minor too.)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 14:31 ` Radim Krčmář
@ 2016-05-25 23:13 ` yunhong jiang
2016-06-14 11:34 ` Paolo Bonzini
1 sibling, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 23:13 UTC (permalink / raw)
To: Radim Krčmář; +Cc: Paolo Bonzini, kvm, mtosatti, kernellwp
On Wed, 25 May 2016 16:31:19 +0200
Radim Krčmář <rkrcmar@redhat.com> wrote:
> 2016-05-25 15:51+0200, Paolo Bonzini:
> > On 25/05/2016 15:27, Radim Krčmář wrote:
> >> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct
> >> > kvm_lapic *apic) +void kvm_lapic_arm_hv_timer(struct kvm_vcpu
> >> > *vcpu) +{
> >>
> >> Preemption timer needs to be adjusted on every vmentry while
> >> hrtimer is set only once. After how many vmentries does
> >> preemption timer surpass the overhead of hrtimer?
> >
> > See my review. :) I hope to get the per-vmentry cost of the
> > preemption timer down to a dozen instructions.
>
> I secretly wished to learn what the overhead of hrtimer is. :)
I think the cost includes "hrtimer cost + timer interrupt ISR +
vm-exit/entry". The hrtimer_start/hrtimer_cancel depends on the current
hrtimer situation. The timer interrupt ISR has far more cost than
VMExit handling. In system w/o PI, a pair of vm-exit/entry can be
caused by the host timer interrupt.
The intial intention of the patch is to reduce the latency, so we
didn't check the overhead of hrtimer with the cost of vmentries. and
that's the reason of a kernel parameter to turn on/off it. I'd still
keep that parameter even with Paolo's new suggestio, although it can be
ON by default now.
BTW, if the VMX preemption timer uses the host tsc, instead of delta
tsc, as parameter, there would be no extra cost at all :(
Thanks
--jyh
>
> Seems like we need a vmexit kvm-unit-test that enables tsc deadline
> timer while benchmarking.
>
> >> > + if (tscdeadline >= guest_tsc)
> >> > + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> >> > guest_tsc);
> >> > + else
> >> > + kvm_x86_ops->set_hv_timer(vcpu, 0);
> >>
> >> Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
> >> vmexit after a vmentry.
> >
> > Even if this is common enough to warrant optimizing for it (which
> > I'm not sure about, and surely depends on the workload), I think it
> > should be done later. You'd get the same "useless vmentry" even
> > with hrtimer-based LAPIC timer.
>
> Makes sense. And early return gets harder to do if we move the setup
> closer to vmlaunch.
>
> > You'd get the same "useless vmentry" even with
> > hrtimer-based LAPIC timer.
>
> (hrtimers won't do an entry/exit cycle if they are injected
> sufficiently early. Maybe preemption or userspace exits would make
> the difference noticeable, but it should be minor too.)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 14:31 ` Radim Krčmář
2016-05-25 23:13 ` yunhong jiang
@ 2016-06-14 11:34 ` Paolo Bonzini
1 sibling, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-06-14 11:34 UTC (permalink / raw)
To: Radim Krčmář; +Cc: Yunhong Jiang, kvm, mtosatti, kernellwp
On 25/05/2016 16:31, Radim Krčmář wrote:
>> > See my review. :) I hope to get the per-vmentry cost of the preemption
>> > timer down to a dozen instructions.
> I secretly wished to learn what the overhead of hrtimer is.
We now know. :) It's about 800 clock cycles.
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
` (2 preceding siblings ...)
2016-05-25 13:27 ` Radim Krčmář
@ 2016-05-25 13:45 ` Radim Krčmář
2016-05-25 22:57 ` yunhong jiang
3 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 13:45 UTC (permalink / raw)
To: Yunhong Jiang; +Cc: kvm, mtosatti, pbonzini, kernellwp
2016-05-24 15:27-0700, Yunhong Jiang:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
>
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
>
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
>
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
>
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> static const int kvm_vmx_max_exit_handlers =
> @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> vmx_set_interrupt_shadow(vcpu, 0);
>
> + kvm_lapic_arm_hv_timer(vcpu);
> +
Every cycle that the host spends in root mode after setting the timeout
makes the preemtion timer delayed. Can't we move the setup closer to
vmlaunch, i.e. after atomic_switch_perf_msrs()?
> if (vmx->guest_pkru_valid)
> __write_pkru(vmx->guest_pkru);
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
2016-05-25 13:45 ` Radim Krčmář
@ 2016-05-25 22:57 ` yunhong jiang
0 siblings, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 22:57 UTC (permalink / raw)
To: Radim Krčmář; +Cc: kvm, mtosatti, pbonzini, kernellwp
On Wed, 25 May 2016 15:45:53 +0200
Radim Krčmář <rkrcmar@redhat.com> wrote:
> 2016-05-24 15:27-0700, Yunhong Jiang:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> >
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> >
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> >
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> >
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> >
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > static const int kvm_vmx_max_exit_handlers =
> > @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct
> > kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > vmx_set_interrupt_shadow(vcpu, 0);
> >
> > + kvm_lapic_arm_hv_timer(vcpu);
> > +
>
> Every cycle that the host spends in root mode after setting the
> timeout makes the preemtion timer delayed. Can't we move the setup
> closer to vmlaunch, i.e. after atomic_switch_perf_msrs()?
Yes, thanks for pointing this.
--jyh
>
> > if (vmx->guest_pkru_valid)
> > __write_pkru(vmx->guest_pkru);
> >
^ permalink raw reply [flat|nested] 32+ messages in thread