All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V2 0/4] Utilizing VMX preemption for timer virtualization
@ 2016-05-24 22:27 Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 1/4] Add the kvm sched_out hook Yunhong Jiang
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Yunhong Jiang @ 2016-05-24 22:27 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

The VMX-preemption timer is a feature on VMX, it counts down, from the
value loaded by VM entry, in VMX nonroot operation. When the timer
counts down to zero, it stops counting down and a VM exit occurs.

This patchset utilize VMX preemption timer for tsc deadline timer
virtualization. The VMX preemption timer is armed before the vm-entry if the
tsc deadline timer is enabled. A VMExit will happen if the virtual TSC
deadline timer expires.

When the vCPU thread is scheduled out, the tsc deadline timer
virtualization will be switched to use the current solution, i.e. use
the timer for it. It's switched back to VMX preemption timer when the
vCPU thread is scheduled int.

This solution replace the complex OS's hrtimer system, and also the
host timer interrupt handling cost, with a preemption_timer VMexit. It
fits well for some NFV usage scenario, when the vCPU is bound to a
pCPU and the pCPU is isolated, or some similar scenarioes.

However, it possibly has impact if the vCPU thread is scheduled in/out
very frequently, because it switches from/to the hrtimer emulation a
lot. A module parameter is provided to turn it on or off.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>

Performance Evalaution:
Host:
[nfv@otcnfv02 ~]$ cat /proc/cpuinfo
....
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Guest:
Two vCPU with vCPU pinned to isolated pCPUs, idle=poll on guest kernel.
When the vCPU is not pinned, the benefit is smaller than pinned situation.

Test tools:
cyclictest [1] running 10 minutes with 1ms interval, i.e. 600000 loop in
total.

1. enable_hv_timer=Y.

# Histogram
......
000003 000000
000004 000029
000005 023017
000006 357485
000007 192723
000008 026141
000009 000106
000010 000067
......
# Min Latencies: 00004
# Avg Latencies: 00006

2. enable_hv_timer=N.

# Histogram
......
000004 000000
000005 000074
000006 001943
000007 005820
000008 164729
000009 424401
000010 001964
000011 000252
000012 000190
......
# Min Latencies: 00005
# Avg Latencies: 00010

Changes since v1 [2]:

* Remove the vmx_sched_out and no changes to kvm_x86_ops for it.
* Remove the two expired timer checkings on each vm-entry.
* Rename the hwemul_timer to hv_timer
* Clear vmx_x86_ops's membership if preemption timer is not usable.
* Cache cpu_preemption_timer_multi.
* Keep the tracepoint with the function patch.
* Other minor changes based on Paolo's review.

[1] https://rt.wiki.kernel.org/index.php/Cyclictest
[2] http://www.spinics.net/lists/kvm/msg132895.html

Yunhong Jiang (4):
  Add the kvm sched_out hook
  Utilize the vmx preemption timer
  Separate the start_sw_tscdeadline
  Utilize the vmx preemption timer for tsc deadline timer

 arch/arm/include/asm/kvm_host.h     |   1 +
 arch/mips/include/asm/kvm_host.h    |   1 +
 arch/powerpc/include/asm/kvm_host.h |   1 +
 arch/s390/include/asm/kvm_host.h    |   1 +
 arch/x86/include/asm/kvm_host.h     |   4 +
 arch/x86/kvm/lapic.c                | 144 ++++++++++++++++++++++++++++++------
 arch/x86/kvm/lapic.h                |  11 +++
 arch/x86/kvm/trace.h                |  22 ++++++
 arch/x86/kvm/vmx.c                  |  51 ++++++++++++-
 arch/x86/kvm/x86.c                  |   8 ++
 include/linux/kvm_host.h            |   1 +
 virt/kvm/kvm_main.c                 |   1 +
 12 files changed, 221 insertions(+), 25 deletions(-)

TODO:
	Find out the CPUs with VMX preemption timer broken.
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH V2 1/4] Add the kvm sched_out hook
  2016-05-24 22:27 [RFC PATCH V2 0/4] Utilizing VMX preemption for timer virtualization Yunhong Jiang
@ 2016-05-24 22:27 ` Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 2/4] Utilize the vmx preemption timer Yunhong Jiang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Yunhong Jiang @ 2016-05-24 22:27 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

When a vCPU thread is scheduled out, the sched_out hook will be invoked.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/arm/include/asm/kvm_host.h     | 1 +
 arch/mips/include/asm/kvm_host.h    | 1 +
 arch/powerpc/include/asm/kvm_host.h | 1 +
 arch/s390/include/asm/kvm_host.h    | 1 +
 arch/x86/include/asm/kvm_host.h     | 1 +
 arch/x86/kvm/x86.c                  | 4 ++++
 include/linux/kvm_host.h            | 1 +
 virt/kvm/kvm_main.c                 | 1 +
 8 files changed, 11 insertions(+)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 0df6b1fc9655..8ea4b8a1a27b 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -292,6 +292,7 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 
 static inline void kvm_arm_init_debug(void) {}
 static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 6733ac575da4..6e432a3065e3 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -812,6 +812,7 @@ static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ec35af34a3fb..2a0ebb0a78eb 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -725,6 +725,7 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_exit(void) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 37b9017c6a96..69084b874837 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -688,6 +688,7 @@ static inline void kvm_arch_check_processor_compat(void *rtn) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0fbe7e70dc1..5e6b3ce7748f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -959,6 +959,7 @@ struct kvm_x86_ops {
 	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
+	void (*sched_out)(struct kvm_vcpu *vcpu);
 
 	/*
 	 * Arch-specific dirty logging hooks. These hooks are only supposed to
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c805cf494154..269d576520ba 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7727,6 +7727,10 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
+{
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	if (type)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b1fa8f11c95b..04c48c50faf3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -722,6 +722,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu);
 
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dd4ac9d9e8f5..711edf7224b1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3533,6 +3533,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 
 	if (current->state == TASK_RUNNING)
 		vcpu->preempted = true;
+	kvm_arch_sched_out(vcpu);
 	kvm_arch_vcpu_put(vcpu);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-05-24 22:27 [RFC PATCH V2 0/4] Utilizing VMX preemption for timer virtualization Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 1/4] Add the kvm sched_out hook Yunhong Jiang
@ 2016-05-24 22:27 ` Yunhong Jiang
  2016-06-14 13:23   ` Roman Kagan
  2016-05-24 22:27 ` [RFC PATCH V2 3/4] Separate the start_sw_tscdeadline Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
  3 siblings, 1 reply; 32+ messages in thread
From: Yunhong Jiang @ 2016-05-24 22:27 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

Adding the basic VMX preemption timer functionality, including checking
if the feature is supported, setup/clean the VMX preemption timer. Also
adds a parameter to state if the VMX preemption timer should be utilized.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/vmx.c              | 43 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5e6b3ce7748f..c0bae55a9a81 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1006,6 +1006,9 @@ struct kvm_x86_ops {
 	int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
 			      uint32_t guest_irq, bool set);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
+
+	void (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 tsc);
+	void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e605d1ed334f..2b29afa61715 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -110,6 +110,10 @@ module_param_named(pml, enable_pml, bool, S_IRUGO);
 
 #define KVM_VMX_TSC_MULTIPLIER_MAX     0xffffffffffffffffULL
 
+static int cpu_preemption_timer_multi;
+static bool __read_mostly enable_hv_timer;
+module_param_named(enable_hv_timer, enable_hv_timer, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
 #define KVM_VM_CR0_ALWAYS_ON						\
@@ -1056,6 +1060,12 @@ static inline bool cpu_has_vmx_virtual_intr_delivery(void)
 		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 }
 
+static inline bool cpu_has_vmx_preemption_timer(void)
+{
+	return vmcs_config.pin_based_exec_ctrl &
+		PIN_BASED_VMX_PREEMPTION_TIMER;
+}
+
 static inline bool cpu_has_vmx_posted_intr(void)
 {
 	return IS_ENABLED(CONFIG_X86_LOCAL_APIC) &&
@@ -3306,7 +3316,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 		return -EIO;
 
 	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
-	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
+	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR |
+		 PIN_BASED_VMX_PREEMPTION_TIMER;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
 				&_pin_based_exec_control) < 0)
 		return -EIO;
@@ -4779,6 +4790,8 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 
 	if (!kvm_vcpu_apicv_active(&vmx->vcpu))
 		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
+	/* Enable the preemption timer dynamically */
+	pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
 	return pin_based_exec_ctrl;
 }
 
@@ -6377,6 +6390,17 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
 	}
 
+	if (cpu_has_vmx_preemption_timer() && enable_hv_timer) {
+		u64 vmx_msr;
+
+		rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
+		cpu_preemption_timer_multi =
+			 vmx_msr & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+	} else {
+		kvm_x86_ops->set_hv_timer = NULL;
+		kvm_x86_ops->cancel_hv_timer = NULL;
+	}
+
 	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
 	return alloc_kvm_area();
@@ -10650,6 +10674,20 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 	return X86EMUL_CONTINUE;
 }
 
+static void vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 delta_tsc)
+{
+	vmcs_write32(VMX_PREEMPTION_TIMER_VALUE,
+			delta_tsc >> cpu_preemption_timer_multi);
+	vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
+			PIN_BASED_VMX_PREEMPTION_TIMER);
+}
+
+static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+	vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
+			PIN_BASED_VMX_PREEMPTION_TIMER);
+}
+
 static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	if (ple_gap)
@@ -11013,6 +11051,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.pmu_ops = &intel_pmu_ops,
 
 	.update_pi_irte = vmx_update_pi_irte,
+
+	.set_hv_timer = vmx_set_hv_timer,
+	.cancel_hv_timer = vmx_cancel_hv_timer,
 };
 
 static int __init vmx_init(void)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH V2 3/4] Separate the start_sw_tscdeadline
  2016-05-24 22:27 [RFC PATCH V2 0/4] Utilizing VMX preemption for timer virtualization Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 1/4] Add the kvm sched_out hook Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 2/4] Utilize the vmx preemption timer Yunhong Jiang
@ 2016-05-24 22:27 ` Yunhong Jiang
  2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
  3 siblings, 0 replies; 32+ messages in thread
From: Yunhong Jiang @ 2016-05-24 22:27 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

The function to start the tsc deadline timer virtualization will be used
also by the sched_out function when we use hwemul_timer, so change it
to a separated function. No logic changes.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/kvm/lapic.c | 57 ++++++++++++++++++++++++++++------------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index bbb5b283ff63..f1cf8a5ede11 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1313,6 +1313,36 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		__delay(tsc_deadline - guest_tsc);
 }
 
+static void start_sw_tscdeadline(struct kvm_lapic *apic)
+{
+	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
+	u64 ns = 0;
+	ktime_t expire;
+	struct kvm_vcpu *vcpu = apic->vcpu;
+	unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	unsigned long flags;
+	ktime_t now;
+
+	if (unlikely(!tscdeadline || !this_tsc_khz))
+		return;
+
+	local_irq_save(flags);
+
+	now = apic->lapic_timer.timer.base->get_time();
+	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	if (likely(tscdeadline > guest_tsc)) {
+		ns = (tscdeadline - guest_tsc) * 1000000ULL;
+		do_div(ns, this_tsc_khz);
+		expire = ktime_add_ns(now, ns);
+		expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
+		hrtimer_start(&apic->lapic_timer.timer,
+				expire, HRTIMER_MODE_ABS_PINNED);
+	} else
+		apic_timer_expired(apic);
+
+	local_irq_restore(flags);
+}
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
 	ktime_t now;
@@ -1359,32 +1389,7 @@ static void start_apic_timer(struct kvm_lapic *apic)
 			   ktime_to_ns(ktime_add_ns(now,
 					apic->lapic_timer.period)));
 	} else if (apic_lvtt_tscdeadline(apic)) {
-		/* lapic timer in tsc deadline mode */
-		u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
-		u64 ns = 0;
-		ktime_t expire;
-		struct kvm_vcpu *vcpu = apic->vcpu;
-		unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
-		unsigned long flags;
-
-		if (unlikely(!tscdeadline || !this_tsc_khz))
-			return;
-
-		local_irq_save(flags);
-
-		now = apic->lapic_timer.timer.base->get_time();
-		guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
-		if (likely(tscdeadline > guest_tsc)) {
-			ns = (tscdeadline - guest_tsc) * 1000000ULL;
-			do_div(ns, this_tsc_khz);
-			expire = ktime_add_ns(now, ns);
-			expire = ktime_sub_ns(expire, lapic_timer_advance_ns);
-			hrtimer_start(&apic->lapic_timer.timer,
-				      expire, HRTIMER_MODE_ABS_PINNED);
-		} else
-			apic_timer_expired(apic);
-
-		local_irq_restore(flags);
+		start_sw_tscdeadline(apic);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 22:27 [RFC PATCH V2 0/4] Utilizing VMX preemption for timer virtualization Yunhong Jiang
                   ` (2 preceding siblings ...)
  2016-05-24 22:27 ` [RFC PATCH V2 3/4] Separate the start_sw_tscdeadline Yunhong Jiang
@ 2016-05-24 22:27 ` Yunhong Jiang
  2016-05-24 23:11   ` David Matlack
                     ` (3 more replies)
  3 siblings, 4 replies; 32+ messages in thread
From: Yunhong Jiang @ 2016-05-24 22:27 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, rkrcmar, pbonzini, kernellwp

From: Yunhong Jiang <yunhong.jiang@gmail.com>

Utilizing the VMX preemption timer for tsc deadline timer
virtualization. The VMX preemption timer is armed when the vCPU is
running, and a VMExit will happen if the virtual TSC deadline timer
expires.

When the vCPU thread is scheduled out, the tsc deadline timer
virtualization will be switched to use the current solution, i.e. use
the timer for it. It's switched back to VMX preemption timer when the
vCPU thread is scheduled int.

This solution avoids the complex OS's hrtimer system, and also the host
timer interrupt handling cost, with a preemption_timer VMexit. It fits
well for some NFV usage scenario, when the vCPU is bound to a pCPU and
the pCPU is isolated, or some similar scenario.

However, it possibly has impact if the vCPU thread is scheduled in/out
very frequently, because it switches from/to the hrtimer emulation a lot.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
---
 arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/lapic.h | 11 ++++++
 arch/x86/kvm/trace.h | 22 ++++++++++++
 arch/x86/kvm/vmx.c   |  8 +++++
 arch/x86/kvm/x86.c   |  4 +++
 5 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f1cf8a5ede11..93679db7ce0f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 		__delay(tsc_deadline - guest_tsc);
 }
 
-static void start_sw_tscdeadline(struct kvm_lapic *apic)
+static void start_sw_tscdeadline(struct kvm_lapic *apic, int no_expire)
 {
 	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
 	u64 ns = 0;
@@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
 
 	now = apic->lapic_timer.timer.base->get_time();
 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
-	if (likely(tscdeadline > guest_tsc)) {
+	if (no_expire || likely(tscdeadline > guest_tsc)) {
 		ns = (tscdeadline - guest_tsc) * 1000000ULL;
 		do_div(ns, this_tsc_khz);
 		expire = ktime_add_ns(now, ns);
@@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
 	local_irq_restore(flags);
 }
 
+void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 tscdeadline, guest_tsc;
+
+	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
+		return;
+
+	tscdeadline = apic->lapic_timer.tscdeadline;
+	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+
+	if (tscdeadline >= guest_tsc)
+		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
+	else
+		kvm_x86_ops->set_hv_timer(vcpu, 0);
+
+	apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
+	trace_kvm_hv_timer_state(vcpu->vcpu_id,
+			apic->lapic_timer.hv_timer_state);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
+
+void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_ARMED);
+	WARN_ON(swait_active(&vcpu->wq));
+	kvm_x86_ops->cancel_hv_timer(vcpu);
+	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
+	apic_timer_expired(apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
+
+void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_NOT_USED);
+
+	if (apic_lvtt_tscdeadline(apic) &&
+	    !atomic_read(&apic->lapic_timer.pending)) {
+		hrtimer_cancel(&apic->lapic_timer.timer);
+		/* In case the timer triggered in above small window */
+		if (!atomic_read(&apic->lapic_timer.pending)) {
+			apic->lapic_timer.hv_timer_state =
+				HV_TIMER_NEEDS_ARMING;
+			trace_kvm_hv_timer_state(vcpu->vcpu_id,
+					apic->lapic_timer.hv_timer_state);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
+
+void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	/* Possibly the TSC deadline timer is not enabled yet */
+	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
+		return;
+
+	if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
+		kvm_x86_ops->cancel_hv_timer(vcpu);
+	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
+
+	if (atomic_read(&apic->lapic_timer.pending))
+		return;
+
+	/*
+	 * Don't trigger the apic_timer_expired() for deadlock,
+	 * because the swake_up() from apic_timer_expired() will
+	 * try to get the run queue lock, which has been held here
+	 * since we are in context switch procedure already.
+	 */
+	start_sw_tscdeadline(apic, 1);
+}
+EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
 	ktime_t now;
@@ -1389,7 +1468,19 @@ static void start_apic_timer(struct kvm_lapic *apic)
 			   ktime_to_ns(ktime_add_ns(now,
 					apic->lapic_timer.period)));
 	} else if (apic_lvtt_tscdeadline(apic)) {
-		start_sw_tscdeadline(apic);
+		/* lapic timer in tsc deadline mode */
+		if (kvm_x86_ops->set_hv_timer) {
+			if (unlikely(!apic->lapic_timer.tscdeadline ||
+					!apic->vcpu->arch.virtual_tsc_khz))
+				return;
+
+			/* Expired timer will be checked on vcpu_run() */
+			apic->lapic_timer.hv_timer_state =
+				HV_TIMER_NEEDS_ARMING;
+			trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
+					apic->lapic_timer.hv_timer_state);
+		} else
+			start_sw_tscdeadline(apic, 0);
 	}
 }
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 891c6da7d4aa..dc4fd8eea04d 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -12,6 +12,12 @@
 #define KVM_APIC_SHORT_MASK	0xc0000
 #define KVM_APIC_DEST_MASK	0x800
 
+enum {
+	HV_TIMER_NOT_USED,
+	HV_TIMER_NEEDS_ARMING,
+	HV_TIMER_ARMED,
+};
+
 struct kvm_timer {
 	struct hrtimer timer;
 	s64 period; 				/* unit: ns */
@@ -20,6 +26,7 @@ struct kvm_timer {
 	u64 tscdeadline;
 	u64 expired_tscdeadline;
 	atomic_t pending;			/* accumulated triggered timers */
+	int hv_timer_state;
 };
 
 struct kvm_lapic {
@@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
 			struct kvm_vcpu **dest_vcpu);
 int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
 			const unsigned long *bitmap, u32 bitmap_size);
+void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
+void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
+void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
+void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 8de925031b5c..8e1b5e9c2e78 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -6,6 +6,7 @@
 #include <asm/svm.h>
 #include <asm/clocksource.h>
 #include <asm/pvclock-abi.h>
+#include <lapic.h>
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
 		  __entry->vec)
 );
 
+#define kvm_trace_symbol_hv_timer		\
+	{HV_TIMER_NOT_USED, "no"},		\
+	{HV_TIMER_NEEDS_ARMING, "need_arming"},	\
+	{HV_TIMER_ARMED, "armed"}
+
+TRACE_EVENT(kvm_hv_timer_state,
+		TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_state),
+		TP_ARGS(vcpu_id, hv_timer_state),
+		TP_STRUCT__entry(
+			__field(unsigned int, vcpu_id)
+			__field(unsigned int, hv_timer_state)
+			),
+		TP_fast_assign(
+			__entry->vcpu_id = vcpu_id;
+			__entry->hv_timer_state = hv_timer_state;
+			),
+		TP_printk("vcpu_id %x hv_timer %s\n",
+			__entry->vcpu_id,
+			__print_symbolic(__entry->hv_timer_state,
+				kvm_trace_symbol_hv_timer))
+	   );
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2b29afa61715..dc0b8cae02b8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_preemption_timer(struct kvm_vcpu *vcpu)
+{
+	kvm_lapic_expired_hv_timer(vcpu);
+	return 1;
+}
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7627,6 +7632,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_XRSTORS]                 = handle_xrstors,
 	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
 	[EXIT_REASON_PCOMMIT]                 = handle_pcommit,
+	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
 		vmx_set_interrupt_shadow(vcpu, 0);
 
+	kvm_lapic_arm_hv_timer(vcpu);
+
 	if (vmx->guest_pkru_valid)
 		__write_pkru(vmx->guest_pkru);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 269d576520ba..f4b50608568a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	if (kvm_x86_ops->set_hv_timer)
+		switch_to_hv_lapic_timer(vcpu);
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
 void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
 {
+	if (kvm_x86_ops->set_hv_timer)
+		switch_to_sw_lapic_timer(vcpu);
 }
 
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
@ 2016-05-24 23:11   ` David Matlack
  2016-05-24 23:35     ` yunhong jiang
  2016-05-25 10:40     ` Paolo Bonzini
  2016-05-25 11:52   ` Paolo Bonzini
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: David Matlack @ 2016-05-24 23:11 UTC (permalink / raw)
  To: Yunhong Jiang
  Cc: kvm list, Marcelo Tosatti, Radim Krčmář,
	Paolo Bonzini, Wanpeng Li

On Tue, May 24, 2016 at 3:27 PM, Yunhong Jiang
<yunhong.jiang@linux.intel.com> wrote:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
>
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
>
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
>
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
>
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.

What is the cost of the extra sched-in/out hooks?

>
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/lapic.h | 11 ++++++
>  arch/x86/kvm/trace.h | 22 ++++++++++++
>  arch/x86/kvm/vmx.c   |  8 +++++
>  arch/x86/kvm/x86.c   |  4 +++
>  5 files changed, 139 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f1cf8a5ede11..93679db7ce0f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>                 __delay(tsc_deadline - guest_tsc);
>  }
>
> -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> +static void start_sw_tscdeadline(struct kvm_lapic *apic, int no_expire)
>  {
>         u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
>         u64 ns = 0;
> @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>
>         now = apic->lapic_timer.timer.base->get_time();
>         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> -       if (likely(tscdeadline > guest_tsc)) {
> +       if (no_expire || likely(tscdeadline > guest_tsc)) {
>                 ns = (tscdeadline - guest_tsc) * 1000000ULL;
>                 do_div(ns, this_tsc_khz);
>                 expire = ktime_add_ns(now, ns);
> @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>         local_irq_restore(flags);
>  }
>
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +       u64 tscdeadline, guest_tsc;
> +
> +       if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> +               return;
> +
> +       tscdeadline = apic->lapic_timer.tscdeadline;
> +       guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> +       if (tscdeadline >= guest_tsc)
> +               kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);

Does this interact correctly with TSC scaling? IIUC this programs the
VMX preemption timer with a delay in guest cycles, rather than host
cycles.

> +       else
> +               kvm_x86_ops->set_hv_timer(vcpu, 0);
> +
> +       apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> +       trace_kvm_hv_timer_state(vcpu->vcpu_id,
> +                       apic->lapic_timer.hv_timer_state);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> +
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +       WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_ARMED);
> +       WARN_ON(swait_active(&vcpu->wq));
> +       kvm_x86_ops->cancel_hv_timer(vcpu);
> +       apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> +       apic_timer_expired(apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> +
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +       WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_NOT_USED);
> +
> +       if (apic_lvtt_tscdeadline(apic) &&
> +           !atomic_read(&apic->lapic_timer.pending)) {
> +               hrtimer_cancel(&apic->lapic_timer.timer);
> +               /* In case the timer triggered in above small window */
> +               if (!atomic_read(&apic->lapic_timer.pending)) {
> +                       apic->lapic_timer.hv_timer_state =
> +                               HV_TIMER_NEEDS_ARMING;
> +                       trace_kvm_hv_timer_state(vcpu->vcpu_id,
> +                                       apic->lapic_timer.hv_timer_state);
> +               }
> +       }
> +}
> +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> +
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +       /* Possibly the TSC deadline timer is not enabled yet */
> +       if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> +               return;
> +
> +       if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> +               kvm_x86_ops->cancel_hv_timer(vcpu);
> +       apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> +
> +       if (atomic_read(&apic->lapic_timer.pending))
> +               return;
> +
> +       /*
> +        * Don't trigger the apic_timer_expired() for deadlock,
> +        * because the swake_up() from apic_timer_expired() will
> +        * try to get the run queue lock, which has been held here
> +        * since we are in context switch procedure already.
> +        */
> +       start_sw_tscdeadline(apic, 1);
> +}
> +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> +
>  static void start_apic_timer(struct kvm_lapic *apic)
>  {
>         ktime_t now;
> @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct kvm_lapic *apic)
>                            ktime_to_ns(ktime_add_ns(now,
>                                         apic->lapic_timer.period)));
>         } else if (apic_lvtt_tscdeadline(apic)) {
> -               start_sw_tscdeadline(apic);
> +               /* lapic timer in tsc deadline mode */
> +               if (kvm_x86_ops->set_hv_timer) {
> +                       if (unlikely(!apic->lapic_timer.tscdeadline ||
> +                                       !apic->vcpu->arch.virtual_tsc_khz))
> +                               return;
> +
> +                       /* Expired timer will be checked on vcpu_run() */
> +                       apic->lapic_timer.hv_timer_state =
> +                               HV_TIMER_NEEDS_ARMING;
> +                       trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> +                                       apic->lapic_timer.hv_timer_state);
> +               } else
> +                       start_sw_tscdeadline(apic, 0);
>         }
>  }
>
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 891c6da7d4aa..dc4fd8eea04d 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -12,6 +12,12 @@
>  #define KVM_APIC_SHORT_MASK    0xc0000
>  #define KVM_APIC_DEST_MASK     0x800
>
> +enum {
> +       HV_TIMER_NOT_USED,
> +       HV_TIMER_NEEDS_ARMING,
> +       HV_TIMER_ARMED,
> +};
> +
>  struct kvm_timer {
>         struct hrtimer timer;
>         s64 period;                             /* unit: ns */
> @@ -20,6 +26,7 @@ struct kvm_timer {
>         u64 tscdeadline;
>         u64 expired_tscdeadline;
>         atomic_t pending;                       /* accumulated triggered timers */
> +       int hv_timer_state;
>  };
>
>  struct kvm_lapic {
> @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
>                         struct kvm_vcpu **dest_vcpu);
>  int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
>                         const unsigned long *bitmap, u32 bitmap_size);
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 8de925031b5c..8e1b5e9c2e78 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -6,6 +6,7 @@
>  #include <asm/svm.h>
>  #include <asm/clocksource.h>
>  #include <asm/pvclock-abi.h>
> +#include <lapic.h>
>
>  #undef TRACE_SYSTEM
>  #define TRACE_SYSTEM kvm
> @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
>                   __entry->vec)
>  );
>
> +#define kvm_trace_symbol_hv_timer              \
> +       {HV_TIMER_NOT_USED, "no"},              \
> +       {HV_TIMER_NEEDS_ARMING, "need_arming"}, \
> +       {HV_TIMER_ARMED, "armed"}
> +
> +TRACE_EVENT(kvm_hv_timer_state,
> +               TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_state),
> +               TP_ARGS(vcpu_id, hv_timer_state),
> +               TP_STRUCT__entry(
> +                       __field(unsigned int, vcpu_id)
> +                       __field(unsigned int, hv_timer_state)
> +                       ),
> +               TP_fast_assign(
> +                       __entry->vcpu_id = vcpu_id;
> +                       __entry->hv_timer_state = hv_timer_state;
> +                       ),
> +               TP_printk("vcpu_id %x hv_timer %s\n",
> +                       __entry->vcpu_id,
> +                       __print_symbolic(__entry->hv_timer_state,
> +                               kvm_trace_symbol_hv_timer))
> +          );
>  #endif /* _TRACE_KVM_H */
>
>  #undef TRACE_INCLUDE_PATH
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2b29afa61715..dc0b8cae02b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu)
>         return 1;
>  }
>
> +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> +{
> +       kvm_lapic_expired_hv_timer(vcpu);
> +       return 1;
> +}
>  /*
>   * The exit handlers return 1 if the exit was handled fully and guest execution
>   * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
> @@ -7627,6 +7632,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>         [EXIT_REASON_XRSTORS]                 = handle_xrstors,
>         [EXIT_REASON_PML_FULL]                = handle_pml_full,
>         [EXIT_REASON_PCOMMIT]                 = handle_pcommit,
> +       [EXIT_REASON_PREEMPTION_TIMER]        = handle_preemption_timer,
>  };
>
>  static const int kvm_vmx_max_exit_handlers =
> @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>         if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>                 vmx_set_interrupt_shadow(vcpu, 0);
>
> +       kvm_lapic_arm_hv_timer(vcpu);
> +
>         if (vmx->guest_pkru_valid)
>                 __write_pkru(vmx->guest_pkru);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 269d576520ba..f4b50608568a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>
>  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
>  {
> +       if (kvm_x86_ops->set_hv_timer)
> +               switch_to_hv_lapic_timer(vcpu);
>         kvm_x86_ops->sched_in(vcpu, cpu);
>  }
>
>  void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
>  {
> +       if (kvm_x86_ops->set_hv_timer)
> +               switch_to_sw_lapic_timer(vcpu);
>  }
>
>  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 23:11   ` David Matlack
@ 2016-05-24 23:35     ` yunhong jiang
  2016-05-25 11:58       ` Paolo Bonzini
  2016-05-25 10:40     ` Paolo Bonzini
  1 sibling, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-05-24 23:35 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm list, Marcelo Tosatti, Radim Krčmář,
	Paolo Bonzini, Wanpeng Li

On Tue, 24 May 2016 16:11:29 -0700
David Matlack <dmatlack@google.com> wrote:

> On Tue, May 24, 2016 at 3:27 PM, Yunhong Jiang
> <yunhong.jiang@linux.intel.com> wrote:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> >
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> >
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> >
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> >
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> 
> What is the cost of the extra sched-in/out hooks?

The cost is because on each sched-in/out, the kvm_sched_in/out need switch
between the sw/hv timer, including call hrtimer_start()/hrtimer_cancel() to
reprogram the hrtimer, update vmcs to arm/unarm the vmx preemption timer. I
think the hrtimer_start/cancel is costly although I have no data on hand.

Or you mean I should provide the data for it? Do you have any suggestion on
workload for compare the difference? I think the schedule hook will not impact
the guest itself, instead, it makes the scheduler procedure longer thus impact
host system performance.

> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > +       struct kvm_lapic *apic = vcpu->arch.apic;
> > +       u64 tscdeadline, guest_tsc;
> > +
> > +       if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > +               return;
> > +
> > +       tscdeadline = apic->lapic_timer.tscdeadline;
> > +       guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > +       if (tscdeadline >= guest_tsc)
> > +               kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> > guest_tsc);
> 
> Does this interact correctly with TSC scaling? IIUC this programs the
> VMX preemption timer with a delay in guest cycles, rather than host
> cycles.

Aha, good point, thanks!

But I re-checked the lapic.c and seems this issue has never been considered?
Check http://lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335 and
lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335 for example. Also I
tried to find a function to translate guest tsc to host tsc and didn't find
such function at all. Did I miss anything?

Thanks
--jyh

> > --
> > 1.8.3.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 23:11   ` David Matlack
  2016-05-24 23:35     ` yunhong jiang
@ 2016-05-25 10:40     ` Paolo Bonzini
  2016-05-25 13:38       ` Radim Krčmář
  1 sibling, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 10:40 UTC (permalink / raw)
  To: David Matlack, Yunhong Jiang
  Cc: kvm list, Marcelo Tosatti, Radim Krčmář, Wanpeng Li



On 25/05/2016 01:11, David Matlack wrote:
> > However, it possibly has impact if the vCPU thread is scheduled in/out
> > very frequently, because it switches from/to the hrtimer emulation a lot.
> 
> What is the cost of the extra sched-in/out hooks?

I still have to review this version of the patches, but I think they can
be removed and moved to blocking/unblocking instead.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
  2016-05-24 23:11   ` David Matlack
@ 2016-05-25 11:52   ` Paolo Bonzini
  2016-05-25 22:44     ` yunhong jiang
  2016-06-04  0:24     ` yunhong jiang
  2016-05-25 13:27   ` Radim Krčmář
  2016-05-25 13:45   ` Radim Krčmář
  3 siblings, 2 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 11:52 UTC (permalink / raw)
  To: Yunhong Jiang, kvm; +Cc: mtosatti, rkrcmar, kernellwp



On 25/05/2016 00:27, Yunhong Jiang wrote:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
> 
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
> 
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
> 
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
> 
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
> 
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/lapic.h | 11 ++++++
>  arch/x86/kvm/trace.h | 22 ++++++++++++
>  arch/x86/kvm/vmx.c   |  8 +++++
>  arch/x86/kvm/x86.c   |  4 +++
>  5 files changed, 139 insertions(+), 3 deletions(-)

Thanks, this looks very nice.

As I mentioned in my reply to Marcelo, I have some more changes in mind
to reduce the overhead.  This way we can enable it unconditionally (on
processors that do not have the erratum).   In particular:

1) While a virtual machine is scheduled out due to contention, you don't
care if the timer fires.  You can fire the timer immediately after the
next vmentry.  You only need to set a timer during HLT.
So, rather than sched_out/sched_in, you can start and stop the hrtimer
in vmx_pre_block and vmx_post_block.  You probably should rename what is
now vmx_pre_block and vmx_post_block to something like pi_pre_block and
pi_post_block, so that vmx_pre_block and vmx_post_block are like

	if (pi_pre_block(vcpu))
		return 1;
	if (have preemption timer) {
		kvm_lapic_switch_to_sw_timer(vcpu);

and

	if (have preemption timer)
		kvm_lapic_switch_to_hv_timer(vcpu);
	pi_post_block(vcpu);

respectively.

You'll also need to check whether the deadlock in
switch_to_sw_lapic_timer is still possible.  Hopefully not---if so,
simpler code! :)

2) if possible, I would like to remove kvm_lapic_arm_hv_timer.  Instead,
make switch_to_hv_lapic_timer and start_apic_timer can call
kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
set_hv_timer can convert the guest TSC to a host TSC and save the
desired host TSC in struct vmx_vcpu.  vcpu_run checks if there is a TSC
deadline and, if so, writes the delta into VMX_PREEMPTION_TIMER_VALUE.
set_hv_timer/cancel_hv_timer only write to PIN_BASED_VM_EXEC_CONTROL.

Besides making vcpu_run faster, I think that with this change the
distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED disappears,
and struct kvm_lapic can simply have a bool hv_timer_in_use.  Again, it
might actually end up making the code simpler, especially the interface
between lapic.c and vmx.c.

The hardest part is converting guest_tsc to host_tsc.  From

	guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
			kvm_tsc_scaling_ratio_frac_bits) + tsc_offset

you have

	host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
			<< kvm_tsc_scaling_ratio_frac_bits)
			/ vcpu->arch.tsc_scaling_ratio;

... except that you need a division with a 128-bit dividend and a 64-bit
divisor here, and there is no such thing in Linux.  It's okay if you
restrict this to 64-bit hosts and use the divq assembly instruction
directly.  (You also need to trap the divide overflow exception; if
there is an overflow just disable the preemption timer).  On 32-bit
systems, it's okay to force-disable set_hv_timer/cancel_hv_timer, i.e.
set them to NULL.

And if you do this, pre_block and post_block become:

	if (pi_pre_block(vcpu))
		return 1;
	if (preemption timer bit set in VMCS)
		kvm_lapic_start_sw_timer(vcpu);

and

	if (preemption timer bit set in VMCS)
		kvm_lapic_cancel_sw_timer(vcpu);
	pi_post_block(vcpu);

There is no need to redo the preemption timer setup in post_block.

3) regarding the erratum, you can use the code from
https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp
(function hmR0InitIntelIsSubjectToVmxPreemptionTimerErratum... please
rename it to something else! :)).

Thanks,

Paolo

> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f1cf8a5ede11..93679db7ce0f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  		__delay(tsc_deadline - guest_tsc);
>  }
>  
> -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> +static void start_sw_tscdeadline(struct kvm_lapic *apic, int no_expire)
>  {
>  	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
>  	u64 ns = 0;
> @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>  
>  	now = apic->lapic_timer.timer.base->get_time();
>  	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> -	if (likely(tscdeadline > guest_tsc)) {
> +	if (no_expire || likely(tscdeadline > guest_tsc)) {
>  		ns = (tscdeadline - guest_tsc) * 1000000ULL;
>  		do_div(ns, this_tsc_khz);
>  		expire = ktime_add_ns(now, ns);
> @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>  	local_irq_restore(flags);
>  }
>  
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 tscdeadline, guest_tsc;
> +
> +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> +		return;
> +
> +	tscdeadline = apic->lapic_timer.tscdeadline;
> +	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> +	if (tscdeadline >= guest_tsc)
> +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
> +	else
> +		kvm_x86_ops->set_hv_timer(vcpu, 0);
> +
> +	apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> +	trace_kvm_hv_timer_state(vcpu->vcpu_id,
> +			apic->lapic_timer.hv_timer_state);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> +
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_ARMED);
> +	WARN_ON(swait_active(&vcpu->wq));
> +	kvm_x86_ops->cancel_hv_timer(vcpu);
> +	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> +	apic_timer_expired(apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> +
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_NOT_USED);
> +
> +	if (apic_lvtt_tscdeadline(apic) &&
> +	    !atomic_read(&apic->lapic_timer.pending)) {
> +		hrtimer_cancel(&apic->lapic_timer.timer);
> +		/* In case the timer triggered in above small window */
> +		if (!atomic_read(&apic->lapic_timer.pending)) {
> +			apic->lapic_timer.hv_timer_state =
> +				HV_TIMER_NEEDS_ARMING;
> +			trace_kvm_hv_timer_state(vcpu->vcpu_id,
> +					apic->lapic_timer.hv_timer_state);
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> +
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	/* Possibly the TSC deadline timer is not enabled yet */
> +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> +		return;
> +
> +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> +		kvm_x86_ops->cancel_hv_timer(vcpu);
> +	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> +
> +	if (atomic_read(&apic->lapic_timer.pending))
> +		return;
> +
> +	/*
> +	 * Don't trigger the apic_timer_expired() for deadlock,
> +	 * because the swake_up() from apic_timer_expired() will
> +	 * try to get the run queue lock, which has been held here
> +	 * since we are in context switch procedure already.
> +	 */
> +	start_sw_tscdeadline(apic, 1);
> +}
> +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> +
>  static void start_apic_timer(struct kvm_lapic *apic)
>  {
>  	ktime_t now;
> @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct kvm_lapic *apic)
>  			   ktime_to_ns(ktime_add_ns(now,
>  					apic->lapic_timer.period)));
>  	} else if (apic_lvtt_tscdeadline(apic)) {
> -		start_sw_tscdeadline(apic);
> +		/* lapic timer in tsc deadline mode */
> +		if (kvm_x86_ops->set_hv_timer) {
> +			if (unlikely(!apic->lapic_timer.tscdeadline ||
> +					!apic->vcpu->arch.virtual_tsc_khz))
> +				return;
> +
> +			/* Expired timer will be checked on vcpu_run() */
> +			apic->lapic_timer.hv_timer_state =
> +				HV_TIMER_NEEDS_ARMING;
> +			trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> +					apic->lapic_timer.hv_timer_state);
> +		} else
> +			start_sw_tscdeadline(apic, 0);
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 891c6da7d4aa..dc4fd8eea04d 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -12,6 +12,12 @@
>  #define KVM_APIC_SHORT_MASK	0xc0000
>  #define KVM_APIC_DEST_MASK	0x800
>  
> +enum {
> +	HV_TIMER_NOT_USED,
> +	HV_TIMER_NEEDS_ARMING,
> +	HV_TIMER_ARMED,
> +};
> +
>  struct kvm_timer {
>  	struct hrtimer timer;
>  	s64 period; 				/* unit: ns */
> @@ -20,6 +26,7 @@ struct kvm_timer {
>  	u64 tscdeadline;
>  	u64 expired_tscdeadline;
>  	atomic_t pending;			/* accumulated triggered timers */
> +	int hv_timer_state;
>  };
>  
>  struct kvm_lapic {
> @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
>  			struct kvm_vcpu **dest_vcpu);
>  int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
>  			const unsigned long *bitmap, u32 bitmap_size);
> +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 8de925031b5c..8e1b5e9c2e78 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -6,6 +6,7 @@
>  #include <asm/svm.h>
>  #include <asm/clocksource.h>
>  #include <asm/pvclock-abi.h>
> +#include <lapic.h>
>  
>  #undef TRACE_SYSTEM
>  #define TRACE_SYSTEM kvm
> @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
>  		  __entry->vec)
>  );
>  
> +#define kvm_trace_symbol_hv_timer		\
> +	{HV_TIMER_NOT_USED, "no"},		\
> +	{HV_TIMER_NEEDS_ARMING, "need_arming"},	\
> +	{HV_TIMER_ARMED, "armed"}
> +
> +TRACE_EVENT(kvm_hv_timer_state,
> +		TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_state),
> +		TP_ARGS(vcpu_id, hv_timer_state),
> +		TP_STRUCT__entry(
> +			__field(unsigned int, vcpu_id)
> +			__field(unsigned int, hv_timer_state)
> +			),
> +		TP_fast_assign(
> +			__entry->vcpu_id = vcpu_id;
> +			__entry->hv_timer_state = hv_timer_state;
> +			),
> +		TP_printk("vcpu_id %x hv_timer %s\n",
> +			__entry->vcpu_id,
> +			__print_symbolic(__entry->hv_timer_state,
> +				kvm_trace_symbol_hv_timer))
> +	   );
>  #endif /* _TRACE_KVM_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2b29afa61715..dc0b8cae02b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu)
>  	return 1;
>  }
>  
> +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> +{
> +	kvm_lapic_expired_hv_timer(vcpu);
> +	return 1;
> +}
>  /*
>   * The exit handlers return 1 if the exit was handled fully and guest execution
>   * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
> @@ -7627,6 +7632,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>  	[EXIT_REASON_XRSTORS]                 = handle_xrstors,
>  	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
>  	[EXIT_REASON_PCOMMIT]                 = handle_pcommit,
> +	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
>  };
>  
>  static const int kvm_vmx_max_exit_handlers =
> @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>  		vmx_set_interrupt_shadow(vcpu, 0);
>  
> +	kvm_lapic_arm_hv_timer(vcpu);
> +
>  	if (vmx->guest_pkru_valid)
>  		__write_pkru(vmx->guest_pkru);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 269d576520ba..f4b50608568a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>  
>  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
>  {
> +	if (kvm_x86_ops->set_hv_timer)
> +		switch_to_hv_lapic_timer(vcpu);
>  	kvm_x86_ops->sched_in(vcpu, cpu);
>  }
>  
>  void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
>  {
> +	if (kvm_x86_ops->set_hv_timer)
> +		switch_to_sw_lapic_timer(vcpu);
>  }
>  
>  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 23:35     ` yunhong jiang
@ 2016-05-25 11:58       ` Paolo Bonzini
  2016-05-25 22:53         ` yunhong jiang
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 11:58 UTC (permalink / raw)
  To: yunhong jiang, David Matlack
  Cc: kvm list, Marcelo Tosatti, Radim Krčmář, Wanpeng Li



On 25/05/2016 01:35, yunhong jiang wrote:
>> > Does this interact correctly with TSC scaling? IIUC this programs the
>> > VMX preemption timer with a delay in guest cycles, rather than host
>> > cycles.
> Aha, good point, thanks!
> 
> But I re-checked the lapic.c and seems this issue has never been considered?
> Check http://lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335 

This is a bug indeed.  The simplest fix would be to ignore
lapic_timer_advance_ns when TSC scaling is in use.

> and lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335

You copied the wrong URL, but at least lines 1383-1410 seem okay to me.

> for example. Also I
> tried to find a function to translate guest tsc to host tsc and didn't find
> such function at all.

See my other reply.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
  2016-05-24 23:11   ` David Matlack
  2016-05-25 11:52   ` Paolo Bonzini
@ 2016-05-25 13:27   ` Radim Krčmář
  2016-05-25 13:51     ` Paolo Bonzini
  2016-05-25 13:45   ` Radim Krčmář
  3 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 13:27 UTC (permalink / raw)
  To: Yunhong Jiang; +Cc: kvm, mtosatti, pbonzini, kernellwp

2016-05-24 15:27-0700, Yunhong Jiang:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
> 
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
> 
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
> 
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
> 
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
> 
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/lapic.h | 11 ++++++
>  arch/x86/kvm/trace.h | 22 ++++++++++++
>  arch/x86/kvm/vmx.c   |  8 +++++
>  arch/x86/kvm/x86.c   |  4 +++
>  5 files changed, 139 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> +{

Preemption timer needs to be adjusted on every vmentry while hrtimer is
set only once.  After how many vmentries does preemption timer surpass
the overhead of hrtimer?

> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 tscdeadline, guest_tsc;
> +
> +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> +		return;
> +
> +	tscdeadline = apic->lapic_timer.tscdeadline;
> +	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> +	if (tscdeadline >= guest_tsc)
> +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
> +	else
> +		kvm_x86_ops->set_hv_timer(vcpu, 0);

Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
vmexit after a vmentry.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 10:40     ` Paolo Bonzini
@ 2016-05-25 13:38       ` Radim Krčmář
  0 siblings, 0 replies; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 13:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Matlack, Yunhong Jiang, kvm list, Marcelo Tosatti, Wanpeng Li

2016-05-25 12:40+0200, Paolo Bonzini:
> On 25/05/2016 01:11, David Matlack wrote:
>> > However, it possibly has impact if the vCPU thread is scheduled in/out
>> > very frequently, because it switches from/to the hrtimer emulation a lot.
>> 
>> What is the cost of the extra sched-in/out hooks?
> 
> I still have to review this version of the patches, but I think they can
> be removed and moved to blocking/unblocking instead.

I this so as well.  hrtimer does not make a difference to a preempted
VCPU or VCPU in userspace, only the blocking sleep needs to be
interrupted.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
                     ` (2 preceding siblings ...)
  2016-05-25 13:27   ` Radim Krčmář
@ 2016-05-25 13:45   ` Radim Krčmář
  2016-05-25 22:57     ` yunhong jiang
  3 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 13:45 UTC (permalink / raw)
  To: Yunhong Jiang; +Cc: kvm, mtosatti, pbonzini, kernellwp

2016-05-24 15:27-0700, Yunhong Jiang:
> From: Yunhong Jiang <yunhong.jiang@gmail.com>
> 
> Utilizing the VMX preemption timer for tsc deadline timer
> virtualization. The VMX preemption timer is armed when the vCPU is
> running, and a VMExit will happen if the virtual TSC deadline timer
> expires.
> 
> When the vCPU thread is scheduled out, the tsc deadline timer
> virtualization will be switched to use the current solution, i.e. use
> the timer for it. It's switched back to VMX preemption timer when the
> vCPU thread is scheduled int.
> 
> This solution avoids the complex OS's hrtimer system, and also the host
> timer interrupt handling cost, with a preemption_timer VMexit. It fits
> well for some NFV usage scenario, when the vCPU is bound to a pCPU and
> the pCPU is isolated, or some similar scenario.
> 
> However, it possibly has impact if the vCPU thread is scheduled in/out
> very frequently, because it switches from/to the hrtimer emulation a lot.
> 
> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>  static const int kvm_vmx_max_exit_handlers =
> @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>  		vmx_set_interrupt_shadow(vcpu, 0);
>  
> +	kvm_lapic_arm_hv_timer(vcpu);
> +

Every cycle that the host spends in root mode after setting the timeout
makes the preemtion timer delayed.  Can't we move the setup closer to
vmlaunch, i.e. after atomic_switch_perf_msrs()?

>  	if (vmx->guest_pkru_valid)
>  		__write_pkru(vmx->guest_pkru);
>  

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 13:27   ` Radim Krčmář
@ 2016-05-25 13:51     ` Paolo Bonzini
  2016-05-25 14:31       ` Radim Krčmář
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-25 13:51 UTC (permalink / raw)
  To: Radim Krčmář, Yunhong Jiang; +Cc: kvm, mtosatti, kernellwp



On 25/05/2016 15:27, Radim Krčmář wrote:
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> 
> Preemption timer needs to be adjusted on every vmentry while hrtimer is
> set only once.  After how many vmentries does preemption timer surpass
> the overhead of hrtimer?

See my review. :)  I hope to get the per-vmentry cost of the preemption
timer down to a dozen instructions.

> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +	u64 tscdeadline, guest_tsc;
> > +
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > +		return;
> > +
> > +	tscdeadline = apic->lapic_timer.tscdeadline;
> > +	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > +	if (tscdeadline >= guest_tsc)
> > +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
> > +	else
> > +		kvm_x86_ops->set_hv_timer(vcpu, 0);
> 
> Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
> vmexit after a vmentry.

Even if this is common enough to warrant optimizing for it (which I'm
not sure about, and surely depends on the workload), I think it should
be done later.  You'd get the same "useless vmentry" even with
hrtimer-based LAPIC timer.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 13:51     ` Paolo Bonzini
@ 2016-05-25 14:31       ` Radim Krčmář
  2016-05-25 23:13         ` yunhong jiang
  2016-06-14 11:34         ` Paolo Bonzini
  0 siblings, 2 replies; 32+ messages in thread
From: Radim Krčmář @ 2016-05-25 14:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Yunhong Jiang, kvm, mtosatti, kernellwp

2016-05-25 15:51+0200, Paolo Bonzini:
> On 25/05/2016 15:27, Radim Krčmář wrote:
>> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic)
>> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
>> > +{
>> 
>> Preemption timer needs to be adjusted on every vmentry while hrtimer is
>> set only once.  After how many vmentries does preemption timer surpass
>> the overhead of hrtimer?
> 
> See my review. :)  I hope to get the per-vmentry cost of the preemption
> timer down to a dozen instructions.

I secretly wished to learn what the overhead of hrtimer is. :)

Seems like we need a vmexit kvm-unit-test that enables tsc deadline
timer while benchmarking.

>> > +	if (tscdeadline >= guest_tsc)
>> > +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc);
>> > +	else
>> > +		kvm_x86_ops->set_hv_timer(vcpu, 0);
>> 
>> Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
>> vmexit after a vmentry.
> 
> Even if this is common enough to warrant optimizing for it (which I'm
> not sure about, and surely depends on the workload), I think it should
> be done later.  You'd get the same "useless vmentry" even with
> hrtimer-based LAPIC timer.

Makes sense.  And early return gets harder to do if we move the setup
closer to vmlaunch.

>                 You'd get the same "useless vmentry" even with
> hrtimer-based LAPIC timer.

(hrtimers won't do an entry/exit cycle if they are injected sufficiently
 early.  Maybe preemption or userspace exits would make the difference
 noticeable, but it should be minor too.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 11:52   ` Paolo Bonzini
@ 2016-05-25 22:44     ` yunhong jiang
  2016-05-26 14:05       ` Alan Jenkins
  2016-06-04  0:24     ` yunhong jiang
  1 sibling, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 22:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp

On Wed, 25 May 2016 13:52:56 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 25/05/2016 00:27, Yunhong Jiang wrote:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> > 
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> > 
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> > 
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> > 
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> > 
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> > ---
> >  arch/x86/kvm/lapic.c | 97
> > ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/lapic.h | 11 ++++++ arch/x86/kvm/trace.h | 22
> > ++++++++++++ arch/x86/kvm/vmx.c   |  8 +++++
> >  arch/x86/kvm/x86.c   |  4 +++
> >  5 files changed, 139 insertions(+), 3 deletions(-)
> 
> Thanks, this looks very nice.
> 
> As I mentioned in my reply to Marcelo, I have some more changes in
> mind to reduce the overhead.  This way we can enable it
> unconditionally (on processors that do not have the erratum).   In
> particular:
> 
> 1) While a virtual machine is scheduled out due to contention, you
> don't care if the timer fires.  You can fire the timer immediately

I'm not very sure if this statement applies to all situation, but I can't think
out any exception. So sure, I will work this way.

> after the next vmentry.  You only need to set a timer during HLT.
> So, rather than sched_out/sched_in, you can start and stop the hrtimer
> in vmx_pre_block and vmx_post_block.  You probably should rename what
> is now vmx_pre_block and vmx_post_block to something like
> pi_pre_block and pi_post_block, so that vmx_pre_block and
> vmx_post_block are like
> 
> 	if (pi_pre_block(vcpu))
> 		return 1;
> 	if (have preemption timer) {
> 		kvm_lapic_switch_to_sw_timer(vcpu);
> 
> and
> 
> 	if (have preemption timer)
> 		kvm_lapic_switch_to_hv_timer(vcpu);
> 	pi_post_block(vcpu);
> 
> respectively.
> 
> You'll also need to check whether the deadlock in
> switch_to_sw_lapic_timer is still possible.  Hopefully not---if so,
> simpler code! :)

Hmm, the deadlock should not happen here, but I will try and see.

> 
> 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
> kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> set_hv_timer can convert the guest TSC to a host TSC and save the
> desired host TSC in struct vmx_vcpu.  vcpu_run checks if there is a
> TSC deadline and, if so, writes the delta into
> VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write
> to PIN_BASED_VM_EXEC_CONTROL.
> 
> Besides making vcpu_run faster, I think that with this change the
> distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED
> disappears, and struct kvm_lapic can simply have a bool
> hv_timer_in_use.  Again, it might actually end up making the code
> simpler, especially the interface between lapic.c and vmx.c.

Yes, this should save two VMCS access on the vcpu_run(). Will do this way.

> 
> The hardest part is converting guest_tsc to host_tsc.  From
> 
> 	guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
> 			kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
> 
> you have
> 
> 	host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
> 			<< kvm_tsc_scaling_ratio_frac_bits)
> 			/ vcpu->arch.tsc_scaling_ratio;
> 
> ... except that you need a division with a 128-bit dividend and a
> 64-bit divisor here, and there is no such thing in Linux.  It's okay
> if you restrict this to 64-bit hosts and use the divq assembly
> instruction directly.  (You also need to trap the divide overflow
> exception; if there is an overflow just disable the preemption
> timer).  On 32-bit systems, it's okay to force-disable
> set_hv_timer/cancel_hv_timer, i.e. set them to NULL.

I'm scared by this conversion :) Sure will hav a try. Need firstly check
how the scale_tsc works.

> 
> And if you do this, pre_block and post_block become:
> 
> 	if (pi_pre_block(vcpu))
> 		return 1;
> 	if (preemption timer bit set in VMCS)
> 		kvm_lapic_start_sw_timer(vcpu);
> 
> and
> 
> 	if (preemption timer bit set in VMCS)
> 		kvm_lapic_cancel_sw_timer(vcpu);
> 	pi_post_block(vcpu);
> 
> There is no need to redo the preemption timer setup in post_block.
> 
> 3) regarding the erratum, you can use the code from
> https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp
> (function hmR0InitIntelIsSubjectToVmxPreemptionTimerErratum... please
> rename it to something else! :)).

Thanks for the information.

--jyh

> 
> Thanks,
> 
> Paolo
> 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index f1cf8a5ede11..93679db7ce0f 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> >  		__delay(tsc_deadline - guest_tsc);
> >  }
> >  
> > -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > +static void start_sw_tscdeadline(struct kvm_lapic *apic, int
> > no_expire) {
> >  	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
> >  	u64 ns = 0;
> > @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct
> > kvm_lapic *apic) 
> >  	now = apic->lapic_timer.timer.base->get_time();
> >  	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > -	if (likely(tscdeadline > guest_tsc)) {
> > +	if (no_expire || likely(tscdeadline > guest_tsc)) {
> >  		ns = (tscdeadline - guest_tsc) * 1000000ULL;
> >  		do_div(ns, this_tsc_khz);
> >  		expire = ktime_add_ns(now, ns);
> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct
> > kvm_lapic *apic) local_irq_restore(flags);
> >  }
> >  
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +	u64 tscdeadline, guest_tsc;
> > +
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > +		return;
> > +
> > +	tscdeadline = apic->lapic_timer.tscdeadline;
> > +	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > +	if (tscdeadline >= guest_tsc)
> > +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> > guest_tsc);
> > +	else
> > +		kvm_x86_ops->set_hv_timer(vcpu, 0);
> > +
> > +	apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> > +	trace_kvm_hv_timer_state(vcpu->vcpu_id,
> > +			apic->lapic_timer.hv_timer_state);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> > +
> > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	WARN_ON(apic->lapic_timer.hv_timer_state !=
> > HV_TIMER_ARMED);
> > +	WARN_ON(swait_active(&vcpu->wq));
> > +	kvm_x86_ops->cancel_hv_timer(vcpu);
> > +	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> > +	apic_timer_expired(apic);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> > +
> > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	WARN_ON(apic->lapic_timer.hv_timer_state !=
> > HV_TIMER_NOT_USED); +
> > +	if (apic_lvtt_tscdeadline(apic) &&
> > +	    !atomic_read(&apic->lapic_timer.pending)) {
> > +		hrtimer_cancel(&apic->lapic_timer.timer);
> > +		/* In case the timer triggered in above small
> > window */
> > +		if (!atomic_read(&apic->lapic_timer.pending)) {
> > +			apic->lapic_timer.hv_timer_state =
> > +				HV_TIMER_NEEDS_ARMING;
> > +			trace_kvm_hv_timer_state(vcpu->vcpu_id,
> > +
> > apic->lapic_timer.hv_timer_state);
> > +		}
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> > +
> > +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	/* Possibly the TSC deadline timer is not enabled yet */
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > +		return;
> > +
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> > +		kvm_x86_ops->cancel_hv_timer(vcpu);
> > +	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> > +
> > +	if (atomic_read(&apic->lapic_timer.pending))
> > +		return;
> > +
> > +	/*
> > +	 * Don't trigger the apic_timer_expired() for deadlock,
> > +	 * because the swake_up() from apic_timer_expired() will
> > +	 * try to get the run queue lock, which has been held here
> > +	 * since we are in context switch procedure already.
> > +	 */
> > +	start_sw_tscdeadline(apic, 1);
> > +}
> > +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> > +
> >  static void start_apic_timer(struct kvm_lapic *apic)
> >  {
> >  	ktime_t now;
> > @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct
> > kvm_lapic *apic) ktime_to_ns(ktime_add_ns(now,
> >  					apic->lapic_timer.period)));
> >  	} else if (apic_lvtt_tscdeadline(apic)) {
> > -		start_sw_tscdeadline(apic);
> > +		/* lapic timer in tsc deadline mode */
> > +		if (kvm_x86_ops->set_hv_timer) {
> > +			if
> > (unlikely(!apic->lapic_timer.tscdeadline ||
> > +					!apic->vcpu->arch.virtual_tsc_khz))
> > +				return;
> > +
> > +			/* Expired timer will be checked on
> > vcpu_run() */
> > +			apic->lapic_timer.hv_timer_state =
> > +				HV_TIMER_NEEDS_ARMING;
> > +
> > trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> > +
> > apic->lapic_timer.hv_timer_state);
> > +		} else
> > +			start_sw_tscdeadline(apic, 0);
> >  	}
> >  }
> >  
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 891c6da7d4aa..dc4fd8eea04d 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -12,6 +12,12 @@
> >  #define KVM_APIC_SHORT_MASK	0xc0000
> >  #define KVM_APIC_DEST_MASK	0x800
> >  
> > +enum {
> > +	HV_TIMER_NOT_USED,
> > +	HV_TIMER_NEEDS_ARMING,
> > +	HV_TIMER_ARMED,
> > +};
> > +
> >  struct kvm_timer {
> >  	struct hrtimer timer;
> >  	s64 period; 				/* unit: ns */
> > @@ -20,6 +26,7 @@ struct kvm_timer {
> >  	u64 tscdeadline;
> >  	u64 expired_tscdeadline;
> >  	atomic_t pending;			/* accumulated
> > triggered timers */
> > +	int hv_timer_state;
> >  };
> >  
> >  struct kvm_lapic {
> > @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm
> > *kvm, struct kvm_lapic_irq *irq, struct kvm_vcpu **dest_vcpu);
> >  int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
> >  			const unsigned long *bitmap, u32
> > bitmap_size); +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
> >  #endif
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 8de925031b5c..8e1b5e9c2e78 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -6,6 +6,7 @@
> >  #include <asm/svm.h>
> >  #include <asm/clocksource.h>
> >  #include <asm/pvclock-abi.h>
> > +#include <lapic.h>
> >  
> >  #undef TRACE_SYSTEM
> >  #define TRACE_SYSTEM kvm
> > @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
> >  		  __entry->vec)
> >  );
> >  
> > +#define kvm_trace_symbol_hv_timer		\
> > +	{HV_TIMER_NOT_USED, "no"},		\
> > +	{HV_TIMER_NEEDS_ARMING, "need_arming"},	\
> > +	{HV_TIMER_ARMED, "armed"}
> > +
> > +TRACE_EVENT(kvm_hv_timer_state,
> > +		TP_PROTO(unsigned int vcpu_id, unsigned int
> > hv_timer_state),
> > +		TP_ARGS(vcpu_id, hv_timer_state),
> > +		TP_STRUCT__entry(
> > +			__field(unsigned int, vcpu_id)
> > +			__field(unsigned int, hv_timer_state)
> > +			),
> > +		TP_fast_assign(
> > +			__entry->vcpu_id = vcpu_id;
> > +			__entry->hv_timer_state = hv_timer_state;
> > +			),
> > +		TP_printk("vcpu_id %x hv_timer %s\n",
> > +			__entry->vcpu_id,
> > +			__print_symbolic(__entry->hv_timer_state,
> > +				kvm_trace_symbol_hv_timer))
> > +	   );
> >  #endif /* _TRACE_KVM_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 2b29afa61715..dc0b8cae02b8 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu
> > *vcpu) return 1;
> >  }
> >  
> > +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	kvm_lapic_expired_hv_timer(vcpu);
> > +	return 1;
> > +}
> >  /*
> >   * The exit handlers return 1 if the exit was handled fully and
> > guest execution
> >   * may resume.  Otherwise they set the kvm_run parameter to
> > indicate what needs @@ -7627,6 +7632,7 @@ static int (*const
> > kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) =
> > { [EXIT_REASON_XRSTORS]                 = handle_xrstors,
> > [EXIT_REASON_PML_FULL]		      = handle_pml_full,
> > [EXIT_REASON_PCOMMIT]                 = handle_pcommit,
> > +	[EXIT_REASON_PREEMPTION_TIMER]	      =
> > handle_preemption_timer, };
> >  
> >  static const int kvm_vmx_max_exit_handlers =
> > @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct
> > kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> >  		vmx_set_interrupt_shadow(vcpu, 0);
> >  
> > +	kvm_lapic_arm_hv_timer(vcpu);
> > +
> >  	if (vmx->guest_pkru_valid)
> >  		__write_pkru(vmx->guest_pkru);
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 269d576520ba..f4b50608568a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu
> > *vcpu) 
> >  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> >  {
> > +	if (kvm_x86_ops->set_hv_timer)
> > +		switch_to_hv_lapic_timer(vcpu);
> >  	kvm_x86_ops->sched_in(vcpu, cpu);
> >  }
> >  
> >  void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
> >  {
> > +	if (kvm_x86_ops->set_hv_timer)
> > +		switch_to_sw_lapic_timer(vcpu);
> >  }
> >  
> >  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 11:58       ` Paolo Bonzini
@ 2016-05-25 22:53         ` yunhong jiang
  2016-05-26  7:20           ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 22:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Matlack, kvm list, Marcelo Tosatti,
	Radim Krčmář,
	Wanpeng Li

On Wed, 25 May 2016 13:58:17 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 25/05/2016 01:35, yunhong jiang wrote:
> >> > Does this interact correctly with TSC scaling? IIUC this
> >> > programs the VMX preemption timer with a delay in guest cycles,
> >> > rather than host cycles.
> > Aha, good point, thanks!
> > 
> > But I re-checked the lapic.c and seems this issue has never been
> > considered? Check
> > http://lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335 
> 
> This is a bug indeed.  The simplest fix would be to ignore
> lapic_timer_advance_ns when TSC scaling is in use.

Hmm, I can fix this on my next submission. Or someone can send out a separate
patch for it.

> 
> > and lxr.free-electrons.com/source/arch/x86/kvm/lapic.c#L1335
> 
> You copied the wrong URL, but at least lines 1383-1410 seem okay to
> me.

Yes, I mean 1383~1410. On L1402, "expire = ktime_add_ns(now, ns)", I think the
"now" is host tsc while "ns" is guest delta tsc, which seems a bug to me. Or I
miss-understand anything?

Thanks
--jyh

> 
> > for example. Also I
> > tried to find a function to translate guest tsc to host tsc and
> > didn't find such function at all.
> 
> See my other reply.
> 
> Thanks,
> 
> Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 13:45   ` Radim Krčmář
@ 2016-05-25 22:57     ` yunhong jiang
  0 siblings, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 22:57 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: kvm, mtosatti, pbonzini, kernellwp

On Wed, 25 May 2016 15:45:53 +0200
Radim Krčmář <rkrcmar@redhat.com> wrote:

> 2016-05-24 15:27-0700, Yunhong Jiang:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> > 
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> > 
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> > 
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> > 
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> > 
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >  static const int kvm_vmx_max_exit_handlers =
> > @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct
> > kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> >  		vmx_set_interrupt_shadow(vcpu, 0);
> >  
> > +	kvm_lapic_arm_hv_timer(vcpu);
> > +
> 
> Every cycle that the host spends in root mode after setting the
> timeout makes the preemtion timer delayed.  Can't we move the setup
> closer to vmlaunch, i.e. after atomic_switch_perf_msrs()?

Yes, thanks for pointing this.

--jyh

> 
> >  	if (vmx->guest_pkru_valid)
> >  		__write_pkru(vmx->guest_pkru);
> >  


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 14:31       ` Radim Krčmář
@ 2016-05-25 23:13         ` yunhong jiang
  2016-06-14 11:34         ` Paolo Bonzini
  1 sibling, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-05-25 23:13 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Paolo Bonzini, kvm, mtosatti, kernellwp

On Wed, 25 May 2016 16:31:19 +0200
Radim Krčmář <rkrcmar@redhat.com> wrote:

> 2016-05-25 15:51+0200, Paolo Bonzini:
> > On 25/05/2016 15:27, Radim Krčmář wrote:
> >> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct
> >> > kvm_lapic *apic) +void kvm_lapic_arm_hv_timer(struct kvm_vcpu
> >> > *vcpu) +{
> >> 
> >> Preemption timer needs to be adjusted on every vmentry while
> >> hrtimer is set only once.  After how many vmentries does
> >> preemption timer surpass the overhead of hrtimer?
> > 
> > See my review. :)  I hope to get the per-vmentry cost of the
> > preemption timer down to a dozen instructions.
> 
> I secretly wished to learn what the overhead of hrtimer is. :)

I think the cost includes "hrtimer cost + timer interrupt ISR +
vm-exit/entry". The hrtimer_start/hrtimer_cancel depends on the current
hrtimer situation.  The timer interrupt ISR has far more cost than
VMExit handling.  In system w/o PI, a pair of vm-exit/entry can be
caused by the host timer interrupt.

The intial intention of the patch is to reduce the latency,  so we
didn't check the overhead of hrtimer with the cost of vmentries.  and
that's the reason of a kernel parameter to turn on/off it. I'd still
keep that parameter even with Paolo's new suggestio, although it can be
ON by default now.

BTW, if the VMX preemption timer uses the host tsc, instead of delta
tsc, as parameter, there would be no extra cost at all :(

Thanks
--jyh

> 
> Seems like we need a vmexit kvm-unit-test that enables tsc deadline
> timer while benchmarking.
> 
> >> > +	if (tscdeadline >= guest_tsc)
> >> > +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> >> > guest_tsc);
> >> > +	else
> >> > +		kvm_x86_ops->set_hv_timer(vcpu, 0);
> >> 
> >> Calling kvm_lapic_expired_hv_timer will avoid the useless immediate
> >> vmexit after a vmentry.
> > 
> > Even if this is common enough to warrant optimizing for it (which
> > I'm not sure about, and surely depends on the workload), I think it
> > should be done later.  You'd get the same "useless vmentry" even
> > with hrtimer-based LAPIC timer.
> 
> Makes sense.  And early return gets harder to do if we move the setup
> closer to vmlaunch.
> 
> >                 You'd get the same "useless vmentry" even with
> > hrtimer-based LAPIC timer.
> 
> (hrtimers won't do an entry/exit cycle if they are injected
> sufficiently early.  Maybe preemption or userspace exits would make
> the difference noticeable, but it should be minor too.)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 22:53         ` yunhong jiang
@ 2016-05-26  7:20           ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-26  7:20 UTC (permalink / raw)
  To: yunhong jiang
  Cc: David Matlack, kvm list, Marcelo Tosatti,
	Radim Krčmář,
	Wanpeng Li



On 26/05/2016 00:53, yunhong jiang wrote:
> Hmm, I can fix this on my next submission. Or someone can send out a separate
> patch for it.

Alan Jenkins is looking at it in another thread.

>> > You copied the wrong URL, but at least lines 1383-1410 seem okay to
>> > me.
> Yes, I mean 1383~1410. On L1402, "expire = ktime_add_ns(now, ns)", I think the
> "now" is host tsc while "ns" is guest delta tsc, which seems a bug to me. Or I
> miss-understand anything?

now is nanoseconds, coming from struct hrtimer_clock_base.

ns is nanoseconds, computed from (tscdeadline - guest_tsc) * 1000000 /
vcpu->arch.virtual_tsc_khz.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 22:44     ` yunhong jiang
@ 2016-05-26 14:05       ` Alan Jenkins
  2016-05-26 15:32         ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Alan Jenkins @ 2016-05-26 14:05 UTC (permalink / raw)
  To: yunhong jiang, Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp

On 25/05/16 23:44, yunhong jiang wrote:
> On Wed, 25 May 2016 13:52:56 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Yes, this should save two VMCS access on the vcpu_run(). Will do this way.
>
>>
>> The hardest part is converting guest_tsc to host_tsc.  From
>>
>> 	guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
>> 			kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
>>
>> you have
>>
>> 	host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
>> 			<< kvm_tsc_scaling_ratio_frac_bits)
>> 			/ vcpu->arch.tsc_scaling_ratio;
>>
>> ... except that you need a division with a 128-bit dividend and a
>> 64-bit divisor here, and there is no such thing in Linux.  It's okay
>> if you restrict this to 64-bit hosts and use the divq assembly
>> instruction directly.  (You also need to trap the divide overflow
>> exception; if there is an overflow just disable the preemption
>> timer).  On 32-bit systems, it's okay to force-disable
>> set_hv_timer/cancel_hv_timer, i.e. set them to NULL.
>
> I'm scared by this conversion :) Sure will hav a try. Need firstly check
> how the scale_tsc works.

aaaaaaaaaaaa

1. Against my interests: have you actually confirmed the VMX preemption 
timer is affected by guest TSC scaling?

    It's not explicit in the SDM.  The way it's described for allocating 
timeslices to the guest, IMO it makes more sense if it is not scaled.

2. Paolo, any chance I could also get away with requiring 64-bit?  I.e.

#ifdef CONFIG_X86_64
#define HAVE_LAPIC_TIMER_ADVANCE 1
#endif

Regards
Alan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-26 14:05       ` Alan Jenkins
@ 2016-05-26 15:32         ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-05-26 15:32 UTC (permalink / raw)
  To: Alan Jenkins, yunhong jiang; +Cc: kvm, mtosatti, rkrcmar, kernellwp



On 26/05/2016 16:05, Alan Jenkins wrote:
> 1. Against my interests: have you actually confirmed the VMX preemption
> timer is affected by guest TSC scaling?

No, it's not, hence my formula:

    host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
            << kvm_tsc_scaling_ratio_frac_bits)
            / vcpu->arch.tsc_scaling_ratio;

The idea is that once you have a host TSC deadline, you can use it to
set the preemption timer on vmentry.

>    It's not explicit in the SDM.  The way it's described for allocating
> timeslices to the guest, IMO it makes more sense if it is not scaled.
> 
> 2. Paolo, any chance I could also get away with requiring 64-bit?  I.e.
> 
> #ifdef CONFIG_X86_64
> #define HAVE_LAPIC_TIMER_ADVANCE 1
> #endif

I wouldn't object.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 11:52   ` Paolo Bonzini
  2016-05-25 22:44     ` yunhong jiang
@ 2016-06-04  0:24     ` yunhong jiang
  2016-06-06 13:49       ` Paolo Bonzini
  1 sibling, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-06-04  0:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp

On Wed, 25 May 2016 13:52:56 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 25/05/2016 00:27, Yunhong Jiang wrote:
> > From: Yunhong Jiang <yunhong.jiang@gmail.com>
> > 
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> > 
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> > 
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> > 
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> > 
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
> > ---
> >  arch/x86/kvm/lapic.c | 97
> > ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/lapic.h | 11 ++++++ arch/x86/kvm/trace.h | 22
> > ++++++++++++ arch/x86/kvm/vmx.c   |  8 +++++
> >  arch/x86/kvm/x86.c   |  4 +++
> >  5 files changed, 139 insertions(+), 3 deletions(-)
> 
> Thanks, this looks very nice.
> 
> As I mentioned in my reply to Marcelo, I have some more changes in
> mind to reduce the overhead.  This way we can enable it
> unconditionally (on processors that do not have the erratum).   In
> particular:
> 
> 1) While a virtual machine is scheduled out due to contention, you
> don't care if the timer fires.  You can fire the timer immediately
> after the next vmentry.  You only need to set a timer during HLT.
> So, rather than sched_out/sched_in, you can start and stop the hrtimer
> in vmx_pre_block and vmx_post_block.  You probably should rename what
> is now vmx_pre_block and vmx_post_block to something like
> pi_pre_block and pi_post_block, so that vmx_pre_block and
> vmx_post_block are like
> 
> 	if (pi_pre_block(vcpu))
> 		return 1;
> 	if (have preemption timer) {
> 		kvm_lapic_switch_to_sw_timer(vcpu);
> 
> and
> 
> 	if (have preemption timer)
> 		kvm_lapic_switch_to_hv_timer(vcpu);
> 	pi_post_block(vcpu);
> 
> respectively.
> 
> You'll also need to check whether the deadlock in
> switch_to_sw_lapic_timer is still possible.  Hopefully not---if so,
> simpler code! :)
> 
> 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
> kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> set_hv_timer can convert the guest TSC to a host TSC and save the
> desired host TSC in struct vmx_vcpu.  vcpu_run checks if there is a

Hi, Paolo, I mostly finished my patch, but one thing come to my mind in
the last minutes and hope to get your input.
One issue to save the host TSC on the vmx_vcpu is, if the vCPU thread
is migrated to another pCPU, and the TSC are not synchronized between
these two pCPUs, then the result is not accurate. In previous patchset,
it's ok because we do the calculation on the vcpu_run() thus the
migration does not matter.

This also bring a tricky situation considering the VMX preepmtion timer
can only hold 32bit value. In a rare situation, the host delta TSC is
less than 32 bit, however, if there is a host tsc backwards after the
migration, then the delta TSC on the new CPU may be larger than 32 bit,
but we can't switch back to sw timer anymore because it's too late. I
add some hacky code on the kvm_arch_vcpu_load(), not sure if it's ok.

I will send a RFC patch. I'm still looking for a
platform with TSC scaling support, thus not test TSC scaling
yet. Others has been tested.

> TSC deadline and, if so, writes the delta into
> VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write
> to PIN_BASED_VM_EXEC_CONTROL.
> 
> Besides making vcpu_run faster, I think that with this change the
> distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED
> disappears, and struct kvm_lapic can simply have a bool
> hv_timer_in_use.  Again, it might actually end up making the code
> simpler, especially the interface between lapic.c and vmx.c.
> 
> The hardest part is converting guest_tsc to host_tsc.  From
> 
> 	guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
> 			kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
> 
> you have
> 
> 	host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
> 			<< kvm_tsc_scaling_ratio_frac_bits)
> 			/ vcpu->arch.tsc_scaling_ratio;
> 
> ... except that you need a division with a 128-bit dividend and a
> 64-bit divisor here, and there is no such thing in Linux.  It's okay
> if you restrict this to 64-bit hosts and use the divq assembly
> instruction directly.  (You also need to trap the divide overflow
> exception; if there is an overflow just disable the preemption

I simply check if the calculation will cause overflow. A divide error
exception is too much for the target usage scenario.


Thanks
--jyh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-06-04  0:24     ` yunhong jiang
@ 2016-06-06 13:49       ` Paolo Bonzini
  2016-06-06 18:21         ` yunhong jiang
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-06-06 13:49 UTC (permalink / raw)
  To: yunhong jiang; +Cc: kvm, mtosatti, rkrcmar, kernellwp



On 04/06/2016 02:24, yunhong jiang wrote:
>> > 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
>> > Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
>> > kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
>> > set_hv_timer can convert the guest TSC to a host TSC and save the
>> > desired host TSC in struct vmx_vcpu.  vcpu_run checks if there is a
> Hi, Paolo, I mostly finished my patch, but one thing come to my mind in
> the last minutes and hope to get your input.
> One issue to save the host TSC on the vmx_vcpu is, if the vCPU thread
> is migrated to another pCPU, and the TSC are not synchronized between
> these two pCPUs, then the result is not accurate. In previous patchset,
> it's ok because we do the calculation on the vcpu_run() thus the
> migration does not matter.

My suggestion here is to redo the vmx_set_hv_timer in vmx_vcpu_load.

It should fix the 32-bit overflow below, too.

Paolo

> This also bring a tricky situation considering the VMX preepmtion timer
> can only hold 32bit value. In a rare situation, the host delta TSC is
> less than 32 bit, however, if there is a host tsc backwards after the
> migration, then the delta TSC on the new CPU may be larger than 32 bit,
> but we can't switch back to sw timer anymore because it's too late. I
> add some hacky code on the kvm_arch_vcpu_load(), not sure if it's ok.
> 
> I will send a RFC patch. I'm still looking for a
> platform with TSC scaling support, thus not test TSC scaling
> yet. Others has been tested.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-06-06 13:49       ` Paolo Bonzini
@ 2016-06-06 18:21         ` yunhong jiang
  0 siblings, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-06-06 18:21 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, mtosatti, rkrcmar, kernellwp

On Mon, 6 Jun 2016 15:49:12 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 04/06/2016 02:24, yunhong jiang wrote:
> >> > 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> >> > Instead, make switch_to_hv_lapic_timer and start_apic_timer can
> >> > call kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> >> > set_hv_timer can convert the guest TSC to a host TSC and save the
> >> > desired host TSC in struct vmx_vcpu.  vcpu_run checks if there
> >> > is a
> > Hi, Paolo, I mostly finished my patch, but one thing come to my
> > mind in the last minutes and hope to get your input.
> > One issue to save the host TSC on the vmx_vcpu is, if the vCPU
> > thread is migrated to another pCPU, and the TSC are not
> > synchronized between these two pCPUs, then the result is not
> > accurate. In previous patchset, it's ok because we do the
> > calculation on the vcpu_run() thus the migration does not matter.
> 
> My suggestion here is to redo the vmx_set_hv_timer in vmx_vcpu_load.
> 
> It should fix the 32-bit overflow below, too.
> 
> Paolo

Aha, good point. Will do this way. And will remove the "RFC" on next patchset.

Thanks
--jyh

> 
> > This also bring a tricky situation considering the VMX preepmtion
> > timer can only hold 32bit value. In a rare situation, the host
> > delta TSC is less than 32 bit, however, if there is a host tsc
> > backwards after the migration, then the delta TSC on the new CPU
> > may be larger than 32 bit, but we can't switch back to sw timer
> > anymore because it's too late. I add some hacky code on the
> > kvm_arch_vcpu_load(), not sure if it's ok.
> > 
> > I will send a RFC patch. I'm still looking for a
> > platform with TSC scaling support, thus not test TSC scaling
> > yet. Others has been tested.
> > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer
  2016-05-25 14:31       ` Radim Krčmář
  2016-05-25 23:13         ` yunhong jiang
@ 2016-06-14 11:34         ` Paolo Bonzini
  1 sibling, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2016-06-14 11:34 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Yunhong Jiang, kvm, mtosatti, kernellwp



On 25/05/2016 16:31, Radim Krčmář wrote:
>> > See my review. :)  I hope to get the per-vmentry cost of the preemption
>> > timer down to a dozen instructions.
> I secretly wished to learn what the overhead of hrtimer is. 

We now know. :)  It's about 800 clock cycles.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-05-24 22:27 ` [RFC PATCH V2 2/4] Utilize the vmx preemption timer Yunhong Jiang
@ 2016-06-14 13:23   ` Roman Kagan
  2016-06-14 13:41     ` Paolo Bonzini
  2016-06-14 16:46     ` yunhong jiang
  0 siblings, 2 replies; 32+ messages in thread
From: Roman Kagan @ 2016-06-14 13:23 UTC (permalink / raw)
  To: Yunhong Jiang; +Cc: kvm, mtosatti, rkrcmar, pbonzini, kernellwp

On Tue, May 24, 2016 at 03:27:30PM -0700, Yunhong Jiang wrote:
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -110,6 +110,10 @@ module_param_named(pml, enable_pml, bool, S_IRUGO);
>  
>  #define KVM_VMX_TSC_MULTIPLIER_MAX     0xffffffffffffffffULL
>  
> +static int cpu_preemption_timer_multi;
> +static bool __read_mostly enable_hv_timer;
> +module_param_named(enable_hv_timer, enable_hv_timer, bool, S_IRUGO);
> +

Non-techincal issue: I think the naming for the parameter is
unfortunate, as it looks similar to a number of CPU properties in QEMU
that start with "hv-", which are all related to MS Hyper-V compatibility
features.

Roman.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-06-14 13:23   ` Roman Kagan
@ 2016-06-14 13:41     ` Paolo Bonzini
  2016-06-14 16:46       ` yunhong jiang
  2016-06-14 16:46     ` yunhong jiang
  1 sibling, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-06-14 13:41 UTC (permalink / raw)
  To: Roman Kagan, Yunhong Jiang, kvm, mtosatti, rkrcmar, kernellwp



On 14/06/2016 15:23, Roman Kagan wrote:
>> >  
>> > +static int cpu_preemption_timer_multi;
>> > +static bool __read_mostly enable_hv_timer;
>> > +module_param_named(enable_hv_timer, enable_hv_timer, bool, S_IRUGO);
>> > +
> Non-techincal issue: I think the naming for the parameter is
> unfortunate, as it looks similar to a number of CPU properties in QEMU
> that start with "hv-", which are all related to MS Hyper-V compatibility
> features.

Non-technical but important.  Let's change it to preemption_timer (and
the variable to enable_preemption_timer).

The code can keep the "hv_timer" moniker though.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-06-14 13:41     ` Paolo Bonzini
@ 2016-06-14 16:46       ` yunhong jiang
  2016-06-14 21:56         ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: yunhong jiang @ 2016-06-14 16:46 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Roman Kagan, kvm, mtosatti, rkrcmar, kernellwp

On Tue, 14 Jun 2016 15:41:07 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 14/06/2016 15:23, Roman Kagan wrote:
> >> >  
> >> > +static int cpu_preemption_timer_multi;
> >> > +static bool __read_mostly enable_hv_timer;
> >> > +module_param_named(enable_hv_timer, enable_hv_timer, bool,
> >> > S_IRUGO); +
> > Non-techincal issue: I think the naming for the parameter is
> > unfortunate, as it looks similar to a number of CPU properties in
> > QEMU that start with "hv-", which are all related to MS Hyper-V
> > compatibility features.
> 
> Non-technical but important.  Let's change it to preemption_timer (and
> the variable to enable_preemption_timer).
> 
> The code can keep the "hv_timer" moniker though.

Hi, Paolo, should I resend the patch for this change?

Thanks
--jyh

> 
> Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-06-14 13:23   ` Roman Kagan
  2016-06-14 13:41     ` Paolo Bonzini
@ 2016-06-14 16:46     ` yunhong jiang
  1 sibling, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-06-14 16:46 UTC (permalink / raw)
  To: Roman Kagan; +Cc: kvm, mtosatti, rkrcmar, pbonzini, kernellwp

On Tue, 14 Jun 2016 16:23:19 +0300
Roman Kagan <rkagan@virtuozzo.com> wrote:

> On Tue, May 24, 2016 at 03:27:30PM -0700, Yunhong Jiang wrote:
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -110,6 +110,10 @@ module_param_named(pml, enable_pml, bool,
> > S_IRUGO); 
> >  #define KVM_VMX_TSC_MULTIPLIER_MAX     0xffffffffffffffffULL
> >  
> > +static int cpu_preemption_timer_multi;
> > +static bool __read_mostly enable_hv_timer;
> > +module_param_named(enable_hv_timer, enable_hv_timer, bool,
> > S_IRUGO); +
> 
> Non-techincal issue: I think the naming for the parameter is
> unfortunate, as it looks similar to a number of CPU properties in QEMU
> that start with "hv-", which are all related to MS Hyper-V
> compatibility features.

Yes, this is a good point and thanks for it.

--jyh

> 
> Roman.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-06-14 16:46       ` yunhong jiang
@ 2016-06-14 21:56         ` Paolo Bonzini
  2016-06-15 18:03           ` yunhong jiang
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2016-06-14 21:56 UTC (permalink / raw)
  To: yunhong jiang; +Cc: Roman Kagan, kvm, mtosatti, rkrcmar, kernellwp



On 14/06/2016 18:46, yunhong jiang wrote:
> On Tue, 14 Jun 2016 15:41:07 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>>
>>
>> On 14/06/2016 15:23, Roman Kagan wrote:
>>>>>  
>>>>> +static int cpu_preemption_timer_multi;
>>>>> +static bool __read_mostly enable_hv_timer;
>>>>> +module_param_named(enable_hv_timer, enable_hv_timer, bool,
>>>>> S_IRUGO); +
>>> Non-techincal issue: I think the naming for the parameter is
>>> unfortunate, as it looks similar to a number of CPU properties in
>>> QEMU that start with "hv-", which are all related to MS Hyper-V
>>> compatibility features.
>>
>> Non-technical but important.  Let's change it to preemption_timer (and
>> the variable to enable_preemption_timer).
>>
>> The code can keep the "hv_timer" moniker though.
> 
> Hi, Paolo, should I resend the patch for this change?

No, no need to.  I've already pushed the patches to kvm/queue.  I've
also separated all vmx bits to one patch and x86 to another.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH V2 2/4] Utilize the vmx preemption timer
  2016-06-14 21:56         ` Paolo Bonzini
@ 2016-06-15 18:03           ` yunhong jiang
  0 siblings, 0 replies; 32+ messages in thread
From: yunhong jiang @ 2016-06-15 18:03 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Roman Kagan, kvm, mtosatti, rkrcmar, kernellwp

On Tue, 14 Jun 2016 23:56:27 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 14/06/2016 18:46, yunhong jiang wrote:
> > On Tue, 14 Jun 2016 15:41:07 +0200
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> >>
> >>
> >> On 14/06/2016 15:23, Roman Kagan wrote:
> >>>>>  
> >>>>> +static int cpu_preemption_timer_multi;
> >>>>> +static bool __read_mostly enable_hv_timer;
> >>>>> +module_param_named(enable_hv_timer, enable_hv_timer, bool,
> >>>>> S_IRUGO); +
> >>> Non-techincal issue: I think the naming for the parameter is
> >>> unfortunate, as it looks similar to a number of CPU properties in
> >>> QEMU that start with "hv-", which are all related to MS Hyper-V
> >>> compatibility features.
> >>
> >> Non-technical but important.  Let's change it to preemption_timer
> >> (and the variable to enable_preemption_timer).
> >>
> >> The code can keep the "hv_timer" moniker though.
> > 
> > Hi, Paolo, should I resend the patch for this change?
> 
> No, no need to.  I've already pushed the patches to kvm/queue.  I've
> also separated all vmx bits to one patch and x86 to another.

Aha, I see that. Thanks a lot for the changes.

--jyh

> 
> Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-06-15 18:08 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24 22:27 [RFC PATCH V2 0/4] Utilizing VMX preemption for timer virtualization Yunhong Jiang
2016-05-24 22:27 ` [RFC PATCH V2 1/4] Add the kvm sched_out hook Yunhong Jiang
2016-05-24 22:27 ` [RFC PATCH V2 2/4] Utilize the vmx preemption timer Yunhong Jiang
2016-06-14 13:23   ` Roman Kagan
2016-06-14 13:41     ` Paolo Bonzini
2016-06-14 16:46       ` yunhong jiang
2016-06-14 21:56         ` Paolo Bonzini
2016-06-15 18:03           ` yunhong jiang
2016-06-14 16:46     ` yunhong jiang
2016-05-24 22:27 ` [RFC PATCH V2 3/4] Separate the start_sw_tscdeadline Yunhong Jiang
2016-05-24 22:27 ` [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer Yunhong Jiang
2016-05-24 23:11   ` David Matlack
2016-05-24 23:35     ` yunhong jiang
2016-05-25 11:58       ` Paolo Bonzini
2016-05-25 22:53         ` yunhong jiang
2016-05-26  7:20           ` Paolo Bonzini
2016-05-25 10:40     ` Paolo Bonzini
2016-05-25 13:38       ` Radim Krčmář
2016-05-25 11:52   ` Paolo Bonzini
2016-05-25 22:44     ` yunhong jiang
2016-05-26 14:05       ` Alan Jenkins
2016-05-26 15:32         ` Paolo Bonzini
2016-06-04  0:24     ` yunhong jiang
2016-06-06 13:49       ` Paolo Bonzini
2016-06-06 18:21         ` yunhong jiang
2016-05-25 13:27   ` Radim Krčmář
2016-05-25 13:51     ` Paolo Bonzini
2016-05-25 14:31       ` Radim Krčmář
2016-05-25 23:13         ` yunhong jiang
2016-06-14 11:34         ` Paolo Bonzini
2016-05-25 13:45   ` Radim Krčmář
2016-05-25 22:57     ` yunhong jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.