kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* KVM timekeeping and TSC virtualization
@ 2010-08-20  8:07 Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 01/35] Drop vm_init_tsc Zachary Amsden
                   ` (36 more replies)
  0 siblings, 37 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm

This patch set implements full TSC virtualization, with both
trapping and passthrough modes, and intelligent mode switching.
As a result, TSC will never go backwards, we are stable against
guest re-calibration attempts, VM reset, and migration.  For guests
which require it, the TSC khz can even be preserved on migration
to a new host.

The TSC will never be trapped on UP systems unless the host TSC
actually runs faster than the guest; other conditions, including
bad hardware and changing speeds are accomodated by using catchup
mode to keep the guest passthrough TSC in line with the host clock.

What is still needed on top of this is a way to force TSC
trapping, or disable it entirely, for benchmarking purposes.
I refrained from adding that last bit because it wasn't clear
whether the best thing to do is a global 'force TSC trapping' /
'force TSC passthrough' / 'intelligent choice', or if this control
should be on a per-VM level, via an ioctl(), module parameter,
or sysfs.

John and Thomas I have cc'd on this because it may be relevant to
their interests and I always appreciate feedback, especially on
a change set as large and complex as this.

Enjoy.  This time, there are no howler monkeys.  I've included
all the feedback I got from previous rounds of this and more.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [KVM timekeeping 01/35] Drop vm_init_tsc
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 16:54   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 02/35] Convert TSC writes to TSC offset writes Zachary Amsden
                   ` (35 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

This is used only by the VMX code, and is not done properly;
if the TSC is indeed backwards, it is out of sync, and will
need proper handling in the logic at each and every CPU change.
For now, drop this test during init as misguided.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    1 -
 arch/x86/kvm/vmx.c              |   10 +++-------
 arch/x86/kvm/x86.c              |    2 --
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 502e53f..960f9c9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -394,7 +394,6 @@ struct kvm_arch {
 	gpa_t ept_identity_map_addr;
 
 	unsigned long irq_sources_bitmap;
-	u64 vm_init_tsc;
 	s64 kvmclock_offset;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cf56462..a8753e1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2518,7 +2518,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 {
 	u32 host_sysenter_cs, msr_low, msr_high;
 	u32 junk;
-	u64 host_pat, tsc_this, tsc_base;
+	u64 host_pat, tsc_this;
 	unsigned long a;
 	struct desc_ptr dt;
 	int i;
@@ -2659,12 +2659,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
 	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
 
-	tsc_base = vmx->vcpu.kvm->arch.vm_init_tsc;
-	rdtscll(tsc_this);
-	if (tsc_this < vmx->vcpu.kvm->arch.vm_init_tsc)
-		tsc_base = tsc_this;
-
-	guest_write_tsc(0, tsc_base);
+	tsc_this = native_read_tsc();
+	guest_write_tsc(0, tsc_this);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d5ac966..1bf9227 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5495,8 +5495,6 @@ struct  kvm *kvm_arch_create_vm(void)
 	/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
 	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
 
-	rdtscll(kvm->arch.vm_init_tsc);
-
 	return kvm;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 02/35] Convert TSC writes to TSC offset writes
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 01/35] Drop vm_init_tsc Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 03/35] Move TSC offset writes to common code Zachary Amsden
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Change svm / vmx to be the same internally and write TSC offset
instead of bare TSC in helper functions.  Isolated as a single
patch to contain code movement.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/svm.c |   29 ++++++++++++++++-------------
 arch/x86/kvm/vmx.c |   11 +++++------
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7a2feb9..4cb8822 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -702,6 +702,20 @@ static void init_sys_seg(struct vmcb_seg *seg, uint32_t type)
 	seg->base = 0;
 }
 
+static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 g_tsc_offset = 0;
+
+	if (is_nested(svm)) {
+		g_tsc_offset = svm->vmcb->control.tsc_offset -
+			       svm->nested.hsave->control.tsc_offset;
+		svm->nested.hsave->control.tsc_offset = offset;
+	}
+
+	svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
+}
+
 static void init_vmcb(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -2567,20 +2581,9 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
 	struct vcpu_svm *svm = to_svm(vcpu);
 
 	switch (ecx) {
-	case MSR_IA32_TSC: {
-		u64 tsc_offset = data - native_read_tsc();
-		u64 g_tsc_offset = 0;
-
-		if (is_nested(svm)) {
-			g_tsc_offset = svm->vmcb->control.tsc_offset -
-				       svm->nested.hsave->control.tsc_offset;
-			svm->nested.hsave->control.tsc_offset = tsc_offset;
-		}
-
-		svm->vmcb->control.tsc_offset = tsc_offset + g_tsc_offset;
-
+	case MSR_IA32_TSC:
+		svm_write_tsc_offset(vcpu, data - native_read_tsc());
 		break;
-	}
 	case MSR_K6_STAR:
 		svm->vmcb->save.star = data;
 		break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a8753e1..1f67e94 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1155,9 +1155,9 @@ static u64 guest_read_tsc(void)
  * writes 'guest_tsc' into guest's timestamp counter "register"
  * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
  */
-static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
+static void vmx_write_tsc_offset(u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
+	vmcs_write64(TSC_OFFSET, offset);
 }
 
 /*
@@ -1261,7 +1261,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 		break;
 	case MSR_IA32_TSC:
 		rdtscll(host_tsc);
-		guest_write_tsc(data, host_tsc);
+		vmx_write_tsc_offset(data - host_tsc);
 		break;
 	case MSR_IA32_CR_PAT:
 		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
@@ -2518,7 +2518,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 {
 	u32 host_sysenter_cs, msr_low, msr_high;
 	u32 junk;
-	u64 host_pat, tsc_this;
+	u64 host_pat;
 	unsigned long a;
 	struct desc_ptr dt;
 	int i;
@@ -2659,8 +2659,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
 	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
 
-	tsc_this = native_read_tsc();
-	guest_write_tsc(0, tsc_this);
+	vmx_write_tsc_offset(0-native_read_tsc());
 
 	return 0;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 03/35] Move TSC offset writes to common code
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 01/35] Drop vm_init_tsc Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 02/35] Convert TSC writes to TSC offset writes Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:06   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 04/35] Fix SVM VMCB reset Zachary Amsden
                   ` (33 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Also, ensure that the storing of the offset and the reading of the TSC
are never preempted by taking a spinlock.  While the lock is overkill
now, it is useful later in this patch series.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +++
 arch/x86/kvm/svm.c              |    6 ++++--
 arch/x86/kvm/vmx.c              |   13 ++++++-------
 arch/x86/kvm/x86.c              |   18 ++++++++++++++++++
 arch/x86/kvm/x86.h              |    2 ++
 5 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 960f9c9..3b4efe2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -395,6 +395,7 @@ struct kvm_arch {
 
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
+	spinlock_t tsc_write_lock;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
@@ -521,6 +522,8 @@ struct kvm_x86_ops {
 
 	bool (*has_wbinvd_exit)(void);
 
+	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
+
 	const struct trace_print_flags *exit_reasons_str;
 };
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4cb8822..8d7ae20 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2581,8 +2581,8 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
 	struct vcpu_svm *svm = to_svm(vcpu);
 
 	switch (ecx) {
-	case MSR_IA32_TSC:
-		svm_write_tsc_offset(vcpu, data - native_read_tsc());
+	case MSR_IA32_TSC: 
+		kvm_write_tsc(vcpu, data);
 		break;
 	case MSR_K6_STAR:
 		svm->vmcb->save.star = data;
@@ -3547,6 +3547,8 @@ static struct kvm_x86_ops svm_x86_ops = {
 	.set_supported_cpuid = svm_set_supported_cpuid,
 
 	.has_wbinvd_exit = svm_has_wbinvd_exit,
+
+	.write_tsc_offset = svm_write_tsc_offset,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1f67e94..e3e056f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1152,10 +1152,9 @@ static u64 guest_read_tsc(void)
 }
 
 /*
- * writes 'guest_tsc' into guest's timestamp counter "register"
- * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
+ * writes 'offset' into guest's timestamp counter offset register
  */
-static void vmx_write_tsc_offset(u64 offset)
+static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
 }
@@ -1230,7 +1229,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct shared_msr_entry *msr;
-	u64 host_tsc;
 	int ret = 0;
 
 	switch (msr_index) {
@@ -1260,8 +1258,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 		vmcs_writel(GUEST_SYSENTER_ESP, data);
 		break;
 	case MSR_IA32_TSC:
-		rdtscll(host_tsc);
-		vmx_write_tsc_offset(data - host_tsc);
+		kvm_write_tsc(vcpu, data);
 		break;
 	case MSR_IA32_CR_PAT:
 		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
@@ -2659,7 +2656,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
 	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
 
-	vmx_write_tsc_offset(0-native_read_tsc());
+	kvm_write_tsc(&vmx->vcpu, 0);
 
 	return 0;
 }
@@ -4354,6 +4351,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.set_supported_cpuid = vmx_set_supported_cpuid,
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+	.write_tsc_offset = vmx_write_tsc_offset,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1bf9227..33e8208 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -895,6 +895,22 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 
+void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 offset;
+	unsigned long flags;
+
+	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+	offset = data - native_read_tsc();
+	kvm_x86_ops->write_tsc_offset(vcpu, offset);
+	spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+	/* Reset of TSC must disable overshoot protection below */
+	vcpu->arch.hv_clock.tsc_timestamp = 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_tsc);
+
 static void kvm_write_guest_time(struct kvm_vcpu *v)
 {
 	struct timespec ts;
@@ -5495,6 +5511,8 @@ struct  kvm *kvm_arch_create_vm(void)
 	/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
 	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
 
+	spin_lock_init(&kvm->arch.tsc_write_lock);
+
 	return kvm;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index b7a4047..2d6385e 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -68,4 +68,6 @@ static inline int is_paging(struct kvm_vcpu *vcpu)
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 
+void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
+
 #endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 04/35] Fix SVM VMCB reset
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (2 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 03/35] Move TSC offset writes to common code Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 05/35] Move TSC reset out of vmcb_init Zachary Amsden
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

On reset, VMCB TSC should be set to zero.  Instead, code was setting
tsc_offset to zero, which passes through the underlying TSC.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/svm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8d7ae20..74e4522 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -781,7 +781,7 @@ static void init_vmcb(struct vcpu_svm *svm)
 
 	control->iopm_base_pa = iopm_base;
 	control->msrpm_base_pa = __pa(svm->msrpm);
-	control->tsc_offset = 0;
+	kvm_write_tsc(&svm->vcpu, 0);
 	control->int_ctl = V_INTR_MASKING_MASK;
 
 	init_seg(&save->es);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 05/35] Move TSC reset out of vmcb_init
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (3 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 04/35] Fix SVM VMCB reset Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:08   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 06/35] TSC reset compensation Zachary Amsden
                   ` (31 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

The VMCB is reset whenever we receive a startup IPI, so Linux is setting
TSC back to zero happens very late in the boot process and destabilizing
the TSC.  Instead, just set TSC to zero once at VCPU creation time.

Why the separate patch?  So git-bisect is your friend.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/svm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 74e4522..e8bfe8e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -781,7 +781,6 @@ static void init_vmcb(struct vcpu_svm *svm)
 
 	control->iopm_base_pa = iopm_base;
 	control->msrpm_base_pa = __pa(svm->msrpm);
-	kvm_write_tsc(&svm->vcpu, 0);
 	control->int_ctl = V_INTR_MASKING_MASK;
 
 	init_seg(&save->es);
@@ -917,6 +916,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id)
 	svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT;
 	svm->asid_generation = 0;
 	init_vmcb(svm);
+	kvm_write_tsc(&svm->vcpu, 0);
 
 	err = fx_init(&svm->vcpu);
 	if (err)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 06/35] TSC reset compensation
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (4 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 05/35] Move TSC reset out of vmcb_init Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 07/35] Make cpu_tsc_khz updates use local CPU Zachary Amsden
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Attempt to synchronize TSCs which are reset to the same value.  In the
case of a reliable hardware TSC, we can just re-use the same offset, but
on non-reliable hardware, we can get closer by adjusting the offset to
match the elapsed time.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +++
 arch/x86/kvm/x86.c              |   31 ++++++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b4efe2..4b42893 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -396,6 +396,9 @@ struct kvm_arch {
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
 	spinlock_t tsc_write_lock;
+	u64 last_tsc_nsec;
+	u64 last_tsc_offset;
+	u64 last_tsc_write;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 33e8208..62c58f9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -898,11 +898,40 @@ static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
-	u64 offset;
+	u64 offset, ns, elapsed;
 	unsigned long flags;
+	struct timespec ts;
 
 	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = data - native_read_tsc();
+	ktime_get_ts(&ts);
+	monotonic_to_bootbased(&ts);
+	ns = timespec_to_ns(&ts);
+	elapsed = ns - kvm->arch.last_tsc_nsec;
+
+	/*
+	 * Special case: identical write to TSC within 5 seconds of
+	 * another CPU is interpreted as an attempt to synchronize
+	 * (the 5 seconds is to accomodate host load / swapping).
+	 *
+	 * In that case, for a reliable TSC, we can match TSC offsets,
+	 * or make a best guest using kernel_ns value.
+	 */
+	if (data == kvm->arch.last_tsc_write && elapsed < 5ULL * NSEC_PER_SEC) {
+		if (!check_tsc_unstable()) {
+			offset = kvm->arch.last_tsc_offset;
+			pr_debug("kvm: matched tsc offset for %llu\n", data);
+		} else {
+			u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
+			tsc_delta = tsc_delta / USEC_PER_SEC;
+			offset += tsc_delta;
+			pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
+		}
+		ns = kvm->arch.last_tsc_nsec;
+	}
+	kvm->arch.last_tsc_nsec = ns;
+	kvm->arch.last_tsc_write = data;
+	kvm->arch.last_tsc_offset = offset;
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 07/35] Make cpu_tsc_khz updates use local CPU
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (5 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 06/35] TSC reset compensation Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 08/35] Warn about unstable TSC Zachary Amsden
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

This simplifies much of the init code; we can now simply always
call tsc_khz_changed, optionally passing it a new value, or letting
it figure out the existing value (while interrupts are disabled, and
thus, by inference from the rule, not raceful against CPU hotplug or
frequency updates, which will issue IPIs to the local CPU to perform
this very same task).

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |  157 +++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 114 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 62c58f9..3420f25 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -895,6 +895,15 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 
+static inline int kvm_tsc_changes_freq(void)
+{
+	int cpu = get_cpu();
+	int ret = !boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+		  cpufreq_quick_get(cpu) != 0;
+	put_cpu();
+	return ret;
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -940,7 +949,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
-static void kvm_write_guest_time(struct kvm_vcpu *v)
+static int kvm_write_guest_time(struct kvm_vcpu *v)
 {
 	struct timespec ts;
 	unsigned long flags;
@@ -949,24 +958,27 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
 	unsigned long this_tsc_khz;
 
 	if ((!vcpu->time_page))
-		return;
-
-	this_tsc_khz = get_cpu_var(cpu_tsc_khz);
-	if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
-		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
-		vcpu->hv_clock_tsc_khz = this_tsc_khz;
-	}
-	put_cpu_var(cpu_tsc_khz);
+		return 0;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
 	kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
 	ktime_get_ts(&ts);
 	monotonic_to_bootbased(&ts);
+	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	local_irq_restore(flags);
 
-	/* With all the info we got, fill in the values */
+	if (unlikely(this_tsc_khz == 0)) {
+		kvm_make_request(KVM_REQ_KVMCLOCK_UPDATE, v);
+		return 1;
+	}
 
+	if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
+		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
+		vcpu->hv_clock_tsc_khz = this_tsc_khz;
+	}
+
+	/* With all the info we got, fill in the values */
 	vcpu->hv_clock.system_time = ts.tv_nsec +
 				     (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
 
@@ -987,6 +999,7 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
 	kunmap_atomic(shared_kaddr, KM_USER0);
 
 	mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
+	return 0;
 }
 
 static int kvm_request_guest_time_update(struct kvm_vcpu *v)
@@ -1853,12 +1866,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	}
 
 	kvm_x86_ops->vcpu_load(vcpu, cpu);
-	if (unlikely(per_cpu(cpu_tsc_khz, cpu) == 0)) {
-		unsigned long khz = cpufreq_quick_get(cpu);
-		if (!khz)
-			khz = tsc_khz;
-		per_cpu(cpu_tsc_khz, cpu) = khz;
-	}
 	kvm_request_guest_time_update(vcpu);
 }
 
@@ -4152,9 +4159,23 @@ int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port)
 }
 EXPORT_SYMBOL_GPL(kvm_fast_pio_out);
 
-static void bounce_off(void *info)
+static void tsc_bad(void *info)
+{
+	__get_cpu_var(cpu_tsc_khz) = 0;
+}
+
+static void tsc_khz_changed(void *data)
 {
-	/* nothing */
+	struct cpufreq_freqs *freq = data;
+	unsigned long khz = 0;
+
+	if (data)
+		khz = freq->new;
+	else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
+		khz = cpufreq_quick_get(raw_smp_processor_id());
+	if (!khz)
+		khz = tsc_khz;
+	__get_cpu_var(cpu_tsc_khz) = khz;
 }
 
 static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
@@ -4165,11 +4186,51 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct kvm_vcpu *vcpu;
 	int i, send_ipi = 0;
 
+	/*
+	 * We allow guests to temporarily run on slowing clocks,
+	 * provided we notify them after, or to run on accelerating
+	 * clocks, provided we notify them before.  Thus time never
+	 * goes backwards.
+	 *
+	 * However, we have a problem.  We can't atomically update
+	 * the frequency of a given CPU from this function; it is
+	 * merely a notifier, which can be called from any CPU.
+	 * Changing the TSC frequency at arbitrary points in time
+	 * requires a recomputation of local variables related to
+	 * the TSC for each VCPU.  We must flag these local variables
+	 * to be updated and be sure the update takes place with the
+	 * new frequency before any guests proceed.
+	 *
+	 * Unfortunately, the combination of hotplug CPU and frequency
+	 * change creates an intractable locking scenario; the order
+	 * of when these callouts happen is undefined with respect to
+	 * CPU hotplug, and they can race with each other.  As such,
+	 * merely setting per_cpu(cpu_tsc_khz) = X during a hotadd is
+	 * undefined; you can actually have a CPU frequency change take
+	 * place in between the computation of X and the setting of the
+	 * variable.  To protect against this problem, all updates of
+	 * the per_cpu tsc_khz variable are done in an interrupt
+	 * protected IPI, and all callers wishing to update the value
+	 * must wait for a synchronous IPI to complete (which is trivial
+	 * if the caller is on the CPU already).  This establishes the
+	 * necessary total order on variable updates.
+	 *
+	 * Note that because a guest time update may take place
+	 * anytime after the setting of the VCPU's request bit, the
+	 * correct TSC value must be set before the request.  However,
+	 * to ensure the update actually makes it to any guest which
+	 * starts running in hardware virtualization between the set
+	 * and the acquisition of the spinlock, we must also ping the
+	 * CPU after setting the request bit.
+	 *
+	 */
+
 	if (val == CPUFREQ_PRECHANGE && freq->old > freq->new)
 		return 0;
 	if (val == CPUFREQ_POSTCHANGE && freq->old < freq->new)
 		return 0;
-	per_cpu(cpu_tsc_khz, freq->cpu) = freq->new;
+
+	smp_call_function_single(freq->cpu, tsc_khz_changed, freq, 1);
 
 	spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
@@ -4179,7 +4240,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 			if (!kvm_request_guest_time_update(vcpu))
 				continue;
 			if (vcpu->cpu != smp_processor_id())
-				send_ipi++;
+				send_ipi = 1;
 		}
 	}
 	spin_unlock(&kvm_lock);
@@ -4197,32 +4258,48 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 		 * guest context is entered kvmclock will be updated,
 		 * so the guest will not see stale values.
 		 */
-		smp_call_function_single(freq->cpu, bounce_off, NULL, 1);
+		smp_call_function_single(freq->cpu, tsc_khz_changed, freq, 1);
 	}
 	return 0;
 }
 
 static struct notifier_block kvmclock_cpufreq_notifier_block = {
-        .notifier_call  = kvmclock_cpufreq_notifier
+	.notifier_call  = kvmclock_cpufreq_notifier
+};
+
+static int kvmclock_cpu_notifier(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+
+	switch (action) {
+		case CPU_ONLINE:
+		case CPU_DOWN_FAILED:
+			smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+			break;
+		case CPU_DOWN_PREPARE:
+			smp_call_function_single(cpu, tsc_bad, NULL, 1);
+			break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block kvmclock_cpu_notifier_block = {
+	.notifier_call  = kvmclock_cpu_notifier,
+	.priority = -INT_MAX
 };
 
 static void kvm_timer_init(void)
 {
 	int cpu;
 
+	register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 		cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
 					  CPUFREQ_TRANSITION_NOTIFIER);
-		for_each_online_cpu(cpu) {
-			unsigned long khz = cpufreq_get(cpu);
-			if (!khz)
-				khz = tsc_khz;
-			per_cpu(cpu_tsc_khz, cpu) = khz;
-		}
-	} else {
-		for_each_possible_cpu(cpu)
-			per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
 	}
+	for_each_online_cpu(cpu)
+		smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
 }
 
 static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
@@ -4324,6 +4401,7 @@ void kvm_arch_exit(void)
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
+	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }
@@ -4739,8 +4817,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
-		if (kvm_check_request(KVM_REQ_KVMCLOCK_UPDATE, vcpu))
-			kvm_write_guest_time(vcpu);
+		if (kvm_check_request(KVM_REQ_KVMCLOCK_UPDATE, vcpu)) {
+			r = kvm_write_guest_time(vcpu);
+			if (unlikely(r))
+				goto out;
+		}
 		if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
 			kvm_mmu_sync_roots(vcpu);
 		if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
@@ -5423,17 +5504,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 
 int kvm_arch_hardware_enable(void *garbage)
 {
-	/*
-	 * Since this may be called from a hotplug notifcation,
-	 * we can't get the CPU frequency directly.
-	 */
-	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
-		int cpu = raw_smp_processor_id();
-		per_cpu(cpu_tsc_khz, cpu) = 0;
-	}
-
 	kvm_shared_msr_cpu_online();
-
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 08/35] Warn about unstable TSC
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (6 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 07/35] Make cpu_tsc_khz updates use local CPU Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:28   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 09/35] Unify TSC logic Zachary Amsden
                   ` (28 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

If creating an SMP guest with unstable host TSC, issue a warning

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3420f25..5e3b10e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5457,6 +5457,10 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
 						unsigned int id)
 {
+	if (check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
+		printk_once(KERN_WARNING
+		"kvm: SMP vm created on host with unstable TSC; "
+		"guest TSC will not be reliable\n");
 	return kvm_x86_ops->vcpu_create(kvm, id);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 09/35] Unify TSC logic
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (7 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 08/35] Warn about unstable TSC Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization Zachary Amsden
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Move the TSC control logic from the vendor backends into x86.c
by adding adjust_tsc_offset to x86 ops.  Now all TSC decisions
can be done in one place.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    5 +++--
 arch/x86/kvm/svm.c              |   26 ++++++++++----------------
 arch/x86/kvm/vmx.c              |   22 ++++++++--------------
 arch/x86/kvm/x86.c              |   17 ++++++++++++++---
 4 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b42893..324e892 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -255,7 +255,6 @@ struct kvm_mmu {
 };
 
 struct kvm_vcpu_arch {
-	u64 host_tsc;
 	/*
 	 * rip and regs accesses must go through
 	 * kvm_{register,rip}_{read,write} functions.
@@ -336,9 +335,10 @@ struct kvm_vcpu_arch {
 
 	gpa_t time;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned int hv_clock_tsc_khz;
+	unsigned int hw_tsc_khz;
 	unsigned int time_offset;
 	struct page *time_page;
+	u64 last_host_tsc;
 
 	bool nmi_pending;
 	bool nmi_injected;
@@ -520,6 +520,7 @@ struct kvm_x86_ops {
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 	int (*get_lpage_level)(void);
 	bool (*rdtscp_supported)(void);
+	void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
 
 	void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e8bfe8e..2be8338 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -716,6 +716,15 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 	svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
 }
 
+static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	svm->vmcb->control.tsc_offset += adjustment;
+	if (is_nested(svm))
+		svm->nested.hsave->control.tsc_offset += adjustment;
+}
+
 static void init_vmcb(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -962,20 +971,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	int i;
 
 	if (unlikely(cpu != vcpu->cpu)) {
-		u64 delta;
-
-		if (check_tsc_unstable()) {
-			/*
-			 * Make sure that the guest sees a monotonically
-			 * increasing TSC.
-			 */
-			delta = vcpu->arch.host_tsc - native_read_tsc();
-			svm->vmcb->control.tsc_offset += delta;
-			if (is_nested(svm))
-				svm->nested.hsave->control.tsc_offset += delta;
-		}
-		vcpu->cpu = cpu;
-		kvm_migrate_timers(vcpu);
 		svm->asid_generation = 0;
 	}
 
@@ -991,8 +986,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 	++vcpu->stat.host_state_reload;
 	for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
 		wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
-
-	vcpu->arch.host_tsc = native_read_tsc();
 }
 
 static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
@@ -3549,6 +3542,7 @@ static struct kvm_x86_ops svm_x86_ops = {
 	.has_wbinvd_exit = svm_has_wbinvd_exit,
 
 	.write_tsc_offset = svm_write_tsc_offset,
+	.adjust_tsc_offset = svm_adjust_tsc_offset,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e3e056f..f8b70ac 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -505,7 +505,6 @@ static void __vcpu_clear(void *arg)
 		vmcs_clear(vmx->vmcs);
 	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
 		per_cpu(current_vmcs, cpu) = NULL;
-	rdtscll(vmx->vcpu.arch.host_tsc);
 	list_del(&vmx->local_vcpus_link);
 	vmx->vcpu.cpu = -1;
 	vmx->launched = 0;
@@ -887,7 +886,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
 static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	u64 tsc_this, delta, new_offset;
 	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
 
 	if (!vmm_exclusive)
@@ -904,14 +902,12 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
 		unsigned long sysenter_esp;
 
-		kvm_migrate_timers(vcpu);
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
 		list_add(&vmx->local_vcpus_link,
 			 &per_cpu(vcpus_on_cpu, cpu));
 		local_irq_enable();
 
-		vcpu->cpu = cpu;
 		/*
 		 * Linux uses per-cpu TSS and GDT, so set these when switching
 		 * processors.
@@ -921,16 +917,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
-
-		/*
-		 * Make sure the time stamp counter is monotonous.
-		 */
-		rdtscll(tsc_this);
-		if (tsc_this < vcpu->arch.host_tsc) {
-			delta = vcpu->arch.host_tsc - tsc_this;
-			new_offset = vmcs_read64(TSC_OFFSET) + delta;
-			vmcs_write64(TSC_OFFSET, new_offset);
-		}
 	}
 }
 
@@ -1159,6 +1145,12 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 	vmcs_write64(TSC_OFFSET, offset);
 }
 
+static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+{
+	u64 offset = vmcs_read64(TSC_OFFSET);
+	vmcs_write64(TSC_OFFSET, offset + adjustment);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -4114,6 +4106,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 
 	cpu = get_cpu();
 	vmx_vcpu_load(&vmx->vcpu, cpu);
+	vmx->vcpu.cpu = cpu;
 	err = vmx_vcpu_setup(vmx);
 	vmx_vcpu_put(&vmx->vcpu);
 	put_cpu();
@@ -4353,6 +4346,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
 
 	.write_tsc_offset = vmx_write_tsc_offset,
+	.adjust_tsc_offset = vmx_adjust_tsc_offset,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5e3b10e..7fc4a55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -973,9 +973,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 		return 1;
 	}
 
-	if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) {
+	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
 		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
-		vcpu->hv_clock_tsc_khz = this_tsc_khz;
+		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
 	/* With all the info we got, fill in the values */
@@ -1866,13 +1866,24 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	}
 
 	kvm_x86_ops->vcpu_load(vcpu, cpu);
-	kvm_request_guest_time_update(vcpu);
+	if (unlikely(vcpu->cpu != cpu)) {
+		/* Make sure TSC doesn't go backwards */
+		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
+				native_read_tsc() - vcpu->arch.last_host_tsc;
+		if (tsc_delta < 0)
+			mark_tsc_unstable("KVM discovered backwards TSC");
+		if (check_tsc_unstable())
+			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
+		kvm_migrate_timers(vcpu);
+		vcpu->cpu = cpu;
+	}
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
+	vcpu->arch.last_host_tsc = native_read_tsc();
 }
 
 static int is_efer_nx(void)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (8 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 09/35] Unify TSC logic Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:30   ` Glauber Costa
  2010-09-14  9:10   ` Jan Kiszka
  2010-08-20  8:07 ` [KVM timekeeping 11/35] Add helper functions for time computation Zachary Amsden
                   ` (26 subsequent siblings)
  36 siblings, 2 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

When CPUs with unstable TSCs enter deep C-state, TSC may stop
running.  This causes us to require resynchronization.  Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7fc4a55..52b6c21 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	}
 
 	kvm_x86_ops->vcpu_load(vcpu, cpu);
-	if (unlikely(vcpu->cpu != cpu)) {
+	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
 		/* Make sure TSC doesn't go backwards */
 		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
 				native_read_tsc() - vcpu->arch.last_host_tsc;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 11/35] Add helper functions for time computation
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (9 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:34   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 12/35] Robust TSC compensation Zachary Amsden
                   ` (25 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles.  Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.

Also, convert the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls to use the kernel
time helper, these should be bootbased as well.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   48 ++++++++++++++++++++++++++++--------------------
 1 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52b6c21..52680f6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -893,6 +893,16 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *
 		 hv_clock->tsc_to_system_mul);
 }
 
+static inline u64 get_kernel_ns(void)
+{
+	struct timespec ts;
+
+	WARN_ON(preemptible());
+	ktime_get_ts(&ts);
+	monotonic_to_bootbased(&ts);
+	return timespec_to_ns(&ts);
+}
+
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 
 static inline int kvm_tsc_changes_freq(void)
@@ -904,18 +914,24 @@ static inline int kvm_tsc_changes_freq(void)
 	return ret;
 }
 
+static inline u64 nsec_to_cycles(u64 nsec)
+{
+	WARN_ON(preemptible());
+	if (kvm_tsc_changes_freq())
+		printk_once(KERN_WARNING
+		 "kvm: unreliable cycle conversion on adjustable rate TSC\n");
+	return (nsec * __get_cpu_var(cpu_tsc_khz)) / USEC_PER_SEC;
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
-	struct timespec ts;
 
 	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = data - native_read_tsc();
-	ktime_get_ts(&ts);
-	monotonic_to_bootbased(&ts);
-	ns = timespec_to_ns(&ts);
+	ns = get_kernel_ns();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	/*
@@ -931,10 +947,9 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 			offset = kvm->arch.last_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
 		} else {
-			u64 tsc_delta = elapsed * __get_cpu_var(cpu_tsc_khz);
-			tsc_delta = tsc_delta / USEC_PER_SEC;
-			offset += tsc_delta;
-			pr_debug("kvm: adjusted tsc offset by %llu\n", tsc_delta);
+			u64 delta = nsec_to_cycles(elapsed);
+			offset += delta;
+			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
 		ns = kvm->arch.last_tsc_nsec;
 	}
@@ -951,11 +966,11 @@ EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
 static int kvm_write_guest_time(struct kvm_vcpu *v)
 {
-	struct timespec ts;
 	unsigned long flags;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	void *shared_kaddr;
 	unsigned long this_tsc_khz;
+	s64 kernel_ns;
 
 	if ((!vcpu->time_page))
 		return 0;
@@ -963,8 +978,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
 	kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
-	ktime_get_ts(&ts);
-	monotonic_to_bootbased(&ts);
+	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	local_irq_restore(flags);
 
@@ -979,9 +993,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 	}
 
 	/* With all the info we got, fill in the values */
-	vcpu->hv_clock.system_time = ts.tv_nsec +
-				     (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
-
+	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
 	vcpu->hv_clock.flags = 0;
 
 	/*
@@ -3263,7 +3275,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_CLOCK: {
-		struct timespec now;
 		struct kvm_clock_data user_ns;
 		u64 now_ns;
 		s64 delta;
@@ -3277,19 +3288,16 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 
 		r = 0;
-		ktime_get_ts(&now);
-		now_ns = timespec_to_ns(&now);
+		now_ns = get_kernel_ns();
 		delta = user_ns.clock - now_ns;
 		kvm->arch.kvmclock_offset = delta;
 		break;
 	}
 	case KVM_GET_CLOCK: {
-		struct timespec now;
 		struct kvm_clock_data user_ns;
 		u64 now_ns;
 
-		ktime_get_ts(&now);
-		now_ns = timespec_to_ns(&now);
+		now_ns = get_kernel_ns();
 		user_ns.clock = kvm->arch.kvmclock_offset + now_ns;
 		user_ns.flags = 0;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 12/35] Robust TSC compensation
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (10 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 11/35] Add helper functions for time computation Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:40   ` Glauber Costa
  2010-08-24 21:33   ` Daniel Verkamp
  2010-08-20  8:07 ` [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback Zachary Amsden
                   ` (24 subsequent siblings)
  36 siblings, 2 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Make the match of TSC find TSC writes that are close to each other
instead of perfectly identical; this allows the compensator to also
work in migration / suspend scenarios.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52680f6..0f3e5fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -928,21 +928,27 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
+	s64 sdiff;
 
 	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = data - native_read_tsc();
 	ns = get_kernel_ns();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
+	sdiff = data - kvm->arch.last_tsc_write;
+	if (sdiff < 0)
+		sdiff = -sdiff;
 
 	/*
-	 * Special case: identical write to TSC within 5 seconds of
+	 * Special case: close write to TSC within 5 seconds of
 	 * another CPU is interpreted as an attempt to synchronize
-	 * (the 5 seconds is to accomodate host load / swapping).
+	 * The 5 seconds is to accomodate host load / swapping as
+	 * well as any reset of TSC during the boot process.
 	 *
 	 * In that case, for a reliable TSC, we can match TSC offsets,
-	 * or make a best guest using kernel_ns value.
+	 * or make a best guest using elapsed value.
 	 */
-	if (data == kvm->arch.last_tsc_write && elapsed < 5ULL * NSEC_PER_SEC) {
+	if (sdiff < nsec_to_cycles(5ULL * NSEC_PER_SEC) &&
+	    elapsed < 5ULL * NSEC_PER_SEC) {
 		if (!check_tsc_unstable()) {
 			offset = kvm->arch.last_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (11 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 12/35] Robust TSC compensation Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-27 16:32   ` Jan Kiszka
  2010-08-20  8:07 ` [KVM timekeeping 14/35] Add clock sync request to hardware enable Zachary Amsden
                   ` (23 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

The CPU_STARTING callback was added upstream with the intention
of being used for KVM, specifically for the hardware enablement
that must be done before we can run in hardware virt.  It had
bugs on the x86_64 architecture at the time, where it was called
after CPU_ONLINE.  The arches have since merged and the bug is
gone.

It might be noted other features should probably start making
use of this callback; microcode updates in particular which
might be fixing important erratums would be best applied before
beginning to run user tasks.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 virt/kvm/kvm_main.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b78b794..d4853a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1958,10 +1958,10 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
 		       cpu);
 		hardware_disable(NULL);
 		break;
-	case CPU_ONLINE:
+	case CPU_STARTING:
 		printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
 		       cpu);
-		smp_call_function_single(cpu, hardware_enable, NULL, 1);
+		hardware_enable(NULL);
 		break;
 	}
 	return NOTIFY_OK;
@@ -2096,7 +2096,6 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 
 static struct notifier_block kvm_cpu_notifier = {
 	.notifier_call = kvm_cpu_hotplug,
-	.priority = 20, /* must be > scheduler priority */
 };
 
 static int vm_stat_get(void *_offset, u64 *val)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 14/35] Add clock sync request to hardware enable
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (12 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 15/35] Move scale_delta into common header Zachary Amsden
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

If there are active VCPUs which are marked as belonging to
a particular hardware CPU, request a clock sync for them when
enabling hardware; the TSC could be desynchronized on a newly
arriving CPU, and we need to recompute guests system time
relative to boot after a suspend event.

This covers both cases.

Note that it is acceptable to take the spinlock, as either
no other tasks will be running and no locks held (BSP after
resume), or other tasks will be guaranteed to drop the lock
relatively quickly (AP on CPU_STARTING).

Noting we now get clock synchronization requests for VCPUs
which are starting up (or restarting), it is tempting to
attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
at this time, however it is not correct to do so; they are
required for systems with non-constant TSC as the frequency
may not be known immediately after the processor has started
until the cpufreq driver has had a chance to run and query
the chipset.

Updated: implement better locking semantics for hardware_enable

Removed the hack of dropping and retaking the lock by adding the
semantic that we always hold kvm_lock when hardware_enable is
called.  The one place that doesn't need to worry about it is
resume, as resuming a frozen CPU, the spinlock won't be taken.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c  |    8 ++++++++
 virt/kvm/kvm_main.c |    6 +++++-
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0f3e5fb..1948c36 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5533,7 +5533,15 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 
 int kvm_arch_hardware_enable(void *garbage)
 {
+	struct kvm *kvm;
+	struct kvm_vcpu *vcpu;
+	int i;
+
 	kvm_shared_msr_cpu_online();
+	list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			if (vcpu->cpu == smp_processor_id())
+				kvm_request_guest_time_update(vcpu);
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d4853a5..c0f4e70 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1961,7 +1961,9 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
 	case CPU_STARTING:
 		printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
 		       cpu);
+		spin_lock(&kvm_lock);
 		hardware_enable(NULL);
+		spin_unlock(&kvm_lock);
 		break;
 	}
 	return NOTIFY_OK;
@@ -2166,8 +2168,10 @@ static int kvm_suspend(struct sys_device *dev, pm_message_t state)
 
 static int kvm_resume(struct sys_device *dev)
 {
-	if (kvm_usage_count)
+	if (kvm_usage_count) {
+		WARN_ON(spin_is_locked(&kvm_lock));
 		hardware_enable(NULL);
+	}
 	return 0;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 15/35] Move scale_delta into common header
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (13 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 14/35] Add clock sync request to hardware enable Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock Zachary Amsden
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

The scale_delta function for shift / multiply with 31-bit
precision moves to a common header so it can be used by both
kernel and kvm module.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/pvclock.h |   38 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/pvclock.c      |    3 ++-
 2 files changed, 40 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index cd02f32..7f7e577 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -12,4 +12,42 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
 			    struct pvclock_vcpu_time_info *vcpu,
 			    struct timespec *ts);
 
+/*
+ * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
+ * yielding a 64-bit result.
+ */
+static inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int shift)
+{
+	u64 product;
+#ifdef __i386__
+	u32 tmp1, tmp2;
+#endif
+
+	if (shift < 0)
+		delta >>= -shift;
+	else
+		delta <<= shift;
+
+#ifdef __i386__
+	__asm__ (
+		"mul  %5       ; "
+		"mov  %4,%%eax ; "
+		"mov  %%edx,%4 ; "
+		"mul  %5       ; "
+		"xor  %5,%5    ; "
+		"add  %4,%%eax ; "
+		"adc  %5,%%edx ; "
+		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
+		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif defined(__x86_64__)
+	__asm__ (
+		"mul %%rdx ; shrd $32,%%rdx,%%rax"
+		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+	return product;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 239427c..bab3b9e 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -82,7 +82,8 @@ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
 static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
 {
 	u64 delta = native_read_tsc() - shadow->tsc_timestamp;
-	return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
+	return pvclock_scale_delta(delta, shadow->tsc_to_nsec_mul,
+				   shadow->tsc_shift);
 }
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (14 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 15/35] Move scale_delta into common header Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2011-09-02 18:34   ` Philipp Hahn
  2010-08-20  8:07 ` [KVM timekeeping 17/35] Implement getnsboottime kernel API Zachary Amsden
                   ` (20 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Kernel time, which advances in discrete steps may progress much slower
than TSC.  As a result, when kvmclock is adjusted to a new base, the
apparent time to the guest, which runs at a much higher, nsec scaled
rate based on the current TSC, may have already been observed to have
a larger value (kernel_ns + scaled tsc) than the value to which we are
setting it (kernel_ns + 0).

We must instead compute the clock as potentially observed by the guest
for kernel_ns to make sure it does not go backwards.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/x86.c              |   43 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 324e892..871800d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -339,6 +339,8 @@ struct kvm_vcpu_arch {
 	unsigned int time_offset;
 	struct page *time_page;
 	u64 last_host_tsc;
+	u64 last_guest_tsc;
+	u64 last_kernel_ns;
 
 	bool nmi_pending;
 	bool nmi_injected;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1948c36..fe74b42 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -976,14 +976,15 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	void *shared_kaddr;
 	unsigned long this_tsc_khz;
-	s64 kernel_ns;
+	s64 kernel_ns, max_kernel_ns;
+	u64 tsc_timestamp;
 
 	if ((!vcpu->time_page))
 		return 0;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
+	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	local_irq_restore(flags);
@@ -993,13 +994,49 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 		return 1;
 	}
 
+	/*
+	 * Time as measured by the TSC may go backwards when resetting the base
+	 * tsc_timestamp.  The reason for this is that the TSC resolution is
+	 * higher than the resolution of the other clock scales.  Thus, many
+	 * possible measurments of the TSC correspond to one measurement of any
+	 * other clock, and so a spread of values is possible.  This is not a
+	 * problem for the computation of the nanosecond clock; with TSC rates
+	 * around 1GHZ, there can only be a few cycles which correspond to one
+	 * nanosecond value, and any path through this code will inevitably
+	 * take longer than that.  However, with the kernel_ns value itself,
+	 * the precision may be much lower, down to HZ granularity.  If the
+	 * first sampling of TSC against kernel_ns ends in the low part of the
+	 * range, and the second in the high end of the range, we can get:
+	 *
+	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
+	 *
+	 * As the sampling errors potentially range in the thousands of cycles,
+	 * it is possible such a time value has already been observed by the
+	 * guest.  To protect against this, we must compute the system time as
+	 * observed by the guest and ensure the new system time is greater.
+ 	 */
+	max_kernel_ns = 0;
+	if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) {
+		max_kernel_ns = vcpu->last_guest_tsc -
+				vcpu->hv_clock.tsc_timestamp;
+		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
+				    vcpu->hv_clock.tsc_to_system_mul,
+				    vcpu->hv_clock.tsc_shift);
+		max_kernel_ns += vcpu->last_kernel_ns;
+	}
+
 	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
 		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
+	if (max_kernel_ns > kernel_ns)
+		kernel_ns = max_kernel_ns;
+
 	/* With all the info we got, fill in the values */
+	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
 	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
+	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->hv_clock.flags = 0;
 
 	/*
@@ -4931,6 +4968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
+	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+
 	atomic_set(&vcpu->guest_mode, 0);
 	smp_wmb();
 	local_irq_enable();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (15 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 18:39   ` john stultz
  2010-08-27 18:05   ` Jan Kiszka
  2010-08-20  8:07 ` [KVM timekeeping 18/35] Use getnsboottime in KVM Zachary Amsden
                   ` (19 subsequent siblings)
  36 siblings, 2 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Add a kernel call to get the number of nanoseconds since boot.  This
is generally useful enough to make it a generic call.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 include/linux/time.h      |    1 +
 kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index ea3559f..5d04108 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
 extern void getrawmonotonic(struct timespec *ts);
 extern void getboottime(struct timespec *ts);
 extern void monotonic_to_bootbased(struct timespec *ts);
+extern s64 getnsboottime(void);
 
 extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
 extern int timekeeping_valid_for_hres(void);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index caf8d4d..d250f0a 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
 }
 EXPORT_SYMBOL_GPL(ktime_get_ts);
 
+
+/**
+ * getnsboottime - get the bootbased clock in nsec format
+ *
+ * The function calculates the bootbased clock from the realtime
+ * clock and the wall_to_monotonic offset and stores the result
+ * in normalized timespec format in the variable pointed to by @ts.
+ */
+s64 getnsboottime(void)
+{
+	unsigned int seq;
+	s64 secs, nsecs;
+
+	WARN_ON(timekeeping_suspended);
+
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
+		secs += total_sleep_time.tv_sec;
+		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
+		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
+
+	} while (read_seqretry(&xtime_lock, seq));
+	return nsecs + (secs * NSEC_PER_SEC);
+}
+EXPORT_SYMBOL_GPL(getnsboottime);
+
 /**
  * do_gettimeofday - Returns the time of day in a timeval
  * @tv:		pointer to the timeval to be set
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 18/35] Use getnsboottime in KVM
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (16 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 17/35] Implement getnsboottime kernel API Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 19/35] Add timekeeping documentation Zachary Amsden
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   18 ++++--------------
 1 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fe74b42..ce03c2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -893,16 +893,6 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *
 		 hv_clock->tsc_to_system_mul);
 }
 
-static inline u64 get_kernel_ns(void)
-{
-	struct timespec ts;
-
-	WARN_ON(preemptible());
-	ktime_get_ts(&ts);
-	monotonic_to_bootbased(&ts);
-	return timespec_to_ns(&ts);
-}
-
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 
 static inline int kvm_tsc_changes_freq(void)
@@ -932,7 +922,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 
 	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = data - native_read_tsc();
-	ns = get_kernel_ns();
+	ns = getnsboottime();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 	sdiff = data - kvm->arch.last_tsc_write;
 	if (sdiff < 0)
@@ -985,7 +975,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
 	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
-	kernel_ns = get_kernel_ns();
+	kernel_ns = getnsboottime();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	local_irq_restore(flags);
 
@@ -3331,7 +3321,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 
 		r = 0;
-		now_ns = get_kernel_ns();
+		now_ns = getnsboottime();
 		delta = user_ns.clock - now_ns;
 		kvm->arch.kvmclock_offset = delta;
 		break;
@@ -3340,7 +3330,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		struct kvm_clock_data user_ns;
 		u64 now_ns;
 
-		now_ns = get_kernel_ns();
+		now_ns = getnsboottime();
 		user_ns.clock = kvm->arch.kvmclock_offset + now_ns;
 		user_ns.flags = 0;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 19/35] Add timekeeping documentation
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (17 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 18/35] Use getnsboottime in KVM Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:50   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 20/35] Make math work for other scales Zachary Amsden
                   ` (17 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Basic informational document about x86 timekeeping and how KVM
is affected.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 Documentation/kvm/timekeeping.txt |  613 +++++++++++++++++++++++++++++++++++++
 1 files changed, 613 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/kvm/timekeeping.txt

diff --git a/Documentation/kvm/timekeeping.txt b/Documentation/kvm/timekeeping.txt
new file mode 100644
index 0000000..cbbc0d5
--- /dev/null
+++ b/Documentation/kvm/timekeeping.txt
@@ -0,0 +1,613 @@
+
+	Timekeeping Virtualization for X86-Based Architectures
+
+	Zachary Amsden <zamsden@redhat.com>
+	Copyright (c) 2010, Red Hat.  All rights reserved.
+
+1) Overview
+2) Timing Devices
+3) TSC Hardware
+4) Virtualization Problems
+
+=========================================================================
+
+1) Overview
+
+One of the most complicated parts of the X86 platform, and specifically,
+the virtualization of this platform is the plethora of timing devices available
+and the complexity of emulating those devices.  In addition, virtualization of
+time introduces a new set of challenges because it introduces a multiplexed
+division of time beyond the control of the guest CPU.
+
+First, we will describe the various timekeeping hardware available, then
+present some of the problems which arise and solutions available, giving
+specific recommendations for certain classes of KVM guests.
+
+The purpose of this document is to collect data and information relevant to
+timekeeping which may be difficult to find elsewhere, specifically,
+information relevant to KVM and hardware-based virtualization.
+
+=========================================================================
+
+2) Timing Devices
+
+First we discuss the basic hardware devices available.  TSC and the related
+KVM clock are special enough to warrant a full exposition and are described in
+the following section.
+
+2.1) i8254 - PIT
+
+One of the first timer devices available is the programmable interrupt timer,
+or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
+channels which can be programmed to deliver periodic or one-shot interrupts.
+These three channels can be configured in different modes and have individual
+counters.  Channel 1 and 2 were not available for general use in the original
+IBM PC, and historically were connected to control RAM refresh and the PC
+speaker.  Now the PIT is typically integrated as part of an emulated chipset
+and a separate physical PIT is not used.
+
+The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
+using single or multiple byte access to the I/O ports.  There are 6 modes
+available, but not all modes are available to all timers, as only timer 2
+has a connected gate input, required for modes 1 and 5.  The gate line is
+controlled by port 61h, bit 0, as illustrated in the following diagram.
+
+ --------------             ---------------- 
+|              |           |                |
+|  1.1932 MHz  |---------->| CLOCK      OUT | ---------> IRQ 0
+|    Clock     |   |       |                |
+ --------------    |    +->| GATE  TIMER 0  |
+                   |        ---------------- 
+                   |
+                   |        ----------------
+                   |       |                |
+                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
+                   |       |                |            (aka /dev/null)
+                   |    +->| GATE  TIMER 1  |
+                   |        ----------------
+                   |
+                   |        ---------------- 
+                   |       |                |
+                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
+                           |                |      |
+Port 61h, bit 0 ---------->| GATE  TIMER 2  |       \_.----   ____
+                            ----------------         _|    )--|LPF|---Speaker
+                                                    / *----   \___/
+Port 61h, bit 1 -----------------------------------/
+
+The timer modes are now described.
+
+Mode 0: Single Timeout.   This is a one-shot software timeout that counts down
+ when the gate is high (always true for timers 0 and 1).  When the count
+ reaches zero, the output goes high.
+
+Mode 1: Triggered One-shot.  The output is intially set high.  When the gate
+ line is set high, a countdown is initiated (which does not stop if the gate is
+ lowered), during which the output is set low.  When the count reaches zero,
+ the output goes high.
+
+Mode 2: Rate Generator.  The output is initially set high.  When the countdown
+ reaches 1, the output goes low for one count and then returns high.  The value
+ is reloaded and the countdown automatically resumes.  If the gate line goes
+ low, the count is halted.  If the output is low when the gate is lowered, the
+ output automatically goes high (this only affects timer 2).
+
+Mode 3: Square Wave.   This generates a high / low square wave.  The count
+ determines the length of the pulse, which alternates between high and low
+ when zero is reached.  The count only proceeds when gate is high and is
+ automatically reloaded on reaching zero.  The count is decremented twice at
+ each clock to generate a full high / low cycle at the full periodic rate.
+ If the count is even, the clock remains high for N/2 counts and low for N/2
+ counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
+ for (N-1)/2 counts.  Only even values are latched by the counter, so odd
+ values are not observed when reading.  This is the intended mode for timer 2,
+ which generates sine-like tones by low-pass filtering the square wave output.
+
+Mode 4: Software Strobe.  After programming this mode and loading the counter,
+ the output remains high until the counter reaches zero.  Then the output
+ goes low for 1 clock cycle and returns high.  The counter is not reloaded.
+ Counting only occurs when gate is high.
+
+Mode 5: Hardware Strobe.  After programming and loading the counter, the
+ output remains high.  When the gate is raised, a countdown is initiated
+ (which does not stop if the gate is lowered).  When the counter reaches zero,
+ the output goes low for 1 clock cycle and then returns high.  The counter is
+ not reloaded.
+
+In addition to normal binary counting, the PIT supports BCD counting.  The
+command port, 0x43 is used to set the counter and mode for each of the three
+timers.
+
+PIT commands, issued to port 0x43, using the following bit encoding:
+
+Bit 7-4: Command (See table below)
+Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
+Bit 0  : Binary (0) / BCD (1)
+
+Command table:
+
+0000 - Latch Timer 0 count for port 0x40
+	sample and hold the count to be read in port 0x40;
+	additional commands ignored until counter is read;
+	mode bits ignored.
+
+0001 - Set Timer 0 LSB mode for port 0x40
+	set timer to read LSB only and force MSB to zero;
+	mode bits set timer mode
+	
+0010 - Set Timer 0 MSB mode for port 0x40
+	set timer to read MSB only and force LSB to zero;
+	mode bits set timer mode
+
+0011 - Set Timer 0 16-bit mode for port 0x40
+	set timer to read / write LSB first, then MSB;
+	mode bits set timer mode
+
+0100 - Latch Timer 1 count for port 0x41 - as described above
+0101 - Set Timer 1 LSB mode for port 0x41 - as described above
+0110 - Set Timer 1 MSB mode for port 0x41 - as described above
+0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
+
+1000 - Latch Timer 2 count for port 0x42 - as described above
+1001 - Set Timer 2 LSB mode for port 0x42 - as described above
+1010 - Set Timer 2 MSB mode for port 0x42 - as described above
+1011 - Set Timer 2 16-bit mode for port 0x42 as described above
+
+1101 - General counter latch
+	Latch combination of counters into corresponding ports
+	Bit 3 = Counter 2
+	Bit 2 = Counter 1
+	Bit 1 = Counter 0
+	Bit 0 = Unused
+
+1110 - Latch timer status
+	Latch combination of counter mode into corresponding ports
+	Bit 3 = Counter 2
+	Bit 2 = Counter 1
+	Bit 1 = Counter 0
+
+	The output of ports 0x40-0x42 following this command will be:
+	
+	Bit 7 = Output pin
+	Bit 6 = Count loaded (0 if timer has expired)
+	Bit 5-4 = Read / Write mode
+	    01 = MSB only
+	    10 = LSB only
+	    11 = LSB / MSB (16-bit)
+	Bit 3-1 = Mode
+	Bit 0 = Binary (0) / BCD mode (1)
+
+2.2) RTC
+
+The second device which was available in the original PC was the MC146818 real
+time clock.  The original device is now obsolete, and usually emulated by the
+system chipset, sometimes by an HPET and some frankenstein IRQ routing.
+
+The RTC is accessed through CMOS variables, which uses an index register to
+control which bytes are read.  Since there is only one index register, read
+of the CMOS and read of the RTC require lock protection (in addition, it is
+dangerous to allow userspace utilities such as hwclock to have direct RTC
+access, as they could corrupt kernel reads and writes of CMOS memory).
+
+The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
+can function as a periodic timer, an additional once a day alarm, and can issue
+interrupts after an update of the CMOS registers by the MC146818 is complete.
+The type of interrupt is signalled in the RTC status registers.
+
+The RTC will update the current time fields by battery power even while the
+system is off.  The current time fields should not be read while an update is
+in progress, as indicated in the status register.
+
+The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
+programmed to a 32kHz divider if the RTC is to count seconds.
+
+This is the RAM map originally used for the RTC/CMOS:
+
+Location    Size    Description
+------------------------------------------
+00h         byte    Current second (BCD)
+01h         byte    Seconds alarm (BCD)
+02h         byte    Current minute (BCD)
+03h         byte    Minutes alarm (BCD)
+04h         byte    Current hour (BCD)
+05h         byte    Hours alarm (BCD)
+06h         byte    Current day of week (BCD)
+07h         byte    Current day of month (BCD)
+08h         byte    Current month (BCD)
+09h         byte    Current year (BCD)
+0Ah         byte    Register A
+                       bit 7   = Update in progress
+                       bit 6-4 = Divider for clock
+                                  000 = 4.194 MHz
+                                  001 = 1.049 MHz
+                                  010 = 32 kHz
+                                  10X = test modes
+                                  110 = reset / disable
+                                  111 = reset / disable
+                       bit 3-0 = Rate selection for periodic interrupt
+                                  000 = periodic timer disabled
+                                  001 = 3.90625 uS
+                                  010 = 7.8125 uS
+                                  011 = .122070 mS
+                                  100 = .244141 mS
+                                     ...
+                                 1101 = 125 mS
+                                 1110 = 250 mS
+                                 1111 = 500 mS
+0Bh         byte    Register B
+                       bit 7   = Run (0) / Halt (1)
+                       bit 6   = Periodic interrupt enable
+                       bit 5   = Alarm interrupt enable
+                       bit 4   = Update-ended interrupt enable
+                       bit 3   = Square wave interrupt enable
+                       bit 2   = BCD calendar (0) / Binary (1)
+                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
+                       bit 0   = 0 (DST off) / 1 (DST enabled)
+OCh         byte    Register C (read only)
+                       bit 7   = interrupt request flag (IRQF)
+                       bit 6   = periodic interrupt flag (PF)
+                       bit 5   = alarm interrupt flag (AF)
+                       bit 4   = update interrupt flag (UF)
+                       bit 3-0 = reserved
+ODh         byte    Register D (read only)
+                       bit 7   = RTC has power
+                       bit 6-0 = reserved
+32h         byte    Current century BCD (*)
+  (*) location vendor specific and now determined from ACPI global tables
+
+2.3) APIC
+
+On Pentium and later processors, an on-board timer is available to each CPU
+as part of the Advanced Programmable Interrupt Controller.  The APIC is
+accessed through memory-mapped registers and provides interrupt service to each
+CPU, used for IPIs and local timer interrupts.
+
+Although in theory the APIC is a safe and stable source for local interrupts,
+in practice, many bugs and glitches have occurred due to the special nature of
+the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
+the use of the APIC and that workarounds may be required.  In addition, some of
+these workarounds pose unique constraints for virtualization - requiring either
+extra overhead incurred from extra reads of memory-mapped I/O or additional
+functionality that may be more computationally expensive to implement.
+
+Since the APIC is documented quite well in the Intel and AMD manuals, we will
+avoid repetition of the detail here.  It should be pointed out that the APIC
+timer is programmed through the LVT (local vector timer) register, is capable
+of one-shot or periodic operation, and is based on the bus clock divided down
+by the programmable divider register.
+
+2.4) HPET
+
+HPET is quite complex, and was originally intended to replace the PIT / RTC
+support of the X86 PC.  It remains to be seen whether that will be the case, as
+the de facto standard of PC hardware is to emulate these older devices.  Some
+systems designated as legacy free may support only the HPET as a hardware timer
+device.
+
+The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
+but allowing implementation freedom to support many more.  It also imposes no
+fixed rate on the timer frequency, but does impose some extremal values on
+frequency, error and slew.
+
+In general, the HPET is recommended as a high precision (compared to PIT /RTC)
+time source which is independent of local variation (as there is only one HPET
+in any given system).  The HPET is also memory-mapped, and its presence is
+indicated through ACPI tables by the BIOS.
+
+Detailed specification of the HPET is beyond the current scope of this
+document, as it is also very well documented elsewhere.
+
+2.5) Offboard Timers
+
+Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
+timing chips built into the cards which may have registers which are accessible
+to kernel or user drivers.  To the author's knowledge, using these to generate
+a clocksource for a Linux or other kernel has not yet been attempted and is in
+general frowned upon as not playing by the agreed rules of the game.  Such a
+timer device would require additional support to be virtualized properly and is
+not considered important at this time as no known operating system does this.
+
+=========================================================================
+
+3) TSC Hardware
+
+The TSC or time stamp counter is relatively simple in theory; it counts
+instruction cycles issued by the processor, which can be used as a measure of
+time.  In practice, due to a number of problems, it is the most complicated
+timekeeping device to use.
+
+The TSC is represented internally as a 64-bit MSR which can be read with the
+RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
+limitations made it possible to write the TSC, but generally on old hardware it
+was only possible to write the low 32-bits of the 64-bit counter, and the upper
+32-bits of the counter were cleared.  Now, however, on Intel processors family
+0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
+has been lifted and all 64-bits are writable.  On AMD systems, the ability to
+write the TSC MSR is not an architectural guarantee.
+
+The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
+means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
+
+Some vendors have implemented an additional instruction, RDTSCP, which returns
+atomically not just the TSC, but an indicator which corresponds to the
+processor number.  This can be used to index into an array of TSC variables to
+determine offset information in SMP systems where TSCs are not synchronized.
+The presence of this instruction must be determined by consulting CPUID feature
+bits.
+
+Both VMX and SVM provide extension fields in the virtualization hardware which
+allows the guest visible TSC to be offset by a constant.  Newer implementations
+promise to allow the TSC to additionally be scaled, but this hardware is not
+yet widely available.
+
+3.1) TSC synchronization
+
+The TSC is a CPU-local clock in most implementations.  This means, on SMP
+platforms, the TSCs of different CPUs may start at different times depending
+on when the CPUs are powered on.  Generally, CPUs on the same die will share
+the same clock, however, this is not always the case.
+
+The BIOS may attempt to resynchronize the TSCs during the poweron process and
+the operating system or other system software may attempt to do this as well.
+Several hardware limitations make the problem worse - if it is not possible to
+write the full 64-bits of the TSC, it may be impossible to match the TSC in
+newly arriving CPUs to that of the rest of the system, resulting in
+unsynchronized TSCs.  This may be done by BIOS or system software, but in
+practice, getting a perfectly synchronized TSC will not be possible unless all
+values are read from the same clock, which generally only is possible on single
+socket systems or those with special hardware support.
+
+3.2) TSC and CPU hotplug
+
+As touched on already, CPUs which arrive later than the boot time of the system
+may not have a TSC value that is synchronized with the rest of the system.
+Either system software, BIOS, or SMM code may actually try to establish the TSC
+to a value matching the rest of the system, but a perfect match is usually not
+a guarantee.  This can have the effect of bringing a system from a state where
+TSC is synchronized back to a state where TSC synchronization flaws, however
+small, may be exposed to the OS and any virtualization environment.
+
+3.3) TSC and multi-socket / NUMA
+
+Multi-socket systems, especially large multi-socket systems are likely to have
+individual clocksources rather than a single, universally distributed clock.
+Since these clocks are driven by different crystals, they will not have
+perfectly matched frequency, and temperature and electrical variations will
+cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
+exact clock and bus design, the drift may or may not be fixed in absolute
+error, and may accumulate over time.
+
+In addition, very large systems may deliberately slew the clocks of individual
+cores.  This technique, known as spread-spectrum clocking, reduces EMI at the
+clock frequency and harmonics of it, which may be required to pass FCC
+standards for telecommunications and computer equipment.
+
+It is recommended not to trust the TSCs to remain synchronized on NUMA or
+multiple socket systems for these reasons.
+
+3.4) TSC and C-states
+
+C-states, or idling states of the processor, especially C1E and deeper sleep
+states may be problematic for TSC as well.  The TSC may stop advancing in such
+a state, resulting in a TSC which is behind that of other CPUs when execution
+is resumed.  Such CPUs must be detected and flagged by the operating system
+based on CPU and chipset identifications.
+
+The TSC in such a case may be corrected by catching it up to a known external
+clocksource.
+
+3.5) TSC frequency change / P-states
+
+To make things slightly more interesting, some CPUs may change frequency.  They
+may or may not run the TSC at the same rate, and because the frequency change
+may be staggered or slewed, at some points in time, the TSC rate may not be
+known other than falling within a range of values.  In this case, the TSC will
+not be a stable time source, and must be calibrated against a known, stable,
+external clock to be a usable source of time.
+
+Whether the TSC runs at a constant rate or scales with the P-state is model
+dependent and must be determined by inspecting CPUID, chipset or vendor
+specific MSR fields.
+
+In addition, some vendors have known bugs where the P-state is actually
+compensated for properly during normal operation, but when the processor is
+inactive, the P-state may be raised temporarily to service cache misses from
+other processors.  In such cases, the TSC on halted CPUs could advance faster
+than that of non-halted processors.  AMD Turion processors are known to have
+this problem.
+
+3.6) TSC and STPCLK / T-states
+
+External signals given to the processor may also have the effect of stopping
+the TSC.  This is typically done for thermal emergency power control to prevent
+an overheating condition, and typically, there is no way to detect that this
+condition has happened.
+
+3.7) TSC virtualization - VMX
+
+VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner.  In
+addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
+field specified in the VMCS.  Special instructions must be used to read and
+write the VMCS field.
+
+3.8) TSC virtualization - SVM
+
+SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner.  In
+addition, SVM allows passing through the host TSC plus an additional offset
+field specified in the SVM control block.
+
+3.9) TSC feature bits in Linux
+
+In summary, there is no way to guarantee the TSC remains in perfect
+synchronization unless it is explicitly guaranteed by the architecture.  Even
+if so, the TSCs in multi-sockets or NUMA systems may still run independently
+despite being locally consistent.
+
+The following feature bits are used by Linux to signal various TSC attributes,
+but they can only be taken to be meaningful for UP or single node systems.
+
+X86_FEATURE_TSC 		: The TSC is available in hardware
+X86_FEATURE_RDTSCP		: The RDTSCP instruction is available
+X86_FEATURE_CONSTANT_TSC 	: The TSC rate is unchanged with P-states
+X86_FEATURE_NONSTOP_TSC		: The TSC does not stop in C-states
+X86_FEATURE_TSC_RELIABLE	: TSC sync checks are skipped (VMware)
+
+4) Virtualization Problems
+
+Timekeeping is especially problematic for virtualization because a number of
+challenges arise.  The most obvious problem is that time is now shared between
+the host and, potentially, a number of virtual machines.  Thus the virtual
+operating system does not run with 100% usage of the CPU, despite the fact that
+it may very well make that assumption.  It may expect it to remain true to very
+exacting bounds when interrupt sources are disabled, but in reality only its
+virtual interrupt sources are disabled, and the machine may still be preempted
+at any time.  This causes problems as the passage of real time, the injection
+of machine interrupts and the associated clock sources are no longer completely
+synchronized with real time.
+
+This same problem can occur on native harware to a degree, as SMM mode may
+steal cycles from the naturally on X86 systems when SMM mode is used by the
+BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
+cause similar problems to virtualization makes it a good justification for
+solving many of these problems on bare metal.
+
+4.1) Interrupt clocking
+
+One of the most immediate problems that occurs with legacy operating systems
+is that the system timekeeping routines are often designed to keep track of
+time by counting periodic interrupts.  These interrupts may come from the PIT
+or the RTC, but the problem is the same: the host virtualization engine may not
+be able to deliver the proper number of interrupts per second, and so guest
+time may fall behind.  This is especially problematic if a high interrupt rate
+is selected, such as 1000 HZ, which is unfortunately the default for many Linux
+guests.
+
+There are three approaches to solving this problem; first, it may be possible
+to simply ignore it.  Guests which have a separate time source for tracking
+'wall clock' or 'real time' may not need any adjustment of their interrupts to
+maintain proper time.  If this is not sufficient, it may be necessary to inject
+additional interrupts into the guest in order to increase the effective
+interrupt rate.  This approach leads to complications in extreme conditions,
+where host load or guest lag is too much to compensate for, and thus another
+solution to the problem has risen: the guest may need to become aware of lost
+ticks and compensate for them internally.  Although promising in theory, the
+implementation of this policy in Linux has been extremely error prone, and a
+number of buggy variants of lost tick compensation are distributed across
+commonly used Linux systems.
+
+Windows uses periodic RTC clocking as a means of keeping time internally, and
+thus requires interrupt slewing to keep proper time.  It does use a low enough
+rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
+practice.
+
+4.2) TSC sampling and serialization
+
+As the highest precision time source available, the cycle counter of the CPU
+has aroused much interest from developers.  As explained above, this timer has
+many problems unique to its nature as a local, potentially unstable and
+potentially unsynchronized source.  One issue which is not unique to the TSC,
+but is highlighted because of its very precise nature is sampling delay.  By
+definition, the counter, once read is already old.  However, it is also
+possible for the counter to be read ahead of the actual use of the result.
+This is a consequence of the superscalar execution of the instruction stream,
+which may execute instructions out of order.  Such execution is called
+non-serialized.  Forcing serialized execution is necessary for precise
+measurement with the TSC, and requires a serializing instruction, such as CPUID
+or an MSR read.
+
+Since CPUID may actually be virtualized by a trap and emulate mechanism, this
+serialization can pose a performance issue for hardware virtualization.  An
+accurate time stamp counter reading may therefore not always be available, and
+it may be necessary for an implementation to guard against "backwards" reads of
+the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
+system.
+
+4.3) Timespec aliasing
+
+Additionally, this lack of serialization from the TSC poses another challenge
+when using results of the TSC when measured against another time source.  As
+the TSC is much higher precision, many possible values of the TSC may be read
+while another clock is still expressing the same value.
+
+That is, you may read (T,T+10) while external clock C maintains the same value.
+Due to non-serialized reads, you may actually end up with a range which
+fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
+calibrated against an external value may have a range of valid values.
+Re-calibrating this computation may actually cause time, as computed after the
+calibration, to go backwards, compared with time computed before the
+calibration.
+
+This problem is particularly pronounced with an internal time source in Linux,
+the kernel time, which is expressed in the theoretically high resolution
+timespec - but which advances in much larger granularity intervals, sometimes
+at the rate of jiffies, and possibly in catchup modes, at a much larger step.
+
+This aliasing requires care in the computation and recalibration of kvmclock
+and any other values derived from TSC computation (such as TSC virtualization
+itself).
+
+4.4) Migration
+
+Migration of a virtual machine raises problems for timekeeping in two ways.
+First, the migration itself may take time, during which interrupts cannot be
+delivered, and after which, the guest time may need to be caught up.  NTP may
+be able to help to some degree here, as the clock correction required is
+typically small enough to fall in the NTP-correctable window.
+
+An additional concern is that timers based off the TSC (or HPET, if the raw bus
+clock is exposed) may now be running at different rates, requiring compensation
+in some way in the hypervisor by virtualizing these timers.  In addition,
+migrating to a faster machine may preclude the use of a passthrough TSC, as a
+faster clock cannot be made visible to a guest without the potential of time
+advancing faster than usual.  A slower clock is less of a problem, as it can
+always be caught up to the original rate.  KVM clock avoids these problems by
+simply storing multipliers and offsets against the TSC for the guest to convert
+back into nanosecond resolution values.
+
+4.5) Scheduling
+
+Since scheduling may be based on precise timing and firing of interrupts, the
+scheduling algorithms of an operating system may be adversely affected by
+virtualization.  In theory, the effect is random and should be universally
+distributed, but in contrived as well as real scenarios (guest device access,
+causes of virtualization exits, possible context switch), this may not always
+be the case.  The effect of this has not been well studied.
+
+In an attempt to work around this, several implementations have provided a
+paravirtualized scheduler clock, which reveals the true amount of CPU time for
+which a virtual machine has been running.
+
+4.6) Watchdogs
+
+Watchdog timers, such as the lock detector in Linux may fire accidentally when
+running under hardware virtualization due to timer interrupts being delayed or
+misinterpretation of the passage of real time.  Usually, these warnings are
+spurious and can be ignored, but in some circumstances it may be necessary to
+disable such detection.
+
+4.7) Delays and precision timing
+
+Precise timing and delays may not be possible in a virtualized system.  This
+can happen if the system is controlling physical hardware, or issues delays to
+compensate for slower I/O to and from devices.  The first issue is not solvable
+in general for a virtualized system; hardware control software can't be
+adequately virtualized without a full real-time operating system, which would
+require an RT aware virtualization platform.
+
+The second issue may cause performance problems, but this is unlikely to be a
+significant issue.  In many cases these delays may be eliminated through
+configuration or paravirtualization.
+
+4.8) Covert channels and leaks
+
+In addition to the above problems, time information will inevitably leak to the
+guest about the host in anything but a perfect implementation of virtualized
+time.  This may allow the guest to infer the presence of a hypervisor (as in a
+red-pill type detection), and it may allow information to leak between guests
+by using CPU utilization itself as a signalling channel.  Preventing such
+problems would require completely isolated virtual time which may not track
+real time any longer.  This may be useful in certain security or QA contexts,
+but in general isn't recommended for real-world deployment scenarios.
+
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 20/35] Make math work for other scales
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (18 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 19/35] Add timekeeping documentation Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 21/35] Track max tsc_khz Zachary Amsden
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

The math in kvm_get_time_scale relies on the fact that
NSEC_PER_SEC < 2^32.  To use the same function to compute
arbitrary time scales, we must extend the first reduction
step to shrink the base rate to a 32-bit value, and
possibly reduce the scaled rate into a 32-bit as well.

Note we must take care to avoid an arithmetic overflow
when scaling up the tps32 value (this could not happen
with the fixed scaled value of NSEC_PER_SEC, but can
happen with scaled rates above 2^31.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   30 ++++++++++++++++++------------
 1 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce03c2c..718bed9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -866,31 +866,35 @@ static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
 	return quotient;
 }
 
-static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info *hv_clock)
+static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
+			       s8 *pshift, u32 *pmultiplier)
 {
-	uint64_t nsecs = 1000000000LL;
+	uint64_t scaled64;
 	int32_t  shift = 0;
 	uint64_t tps64;
 	uint32_t tps32;
 
-	tps64 = tsc_khz * 1000LL;
-	while (tps64 > nsecs*2) {
+	tps64 = base_khz * 1000LL;
+	scaled64 = scaled_khz * 1000LL;
+	while (tps64 > scaled64*2 || tps64 & 0xffffffff00000000UL) {
 		tps64 >>= 1;
 		shift--;
 	}
 
 	tps32 = (uint32_t)tps64;
-	while (tps32 <= (uint32_t)nsecs) {
-		tps32 <<= 1;
+	while (tps32 <= scaled64 || scaled64 & 0xffffffff00000000UL) {
+		if (scaled64 & 0xffffffff00000000UL || tps32 & 0x80000000)
+			scaled64 >>= 1;
+		else
+			tps32 <<= 1;
 		shift++;
 	}
 
-	hv_clock->tsc_shift = shift;
-	hv_clock->tsc_to_system_mul = div_frac(nsecs, tps32);
+	*pshift = shift;
+	*pmultiplier = div_frac(scaled64, tps32);
 
-	pr_debug("%s: tsc_khz %u, tsc_shift %d, tsc_mul %u\n",
-		 __func__, tsc_khz, hv_clock->tsc_shift,
-		 hv_clock->tsc_to_system_mul);
+	pr_debug("%s: base_khz %u => %u, shift %d, mul %u\n",
+		 __func__, base_khz, scaled_khz, shift, *pmultiplier);
 }
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
@@ -1016,7 +1020,9 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 	}
 
 	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
-		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
+		kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
+				   &vcpu->hv_clock.tsc_shift,
+				   &vcpu->hv_clock.tsc_to_system_mul);
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 21/35] Track max tsc_khz
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (19 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 20/35] Make math work for other scales Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 22/35] Track tsc last write in vcpu Zachary Amsden
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 718bed9..517daf3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -898,6 +898,7 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
 }
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
+unsigned long max_tsc_khz;
 
 static inline int kvm_tsc_changes_freq(void)
 {
@@ -4351,11 +4352,20 @@ static void kvm_timer_init(void)
 {
 	int cpu;
 
+	max_tsc_khz = tsc_khz;
 	register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+#ifdef CONFIG_CPU_FREQ
+		struct cpufreq_policy policy;
+		memset(&policy, 0, sizeof(policy));
+		cpufreq_get_policy(&policy, get_cpu());
+		if (policy.cpuinfo.max_freq)
+			max_tsc_khz = policy.cpuinfo.max_freq;
+#endif
 		cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
 					  CPUFREQ_TRANSITION_NOTIFIER);
 	}
+	pr_debug("kvm: max_tsc_khz = %ld\n", max_tsc_khz);
 	for_each_online_cpu(cpu)
 		smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 22/35] Track tsc last write in vcpu
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (20 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 21/35] Track max tsc_khz Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 23/35] Set initial TSC rate conversion factors Zachary Amsden
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Reason becomes apparent soon; this is an easy to digest step.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/kvm/x86.c              |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 871800d..638cae5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -341,6 +341,8 @@ struct kvm_vcpu_arch {
 	u64 last_host_tsc;
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
+	u64 last_tsc_nsec;
+	u64 last_tsc_write;
 
 	bool nmi_pending;
 	bool nmi_injected;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 517daf3..3691e70 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -962,6 +962,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 
 	/* Reset of TSC must disable overshoot protection below */
 	vcpu->arch.hv_clock.tsc_timestamp = 0;
+	vcpu->arch.last_tsc_write = data;
+	vcpu->arch.last_tsc_nsec = ns;
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 23/35] Set initial TSC rate conversion factors
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (21 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 22/35] Track tsc last write in vcpu Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 24/35] Timer request function renaming Zachary Amsden
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +++
 arch/x86/kvm/x86.c              |   12 ++++++++++++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 638cae5..3a54cc1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -403,6 +403,9 @@ struct kvm_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_offset;
 	u64 last_tsc_write;
+	u32 virtual_tsc_khz;
+	u32 virtual_tsc_mult;
+	s8 virtual_tsc_shift;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3691e70..d715ec0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -918,6 +918,15 @@ static inline u64 nsec_to_cycles(u64 nsec)
 	return (nsec * __get_cpu_var(cpu_tsc_khz)) / USEC_PER_SEC;
 }
 
+static void kvm_arch_set_tsc_khz(struct kvm *kvm, u32 this_tsc_khz)
+{
+	/* Compute a scale to convert nanoseconds in TSC cycles */
+	kvm_get_time_scale(this_tsc_khz, NSEC_PER_SEC / 1000,
+			   &kvm->arch.virtual_tsc_shift,
+			   &kvm->arch.virtual_tsc_mult);
+	kvm->arch.virtual_tsc_khz = this_tsc_khz;
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -5636,6 +5645,9 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	}
 	vcpu->arch.pio_data = page_address(page);
 
+	if (!kvm->arch.virtual_tsc_khz)
+		kvm_arch_set_tsc_khz(kvm, max_tsc_khz);
+
 	r = kvm_mmu_create(vcpu);
 	if (r < 0)
 		goto fail_free_pio_data;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 24/35] Timer request function renaming
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (22 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 23/35] Set initial TSC rate conversion factors Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 25/35] Add clock catchup mode Zachary Amsden
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Separate step to prepare for next change.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c       |   12 ++++++------
 include/linux/kvm_host.h |    2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d715ec0..ac0b2d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -838,7 +838,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
 
 	/*
 	 * The guest calculates current wall clock time by adding
-	 * system time (updated by kvm_write_guest_time below) to the
+	 * system time (updated by kvm_guest_time_update below) to the
 	 * wall clock specified here.  guest system time equals host
 	 * system time for us, thus we must fill in host boot time here.
 	 */
@@ -976,7 +976,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
-static int kvm_write_guest_time(struct kvm_vcpu *v)
+static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
@@ -996,7 +996,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
 	local_irq_restore(flags);
 
 	if (unlikely(this_tsc_khz == 0)) {
-		kvm_make_request(KVM_REQ_KVMCLOCK_UPDATE, v);
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
 	}
 
@@ -1071,7 +1071,7 @@ static int kvm_request_guest_time_update(struct kvm_vcpu *v)
 
 	if (!vcpu->time_page)
 		return 0;
-	kvm_make_request(KVM_REQ_KVMCLOCK_UPDATE, v);
+	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 	return 1;
 }
 
@@ -4896,8 +4896,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
-		if (kvm_check_request(KVM_REQ_KVMCLOCK_UPDATE, vcpu)) {
-			r = kvm_write_guest_time(vcpu);
+		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
+			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
 				goto out;
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c13cc48..b3ede70 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -36,7 +36,7 @@
 #define KVM_REQ_PENDING_TIMER      5
 #define KVM_REQ_UNHALT             6
 #define KVM_REQ_MMU_SYNC           7
-#define KVM_REQ_KVMCLOCK_UPDATE    8
+#define KVM_REQ_CLOCK_UPDATE       8
 #define KVM_REQ_KICK               9
 #define KVM_REQ_DEACTIVATE_FPU    10
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 25/35] Add clock catchup mode
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (23 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 24/35] Timer request function renaming Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-25 17:27   ` Marcelo Tosatti
  2010-08-20  8:07 ` [KVM timekeeping 26/35] Catchup slower TSC to guest rate Zachary Amsden
                   ` (11 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Make the clock update handler handle generic clock synchronization,
not just KVM clock.  We add a catchup mode which keeps passthrough
TSC in line with absolute guest TSC.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/x86.c              |   55 ++++++++++++++++++++++++++------------
 2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a54cc1..ec1dc3a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -343,6 +343,7 @@ struct kvm_vcpu_arch {
 	u64 last_kernel_ns;
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
+	bool tsc_rebase;
 
 	bool nmi_pending;
 	bool nmi_injected;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac0b2d9..a4215d7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -927,6 +927,15 @@ static void kvm_arch_set_tsc_khz(struct kvm *kvm, u32 this_tsc_khz)
 	kvm->arch.virtual_tsc_khz = this_tsc_khz;
 }
 
+static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
+{
+	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
+				      vcpu->kvm->arch.virtual_tsc_mult,
+				      vcpu->kvm->arch.virtual_tsc_shift);
+	tsc += vcpu->arch.last_tsc_write;
+	return tsc;
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -984,22 +993,29 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
-
-	if ((!vcpu->time_page))
-		return 0;
+	bool catchup = (!vcpu->time_page);
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
 	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
 	kernel_ns = getnsboottime();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
-	local_irq_restore(flags);
 
 	if (unlikely(this_tsc_khz == 0)) {
+		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
 	}
 
+	if (catchup) {
+		u64 tsc = compute_guest_tsc(v, kernel_ns);
+		if (tsc > tsc_timestamp)
+			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
+	}
+	local_irq_restore(flags);
+	if (catchup)
+		return 0;
+
 	/*
 	 * Time as measured by the TSC may go backwards when resetting the base
 	 * tsc_timestamp.  The reason for this is that the TSC resolution is
@@ -1065,14 +1081,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	return 0;
 }
 
-static int kvm_request_guest_time_update(struct kvm_vcpu *v)
+static void kvm_request_clock_update(struct kvm_vcpu *v)
 {
-	struct kvm_vcpu_arch *vcpu = &v->arch;
-
-	if (!vcpu->time_page)
-		return 0;
 	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
-	return 1;
 }
 
 static bool msr_mtrr_valid(unsigned msr)
@@ -1398,6 +1409,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		}
 
 		vcpu->arch.time = data;
+		kvm_request_clock_update(vcpu);
 
 		/* we verify if the enable bit is set... */
 		if (!(data & 1))
@@ -1413,8 +1425,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 			kvm_release_page_clean(vcpu->arch.time_page);
 			vcpu->arch.time_page = NULL;
 		}
-
-		kvm_request_guest_time_update(vcpu);
 		break;
 	}
 	case MSR_IA32_MCG_CTL:
@@ -1929,16 +1939,20 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	}
 
 	kvm_x86_ops->vcpu_load(vcpu, cpu);
-	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
+	if (unlikely(vcpu->cpu != cpu) || vcpu->arch.tsc_rebase) {
 		/* Make sure TSC doesn't go backwards */
 		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
 				native_read_tsc() - vcpu->arch.last_host_tsc;
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
-		if (check_tsc_unstable())
+		if (check_tsc_unstable()) {
 			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
-		kvm_migrate_timers(vcpu);
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		if (vcpu->cpu != cpu)
+			kvm_migrate_timers(vcpu);
 		vcpu->cpu = cpu;
+		vcpu->arch.tsc_rebase = 0;
 	}
 }
 
@@ -1947,6 +1961,12 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = native_read_tsc();
+
+	/* For unstable TSC, force compensation and catchup on next CPU */
+	if (check_tsc_unstable()) {
+		vcpu->arch.tsc_rebase = 1;
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+	}
 }
 
 static int is_efer_nx(void)
@@ -4307,8 +4327,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
-			if (!kvm_request_guest_time_update(vcpu))
-				continue;
+			kvm_request_clock_update(vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
@@ -5597,7 +5616,7 @@ int kvm_arch_hardware_enable(void *garbage)
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_for_each_vcpu(i, vcpu, kvm)
 			if (vcpu->cpu == smp_processor_id())
-				kvm_request_guest_time_update(vcpu);
+				kvm_request_clock_update(vcpu);
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 26/35] Catchup slower TSC to guest rate
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (24 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 25/35] Add clock catchup mode Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-09-07  3:44   ` Dong, Eddie
  2010-08-20  8:07 ` [KVM timekeeping 27/35] Add TSC trapping Zachary Amsden
                   ` (10 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Use the catchup code to continue adjusting the TSC when
running at lower than the guest rate

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4215d7..086d56a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1013,8 +1013,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
 	}
 	local_irq_restore(flags);
-	if (catchup)
+	if (catchup) {
+		if (this_tsc_khz < v->kvm->arch.virtual_tsc_khz)
+			vcpu->tsc_rebase = 1;
 		return 0;
+	}
 
 	/*
 	 * Time as measured by the TSC may go backwards when resetting the base
@@ -5022,6 +5025,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	kvm_guest_exit();
 
+	/* Running on slower TSC without kvmclock, we must bump TSC */
+	if (vcpu->arch.tsc_rebase)
+		kvm_request_clock_update(vcpu);
+
 	preempt_enable();
 
 	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 27/35] Add TSC trapping
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (25 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 26/35] Catchup slower TSC to guest rate Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 28/35] Unstable TSC write compensation Zachary Amsden
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Add TSC trapping for SVM and VMX, with handler in common code.
Reasons to trap the TSC are numerous, but we avoid it as much
as possible for performance reasons.  We don't trap TSC when
kvmclock is in use.  We base the trapped TSC off the system
clock, which keeps it in sync on SMP virtual machines, and
we don't trap the TSC if the system TSC is "stable".

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/svm.c              |   22 +++++++++++++++
 arch/x86/kvm/vmx.c              |   21 ++++++++++++++
 arch/x86/kvm/x86.c              |   58 ++++++++++++++++++++++++++-------------
 arch/x86/kvm/x86.h              |    1 +
 5 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ec1dc3a..993d13d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -344,6 +344,7 @@ struct kvm_vcpu_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
 	bool tsc_rebase;
+	bool tsc_trapping;
 
 	bool nmi_pending;
 	bool nmi_injected;
@@ -529,6 +530,7 @@ struct kvm_x86_ops {
 	int (*get_lpage_level)(void);
 	bool (*rdtscp_supported)(void);
 	void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+	void (*set_tsc_trap)(struct kvm_vcpu *vcpu, bool trap);
 
 	void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2be8338..604fc0f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -788,6 +788,9 @@ static void init_vmcb(struct vcpu_svm *svm)
 				(1ULL << INTERCEPT_MONITOR) |
 				(1ULL << INTERCEPT_MWAIT);
 
+	if (svm->vcpu.arch.tsc_trapping)
+		svm->vmcb->control.intercept |= 1ULL << INTERCEPT_RDTSC;
+
 	control->iopm_base_pa = iopm_base;
 	control->msrpm_base_pa = __pa(svm->msrpm);
 	control->int_ctl = V_INTR_MASKING_MASK;
@@ -1020,6 +1023,16 @@ static void svm_clear_vintr(struct vcpu_svm *svm)
 	svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_VINTR);
 }
 
+static void svm_set_tsc_trap(struct kvm_vcpu *vcpu, bool trap)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	vcpu->arch.tsc_trapping = trap;
+	if (trap)
+		svm->vmcb->control.intercept |= 1ULL << INTERCEPT_RDTSC;
+	else
+		svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_RDTSC);
+}
+
 static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg)
 {
 	struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save;
@@ -2406,6 +2419,13 @@ static int task_switch_interception(struct vcpu_svm *svm)
 	return 1;
 }
 
+static int rdtsc_interception(struct vcpu_svm *svm)
+{
+	svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
+	kvm_read_tsc(&svm->vcpu);
+	return 1;
+}
+
 static int cpuid_interception(struct vcpu_svm *svm)
 {
 	svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
@@ -2724,6 +2744,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_SMI]				= nop_on_interception,
 	[SVM_EXIT_INIT]				= nop_on_interception,
 	[SVM_EXIT_VINTR]			= interrupt_window_interception,
+	[SVM_EXIT_RDTSC]			= rdtsc_interception,
 	[SVM_EXIT_CPUID]			= cpuid_interception,
 	[SVM_EXIT_IRET]                         = iret_interception,
 	[SVM_EXIT_INVD]                         = emulate_on_interception,
@@ -3543,6 +3564,7 @@ static struct kvm_x86_ops svm_x86_ops = {
 
 	.write_tsc_offset = svm_write_tsc_offset,
 	.adjust_tsc_offset = svm_adjust_tsc_offset,
+	.set_tsc_trap = svm_set_tsc_trap,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f8b70ac..45508f2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2788,6 +2788,19 @@ out:
 	return ret;
 }
 
+static void vmx_set_tsc_trap(struct kvm_vcpu *vcpu, bool trap)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	if (trap)
+		cpu_based_vm_exec_control |= CPU_BASED_RDTSC_EXITING;
+	else
+		cpu_based_vm_exec_control &= ~CPU_BASED_RDTSC_EXITING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+	vcpu->arch.tsc_trapping = trap;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
@@ -3388,6 +3401,12 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_rdtsc(struct kvm_vcpu *vcpu)
+{
+	kvm_read_tsc(vcpu);
+	return 1;
+}
+
 static int handle_wbinvd(struct kvm_vcpu *vcpu)
 {
 	skip_emulated_instruction(vcpu);
@@ -3670,6 +3689,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PENDING_INTERRUPT]       = handle_interrupt_window,
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
+	[EXIT_REASON_RDTSC]		      = handle_rdtsc,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
@@ -4347,6 +4367,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
 
 	.write_tsc_offset = vmx_write_tsc_offset,
 	.adjust_tsc_offset = vmx_adjust_tsc_offset,
+	.set_tsc_trap = vmx_set_tsc_trap,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 086d56a..839e3fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -985,6 +985,19 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+void kvm_read_tsc(struct kvm_vcpu *vcpu)
+{
+	u64 tsc;
+	s64 kernel_ns = getnsboottime();
+
+	tsc = compute_guest_tsc(vcpu, kernel_ns);
+	kvm_register_write(vcpu, VCPU_REGS_RAX, (u32)tsc);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, tsc >> 32);
+	vcpu->arch.last_guest_tsc = tsc;
+	kvm_x86_ops->skip_emulated_instruction(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_read_tsc);
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags;
@@ -1089,6 +1102,16 @@ static void kvm_request_clock_update(struct kvm_vcpu *v)
 	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 }
 
+static void kvm_update_tsc_trapping(struct kvm *kvm)
+{
+	int trap, i;
+	struct kvm_vcpu *vcpu;
+
+	trap = check_tsc_unstable() && atomic_read(&kvm->online_vcpus) > 1;
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_x86_ops->set_tsc_trap(vcpu, trap && !vcpu->arch.time_page);
+}
+
 static bool msr_mtrr_valid(unsigned msr)
 {
 	switch (msr) {
@@ -1414,20 +1437,18 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		vcpu->arch.time = data;
 		kvm_request_clock_update(vcpu);
 
-		/* we verify if the enable bit is set... */
-		if (!(data & 1))
-			break;
-
-		/* ...but clean it before doing the actual write */
-		vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
-
-		vcpu->arch.time_page =
+		/* if the enable bit is set... */
+		if ((data & 1)) {
+			vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
+			vcpu->arch.time_page =
 				gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
 
-		if (is_error_page(vcpu->arch.time_page)) {
-			kvm_release_page_clean(vcpu->arch.time_page);
-			vcpu->arch.time_page = NULL;
+			if (is_error_page(vcpu->arch.time_page)) {
+				kvm_release_page_clean(vcpu->arch.time_page);
+				vcpu->arch.time_page = NULL;
+			}
 		}
+		kvm_update_tsc_trapping(vcpu->kvm);
 		break;
 	}
 	case MSR_IA32_MCG_CTL:
@@ -5007,7 +5028,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	if (!vcpu->arch.tsc_trapping)
+		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
 
 	atomic_set(&vcpu->guest_mode, 0);
 	smp_wmb();
@@ -5561,14 +5583,12 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->vcpu_free(vcpu);
 }
 
-struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
-						unsigned int id)
+struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id)
 {
-	if (check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
-		printk_once(KERN_WARNING
-		"kvm: SMP vm created on host with unstable TSC; "
-		"guest TSC will not be reliable\n");
-	return kvm_x86_ops->vcpu_create(kvm, id);
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_x86_ops->vcpu_create(kvm, id);
+	kvm_update_tsc_trapping(vcpu->kvm);
+	return vcpu;
 }
 
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2d6385e..cb38f51 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -69,5 +69,6 @@ void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
+void kvm_read_tsc(struct kvm_vcpu *vcpu);
 
 #endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 28/35] Unstable TSC write compensation
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (26 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 27/35] Add TSC trapping Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 29/35] TSC overrun protection Zachary Amsden
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Now that we have trapping and catchup mode, based off guest virtual
TSC khz, we can accomodate writes to unstable TSCs by doing computation
in guest HZ rather than the transient host HZ.  Instead of a large
window of approximate elapsed time, we use a narrower (1 second) window of
delta time between the guest TSC and system time.

With this change, guests no longer exhibit pathological behavior
during guest initiatied TSC recalibration.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   51 +++++++++++++++++++++------------------------------
 1 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 839e3fd..23d1d02 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -900,22 +900,10 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 unsigned long max_tsc_khz;
 
-static inline int kvm_tsc_changes_freq(void)
+static inline u64 nsec_to_cycles(struct kvm *kvm, u64 nsec)
 {
-	int cpu = get_cpu();
-	int ret = !boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
-		  cpufreq_quick_get(cpu) != 0;
-	put_cpu();
-	return ret;
-}
-
-static inline u64 nsec_to_cycles(u64 nsec)
-{
-	WARN_ON(preemptible());
-	if (kvm_tsc_changes_freq())
-		printk_once(KERN_WARNING
-		 "kvm: unreliable cycle conversion on adjustable rate TSC\n");
-	return (nsec * __get_cpu_var(cpu_tsc_khz)) / USEC_PER_SEC;
+	return pvclock_scale_delta(nsec, kvm->arch.virtual_tsc_mult,
+				   kvm->arch.virtual_tsc_shift);
 }
 
 static void kvm_arch_set_tsc_khz(struct kvm *kvm, u32 this_tsc_khz)
@@ -942,6 +930,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 	u64 offset, ns, elapsed;
 	unsigned long flags;
 	s64 sdiff;
+	u64 delta;
 
 	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = data - native_read_tsc();
@@ -952,29 +941,31 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 		sdiff = -sdiff;
 
 	/*
-	 * Special case: close write to TSC within 5 seconds of
-	 * another CPU is interpreted as an attempt to synchronize
-	 * The 5 seconds is to accomodate host load / swapping as
-	 * well as any reset of TSC during the boot process.
+	 * Special case: TSC write with a small delta of virtual
+	 * cycle time against real time is interpreted as an attempt
+	 * to synchronize the CPU.
 	 *
-	 * In that case, for a reliable TSC, we can match TSC offsets,
-	 * or make a best guest using elapsed value.
+	 * For a reliable TSC, we can match TSC offsets, and for an
+	 * unreliable TSC, we will trap and match the last_nsec value.
+	 * In either case, we will have near perfect synchronization.
 	 */
-	if (sdiff < nsec_to_cycles(5ULL * NSEC_PER_SEC) &&
-	    elapsed < 5ULL * NSEC_PER_SEC) {
+	delta = nsec_to_cycles(kvm, elapsed);
+	sdiff -= delta;
+	if (sdiff < 0)
+		sdiff = -sdiff;
+	if (sdiff < nsec_to_cycles(kvm, NSEC_PER_SEC) ) {
 		if (!check_tsc_unstable()) {
 			offset = kvm->arch.last_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
 		} else {
-			u64 delta = nsec_to_cycles(elapsed);
-			offset += delta;
-			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
+			/* Unstable write; allow offset, preserve last write */
+			pr_debug("kvm: matched write on unstable tsc\n");
 		}
-		ns = kvm->arch.last_tsc_nsec;
+	} else {
+		kvm->arch.last_tsc_nsec = ns;
+		kvm->arch.last_tsc_write = data;
+		kvm->arch.last_tsc_offset = offset;
 	}
-	kvm->arch.last_tsc_nsec = ns;
-	kvm->arch.last_tsc_write = data;
-	kvm->arch.last_tsc_offset = offset;
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 29/35] TSC overrun protection
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (27 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 28/35] Unstable TSC write compensation Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 30/35] IOCTL for setting TSC rate Zachary Amsden
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

When running on a TSC which runs at a higher rate than the guest
TSC, and not in KVM clock mode, we should not pass through the
TSC.  Add logic to detect this and switch into trap mode.

There are a few problems with this; first, the condition is not
detected at creation time.  This isn't currently an issue since the
clock will be set to the highest possible rate.

The second problem is that we don't have a way to exit this mode;
the underlying TSC will accelerate beyond our control, and so the
offset must be re-adjusted backwards if the overrun condition is
ever removed.

Even entry to this mode is problematic; some hardware errata or
other miscalibration may have exposed an accelerated TSC to the
guest, in which case, we have to preserve the 'bump' of accelerated
time to avoid having a backwards clock movement.

Another problem is that CPU frequency governors may be loaded
after KVM has already started, in which case our estimated CPU
frequency may be shown to be wrong.

These problems will be dealt with separately for clarity.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/x86.c              |   34 ++++++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 993d13d..9b2d231 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -345,6 +345,7 @@ struct kvm_vcpu_arch {
 	u64 last_tsc_write;
 	bool tsc_rebase;
 	bool tsc_trapping;
+	bool tsc_overrun;
 
 	bool nmi_pending;
 	bool nmi_injected;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23d1d02..887e30f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1015,13 +1015,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		u64 tsc = compute_guest_tsc(v, kernel_ns);
 		if (tsc > tsc_timestamp)
 			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
-	}
-	local_irq_restore(flags);
-	if (catchup) {
-		if (this_tsc_khz < v->kvm->arch.virtual_tsc_khz)
+		local_irq_restore(flags);
+
+		/* Now, see if we need to switch into trap mode */
+		if (vcpu->tsc_overrun && !vcpu->tsc_trapping)
+			kvm_x86_ops->set_tsc_trap(v, 1);
+
+		/* If we're falling behind and not trapping, re-trigger */
+		if (!vcpu->tsc_trapping &&
+		    this_tsc_khz < v->kvm->arch.virtual_tsc_khz)
 			vcpu->tsc_rebase = 1;
 		return 0;
 	}
+	local_irq_restore(flags);
 
 	/*
 	 * Time as measured by the TSC may go backwards when resetting the base
@@ -1098,6 +1104,17 @@ static void kvm_update_tsc_trapping(struct kvm *kvm)
 	int trap, i;
 	struct kvm_vcpu *vcpu;
 
+	/*
+ 	 * Subtle point; we don't consider TSC rate here as part of
+ 	 * the decision to trap or not.  The reason for it is that
+ 	 * TSC rate changes happen asynchronously, and are thus racy.
+ 	 * The only safe place to check for this is above, in
+ 	 * kvm_guest_time_update, where we've read the HZ value and
+ 	 * the indication from the asynchronous notifier that TSC
+ 	 * is in an overrun condition.  Even that is racy, however that
+ 	 * code is guaranteed to be called again if the CPU frequency
+ 	 * changes yet another time before entering hardware virt.
+	 */
 	trap = check_tsc_unstable() && atomic_read(&kvm->online_vcpus) > 1;
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_x86_ops->set_tsc_trap(vcpu, trap && !vcpu->arch.time_page);
@@ -1977,8 +1994,11 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = native_read_tsc();
 
-	/* For unstable TSC, force compensation and catchup on next CPU */
-	if (check_tsc_unstable()) {
+	/*
+	 * For unstable TSC, force compensation and catchup on next CPU
+	 * Don't need to do this if there is an overrun, as we'll trap.
+	 */
+	if (check_tsc_unstable() && !vcpu->arch.tsc_overrun) {
 		vcpu->arch.tsc_rebase = 1;
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 	}
@@ -4342,6 +4362,8 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
+			if (freq->new > kvm->arch.virtual_tsc_khz)
+				vcpu->arch.tsc_overrun = 1;
 			kvm_request_clock_update(vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 30/35] IOCTL for setting TSC rate
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (28 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 29/35] TSC overrun protection Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:56   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 31/35] Exit conditions for TSC trapping Zachary Amsden
                   ` (6 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Add an IOCTL for setting the TSC rate for a VM, intended to
be used to migrate non-kvmclock based VMs which rely on TSC
rate staying stable across host migration.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c  |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/kvm.h |    4 ++++
 2 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 887e30f..e618265 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1017,6 +1017,10 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
 		local_irq_restore(flags);
 
+		/* hw_tsc_khz unknown at creation time, check for overrun */
+		if (this_tsc_khz > v->kvm->arch.virtual_tsc_khz)
+			vcpu->tsc_overrun = 1;
+
 		/* Now, see if we need to switch into trap mode */
 		if (vcpu->tsc_overrun && !vcpu->tsc_trapping)
 			kvm_x86_ops->set_tsc_trap(v, 1);
@@ -1846,6 +1850,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_SET_TSC_RATE:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -3413,6 +3418,37 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_X86_GET_TSC_RATE: {
+		u32 rate = kvm->arch.virtual_tsc_khz;
+		r = -EFAULT;
+		if (copy_to_user(argp, &rate, sizeof(rate)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_X86_SET_TSC_RATE: {
+		u32 rate;
+		int i;
+		struct kvm_vcpu *vcpu;
+		r = -EFAULT;
+		if (copy_from_user(&rate, argp, sizeof rate))
+			goto out;
+		if (rate == 0 || rate > (1ULL << 40)) {
+			r = -EINVAL;
+			break;
+		}
+		/*
+		 * This is intended to be called once, during VM creation.
+		 * Calling this with running VCPUs to dynamically change
+		 * speed is risky; there is no synchronization with the
+		 * compensation loop, so computations using virtual_tsc_khz
+		 * conversions may go haywire.  Use at your own risk.
+		 */
+		kvm_arch_set_tsc_khz(kvm, rate);
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		break;
+	}
 
 	default:
 		;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 3707704..22d27f2 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -539,6 +539,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_XCRS 56
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
+#define KVM_CAP_SET_TSC_RATE 58
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -675,6 +676,9 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO	  _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
+/* Available with KVM_CAP_SET_TSC_RATE */
+#define KVM_X86_GET_TSC_RATE      _IOR(KVMIO,  0xa2, __u32)
+#define KVM_X86_SET_TSC_RATE      _IOW(KVMIO,  0xa3, __u32)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 31/35] Exit conditions for TSC trapping
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (29 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 30/35] IOCTL for setting TSC rate Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 32/35] Entry " Zachary Amsden
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Apply exit conditions for TSC trapping back to non-trap mode.
To simplify the logic, we use a few static decision functions
and move all the entry / exit to and from trap directly into the
clock update handler.

We pick up a slight benefit of not having to rebase the TSC at every
possible preemption point when we are trapping, which we now know
definitively because the transition points are all in one place.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 ++
 arch/x86/kvm/x86.c              |   93 ++++++++++++++++++++++++++++----------
 2 files changed, 72 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9b2d231..64569b0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -345,6 +345,7 @@ struct kvm_vcpu_arch {
 	u64 last_tsc_write;
 	bool tsc_rebase;
 	bool tsc_trapping;
+	bool tsc_mode;		/* 0 = passthrough, 1 = trap */
 	bool tsc_overrun;
 
 	bool nmi_pending;
@@ -373,6 +374,9 @@ struct kvm_vcpu_arch {
 	cpumask_var_t wbinvd_dirty_mask;
 };
 
+#define TSC_MODE_PASSTHROUGH	0
+#define TSC_MODE_TRAP		1
+
 struct kvm_arch {
 	unsigned int n_free_mmu_pages;
 	unsigned int n_requested_mmu_pages;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e618265..33cb0f0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -997,7 +997,8 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
-	bool catchup = (!vcpu->time_page);
+	bool kvmclock = (vcpu->time_page != NULL);
+	bool catchup = !kvmclock;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
@@ -1011,18 +1012,43 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		return 1;
 	}
 
+	/*
+	 * If we are trapping and no longer need to, use catchup to
+	 * ensure passthrough TSC will not be less than trapped TSC
+	 */
+	if (vcpu->tsc_mode == TSC_MODE_PASSTHROUGH && vcpu->tsc_trapping &&
+	    ((this_tsc_khz <= v->kvm->arch.virtual_tsc_khz || kvmclock))) {
+		catchup = 1;
+
+		/*
+		 * If there was an overrun condition, we reset the TSC back to
+		 * the last possible guest visible value to avoid unnecessary
+		 * forward leaps; it will catch up to real time below.
+		 */
+		if (unlikely(vcpu->tsc_overrun)) {
+			vcpu->tsc_overrun = 0;
+			if (vcpu->last_guest_tsc)
+				kvm_x86_ops->adjust_tsc_offset(v,
+					vcpu->last_guest_tsc - tsc_timestamp);
+		}
+		kvm_x86_ops->set_tsc_trap(v, 0);
+	}
+
 	if (catchup) {
 		u64 tsc = compute_guest_tsc(v, kernel_ns);
 		if (tsc > tsc_timestamp)
 			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
-		local_irq_restore(flags);
-
-		/* hw_tsc_khz unknown at creation time, check for overrun */
-		if (this_tsc_khz > v->kvm->arch.virtual_tsc_khz)
-			vcpu->tsc_overrun = 1;
+	}
+	local_irq_restore(flags);
+ 
+	/* hw_tsc_khz unknown at creation time, check for overrun */
+	if (this_tsc_khz > v->kvm->arch.virtual_tsc_khz)
+		vcpu->tsc_overrun = 1;
 
+	if (!kvmclock) {
 		/* Now, see if we need to switch into trap mode */
-		if (vcpu->tsc_overrun && !vcpu->tsc_trapping)
+		if ((vcpu->tsc_mode == TSC_MODE_TRAP || vcpu->tsc_overrun) &&
+		    !vcpu->tsc_trapping)
 			kvm_x86_ops->set_tsc_trap(v, 1);
 
 		/* If we're falling behind and not trapping, re-trigger */
@@ -1031,7 +1057,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 			vcpu->tsc_rebase = 1;
 		return 0;
 	}
-	local_irq_restore(flags);
 
 	/*
 	 * Time as measured by the TSC may go backwards when resetting the base
@@ -1103,25 +1128,42 @@ static void kvm_request_clock_update(struct kvm_vcpu *v)
 	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 }
 
+static inline bool kvm_unstable_smp_clock(struct kvm *kvm)
+{
+	return check_tsc_unstable() && atomic_read(&kvm->online_vcpus) > 1;
+}
+
+static inline bool best_tsc_mode(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * When kvmclock is enabled (time_page is set), we should not trap;
+	 * otherwise, we trap for SMP VMs with unstable clocks.  We also
+	 * will trap for TSC overrun, but not because of this test; overrun
+	 * conditions may disappear with CPU frequency changes, and so
+	 * trapping is not the 'best' mode.  Further, they may also appear
+	 * asynchronously, and we don't want racy logic for tsc_mode, so
+	 * they only set tsc_overrun, not the tsc_mode field.
+	 */
+	return (!vcpu->arch.time_page) && kvm_unstable_smp_clock(vcpu->kvm);
+}
+
 static void kvm_update_tsc_trapping(struct kvm *kvm)
 {
-	int trap, i;
+	int i;
 	struct kvm_vcpu *vcpu;
 
 	/*
- 	 * Subtle point; we don't consider TSC rate here as part of
- 	 * the decision to trap or not.  The reason for it is that
- 	 * TSC rate changes happen asynchronously, and are thus racy.
- 	 * The only safe place to check for this is above, in
+ 	 * The only safe place to check for clock update is in
  	 * kvm_guest_time_update, where we've read the HZ value and
- 	 * the indication from the asynchronous notifier that TSC
- 	 * is in an overrun condition.  Even that is racy, however that
- 	 * code is guaranteed to be called again if the CPU frequency
+ 	 * possibly received indication from the asynchronous notifier that
+	 * the TSC is in an overrun condition.  Even that is racy, however
+	 * that code is guaranteed to be called again if the CPU frequency
  	 * changes yet another time before entering hardware virt.
 	 */
-	trap = check_tsc_unstable() && atomic_read(&kvm->online_vcpus) > 1;
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_x86_ops->set_tsc_trap(vcpu, trap && !vcpu->arch.time_page);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		vcpu->arch.tsc_mode = best_tsc_mode(vcpu);
+		kvm_request_clock_update(vcpu);
+	}
 }
 
 static bool msr_mtrr_valid(unsigned msr)
@@ -1445,9 +1487,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 			kvm_release_page_dirty(vcpu->arch.time_page);
 			vcpu->arch.time_page = NULL;
 		}
-
 		vcpu->arch.time = data;
-		kvm_request_clock_update(vcpu);
 
 		/* if the enable bit is set... */
 		if ((data & 1)) {
@@ -1460,7 +1500,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 				vcpu->arch.time_page = NULL;
 			}
 		}
-		kvm_update_tsc_trapping(vcpu->kvm);
+
+		/* Disable / enable trapping for kvmclock */
+		vcpu->arch.tsc_mode = best_tsc_mode(vcpu);
+		kvm_request_clock_update(vcpu);
 		break;
 	}
 	case MSR_IA32_MCG_CTL:
@@ -2000,10 +2043,10 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	vcpu->arch.last_host_tsc = native_read_tsc();
 
 	/*
-	 * For unstable TSC, force compensation and catchup on next CPU
-	 * Don't need to do this if there is an overrun, as we'll trap.
+	 * For unstable TSC, force compensation and catchup on next CPU.
+	 * Don't need to do this if we are trapping.
 	 */
-	if (check_tsc_unstable() && !vcpu->arch.tsc_overrun) {
+	if (check_tsc_unstable() && !vcpu->arch.tsc_trapping) {
 		vcpu->arch.tsc_rebase = 1;
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 32/35] Entry conditions for TSC trapping
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (30 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 31/35] Exit conditions for TSC trapping Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock Zachary Amsden
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

We must also handle the reverse condition; TSC can't go backwards
when trapping, and it's possible that bad hardware offsetting
makes this problem visible when entering trapping mode.

This is accomodated by adding a 'bump' field to the computed
TSC; it's not pleasant but it works.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/x86.c              |   58 +++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64569b0..950537c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -413,6 +413,8 @@ struct kvm_arch {
 	u32 virtual_tsc_khz;
 	u32 virtual_tsc_mult;
 	s8 virtual_tsc_shift;
+	s64 tsc_bump;
+	s64 last_tsc_bump_ns;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 33cb0f0..86f182a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -917,13 +917,48 @@ static void kvm_arch_set_tsc_khz(struct kvm *kvm, u32 this_tsc_khz)
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
+	struct kvm_arch *arch = &vcpu->kvm->arch;
 	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
-				      vcpu->kvm->arch.virtual_tsc_mult,
-				      vcpu->kvm->arch.virtual_tsc_shift);
+				      arch->virtual_tsc_mult,
+				      arch->virtual_tsc_shift);
 	tsc += vcpu->arch.last_tsc_write;
+	if (unlikely(arch->tsc_bump)) {
+		s64 bump;
+
+		/*
+		 * Ugh.  There were a TSC bump.  See how much time elapsed
+		 * in cycles since last read, take it off the bump, but
+		 * ensure TSC advances by at least one.  We're serialized
+		 * by the TSC write lock until the bump is gone.
+		 */
+		spin_lock(&arch->tsc_write_lock);
+		bump = pvclock_scale_delta(kernel_ns - arch->last_tsc_bump_ns,
+					   arch->virtual_tsc_mult,
+					   arch->virtual_tsc_shift);
+		bump = arch->tsc_bump - bump + 1;
+		if (bump < 0) {
+			pr_debug("kvm: vpu%d zeroed TSC bump\n", vcpu->vcpu_id);
+			bump = 0;
+		}
+		arch->tsc_bump = bump;
+		arch->last_tsc_bump_ns = kernel_ns;
+		spin_unlock(&arch->tsc_write_lock);
+
+		tsc += bump;
+	}
 	return tsc;
 }
 
+static void bump_guest_tsc(struct kvm_vcpu *vcpu, s64 bump, s64 kernel_ns)
+{
+	struct kvm *kvm = vcpu->kvm;
+	spin_lock(&kvm->arch.tsc_write_lock);
+	kvm->arch.tsc_bump += bump;
+	kvm->arch.last_tsc_bump_ns = kernel_ns;
+	spin_unlock(&vcpu->kvm->arch.tsc_write_lock);
+	pr_debug("kvm: vcpu%d bumped TSC by %lld\n", vcpu->vcpu_id, bump);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -996,7 +1031,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	void *shared_kaddr;
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
-	u64 tsc_timestamp;
+	u64 tsc_timestamp, tsc;
 	bool kvmclock = (vcpu->time_page != NULL);
 	bool catchup = !kvmclock;
 
@@ -1035,7 +1070,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	}
 
 	if (catchup) {
-		u64 tsc = compute_guest_tsc(v, kernel_ns);
+		tsc = compute_guest_tsc(v, kernel_ns);
 		if (tsc > tsc_timestamp)
 			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
 	}
@@ -1048,8 +1083,21 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	if (!kvmclock) {
 		/* Now, see if we need to switch into trap mode */
 		if ((vcpu->tsc_mode == TSC_MODE_TRAP || vcpu->tsc_overrun) &&
-		    !vcpu->tsc_trapping)
+		    !vcpu->tsc_trapping) {
+			/*
+			 * Check for the (hopefully) unlikely event of the
+			 * computed virtual TSC being before the TSC we were
+			 * passing through in hardware.  This can happen if
+			 * the kernel has miscomputed tsc_khz, we miss an
+			 * overrun condition, or via bad SMP calibration.
+			 * If this is the case, we must add a bump to the
+			 * virtual TSC; this suck.
+			 */
+			if (unlikely(tsc < vcpu->last_guest_tsc))
+				bump_guest_tsc(v, vcpu->last_guest_tsc - tsc,
+					       kernel_ns);
 			kvm_x86_ops->set_tsc_trap(v, 1);
+		}
 
 		/* If we're falling behind and not trapping, re-trigger */
 		if (!vcpu->tsc_trapping &&
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (31 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 32/35] Entry " Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 17:45   ` Glauber Costa
  2010-08-20  8:07 ` [KVM timekeeping 34/35] Remove dead code Zachary Amsden
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

When no platform bugs have been detected, no TSC warps have been
detected, and the hardware guarantees to us TSC does not change
rate or stop with P-state or C-state changes, we can consider it reliable.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86f182a..a7fa24e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -55,6 +55,7 @@
 #include <asm/mce.h>
 #include <asm/i387.h>
 #include <asm/xcr.h>
+#include <asm/pvclock-abi.h>
 
 #define MAX_IO_MSRS 256
 #define CR0_RESERVED_BITS						\
@@ -900,6 +901,13 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 unsigned long max_tsc_khz;
 
+static inline int kvm_tsc_reliable(void)
+{
+	return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+		boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+		!check_tsc_unstable());
+}
+
 static inline u64 nsec_to_cycles(struct kvm *kvm, u64 nsec)
 {
 	return pvclock_scale_delta(nsec, kvm->arch.virtual_tsc_mult,
@@ -1151,7 +1159,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
 	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
 	vcpu->last_kernel_ns = kernel_ns;
-	vcpu->hv_clock.flags = 0;
+	vcpu->hv_clock.flags = kvm_tsc_reliable() ? PVCLOCK_TSC_STABLE_BIT : 0;
 
 	/*
 	 * The interface expects us to write an even number signaling that the
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 34/35] Remove dead code
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (32 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20  8:07 ` [KVM timekeeping 35/35] Add some debug stuff Zachary Amsden
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

This function is basically almost completely totally useless.

N.B.  The hardware_enable code is not redundant; we could have an old
VCPU thread hanging around that was not rescheduled during a
CPU hotremove / hotadd event.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   15 +++++----------
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a7fa24e..23d5138 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1179,11 +1179,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	return 0;
 }
 
-static void kvm_request_clock_update(struct kvm_vcpu *v)
-{
-	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
-}
-
 static inline bool kvm_unstable_smp_clock(struct kvm *kvm)
 {
 	return check_tsc_unstable() && atomic_read(&kvm->online_vcpus) > 1;
@@ -1218,7 +1213,7 @@ static void kvm_update_tsc_trapping(struct kvm *kvm)
 	 */
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		vcpu->arch.tsc_mode = best_tsc_mode(vcpu);
-		kvm_request_clock_update(vcpu);
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 	}
 }
 
@@ -1559,7 +1554,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 
 		/* Disable / enable trapping for kvmclock */
 		vcpu->arch.tsc_mode = best_tsc_mode(vcpu);
-		kvm_request_clock_update(vcpu);
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		break;
 	}
 	case MSR_IA32_MCG_CTL:
@@ -4499,7 +4494,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 				continue;
 			if (freq->new > kvm->arch.virtual_tsc_khz)
 				vcpu->arch.tsc_overrun = 1;
-			kvm_request_clock_update(vcpu);
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
@@ -5197,7 +5192,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	/* Running on slower TSC without kvmclock, we must bump TSC */
 	if (vcpu->arch.tsc_rebase)
-		kvm_request_clock_update(vcpu);
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	preempt_enable();
 
@@ -5791,7 +5786,7 @@ int kvm_arch_hardware_enable(void *garbage)
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_for_each_vcpu(i, vcpu, kvm)
 			if (vcpu->cpu == smp_processor_id())
-				kvm_request_clock_update(vcpu);
+				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [KVM timekeeping 35/35] Add some debug stuff
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (33 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 34/35] Remove dead code Zachary Amsden
@ 2010-08-20  8:07 ` Zachary Amsden
  2010-08-20 13:26 ` KVM timekeeping and TSC virtualization David S. Ahern
  2010-08-24 22:13 ` Marcelo Tosatti
  36 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20  8:07 UTC (permalink / raw)
  To: kvm
  Cc: Zachary Amsden, Avi Kivity, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Very useful, debug-only output

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23d5138..c74a087 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -967,6 +967,9 @@ static void bump_guest_tsc(struct kvm_vcpu *vcpu, s64 bump, s64 kernel_ns)
 	pr_debug("kvm: vcpu%d bumped TSC by %lld\n", vcpu->vcpu_id, bump);
 }
 
+static int tsc_read_log;
+static int tsc_read_cpu = -1;
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -983,6 +986,12 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 	if (sdiff < 0)
 		sdiff = -sdiff;
 
+#ifdef DEBUG
+	pr_debug("kvm: tsc%d write %llu (ofs %llu)\n", vcpu->vcpu_id, data,
+		 offset);
+	tsc_read_log += 2;
+#endif
+
 	/*
 	 * Special case: TSC write with a small delta of virtual
 	 * cycle time against real time is interpreted as an attempt
@@ -999,7 +1008,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 	if (sdiff < nsec_to_cycles(kvm, NSEC_PER_SEC) ) {
 		if (!check_tsc_unstable()) {
 			offset = kvm->arch.last_tsc_offset;
-			pr_debug("kvm: matched tsc offset for %llu\n", data);
+			pr_debug("kvm: matched tsc%d offset for %llu\n", 
+				 vcpu->vcpu_id, data);
 		} else {
 			/* Unstable write; allow offset, preserve last write */
 			pr_debug("kvm: matched write on unstable tsc\n");
@@ -1029,6 +1039,16 @@ void kvm_read_tsc(struct kvm_vcpu *vcpu)
 	kvm_register_write(vcpu, VCPU_REGS_RDX, tsc >> 32);
 	vcpu->arch.last_guest_tsc = tsc;
 	kvm_x86_ops->skip_emulated_instruction(vcpu);
+
+#ifdef DEBUG
+	if (tsc_read_log > 0 && vcpu->vcpu_id != tsc_read_cpu) {
+		--tsc_read_log;
+		tsc_read_cpu = vcpu->vcpu_id;
+		pr_debug("kvm_read_tsc: cpu%d [TRAP] %llu\n",vcpu->vcpu_id,tsc);
+		kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
+		pr_debug("kvm_read_tsc: cpu%d [PASS] %llu\n",vcpu->vcpu_id,tsc);
+	}
+#endif
 }
 EXPORT_SYMBOL_GPL(kvm_read_tsc);
 
@@ -1070,11 +1090,16 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		 */
 		if (unlikely(vcpu->tsc_overrun)) {
 			vcpu->tsc_overrun = 0;
-			if (vcpu->last_guest_tsc)
+			if (vcpu->last_guest_tsc) {
+				pr_debug("kvm: corrected TSC overrun of %llu\n",
+					vcpu->last_guest_tsc - tsc_timestamp);
 				kvm_x86_ops->adjust_tsc_offset(v,
 					vcpu->last_guest_tsc - tsc_timestamp);
+			}
 		}
 		kvm_x86_ops->set_tsc_trap(v, 0);
+		pr_debug("kvm: passing TSC vcpu%d tsc_mode: %d time_page %p\n",
+			 v->vcpu_id, vcpu->tsc_mode, vcpu->time_page);
 	}
 
 	if (catchup) {
@@ -1105,6 +1130,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 				bump_guest_tsc(v, vcpu->last_guest_tsc - tsc,
 					       kernel_ns);
 			kvm_x86_ops->set_tsc_trap(v, 1);
+			pr_debug("kvm: trapping TSC on vcpu%d\n", v->vcpu_id);
 		}
 
 		/* If we're falling behind and not trapping, re-trigger */
@@ -1214,6 +1240,9 @@ static void kvm_update_tsc_trapping(struct kvm *kvm)
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		vcpu->arch.tsc_mode = best_tsc_mode(vcpu);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		pr_debug("kvm: vcpu%d tsc_mode: %d time_page %p\n",
+			 vcpu->vcpu_id, vcpu->arch.tsc_mode,
+			 vcpu->arch.time_page);
 	}
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (34 preceding siblings ...)
  2010-08-20  8:07 ` [KVM timekeeping 35/35] Add some debug stuff Zachary Amsden
@ 2010-08-20 13:26 ` David S. Ahern
  2010-08-20 23:24   ` Zachary Amsden
  2010-08-24 22:13 ` Marcelo Tosatti
  36 siblings, 1 reply; 106+ messages in thread
From: David S. Ahern @ 2010-08-20 13:26 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm



On 08/20/10 02:07, Zachary Amsden wrote:
> This patch set implements full TSC virtualization, with both
> trapping and passthrough modes, and intelligent mode switching.
> As a result, TSC will never go backwards, we are stable against
> guest re-calibration attempts, VM reset, and migration.  For guests
> which require it, the TSC khz can even be preserved on migration
> to a new host.
> 
> The TSC will never be trapped on UP systems unless the host TSC
> actually runs faster than the guest; other conditions, including
> bad hardware and changing speeds are accomodated by using catchup
> mode to keep the guest passthrough TSC in line with the host clock.

What's the overhead of trapping TSC reads for Nehalem-type processors?

gettimeofday() in guests is the biggest performance problem with KVM for
me, especially for older OSes like RHEL4 which is a supported OS for
another 2 years. Even with RHEL5, 32-bit, I had to force kvmclock off to
get the VM to run reliably:

http://article.gmane.org/gmane.comp.emulators.kvm.devel/51017/match=kvmclock+rhel5.5

David


> 
> What is still needed on top of this is a way to force TSC
> trapping, or disable it entirely, for benchmarking purposes.
> I refrained from adding that last bit because it wasn't clear
> whether the best thing to do is a global 'force TSC trapping' /
> 'force TSC passthrough' / 'intelligent choice', or if this control
> should be on a per-VM level, via an ioctl(), module parameter,
> or sysfs.
> 
> John and Thomas I have cc'd on this because it may be relevant to
> their interests and I always appreciate feedback, especially on
> a change set as large and complex as this.
> 
> Enjoy.  This time, there are no howler monkeys.  I've included
> all the feedback I got from previous rounds of this and more.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 01/35] Drop vm_init_tsc
  2010-08-20  8:07 ` [KVM timekeeping 01/35] Drop vm_init_tsc Zachary Amsden
@ 2010-08-20 16:54   ` Glauber Costa
  0 siblings, 0 replies; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 16:54 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:15PM -1000, Zachary Amsden wrote:
> This is used only by the VMX code, and is not done properly;
> if the TSC is indeed backwards, it is out of sync, and will
> need proper handling in the logic at each and every CPU change.
> For now, drop this test during init as misguided.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>

Ok, not a big loose, I agree.

Btw, I suggest we start merging some of those simple
and self-contained patches, before the series grows too big.
Makes it easier for everyone to review the real important stuff.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 03/35] Move TSC offset writes to common code
  2010-08-20  8:07 ` [KVM timekeeping 03/35] Move TSC offset writes to common code Zachary Amsden
@ 2010-08-20 17:06   ` Glauber Costa
  2010-08-24  0:51     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:06 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:17PM -1000, Zachary Amsden wrote:
> Also, ensure that the storing of the offset and the reading of the TSC
> are never preempted by taking a spinlock.  While the lock is overkill
> now, it is useful later in this patch series.
> 
> +	spinlock_t tsc_write_lock;
Forgive my utter ignorance, specially if it is to become
obvious in a latter patch: This is a vcpu-local operation,
uses rdtscl, so pcpu-local too, and we don't expect
multiple writers to it at the same time.

Why do we need this lock?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 05/35] Move TSC reset out of vmcb_init
  2010-08-20  8:07 ` [KVM timekeeping 05/35] Move TSC reset out of vmcb_init Zachary Amsden
@ 2010-08-20 17:08   ` Glauber Costa
  2010-08-24  0:52     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:08 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:19PM -1000, Zachary Amsden wrote:
> The VMCB is reset whenever we receive a startup IPI, so Linux is setting
> TSC back to zero happens very late in the boot process and destabilizing
> the TSC.  Instead, just set TSC to zero once at VCPU creation time.
> 
> Why the separate patch?  So git-bisect is your friend.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Shouldn't we set for whatever value the BSP already has, and then the BSP to
zero? Since vcpus are initialized at different times, this pretty much
guarantees that the guest will have desynchronized tsc at all cases
(not that if it was better before...)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 08/35] Warn about unstable TSC
  2010-08-20  8:07 ` [KVM timekeeping 08/35] Warn about unstable TSC Zachary Amsden
@ 2010-08-20 17:28   ` Glauber Costa
  2010-08-24  0:56     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:28 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:22PM -1000, Zachary Amsden wrote:
> If creating an SMP guest with unstable host TSC, issue a warning
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Ok, I am not sure I agree 100 % this is needed.
I believe we should try to communicate this kind of thing to the guest,
not the host, and through cpuid.

Passing through tsc flags to the guest should maybe be enough?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-08-20  8:07 ` [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization Zachary Amsden
@ 2010-08-20 17:30   ` Glauber Costa
  2010-09-14  9:10   ` Jan Kiszka
  1 sibling, 0 replies; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:30 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:24PM -1000, Zachary Amsden wrote:
> When CPUs with unstable TSCs enter deep C-state, TSC may stop
> running.  This causes us to require resynchronization.  Since
> we can't tell when this may potentially happen, we assume the
> worst by forcing re-compensation for it at every point the VCPU
> task is descheduled.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Fair enough.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 11/35] Add helper functions for time computation
  2010-08-20  8:07 ` [KVM timekeeping 11/35] Add helper functions for time computation Zachary Amsden
@ 2010-08-20 17:34   ` Glauber Costa
  2010-08-24  0:58     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:34 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:25PM -1000, Zachary Amsden wrote:
> Add a helper function to compute the kernel time and convert nanoseconds
> back to CPU specific cycles.  Note that these must not be called in preemptible
> context, as that would mean the kernel could enter software suspend state,
> which would cause non-atomic operation.
> 
> Also, convert the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls to use the kernel
> time helper, these should be bootbased as well.
This is one of the things I believe should be applied right now.
Maybe we want a cut version of this patch, that exposes this API while
adjusting KVM_SET_CLOCK / KVM_GET_CLOCK to get in early rather than late?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 12/35] Robust TSC compensation
  2010-08-20  8:07 ` [KVM timekeeping 12/35] Robust TSC compensation Zachary Amsden
@ 2010-08-20 17:40   ` Glauber Costa
  2010-08-24  1:01     ` Zachary Amsden
  2010-08-24 21:33   ` Daniel Verkamp
  1 sibling, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:40 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:26PM -1000, Zachary Amsden wrote:
> Make the match of TSC find TSC writes that are close to each other
> instead of perfectly identical; this allows the compensator to also
> work in migration / suspend scenarios.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/kvm/x86.c |   14 ++++++++++----
>  1 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 52680f6..0f3e5fb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -928,21 +928,27 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	struct kvm *kvm = vcpu->kvm;
>  	u64 offset, ns, elapsed;
>  	unsigned long flags;
> +	s64 sdiff;
>  
>  	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>  	offset = data - native_read_tsc();
>  	ns = get_kernel_ns();
>  	elapsed = ns - kvm->arch.last_tsc_nsec;
> +	sdiff = data - kvm->arch.last_tsc_write;
> +	if (sdiff < 0)
> +		sdiff = -sdiff;
>  
>  	/*
> -	 * Special case: identical write to TSC within 5 seconds of
> +	 * Special case: close write to TSC within 5 seconds of
>  	 * another CPU is interpreted as an attempt to synchronize
> -	 * (the 5 seconds is to accomodate host load / swapping).
> +	 * The 5 seconds is to accomodate host load / swapping as
> +	 * well as any reset of TSC during the boot process.
>  	 *
>  	 * In that case, for a reliable TSC, we can match TSC offsets,
> -	 * or make a best guest using kernel_ns value.
> +	 * or make a best guest using elapsed value.
>  	 */
> -	if (data == kvm->arch.last_tsc_write && elapsed < 5ULL * NSEC_PER_SEC) {
> +	if (sdiff < nsec_to_cycles(5ULL * NSEC_PER_SEC) &&
> +	    elapsed < 5ULL * NSEC_PER_SEC) {
>  		if (!check_tsc_unstable()) {
Isn't 5 way too long for this case?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock
  2010-08-20  8:07 ` [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock Zachary Amsden
@ 2010-08-20 17:45   ` Glauber Costa
  2010-08-24  1:14     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:45 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:47PM -1000, Zachary Amsden wrote:
> When no platform bugs have been detected, no TSC warps have been
> detected, and the hardware guarantees to us TSC does not change
> rate or stop with P-state or C-state changes, we can consider it reliable.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/kvm/x86.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86f182a..a7fa24e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -55,6 +55,7 @@
>  #include <asm/mce.h>
>  #include <asm/i387.h>
>  #include <asm/xcr.h>
> +#include <asm/pvclock-abi.h>
>  
>  #define MAX_IO_MSRS 256
>  #define CR0_RESERVED_BITS						\
> @@ -900,6 +901,13 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
>  static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>  unsigned long max_tsc_khz;
>  
> +static inline int kvm_tsc_reliable(void)
> +{
> +	return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +		boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> +		!check_tsc_unstable());
> +}
> +
>  static inline u64 nsec_to_cycles(struct kvm *kvm, u64 nsec)
>  {
>  	return pvclock_scale_delta(nsec, kvm->arch.virtual_tsc_mult,
> @@ -1151,7 +1159,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>  	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>  	vcpu->last_kernel_ns = kernel_ns;
> -	vcpu->hv_clock.flags = 0;
> +	vcpu->hv_clock.flags = kvm_tsc_reliable() ? PVCLOCK_TSC_STABLE_BIT : 0;
This is not enough.

We still can have bugs arriving from the difference in resolution between the underlying
clock and the tsc. What we're doing here, is to pass a reliable flag, to a non-reliable
guest tsc. We can only trust the guest kvmclock to be tsc-stable if the host is using
tsc clocksource as well.

Since the stable bit have to be read from the guest at every clock read, we can just
use it, and drop it if the host changes its clocksource.

An alternative for the reliable tsc case, would be to just maintain our own parallel
tsc-based clock. But to be honest, I don't like this solution very much. It adds
complexity, and I kinda believe that if the sysadmin had the work to go there
and switch clocksources, he probably has a reason for that.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 19/35] Add timekeeping documentation
  2010-08-20  8:07 ` [KVM timekeeping 19/35] Add timekeeping documentation Zachary Amsden
@ 2010-08-20 17:50   ` Glauber Costa
  0 siblings, 0 replies; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:50 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:33PM -1000, Zachary Amsden wrote:
> Basic informational document about x86 timekeeping and how KVM
> is affected.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
This can probably be merged right now.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 30/35] IOCTL for setting TSC rate
  2010-08-20  8:07 ` [KVM timekeeping 30/35] IOCTL for setting TSC rate Zachary Amsden
@ 2010-08-20 17:56   ` Glauber Costa
  2010-08-21 16:11     ` Arnd Bergmann
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-08-20 17:56 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:44PM -1000, Zachary Amsden wrote:
> Add an IOCTL for setting the TSC rate for a VM, intended to
> be used to migrate non-kvmclock based VMs which rely on TSC
> rate staying stable across host migration.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/kvm/x86.c  |   36 ++++++++++++++++++++++++++++++++++++
>  include/linux/kvm.h |    4 ++++
>  2 files changed, 40 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 887e30f..e618265 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1017,6 +1017,10 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
>  		local_irq_restore(flags);
>  
> +		/* hw_tsc_khz unknown at creation time, check for overrun */
> +		if (this_tsc_khz > v->kvm->arch.virtual_tsc_khz)
> +			vcpu->tsc_overrun = 1;
> +
>  		/* Now, see if we need to switch into trap mode */
>  		if (vcpu->tsc_overrun && !vcpu->tsc_trapping)
>  			kvm_x86_ops->set_tsc_trap(v, 1);
> @@ -1846,6 +1850,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_DEBUGREGS:
>  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
>  	case KVM_CAP_XSAVE:
> +	case KVM_CAP_SET_TSC_RATE:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> @@ -3413,6 +3418,37 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r = 0;
>  		break;
>  	}
> +	case KVM_X86_GET_TSC_RATE: {
> +		u32 rate = kvm->arch.virtual_tsc_khz;
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &rate, sizeof(rate)))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_X86_SET_TSC_RATE: {
> +		u32 rate;
> +		int i;
> +		struct kvm_vcpu *vcpu;
> +		r = -EFAULT;
> +		if (copy_from_user(&rate, argp, sizeof rate))
> +			goto out;
> +		if (rate == 0 || rate > (1ULL << 40)) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		/*
> +		 * This is intended to be called once, during VM creation.
> +		 * Calling this with running VCPUs to dynamically change
> +		 * speed is risky; there is no synchronization with the
> +		 * compensation loop, so computations using virtual_tsc_khz
> +		 * conversions may go haywire.  Use at your own risk.
> +		 */
> +		kvm_arch_set_tsc_khz(kvm, rate);
> +		kvm_for_each_vcpu(i, vcpu, kvm)
> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +		break;
> +	}
>  
>  	default:
>  		;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 3707704..22d27f2 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -539,6 +539,7 @@ struct kvm_ppc_pvinfo {
>  #define KVM_CAP_XCRS 56
>  #endif
>  #define KVM_CAP_PPC_GET_PVINFO 57
> +#define KVM_CAP_SET_TSC_RATE 58
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -675,6 +676,9 @@ struct kvm_clock_data {
>  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
>  /* Available with KVM_CAP_PPC_GET_PVINFO */
>  #define KVM_PPC_GET_PVINFO	  _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
> +/* Available with KVM_CAP_SET_TSC_RATE */
> +#define KVM_X86_GET_TSC_RATE      _IOR(KVMIO,  0xa2, __u32)
> +#define KVM_X86_SET_TSC_RATE      _IOW(KVMIO,  0xa3, __u32)

wrap this into a struct?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-20  8:07 ` [KVM timekeeping 17/35] Implement getnsboottime kernel API Zachary Amsden
@ 2010-08-20 18:39   ` john stultz
  2010-08-20 23:37     ` Zachary Amsden
  2010-08-27 18:05   ` Jan Kiszka
  1 sibling, 1 reply; 106+ messages in thread
From: john stultz @ 2010-08-20 18:39 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	linux-kernel

On Thu, 2010-08-19 at 22:07 -1000, Zachary Amsden wrote:
> Add a kernel call to get the number of nanoseconds since boot.  This
> is generally useful enough to make it a generic call.

Few comments here.

> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  include/linux/time.h      |    1 +
>  kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
>  2 files changed, 28 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/time.h b/include/linux/time.h
> index ea3559f..5d04108 100644
> --- a/include/linux/time.h
> +++ b/include/linux/time.h
> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
>  extern void getrawmonotonic(struct timespec *ts);
>  extern void getboottime(struct timespec *ts);
>  extern void monotonic_to_bootbased(struct timespec *ts);
> +extern s64 getnsboottime(void);

So instead of converting the timespec from getboottime, why did you add
a new interface? Also if not a timespec, why did you pick a s64 instead
of a ktime_t?


>  extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
>  extern int timekeeping_valid_for_hres(void);
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index caf8d4d..d250f0a 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
>  }
>  EXPORT_SYMBOL_GPL(ktime_get_ts);
> 
> +
> +/**
> + * getnsboottime - get the bootbased clock in nsec format
> + *
> + * The function calculates the bootbased clock from the realtime
> + * clock and the wall_to_monotonic offset and stores the result
> + * in normalized timespec format in the variable pointed to by @ts.
> + */
> +s64 getnsboottime(void)
> +{
> +	unsigned int seq;
> +	s64 secs, nsecs;
> +
> +	WARN_ON(timekeeping_suspended);
> +
> +	do {
> +		seq = read_seqbegin(&xtime_lock);
> +		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
> +		secs += total_sleep_time.tv_sec;
> +		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
> +		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
> +
> +	} while (read_seqretry(&xtime_lock, seq));
> +	return nsecs + (secs * NSEC_PER_SEC);
> +}
> +EXPORT_SYMBOL_GPL(getnsboottime);

You forgot to include the boottime.tv_sec/nsec offset in this. Take a
look again at getboottime()

thanks
-john

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-20 13:26 ` KVM timekeeping and TSC virtualization David S. Ahern
@ 2010-08-20 23:24   ` Zachary Amsden
  2010-08-22  1:32     ` David S. Ahern
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20 23:24 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

On 08/20/2010 03:26 AM, David S. Ahern wrote:
>
> On 08/20/10 02:07, Zachary Amsden wrote:
>    
>> This patch set implements full TSC virtualization, with both
>> trapping and passthrough modes, and intelligent mode switching.
>> As a result, TSC will never go backwards, we are stable against
>> guest re-calibration attempts, VM reset, and migration.  For guests
>> which require it, the TSC khz can even be preserved on migration
>> to a new host.
>>
>> The TSC will never be trapped on UP systems unless the host TSC
>> actually runs faster than the guest; other conditions, including
>> bad hardware and changing speeds are accomodated by using catchup
>> mode to keep the guest passthrough TSC in line with the host clock.
>>      
> What's the overhead of trapping TSC reads for Nehalem-type processors?
>
> gettimeofday() in guests is the biggest performance problem with KVM for
> me, especially for older OSes like RHEL4 which is a supported OS for
> another 2 years. Even with RHEL5, 32-bit, I had to force kvmclock off to
> get the VM to run reliably:
>
> http://article.gmane.org/gmane.comp.emulators.kvm.devel/51017/match=kvmclock+rhel5.5
>    

Correctness is the biggest timekeeping problem with KVM for me.  The 
fact that you had to force kvmclock off is evidence of that.  Slightly 
slower applications are fine.  Broken ones are not acceptable.

TSC will not be trapped with kvmclock, and the bug you hit with RHEL5 
kvmclock has since been fixed.  As you can see, it is not a simple and 
straightforward issue to get all the issues sorted out.

Also, TSC will not be trapped with UP VMs, only SMP.  If you seriously 
believe RHEL4 will perform better as an SMP guest than several instances 
of coordinated UP guests, you would worry about this issue.  I don't.  
The amount of upstream scalability and performance work done since that 
timeframe is enormous, to the point that it's entirely plausible that 
KVM governed UP RHEL4 guests as a cluster are faster than a RHEL4 SMP host.

So the answer is - it depends.  Hardware is always getting faster, and 
trap / exit cost is going down.   Right now, it is anywhere from a few 
hundred to multiple thousands of cycles, depending on your hardware.  I 
don't have an exact benchmark number I can quote, although in a couple 
of hours, I probably will.  I'll guess 3,000 cycles.

I agree, gettimeofday is a huge issue, for poorly written applications.  
Not that this means we won't speed it up, in fact, I have already done 
quite a bit of work on ways to reduce the exit cost.  Let's, however, 
get things correct before trying to make them aggressively fast.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-20 18:39   ` john stultz
@ 2010-08-20 23:37     ` Zachary Amsden
  2010-08-21  0:02       ` john stultz
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-20 23:37 UTC (permalink / raw)
  To: john stultz
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	linux-kernel

On 08/20/2010 08:39 AM, john stultz wrote:
> On Thu, 2010-08-19 at 22:07 -1000, Zachary Amsden wrote:
>    
>> Add a kernel call to get the number of nanoseconds since boot.  This
>> is generally useful enough to make it a generic call.
>>      
> Few comments here.
>
>    
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   include/linux/time.h      |    1 +
>>   kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
>>   2 files changed, 28 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/time.h b/include/linux/time.h
>> index ea3559f..5d04108 100644
>> --- a/include/linux/time.h
>> +++ b/include/linux/time.h
>> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
>>   extern void getrawmonotonic(struct timespec *ts);
>>   extern void getboottime(struct timespec *ts);
>>   extern void monotonic_to_bootbased(struct timespec *ts);
>> +extern s64 getnsboottime(void);
>>      
> So instead of converting the timespec from getboottime, why did you add
> a new interface? Also if not a timespec, why did you pick a s64 instead
> of a ktime_t?
>    

The new interface was suggested several times, so I'm proposing it.  I'm 
indifferent to putting it the kernel API or making it internal to KVM.  
KVM doesn't want to deal with conversions to / from ktime_t; this code 
uses a lot (too much) math, and it's easy to get wrong when splitting 
sec / nsec fields.  So s64 seems a natural type for ns values.  I 
realize it's not entirely consistent with the kernel API, but s64 
representation for ns seems to be creeping in.

>
>    
>>   extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
>>   extern int timekeeping_valid_for_hres(void);
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index caf8d4d..d250f0a 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
>>   }
>>   EXPORT_SYMBOL_GPL(ktime_get_ts);
>>
>> +
>> +/**
>> + * getnsboottime - get the bootbased clock in nsec format
>> + *
>> + * The function calculates the bootbased clock from the realtime
>> + * clock and the wall_to_monotonic offset and stores the result
>> + * in normalized timespec format in the variable pointed to by @ts.
>> + */
>> +s64 getnsboottime(void)
>> +{
>> +	unsigned int seq;
>> +	s64 secs, nsecs;
>> +
>> +	WARN_ON(timekeeping_suspended);
>> +
>> +	do {
>> +		seq = read_seqbegin(&xtime_lock);
>> +		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
>> +		secs += total_sleep_time.tv_sec;
>> +		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
>> +		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
>> +
>> +	} while (read_seqretry(&xtime_lock, seq));
>> +	return nsecs + (secs * NSEC_PER_SEC);
>> +}
>> +EXPORT_SYMBOL_GPL(getnsboottime);
>>      
> You forgot to include the boottime.tv_sec/nsec offset in this. Take a
> look again at getboottime()
>    

I don't think so... boottime is internal to getboottime, and it's just 
wall_to_monotonic + total_sleep_time -- right?

Perhaps I've named the function badly.  What I want is the monotonic 
clock, adjusted for sleep time - i.e. a clock that counts elapsed real 
time without accounting for wall clock changes due to time zone, which 
never goes backwards.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-20 23:37     ` Zachary Amsden
@ 2010-08-21  0:02       ` john stultz
  2010-08-21  0:52         ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: john stultz @ 2010-08-21  0:02 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	linux-kernel

On Fri, 2010-08-20 at 13:37 -1000, Zachary Amsden wrote:
> On 08/20/2010 08:39 AM, john stultz wrote:
> > On Thu, 2010-08-19 at 22:07 -1000, Zachary Amsden wrote:
> >    
> >> Add a kernel call to get the number of nanoseconds since boot.  This
> >> is generally useful enough to make it a generic call.
> >>      
> > Few comments here.
> >
> >    
> >> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
> >> ---
> >>   include/linux/time.h      |    1 +
> >>   kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
> >>   2 files changed, 28 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/include/linux/time.h b/include/linux/time.h
> >> index ea3559f..5d04108 100644
> >> --- a/include/linux/time.h
> >> +++ b/include/linux/time.h
> >> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
> >>   extern void getrawmonotonic(struct timespec *ts);
> >>   extern void getboottime(struct timespec *ts);
> >>   extern void monotonic_to_bootbased(struct timespec *ts);
> >> +extern s64 getnsboottime(void);
> >>      
> > So instead of converting the timespec from getboottime, why did you add
> > a new interface? Also if not a timespec, why did you pick a s64 instead
> > of a ktime_t?
> >    
> 
> The new interface was suggested several times, so I'm proposing it.  I'm 
> indifferent to putting it the kernel API or making it internal to KVM.  
> KVM doesn't want to deal with conversions to / from ktime_t; this code 
> uses a lot (too much) math, and it's easy to get wrong when splitting 
> sec / nsec fields.  So s64 seems a natural type for ns values.  I 
> realize it's not entirely consistent with the kernel API, but s64 
> representation for ns seems to be creeping in.

I can understand wanting that, way back I was pushing for s64 ns
representations for most time values, but the ktime_t was considered a
reasonable compromise to avoid costly 64bit divides to split (sec,nsec)
on 32bit arches.

Maybe call it getboottime_ns() just to distinguish it from
getnstimeofday() which returns a timespec?


> >>   extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
> >>   extern int timekeeping_valid_for_hres(void);
> >> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> >> index caf8d4d..d250f0a 100644
> >> --- a/kernel/time/timekeeping.c
> >> +++ b/kernel/time/timekeeping.c
> >> @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
> >>   }
> >>   EXPORT_SYMBOL_GPL(ktime_get_ts);
> >>
> >> +
> >> +/**
> >> + * getnsboottime - get the bootbased clock in nsec format
> >> + *
> >> + * The function calculates the bootbased clock from the realtime
> >> + * clock and the wall_to_monotonic offset and stores the result
> >> + * in normalized timespec format in the variable pointed to by @ts.
> >> + */
> >> +s64 getnsboottime(void)
> >> +{
> >> +	unsigned int seq;
> >> +	s64 secs, nsecs;
> >> +
> >> +	WARN_ON(timekeeping_suspended);
> >> +
> >> +	do {
> >> +		seq = read_seqbegin(&xtime_lock);
> >> +		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
> >> +		secs += total_sleep_time.tv_sec;
> >> +		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
> >> +		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
> >> +
> >> +	} while (read_seqretry(&xtime_lock, seq));
> >> +	return nsecs + (secs * NSEC_PER_SEC);
> >> +}
> >> +EXPORT_SYMBOL_GPL(getnsboottime);
> >>      
> > You forgot to include the boottime.tv_sec/nsec offset in this. Take a
> > look again at getboottime()
> >    
> 
> I don't think so... boottime is internal to getboottime, and it's just 
> wall_to_monotonic + total_sleep_time -- right?

Right, sorry, some architectures refine boot time even further,
providing an offset from when the machine was actually powered on to
when the timekeeping code was initialized. But that's already adjusted
into wall_to_monotonic at startup. I thought we kept it separately.


> Perhaps I've named the function badly.  What I want is the monotonic 
> clock, adjusted for sleep time - i.e. a clock that counts elapsed real 
> time without accounting for wall clock changes due to time zone, which 
> never goes backwards.

That looks fine then.  Its a little confusing since getboottime()
returns a timespec with the absolute time that the system booted. Where
as your interface is providing the time since boot. 

Maybe gettimefromboot_ns() would be clearer?

thanks
-john

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-21  0:02       ` john stultz
@ 2010-08-21  0:52         ` Zachary Amsden
  2010-08-21  1:04           ` john stultz
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-21  0:52 UTC (permalink / raw)
  To: john stultz
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	linux-kernel

On 08/20/2010 02:02 PM, john stultz wrote:
> On Fri, 2010-08-20 at 13:37 -1000, Zachary Amsden wrote:
>    
>> On 08/20/2010 08:39 AM, john stultz wrote:
>>      
>>> On Thu, 2010-08-19 at 22:07 -1000, Zachary Amsden wrote:
>>>
>>>        
>>>> Add a kernel call to get the number of nanoseconds since boot.  This
>>>> is generally useful enough to make it a generic call.
>>>>
>>>>          
>>> Few comments here.
>>>
>>>
>>>        
>>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>>> ---
>>>>    include/linux/time.h      |    1 +
>>>>    kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
>>>>    2 files changed, 28 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/time.h b/include/linux/time.h
>>>> index ea3559f..5d04108 100644
>>>> --- a/include/linux/time.h
>>>> +++ b/include/linux/time.h
>>>> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
>>>>    extern void getrawmonotonic(struct timespec *ts);
>>>>    extern void getboottime(struct timespec *ts);
>>>>    extern void monotonic_to_bootbased(struct timespec *ts);
>>>> +extern s64 getnsboottime(void);
>>>>
>>>>          
>>> So instead of converting the timespec from getboottime, why did you add
>>> a new interface? Also if not a timespec, why did you pick a s64 instead
>>> of a ktime_t?
>>>
>>>        
>> The new interface was suggested several times, so I'm proposing it.  I'm
>> indifferent to putting it the kernel API or making it internal to KVM.
>> KVM doesn't want to deal with conversions to / from ktime_t; this code
>> uses a lot (too much) math, and it's easy to get wrong when splitting
>> sec / nsec fields.  So s64 seems a natural type for ns values.  I
>> realize it's not entirely consistent with the kernel API, but s64
>> representation for ns seems to be creeping in.
>>      
> I can understand wanting that, way back I was pushing for s64 ns
> representations for most time values, but the ktime_t was considered a
> reasonable compromise to avoid costly 64bit divides to split (sec,nsec)
> on 32bit arches.
>    

We want time in simply parseable formats, so we always end up with sec / 
msec, sec / usec, sec / nsec.  This is simply a convenient 
representation for humans.  Programmers always end up copying this model 
and it causes so many lovely bugs.  How many times can you race while 
reading CMOS Y/M/D/H/S?

Fortunately now that 64-bit computing is nearly pervasive, we can make 
most of these problems go away.

I think gettimefromboot_ns() is a good descriptive name, but slightly 
too long - it would ruin my indentation.  Perhaps getrealtime_ns()?

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-21  0:52         ` Zachary Amsden
@ 2010-08-21  1:04           ` john stultz
  2010-08-21  1:22             ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: john stultz @ 2010-08-21  1:04 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	linux-kernel

On Fri, 2010-08-20 at 14:52 -1000, Zachary Amsden wrote:
> I think gettimefromboot_ns() is a good descriptive name, but slightly 
> too long - it would ruin my indentation.  Perhaps getrealtime_ns()?

Sigh... So getrealtime_ns would probably be confused with
CLOCK_REALTIME, which is wall time.  :P

At this point it feels too nitpicky to suggest anything else, so go
ahead and use boottime_ns and we'll refine things if anyone actually
trips up on it.

thanks
-john



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-21  1:04           ` john stultz
@ 2010-08-21  1:22             ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-21  1:22 UTC (permalink / raw)
  To: john stultz
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	linux-kernel

On 08/20/2010 03:04 PM, john stultz wrote:
> On Fri, 2010-08-20 at 14:52 -1000, Zachary Amsden wrote:
>    
>> I think gettimefromboot_ns() is a good descriptive name, but slightly
>> too long - it would ruin my indentation.  Perhaps getrealtime_ns()?
>>      
> Sigh... So getrealtime_ns would probably be confused with
> CLOCK_REALTIME, which is wall time.  :P
>
> At this point it feels too nitpicky to suggest anything else, so go
> ahead and use boottime_ns and we'll refine things if anyone actually
> trips up on it.
>    

As long as the prototype is adequately commented, there needn't be any 
confusion ;)

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 30/35] IOCTL for setting TSC rate
  2010-08-20 17:56   ` Glauber Costa
@ 2010-08-21 16:11     ` Arnd Bergmann
  0 siblings, 0 replies; 106+ messages in thread
From: Arnd Bergmann @ 2010-08-21 16:11 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Zachary Amsden, kvm, Avi Kivity, Marcelo Tosatti,
	Thomas Gleixner, John Stultz, linux-kernel

On Friday 20 August 2010 19:56:20 Glauber Costa wrote:
> > @@ -675,6 +676,9 @@ struct kvm_clock_data {
> >  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
> >  /* Available with KVM_CAP_PPC_GET_PVINFO */
> >  #define KVM_PPC_GET_PVINFO     _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
> > +/* Available with KVM_CAP_SET_TSC_RATE */
> > +#define KVM_X86_GET_TSC_RATE      _IOR(KVMIO,  0xa2, __u32)
> > +#define KVM_X86_SET_TSC_RATE      _IOW(KVMIO,  0xa3, __u32)
> 
> wrap this into a struct?

I don't think that would improve the code. Generally, we try to *avoid* using
structs in ioctl arguments, although KVM does have a precedent of using structs
there.

In fact, the code here could be simplified by using get_user/put_user on the
simple argument, which would not be possibly with a struct.

	Arnd

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-20 23:24   ` Zachary Amsden
@ 2010-08-22  1:32     ` David S. Ahern
  2010-08-24  1:44       ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Ahern @ 2010-08-22  1:32 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm



On 08/20/10 17:24, Zachary Amsden wrote:
> On 08/20/2010 03:26 AM, David S. Ahern wrote:
>>
>> On 08/20/10 02:07, Zachary Amsden wrote:
>>   
>>> This patch set implements full TSC virtualization, with both
>>> trapping and passthrough modes, and intelligent mode switching.
>>> As a result, TSC will never go backwards, we are stable against
>>> guest re-calibration attempts, VM reset, and migration.  For guests
>>> which require it, the TSC khz can even be preserved on migration
>>> to a new host.
>>>
>>> The TSC will never be trapped on UP systems unless the host TSC
>>> actually runs faster than the guest; other conditions, including
>>> bad hardware and changing speeds are accomodated by using catchup
>>> mode to keep the guest passthrough TSC in line with the host clock.
>>>      
>> What's the overhead of trapping TSC reads for Nehalem-type processors?
>>
>> gettimeofday() in guests is the biggest performance problem with KVM for
>> me, especially for older OSes like RHEL4 which is a supported OS for
>> another 2 years. Even with RHEL5, 32-bit, I had to force kvmclock off to
>> get the VM to run reliably:
>>
>> http://article.gmane.org/gmane.comp.emulators.kvm.devel/51017/match=kvmclock+rhel5.5
>>
>>    
> 
> Correctness is the biggest timekeeping problem with KVM for me.  The
> fact that you had to force kvmclock off is evidence of that.  Slightly
> slower applications are fine.  Broken ones are not acceptable.

I have been concerned with speed and correctness for a while:

http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

> 
> TSC will not be trapped with kvmclock, and the bug you hit with RHEL5
> kvmclock has since been fixed.  As you can see, it is not a simple and
> straightforward issue to get all the issues sorted out.

kvmclock is for guests running RHEL5.5+some update and or some guest
running a very recent linux kernel. There's a lot of products running on
OS'es older than that.

> 
> Also, TSC will not be trapped with UP VMs, only SMP.  If you seriously
> believe RHEL4 will perform better as an SMP guest than several instances
> of coordinated UP guests, you would worry about this issue.  I don't. 
> The amount of upstream scalability and performance work done since that
> timeframe is enormous, to the point that it's entirely plausible that
> KVM governed UP RHEL4 guests as a cluster are faster than a RHEL4 SMP host.

Products built on RHEL3, RHEL4 or earlier RHEL5 were developed in the
past, and performance expectations set for that version based on SMP -
be it bare metal or virtual. You can't expect a product to be redesigned
to run on KVM.

> 
> So the answer is - it depends.  Hardware is always getting faster, and
> trap / exit cost is going down.   Right now, it is anywhere from a few
> hundred to multiple thousands of cycles, depending on your hardware.  I
> don't have an exact benchmark number I can quote, although in a couple
> of hours, I probably will.  I'll guess 3,000 cycles.
> 
> I agree, gettimeofday is a huge issue, for poorly written applications. 

I understand it is not a simple problem, and "poorly written
applications" is a bit of reach don't you think? There are a number of
workloads that depend on time stamps; that does not make them poorly
designed.

> Not that this means we won't speed it up, in fact, I have already done
> quite a bit of work on ways to reduce the exit cost.  Let's, however,
> get things correct before trying to make them aggressively fast.
> 
> Zach

I have also looked at time keeping and performance of getimeofday on a
certain proprietary hypervisor. KVM lags severely here and workloads
dependent on timestamps are dramatically impacted. Evaluations and
decisions are made today based on current designs - both KVM and
product. Severe performance deltas raise a lot of flags.

David

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 03/35] Move TSC offset writes to common code
  2010-08-20 17:06   ` Glauber Costa
@ 2010-08-24  0:51     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  0:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/20/2010 07:06 AM, Glauber Costa wrote:
> On Thu, Aug 19, 2010 at 10:07:17PM -1000, Zachary Amsden wrote:
>    
>> Also, ensure that the storing of the offset and the reading of the TSC
>> are never preempted by taking a spinlock.  While the lock is overkill
>> now, it is useful later in this patch series.
>>
>> +	spinlock_t tsc_write_lock;
>>      
> Forgive my utter ignorance, specially if it is to become
> obvious in a latter patch: This is a vcpu-local operation,
> uses rdtscl, so pcpu-local too, and we don't expect
> multiple writers to it at the same time.
>
> Why do we need this lock?
>
>    

Synchronizing access to the variables which we use to match TSC writes 
across multiple VCPUs.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 05/35] Move TSC reset out of vmcb_init
  2010-08-20 17:08   ` Glauber Costa
@ 2010-08-24  0:52     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  0:52 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/20/2010 07:08 AM, Glauber Costa wrote:
> On Thu, Aug 19, 2010 at 10:07:19PM -1000, Zachary Amsden wrote:
>    
>> The VMCB is reset whenever we receive a startup IPI, so Linux is setting
>> TSC back to zero happens very late in the boot process and destabilizing
>> the TSC.  Instead, just set TSC to zero once at VCPU creation time.
>>
>> Why the separate patch?  So git-bisect is your friend.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>      
> Shouldn't we set for whatever value the BSP already has, and then the BSP to
> zero? Since vcpus are initialized at different times, this pretty much
> guarantees that the guest will have desynchronized tsc at all cases
> (not that if it was better before...)
>    

Yes, we should - but it takes a lot more machinery to do that, and so it 
happens later in the series.  You have to match the offsets for the BSP 
and other vcpus...

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 08/35] Warn about unstable TSC
  2010-08-20 17:28   ` Glauber Costa
@ 2010-08-24  0:56     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  0:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/20/2010 07:28 AM, Glauber Costa wrote:
> On Thu, Aug 19, 2010 at 10:07:22PM -1000, Zachary Amsden wrote:
>    
>> If creating an SMP guest with unstable host TSC, issue a warning
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>      
> Ok, I am not sure I agree 100 % this is needed.
> I believe we should try to communicate this kind of thing to the guest,
> not the host, and through cpuid.
>
> Passing through tsc flags to the guest should maybe be enough?
>    

I found a better way to deal with this later in the series... the theory 
being if the full series doesn't get backported to supported releases, 
at least we'll get a warning.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 11/35] Add helper functions for time computation
  2010-08-20 17:34   ` Glauber Costa
@ 2010-08-24  0:58     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  0:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/20/2010 07:34 AM, Glauber Costa wrote:
> On Thu, Aug 19, 2010 at 10:07:25PM -1000, Zachary Amsden wrote:
>    
>> Add a helper function to compute the kernel time and convert nanoseconds
>> back to CPU specific cycles.  Note that these must not be called in preemptible
>> context, as that would mean the kernel could enter software suspend state,
>> which would cause non-atomic operation.
>>
>> Also, convert the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls to use the kernel
>> time helper, these should be bootbased as well.
>>      
> This is one of the things I believe should be applied right now.
> Maybe we want a cut version of this patch, that exposes this API while
> adjusting KVM_SET_CLOCK / KVM_GET_CLOCK to get in early rather than late?
>    

The first half of the series, at least, is good to go upstream and ready 
for backport.  The trapping and later stuff obviously needs to get some 
upstream testing.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 12/35] Robust TSC compensation
  2010-08-20 17:40   ` Glauber Costa
@ 2010-08-24  1:01     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  1:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/20/2010 07:40 AM, Glauber Costa wrote:
> On Thu, Aug 19, 2010 at 10:07:26PM -1000, Zachary Amsden wrote:
>    
>> Make the match of TSC find TSC writes that are close to each other
>> instead of perfectly identical; this allows the compensator to also
>> work in migration / suspend scenarios.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   arch/x86/kvm/x86.c |   14 ++++++++++----
>>   1 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 52680f6..0f3e5fb 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -928,21 +928,27 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
>>   	struct kvm *kvm = vcpu->kvm;
>>   	u64 offset, ns, elapsed;
>>   	unsigned long flags;
>> +	s64 sdiff;
>>
>>   	spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>>   	offset = data - native_read_tsc();
>>   	ns = get_kernel_ns();
>>   	elapsed = ns - kvm->arch.last_tsc_nsec;
>> +	sdiff = data - kvm->arch.last_tsc_write;
>> +	if (sdiff<  0)
>> +		sdiff = -sdiff;
>>
>>   	/*
>> -	 * Special case: identical write to TSC within 5 seconds of
>> +	 * Special case: close write to TSC within 5 seconds of
>>   	 * another CPU is interpreted as an attempt to synchronize
>> -	 * (the 5 seconds is to accomodate host load / swapping).
>> +	 * The 5 seconds is to accomodate host load / swapping as
>> +	 * well as any reset of TSC during the boot process.
>>   	 *
>>   	 * In that case, for a reliable TSC, we can match TSC offsets,
>> -	 * or make a best guest using kernel_ns value.
>> +	 * or make a best guest using elapsed value.
>>   	 */
>> -	if (data == kvm->arch.last_tsc_write&&  elapsed<  5ULL * NSEC_PER_SEC) {
>> +	if (sdiff<  nsec_to_cycles(5ULL * NSEC_PER_SEC)&&
>> +	    elapsed<  5ULL * NSEC_PER_SEC) {
>>   		if (!check_tsc_unstable()) {
>>      
> Isn't 5 way too long for this case?
>
>
>    

It was actually too short for a while, and I didn't realize why until I 
discovered on SVM, the APs were getting the TSC reset after the startup IPI.

In any case, the value is certainly up for debate.  I chose a large 
number because who knows how badly things can get off in the case of 
host overcommit / swapping.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock
  2010-08-20 17:45   ` Glauber Costa
@ 2010-08-24  1:14     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  1:14 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/20/2010 07:45 AM, Glauber Costa wrote:
> On Thu, Aug 19, 2010 at 10:07:47PM -1000, Zachary Amsden wrote:
>    
>> When no platform bugs have been detected, no TSC warps have been
>> detected, and the hardware guarantees to us TSC does not change
>> rate or stop with P-state or C-state changes, we can consider it reliable.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   arch/x86/kvm/x86.c |   10 +++++++++-
>>   1 files changed, 9 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 86f182a..a7fa24e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -55,6 +55,7 @@
>>   #include<asm/mce.h>
>>   #include<asm/i387.h>
>>   #include<asm/xcr.h>
>> +#include<asm/pvclock-abi.h>
>>
>>   #define MAX_IO_MSRS 256
>>   #define CR0_RESERVED_BITS						\
>> @@ -900,6 +901,13 @@ static void kvm_get_time_scale(uint32_t scaled_khz, uint32_t base_khz,
>>   static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>>   unsigned long max_tsc_khz;
>>
>> +static inline int kvm_tsc_reliable(void)
>> +{
>> +	return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)&&
>> +		boot_cpu_has(X86_FEATURE_NONSTOP_TSC)&&
>> +		!check_tsc_unstable());
>> +}
>> +
>>   static inline u64 nsec_to_cycles(struct kvm *kvm, u64 nsec)
>>   {
>>   	return pvclock_scale_delta(nsec, kvm->arch.virtual_tsc_mult,
>> @@ -1151,7 +1159,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>   	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>>   	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>>   	vcpu->last_kernel_ns = kernel_ns;
>> -	vcpu->hv_clock.flags = 0;
>> +	vcpu->hv_clock.flags = kvm_tsc_reliable() ? PVCLOCK_TSC_STABLE_BIT : 0;
>>      
> This is not enough.
>
> We still can have bugs arriving from the difference in resolution between the underlying
> clock and the tsc. What we're doing here, is to pass a reliable flag, to a non-reliable
> guest tsc. We can only trust the guest kvmclock to be tsc-stable if the host is using
> tsc clocksource as well.
>    

Is there actually an exported API to determine if clocksource is running 
on TSC and get notified when it switches?

> Since the stable bit have to be read from the guest at every clock read, we can just
> use it, and drop it if the host changes its clocksource.
>    

I know we've discussed this a bit, but with patch 16/35, Fix a possible 
backwards warp of kvmclock, I don't think you can see the backwards 
movement in an "incorrect" way within the guest.

Backwards jump for each processor must be eliminated, which is what that 
patch does.

It still allows the possibility of SMP differences, due to the 
calibration error, you may have one CPU which is slightly advanced.  You 
may in fact get a kvmclock value which is less than the previously read 
(on another CPU) kvmclock value in such a case.  The question is - is 
this calibration error of sufficient magnitude to be significant at all?

Note that even with a perfectly calibrated TSC on a stable system 
already, with no atomic lock, kvmclock already has this error built into 
it; the TSC reads of multiple processors will not be serialized with 
each other and "backwards" values can be observed globally (but not 
locally).  So the question really is, how big is the error relative to 
the TSC rate, and is it significant enough to matter.

Obviously that changes for different host clocks, and in principle I 
agree with you; it could very well be significant.  However, we have no 
clear API from clocksource to use effectively for this (indeed, in some 
cases, with jiffies clock, it isn't even clear what the API should do).

We could use more 'magic' trickery to keep kvmclock values aligned, 
matching the system_time and tsc_timestamp when setting up SMP kvmclocks 
on a host which has 'stable TSC'.

> An alternative for the reliable tsc case, would be to just maintain our own parallel
> tsc-based clock. But to be honest, I don't like this solution very much. It adds
> complexity, and I kinda believe that if the sysadmin had the work to go there
> and switch clocksources, he probably has a reason for that.
>    

I originally went down that route, and it got ugly, ugly, ugly.

In any case, you are right, this patch needs to be held for further 
discussion.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-22  1:32     ` David S. Ahern
@ 2010-08-24  1:44       ` Zachary Amsden
  2010-08-24  3:04         ` David S. Ahern
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  1:44 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

On 08/21/2010 03:32 PM, David S. Ahern wrote:
>
> On 08/20/10 17:24, Zachary Amsden wrote:
>    
>> On 08/20/2010 03:26 AM, David S. Ahern wrote:
>>      
>>> On 08/20/10 02:07, Zachary Amsden wrote:
>>>
>>>        
>>>> This patch set implements full TSC virtualization, with both
>>>> trapping and passthrough modes, and intelligent mode switching.
>>>> As a result, TSC will never go backwards, we are stable against
>>>> guest re-calibration attempts, VM reset, and migration.  For guests
>>>> which require it, the TSC khz can even be preserved on migration
>>>> to a new host.
>>>>
>>>> The TSC will never be trapped on UP systems unless the host TSC
>>>> actually runs faster than the guest; other conditions, including
>>>> bad hardware and changing speeds are accomodated by using catchup
>>>> mode to keep the guest passthrough TSC in line with the host clock.
>>>>
>>>>          
>>> What's the overhead of trapping TSC reads for Nehalem-type processors?
>>>
>>> gettimeofday() in guests is the biggest performance problem with KVM for
>>> me, especially for older OSes like RHEL4 which is a supported OS for
>>> another 2 years. Even with RHEL5, 32-bit, I had to force kvmclock off to
>>> get the VM to run reliably:
>>>
>>> http://article.gmane.org/gmane.comp.emulators.kvm.devel/51017/match=kvmclock+rhel5.5
>>>
>>>
>>>        
>> Correctness is the biggest timekeeping problem with KVM for me.  The
>> fact that you had to force kvmclock off is evidence of that.  Slightly
>> slower applications are fine.  Broken ones are not acceptable.
>>      
> I have been concerned with speed and correctness for a while:
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
> http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html
>
>    
>> TSC will not be trapped with kvmclock, and the bug you hit with RHEL5
>> kvmclock has since been fixed.  As you can see, it is not a simple and
>> straightforward issue to get all the issues sorted out.
>>      
> kvmclock is for guests running RHEL5.5+some update and or some guest
> running a very recent linux kernel. There's a lot of products running on
> OS'es older than that.
>
>    
>> Also, TSC will not be trapped with UP VMs, only SMP.  If you seriously
>> believe RHEL4 will perform better as an SMP guest than several instances
>> of coordinated UP guests, you would worry about this issue.  I don't.
>> The amount of upstream scalability and performance work done since that
>> timeframe is enormous, to the point that it's entirely plausible that
>> KVM governed UP RHEL4 guests as a cluster are faster than a RHEL4 SMP host.
>>      
> Products built on RHEL3, RHEL4 or earlier RHEL5 were developed in the
> past, and performance expectations set for that version based on SMP -
> be it bare metal or virtual. You can't expect a product to be redesigned
> to run on KVM.
>    

You can expect people to measure and use the system appropriately.  
Products built on RHEL3, 4, etc will not have the inherent SMP 
scalability and therefore won't benefit as hugely from an SMP VM.

>> So the answer is - it depends.  Hardware is always getting faster, and
>> trap / exit cost is going down.   Right now, it is anywhere from a few
>> hundred to multiple thousands of cycles, depending on your hardware.  I
>> don't have an exact benchmark number I can quote, although in a couple
>> of hours, I probably will.  I'll guess 3,000 cycles.
>>
>> I agree, gettimeofday is a huge issue, for poorly written applications.
>>      
> I understand it is not a simple problem, and "poorly written
> applications" is a bit of reach don't you think? There are a number of
> workloads that depend on time stamps; that does not make them poorly
> designed.
>    

The timestamp will never be unique or completely accurate.  Therefore it 
is not necessary to issue calls to get timestamps from the kernel at an 
extreme rate.

A 64-bit counter and a timestamp fetched approximately once per second 
IS unique and accurate to a 1-second value.

On any virtualized system, unless you have dedicated virtual machines 
and a real time host operating system, you can't really guarantee better 
than 1-second time resolution to any guest.  (Pick some other resolution 
than 1-second; same argument applies).

Therefore, workloads which issue kernel calls to repeatedly get less 
than useful information are in fact, poorly designed, unless they are 
written to run on real time host environments for dedicated RT use.

>> Not that this means we won't speed it up, in fact, I have already done
>> quite a bit of work on ways to reduce the exit cost.  Let's, however,
>> get things correct before trying to make them aggressively fast.
>>
>> Zach
>>      
> I have also looked at time keeping and performance of getimeofday on a
> certain proprietary hypervisor. KVM lags severely here and workloads
> dependent on timestamps are dramatically impacted. Evaluations and
> decisions are made today based on current designs - both KVM and
> product. Severe performance deltas raise a lot of flags.
>    

This is laughably incorrect.

Gettimeofday is faster on KVM than anything else using TSC based clock 
because it passes the TSC through directly.   VMware traps the TSC and 
is actually slower.

Can you please define your "severe performance delta" and tell us your 
benchmark methodology?  I'd like to help you figure out how it is flawed.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-24  1:44       ` Zachary Amsden
@ 2010-08-24  3:04         ` David S. Ahern
  2010-08-24  5:47           ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Ahern @ 2010-08-24  3:04 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm



On 08/23/10 19:44, Zachary Amsden wrote:
>> I have also looked at time keeping and performance of getimeofday on a
>> certain proprietary hypervisor. KVM lags severely here and workloads
>> dependent on timestamps are dramatically impacted. Evaluations and
>> decisions are made today based on current designs - both KVM and
>> product. Severe performance deltas raise a lot of flags.
>>    
> 
> This is laughably incorrect.

Uh, right.

> 
> Gettimeofday is faster on KVM than anything else using TSC based clock
> because it passes the TSC through directly.   VMware traps the TSC and
> is actually slower.

Yes, it does trap the TSC to ensure it is increasing. My question
regarding trapping on KVM was about to what to expect in terms of
overhead. Furthermore, if you add trapping on KVM are TSC reads still
faster on KVM?

> 
> Can you please define your "severe performance delta" and tell us your
> benchmark methodology?  I'd like to help you figure out how it is flawed.

I sent you the link in the last response. Here it is again:
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

TSC - fast, but has horrible time drifts

PIT - horribly slow

ACPI PM - horribly slow

HPET - did not exist in Nov. 2008, and since has not been reliable in my
tests with RHEL4 and RHEL5

kvmclock - does not exist for RHEL4 and not usable on RHEL5 until the
update of 5.5 with the fix (I have not retried RHEL5 with the latest
maintenance kernel to verify it is stable in my use cases).

Take the program from the link above. Run it in a RHEL4 & RHEL5 guest
running on VMware for all the clock sources. Somewhere I have the data
for these comparisons -- KVM, VMware and bare metal. Same hardware, same
OS. The PIT and acpi-PM clock sources are faster on VMware than bare metal.


My point is that kvmclock is Red Hat's answer for the future -- RHEL6,
RHEL5.Y (whenever it proves reliable). What about the present?  What
about products based on other distributions newer than RHEL5 but
pre-kvmclock?

There are a lot of moving windows of what to use as a clock source, not
just per major number (RHEL4, RHEL5) but minor number (e.g., TSC
stability on RHEL4 -- e.g.,
https://bugzilla.redhat.com/show_bug.cgi?id=491154) and further
maintenance releases (kvmclock requiring RHEL5.5+). That is not very
friendly to a product making a transition to virtualization - and with
the same code base running bare metal or in a VM.

David


> 
> Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-24  3:04         ` David S. Ahern
@ 2010-08-24  5:47           ` Zachary Amsden
  2010-08-24 13:32             ` David S. Ahern
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24  5:47 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

On 08/23/2010 05:04 PM, David S. Ahern wrote:
>
> On 08/23/10 19:44, Zachary Amsden wrote:
>    
>>> I have also looked at time keeping and performance of getimeofday on a
>>> certain proprietary hypervisor. KVM lags severely here and workloads
>>> dependent on timestamps are dramatically impacted. Evaluations and
>>> decisions are made today based on current designs - both KVM and
>>> product. Severe performance deltas raise a lot of flags.
>>>
>>>        
>> This is laughably incorrect.
>>      
> Uh, right.
>    

I've heard the rumor that TSC is orders of magnitude faster under VMware 
than under KVM from three people now, and I thought you were part of 
that camp.

Needless to say, they are either laughably incorrect, or possess some 
great secret knowledge of how to make things under virtualization go 
faster than bare metal.

I also have a magical talking unicorn, which, btw, is invisible.  
Extraordinary claims require extraordinary proof (the proof of my 
unicorn is too complex to fit in the margin of this e-mail, however, I 
assure you he is real).

>    
>> Gettimeofday is faster on KVM than anything else using TSC based clock
>> because it passes the TSC through directly.   VMware traps the TSC and
>> is actually slower.
>>      
> Yes, it does trap the TSC to ensure it is increasing. My question
> regarding trapping on KVM was about to what to expect in terms of
> overhead. Furthermore, if you add trapping on KVM are TSC reads still
> faster on KVM?
>
>    
>> Can you please define your "severe performance delta" and tell us your
>> benchmark methodology?  I'd like to help you figure out how it is flawed.
>>      
> I sent you the link in the last response. Here it is again:
> http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html
>
> TSC - fast, but has horrible time drifts
>
> PIT - horribly slow
>
> ACPI PM - horribly slow
>
> HPET - did not exist in Nov. 2008, and since has not been reliable in my
> tests with RHEL4 and RHEL5
>
> kvmclock - does not exist for RHEL4 and not usable on RHEL5 until the
> update of 5.5 with the fix (I have not retried RHEL5 with the latest
> maintenance kernel to verify it is stable in my use cases).
>
> Take the program from the link above. Run it in a RHEL4&  RHEL5 guest
> running on VMware for all the clock sources. Somewhere I have the data
> for these comparisons -- KVM, VMware and bare metal. Same hardware, same
> OS. The PIT and acpi-PM clock sources are faster on VMware than bare metal.
>
>
> My point is that kvmclock is Red Hat's answer for the future -- RHEL6,
> RHEL5.Y (whenever it proves reliable). What about the present?  What
> about products based on other distributions newer than RHEL5 but
> pre-kvmclock?
>    

It should be obvious from this patchset... PIT or TSC.

KVM did not have an in-kernel PIT implementation circa 2008, so this 
data is quite old.  It's much faster now and will continue to get faster 
as exit cost goes down and the emulation gets further optimized.

Plus, now we have an error-free TSC.

> There are a lot of moving windows of what to use as a clock source, not
> just per major number (RHEL4, RHEL5) but minor number (e.g., TSC
> stability on RHEL4 -- e.g.,
> https://bugzilla.redhat.com/show_bug.cgi?id=491154) and further
> maintenance releases (kvmclock requiring RHEL5.5+). That is not very
> friendly to a product making a transition to virtualization - and with
> the same code base running bare metal or in a VM.
>    

If you have old software running on broken hardware you do not get 
hardware performance and error-free time virtualization.  With any 
vendor.  Period.

With this patchset, KVM now has a much stronger guarantee: If you have 
old guest software running on broken hardware, using SMP virtual 
machines, you do not get hardware performance and error-free time 
virtualization.    However, if you have new guest software, non-broken 
hardware, or can simply run UP guests instead of SMP, you can have 
hardware performance, and it is now error free.  Alternatively, you can 
sacrifice some accuracy and have hardware performance, even for SMP 
guests, if you can tolerate some minor cross-CPU TSC variation.  No 
other vendor I know of can make that guarantee.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-24  5:47           ` Zachary Amsden
@ 2010-08-24 13:32             ` David S. Ahern
  2010-08-24 23:01               ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Ahern @ 2010-08-24 13:32 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm



On 08/23/10 23:47, Zachary Amsden wrote:
> I've heard the rumor that TSC is orders of magnitude faster under VMware
> than under KVM from three people now, and I thought you were part of
> that camp.
> 
> Needless to say, they are either laughably incorrect, or possess some
> great secret knowledge of how to make things under virtualization go
> faster than bare metal.
> 
> I also have a magical talking unicorn, which, btw, is invisible. 
> Extraordinary claims require extraordinary proof (the proof of my
> unicorn is too complex to fit in the margin of this e-mail, however, I
> assure you he is real).

I have put in a lot of time over the past 3 years to understand how the
'magic' of virtualization works; please don't lump me into camps until I
raise my hand as being part of one.


>> My point is that kvmclock is Red Hat's answer for the future -- RHEL6,
>> RHEL5.Y (whenever it proves reliable). What about the present?  What
>> about products based on other distributions newer than RHEL5 but
>> pre-kvmclock?
>>    
> 
> It should be obvious from this patchset... PIT or TSC.
> 
> KVM did not have an in-kernel PIT implementation circa 2008, so this
> data is quite old.  It's much faster now and will continue to get faster
> as exit cost goes down and the emulation gets further optimized.

It was in-kernel pit in early 2008 (kernel git entry):

commit 7837699fa6d7adf81f26ab73a5f6897ea1ab9d6a
Author: Sheng Yang <sheng.yang@intel.com>
Date:   Mon Jan 28 05:10:22 2008 +0800

    KVM: In kernel PIT model


> 
> Plus, now we have an error-free TSC.
> 
>> There are a lot of moving windows of what to use as a clock source, not
>> just per major number (RHEL4, RHEL5) but minor number (e.g., TSC
>> stability on RHEL4 -- e.g.,
>> https://bugzilla.redhat.com/show_bug.cgi?id=491154) and further
>> maintenance releases (kvmclock requiring RHEL5.5+). That is not very
>> friendly to a product making a transition to virtualization - and with
>> the same code base running bare metal or in a VM.
>>    
> 
> If you have old software running on broken hardware you do not get
> hardware performance and error-free time virtualization.  With any
> vendor.  Period.

Sucks to be old *and* broken. But old with fancy new wheels, er hardware
-- like commodity x86 servers running Nehalem-based processors -- is a
different story.

> 
> With this patchset, KVM now has a much stronger guarantee: If you have
> old guest software running on broken hardware, using SMP virtual
> machines, you do not get hardware performance and error-free time
> virtualization.    However, if you have new guest software, non-broken
> hardware, or can simply run UP guests instead of SMP, you can have
> hardware performance, and it is now error free.  Alternatively, you can
> sacrifice some accuracy and have hardware performance, even for SMP
> guests, if you can tolerate some minor cross-CPU TSC variation.  No
> other vendor I know of can make that guarantee.
> 
> Zach

If the processor has a stable TSC why trap it? I realize you are trying
to cover a gauntlet of hardware and guests, so maybe a nerd knob is
needed to disable.

David

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 12/35] Robust TSC compensation
  2010-08-20  8:07 ` [KVM timekeeping 12/35] Robust TSC compensation Zachary Amsden
  2010-08-20 17:40   ` Glauber Costa
@ 2010-08-24 21:33   ` Daniel Verkamp
  1 sibling, 0 replies; 106+ messages in thread
From: Daniel Verkamp @ 2010-08-24 21:33 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On Fri, Aug 20, 2010 at 1:07 AM, Zachary Amsden <zamsden@redhat.com> wrote:
[...]
> +        * or make a best guest using elapsed value.

Perhaps s/guest/guess/ while the line is changing anyway.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
                   ` (35 preceding siblings ...)
  2010-08-20 13:26 ` KVM timekeeping and TSC virtualization David S. Ahern
@ 2010-08-24 22:13 ` Marcelo Tosatti
  2010-08-25  4:04   ` Zachary Amsden
  36 siblings, 1 reply; 106+ messages in thread
From: Marcelo Tosatti @ 2010-08-24 22:13 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm, Avi Kivity

On Thu, Aug 19, 2010 at 10:07:14PM -1000, Zachary Amsden wrote:
> This patch set implements full TSC virtualization, with both
> trapping and passthrough modes, and intelligent mode switching.
> As a result, TSC will never go backwards, we are stable against
> guest re-calibration attempts, VM reset, and migration.  For guests
> which require it, the TSC khz can even be preserved on migration
> to a new host.
> 
> The TSC will never be trapped on UP systems unless the host TSC
> actually runs faster than the guest; other conditions, including
> bad hardware and changing speeds are accomodated by using catchup
> mode to keep the guest passthrough TSC in line with the host clock.
> 
> What is still needed on top of this is a way to force TSC
> trapping, or disable it entirely, for benchmarking purposes.
> I refrained from adding that last bit because it wasn't clear
> whether the best thing to do is a global 'force TSC trapping' /
> 'force TSC passthrough' / 'intelligent choice', or if this control
> should be on a per-VM level, via an ioctl(), module parameter,
> or sysfs.
> 
> John and Thomas I have cc'd on this because it may be relevant to
> their interests and I always appreciate feedback, especially on
> a change set as large and complex as this.
> 
> Enjoy.  This time, there are no howler monkeys.  I've included
> all the feedback I got from previous rounds of this and more.

Applied 1-19, thanks.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-24 13:32             ` David S. Ahern
@ 2010-08-24 23:01               ` Zachary Amsden
  2010-08-25 16:55                 ` Marcelo Tosatti
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-24 23:01 UTC (permalink / raw)
  To: David S. Ahern; +Cc: kvm

On 08/24/2010 03:32 AM, David S. Ahern wrote:
>
> On 08/23/10 23:47, Zachary Amsden wrote:
>    
>> I've heard the rumor that TSC is orders of magnitude faster under VMware
>> than under KVM from three people now, and I thought you were part of
>> that camp.
>>
>> Needless to say, they are either laughably incorrect, or possess some
>> great secret knowledge of how to make things under virtualization go
>> faster than bare metal.
>>
>> I also have a magical talking unicorn, which, btw, is invisible.
>> Extraordinary claims require extraordinary proof (the proof of my
>> unicorn is too complex to fit in the margin of this e-mail, however, I
>> assure you he is real).
>>      
> I have put in a lot of time over the past 3 years to understand how the
> 'magic' of virtualization works; please don't lump me into camps until I
> raise my hand as being part of one.
>
>
>    
>>> My point is that kvmclock is Red Hat's answer for the future -- RHEL6,
>>> RHEL5.Y (whenever it proves reliable). What about the present?  What
>>> about products based on other distributions newer than RHEL5 but
>>> pre-kvmclock?
>>>
>>>        
>> It should be obvious from this patchset... PIT or TSC.
>>
>> KVM did not have an in-kernel PIT implementation circa 2008, so this
>> data is quite old.  It's much faster now and will continue to get faster
>> as exit cost goes down and the emulation gets further optimized.
>>      
> It was in-kernel pit in early 2008 (kernel git entry):
>
> commit 7837699fa6d7adf81f26ab73a5f6897ea1ab9d6a
> Author: Sheng Yang<sheng.yang@intel.com>
> Date:   Mon Jan 28 05:10:22 2008 +0800
>
>      KVM: In kernel PIT model
>
>
>    
>> Plus, now we have an error-free TSC.
>>
>>      
>>> There are a lot of moving windows of what to use as a clock source, not
>>> just per major number (RHEL4, RHEL5) but minor number (e.g., TSC
>>> stability on RHEL4 -- e.g.,
>>> https://bugzilla.redhat.com/show_bug.cgi?id=491154) and further
>>> maintenance releases (kvmclock requiring RHEL5.5+). That is not very
>>> friendly to a product making a transition to virtualization - and with
>>> the same code base running bare metal or in a VM.
>>>
>>>        
>> If you have old software running on broken hardware you do not get
>> hardware performance and error-free time virtualization.  With any
>> vendor.  Period.
>>      
> Sucks to be old *and* broken. But old with fancy new wheels, er hardware
> -- like commodity x86 servers running Nehalem-based processors -- is a
> different story.
>
>    
>> With this patchset, KVM now has a much stronger guarantee: If you have
>> old guest software running on broken hardware, using SMP virtual
>> machines, you do not get hardware performance and error-free time
>> virtualization.    However, if you have new guest software, non-broken
>> hardware, or can simply run UP guests instead of SMP, you can have
>> hardware performance, and it is now error free.  Alternatively, you can
>> sacrifice some accuracy and have hardware performance, even for SMP
>> guests, if you can tolerate some minor cross-CPU TSC variation.  No
>> other vendor I know of can make that guarantee.
>>
>> Zach
>>      
> If the processor has a stable TSC why trap it? I realize you are trying
> to cover a gauntlet of hardware and guests, so maybe a nerd knob is
> needed to disable.
>    

Exactly.  If you have a stable TSC, we don't trap it.  If you don't have 
a stable TSC, we do.  That's the point of these patches.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-24 22:13 ` Marcelo Tosatti
@ 2010-08-25  4:04   ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-25  4:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Avi Kivity

On 08/24/2010 12:13 PM, Marcelo Tosatti wrote:
> On Thu, Aug 19, 2010 at 10:07:14PM -1000, Zachary Amsden wrote:
>    
>> This patch set implements full TSC virtualization, with both
>> trapping and passthrough modes, and intelligent mode switching.
>> As a result, TSC will never go backwards, we are stable against
>> guest re-calibration attempts, VM reset, and migration.  For guests
>> which require it, the TSC khz can even be preserved on migration
>> to a new host.
>>
>> The TSC will never be trapped on UP systems unless the host TSC
>> actually runs faster than the guest; other conditions, including
>> bad hardware and changing speeds are accomodated by using catchup
>> mode to keep the guest passthrough TSC in line with the host clock.
>>
>> What is still needed on top of this is a way to force TSC
>> trapping, or disable it entirely, for benchmarking purposes.
>> I refrained from adding that last bit because it wasn't clear
>> whether the best thing to do is a global 'force TSC trapping' /
>> 'force TSC passthrough' / 'intelligent choice', or if this control
>> should be on a per-VM level, via an ioctl(), module parameter,
>> or sysfs.
>>
>> John and Thomas I have cc'd on this because it may be relevant to
>> their interests and I always appreciate feedback, especially on
>> a change set as large and complex as this.
>>
>> Enjoy.  This time, there are no howler monkeys.  I've included
>> all the feedback I got from previous rounds of this and more.
>>      
> Applied 1-19, thanks.
>
>    

Great, thanks.  I'll wait for more feedback on the rest and rebase.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-24 23:01               ` Zachary Amsden
@ 2010-08-25 16:55                 ` Marcelo Tosatti
  2010-08-25 20:32                   ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Marcelo Tosatti @ 2010-08-25 16:55 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: David S. Ahern, kvm

On Tue, Aug 24, 2010 at 01:01:38PM -1000, Zachary Amsden wrote:
> >>With this patchset, KVM now has a much stronger guarantee: If you have
> >>old guest software running on broken hardware, using SMP virtual
> >>machines, you do not get hardware performance and error-free time
> >>virtualization.    However, if you have new guest software, non-broken
> >>hardware, or can simply run UP guests instead of SMP, you can have
> >>hardware performance, and it is now error free.  Alternatively, you can
> >>sacrifice some accuracy and have hardware performance, even for SMP
> >>guests, if you can tolerate some minor cross-CPU TSC variation.  No
> >>other vendor I know of can make that guarantee.
> >>
> >>Zach
> >If the processor has a stable TSC why trap it? I realize you are trying
> >to cover a gauntlet of hardware and guests, so maybe a nerd knob is
> >needed to disable.
> 
> Exactly.  If you have a stable TSC, we don't trap it.  If you don't
> have a stable TSC, we do.  That's the point of these patches.

Wait, don't you trap if host TSC is faster than guest TSC?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 25/35] Add clock catchup mode
  2010-08-20  8:07 ` [KVM timekeeping 25/35] Add clock catchup mode Zachary Amsden
@ 2010-08-25 17:27   ` Marcelo Tosatti
  2010-08-25 20:48     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Marcelo Tosatti @ 2010-08-25 17:27 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Glauber Costa, Thomas Gleixner, John Stultz,
	linux-kernel

On Thu, Aug 19, 2010 at 10:07:39PM -1000, Zachary Amsden wrote:
> Make the clock update handler handle generic clock synchronization,
> not just KVM clock.  We add a catchup mode which keeps passthrough
> TSC in line with absolute guest TSC.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/x86.c              |   55 ++++++++++++++++++++++++++------------
>  2 files changed, 38 insertions(+), 18 deletions(-)
> 

>  	kvm_x86_ops->vcpu_load(vcpu, cpu);
> -	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
> +	if (unlikely(vcpu->cpu != cpu) || vcpu->arch.tsc_rebase) {
>  		/* Make sure TSC doesn't go backwards */
>  		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>  				native_read_tsc() - vcpu->arch.last_host_tsc;
>  		if (tsc_delta < 0)
>  			mark_tsc_unstable("KVM discovered backwards TSC");
> -		if (check_tsc_unstable())
> +		if (check_tsc_unstable()) {
>  			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> -		kvm_migrate_timers(vcpu);
> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +		}
> +		if (vcpu->cpu != cpu)
> +			kvm_migrate_timers(vcpu);
>  		vcpu->cpu = cpu;
> +		vcpu->arch.tsc_rebase = 0;
>  	}
>  }
>  
> @@ -1947,6 +1961,12 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
>  	vcpu->arch.last_host_tsc = native_read_tsc();
> +
> +	/* For unstable TSC, force compensation and catchup on next CPU */
> +	if (check_tsc_unstable()) {
> +		vcpu->arch.tsc_rebase = 1;
> +		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +	}

The mix between catchup,trap versus stable,unstable TSC is confusing and
difficult to grasp. Can you please introduce all the infrastructure
first, then control usage of them in centralized places? Examples:

+static void kvm_update_tsc_trapping(struct kvm *kvm)
+{
+       int trap, i;
+       struct kvm_vcpu *vcpu;
+
+       trap = check_tsc_unstable() && atomic_read(&kvm->online_vcpus) > 1;
+       kvm_for_each_vcpu(i, vcpu, kvm)
+               kvm_x86_ops->set_tsc_trap(vcpu, trap && !vcpu->arch.time_page);
+}

+       /* For unstable TSC, force compensation and catchup on next CPU */
+       if (check_tsc_unstable()) {
+               vcpu->arch.tsc_rebase = 1;
+               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+       }


kvm_guest_time_update is becoming very confusing too. I understand this
is due to the many cases its dealing with, but please make it as simple
as possible.

+       /*
+        * If we are trapping and no longer need to, use catchup to
+        * ensure passthrough TSC will not be less than trapped TSC
+        */
+       if (vcpu->tsc_mode == TSC_MODE_PASSTHROUGH && vcpu->tsc_trapping &&
+           ((this_tsc_khz <= v->kvm->arch.virtual_tsc_khz || kvmclock))) {
+               catchup = 1;

What, TSC trapping with kvmclock enabled?

For both catchup and trapping the resolution of the host clock is
important, as Glauber commented for kvmclock. Can you comment on the
problems that arrive from a low res clock for both modes?

Similarly for catchup mode, the effect of exit frequency. No need for
any guarantees?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: KVM timekeeping and TSC virtualization
  2010-08-25 16:55                 ` Marcelo Tosatti
@ 2010-08-25 20:32                   ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-25 20:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: David S. Ahern, kvm

On 08/25/2010 06:55 AM, Marcelo Tosatti wrote:
> On Tue, Aug 24, 2010 at 01:01:38PM -1000, Zachary Amsden wrote:
>    
>>>> With this patchset, KVM now has a much stronger guarantee: If you have
>>>> old guest software running on broken hardware, using SMP virtual
>>>> machines, you do not get hardware performance and error-free time
>>>> virtualization.    However, if you have new guest software, non-broken
>>>> hardware, or can simply run UP guests instead of SMP, you can have
>>>> hardware performance, and it is now error free.  Alternatively, you can
>>>> sacrifice some accuracy and have hardware performance, even for SMP
>>>> guests, if you can tolerate some minor cross-CPU TSC variation.  No
>>>> other vendor I know of can make that guarantee.
>>>>
>>>> Zach
>>>>          
>>> If the processor has a stable TSC why trap it? I realize you are trying
>>> to cover a gauntlet of hardware and guests, so maybe a nerd knob is
>>> needed to disable.
>>>        
>> Exactly.  If you have a stable TSC, we don't trap it.  If you don't
>> have a stable TSC, we do.  That's the point of these patches.
>>      
> Wait, don't you trap if host TSC is faster than guest TSC?
>    

Yes, but that's not a standard scenario, that only applies for a 
migration.  There's also a way to avoid this:

set guest TSC > MAX (all host TSCs in cluster)

Then catchup mode will be used to keep the TSC up to speed.  Actually, 
it's an open question what to do for SMP in this case, I'm not sure I 
got that right.  Technically the proper thing to do is to trap, but this 
should have a user override.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 25/35] Add clock catchup mode
  2010-08-25 17:27   ` Marcelo Tosatti
@ 2010-08-25 20:48     ` Zachary Amsden
  2010-08-25 22:01       ` Marcelo Tosatti
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-25 20:48 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, Avi Kivity, Glauber Costa, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/25/2010 07:27 AM, Marcelo Tosatti wrote:
> On Thu, Aug 19, 2010 at 10:07:39PM -1000, Zachary Amsden wrote:
>    
>> Make the clock update handler handle generic clock synchronization,
>> not just KVM clock.  We add a catchup mode which keeps passthrough
>> TSC in line with absolute guest TSC.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |    1 +
>>   arch/x86/kvm/x86.c              |   55 ++++++++++++++++++++++++++------------
>>   2 files changed, 38 insertions(+), 18 deletions(-)
>>
>>      
>    
>>   	kvm_x86_ops->vcpu_load(vcpu, cpu);
>> -	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>> +	if (unlikely(vcpu->cpu != cpu) || vcpu->arch.tsc_rebase) {
>>   		/* Make sure TSC doesn't go backwards */
>>   		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>   				native_read_tsc() - vcpu->arch.last_host_tsc;
>>   		if (tsc_delta<  0)
>>   			mark_tsc_unstable("KVM discovered backwards TSC");
>> -		if (check_tsc_unstable())
>> +		if (check_tsc_unstable()) {
>>   			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>> -		kvm_migrate_timers(vcpu);
>> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +		}
>> +		if (vcpu->cpu != cpu)
>> +			kvm_migrate_timers(vcpu);
>>   		vcpu->cpu = cpu;
>> +		vcpu->arch.tsc_rebase = 0;
>>   	}
>>   }
>>
>> @@ -1947,6 +1961,12 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>   	kvm_x86_ops->vcpu_put(vcpu);
>>   	kvm_put_guest_fpu(vcpu);
>>   	vcpu->arch.last_host_tsc = native_read_tsc();
>> +
>> +	/* For unstable TSC, force compensation and catchup on next CPU */
>> +	if (check_tsc_unstable()) {
>> +		vcpu->arch.tsc_rebase = 1;
>> +		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +	}
>>      
> The mix between catchup,trap versus stable,unstable TSC is confusing and
> difficult to grasp. Can you please introduce all the infrastructure
> first, then control usage of them in centralized places? Examples:
>
> +static void kvm_update_tsc_trapping(struct kvm *kvm)
> +{
> +       int trap, i;
> +       struct kvm_vcpu *vcpu;
> +
> +       trap = check_tsc_unstable()&&  atomic_read(&kvm->online_vcpus)>  1;
> +       kvm_for_each_vcpu(i, vcpu, kvm)
> +               kvm_x86_ops->set_tsc_trap(vcpu, trap&&  !vcpu->arch.time_page);
> +}
>
> +       /* For unstable TSC, force compensation and catchup on next CPU */
> +       if (check_tsc_unstable()) {
> +               vcpu->arch.tsc_rebase = 1;
> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +       }
>
>
> kvm_guest_time_update is becoming very confusing too. I understand this
> is due to the many cases its dealing with, but please make it as simple
> as possible.
>    

I tried to comment as best as I could.  I think the whole 
"kvm_update_tsc_trapping" thing is probably a poor design choice.  It 
works, but it's thoroughly unintelligible right now without spending 
some days figuring out why.

I'll rework the tail series of patches to try to make them more clear.

> +       /*
> +        * If we are trapping and no longer need to, use catchup to
> +        * ensure passthrough TSC will not be less than trapped TSC
> +        */
> +       if (vcpu->tsc_mode == TSC_MODE_PASSTHROUGH&&  vcpu->tsc_trapping&&
> +           ((this_tsc_khz<= v->kvm->arch.virtual_tsc_khz || kvmclock))) {
> +               catchup = 1;
>
> What, TSC trapping with kvmclock enabled?
>    

Transitioning to use of kvmclock after a cold boot means we may have 
been trapping and now we will not be.

> For both catchup and trapping the resolution of the host clock is
> important, as Glauber commented for kvmclock. Can you comment on the
> problems that arrive from a low res clock for both modes?
>
> Similarly for catchup mode, the effect of exit frequency. No need for
> any guarantees?
>    

The scheduler will do something to get an IRQ at whatever resolution it 
uses for it's timeslice.  That guarantees an exit per timeslice, so 
we'll never be behind by more than one slice while scheduling.  While 
not scheduling, we're dormant anyway, waiting on either an IRQ or shared 
memory variable change.  Local timers could end up behind when dormant.

We may need a hack to accelerate firing of timers in such a case, or 
perhaps bounds on when to use catchup mode and when to not.

Partly, the lack of implementation is by deliberate choice; the logic 
involved with setting such bounds and wisdom of doing so is a choice 
most likely to be done by a policy agent in userspace, in our case, 
qemu.  In the end, that is what has full control over the setting or not 
of guest TSC rate and choice of TSC mode.

What's lacking is the ability to force the use of a certain mode.  I 
think it's clear now, that needs to be a per-VM choice, not a global one.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 25/35] Add clock catchup mode
  2010-08-25 20:48     ` Zachary Amsden
@ 2010-08-25 22:01       ` Marcelo Tosatti
  2010-08-25 23:38         ` Glauber Costa
  2010-08-26  0:17         ` Zachary Amsden
  0 siblings, 2 replies; 106+ messages in thread
From: Marcelo Tosatti @ 2010-08-25 22:01 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Glauber Costa, Thomas Gleixner, John Stultz,
	linux-kernel

On Wed, Aug 25, 2010 at 10:48:20AM -1000, Zachary Amsden wrote:
> On 08/25/2010 07:27 AM, Marcelo Tosatti wrote:
> >On Thu, Aug 19, 2010 at 10:07:39PM -1000, Zachary Amsden wrote:
> >>Make the clock update handler handle generic clock synchronization,
> >>not just KVM clock.  We add a catchup mode which keeps passthrough
> >>TSC in line with absolute guest TSC.
> >>
> >>Signed-off-by: Zachary Amsden<zamsden@redhat.com>
> >>---
> >>  arch/x86/include/asm/kvm_host.h |    1 +
> >>  arch/x86/kvm/x86.c              |   55 ++++++++++++++++++++++++++------------
> >>  2 files changed, 38 insertions(+), 18 deletions(-)
> >>
> >>  	kvm_x86_ops->vcpu_load(vcpu, cpu);
> >>-	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
> >>+	if (unlikely(vcpu->cpu != cpu) || vcpu->arch.tsc_rebase) {
> >>  		/* Make sure TSC doesn't go backwards */
> >>  		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> >>  				native_read_tsc() - vcpu->arch.last_host_tsc;
> >>  		if (tsc_delta<  0)
> >>  			mark_tsc_unstable("KVM discovered backwards TSC");
> >>-		if (check_tsc_unstable())
> >>+		if (check_tsc_unstable()) {
> >>  			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> >>-		kvm_migrate_timers(vcpu);
> >>+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >>+		}
> >>+		if (vcpu->cpu != cpu)
> >>+			kvm_migrate_timers(vcpu);
> >>  		vcpu->cpu = cpu;
> >>+		vcpu->arch.tsc_rebase = 0;
> >>  	}
> >>  }
> >>
> >>@@ -1947,6 +1961,12 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>  	kvm_x86_ops->vcpu_put(vcpu);
> >>  	kvm_put_guest_fpu(vcpu);
> >>  	vcpu->arch.last_host_tsc = native_read_tsc();
> >>+
> >>+	/* For unstable TSC, force compensation and catchup on next CPU */
> >>+	if (check_tsc_unstable()) {
> >>+		vcpu->arch.tsc_rebase = 1;
> >>+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >>+	}
> >The mix between catchup,trap versus stable,unstable TSC is confusing and
> >difficult to grasp. Can you please introduce all the infrastructure
> >first, then control usage of them in centralized places? Examples:
> >
> >+static void kvm_update_tsc_trapping(struct kvm *kvm)
> >+{
> >+       int trap, i;
> >+       struct kvm_vcpu *vcpu;
> >+
> >+       trap = check_tsc_unstable()&&  atomic_read(&kvm->online_vcpus)>  1;
> >+       kvm_for_each_vcpu(i, vcpu, kvm)
> >+               kvm_x86_ops->set_tsc_trap(vcpu, trap&&  !vcpu->arch.time_page);
> >+}
> >
> >+       /* For unstable TSC, force compensation and catchup on next CPU */
> >+       if (check_tsc_unstable()) {
> >+               vcpu->arch.tsc_rebase = 1;
> >+               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >+       }
> >
> >
> >kvm_guest_time_update is becoming very confusing too. I understand this
> >is due to the many cases its dealing with, but please make it as simple
> >as possible.
> 
> I tried to comment as best as I could.  I think the whole
> "kvm_update_tsc_trapping" thing is probably a poor design choice.
> It works, but it's thoroughly unintelligible right now without
> spending some days figuring out why.
> 
> I'll rework the tail series of patches to try to make them more clear.
> 
> >+       /*
> >+        * If we are trapping and no longer need to, use catchup to
> >+        * ensure passthrough TSC will not be less than trapped TSC
> >+        */
> >+       if (vcpu->tsc_mode == TSC_MODE_PASSTHROUGH&&  vcpu->tsc_trapping&&
> >+           ((this_tsc_khz<= v->kvm->arch.virtual_tsc_khz || kvmclock))) {
> >+               catchup = 1;
> >
> >What, TSC trapping with kvmclock enabled?
> 
> Transitioning to use of kvmclock after a cold boot means we may have
> been trapping and now we will not be.
> 
> >For both catchup and trapping the resolution of the host clock is
> >important, as Glauber commented for kvmclock. Can you comment on the
> >problems that arrive from a low res clock for both modes?
> >
> >Similarly for catchup mode, the effect of exit frequency. No need for
> >any guarantees?
> 
> The scheduler will do something to get an IRQ at whatever resolution
> it uses for it's timeslice.  That guarantees an exit per timeslice,
> so we'll never be behind by more than one slice while scheduling.
> While not scheduling, we're dormant anyway, waiting on either an IRQ
> or shared memory variable change.  Local timers could end up behind
> when dormant.
> 
> We may need a hack to accelerate firing of timers in such a case, or
> perhaps bounds on when to use catchup mode and when to not.

What about emulating rdtsc with low res clock? 

"The RDTSC instruction reads the time-stamp counter and is guaranteed to
return a monotonically increasing unique value whenever executed, except
for a 64-bit counter wraparound."

> Partly, the lack of implementation is by deliberate choice; the
> logic involved with setting such bounds and wisdom of doing so is a
> choice most likely to be done by a policy agent in userspace, in our
> case, qemu.  In the end, that is what has full control over the
> setting or not of guest TSC rate and choice of TSC mode.
> 
> What's lacking is the ability to force the use of a certain mode.  I
> think it's clear now, that needs to be a per-VM choice, not a global
> one.
> 
> Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 25/35] Add clock catchup mode
  2010-08-25 22:01       ` Marcelo Tosatti
@ 2010-08-25 23:38         ` Glauber Costa
  2010-08-26  0:17         ` Zachary Amsden
  1 sibling, 0 replies; 106+ messages in thread
From: Glauber Costa @ 2010-08-25 23:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Zachary Amsden, kvm, Avi Kivity, Thomas Gleixner, John Stultz,
	linux-kernel

On Wed, Aug 25, 2010 at 07:01:34PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 25, 2010 at 10:48:20AM -1000, Zachary Amsden wrote:
> > On 08/25/2010 07:27 AM, Marcelo Tosatti wrote:
> > >On Thu, Aug 19, 2010 at 10:07:39PM -1000, Zachary Amsden wrote:
> > >>Make the clock update handler handle generic clock synchronization,
> > >>not just KVM clock.  We add a catchup mode which keeps passthrough
> > >>TSC in line with absolute guest TSC.
> > >>
> > >>Signed-off-by: Zachary Amsden<zamsden@redhat.com>
> > >>---
> > >>  arch/x86/include/asm/kvm_host.h |    1 +
> > >>  arch/x86/kvm/x86.c              |   55 ++++++++++++++++++++++++++------------
> > >>  2 files changed, 38 insertions(+), 18 deletions(-)
> > >>
> > >>  	kvm_x86_ops->vcpu_load(vcpu, cpu);
> > >>-	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
> > >>+	if (unlikely(vcpu->cpu != cpu) || vcpu->arch.tsc_rebase) {
> > >>  		/* Make sure TSC doesn't go backwards */
> > >>  		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> > >>  				native_read_tsc() - vcpu->arch.last_host_tsc;
> > >>  		if (tsc_delta<  0)
> > >>  			mark_tsc_unstable("KVM discovered backwards TSC");
> > >>-		if (check_tsc_unstable())
> > >>+		if (check_tsc_unstable()) {
> > >>  			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> > >>-		kvm_migrate_timers(vcpu);
> > >>+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> > >>+		}
> > >>+		if (vcpu->cpu != cpu)
> > >>+			kvm_migrate_timers(vcpu);
> > >>  		vcpu->cpu = cpu;
> > >>+		vcpu->arch.tsc_rebase = 0;
> > >>  	}
> > >>  }
> > >>
> > >>@@ -1947,6 +1961,12 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> > >>  	kvm_x86_ops->vcpu_put(vcpu);
> > >>  	kvm_put_guest_fpu(vcpu);
> > >>  	vcpu->arch.last_host_tsc = native_read_tsc();
> > >>+
> > >>+	/* For unstable TSC, force compensation and catchup on next CPU */
> > >>+	if (check_tsc_unstable()) {
> > >>+		vcpu->arch.tsc_rebase = 1;
> > >>+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> > >>+	}
> > >The mix between catchup,trap versus stable,unstable TSC is confusing and
> > >difficult to grasp. Can you please introduce all the infrastructure
> > >first, then control usage of them in centralized places? Examples:
> > >
> > >+static void kvm_update_tsc_trapping(struct kvm *kvm)
> > >+{
> > >+       int trap, i;
> > >+       struct kvm_vcpu *vcpu;
> > >+
> > >+       trap = check_tsc_unstable()&&  atomic_read(&kvm->online_vcpus)>  1;
> > >+       kvm_for_each_vcpu(i, vcpu, kvm)
> > >+               kvm_x86_ops->set_tsc_trap(vcpu, trap&&  !vcpu->arch.time_page);
> > >+}
> > >
> > >+       /* For unstable TSC, force compensation and catchup on next CPU */
> > >+       if (check_tsc_unstable()) {
> > >+               vcpu->arch.tsc_rebase = 1;
> > >+               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> > >+       }
> > >
> > >
> > >kvm_guest_time_update is becoming very confusing too. I understand this
> > >is due to the many cases its dealing with, but please make it as simple
> > >as possible.
> > 
> > I tried to comment as best as I could.  I think the whole
> > "kvm_update_tsc_trapping" thing is probably a poor design choice.
> > It works, but it's thoroughly unintelligible right now without
> > spending some days figuring out why.
> > 
> > I'll rework the tail series of patches to try to make them more clear.
> > 
> > >+       /*
> > >+        * If we are trapping and no longer need to, use catchup to
> > >+        * ensure passthrough TSC will not be less than trapped TSC
> > >+        */
> > >+       if (vcpu->tsc_mode == TSC_MODE_PASSTHROUGH&&  vcpu->tsc_trapping&&
> > >+           ((this_tsc_khz<= v->kvm->arch.virtual_tsc_khz || kvmclock))) {
> > >+               catchup = 1;
> > >
> > >What, TSC trapping with kvmclock enabled?
> > 
> > Transitioning to use of kvmclock after a cold boot means we may have
> > been trapping and now we will not be.
> > 
> > >For both catchup and trapping the resolution of the host clock is
> > >important, as Glauber commented for kvmclock. Can you comment on the
> > >problems that arrive from a low res clock for both modes?
> > >
> > >Similarly for catchup mode, the effect of exit frequency. No need for
> > >any guarantees?
> > 
> > The scheduler will do something to get an IRQ at whatever resolution
> > it uses for it's timeslice.  That guarantees an exit per timeslice,
> > so we'll never be behind by more than one slice while scheduling.
> > While not scheduling, we're dormant anyway, waiting on either an IRQ
> > or shared memory variable change.  Local timers could end up behind
> > when dormant.
> > 
> > We may need a hack to accelerate firing of timers in such a case, or
> > perhaps bounds on when to use catchup mode and when to not.
> 
> What about emulating rdtsc with low res clock? 
> 
> "The RDTSC instruction reads the time-stamp counter and is guaranteed to
> return a monotonically increasing unique value whenever executed, except
> for a 64-bit counter wraparound."
> 
This is bad semantics, IMHO. It is a totally different behaviour than the
one guest users would expect.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 25/35] Add clock catchup mode
  2010-08-25 22:01       ` Marcelo Tosatti
  2010-08-25 23:38         ` Glauber Costa
@ 2010-08-26  0:17         ` Zachary Amsden
  1 sibling, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-08-26  0:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, Avi Kivity, Glauber Costa, Thomas Gleixner, John Stultz,
	linux-kernel

On 08/25/2010 12:01 PM, Marcelo Tosatti wrote:
> On Wed, Aug 25, 2010 at 10:48:20AM -1000, Zachary Amsden wrote:
>    
>> On 08/25/2010 07:27 AM, Marcelo Tosatti wrote:
>>      
>>> On Thu, Aug 19, 2010 at 10:07:39PM -1000, Zachary Amsden wrote:
>>>        
>>>> Make the clock update handler handle generic clock synchronization,
>>>> not just KVM clock.  We add a catchup mode which keeps passthrough
>>>> TSC in line with absolute guest TSC.
>>>>
>>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>>> ---
>>>>   arch/x86/include/asm/kvm_host.h |    1 +
>>>>   arch/x86/kvm/x86.c              |   55 ++++++++++++++++++++++++++------------
>>>>   2 files changed, 38 insertions(+), 18 deletions(-)
>>>>
>>>>   	kvm_x86_ops->vcpu_load(vcpu, cpu);
>>>> -	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>> +	if (unlikely(vcpu->cpu != cpu) || vcpu->arch.tsc_rebase) {
>>>>   		/* Make sure TSC doesn't go backwards */
>>>>   		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>>   				native_read_tsc() - vcpu->arch.last_host_tsc;
>>>>   		if (tsc_delta<   0)
>>>>   			mark_tsc_unstable("KVM discovered backwards TSC");
>>>> -		if (check_tsc_unstable())
>>>> +		if (check_tsc_unstable()) {
>>>>   			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>>>> -		kvm_migrate_timers(vcpu);
>>>> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>>> +		}
>>>> +		if (vcpu->cpu != cpu)
>>>> +			kvm_migrate_timers(vcpu);
>>>>   		vcpu->cpu = cpu;
>>>> +		vcpu->arch.tsc_rebase = 0;
>>>>   	}
>>>>   }
>>>>
>>>> @@ -1947,6 +1961,12 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>>   	kvm_x86_ops->vcpu_put(vcpu);
>>>>   	kvm_put_guest_fpu(vcpu);
>>>>   	vcpu->arch.last_host_tsc = native_read_tsc();
>>>> +
>>>> +	/* For unstable TSC, force compensation and catchup on next CPU */
>>>> +	if (check_tsc_unstable()) {
>>>> +		vcpu->arch.tsc_rebase = 1;
>>>> +		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>>> +	}
>>>>          
>>> The mix between catchup,trap versus stable,unstable TSC is confusing and
>>> difficult to grasp. Can you please introduce all the infrastructure
>>> first, then control usage of them in centralized places? Examples:
>>>
>>> +static void kvm_update_tsc_trapping(struct kvm *kvm)
>>> +{
>>> +       int trap, i;
>>> +       struct kvm_vcpu *vcpu;
>>> +
>>> +       trap = check_tsc_unstable()&&   atomic_read(&kvm->online_vcpus)>   1;
>>> +       kvm_for_each_vcpu(i, vcpu, kvm)
>>> +               kvm_x86_ops->set_tsc_trap(vcpu, trap&&   !vcpu->arch.time_page);
>>> +}
>>>
>>> +       /* For unstable TSC, force compensation and catchup on next CPU */
>>> +       if (check_tsc_unstable()) {
>>> +               vcpu->arch.tsc_rebase = 1;
>>> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>> +       }
>>>
>>>
>>> kvm_guest_time_update is becoming very confusing too. I understand this
>>> is due to the many cases its dealing with, but please make it as simple
>>> as possible.
>>>        
>> I tried to comment as best as I could.  I think the whole
>> "kvm_update_tsc_trapping" thing is probably a poor design choice.
>> It works, but it's thoroughly unintelligible right now without
>> spending some days figuring out why.
>>
>> I'll rework the tail series of patches to try to make them more clear.
>>
>>      
>>> +       /*
>>> +        * If we are trapping and no longer need to, use catchup to
>>> +        * ensure passthrough TSC will not be less than trapped TSC
>>> +        */
>>> +       if (vcpu->tsc_mode == TSC_MODE_PASSTHROUGH&&   vcpu->tsc_trapping&&
>>> +           ((this_tsc_khz<= v->kvm->arch.virtual_tsc_khz || kvmclock))) {
>>> +               catchup = 1;
>>>
>>> What, TSC trapping with kvmclock enabled?
>>>        
>> Transitioning to use of kvmclock after a cold boot means we may have
>> been trapping and now we will not be.
>>
>>      
>>> For both catchup and trapping the resolution of the host clock is
>>> important, as Glauber commented for kvmclock. Can you comment on the
>>> problems that arrive from a low res clock for both modes?
>>>
>>> Similarly for catchup mode, the effect of exit frequency. No need for
>>> any guarantees?
>>>        
>> The scheduler will do something to get an IRQ at whatever resolution
>> it uses for it's timeslice.  That guarantees an exit per timeslice,
>> so we'll never be behind by more than one slice while scheduling.
>> While not scheduling, we're dormant anyway, waiting on either an IRQ
>> or shared memory variable change.  Local timers could end up behind
>> when dormant.
>>
>> We may need a hack to accelerate firing of timers in such a case, or
>> perhaps bounds on when to use catchup mode and when to not.
>>      
> What about emulating rdtsc with low res clock?
>
> "The RDTSC instruction reads the time-stamp counter and is guaranteed to
> return a monotonically increasing unique value whenever executed, except
> for a 64-bit counter wraparound."
>    

Technically, that may not be quite correct.

<digression into weeds>

The RDTSC instruction will return a monotonically increasing unique 
value, but the execution and retirement of the instruction are 
unserialized.  So technically, two simultaneous RDTSC could be issued to 
multiple execution units, and they may either return the same values, or 
the earlier one may stall and complete after the latter.

rdtsc
mov %eax, %ebx
mov %edx, %ecx
rdtsc
cmp %edx, %ecx
jb fail
cmp %ebx, %eax
jae fail
jmp good
fail:
int3
good:
ret

If execution of RDTSC is restricted to a single issue unit, this can 
never fail.  If it can be issued simultaneously in multiple units, it 
can fail because register renaming may end up sorting the instruction 
stream and removing dependencies so it can be executed as:

     UNIT 1                              UNIT 2
rdtsc                                 rdtsc
mov %eax, %ebx              (store to local %edx, %eax)
mov %edx, %ecx              cmp %ebx, local %eax
                                         (commit local %edx, %eax to 
global register)
cmp %edx, %ecx
jb fail
                                          jae fail

Both failure modes can be observed if this is indeed the case.  I'm not 
aware that anything is specifically done to maintain the serialization 
internally, and as the architecture actually specifically states that 
RDTSC is unserialized, I doubt anything to prevent this situation is done.

</digression into weeds>

However, that's not the pertinent issue.  If the clock is very low res, 
we don't present a higher granularity TSC to the guest.

While there are things that can be done to ensure that (add 1 for each 
read, estimate with TSC..), they have problems of their own and in 
generally will make things very messy.

Given the above digression, I'm not sure that any code written to run 
with such guarantees is actually sound.

It is plausible, however, someone does

count of some value / (TSC2 - TSC1)

and ends up with a divide by zero.  So it may be better to bump the 
counter by at least one for each call.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback
  2010-08-20  8:07 ` [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback Zachary Amsden
@ 2010-08-27 16:32   ` Jan Kiszka
  2010-08-27 23:43     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Kiszka @ 2010-08-27 16:32 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

Zachary Amsden wrote:
> The CPU_STARTING callback was added upstream with the intention
> of being used for KVM, specifically for the hardware enablement
> that must be done before we can run in hardware virt.  It had
> bugs on the x86_64 architecture at the time, where it was called
> after CPU_ONLINE.  The arches have since merged and the bug is
> gone.

What bugs are you referring to, or since which kernel version is
CPU_STARTING usable for KVM? I need to encode this into kvm-kmod.

Thanks,
Jan

> 
> It might be noted other features should probably start making
> use of this callback; microcode updates in particular which
> might be fixing important erratums would be best applied before
> beginning to run user tasks.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  virt/kvm/kvm_main.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b78b794..d4853a5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1958,10 +1958,10 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
>  		       cpu);
>  		hardware_disable(NULL);
>  		break;
> -	case CPU_ONLINE:
> +	case CPU_STARTING:
>  		printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
>  		       cpu);
> -		smp_call_function_single(cpu, hardware_enable, NULL, 1);
> +		hardware_enable(NULL);
>  		break;
>  	}
>  	return NOTIFY_OK;
> @@ -2096,7 +2096,6 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  
>  static struct notifier_block kvm_cpu_notifier = {
>  	.notifier_call = kvm_cpu_hotplug,
> -	.priority = 20, /* must be > scheduler priority */
>  };
>  
>  static int vm_stat_get(void *_offset, u64 *val)

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-20  8:07 ` [KVM timekeeping 17/35] Implement getnsboottime kernel API Zachary Amsden
  2010-08-20 18:39   ` john stultz
@ 2010-08-27 18:05   ` Jan Kiszka
  2010-08-27 23:48     ` Zachary Amsden
  1 sibling, 1 reply; 106+ messages in thread
From: Jan Kiszka @ 2010-08-27 18:05 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

Zachary Amsden wrote:
> Add a kernel call to get the number of nanoseconds since boot.  This
> is generally useful enough to make it a generic call.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  include/linux/time.h      |    1 +
>  kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
>  2 files changed, 28 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/time.h b/include/linux/time.h
> index ea3559f..5d04108 100644
> --- a/include/linux/time.h
> +++ b/include/linux/time.h
> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
>  extern void getrawmonotonic(struct timespec *ts);
>  extern void getboottime(struct timespec *ts);
>  extern void monotonic_to_bootbased(struct timespec *ts);
> +extern s64 getnsboottime(void);
>  
>  extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
>  extern int timekeeping_valid_for_hres(void);
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index caf8d4d..d250f0a 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
>  }
>  EXPORT_SYMBOL_GPL(ktime_get_ts);
>  
> +
> +/**
> + * getnsboottime - get the bootbased clock in nsec format
> + *
> + * The function calculates the bootbased clock from the realtime
> + * clock and the wall_to_monotonic offset and stores the result
> + * in normalized timespec format in the variable pointed to by @ts.
> + */

This thing is not returning anything in some ts variable. And I also had
a hard time spotting the key difference to getboottime - the name is
really confusing.

Besides this, if you have good suggestion how to provide a compat
version for older kernels, I'm all ears. Please also have a careful look
at kvm-kmod's kvm_getboottime again, right now I'm a bit confused about
what it is supposed to return and what it actually does (note that
kvm-kmod cannot account for time spent in suspend state).

Thanks!
Jan

> +s64 getnsboottime(void)
> +{
> +	unsigned int seq;
> +	s64 secs, nsecs;
> +
> +	WARN_ON(timekeeping_suspended);
> +
> +	do {
> +		seq = read_seqbegin(&xtime_lock);
> +		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
> +		secs += total_sleep_time.tv_sec;
> +		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
> +		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
> +
> +	} while (read_seqretry(&xtime_lock, seq));
> +	return nsecs + (secs * NSEC_PER_SEC);
> +}
> +EXPORT_SYMBOL_GPL(getnsboottime);
> +
>  /**
>   * do_gettimeofday - Returns the time of day in a timeval
>   * @tv:		pointer to the timeval to be set

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback
  2010-08-27 16:32   ` Jan Kiszka
@ 2010-08-27 23:43     ` Zachary Amsden
  2010-08-30  9:10       ` Jan Kiszka
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-27 23:43 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On 08/27/2010 06:32 AM, Jan Kiszka wrote:
> Zachary Amsden wrote:
>    
>> The CPU_STARTING callback was added upstream with the intention
>> of being used for KVM, specifically for the hardware enablement
>> that must be done before we can run in hardware virt.  It had
>> bugs on the x86_64 architecture at the time, where it was called
>> after CPU_ONLINE.  The arches have since merged and the bug is
>> gone.
>>      
> What bugs are you referring to, or since which kernel version is
> CPU_STARTING usable for KVM? I need to encode this into kvm-kmod.
>    

Prior to the x86_64 / i386 merge, CPU_STARTING didn't work the same way 
/ exist in the x86_64 code... most of this is historical guesswork.  At 
some point, the 32/64 versions of the code in smpboot.c got merged and 
now it does.

Binary searching around my tree shows this timeframe:

2.6.11? - 2.6.23 : silver age ; i386 and x86_64 merge underway
    |
2.6.24 : bronze age ; i386 and x86_64 deprecated
    |
2.6.26 : iron age; smpboot_32.c / smpboot_64.c merge
      \
      2.6.28 : CPU_STARTING exists and first used

/me scratches head wondering how this affects operation on older kernels....

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-27 18:05   ` Jan Kiszka
@ 2010-08-27 23:48     ` Zachary Amsden
  2010-08-30 18:07       ` Jan Kiszka
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-08-27 23:48 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On 08/27/2010 08:05 AM, Jan Kiszka wrote:
> Zachary Amsden wrote:
>    
>> Add a kernel call to get the number of nanoseconds since boot.  This
>> is generally useful enough to make it a generic call.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   include/linux/time.h      |    1 +
>>   kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
>>   2 files changed, 28 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/time.h b/include/linux/time.h
>> index ea3559f..5d04108 100644
>> --- a/include/linux/time.h
>> +++ b/include/linux/time.h
>> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
>>   extern void getrawmonotonic(struct timespec *ts);
>>   extern void getboottime(struct timespec *ts);
>>   extern void monotonic_to_bootbased(struct timespec *ts);
>> +extern s64 getnsboottime(void);
>>
>>   extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
>>   extern int timekeeping_valid_for_hres(void);
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index caf8d4d..d250f0a 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
>>   }
>>   EXPORT_SYMBOL_GPL(ktime_get_ts);
>>
>> +
>> +/**
>> + * getnsboottime - get the bootbased clock in nsec format
>> + *
>> + * The function calculates the bootbased clock from the realtime
>> + * clock and the wall_to_monotonic offset and stores the result
>> + * in normalized timespec format in the variable pointed to by @ts.
>> + */
>>      
> This thing is not returning anything in some ts variable. And I also had
> a hard time spotting the key difference to getboottime - the name is
> really confusing.
>
> Besides this, if you have good suggestion how to provide a compat
> version for older kernels, I'm all ears. Please also have a careful look
> at kvm-kmod's kvm_getboottime again, right now I'm a bit confused about
> what it is supposed to return and what it actually does (note that
> kvm-kmod cannot account for time spent in suspend state).
>
> Thanks!
> Jan
>
>    
>> +s64 getnsboottime(void)
>> +{
>> +	unsigned int seq;
>> +	s64 secs, nsecs;
>> +
>> +	WARN_ON(timekeeping_suspended);
>> +
>> +	do {
>> +		seq = read_seqbegin(&xtime_lock);
>> +		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
>> +		secs += total_sleep_time.tv_sec;
>> +		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
>> +		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
>> +
>> +	} while (read_seqretry(&xtime_lock, seq));
>> +	return nsecs + (secs * NSEC_PER_SEC);
>> +}
>> +EXPORT_SYMBOL_GPL(getnsboottime);
>> +
>>   /**
>>    * do_gettimeofday - Returns the time of day in a timeval
>>    * @tv:		pointer to the timeval to be set
>>      
>    

Yes, we should probably change the name before making this an actual 
kernel API, John made some better suggestions.

For kvm-kmod, the following conversion should work:

static inline u64 get_kernel_ns(void)
{
     struct timespec ts;

     WARN_ON(preemptible());
     ktime_get_ts(&ts);
     monotonic_to_bootbased(&ts);
     return timespec_to_ns(&ts);
}

The only real point to getnsboottime is to stop the unnecessary 
conversions, but looking at it now, it doesn't appear to actually save 
any, does it?  Maybe it is better to just drop the thing all-together.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback
  2010-08-27 23:43     ` Zachary Amsden
@ 2010-08-30  9:10       ` Jan Kiszka
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Kiszka @ 2010-08-30  9:10 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

Zachary Amsden wrote:
> On 08/27/2010 06:32 AM, Jan Kiszka wrote:
>> Zachary Amsden wrote:
>>    
>>> The CPU_STARTING callback was added upstream with the intention
>>> of being used for KVM, specifically for the hardware enablement
>>> that must be done before we can run in hardware virt.  It had
>>> bugs on the x86_64 architecture at the time, where it was called
>>> after CPU_ONLINE.  The arches have since merged and the bug is
>>> gone.
>>>      
>> What bugs are you referring to, or since which kernel version is
>> CPU_STARTING usable for KVM? I need to encode this into kvm-kmod.
>>    
> 
> Prior to the x86_64 / i386 merge, CPU_STARTING didn't work the same way 
> / exist in the x86_64 code... most of this is historical guesswork.  At 
> some point, the 32/64 versions of the code in smpboot.c got merged and 
> now it does.
> 
> Binary searching around my tree shows this timeframe:
> 
> 2.6.11? - 2.6.23 : silver age ; i386 and x86_64 merge underway
>     |
> 2.6.24 : bronze age ; i386 and x86_64 deprecated
>     |
> 2.6.26 : iron age; smpboot_32.c / smpboot_64.c merge
>       \
>       2.6.28 : CPU_STARTING exists and first used
> 
> /me scratches head wondering how this affects operation on older kernels....

I basically need to revert your patch on host kernels without
CPU_STARTING and also on those where it might be broken. So I will set
the barrier to 2.6.28 then.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 17/35] Implement getnsboottime kernel API
  2010-08-27 23:48     ` Zachary Amsden
@ 2010-08-30 18:07       ` Jan Kiszka
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Kiszka @ 2010-08-30 18:07 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

Zachary Amsden wrote:
> On 08/27/2010 08:05 AM, Jan Kiszka wrote:
>> Zachary Amsden wrote:
>>    
>>> Add a kernel call to get the number of nanoseconds since boot.  This
>>> is generally useful enough to make it a generic call.
>>>
>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>> ---
>>>   include/linux/time.h      |    1 +
>>>   kernel/time/timekeeping.c |   27 +++++++++++++++++++++++++++
>>>   2 files changed, 28 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/time.h b/include/linux/time.h
>>> index ea3559f..5d04108 100644
>>> --- a/include/linux/time.h
>>> +++ b/include/linux/time.h
>>> @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv);
>>>   extern void getrawmonotonic(struct timespec *ts);
>>>   extern void getboottime(struct timespec *ts);
>>>   extern void monotonic_to_bootbased(struct timespec *ts);
>>> +extern s64 getnsboottime(void);
>>>
>>>   extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
>>>   extern int timekeeping_valid_for_hres(void);
>>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>>> index caf8d4d..d250f0a 100644
>>> --- a/kernel/time/timekeeping.c
>>> +++ b/kernel/time/timekeeping.c
>>> @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts)
>>>   }
>>>   EXPORT_SYMBOL_GPL(ktime_get_ts);
>>>
>>> +
>>> +/**
>>> + * getnsboottime - get the bootbased clock in nsec format
>>> + *
>>> + * The function calculates the bootbased clock from the realtime
>>> + * clock and the wall_to_monotonic offset and stores the result
>>> + * in normalized timespec format in the variable pointed to by @ts.
>>> + */
>>>      
>> This thing is not returning anything in some ts variable. And I also had
>> a hard time spotting the key difference to getboottime - the name is
>> really confusing.
>>
>> Besides this, if you have good suggestion how to provide a compat
>> version for older kernels, I'm all ears. Please also have a careful look
>> at kvm-kmod's kvm_getboottime again, right now I'm a bit confused about
>> what it is supposed to return and what it actually does (note that
>> kvm-kmod cannot account for time spent in suspend state).
>>
>> Thanks!
>> Jan
>>
>>    
>>> +s64 getnsboottime(void)
>>> +{
>>> +	unsigned int seq;
>>> +	s64 secs, nsecs;
>>> +
>>> +	WARN_ON(timekeeping_suspended);
>>> +
>>> +	do {
>>> +		seq = read_seqbegin(&xtime_lock);
>>> +		secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
>>> +		secs += total_sleep_time.tv_sec;
>>> +		nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
>>> +		nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
>>> +
>>> +	} while (read_seqretry(&xtime_lock, seq));
>>> +	return nsecs + (secs * NSEC_PER_SEC);
>>> +}
>>> +EXPORT_SYMBOL_GPL(getnsboottime);
>>> +
>>>   /**
>>>    * do_gettimeofday - Returns the time of day in a timeval
>>>    * @tv:		pointer to the timeval to be set
>>>      
>>    
> 
> Yes, we should probably change the name before making this an actual 
> kernel API, John made some better suggestions.
> 
> For kvm-kmod, the following conversion should work:
> 
> static inline u64 get_kernel_ns(void)
> {
>      struct timespec ts;
> 
>      WARN_ON(preemptible());
>      ktime_get_ts(&ts);
>      monotonic_to_bootbased(&ts);
>      return timespec_to_ns(&ts);
> }
> 
> The only real point to getnsboottime is to stop the unnecessary 
> conversions, but looking at it now, it doesn't appear to actually save 
> any, does it?  Maybe it is better to just drop the thing all-together.

It looks like. Will you post a removal patch kvm.git? I'll better wait
with updating kvm-kmod then.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 106+ messages in thread

* RE: [KVM timekeeping 26/35] Catchup slower TSC to guest rate
  2010-08-20  8:07 ` [KVM timekeeping 26/35] Catchup slower TSC to guest rate Zachary Amsden
@ 2010-09-07  3:44   ` Dong, Eddie
  2010-09-07 22:14     ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Dong, Eddie @ 2010-09-07  3:44 UTC (permalink / raw)
  To: Zachary Amsden, kvm
  Cc: Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel, Dong, Eddie

Zachary:
	Will you extend the logic to cover the situation when the guest runs at higher than the guest rate but the PCPU is over committed. In that case, likely we can use the time spent when the VCPU is scheduled out to catch up as well. Of course if the VCPU scheduled out time is not enough to compensate the cycles caused by fast host TSC (exceeding a threahold), we will eventually have to fall back to trap and emulation mode.

Thx, Eddie

-----Original Message-----
From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Zachary Amsden
Sent: 2010年8月20日 16:08
To: kvm@vger.kernel.org
Cc: Zachary Amsden; Avi Kivity; Marcelo Tosatti; Glauber Costa; Thomas Gleixner; John Stultz; linux-kernel@vger.kernel.org
Subject: [KVM timekeeping 26/35] Catchup slower TSC to guest rate

Use the catchup code to continue adjusting the TSC when
running at lower than the guest rate

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/kvm/x86.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4215d7..086d56a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1013,8 +1013,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 			kvm_x86_ops->adjust_tsc_offset(v, tsc-tsc_timestamp);
 	}
 	local_irq_restore(flags);
-	if (catchup)
+	if (catchup) {
+		if (this_tsc_khz < v->kvm->arch.virtual_tsc_khz)
+			vcpu->tsc_rebase = 1;
 		return 0;
+	}
 
 	/*
 	 * Time as measured by the TSC may go backwards when resetting the base
@@ -5022,6 +5025,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	kvm_guest_exit();
 
+	/* Running on slower TSC without kvmclock, we must bump TSC */
+	if (vcpu->arch.tsc_rebase)
+		kvm_request_clock_update(vcpu);
+
 	preempt_enable();
 
 	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 26/35] Catchup slower TSC to guest rate
  2010-09-07  3:44   ` Dong, Eddie
@ 2010-09-07 22:14     ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-09-07 22:14 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/06/2010 05:44 PM, Dong, Eddie wrote:
> Zachary:
> 	Will you extend the logic to cover the situation when the guest runs at higher than the guest rate but the PCPU is over committed. In that case, likely we can use the time spent when the VCPU is scheduled out to catch up as well. Of course if the VCPU scheduled out time is not enough to compensate the cycles caused by fast host TSC (exceeding a threahold), we will eventually have to fall back to trap and emulation mode.
>   

It is possible to do this, but it is rather dangerous. We can't let the
guest clock accelerate without bounds. We could put a limit on the
maximum overrun the TSC is allowed to reach, and then switch into
trapping mode, but this pre-supposes we will actually get an interrupt
in time. A CPU heavy guest with little host activity could easily
overrun much further than we would like unless we have a way to reliably
trigger interrupts near the time of maximum allowed overrun.

So, first, we must have a way to get such interrupts; this is needed
anyway, for the catchup case, we have a similar problem with underrun
which must be addressed. It's quite possible to add the mode you
describe once that feature is in, but it also adds even more complexity
to an already intricate clock system (which is one of the problem with
the latter part of this patch series).

Second, this mode of operation is incompatible with SMP guests under all
circumstances. SMP guests with mismatched clock speeds must always run
in trapping mode, as it is not possible to synchronize the catchup /
trap switching without extremely heavyweight measures (use IPI wakeup).
Those mechanisms will not only cost more than the trapping overhead
(future, faster systems, and larger, more parallel systems), but they
will also damage host performance (unneeded wakeups when other VCPUs are
not scheduled). Unless, of course, we gang-schedule... but that is a
difficult change and a very different mode of operation. Getting rid of
TSC trap overhead on systems with non-constant TSC isn't a sufficient
motivation for that kind of design change.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-08-20  8:07 ` [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization Zachary Amsden
  2010-08-20 17:30   ` Glauber Costa
@ 2010-09-14  9:10   ` Jan Kiszka
  2010-09-14  9:27     ` Avi Kivity
  2010-09-15  4:07     ` Zachary Amsden
  1 sibling, 2 replies; 106+ messages in thread
From: Jan Kiszka @ 2010-09-14  9:10 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

Am 20.08.2010 10:07, Zachary Amsden wrote:
> When CPUs with unstable TSCs enter deep C-state, TSC may stop
> running.  This causes us to require resynchronization.  Since
> we can't tell when this may potentially happen, we assume the
> worst by forcing re-compensation for it at every point the VCPU
> task is descheduled.
> 
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/kvm/x86.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7fc4a55..52b6c21 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	}
>  
>  	kvm_x86_ops->vcpu_load(vcpu, cpu);
> -	if (unlikely(vcpu->cpu != cpu)) {
> +	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>  		/* Make sure TSC doesn't go backwards */
>  		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>  				native_read_tsc() - vcpu->arch.last_host_tsc;

For yet unknown reason, this commit breaks Linux guests here if they are
started with only a single VCPU. They hang during boot, obviously no
longer receiving interrupts.

I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
effect of the wrapping, though I cannot imagine how.

Anyone any ideas?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14  9:10   ` Jan Kiszka
@ 2010-09-14  9:27     ` Avi Kivity
  2010-09-14 10:40       ` Jan Kiszka
  2010-09-15  4:07     ` Zachary Amsden
  1 sibling, 1 reply; 106+ messages in thread
From: Avi Kivity @ 2010-09-14  9:27 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Zachary Amsden, kvm, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

  On 09/14/2010 11:10 AM, Jan Kiszka wrote:
> Am 20.08.2010 10:07, Zachary Amsden wrote:
>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>> running.  This causes us to require resynchronization.  Since
>> we can't tell when this may potentially happen, we assume the
>> worst by forcing re-compensation for it at every point the VCPU
>> task is descheduled.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   arch/x86/kvm/x86.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 7fc4a55..52b6c21 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>   	}
>>
>>   	kvm_x86_ops->vcpu_load(vcpu, cpu);
>> -	if (unlikely(vcpu->cpu != cpu)) {
>> +	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>   		/* Make sure TSC doesn't go backwards */
>>   		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>   				native_read_tsc() - vcpu->arch.last_host_tsc;
> For yet unknown reason, this commit breaks Linux guests here if they are
> started with only a single VCPU. They hang during boot, obviously no
> longer receiving interrupts.
>
> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
> effect of the wrapping, though I cannot imagine how.
>
> Anyone any ideas?
>
>

Most likely, time went backwards, and some 'future - past' calculation 
resulted in a negative sleep value which was then interpreted as 
unsigned and resulted in a 2342525634 year sleep.

Does your guest use kvmclock, tsc, or some other time source?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14  9:27     ` Avi Kivity
@ 2010-09-14 10:40       ` Jan Kiszka
  2010-09-14 10:47         ` Avi Kivity
  2010-09-14 19:32         ` Zachary Amsden
  0 siblings, 2 replies; 106+ messages in thread
From: Jan Kiszka @ 2010-09-14 10:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zachary Amsden, kvm, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

Am 14.09.2010 11:27, Avi Kivity wrote:
>   On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>> running.  This causes us to require resynchronization.  Since
>>> we can't tell when this may potentially happen, we assume the
>>> worst by forcing re-compensation for it at every point the VCPU
>>> task is descheduled.
>>>
>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>> ---
>>>   arch/x86/kvm/x86.c |    2 +-
>>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 7fc4a55..52b6c21 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>   	}
>>>
>>>   	kvm_x86_ops->vcpu_load(vcpu, cpu);
>>> -	if (unlikely(vcpu->cpu != cpu)) {
>>> +	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>   		/* Make sure TSC doesn't go backwards */
>>>   		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>   				native_read_tsc() - vcpu->arch.last_host_tsc;
>> For yet unknown reason, this commit breaks Linux guests here if they are
>> started with only a single VCPU. They hang during boot, obviously no
>> longer receiving interrupts.
>>
>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
>> effect of the wrapping, though I cannot imagine how.
>>
>> Anyone any ideas?
>>
>>
> 
> Most likely, time went backwards, and some 'future - past' calculation 
> resulted in a negative sleep value which was then interpreted as 
> unsigned and resulted in a 2342525634 year sleep.

Looks like that's the case on first glance at the apic state.

> 
> Does your guest use kvmclock, tsc, or some other time source?

A kernel that has kvmclock support even hangs in SMP mode. The others
pick hpet or acpi_pm. TSC is considered unstable.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14 10:40       ` Jan Kiszka
@ 2010-09-14 10:47         ` Avi Kivity
  2010-09-14 19:32         ` Zachary Amsden
  1 sibling, 0 replies; 106+ messages in thread
From: Avi Kivity @ 2010-09-14 10:47 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Zachary Amsden, kvm, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

  On 09/14/2010 12:40 PM, Jan Kiszka wrote:
> >>  For yet unknown reason, this commit breaks Linux guests here if they are
> >>  started with only a single VCPU. They hang during boot, obviously no
> >>  longer receiving interrupts.
> >>
> >>  I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
> >>  effect of the wrapping, though I cannot imagine how.
> >>
> >>  Anyone any ideas?
> >>
> >>
> >
> >  Most likely, time went backwards, and some 'future - past' calculation
> >  resulted in a negative sleep value which was then interpreted as
> >  unsigned and resulted in a 2342525634 year sleep.
>
> Looks like that's the case on first glance at the apic state.

Aside from this bug, it would be better for the guest to use signed 
arithmetic (or check beforehand) and not sleep if the future is already 
past.

> >
> >  Does your guest use kvmclock, tsc, or some other time source?
>
> A kernel that has kvmclock support even hangs in SMP mode.

Does it have the recent kvmclock patches backported? (should be in 
latest -stable).

> The others
> pick hpet or acpi_pm. TSC is considered unstable.
>
>

Wierd.  But perhaps it's the scheduler clock that's misbehaving, IIRC it 
has its own local clock.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14 10:40       ` Jan Kiszka
  2010-09-14 10:47         ` Avi Kivity
@ 2010-09-14 19:32         ` Zachary Amsden
  2010-09-14 22:26           ` Jan Kiszka
  1 sibling, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-09-14 19:32 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Avi Kivity, kvm, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/14/2010 12:40 AM, Jan Kiszka wrote:
> Am 14.09.2010 11:27, Avi Kivity wrote:
>    
>>    On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>>      
>>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>>        
>>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>>> running.  This causes us to require resynchronization.  Since
>>>> we can't tell when this may potentially happen, we assume the
>>>> worst by forcing re-compensation for it at every point the VCPU
>>>> task is descheduled.
>>>>
>>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>>> ---
>>>>    arch/x86/kvm/x86.c |    2 +-
>>>>    1 files changed, 1 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 7fc4a55..52b6c21 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>    	}
>>>>
>>>>    	kvm_x86_ops->vcpu_load(vcpu, cpu);
>>>> -	if (unlikely(vcpu->cpu != cpu)) {
>>>> +	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>>    		/* Make sure TSC doesn't go backwards */
>>>>    		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>>    				native_read_tsc() - vcpu->arch.last_host_tsc;
>>>>          
>>> For yet unknown reason, this commit breaks Linux guests here if they are
>>> started with only a single VCPU. They hang during boot, obviously no
>>> longer receiving interrupts.
>>>
>>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
>>> effect of the wrapping, though I cannot imagine how.
>>>
>>> Anyone any ideas?
>>>
>>>
>>>        
>> Most likely, time went backwards, and some 'future - past' calculation
>> resulted in a negative sleep value which was then interpreted as
>> unsigned and resulted in a 2342525634 year sleep.
>>      
> Looks like that's the case on first glance at the apic state.
>    

This compensation effectively nulls the delta between current and last TSC:

         if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
                 /* Make sure TSC doesn't go backwards */
                 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
                                 native_read_tsc() - 
vcpu->arch.last_host_tsc;
                 if (tsc_delta < 0)
                         mark_tsc_unstable("KVM discovered backwards TSC");
                 if (check_tsc_unstable())
                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
                 kvm_migrate_timers(vcpu);
                 vcpu->cpu = cpu;

If TSC has advanced quite a bit due to a TSC jump during sleep(*), it 
will adjust the offset backwards to compensate; similarly, if it has 
gone backwards, it will advance the offset.

In neither case should the visible TSC go backwards, assuming 
last_host_tsc is recorded properly, and so kvmclock should be similarly 
unaffected.

Perhaps the guest is more intelligent than we hope, and is comparing two 
different clocks: kvmclock or TSC with the rate of PIT interrupts.  This 
could result in negative arithmetic begin interpreted as unsigned.  Are 
you using PIT interrupt reinjection on this guest or passing 
-no-kvm-pit-reinjection?

>    
>> Does your guest use kvmclock, tsc, or some other time source?
>>      
> A kernel that has kvmclock support even hangs in SMP mode. The others
> pick hpet or acpi_pm. TSC is considered unstable.
>    

SMP mode here has always and will always be unreliable.  Are you running 
on an Intel or AMD CPU?  The origin of this code comes from a workaround 
for (*) in vendor-specific code, and perhaps it is inappropriate for both.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14 19:32         ` Zachary Amsden
@ 2010-09-14 22:26           ` Jan Kiszka
  2010-09-14 23:40             ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Kiszka @ 2010-09-14 22:26 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Avi Kivity, kvm, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4661 bytes --]

Am 14.09.2010 21:32, Zachary Amsden wrote:
> On 09/14/2010 12:40 AM, Jan Kiszka wrote:
>> Am 14.09.2010 11:27, Avi Kivity wrote:
>>   
>>>    On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>>>     
>>>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>>>       
>>>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>>>> running.  This causes us to require resynchronization.  Since
>>>>> we can't tell when this may potentially happen, we assume the
>>>>> worst by forcing re-compensation for it at every point the VCPU
>>>>> task is descheduled.
>>>>>
>>>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>>>> ---
>>>>>    arch/x86/kvm/x86.c |    2 +-
>>>>>    1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index 7fc4a55..52b6c21 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>>>>> *vcpu, int cpu)
>>>>>        }
>>>>>
>>>>>        kvm_x86_ops->vcpu_load(vcpu, cpu);
>>>>> -    if (unlikely(vcpu->cpu != cpu)) {
>>>>> +    if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>>>            /* Make sure TSC doesn't go backwards */
>>>>>            s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>>>                    native_read_tsc() - vcpu->arch.last_host_tsc;
>>>>>          
>>>> For yet unknown reason, this commit breaks Linux guests here if they
>>>> are
>>>> started with only a single VCPU. They hang during boot, obviously no
>>>> longer receiving interrupts.
>>>>
>>>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
>>>> effect of the wrapping, though I cannot imagine how.
>>>>
>>>> Anyone any ideas?
>>>>
>>>>
>>>>        
>>> Most likely, time went backwards, and some 'future - past' calculation
>>> resulted in a negative sleep value which was then interpreted as
>>> unsigned and resulted in a 2342525634 year sleep.
>>>      
>> Looks like that's the case on first glance at the apic state.
>>    
> 
> This compensation effectively nulls the delta between current and last TSC:
> 
>         if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>                 /* Make sure TSC doesn't go backwards */
>                 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>                                 native_read_tsc() -
> vcpu->arch.last_host_tsc;
>                 if (tsc_delta < 0)
>                         mark_tsc_unstable("KVM discovered backwards TSC");
>                 if (check_tsc_unstable())
>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>                 kvm_migrate_timers(vcpu);
>                 vcpu->cpu = cpu;
> 
> If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
> will adjust the offset backwards to compensate; similarly, if it has
> gone backwards, it will advance the offset.
> 
> In neither case should the visible TSC go backwards, assuming
> last_host_tsc is recorded properly, and so kvmclock should be similarly
> unaffected.
> 
> Perhaps the guest is more intelligent than we hope, and is comparing two
> different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
> could result in negative arithmetic begin interpreted as unsigned.  Are
> you using PIT interrupt reinjection on this guest or passing
> -no-kvm-pit-reinjection?
> 
>>   
>>> Does your guest use kvmclock, tsc, or some other time source?
>>>      
>> A kernel that has kvmclock support even hangs in SMP mode. The others
>> pick hpet or acpi_pm. TSC is considered unstable.
>>    
> 
> SMP mode here has always and will always be unreliable.  Are you running
> on an Intel or AMD CPU?  The origin of this code comes from a workaround
> for (*) in vendor-specific code, and perhaps it is inappropriate for both.

I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
a few hours ago. Well, the issue is gone now...

So I looked into the system logs and found this:

[18446744053.434939] PM: resume of devices complete after 4379.595 msecs
[18446744053.457133] PM: Finishing wakeup.
[18446744053.457135] Restarting tasks ...
[    0.000999] Marking TSC unstable due to KVM discovered backwards TSC
[270103.974668] done.

From that point on the box was on hpet, including the time I did the
failing tests this morning. The kvm-kmod version loaded at this point
was based on kvm.git df549cfc.

But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
with using it as clock source. Does this tell you anything?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14 22:26           ` Jan Kiszka
@ 2010-09-14 23:40             ` Zachary Amsden
  2010-09-15  5:34               ` Jan Kiszka
  2010-09-15 12:29               ` Glauber Costa
  0 siblings, 2 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-09-14 23:40 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Avi Kivity, kvm, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/14/2010 12:26 PM, Jan Kiszka wrote:
> Am 14.09.2010 21:32, Zachary Amsden wrote:
>    
>> On 09/14/2010 12:40 AM, Jan Kiszka wrote:
>>      
>>> Am 14.09.2010 11:27, Avi Kivity wrote:
>>>
>>>        
>>>>     On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>>>>
>>>>          
>>>>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>>>>
>>>>>            
>>>>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>>>>> running.  This causes us to require resynchronization.  Since
>>>>>> we can't tell when this may potentially happen, we assume the
>>>>>> worst by forcing re-compensation for it at every point the VCPU
>>>>>> task is descheduled.
>>>>>>
>>>>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>>>>> ---
>>>>>>     arch/x86/kvm/x86.c |    2 +-
>>>>>>     1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index 7fc4a55..52b6c21 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>>>>>> *vcpu, int cpu)
>>>>>>         }
>>>>>>
>>>>>>         kvm_x86_ops->vcpu_load(vcpu, cpu);
>>>>>> -    if (unlikely(vcpu->cpu != cpu)) {
>>>>>> +    if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>>>>             /* Make sure TSC doesn't go backwards */
>>>>>>             s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>>>>                     native_read_tsc() - vcpu->arch.last_host_tsc;
>>>>>>
>>>>>>              
>>>>> For yet unknown reason, this commit breaks Linux guests here if they
>>>>> are
>>>>> started with only a single VCPU. They hang during boot, obviously no
>>>>> longer receiving interrupts.
>>>>>
>>>>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
>>>>> effect of the wrapping, though I cannot imagine how.
>>>>>
>>>>> Anyone any ideas?
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Most likely, time went backwards, and some 'future - past' calculation
>>>> resulted in a negative sleep value which was then interpreted as
>>>> unsigned and resulted in a 2342525634 year sleep.
>>>>
>>>>          
>>> Looks like that's the case on first glance at the apic state.
>>>
>>>        
>> This compensation effectively nulls the delta between current and last TSC:
>>
>>          if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>                  /* Make sure TSC doesn't go backwards */
>>                  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>                                  native_read_tsc() -
>> vcpu->arch.last_host_tsc;
>>                  if (tsc_delta<  0)
>>                          mark_tsc_unstable("KVM discovered backwards TSC");
>>                  if (check_tsc_unstable())
>>                          kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>>                  kvm_migrate_timers(vcpu);
>>                  vcpu->cpu = cpu;
>>
>> If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
>> will adjust the offset backwards to compensate; similarly, if it has
>> gone backwards, it will advance the offset.
>>
>> In neither case should the visible TSC go backwards, assuming
>> last_host_tsc is recorded properly, and so kvmclock should be similarly
>> unaffected.
>>
>> Perhaps the guest is more intelligent than we hope, and is comparing two
>> different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
>> could result in negative arithmetic begin interpreted as unsigned.  Are
>> you using PIT interrupt reinjection on this guest or passing
>> -no-kvm-pit-reinjection?
>>
>>      
>>>
>>>        
>>>> Does your guest use kvmclock, tsc, or some other time source?
>>>>
>>>>          
>>> A kernel that has kvmclock support even hangs in SMP mode. The others
>>> pick hpet or acpi_pm. TSC is considered unstable.
>>>
>>>        
>> SMP mode here has always and will always be unreliable.  Are you running
>> on an Intel or AMD CPU?  The origin of this code comes from a workaround
>> for (*) in vendor-specific code, and perhaps it is inappropriate for both.
>>      
> I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
> a few hours ago. Well, the issue is gone now...
>
> So I looked into the system logs and found this:
>
> [18446744053.434939] PM: resume of devices complete after 4379.595 msecs
> [18446744053.457133] PM: Finishing wakeup.
> [18446744053.457135] Restarting tasks ...
> [    0.000999] Marking TSC unstable due to KVM discovered backwards TSC
> [270103.974668] done.
>
>  From that point on the box was on hpet, including the time I did the
> failing tests this morning. The kvm-kmod version loaded at this point
> was based on kvm.git df549cfc.
>
> But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
> with using it as clock source. Does this tell you anything?
>    

Yes, quite a bit.

It's possible that marking the TSC unstable with an actively running VM 
causes a boundary condition that I had not accounted for.  It's also 
possible that the clocksource switch triggered some bad behavior.

This suggests two debugging techniques: I can manually switch the 
clocksource, and I can also load a module which does nothing other than 
mark the TSC unstable.  Failing that, we can investigate PM suspend / 
resume for possible issues.

I'll try this on my Intel boxes to see what happens.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14  9:10   ` Jan Kiszka
  2010-09-14  9:27     ` Avi Kivity
@ 2010-09-15  4:07     ` Zachary Amsden
  2010-09-15  8:09       ` Jan Kiszka
  1 sibling, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-09-15  4:07 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/13/2010 11:10 PM, Jan Kiszka wrote:
> Am 20.08.2010 10:07, Zachary Amsden wrote:
>    
>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>> running.  This causes us to require resynchronization.  Since
>> we can't tell when this may potentially happen, we assume the
>> worst by forcing re-compensation for it at every point the VCPU
>> task is descheduled.
>>
>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>> ---
>>   arch/x86/kvm/x86.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 7fc4a55..52b6c21 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>   	}
>>
>>   	kvm_x86_ops->vcpu_load(vcpu, cpu);
>> -	if (unlikely(vcpu->cpu != cpu)) {
>> +	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>   		/* Make sure TSC doesn't go backwards */
>>   		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>   				native_read_tsc() - vcpu->arch.last_host_tsc;
>>      
> For yet unknown reason, this commit breaks Linux guests here if they are
> started with only a single VCPU. They hang during boot, obviously no
> longer receiving interrupts.
>
> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
> effect of the wrapping, though I cannot imagine how.
>
> Anyone any ideas?
>    

Question: how did you come to the knowledge that this is the commit 
which breaks things?  I'm assuming you bisected, in which case a 
transition from stable -> unstable would have only happened once.  This 
also means the PM suspend event which you observed only happened once, 
so obviously if you bisected successfully, there is a bug which doesn't 
involved the PM transition or the stable -> unstable transition.

Your host TSC must have desynchronized during the PM transition, and 
this change compensates the TSC on an unstable host to effectively show 
run time, not real time.  Perhaps the lack of catchup code (to catch 
back up to real time) is triggering the bug.

In any case, I'll proceed with the forcing of unstable TSC and HPET 
clocksource and see what happens.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14 23:40             ` Zachary Amsden
@ 2010-09-15  5:34               ` Jan Kiszka
  2010-09-15  7:55                 ` Avi Kivity
  2010-09-15 12:29               ` Glauber Costa
  1 sibling, 1 reply; 106+ messages in thread
From: Jan Kiszka @ 2010-09-15  5:34 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Avi Kivity, kvm, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6729 bytes --]

Am 15.09.2010 01:40, Zachary Amsden wrote:
> On 09/14/2010 12:26 PM, Jan Kiszka wrote:
>> Am 14.09.2010 21:32, Zachary Amsden wrote:
>>   
>>> On 09/14/2010 12:40 AM, Jan Kiszka wrote:
>>>     
>>>> Am 14.09.2010 11:27, Avi Kivity wrote:
>>>>
>>>>       
>>>>>     On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>>>>>
>>>>>         
>>>>>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>>>>>
>>>>>>           
>>>>>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>>>>>> running.  This causes us to require resynchronization.  Since
>>>>>>> we can't tell when this may potentially happen, we assume the
>>>>>>> worst by forcing re-compensation for it at every point the VCPU
>>>>>>> task is descheduled.
>>>>>>>
>>>>>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>>>>>> ---
>>>>>>>     arch/x86/kvm/x86.c |    2 +-
>>>>>>>     1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>>> index 7fc4a55..52b6c21 100644
>>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>>>>>>> *vcpu, int cpu)
>>>>>>>         }
>>>>>>>
>>>>>>>         kvm_x86_ops->vcpu_load(vcpu, cpu);
>>>>>>> -    if (unlikely(vcpu->cpu != cpu)) {
>>>>>>> +    if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>>>>>             /* Make sure TSC doesn't go backwards */
>>>>>>>             s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>>>>>                     native_read_tsc() - vcpu->arch.last_host_tsc;
>>>>>>>
>>>>>>>              
>>>>>> For yet unknown reason, this commit breaks Linux guests here if they
>>>>>> are
>>>>>> started with only a single VCPU. They hang during boot, obviously no
>>>>>> longer receiving interrupts.
>>>>>>
>>>>>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a
>>>>>> side
>>>>>> effect of the wrapping, though I cannot imagine how.
>>>>>>
>>>>>> Anyone any ideas?
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>> Most likely, time went backwards, and some 'future - past' calculation
>>>>> resulted in a negative sleep value which was then interpreted as
>>>>> unsigned and resulted in a 2342525634 year sleep.
>>>>>
>>>>>          
>>>> Looks like that's the case on first glance at the apic state.
>>>>
>>>>        
>>> This compensation effectively nulls the delta between current and
>>> last TSC:
>>>
>>>          if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>                  /* Make sure TSC doesn't go backwards */
>>>                  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>                                  native_read_tsc() -
>>> vcpu->arch.last_host_tsc;
>>>                  if (tsc_delta<  0)
>>>                          mark_tsc_unstable("KVM discovered backwards
>>> TSC");
>>>                  if (check_tsc_unstable())
>>>                          kvm_x86_ops->adjust_tsc_offset(vcpu,
>>> -tsc_delta);
>>>                  kvm_migrate_timers(vcpu);
>>>                  vcpu->cpu = cpu;
>>>
>>> If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
>>> will adjust the offset backwards to compensate; similarly, if it has
>>> gone backwards, it will advance the offset.
>>>
>>> In neither case should the visible TSC go backwards, assuming
>>> last_host_tsc is recorded properly, and so kvmclock should be similarly
>>> unaffected.
>>>
>>> Perhaps the guest is more intelligent than we hope, and is comparing two
>>> different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
>>> could result in negative arithmetic begin interpreted as unsigned.  Are
>>> you using PIT interrupt reinjection on this guest or passing
>>> -no-kvm-pit-reinjection?
>>>
>>>     
>>>>
>>>>       
>>>>> Does your guest use kvmclock, tsc, or some other time source?
>>>>>
>>>>>          
>>>> A kernel that has kvmclock support even hangs in SMP mode. The others
>>>> pick hpet or acpi_pm. TSC is considered unstable.
>>>>
>>>>        
>>> SMP mode here has always and will always be unreliable.  Are you running
>>> on an Intel or AMD CPU?  The origin of this code comes from a workaround
>>> for (*) in vendor-specific code, and perhaps it is inappropriate for
>>> both.
>>>      
>> I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
>> a few hours ago. Well, the issue is gone now...
>>
>> So I looked into the system logs and found this:
>>
>> [18446744053.434939] PM: resume of devices complete after 4379.595 msecs
>> [18446744053.457133] PM: Finishing wakeup.
>> [18446744053.457135] Restarting tasks ...
>> [    0.000999] Marking TSC unstable due to KVM discovered backwards TSC
>> [270103.974668] done.
>>
>>  From that point on the box was on hpet, including the time I did the
>> failing tests this morning. The kvm-kmod version loaded at this point
>> was based on kvm.git df549cfc.
>>
>> But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
>> with using it as clock source. Does this tell you anything?
>>    
> 
> Yes, quite a bit.
> 
> It's possible that marking the TSC unstable with an actively running VM
> causes a boundary condition that I had not accounted for.  It's also
> possible that the clocksource switch triggered some bad behavior.

Suspend/resume (to RAM) is indeed triggering the tsc switch by KVM here.
This should be the first issue as the kernel itself has no problems with
recovering from suspend/resume /wrt tsc.

The next one is what happened to the guest running at that point. It was
a SUSE 11.3 32-bit image, using kvm-clock. After resume and host-side
clock switch it lost its timer ticks, likely due to some breakage of
kvm-clock.

And finally, I'm now in the original failure state again in which every
newly started Linux guest with kvm-clock support also suffers from stuck
timers. Linux kernel that lack kvm-clock run fine, e.g. on hpet
clocksource. Maybe this is just another symptom of what also cause the
second problem.

> 
> This suggests two debugging techniques: I can manually switch the
> clocksource, and I can also load a module which does nothing other than
> mark the TSC unstable.  Failing that, we can investigate PM suspend /
> resume for possible issues.
> 
> I'll try this on my Intel boxes to see what happens.

Do you think kvm-kmod could contribute to this? As I said, I'm on a 34
kernel, namely SUSE's 2.6.34.4-0.1-desktop. Any feature missing in that
kernel latest KVM depends for proper tsc/kvm-clock handling? If you have
any concerns, I could try to run kvm.git natively later on.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-15  5:34               ` Jan Kiszka
@ 2010-09-15  7:55                 ` Avi Kivity
  2010-09-15  8:04                   ` Jan Kiszka
  0 siblings, 1 reply; 106+ messages in thread
From: Avi Kivity @ 2010-09-15  7:55 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Zachary Amsden, kvm, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

  On 09/15/2010 07:34 AM, Jan Kiszka wrote:
> Do you think kvm-kmod could contribute to this? As I said, I'm on a 34
> kernel, namely SUSE's 2.6.34.4-0.1-desktop. Any feature missing in that
> kernel latest KVM depends for proper tsc/kvm-clock handling? If you have
> any concerns, I could try to run kvm.git natively later on.

In case you aren't aware of it, 'make localmodconfig' is an easy way to 
transition from a distro kernel to a locally built kernel.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-15  7:55                 ` Avi Kivity
@ 2010-09-15  8:04                   ` Jan Kiszka
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Kiszka @ 2010-09-15  8:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Zachary Amsden, kvm, Marcelo Tosatti, Glauber Costa,
	Thomas Gleixner, John Stultz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 690 bytes --]

Am 15.09.2010 09:55, Avi Kivity wrote:
>  On 09/15/2010 07:34 AM, Jan Kiszka wrote:
>> Do you think kvm-kmod could contribute to this? As I said, I'm on a 34
>> kernel, namely SUSE's 2.6.34.4-0.1-desktop. Any feature missing in that
>> kernel latest KVM depends for proper tsc/kvm-clock handling? If you have
>> any concerns, I could try to run kvm.git natively later on.
> 
> In case you aren't aware of it, 'make localmodconfig' is an easy way to
> transition from a distro kernel to a locally built kernel.
> 

I am aware of it. But this doesn't remove the need to shut down my
session and reboot the notebook - something I usually avoid for weeks or
even longer.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-15  4:07     ` Zachary Amsden
@ 2010-09-15  8:09       ` Jan Kiszka
  2010-09-15 12:32         ` Glauber Costa
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Kiszka @ 2010-09-15  8:09 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2748 bytes --]

Am 15.09.2010 06:07, Zachary Amsden wrote:
> On 09/13/2010 11:10 PM, Jan Kiszka wrote:
>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>   
>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>> running.  This causes us to require resynchronization.  Since
>>> we can't tell when this may potentially happen, we assume the
>>> worst by forcing re-compensation for it at every point the VCPU
>>> task is descheduled.
>>>
>>> Signed-off-by: Zachary Amsden<zamsden@redhat.com>
>>> ---
>>>   arch/x86/kvm/x86.c |    2 +-
>>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 7fc4a55..52b6c21 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu,
>>> int cpu)
>>>       }
>>>
>>>       kvm_x86_ops->vcpu_load(vcpu, cpu);
>>> -    if (unlikely(vcpu->cpu != cpu)) {
>>> +    if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>           /* Make sure TSC doesn't go backwards */
>>>           s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>                   native_read_tsc() - vcpu->arch.last_host_tsc;
>>>      
>> For yet unknown reason, this commit breaks Linux guests here if they are
>> started with only a single VCPU. They hang during boot, obviously no
>> longer receiving interrupts.
>>
>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
>> effect of the wrapping, though I cannot imagine how.
>>
>> Anyone any ideas?
>>    
> 
> Question: how did you come to the knowledge that this is the commit
> which breaks things?  I'm assuming you bisected, in which case a
> transition from stable -> unstable would have only happened once.

Right.

>  This
> also means the PM suspend event which you observed only happened once,
> so obviously if you bisected successfully, there is a bug which doesn't
> involved the PM transition or the stable -> unstable transition.

Right, see my other posting.

> 
> Your host TSC must have desynchronized during the PM transition, and
> this change compensates the TSC on an unstable host to effectively show
> run time, not real time.  Perhaps the lack of catchup code (to catch
> back up to real time) is triggering the bug.

I'm still unsure if KVM is right in declaring the TSC unstable. It looks
like Linux is less picky here - are the requirements different?

> 
> In any case, I'll proceed with the forcing of unstable TSC and HPET
> clocksource and see what happens.

I tried that before, but it did not trigger the issue that kvm-clock
guests no longer boot properly. This only happens if the TSC is marked
unstable.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-14 23:40             ` Zachary Amsden
  2010-09-15  5:34               ` Jan Kiszka
@ 2010-09-15 12:29               ` Glauber Costa
  1 sibling, 0 replies; 106+ messages in thread
From: Glauber Costa @ 2010-09-15 12:29 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jan Kiszka, Avi Kivity, kvm, Marcelo Tosatti, Thomas Gleixner,
	John Stultz, linux-kernel

On Tue, Sep 14, 2010 at 01:40:34PM -1000, Zachary Amsden wrote:
> On 09/14/2010 12:26 PM, Jan Kiszka wrote:
> >Am 14.09.2010 21:32, Zachary Amsden wrote:
> >>On 09/14/2010 12:40 AM, Jan Kiszka wrote:
> >>>Am 14.09.2010 11:27, Avi Kivity wrote:
> >>>
> >>>>    On 09/14/2010 11:10 AM, Jan Kiszka wrote:
> >>>>
> >>>>>Am 20.08.2010 10:07, Zachary Amsden wrote:
> >>>>>
> >>>>>>When CPUs with unstable TSCs enter deep C-state, TSC may stop
> >>>>>>running.  This causes us to require resynchronization.  Since
> >>>>>>we can't tell when this may potentially happen, we assume the
> >>>>>>worst by forcing re-compensation for it at every point the VCPU
> >>>>>>task is descheduled.
> >>>>>>
> >>>>>>Signed-off-by: Zachary Amsden<zamsden@redhat.com>
> >>>>>>---
> >>>>>>    arch/x86/kvm/x86.c |    2 +-
> >>>>>>    1 files changed, 1 insertions(+), 1 deletions(-)
> >>>>>>
> >>>>>>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>>>>index 7fc4a55..52b6c21 100644
> >>>>>>--- a/arch/x86/kvm/x86.c
> >>>>>>+++ b/arch/x86/kvm/x86.c
> >>>>>>@@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
> >>>>>>*vcpu, int cpu)
> >>>>>>        }
> >>>>>>
> >>>>>>        kvm_x86_ops->vcpu_load(vcpu, cpu);
> >>>>>>-    if (unlikely(vcpu->cpu != cpu)) {
> >>>>>>+    if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
> >>>>>>            /* Make sure TSC doesn't go backwards */
> >>>>>>            s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> >>>>>>                    native_read_tsc() - vcpu->arch.last_host_tsc;
> >>>>>>
> >>>>>For yet unknown reason, this commit breaks Linux guests here if they
> >>>>>are
> >>>>>started with only a single VCPU. They hang during boot, obviously no
> >>>>>longer receiving interrupts.
> >>>>>
> >>>>>I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
> >>>>>effect of the wrapping, though I cannot imagine how.
> >>>>>
> >>>>>Anyone any ideas?
> >>>>>
> >>>>>
> >>>>>
> >>>>Most likely, time went backwards, and some 'future - past' calculation
> >>>>resulted in a negative sleep value which was then interpreted as
> >>>>unsigned and resulted in a 2342525634 year sleep.
> >>>>
> >>>Looks like that's the case on first glance at the apic state.
> >>>
> >>This compensation effectively nulls the delta between current and last TSC:
> >>
> >>         if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
> >>                 /* Make sure TSC doesn't go backwards */
> >>                 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> >>                                 native_read_tsc() -
> >>vcpu->arch.last_host_tsc;
> >>                 if (tsc_delta<  0)
> >>                         mark_tsc_unstable("KVM discovered backwards TSC");
> >>                 if (check_tsc_unstable())
> >>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> >>                 kvm_migrate_timers(vcpu);
> >>                 vcpu->cpu = cpu;
> >>
> >>If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
> >>will adjust the offset backwards to compensate; similarly, if it has
> >>gone backwards, it will advance the offset.
> >>
> >>In neither case should the visible TSC go backwards, assuming
> >>last_host_tsc is recorded properly, and so kvmclock should be similarly
> >>unaffected.
> >>
> >>Perhaps the guest is more intelligent than we hope, and is comparing two
> >>different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
> >>could result in negative arithmetic begin interpreted as unsigned.  Are
> >>you using PIT interrupt reinjection on this guest or passing
> >>-no-kvm-pit-reinjection?
> >>
> >>>
> >>>>Does your guest use kvmclock, tsc, or some other time source?
> >>>>
> >>>A kernel that has kvmclock support even hangs in SMP mode. The others
> >>>pick hpet or acpi_pm. TSC is considered unstable.
> >>>
> >>SMP mode here has always and will always be unreliable.  Are you running
> >>on an Intel or AMD CPU?  The origin of this code comes from a workaround
> >>for (*) in vendor-specific code, and perhaps it is inappropriate for both.
> >I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
> >a few hours ago. Well, the issue is gone now...
> >
> >So I looked into the system logs and found this:
> >
> >[18446744053.434939] PM: resume of devices complete after 4379.595 msecs
> >[18446744053.457133] PM: Finishing wakeup.
> >[18446744053.457135] Restarting tasks ...
> >[    0.000999] Marking TSC unstable due to KVM discovered backwards TSC
> >[270103.974668] done.
> >
> > From that point on the box was on hpet, including the time I did the
> >failing tests this morning. The kvm-kmod version loaded at this point
> >was based on kvm.git df549cfc.
> >
> >But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
> >with using it as clock source. Does this tell you anything?
> 
> Yes, quite a bit.
> 
> It's possible that marking the TSC unstable with an actively running
> VM causes a boundary condition that I had not accounted for.  It's
> also possible that the clocksource switch triggered some bad
> behavior.
changing the clocksource will change the resolution of the underlying
clock base. This do can cause a big problem for anything that does
a mix of tsc + other clocksources. For the old version of kvmclock,
this should not really matter, since we have the stable bit hammer
on the guest side, that just gets flipped if we go out of tsc clocksource
to something else (actually, right now it is still always on).

But now that you mentioned, changing _to_ tsc csource can be problematic...

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-15  8:09       ` Jan Kiszka
@ 2010-09-15 12:32         ` Glauber Costa
  2010-09-15 18:27           ` Jan Kiszka
  0 siblings, 1 reply; 106+ messages in thread
From: Glauber Costa @ 2010-09-15 12:32 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Zachary Amsden, kvm, Avi Kivity, Marcelo Tosatti,
	Thomas Gleixner, John Stultz, linux-kernel

On Wed, Sep 15, 2010 at 10:09:33AM +0200, Jan Kiszka wrote:
> > In any case, I'll proceed with the forcing of unstable TSC and HPET
> > clocksource and see what happens.
> 
> I tried that before, but it did not trigger the issue that kvm-clock
> guests no longer boot properly. This only happens if the TSC is marked
> unstable.

even artificially marked unstable ?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-15 12:32         ` Glauber Costa
@ 2010-09-15 18:27           ` Jan Kiszka
  2010-09-17 22:09             ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Kiszka @ 2010-09-15 18:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Zachary Amsden, kvm, Avi Kivity, Marcelo Tosatti,
	Thomas Gleixner, John Stultz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

Am 15.09.2010 14:32, Glauber Costa wrote:
> On Wed, Sep 15, 2010 at 10:09:33AM +0200, Jan Kiszka wrote:
>>> In any case, I'll proceed with the forcing of unstable TSC and HPET
>>> clocksource and see what happens.
>>
>> I tried that before, but it did not trigger the issue that kvm-clock
>> guests no longer boot properly. This only happens if the TSC is marked
>> unstable.
> 
> even artificially marked unstable ?
> 

Yes. As soon as I hack tsc_unstable to 1, things go wrong. When I hack
it back to 0, guest that wants kvm-clock boots again and seem to run fine.

This is issue #2, I guess. Issue #2 remains that the TSC is marked
unstable. I have the feeling that this is bogus, maybe due to lacking
suspend/resume awareness? The tsc clocksource does

	clocksource_tsc.cycle_last = 0;

on resume...

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-15 18:27           ` Jan Kiszka
@ 2010-09-17 22:09             ` Zachary Amsden
  2010-09-17 22:31               ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-09-17 22:09 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Glauber Costa, kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/15/2010 08:27 AM, Jan Kiszka wrote:
> Am 15.09.2010 14:32, Glauber Costa wrote:
>    
>> On Wed, Sep 15, 2010 at 10:09:33AM +0200, Jan Kiszka wrote:
>>      
>>>> In any case, I'll proceed with the forcing of unstable TSC and HPET
>>>> clocksource and see what happens.
>>>>          
>>> I tried that before, but it did not trigger the issue that kvm-clock
>>> guests no longer boot properly. This only happens if the TSC is marked
>>> unstable.
>>>        
>> even artificially marked unstable ?
>>
>>      
> Yes. As soon as I hack tsc_unstable to 1, things go wrong. When I hack
> it back to 0, guest that wants kvm-clock boots again and seem to run fine.
>
> This is issue #2, I guess. Issue #2 remains that the TSC is marked
> unstable. I have the feeling that this is bogus, maybe due to lacking
> suspend/resume awareness? The tsc clocksource does
>
> 	clocksource_tsc.cycle_last = 0;
>
> on resume...
>
> Jan
>
>    

I have now reproduced this exactly.  Shouldn't be long before I have a 
solution.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-17 22:09             ` Zachary Amsden
@ 2010-09-17 22:31               ` Zachary Amsden
  2010-09-18 23:53                 ` Zachary Amsden
  0 siblings, 1 reply; 106+ messages in thread
From: Zachary Amsden @ 2010-09-17 22:31 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Glauber Costa, kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/17/2010 12:09 PM, Zachary Amsden wrote:
> On 09/15/2010 08:27 AM, Jan Kiszka wrote:
>> Am 15.09.2010 14:32, Glauber Costa wrote:
>>> On Wed, Sep 15, 2010 at 10:09:33AM +0200, Jan Kiszka wrote:
>>>>> In any case, I'll proceed with the forcing of unstable TSC and HPET
>>>>> clocksource and see what happens.
>>>> I tried that before, but it did not trigger the issue that kvm-clock
>>>> guests no longer boot properly. This only happens if the TSC is marked
>>>> unstable.
>>> even artificially marked unstable ?
>>>
>> Yes. As soon as I hack tsc_unstable to 1, things go wrong. When I hack
>> it back to 0, guest that wants kvm-clock boots again and seem to run 
>> fine.
>>
>> This is issue #2, I guess. Issue #2 remains that the TSC is marked
>> unstable. I have the feeling that this is bogus, maybe due to lacking
>> suspend/resume awareness? The tsc clocksource does
>>
>>     clocksource_tsc.cycle_last = 0;
>>
>> on resume...
>>
>> Jan
>>
>
> I have now reproduced this exactly.  Shouldn't be long before I have a 
> solution.


Actually, here is what I am seeing: the guest proceeds - in SUPER SLOW 
MO... the effect of this patch negates time when the guest is not 
running.  If the guest is not running because it is idle, negating time 
is the wrong thing to do.  Left to sit, the boot process still proceeds, 
but it goes so slow, you can grow a beard in that time.

Instead, we need wallclock awareness to be preserved.  Should be easy to 
work in one of my later patches which does this.

Zach

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization
  2010-09-17 22:31               ` Zachary Amsden
@ 2010-09-18 23:53                 ` Zachary Amsden
  0 siblings, 0 replies; 106+ messages in thread
From: Zachary Amsden @ 2010-09-18 23:53 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Glauber Costa, kvm, Avi Kivity, Marcelo Tosatti, Thomas Gleixner,
	John Stultz, linux-kernel

On 09/17/2010 12:31 PM, Zachary Amsden wrote:
> On 09/17/2010 12:09 PM, Zachary Amsden wrote:
>> On 09/15/2010 08:27 AM, Jan Kiszka wrote:
>>> Am 15.09.2010 14:32, Glauber Costa wrote:
>>>> On Wed, Sep 15, 2010 at 10:09:33AM +0200, Jan Kiszka wrote:
>>>>>> In any case, I'll proceed with the forcing of unstable TSC and HPET
>>>>>> clocksource and see what happens.
>>>>> I tried that before, but it did not trigger the issue that kvm-clock
>>>>> guests no longer boot properly. This only happens if the TSC is 
>>>>> marked
>>>>> unstable.
>>>> even artificially marked unstable ?
>>>>
>>> Yes. As soon as I hack tsc_unstable to 1, things go wrong. When I hack
>>> it back to 0, guest that wants kvm-clock boots again and seem to run 
>>> fine.
>>>
>>> This is issue #2, I guess. Issue #2 remains that the TSC is marked
>>> unstable. I have the feeling that this is bogus, maybe due to lacking
>>> suspend/resume awareness? The tsc clocksource does
>>>
>>>     clocksource_tsc.cycle_last = 0;
>>>
>>> on resume...
>>>
>>> Jan
>>>
>>
>> I have now reproduced this exactly.  Shouldn't be long before I have 
>> a solution.
>

Wow, bug was subtle.  Here's what happens:

kvmclock enabled
unstable TSC compensation erases TSC gain; requests kvmclock update
we get preempted after writing new kvmclock value, so HV is never entered
unstable TSC compensation erases TSC gain again; requests kvmclock update
kvmclock overflow compensation underflows because vcpu->last_guest_tsc
kvmclock advances randomly
HV entered
HV exit sets last_guest_tsc, making bug invisible

The solution is to set vcpu->last_guest_tsc always when updating 
kvmclock.  Note the bug can occur independently of the unstable TSC 
compensation, which makes it more likely only because it requests more 
kvmclock updates.  The fundamental issue is the lack of complete state 
update caused by preemption before HV is entered, but after kvmclock is 
written.

I have a patch series which fixes this now, but it has some debug gunk 
in it I need to get rid of.

I'll try to send it in the next few hours, as I'm on vacation (parents 
visiting) next week.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock
  2010-08-20  8:07 ` [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock Zachary Amsden
@ 2011-09-02 18:34   ` Philipp Hahn
  2011-09-05 14:06     ` [BUG, PATCH-2.6.32] " Philipp Hahn
  0 siblings, 1 reply; 106+ messages in thread
From: Philipp Hahn @ 2011-09-02 18:34 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz

[-- Attachment #1: Type: text/plain, Size: 6616 bytes --]

Hello,

there have been serveral reports on kvm-devel about problems with kvm-clock. I 
can reproduce this 100%.
1. doesn't depend on guest kernel eing 2.6.32  or 3.0.1
2. qemu-0.14.1 is broken, qemu-0.15.0 works.
3. host-kernel 2.6.32.40 is broken, 3.0.0 works.
So currently I have found two solutions: either upgrade qemu-kvm or the 
host-kernel, which (for me) both are no options.
I tracked it down to this patch: If I revert it, the VM starts fine.
My current procedure to reproduce this problem is like this:
1. boot into guest kernel
2. reboot from within guest
3. on the second boot, the VM crawls very slowly. The kernel-time printed are 
roughly the same numbers as on the first boot, but they don't match 
wall-clock: The kernel-time claims to be 10 s from uptime, but real-time is 
more like 42 s. If the boot process makes it as far as a command line, 
runnign a "sleep 1" takes much more than one second.

Putting in some prink(..., max_kernel_ns, kernel_ns) I've ovserved that during 
the first boot, I get 10k calles to that function a second, on the seond boot 
its down to 10-20 a second.
Also max_kernel_ns is way larger than kernel_ns:
18:27:17 [23755.005941] 6148360456312778392 23755005912468

This patch was back-ported to 2.6.32.40 but it looks like some other 
infrastructure might have changed, so it doen't work as expected.

And idee on how to proceed?

On Friday 20 August 2010 10:07:30 Zachary Amsden wrote:
> Kernel time, which advances in discrete steps may progress much slower
> than TSC.  As a result, when kvmclock is adjusted to a new base, the
> apparent time to the guest, which runs at a much higher, nsec scaled
> rate based on the current TSC, may have already been observed to have
> a larger value (kernel_ns + scaled tsc) than the value to which we are
> setting it (kernel_ns + 0).
>
> We must instead compute the clock as potentially observed by the guest
> for kernel_ns to make sure it does not go backwards.
>
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/x86.c              |   43
> +++++++++++++++++++++++++++++++++++++- 2 files changed, 43 insertions(+), 2
> deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index 324e892..871800d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -339,6 +339,8 @@ struct kvm_vcpu_arch {
>  	unsigned int time_offset;
>  	struct page *time_page;
>  	u64 last_host_tsc;
> +	u64 last_guest_tsc;
> +	u64 last_kernel_ns;
>
>  	bool nmi_pending;
>  	bool nmi_injected;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1948c36..fe74b42 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -976,14 +976,15 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	void *shared_kaddr;
>  	unsigned long this_tsc_khz;
> -	s64 kernel_ns;
> +	s64 kernel_ns, max_kernel_ns;
> +	u64 tsc_timestamp;
>
>  	if ((!vcpu->time_page))
>  		return 0;
>
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
> -	kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
> +	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
>  	kernel_ns = get_kernel_ns();
>  	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>  	local_irq_restore(flags);
> @@ -993,13 +994,49 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>  		return 1;
>  	}
>
> +	/*
> +	 * Time as measured by the TSC may go backwards when resetting the base
> +	 * tsc_timestamp.  The reason for this is that the TSC resolution is
> +	 * higher than the resolution of the other clock scales.  Thus, many
> +	 * possible measurments of the TSC correspond to one measurement of any
> +	 * other clock, and so a spread of values is possible.  This is not a
> +	 * problem for the computation of the nanosecond clock; with TSC rates
> +	 * around 1GHZ, there can only be a few cycles which correspond to one
> +	 * nanosecond value, and any path through this code will inevitably
> +	 * take longer than that.  However, with the kernel_ns value itself,
> +	 * the precision may be much lower, down to HZ granularity.  If the
> +	 * first sampling of TSC against kernel_ns ends in the low part of the
> +	 * range, and the second in the high end of the range, we can get:
> +	 *
> +	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> +	 *
> +	 * As the sampling errors potentially range in the thousands of cycles,
> +	 * it is possible such a time value has already been observed by the
> +	 * guest.  To protect against this, we must compute the system time as
> +	 * observed by the guest and ensure the new system time is greater.
> + 	 */
> +	max_kernel_ns = 0;
> +	if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) {
> +		max_kernel_ns = vcpu->last_guest_tsc -
> +				vcpu->hv_clock.tsc_timestamp;
> +		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> +				    vcpu->hv_clock.tsc_to_system_mul,
> +				    vcpu->hv_clock.tsc_shift);
> +		max_kernel_ns += vcpu->last_kernel_ns;
> +	}
> +
>  	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>  		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
>  		vcpu->hw_tsc_khz = this_tsc_khz;
>  	}
>
> +	if (max_kernel_ns > kernel_ns)
> +		kernel_ns = max_kernel_ns;
> +
>  	/* With all the info we got, fill in the values */
> +	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>  	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
> +	vcpu->last_kernel_ns = kernel_ns;
>  	vcpu->hv_clock.flags = 0;
>
>  	/*
> @@ -4931,6 +4968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	if (hw_breakpoint_active())
>  		hw_breakpoint_restore();
>
> +	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +
>  	atomic_set(&vcpu->guest_mode, 0);
>  	smp_wmb();
>  	local_irq_enable();

Sincerely
Philipp
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        Linux for Your Business        fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
                                                   http://www.univention.de/
----------------------------------------------------------------------------
Treffen Sie Univention auf der IT&Business vom 20. bis 22. September 2011
auf dem Gemeinschaftsstand der Open Source Business Alliance in Stuttgart in
Halle 3 Stand 3D27-7.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [BUG, PATCH-2.6.32] Fix a possible backwards warp of kvmclock
  2011-09-02 18:34   ` Philipp Hahn
@ 2011-09-05 14:06     ` Philipp Hahn
  2011-09-12 11:32       ` Marcelo Tosatti
  0 siblings, 1 reply; 106+ messages in thread
From: Philipp Hahn @ 2011-09-05 14:06 UTC (permalink / raw)
  To: Zachary Amsden, 编码人, Xiao Guangrong, Nikola Ciprich
  Cc: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa, Thomas Gleixner,
	John Stultz


[-- Attachment #1.1: Type: text/plain, Size: 1320 bytes --]

Hello,

(cc:-ing lost of people who reported similar bugs on kvm-devel)
> Changing clock in KVM host may cause VM to hang
> 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w 
qemu-kvm-0.13.0)

I found a bug regarding PV-clock in the KVM-kernel module. The attached patch 
solves the problem of the guest being very slow after a reboot. Can you 
please have a look and give it a try to see if it solves your problem as 
well.

Since the fix is only relevant for the stable 2.6.32 tree, where the code is 
quiet different, please have a look and forward to stable@ as appropriate.

Sincerely
Philipp

PS: This bug is tracked in our German Bugzilla at 
<https://forge.univention.org/bugzilla/show_bug.cgi?id=23258>
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        Linux for Your Business        fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
                                                   http://www.univention.de/
----------------------------------------------------------------------------
Treffen Sie Univention auf der IT&Business vom 20. bis 22. September 2011
auf dem Gemeinschaftsstand der Open Source Business Alliance in Stuttgart in
Halle 3 Stand 3D27-7.

[-- Attachment #1.2: 23258_kvm-clock-reset.diff --]
[-- Type: text/x-diff, Size: 2952 bytes --]

Bug #23257: Reset tsc_timestamp on TSC writes

vcpu->last_guest_tsc is updated in vcpu_enter_guest() and kvm_arch_vcpu_put()
by getting the last value of the TSC from the guest.
On reset, the SeaBIOS resets the TSC to 0, which triggers a bug on the next
call to kvm_write_guest_time(): Since vcpu->hw_clock.tsc_timestamp still
contains the old value before the reset, "max_kernel_ns = vcpu->last_guest_tsc
- vcpu->hw_clock.tsc_timestamp" gets negative. Since the variable is u64, it
 gets translated to a large positive value.

[9333.197080]
vcpu->last_guest_tsc        =209_328_760_015           ←
vcpu->hv_clock.tsc_timestamp=209_328_708_109
vcpu->last_kernel_ns        =9_333_179_830_643
kernel_ns                   =9_333_197_073_429
max_kernel_ns               =9_333_179_847_943         ←

[9336.910995]
vcpu->last_guest_tsc        =9_438_510_584             ←
vcpu->hv_clock.tsc_timestamp=211_080_593_143
vcpu->last_kernel_ns        =9_333_763_732_907
kernel_ns                   =9_336_910_990_771
max_kernel_ns               =6_148_296_831_006_663_830 ←

For completeness, here are the values for my 3 GHz CPU:
vcpu->hv_clock.tsc_shift         =-1
vcpu->hv_clock.tsc_to_system_mul =2_863_019_502

This makes the guest kernel crawl very slowly when clocksource=kvmclock is
used: sleeps take way longer than expected and don't match wall clock any more.
The times printed with printk() don't match real time and the reboot often
stalls for long times.

In linux-git this isn't a problem, since on every MSR_IA32_TSC write
vcpu->arch.hv_clock.tsc_timestamp is reset to 0, which disables above logic.
The code there is only in arch/x86/kvm/x86.c, since much of the kvm-clock
related code has been refactured for 2.6.37:
	99e3e30a arch/x86/kvm/x86.c (Zachary Amsden            2010-08-19 22:07:17 -1000 1084)  vcpu->arch.hv_clock.tsc_timestamp = 0;                                                      
Since 1d5f066e0b63271b67eac6d3752f8aa96adcbddb from 2.6.37 was back-ported to
2.6.32.40 as ad2088cabe0fd7f633f38ba106025d33ed9a2105, the following patch is
needed to add the needed reset logic to 2.6.32 as well.

Signed-off-by: Philipp Hahn <hahn@univention.de>
--- a/arch/x86/kvm/vmx.c	2011-09-05 14:17:54.000000000 +0200
+++ b/arch/x86/kvm/vmx.c	2011-09-05 14:18:03.000000000 +0200
@@ -1067,6 +1067,7 @@ static int vmx_set_msr(struct kvm_vcpu *
 	case MSR_IA32_TSC:
 		rdtscll(host_tsc);
 		guest_write_tsc(data, host_tsc);
+		vcpu->arch.hv_clock.tsc_timestamp = 0;
 		break;
 	case MSR_IA32_CR_PAT:
 		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
--- a/arch/x86/kvm/svm.c	2011-09-05 14:17:57.000000000 +0200
+++ b/arch/x86/kvm/svm.c	2011-09-05 14:18:00.000000000 +0200
@@ -2256,6 +2256,7 @@ static int svm_set_msr(struct kvm_vcpu *
 		}
 
 		svm->vmcb->control.tsc_offset = tsc_offset + g_tsc_offset;
+		vcpu->arch.hv_clock.tsc_timestamp = 0;
 
 		break;
 	}

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [BUG, PATCH-2.6.32] Fix a possible backwards warp of kvmclock
  2011-09-05 14:06     ` [BUG, PATCH-2.6.32] " Philipp Hahn
@ 2011-09-12 11:32       ` Marcelo Tosatti
  0 siblings, 0 replies; 106+ messages in thread
From: Marcelo Tosatti @ 2011-09-12 11:32 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: Zachary Amsden, 编码人,
	Xiao Guangrong, Nikola Ciprich, Michael Tokarev, kvm, Avi Kivity,
	Glauber Costa, Thomas Gleixner, John Stultz


Philipp,

Thanks for debugging this issue. We'll be fowarding the fix
for inclusion in the official 2.6.32 tree.

On Mon, Sep 05, 2011 at 04:06:57PM +0200, Philipp Hahn wrote:
> Hello,
> 
> (cc:-ing lost of people who reported similar bugs on kvm-devel)
> > Changing clock in KVM host may cause VM to hang
> > 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w 
> qemu-kvm-0.13.0)
> 
> I found a bug regarding PV-clock in the KVM-kernel module. The attached patch 
> solves the problem of the guest being very slow after a reboot. Can you 
> please have a look and give it a try to see if it solves your problem as 
> well.
> 
> Since the fix is only relevant for the stable 2.6.32 tree, where the code is 
> quiet different, please have a look and forward to stable@ as appropriate.
> 
> Sincerely
> Philipp
> 
> PS: This bug is tracked in our German Bugzilla at 
> <https://forge.univention.org/bugzilla/show_bug.cgi?id=23258>
> -- 
> Philipp Hahn           Open Source Software Engineer      hahn@univention.de
> Univention GmbH        Linux for Your Business        fon: +49 421 22 232- 0
> Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
>                                                    http://www.univention.de/
> ----------------------------------------------------------------------------
> Treffen Sie Univention auf der IT&Business vom 20. bis 22. September 2011
> auf dem Gemeinschaftsstand der Open Source Business Alliance in Stuttgart in
> Halle 3 Stand 3D27-7.

> Bug #23257: Reset tsc_timestamp on TSC writes
> 
> vcpu->last_guest_tsc is updated in vcpu_enter_guest() and kvm_arch_vcpu_put()
> by getting the last value of the TSC from the guest.
> On reset, the SeaBIOS resets the TSC to 0, which triggers a bug on the next
> call to kvm_write_guest_time(): Since vcpu->hw_clock.tsc_timestamp still
> contains the old value before the reset, "max_kernel_ns = vcpu->last_guest_tsc
> - vcpu->hw_clock.tsc_timestamp" gets negative. Since the variable is u64, it
>  gets translated to a large positive value.
> 
> [9333.197080]
> vcpu->last_guest_tsc        =209_328_760_015           ←
> vcpu->hv_clock.tsc_timestamp=209_328_708_109
> vcpu->last_kernel_ns        =9_333_179_830_643
> kernel_ns                   =9_333_197_073_429
> max_kernel_ns               =9_333_179_847_943         ←
> 
> [9336.910995]
> vcpu->last_guest_tsc        =9_438_510_584             ←
> vcpu->hv_clock.tsc_timestamp=211_080_593_143
> vcpu->last_kernel_ns        =9_333_763_732_907
> kernel_ns                   =9_336_910_990_771
> max_kernel_ns               =6_148_296_831_006_663_830 ←
> 
> For completeness, here are the values for my 3 GHz CPU:
> vcpu->hv_clock.tsc_shift         =-1
> vcpu->hv_clock.tsc_to_system_mul =2_863_019_502
> 
> This makes the guest kernel crawl very slowly when clocksource=kvmclock is
> used: sleeps take way longer than expected and don't match wall clock any more.
> The times printed with printk() don't match real time and the reboot often
> stalls for long times.
> 
> In linux-git this isn't a problem, since on every MSR_IA32_TSC write
> vcpu->arch.hv_clock.tsc_timestamp is reset to 0, which disables above logic.
> The code there is only in arch/x86/kvm/x86.c, since much of the kvm-clock
> related code has been refactured for 2.6.37:
> 	99e3e30a arch/x86/kvm/x86.c (Zachary Amsden            2010-08-19 22:07:17 -1000 1084)  vcpu->arch.hv_clock.tsc_timestamp = 0;                                                      
> Since 1d5f066e0b63271b67eac6d3752f8aa96adcbddb from 2.6.37 was back-ported to
> 2.6.32.40 as ad2088cabe0fd7f633f38ba106025d33ed9a2105, the following patch is
> needed to add the needed reset logic to 2.6.32 as well.
> 
> Signed-off-by: Philipp Hahn <hahn@univention.de>
> --- a/arch/x86/kvm/vmx.c	2011-09-05 14:17:54.000000000 +0200
> +++ b/arch/x86/kvm/vmx.c	2011-09-05 14:18:03.000000000 +0200
> @@ -1067,6 +1067,7 @@ static int vmx_set_msr(struct kvm_vcpu *
>  	case MSR_IA32_TSC:
>  		rdtscll(host_tsc);
>  		guest_write_tsc(data, host_tsc);
> +		vcpu->arch.hv_clock.tsc_timestamp = 0;
>  		break;
>  	case MSR_IA32_CR_PAT:
>  		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> --- a/arch/x86/kvm/svm.c	2011-09-05 14:17:57.000000000 +0200
> +++ b/arch/x86/kvm/svm.c	2011-09-05 14:18:00.000000000 +0200
> @@ -2256,6 +2256,7 @@ static int svm_set_msr(struct kvm_vcpu *
>  		}
>  
>  		svm->vmcb->control.tsc_offset = tsc_offset + g_tsc_offset;
> +		vcpu->arch.hv_clock.tsc_timestamp = 0;
>  
>  		break;
>  	}




^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2011-09-12 12:20 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-20  8:07 KVM timekeeping and TSC virtualization Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 01/35] Drop vm_init_tsc Zachary Amsden
2010-08-20 16:54   ` Glauber Costa
2010-08-20  8:07 ` [KVM timekeeping 02/35] Convert TSC writes to TSC offset writes Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 03/35] Move TSC offset writes to common code Zachary Amsden
2010-08-20 17:06   ` Glauber Costa
2010-08-24  0:51     ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 04/35] Fix SVM VMCB reset Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 05/35] Move TSC reset out of vmcb_init Zachary Amsden
2010-08-20 17:08   ` Glauber Costa
2010-08-24  0:52     ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 06/35] TSC reset compensation Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 07/35] Make cpu_tsc_khz updates use local CPU Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 08/35] Warn about unstable TSC Zachary Amsden
2010-08-20 17:28   ` Glauber Costa
2010-08-24  0:56     ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 09/35] Unify TSC logic Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization Zachary Amsden
2010-08-20 17:30   ` Glauber Costa
2010-09-14  9:10   ` Jan Kiszka
2010-09-14  9:27     ` Avi Kivity
2010-09-14 10:40       ` Jan Kiszka
2010-09-14 10:47         ` Avi Kivity
2010-09-14 19:32         ` Zachary Amsden
2010-09-14 22:26           ` Jan Kiszka
2010-09-14 23:40             ` Zachary Amsden
2010-09-15  5:34               ` Jan Kiszka
2010-09-15  7:55                 ` Avi Kivity
2010-09-15  8:04                   ` Jan Kiszka
2010-09-15 12:29               ` Glauber Costa
2010-09-15  4:07     ` Zachary Amsden
2010-09-15  8:09       ` Jan Kiszka
2010-09-15 12:32         ` Glauber Costa
2010-09-15 18:27           ` Jan Kiszka
2010-09-17 22:09             ` Zachary Amsden
2010-09-17 22:31               ` Zachary Amsden
2010-09-18 23:53                 ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 11/35] Add helper functions for time computation Zachary Amsden
2010-08-20 17:34   ` Glauber Costa
2010-08-24  0:58     ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 12/35] Robust TSC compensation Zachary Amsden
2010-08-20 17:40   ` Glauber Costa
2010-08-24  1:01     ` Zachary Amsden
2010-08-24 21:33   ` Daniel Verkamp
2010-08-20  8:07 ` [KVM timekeeping 13/35] Perform hardware_enable in CPU_STARTING callback Zachary Amsden
2010-08-27 16:32   ` Jan Kiszka
2010-08-27 23:43     ` Zachary Amsden
2010-08-30  9:10       ` Jan Kiszka
2010-08-20  8:07 ` [KVM timekeeping 14/35] Add clock sync request to hardware enable Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 15/35] Move scale_delta into common header Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock Zachary Amsden
2011-09-02 18:34   ` Philipp Hahn
2011-09-05 14:06     ` [BUG, PATCH-2.6.32] " Philipp Hahn
2011-09-12 11:32       ` Marcelo Tosatti
2010-08-20  8:07 ` [KVM timekeeping 17/35] Implement getnsboottime kernel API Zachary Amsden
2010-08-20 18:39   ` john stultz
2010-08-20 23:37     ` Zachary Amsden
2010-08-21  0:02       ` john stultz
2010-08-21  0:52         ` Zachary Amsden
2010-08-21  1:04           ` john stultz
2010-08-21  1:22             ` Zachary Amsden
2010-08-27 18:05   ` Jan Kiszka
2010-08-27 23:48     ` Zachary Amsden
2010-08-30 18:07       ` Jan Kiszka
2010-08-20  8:07 ` [KVM timekeeping 18/35] Use getnsboottime in KVM Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 19/35] Add timekeeping documentation Zachary Amsden
2010-08-20 17:50   ` Glauber Costa
2010-08-20  8:07 ` [KVM timekeeping 20/35] Make math work for other scales Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 21/35] Track max tsc_khz Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 22/35] Track tsc last write in vcpu Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 23/35] Set initial TSC rate conversion factors Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 24/35] Timer request function renaming Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 25/35] Add clock catchup mode Zachary Amsden
2010-08-25 17:27   ` Marcelo Tosatti
2010-08-25 20:48     ` Zachary Amsden
2010-08-25 22:01       ` Marcelo Tosatti
2010-08-25 23:38         ` Glauber Costa
2010-08-26  0:17         ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 26/35] Catchup slower TSC to guest rate Zachary Amsden
2010-09-07  3:44   ` Dong, Eddie
2010-09-07 22:14     ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 27/35] Add TSC trapping Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 28/35] Unstable TSC write compensation Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 29/35] TSC overrun protection Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 30/35] IOCTL for setting TSC rate Zachary Amsden
2010-08-20 17:56   ` Glauber Costa
2010-08-21 16:11     ` Arnd Bergmann
2010-08-20  8:07 ` [KVM timekeeping 31/35] Exit conditions for TSC trapping Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 32/35] Entry " Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 33/35] Indicate reliable TSC in kvmclock Zachary Amsden
2010-08-20 17:45   ` Glauber Costa
2010-08-24  1:14     ` Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 34/35] Remove dead code Zachary Amsden
2010-08-20  8:07 ` [KVM timekeeping 35/35] Add some debug stuff Zachary Amsden
2010-08-20 13:26 ` KVM timekeeping and TSC virtualization David S. Ahern
2010-08-20 23:24   ` Zachary Amsden
2010-08-22  1:32     ` David S. Ahern
2010-08-24  1:44       ` Zachary Amsden
2010-08-24  3:04         ` David S. Ahern
2010-08-24  5:47           ` Zachary Amsden
2010-08-24 13:32             ` David S. Ahern
2010-08-24 23:01               ` Zachary Amsden
2010-08-25 16:55                 ` Marcelo Tosatti
2010-08-25 20:32                   ` Zachary Amsden
2010-08-24 22:13 ` Marcelo Tosatti
2010-08-25  4:04   ` Zachary Amsden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).