All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4)
@ 2012-11-15  0:08 Marcelo Tosatti
  2012-11-15  0:08 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
                   ` (18 more replies)
  0 siblings, 19 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm; +Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini


This patchset, based on earlier work by Jeremy Fitzhardinge, implements
paravirtual clock vsyscall support.

It should be possible to implement Xen support relatively easily.

It reduces clock_gettime from 500 cycles to 200 cycles
on my testbox.

Please review.

v4:
- remove aligned_pvti structure, align directly (Glauber)
- add comments to migration notifier (Glauber)
- mark migration notifier condition as unlikely (Glauber)
- add comment about rdtsc barrier dependency on sse2 (Gleb)
- add idea to improve vdso gettime call (Gleb)
- remove new msr interface, reuse kernel copy of pvclock
data (Glauber)
- move copying of timekeeping data from generic timekeeping 
code to kvm code (John)

v3:
- fix PVCLOCK_VSYSCALL_NR_PAGES definition (glommer)
- fold flags race fix into pvclock refactoring (avi)
- remove CONFIG_PARAVIRT_CLOCK_VSYSCALL (glommer)
- add reference to tsc.c from vclock_gettime.c about cycle_last rationale
(glommer)
- fix whitespace damage (glommer)


v2:
- Do not allow visibility of different <system_timestamp, tsc_timestamp>
tuples.
- Add option to disable vsyscall.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: x86-kvm-retain-guest-stopped.patch --]
[-- Type: text/plain, Size: 1754 bytes --]

Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
+	struct pvclock_vcpu_time_info *guest_hv_clock;
 	u8 pvclock_flags;
 
 	/* Keep irq disabled to prevent changes to the clock */
@@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
 	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->last_guest_tsc = tsc_timestamp;
 
-	pvclock_flags = 0;
-	if (vcpu->pvclock_set_guest_stopped_request) {
-		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-		vcpu->pvclock_set_guest_stopped_request = false;
-	}
-
-	vcpu->hv_clock.flags = pvclock_flags;
 
 	/*
 	 * The interface expects us to write an even number signaling that the
@@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
 
 	shared_kaddr = kmap_atomic(vcpu->time_page);
 
+	guest_hv_clock = shared_kaddr + vcpu->time_offset;
+
+	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+	if (vcpu->pvclock_set_guest_stopped_request) {
+		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+		vcpu->pvclock_set_guest_stopped_request = false;
+	}
+
+	vcpu->hv_clock.flags = pvclock_flags;
+
 	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
 	       sizeof(vcpu->hv_clock));
 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 02/18] x86: kvmclock: allocate pvclock shared memory area
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
  2012-11-15  0:08 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15 17:05   ` Glauber Costa
  2012-11-15  0:08 ` [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 00-kvmclock-alloc-area.patch --]
[-- Type: text/plain, Size: 3974 bytes --]

We want to expose the pvclock shared memory areas, which 
the hypervisor periodically updates, to userspace.

For a linear mapping from userspace, it is necessary that
entire page sized regions are used for array of pvclock 
structures.

There is no such guarantee with per cpu areas, therefore move
to memblock_alloc based allocation.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/kvmclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/memblock.h>
 
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
@@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
 early_param("no-kvmclock", parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+struct pvclock_aligned_vcpu_time_info {
+	struct pvclock_vcpu_time_info clock;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+static struct pvclock_aligned_vcpu_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
 	struct pvclock_vcpu_time_info *vcpu_time;
 	struct timespec ts;
 	int low, high;
+	int cpu;
+
+	preempt_disable();
+	cpu = smp_processor_id();
 
 	low = (int)__pa_symbol(&wall_clock);
 	high = ((u64)__pa_symbol(&wall_clock) >> 32);
 
 	native_write_msr(msr_kvm_wall_clock, low, high);
 
-	vcpu_time = &get_cpu_var(hv_clock);
+	vcpu_time = &hv_clock[cpu].clock;
 	pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
-	put_cpu_var(hv_clock);
+
+	preempt_enable();
 
 	return ts.tv_sec;
 }
@@ -74,9 +84,11 @@ static cycle_t kvm_clock_read(void)
 {
 	struct pvclock_vcpu_time_info *src;
 	cycle_t ret;
+	int cpu;
 
 	preempt_disable_notrace();
-	src = &__get_cpu_var(hv_clock);
+	cpu = smp_processor_id();
+	src = &hv_clock[cpu].clock;
 	ret = pvclock_clocksource_read(src);
 	preempt_enable_notrace();
 	return ret;
@@ -99,8 +111,15 @@ static cycle_t kvm_clock_get_cycles(stru
 static unsigned long kvm_get_tsc_khz(void)
 {
 	struct pvclock_vcpu_time_info *src;
-	src = &per_cpu(hv_clock, 0);
-	return pvclock_tsc_khz(src);
+	int cpu;
+	unsigned long tsc_khz;
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	src = &hv_clock[cpu].clock;
+	tsc_khz = pvclock_tsc_khz(src);
+	preempt_enable();
+	return tsc_khz;
 }
 
 static void kvm_get_preset_lpj(void)
@@ -119,10 +138,14 @@ bool kvm_check_and_clear_guest_paused(vo
 {
 	bool ret = false;
 	struct pvclock_vcpu_time_info *src;
+	int cpu = smp_processor_id();
 
-	src = &__get_cpu_var(hv_clock);
+	if (!hv_clock)
+		return ret;
+
+	src = &hv_clock[cpu].clock;
 	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
-		__this_cpu_and(hv_clock.flags, ~PVCLOCK_GUEST_STOPPED);
+		src->flags &= ~PVCLOCK_GUEST_STOPPED;
 		ret = true;
 	}
 
@@ -141,9 +164,10 @@ int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
+	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].clock;
 
-	low = (int)__pa(&per_cpu(hv_clock, cpu)) | 1;
-	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
+	low = (int)__pa(src) | 1;
+	high = ((u64)__pa(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);
@@ -197,9 +221,17 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
+	unsigned long mem;
+
 	if (!kvm_para_available())
 		return;
 
+	mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS,
+			     PAGE_SIZE);
+	if (!mem)
+		return;
+	hv_clock = __va(mem);
+
 	if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
 		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
 		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
  2012-11-15  0:08 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
  2012-11-15  0:08 ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 04/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 01-pvclock-read-rdtsc-barrier --]
[-- Type: text/plain, Size: 745 bytes --]

Originally from Jeremy Fitzhardinge.

pvclock_get_time_values, which contains the memory barriers
will be removed by next patch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
 
 	do {
 		version = pvclock_get_time_values(&shadow, src);
-		barrier();
+		rdtsc_barrier();
 		offset = pvclock_get_nsec_offset(&shadow);
 		ret = shadow.system_timestamp + offset;
-		barrier();
+		rdtsc_barrier();
 	} while (version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 04/18] x86: pvclock: remove pvclock_shadow_time
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2012-11-15  0:08 ` [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 02-pvclock-remove-shadow-time --]
[-- Type: text/plain, Size: 3030 bytes --]

Originally from Jeremy Fitzhardinge.

We can copy the information directly from "struct pvclock_vcpu_time_info", 
remove pvclock_shadow_time.

Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -19,21 +19,6 @@
 #include <linux/percpu.h>
 #include <asm/pvclock.h>
 
-/*
- * These are perodically updated
- *    xen: magic shared_info page
- *    kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
-	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
-	u32 tsc_to_nsec_mul;
-	int tsc_shift;
-	u32 version;
-	u8  flags;
-};
-
 static u8 valid_flags __read_mostly = 0;
 
 void pvclock_set_flags(u8 flags)
@@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-	u64 delta = native_read_tsc() - shadow->tsc_timestamp;
-	return pvclock_scale_delta(delta, shadow->tsc_to_nsec_mul,
-				   shadow->tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-					struct pvclock_vcpu_time_info *src)
-{
-	do {
-		dst->version = src->version;
-		rmb();		/* fetch version before data */
-		dst->tsc_timestamp     = src->tsc_timestamp;
-		dst->system_timestamp  = src->system_time;
-		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
-		dst->tsc_shift         = src->tsc_shift;
-		dst->flags             = src->flags;
-		rmb();		/* test version after fetching data */
-	} while ((src->version & 1) || (dst->version != src->version));
-
-	return dst->version;
+	u64 delta = native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
 }
 
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -90,21 +54,22 @@ void pvclock_resume(void)
 
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
-	struct pvclock_shadow_time shadow;
 	unsigned version;
 	cycle_t ret, offset;
 	u64 last;
+	u8 flags;
 
 	do {
-		version = pvclock_get_time_values(&shadow, src);
+		version = src->version;
 		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(&shadow);
-		ret = shadow.system_timestamp + offset;
+		offset = pvclock_get_nsec_offset(src);
+		ret = src->system_time + offset;
+		flags = src->flags;
 		rdtsc_barrier();
-	} while (version != src->version);
+	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
-		(shadow.flags & PVCLOCK_TSC_STABLE_BIT))
+		(flags & PVCLOCK_TSC_STABLE_BIT))
 		return ret;
 
 	/*



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 05/18] x86: pvclock: create helper for pvclock data retrieval
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2012-11-15  0:08 ` [patch 04/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15 12:27   ` Glauber Costa
  2012-11-15  0:08 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 03-move-pvread-to-pvheader --]
[-- Type: text/plain, Size: 2264 bytes --]

Originally from Jeremy Fitzhardinge.

So code can be reused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
-{
-	u64 delta = native_read_tsc() - src->tsc_timestamp;
-	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
-				   src->tsc_shift);
-}
-
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
 {
 	u64 pv_tsc_khz = 1000000ULL << 32;
@@ -55,17 +48,12 @@ void pvclock_resume(void)
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
-	cycle_t ret, offset;
+	cycle_t ret;
 	u64 last;
 	u8 flags;
 
 	do {
-		version = src->version;
-		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(src);
-		ret = src->system_time + offset;
-		flags = src->flags;
-		rdtsc_barrier();
+		version = __pvclock_read_cycles(src, &ret, &flags);
 	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -56,4 +56,32 @@ static inline u64 pvclock_scale_delta(u6
 	return product;
 }
 
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+{
+	u64 delta = __native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
+}
+
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+			       cycle_t *cycles, u8 *flags)
+{
+	unsigned version;
+	cycle_t ret, offset;
+	u8 ret_flags;
+
+	version = src->version;
+	rdtsc_barrier();
+	offset = pvclock_get_nsec_offset(src);
+	ret = src->system_time + offset;
+	ret_flags = src->flags;
+	rdtsc_barrier();
+
+	*cycles = ret;
+	*flags = ret_flags;
+	return version;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 06/18] x86: pvclock: introduce helper to read flags
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2012-11-15  0:08 ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15 12:28   ` Glauber Costa
  2012-11-15  0:08 ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 05-pvclock-add-get-flags --]
[-- Type: text/plain, Size: 1278 bytes --]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -45,6 +45,19 @@ void pvclock_resume(void)
 	atomic64_set(&last_value, 0);
 }
 
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
+{
+	unsigned version;
+	cycle_t ret;
+	u8 flags;
+
+	do {
+		version = __pvclock_read_cycles(src, &ret, &flags);
+	} while ((src->version & 1) || version != src->version);
+
+	return flags & valid_flags;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 07/18] x86: pvclock: add note about rdtsc barriers
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2012-11-15  0:08 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15 12:30   ` Glauber Costa
  2012-11-15  0:08 ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 05.2-pvclock-add-comment-barrier --]
[-- Type: text/plain, Size: 675 bytes --]

As noted by Gleb, not advertising SSE2 support implies
no RDTSC barriers.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str
 	u8 ret_flags;
 
 	version = src->version;
+	/* Note: emulated platforms which do not advertise SSE2 support
+ 	 * result in kvmclock not using the necessary RDTSC barriers.
+ 	 */
 	rdtsc_barrier();
 	offset = pvclock_get_nsec_offset(src);
 	ret = src->system_time + offset;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 08/18] sched: add notifier for cross-cpu migrations
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2012-11-15  0:08 ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  7:01   ` Gleb Natapov
  2012-11-15  0:08 ` [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 06-add-task-migration-notifier --]
[-- Type: text/plain, Size: 1732 bytes --]

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/sched.h
===================================================================
--- vsyscall.orig/include/linux/sched.h
+++ vsyscall/include/linux/sched.h
@@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+	struct task_struct *task;
+	int from_cpu;
+	int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
Index: vsyscall/kernel/sched/core.c
===================================================================
--- vsyscall.orig/kernel/sched/core.c
+++ vsyscall/kernel/sched/core.c
@@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
 		rq->skip_clock_update = 1;
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+	atomic_notifier_chain_register(&task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		struct task_migration_notifier tmn;
+
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+
+		tmn.task = p;
+		tmn.from_cpu = task_cpu(p);
+		tmn.to_cpu = new_cpu;
+
+		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
 	}
 
 	__set_task_cpu(p, new_cpu);



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2012-11-15  0:08 ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 10/18] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 07-add-pvclock-structs-and-fixmap --]
[-- Type: text/plain, Size: 4272 bytes --]

Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization 
routines.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -17,6 +17,10 @@
 
 #include <linux/kernel.h>
 #include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <linux/bootmem.h>
 #include <asm/pvclock.h>
 
 static u8 valid_flags __read_mostly = 0;
@@ -122,3 +126,68 @@ void pvclock_read_wallclock(struct pvclo
 
 	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+static struct pvclock_vsyscall_time_info *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *
+pvclock_get_vsyscall_user_time_info(int cpu)
+{
+	if (!pvclock_vdso_info) {
+		BUG();
+		return NULL;
+	}
+
+	return &pvclock_vdso_info[cpu];
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+	return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+	struct task_migration_notifier *mn = v;
+	struct pvclock_vsyscall_time_info *pvti;
+
+	pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
+
+	/* this is NULL when pvclock vsyscall is not initialized */
+	if (unlikely(pvti == NULL))
+		return NOTIFY_DONE;
+
+	pvti->migrate_count++;
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+	.notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state.  This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+
+int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
+				 int size)
+{
+	int idx;
+
+	WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
+
+	pvclock_vdso_info = i;
+
+	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
+		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
+			     __pa_symbol(i) + (idx*PAGE_SIZE),
+			     PAGE_KERNEL_VVAR);
+	}
+
+
+	register_task_migration_notifier(&pvclock_migrate);
+
+	return 0;
+}
Index: vsyscall/arch/x86/include/asm/fixmap.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/fixmap.h
+++ vsyscall/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include <asm/acpi.h>
 #include <asm/apicdef.h>
 #include <asm/page.h>
+#include <asm/pvclock.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
@@ -81,6 +82,10 @@ enum fixed_addresses {
 	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
+#ifdef CONFIG_PARAVIRT_CLOCK
+	PVCLOCK_FIXMAP_BEGIN,
+	PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
+#endif
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -85,4 +85,16 @@ unsigned __pvclock_read_cycles(const str
 	return version;
 }
 
+struct pvclock_vsyscall_time_info {
+	struct pvclock_vcpu_time_info pvti;
+	u32 migrate_count;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+#define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
+#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
+
+int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
+				 int size);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: vsyscall/arch/x86/include/asm/clocksource.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/clocksource.h
+++ vsyscall/arch/x86/include/asm/clocksource.h
@@ -8,6 +8,7 @@
 #define VCLOCK_NONE 0  /* No vDSO clock available.	*/
 #define VCLOCK_TSC  1  /* vDSO should use vread_tsc.	*/
 #define VCLOCK_HPET 2  /* vDSO should use vread_hpet.	*/
+#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
 
 struct arch_clocksource_data {
 	int vclock_mode;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 10/18] x86: kvm guest: pvclock vsyscall support
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2012-11-15  0:08 ` [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 11/18] x86: vdso: pvclock gettime support Marcelo Tosatti
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 08-add-pvclock-vsyscall-kvm-support --]
[-- Type: text/plain, Size: 4727 bytes --]

Hook into generic pvclock vsyscall code, with the aim to 
allow userspace to have visibility into pvclock data.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/kvmclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -40,11 +40,7 @@ static int parse_no_kvmclock(char *arg)
 early_param("no-kvmclock", parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-struct pvclock_aligned_vcpu_time_info {
-	struct pvclock_vcpu_time_info clock;
-} __attribute__((__aligned__(SMP_CACHE_BYTES)));
-
-static struct pvclock_aligned_vcpu_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -67,7 +63,7 @@ static unsigned long kvm_get_wallclock(v
 
 	native_write_msr(msr_kvm_wall_clock, low, high);
 
-	vcpu_time = &hv_clock[cpu].clock;
+	vcpu_time = &hv_clock[cpu].pvti;
 	pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
 
 	preempt_enable();
@@ -88,7 +84,7 @@ static cycle_t kvm_clock_read(void)
 
 	preempt_disable_notrace();
 	cpu = smp_processor_id();
-	src = &hv_clock[cpu].clock;
+	src = &hv_clock[cpu].pvti;
 	ret = pvclock_clocksource_read(src);
 	preempt_enable_notrace();
 	return ret;
@@ -116,7 +112,7 @@ static unsigned long kvm_get_tsc_khz(voi
 
 	preempt_disable();
 	cpu = smp_processor_id();
-	src = &hv_clock[cpu].clock;
+	src = &hv_clock[cpu].pvti;
 	tsc_khz = pvclock_tsc_khz(src);
 	preempt_enable();
 	return tsc_khz;
@@ -143,7 +139,7 @@ bool kvm_check_and_clear_guest_paused(vo
 	if (!hv_clock)
 		return ret;
 
-	src = &hv_clock[cpu].clock;
+	src = &hv_clock[cpu].pvti;
 	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
 		src->flags &= ~PVCLOCK_GUEST_STOPPED;
 		ret = true;
@@ -164,7 +160,7 @@ int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
-	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].clock;
+	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
 	low = (int)__pa(src) | 1;
 	high = ((u64)__pa(src) >> 32);
@@ -226,7 +222,7 @@ void __init kvmclock_init(void)
 	if (!kvm_para_available())
 		return;
 
-	mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS,
+	mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS,
 			     PAGE_SIZE);
 	if (!mem)
 		return;
@@ -265,3 +261,36 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 }
+
+int kvm_setup_vsyscall_timeinfo(void)
+{
+	int cpu;
+	int ret;
+	u8 flags;
+	struct pvclock_vcpu_time_info *vcpu_time;
+	unsigned int size;
+
+	size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS;
+
+	preempt_disable();
+	cpu = smp_processor_id();
+
+	vcpu_time = &hv_clock[cpu].pvti;
+	flags = pvclock_read_flags(vcpu_time);
+
+	if (!(flags & PVCLOCK_TSC_STABLE_BIT)) {
+		preempt_enable();
+		return 1;
+	}
+
+	if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
+		preempt_enable();
+		return ret;
+	}
+
+	preempt_enable();
+
+	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+	return 0;
+}
+
Index: vsyscall/arch/x86/kernel/kvm.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvm.c
+++ vsyscall/arch/x86/kernel/kvm.c
@@ -42,6 +42,7 @@
 #include <asm/apic.h>
 #include <asm/apicdef.h>
 #include <asm/hypervisor.h>
+#include <asm/kvm_guest.h>
 
 static int kvmapf = 1;
 
@@ -62,6 +63,15 @@ static int parse_no_stealacc(char *arg)
 
 early_param("no-steal-acc", parse_no_stealacc);
 
+static int kvmclock_vsyscall = 1;
+static int parse_no_kvmclock_vsyscall(char *arg)
+{
+        kvmclock_vsyscall = 0;
+        return 0;
+}
+
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -468,6 +478,9 @@ void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (kvmclock_vsyscall)
+		kvm_setup_vsyscall_timeinfo();
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
Index: vsyscall/arch/x86/include/asm/kvm_guest.h
===================================================================
--- /dev/null
+++ vsyscall/arch/x86/include/asm/kvm_guest.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_X86_KVM_GUEST_H
+#define _ASM_X86_KVM_GUEST_H
+
+int kvm_setup_vsyscall_timeinfo(void);
+
+#endif /* _ASM_X86_KVM_GUEST_H */



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 11/18] x86: vdso: pvclock gettime support
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2012-11-15  0:08 ` [patch 10/18] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 10-add-pvclock-vdso-code --]
[-- Type: text/plain, Size: 5202 bytes --]

Improve performance of time system calls when using Linux pvclock, 
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -22,6 +22,7 @@
 #include <asm/hpet.h>
 #include <asm/unistd.h>
 #include <asm/io.h>
+#include <asm/pvclock.h>
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -62,6 +63,76 @@ static notrace cycle_t vread_hpet(void)
 	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+
+static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+{
+	const struct pvclock_vsyscall_time_info *pvti_base;
+	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
+	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
+
+	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
+
+	pvti_base = (struct pvclock_vsyscall_time_info *)
+		    __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
+
+	return &pvti_base[offset];
+}
+
+static notrace cycle_t vread_pvclock(int *mode)
+{
+	const struct pvclock_vsyscall_time_info *pvti;
+	cycle_t ret;
+	u64 last;
+	u32 version;
+	u32 migrate_count;
+	u8 flags;
+	unsigned cpu, cpu1;
+
+
+	/*
+	 * When looping to get a consistent (time-info, tsc) pair, we
+	 * also need to deal with the possibility we can switch vcpus,
+	 * so make sure we always re-fetch time-info for the current vcpu.
+	 */
+	do {
+		cpu = __getcpu() & VGETCPU_CPU_MASK;
+		/* TODO: We can put vcpu id into higher bits of pvti.version.
+		 * This will save a couple of cycles by getting rid of
+		 * __getcpu() calls (Gleb).
+		 */
+
+		pvti = get_pvti(cpu);
+
+		migrate_count = pvti->migrate_count;
+
+		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
+
+		/*
+		 * Test we're still on the cpu as well as the version.
+		 * We could have been migrated just after the first
+		 * vgetcpu but before fetching the version, so we
+		 * wouldn't notice a version change.
+		 */
+		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
+	} while (unlikely(cpu != cpu1 ||
+			  (pvti->pvti.version & 1) ||
+			  pvti->pvti.version != version ||
+			  pvti->migrate_count != migrate_count));
+
+	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+		*mode = VCLOCK_NONE;
+
+	/* refer to tsc.c read_tsc() comment for rationale */
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	return last;
+}
+#endif
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
 	long ret;
@@ -80,7 +151,7 @@ notrace static long vdso_fallback_gtod(s
 }
 
 
-notrace static inline u64 vgetsns(void)
+notrace static inline u64 vgetsns(int *mode)
 {
 	long v;
 	cycles_t cycles;
@@ -88,6 +159,8 @@ notrace static inline u64 vgetsns(void)
 		cycles = vread_tsc();
 	else if (gtod->clock.vclock_mode == VCLOCK_HPET)
 		cycles = vread_hpet();
+	else if (gtod->clock.vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
 	else
 		return 0;
 	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
@@ -107,7 +180,7 @@ notrace static int __always_inline do_re
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->wall_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -127,7 +200,7 @@ notrace static int do_monotonic(struct t
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->monotonic_time_sec;
 		ns = gtod->monotonic_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	timespec_add_ns(ts, ns);
Index: vsyscall/arch/x86/include/asm/vsyscall.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/vsyscall.h
+++ vsyscall/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,23 @@ extern void map_vsyscall(void);
  */
 extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
 
+#define VGETCPU_CPU_MASK 0xfff
+
+static inline unsigned int __getcpu(void)
+{
+	unsigned int p;
+
+	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
+		/* Load per CPU data from RDTSCP */
+		native_read_tscp(&p);
+	} else {
+		/* Load per CPU data from GDT */
+		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	}
+
+	return p;
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
Index: vsyscall/arch/x86/vdso/vgetcpu.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vgetcpu.c
+++ vsyscall/arch/x86/vdso/vgetcpu.c
@@ -17,15 +17,10 @@ __vdso_getcpu(unsigned *cpu, unsigned *n
 {
 	unsigned int p;
 
-	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
-		/* Load per CPU data from RDTSCP */
-		native_read_tscp(&p);
-	} else {
-		/* Load per CPU data from GDT */
-		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
-	}
+	p = __getcpu();
+
 	if (cpu)
-		*cpu = p & 0xfff;
+		*cpu = p & VGETCPU_CPU_MASK;
 	if (node)
 		*node = p >> 12;
 	return 0;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2012-11-15  0:08 ` [patch 11/18] x86: vdso: pvclock gettime support Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 12-kvm-read-l1-tsc-pass-tscvalue --]
[-- Type: text/plain, Size: 3372 bytes --]

Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -703,7 +703,7 @@ struct kvm_x86_ops {
 	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
 	u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
-	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
 
 	void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
 
Index: vsyscall/arch/x86/kvm/lapic.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/lapic.c
+++ vsyscall/arch/x86/kvm/lapic.c
@@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_
 		local_irq_save(flags);
 
 		now = apic->lapic_timer.timer.base->get_time();
-		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc());
 		if (likely(tscdeadline > guest_tsc)) {
 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
 			do_div(ns, this_tsc_khz);
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
-u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
 	struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
 	return vmcb->control.tsc_offset +
-		svm_scale_tsc(vcpu, native_read_tsc());
+		svm_scale_tsc(vcpu, host_tsc);
 }
 
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void)
  * Like guest_read_tsc, but always returns L1's notion of the timestamp
  * counter, even if a nested guest (L2) is currently running.
  */
-u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-	u64 host_tsc, tsc_offset;
+	u64 tsc_offset;
 
-	rdtscll(host_tsc);
 	tsc_offset = is_guest_mode(vcpu) ?
 		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
 		vmcs_read64(TSC_OFFSET);
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu,
+							   native_read_tsc());
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 13/18] time: export time information for KVM pvclock
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2012-11-15  0:08 ` [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  1:38   ` John Stultz
  2012-11-15  0:08 ` [patch 14/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 13-time-add-pvclock-gtod-data --]
[-- Type: text/plain, Size: 2655 bytes --]

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/pvclock_gtod.h
===================================================================
--- /dev/null
+++ vsyscall/include/linux/pvclock_gtod.h
@@ -0,0 +1,9 @@
+#ifndef _PVCLOCK_GTOD_H
+#define _PVCLOCK_GTOD_H
+
+#include <linux/notifier.h>
+
+extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
+extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);
+
+#endif /* _PVCLOCK_GTOD_H */
Index: vsyscall/kernel/time/timekeeping.c
===================================================================
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
 #include <linux/time.h>
 #include <linux/tick.h>
 #include <linux/stop_machine.h>
+#include <linux/pvclock_gtod.h>
 
 
 static struct timekeeper timekeeper;
@@ -180,6 +181,54 @@ static inline s64 timekeeping_get_ns_raw
 	return nsec + arch_gettimeoffset();
 }
 
+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+	raw_notifier_call_chain(&pvclock_gtod_chain, 0, tk);
+}
+
+/**
+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_register_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
+	/* update timekeeping data */
+	update_pvclock_gtod(tk);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
+
+/**
+ * pvclock_gtod_unregister_notifier - unregister a pvclock
+ * timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
@@ -188,6 +237,7 @@ static void timekeeping_update(struct ti
 		ntp_clear();
 	}
 	update_vsyscall(tk);
+	update_pvclock_gtod(tk);
 }
 
 /**



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 14/18] KVM: x86: notifier for clocksource changes
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (12 preceding siblings ...)
  2012-11-15  0:08 ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 15-add-kvm-req-pvclock-gtod-update --]
[-- Type: text/plain, Size: 3971 bytes --]

Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -46,6 +46,8 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/pci.h>
+#include <linux/timekeeper_internal.h>
+#include <linux/pvclock_gtod.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -899,6 +901,53 @@ static int do_set_msr(struct kvm_vcpu *v
 	return kvm_set_msr(vcpu, index, *data);
 }
 
+struct pvclock_gtod_data {
+	seqcount_t	seq;
+
+	struct { /* extract of a clocksource struct */
+		int vclock_mode;
+		cycle_t	cycle_last;
+		cycle_t	mask;
+		u32	mult;
+		u32	shift;
+	} clock;
+
+	/* open coded 'struct timespec' */
+	u64		monotonic_time_snsec;
+	time_t		monotonic_time_sec;
+};
+
+static struct pvclock_gtod_data pvclock_gtod_data;
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
+
+	write_seqcount_begin(&vdata->seq);
+
+	/* copy pvclock gtod data */
+	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
+	vdata->clock.cycle_last		= tk->clock->cycle_last;
+	vdata->clock.mask		= tk->clock->mask;
+	vdata->clock.mult		= tk->mult;
+	vdata->clock.shift		= tk->shift;
+
+	vdata->monotonic_time_sec	= tk->xtime_sec
+					+ tk->wall_to_monotonic.tv_sec;
+	vdata->monotonic_time_snsec	= tk->xtime_nsec
+					+ (tk->wall_to_monotonic.tv_nsec
+						<< tk->shift);
+	while (vdata->monotonic_time_snsec >=
+					(((u64)NSEC_PER_SEC) << tk->shift)) {
+		vdata->monotonic_time_snsec -=
+					((u64)NSEC_PER_SEC) << tk->shift;
+		vdata->monotonic_time_sec++;
+	}
+
+	write_seqcount_end(&vdata->seq);
+}
+
+
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
 {
 	int version;
@@ -995,6 +1044,8 @@ static inline u64 get_kernel_ns(void)
 	return timespec_to_ns(&ts);
 }
 
+static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
+
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 unsigned long max_tsc_khz;
 
@@ -1227,7 +1278,6 @@ static int kvm_guest_time_update(struct 
 	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->last_guest_tsc = tsc_timestamp;
 
-
 	/*
 	 * The interface expects us to write an even number signaling that the
 	 * update is finished. Since the guest won't see the intermediate
@@ -4894,6 +4944,37 @@ static void kvm_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(mask);
 }
 
+static void pvclock_gtod_update_fn(struct work_struct *work)
+{
+}
+
+static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
+
+/*
+ * Notification about pvclock gtod data update.
+ */
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+			       void *priv)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	struct timekeeper *tk = priv;
+
+	update_pvclock_gtod(tk);
+
+	/* disable master clock if host does not trust, or does not
+ 	 * use, TSC clocksource
+ 	 */
+	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	    atomic_read(&kvm_guest_has_master_clock) != 0)
+		queue_work(system_long_wq, &pvclock_gtod_work);
+
+	return 0;
+}
+
+static struct notifier_block pvclock_gtod_notifier = {
+	.notifier_call = pvclock_gtod_notify,
+};
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
@@ -4935,6 +5016,8 @@ int kvm_arch_init(void *opaque)
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 
 	kvm_lapic_init();
+	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
+
 	return 0;
 
 out:
@@ -4949,6 +5032,7 @@ void kvm_arch_exit(void)
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (13 preceding siblings ...)
  2012-11-15  0:08 ` [patch 14/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 14-host-pass-stable-pvclock-flag --]
[-- Type: text/plain, Size: 13136 bytes --]

KVM added a global variable to guarantee monotonicity in the guest. 
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances 
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when 
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all 
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs. 

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1186,21 +1186,166 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static cycle_t read_tsc(void)
+{
+	cycle_t ret;
+	u64 last;
+
+	/*
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
+	 */
+	rdtsc_barrier();
+	ret = (cycle_t)vget_cycles();
+
+	last = pvclock_gtod_data.clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
+}
+
+static inline u64 vgettsc(cycle_t *cycle_now)
+{
+	long v;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	*cycle_now = read_tsc();
+
+	v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
+	return v * gtod->clock.mult;
+}
+
+static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	ts->tv_nsec = 0;
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		mode = gtod->clock.vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_sec;
+		ns = gtod->monotonic_time_snsec;
+		ns += vgettsc(cycle_now);
+		ns >>= gtod->clock.shift;
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	timespec_add_ns(ts, ns);
+
+	return mode;
+}
+
+/* returns true if host is using tsc clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+	struct timespec ts;
+
+	/* checked again under seqlock below */
+	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+		return false;
+
+	if (do_monotonic(&ts, cycle_now) != VCLOCK_TSC)
+		return false;
+
+	monotonic_to_bootbased(&ts);
+	*kernel_ns = timespec_to_ns(&ts);
+
+	return true;
+}
+
+
+/*
+ *
+ * Assuming a stable TSC across physical CPUS, the following condition
+ * is possible. Each numbered line represents an event visible to both
+ * CPUs at the next numbered event.
+ *
+ * "timespecX" represents host monotonic time. "tscX" represents
+ * RDTSC value.
+ *
+ * 		VCPU0 on CPU0		|	VCPU1 on CPU1
+ *
+ * 1.  read timespec0,tsc0
+ * 2.					| timespec1 = timespec0 + N
+ * 					| tsc1 = tsc0 + M
+ * 3. transition to guest		| transition to guest
+ * 4. ret0 = timespec0 + (rdtsc - tsc0) |
+ * 5.				        | ret1 = timespec1 + (rdtsc - tsc1)
+ * 				        | ret1 = timespec0 + N + (rdtsc - (tsc0 + M))
+ *
+ * Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity:
+ *
+ * 	- ret0 < ret1
+ *	- timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
+ *		...
+ *	- 0 < N - M => M < N
+ *
+ * That is, when timespec0 != timespec1, M < N. Unfortunately that is not
+ * always the case (the difference between two distinct xtime instances
+ * might be smaller then the difference between corresponding TSC reads,
+ * when updating guest vcpus pvclock areas).
+ *
+ * To avoid that problem, do not allow visibility of distinct
+ * system_timestamp/tsc_timestamp values simultaneously: use a master
+ * copy of host monotonic time values. Update that master copy
+ * in lockstep.
+ *
+ * Rely on synchronization of host TSCs for monotonicity.
+ *
+ */
+
+static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	int vclock_mode;
+
+	/*
+ 	 * If the host uses TSC clock, then passthrough TSC as stable
+	 * to the guest.
+	 */
+	ka->use_master_clock = kvm_get_time_and_clockread(
+					&ka->master_kernel_ns,
+					&ka->master_cycle_now);
+
+	if (ka->use_master_clock)
+		atomic_set(&kvm_guest_has_master_clock, 1);
+
+	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+}
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
-	unsigned long flags;
+	unsigned long flags, this_tsc_khz;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
+	struct kvm_arch *ka = &v->kvm->arch;
 	void *shared_kaddr;
-	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
-	u64 tsc_timestamp;
+	u64 tsc_timestamp, host_tsc;
 	struct pvclock_vcpu_time_info *guest_hv_clock;
 	u8 pvclock_flags;
+	bool use_master_clock;
+
+	kernel_ns = 0;
+	host_tsc = 0;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
-	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
 		local_irq_restore(flags);
@@ -1208,6 +1353,24 @@ static int kvm_guest_time_update(struct 
 		return 1;
 	}
 
+  	/*
+  	 * If the host uses TSC clock, then passthrough TSC as stable
+ 	 * to the guest.
+ 	 */
+ 	spin_lock(&ka->pvclock_gtod_sync_lock);
+ 	use_master_clock = ka->use_master_clock;
+ 	if (use_master_clock) {
+ 		host_tsc = ka->master_cycle_now;
+ 		kernel_ns = ka->master_kernel_ns;
+ 	}
+ 	spin_unlock(&ka->pvclock_gtod_sync_lock);
+ 	if (!use_master_clock) {
+ 		host_tsc = native_read_tsc();
+ 		kernel_ns = get_kernel_ns();
+ 	}
+
+ 	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, host_tsc);
+
 	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
@@ -1269,9 +1432,14 @@ static int kvm_guest_time_update(struct 
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-	if (max_kernel_ns > kernel_ns)
-		kernel_ns = max_kernel_ns;
-
+	/* with a master <monotonic time, tsc value> tuple,
+	 * pvclock clock reads always increase at the (scaled) rate
+	 * of guest TSC - no need to deal with sampling errors.
+	 */
+	if (!use_master_clock) {
+		if (max_kernel_ns > kernel_ns)
+			kernel_ns = max_kernel_ns;
+	}
 	/* With all the info we got, fill in the values */
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
 	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
@@ -1297,6 +1465,10 @@ static int kvm_guest_time_update(struct 
 		vcpu->pvclock_set_guest_stopped_request = false;
 	}
 
+	/* If the host uses TSC clocksource, then it is stable */
+	if (use_master_clock)
+		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
+
 	vcpu->hv_clock.flags = pvclock_flags;
 
 	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
@@ -4946,6 +5118,17 @@ static void kvm_set_mmio_spte_mask(void)
 
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
+	struct kvm *kvm;
+
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	raw_spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			set_bit(KVM_REQ_MASTERCLOCK_UPDATE, &vcpu->requests);
+	atomic_set(&kvm_guest_has_master_clock, 0);
+	raw_spin_unlock(&kvm_lock);
 }
 
 static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
@@ -5332,6 +5515,28 @@ static void process_nmi(struct kvm_vcpu 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+static void kvm_gen_update_masterclock(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	struct kvm_arch *ka = &kvm->arch;
+
+	spin_lock(&ka->pvclock_gtod_sync_lock);
+	kvm_make_mclock_inprogress_request(kvm);
+	/* no guest entries from this point */
+	pvclock_update_vm_gtod_copy(kvm);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+
+	/* guest entries allowed */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
+
+	spin_unlock(&ka->pvclock_gtod_sync_lock);
+
+}
+
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -5344,6 +5549,8 @@ static int vcpu_enter_guest(struct kvm_v
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
+		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_gen_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
@@ -6248,6 +6455,8 @@ int kvm_arch_hardware_enable(void *garba
 			kvm_for_each_vcpu(i, vcpu, kvm) {
 				vcpu->arch.tsc_offset_adjustment += delta_cyc;
 				vcpu->arch.last_host_tsc = local_tsc;
+				set_bit(KVM_REQ_MASTERCLOCK_UPDATE,
+					&vcpu->requests);
 			}
 
 			/*
@@ -6385,6 +6594,9 @@ int kvm_arch_init_vm(struct kvm *kvm, un
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
+	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
+
+	pvclock_update_vm_gtod_copy(kvm);
 
 	return 0;
 }
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -22,6 +22,7 @@
 #include <linux/kvm_para.h>
 #include <linux/kvm_types.h>
 #include <linux/perf_event.h>
+#include <linux/pvclock_gtod.h>
 
 #include <asm/pvclock-abi.h>
 #include <asm/desc.h>
@@ -560,6 +561,11 @@ struct kvm_arch {
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
 
+	spinlock_t pvclock_gtod_sync_lock;
+	bool use_master_clock;
+	u64 master_kernel_ns;
+	cycle_t master_cycle_now;
+
 	struct kvm_xen_hvm_config xen_hvm_config;
 
 	/* fields used by HYPER-V emulation */
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -118,6 +118,8 @@ static inline bool is_error_page(struct 
 #define KVM_REQ_IMMEDIATE_EXIT    15
 #define KVM_REQ_PMU               16
 #define KVM_REQ_PMI               17
+#define KVM_REQ_MASTERCLOCK_UPDATE  18
+#define KVM_REQ_MCLOCK_INPROGRESS 19
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
@@ -527,6 +529,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
+void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -212,6 +212,11 @@ void kvm_reload_remote_mmus(struct kvm *
 	make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
 
+void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+{
+	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
+}
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -4,6 +4,7 @@
 #include <linux/tracepoint.h>
 #include <asm/vmx.h>
 #include <asm/svm.h>
+#include <asm/clocksource.h>
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -754,6 +755,31 @@ TRACE_EVENT(
 		  __entry->write ? "Write" : "Read",
 		  __entry->gpa_match ? "GPA" : "GVA")
 );
+
+#define host_clocks				\
+	{VCLOCK_NONE, "none"},			\
+	{VCLOCK_TSC,  "tsc"},			\
+	{VCLOCK_HPET, "hpet"}			\
+
+TRACE_EVENT(kvm_update_master_clock,
+	TP_PROTO(bool use_master_clock, unsigned int host_clock),
+	TP_ARGS(use_master_clock, host_clock),
+
+	TP_STRUCT__entry(
+		__field(		bool,	use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("masterclock %d hostclock %s",
+		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks))
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (14 preceding siblings ...)
  2012-11-15  0:08 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 17/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 16-add-kvm-add-vcpu-postcreate --]
[-- Type: text/plain, Size: 3730 bytes --]

TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/ia64/kvm/kvm-ia64.c
===================================================================
--- vsyscall.orig/arch/ia64/kvm/kvm-ia64.c
+++ vsyscall/arch/ia64/kvm/kvm-ia64.c
@@ -1330,6 +1330,11 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return 0;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
 	return -EINVAL;
Index: vsyscall/arch/powerpc/kvm/powerpc.c
===================================================================
--- vsyscall.orig/arch/powerpc/kvm/powerpc.c
+++ vsyscall/arch/powerpc/kvm/powerpc.c
@@ -354,6 +354,11 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
 	return vcpu;
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	/* Make sure we're not using the vcpu anymore */
Index: vsyscall/arch/s390/kvm/kvm-s390.c
===================================================================
--- vsyscall.orig/arch/s390/kvm/kvm-s390.c
+++ vsyscall/arch/s390/kvm/kvm-s390.c
@@ -355,6 +355,11 @@ static void kvm_s390_vcpu_initial_reset(
 	atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 {
 	atomic_set(&vcpu->arch.sie_block->cpuflags, CPUSTAT_ZARCH |
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -1254,7 +1254,6 @@ static struct kvm_vcpu *svm_create_vcpu(
 	svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT;
 	svm->asid_generation = 0;
 	init_vmcb(svm);
-	kvm_write_tsc(&svm->vcpu, 0);
 
 	err = fx_init(&svm->vcpu);
 	if (err)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -3896,8 +3896,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
 	set_cr4_guest_host_mask(vmx);
 
-	kvm_write_tsc(&vmx->vcpu, 0);
-
 	return 0;
 }
 
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -6289,6 +6289,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return r;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	int r;
+
+	r = vcpu_load(vcpu);
+	if (r)
+		return r;
+	kvm_write_tsc(vcpu, 0);
+	vcpu_put(vcpu);
+
+	return r;
+}
+
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	int r;
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -583,6 +583,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -1855,6 +1855,7 @@ static int kvm_vm_ioctl_create_vcpu(stru
 	atomic_inc(&kvm->online_vcpus);
 
 	mutex_unlock(&kvm->lock);
+	kvm_arch_vcpu_postcreate(vcpu);
 	return r;
 
 unlock_vcpu_destroy:



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 17/18] KVM: x86: require matched TSC offsets for master clock
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (15 preceding siblings ...)
  2012-11-15  0:08 ` [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15  0:08 ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
  18 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 17-masterclock-require-matched-tsc --]
[-- Type: text/plain, Size: 6962 bytes --]

With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple 
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -560,6 +560,7 @@ struct kvm_arch {
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
+	int nr_vcpus_matched_tsc;
 
 	spinlock_t pvclock_gtod_sync_lock;
 	bool use_master_clock;
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1097,12 +1097,38 @@ static u64 compute_guest_tsc(struct kvm_
 	return tsc;
 }
 
+void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+{
+	bool vcpus_matched;
+	bool do_request = false;
+	struct kvm_arch *ka = &vcpu->kvm->arch;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			 atomic_read(&vcpu->kvm->online_vcpus));
+
+	if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
+		if (!ka->use_master_clock)
+			do_request = 1;
+
+	if (!vcpus_matched && ka->use_master_clock)
+			do_request = 1;
+
+	if (do_request)
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
+	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
+			    atomic_read(&vcpu->kvm->online_vcpus),
+		            ka->use_master_clock, gtod->clock.vclock_mode);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
 	s64 usdiff;
+	bool matched;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
@@ -1145,6 +1171,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
+		matched = true;
 	} else {
 		/*
 		 * We split periods of matched TSC writes into generations.
@@ -1159,6 +1186,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 		kvm->arch.cur_tsc_nsec = ns;
 		kvm->arch.cur_tsc_write = data;
 		kvm->arch.cur_tsc_offset = offset;
+		matched = false;
 		pr_debug("kvm: new tsc generation %u, clock %llu\n",
 			 kvm->arch.cur_tsc_generation, data);
 	}
@@ -1182,6 +1210,15 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+	spin_lock(&kvm->arch.pvclock_gtod_sync_lock);
+	if (matched)
+		kvm->arch.nr_vcpus_matched_tsc++;
+	else
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+
+	kvm_track_tsc_matching(vcpu);
+	spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
 }
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -1271,8 +1308,9 @@ static bool kvm_get_time_and_clockread(s
 
 /*
  *
- * Assuming a stable TSC across physical CPUS, the following condition
- * is possible. Each numbered line represents an event visible to both
+ * Assuming a stable TSC across physical CPUS, and a stable TSC
+ * across virtual CPUs, the following condition is possible.
+ * Each numbered line represents an event visible to both
  * CPUs at the next numbered event.
  *
  * "timespecX" represents host monotonic time. "tscX" represents
@@ -1305,7 +1343,7 @@ static bool kvm_get_time_and_clockread(s
  * copy of host monotonic time values. Update that master copy
  * in lockstep.
  *
- * Rely on synchronization of host TSCs for monotonicity.
+ * Rely on synchronization of host TSCs and guest TSCs for monotonicity.
  *
  */
 
@@ -1313,20 +1351,27 @@ static void pvclock_update_vm_gtod_copy(
 {
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
+	bool host_tsc_clocksource, vcpus_matched;
+
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+				atomic_read(&kvm->online_vcpus));
 
 	/*
  	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	ka->use_master_clock = kvm_get_time_and_clockread(
+	host_tsc_clocksource = kvm_get_time_and_clockread(
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
+	ka->use_master_clock = host_tsc_clocksource & vcpus_matched;
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
-	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
+				      vcpus_matched);
 }
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -762,21 +762,54 @@ TRACE_EVENT(
 	{VCLOCK_HPET, "hpet"}			\
 
 TRACE_EVENT(kvm_update_master_clock,
-	TP_PROTO(bool use_master_clock, unsigned int host_clock),
-	TP_ARGS(use_master_clock, host_clock),
+	TP_PROTO(bool use_master_clock, unsigned int host_clock, bool offset_matched),
+	TP_ARGS(use_master_clock, host_clock, offset_matched),
 
 	TP_STRUCT__entry(
 		__field(		bool,	use_master_clock	)
 		__field(	unsigned int,	host_clock		)
+		__field(		bool,	offset_matched		)
 	),
 
 	TP_fast_assign(
 		__entry->use_master_clock	= use_master_clock;
 		__entry->host_clock		= host_clock;
+		__entry->offset_matched		= offset_matched;
 	),
 
-	TP_printk("masterclock %d hostclock %s",
+	TP_printk("masterclock %d hostclock %s offsetmatched %u",
 		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks),
+		  __entry->offset_matched)
+);
+
+TRACE_EVENT(kvm_track_tsc,
+	TP_PROTO(unsigned int vcpu_id, unsigned int nr_matched,
+		 unsigned int online_vcpus, bool use_master_clock,
+		 unsigned int host_clock),
+	TP_ARGS(vcpu_id, nr_matched, online_vcpus, use_master_clock,
+		host_clock),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id			)
+		__field(	unsigned int,	nr_vcpus_matched_tsc	)
+		__field(	unsigned int,	online_vcpus		)
+		__field(	bool,		use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id		= vcpu_id;
+		__entry->nr_vcpus_matched_tsc	= nr_matched;
+		__entry->online_vcpus		= online_vcpus;
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("vcpu_id %u masterclock %u offsetmatched %u nr_online %u"
+		  " hostclock %s",
+		  __entry->vcpu_id, __entry->use_master_clock,
+		  __entry->nr_vcpus_matched_tsc, __entry->online_vcpus,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (16 preceding siblings ...)
  2012-11-15  0:08 ` [patch 17/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
@ 2012-11-15  0:08 ` Marcelo Tosatti
  2012-11-15 12:34   ` Glauber Costa
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
  18 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-15  0:08 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 18-do-not-writeclock-on-cpu-migration --]
[-- Type: text/plain, Size: 886 bytes --]

As requested by Glauber, do not update kvmclock area on vcpu->pcpu 
migration, in case the host has stable TSC. 

This is to reduce cacheline bouncing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 			kvm_x86_ops->write_tsc_offset(vcpu, offset);
 			vcpu->arch.tsc_catchup = 1;
 		}
-		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		/*
+ 		 * On a host with synchronized TSC, there is no need to update
+ 		 * kvmclock on vcpu->cpu migration
+ 		 */
+		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_migrate_timers(vcpu);
 		vcpu->cpu = cpu;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 13/18] time: export time information for KVM pvclock
  2012-11-15  0:08 ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
@ 2012-11-15  1:38   ` John Stultz
  0 siblings, 0 replies; 49+ messages in thread
From: John Stultz @ 2012-11-15  1:38 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, jeremy, glommer, zamsden, gleb, avi, pbonzini

On 11/14/2012 04:08 PM, Marcelo Tosatti wrote:
> As suggested by John, export time data similarly to how its
> done by vsyscall support. This allows KVM to retrieve necessary
> information to implement vsyscall support in KVM guests.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Thanks for the updates here.  The notifier method is interesting, and if 
it works well, we may want to extend it later to cover the vsyscall code 
too, but that can be done in a later iteration.

Acked-by: John Stultz <johnstul@us.ibm.com>

thanks
-john


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 08/18] sched: add notifier for cross-cpu migrations
  2012-11-15  0:08 ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
@ 2012-11-15  7:01   ` Gleb Natapov
  0 siblings, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2012-11-15  7:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, johnstul, jeremy, glommer, zamsden, avi, pbonzini,
	a.p.zijlstra, mingo

CCing Peter and Ingo.

On Wed, Nov 14, 2012 at 10:08:31PM -0200, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/include/linux/sched.h
> ===================================================================
> --- vsyscall.orig/include/linux/sched.h
> +++ vsyscall/include/linux/sched.h
> @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
>  extern void calc_global_load(unsigned long ticks);
>  extern void update_cpu_load_nohz(void);
>  
> +/* Notifier for when a task gets migrated to a new CPU */
> +struct task_migration_notifier {
> +	struct task_struct *task;
> +	int from_cpu;
> +	int to_cpu;
> +};
> +extern void register_task_migration_notifier(struct notifier_block *n);
> +
>  extern unsigned long get_parent_ip(unsigned long addr);
>  
>  struct seq_file;
> Index: vsyscall/kernel/sched/core.c
> ===================================================================
> --- vsyscall.orig/kernel/sched/core.c
> +++ vsyscall/kernel/sched/core.c
> @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
>  		rq->skip_clock_update = 1;
>  }
>  
> +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
> +
> +void register_task_migration_notifier(struct notifier_block *n)
> +{
> +	atomic_notifier_chain_register(&task_migration_notifier, n);
> +}
> +
>  #ifdef CONFIG_SMP
>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  {
> @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
>  	trace_sched_migrate_task(p, new_cpu);
>  
>  	if (task_cpu(p) != new_cpu) {
> +		struct task_migration_notifier tmn;
> +
>  		p->se.nr_migrations++;
>  		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
> +
> +		tmn.task = p;
> +		tmn.from_cpu = task_cpu(p);
> +		tmn.to_cpu = new_cpu;
> +
> +		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>  	}
>  
>  	__set_task_cpu(p, new_cpu);
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 05/18] x86: pvclock: create helper for pvclock data retrieval
  2012-11-15  0:08 ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
@ 2012-11-15 12:27   ` Glauber Costa
  0 siblings, 0 replies; 49+ messages in thread
From: Glauber Costa @ 2012-11-15 12:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> So code can be reused.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
I thought I had acked this one already?

But maybe I didn't...

Acked-by: Glauber Costa <glommer@parallels.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 06/18] x86: pvclock: introduce helper to read flags
  2012-11-15  0:08 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
@ 2012-11-15 12:28   ` Glauber Costa
  0 siblings, 0 replies; 49+ messages in thread
From: Glauber Costa @ 2012-11-15 12:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> 
> Index: vsyscall/arch/x86/kernel/pvclock.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kernel/pvclock.c
Acked-by: Glauber Costa <glommer@parallels.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 07/18] x86: pvclock: add note about rdtsc barriers
  2012-11-15  0:08 ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
@ 2012-11-15 12:30   ` Glauber Costa
  2012-11-16  2:05     ` [patch 07/18] x86: pvclock: add note about rdtsc barriers (v2) Marcelo Tosatti
  0 siblings, 1 reply; 49+ messages in thread
From: Glauber Costa @ 2012-11-15 12:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
> As noted by Gleb, not advertising SSE2 support implies
> no RDTSC barriers.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

And this gets a separate patch because?

> Index: vsyscall/arch/x86/include/asm/pvclock.h
> ===================================================================
> --- vsyscall.orig/arch/x86/include/asm/pvclock.h
> +++ vsyscall/arch/x86/include/asm/pvclock.h
> @@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str
>  	u8 ret_flags;
>  
>  	version = src->version;
> +	/* Note: emulated platforms which do not advertise SSE2 support
> + 	 * result in kvmclock not using the necessary RDTSC barriers.
> + 	 */

And the expected effects are? Will it work in that case? Any precautions
one must take? Is it safe for Xen? Is it safe for KVM?

Those are the types of things I'd expect to see in a comment for
something as subtle as this.

>  	rdtsc_barrier();
>  	offset = pvclock_get_nsec_offset(src);
>  	ret = src->system_time + offset;
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
  2012-11-15  0:08 ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
@ 2012-11-15 12:34   ` Glauber Costa
  0 siblings, 0 replies; 49+ messages in thread
From: Glauber Costa @ 2012-11-15 12:34 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
> As requested by Glauber, do not update kvmclock area on vcpu->pcpu 
> migration, in case the host has stable TSC. 
> 
> This is to reduce cacheline bouncing.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This looks fine, but it can always get tricky...
Assuming you tested this change in at least one stable tsc and one
unstable tsc system:

Acked-by: Glauber Costa <glommer@parallels.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area
  2012-11-15  0:08 ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
@ 2012-11-15 17:05   ` Glauber Costa
  2012-11-16  2:07     ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2) Marcelo Tosatti
  0 siblings, 1 reply; 49+ messages in thread
From: Glauber Costa @ 2012-11-15 17:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
> We want to expose the pvclock shared memory areas, which 
> the hypervisor periodically updates, to userspace.
> 
> For a linear mapping from userspace, it is necessary that
> entire page sized regions are used for array of pvclock 
> structures.
> 
> There is no such guarantee with per cpu areas, therefore move
> to memblock_alloc based allocation.
Ok.

> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
For the concept:

Acked-by: Glauber Costa <glommer@parallels.com>

I do have one comment.

> 
> Index: vsyscall/arch/x86/kernel/kvmclock.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kernel/kvmclock.c
> +++ vsyscall/arch/x86/kernel/kvmclock.c
> @@ -23,6 +23,7 @@
>  #include <asm/apic.h>
>  #include <linux/percpu.h>
>  #include <linux/hardirq.h>
> +#include <linux/memblock.h>
>  
>  #include <asm/x86_init.h>
>  #include <asm/reboot.h>
> @@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
>  early_param("no-kvmclock", parse_no_kvmclock);
>  
>  /* The hypervisor will put information about time periodically here */
> -static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
> +struct pvclock_aligned_vcpu_time_info {
> +	struct pvclock_vcpu_time_info clock;
> +} __attribute__((__aligned__(SMP_CACHE_BYTES)));
> +
> +static struct pvclock_aligned_vcpu_time_info *hv_clock;
>  static struct pvclock_wall_clock wall_clock;
>  
>  /*
> @@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
>  	struct pvclock_vcpu_time_info *vcpu_time;
>  	struct timespec ts;
>  	int low, high;
> +	int cpu;
> +
> +	preempt_disable();
> +	cpu = smp_processor_id();
>  
>  	low = (int)__pa_symbol(&wall_clock);
>  	high = ((u64)__pa_symbol(&wall_clock) >> 32);
>  
>  	native_write_msr(msr_kvm_wall_clock, low, high);
>  
> -	vcpu_time = &get_cpu_var(hv_clock);
> +	vcpu_time = &hv_clock[cpu].clock;


You are leaving preempt disabled for a lot longer than in should be. In
particular, you'll have a round trip to the hypervisor in the middle. It
doesn't matter *that* much because wallclock is mostly a one-time thing,
so I won't oppose in this basis.

But if you have the chance, either fix it, or tell us why you need
preemption on for longer than it was before.

>  void __init kvmclock_init(void)
>  {
> +	unsigned long mem;
> +
>  	if (!kvm_para_available())
>  		return;
>  
> +	mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS,
> +			     PAGE_SIZE);
> +	if (!mem)
> +		return;
> +	hv_clock = __va(mem);
> +
>  	if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
>  		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
>  		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
> 
>
If you don't have kvmclock enabled, which the next line in here would
test for, you will exit this function with allocated memory. How about
the following?

if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
    msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
    msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
    return;

mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info)
                     * NR_CPUS, PAGE_SIZE);
if (!mem)
    return;
hv_clock = __va(mem);

printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
       msr_kvm_system_time, msr_kvm_wall_clock);

if (kvm_register_clock("boot clock")) {
    memblock_free();
    return;
}





^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 07/18] x86: pvclock: add note about rdtsc barriers (v2)
  2012-11-15 12:30   ` Glauber Costa
@ 2012-11-16  2:05     ` Marcelo Tosatti
  0 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-16  2:05 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini


As noted by Gleb, not advertising SSE2 support implies
no RDTSC barriers.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -74,6 +74,12 @@ unsigned __pvclock_read_cycles(const str
 	u8 ret_flags;
 
 	version = src->version;
+	/* Note: emulated platforms which do not advertise SSE2 support
+ 	 * result in kvmclock not using the necessary RDTSC barriers.
+ 	 * Without barriers, it is possible that RDTSC instruction reads from
+ 	 * the time stamp counter outside rdtsc_barrier protected section
+ 	 * below, resulting in violation of monotonicity.
+ 	 */
 	rdtsc_barrier();
 	offset = pvclock_get_nsec_offset(src);
 	ret = src->system_time + offset;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2)
  2012-11-15 17:05   ` Glauber Costa
@ 2012-11-16  2:07     ` Marcelo Tosatti
  2012-11-16  7:06       ` Glauber Costa
  0 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-16  2:07 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini


We want to expose the pvclock shared memory areas, which 
the hypervisor periodically updates, to userspace.

For a linear mapping from userspace, it is necessary that
entire page sized regions are used for array of pvclock 
structures.

There is no such guarantee with per cpu areas, therefore move
to memblock_alloc based allocation.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/kvmclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/memblock.h>
 
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
@@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
 early_param("no-kvmclock", parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+struct pvclock_aligned_vcpu_time_info {
+	struct pvclock_vcpu_time_info clock;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+static struct pvclock_aligned_vcpu_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
 	struct pvclock_vcpu_time_info *vcpu_time;
 	struct timespec ts;
 	int low, high;
+	int cpu;
 
 	low = (int)__pa_symbol(&wall_clock);
 	high = ((u64)__pa_symbol(&wall_clock) >> 32);
 
 	native_write_msr(msr_kvm_wall_clock, low, high);
 
-	vcpu_time = &get_cpu_var(hv_clock);
+	preempt_disable();
+	cpu = smp_processor_id();
+
+	vcpu_time = &hv_clock[cpu].clock;
 	pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
-	put_cpu_var(hv_clock);
+
+	preempt_enable();
 
 	return ts.tv_sec;
 }
@@ -74,9 +84,11 @@ static cycle_t kvm_clock_read(void)
 {
 	struct pvclock_vcpu_time_info *src;
 	cycle_t ret;
+	int cpu;
 
 	preempt_disable_notrace();
-	src = &__get_cpu_var(hv_clock);
+	cpu = smp_processor_id();
+	src = &hv_clock[cpu].clock;
 	ret = pvclock_clocksource_read(src);
 	preempt_enable_notrace();
 	return ret;
@@ -99,8 +111,15 @@ static cycle_t kvm_clock_get_cycles(stru
 static unsigned long kvm_get_tsc_khz(void)
 {
 	struct pvclock_vcpu_time_info *src;
-	src = &per_cpu(hv_clock, 0);
-	return pvclock_tsc_khz(src);
+	int cpu;
+	unsigned long tsc_khz;
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	src = &hv_clock[cpu].clock;
+	tsc_khz = pvclock_tsc_khz(src);
+	preempt_enable();
+	return tsc_khz;
 }
 
 static void kvm_get_preset_lpj(void)
@@ -119,10 +138,14 @@ bool kvm_check_and_clear_guest_paused(vo
 {
 	bool ret = false;
 	struct pvclock_vcpu_time_info *src;
+	int cpu = smp_processor_id();
 
-	src = &__get_cpu_var(hv_clock);
+	if (!hv_clock)
+		return ret;
+
+	src = &hv_clock[cpu].clock;
 	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
-		__this_cpu_and(hv_clock.flags, ~PVCLOCK_GUEST_STOPPED);
+		src->flags &= ~PVCLOCK_GUEST_STOPPED;
 		ret = true;
 	}
 
@@ -141,9 +164,10 @@ int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
+	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].clock;
 
-	low = (int)__pa(&per_cpu(hv_clock, cpu)) | 1;
-	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
+	low = (int)__pa(src) | 1;
+	high = ((u64)__pa(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);
@@ -197,6 +221,8 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
+	unsigned long mem;
+
 	if (!kvm_para_available())
 		return;
 
@@ -209,8 +235,18 @@ void __init kvmclock_init(void)
 	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
-	if (kvm_register_clock("boot clock"))
+	mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS,
+			     PAGE_SIZE);
+	if (!mem)
 		return;
+	hv_clock = __va(mem);
+
+	if (kvm_register_clock("boot clock")) {
+		hv_clock = NULL;
+		memblock_free(mem,
+			sizeof(struct pvclock_aligned_vcpu_time_info)*NR_CPUS);
+		return;
+	}
 	pv_time_ops.sched_clock = kvm_clock_read;
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2)
  2012-11-16  2:07     ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2) Marcelo Tosatti
@ 2012-11-16  7:06       ` Glauber Costa
  0 siblings, 0 replies; 49+ messages in thread
From: Glauber Costa @ 2012-11-16  7:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/16/2012 06:07 AM, Marcelo Tosatti wrote:
> 
> We want to expose the pvclock shared memory areas, which 
> the hypervisor periodically updates, to userspace.
> 
> For a linear mapping from userspace, it is necessary that
> entire page sized regions are used for array of pvclock 
> structures.
> 
> There is no such guarantee with per cpu areas, therefore move
> to memblock_alloc based allocation.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
No further objections.

Acked-by: Glauber Costa <glommer@parallels.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5)
  2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
                   ` (17 preceding siblings ...)
  2012-11-15  0:08 ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
@ 2012-11-19 21:57 ` Marcelo Tosatti
  2012-11-19 21:57   ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
                     ` (17 more replies)
  18 siblings, 18 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:57 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer

This patchset, based on earlier work by Jeremy Fitzhardinge, implements
paravirtual clock vsyscall support.

It should be possible to implement Xen support relatively easily.

It reduces clock_gettime from 500 cycles to 200 cycles
on my testbox.

v5:
- reduce preempt disable window in kvm_get_wallclock (Glauber)
- improve comment about SSE2 (Glauber)

v4:
- remove aligned_pvti structure, align directly (Glauber)
- add comments to migration notifier (Glauber)
- mark migration notifier condition as unlikely (Glauber)
- add comment about rdtsc barrier dependency on sse2 (Gleb)
- add idea to improve vdso gettime call (Gleb)
- remove new msr interface, reuse kernel copy of pvclock
data (Glauber)
- move copying of timekeeping data from generic timekeeping
code to kvm code (John)

v3:
- fix PVCLOCK_VSYSCALL_NR_PAGES definition (glommer)
- fold flags race fix into pvclock refactoring (avi)
- remove CONFIG_PARAVIRT_CLOCK_VSYSCALL (glommer)
- add reference to tsc.c from vclock_gettime.c about cycle_last rationale
(glommer)
- fix whitespace damage (glommer)


v2:
- Do not allow visibility of different <system_timestamp, tsc_timestamp>
tuples.
- Add option to disable vsyscall.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
@ 2012-11-19 21:57   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:57 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: x86-kvm-retain-guest-stopped.patch --]
[-- Type: text/plain, Size: 1754 bytes --]

Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
+	struct pvclock_vcpu_time_info *guest_hv_clock;
 	u8 pvclock_flags;
 
 	/* Keep irq disabled to prevent changes to the clock */
@@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
 	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->last_guest_tsc = tsc_timestamp;
 
-	pvclock_flags = 0;
-	if (vcpu->pvclock_set_guest_stopped_request) {
-		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-		vcpu->pvclock_set_guest_stopped_request = false;
-	}
-
-	vcpu->hv_clock.flags = pvclock_flags;
 
 	/*
 	 * The interface expects us to write an even number signaling that the
@@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
 
 	shared_kaddr = kmap_atomic(vcpu->time_page);
 
+	guest_hv_clock = shared_kaddr + vcpu->time_offset;
+
+	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+	if (vcpu->pvclock_set_guest_stopped_request) {
+		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+		vcpu->pvclock_set_guest_stopped_request = false;
+	}
+
+	vcpu->hv_clock.flags = pvclock_flags;
+
 	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
 	       sizeof(vcpu->hv_clock));
 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 02/18] x86: kvmclock: allocate pvclock shared memory area
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
  2012-11-19 21:57   ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 00-kvmclock-alloc-area.patch --]
[-- Type: text/plain, Size: 4350 bytes --]

We want to expose the pvclock shared memory areas, which 
the hypervisor periodically updates, to userspace.

For a linear mapping from userspace, it is necessary that
entire page sized regions are used for array of pvclock 
structures.

There is no such guarantee with per cpu areas, therefore move
to memblock_alloc based allocation.

Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/kvmclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/memblock.h>
 
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
@@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
 early_param("no-kvmclock", parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+struct pvclock_aligned_vcpu_time_info {
+	struct pvclock_vcpu_time_info clock;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+static struct pvclock_aligned_vcpu_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
 	struct pvclock_vcpu_time_info *vcpu_time;
 	struct timespec ts;
 	int low, high;
+	int cpu;
 
 	low = (int)__pa_symbol(&wall_clock);
 	high = ((u64)__pa_symbol(&wall_clock) >> 32);
 
 	native_write_msr(msr_kvm_wall_clock, low, high);
 
-	vcpu_time = &get_cpu_var(hv_clock);
+	preempt_disable();
+	cpu = smp_processor_id();
+
+	vcpu_time = &hv_clock[cpu].clock;
 	pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
-	put_cpu_var(hv_clock);
+
+	preempt_enable();
 
 	return ts.tv_sec;
 }
@@ -74,9 +84,11 @@ static cycle_t kvm_clock_read(void)
 {
 	struct pvclock_vcpu_time_info *src;
 	cycle_t ret;
+	int cpu;
 
 	preempt_disable_notrace();
-	src = &__get_cpu_var(hv_clock);
+	cpu = smp_processor_id();
+	src = &hv_clock[cpu].clock;
 	ret = pvclock_clocksource_read(src);
 	preempt_enable_notrace();
 	return ret;
@@ -99,8 +111,15 @@ static cycle_t kvm_clock_get_cycles(stru
 static unsigned long kvm_get_tsc_khz(void)
 {
 	struct pvclock_vcpu_time_info *src;
-	src = &per_cpu(hv_clock, 0);
-	return pvclock_tsc_khz(src);
+	int cpu;
+	unsigned long tsc_khz;
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	src = &hv_clock[cpu].clock;
+	tsc_khz = pvclock_tsc_khz(src);
+	preempt_enable();
+	return tsc_khz;
 }
 
 static void kvm_get_preset_lpj(void)
@@ -119,10 +138,14 @@ bool kvm_check_and_clear_guest_paused(vo
 {
 	bool ret = false;
 	struct pvclock_vcpu_time_info *src;
+	int cpu = smp_processor_id();
 
-	src = &__get_cpu_var(hv_clock);
+	if (!hv_clock)
+		return ret;
+
+	src = &hv_clock[cpu].clock;
 	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
-		__this_cpu_and(hv_clock.flags, ~PVCLOCK_GUEST_STOPPED);
+		src->flags &= ~PVCLOCK_GUEST_STOPPED;
 		ret = true;
 	}
 
@@ -141,9 +164,10 @@ int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
+	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].clock;
 
-	low = (int)__pa(&per_cpu(hv_clock, cpu)) | 1;
-	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
+	low = (int)__pa(src) | 1;
+	high = ((u64)__pa(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);
@@ -197,6 +221,8 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
+	unsigned long mem;
+
 	if (!kvm_para_available())
 		return;
 
@@ -209,8 +235,18 @@ void __init kvmclock_init(void)
 	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
-	if (kvm_register_clock("boot clock"))
+	mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS,
+			     PAGE_SIZE);
+	if (!mem)
 		return;
+	hv_clock = __va(mem);
+
+	if (kvm_register_clock("boot clock")) {
+		hv_clock = NULL;
+		memblock_free(mem,
+			sizeof(struct pvclock_aligned_vcpu_time_info)*NR_CPUS);
+		return;
+	}
 	pv_time_ops.sched_clock = kvm_clock_read;
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
  2012-11-19 21:57   ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
  2012-11-19 21:58   ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 04/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 01-pvclock-read-rdtsc-barrier --]
[-- Type: text/plain, Size: 745 bytes --]

Originally from Jeremy Fitzhardinge.

pvclock_get_time_values, which contains the memory barriers
will be removed by next patch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
 
 	do {
 		version = pvclock_get_time_values(&shadow, src);
-		barrier();
+		rdtsc_barrier();
 		offset = pvclock_get_nsec_offset(&shadow);
 		ret = shadow.system_timestamp + offset;
-		barrier();
+		rdtsc_barrier();
 	} while (version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 04/18] x86: pvclock: remove pvclock_shadow_time
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (2 preceding siblings ...)
  2012-11-19 21:58   ` [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
                     ` (13 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 02-pvclock-remove-shadow-time --]
[-- Type: text/plain, Size: 3030 bytes --]

Originally from Jeremy Fitzhardinge.

We can copy the information directly from "struct pvclock_vcpu_time_info", 
remove pvclock_shadow_time.

Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -19,21 +19,6 @@
 #include <linux/percpu.h>
 #include <asm/pvclock.h>
 
-/*
- * These are perodically updated
- *    xen: magic shared_info page
- *    kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
-	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
-	u32 tsc_to_nsec_mul;
-	int tsc_shift;
-	u32 version;
-	u8  flags;
-};
-
 static u8 valid_flags __read_mostly = 0;
 
 void pvclock_set_flags(u8 flags)
@@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-	u64 delta = native_read_tsc() - shadow->tsc_timestamp;
-	return pvclock_scale_delta(delta, shadow->tsc_to_nsec_mul,
-				   shadow->tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-					struct pvclock_vcpu_time_info *src)
-{
-	do {
-		dst->version = src->version;
-		rmb();		/* fetch version before data */
-		dst->tsc_timestamp     = src->tsc_timestamp;
-		dst->system_timestamp  = src->system_time;
-		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
-		dst->tsc_shift         = src->tsc_shift;
-		dst->flags             = src->flags;
-		rmb();		/* test version after fetching data */
-	} while ((src->version & 1) || (dst->version != src->version));
-
-	return dst->version;
+	u64 delta = native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
 }
 
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -90,21 +54,22 @@ void pvclock_resume(void)
 
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
-	struct pvclock_shadow_time shadow;
 	unsigned version;
 	cycle_t ret, offset;
 	u64 last;
+	u8 flags;
 
 	do {
-		version = pvclock_get_time_values(&shadow, src);
+		version = src->version;
 		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(&shadow);
-		ret = shadow.system_timestamp + offset;
+		offset = pvclock_get_nsec_offset(src);
+		ret = src->system_time + offset;
+		flags = src->flags;
 		rdtsc_barrier();
-	} while (version != src->version);
+	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
-		(shadow.flags & PVCLOCK_TSC_STABLE_BIT))
+		(flags & PVCLOCK_TSC_STABLE_BIT))
 		return ret;
 
 	/*



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 05/18] x86: pvclock: create helper for pvclock data retrieval
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (3 preceding siblings ...)
  2012-11-19 21:58   ` [patch 04/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
                     ` (12 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 03-move-pvread-to-pvheader --]
[-- Type: text/plain, Size: 2312 bytes --]

Originally from Jeremy Fitzhardinge.

So code can be reused.

Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
-{
-	u64 delta = native_read_tsc() - src->tsc_timestamp;
-	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
-				   src->tsc_shift);
-}
-
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
 {
 	u64 pv_tsc_khz = 1000000ULL << 32;
@@ -55,17 +48,12 @@ void pvclock_resume(void)
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
-	cycle_t ret, offset;
+	cycle_t ret;
 	u64 last;
 	u8 flags;
 
 	do {
-		version = src->version;
-		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(src);
-		ret = src->system_time + offset;
-		flags = src->flags;
-		rdtsc_barrier();
+		version = __pvclock_read_cycles(src, &ret, &flags);
 	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -56,4 +56,32 @@ static inline u64 pvclock_scale_delta(u6
 	return product;
 }
 
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+{
+	u64 delta = __native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
+}
+
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+			       cycle_t *cycles, u8 *flags)
+{
+	unsigned version;
+	cycle_t ret, offset;
+	u8 ret_flags;
+
+	version = src->version;
+	rdtsc_barrier();
+	offset = pvclock_get_nsec_offset(src);
+	ret = src->system_time + offset;
+	ret_flags = src->flags;
+	rdtsc_barrier();
+
+	*cycles = ret;
+	*flags = ret_flags;
+	return version;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 06/18] x86: pvclock: introduce helper to read flags
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (4 preceding siblings ...)
  2012-11-19 21:58   ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 05-pvclock-add-get-flags --]
[-- Type: text/plain, Size: 1325 bytes --]

Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -45,6 +45,19 @@ void pvclock_resume(void)
 	atomic64_set(&last_value, 0);
 }
 
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
+{
+	unsigned version;
+	cycle_t ret;
+	u8 flags;
+
+	do {
+		version = __pvclock_read_cycles(src, &ret, &flags);
+	} while ((src->version & 1) || version != src->version);
+
+	return flags & valid_flags;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 07/18] x86: pvclock: add note about rdtsc barriers
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (5 preceding siblings ...)
  2012-11-19 21:58   ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 05.2-pvclock-add-comment-barrier --]
[-- Type: text/plain, Size: 871 bytes --]

As noted by Gleb, not advertising SSE2 support implies
no RDTSC barriers.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -74,6 +74,12 @@ unsigned __pvclock_read_cycles(const str
 	u8 ret_flags;
 
 	version = src->version;
+	/* Note: emulated platforms which do not advertise SSE2 support
+ 	 * result in kvmclock not using the necessary RDTSC barriers.
+ 	 * Without barriers, it is possible that RDTSC instruction reads from
+ 	 * the time stamp counter outside rdtsc_barrier protected section
+ 	 * below, resulting in violation of monotonicity.
+ 	 */
 	rdtsc_barrier();
 	offset = pvclock_get_nsec_offset(src);
 	ret = src->system_time + offset;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 08/18] sched: add notifier for cross-cpu migrations
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (6 preceding siblings ...)
  2012-11-19 21:58   ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 06-add-task-migration-notifier --]
[-- Type: text/plain, Size: 1732 bytes --]

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/sched.h
===================================================================
--- vsyscall.orig/include/linux/sched.h
+++ vsyscall/include/linux/sched.h
@@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+	struct task_struct *task;
+	int from_cpu;
+	int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
Index: vsyscall/kernel/sched/core.c
===================================================================
--- vsyscall.orig/kernel/sched/core.c
+++ vsyscall/kernel/sched/core.c
@@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
 		rq->skip_clock_update = 1;
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+	atomic_notifier_chain_register(&task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		struct task_migration_notifier tmn;
+
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+
+		tmn.task = p;
+		tmn.from_cpu = task_cpu(p);
+		tmn.to_cpu = new_cpu;
+
+		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
 	}
 
 	__set_task_cpu(p, new_cpu);



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (7 preceding siblings ...)
  2012-11-19 21:58   ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 11/18] x86: vdso: pvclock gettime support Marcelo Tosatti
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 07-add-pvclock-structs-and-fixmap --]
[-- Type: text/plain, Size: 4272 bytes --]

Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization 
routines.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -17,6 +17,10 @@
 
 #include <linux/kernel.h>
 #include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <linux/bootmem.h>
 #include <asm/pvclock.h>
 
 static u8 valid_flags __read_mostly = 0;
@@ -122,3 +126,68 @@ void pvclock_read_wallclock(struct pvclo
 
 	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+static struct pvclock_vsyscall_time_info *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *
+pvclock_get_vsyscall_user_time_info(int cpu)
+{
+	if (!pvclock_vdso_info) {
+		BUG();
+		return NULL;
+	}
+
+	return &pvclock_vdso_info[cpu];
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+	return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+	struct task_migration_notifier *mn = v;
+	struct pvclock_vsyscall_time_info *pvti;
+
+	pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
+
+	/* this is NULL when pvclock vsyscall is not initialized */
+	if (unlikely(pvti == NULL))
+		return NOTIFY_DONE;
+
+	pvti->migrate_count++;
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+	.notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state.  This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+
+int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
+				 int size)
+{
+	int idx;
+
+	WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
+
+	pvclock_vdso_info = i;
+
+	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
+		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
+			     __pa_symbol(i) + (idx*PAGE_SIZE),
+			     PAGE_KERNEL_VVAR);
+	}
+
+
+	register_task_migration_notifier(&pvclock_migrate);
+
+	return 0;
+}
Index: vsyscall/arch/x86/include/asm/fixmap.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/fixmap.h
+++ vsyscall/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include <asm/acpi.h>
 #include <asm/apicdef.h>
 #include <asm/page.h>
+#include <asm/pvclock.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
@@ -81,6 +82,10 @@ enum fixed_addresses {
 	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
+#ifdef CONFIG_PARAVIRT_CLOCK
+	PVCLOCK_FIXMAP_BEGIN,
+	PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
+#endif
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -85,4 +85,16 @@ unsigned __pvclock_read_cycles(const str
 	return version;
 }
 
+struct pvclock_vsyscall_time_info {
+	struct pvclock_vcpu_time_info pvti;
+	u32 migrate_count;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+#define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
+#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
+
+int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
+				 int size);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: vsyscall/arch/x86/include/asm/clocksource.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/clocksource.h
+++ vsyscall/arch/x86/include/asm/clocksource.h
@@ -8,6 +8,7 @@
 #define VCLOCK_NONE 0  /* No vDSO clock available.	*/
 #define VCLOCK_TSC  1  /* vDSO should use vread_tsc.	*/
 #define VCLOCK_HPET 2  /* vDSO should use vread_hpet.	*/
+#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
 
 struct arch_clocksource_data {
 	int vclock_mode;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 11/18] x86: vdso: pvclock gettime support
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (8 preceding siblings ...)
  2012-11-19 21:58   ` [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 10-add-pvclock-vdso-code --]
[-- Type: text/plain, Size: 5202 bytes --]

Improve performance of time system calls when using Linux pvclock, 
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -22,6 +22,7 @@
 #include <asm/hpet.h>
 #include <asm/unistd.h>
 #include <asm/io.h>
+#include <asm/pvclock.h>
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -62,6 +63,76 @@ static notrace cycle_t vread_hpet(void)
 	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+
+static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+{
+	const struct pvclock_vsyscall_time_info *pvti_base;
+	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
+	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
+
+	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
+
+	pvti_base = (struct pvclock_vsyscall_time_info *)
+		    __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
+
+	return &pvti_base[offset];
+}
+
+static notrace cycle_t vread_pvclock(int *mode)
+{
+	const struct pvclock_vsyscall_time_info *pvti;
+	cycle_t ret;
+	u64 last;
+	u32 version;
+	u32 migrate_count;
+	u8 flags;
+	unsigned cpu, cpu1;
+
+
+	/*
+	 * When looping to get a consistent (time-info, tsc) pair, we
+	 * also need to deal with the possibility we can switch vcpus,
+	 * so make sure we always re-fetch time-info for the current vcpu.
+	 */
+	do {
+		cpu = __getcpu() & VGETCPU_CPU_MASK;
+		/* TODO: We can put vcpu id into higher bits of pvti.version.
+		 * This will save a couple of cycles by getting rid of
+		 * __getcpu() calls (Gleb).
+		 */
+
+		pvti = get_pvti(cpu);
+
+		migrate_count = pvti->migrate_count;
+
+		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
+
+		/*
+		 * Test we're still on the cpu as well as the version.
+		 * We could have been migrated just after the first
+		 * vgetcpu but before fetching the version, so we
+		 * wouldn't notice a version change.
+		 */
+		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
+	} while (unlikely(cpu != cpu1 ||
+			  (pvti->pvti.version & 1) ||
+			  pvti->pvti.version != version ||
+			  pvti->migrate_count != migrate_count));
+
+	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+		*mode = VCLOCK_NONE;
+
+	/* refer to tsc.c read_tsc() comment for rationale */
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	return last;
+}
+#endif
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
 	long ret;
@@ -80,7 +151,7 @@ notrace static long vdso_fallback_gtod(s
 }
 
 
-notrace static inline u64 vgetsns(void)
+notrace static inline u64 vgetsns(int *mode)
 {
 	long v;
 	cycles_t cycles;
@@ -88,6 +159,8 @@ notrace static inline u64 vgetsns(void)
 		cycles = vread_tsc();
 	else if (gtod->clock.vclock_mode == VCLOCK_HPET)
 		cycles = vread_hpet();
+	else if (gtod->clock.vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
 	else
 		return 0;
 	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
@@ -107,7 +180,7 @@ notrace static int __always_inline do_re
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->wall_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -127,7 +200,7 @@ notrace static int do_monotonic(struct t
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->monotonic_time_sec;
 		ns = gtod->monotonic_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	timespec_add_ns(ts, ns);
Index: vsyscall/arch/x86/include/asm/vsyscall.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/vsyscall.h
+++ vsyscall/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,23 @@ extern void map_vsyscall(void);
  */
 extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
 
+#define VGETCPU_CPU_MASK 0xfff
+
+static inline unsigned int __getcpu(void)
+{
+	unsigned int p;
+
+	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
+		/* Load per CPU data from RDTSCP */
+		native_read_tscp(&p);
+	} else {
+		/* Load per CPU data from GDT */
+		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	}
+
+	return p;
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
Index: vsyscall/arch/x86/vdso/vgetcpu.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vgetcpu.c
+++ vsyscall/arch/x86/vdso/vgetcpu.c
@@ -17,15 +17,10 @@ __vdso_getcpu(unsigned *cpu, unsigned *n
 {
 	unsigned int p;
 
-	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
-		/* Load per CPU data from RDTSCP */
-		native_read_tscp(&p);
-	} else {
-		/* Load per CPU data from GDT */
-		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
-	}
+	p = __getcpu();
+
 	if (cpu)
-		*cpu = p & 0xfff;
+		*cpu = p & VGETCPU_CPU_MASK;
 	if (node)
 		*node = p >> 12;
 	return 0;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (9 preceding siblings ...)
  2012-11-19 21:58   ` [patch 11/18] x86: vdso: pvclock gettime support Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
                     ` (6 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 12-kvm-read-l1-tsc-pass-tscvalue --]
[-- Type: text/plain, Size: 3372 bytes --]

Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -700,7 +700,7 @@ struct kvm_x86_ops {
 	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
 	u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
-	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
 
 	void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
 
Index: vsyscall/arch/x86/kvm/lapic.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/lapic.c
+++ vsyscall/arch/x86/kvm/lapic.c
@@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_
 		local_irq_save(flags);
 
 		now = apic->lapic_timer.timer.base->get_time();
-		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc());
 		if (likely(tscdeadline > guest_tsc)) {
 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
 			do_div(ns, this_tsc_khz);
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
-u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
 	struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
 	return vmcb->control.tsc_offset +
-		svm_scale_tsc(vcpu, native_read_tsc());
+		svm_scale_tsc(vcpu, host_tsc);
 }
 
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void)
  * Like guest_read_tsc, but always returns L1's notion of the timestamp
  * counter, even if a nested guest (L2) is currently running.
  */
-u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-	u64 host_tsc, tsc_offset;
+	u64 tsc_offset;
 
-	rdtscll(host_tsc);
 	tsc_offset = is_guest_mode(vcpu) ?
 		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
 		vmcs_read64(TSC_OFFSET);
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1148,7 +1148,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -5375,7 +5375,8 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu,
+							   native_read_tsc());
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 13/18] time: export time information for KVM pvclock
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (10 preceding siblings ...)
  2012-11-19 21:58   ` [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 14/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 13-time-add-pvclock-gtod-data --]
[-- Type: text/plain, Size: 2699 bytes --]

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/pvclock_gtod.h
===================================================================
--- /dev/null
+++ vsyscall/include/linux/pvclock_gtod.h
@@ -0,0 +1,9 @@
+#ifndef _PVCLOCK_GTOD_H
+#define _PVCLOCK_GTOD_H
+
+#include <linux/notifier.h>
+
+extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
+extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);
+
+#endif /* _PVCLOCK_GTOD_H */
Index: vsyscall/kernel/time/timekeeping.c
===================================================================
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
 #include <linux/time.h>
 #include <linux/tick.h>
 #include <linux/stop_machine.h>
+#include <linux/pvclock_gtod.h>
 
 
 static struct timekeeper timekeeper;
@@ -180,6 +181,54 @@ static inline s64 timekeeping_get_ns_raw
 	return nsec + arch_gettimeoffset();
 }
 
+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+	raw_notifier_call_chain(&pvclock_gtod_chain, 0, tk);
+}
+
+/**
+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_register_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
+	/* update timekeeping data */
+	update_pvclock_gtod(tk);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
+
+/**
+ * pvclock_gtod_unregister_notifier - unregister a pvclock
+ * timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
@@ -188,6 +237,7 @@ static void timekeeping_update(struct ti
 		ntp_clear();
 	}
 	update_vsyscall(tk);
+	update_pvclock_gtod(tk);
 }
 
 /**



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 14/18] KVM: x86: notifier for clocksource changes
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (11 preceding siblings ...)
  2012-11-19 21:58   ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 15-add-kvm-req-pvclock-gtod-update --]
[-- Type: text/plain, Size: 3971 bytes --]

Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -46,6 +46,8 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/pci.h>
+#include <linux/timekeeper_internal.h>
+#include <linux/pvclock_gtod.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -899,6 +901,53 @@ static int do_set_msr(struct kvm_vcpu *v
 	return kvm_set_msr(vcpu, index, *data);
 }
 
+struct pvclock_gtod_data {
+	seqcount_t	seq;
+
+	struct { /* extract of a clocksource struct */
+		int vclock_mode;
+		cycle_t	cycle_last;
+		cycle_t	mask;
+		u32	mult;
+		u32	shift;
+	} clock;
+
+	/* open coded 'struct timespec' */
+	u64		monotonic_time_snsec;
+	time_t		monotonic_time_sec;
+};
+
+static struct pvclock_gtod_data pvclock_gtod_data;
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
+
+	write_seqcount_begin(&vdata->seq);
+
+	/* copy pvclock gtod data */
+	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
+	vdata->clock.cycle_last		= tk->clock->cycle_last;
+	vdata->clock.mask		= tk->clock->mask;
+	vdata->clock.mult		= tk->mult;
+	vdata->clock.shift		= tk->shift;
+
+	vdata->monotonic_time_sec	= tk->xtime_sec
+					+ tk->wall_to_monotonic.tv_sec;
+	vdata->monotonic_time_snsec	= tk->xtime_nsec
+					+ (tk->wall_to_monotonic.tv_nsec
+						<< tk->shift);
+	while (vdata->monotonic_time_snsec >=
+					(((u64)NSEC_PER_SEC) << tk->shift)) {
+		vdata->monotonic_time_snsec -=
+					((u64)NSEC_PER_SEC) << tk->shift;
+		vdata->monotonic_time_sec++;
+	}
+
+	write_seqcount_end(&vdata->seq);
+}
+
+
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
 {
 	int version;
@@ -995,6 +1044,8 @@ static inline u64 get_kernel_ns(void)
 	return timespec_to_ns(&ts);
 }
 
+static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
+
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 unsigned long max_tsc_khz;
 
@@ -1227,7 +1278,6 @@ static int kvm_guest_time_update(struct 
 	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->last_guest_tsc = tsc_timestamp;
 
-
 	/*
 	 * The interface expects us to write an even number signaling that the
 	 * update is finished. Since the guest won't see the intermediate
@@ -4894,6 +4944,37 @@ static void kvm_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(mask);
 }
 
+static void pvclock_gtod_update_fn(struct work_struct *work)
+{
+}
+
+static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
+
+/*
+ * Notification about pvclock gtod data update.
+ */
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+			       void *priv)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	struct timekeeper *tk = priv;
+
+	update_pvclock_gtod(tk);
+
+	/* disable master clock if host does not trust, or does not
+ 	 * use, TSC clocksource
+ 	 */
+	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	    atomic_read(&kvm_guest_has_master_clock) != 0)
+		queue_work(system_long_wq, &pvclock_gtod_work);
+
+	return 0;
+}
+
+static struct notifier_block pvclock_gtod_notifier = {
+	.notifier_call = pvclock_gtod_notify,
+};
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
@@ -4935,6 +5016,8 @@ int kvm_arch_init(void *opaque)
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 
 	kvm_lapic_init();
+	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
+
 	return 0;
 
 out:
@@ -4949,6 +5032,7 @@ void kvm_arch_exit(void)
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (12 preceding siblings ...)
  2012-11-19 21:58   ` [patch 14/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 14-host-pass-stable-pvclock-flag --]
[-- Type: text/plain, Size: 13136 bytes --]

KVM added a global variable to guarantee monotonicity in the guest. 
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances 
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when 
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all 
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs. 

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1186,21 +1186,166 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static cycle_t read_tsc(void)
+{
+	cycle_t ret;
+	u64 last;
+
+	/*
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
+	 */
+	rdtsc_barrier();
+	ret = (cycle_t)vget_cycles();
+
+	last = pvclock_gtod_data.clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
+}
+
+static inline u64 vgettsc(cycle_t *cycle_now)
+{
+	long v;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	*cycle_now = read_tsc();
+
+	v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
+	return v * gtod->clock.mult;
+}
+
+static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	ts->tv_nsec = 0;
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		mode = gtod->clock.vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_sec;
+		ns = gtod->monotonic_time_snsec;
+		ns += vgettsc(cycle_now);
+		ns >>= gtod->clock.shift;
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	timespec_add_ns(ts, ns);
+
+	return mode;
+}
+
+/* returns true if host is using tsc clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+	struct timespec ts;
+
+	/* checked again under seqlock below */
+	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+		return false;
+
+	if (do_monotonic(&ts, cycle_now) != VCLOCK_TSC)
+		return false;
+
+	monotonic_to_bootbased(&ts);
+	*kernel_ns = timespec_to_ns(&ts);
+
+	return true;
+}
+
+
+/*
+ *
+ * Assuming a stable TSC across physical CPUS, the following condition
+ * is possible. Each numbered line represents an event visible to both
+ * CPUs at the next numbered event.
+ *
+ * "timespecX" represents host monotonic time. "tscX" represents
+ * RDTSC value.
+ *
+ * 		VCPU0 on CPU0		|	VCPU1 on CPU1
+ *
+ * 1.  read timespec0,tsc0
+ * 2.					| timespec1 = timespec0 + N
+ * 					| tsc1 = tsc0 + M
+ * 3. transition to guest		| transition to guest
+ * 4. ret0 = timespec0 + (rdtsc - tsc0) |
+ * 5.				        | ret1 = timespec1 + (rdtsc - tsc1)
+ * 				        | ret1 = timespec0 + N + (rdtsc - (tsc0 + M))
+ *
+ * Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity:
+ *
+ * 	- ret0 < ret1
+ *	- timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
+ *		...
+ *	- 0 < N - M => M < N
+ *
+ * That is, when timespec0 != timespec1, M < N. Unfortunately that is not
+ * always the case (the difference between two distinct xtime instances
+ * might be smaller then the difference between corresponding TSC reads,
+ * when updating guest vcpus pvclock areas).
+ *
+ * To avoid that problem, do not allow visibility of distinct
+ * system_timestamp/tsc_timestamp values simultaneously: use a master
+ * copy of host monotonic time values. Update that master copy
+ * in lockstep.
+ *
+ * Rely on synchronization of host TSCs for monotonicity.
+ *
+ */
+
+static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	int vclock_mode;
+
+	/*
+ 	 * If the host uses TSC clock, then passthrough TSC as stable
+	 * to the guest.
+	 */
+	ka->use_master_clock = kvm_get_time_and_clockread(
+					&ka->master_kernel_ns,
+					&ka->master_cycle_now);
+
+	if (ka->use_master_clock)
+		atomic_set(&kvm_guest_has_master_clock, 1);
+
+	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+}
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
-	unsigned long flags;
+	unsigned long flags, this_tsc_khz;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
+	struct kvm_arch *ka = &v->kvm->arch;
 	void *shared_kaddr;
-	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
-	u64 tsc_timestamp;
+	u64 tsc_timestamp, host_tsc;
 	struct pvclock_vcpu_time_info *guest_hv_clock;
 	u8 pvclock_flags;
+	bool use_master_clock;
+
+	kernel_ns = 0;
+	host_tsc = 0;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
-	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
 		local_irq_restore(flags);
@@ -1208,6 +1353,24 @@ static int kvm_guest_time_update(struct 
 		return 1;
 	}
 
+  	/*
+  	 * If the host uses TSC clock, then passthrough TSC as stable
+ 	 * to the guest.
+ 	 */
+ 	spin_lock(&ka->pvclock_gtod_sync_lock);
+ 	use_master_clock = ka->use_master_clock;
+ 	if (use_master_clock) {
+ 		host_tsc = ka->master_cycle_now;
+ 		kernel_ns = ka->master_kernel_ns;
+ 	}
+ 	spin_unlock(&ka->pvclock_gtod_sync_lock);
+ 	if (!use_master_clock) {
+ 		host_tsc = native_read_tsc();
+ 		kernel_ns = get_kernel_ns();
+ 	}
+
+ 	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, host_tsc);
+
 	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
@@ -1269,9 +1432,14 @@ static int kvm_guest_time_update(struct 
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-	if (max_kernel_ns > kernel_ns)
-		kernel_ns = max_kernel_ns;
-
+	/* with a master <monotonic time, tsc value> tuple,
+	 * pvclock clock reads always increase at the (scaled) rate
+	 * of guest TSC - no need to deal with sampling errors.
+	 */
+	if (!use_master_clock) {
+		if (max_kernel_ns > kernel_ns)
+			kernel_ns = max_kernel_ns;
+	}
 	/* With all the info we got, fill in the values */
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
 	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
@@ -1297,6 +1465,10 @@ static int kvm_guest_time_update(struct 
 		vcpu->pvclock_set_guest_stopped_request = false;
 	}
 
+	/* If the host uses TSC clocksource, then it is stable */
+	if (use_master_clock)
+		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
+
 	vcpu->hv_clock.flags = pvclock_flags;
 
 	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
@@ -4946,6 +5118,17 @@ static void kvm_set_mmio_spte_mask(void)
 
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
+	struct kvm *kvm;
+
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	raw_spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			set_bit(KVM_REQ_MASTERCLOCK_UPDATE, &vcpu->requests);
+	atomic_set(&kvm_guest_has_master_clock, 0);
+	raw_spin_unlock(&kvm_lock);
 }
 
 static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
@@ -5332,6 +5515,28 @@ static void process_nmi(struct kvm_vcpu 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+static void kvm_gen_update_masterclock(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	struct kvm_arch *ka = &kvm->arch;
+
+	spin_lock(&ka->pvclock_gtod_sync_lock);
+	kvm_make_mclock_inprogress_request(kvm);
+	/* no guest entries from this point */
+	pvclock_update_vm_gtod_copy(kvm);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+
+	/* guest entries allowed */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
+
+	spin_unlock(&ka->pvclock_gtod_sync_lock);
+
+}
+
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -5344,6 +5549,8 @@ static int vcpu_enter_guest(struct kvm_v
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
+		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_gen_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
@@ -6248,6 +6455,8 @@ int kvm_arch_hardware_enable(void *garba
 			kvm_for_each_vcpu(i, vcpu, kvm) {
 				vcpu->arch.tsc_offset_adjustment += delta_cyc;
 				vcpu->arch.last_host_tsc = local_tsc;
+				set_bit(KVM_REQ_MASTERCLOCK_UPDATE,
+					&vcpu->requests);
 			}
 
 			/*
@@ -6385,6 +6594,9 @@ int kvm_arch_init_vm(struct kvm *kvm, un
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
+	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
+
+	pvclock_update_vm_gtod_copy(kvm);
 
 	return 0;
 }
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -22,6 +22,7 @@
 #include <linux/kvm_para.h>
 #include <linux/kvm_types.h>
 #include <linux/perf_event.h>
+#include <linux/pvclock_gtod.h>
 
 #include <asm/pvclock-abi.h>
 #include <asm/desc.h>
@@ -560,6 +561,11 @@ struct kvm_arch {
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
 
+	spinlock_t pvclock_gtod_sync_lock;
+	bool use_master_clock;
+	u64 master_kernel_ns;
+	cycle_t master_cycle_now;
+
 	struct kvm_xen_hvm_config xen_hvm_config;
 
 	/* fields used by HYPER-V emulation */
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -118,6 +118,8 @@ static inline bool is_error_page(struct 
 #define KVM_REQ_IMMEDIATE_EXIT    15
 #define KVM_REQ_PMU               16
 #define KVM_REQ_PMI               17
+#define KVM_REQ_MASTERCLOCK_UPDATE  18
+#define KVM_REQ_MCLOCK_INPROGRESS 19
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
@@ -527,6 +529,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
+void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -212,6 +212,11 @@ void kvm_reload_remote_mmus(struct kvm *
 	make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
 
+void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+{
+	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
+}
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -4,6 +4,7 @@
 #include <linux/tracepoint.h>
 #include <asm/vmx.h>
 #include <asm/svm.h>
+#include <asm/clocksource.h>
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -754,6 +755,31 @@ TRACE_EVENT(
 		  __entry->write ? "Write" : "Read",
 		  __entry->gpa_match ? "GPA" : "GVA")
 );
+
+#define host_clocks				\
+	{VCLOCK_NONE, "none"},			\
+	{VCLOCK_TSC,  "tsc"},			\
+	{VCLOCK_HPET, "hpet"}			\
+
+TRACE_EVENT(kvm_update_master_clock,
+	TP_PROTO(bool use_master_clock, unsigned int host_clock),
+	TP_ARGS(use_master_clock, host_clock),
+
+	TP_STRUCT__entry(
+		__field(		bool,	use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("masterclock %d hostclock %s",
+		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks))
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (13 preceding siblings ...)
  2012-11-19 21:58   ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 17/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 16-add-kvm-add-vcpu-postcreate --]
[-- Type: text/plain, Size: 3730 bytes --]

TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/ia64/kvm/kvm-ia64.c
===================================================================
--- vsyscall.orig/arch/ia64/kvm/kvm-ia64.c
+++ vsyscall/arch/ia64/kvm/kvm-ia64.c
@@ -1330,6 +1330,11 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return 0;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
 	return -EINVAL;
Index: vsyscall/arch/powerpc/kvm/powerpc.c
===================================================================
--- vsyscall.orig/arch/powerpc/kvm/powerpc.c
+++ vsyscall/arch/powerpc/kvm/powerpc.c
@@ -354,6 +354,11 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
 	return vcpu;
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	/* Make sure we're not using the vcpu anymore */
Index: vsyscall/arch/s390/kvm/kvm-s390.c
===================================================================
--- vsyscall.orig/arch/s390/kvm/kvm-s390.c
+++ vsyscall/arch/s390/kvm/kvm-s390.c
@@ -355,6 +355,11 @@ static void kvm_s390_vcpu_initial_reset(
 	atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 {
 	atomic_set(&vcpu->arch.sie_block->cpuflags, CPUSTAT_ZARCH |
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -1254,7 +1254,6 @@ static struct kvm_vcpu *svm_create_vcpu(
 	svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT;
 	svm->asid_generation = 0;
 	init_vmcb(svm);
-	kvm_write_tsc(&svm->vcpu, 0);
 
 	err = fx_init(&svm->vcpu);
 	if (err)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -3896,8 +3896,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
 	set_cr4_guest_host_mask(vmx);
 
-	kvm_write_tsc(&vmx->vcpu, 0);
-
 	return 0;
 }
 
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -6289,6 +6289,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return r;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	int r;
+
+	r = vcpu_load(vcpu);
+	if (r)
+		return r;
+	kvm_write_tsc(vcpu, 0);
+	vcpu_put(vcpu);
+
+	return r;
+}
+
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	int r;
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -583,6 +583,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -1855,6 +1855,7 @@ static int kvm_vm_ioctl_create_vcpu(stru
 	atomic_inc(&kvm->online_vcpus);
 
 	mutex_unlock(&kvm->lock);
+	kvm_arch_vcpu_postcreate(vcpu);
 	return r;
 
 unlock_vcpu_destroy:



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 17/18] KVM: x86: require matched TSC offsets for master clock
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (14 preceding siblings ...)
  2012-11-19 21:58   ` [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-19 21:58   ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
  2012-11-20  9:03   ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Glauber Costa
  17 siblings, 0 replies; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 17-masterclock-require-matched-tsc --]
[-- Type: text/plain, Size: 6962 bytes --]

With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple 
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -560,6 +560,7 @@ struct kvm_arch {
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
+	int nr_vcpus_matched_tsc;
 
 	spinlock_t pvclock_gtod_sync_lock;
 	bool use_master_clock;
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1097,12 +1097,38 @@ static u64 compute_guest_tsc(struct kvm_
 	return tsc;
 }
 
+void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+{
+	bool vcpus_matched;
+	bool do_request = false;
+	struct kvm_arch *ka = &vcpu->kvm->arch;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			 atomic_read(&vcpu->kvm->online_vcpus));
+
+	if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
+		if (!ka->use_master_clock)
+			do_request = 1;
+
+	if (!vcpus_matched && ka->use_master_clock)
+			do_request = 1;
+
+	if (do_request)
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
+	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
+			    atomic_read(&vcpu->kvm->online_vcpus),
+		            ka->use_master_clock, gtod->clock.vclock_mode);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
 	s64 usdiff;
+	bool matched;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
@@ -1145,6 +1171,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
+		matched = true;
 	} else {
 		/*
 		 * We split periods of matched TSC writes into generations.
@@ -1159,6 +1186,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 		kvm->arch.cur_tsc_nsec = ns;
 		kvm->arch.cur_tsc_write = data;
 		kvm->arch.cur_tsc_offset = offset;
+		matched = false;
 		pr_debug("kvm: new tsc generation %u, clock %llu\n",
 			 kvm->arch.cur_tsc_generation, data);
 	}
@@ -1182,6 +1210,15 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+	spin_lock(&kvm->arch.pvclock_gtod_sync_lock);
+	if (matched)
+		kvm->arch.nr_vcpus_matched_tsc++;
+	else
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+
+	kvm_track_tsc_matching(vcpu);
+	spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
 }
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -1271,8 +1308,9 @@ static bool kvm_get_time_and_clockread(s
 
 /*
  *
- * Assuming a stable TSC across physical CPUS, the following condition
- * is possible. Each numbered line represents an event visible to both
+ * Assuming a stable TSC across physical CPUS, and a stable TSC
+ * across virtual CPUs, the following condition is possible.
+ * Each numbered line represents an event visible to both
  * CPUs at the next numbered event.
  *
  * "timespecX" represents host monotonic time. "tscX" represents
@@ -1305,7 +1343,7 @@ static bool kvm_get_time_and_clockread(s
  * copy of host monotonic time values. Update that master copy
  * in lockstep.
  *
- * Rely on synchronization of host TSCs for monotonicity.
+ * Rely on synchronization of host TSCs and guest TSCs for monotonicity.
  *
  */
 
@@ -1313,20 +1351,27 @@ static void pvclock_update_vm_gtod_copy(
 {
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
+	bool host_tsc_clocksource, vcpus_matched;
+
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+				atomic_read(&kvm->online_vcpus));
 
 	/*
  	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	ka->use_master_clock = kvm_get_time_and_clockread(
+	host_tsc_clocksource = kvm_get_time_and_clockread(
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
+	ka->use_master_clock = host_tsc_clocksource & vcpus_matched;
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
-	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
+				      vcpus_matched);
 }
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -762,21 +762,54 @@ TRACE_EVENT(
 	{VCLOCK_HPET, "hpet"}			\
 
 TRACE_EVENT(kvm_update_master_clock,
-	TP_PROTO(bool use_master_clock, unsigned int host_clock),
-	TP_ARGS(use_master_clock, host_clock),
+	TP_PROTO(bool use_master_clock, unsigned int host_clock, bool offset_matched),
+	TP_ARGS(use_master_clock, host_clock, offset_matched),
 
 	TP_STRUCT__entry(
 		__field(		bool,	use_master_clock	)
 		__field(	unsigned int,	host_clock		)
+		__field(		bool,	offset_matched		)
 	),
 
 	TP_fast_assign(
 		__entry->use_master_clock	= use_master_clock;
 		__entry->host_clock		= host_clock;
+		__entry->offset_matched		= offset_matched;
 	),
 
-	TP_printk("masterclock %d hostclock %s",
+	TP_printk("masterclock %d hostclock %s offsetmatched %u",
 		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks),
+		  __entry->offset_matched)
+);
+
+TRACE_EVENT(kvm_track_tsc,
+	TP_PROTO(unsigned int vcpu_id, unsigned int nr_matched,
+		 unsigned int online_vcpus, bool use_master_clock,
+		 unsigned int host_clock),
+	TP_ARGS(vcpu_id, nr_matched, online_vcpus, use_master_clock,
+		host_clock),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id			)
+		__field(	unsigned int,	nr_vcpus_matched_tsc	)
+		__field(	unsigned int,	online_vcpus		)
+		__field(	bool,		use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id		= vcpu_id;
+		__entry->nr_vcpus_matched_tsc	= nr_matched;
+		__entry->online_vcpus		= online_vcpus;
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("vcpu_id %u masterclock %u offsetmatched %u nr_online %u"
+		  " hostclock %s",
+		  __entry->vcpu_id, __entry->use_master_clock,
+		  __entry->nr_vcpus_matched_tsc, __entry->online_vcpus,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (15 preceding siblings ...)
  2012-11-19 21:58   ` [patch 17/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
@ 2012-11-19 21:58   ` Marcelo Tosatti
  2012-11-20  9:01     ` Glauber Costa
  2012-11-20  9:03   ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Glauber Costa
  17 siblings, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2012-11-19 21:58 UTC (permalink / raw)
  To: mingo, kvm; +Cc: johnstul, jeremy, glommer, Marcelo Tosatti

[-- Attachment #1: 18-do-not-writeclock-on-cpu-migration --]
[-- Type: text/plain, Size: 886 bytes --]

As requested by Glauber, do not update kvmclock area on vcpu->pcpu 
migration, in case the host has stable TSC. 

This is to reduce cacheline bouncing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 			kvm_x86_ops->write_tsc_offset(vcpu, offset);
 			vcpu->arch.tsc_catchup = 1;
 		}
-		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		/*
+ 		 * On a host with synchronized TSC, there is no need to update
+ 		 * kvmclock on vcpu->cpu migration
+ 		 */
+		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_migrate_timers(vcpu);
 		vcpu->cpu = cpu;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
  2012-11-19 21:58   ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
@ 2012-11-20  9:01     ` Glauber Costa
  0 siblings, 0 replies; 49+ messages in thread
From: Glauber Costa @ 2012-11-20  9:01 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: mingo, kvm, johnstul, jeremy

On 11/20/2012 01:58 AM, Marcelo Tosatti wrote:
> As requested by Glauber, do not update kvmclock area on vcpu->pcpu 
> migration, in case the host has stable TSC. 
> 
> This is to reduce cacheline bouncing.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/arch/x86/kvm/x86.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kvm/x86.c
> +++ vsyscall/arch/x86/kvm/x86.c
> @@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
>  			kvm_x86_ops->write_tsc_offset(vcpu, offset);
>  			vcpu->arch.tsc_catchup = 1;
>  		}
> -		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +		/*
> + 		 * On a host with synchronized TSC, there is no need to update
> + 		 * kvmclock on vcpu->cpu migration
> + 		 */
> +		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  		if (vcpu->cpu != cpu)
>  			kvm_migrate_timers(vcpu);
>  		vcpu->cpu = cpu;

Ok. Since you are only touching the one in kvm_arch_vcpu_load() and
leaving the others untouched, it looks correct.

Acked-by: Glauber Costa <glommer@parallels.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5)
  2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
                     ` (16 preceding siblings ...)
  2012-11-19 21:58   ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
@ 2012-11-20  9:03   ` Glauber Costa
  17 siblings, 0 replies; 49+ messages in thread
From: Glauber Costa @ 2012-11-20  9:03 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: mingo, kvm, johnstul, jeremy

On 11/20/2012 01:57 AM, Marcelo Tosatti wrote:
> This patchset, based on earlier work by Jeremy Fitzhardinge, implements
> paravirtual clock vsyscall support.
> 
> It should be possible to implement Xen support relatively easily.
> 
> It reduces clock_gettime from 500 cycles to 200 cycles
> on my testbox.
> 

There are no more significant objections from my side.
I will still try to go through it again today just in case.


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2012-11-20  9:03 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-15  0:08 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4) Marcelo Tosatti
2012-11-15  0:08 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
2012-11-15  0:08 ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
2012-11-15 17:05   ` Glauber Costa
2012-11-16  2:07     ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2) Marcelo Tosatti
2012-11-16  7:06       ` Glauber Costa
2012-11-15  0:08 ` [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
2012-11-15  0:08 ` [patch 04/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
2012-11-15  0:08 ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
2012-11-15 12:27   ` Glauber Costa
2012-11-15  0:08 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
2012-11-15 12:28   ` Glauber Costa
2012-11-15  0:08 ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
2012-11-15 12:30   ` Glauber Costa
2012-11-16  2:05     ` [patch 07/18] x86: pvclock: add note about rdtsc barriers (v2) Marcelo Tosatti
2012-11-15  0:08 ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
2012-11-15  7:01   ` Gleb Natapov
2012-11-15  0:08 ` [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
2012-11-15  0:08 ` [patch 10/18] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
2012-11-15  0:08 ` [patch 11/18] x86: vdso: pvclock gettime support Marcelo Tosatti
2012-11-15  0:08 ` [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
2012-11-15  0:08 ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
2012-11-15  1:38   ` John Stultz
2012-11-15  0:08 ` [patch 14/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
2012-11-15  0:08 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
2012-11-15  0:08 ` [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
2012-11-15  0:08 ` [patch 17/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
2012-11-15  0:08 ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
2012-11-15 12:34   ` Glauber Costa
2012-11-19 21:57 ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Marcelo Tosatti
2012-11-19 21:57   ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
2012-11-19 21:58   ` [patch 02/18] x86: kvmclock: allocate pvclock shared memory area Marcelo Tosatti
2012-11-19 21:58   ` [patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
2012-11-19 21:58   ` [patch 04/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
2012-11-19 21:58   ` [patch 05/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
2012-11-19 21:58   ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
2012-11-19 21:58   ` [patch 07/18] x86: pvclock: add note about rdtsc barriers Marcelo Tosatti
2012-11-19 21:58   ` [patch 08/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
2012-11-19 21:58   ` [patch 09/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
2012-11-19 21:58   ` [patch 11/18] x86: vdso: pvclock gettime support Marcelo Tosatti
2012-11-19 21:58   ` [patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
2012-11-19 21:58   ` [patch 13/18] time: export time information for KVM pvclock Marcelo Tosatti
2012-11-19 21:58   ` [patch 14/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
2012-11-19 21:58   ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
2012-11-19 21:58   ` [patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
2012-11-19 21:58   ` [patch 17/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
2012-11-19 21:58   ` [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration Marcelo Tosatti
2012-11-20  9:01     ` Glauber Costa
2012-11-20  9:03   ` [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5) Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.