[patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2)
@ 2012-10-24 13:13 Marcelo Tosatti
  2012-10-24 13:13 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
                   ` (18 more replies)
  0 siblings, 19 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm; +Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini

This patchset, based on earlier work by Jeremy Fitzhardinge, implements
paravirtual clock vsyscall support.

It should be possible to implement Xen support relatively easily.

It reduces clock_gettime from 500 cycles to 200 cycles
on my testbox.

Please review.

>From my POV, this is ready to merge.

v2: 
- Do not allow visibility of different <system_timestamp, tsc_timestamp> 
tuples.
- Add option to disable vsyscall.




^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 02/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: x86-kvm-retain-guest-stopped.patch --]
[-- Type: text/plain, Size: 1659 bytes --]

Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
+	struct pvclock_vcpu_time_info *guest_hv_clock;
 	u8 pvclock_flags;
 
 	/* Keep irq disabled to prevent changes to the clock */
@@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
 	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->last_guest_tsc = tsc_timestamp;
 
-	pvclock_flags = 0;
-	if (vcpu->pvclock_set_guest_stopped_request) {
-		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-		vcpu->pvclock_set_guest_stopped_request = false;
-	}
-
-	vcpu->hv_clock.flags = pvclock_flags;
 
 	/*
 	 * The interface expects us to write an even number signaling that the
@@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
 
 	shared_kaddr = kmap_atomic(vcpu->time_page);
 
+	guest_hv_clock = shared_kaddr + vcpu->time_offset;
+
+	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+	if (vcpu->pvclock_set_guest_stopped_request) {
+		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+		vcpu->pvclock_set_guest_stopped_request = false;
+	}
+
+	vcpu->hv_clock.flags = pvclock_flags;
+
 	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
 	       sizeof(vcpu->hv_clock));
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 02/18] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
  2012-10-24 13:13 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 03/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 01-pvclock-read-rdtsc-barrier --]
[-- Type: text/plain, Size: 745 bytes --]

Originally from Jeremy Fitzhardinge.

pvclock_get_time_values, which contains the memory barriers
will be removed by next patch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
 
 	do {
 		version = pvclock_get_time_values(&shadow, src);
-		barrier();
+		rdtsc_barrier();
 		offset = pvclock_get_nsec_offset(&shadow);
 		ret = shadow.system_timestamp + offset;
-		barrier();
+		rdtsc_barrier();
 	} while (version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 03/18] x86: pvclock: remove pvclock_shadow_time
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
  2012-10-24 13:13 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
  2012-10-24 13:13 ` [patch 02/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-30  9:23   ` Avi Kivity
  2012-10-24 13:13 ` [patch 04/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 02-pvclock-remove-shadow-time --]
[-- Type: text/plain, Size: 2949 bytes --]

Originally from Jeremy Fitzhardinge.

We can copy the information directly from "struct pvclock_vcpu_time_info", 
remove pvclock_shadow_time.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -19,21 +19,6 @@
 #include <linux/percpu.h>
 #include <asm/pvclock.h>
 
-/*
- * These are perodically updated
- *    xen: magic shared_info page
- *    kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
-	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
-	u32 tsc_to_nsec_mul;
-	int tsc_shift;
-	u32 version;
-	u8  flags;
-};
-
 static u8 valid_flags __read_mostly = 0;
 
 void pvclock_set_flags(u8 flags)
@@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-	u64 delta = native_read_tsc() - shadow->tsc_timestamp;
-	return pvclock_scale_delta(delta, shadow->tsc_to_nsec_mul,
-				   shadow->tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-					struct pvclock_vcpu_time_info *src)
-{
-	do {
-		dst->version = src->version;
-		rmb();		/* fetch version before data */
-		dst->tsc_timestamp     = src->tsc_timestamp;
-		dst->system_timestamp  = src->system_time;
-		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
-		dst->tsc_shift         = src->tsc_shift;
-		dst->flags             = src->flags;
-		rmb();		/* test version after fetching data */
-	} while ((src->version & 1) || (dst->version != src->version));
-
-	return dst->version;
+	u64 delta = native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
 }
 
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -90,21 +54,20 @@ void pvclock_resume(void)
 
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
-	struct pvclock_shadow_time shadow;
 	unsigned version;
 	cycle_t ret, offset;
 	u64 last;
 
 	do {
-		version = pvclock_get_time_values(&shadow, src);
+		version = src->version;
 		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(&shadow);
-		ret = shadow.system_timestamp + offset;
+		offset = pvclock_get_nsec_offset(src);
+		ret = src->system_time + offset;
 		rdtsc_barrier();
-	} while (version != src->version);
+	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
-		(shadow.flags & PVCLOCK_TSC_STABLE_BIT))
+		(src->flags & PVCLOCK_TSC_STABLE_BIT))
 		return ret;
 
 	/*



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 04/18] x86: pvclock: create helper for pvclock data retrieval
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2012-10-24 13:13 ` [patch 03/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 05/18] x86: pvclock: fix flags usage race Marcelo Tosatti
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 03-move-pvread-to-pvheader --]
[-- Type: text/plain, Size: 2146 bytes --]

Originally from Jeremy Fitzhardinge.

So code can be reused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
-{
-	u64 delta = native_read_tsc() - src->tsc_timestamp;
-	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
-				   src->tsc_shift);
-}
-
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
 {
 	u64 pv_tsc_khz = 1000000ULL << 32;
@@ -55,15 +48,11 @@ void pvclock_resume(void)
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
-	cycle_t ret, offset;
+	cycle_t ret;
 	u64 last;
 
 	do {
-		version = src->version;
-		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(src);
-		ret = src->system_time + offset;
-		rdtsc_barrier();
+		version = __pvclock_read_cycles(src, &ret);
 	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -56,4 +56,29 @@ static inline u64 pvclock_scale_delta(u6
 	return product;
 }
 
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+{
+	u64 delta = __native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
+}
+
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+			       cycle_t *cycles)
+{
+	unsigned version;
+	cycle_t ret, offset;
+
+	version = src->version;
+	rdtsc_barrier();
+	offset = pvclock_get_nsec_offset(src);
+	ret = src->system_time + offset;
+	rdtsc_barrier();
+
+	*cycles = ret;
+	return version;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 05/18] x86: pvclock: fix flags usage race
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2012-10-24 13:13 ` [patch 04/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 04-pvclock-read-cycles-return-flags --]
[-- Type: text/plain, Size: 1640 bytes --]

Validity of values returned by pvclock (including flags) is guaranteed by version 
checks.

That is, read of src->flags outside version check protection can refer
to a different paravirt clock update by the hypervisor.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -66,18 +66,21 @@ u64 pvclock_get_nsec_offset(const struct
 
 static __always_inline
 unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
-			       cycle_t *cycles)
+			       cycle_t *cycles, u8 *flags)
 {
 	unsigned version;
 	cycle_t ret, offset;
+	u8 ret_flags;
 
 	version = src->version;
 	rdtsc_barrier();
 	offset = pvclock_get_nsec_offset(src);
 	ret = src->system_time + offset;
+	ret_flags = src->flags;
 	rdtsc_barrier();
 
 	*cycles = ret;
+	*flags = ret_flags;
 	return version;
 }
 
Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -50,13 +50,14 @@ cycle_t pvclock_clocksource_read(struct 
 	unsigned version;
 	cycle_t ret;
 	u64 last;
+	u8 flags;
 
 	do {
-		version = __pvclock_read_cycles(src, &ret);
+		version = __pvclock_read_cycles(src, &ret, &flags);
 	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
-		(src->flags & PVCLOCK_TSC_STABLE_BIT))
+		(flags & PVCLOCK_TSC_STABLE_BIT))
 		return ret;
 
 	/*



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 06/18] x86: pvclock: introduce helper to read flags
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2012-10-24 13:13 ` [patch 05/18] x86: pvclock: fix flags usage race Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 07/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 05-pvclock-add-get-flags --]
[-- Type: text/plain, Size: 1278 bytes --]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -45,6 +45,19 @@ void pvclock_resume(void)
 	atomic64_set(&last_value, 0);
 }
 
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
+{
+	unsigned version;
+	cycle_t ret;
+	u8 flags;
+
+	do {
+		version = __pvclock_read_cycles(src, &ret, &flags);
+	} while ((src->version & 1) || version != src->version);
+
+	return flags & valid_flags;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 07/18] sched: add notifier for cross-cpu migrations
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2012-10-24 13:13 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 06-add-task-migration-notifier --]
[-- Type: text/plain, Size: 1732 bytes --]

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/sched.h
===================================================================
--- vsyscall.orig/include/linux/sched.h
+++ vsyscall/include/linux/sched.h
@@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+	struct task_struct *task;
+	int from_cpu;
+	int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
Index: vsyscall/kernel/sched/core.c
===================================================================
--- vsyscall.orig/kernel/sched/core.c
+++ vsyscall/kernel/sched/core.c
@@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
 		rq->skip_clock_update = 1;
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+	atomic_notifier_chain_register(&task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		struct task_migration_notifier tmn;
+
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+
+		tmn.task = p;
+		tmn.from_cpu = task_cpu(p);
+		tmn.to_cpu = new_cpu;
+
+		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
 	}
 
 	__set_task_cpu(p, new_cpu);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2012-10-24 13:13 ` [patch 07/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-29 14:18   ` Glauber Costa
  2012-10-29 14:39   ` Glauber Costa
  2012-10-24 13:13 ` [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 07-add-pvclock-structs-and-fixmap --]
[-- Type: text/plain, Size: 5321 bytes --]

Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization 
routines.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -17,6 +17,10 @@
 
 #include <linux/kernel.h>
 #include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <linux/bootmem.h>
 #include <asm/pvclock.h>
 
 static u8 valid_flags __read_mostly = 0;
@@ -122,3 +126,70 @@ void pvclock_read_wallclock(struct pvclo
 
 	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+static aligned_pvti_t *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *pvclock_get_vsyscall_user_time_info(int cpu)
+{
+	if (pvclock_vdso_info == NULL) {
+		BUG();
+		return NULL;
+	}
+
+	return &pvclock_vdso_info[cpu].info;
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+	return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+	struct task_migration_notifier *mn = v;
+	struct pvclock_vsyscall_time_info *pvti;
+
+	pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
+
+	if (pvti == NULL)
+		return NOTIFY_DONE;
+
+	pvti->migrate_count++;
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+	.notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state.  This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+int __init pvclock_init_vsyscall(void)
+{
+	int idx;
+	unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
+
+	pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);
+	if (!pvclock_vdso_info)
+		return -ENOMEM;
+
+	memset(pvclock_vdso_info, 0, size);
+
+	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
+		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
+			     __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
+		     	     PAGE_KERNEL_VVAR);
+	}
+
+	register_task_migration_notifier(&pvclock_migrate);
+
+	return 0;
+}
+
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
Index: vsyscall/arch/x86/include/asm/fixmap.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/fixmap.h
+++ vsyscall/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include <asm/acpi.h>
 #include <asm/apicdef.h>
 #include <asm/page.h>
+#include <asm/pvclock.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
@@ -81,6 +82,10 @@ enum fixed_addresses {
 	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+	PVCLOCK_FIXMAP_BEGIN,
+	PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
+#endif
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo
 			    struct pvclock_vcpu_time_info *vcpu,
 			    struct timespec *ts);
 void pvclock_resume(void);
+int __init pvclock_init_vsyscall(void);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
@@ -85,4 +87,24 @@ unsigned __pvclock_read_cycles(const str
 	return version;
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+struct pvclock_vsyscall_time_info {
+	struct pvclock_vcpu_time_info pvti;
+	u32 migrate_count;
+};
+
+typedef union {
+	struct pvclock_vsyscall_time_info info;
+	char pad[SMP_CACHE_BYTES];
+} aligned_pvti_t ____cacheline_aligned;
+
+#define PVTI_SIZE sizeof(aligned_pvti_t)
+#if NR_CPUS == 1
+#define PVCLOCK_VSYSCALL_NR_PAGES 1
+#else
+#define PVCLOCK_VSYSCALL_NR_PAGES ((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1
+#endif
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: vsyscall/arch/x86/include/asm/clocksource.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/clocksource.h
+++ vsyscall/arch/x86/include/asm/clocksource.h
@@ -8,6 +8,7 @@
 #define VCLOCK_NONE 0  /* No vDSO clock available.	*/
 #define VCLOCK_TSC  1  /* vDSO should use vread_tsc.	*/
 #define VCLOCK_HPET 2  /* vDSO should use vread_hpet.	*/
+#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
 
 struct arch_clocksource_data {
 	int vclock_mode;
Index: vsyscall/arch/x86/Kconfig
===================================================================
--- vsyscall.orig/arch/x86/Kconfig
+++ vsyscall/arch/x86/Kconfig
@@ -632,6 +632,13 @@ config PARAVIRT_SPINLOCKS
 
 config PARAVIRT_CLOCK
 	bool
+config PARAVIRT_CLOCK_VSYSCALL
+	bool "Paravirt clock vsyscall support"
+	depends on PARAVIRT_CLOCK && GENERIC_TIME_VSYSCALL
+	---help---
+	  Enable performance critical clock related system calls to
+	  be executed in userspace, provided that the hypervisor
+	  supports it.
 
 endif
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2012-10-24 13:13 ` [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-29 14:45   ` Glauber Costa
  2012-10-24 13:13 ` [patch 10/18] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 11-host-add-userspace-time-msr --]
[-- Type: text/plain, Size: 9580 bytes --]

Allow a guest to register a second location for the VCPU time info

structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
This is intended to allow the guest kernel to map this information
into a usermode accessible page, so that usermode can efficiently
calculate system time from the TSC without having to make a syscall.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_para.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_para.h
+++ vsyscall/arch/x86/include/asm/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_FEATURE_ASYNC_PF		4
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_EOI		6
+#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
+#define MSR_KVM_USERSPACE_TIME      0x4b564d05
 
 struct kvm_steal_time {
 	__u64 steal;
Index: vsyscall/Documentation/virtual/kvm/msr.txt
===================================================================
--- vsyscall.orig/Documentation/virtual/kvm/msr.txt
+++ vsyscall/Documentation/virtual/kvm/msr.txt
@@ -125,6 +125,22 @@ MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01
 	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
 	leaf prior to usage.
 
+MSR_KVM_USERSPACE_TIME:  0x4b564d05
+
+Allow a guest to register a second location for the VCPU time info
+structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
+This is intended to allow the guest kernel to map this information
+into a usermode accessible page, so that usermode can efficiently
+calculate system time from the TSC without having to make a syscall.
+
+Relationship with master copy (MSR_KVM_SYSTEM_TIME_NEW):
+
+- This MSR must be enabled only when the master is enabled.
+- Disabling updates to the master automatically disables
+updates for this copy.
+
+Availability of this MSR must be checked via bit 7 in 0x4000001 cpuid
+leaf prior to usage.
 
 MSR_KVM_WALL_CLOCK:  0x11
 
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -415,10 +415,13 @@ struct kvm_vcpu_arch {
 	int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
 
 	gpa_t time;
+	gpa_t uspace_time;
 	struct pvclock_vcpu_time_info hv_clock;
 	unsigned int hw_tsc_khz;
 	unsigned int time_offset;
+	unsigned int uspace_time_offset;
 	struct page *time_page;
+	struct page *uspace_time_page;
 	/* set guest stopped flag in pvclock flags field */
 	bool pvclock_set_guest_stopped_request;
 
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -809,13 +809,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	10
+#define KVM_SAVE_MSRS_BEGIN	11
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_USERSPACE_TIME,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1135,16 +1135,43 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
+			     unsigned int offset_in_page, gpa_t gpa)
+{
+	struct kvm_vcpu_arch *vcpu = &v->arch;
+	void *shared_kaddr;
+	struct pvclock_vcpu_time_info *guest_hv_clock;
+	u8 pvclock_flags;
+
+	shared_kaddr = kmap_atomic(page);
+
+	guest_hv_clock = shared_kaddr + offset_in_page;
+
+	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+	if (vcpu->pvclock_set_guest_stopped_request) {
+		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+		vcpu->pvclock_set_guest_stopped_request = false;
+	}
+
+	vcpu->hv_clock.flags = pvclock_flags;
+
+	memcpy(shared_kaddr + offset_in_page, &vcpu->hv_clock,
+	       sizeof(vcpu->hv_clock));
+
+	kunmap_atomic(shared_kaddr);
+
+	mark_page_dirty(v->kvm, gpa >> PAGE_SHIFT);
+}
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
-	void *shared_kaddr;
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
-	struct pvclock_vcpu_time_info *guest_hv_clock;
-	u8 pvclock_flags;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
@@ -1235,26 +1262,11 @@ static int kvm_guest_time_update(struct 
 	 */
 	vcpu->hv_clock.version += 2;
 
-	shared_kaddr = kmap_atomic(vcpu->time_page);
-
-	guest_hv_clock = shared_kaddr + vcpu->time_offset;
-
-	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
-	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+ 	kvm_write_pvtime(v, vcpu->time_page, vcpu->time_offset, vcpu->time);
+ 	if (vcpu->uspace_time_page)
+ 		kvm_write_pvtime(v, vcpu->uspace_time_page,
+ 				 vcpu->uspace_time_offset, vcpu->uspace_time);
 
-	if (vcpu->pvclock_set_guest_stopped_request) {
-		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-		vcpu->pvclock_set_guest_stopped_request = false;
-	}
-
-	vcpu->hv_clock.flags = pvclock_flags;
-
-	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
-	       sizeof(vcpu->hv_clock));
-
-	kunmap_atomic(shared_kaddr);
-
-	mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
 	return 0;
 }
 
@@ -1549,6 +1561,15 @@ static void kvmclock_reset(struct kvm_vc
 	}
 }
 
+static void kvmclock_uspace_reset(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.uspace_time = 0;
+	if (vcpu->arch.uspace_time_page) {
+		kvm_release_page_dirty(vcpu->arch.uspace_time_page);
+		vcpu->arch.uspace_time_page = NULL;
+	}
+}
+
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
 	u64 delta;
@@ -1639,6 +1660,31 @@ int kvm_set_msr_common(struct kvm_vcpu *
 		vcpu->kvm->arch.wall_clock = data;
 		kvm_write_wall_clock(vcpu->kvm, data);
 		break;
+	case MSR_KVM_USERSPACE_TIME: {
+		kvmclock_uspace_reset(vcpu);
+
+		if (!vcpu->arch.time_page && (data & 1))
+			return 1;
+
+		vcpu->arch.uspace_time = data;
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+		/* we verify if the enable bit is set... */
+		if (!(data & 1))
+			break;
+
+		/* ...but clean it before doing the actual write */
+		vcpu->arch.uspace_time_offset = data & ~(PAGE_MASK | 1);
+
+		vcpu->arch.uspace_time_page = gfn_to_page(vcpu->kvm,
+							  data >> PAGE_SHIFT);
+
+		if (is_error_page(vcpu->arch.uspace_time_page)) {
+			kvm_release_page_clean(vcpu->arch.uspace_time_page);
+			vcpu->arch.uspace_time_page = NULL;
+		}
+		break;
+	}
 	case MSR_KVM_SYSTEM_TIME_NEW:
 	case MSR_KVM_SYSTEM_TIME: {
 		kvmclock_reset(vcpu);
@@ -1647,8 +1693,10 @@ int kvm_set_msr_common(struct kvm_vcpu *
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 		/* we verify if the enable bit is set... */
-		if (!(data & 1))
+		if (!(data & 1)) {
+			kvmclock_uspace_reset(vcpu);
 			break;
+		}
 
 		/* ...but clean it before doing the actual write */
 		vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
@@ -1656,8 +1704,10 @@ int kvm_set_msr_common(struct kvm_vcpu *
 		vcpu->arch.time_page =
 				gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
 
-		if (is_error_page(vcpu->arch.time_page))
+		if (is_error_page(vcpu->arch.time_page)) {
 			vcpu->arch.time_page = NULL;
+			kvmclock_uspace_reset(vcpu);
+		}
 
 		break;
 	}
@@ -2010,6 +2060,9 @@ int kvm_get_msr_common(struct kvm_vcpu *
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_USERSPACE_TIME:
+		data = vcpu->arch.uspace_time;
+		break;
 	case MSR_KVM_ASYNC_PF_EN:
 		data = vcpu->arch.apf.msr_val;
 		break;
@@ -2195,6 +2248,7 @@ int kvm_dev_ioctl_check_extension(long e
 	case KVM_CAP_KVMCLOCK_CTRL:
 	case KVM_CAP_READONLY_MEM:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_USERSPACE_CLOCKSOURCE:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -6017,6 +6071,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *
 
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
+	kvmclock_uspace_reset(vcpu);
 	kvmclock_reset(vcpu);
 
 	free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
Index: vsyscall/arch/x86/kvm/cpuid.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/cpuid.c
+++ vsyscall/arch/x86/kvm/cpuid.c
@@ -411,7 +411,9 @@ static int do_cpuid_ent(struct kvm_cpuid
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
 			     (1 << KVM_FEATURE_ASYNC_PF) |
 			     (1 << KVM_FEATURE_PV_EOI) |
-			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+			     (1 << KVM_FEATURE_USERSPACE_CLOCKSOURCE);
+
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
Index: vsyscall/include/uapi/linux/kvm.h
===================================================================
--- vsyscall.orig/include/uapi/linux/kvm.h
+++ vsyscall/include/uapi/linux/kvm.h
@@ -626,6 +626,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_READONLY_MEM 81
 #endif
 #define KVM_CAP_IRQFD_RESAMPLE 82
+#define KVM_CAP_USERSPACE_CLOCKSOURCE 83
 
 #ifdef KVM_CAP_IRQ_ROUTING
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 10/18] x86: kvm guest: pvclock vsyscall support
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2012-10-24 13:13 ` [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 11/18] x86: vsyscall: pass mode to gettime backend Marcelo Tosatti
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 08-add-pvclock-vsyscall-kvm-support --]
[-- Type: text/plain, Size: 4791 bytes --]

Allow hypervisor to update userspace visible copy of
pvclock data.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/kvmclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -31,6 +31,9 @@ static int kvmclock = 1;
 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
 
+/* set when the generic vsyscall pvclock elements are setup */
+bool vsyscall_clock_initializable = false;
+
 static int parse_no_kvmclock(char *arg)
 {
 	kvmclock = 0;
@@ -151,6 +154,28 @@ int kvm_register_clock(char *txt)
 	return ret;
 }
 
+static int kvm_register_vsyscall_clock(char *txt)
+{
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+	int cpu = smp_processor_id();
+	int low, high, ret;
+	struct pvclock_vcpu_time_info *info;
+
+	info = pvclock_get_vsyscall_time_info(cpu);
+
+	low = (int)__pa(info) | 1;
+	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
+	ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high);
+	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
+	       cpu, high, low, txt);
+
+	return ret;
+#else
+	return 0;
+#endif
+}
+
+
 static void kvm_save_sched_clock_state(void)
 {
 }
@@ -158,6 +183,8 @@ static void kvm_save_sched_clock_state(v
 static void kvm_restore_sched_clock_state(void)
 {
 	kvm_register_clock("primary cpu clock, resume");
+	if (vsyscall_clock_initializable)
+		kvm_register_vsyscall_clock("primary cpu vsyscall clock, resume");
 }
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -168,6 +195,8 @@ static void __cpuinit kvm_setup_secondar
 	 * we shouldn't fail.
 	 */
 	WARN_ON(kvm_register_clock("secondary cpu clock"));
+	if (vsyscall_clock_initializable)
+		kvm_register_vsyscall_clock("secondary cpu vsyscall clock");
 }
 #endif
 
@@ -182,6 +211,8 @@ static void __cpuinit kvm_setup_secondar
 #ifdef CONFIG_KEXEC
 static void kvm_crash_shutdown(struct pt_regs *regs)
 {
+	if (vsyscall_clock_initializable)
+		native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0);
 	native_write_msr(msr_kvm_system_time, 0, 0);
 	kvm_disable_steal_time();
 	native_machine_crash_shutdown(regs);
@@ -190,6 +221,8 @@ static void kvm_crash_shutdown(struct pt
 
 static void kvm_shutdown(void)
 {
+	if (vsyscall_clock_initializable)
+		native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0);
 	native_write_msr(msr_kvm_system_time, 0, 0);
 	kvm_disable_steal_time();
 	native_machine_shutdown();
@@ -233,3 +266,27 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 }
+
+int kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+	int ret;
+	struct pvclock_vcpu_time_info *vcpu_time;
+	u8 flags;
+
+	vcpu_time = &get_cpu_var(hv_clock);
+	flags = pvclock_read_flags(vcpu_time);
+	put_cpu_var(hv_clock);
+
+	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+		return 1;
+
+	if ((ret = pvclock_init_vsyscall()))
+		return ret;
+
+	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+	vsyscall_clock_initializable = true;
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
+	return 0;
+}
+
Index: vsyscall/arch/x86/kernel/kvm.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvm.c
+++ vsyscall/arch/x86/kernel/kvm.c
@@ -42,6 +42,7 @@
 #include <asm/apic.h>
 #include <asm/apicdef.h>
 #include <asm/hypervisor.h>
+#include <asm/kvm_guest.h>
 
 static int kvmapf = 1;
 
@@ -62,6 +63,15 @@ static int parse_no_stealacc(char *arg)
 
 early_param("no-steal-acc", parse_no_stealacc);
 
+static int kvmclock_vsyscall = 1;
+static int parse_no_kvmclock_vsyscall(char *arg)
+{
+        kvmclock_vsyscall = 0;
+        return 0;
+}
+
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -468,6 +478,10 @@ void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (kvm_para_has_feature(KVM_FEATURE_USERSPACE_CLOCKSOURCE)
+	    && kvmclock_vsyscall)
+		kvm_setup_vsyscall_timeinfo();
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
Index: vsyscall/arch/x86/include/asm/kvm_guest.h
===================================================================
--- /dev/null
+++ vsyscall/arch/x86/include/asm/kvm_guest.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_X86_KVM_GUEST_H
+#define _ASM_X86_KVM_GUEST_H
+
+extern bool vsyscall_clock_initializable;
+
+int kvm_setup_vsyscall_timeinfo(void);
+
+#endif /* _ASM_X86_KVM_GUEST_H */



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 11/18] x86: vsyscall: pass mode to gettime backend
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2012-10-24 13:13 ` [patch 10/18] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-29 14:47   ` Glauber Costa
  2012-10-24 13:13 ` [patch 12/18] x86: vdso: pvclock gettime support Marcelo Tosatti
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 09-vclock-gettime-return-mode --]
[-- Type: text/plain, Size: 1109 bytes --]

Required by next patch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -80,7 +80,7 @@ notrace static long vdso_fallback_gtod(s
 }
 
 
-notrace static inline u64 vgetsns(void)
+notrace static inline u64 vgetsns(int *mode)
 {
 	long v;
 	cycles_t cycles;
@@ -107,7 +107,7 @@ notrace static int __always_inline do_re
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->wall_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -127,7 +127,7 @@ notrace static int do_monotonic(struct t
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->monotonic_time_sec;
 		ns = gtod->monotonic_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	timespec_add_ns(ts, ns);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 12/18] x86: vdso: pvclock gettime support
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2012-10-24 13:13 ` [patch 11/18] x86: vsyscall: pass mode to gettime backend Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-29 14:59   ` Glauber Costa
  2012-10-24 13:13 ` [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 10-add-pvclock-vdso-code --]
[-- Type: text/plain, Size: 4010 bytes --]

Improve performance of time system calls when using Linux pvclock, 
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -22,6 +22,7 @@
 #include <asm/hpet.h>
 #include <asm/unistd.h>
 #include <asm/io.h>
+#include <asm/pvclock.h>
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
 	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+{
+	const aligned_pvti_t *pvti_base;
+	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
+	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
+
+	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
+
+	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
+
+	return &pvti_base[offset].info;
+}
+
+static notrace cycle_t vread_pvclock(int *mode)
+{
+	const struct pvclock_vsyscall_time_info *pvti;
+	cycle_t ret;
+	u64 last;
+	u32 version;
+	u32 migrate_count;
+	u8 flags;
+	unsigned cpu, cpu1;
+
+
+	/*
+	 * When looping to get a consistent (time-info, tsc) pair, we
+	 * also need to deal with the possibility we can switch vcpus,
+	 * so make sure we always re-fetch time-info for the current vcpu.
+	 */
+	do {
+		cpu = __getcpu() & 0xfff;
+		pvti = get_pvti(cpu);
+
+		migrate_count = pvti->migrate_count;
+
+		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
+
+		/*
+		 * Test we're still on the cpu as well as the version.
+		 * We could have been migrated just after the first
+		 * vgetcpu but before fetching the version, so we
+		 * wouldn't notice a version change.
+		 */
+		cpu1 = __getcpu() & 0xfff;
+	} while (unlikely(cpu != cpu1 ||
+			  (pvti->pvti.version & 1) ||
+			  pvti->pvti.version != version ||
+			  pvti->migrate_count != migrate_count));
+
+	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+		*mode = VCLOCK_NONE;
+
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	return last;
+}
+#endif
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
 	long ret;
@@ -88,6 +152,8 @@ notrace static inline u64 vgetsns(int *m
 		cycles = vread_tsc();
 	else if (gtod->clock.vclock_mode == VCLOCK_HPET)
 		cycles = vread_hpet();
+	else if (gtod->clock.vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
 	else
 		return 0;
 	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
Index: vsyscall/arch/x86/include/asm/vsyscall.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/vsyscall.h
+++ vsyscall/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,21 @@ extern void map_vsyscall(void);
  */
 extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
 
+static inline unsigned int __getcpu(void)
+{
+	unsigned int p;
+
+	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
+		/* Load per CPU data from RDTSCP */
+		native_read_tscp(&p);
+	} else {
+		/* Load per CPU data from GDT */
+		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	}
+
+	return p;
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
Index: vsyscall/arch/x86/vdso/vgetcpu.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vgetcpu.c
+++ vsyscall/arch/x86/vdso/vgetcpu.c
@@ -17,13 +17,8 @@ __vdso_getcpu(unsigned *cpu, unsigned *n
 {
 	unsigned int p;
 
-	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
-		/* Load per CPU data from RDTSCP */
-		native_read_tscp(&p);
-	} else {
-		/* Load per CPU data from GDT */
-		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
-	}
+	p = __getcpu();
+
 	if (cpu)
 		*cpu = p & 0xfff;
 	if (node)



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2012-10-24 13:13 ` [patch 12/18] x86: vdso: pvclock gettime support Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-29 15:04   ` Glauber Costa
  2012-10-24 13:13 ` [patch 14/18] time: export time information for KVM pvclock Marcelo Tosatti
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 12-kvm-read-l1-tsc-pass-tscvalue --]
[-- Type: text/plain, Size: 3372 bytes --]

Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -703,7 +703,7 @@ struct kvm_x86_ops {
 	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
 	u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
-	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
 
 	void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
 
Index: vsyscall/arch/x86/kvm/lapic.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/lapic.c
+++ vsyscall/arch/x86/kvm/lapic.c
@@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_
 		local_irq_save(flags);
 
 		now = apic->lapic_timer.timer.base->get_time();
-		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc());
 		if (likely(tscdeadline > guest_tsc)) {
 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
 			do_div(ns, this_tsc_khz);
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
-u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
 	struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
 	return vmcb->control.tsc_offset +
-		svm_scale_tsc(vcpu, native_read_tsc());
+		svm_scale_tsc(vcpu, host_tsc);
 }
 
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void)
  * Like guest_read_tsc, but always returns L1's notion of the timestamp
  * counter, even if a nested guest (L2) is currently running.
  */
-u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-	u64 host_tsc, tsc_offset;
+	u64 tsc_offset;
 
-	rdtscll(host_tsc);
 	tsc_offset = is_guest_mode(vcpu) ?
 		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
 		vmcs_read64(TSC_OFFSET);
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu,
+							   native_read_tsc());
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 14/18] time: export time information for KVM pvclock
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (12 preceding siblings ...)
  2012-10-24 13:13 ` [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-11-10  1:02   ` John Stultz
  2012-10-24 13:13 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 13-time-add-pvclock-gtod-data --]
[-- Type: text/plain, Size: 3841 bytes --]

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/pvclock_gtod.h
===================================================================
--- /dev/null
+++ vsyscall/include/linux/pvclock_gtod.h
@@ -0,0 +1,27 @@
+#ifndef _PVCLOCK_GTOD_H
+#define _PVCLOCK_GTOD_H
+
+#include <linux/clocksource.h>
+#include <linux/notifier.h>
+
+struct pvclock_gtod_data {
+	seqcount_t	seq;
+
+	struct { /* extract of a clocksource struct */
+		int vclock_mode;
+		cycle_t	cycle_last;
+		cycle_t	mask;
+		u32	mult;
+		u32	shift;
+	} clock;
+
+	/* open coded 'struct timespec' */
+	u64		monotonic_time_snsec;
+	time_t		monotonic_time_sec;
+};
+extern struct pvclock_gtod_data pvclock_gtod_data;
+
+extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
+extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);
+
+#endif /* _PVCLOCK_GTOD_H */
Index: vsyscall/kernel/time/timekeeping.c
===================================================================
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
 #include <linux/time.h>
 #include <linux/tick.h>
 #include <linux/stop_machine.h>
+#include <linux/pvclock_gtod.h>
 
 
 static struct timekeeper timekeeper;
@@ -180,6 +181,79 @@ static inline s64 timekeeping_get_ns_raw
 	return nsec + arch_gettimeoffset();
 }
 
+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
+
+/**
+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_register_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
+
+/**
+ * pvclock_gtod_unregister_notifier - unregister a pvclock
+ * timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
+
+struct pvclock_gtod_data pvclock_gtod_data;
+EXPORT_SYMBOL_GPL(pvclock_gtod_data);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
+
+	write_seqcount_begin(&vdata->seq);
+
+	/* copy pvclock gtod data */
+	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
+	vdata->clock.cycle_last		= tk->clock->cycle_last;
+	vdata->clock.mask		= tk->clock->mask;
+	vdata->clock.mult		= tk->mult;
+	vdata->clock.shift		= tk->shift;
+
+	vdata->monotonic_time_sec	= tk->xtime_sec
+					+ tk->wall_to_monotonic.tv_sec;
+	vdata->monotonic_time_snsec	= tk->xtime_nsec
+					+ (tk->wall_to_monotonic.tv_nsec
+						<< tk->shift);
+	while (vdata->monotonic_time_snsec >=
+					(((u64)NSEC_PER_SEC) << tk->shift)) {
+		vdata->monotonic_time_snsec -=
+					((u64)NSEC_PER_SEC) << tk->shift;
+		vdata->monotonic_time_sec++;
+	}
+
+	write_seqcount_end(&vdata->seq);
+	raw_notifier_call_chain(&pvclock_gtod_chain, 0, NULL);
+}
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
@@ -188,6 +262,7 @@ static void timekeeping_update(struct ti
 		ntp_clear();
 	}
 	update_vsyscall(tk);
+	update_pvclock_gtod(tk);
 }
 
 /**



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (13 preceding siblings ...)
  2012-10-24 13:13 ` [patch 14/18] time: export time information for KVM pvclock Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-30  8:34   ` Glauber Costa
  2012-10-24 13:13 ` [patch 16/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 14-host-pass-stable-pvclock-flag --]
[-- Type: text/plain, Size: 13339 bytes --]

KVM added a global variable to guarantee monotonicity in the guest. 
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances 
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when 
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all 
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs. 

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -46,6 +46,7 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/pci.h>
+#include <linux/pvclock_gtod.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -1135,8 +1136,149 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static cycle_t read_tsc(void)
+{
+	cycle_t ret;
+	u64 last;
+
+	/*
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
+	 */
+	rdtsc_barrier();
+	ret = (cycle_t)vget_cycles();
+
+	last = pvclock_gtod_data.clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
+}
+
+static inline u64 vgettsc(cycle_t *cycle_now)
+{
+	long v;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	*cycle_now = read_tsc();
+
+	v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
+	return v * gtod->clock.mult;
+}
+
+static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	ts->tv_nsec = 0;
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		mode = gtod->clock.vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_sec;
+		ns = gtod->monotonic_time_snsec;
+		ns += vgettsc(cycle_now);
+		ns >>= gtod->clock.shift;
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	timespec_add_ns(ts, ns);
+
+	return mode;
+}
+
+/* returns true if host is using tsc clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+	struct timespec ts;
+
+	/* checked again under seqlock below */
+	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+		return false;
+
+	if (do_monotonic(&ts, cycle_now) != VCLOCK_TSC)
+		return false;
+
+	monotonic_to_bootbased(&ts);
+	*kernel_ns = timespec_to_ns(&ts);
+
+	return true;
+}
+
+
+/*
+ *
+ * Assuming a stable TSC across physical CPUS, the following condition
+ * is possible. Each numbered line represents an event visible to both
+ * CPUs at the next numbered event.
+ *
+ * "timespecX" represents host monotonic time. "tscX" represents
+ * RDTSC value.
+ *
+ * 		VCPU0 on CPU0		|	VCPU1 on CPU1
+ *
+ * 1.  read timespec0,tsc0
+ * 2.					| timespec1 = timespec0 + N
+ * 					| tsc1 = tsc0 + M
+ * 3. transition to guest		| transition to guest
+ * 4. ret0 = timespec0 + (rdtsc - tsc0) |
+ * 5.				        | ret1 = timespec1 + (rdtsc - tsc1)
+ * 				        | ret1 = timespec0 + N + (rdtsc - (tsc0 + M))
+ *
+ * Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity:
+ *
+ * 	- ret0 < ret1
+ *	- timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
+ *		...
+ *	- 0 < N - M => M < N
+ *
+ * That is, when timespec0 != timespec1, M < N. Unfortunately that is not
+ * always the case (the difference between two distinct xtime instances
+ * might be smaller then the difference between corresponding TSC reads,
+ * when updating guest vcpus pvclock areas).
+ *
+ * To avoid that problem, do not allow visibility of distinct
+ * system_timestamp/tsc_timestamp values simultaneously: use a master
+ * copy of host monotonic time values. Update that master copy
+ * in lockstep.
+ *
+ * Rely on synchronization of host TSCs for monotonicity.
+ *
+ */
+
+static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	int vclock_mode;
+
+	/*
+ 	 * If the host uses TSC clock, then passthrough TSC as stable
+	 * to the guest.
+	 */
+	ka->use_master_clock = kvm_get_time_and_clockread(
+					&ka->master_kernel_ns,
+					&ka->master_cycle_now);
+
+	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+}
+
 static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
-			     unsigned int offset_in_page, gpa_t gpa)
+			     unsigned int offset_in_page, gpa_t gpa,
+			     bool use_master_clock)
 {
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	void *shared_kaddr;
@@ -1155,6 +1297,10 @@ static void kvm_write_pvtime(struct kvm_
 		vcpu->pvclock_set_guest_stopped_request = false;
 	}
 
+	/* If the host uses TSC clocksource, then it is stable */
+	if (use_master_clock)
+		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
+
 	vcpu->hv_clock.flags = pvclock_flags;
 
 	memcpy(shared_kaddr + offset_in_page, &vcpu->hv_clock,
@@ -1169,14 +1315,18 @@ static int kvm_guest_time_update(struct 
 {
 	unsigned long flags;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
+	struct kvm_arch *ka = &v->kvm->arch;
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
+	u64 host_tsc;
+	bool use_master_clock;
+
+	kernel_ns = 0;
+	host_tsc = 0;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
-	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
 		local_irq_restore(flags);
@@ -1185,6 +1335,24 @@ static int kvm_guest_time_update(struct 
 	}
 
 	/*
+ 	 * If the host uses TSC clock, then passthrough TSC as stable
+	 * to the guest.
+	 */
+	spin_lock(&ka->pvclock_gtod_sync_lock);
+	use_master_clock = ka->use_master_clock;
+	if (use_master_clock) {
+		host_tsc = ka->master_cycle_now;
+		kernel_ns = ka->master_kernel_ns;
+	}
+	spin_unlock(&ka->pvclock_gtod_sync_lock);
+	if (!use_master_clock) {
+		host_tsc = native_read_tsc();
+		kernel_ns = get_kernel_ns();
+	}
+
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, host_tsc);
+
+	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
 	 *   1) CPU could have been running below the maximum TSC rate
@@ -1245,8 +1413,14 @@ static int kvm_guest_time_update(struct 
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-	if (max_kernel_ns > kernel_ns)
-		kernel_ns = max_kernel_ns;
+	/* with a master <monotonic time, tsc value> tuple,
+ 	 * pvclock clock reads always increase at the (scaled) rate
+ 	 * of guest TSC - no need to deal with sampling errors.
+ 	 */
+	if (!use_master_clock) {
+		if (max_kernel_ns > kernel_ns)
+			kernel_ns = max_kernel_ns;
+	}
 
 	/* With all the info we got, fill in the values */
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
@@ -1262,10 +1436,12 @@ static int kvm_guest_time_update(struct 
 	 */
 	vcpu->hv_clock.version += 2;
 
- 	kvm_write_pvtime(v, vcpu->time_page, vcpu->time_offset, vcpu->time);
+ 	kvm_write_pvtime(v, vcpu->time_page, vcpu->time_offset, vcpu->time,
+			 use_master_clock);
  	if (vcpu->uspace_time_page)
  		kvm_write_pvtime(v, vcpu->uspace_time_page,
- 				 vcpu->uspace_time_offset, vcpu->uspace_time);
+ 				 vcpu->uspace_time_offset, vcpu->uspace_time,
+				 use_master_clock);
 
 	return 0;
 }
@@ -5302,6 +5478,28 @@ static void process_nmi(struct kvm_vcpu 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+static void kvm_gen_update_masterclock(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	struct kvm_arch *ka = &kvm->arch;
+
+	spin_lock(&ka->pvclock_gtod_sync_lock);
+	kvm_make_mclock_inprogress_request(kvm);
+	/* no guest entries from this point */
+	pvclock_update_vm_gtod_copy(kvm);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+
+	/* guest entries allowed */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
+
+	spin_unlock(&ka->pvclock_gtod_sync_lock);
+
+}
+
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -5314,6 +5512,8 @@ static int vcpu_enter_guest(struct kvm_v
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
+		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_gen_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
@@ -6219,6 +6419,8 @@ int kvm_arch_hardware_enable(void *garba
 			kvm_for_each_vcpu(i, vcpu, kvm) {
 				vcpu->arch.tsc_offset_adjustment += delta_cyc;
 				vcpu->arch.last_host_tsc = local_tsc;
+				set_bit(KVM_REQ_MASTERCLOCK_UPDATE,
+					&vcpu->requests);
 			}
 
 			/*
@@ -6356,6 +6558,9 @@ int kvm_arch_init_vm(struct kvm *kvm, un
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
+	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
+
+	pvclock_update_vm_gtod_copy(kvm);
 
 	return 0;
 }
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -22,6 +22,7 @@
 #include <linux/kvm_para.h>
 #include <linux/kvm_types.h>
 #include <linux/perf_event.h>
+#include <linux/pvclock_gtod.h>
 
 #include <asm/pvclock-abi.h>
 #include <asm/desc.h>
@@ -563,6 +564,11 @@ struct kvm_arch {
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
 
+	spinlock_t pvclock_gtod_sync_lock;
+	bool use_master_clock;
+	u64 master_kernel_ns;
+	cycle_t master_cycle_now;
+
 	struct kvm_xen_hvm_config xen_hvm_config;
 
 	/* fields used by HYPER-V emulation */
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -118,6 +118,8 @@ static inline bool is_error_page(struct 
 #define KVM_REQ_IMMEDIATE_EXIT    15
 #define KVM_REQ_PMU               16
 #define KVM_REQ_PMI               17
+#define KVM_REQ_MASTERCLOCK_UPDATE  18
+#define KVM_REQ_MCLOCK_INPROGRESS 19
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
@@ -527,6 +529,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
+void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -212,6 +212,11 @@ void kvm_reload_remote_mmus(struct kvm *
 	make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
 
+void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+{
+	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
+}
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -4,6 +4,7 @@
 #include <linux/tracepoint.h>
 #include <asm/vmx.h>
 #include <asm/svm.h>
+#include <asm/clocksource.h>
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -754,6 +755,31 @@ TRACE_EVENT(
 		  __entry->write ? "Write" : "Read",
 		  __entry->gpa_match ? "GPA" : "GVA")
 );
+
+#define host_clocks				\
+	{VCLOCK_NONE, "none"},			\
+	{VCLOCK_TSC,  "tsc"},			\
+	{VCLOCK_HPET, "hpet"}			\
+
+TRACE_EVENT(kvm_update_master_clock,
+	TP_PROTO(bool use_master_clock, unsigned int host_clock),
+	TP_ARGS(use_master_clock, host_clock),
+
+	TP_STRUCT__entry(
+		__field(		bool,	use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("masterclock %d hostclock %s",
+		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks))
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 16/18] KVM: x86: notifier for clocksource changes
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (14 preceding siblings ...)
  2012-10-24 13:13 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 17/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 15-add-kvm-req-pvclock-gtod-update --]
[-- Type: text/plain, Size: 2592 bytes --]

Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1218,6 +1218,7 @@ static bool kvm_get_time_and_clockread(s
 	return true;
 }
 
+static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
 
 /*
  *
@@ -1271,6 +1272,8 @@ static void pvclock_update_vm_gtod_copy(
 	ka->use_master_clock = kvm_get_time_and_clockread(
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
+	if (ka->use_master_clock)
+		atomic_set(&kvm_guest_has_master_clock, 1);
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
@@ -5124,6 +5127,44 @@ static void kvm_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(mask);
 }
 
+static void pvclock_gtod_update_fn(struct work_struct *work)
+{
+	struct kvm *kvm;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	raw_spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			set_bit(KVM_REQ_MASTERCLOCK_UPDATE, &vcpu->requests);
+	atomic_set(&kvm_guest_has_master_clock, 0);
+	raw_spin_unlock(&kvm_lock);
+}
+
+static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
+
+/*
+ * Notification about pvclock gtod data update.
+ */
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+			       void *unused2)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	/* disable master clock if host does not trust, or does not
+ 	 * use, TSC clocksource
+ 	 */
+	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	    atomic_read(&kvm_guest_has_master_clock) != 0)
+		queue_work(system_long_wq, &pvclock_gtod_work);
+
+	return 0;
+}
+
+static struct notifier_block pvclock_gtod_notifier = {
+	.notifier_call = pvclock_gtod_notify,
+};
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
@@ -5165,6 +5206,8 @@ int kvm_arch_init(void *opaque)
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 
 	kvm_lapic_init();
+	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
+
 	return 0;
 
 out:
@@ -5179,6 +5222,7 @@ void kvm_arch_exit(void)
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 17/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (15 preceding siblings ...)
  2012-10-24 13:13 ` [patch 16/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-24 13:13 ` [patch 18/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 16-add-kvm-add-vcpu-postcreate --]
[-- Type: text/plain, Size: 3730 bytes --]

TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/ia64/kvm/kvm-ia64.c
===================================================================
--- vsyscall.orig/arch/ia64/kvm/kvm-ia64.c
+++ vsyscall/arch/ia64/kvm/kvm-ia64.c
@@ -1330,6 +1330,11 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return 0;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
 	return -EINVAL;
Index: vsyscall/arch/powerpc/kvm/powerpc.c
===================================================================
--- vsyscall.orig/arch/powerpc/kvm/powerpc.c
+++ vsyscall/arch/powerpc/kvm/powerpc.c
@@ -354,6 +354,11 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
 	return vcpu;
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	/* Make sure we're not using the vcpu anymore */
Index: vsyscall/arch/s390/kvm/kvm-s390.c
===================================================================
--- vsyscall.orig/arch/s390/kvm/kvm-s390.c
+++ vsyscall/arch/s390/kvm/kvm-s390.c
@@ -355,6 +355,11 @@ static void kvm_s390_vcpu_initial_reset(
 	atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 {
 	atomic_set(&vcpu->arch.sie_block->cpuflags, CPUSTAT_ZARCH |
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -1254,7 +1254,6 @@ static struct kvm_vcpu *svm_create_vcpu(
 	svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT;
 	svm->asid_generation = 0;
 	init_vmcb(svm);
-	kvm_write_tsc(&svm->vcpu, 0);
 
 	err = fx_init(&svm->vcpu);
 	if (err)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -3896,8 +3896,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
 	set_cr4_guest_host_mask(vmx);
 
-	kvm_write_tsc(&vmx->vcpu, 0);
-
 	return 0;
 }
 
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -6350,6 +6350,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return r;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	int r;
+
+	r = vcpu_load(vcpu);
+	if (r)
+		return r;
+	kvm_write_tsc(vcpu, 0);
+	vcpu_put(vcpu);
+
+	return r;
+}
+
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	int r;
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -583,6 +583,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -1855,6 +1855,7 @@ static int kvm_vm_ioctl_create_vcpu(stru
 	atomic_inc(&kvm->online_vcpus);
 
 	mutex_unlock(&kvm->lock);
+	kvm_arch_vcpu_postcreate(vcpu);
 	return r;
 
 unlock_vcpu_destroy:



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 18/18] KVM: x86: require matched TSC offsets for master clock
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (16 preceding siblings ...)
  2012-10-24 13:13 ` [patch 17/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
@ 2012-10-24 13:13 ` Marcelo Tosatti
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
  18 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-24 13:13 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 17-masterclock-require-matched-tsc --]
[-- Type: text/plain, Size: 6975 bytes --]

With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple 
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -563,6 +563,7 @@ struct kvm_arch {
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
+	int nr_vcpus_matched_tsc;
 
 	spinlock_t pvclock_gtod_sync_lock;
 	bool use_master_clock;
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1047,12 +1047,38 @@ static u64 compute_guest_tsc(struct kvm_
 	return tsc;
 }
 
+void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+{
+	bool vcpus_matched;
+	bool do_request = false;
+	struct kvm_arch *ka = &vcpu->kvm->arch;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			 atomic_read(&vcpu->kvm->online_vcpus));
+
+	if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
+		if (!ka->use_master_clock)
+			do_request = 1;
+
+	if (!vcpus_matched && ka->use_master_clock)
+			do_request = 1;
+
+	if (do_request)
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
+	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
+			    atomic_read(&vcpu->kvm->online_vcpus),
+		            ka->use_master_clock, gtod->clock.vclock_mode);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
 	s64 usdiff;
+	bool matched;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
@@ -1095,6 +1121,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
+		matched = true;
 	} else {
 		/*
 		 * We split periods of matched TSC writes into generations.
@@ -1109,6 +1136,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 		kvm->arch.cur_tsc_nsec = ns;
 		kvm->arch.cur_tsc_write = data;
 		kvm->arch.cur_tsc_offset = offset;
+		matched = false;
 		pr_debug("kvm: new tsc generation %u, clock %llu\n",
 			 kvm->arch.cur_tsc_generation, data);
 	}
@@ -1132,6 +1160,15 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+	spin_lock(&kvm->arch.pvclock_gtod_sync_lock);
+	if (matched)
+		kvm->arch.nr_vcpus_matched_tsc++;
+	else
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+
+	kvm_track_tsc_matching(vcpu);
+	spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
 }
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -1222,8 +1259,9 @@ static atomic_t kvm_guest_has_master_clo
 
 /*
  *
- * Assuming a stable TSC across physical CPUS, the following condition
- * is possible. Each numbered line represents an event visible to both
+ * Assuming a stable TSC across physical CPUS, and a stable TSC
+ * across virtual CPUs, the following condition is possible.
+ * Each numbered line represents an event visible to both
  * CPUs at the next numbered event.
  *
  * "timespecX" represents host monotonic time. "tscX" represents
@@ -1256,7 +1294,7 @@ static atomic_t kvm_guest_has_master_clo
  * copy of host monotonic time values. Update that master copy
  * in lockstep.
  *
- * Rely on synchronization of host TSCs for monotonicity.
+ * Rely on synchronization of host TSCs and guest TSCs for monotonicity.
  *
  */
 
@@ -1264,19 +1302,26 @@ static void pvclock_update_vm_gtod_copy(
 {
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
+	bool host_tsc_clocksource, vcpus_matched;
 
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+				atomic_read(&kvm->online_vcpus));
 	/*
  	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	ka->use_master_clock = kvm_get_time_and_clockread(
+	host_tsc_clocksource = kvm_get_time_and_clockread(
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
+
+	ka->use_master_clock = host_tsc_clocksource & vcpus_matched;
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
-	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
+				      vcpus_matched);
 }
 
 static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -762,21 +762,54 @@ TRACE_EVENT(
 	{VCLOCK_HPET, "hpet"}			\
 
 TRACE_EVENT(kvm_update_master_clock,
-	TP_PROTO(bool use_master_clock, unsigned int host_clock),
-	TP_ARGS(use_master_clock, host_clock),
+	TP_PROTO(bool use_master_clock, unsigned int host_clock, bool offset_matched),
+	TP_ARGS(use_master_clock, host_clock, offset_matched),
 
 	TP_STRUCT__entry(
 		__field(		bool,	use_master_clock	)
 		__field(	unsigned int,	host_clock		)
+		__field(		bool,	offset_matched		)
 	),
 
 	TP_fast_assign(
 		__entry->use_master_clock	= use_master_clock;
 		__entry->host_clock		= host_clock;
+		__entry->offset_matched		= offset_matched;
 	),
 
-	TP_printk("masterclock %d hostclock %s",
+	TP_printk("masterclock %d hostclock %s offsetmatched %u",
 		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks),
+		  __entry->offset_matched)
+);
+
+TRACE_EVENT(kvm_track_tsc,
+	TP_PROTO(unsigned int vcpu_id, unsigned int nr_matched,
+		 unsigned int online_vcpus, bool use_master_clock,
+		 unsigned int host_clock),
+	TP_ARGS(vcpu_id, nr_matched, online_vcpus, use_master_clock,
+		host_clock),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id			)
+		__field(	unsigned int,	nr_vcpus_matched_tsc	)
+		__field(	unsigned int,	online_vcpus		)
+		__field(	bool,		use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id		= vcpu_id;
+		__entry->nr_vcpus_matched_tsc	= nr_matched;
+		__entry->online_vcpus		= online_vcpus;
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("vcpu_id %u masterclock %u offsetmatched %u nr_online %u"
+		  " hostclock %s",
+		  __entry->vcpu_id, __entry->use_master_clock,
+		  __entry->nr_vcpus_matched_tsc, __entry->online_vcpus,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-24 13:13 ` [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
@ 2012-10-29 14:18   ` Glauber Costa
  2012-10-29 14:54     ` Marcelo Tosatti
  2012-10-29 14:39   ` Glauber Costa
  1 sibling, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-29 14:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> Index: vsyscall/arch/x86/Kconfig
> ===================================================================
> --- vsyscall.orig/arch/x86/Kconfig
> +++ vsyscall/arch/x86/Kconfig
> @@ -632,6 +632,13 @@ config PARAVIRT_SPINLOCKS
>  
>  config PARAVIRT_CLOCK
>  	bool
> +config PARAVIRT_CLOCK_VSYSCALL
> +	bool "Paravirt clock vsyscall support"
> +	depends on PARAVIRT_CLOCK && GENERIC_TIME_VSYSCALL
> +	---help---
> +	  Enable performance critical clock related system calls to
> +	  be executed in userspace, provided that the hypervisor
> +	  supports it.
>  
>  endif

Besides debugging, what is the point in having this as an
extra-selectable? Is there any case in which a virtual machine has code
for this, but may decide to run without it ?

I believe all this code in vsyscall should be wrapped in PARAVIRT_CLOCK
only.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-24 13:13 ` [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
  2012-10-29 14:18   ` Glauber Costa
@ 2012-10-29 14:39   ` Glauber Costa
  1 sibling, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-10-29 14:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> + */
> +int __init pvclock_init_vsyscall(void)
> +{
> +	int idx;
> +	unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
> +
> +	pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);
> +	if (!pvclock_vdso_info)
> +		return -ENOMEM;
> +
> +	memset(pvclock_vdso_info, 0, size);
> +
> +	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
> +		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
> +			     __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
> +		     	     PAGE_KERNEL_VVAR);


BTW, Previous line is whitespace damaged.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-24 13:13 ` [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
@ 2012-10-29 14:45   ` Glauber Costa
  2012-10-29 17:44     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-29 14:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> Allow a guest to register a second location for the VCPU time info
> 
> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
> This is intended to allow the guest kernel to map this information
> into a usermode accessible page, so that usermode can efficiently
> calculate system time from the TSC without having to make a syscall.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Can you please be a bit more specific about why we need this? Why does
the host need to provide us with two pages with the exact same data? Why
can't just do it with mapping tricks in the guest?



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend
  2012-10-24 13:13 ` [patch 11/18] x86: vsyscall: pass mode to gettime backend Marcelo Tosatti
@ 2012-10-29 14:47   ` Glauber Costa
  2012-10-29 18:41     ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-29 14:47 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> Required by next patch.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
I don't see where.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-29 14:18   ` Glauber Costa
@ 2012-10-29 14:54     ` Marcelo Tosatti
  2012-10-29 17:46       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-29 14:54 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Mon, Oct 29, 2012 at 06:18:20PM +0400, Glauber Costa wrote:
> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> > Index: vsyscall/arch/x86/Kconfig
> > ===================================================================
> > --- vsyscall.orig/arch/x86/Kconfig
> > +++ vsyscall/arch/x86/Kconfig
> > @@ -632,6 +632,13 @@ config PARAVIRT_SPINLOCKS
> >  
> >  config PARAVIRT_CLOCK
> >  	bool
> > +config PARAVIRT_CLOCK_VSYSCALL
> > +	bool "Paravirt clock vsyscall support"
> > +	depends on PARAVIRT_CLOCK && GENERIC_TIME_VSYSCALL
> > +	---help---
> > +	  Enable performance critical clock related system calls to
> > +	  be executed in userspace, provided that the hypervisor
> > +	  supports it.
> >  
> >  endif
> 
> Besides debugging, what is the point in having this as an
> extra-selectable? Is there any case in which a virtual machine has code
> for this, but may decide to run without it ?

Don't think so (its pretty small anyway, the code).

> I believe all this code in vsyscall should be wrapped in PARAVIRT_CLOCK
> only.

Unless Jeremy has a reason, i'm fine with that.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 12/18] x86: vdso: pvclock gettime support
  2012-10-24 13:13 ` [patch 12/18] x86: vdso: pvclock gettime support Marcelo Tosatti
@ 2012-10-29 14:59   ` Glauber Costa
  2012-10-29 18:42     ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-29 14:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> Improve performance of time system calls when using Linux pvclock, 
> by reading time info from fixmap visible copy of pvclock data.
> 
> Originally from Jeremy Fitzhardinge.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/arch/x86/vdso/vclock_gettime.c
> ===================================================================
> --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
> +++ vsyscall/arch/x86/vdso/vclock_gettime.c
> @@ -22,6 +22,7 @@
>  #include <asm/hpet.h>
>  #include <asm/unistd.h>
>  #include <asm/io.h>
> +#include <asm/pvclock.h>
>  
>  #define gtod (&VVAR(vsyscall_gtod_data))
>  
> @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
>  	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
>  }
>  
> +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
> +
> +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> +{
> +	const aligned_pvti_t *pvti_base;
> +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> +
> +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> +
> +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> +
> +	return &pvti_base[offset].info;
> +}
> +

Unless I am missing something, if gcc decides to not inline get_pvti,
this will break, right? I believe you need to mark that function with
__always_inline.

> +static notrace cycle_t vread_pvclock(int *mode)
> +{
> +	const struct pvclock_vsyscall_time_info *pvti;
> +	cycle_t ret;
> +	u64 last;
> +	u32 version;
> +	u32 migrate_count;
> +	u8 flags;
> +	unsigned cpu, cpu1;
> +
> +
> +	/*
> +	 * When looping to get a consistent (time-info, tsc) pair, we
> +	 * also need to deal with the possibility we can switch vcpus,
> +	 * so make sure we always re-fetch time-info for the current vcpu.
> +	 */
> +	do {
> +		cpu = __getcpu() & 0xfff;

Please wrap this 0xfff into something meaningful.

> +		pvti = get_pvti(cpu);
> +
> +		migrate_count = pvti->migrate_count;
> +
> +		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> +
> +		/*
> +		 * Test we're still on the cpu as well as the version.
> +		 * We could have been migrated just after the first
> +		 * vgetcpu but before fetching the version, so we
> +		 * wouldn't notice a version change.
> +		 */
> +		cpu1 = __getcpu() & 0xfff;
> +	} while (unlikely(cpu != cpu1 ||
> +			  (pvti->pvti.version & 1) ||
> +			  pvti->pvti.version != version ||
> +			  pvti->migrate_count != migrate_count));
> +
> +	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +		*mode = VCLOCK_NONE;
> +
> +	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
> +
> +	if (likely(ret >= last))
> +		return ret;
> +

Please add a comment here referring to tsc.c, where an explanation of
this test lives. This is quite non-obvious for the non initiated.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc
  2012-10-24 13:13 ` [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
@ 2012-10-29 15:04   ` Glauber Costa
  2012-10-29 18:45     ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-29 15:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Would you mind explaining why?

it seems to me that rdtscll() here would be perfectly safe: the only
case in which they wouldn't, is in a nested-vm environment running
paravirt-linux with a paravirt tsc. In this case, it is quite likely
that we'll want rdtscll *anyway*, instead of going to tsc directly.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-29 14:45   ` Glauber Costa
@ 2012-10-29 17:44     ` Jeremy Fitzhardinge
  2012-10-29 18:40       ` Marcelo Tosatti
  2012-10-30  7:38       ` Glauber Costa
  0 siblings, 2 replies; 94+ messages in thread
From: Jeremy Fitzhardinge @ 2012-10-29 17:44 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Marcelo Tosatti, kvm, johnstul, zamsden, gleb, avi, pbonzini

On 10/29/2012 07:45 AM, Glauber Costa wrote:
> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>> Allow a guest to register a second location for the VCPU time info
>>
>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>> This is intended to allow the guest kernel to map this information
>> into a usermode accessible page, so that usermode can efficiently
>> calculate system time from the TSC without having to make a syscall.
>>
>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Can you please be a bit more specific about why we need this? Why does
> the host need to provide us with two pages with the exact same data? Why
> can't just do it with mapping tricks in the guest?

In Xen the pvclock structure is embedded within a pile of other stuff
that shouldn't be mapped into guest memory, so providing for a second
location allows it to be placed whereever is convenient for the guest.
That's a restriction of the Xen ABI, but I don't know if it affects KVM.

    J

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-29 14:54     ` Marcelo Tosatti
@ 2012-10-29 17:46       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 94+ messages in thread
From: Jeremy Fitzhardinge @ 2012-10-29 17:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Glauber Costa, kvm, johnstul, zamsden, gleb, avi, pbonzini

On 10/29/2012 07:54 AM, Marcelo Tosatti wrote:
> On Mon, Oct 29, 2012 at 06:18:20PM +0400, Glauber Costa wrote:
>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>> Index: vsyscall/arch/x86/Kconfig
>>> ===================================================================
>>> --- vsyscall.orig/arch/x86/Kconfig
>>> +++ vsyscall/arch/x86/Kconfig
>>> @@ -632,6 +632,13 @@ config PARAVIRT_SPINLOCKS
>>>  
>>>  config PARAVIRT_CLOCK
>>>  	bool
>>> +config PARAVIRT_CLOCK_VSYSCALL
>>> +	bool "Paravirt clock vsyscall support"
>>> +	depends on PARAVIRT_CLOCK && GENERIC_TIME_VSYSCALL
>>> +	---help---
>>> +	  Enable performance critical clock related system calls to
>>> +	  be executed in userspace, provided that the hypervisor
>>> +	  supports it.
>>>  
>>>  endif
>> Besides debugging, what is the point in having this as an
>> extra-selectable? Is there any case in which a virtual machine has code
>> for this, but may decide to run without it ?
> Don't think so (its pretty small anyway, the code).
>
>> I believe all this code in vsyscall should be wrapped in PARAVIRT_CLOCK
>> only.
> Unless Jeremy has a reason, i'm fine with that.

I often set up blind config variables for dependency management; I'm
guessing the "GENERIC_TIME_VSYSCALL" dependency is important.  I think
the problem is that this exists, but that it's a user-selectable
option.  Removing the prompt should fix that.

    J


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-29 17:44     ` Jeremy Fitzhardinge
@ 2012-10-29 18:40       ` Marcelo Tosatti
  2012-10-30  7:41         ` Glauber Costa
  2012-10-30  9:39         ` Avi Kivity
  2012-10-30  7:38       ` Glauber Costa
  1 sibling, 2 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-29 18:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Glauber Costa
  Cc: kvm, johnstul, zamsden, gleb, avi, pbonzini

On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
> On 10/29/2012 07:45 AM, Glauber Costa wrote:
> > On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> >> Allow a guest to register a second location for the VCPU time info
> >>
> >> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
> >> This is intended to allow the guest kernel to map this information
> >> into a usermode accessible page, so that usermode can efficiently
> >> calculate system time from the TSC without having to make a syscall.
> >>
> >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > Can you please be a bit more specific about why we need this? Why does
> > the host need to provide us with two pages with the exact same data? Why
> > can't just do it with mapping tricks in the guest?
> 
> In Xen the pvclock structure is embedded within a pile of other stuff
> that shouldn't be mapped into guest memory, so providing for a second
> location allows it to be placed whereever is convenient for the guest.
> That's a restriction of the Xen ABI, but I don't know if it affects KVM.
> 
>     J

It is possible to share the data for KVM in theory, but:

- It is a small amount of memory. 
- It requires aligning to page size (the in-kernel percpu array 
is currently cacheline aligned).
- It is possible to modify flags separately for userspace/kernelspace,
if desired.

This justifies the duplication IMO (code is simple and clean).


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend
  2012-10-29 14:47   ` Glauber Costa
@ 2012-10-29 18:41     ` Marcelo Tosatti
  2012-10-30  7:42       ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-29 18:41 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Mon, Oct 29, 2012 at 06:47:57PM +0400, Glauber Costa wrote:
> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> > Required by next patch.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> I don't see where.

+       if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+               *mode = VCLOCK_NONE;



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 12/18] x86: vdso: pvclock gettime support
  2012-10-29 14:59   ` Glauber Costa
@ 2012-10-29 18:42     ` Marcelo Tosatti
  2012-10-30  7:49       ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-29 18:42 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Mon, Oct 29, 2012 at 06:59:35PM +0400, Glauber Costa wrote:
> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> > Improve performance of time system calls when using Linux pvclock, 
> > by reading time info from fixmap visible copy of pvclock data.
> > 
> > Originally from Jeremy Fitzhardinge.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: vsyscall/arch/x86/vdso/vclock_gettime.c
> > ===================================================================
> > --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
> > +++ vsyscall/arch/x86/vdso/vclock_gettime.c
> > @@ -22,6 +22,7 @@
> >  #include <asm/hpet.h>
> >  #include <asm/unistd.h>
> >  #include <asm/io.h>
> > +#include <asm/pvclock.h>
> >  
> >  #define gtod (&VVAR(vsyscall_gtod_data))
> >  
> > @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
> >  	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
> >  }
> >  
> > +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
> > +
> > +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> > +{
> > +	const aligned_pvti_t *pvti_base;
> > +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> > +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> > +
> > +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> > +
> > +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> > +
> > +	return &pvti_base[offset].info;
> > +}
> > +
> 
> Unless I am missing something, if gcc decides to not inline get_pvti,
> this will break, right? I believe you need to mark that function with
> __always_inline.

Can't see why. Please enlighten me.

> 
> > +static notrace cycle_t vread_pvclock(int *mode)
> > +{
> > +	const struct pvclock_vsyscall_time_info *pvti;
> > +	cycle_t ret;
> > +	u64 last;
> > +	u32 version;
> > +	u32 migrate_count;
> > +	u8 flags;
> > +	unsigned cpu, cpu1;
> > +
> > +
> > +	/*
> > +	 * When looping to get a consistent (time-info, tsc) pair, we
> > +	 * also need to deal with the possibility we can switch vcpus,
> > +	 * so make sure we always re-fetch time-info for the current vcpu.
> > +	 */
> > +	do {
> > +		cpu = __getcpu() & 0xfff;
> 
> Please wrap this 0xfff into something meaningful.

OK.

> > +		pvti = get_pvti(cpu);
> > +
> > +		migrate_count = pvti->migrate_count;
> > +
> > +		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> > +
> > +		/*
> > +		 * Test we're still on the cpu as well as the version.
> > +		 * We could have been migrated just after the first
> > +		 * vgetcpu but before fetching the version, so we
> > +		 * wouldn't notice a version change.
> > +		 */
> > +		cpu1 = __getcpu() & 0xfff;
> > +	} while (unlikely(cpu != cpu1 ||
> > +			  (pvti->pvti.version & 1) ||
> > +			  pvti->pvti.version != version ||
> > +			  pvti->migrate_count != migrate_count));
> > +
> > +	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> > +		*mode = VCLOCK_NONE;
> > +
> > +	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
> > +
> > +	if (likely(ret >= last))
> > +		return ret;
> > +
> 
> Please add a comment here referring to tsc.c, where an explanation of
> this test lives. This is quite non-obvious for the non initiated.

OK.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc
  2012-10-29 15:04   ` Glauber Costa
@ 2012-10-29 18:45     ` Marcelo Tosatti
  2012-10-30  7:55       ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-29 18:45 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Mon, Oct 29, 2012 at 07:04:59PM +0400, Glauber Costa wrote:
> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> > Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Would you mind explaining why?
> 
> it seems to me that rdtscll() here would be perfectly safe: the only
> case in which they wouldn't, is in a nested-vm environment running
> paravirt-linux with a paravirt tsc. In this case, it is quite likely
> that we'll want rdtscll *anyway*, instead of going to tsc directly.

Its something different (from a future patch):

"KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between

        1. ktime_get_ts(&timespec);
        2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...)."

Think step 1. returning the same value on both vcpus (to read the
explanation above).

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-29 17:44     ` Jeremy Fitzhardinge
  2012-10-29 18:40       ` Marcelo Tosatti
@ 2012-10-30  7:38       ` Glauber Costa
  1 sibling, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-10-30  7:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Marcelo Tosatti, kvm, johnstul, zamsden, gleb, avi, pbonzini

On 10/29/2012 09:44 PM, Jeremy Fitzhardinge wrote:
> On 10/29/2012 07:45 AM, Glauber Costa wrote:
>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>> Allow a guest to register a second location for the VCPU time info
>>>
>>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>>> This is intended to allow the guest kernel to map this information
>>> into a usermode accessible page, so that usermode can efficiently
>>> calculate system time from the TSC without having to make a syscall.
>>>
>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> Can you please be a bit more specific about why we need this? Why does
>> the host need to provide us with two pages with the exact same data? Why
>> can't just do it with mapping tricks in the guest?
> 
> In Xen the pvclock structure is embedded within a pile of other stuff
> that shouldn't be mapped into guest memory, so providing for a second
> location allows it to be placed whereever is convenient for the guest.
> That's a restriction of the Xen ABI, but I don't know if it affects KVM.
> 
>     J
> 
In kvm the exported data seems to be exactly the same. So it makes sense
to have a single page exported to the guest. The guest may have a
facility that maps the vsyscall clock and the normal clock to a specific
location given a page - that can be either a different page or the same
page.

Any reason why it wouldn't work?


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-29 18:40       ` Marcelo Tosatti
@ 2012-10-30  7:41         ` Glauber Costa
  2012-10-30  9:39         ` Avi Kivity
  1 sibling, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-10-30  7:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Jeremy Fitzhardinge, kvm, johnstul, zamsden, gleb, avi, pbonzini

On 10/29/2012 10:40 PM, Marcelo Tosatti wrote:
> On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
>> On 10/29/2012 07:45 AM, Glauber Costa wrote:
>>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>>> Allow a guest to register a second location for the VCPU time info
>>>>
>>>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>>>> This is intended to allow the guest kernel to map this information
>>>> into a usermode accessible page, so that usermode can efficiently
>>>> calculate system time from the TSC without having to make a syscall.
>>>>
>>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>> Can you please be a bit more specific about why we need this? Why does
>>> the host need to provide us with two pages with the exact same data? Why
>>> can't just do it with mapping tricks in the guest?
>>
>> In Xen the pvclock structure is embedded within a pile of other stuff
>> that shouldn't be mapped into guest memory, so providing for a second
>> location allows it to be placed whereever is convenient for the guest.
>> That's a restriction of the Xen ABI, but I don't know if it affects KVM.
>>
>>     J
> 
> It is possible to share the data for KVM in theory, but:
> 
> - It is a small amount of memory. 
> - It requires aligning to page size (the in-kernel percpu array 
> is currently cacheline aligned).
> - It is possible to modify flags separately for userspace/kernelspace,
> if desired.
> 
> This justifies the duplication IMO (code is simple and clean).
> 
Duplicating is indeed no the end of the world. But one note:

* If it is page-size aligned, it is automatically cacheline aligned.
Since we have to export the user page *anyway*, this is a non-issue.

That said, duplicating instead of integrating, which is technically
possible, is a design decision, and it needs to be documented somewhere
aside from this mail thread.




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend
  2012-10-29 18:41     ` Marcelo Tosatti
@ 2012-10-30  7:42       ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-10-30  7:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/29/2012 10:41 PM, Marcelo Tosatti wrote:
> On Mon, Oct 29, 2012 at 06:47:57PM +0400, Glauber Costa wrote:
>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>> Required by next patch.
>>>
>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> I don't see where.
> 
> +       if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +               *mode = VCLOCK_NONE;
> 
> 

I see. But I end up thinking that this is a case of oversimplification
that gets things more complicated =)

If this would be folded in the next patch, this would be obvious.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 12/18] x86: vdso: pvclock gettime support
  2012-10-29 18:42     ` Marcelo Tosatti
@ 2012-10-30  7:49       ` Glauber Costa
  2012-10-31  3:16         ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-30  7:49 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/29/2012 10:42 PM, Marcelo Tosatti wrote:
> On Mon, Oct 29, 2012 at 06:59:35PM +0400, Glauber Costa wrote:
>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>> Improve performance of time system calls when using Linux pvclock, 
>>> by reading time info from fixmap visible copy of pvclock data.
>>>
>>> Originally from Jeremy Fitzhardinge.
>>>
>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>>
>>> Index: vsyscall/arch/x86/vdso/vclock_gettime.c
>>> ===================================================================
>>> --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
>>> +++ vsyscall/arch/x86/vdso/vclock_gettime.c
>>> @@ -22,6 +22,7 @@
>>>  #include <asm/hpet.h>
>>>  #include <asm/unistd.h>
>>>  #include <asm/io.h>
>>> +#include <asm/pvclock.h>
>>>  
>>>  #define gtod (&VVAR(vsyscall_gtod_data))
>>>  
>>> @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
>>>  	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
>>>  }
>>>  
>>> +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
>>> +
>>> +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>> +{
>>> +	const aligned_pvti_t *pvti_base;
>>> +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
>>> +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
>>> +
>>> +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
>>> +
>>> +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
>>> +
>>> +	return &pvti_base[offset].info;
>>> +}
>>> +
>>
>> Unless I am missing something, if gcc decides to not inline get_pvti,
>> this will break, right? I believe you need to mark that function with
>> __always_inline.
> 
> Can't see why. Please enlighten me.
> 

I can be wrong, I don't deal with this vdso code for quite a while - so
forgive me if my memory tricked me.

But wasn't it the case that vdso functions could not call functions in
the kernel address space outside the mapped page? Or does this
restriction only apply to accessing data?


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc
  2012-10-29 18:45     ` Marcelo Tosatti
@ 2012-10-30  7:55       ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-10-30  7:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/29/2012 10:45 PM, Marcelo Tosatti wrote:
> On Mon, Oct 29, 2012 at 07:04:59PM +0400, Glauber Costa wrote:
>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>> Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().
>>>
>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>
>> Would you mind explaining why?
>>
>> it seems to me that rdtscll() here would be perfectly safe: the only
>> case in which they wouldn't, is in a nested-vm environment running
>> paravirt-linux with a paravirt tsc. In this case, it is quite likely
>> that we'll want rdtscll *anyway*, instead of going to tsc directly.
> 
> Its something different (from a future patch):
> 
> "KVM added a global variable to guarantee monotonicity in the guest.
> One of the reasons for that is that the time between
> 
>         1. ktime_get_ts(&timespec);
>         2. rdtscll(tsc);
> 
> Is variable. That is, given a host with stable TSC, suppose that
> two VCPUs read the same time via ktime_get_ts() above.
> 
> The time required to execute 2. is not the same on those two instances
> executing in different VCPUS (cache misses, interrupts...)."
> 
> 
> Think step 1. returning the same value on both vcpus (to read the
> explanation above).
> 

This still doesn't go the core of the question. You are replacing
rdtscll with native_read_tsc(). They are equivalent most of the time for
the host (except in the case I outlined). So the logic you exposed, is
valid for both. But the real problem is another one:

Although I am not very skilled in C, I rock in the Logo programming
language (http://en.wikipedia.org/wiki/Logo_(programming_language) ),
and with that knowledge, I can understand your C code - with a bit of
effort. After reading it, it becomes very clear that what you do is
"Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc",
so your changelog doesn't really add any data. It would be great if it
explained us, readers, why instead of what!

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
  2012-10-24 13:13 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
@ 2012-10-30  8:34   ` Glauber Costa
  2012-10-31  3:19     ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag\ Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-10-30  8:34 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> KVM added a global variable to guarantee monotonicity in the guest. 
> One of the reasons for that is that the time between
> 
> 	1. ktime_get_ts(&timespec);
> 	2. rdtscll(tsc);
> 
> Is variable. That is, given a host with stable TSC, suppose that
> two VCPUs read the same time via ktime_get_ts() above.
> 
> The time required to execute 2. is not the same on those two instances 
> executing in different VCPUS (cache misses, interrupts...).
> 
> If the TSC value that is used by the host to interpolate when 
> calculating the monotonic time is the same value used to calculate
> the tsc_timestamp value stored in the pvclock data structure, and
> a single <system_timestamp, tsc_timestamp> tuple is visible to all 
> vcpus simultaneously, this problem disappears. See comment on top
> of pvclock_update_vm_gtod_copy for details.
> 
> Monotonicity is then guaranteed by synchronicity of the host TSCs
> and guest TSCs. 
> 
> Set TSC stable pvclock flag in that case, allowing the guest to read
> clock from userspace.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

If you are using a master copy, with a stable host-side tsc, you can get
rid of the normal REQ_CLOCK updates during vcpu load.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 03/18] x86: pvclock: remove pvclock_shadow_time
  2012-10-24 13:13 ` [patch 03/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
@ 2012-10-30  9:23   ` Avi Kivity
  2012-10-30  9:24     ` Avi Kivity
  0 siblings, 1 reply; 94+ messages in thread
From: Avi Kivity @ 2012-10-30  9:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, glommer, zamsden, gleb, pbonzini

On 10/24/2012 03:13 PM, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> We can copy the information directly from "struct pvclock_vcpu_time_info", 
> remove pvclock_shadow_time.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
>  
>  unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
> @@ -90,21 +54,20 @@ void pvclock_resume(void)
>  
>  cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
>  {
> -	struct pvclock_shadow_time shadow;
>  	unsigned version;
>  	cycle_t ret, offset;
>  	u64 last;
>  
>  	do {
> -		version = pvclock_get_time_values(&shadow, src);
> +		version = src->version;
>  		rdtsc_barrier();
> -		offset = pvclock_get_nsec_offset(&shadow);
> -		ret = shadow.system_timestamp + offset;
> +		offset = pvclock_get_nsec_offset(src);
> +		ret = src->system_time + offset;
>  		rdtsc_barrier();
> -	} while (version != src->version);
> +	} while ((src->version & 1) || version != src->version);
>  
>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> -		(shadow.flags & PVCLOCK_TSC_STABLE_BIT))
> +		(src->flags & PVCLOCK_TSC_STABLE_BIT))
>  		return ret;
>  

You're now reading PVCLOCK_TSC_STABLE outside the critical section.  We
could have live migrated to a tsc-unstable host and had this flag cleared.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 03/18] x86: pvclock: remove pvclock_shadow_time
  2012-10-30  9:23   ` Avi Kivity
@ 2012-10-30  9:24     ` Avi Kivity
  0 siblings, 0 replies; 94+ messages in thread
From: Avi Kivity @ 2012-10-30  9:24 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, glommer, zamsden, gleb, pbonzini

On 10/30/2012 11:23 AM, Avi Kivity wrote:
> On 10/24/2012 03:13 PM, Marcelo Tosatti wrote:
>> Originally from Jeremy Fitzhardinge.
>> 
>> We can copy the information directly from "struct pvclock_vcpu_time_info", 
>> remove pvclock_shadow_time.
>> 
>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> 
>>  
>>  unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
>> @@ -90,21 +54,20 @@ void pvclock_resume(void)
>>  
>>  cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
>>  {
>> -	struct pvclock_shadow_time shadow;
>>  	unsigned version;
>>  	cycle_t ret, offset;
>>  	u64 last;
>>  
>>  	do {
>> -		version = pvclock_get_time_values(&shadow, src);
>> +		version = src->version;
>>  		rdtsc_barrier();
>> -		offset = pvclock_get_nsec_offset(&shadow);
>> -		ret = shadow.system_timestamp + offset;
>> +		offset = pvclock_get_nsec_offset(src);
>> +		ret = src->system_time + offset;
>>  		rdtsc_barrier();
>> -	} while (version != src->version);
>> +	} while ((src->version & 1) || version != src->version);
>>  
>>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
>> -		(shadow.flags & PVCLOCK_TSC_STABLE_BIT))
>> +		(src->flags & PVCLOCK_TSC_STABLE_BIT))
>>  		return ret;
>>  
> 
> You're now reading PVCLOCK_TSC_STABLE outside the critical section.  We
> could have live migrated to a tsc-unstable host and had this flag cleared.
> 

I see it's fixed in patch 5.  So there's a window of two patches where
this is broken.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-29 18:40       ` Marcelo Tosatti
  2012-10-30  7:41         ` Glauber Costa
@ 2012-10-30  9:39         ` Avi Kivity
  2012-10-31  3:12           ` Marcelo Tosatti
  1 sibling, 1 reply; 94+ messages in thread
From: Avi Kivity @ 2012-10-30  9:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Jeremy Fitzhardinge, Glauber Costa, kvm, johnstul, zamsden, gleb,
	pbonzini

On 10/29/2012 08:40 PM, Marcelo Tosatti wrote:
> On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
>> On 10/29/2012 07:45 AM, Glauber Costa wrote:
>> > On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>> >> Allow a guest to register a second location for the VCPU time info
>> >>
>> >> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>> >> This is intended to allow the guest kernel to map this information
>> >> into a usermode accessible page, so that usermode can efficiently
>> >> calculate system time from the TSC without having to make a syscall.
>> >>
>> >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> > Can you please be a bit more specific about why we need this? Why does
>> > the host need to provide us with two pages with the exact same data? Why
>> > can't just do it with mapping tricks in the guest?
>> 
>> In Xen the pvclock structure is embedded within a pile of other stuff
>> that shouldn't be mapped into guest memory, so providing for a second
>> location allows it to be placed whereever is convenient for the guest.
>> That's a restriction of the Xen ABI, but I don't know if it affects KVM.
>> 
>>     J
> 
> It is possible to share the data for KVM in theory, but:
> 
> - It is a small amount of memory. 
> - It requires aligning to page size (the in-kernel percpu array 
> is currently cacheline aligned).
> - It is possible to modify flags separately for userspace/kernelspace,
> if desired.
> 
> This justifies the duplication IMO (code is simple and clean).
> 

What would be the changes required to remove the duplication?  If it's
just page alignment, then is seems even smaller.  In addition we avoid
expanding the ABI again.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-30  9:39         ` Avi Kivity
@ 2012-10-31  3:12           ` Marcelo Tosatti
  2012-11-02 10:21             ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31  3:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, Glauber Costa, kvm, johnstul, zamsden, gleb,
	pbonzini

On Tue, Oct 30, 2012 at 11:39:32AM +0200, Avi Kivity wrote:
> On 10/29/2012 08:40 PM, Marcelo Tosatti wrote:
> > On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
> >> On 10/29/2012 07:45 AM, Glauber Costa wrote:
> >> > On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> >> >> Allow a guest to register a second location for the VCPU time info
> >> >>
> >> >> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
> >> >> This is intended to allow the guest kernel to map this information
> >> >> into a usermode accessible page, so that usermode can efficiently
> >> >> calculate system time from the TSC without having to make a syscall.
> >> >>
> >> >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >> > Can you please be a bit more specific about why we need this? Why does
> >> > the host need to provide us with two pages with the exact same data? Why
> >> > can't just do it with mapping tricks in the guest?
> >> 
> >> In Xen the pvclock structure is embedded within a pile of other stuff
> >> that shouldn't be mapped into guest memory, so providing for a second
> >> location allows it to be placed whereever is convenient for the guest.
> >> That's a restriction of the Xen ABI, but I don't know if it affects KVM.
> >> 
> >>     J
> > 
> > It is possible to share the data for KVM in theory, but:
> > 
> > - It is a small amount of memory. 
> > - It requires aligning to page size (the in-kernel percpu array 
> > is currently cacheline aligned).
> > - It is possible to modify flags separately for userspace/kernelspace,
> > if desired.
> > 
> > This justifies the duplication IMO (code is simple and clean).
> > 
> 
> What would be the changes required to remove the duplication?  If it's
> just page alignment, then is seems even smaller.  In addition we avoid
> expanding the ABI again.

This would require changing the kernel copy from percpu data, which
there is no guarantee is linear (necessary for fixmap mapping), to
dynamically allocated (which in turn can be tricky due to early boot
clock requirement).

Hum, no thanks.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 12/18] x86: vdso: pvclock gettime support
  2012-10-30  7:49       ` Glauber Costa
@ 2012-10-31  3:16         ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31  3:16 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Tue, Oct 30, 2012 at 11:49:39AM +0400, Glauber Costa wrote:
> On 10/29/2012 10:42 PM, Marcelo Tosatti wrote:
> > On Mon, Oct 29, 2012 at 06:59:35PM +0400, Glauber Costa wrote:
> >> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> >>> Improve performance of time system calls when using Linux pvclock, 
> >>> by reading time info from fixmap visible copy of pvclock data.
> >>>
> >>> Originally from Jeremy Fitzhardinge.
> >>>
> >>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>>
> >>> Index: vsyscall/arch/x86/vdso/vclock_gettime.c
> >>> ===================================================================
> >>> --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
> >>> +++ vsyscall/arch/x86/vdso/vclock_gettime.c
> >>> @@ -22,6 +22,7 @@
> >>>  #include <asm/hpet.h>
> >>>  #include <asm/unistd.h>
> >>>  #include <asm/io.h>
> >>> +#include <asm/pvclock.h>
> >>>  
> >>>  #define gtod (&VVAR(vsyscall_gtod_data))
> >>>  
> >>> @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
> >>>  	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
> >>>  }
> >>>  
> >>> +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
> >>> +
> >>> +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >>> +{
> >>> +	const aligned_pvti_t *pvti_base;
> >>> +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> >>> +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> >>> +
> >>> +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> >>> +
> >>> +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> >>> +
> >>> +	return &pvti_base[offset].info;
> >>> +}
> >>> +
> >>
> >> Unless I am missing something, if gcc decides to not inline get_pvti,
> >> this will break, right? I believe you need to mark that function with
> >> __always_inline.
> > 
> > Can't see why. Please enlighten me.
> > 
> 
> I can be wrong, I don't deal with this vdso code for quite a while - so
> forgive me if my memory tricked me.
> 
> But wasn't it the case that vdso functions could not call functions in
> the kernel address space outside the mapped page? Or does this
> restriction only apply to accessing data?

Only code in the vdso page is executed, but this particular function is
in the vdso so its fine.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag\
  2012-10-30  8:34   ` Glauber Costa
@ 2012-10-31  3:19     ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31  3:19 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Tue, Oct 30, 2012 at 12:34:25PM +0400, Glauber Costa wrote:
> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
> > KVM added a global variable to guarantee monotonicity in the guest. 
> > One of the reasons for that is that the time between
> > 
> > 	1. ktime_get_ts(&timespec);
> > 	2. rdtscll(tsc);
> > 
> > Is variable. That is, given a host with stable TSC, suppose that
> > two VCPUs read the same time via ktime_get_ts() above.
> > 
> > The time required to execute 2. is not the same on those two instances 
> > executing in different VCPUS (cache misses, interrupts...).
> > 
> > If the TSC value that is used by the host to interpolate when 
> > calculating the monotonic time is the same value used to calculate
> > the tsc_timestamp value stored in the pvclock data structure, and
> > a single <system_timestamp, tsc_timestamp> tuple is visible to all 
> > vcpus simultaneously, this problem disappears. See comment on top
> > of pvclock_update_vm_gtod_copy for details.
> > 
> > Monotonicity is then guaranteed by synchronicity of the host TSCs
> > and guest TSCs. 
> > 
> > Set TSC stable pvclock flag in that case, allowing the guest to read
> > clock from userspace.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> If you are using a master copy, with a stable host-side tsc, you can get
> rid of the normal REQ_CLOCK updates during vcpu load.

Yes. The updates are harmless and not frequent, though, so i'd rather not
touch this.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3)
  2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
                   ` (17 preceding siblings ...)
  2012-10-24 13:13 ` [patch 18/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
@ 2012-10-31 22:46 ` Marcelo Tosatti
  2012-10-31 22:46   ` [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
                     ` (15 more replies)
  18 siblings, 16 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:46 UTC (permalink / raw)
  To: kvm; +Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini

This patchset, based on earlier work by Jeremy Fitzhardinge, implements
paravirtual clock vsyscall support.

It should be possible to implement Xen support relatively easily.

It reduces clock_gettime from 500 cycles to 200 cycles
on my testbox.

Please review.

>From my POV, this is ready to merge.

v3:
- fix PVCLOCK_VSYSCALL_NR_PAGES definition (glommer)
- fold flags race fix into pvclock refactoring (avi)
- remove CONFIG_PARAVIRT_CLOCK_VSYSCALL (glommer)
- add reference to tsc.c from vclock_gettime.c about cycle_last rationale (glommer)
- fix whitespace damage (glommer)


v2:
- Do not allow visibility of different <system_timestamp, tsc_timestamp>
tuples.
- Add option to disable vsyscall.




^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
@ 2012-10-31 22:46   ` Marcelo Tosatti
  2012-11-01 10:39     ` Gleb Natapov
  2012-11-01 13:44     ` Glauber Costa
  2012-10-31 22:46   ` [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
                     ` (14 subsequent siblings)
  15 siblings, 2 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:46 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: x86-kvm-retain-guest-stopped.patch --]
[-- Type: text/plain, Size: 1659 bytes --]

Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
+	struct pvclock_vcpu_time_info *guest_hv_clock;
 	u8 pvclock_flags;
 
 	/* Keep irq disabled to prevent changes to the clock */
@@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
 	vcpu->last_kernel_ns = kernel_ns;
 	vcpu->last_guest_tsc = tsc_timestamp;
 
-	pvclock_flags = 0;
-	if (vcpu->pvclock_set_guest_stopped_request) {
-		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-		vcpu->pvclock_set_guest_stopped_request = false;
-	}
-
-	vcpu->hv_clock.flags = pvclock_flags;
 
 	/*
 	 * The interface expects us to write an even number signaling that the
@@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
 
 	shared_kaddr = kmap_atomic(vcpu->time_page);
 
+	guest_hv_clock = shared_kaddr + vcpu->time_offset;
+
+	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+	if (vcpu->pvclock_set_guest_stopped_request) {
+		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+		vcpu->pvclock_set_guest_stopped_request = false;
+	}
+
+	vcpu->hv_clock.flags = pvclock_flags;
+
 	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
 	       sizeof(vcpu->hv_clock));
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
  2012-10-31 22:46   ` [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
@ 2012-10-31 22:46   ` Marcelo Tosatti
  2012-11-01 11:48     ` Gleb Natapov
  2012-10-31 22:46   ` [patch 03/16] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
                     ` (13 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:46 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 01-pvclock-read-rdtsc-barrier --]
[-- Type: text/plain, Size: 745 bytes --]

Originally from Jeremy Fitzhardinge.

pvclock_get_time_values, which contains the memory barriers
will be removed by next patch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
 
 	do {
 		version = pvclock_get_time_values(&shadow, src);
-		barrier();
+		rdtsc_barrier();
 		offset = pvclock_get_nsec_offset(&shadow);
 		ret = shadow.system_timestamp + offset;
-		barrier();
+		rdtsc_barrier();
 	} while (version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 03/16] x86: pvclock: remove pvclock_shadow_time
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
  2012-10-31 22:46   ` [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
  2012-10-31 22:46   ` [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
@ 2012-10-31 22:46   ` Marcelo Tosatti
  2012-11-01 13:52     ` Glauber Costa
  2012-10-31 22:47   ` [patch 04/16] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
                     ` (12 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:46 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 02-pvclock-remove-shadow-time --]
[-- Type: text/plain, Size: 2979 bytes --]

Originally from Jeremy Fitzhardinge.

We can copy the information directly from "struct pvclock_vcpu_time_info", 
remove pvclock_shadow_time.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -19,21 +19,6 @@
 #include <linux/percpu.h>
 #include <asm/pvclock.h>
 
-/*
- * These are perodically updated
- *    xen: magic shared_info page
- *    kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
-	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
-	u32 tsc_to_nsec_mul;
-	int tsc_shift;
-	u32 version;
-	u8  flags;
-};
-
 static u8 valid_flags __read_mostly = 0;
 
 void pvclock_set_flags(u8 flags)
@@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-	u64 delta = native_read_tsc() - shadow->tsc_timestamp;
-	return pvclock_scale_delta(delta, shadow->tsc_to_nsec_mul,
-				   shadow->tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-					struct pvclock_vcpu_time_info *src)
-{
-	do {
-		dst->version = src->version;
-		rmb();		/* fetch version before data */
-		dst->tsc_timestamp     = src->tsc_timestamp;
-		dst->system_timestamp  = src->system_time;
-		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
-		dst->tsc_shift         = src->tsc_shift;
-		dst->flags             = src->flags;
-		rmb();		/* test version after fetching data */
-	} while ((src->version & 1) || (dst->version != src->version));
-
-	return dst->version;
+	u64 delta = native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
 }
 
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -90,21 +54,22 @@ void pvclock_resume(void)
 
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
-	struct pvclock_shadow_time shadow;
 	unsigned version;
 	cycle_t ret, offset;
 	u64 last;
+	u8 flags;
 
 	do {
-		version = pvclock_get_time_values(&shadow, src);
+		version = src->version;
 		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(&shadow);
-		ret = shadow.system_timestamp + offset;
+		offset = pvclock_get_nsec_offset(src);
+		ret = src->system_time + offset;
+		flags = src->flags;
 		rdtsc_barrier();
-	} while (version != src->version);
+	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
-		(shadow.flags & PVCLOCK_TSC_STABLE_BIT))
+		(flags & PVCLOCK_TSC_STABLE_BIT))
 		return ret;
 
 	/*



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 04/16] x86: pvclock: create helper for pvclock data retrieval
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (2 preceding siblings ...)
  2012-10-31 22:46   ` [patch 03/16] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-01 14:04     ` Glauber Costa
  2012-10-31 22:47   ` [patch 05/16] x86: pvclock: introduce helper to read flags Marcelo Tosatti
                     ` (11 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 03-move-pvread-to-pvheader --]
[-- Type: text/plain, Size: 2264 bytes --]

Originally from Jeremy Fitzhardinge.

So code can be reused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags)
 	valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
-{
-	u64 delta = native_read_tsc() - src->tsc_timestamp;
-	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
-				   src->tsc_shift);
-}
-
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
 {
 	u64 pv_tsc_khz = 1000000ULL << 32;
@@ -55,17 +48,12 @@ void pvclock_resume(void)
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
-	cycle_t ret, offset;
+	cycle_t ret;
 	u64 last;
 	u8 flags;
 
 	do {
-		version = src->version;
-		rdtsc_barrier();
-		offset = pvclock_get_nsec_offset(src);
-		ret = src->system_time + offset;
-		flags = src->flags;
-		rdtsc_barrier();
+		version = __pvclock_read_cycles(src, &ret, &flags);
 	} while ((src->version & 1) || version != src->version);
 
 	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -56,4 +56,32 @@ static inline u64 pvclock_scale_delta(u6
 	return product;
 }
 
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+{
+	u64 delta = __native_read_tsc() - src->tsc_timestamp;
+	return pvclock_scale_delta(delta, src->tsc_to_system_mul,
+				   src->tsc_shift);
+}
+
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+			       cycle_t *cycles, u8 *flags)
+{
+	unsigned version;
+	cycle_t ret, offset;
+	u8 ret_flags;
+
+	version = src->version;
+	rdtsc_barrier();
+	offset = pvclock_get_nsec_offset(src);
+	ret = src->system_time + offset;
+	ret_flags = src->flags;
+	rdtsc_barrier();
+
+	*cycles = ret;
+	*flags = ret_flags;
+	return version;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 05/16] x86: pvclock: introduce helper to read flags
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (3 preceding siblings ...)
  2012-10-31 22:47   ` [patch 04/16] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-01 14:07     ` Glauber Costa
  2012-10-31 22:47   ` [patch 06/16] sched: add notifier for cross-cpu migrations Marcelo Tosatti
                     ` (10 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 05-pvclock-add-get-flags --]
[-- Type: text/plain, Size: 1278 bytes --]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -45,6 +45,19 @@ void pvclock_resume(void)
 	atomic64_set(&last_value, 0);
 }
 
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
+{
+	unsigned version;
+	cycle_t ret;
+	u8 flags;
+
+	do {
+		version = __pvclock_read_cycles(src, &ret, &flags);
+	} while ((src->version & 1) || version != src->version);
+
+	return flags & valid_flags;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	unsigned version;
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 06/16] sched: add notifier for cross-cpu migrations
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (4 preceding siblings ...)
  2012-10-31 22:47   ` [patch 05/16] x86: pvclock: introduce helper to read flags Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-01 14:08     ` Glauber Costa
  2012-10-31 22:47   ` [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
                     ` (9 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 06-add-task-migration-notifier --]
[-- Type: text/plain, Size: 1732 bytes --]

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/sched.h
===================================================================
--- vsyscall.orig/include/linux/sched.h
+++ vsyscall/include/linux/sched.h
@@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+	struct task_struct *task;
+	int from_cpu;
+	int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
Index: vsyscall/kernel/sched/core.c
===================================================================
--- vsyscall.orig/kernel/sched/core.c
+++ vsyscall/kernel/sched/core.c
@@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
 		rq->skip_clock_update = 1;
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+	atomic_notifier_chain_register(&task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		struct task_migration_notifier tmn;
+
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+
+		tmn.task = p;
+		tmn.from_cpu = task_cpu(p);
+		tmn.to_cpu = new_cpu;
+
+		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
 	}
 
 	__set_task_cpu(p, new_cpu);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (5 preceding siblings ...)
  2012-10-31 22:47   ` [patch 06/16] sched: add notifier for cross-cpu migrations Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-01 14:19     ` Glauber Costa
  2012-10-31 22:47   ` [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
                     ` (8 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 07-add-pvclock-structs-and-fixmap --]
[-- Type: text/plain, Size: 4534 bytes --]

Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization 
routines.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/pvclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -17,6 +17,10 @@
 
 #include <linux/kernel.h>
 #include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <linux/bootmem.h>
 #include <asm/pvclock.h>
 
 static u8 valid_flags __read_mostly = 0;
@@ -122,3 +126,67 @@ void pvclock_read_wallclock(struct pvclo
 
 	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+static aligned_pvti_t *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *
+pvclock_get_vsyscall_user_time_info(int cpu)
+{
+	if (pvclock_vdso_info == NULL) {
+		BUG();
+		return NULL;
+	}
+
+	return &pvclock_vdso_info[cpu].info;
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+	return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+	struct task_migration_notifier *mn = v;
+	struct pvclock_vsyscall_time_info *pvti;
+
+	pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
+
+	if (pvti == NULL)
+		return NOTIFY_DONE;
+
+	pvti->migrate_count++;
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+	.notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state.  This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+int __init pvclock_init_vsyscall(void)
+{
+	int idx;
+	unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
+
+	pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);
+	if (!pvclock_vdso_info)
+		return -ENOMEM;
+
+	memset(pvclock_vdso_info, 0, size);
+
+	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
+		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
+			     __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
+			     PAGE_KERNEL_VVAR);
+	}
+
+	register_task_migration_notifier(&pvclock_migrate);
+
+	return 0;
+}
Index: vsyscall/arch/x86/include/asm/fixmap.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/fixmap.h
+++ vsyscall/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include <asm/acpi.h>
 #include <asm/apicdef.h>
 #include <asm/page.h>
+#include <asm/pvclock.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
@@ -81,6 +82,10 @@ enum fixed_addresses {
 	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
+#ifdef CONFIG_PARAVIRT_CLOCK
+	PVCLOCK_FIXMAP_BEGIN,
+	PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
+#endif
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
Index: vsyscall/arch/x86/include/asm/pvclock.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo
 			    struct pvclock_vcpu_time_info *vcpu,
 			    struct timespec *ts);
 void pvclock_resume(void);
+int __init pvclock_init_vsyscall(void);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
@@ -85,4 +87,17 @@ unsigned __pvclock_read_cycles(const str
 	return version;
 }
 
+struct pvclock_vsyscall_time_info {
+	struct pvclock_vcpu_time_info pvti;
+	u32 migrate_count;
+};
+
+typedef union {
+	struct pvclock_vsyscall_time_info info;
+	char pad[SMP_CACHE_BYTES];
+} aligned_pvti_t ____cacheline_aligned;
+
+#define PVTI_SIZE sizeof(aligned_pvti_t)
+#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: vsyscall/arch/x86/include/asm/clocksource.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/clocksource.h
+++ vsyscall/arch/x86/include/asm/clocksource.h
@@ -8,6 +8,7 @@
 #define VCLOCK_NONE 0  /* No vDSO clock available.	*/
 #define VCLOCK_TSC  1  /* vDSO should use vread_tsc.	*/
 #define VCLOCK_HPET 2  /* vDSO should use vread_hpet.	*/
+#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
 
 struct arch_clocksource_data {
 	int vclock_mode;



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (6 preceding siblings ...)
  2012-10-31 22:47   ` [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-01 14:28     ` Glauber Costa
  2012-10-31 22:47   ` [patch 09/16] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
                     ` (7 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 11-host-add-userspace-time-msr --]
[-- Type: text/plain, Size: 9580 bytes --]

Allow a guest to register a second location for the VCPU time info

structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
This is intended to allow the guest kernel to map this information
into a usermode accessible page, so that usermode can efficiently
calculate system time from the TSC without having to make a syscall.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_para.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_para.h
+++ vsyscall/arch/x86/include/asm/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_FEATURE_ASYNC_PF		4
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_EOI		6
+#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
+#define MSR_KVM_USERSPACE_TIME      0x4b564d05
 
 struct kvm_steal_time {
 	__u64 steal;
Index: vsyscall/Documentation/virtual/kvm/msr.txt
===================================================================
--- vsyscall.orig/Documentation/virtual/kvm/msr.txt
+++ vsyscall/Documentation/virtual/kvm/msr.txt
@@ -125,6 +125,22 @@ MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01
 	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
 	leaf prior to usage.
 
+MSR_KVM_USERSPACE_TIME:  0x4b564d05
+
+Allow a guest to register a second location for the VCPU time info
+structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
+This is intended to allow the guest kernel to map this information
+into a usermode accessible page, so that usermode can efficiently
+calculate system time from the TSC without having to make a syscall.
+
+Relationship with master copy (MSR_KVM_SYSTEM_TIME_NEW):
+
+- This MSR must be enabled only when the master is enabled.
+- Disabling updates to the master automatically disables
+updates for this copy.
+
+Availability of this MSR must be checked via bit 7 in 0x4000001 cpuid
+leaf prior to usage.
 
 MSR_KVM_WALL_CLOCK:  0x11
 
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -415,10 +415,13 @@ struct kvm_vcpu_arch {
 	int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
 
 	gpa_t time;
+	gpa_t uspace_time;
 	struct pvclock_vcpu_time_info hv_clock;
 	unsigned int hw_tsc_khz;
 	unsigned int time_offset;
+	unsigned int uspace_time_offset;
 	struct page *time_page;
+	struct page *uspace_time_page;
 	/* set guest stopped flag in pvclock flags field */
 	bool pvclock_set_guest_stopped_request;
 
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -809,13 +809,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	10
+#define KVM_SAVE_MSRS_BEGIN	11
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_USERSPACE_TIME,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1135,16 +1135,43 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
+			     unsigned int offset_in_page, gpa_t gpa)
+{
+	struct kvm_vcpu_arch *vcpu = &v->arch;
+	void *shared_kaddr;
+	struct pvclock_vcpu_time_info *guest_hv_clock;
+	u8 pvclock_flags;
+
+	shared_kaddr = kmap_atomic(page);
+
+	guest_hv_clock = shared_kaddr + offset_in_page;
+
+	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+	if (vcpu->pvclock_set_guest_stopped_request) {
+		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+		vcpu->pvclock_set_guest_stopped_request = false;
+	}
+
+	vcpu->hv_clock.flags = pvclock_flags;
+
+	memcpy(shared_kaddr + offset_in_page, &vcpu->hv_clock,
+	       sizeof(vcpu->hv_clock));
+
+	kunmap_atomic(shared_kaddr);
+
+	mark_page_dirty(v->kvm, gpa >> PAGE_SHIFT);
+}
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
-	void *shared_kaddr;
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
-	struct pvclock_vcpu_time_info *guest_hv_clock;
-	u8 pvclock_flags;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
@@ -1235,26 +1262,11 @@ static int kvm_guest_time_update(struct 
 	 */
 	vcpu->hv_clock.version += 2;
 
-	shared_kaddr = kmap_atomic(vcpu->time_page);
-
-	guest_hv_clock = shared_kaddr + vcpu->time_offset;
-
-	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
-	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+ 	kvm_write_pvtime(v, vcpu->time_page, vcpu->time_offset, vcpu->time);
+ 	if (vcpu->uspace_time_page)
+ 		kvm_write_pvtime(v, vcpu->uspace_time_page,
+ 				 vcpu->uspace_time_offset, vcpu->uspace_time);
 
-	if (vcpu->pvclock_set_guest_stopped_request) {
-		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-		vcpu->pvclock_set_guest_stopped_request = false;
-	}
-
-	vcpu->hv_clock.flags = pvclock_flags;
-
-	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
-	       sizeof(vcpu->hv_clock));
-
-	kunmap_atomic(shared_kaddr);
-
-	mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
 	return 0;
 }
 
@@ -1549,6 +1561,15 @@ static void kvmclock_reset(struct kvm_vc
 	}
 }
 
+static void kvmclock_uspace_reset(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.uspace_time = 0;
+	if (vcpu->arch.uspace_time_page) {
+		kvm_release_page_dirty(vcpu->arch.uspace_time_page);
+		vcpu->arch.uspace_time_page = NULL;
+	}
+}
+
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
 	u64 delta;
@@ -1639,6 +1660,31 @@ int kvm_set_msr_common(struct kvm_vcpu *
 		vcpu->kvm->arch.wall_clock = data;
 		kvm_write_wall_clock(vcpu->kvm, data);
 		break;
+	case MSR_KVM_USERSPACE_TIME: {
+		kvmclock_uspace_reset(vcpu);
+
+		if (!vcpu->arch.time_page && (data & 1))
+			return 1;
+
+		vcpu->arch.uspace_time = data;
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+		/* we verify if the enable bit is set... */
+		if (!(data & 1))
+			break;
+
+		/* ...but clean it before doing the actual write */
+		vcpu->arch.uspace_time_offset = data & ~(PAGE_MASK | 1);
+
+		vcpu->arch.uspace_time_page = gfn_to_page(vcpu->kvm,
+							  data >> PAGE_SHIFT);
+
+		if (is_error_page(vcpu->arch.uspace_time_page)) {
+			kvm_release_page_clean(vcpu->arch.uspace_time_page);
+			vcpu->arch.uspace_time_page = NULL;
+		}
+		break;
+	}
 	case MSR_KVM_SYSTEM_TIME_NEW:
 	case MSR_KVM_SYSTEM_TIME: {
 		kvmclock_reset(vcpu);
@@ -1647,8 +1693,10 @@ int kvm_set_msr_common(struct kvm_vcpu *
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 		/* we verify if the enable bit is set... */
-		if (!(data & 1))
+		if (!(data & 1)) {
+			kvmclock_uspace_reset(vcpu);
 			break;
+		}
 
 		/* ...but clean it before doing the actual write */
 		vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
@@ -1656,8 +1704,10 @@ int kvm_set_msr_common(struct kvm_vcpu *
 		vcpu->arch.time_page =
 				gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
 
-		if (is_error_page(vcpu->arch.time_page))
+		if (is_error_page(vcpu->arch.time_page)) {
 			vcpu->arch.time_page = NULL;
+			kvmclock_uspace_reset(vcpu);
+		}
 
 		break;
 	}
@@ -2010,6 +2060,9 @@ int kvm_get_msr_common(struct kvm_vcpu *
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_USERSPACE_TIME:
+		data = vcpu->arch.uspace_time;
+		break;
 	case MSR_KVM_ASYNC_PF_EN:
 		data = vcpu->arch.apf.msr_val;
 		break;
@@ -2195,6 +2248,7 @@ int kvm_dev_ioctl_check_extension(long e
 	case KVM_CAP_KVMCLOCK_CTRL:
 	case KVM_CAP_READONLY_MEM:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_USERSPACE_CLOCKSOURCE:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -6017,6 +6071,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *
 
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
+	kvmclock_uspace_reset(vcpu);
 	kvmclock_reset(vcpu);
 
 	free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
Index: vsyscall/arch/x86/kvm/cpuid.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/cpuid.c
+++ vsyscall/arch/x86/kvm/cpuid.c
@@ -411,7 +411,9 @@ static int do_cpuid_ent(struct kvm_cpuid
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
 			     (1 << KVM_FEATURE_ASYNC_PF) |
 			     (1 << KVM_FEATURE_PV_EOI) |
-			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+			     (1 << KVM_FEATURE_USERSPACE_CLOCKSOURCE);
+
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
Index: vsyscall/include/uapi/linux/kvm.h
===================================================================
--- vsyscall.orig/include/uapi/linux/kvm.h
+++ vsyscall/include/uapi/linux/kvm.h
@@ -626,6 +626,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_READONLY_MEM 81
 #endif
 #define KVM_CAP_IRQFD_RESAMPLE 82
+#define KVM_CAP_USERSPACE_CLOCKSOURCE 83
 
 #ifdef KVM_CAP_IRQ_ROUTING
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 09/16] x86: kvm guest: pvclock vsyscall support
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (7 preceding siblings ...)
  2012-10-31 22:47   ` [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-02  9:42     ` Glauber Costa
  2012-10-31 22:47   ` [patch 10/16] x86: vdso: pvclock gettime support Marcelo Tosatti
                     ` (6 subsequent siblings)
  15 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 08-add-pvclock-vsyscall-kvm-support --]
[-- Type: text/plain, Size: 4641 bytes --]

Allow hypervisor to update userspace visible copy of
pvclock data.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kernel/kvmclock.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -31,6 +31,9 @@ static int kvmclock = 1;
 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
 
+/* set when the generic vsyscall pvclock elements are setup */
+bool vsyscall_clock_initializable = false;
+
 static int parse_no_kvmclock(char *arg)
 {
 	kvmclock = 0;
@@ -151,6 +154,24 @@ int kvm_register_clock(char *txt)
 	return ret;
 }
 
+static int kvm_register_vsyscall_clock(char *txt)
+{
+	int cpu = smp_processor_id();
+	int low, high, ret;
+	struct pvclock_vcpu_time_info *info;
+
+	info = pvclock_get_vsyscall_time_info(cpu);
+
+	low = (int)__pa(info) | 1;
+	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
+	ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high);
+	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
+	       cpu, high, low, txt);
+
+	return ret;
+}
+
+
 static void kvm_save_sched_clock_state(void)
 {
 }
@@ -158,6 +179,8 @@ static void kvm_save_sched_clock_state(v
 static void kvm_restore_sched_clock_state(void)
 {
 	kvm_register_clock("primary cpu clock, resume");
+	if (vsyscall_clock_initializable)
+		kvm_register_vsyscall_clock("primary cpu vsyscall clock, resume");
 }
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -168,6 +191,8 @@ static void __cpuinit kvm_setup_secondar
 	 * we shouldn't fail.
 	 */
 	WARN_ON(kvm_register_clock("secondary cpu clock"));
+	if (vsyscall_clock_initializable)
+		kvm_register_vsyscall_clock("secondary cpu vsyscall clock");
 }
 #endif
 
@@ -182,6 +207,8 @@ static void __cpuinit kvm_setup_secondar
 #ifdef CONFIG_KEXEC
 static void kvm_crash_shutdown(struct pt_regs *regs)
 {
+	if (vsyscall_clock_initializable)
+		native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0);
 	native_write_msr(msr_kvm_system_time, 0, 0);
 	kvm_disable_steal_time();
 	native_machine_crash_shutdown(regs);
@@ -190,6 +217,8 @@ static void kvm_crash_shutdown(struct pt
 
 static void kvm_shutdown(void)
 {
+	if (vsyscall_clock_initializable)
+		native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0);
 	native_write_msr(msr_kvm_system_time, 0, 0);
 	kvm_disable_steal_time();
 	native_machine_shutdown();
@@ -233,3 +262,25 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 }
+
+int kvm_setup_vsyscall_timeinfo(void)
+{
+	int ret;
+	struct pvclock_vcpu_time_info *vcpu_time;
+	u8 flags;
+
+	vcpu_time = &get_cpu_var(hv_clock);
+	flags = pvclock_read_flags(vcpu_time);
+	put_cpu_var(hv_clock);
+
+	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+		return 1;
+
+	if ((ret = pvclock_init_vsyscall()))
+		return ret;
+
+	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+	vsyscall_clock_initializable = true;
+	return 0;
+}
+
Index: vsyscall/arch/x86/kernel/kvm.c
===================================================================
--- vsyscall.orig/arch/x86/kernel/kvm.c
+++ vsyscall/arch/x86/kernel/kvm.c
@@ -42,6 +42,7 @@
 #include <asm/apic.h>
 #include <asm/apicdef.h>
 #include <asm/hypervisor.h>
+#include <asm/kvm_guest.h>
 
 static int kvmapf = 1;
 
@@ -62,6 +63,15 @@ static int parse_no_stealacc(char *arg)
 
 early_param("no-steal-acc", parse_no_stealacc);
 
+static int kvmclock_vsyscall = 1;
+static int parse_no_kvmclock_vsyscall(char *arg)
+{
+        kvmclock_vsyscall = 0;
+        return 0;
+}
+
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -468,6 +478,10 @@ void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (kvm_para_has_feature(KVM_FEATURE_USERSPACE_CLOCKSOURCE)
+	    && kvmclock_vsyscall)
+		kvm_setup_vsyscall_timeinfo();
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
Index: vsyscall/arch/x86/include/asm/kvm_guest.h
===================================================================
--- /dev/null
+++ vsyscall/arch/x86/include/asm/kvm_guest.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_X86_KVM_GUEST_H
+#define _ASM_X86_KVM_GUEST_H
+
+extern bool vsyscall_clock_initializable;
+
+int kvm_setup_vsyscall_timeinfo(void);
+
+#endif /* _ASM_X86_KVM_GUEST_H */



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 10/16] x86: vdso: pvclock gettime support
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (8 preceding siblings ...)
  2012-10-31 22:47   ` [patch 09/16] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-11-01 14:41     ` Glauber Costa
  2012-11-14 10:42     ` Gleb Natapov
  2012-10-31 22:47   ` [patch 11/16] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
                     ` (5 subsequent siblings)
  15 siblings, 2 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 10-add-pvclock-vdso-code --]
[-- Type: text/plain, Size: 4998 bytes --]

Improve performance of time system calls when using Linux pvclock, 
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -22,6 +22,7 @@
 #include <asm/hpet.h>
 #include <asm/unistd.h>
 #include <asm/io.h>
+#include <asm/pvclock.h>
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
@@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void)
 	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+
+static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+{
+	const aligned_pvti_t *pvti_base;
+	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
+	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
+
+	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
+
+	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
+
+	return &pvti_base[offset].info;
+}
+
+static notrace cycle_t vread_pvclock(int *mode)
+{
+	const struct pvclock_vsyscall_time_info *pvti;
+	cycle_t ret;
+	u64 last;
+	u32 version;
+	u32 migrate_count;
+	u8 flags;
+	unsigned cpu, cpu1;
+
+
+	/*
+	 * When looping to get a consistent (time-info, tsc) pair, we
+	 * also need to deal with the possibility we can switch vcpus,
+	 * so make sure we always re-fetch time-info for the current vcpu.
+	 */
+	do {
+		cpu = __getcpu() & VGETCPU_CPU_MASK;
+		pvti = get_pvti(cpu);
+
+		migrate_count = pvti->migrate_count;
+
+		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
+
+		/*
+		 * Test we're still on the cpu as well as the version.
+		 * We could have been migrated just after the first
+		 * vgetcpu but before fetching the version, so we
+		 * wouldn't notice a version change.
+		 */
+		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
+	} while (unlikely(cpu != cpu1 ||
+			  (pvti->pvti.version & 1) ||
+			  pvti->pvti.version != version ||
+			  pvti->migrate_count != migrate_count));
+
+	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+		*mode = VCLOCK_NONE;
+
+	/* refer to tsc.c read_tsc() comment for rationale */
+	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	return last;
+}
+#endif
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
 	long ret;
@@ -80,7 +145,7 @@ notrace static long vdso_fallback_gtod(s
 }
 
 
-notrace static inline u64 vgetsns(void)
+notrace static inline u64 vgetsns(int *mode)
 {
 	long v;
 	cycles_t cycles;
@@ -88,6 +153,8 @@ notrace static inline u64 vgetsns(void)
 		cycles = vread_tsc();
 	else if (gtod->clock.vclock_mode == VCLOCK_HPET)
 		cycles = vread_hpet();
+	else if (gtod->clock.vclock_mode == VCLOCK_PVCLOCK)
+		cycles = vread_pvclock(mode);
 	else
 		return 0;
 	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
@@ -107,7 +174,7 @@ notrace static int __always_inline do_re
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->wall_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -127,7 +194,7 @@ notrace static int do_monotonic(struct t
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->monotonic_time_sec;
 		ns = gtod->monotonic_time_snsec;
-		ns += vgetsns();
+		ns += vgetsns(&mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	timespec_add_ns(ts, ns);
Index: vsyscall/arch/x86/include/asm/vsyscall.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/vsyscall.h
+++ vsyscall/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,23 @@ extern void map_vsyscall(void);
  */
 extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
 
+#define VGETCPU_CPU_MASK 0xfff
+
+static inline unsigned int __getcpu(void)
+{
+	unsigned int p;
+
+	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
+		/* Load per CPU data from RDTSCP */
+		native_read_tscp(&p);
+	} else {
+		/* Load per CPU data from GDT */
+		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	}
+
+	return p;
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
Index: vsyscall/arch/x86/vdso/vgetcpu.c
===================================================================
--- vsyscall.orig/arch/x86/vdso/vgetcpu.c
+++ vsyscall/arch/x86/vdso/vgetcpu.c
@@ -17,15 +17,10 @@ __vdso_getcpu(unsigned *cpu, unsigned *n
 {
 	unsigned int p;
 
-	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
-		/* Load per CPU data from RDTSCP */
-		native_read_tscp(&p);
-	} else {
-		/* Load per CPU data from GDT */
-		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
-	}
+	p = __getcpu();
+
 	if (cpu)
-		*cpu = p & 0xfff;
+		*cpu = p & VGETCPU_CPU_MASK;
 	if (node)
 		*node = p >> 12;
 	return 0;



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 11/16] KVM: x86: pass host_tsc to read_l1_tsc
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (9 preceding siblings ...)
  2012-10-31 22:47   ` [patch 10/16] x86: vdso: pvclock gettime support Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-10-31 22:47   ` [patch 12/16] time: export time information for KVM pvclock Marcelo Tosatti
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 12-kvm-read-l1-tsc-pass-tscvalue --]
[-- Type: text/plain, Size: 3372 bytes --]

Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -703,7 +703,7 @@ struct kvm_x86_ops {
 	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
 	u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
-	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
 
 	void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
 
Index: vsyscall/arch/x86/kvm/lapic.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/lapic.c
+++ vsyscall/arch/x86/kvm/lapic.c
@@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_
 		local_irq_save(flags);
 
 		now = apic->lapic_timer.timer.base->get_time();
-		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+		guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc());
 		if (likely(tscdeadline > guest_tsc)) {
 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
 			do_div(ns, this_tsc_khz);
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
-u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
 	struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
 	return vmcb->control.tsc_offset +
-		svm_scale_tsc(vcpu, native_read_tsc());
+		svm_scale_tsc(vcpu, host_tsc);
 }
 
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void)
  * Like guest_read_tsc, but always returns L1's notion of the timestamp
  * counter, even if a nested guest (L2) is currently running.
  */
-u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-	u64 host_tsc, tsc_offset;
+	u64 tsc_offset;
 
-	rdtscll(host_tsc);
 	tsc_offset = is_guest_mode(vcpu) ?
 		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
 		vmcs_read64(TSC_OFFSET);
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu,
+							   native_read_tsc());
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 12/16] time: export time information for KVM pvclock
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (10 preceding siblings ...)
  2012-10-31 22:47   ` [patch 11/16] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-10-31 22:47   ` [patch 13/16] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 13-time-add-pvclock-gtod-data --]
[-- Type: text/plain, Size: 3841 bytes --]

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/include/linux/pvclock_gtod.h
===================================================================
--- /dev/null
+++ vsyscall/include/linux/pvclock_gtod.h
@@ -0,0 +1,27 @@
+#ifndef _PVCLOCK_GTOD_H
+#define _PVCLOCK_GTOD_H
+
+#include <linux/clocksource.h>
+#include <linux/notifier.h>
+
+struct pvclock_gtod_data {
+	seqcount_t	seq;
+
+	struct { /* extract of a clocksource struct */
+		int vclock_mode;
+		cycle_t	cycle_last;
+		cycle_t	mask;
+		u32	mult;
+		u32	shift;
+	} clock;
+
+	/* open coded 'struct timespec' */
+	u64		monotonic_time_snsec;
+	time_t		monotonic_time_sec;
+};
+extern struct pvclock_gtod_data pvclock_gtod_data;
+
+extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
+extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);
+
+#endif /* _PVCLOCK_GTOD_H */
Index: vsyscall/kernel/time/timekeeping.c
===================================================================
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
 #include <linux/time.h>
 #include <linux/tick.h>
 #include <linux/stop_machine.h>
+#include <linux/pvclock_gtod.h>
 
 
 static struct timekeeper timekeeper;
@@ -180,6 +181,79 @@ static inline s64 timekeeping_get_ns_raw
 	return nsec + arch_gettimeoffset();
 }
 
+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
+
+/**
+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_register_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
+
+/**
+ * pvclock_gtod_unregister_notifier - unregister a pvclock
+ * timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
+{
+	struct timekeeper *tk = &timekeeper;
+	unsigned long flags;
+	int ret;
+
+	write_seqlock_irqsave(&tk->lock, flags);
+	ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
+	write_sequnlock_irqrestore(&tk->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
+
+struct pvclock_gtod_data pvclock_gtod_data;
+EXPORT_SYMBOL_GPL(pvclock_gtod_data);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
+
+	write_seqcount_begin(&vdata->seq);
+
+	/* copy pvclock gtod data */
+	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
+	vdata->clock.cycle_last		= tk->clock->cycle_last;
+	vdata->clock.mask		= tk->clock->mask;
+	vdata->clock.mult		= tk->mult;
+	vdata->clock.shift		= tk->shift;
+
+	vdata->monotonic_time_sec	= tk->xtime_sec
+					+ tk->wall_to_monotonic.tv_sec;
+	vdata->monotonic_time_snsec	= tk->xtime_nsec
+					+ (tk->wall_to_monotonic.tv_nsec
+						<< tk->shift);
+	while (vdata->monotonic_time_snsec >=
+					(((u64)NSEC_PER_SEC) << tk->shift)) {
+		vdata->monotonic_time_snsec -=
+					((u64)NSEC_PER_SEC) << tk->shift;
+		vdata->monotonic_time_sec++;
+	}
+
+	write_seqcount_end(&vdata->seq);
+	raw_notifier_call_chain(&pvclock_gtod_chain, 0, NULL);
+}
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
@@ -188,6 +262,7 @@ static void timekeeping_update(struct ti
 		ntp_clear();
 	}
 	update_vsyscall(tk);
+	update_pvclock_gtod(tk);
 }
 
 /**



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 13/16] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (11 preceding siblings ...)
  2012-10-31 22:47   ` [patch 12/16] time: export time information for KVM pvclock Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-10-31 22:47   ` [patch 14/16] KVM: x86: notifier for clocksource changes Marcelo Tosatti
                     ` (2 subsequent siblings)
  15 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 14-host-pass-stable-pvclock-flag --]
[-- Type: text/plain, Size: 13339 bytes --]

KVM added a global variable to guarantee monotonicity in the guest. 
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances 
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when 
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all 
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs. 

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -46,6 +46,7 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/pci.h>
+#include <linux/pvclock_gtod.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -1135,8 +1136,149 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static cycle_t read_tsc(void)
+{
+	cycle_t ret;
+	u64 last;
+
+	/*
+	 * Empirically, a fence (of type that depends on the CPU)
+	 * before rdtsc is enough to ensure that rdtsc is ordered
+	 * with respect to loads.  The various CPU manuals are unclear
+	 * as to whether rdtsc can be reordered with later loads,
+	 * but no one has ever seen it happen.
+	 */
+	rdtsc_barrier();
+	ret = (cycle_t)vget_cycles();
+
+	last = pvclock_gtod_data.clock.cycle_last;
+
+	if (likely(ret >= last))
+		return ret;
+
+	/*
+	 * GCC likes to generate cmov here, but this branch is extremely
+	 * predictable (it's just a funciton of time and the likely is
+	 * very likely) and there's a data dependence, so force GCC
+	 * to generate a branch instead.  I don't barrier() because
+	 * we don't actually need a barrier, and if this function
+	 * ever gets inlined it will generate worse code.
+	 */
+	asm volatile ("");
+	return last;
+}
+
+static inline u64 vgettsc(cycle_t *cycle_now)
+{
+	long v;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	*cycle_now = read_tsc();
+
+	v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
+	return v * gtod->clock.mult;
+}
+
+static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+{
+	unsigned long seq;
+	u64 ns;
+	int mode;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	ts->tv_nsec = 0;
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		mode = gtod->clock.vclock_mode;
+		ts->tv_sec = gtod->monotonic_time_sec;
+		ns = gtod->monotonic_time_snsec;
+		ns += vgettsc(cycle_now);
+		ns >>= gtod->clock.shift;
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	timespec_add_ns(ts, ns);
+
+	return mode;
+}
+
+/* returns true if host is using tsc clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+	struct timespec ts;
+
+	/* checked again under seqlock below */
+	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+		return false;
+
+	if (do_monotonic(&ts, cycle_now) != VCLOCK_TSC)
+		return false;
+
+	monotonic_to_bootbased(&ts);
+	*kernel_ns = timespec_to_ns(&ts);
+
+	return true;
+}
+
+
+/*
+ *
+ * Assuming a stable TSC across physical CPUS, the following condition
+ * is possible. Each numbered line represents an event visible to both
+ * CPUs at the next numbered event.
+ *
+ * "timespecX" represents host monotonic time. "tscX" represents
+ * RDTSC value.
+ *
+ * 		VCPU0 on CPU0		|	VCPU1 on CPU1
+ *
+ * 1.  read timespec0,tsc0
+ * 2.					| timespec1 = timespec0 + N
+ * 					| tsc1 = tsc0 + M
+ * 3. transition to guest		| transition to guest
+ * 4. ret0 = timespec0 + (rdtsc - tsc0) |
+ * 5.				        | ret1 = timespec1 + (rdtsc - tsc1)
+ * 				        | ret1 = timespec0 + N + (rdtsc - (tsc0 + M))
+ *
+ * Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity:
+ *
+ * 	- ret0 < ret1
+ *	- timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
+ *		...
+ *	- 0 < N - M => M < N
+ *
+ * That is, when timespec0 != timespec1, M < N. Unfortunately that is not
+ * always the case (the difference between two distinct xtime instances
+ * might be smaller then the difference between corresponding TSC reads,
+ * when updating guest vcpus pvclock areas).
+ *
+ * To avoid that problem, do not allow visibility of distinct
+ * system_timestamp/tsc_timestamp values simultaneously: use a master
+ * copy of host monotonic time values. Update that master copy
+ * in lockstep.
+ *
+ * Rely on synchronization of host TSCs for monotonicity.
+ *
+ */
+
+static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	int vclock_mode;
+
+	/*
+ 	 * If the host uses TSC clock, then passthrough TSC as stable
+	 * to the guest.
+	 */
+	ka->use_master_clock = kvm_get_time_and_clockread(
+					&ka->master_kernel_ns,
+					&ka->master_cycle_now);
+
+	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+}
+
 static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
-			     unsigned int offset_in_page, gpa_t gpa)
+			     unsigned int offset_in_page, gpa_t gpa,
+			     bool use_master_clock)
 {
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	void *shared_kaddr;
@@ -1155,6 +1297,10 @@ static void kvm_write_pvtime(struct kvm_
 		vcpu->pvclock_set_guest_stopped_request = false;
 	}
 
+	/* If the host uses TSC clocksource, then it is stable */
+	if (use_master_clock)
+		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
+
 	vcpu->hv_clock.flags = pvclock_flags;
 
 	memcpy(shared_kaddr + offset_in_page, &vcpu->hv_clock,
@@ -1169,14 +1315,18 @@ static int kvm_guest_time_update(struct 
 {
 	unsigned long flags;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
+	struct kvm_arch *ka = &v->kvm->arch;
 	unsigned long this_tsc_khz;
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
+	u64 host_tsc;
+	bool use_master_clock;
+
+	kernel_ns = 0;
+	host_tsc = 0;
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, native_read_tsc());
-	kernel_ns = get_kernel_ns();
 	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
 		local_irq_restore(flags);
@@ -1185,6 +1335,24 @@ static int kvm_guest_time_update(struct 
 	}
 
 	/*
+ 	 * If the host uses TSC clock, then passthrough TSC as stable
+	 * to the guest.
+	 */
+	spin_lock(&ka->pvclock_gtod_sync_lock);
+	use_master_clock = ka->use_master_clock;
+	if (use_master_clock) {
+		host_tsc = ka->master_cycle_now;
+		kernel_ns = ka->master_kernel_ns;
+	}
+	spin_unlock(&ka->pvclock_gtod_sync_lock);
+	if (!use_master_clock) {
+		host_tsc = native_read_tsc();
+		kernel_ns = get_kernel_ns();
+	}
+
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v, host_tsc);
+
+	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
 	 *   1) CPU could have been running below the maximum TSC rate
@@ -1245,8 +1413,14 @@ static int kvm_guest_time_update(struct 
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-	if (max_kernel_ns > kernel_ns)
-		kernel_ns = max_kernel_ns;
+	/* with a master <monotonic time, tsc value> tuple,
+ 	 * pvclock clock reads always increase at the (scaled) rate
+ 	 * of guest TSC - no need to deal with sampling errors.
+ 	 */
+	if (!use_master_clock) {
+		if (max_kernel_ns > kernel_ns)
+			kernel_ns = max_kernel_ns;
+	}
 
 	/* With all the info we got, fill in the values */
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
@@ -1262,10 +1436,12 @@ static int kvm_guest_time_update(struct 
 	 */
 	vcpu->hv_clock.version += 2;
 
- 	kvm_write_pvtime(v, vcpu->time_page, vcpu->time_offset, vcpu->time);
+ 	kvm_write_pvtime(v, vcpu->time_page, vcpu->time_offset, vcpu->time,
+			 use_master_clock);
  	if (vcpu->uspace_time_page)
  		kvm_write_pvtime(v, vcpu->uspace_time_page,
- 				 vcpu->uspace_time_offset, vcpu->uspace_time);
+ 				 vcpu->uspace_time_offset, vcpu->uspace_time,
+				 use_master_clock);
 
 	return 0;
 }
@@ -5302,6 +5478,28 @@ static void process_nmi(struct kvm_vcpu 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+static void kvm_gen_update_masterclock(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	struct kvm_arch *ka = &kvm->arch;
+
+	spin_lock(&ka->pvclock_gtod_sync_lock);
+	kvm_make_mclock_inprogress_request(kvm);
+	/* no guest entries from this point */
+	pvclock_update_vm_gtod_copy(kvm);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+
+	/* guest entries allowed */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
+
+	spin_unlock(&ka->pvclock_gtod_sync_lock);
+
+}
+
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -5314,6 +5512,8 @@ static int vcpu_enter_guest(struct kvm_v
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
+		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_gen_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
@@ -6219,6 +6419,8 @@ int kvm_arch_hardware_enable(void *garba
 			kvm_for_each_vcpu(i, vcpu, kvm) {
 				vcpu->arch.tsc_offset_adjustment += delta_cyc;
 				vcpu->arch.last_host_tsc = local_tsc;
+				set_bit(KVM_REQ_MASTERCLOCK_UPDATE,
+					&vcpu->requests);
 			}
 
 			/*
@@ -6356,6 +6558,9 @@ int kvm_arch_init_vm(struct kvm *kvm, un
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
+	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
+
+	pvclock_update_vm_gtod_copy(kvm);
 
 	return 0;
 }
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -22,6 +22,7 @@
 #include <linux/kvm_para.h>
 #include <linux/kvm_types.h>
 #include <linux/perf_event.h>
+#include <linux/pvclock_gtod.h>
 
 #include <asm/pvclock-abi.h>
 #include <asm/desc.h>
@@ -563,6 +564,11 @@ struct kvm_arch {
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
 
+	spinlock_t pvclock_gtod_sync_lock;
+	bool use_master_clock;
+	u64 master_kernel_ns;
+	cycle_t master_cycle_now;
+
 	struct kvm_xen_hvm_config xen_hvm_config;
 
 	/* fields used by HYPER-V emulation */
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -118,6 +118,8 @@ static inline bool is_error_page(struct 
 #define KVM_REQ_IMMEDIATE_EXIT    15
 #define KVM_REQ_PMU               16
 #define KVM_REQ_PMI               17
+#define KVM_REQ_MASTERCLOCK_UPDATE  18
+#define KVM_REQ_MCLOCK_INPROGRESS 19
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
@@ -527,6 +529,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
+void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -212,6 +212,11 @@ void kvm_reload_remote_mmus(struct kvm *
 	make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
 
+void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+{
+	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
+}
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -4,6 +4,7 @@
 #include <linux/tracepoint.h>
 #include <asm/vmx.h>
 #include <asm/svm.h>
+#include <asm/clocksource.h>
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm
@@ -754,6 +755,31 @@ TRACE_EVENT(
 		  __entry->write ? "Write" : "Read",
 		  __entry->gpa_match ? "GPA" : "GVA")
 );
+
+#define host_clocks				\
+	{VCLOCK_NONE, "none"},			\
+	{VCLOCK_TSC,  "tsc"},			\
+	{VCLOCK_HPET, "hpet"}			\
+
+TRACE_EVENT(kvm_update_master_clock,
+	TP_PROTO(bool use_master_clock, unsigned int host_clock),
+	TP_ARGS(use_master_clock, host_clock),
+
+	TP_STRUCT__entry(
+		__field(		bool,	use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("masterclock %d hostclock %s",
+		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks))
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 14/16] KVM: x86: notifier for clocksource changes
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (12 preceding siblings ...)
  2012-10-31 22:47   ` [patch 13/16] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-10-31 22:47   ` [patch 15/16] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
  2012-10-31 22:47   ` [patch 16/16] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
  15 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 15-add-kvm-req-pvclock-gtod-update --]
[-- Type: text/plain, Size: 2592 bytes --]

Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1218,6 +1218,7 @@ static bool kvm_get_time_and_clockread(s
 	return true;
 }
 
+static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
 
 /*
  *
@@ -1271,6 +1272,8 @@ static void pvclock_update_vm_gtod_copy(
 	ka->use_master_clock = kvm_get_time_and_clockread(
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
+	if (ka->use_master_clock)
+		atomic_set(&kvm_guest_has_master_clock, 1);
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
@@ -5124,6 +5127,44 @@ static void kvm_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(mask);
 }
 
+static void pvclock_gtod_update_fn(struct work_struct *work)
+{
+	struct kvm *kvm;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	raw_spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			set_bit(KVM_REQ_MASTERCLOCK_UPDATE, &vcpu->requests);
+	atomic_set(&kvm_guest_has_master_clock, 0);
+	raw_spin_unlock(&kvm_lock);
+}
+
+static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
+
+/*
+ * Notification about pvclock gtod data update.
+ */
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+			       void *unused2)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	/* disable master clock if host does not trust, or does not
+ 	 * use, TSC clocksource
+ 	 */
+	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	    atomic_read(&kvm_guest_has_master_clock) != 0)
+		queue_work(system_long_wq, &pvclock_gtod_work);
+
+	return 0;
+}
+
+static struct notifier_block pvclock_gtod_notifier = {
+	.notifier_call = pvclock_gtod_notify,
+};
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
@@ -5165,6 +5206,8 @@ int kvm_arch_init(void *opaque)
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 
 	kvm_lapic_init();
+	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
+
 	return 0;
 
 out:
@@ -5179,6 +5222,7 @@ void kvm_arch_exit(void)
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 15/16] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (13 preceding siblings ...)
  2012-10-31 22:47   ` [patch 14/16] KVM: x86: notifier for clocksource changes Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  2012-10-31 22:47   ` [patch 16/16] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
  15 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 16-add-kvm-add-vcpu-postcreate --]
[-- Type: text/plain, Size: 3730 bytes --]

TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/ia64/kvm/kvm-ia64.c
===================================================================
--- vsyscall.orig/arch/ia64/kvm/kvm-ia64.c
+++ vsyscall/arch/ia64/kvm/kvm-ia64.c
@@ -1330,6 +1330,11 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return 0;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
 	return -EINVAL;
Index: vsyscall/arch/powerpc/kvm/powerpc.c
===================================================================
--- vsyscall.orig/arch/powerpc/kvm/powerpc.c
+++ vsyscall/arch/powerpc/kvm/powerpc.c
@@ -354,6 +354,11 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
 	return vcpu;
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	/* Make sure we're not using the vcpu anymore */
Index: vsyscall/arch/s390/kvm/kvm-s390.c
===================================================================
--- vsyscall.orig/arch/s390/kvm/kvm-s390.c
+++ vsyscall/arch/s390/kvm/kvm-s390.c
@@ -355,6 +355,11 @@ static void kvm_s390_vcpu_initial_reset(
 	atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags);
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 {
 	atomic_set(&vcpu->arch.sie_block->cpuflags, CPUSTAT_ZARCH |
Index: vsyscall/arch/x86/kvm/svm.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -1254,7 +1254,6 @@ static struct kvm_vcpu *svm_create_vcpu(
 	svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT;
 	svm->asid_generation = 0;
 	init_vmcb(svm);
-	kvm_write_tsc(&svm->vcpu, 0);
 
 	err = fx_init(&svm->vcpu);
 	if (err)
Index: vsyscall/arch/x86/kvm/vmx.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -3896,8 +3896,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
 	set_cr4_guest_host_mask(vmx);
 
-	kvm_write_tsc(&vmx->vcpu, 0);
-
 	return 0;
 }
 
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -6350,6 +6350,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
 	return r;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+	int r;
+
+	r = vcpu_load(vcpu);
+	if (r)
+		return r;
+	kvm_write_tsc(vcpu, 0);
+	vcpu_put(vcpu);
+
+	return r;
+}
+
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	int r;
Index: vsyscall/include/linux/kvm_host.h
===================================================================
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -583,6 +583,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
Index: vsyscall/virt/kvm/kvm_main.c
===================================================================
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -1855,6 +1855,7 @@ static int kvm_vm_ioctl_create_vcpu(stru
 	atomic_inc(&kvm->online_vcpus);
 
 	mutex_unlock(&kvm->lock);
+	kvm_arch_vcpu_postcreate(vcpu);
 	return r;
 
 unlock_vcpu_destroy:



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [patch 16/16] KVM: x86: require matched TSC offsets for master clock
  2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
                     ` (14 preceding siblings ...)
  2012-10-31 22:47   ` [patch 15/16] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
@ 2012-10-31 22:47   ` Marcelo Tosatti
  15 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-10-31 22:47 UTC (permalink / raw)
  To: kvm
  Cc: johnstul, jeremy, glommer, zamsden, gleb, avi, pbonzini, Marcelo Tosatti

[-- Attachment #1: 17-masterclock-require-matched-tsc --]
[-- Type: text/plain, Size: 6975 bytes --]

With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple 
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===================================================================
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -563,6 +563,7 @@ struct kvm_arch {
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
 	u8  cur_tsc_generation;
+	int nr_vcpus_matched_tsc;
 
 	spinlock_t pvclock_gtod_sync_lock;
 	bool use_master_clock;
Index: vsyscall/arch/x86/kvm/x86.c
===================================================================
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1047,12 +1047,38 @@ static u64 compute_guest_tsc(struct kvm_
 	return tsc;
 }
 
+void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+{
+	bool vcpus_matched;
+	bool do_request = false;
+	struct kvm_arch *ka = &vcpu->kvm->arch;
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			 atomic_read(&vcpu->kvm->online_vcpus));
+
+	if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
+		if (!ka->use_master_clock)
+			do_request = 1;
+
+	if (!vcpus_matched && ka->use_master_clock)
+			do_request = 1;
+
+	if (do_request)
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
+	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
+			    atomic_read(&vcpu->kvm->online_vcpus),
+		            ka->use_master_clock, gtod->clock.vclock_mode);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
 	s64 usdiff;
+	bool matched;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
@@ -1095,6 +1121,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
+		matched = true;
 	} else {
 		/*
 		 * We split periods of matched TSC writes into generations.
@@ -1109,6 +1136,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 		kvm->arch.cur_tsc_nsec = ns;
 		kvm->arch.cur_tsc_write = data;
 		kvm->arch.cur_tsc_offset = offset;
+		matched = false;
 		pr_debug("kvm: new tsc generation %u, clock %llu\n",
 			 kvm->arch.cur_tsc_generation, data);
 	}
@@ -1132,6 +1160,15 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+	spin_lock(&kvm->arch.pvclock_gtod_sync_lock);
+	if (matched)
+		kvm->arch.nr_vcpus_matched_tsc++;
+	else
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+
+	kvm_track_tsc_matching(vcpu);
+	spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
 }
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -1222,8 +1259,9 @@ static atomic_t kvm_guest_has_master_clo
 
 /*
  *
- * Assuming a stable TSC across physical CPUS, the following condition
- * is possible. Each numbered line represents an event visible to both
+ * Assuming a stable TSC across physical CPUS, and a stable TSC
+ * across virtual CPUs, the following condition is possible.
+ * Each numbered line represents an event visible to both
  * CPUs at the next numbered event.
  *
  * "timespecX" represents host monotonic time. "tscX" represents
@@ -1256,7 +1294,7 @@ static atomic_t kvm_guest_has_master_clo
  * copy of host monotonic time values. Update that master copy
  * in lockstep.
  *
- * Rely on synchronization of host TSCs for monotonicity.
+ * Rely on synchronization of host TSCs and guest TSCs for monotonicity.
  *
  */
 
@@ -1264,19 +1302,26 @@ static void pvclock_update_vm_gtod_copy(
 {
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
+	bool host_tsc_clocksource, vcpus_matched;
 
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+				atomic_read(&kvm->online_vcpus));
 	/*
  	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	ka->use_master_clock = kvm_get_time_and_clockread(
+	host_tsc_clocksource = kvm_get_time_and_clockread(
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
+
+	ka->use_master_clock = host_tsc_clocksource & vcpus_matched;
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
-	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode);
+	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
+				      vcpus_matched);
 }
 
 static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
Index: vsyscall/arch/x86/kvm/trace.h
===================================================================
--- vsyscall.orig/arch/x86/kvm/trace.h
+++ vsyscall/arch/x86/kvm/trace.h
@@ -762,21 +762,54 @@ TRACE_EVENT(
 	{VCLOCK_HPET, "hpet"}			\
 
 TRACE_EVENT(kvm_update_master_clock,
-	TP_PROTO(bool use_master_clock, unsigned int host_clock),
-	TP_ARGS(use_master_clock, host_clock),
+	TP_PROTO(bool use_master_clock, unsigned int host_clock, bool offset_matched),
+	TP_ARGS(use_master_clock, host_clock, offset_matched),
 
 	TP_STRUCT__entry(
 		__field(		bool,	use_master_clock	)
 		__field(	unsigned int,	host_clock		)
+		__field(		bool,	offset_matched		)
 	),
 
 	TP_fast_assign(
 		__entry->use_master_clock	= use_master_clock;
 		__entry->host_clock		= host_clock;
+		__entry->offset_matched		= offset_matched;
 	),
 
-	TP_printk("masterclock %d hostclock %s",
+	TP_printk("masterclock %d hostclock %s offsetmatched %u",
 		  __entry->use_master_clock,
+		  __print_symbolic(__entry->host_clock, host_clocks),
+		  __entry->offset_matched)
+);
+
+TRACE_EVENT(kvm_track_tsc,
+	TP_PROTO(unsigned int vcpu_id, unsigned int nr_matched,
+		 unsigned int online_vcpus, bool use_master_clock,
+		 unsigned int host_clock),
+	TP_ARGS(vcpu_id, nr_matched, online_vcpus, use_master_clock,
+		host_clock),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id			)
+		__field(	unsigned int,	nr_vcpus_matched_tsc	)
+		__field(	unsigned int,	online_vcpus		)
+		__field(	bool,		use_master_clock	)
+		__field(	unsigned int,	host_clock		)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id		= vcpu_id;
+		__entry->nr_vcpus_matched_tsc	= nr_matched;
+		__entry->online_vcpus		= online_vcpus;
+		__entry->use_master_clock	= use_master_clock;
+		__entry->host_clock		= host_clock;
+	),
+
+	TP_printk("vcpu_id %u masterclock %u offsetmatched %u nr_online %u"
+		  " hostclock %s",
+		  __entry->vcpu_id, __entry->use_master_clock,
+		  __entry->nr_vcpus_matched_tsc, __entry->online_vcpus,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-10-31 22:46   ` [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
@ 2012-11-01 10:39     ` Gleb Natapov
  2012-11-01 20:51       ` Marcelo Tosatti
  2012-11-01 13:44     ` Glauber Costa
  1 sibling, 1 reply; 94+ messages in thread
From: Gleb Natapov @ 2012-11-01 10:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, glommer, zamsden, avi, pbonzini

On Wed, Oct 31, 2012 at 08:46:57PM -0200, Marcelo Tosatti wrote:
> Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
> migration) to clear the bit.
> 
> Noticed by Paolo Bonzini.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
Reviewed-by: Gleb Natapov <gleb@redhat.com>

Small nitpick bellow.

Also do we really need to call kvm_guest_time_update() on a guest pause?
Wouldn't separate request bit, which only sets the flag, suffice?

> Index: vsyscall/arch/x86/kvm/x86.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kvm/x86.c
> +++ vsyscall/arch/x86/kvm/x86.c
> @@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
>  	unsigned long this_tsc_khz;
>  	s64 kernel_ns, max_kernel_ns;
>  	u64 tsc_timestamp;
> +	struct pvclock_vcpu_time_info *guest_hv_clock;
>  	u8 pvclock_flags;
>  
>  	/* Keep irq disabled to prevent changes to the clock */
> @@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
>  	vcpu->last_kernel_ns = kernel_ns;
>  	vcpu->last_guest_tsc = tsc_timestamp;
>  
> -	pvclock_flags = 0;
> -	if (vcpu->pvclock_set_guest_stopped_request) {
> -		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
> -		vcpu->pvclock_set_guest_stopped_request = false;
> -	}
> -
> -	vcpu->hv_clock.flags = pvclock_flags;
>  
>  	/*
>  	 * The interface expects us to write an even number signaling that the
> @@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
>  
>  	shared_kaddr = kmap_atomic(vcpu->time_page);
>  
> +	guest_hv_clock = shared_kaddr + vcpu->time_offset;
> +
> +	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
> +	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
> +
> +	if (vcpu->pvclock_set_guest_stopped_request) {
> +		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
> +		vcpu->pvclock_set_guest_stopped_request = false;
> +	}
> +
> +	vcpu->hv_clock.flags = pvclock_flags;
> +
>  	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
We can use guest_hv_clock here now.

>  	       sizeof(vcpu->hv_clock));
>  
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-10-31 22:46   ` [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
@ 2012-11-01 11:48     ` Gleb Natapov
  2012-11-01 13:49       ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Gleb Natapov @ 2012-11-01 11:48 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, glommer, zamsden, avi, pbonzini

On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> pvclock_get_time_values, which contains the memory barriers
> will be removed by next patch.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/arch/x86/kernel/pvclock.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> +++ vsyscall/arch/x86/kernel/pvclock.c
> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
>  
>  	do {
>  		version = pvclock_get_time_values(&shadow, src);
> -		barrier();
> +		rdtsc_barrier();
>  		offset = pvclock_get_nsec_offset(&shadow);
>  		ret = shadow.system_timestamp + offset;
> -		barrier();
> +		rdtsc_barrier();
>  	} while (version != src->version);
>  
>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> 
On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
way though.

--
			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-10-31 22:46   ` [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
  2012-11-01 10:39     ` Gleb Natapov
@ 2012-11-01 13:44     ` Glauber Costa
  1 sibling, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 13:44 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:46 AM, Marcelo Tosatti wrote:
> Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
> migration) to clear the bit.
> 
> Noticed by Paolo Bonzini.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/arch/x86/kvm/x86.c
> ===================================================================

Reviewed-by: Glauber Costa <glommer@parallels.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-01 11:48     ` Gleb Natapov
@ 2012-11-01 13:49       ` Glauber Costa
  2012-11-01 13:51         ` Gleb Natapov
  2012-11-01 20:56         ` Marcelo Tosatti
  0 siblings, 2 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 13:49 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, kvm, johnstul, jeremy, zamsden, avi, pbonzini

On 11/01/2012 03:48 PM, Gleb Natapov wrote:
> On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
>> Originally from Jeremy Fitzhardinge.
>>
>> pvclock_get_time_values, which contains the memory barriers
>> will be removed by next patch.
>>
>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>
>> Index: vsyscall/arch/x86/kernel/pvclock.c
>> ===================================================================
>> --- vsyscall.orig/arch/x86/kernel/pvclock.c
>> +++ vsyscall/arch/x86/kernel/pvclock.c
>> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
>>  
>>  	do {
>>  		version = pvclock_get_time_values(&shadow, src);
>> -		barrier();
>> +		rdtsc_barrier();
>>  		offset = pvclock_get_nsec_offset(&shadow);
>>  		ret = shadow.system_timestamp + offset;
>> -		barrier();
>> +		rdtsc_barrier();
>>  	} while (version != src->version);
>>  
>>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
>>
> On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
> be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
> way though.
> 
> --
> 			Gleb.
> 
Actually it shouldn't matter for KVM, since the page is only updated by
the vcpu, and the guest is never running while it happens. If Jeremy is
fine with this, so should I.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-01 13:49       ` Glauber Costa
@ 2012-11-01 13:51         ` Gleb Natapov
  2012-11-01 20:56         ` Marcelo Tosatti
  1 sibling, 0 replies; 94+ messages in thread
From: Gleb Natapov @ 2012-11-01 13:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Marcelo Tosatti, kvm, johnstul, jeremy, zamsden, avi, pbonzini

On Thu, Nov 01, 2012 at 05:49:51PM +0400, Glauber Costa wrote:
> On 11/01/2012 03:48 PM, Gleb Natapov wrote:
> > On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
> >> Originally from Jeremy Fitzhardinge.
> >>
> >> pvclock_get_time_values, which contains the memory barriers
> >> will be removed by next patch.
> >>
> >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>
> >> Index: vsyscall/arch/x86/kernel/pvclock.c
> >> ===================================================================
> >> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> >> +++ vsyscall/arch/x86/kernel/pvclock.c
> >> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
> >>  
> >>  	do {
> >>  		version = pvclock_get_time_values(&shadow, src);
> >> -		barrier();
> >> +		rdtsc_barrier();
> >>  		offset = pvclock_get_nsec_offset(&shadow);
> >>  		ret = shadow.system_timestamp + offset;
> >> -		barrier();
> >> +		rdtsc_barrier();
> >>  	} while (version != src->version);
> >>  
> >>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> >>
> > On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
> > be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
> > way though.
> > 
> > --
> > 			Gleb.
> > 
> Actually it shouldn't matter for KVM, since the page is only updated by
> the vcpu, and the guest is never running while it happens. If Jeremy is
> fine with this, so should I.
If you and Jeremy are fine with this, so should I.

--
			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 03/16] x86: pvclock: remove pvclock_shadow_time
  2012-10-31 22:46   ` [patch 03/16] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
@ 2012-11-01 13:52     ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 13:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:46 AM, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> We can copy the information directly from "struct pvclock_vcpu_time_info", 
> remove pvclock_shadow_time.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Reviewed-by: Glauber Costa <glommer@parallels.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 04/16] x86: pvclock: create helper for pvclock data retrieval
  2012-10-31 22:47   ` [patch 04/16] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
@ 2012-11-01 14:04     ` Glauber Costa
  2012-11-01 20:57       ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 14:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> +static __always_inline
> +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
> +			       cycle_t *cycles, u8 *flags)
> +{
> +	unsigned version;
> +	cycle_t ret, offset;
> +	u8 ret_flags;
> +
> +	version = src->version;
> +	rdtsc_barrier();
> +	offset = pvclock_get_nsec_offset(src);
> +	ret = src->system_time + offset;
> +	ret_flags = src->flags;
> +	rdtsc_barrier();
> +
> +	*cycles = ret;
> +	*flags = ret_flags;
> +	return version;
> +}
> +
This interface is a bit weird.
The actual value you are interested in is "cycles", so why is it
returned through the parameters? I think it would be clearer to have
this return cycles, and &version as a parameter.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 05/16] x86: pvclock: introduce helper to read flags
  2012-10-31 22:47   ` [patch 05/16] x86: pvclock: introduce helper to read flags Marcelo Tosatti
@ 2012-11-01 14:07     ` Glauber Costa
  2012-11-01 21:08       ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 14:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> 
> Index: vsyscall/arch/x86/kernel/pvclock.c
If you are resending this series, I don't see a reason for this one to
be in a separate patch.

code itself is fine.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 06/16] sched: add notifier for cross-cpu migrations
  2012-10-31 22:47   ` [patch 06/16] sched: add notifier for cross-cpu migrations Marcelo Tosatti
@ 2012-11-01 14:08     ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 14:08 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini, Peter Zijlstra

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 

Please collect peterz's ack for this one.

> Index: vsyscall/include/linux/sched.h
> ===================================================================
> --- vsyscall.orig/include/linux/sched.h
> +++ vsyscall/include/linux/sched.h
> @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
>  extern void calc_global_load(unsigned long ticks);
>  extern void update_cpu_load_nohz(void);
>  
> +/* Notifier for when a task gets migrated to a new CPU */
> +struct task_migration_notifier {
> +	struct task_struct *task;
> +	int from_cpu;
> +	int to_cpu;
> +};
> +extern void register_task_migration_notifier(struct notifier_block *n);
> +
>  extern unsigned long get_parent_ip(unsigned long addr);
>  
>  struct seq_file;
> Index: vsyscall/kernel/sched/core.c
> ===================================================================
> --- vsyscall.orig/kernel/sched/core.c
> +++ vsyscall/kernel/sched/core.c
> @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
>  		rq->skip_clock_update = 1;
>  }
>  
> +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
> +
> +void register_task_migration_notifier(struct notifier_block *n)
> +{
> +	atomic_notifier_chain_register(&task_migration_notifier, n);
> +}
> +
>  #ifdef CONFIG_SMP
>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  {
> @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
>  	trace_sched_migrate_task(p, new_cpu);
>  
>  	if (task_cpu(p) != new_cpu) {
> +		struct task_migration_notifier tmn;
> +
>  		p->se.nr_migrations++;
>  		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
> +
> +		tmn.task = p;
> +		tmn.from_cpu = task_cpu(p);
> +		tmn.to_cpu = new_cpu;
> +
> +		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>  	}
>  
>  	__set_task_cpu(p, new_cpu);
> 
> 


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization
  2012-10-31 22:47   ` [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
@ 2012-11-01 14:19     ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 14:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> Originally from Jeremy Fitzhardinge.
> 
> Introduce generic, non hypervisor specific, pvclock initialization 
> routines.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/arch/x86/kernel/pvclock.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> +++ vsyscall/arch/x86/kernel/pvclock.c
> @@ -17,6 +17,10 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/percpu.h>
> +#include <linux/notifier.h>
> +#include <linux/sched.h>
> +#include <linux/gfp.h>
> +#include <linux/bootmem.h>
>  #include <asm/pvclock.h>
>  
>  static u8 valid_flags __read_mostly = 0;
> @@ -122,3 +126,67 @@ void pvclock_read_wallclock(struct pvclo
>  
>  	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
>  }
> +
> +static aligned_pvti_t *pvclock_vdso_info;
is there a non-aligned pvti ?

I would also add that "pvti" is not exactly the most descriptive name
I've seen.

> +
> +static struct pvclock_vsyscall_time_info *
> +pvclock_get_vsyscall_user_time_info(int cpu)
> +{
> +	if (pvclock_vdso_info == NULL) {
> +		BUG();
> +		return NULL;
> +	}
> +
> +	return &pvclock_vdso_info[cpu].info;
> +}
> +
if (!pvclock_vdso_info)
    BUG();
return &pvclock...

> +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
> +{
> +	return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
> +}
> +
> +int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
> +{
> +	struct task_migration_notifier *mn = v;
> +	struct pvclock_vsyscall_time_info *pvti;
> +
> +	pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
> +
> +	if (pvti == NULL)
> +		return NOTIFY_DONE;
> +

When is it NULL? IIUC, this is when the vsyscall is disabled, right?
Would you mind adding comments for it?

Also, this is supposed to be an unlikely branch, right?

> +	pvti->migrate_count++;
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block pvclock_migrate = {
> +	.notifier_call = pvclock_task_migrate,
> +};
> +
> +/*
> + * Initialize the generic pvclock vsyscall state.  This will allocate
> + * a/some page(s) for the per-vcpu pvclock information, set up a
> + * fixmap mapping for the page(s)
> + */
> +int __init pvclock_init_vsyscall(void)
> +{
> +	int idx;
> +	unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
> +
> +	pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);

This will be the most critical path for reading pvclock's time data. So
why can't we:
1) Make it per-node.
2) Of course, as a consequence, make sure all info structures in the
same page are in the same node ?



> +	if (!pvclock_vdso_info)
> +		return -ENOMEM;
> +
> +	memset(pvclock_vdso_info, 0, size);
> +
> +	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
> +		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
> +			     __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
> +			     PAGE_KERNEL_VVAR);
> +	}
> +
> +	register_task_migration_notifier(&pvclock_migrate);
> +
> +	return 0;
> +}
> Index: vsyscall/arch/x86/include/asm/fixmap.h
> ===================================================================
> --- vsyscall.orig/arch/x86/include/asm/fixmap.h
> +++ vsyscall/arch/x86/include/asm/fixmap.h
> @@ -19,6 +19,7 @@
>  #include <asm/acpi.h>
>  #include <asm/apicdef.h>
>  #include <asm/page.h>
> +#include <asm/pvclock.h>
>  #ifdef CONFIG_X86_32
>  #include <linux/threads.h>
>  #include <asm/kmap_types.h>
> @@ -81,6 +82,10 @@ enum fixed_addresses {
>  	VVAR_PAGE,
>  	VSYSCALL_HPET,
>  #endif
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +	PVCLOCK_FIXMAP_BEGIN,
> +	PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
> +#endif
>  	FIX_DBGP_BASE,
>  	FIX_EARLYCON_MEM_BASE,
>  #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
> Index: vsyscall/arch/x86/include/asm/pvclock.h
> ===================================================================
> --- vsyscall.orig/arch/x86/include/asm/pvclock.h
> +++ vsyscall/arch/x86/include/asm/pvclock.h
> @@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo
>  			    struct pvclock_vcpu_time_info *vcpu,
>  			    struct timespec *ts);
>  void pvclock_resume(void);
> +int __init pvclock_init_vsyscall(void);
> +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
>  
>  /*
>   * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
> @@ -85,4 +87,17 @@ unsigned __pvclock_read_cycles(const str
>  	return version;
>  }
>  
> +struct pvclock_vsyscall_time_info {
> +	struct pvclock_vcpu_time_info pvti;
> +	u32 migrate_count;
> +};
> +
> +typedef union {
> +	struct pvclock_vsyscall_time_info info;
> +	char pad[SMP_CACHE_BYTES];
> +} aligned_pvti_t ____cacheline_aligned;

Please, just align pvclock_vsyscall_time_info. It is 10 thousand times
more descriptive.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-31 22:47   ` [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
@ 2012-11-01 14:28     ` Glauber Costa
  2012-11-01 21:39       ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 14:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> Allow a guest to register a second location for the VCPU time info
> 
> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
> This is intended to allow the guest kernel to map this information
> into a usermode accessible page, so that usermode can efficiently
> calculate system time from the TSC without having to make a syscall.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 

Changelog doesn't make a lot of sense. (specially from first line to the
second). Please add in here the reasons why we can't (or decided not to)
use the same page. The info in the last mail thread is good enough, just
put it here.


> Index: vsyscall/arch/x86/include/asm/kvm_para.h
> ===================================================================
> --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
> +++ vsyscall/arch/x86/include/asm/kvm_para.h
> @@ -23,6 +23,7 @@
>  #define KVM_FEATURE_ASYNC_PF		4
>  #define KVM_FEATURE_STEAL_TIME		5
>  #define KVM_FEATURE_PV_EOI		6
> +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -39,6 +40,7 @@
>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>  #define MSR_KVM_STEAL_TIME  0x4b564d03
>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
> +#define MSR_KVM_USERSPACE_TIME      0x4b564d05
>  

I accept that it is possible that we may be better off with the page
mapped twice.

But why do we need an extra MSR? When, and why, would you enable the
kernel-based pvclock, but disabled the userspace pvclock?

I believe only the existing MSRs should be used for this. If you write
to it, and the host is capable of exporting userspace pvclock
information, then it does. If it isn't, then it doesn't.

No reason for the extra setup that is only likely to bring more headache.


> Index: vsyscall/arch/x86/kvm/cpuid.c
> ===================================================================
> --- vsyscall.orig/arch/x86/kvm/cpuid.c
> +++ vsyscall/arch/x86/kvm/cpuid.c
> @@ -411,7 +411,9 @@ static int do_cpuid_ent(struct kvm_cpuid
>  			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
>  			     (1 << KVM_FEATURE_ASYNC_PF) |
>  			     (1 << KVM_FEATURE_PV_EOI) |
> -			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> +			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> +			     (1 << KVM_FEATURE_USERSPACE_CLOCKSOURCE);
> +
>  

The feature bit itself, is obviously fine.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 10/16] x86: vdso: pvclock gettime support
  2012-10-31 22:47   ` [patch 10/16] x86: vdso: pvclock gettime support Marcelo Tosatti
@ 2012-11-01 14:41     ` Glauber Costa
  2012-11-01 21:42       ` Marcelo Tosatti
  2012-11-14 10:42     ` Gleb Natapov
  1 sibling, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-11-01 14:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +
> +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> +{
> +	const aligned_pvti_t *pvti_base;
> +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> +
> +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> +
> +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> +
> +	return &pvti_base[offset].info;
> +}
> +
Does BUG_ON() really do what you believe it does while in userspace
context? We're not running with the kernel descriptors, so this will
probably just kill the process without any explanation

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory
  2012-11-01 10:39     ` Gleb Natapov
@ 2012-11-01 20:51       ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 20:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, johnstul, jeremy, glommer, zamsden, avi, pbonzini

On Thu, Nov 01, 2012 at 12:39:30PM +0200, Gleb Natapov wrote:
> On Wed, Oct 31, 2012 at 08:46:57PM -0200, Marcelo Tosatti wrote:
> > Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
> > migration) to clear the bit.
> > 
> > Noticed by Paolo Bonzini.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> Reviewed-by: Gleb Natapov <gleb@redhat.com>
> 
> Small nitpick bellow.
> 
> Also do we really need to call kvm_guest_time_update() on a guest pause?
> Wouldn't separate request bit, which only sets the flag, suffice?

Management of vcpu clock area better be isolated in a single place.

> > Index: vsyscall/arch/x86/kvm/x86.c
> > ===================================================================
> > --- vsyscall.orig/arch/x86/kvm/x86.c
> > +++ vsyscall/arch/x86/kvm/x86.c
> > @@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
> >  	unsigned long this_tsc_khz;
> >  	s64 kernel_ns, max_kernel_ns;
> >  	u64 tsc_timestamp;
> > +	struct pvclock_vcpu_time_info *guest_hv_clock;
> >  	u8 pvclock_flags;
> >  
> >  	/* Keep irq disabled to prevent changes to the clock */
> > @@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
> >  	vcpu->last_kernel_ns = kernel_ns;
> >  	vcpu->last_guest_tsc = tsc_timestamp;
> >  
> > -	pvclock_flags = 0;
> > -	if (vcpu->pvclock_set_guest_stopped_request) {
> > -		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
> > -		vcpu->pvclock_set_guest_stopped_request = false;
> > -	}
> > -
> > -	vcpu->hv_clock.flags = pvclock_flags;
> >  
> >  	/*
> >  	 * The interface expects us to write an even number signaling that the
> > @@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
> >  
> >  	shared_kaddr = kmap_atomic(vcpu->time_page);
> >  
> > +	guest_hv_clock = shared_kaddr + vcpu->time_offset;
> > +
> > +	/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
> > +	pvclock_flags = (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
> > +
> > +	if (vcpu->pvclock_set_guest_stopped_request) {
> > +		pvclock_flags |= PVCLOCK_GUEST_STOPPED;
> > +		vcpu->pvclock_set_guest_stopped_request = false;
> > +	}
> > +
> > +	vcpu->hv_clock.flags = pvclock_flags;
> > +
> >  	memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
> We can use guest_hv_clock here now.
> 
> >  	       sizeof(vcpu->hv_clock));
> >  
> > 
> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-01 13:49       ` Glauber Costa
  2012-11-01 13:51         ` Gleb Natapov
@ 2012-11-01 20:56         ` Marcelo Tosatti
  2012-11-01 22:13           ` Gleb Natapov
  1 sibling, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 20:56 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Gleb Natapov, kvm, johnstul, jeremy, zamsden, avi, pbonzini

On Thu, Nov 01, 2012 at 05:49:51PM +0400, Glauber Costa wrote:
> On 11/01/2012 03:48 PM, Gleb Natapov wrote:
> > On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
> >> Originally from Jeremy Fitzhardinge.
> >>
> >> pvclock_get_time_values, which contains the memory barriers
> >> will be removed by next patch.
> >>
> >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>
> >> Index: vsyscall/arch/x86/kernel/pvclock.c
> >> ===================================================================
> >> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> >> +++ vsyscall/arch/x86/kernel/pvclock.c
> >> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
> >>  
> >>  	do {
> >>  		version = pvclock_get_time_values(&shadow, src);
> >> -		barrier();
> >> +		rdtsc_barrier();
> >>  		offset = pvclock_get_nsec_offset(&shadow);
> >>  		ret = shadow.system_timestamp + offset;
> >> -		barrier();
> >> +		rdtsc_barrier();
> >>  	} while (version != src->version);
> >>  
> >>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> >>
> > On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
> > be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
> > way though.
> > 
> > --
> > 			Gleb.
> > 
> Actually it shouldn't matter for KVM, since the page is only updated by
> the vcpu, and the guest is never running while it happens. If Jeremy is
> fine with this, so should I.

17.13 TIME-STAMP COUNTER

"The RDTSC instruction is not serializing or ordered with other
instructions. It does not necessarily wait until all previous
instructions have been executed before reading the counter. Similarly,
subsequent instructions may begin execution before the RDTSC instruction
operation is performed."

Both instructions are TSC barriers. 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 04/16] x86: pvclock: create helper for pvclock data retrieval
  2012-11-01 14:04     ` Glauber Costa
@ 2012-11-01 20:57       ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 20:57 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Thu, Nov 01, 2012 at 06:04:02PM +0400, Glauber Costa wrote:
> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> > +static __always_inline
> > +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
> > +			       cycle_t *cycles, u8 *flags)
> > +{
> > +	unsigned version;
> > +	cycle_t ret, offset;
> > +	u8 ret_flags;
> > +
> > +	version = src->version;
> > +	rdtsc_barrier();
> > +	offset = pvclock_get_nsec_offset(src);
> > +	ret = src->system_time + offset;
> > +	ret_flags = src->flags;
> > +	rdtsc_barrier();
> > +
> > +	*cycles = ret;
> > +	*flags = ret_flags;
> > +	return version;
> > +}
> > +
> This interface is a bit weird.
> The actual value you are interested in is "cycles", so why is it
> returned through the parameters? I think it would be clearer to have
> this return cycles, and &version as a parameter.

I disagree because

do {
	version = pvclock_read_cycles();
} while (version != src->version);

Looks fine.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 05/16] x86: pvclock: introduce helper to read flags
  2012-11-01 14:07     ` Glauber Costa
@ 2012-11-01 21:08       ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 21:08 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Thu, Nov 01, 2012 at 06:07:45PM +0400, Glauber Costa wrote:
> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > 
> > Index: vsyscall/arch/x86/kernel/pvclock.c
> If you are resending this series, I don't see a reason for this one to
> be in a separate patch.
> 
> code itself is fine.

Its easier to review smaller patches.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-11-01 14:28     ` Glauber Costa
@ 2012-11-01 21:39       ` Marcelo Tosatti
  2012-11-02 10:23         ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 21:39 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote:
> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> > Allow a guest to register a second location for the VCPU time info
> > 
> > structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
> > This is intended to allow the guest kernel to map this information
> > into a usermode accessible page, so that usermode can efficiently
> > calculate system time from the TSC without having to make a syscall.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> 
> Changelog doesn't make a lot of sense. (specially from first line to the
> second). Please add in here the reasons why we can't (or decided not to)
> use the same page. The info in the last mail thread is good enough, just
> put it here.

Fixed.

> > Index: vsyscall/arch/x86/include/asm/kvm_para.h
> > ===================================================================
> > --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
> > +++ vsyscall/arch/x86/include/asm/kvm_para.h
> > @@ -23,6 +23,7 @@
> >  #define KVM_FEATURE_ASYNC_PF		4
> >  #define KVM_FEATURE_STEAL_TIME		5
> >  #define KVM_FEATURE_PV_EOI		6
> > +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
> >  
> >  /* The last 8 bits are used to indicate how to interpret the flags field
> >   * in pvclock structure. If no bits are set, all flags are ignored.
> > @@ -39,6 +40,7 @@
> >  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> >  #define MSR_KVM_STEAL_TIME  0x4b564d03
> >  #define MSR_KVM_PV_EOI_EN      0x4b564d04
> > +#define MSR_KVM_USERSPACE_TIME      0x4b564d05
> >  
> 
> I accept that it is possible that we may be better off with the page
> mapped twice.
> 
> But why do we need an extra MSR? When, and why, would you enable the
> kernel-based pvclock, but disabled the userspace pvclock?

Because there is no stable TSC available, for example (which cannot
be used to measure passage of time).

> I believe only the existing MSRs should be used for this. If you write
> to it, and the host is capable of exporting userspace pvclock
> information, then it does. If it isn't, then it doesn't.
> 
> No reason for the extra setup that is only likely to bring more headache.

It is necessary.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 10/16] x86: vdso: pvclock gettime support
  2012-11-01 14:41     ` Glauber Costa
@ 2012-11-01 21:42       ` Marcelo Tosatti
  2012-11-02  0:33         ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 21:42 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Thu, Nov 01, 2012 at 06:41:46PM +0400, Glauber Costa wrote:
> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> > +#ifdef CONFIG_PARAVIRT_CLOCK
> > +
> > +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> > +{
> > +	const aligned_pvti_t *pvti_base;
> > +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> > +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> > +
> > +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> > +
> > +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> > +
> > +	return &pvti_base[offset].info;
> > +}
> > +
> Does BUG_ON() really do what you believe it does while in userspace
> context? We're not running with the kernel descriptors, so this will
> probably just kill the process without any explanation

A coredump is generated which can be used to trace back to ud2a instruction
at the vdso code.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-01 20:56         ` Marcelo Tosatti
@ 2012-11-01 22:13           ` Gleb Natapov
  2012-11-01 22:21             ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Gleb Natapov @ 2012-11-01 22:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Glauber Costa, kvm, johnstul, jeremy, zamsden, avi, pbonzini

On Thu, Nov 01, 2012 at 06:56:11PM -0200, Marcelo Tosatti wrote:
> On Thu, Nov 01, 2012 at 05:49:51PM +0400, Glauber Costa wrote:
> > On 11/01/2012 03:48 PM, Gleb Natapov wrote:
> > > On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
> > >> Originally from Jeremy Fitzhardinge.
> > >>
> > >> pvclock_get_time_values, which contains the memory barriers
> > >> will be removed by next patch.
> > >>
> > >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > >>
> > >> Index: vsyscall/arch/x86/kernel/pvclock.c
> > >> ===================================================================
> > >> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> > >> +++ vsyscall/arch/x86/kernel/pvclock.c
> > >> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
> > >>  
> > >>  	do {
> > >>  		version = pvclock_get_time_values(&shadow, src);
> > >> -		barrier();
> > >> +		rdtsc_barrier();
> > >>  		offset = pvclock_get_nsec_offset(&shadow);
> > >>  		ret = shadow.system_timestamp + offset;
> > >> -		barrier();
> > >> +		rdtsc_barrier();
> > >>  	} while (version != src->version);
> > >>  
> > >>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > >>
> > > On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
> > > be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
> > > way though.
> > > 
> > > --
> > > 			Gleb.
> > > 
> > Actually it shouldn't matter for KVM, since the page is only updated by
> > the vcpu, and the guest is never running while it happens. If Jeremy is
> > fine with this, so should I.
> 
> 17.13 TIME-STAMP COUNTER
> 
> "The RDTSC instruction is not serializing or ordered with other
> instructions. It does not necessarily wait until all previous
> instructions have been executed before reading the counter. Similarly,
> subsequent instructions may begin execution before the RDTSC instruction
> operation is performed."
> 
> Both instructions are TSC barriers. 
> 
Which both instructions?

--
			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-01 22:13           ` Gleb Natapov
@ 2012-11-01 22:21             ` Marcelo Tosatti
  2012-11-02  6:02               ` Gleb Natapov
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-01 22:21 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Glauber Costa, kvm, johnstul, jeremy, zamsden, avi, pbonzini

On Fri, Nov 02, 2012 at 12:13:54AM +0200, Gleb Natapov wrote:
> On Thu, Nov 01, 2012 at 06:56:11PM -0200, Marcelo Tosatti wrote:
> > On Thu, Nov 01, 2012 at 05:49:51PM +0400, Glauber Costa wrote:
> > > On 11/01/2012 03:48 PM, Gleb Natapov wrote:
> > > > On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
> > > >> Originally from Jeremy Fitzhardinge.
> > > >>
> > > >> pvclock_get_time_values, which contains the memory barriers
> > > >> will be removed by next patch.
> > > >>
> > > >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > >>
> > > >> Index: vsyscall/arch/x86/kernel/pvclock.c
> > > >> ===================================================================
> > > >> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> > > >> +++ vsyscall/arch/x86/kernel/pvclock.c
> > > >> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
> > > >>  
> > > >>  	do {
> > > >>  		version = pvclock_get_time_values(&shadow, src);
> > > >> -		barrier();
> > > >> +		rdtsc_barrier();
> > > >>  		offset = pvclock_get_nsec_offset(&shadow);
> > > >>  		ret = shadow.system_timestamp + offset;
> > > >> -		barrier();
> > > >> +		rdtsc_barrier();
> > > >>  	} while (version != src->version);
> > > >>  
> > > >>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > >>
> > > > On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
> > > > be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
> > > > way though.
> > > > 
> > > > --
> > > > 			Gleb.
> > > > 
> > > Actually it shouldn't matter for KVM, since the page is only updated by
> > > the vcpu, and the guest is never running while it happens. If Jeremy is
> > > fine with this, so should I.
> > 
> > 17.13 TIME-STAMP COUNTER
> > 
> > "The RDTSC instruction is not serializing or ordered with other
> > instructions. It does not necessarily wait until all previous
> > instructions have been executed before reading the counter. Similarly,
> > subsequent instructions may begin execution before the RDTSC instruction
> > operation is performed."
> > 
> > Both instructions are TSC barriers. 
> > 
> Which both instructions?

static __always_inline void rdtsc_barrier(void)
{
        alternative(ASM_NOP3, "mfence", X86_FEATURE_MFENCE_RDTSC);
        alternative(ASM_NOP3, "lfence", X86_FEATURE_LFENCE_RDTSC);
}


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 10/16] x86: vdso: pvclock gettime support
  2012-11-01 21:42       ` Marcelo Tosatti
@ 2012-11-02  0:33         ` Marcelo Tosatti
  2012-11-02 10:25           ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-02  0:33 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Thu, Nov 01, 2012 at 07:42:43PM -0200, Marcelo Tosatti wrote:
> On Thu, Nov 01, 2012 at 06:41:46PM +0400, Glauber Costa wrote:
> > On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> > > +#ifdef CONFIG_PARAVIRT_CLOCK
> > > +
> > > +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> > > +{
> > > +	const aligned_pvti_t *pvti_base;
> > > +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> > > +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> > > +
> > > +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> > > +
> > > +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> > > +
> > > +	return &pvti_base[offset].info;
> > > +}
> > > +
> > Does BUG_ON() really do what you believe it does while in userspace
> > context? We're not running with the kernel descriptors, so this will
> > probably just kill the process without any explanation
> 
> A coredump is generated which can be used to trace back to ud2a instruction
> at the vdso code.

All comments have been addressed. Let me know if there is anything else
on v3 that you'd like to see done differently.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
  2012-11-01 22:21             ` Marcelo Tosatti
@ 2012-11-02  6:02               ` Gleb Natapov
  0 siblings, 0 replies; 94+ messages in thread
From: Gleb Natapov @ 2012-11-02  6:02 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Glauber Costa, kvm, johnstul, jeremy, zamsden, avi, pbonzini

On Thu, Nov 01, 2012 at 08:21:51PM -0200, Marcelo Tosatti wrote:
> On Fri, Nov 02, 2012 at 12:13:54AM +0200, Gleb Natapov wrote:
> > On Thu, Nov 01, 2012 at 06:56:11PM -0200, Marcelo Tosatti wrote:
> > > On Thu, Nov 01, 2012 at 05:49:51PM +0400, Glauber Costa wrote:
> > > > On 11/01/2012 03:48 PM, Gleb Natapov wrote:
> > > > > On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
> > > > >> Originally from Jeremy Fitzhardinge.
> > > > >>
> > > > >> pvclock_get_time_values, which contains the memory barriers
> > > > >> will be removed by next patch.
> > > > >>
> > > > >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > > >>
> > > > >> Index: vsyscall/arch/x86/kernel/pvclock.c
> > > > >> ===================================================================
> > > > >> --- vsyscall.orig/arch/x86/kernel/pvclock.c
> > > > >> +++ vsyscall/arch/x86/kernel/pvclock.c
> > > > >> @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
> > > > >>  
> > > > >>  	do {
> > > > >>  		version = pvclock_get_time_values(&shadow, src);
> > > > >> -		barrier();
> > > > >> +		rdtsc_barrier();
> > > > >>  		offset = pvclock_get_nsec_offset(&shadow);
> > > > >>  		ret = shadow.system_timestamp + offset;
> > > > >> -		barrier();
> > > > >> +		rdtsc_barrier();
> > > > >>  	} while (version != src->version);
> > > > >>  
> > > > >>  	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > > >>
> > > > > On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
> > > > > be "lock; addl $0,0(%%esp)". I doubt pvclock will work correctly either
> > > > > way though.
> > > > > 
> > > > > --
> > > > > 			Gleb.
> > > > > 
> > > > Actually it shouldn't matter for KVM, since the page is only updated by
> > > > the vcpu, and the guest is never running while it happens. If Jeremy is
> > > > fine with this, so should I.
> > > 
> > > 17.13 TIME-STAMP COUNTER
> > > 
> > > "The RDTSC instruction is not serializing or ordered with other
> > > instructions. It does not necessarily wait until all previous
> > > instructions have been executed before reading the counter. Similarly,
> > > subsequent instructions may begin execution before the RDTSC instruction
> > > operation is performed."
> > > 
> > > Both instructions are TSC barriers. 
> > > 
> > Which both instructions?
> 
> static __always_inline void rdtsc_barrier(void)
> {
>         alternative(ASM_NOP3, "mfence", X86_FEATURE_MFENCE_RDTSC);
>         alternative(ASM_NOP3, "lfence", X86_FEATURE_LFENCE_RDTSC);
> }
Both of them will be patched to nop if guest does not have SSE2 cpuid bit.

--
			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/16] x86: kvm guest: pvclock vsyscall support
  2012-10-31 22:47   ` [patch 09/16] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
@ 2012-11-02  9:42     ` Glauber Costa
  2012-11-05  8:35       ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-11-02  9:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> +	info = pvclock_get_vsyscall_time_info(cpu);
> +
> +	low = (int)__pa(info) | 1;
> +	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
> +	ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high);
> +	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
> +	       cpu, high, low, txt);
> +

Why do you put info in the lower half, and the hv_clock in the higher half ?


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-10-31  3:12           ` Marcelo Tosatti
@ 2012-11-02 10:21             ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-02 10:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, Jeremy Fitzhardinge, kvm, johnstul, zamsden, gleb, pbonzini

On 10/31/2012 07:12 AM, Marcelo Tosatti wrote:
> On Tue, Oct 30, 2012 at 11:39:32AM +0200, Avi Kivity wrote:
>> On 10/29/2012 08:40 PM, Marcelo Tosatti wrote:
>>> On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
>>>> On 10/29/2012 07:45 AM, Glauber Costa wrote:
>>>>> On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
>>>>>> Allow a guest to register a second location for the VCPU time info
>>>>>>
>>>>>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>>>>>> This is intended to allow the guest kernel to map this information
>>>>>> into a usermode accessible page, so that usermode can efficiently
>>>>>> calculate system time from the TSC without having to make a syscall.
>>>>>>
>>>>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>>>> Can you please be a bit more specific about why we need this? Why does
>>>>> the host need to provide us with two pages with the exact same data? Why
>>>>> can't just do it with mapping tricks in the guest?
>>>>
>>>> In Xen the pvclock structure is embedded within a pile of other stuff
>>>> that shouldn't be mapped into guest memory, so providing for a second
>>>> location allows it to be placed whereever is convenient for the guest.
>>>> That's a restriction of the Xen ABI, but I don't know if it affects KVM.
>>>>
>>>>     J
>>>
>>> It is possible to share the data for KVM in theory, but:
>>>
>>> - It is a small amount of memory. 
>>> - It requires aligning to page size (the in-kernel percpu array 
>>> is currently cacheline aligned).
>>> - It is possible to modify flags separately for userspace/kernelspace,
>>> if desired.
>>>
>>> This justifies the duplication IMO (code is simple and clean).
>>>
>>
>> What would be the changes required to remove the duplication?  If it's
>> just page alignment, then is seems even smaller.  In addition we avoid
>> expanding the ABI again.
> 
> This would require changing the kernel copy from percpu data, which
> there is no guarantee is linear (necessary for fixmap mapping), to
> dynamically allocated (which in turn can be tricky due to early boot
> clock requirement).
> 
> Hum, no thanks.
> 
You allocate it using bootmemory for vsyscall anyway. If they are
strictly in the same physical location, you are not allocating anything
extra.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-11-01 21:39       ` Marcelo Tosatti
@ 2012-11-02 10:23         ` Glauber Costa
  2012-11-02 13:00           ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: Glauber Costa @ 2012-11-02 10:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/02/2012 01:39 AM, Marcelo Tosatti wrote:
> On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote:
>> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
>>> Allow a guest to register a second location for the VCPU time info
>>>
>>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>>> This is intended to allow the guest kernel to map this information
>>> into a usermode accessible page, so that usermode can efficiently
>>> calculate system time from the TSC without having to make a syscall.
>>>
>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>>
>>
>> Changelog doesn't make a lot of sense. (specially from first line to the
>> second). Please add in here the reasons why we can't (or decided not to)
>> use the same page. The info in the last mail thread is good enough, just
>> put it here.
> 
> Fixed.
> 
>>> Index: vsyscall/arch/x86/include/asm/kvm_para.h
>>> ===================================================================
>>> --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
>>> +++ vsyscall/arch/x86/include/asm/kvm_para.h
>>> @@ -23,6 +23,7 @@
>>>  #define KVM_FEATURE_ASYNC_PF		4
>>>  #define KVM_FEATURE_STEAL_TIME		5
>>>  #define KVM_FEATURE_PV_EOI		6
>>> +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
>>>  
>>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>>   * in pvclock structure. If no bits are set, all flags are ignored.
>>> @@ -39,6 +40,7 @@
>>>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>>>  #define MSR_KVM_STEAL_TIME  0x4b564d03
>>>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
>>> +#define MSR_KVM_USERSPACE_TIME      0x4b564d05
>>>  
>>
>> I accept that it is possible that we may be better off with the page
>> mapped twice.
>>
>> But why do we need an extra MSR? When, and why, would you enable the
>> kernel-based pvclock, but disabled the userspace pvclock?
> 
> Because there is no stable TSC available, for example (which cannot
> be used to measure passage of time).
> 

What you say is true, but completely unrelated. I am not talking about
situations in which userspace pvclock is available and you end up not
using it.

I am talking about situations in which it is available, you are capable
of using it, but then decides for some reason to permanently disabled -
as in not setting it up altogether.

It seems to me that if the host has code to deal with userspace pvclock,
and you already coded the guest in a way that you may or may not use it
(dependent on the value of the stable bit), you could very well only
check for the cpuid flag, and do the guest setup if available - skipping
this MSR dance altogether.

Now, of course, there is the problem of communicating the address in
which the guest expects the page to be. Skipping the MSR setup would
require it to be more or less at a fixed location. We could in principle
lay them down together with the already existing pvclock structure. (But
granted, I am not sure it is worth it...)

I think in general, this question deserves a bit more of attention. We
are about to have just the perfect opportunity for this next week, so
let's use it.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 10/16] x86: vdso: pvclock gettime support
  2012-11-02  0:33         ` Marcelo Tosatti
@ 2012-11-02 10:25           ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-02 10:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/02/2012 04:33 AM, Marcelo Tosatti wrote:
> On Thu, Nov 01, 2012 at 07:42:43PM -0200, Marcelo Tosatti wrote:
>> On Thu, Nov 01, 2012 at 06:41:46PM +0400, Glauber Costa wrote:
>>> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
>>>> +#ifdef CONFIG_PARAVIRT_CLOCK
>>>> +
>>>> +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>>> +{
>>>> +	const aligned_pvti_t *pvti_base;
>>>> +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
>>>> +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
>>>> +
>>>> +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
>>>> +
>>>> +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
>>>> +
>>>> +	return &pvti_base[offset].info;
>>>> +}
>>>> +
>>> Does BUG_ON() really do what you believe it does while in userspace
>>> context? We're not running with the kernel descriptors, so this will
>>> probably just kill the process without any explanation
>>
>> A coredump is generated which can be used to trace back to ud2a instruction
>> at the vdso code.
> 
> All comments have been addressed. Let me know if there is anything else
> on v3 that you'd like to see done differently.
> 
Mainly:

1) stick a "v3" string in the subject. You didn't do it for v2, and I
got confused at some points while looking for the correct patches

2) The changelogs are, in general, a bit poor. I've pointed to the ones
specifically that pops out, but I would appreciate if you would go over
them again, making them more informative.

3) Please make sure Peter is okay with the proposed notifier change.

4) Please consider allocating memory with __alloc_bootmem_node instead.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-11-02 10:23         ` Glauber Costa
@ 2012-11-02 13:00           ` Marcelo Tosatti
  2012-11-05  8:03             ` Glauber Costa
  0 siblings, 1 reply; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-02 13:00 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Fri, Nov 02, 2012 at 02:23:06PM +0400, Glauber Costa wrote:
> On 11/02/2012 01:39 AM, Marcelo Tosatti wrote:
> > On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote:
> >> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> >>> Allow a guest to register a second location for the VCPU time info
> >>>
> >>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
> >>> This is intended to allow the guest kernel to map this information
> >>> into a usermode accessible page, so that usermode can efficiently
> >>> calculate system time from the TSC without having to make a syscall.
> >>>
> >>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>>
> >>
> >> Changelog doesn't make a lot of sense. (specially from first line to the
> >> second). Please add in here the reasons why we can't (or decided not to)
> >> use the same page. The info in the last mail thread is good enough, just
> >> put it here.
> > 
> > Fixed.
> > 
> >>> Index: vsyscall/arch/x86/include/asm/kvm_para.h
> >>> ===================================================================
> >>> --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
> >>> +++ vsyscall/arch/x86/include/asm/kvm_para.h
> >>> @@ -23,6 +23,7 @@
> >>>  #define KVM_FEATURE_ASYNC_PF		4
> >>>  #define KVM_FEATURE_STEAL_TIME		5
> >>>  #define KVM_FEATURE_PV_EOI		6
> >>> +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
> >>>  
> >>>  /* The last 8 bits are used to indicate how to interpret the flags field
> >>>   * in pvclock structure. If no bits are set, all flags are ignored.
> >>> @@ -39,6 +40,7 @@
> >>>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> >>>  #define MSR_KVM_STEAL_TIME  0x4b564d03
> >>>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
> >>> +#define MSR_KVM_USERSPACE_TIME      0x4b564d05
> >>>  
> >>
> >> I accept that it is possible that we may be better off with the page
> >> mapped twice.
> >>
> >> But why do we need an extra MSR? When, and why, would you enable the
> >> kernel-based pvclock, but disabled the userspace pvclock?
> > 
> > Because there is no stable TSC available, for example (which cannot
> > be used to measure passage of time).
> > 
> 
> What you say is true, but completely unrelated. I am not talking about
> situations in which userspace pvclock is available and you end up not
> using it.
> 
> I am talking about situations in which it is available, you are capable
> of using it, but then decides for some reason to permanently disabled -
> as in not setting it up altogether.
> 
> It seems to me that if the host has code to deal with userspace pvclock,
> and you already coded the guest in a way that you may or may not use it
> (dependent on the value of the stable bit), you could very well only
> check for the cpuid flag, and do the guest setup if available - skipping
> this MSR dance altogether.
> 
> Now, of course, there is the problem of communicating the address in
> which the guest expects the page to be. Skipping the MSR setup would
> require it to be more or less at a fixed location. We could in principle
> lay them down together with the already existing pvclock structure. (But
> granted, I am not sure it is worth it...)
> 
> I think in general, this question deserves a bit more of attention. We
> are about to have just the perfect opportunity for this next week, so
> let's use it.

In essence you are proposing a different interface to communicate the
"userspace vsyscall pvclock area", other than the MSR, right?

If so:

1. What is the problem with the MSR interface.
2. What advantage this new interface (which honestly i do not
understand) provides?



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
  2012-11-02 13:00           ` Marcelo Tosatti
@ 2012-11-05  8:03             ` Glauber Costa
  0 siblings, 0 replies; 94+ messages in thread
From: Glauber Costa @ 2012-11-05  8:03 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On 11/02/2012 05:00 PM, Marcelo Tosatti wrote:
> On Fri, Nov 02, 2012 at 02:23:06PM +0400, Glauber Costa wrote:
>> On 11/02/2012 01:39 AM, Marcelo Tosatti wrote:
>>> On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote:
>>>> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
>>>>> Allow a guest to register a second location for the VCPU time info
>>>>>
>>>>> structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
>>>>> This is intended to allow the guest kernel to map this information
>>>>> into a usermode accessible page, so that usermode can efficiently
>>>>> calculate system time from the TSC without having to make a syscall.
>>>>>
>>>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>>>>
>>>>
>>>> Changelog doesn't make a lot of sense. (specially from first line to the
>>>> second). Please add in here the reasons why we can't (or decided not to)
>>>> use the same page. The info in the last mail thread is good enough, just
>>>> put it here.
>>>
>>> Fixed.
>>>
>>>>> Index: vsyscall/arch/x86/include/asm/kvm_para.h
>>>>> ===================================================================
>>>>> --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
>>>>> +++ vsyscall/arch/x86/include/asm/kvm_para.h
>>>>> @@ -23,6 +23,7 @@
>>>>>  #define KVM_FEATURE_ASYNC_PF		4
>>>>>  #define KVM_FEATURE_STEAL_TIME		5
>>>>>  #define KVM_FEATURE_PV_EOI		6
>>>>> +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
>>>>>  
>>>>>  /* The last 8 bits are used to indicate how to interpret the flags field
>>>>>   * in pvclock structure. If no bits are set, all flags are ignored.
>>>>> @@ -39,6 +40,7 @@
>>>>>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>>>>>  #define MSR_KVM_STEAL_TIME  0x4b564d03
>>>>>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
>>>>> +#define MSR_KVM_USERSPACE_TIME      0x4b564d05
>>>>>  
>>>>
>>>> I accept that it is possible that we may be better off with the page
>>>> mapped twice.
>>>>
>>>> But why do we need an extra MSR? When, and why, would you enable the
>>>> kernel-based pvclock, but disabled the userspace pvclock?
>>>
>>> Because there is no stable TSC available, for example (which cannot
>>> be used to measure passage of time).
>>>
>>
>> What you say is true, but completely unrelated. I am not talking about
>> situations in which userspace pvclock is available and you end up not
>> using it.
>>
>> I am talking about situations in which it is available, you are capable
>> of using it, but then decides for some reason to permanently disabled -
>> as in not setting it up altogether.
>>
>> It seems to me that if the host has code to deal with userspace pvclock,
>> and you already coded the guest in a way that you may or may not use it
>> (dependent on the value of the stable bit), you could very well only
>> check for the cpuid flag, and do the guest setup if available - skipping
>> this MSR dance altogether.
>>
>> Now, of course, there is the problem of communicating the address in
>> which the guest expects the page to be. Skipping the MSR setup would
>> require it to be more or less at a fixed location. We could in principle
>> lay them down together with the already existing pvclock structure. (But
>> granted, I am not sure it is worth it...)
>>
>> I think in general, this question deserves a bit more of attention. We
>> are about to have just the perfect opportunity for this next week, so
>> let's use it.
> 
> In essence you are proposing a different interface to communicate the
> "userspace vsyscall pvclock area", other than the MSR, right?
> 
> If so:
> 
> 1. What is the problem with the MSR interface.
> 2. What advantage this new interface (which honestly i do not
> understand) provides?
> 
No, I am not proposing a different interface to communicate this.
I am proposing *no* interface to communicate this.

I'll give that if it is absolutely necessary to have it on a random
address, and you need to pass it back and forth, of course MSRs would be
the choice. But I am not yet terribly convinced that all the pain of
syncing this from userspace, migrating those values, bookkeeping what is
on, what is not, is worth the pain.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 09/16] x86: kvm guest: pvclock vsyscall support
  2012-11-02  9:42     ` Glauber Costa
@ 2012-11-05  8:35       ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-05  8:35 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, johnstul, jeremy, zamsden, gleb, avi, pbonzini

On Fri, Nov 02, 2012 at 01:42:41PM +0400, Glauber Costa wrote:
> On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
> > +	info = pvclock_get_vsyscall_time_info(cpu);
> > +
> > +	low = (int)__pa(info) | 1;
> > +	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
> > +	ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high);
> > +	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
> > +	       cpu, high, low, txt);
> > +
> 
> Why do you put info in the lower half, and the hv_clock in the higher half ?

Copy&paste bug. Fixed, thanks.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 14/18] time: export time information for KVM pvclock
  2012-10-24 13:13 ` [patch 14/18] time: export time information for KVM pvclock Marcelo Tosatti
@ 2012-11-10  1:02   ` John Stultz
  2012-11-13 21:07     ` Marcelo Tosatti
  0 siblings, 1 reply; 94+ messages in thread
From: John Stultz @ 2012-11-10  1:02 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, jeremy, glommer, zamsden, gleb, avi, pbonzini

On 10/24/2012 06:13 AM, Marcelo Tosatti wrote:
> As suggested by John, export time data similarly to how its
> done by vsyscall support. This allows KVM to retrieve necessary
> information to implement vsyscall support in KVM guests.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Thanks Marcelo, I like this much better then what you were proposing 
privately earlier!

Fairly minor nit below.

> Index: vsyscall/kernel/time/timekeeping.c
> ===================================================================
> --- vsyscall.orig/kernel/time/timekeeping.c
> +++ vsyscall/kernel/time/timekeeping.c
> @@ -21,6 +21,7 @@
>   #include <linux/time.h>
>   #include <linux/tick.h>
>   #include <linux/stop_machine.h>
> +#include <linux/pvclock_gtod.h>
>
>
>   static struct timekeeper timekeeper;
> @@ -180,6 +181,79 @@ static inline s64 timekeeping_get_ns_raw
>   	return nsec + arch_gettimeoffset();
>   }
>
> +static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
> +
> +/**
> + * pvclock_gtod_register_notifier - register a pvclock timedata update listener
> + *
> + * Must hold write on timekeeper.lock
> + */
> +int pvclock_gtod_register_notifier(struct notifier_block *nb)
> +{
> +	struct timekeeper *tk = &timekeeper;
> +	unsigned long flags;
> +	int ret;
> +
> +	write_seqlock_irqsave(&tk->lock, flags);
> +	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
> +	write_sequnlock_irqrestore(&tk->lock, flags);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
> +
> +/**
> + * pvclock_gtod_unregister_notifier - unregister a pvclock
> + * timedata update listener
> + *
> + * Must hold write on timekeeper.lock
> + */
> +int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
> +{
> +	struct timekeeper *tk = &timekeeper;
> +	unsigned long flags;
> +	int ret;
> +
> +	write_seqlock_irqsave(&tk->lock, flags);
> +	ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
> +	write_sequnlock_irqrestore(&tk->lock, flags);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
> +
> +struct pvclock_gtod_data pvclock_gtod_data;
> +EXPORT_SYMBOL_GPL(pvclock_gtod_data);
> +
> +static void update_pvclock_gtod(struct timekeeper *tk)
> +{
> +	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
> +
> +	write_seqcount_begin(&vdata->seq);
> +
> +	/* copy pvclock gtod data */
> +	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
> +	vdata->clock.cycle_last		= tk->clock->cycle_last;
> +	vdata->clock.mask		= tk->clock->mask;
> +	vdata->clock.mult		= tk->mult;
> +	vdata->clock.shift		= tk->shift;
> +
> +	vdata->monotonic_time_sec	= tk->xtime_sec
> +					+ tk->wall_to_monotonic.tv_sec;
> +	vdata->monotonic_time_snsec	= tk->xtime_nsec
> +					+ (tk->wall_to_monotonic.tv_nsec
> +						<< tk->shift);
> +	while (vdata->monotonic_time_snsec >=
> +					(((u64)NSEC_PER_SEC) << tk->shift)) {
> +		vdata->monotonic_time_snsec -=
> +					((u64)NSEC_PER_SEC) << tk->shift;
> +		vdata->monotonic_time_sec++;
> +	}
> +
> +	write_seqcount_end(&vdata->seq);
> +	raw_notifier_call_chain(&pvclock_gtod_chain, 0, NULL);
> +}
> +

My only request is could the update_pvclock_gtod() be implemented 
similarly to the update_vsyscall, where the update function lives in the 
pvclock code (maybe using a weak symbol or something) so we don't have 
to have all these pvclock details in the timekeeping core?

thanks
-john



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 14/18] time: export time information for KVM pvclock
  2012-11-10  1:02   ` John Stultz
@ 2012-11-13 21:07     ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-13 21:07 UTC (permalink / raw)
  To: John Stultz; +Cc: kvm, jeremy, glommer, zamsden, gleb, avi, pbonzini

On Fri, Nov 09, 2012 at 05:02:52PM -0800, John Stultz wrote:
> On 10/24/2012 06:13 AM, Marcelo Tosatti wrote:
> >As suggested by John, export time data similarly to how its
> >done by vsyscall support. This allows KVM to retrieve necessary
> >information to implement vsyscall support in KVM guests.
> >
> >Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Thanks Marcelo, I like this much better then what you were proposing
> privately earlier!
> 
> Fairly minor nit below.
> 
> >Index: vsyscall/kernel/time/timekeeping.c
> >===================================================================
> >--- vsyscall.orig/kernel/time/timekeeping.c
> >+++ vsyscall/kernel/time/timekeeping.c
> >@@ -21,6 +21,7 @@
> >  #include <linux/time.h>
> >  #include <linux/tick.h>
> >  #include <linux/stop_machine.h>
> >+#include <linux/pvclock_gtod.h>
> >
> >
> >  static struct timekeeper timekeeper;
> >@@ -180,6 +181,79 @@ static inline s64 timekeeping_get_ns_raw
> >  	return nsec + arch_gettimeoffset();
> >  }
> >
> >+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
> >+
> >+/**
> >+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
> >+ *
> >+ * Must hold write on timekeeper.lock
> >+ */
> >+int pvclock_gtod_register_notifier(struct notifier_block *nb)
> >+{
> >+	struct timekeeper *tk = &timekeeper;
> >+	unsigned long flags;
> >+	int ret;
> >+
> >+	write_seqlock_irqsave(&tk->lock, flags);
> >+	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
> >+	write_sequnlock_irqrestore(&tk->lock, flags);
> >+
> >+	return ret;
> >+}
> >+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
> >+
> >+/**
> >+ * pvclock_gtod_unregister_notifier - unregister a pvclock
> >+ * timedata update listener
> >+ *
> >+ * Must hold write on timekeeper.lock
> >+ */
> >+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
> >+{
> >+	struct timekeeper *tk = &timekeeper;
> >+	unsigned long flags;
> >+	int ret;
> >+
> >+	write_seqlock_irqsave(&tk->lock, flags);
> >+	ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
> >+	write_sequnlock_irqrestore(&tk->lock, flags);
> >+
> >+	return ret;
> >+}
> >+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
> >+
> >+struct pvclock_gtod_data pvclock_gtod_data;
> >+EXPORT_SYMBOL_GPL(pvclock_gtod_data);
> >+
> >+static void update_pvclock_gtod(struct timekeeper *tk)
> >+{
> >+	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
> >+
> >+	write_seqcount_begin(&vdata->seq);
> >+
> >+	/* copy pvclock gtod data */
> >+	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
> >+	vdata->clock.cycle_last		= tk->clock->cycle_last;
> >+	vdata->clock.mask		= tk->clock->mask;
> >+	vdata->clock.mult		= tk->mult;
> >+	vdata->clock.shift		= tk->shift;
> >+
> >+	vdata->monotonic_time_sec	= tk->xtime_sec
> >+					+ tk->wall_to_monotonic.tv_sec;
> >+	vdata->monotonic_time_snsec	= tk->xtime_nsec
> >+					+ (tk->wall_to_monotonic.tv_nsec
> >+						<< tk->shift);
> >+	while (vdata->monotonic_time_snsec >=
> >+					(((u64)NSEC_PER_SEC) << tk->shift)) {
> >+		vdata->monotonic_time_snsec -=
> >+					((u64)NSEC_PER_SEC) << tk->shift;
> >+		vdata->monotonic_time_sec++;
> >+	}
> >+
> >+	write_seqcount_end(&vdata->seq);
> >+	raw_notifier_call_chain(&pvclock_gtod_chain, 0, NULL);
> >+}
> >+
> 
> My only request is could the update_pvclock_gtod() be implemented
> similarly to the update_vsyscall, where the update function lives in
> the pvclock code (maybe using a weak symbol or something) so we
> don't have to have all these pvclock details in the timekeeping
> core?
> 
> thanks
> -john

In KVM code, yes, no problem.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 10/16] x86: vdso: pvclock gettime support
  2012-10-31 22:47   ` [patch 10/16] x86: vdso: pvclock gettime support Marcelo Tosatti
  2012-11-01 14:41     ` Glauber Costa
@ 2012-11-14 10:42     ` Gleb Natapov
  2012-11-14 22:42       ` Marcelo Tosatti
  1 sibling, 1 reply; 94+ messages in thread
From: Gleb Natapov @ 2012-11-14 10:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, johnstul, jeremy, glommer, zamsden, avi, pbonzini

On Wed, Oct 31, 2012 at 08:47:06PM -0200, Marcelo Tosatti wrote:
> Improve performance of time system calls when using Linux pvclock, 
> by reading time info from fixmap visible copy of pvclock data.
> 
> Originally from Jeremy Fitzhardinge.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: vsyscall/arch/x86/vdso/vclock_gettime.c
> ===================================================================
> --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
> +++ vsyscall/arch/x86/vdso/vclock_gettime.c
> @@ -22,6 +22,7 @@
>  #include <asm/hpet.h>
>  #include <asm/unistd.h>
>  #include <asm/io.h>
> +#include <asm/pvclock.h>
>  
>  #define gtod (&VVAR(vsyscall_gtod_data))
>  
> @@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void)
>  	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
>  }
>  
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +
> +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> +{
> +	const aligned_pvti_t *pvti_base;
> +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> +
> +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> +
> +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> +
> +	return &pvti_base[offset].info;
> +}
> +
> +static notrace cycle_t vread_pvclock(int *mode)
> +{
> +	const struct pvclock_vsyscall_time_info *pvti;
> +	cycle_t ret;
> +	u64 last;
> +	u32 version;
> +	u32 migrate_count;
> +	u8 flags;
> +	unsigned cpu, cpu1;
> +
> +
> +	/*
> +	 * When looping to get a consistent (time-info, tsc) pair, we
> +	 * also need to deal with the possibility we can switch vcpus,
> +	 * so make sure we always re-fetch time-info for the current vcpu.
> +	 */
> +	do {
> +		cpu = __getcpu() & VGETCPU_CPU_MASK;
> +		pvti = get_pvti(cpu);
> +
> +		migrate_count = pvti->migrate_count;
> +
> +		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> +
> +		/*
> +		 * Test we're still on the cpu as well as the version.
> +		 * We could have been migrated just after the first
> +		 * vgetcpu but before fetching the version, so we
> +		 * wouldn't notice a version change.
> +		 */
> +		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> +	} while (unlikely(cpu != cpu1 ||
> +			  (pvti->pvti.version & 1) ||
> +			  pvti->pvti.version != version ||
> +			  pvti->migrate_count != migrate_count));
> +
We can put vcpu id into higher bits of pvti.version. This will
save a couple of cycles by getting rid of __getcpu() calls.

> +	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +		*mode = VCLOCK_NONE;
> +
> +	/* refer to tsc.c read_tsc() comment for rationale */
> +	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
> +
> +	if (likely(ret >= last))
> +		return ret;
> +
> +	return last;
> +}
> +#endif
> +
>  notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
>  {
>  	long ret;
> @@ -80,7 +145,7 @@ notrace static long vdso_fallback_gtod(s
>  }
>  
>  
> -notrace static inline u64 vgetsns(void)
> +notrace static inline u64 vgetsns(int *mode)
>  {
>  	long v;
>  	cycles_t cycles;
> @@ -88,6 +153,8 @@ notrace static inline u64 vgetsns(void)
>  		cycles = vread_tsc();
>  	else if (gtod->clock.vclock_mode == VCLOCK_HPET)
>  		cycles = vread_hpet();
> +	else if (gtod->clock.vclock_mode == VCLOCK_PVCLOCK)
> +		cycles = vread_pvclock(mode);
>  	else
>  		return 0;
>  	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
> @@ -107,7 +174,7 @@ notrace static int __always_inline do_re
>  		mode = gtod->clock.vclock_mode;
>  		ts->tv_sec = gtod->wall_time_sec;
>  		ns = gtod->wall_time_snsec;
> -		ns += vgetsns();
> +		ns += vgetsns(&mode);
>  		ns >>= gtod->clock.shift;
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  
> @@ -127,7 +194,7 @@ notrace static int do_monotonic(struct t
>  		mode = gtod->clock.vclock_mode;
>  		ts->tv_sec = gtod->monotonic_time_sec;
>  		ns = gtod->monotonic_time_snsec;
> -		ns += vgetsns();
> +		ns += vgetsns(&mode);
>  		ns >>= gtod->clock.shift;
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  	timespec_add_ns(ts, ns);
> Index: vsyscall/arch/x86/include/asm/vsyscall.h
> ===================================================================
> --- vsyscall.orig/arch/x86/include/asm/vsyscall.h
> +++ vsyscall/arch/x86/include/asm/vsyscall.h
> @@ -33,6 +33,23 @@ extern void map_vsyscall(void);
>   */
>  extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
>  
> +#define VGETCPU_CPU_MASK 0xfff
> +
> +static inline unsigned int __getcpu(void)
> +{
> +	unsigned int p;
> +
> +	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
> +		/* Load per CPU data from RDTSCP */
> +		native_read_tscp(&p);
> +	} else {
> +		/* Load per CPU data from GDT */
> +		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
> +	}
> +
> +	return p;
> +}
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* _ASM_X86_VSYSCALL_H */
> Index: vsyscall/arch/x86/vdso/vgetcpu.c
> ===================================================================
> --- vsyscall.orig/arch/x86/vdso/vgetcpu.c
> +++ vsyscall/arch/x86/vdso/vgetcpu.c
> @@ -17,15 +17,10 @@ __vdso_getcpu(unsigned *cpu, unsigned *n
>  {
>  	unsigned int p;
>  
> -	if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
> -		/* Load per CPU data from RDTSCP */
> -		native_read_tscp(&p);
> -	} else {
> -		/* Load per CPU data from GDT */
> -		asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
> -	}
> +	p = __getcpu();
> +
>  	if (cpu)
> -		*cpu = p & 0xfff;
> +		*cpu = p & VGETCPU_CPU_MASK;
>  	if (node)
>  		*node = p >> 12;
>  	return 0;
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [patch 10/16] x86: vdso: pvclock gettime support
  2012-11-14 10:42     ` Gleb Natapov
@ 2012-11-14 22:42       ` Marcelo Tosatti
  0 siblings, 0 replies; 94+ messages in thread
From: Marcelo Tosatti @ 2012-11-14 22:42 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, johnstul, jeremy, glommer, zamsden, avi, pbonzini

On Wed, Nov 14, 2012 at 12:42:48PM +0200, Gleb Natapov wrote:
> On Wed, Oct 31, 2012 at 08:47:06PM -0200, Marcelo Tosatti wrote:
> > Improve performance of time system calls when using Linux pvclock, 
> > by reading time info from fixmap visible copy of pvclock data.
> > 
> > Originally from Jeremy Fitzhardinge.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: vsyscall/arch/x86/vdso/vclock_gettime.c
> > ===================================================================
> > --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
> > +++ vsyscall/arch/x86/vdso/vclock_gettime.c
> > @@ -22,6 +22,7 @@
> >  #include <asm/hpet.h>
> >  #include <asm/unistd.h>
> >  #include <asm/io.h>
> > +#include <asm/pvclock.h>
> >  
> >  #define gtod (&VVAR(vsyscall_gtod_data))
> >  
> > @@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void)
> >  	return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
> >  }
> >  
> > +#ifdef CONFIG_PARAVIRT_CLOCK
> > +
> > +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> > +{
> > +	const aligned_pvti_t *pvti_base;
> > +	int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> > +	int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> > +
> > +	BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> > +
> > +	pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> > +
> > +	return &pvti_base[offset].info;
> > +}
> > +
> > +static notrace cycle_t vread_pvclock(int *mode)
> > +{
> > +	const struct pvclock_vsyscall_time_info *pvti;
> > +	cycle_t ret;
> > +	u64 last;
> > +	u32 version;
> > +	u32 migrate_count;
> > +	u8 flags;
> > +	unsigned cpu, cpu1;
> > +
> > +
> > +	/*
> > +	 * When looping to get a consistent (time-info, tsc) pair, we
> > +	 * also need to deal with the possibility we can switch vcpus,
> > +	 * so make sure we always re-fetch time-info for the current vcpu.
> > +	 */
> > +	do {
> > +		cpu = __getcpu() & VGETCPU_CPU_MASK;
> > +		pvti = get_pvti(cpu);
> > +
> > +		migrate_count = pvti->migrate_count;
> > +
> > +		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> > +
> > +		/*
> > +		 * Test we're still on the cpu as well as the version.
> > +		 * We could have been migrated just after the first
> > +		 * vgetcpu but before fetching the version, so we
> > +		 * wouldn't notice a version change.
> > +		 */
> > +		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> > +	} while (unlikely(cpu != cpu1 ||
> > +			  (pvti->pvti.version & 1) ||
> > +			  pvti->pvti.version != version ||
> > +			  pvti->migrate_count != migrate_count));
> > +
> We can put vcpu id into higher bits of pvti.version. This will
> save a couple of cycles by getting rid of __getcpu() calls.

Yes. Added as comment in the code.



^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2012-11-14 22:43 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-24 13:13 [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v2) Marcelo Tosatti
2012-10-24 13:13 ` [patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
2012-10-24 13:13 ` [patch 02/18] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
2012-10-24 13:13 ` [patch 03/18] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
2012-10-30  9:23   ` Avi Kivity
2012-10-30  9:24     ` Avi Kivity
2012-10-24 13:13 ` [patch 04/18] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
2012-10-24 13:13 ` [patch 05/18] x86: pvclock: fix flags usage race Marcelo Tosatti
2012-10-24 13:13 ` [patch 06/18] x86: pvclock: introduce helper to read flags Marcelo Tosatti
2012-10-24 13:13 ` [patch 07/18] sched: add notifier for cross-cpu migrations Marcelo Tosatti
2012-10-24 13:13 ` [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
2012-10-29 14:18   ` Glauber Costa
2012-10-29 14:54     ` Marcelo Tosatti
2012-10-29 17:46       ` Jeremy Fitzhardinge
2012-10-29 14:39   ` Glauber Costa
2012-10-24 13:13 ` [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
2012-10-29 14:45   ` Glauber Costa
2012-10-29 17:44     ` Jeremy Fitzhardinge
2012-10-29 18:40       ` Marcelo Tosatti
2012-10-30  7:41         ` Glauber Costa
2012-10-30  9:39         ` Avi Kivity
2012-10-31  3:12           ` Marcelo Tosatti
2012-11-02 10:21             ` Glauber Costa
2012-10-30  7:38       ` Glauber Costa
2012-10-24 13:13 ` [patch 10/18] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
2012-10-24 13:13 ` [patch 11/18] x86: vsyscall: pass mode to gettime backend Marcelo Tosatti
2012-10-29 14:47   ` Glauber Costa
2012-10-29 18:41     ` Marcelo Tosatti
2012-10-30  7:42       ` Glauber Costa
2012-10-24 13:13 ` [patch 12/18] x86: vdso: pvclock gettime support Marcelo Tosatti
2012-10-29 14:59   ` Glauber Costa
2012-10-29 18:42     ` Marcelo Tosatti
2012-10-30  7:49       ` Glauber Costa
2012-10-31  3:16         ` Marcelo Tosatti
2012-10-24 13:13 ` [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
2012-10-29 15:04   ` Glauber Costa
2012-10-29 18:45     ` Marcelo Tosatti
2012-10-30  7:55       ` Glauber Costa
2012-10-24 13:13 ` [patch 14/18] time: export time information for KVM pvclock Marcelo Tosatti
2012-11-10  1:02   ` John Stultz
2012-11-13 21:07     ` Marcelo Tosatti
2012-10-24 13:13 ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
2012-10-30  8:34   ` Glauber Costa
2012-10-31  3:19     ` [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag\ Marcelo Tosatti
2012-10-24 13:13 ` [patch 16/18] KVM: x86: notifier for clocksource changes Marcelo Tosatti
2012-10-24 13:13 ` [patch 17/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
2012-10-24 13:13 ` [patch 18/18] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti
2012-10-31 22:46 ` [patch 00/16] pvclock vsyscall support + KVM hypervisor support (v3) Marcelo Tosatti
2012-10-31 22:46   ` [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory Marcelo Tosatti
2012-11-01 10:39     ` Gleb Natapov
2012-11-01 20:51       ` Marcelo Tosatti
2012-11-01 13:44     ` Glauber Costa
2012-10-31 22:46   ` [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region Marcelo Tosatti
2012-11-01 11:48     ` Gleb Natapov
2012-11-01 13:49       ` Glauber Costa
2012-11-01 13:51         ` Gleb Natapov
2012-11-01 20:56         ` Marcelo Tosatti
2012-11-01 22:13           ` Gleb Natapov
2012-11-01 22:21             ` Marcelo Tosatti
2012-11-02  6:02               ` Gleb Natapov
2012-10-31 22:46   ` [patch 03/16] x86: pvclock: remove pvclock_shadow_time Marcelo Tosatti
2012-11-01 13:52     ` Glauber Costa
2012-10-31 22:47   ` [patch 04/16] x86: pvclock: create helper for pvclock data retrieval Marcelo Tosatti
2012-11-01 14:04     ` Glauber Costa
2012-11-01 20:57       ` Marcelo Tosatti
2012-10-31 22:47   ` [patch 05/16] x86: pvclock: introduce helper to read flags Marcelo Tosatti
2012-11-01 14:07     ` Glauber Costa
2012-11-01 21:08       ` Marcelo Tosatti
2012-10-31 22:47   ` [patch 06/16] sched: add notifier for cross-cpu migrations Marcelo Tosatti
2012-11-01 14:08     ` Glauber Costa
2012-10-31 22:47   ` [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization Marcelo Tosatti
2012-11-01 14:19     ` Glauber Costa
2012-10-31 22:47   ` [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR Marcelo Tosatti
2012-11-01 14:28     ` Glauber Costa
2012-11-01 21:39       ` Marcelo Tosatti
2012-11-02 10:23         ` Glauber Costa
2012-11-02 13:00           ` Marcelo Tosatti
2012-11-05  8:03             ` Glauber Costa
2012-10-31 22:47   ` [patch 09/16] x86: kvm guest: pvclock vsyscall support Marcelo Tosatti
2012-11-02  9:42     ` Glauber Costa
2012-11-05  8:35       ` Marcelo Tosatti
2012-10-31 22:47   ` [patch 10/16] x86: vdso: pvclock gettime support Marcelo Tosatti
2012-11-01 14:41     ` Glauber Costa
2012-11-01 21:42       ` Marcelo Tosatti
2012-11-02  0:33         ` Marcelo Tosatti
2012-11-02 10:25           ` Glauber Costa
2012-11-14 10:42     ` Gleb Natapov
2012-11-14 22:42       ` Marcelo Tosatti
2012-10-31 22:47   ` [patch 11/16] KVM: x86: pass host_tsc to read_l1_tsc Marcelo Tosatti
2012-10-31 22:47   ` [patch 12/16] time: export time information for KVM pvclock Marcelo Tosatti
2012-10-31 22:47   ` [patch 13/16] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag Marcelo Tosatti
2012-10-31 22:47   ` [patch 14/16] KVM: x86: notifier for clocksource changes Marcelo Tosatti
2012-10-31 22:47   ` [patch 15/16] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization Marcelo Tosatti
2012-10-31 22:47   ` [patch 16/16] KVM: x86: require matched TSC offsets for master clock Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.