All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
@ 2014-12-23  0:39 Andy Lutomirski
  2014-12-23  0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
                   ` (5 more replies)
  0 siblings, 6 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  0:39 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel, Andy Lutomirski

This is a dramatic simplification and speedup of the vdso pvclock read
code.  Is it correct?

Andy Lutomirski (2):
  x86, vdso: Use asm volatile in __getcpu
  x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

 arch/x86/include/asm/vgtod.h   |  6 ++--
 arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
 2 files changed, 51 insertions(+), 37 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC 1/2] x86, vdso: Use asm volatile in __getcpu
  2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
  2014-12-23  0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
@ 2014-12-23  0:39 ` Andy Lutomirski
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  0:39 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel, Andy Lutomirski

In Linux 3.18 and below, GCC hoists the lsl instructions in the
pvclock code all the way to the beginning of __vdso_clock_gettime,
slowing the non-paravirt case significantly.  For unknown reasons,
presumably related to the removal of a branch, the performance issue
is gone as of

e76b027e6408 x86,vdso: Use LSL unconditionally for vgetcpu

but I don't trust GCC enough to expect the problem to stay fixed.

There should be no correctness issue, because the __getcpu calls in
__vdso_vlock_gettime were never necessary in the first place.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/vgtod.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index e7e9682a33e9..f556c4843aa1 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -80,9 +80,11 @@ static inline unsigned int __getcpu(void)
 
 	/*
 	 * Load per CPU data from GDT.  LSL is faster than RDTSCP and
-	 * works on all CPUs.
+	 * works on all CPUs.  This is volatile so that it orders
+	 * correctly wrt barrier() and to keep gcc from cleverly
+	 * hoisting it out of the calling function.
 	 */
-	asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	asm volatile ("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
 
 	return p;
 }
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC 1/2] x86, vdso: Use asm volatile in __getcpu
  2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
@ 2014-12-23  0:39 ` Andy Lutomirski
  2014-12-23  0:39 ` Andy Lutomirski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  0:39 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list, Andy Lutomirski

In Linux 3.18 and below, GCC hoists the lsl instructions in the
pvclock code all the way to the beginning of __vdso_clock_gettime,
slowing the non-paravirt case significantly.  For unknown reasons,
presumably related to the removal of a branch, the performance issue
is gone as of

e76b027e6408 x86,vdso: Use LSL unconditionally for vgetcpu

but I don't trust GCC enough to expect the problem to stay fixed.

There should be no correctness issue, because the __getcpu calls in
__vdso_vlock_gettime were never necessary in the first place.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/vgtod.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index e7e9682a33e9..f556c4843aa1 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -80,9 +80,11 @@ static inline unsigned int __getcpu(void)
 
 	/*
 	 * Load per CPU data from GDT.  LSL is faster than RDTSCP and
-	 * works on all CPUs.
+	 * works on all CPUs.  This is volatile so that it orders
+	 * correctly wrt barrier() and to keep gcc from cleverly
+	 * hoisting it out of the calling function.
 	 */
-	asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+	asm volatile ("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
 
 	return p;
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
  2014-12-23  0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
  2014-12-23  0:39 ` Andy Lutomirski
@ 2014-12-23  0:39 ` Andy Lutomirski
  2014-12-23 10:28   ` [Xen-devel] " David Vrabel
                     ` (8 more replies)
  2014-12-23  0:39 ` Andy Lutomirski
                   ` (2 subsequent siblings)
  5 siblings, 9 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  0:39 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel, Andy Lutomirski

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid.  Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 64ns on my machine.
With this change, it takes 19ns, which is almost as fast as the pure TSC
implementation.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 35 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 9793322751e0..f2e0396d5629 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-	const struct pvclock_vsyscall_time_info *pvti;
+	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
 	cycle_t ret;
-	u64 last;
-	u32 version;
-	u8 flags;
-	unsigned cpu, cpu1;
-
+	u64 tsc, pvti_tsc;
+	u64 last, delta, pvti_system_time;
+	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
 	/*
-	 * Note: hypervisor must guarantee that:
-	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-	 * 2. that per-CPU pvclock time info is updated if the
-	 *    underlying CPU changes.
-	 * 3. that version is increased whenever underlying CPU
-	 *    changes.
+	 * Note: The kernel and hypervisor must guarantee that cpu ID
+	 * number maps 1:1 to per-CPU pvclock time info.
+	 *
+	 * Because the hypervisor is entirely unaware of guest userspace
+	 * preemption, it cannot guarantee that per-CPU pvclock time
+	 * info is updated if the underlying CPU changes or that that
+	 * version is increased whenever underlying CPU changes.
+	 *
+	 * On KVM, we are guaranteed that pvti updates for any vCPU are
+	 * atomic as seen by *all* vCPUs.  This is an even stronger
+	 * guarantee than we get with a normal seqlock.
 	 *
+	 * On Xen, we don't appear to have that guarantee, but Xen still
+	 * supplies a valid seqlock using the version field.
+
+	 * We only do pvclock vdso timing at all if
+	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+	 * mean that all vCPUs have matching pvti and that the TSC is
+	 * synced, so we can just look at vCPU 0's pvti.
 	 */
-	do {
-		cpu = __getcpu() & VGETCPU_CPU_MASK;
-		/* TODO: We can put vcpu id into higher bits of pvti.version.
-		 * This will save a couple of cycles by getting rid of
-		 * __getcpu() calls (Gleb).
-		 */
-
-		pvti = get_pvti(cpu);
-
-		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
-
-		/*
-		 * Test we're still on the cpu as well as the version.
-		 * We could have been migrated just after the first
-		 * vgetcpu but before fetching the version, so we
-		 * wouldn't notice a version change.
-		 */
-		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
-	} while (unlikely(cpu != cpu1 ||
-			  (pvti->pvti.version & 1) ||
-			  pvti->pvti.version != version));
-
-	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
 		*mode = VCLOCK_NONE;
+		return 0;
+	}
+
+	do {
+		version = pvti->version;
+
+		/* This is also a read barrier, so we'll read version first. */
+		rdtsc_barrier();
+		tsc = __native_read_tsc();
+
+		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+		pvti_tsc_shift = pvti->tsc_shift;
+		pvti_system_time = pvti->system_time;
+		pvti_tsc = pvti->tsc_timestamp;
+
+		/* Make sure that the version double-check is last. */
+		smp_rmb();
+	} while (unlikely((version & 1) || version != pvti->version));
+
+	delta = tsc - pvti_tsc;
+	ret = pvti_system_time +
+		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+				    pvti_tsc_shift);
 
 	/* refer to tsc.c read_tsc() comment for rationale */
 	last = gtod->cycle_last;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
@ 2014-12-23  0:39 ` Andy Lutomirski
  2014-12-23  7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
  2014-12-23  7:21 ` Paolo Bonzini
  5 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  0:39 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list, Andy Lutomirski

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid.  Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 64ns on my machine.
With this change, it takes 19ns, which is almost as fast as the pure TSC
implementation.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 35 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 9793322751e0..f2e0396d5629 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-	const struct pvclock_vsyscall_time_info *pvti;
+	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
 	cycle_t ret;
-	u64 last;
-	u32 version;
-	u8 flags;
-	unsigned cpu, cpu1;
-
+	u64 tsc, pvti_tsc;
+	u64 last, delta, pvti_system_time;
+	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
 	/*
-	 * Note: hypervisor must guarantee that:
-	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-	 * 2. that per-CPU pvclock time info is updated if the
-	 *    underlying CPU changes.
-	 * 3. that version is increased whenever underlying CPU
-	 *    changes.
+	 * Note: The kernel and hypervisor must guarantee that cpu ID
+	 * number maps 1:1 to per-CPU pvclock time info.
+	 *
+	 * Because the hypervisor is entirely unaware of guest userspace
+	 * preemption, it cannot guarantee that per-CPU pvclock time
+	 * info is updated if the underlying CPU changes or that that
+	 * version is increased whenever underlying CPU changes.
+	 *
+	 * On KVM, we are guaranteed that pvti updates for any vCPU are
+	 * atomic as seen by *all* vCPUs.  This is an even stronger
+	 * guarantee than we get with a normal seqlock.
 	 *
+	 * On Xen, we don't appear to have that guarantee, but Xen still
+	 * supplies a valid seqlock using the version field.
+
+	 * We only do pvclock vdso timing at all if
+	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+	 * mean that all vCPUs have matching pvti and that the TSC is
+	 * synced, so we can just look at vCPU 0's pvti.
 	 */
-	do {
-		cpu = __getcpu() & VGETCPU_CPU_MASK;
-		/* TODO: We can put vcpu id into higher bits of pvti.version.
-		 * This will save a couple of cycles by getting rid of
-		 * __getcpu() calls (Gleb).
-		 */
-
-		pvti = get_pvti(cpu);
-
-		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
-
-		/*
-		 * Test we're still on the cpu as well as the version.
-		 * We could have been migrated just after the first
-		 * vgetcpu but before fetching the version, so we
-		 * wouldn't notice a version change.
-		 */
-		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
-	} while (unlikely(cpu != cpu1 ||
-			  (pvti->pvti.version & 1) ||
-			  pvti->pvti.version != version));
-
-	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
 		*mode = VCLOCK_NONE;
+		return 0;
+	}
+
+	do {
+		version = pvti->version;
+
+		/* This is also a read barrier, so we'll read version first. */
+		rdtsc_barrier();
+		tsc = __native_read_tsc();
+
+		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+		pvti_tsc_shift = pvti->tsc_shift;
+		pvti_system_time = pvti->system_time;
+		pvti_tsc = pvti->tsc_timestamp;
+
+		/* Make sure that the version double-check is last. */
+		smp_rmb();
+	} while (unlikely((version & 1) || version != pvti->version));
+
+	delta = tsc - pvti_tsc;
+	ret = pvti_system_time +
+		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+				    pvti_tsc_shift);
 
 	/* refer to tsc.c read_tsc() comment for rationale */
 	last = gtod->cycle_last;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
  2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-12-23  0:39 ` Andy Lutomirski
@ 2014-12-23  7:21 ` Paolo Bonzini
  2014-12-23  8:16   ` Andy Lutomirski
  2014-12-23  8:16   ` Andy Lutomirski
  2014-12-23  7:21 ` Paolo Bonzini
  5 siblings, 2 replies; 77+ messages in thread
From: Paolo Bonzini @ 2014-12-23  7:21 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel, glommer



On 23/12/2014 01:39, Andy Lutomirski wrote:
> This is a dramatic simplification and speedup of the vdso pvclock read
> code.  Is it correct?
> 
> Andy Lutomirski (2):
>   x86, vdso: Use asm volatile in __getcpu
>   x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

Patch 1 is ok,

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

For patch 2 I will defer to Marcelo and Glauber (and the Xen folks).

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
  2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
                   ` (4 preceding siblings ...)
  2014-12-23  7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
@ 2014-12-23  7:21 ` Paolo Bonzini
  5 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2014-12-23  7:21 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list, glommer



On 23/12/2014 01:39, Andy Lutomirski wrote:
> This is a dramatic simplification and speedup of the vdso pvclock read
> code.  Is it correct?
> 
> Andy Lutomirski (2):
>   x86, vdso: Use asm volatile in __getcpu
>   x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

Patch 1 is ok,

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

For patch 2 I will defer to Marcelo and Glauber (and the Xen folks).

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
  2014-12-23  7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
@ 2014-12-23  8:16   ` Andy Lutomirski
  2014-12-23  8:30     ` Paolo Bonzini
  2014-12-23  8:30     ` Paolo Bonzini
  2014-12-23  8:16   ` Andy Lutomirski
  1 sibling, 2 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  8:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Gleb Natapov, kvm list, linux-kernel, xen-devel,
	glommer

On Mon, Dec 22, 2014 at 11:21 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 23/12/2014 01:39, Andy Lutomirski wrote:
>> This is a dramatic simplification and speedup of the vdso pvclock read
>> code.  Is it correct?
>>
>> Andy Lutomirski (2):
>>   x86, vdso: Use asm volatile in __getcpu
>>   x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
>
> Patch 1 is ok,
>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Any thoughts as to whether it should be tagged for stable?  I haven't
looked closely enough at the old pvclock code or the generated code to
have much of an opinion there.  It'll be a big speedup for non-pvclock
users at least.

--Andy

>
> For patch 2 I will defer to Marcelo and Glauber (and the Xen folks).
>
> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
  2014-12-23  7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
  2014-12-23  8:16   ` Andy Lutomirski
@ 2014-12-23  8:16   ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  8:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Gleb Natapov, glommer, Marcelo Tosatti, linux-kernel,
	xen-devel

On Mon, Dec 22, 2014 at 11:21 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 23/12/2014 01:39, Andy Lutomirski wrote:
>> This is a dramatic simplification and speedup of the vdso pvclock read
>> code.  Is it correct?
>>
>> Andy Lutomirski (2):
>>   x86, vdso: Use asm volatile in __getcpu
>>   x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
>
> Patch 1 is ok,
>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Any thoughts as to whether it should be tagged for stable?  I haven't
looked closely enough at the old pvclock code or the generated code to
have much of an opinion there.  It'll be a big speedup for non-pvclock
users at least.

--Andy

>
> For patch 2 I will defer to Marcelo and Glauber (and the Xen folks).
>
> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
  2014-12-23  8:16   ` Andy Lutomirski
  2014-12-23  8:30     ` Paolo Bonzini
@ 2014-12-23  8:30     ` Paolo Bonzini
  1 sibling, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2014-12-23  8:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Marcelo Tosatti, Gleb Natapov, kvm list, linux-kernel, xen-devel,
	glommer



On 23/12/2014 09:16, Andy Lutomirski wrote:
> Any thoughts as to whether it should be tagged for stable?  I haven't
> looked closely enough at the old pvclock code or the generated code to
> have much of an opinion there.  It'll be a big speedup for non-pvclock
> users at least.

Yes, please.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
  2014-12-23  8:16   ` Andy Lutomirski
@ 2014-12-23  8:30     ` Paolo Bonzini
  2014-12-23  8:30     ` Paolo Bonzini
  1 sibling, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2014-12-23  8:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Gleb Natapov, glommer, Marcelo Tosatti, linux-kernel,
	xen-devel



On 23/12/2014 09:16, Andy Lutomirski wrote:
> Any thoughts as to whether it should be tagged for stable?  I haven't
> looked closely enough at the old pvclock code or the generated code to
> have much of an opinion there.  It'll be a big speedup for non-pvclock
> users at least.

Yes, please.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
@ 2014-12-23 10:28   ` David Vrabel
  2014-12-23 10:28   ` David Vrabel
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: David Vrabel @ 2014-12-23 10:28 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 23/12/14 00:39, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.

This sounds plausible but I'm not going to be able to give it a detailed
look until the new year.

David

> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>  	cycle_t ret;
> -	u64 last;
> -	u32 version;
> -	u8 flags;
> -	unsigned cpu, cpu1;
> -
> +	u64 tsc, pvti_tsc;
> +	u64 last, delta, pvti_system_time;
> +	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>  	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>  	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.
>  	 */
> -	do {
> -		cpu = __getcpu() & VGETCPU_CPU_MASK;
> -		/* TODO: We can put vcpu id into higher bits of pvti.version.
> -		 * This will save a couple of cycles by getting rid of
> -		 * __getcpu() calls (Gleb).
> -		 */
> -
> -		pvti = get_pvti(cpu);
> -
> -		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -		/*
> -		 * Test we're still on the cpu as well as the version.
> -		 * We could have been migrated just after the first
> -		 * vgetcpu but before fetching the version, so we
> -		 * wouldn't notice a version change.
> -		 */
> -		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -	} while (unlikely(cpu != cpu1 ||
> -			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> -
> -	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>  		*mode = VCLOCK_NONE;
> +		return 0;
> +	}
> +
> +	do {
> +		version = pvti->version;
> +
> +		/* This is also a read barrier, so we'll read version first. */
> +		rdtsc_barrier();
> +		tsc = __native_read_tsc();
> +
> +		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +		pvti_tsc_shift = pvti->tsc_shift;
> +		pvti_system_time = pvti->system_time;
> +		pvti_tsc = pvti->tsc_timestamp;
> +
> +		/* Make sure that the version double-check is last. */
> +		smp_rmb();
> +	} while (unlikely((version & 1) || version != pvti->version));
> +
> +	delta = tsc - pvti_tsc;
> +	ret = pvti_system_time +
> +		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +				    pvti_tsc_shift);
>  
>  	/* refer to tsc.c read_tsc() comment for rationale */
>  	last = gtod->cycle_last;
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
  2014-12-23 10:28   ` [Xen-devel] " David Vrabel
@ 2014-12-23 10:28   ` David Vrabel
  2014-12-23 15:14   ` Boris Ostrovsky
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: David Vrabel @ 2014-12-23 10:28 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 23/12/14 00:39, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.

This sounds plausible but I'm not going to be able to give it a detailed
look until the new year.

David

> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>  	cycle_t ret;
> -	u64 last;
> -	u32 version;
> -	u8 flags;
> -	unsigned cpu, cpu1;
> -
> +	u64 tsc, pvti_tsc;
> +	u64 last, delta, pvti_system_time;
> +	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>  	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>  	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.
>  	 */
> -	do {
> -		cpu = __getcpu() & VGETCPU_CPU_MASK;
> -		/* TODO: We can put vcpu id into higher bits of pvti.version.
> -		 * This will save a couple of cycles by getting rid of
> -		 * __getcpu() calls (Gleb).
> -		 */
> -
> -		pvti = get_pvti(cpu);
> -
> -		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -		/*
> -		 * Test we're still on the cpu as well as the version.
> -		 * We could have been migrated just after the first
> -		 * vgetcpu but before fetching the version, so we
> -		 * wouldn't notice a version change.
> -		 */
> -		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -	} while (unlikely(cpu != cpu1 ||
> -			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> -
> -	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>  		*mode = VCLOCK_NONE;
> +		return 0;
> +	}
> +
> +	do {
> +		version = pvti->version;
> +
> +		/* This is also a read barrier, so we'll read version first. */
> +		rdtsc_barrier();
> +		tsc = __native_read_tsc();
> +
> +		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +		pvti_tsc_shift = pvti->tsc_shift;
> +		pvti_system_time = pvti->system_time;
> +		pvti_tsc = pvti->tsc_timestamp;
> +
> +		/* Make sure that the version double-check is last. */
> +		smp_rmb();
> +	} while (unlikely((version & 1) || version != pvti->version));
> +
> +	delta = tsc - pvti_tsc;
> +	ret = pvti_system_time +
> +		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +				    pvti_tsc_shift);
>  
>  	/* refer to tsc.c read_tsc() comment for rationale */
>  	last = gtod->cycle_last;
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
                     ` (2 preceding siblings ...)
  2014-12-23 15:14   ` Boris Ostrovsky
@ 2014-12-23 15:14   ` Boris Ostrovsky
  2014-12-23 15:14     ` Paolo Bonzini
  2014-12-23 15:14     ` [Xen-devel] " Paolo Bonzini
  2014-12-24 21:30   ` David Matlack
                     ` (4 subsequent siblings)
  8 siblings, 2 replies; 77+ messages in thread
From: Boris Ostrovsky @ 2014-12-23 15:14 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 12/22/2014 07:39 PM, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>   arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>   1 file changed, 47 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>   
>   static notrace cycle_t vread_pvclock(int *mode)
>   {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>   	cycle_t ret;
> -	u64 last;
> -	u32 version;
> -	u8 flags;
> -	unsigned cpu, cpu1;
> -
> +	u64 tsc, pvti_tsc;
> +	u64 last, delta, pvti_system_time;
> +	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>   
>   	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>   	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.
>   	 */
> -	do {
> -		cpu = __getcpu() & VGETCPU_CPU_MASK;
> -		/* TODO: We can put vcpu id into higher bits of pvti.version.
> -		 * This will save a couple of cycles by getting rid of
> -		 * __getcpu() calls (Gleb).
> -		 */
> -
> -		pvti = get_pvti(cpu);
> -
> -		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -		/*
> -		 * Test we're still on the cpu as well as the version.
> -		 * We could have been migrated just after the first
> -		 * vgetcpu but before fetching the version, so we
> -		 * wouldn't notice a version change.
> -		 */
> -		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -	} while (unlikely(cpu != cpu1 ||
> -			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> -
> -	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>   		*mode = VCLOCK_NONE;
> +		return 0;
> +	}
> +
> +	do {
> +		version = pvti->version;
> +
> +		/* This is also a read barrier, so we'll read version first. */
> +		rdtsc_barrier();
> +		tsc = __native_read_tsc();


This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is 
used, for example, after guest migrated (unless HW is capable of scaling 
TSC rate).

-boris


> +
> +		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +		pvti_tsc_shift = pvti->tsc_shift;
> +		pvti_system_time = pvti->system_time;
> +		pvti_tsc = pvti->tsc_timestamp;
> +
> +		/* Make sure that the version double-check is last. */
> +		smp_rmb();
> +	} while (unlikely((version & 1) || version != pvti->version));
> +
> +	delta = tsc - pvti_tsc;
> +	ret = pvti_system_time +
> +		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +				    pvti_tsc_shift);
>   
>   	/* refer to tsc.c read_tsc() comment for rationale */
>   	last = gtod->cycle_last;


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
  2014-12-23 10:28   ` [Xen-devel] " David Vrabel
  2014-12-23 10:28   ` David Vrabel
@ 2014-12-23 15:14   ` Boris Ostrovsky
  2014-12-23 15:14   ` [Xen-devel] " Boris Ostrovsky
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: Boris Ostrovsky @ 2014-12-23 15:14 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 12/22/2014 07:39 PM, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>   arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>   1 file changed, 47 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>   
>   static notrace cycle_t vread_pvclock(int *mode)
>   {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>   	cycle_t ret;
> -	u64 last;
> -	u32 version;
> -	u8 flags;
> -	unsigned cpu, cpu1;
> -
> +	u64 tsc, pvti_tsc;
> +	u64 last, delta, pvti_system_time;
> +	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>   
>   	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>   	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.
>   	 */
> -	do {
> -		cpu = __getcpu() & VGETCPU_CPU_MASK;
> -		/* TODO: We can put vcpu id into higher bits of pvti.version.
> -		 * This will save a couple of cycles by getting rid of
> -		 * __getcpu() calls (Gleb).
> -		 */
> -
> -		pvti = get_pvti(cpu);
> -
> -		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -		/*
> -		 * Test we're still on the cpu as well as the version.
> -		 * We could have been migrated just after the first
> -		 * vgetcpu but before fetching the version, so we
> -		 * wouldn't notice a version change.
> -		 */
> -		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -	} while (unlikely(cpu != cpu1 ||
> -			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> -
> -	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>   		*mode = VCLOCK_NONE;
> +		return 0;
> +	}
> +
> +	do {
> +		version = pvti->version;
> +
> +		/* This is also a read barrier, so we'll read version first. */
> +		rdtsc_barrier();
> +		tsc = __native_read_tsc();


This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is 
used, for example, after guest migrated (unless HW is capable of scaling 
TSC rate).

-boris


> +
> +		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +		pvti_tsc_shift = pvti->tsc_shift;
> +		pvti_system_time = pvti->system_time;
> +		pvti_tsc = pvti->tsc_timestamp;
> +
> +		/* Make sure that the version double-check is last. */
> +		smp_rmb();
> +	} while (unlikely((version & 1) || version != pvti->version));
> +
> +	delta = tsc - pvti_tsc;
> +	ret = pvti_system_time +
> +		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +				    pvti_tsc_shift);
>   
>   	/* refer to tsc.c read_tsc() comment for rationale */
>   	last = gtod->cycle_last;

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23 15:14   ` [Xen-devel] " Boris Ostrovsky
  2014-12-23 15:14     ` Paolo Bonzini
@ 2014-12-23 15:14     ` Paolo Bonzini
  2014-12-23 15:25       ` Boris Ostrovsky
  2014-12-23 15:25       ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 2 replies; 77+ messages in thread
From: Paolo Bonzini @ 2014-12-23 15:14 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 23/12/2014 16:14, Boris Ostrovsky wrote:
>> +    do {
>> +        version = pvti->version;
>> +
>> +        /* This is also a read barrier, so we'll read version first. */
>> +        rdtsc_barrier();
>> +        tsc = __native_read_tsc();
> 
> 
> This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
> used, for example, after guest migrated (unless HW is capable of scaling
> TSC rate).

So does the __pvclock_read_cycles this is replacing (via
pvclock_get_nsec_offset).

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23 15:14   ` [Xen-devel] " Boris Ostrovsky
@ 2014-12-23 15:14     ` Paolo Bonzini
  2014-12-23 15:14     ` [Xen-devel] " Paolo Bonzini
  1 sibling, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2014-12-23 15:14 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 23/12/2014 16:14, Boris Ostrovsky wrote:
>> +    do {
>> +        version = pvti->version;
>> +
>> +        /* This is also a read barrier, so we'll read version first. */
>> +        rdtsc_barrier();
>> +        tsc = __native_read_tsc();
> 
> 
> This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
> used, for example, after guest migrated (unless HW is capable of scaling
> TSC rate).

So does the __pvclock_read_cycles this is replacing (via
pvclock_get_nsec_offset).

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23 15:14     ` [Xen-devel] " Paolo Bonzini
  2014-12-23 15:25       ` Boris Ostrovsky
@ 2014-12-23 15:25       ` Boris Ostrovsky
  1 sibling, 0 replies; 77+ messages in thread
From: Boris Ostrovsky @ 2014-12-23 15:25 UTC (permalink / raw)
  To: Paolo Bonzini, Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 12/23/2014 10:14 AM, Paolo Bonzini wrote:
>
> On 23/12/2014 16:14, Boris Ostrovsky wrote:
>>> +    do {
>>> +        version = pvti->version;
>>> +
>>> +        /* This is also a read barrier, so we'll read version first. */
>>> +        rdtsc_barrier();
>>> +        tsc = __native_read_tsc();
>>
>> This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
>> used, for example, after guest migrated (unless HW is capable of scaling
>> TSC rate).
> So does the __pvclock_read_cycles this is replacing (via
> pvclock_get_nsec_offset).

Right, I didn't notice that.

-boris

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23 15:14     ` [Xen-devel] " Paolo Bonzini
@ 2014-12-23 15:25       ` Boris Ostrovsky
  2014-12-23 15:25       ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 0 replies; 77+ messages in thread
From: Boris Ostrovsky @ 2014-12-23 15:25 UTC (permalink / raw)
  To: Paolo Bonzini, Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 12/23/2014 10:14 AM, Paolo Bonzini wrote:
>
> On 23/12/2014 16:14, Boris Ostrovsky wrote:
>>> +    do {
>>> +        version = pvti->version;
>>> +
>>> +        /* This is also a read barrier, so we'll read version first. */
>>> +        rdtsc_barrier();
>>> +        tsc = __native_read_tsc();
>>
>> This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
>> used, for example, after guest migrated (unless HW is capable of scaling
>> TSC rate).
> So does the __pvclock_read_cycles this is replacing (via
> pvclock_get_nsec_offset).

Right, I didn't notice that.

-boris

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
                     ` (3 preceding siblings ...)
  2014-12-23 15:14   ` [Xen-devel] " Boris Ostrovsky
@ 2014-12-24 21:30   ` David Matlack
  2014-12-24 21:43     ` Andy Lutomirski
  2014-12-24 21:43     ` Andy Lutomirski
  2014-12-24 21:30   ` David Matlack
                     ` (3 subsequent siblings)
  8 siblings, 2 replies; 77+ messages in thread
From: David Matlack @ 2014-12-24 21:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Marcelo Tosatti, Gleb Natapov, kvm list,
	linux-kernel, xen-devel

On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>  1 file changed, 47 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> -       const struct pvclock_vsyscall_time_info *pvti;
> +       const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>         cycle_t ret;
> -       u64 last;
> -       u32 version;
> -       u8 flags;
> -       unsigned cpu, cpu1;
> -
> +       u64 tsc, pvti_tsc;
> +       u64 last, delta, pvti_system_time;
> +       u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>
>         /*
> -        * Note: hypervisor must guarantee that:
> -        * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -        * 2. that per-CPU pvclock time info is updated if the
> -        *    underlying CPU changes.
> -        * 3. that version is increased whenever underlying CPU
> -        *    changes.
> +        * Note: The kernel and hypervisor must guarantee that cpu ID
> +        * number maps 1:1 to per-CPU pvclock time info.
> +        *
> +        * Because the hypervisor is entirely unaware of guest userspace
> +        * preemption, it cannot guarantee that per-CPU pvclock time
> +        * info is updated if the underlying CPU changes or that that
> +        * version is increased whenever underlying CPU changes.
> +        *
> +        * On KVM, we are guaranteed that pvti updates for any vCPU are
> +        * atomic as seen by *all* vCPUs.  This is an even stronger
> +        * guarantee than we get with a normal seqlock.
>          *
> +        * On Xen, we don't appear to have that guarantee, but Xen still
> +        * supplies a valid seqlock using the version field.
> +

Forgotten * here?

> +        * We only do pvclock vdso timing at all if
> +        * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +        * mean that all vCPUs have matching pvti and that the TSC is
> +        * synced, so we can just look at vCPU 0's pvti.
>          */
> -       do {
> -               cpu = __getcpu() & VGETCPU_CPU_MASK;
> -               /* TODO: We can put vcpu id into higher bits of pvti.version.
> -                * This will save a couple of cycles by getting rid of
> -                * __getcpu() calls (Gleb).
> -                */
> -
> -               pvti = get_pvti(cpu);
> -
> -               version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -               /*
> -                * Test we're still on the cpu as well as the version.
> -                * We could have been migrated just after the first
> -                * vgetcpu but before fetching the version, so we
> -                * wouldn't notice a version change.
> -                */
> -               cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -       } while (unlikely(cpu != cpu1 ||
> -                         (pvti->pvti.version & 1) ||
> -                         pvti->pvti.version != version));
> -
> -       if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +       if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>                 *mode = VCLOCK_NONE;
> +               return 0;
> +       }
> +
> +       do {
> +               version = pvti->version;
> +
> +               /* This is also a read barrier, so we'll read version first. */
> +               rdtsc_barrier();
> +               tsc = __native_read_tsc();

Is there a reason why you read the tsc inside the loop rather than once
after the loop?

> +
> +               pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +               pvti_tsc_shift = pvti->tsc_shift;
> +               pvti_system_time = pvti->system_time;
> +               pvti_tsc = pvti->tsc_timestamp;
> +
> +               /* Make sure that the version double-check is last. */
> +               smp_rmb();
> +       } while (unlikely((version & 1) || version != pvti->version));
> +
> +       delta = tsc - pvti_tsc;
> +       ret = pvti_system_time +
> +               pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +                                   pvti_tsc_shift);
>
>         /* refer to tsc.c read_tsc() comment for rationale */
>         last = gtod->cycle_last;
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
                     ` (4 preceding siblings ...)
  2014-12-24 21:30   ` David Matlack
@ 2014-12-24 21:30   ` David Matlack
  2015-01-05 15:25   ` Marcelo Tosatti
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2014-12-24 21:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Gleb Natapov, Marcelo Tosatti, linux-kernel, xen-devel,
	Paolo Bonzini

On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>  1 file changed, 47 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> -       const struct pvclock_vsyscall_time_info *pvti;
> +       const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>         cycle_t ret;
> -       u64 last;
> -       u32 version;
> -       u8 flags;
> -       unsigned cpu, cpu1;
> -
> +       u64 tsc, pvti_tsc;
> +       u64 last, delta, pvti_system_time;
> +       u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>
>         /*
> -        * Note: hypervisor must guarantee that:
> -        * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -        * 2. that per-CPU pvclock time info is updated if the
> -        *    underlying CPU changes.
> -        * 3. that version is increased whenever underlying CPU
> -        *    changes.
> +        * Note: The kernel and hypervisor must guarantee that cpu ID
> +        * number maps 1:1 to per-CPU pvclock time info.
> +        *
> +        * Because the hypervisor is entirely unaware of guest userspace
> +        * preemption, it cannot guarantee that per-CPU pvclock time
> +        * info is updated if the underlying CPU changes or that that
> +        * version is increased whenever underlying CPU changes.
> +        *
> +        * On KVM, we are guaranteed that pvti updates for any vCPU are
> +        * atomic as seen by *all* vCPUs.  This is an even stronger
> +        * guarantee than we get with a normal seqlock.
>          *
> +        * On Xen, we don't appear to have that guarantee, but Xen still
> +        * supplies a valid seqlock using the version field.
> +

Forgotten * here?

> +        * We only do pvclock vdso timing at all if
> +        * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +        * mean that all vCPUs have matching pvti and that the TSC is
> +        * synced, so we can just look at vCPU 0's pvti.
>          */
> -       do {
> -               cpu = __getcpu() & VGETCPU_CPU_MASK;
> -               /* TODO: We can put vcpu id into higher bits of pvti.version.
> -                * This will save a couple of cycles by getting rid of
> -                * __getcpu() calls (Gleb).
> -                */
> -
> -               pvti = get_pvti(cpu);
> -
> -               version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -               /*
> -                * Test we're still on the cpu as well as the version.
> -                * We could have been migrated just after the first
> -                * vgetcpu but before fetching the version, so we
> -                * wouldn't notice a version change.
> -                */
> -               cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -       } while (unlikely(cpu != cpu1 ||
> -                         (pvti->pvti.version & 1) ||
> -                         pvti->pvti.version != version));
> -
> -       if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +       if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>                 *mode = VCLOCK_NONE;
> +               return 0;
> +       }
> +
> +       do {
> +               version = pvti->version;
> +
> +               /* This is also a read barrier, so we'll read version first. */
> +               rdtsc_barrier();
> +               tsc = __native_read_tsc();

Is there a reason why you read the tsc inside the loop rather than once
after the loop?

> +
> +               pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +               pvti_tsc_shift = pvti->tsc_shift;
> +               pvti_system_time = pvti->system_time;
> +               pvti_tsc = pvti->tsc_timestamp;
> +
> +               /* Make sure that the version double-check is last. */
> +               smp_rmb();
> +       } while (unlikely((version & 1) || version != pvti->version));
> +
> +       delta = tsc - pvti_tsc;
> +       ret = pvti_system_time +
> +               pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +                                   pvti_tsc_shift);
>
>         /* refer to tsc.c read_tsc() comment for rationale */
>         last = gtod->cycle_last;
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-24 21:30   ` David Matlack
  2014-12-24 21:43     ` Andy Lutomirski
@ 2014-12-24 21:43     ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-24 21:43 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marcelo Tosatti, Gleb Natapov, kvm list,
	linux-kernel, xen-devel

On Wed, Dec 24, 2014 at 1:30 PM, David Matlack <dmatlack@google.com> wrote:
> On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid.  Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>>  1 file changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index 9793322751e0..f2e0396d5629 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>
>>  static notrace cycle_t vread_pvclock(int *mode)
>>  {
>> -       const struct pvclock_vsyscall_time_info *pvti;
>> +       const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>>         cycle_t ret;
>> -       u64 last;
>> -       u32 version;
>> -       u8 flags;
>> -       unsigned cpu, cpu1;
>> -
>> +       u64 tsc, pvti_tsc;
>> +       u64 last, delta, pvti_system_time;
>> +       u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>>         /*
>> -        * Note: hypervisor must guarantee that:
>> -        * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> -        * 2. that per-CPU pvclock time info is updated if the
>> -        *    underlying CPU changes.
>> -        * 3. that version is increased whenever underlying CPU
>> -        *    changes.
>> +        * Note: The kernel and hypervisor must guarantee that cpu ID
>> +        * number maps 1:1 to per-CPU pvclock time info.
>> +        *
>> +        * Because the hypervisor is entirely unaware of guest userspace
>> +        * preemption, it cannot guarantee that per-CPU pvclock time
>> +        * info is updated if the underlying CPU changes or that that
>> +        * version is increased whenever underlying CPU changes.
>> +        *
>> +        * On KVM, we are guaranteed that pvti updates for any vCPU are
>> +        * atomic as seen by *all* vCPUs.  This is an even stronger
>> +        * guarantee than we get with a normal seqlock.
>>          *
>> +        * On Xen, we don't appear to have that guarantee, but Xen still
>> +        * supplies a valid seqlock using the version field.
>> +
>
> Forgotten * here?
>
>> +        * We only do pvclock vdso timing at all if
>> +        * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> +        * mean that all vCPUs have matching pvti and that the TSC is
>> +        * synced, so we can just look at vCPU 0's pvti.
>>          */
>> -       do {
>> -               cpu = __getcpu() & VGETCPU_CPU_MASK;
>> -               /* TODO: We can put vcpu id into higher bits of pvti.version.
>> -                * This will save a couple of cycles by getting rid of
>> -                * __getcpu() calls (Gleb).
>> -                */
>> -
>> -               pvti = get_pvti(cpu);
>> -
>> -               version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> -
>> -               /*
>> -                * Test we're still on the cpu as well as the version.
>> -                * We could have been migrated just after the first
>> -                * vgetcpu but before fetching the version, so we
>> -                * wouldn't notice a version change.
>> -                */
>> -               cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> -       } while (unlikely(cpu != cpu1 ||
>> -                         (pvti->pvti.version & 1) ||
>> -                         pvti->pvti.version != version));
>> -
>> -       if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> +       if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>                 *mode = VCLOCK_NONE;
>> +               return 0;
>> +       }
>> +
>> +       do {
>> +               version = pvti->version;
>> +
>> +               /* This is also a read barrier, so we'll read version first. */
>> +               rdtsc_barrier();
>> +               tsc = __native_read_tsc();
>
> Is there a reason why you read the tsc inside the loop rather than once
> after the loop?

I want to make sure that the tsc value used is consistent with the
scale and offset.  Otherwise it would be possible to read the pvti
data, then get preempted and sleep for a long time before rdtsc.  The
result could be a time value larger than an immediate subsequent call
would return.

--Andy

>
>> +
>> +               pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> +               pvti_tsc_shift = pvti->tsc_shift;
>> +               pvti_system_time = pvti->system_time;
>> +               pvti_tsc = pvti->tsc_timestamp;
>> +
>> +               /* Make sure that the version double-check is last. */
>> +               smp_rmb();
>> +       } while (unlikely((version & 1) || version != pvti->version));
>> +
>> +       delta = tsc - pvti_tsc;
>> +       ret = pvti_system_time +
>> +               pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> +                                   pvti_tsc_shift);
>>
>>         /* refer to tsc.c read_tsc() comment for rationale */
>>         last = gtod->cycle_last;
>> --
>> 2.1.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-24 21:30   ` David Matlack
@ 2014-12-24 21:43     ` Andy Lutomirski
  2014-12-24 21:43     ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-24 21:43 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm list, Gleb Natapov, Marcelo Tosatti, linux-kernel, xen-devel,
	Paolo Bonzini

On Wed, Dec 24, 2014 at 1:30 PM, David Matlack <dmatlack@google.com> wrote:
> On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid.  Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>>  1 file changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index 9793322751e0..f2e0396d5629 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>
>>  static notrace cycle_t vread_pvclock(int *mode)
>>  {
>> -       const struct pvclock_vsyscall_time_info *pvti;
>> +       const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>>         cycle_t ret;
>> -       u64 last;
>> -       u32 version;
>> -       u8 flags;
>> -       unsigned cpu, cpu1;
>> -
>> +       u64 tsc, pvti_tsc;
>> +       u64 last, delta, pvti_system_time;
>> +       u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>>         /*
>> -        * Note: hypervisor must guarantee that:
>> -        * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> -        * 2. that per-CPU pvclock time info is updated if the
>> -        *    underlying CPU changes.
>> -        * 3. that version is increased whenever underlying CPU
>> -        *    changes.
>> +        * Note: The kernel and hypervisor must guarantee that cpu ID
>> +        * number maps 1:1 to per-CPU pvclock time info.
>> +        *
>> +        * Because the hypervisor is entirely unaware of guest userspace
>> +        * preemption, it cannot guarantee that per-CPU pvclock time
>> +        * info is updated if the underlying CPU changes or that that
>> +        * version is increased whenever underlying CPU changes.
>> +        *
>> +        * On KVM, we are guaranteed that pvti updates for any vCPU are
>> +        * atomic as seen by *all* vCPUs.  This is an even stronger
>> +        * guarantee than we get with a normal seqlock.
>>          *
>> +        * On Xen, we don't appear to have that guarantee, but Xen still
>> +        * supplies a valid seqlock using the version field.
>> +
>
> Forgotten * here?
>
>> +        * We only do pvclock vdso timing at all if
>> +        * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> +        * mean that all vCPUs have matching pvti and that the TSC is
>> +        * synced, so we can just look at vCPU 0's pvti.
>>          */
>> -       do {
>> -               cpu = __getcpu() & VGETCPU_CPU_MASK;
>> -               /* TODO: We can put vcpu id into higher bits of pvti.version.
>> -                * This will save a couple of cycles by getting rid of
>> -                * __getcpu() calls (Gleb).
>> -                */
>> -
>> -               pvti = get_pvti(cpu);
>> -
>> -               version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> -
>> -               /*
>> -                * Test we're still on the cpu as well as the version.
>> -                * We could have been migrated just after the first
>> -                * vgetcpu but before fetching the version, so we
>> -                * wouldn't notice a version change.
>> -                */
>> -               cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> -       } while (unlikely(cpu != cpu1 ||
>> -                         (pvti->pvti.version & 1) ||
>> -                         pvti->pvti.version != version));
>> -
>> -       if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> +       if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>                 *mode = VCLOCK_NONE;
>> +               return 0;
>> +       }
>> +
>> +       do {
>> +               version = pvti->version;
>> +
>> +               /* This is also a read barrier, so we'll read version first. */
>> +               rdtsc_barrier();
>> +               tsc = __native_read_tsc();
>
> Is there a reason why you read the tsc inside the loop rather than once
> after the loop?

I want to make sure that the tsc value used is consistent with the
scale and offset.  Otherwise it would be possible to read the pvti
data, then get preempted and sleep for a long time before rdtsc.  The
result could be a time value larger than an immediate subsequent call
would return.

--Andy

>
>> +
>> +               pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> +               pvti_tsc_shift = pvti->tsc_shift;
>> +               pvti_system_time = pvti->system_time;
>> +               pvti_tsc = pvti->tsc_timestamp;
>> +
>> +               /* Make sure that the version double-check is last. */
>> +               smp_rmb();
>> +       } while (unlikely((version & 1) || version != pvti->version));
>> +
>> +       delta = tsc - pvti_tsc;
>> +       ret = pvti_system_time +
>> +               pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> +                                   pvti_tsc_shift);
>>
>>         /* refer to tsc.c read_tsc() comment for rationale */
>>         last = gtod->cycle_last;
>> --
>> 2.1.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
                     ` (5 preceding siblings ...)
  2014-12-24 21:30   ` David Matlack
@ 2015-01-05 15:25   ` Marcelo Tosatti
  2015-01-05 18:56     ` Andy Lutomirski
  2015-01-05 18:56     ` Andy Lutomirski
  2015-01-05 15:25   ` Marcelo Tosatti
  2015-01-08 12:51     ` David Vrabel
  8 siblings, 2 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-05 15:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Gleb Natapov, kvm list, linux-kernel, xen-devel

On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>  1 file changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>  	cycle_t ret;
> -	u64 last;
> -	u32 version;
> -	u8 flags;
> -	unsigned cpu, cpu1;
> -
> +	u64 tsc, pvti_tsc;
> +	u64 last, delta, pvti_system_time;
> +	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>  	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>  	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.
>  	 */

Can Xen guarantee that ?

> -	do {
> -		cpu = __getcpu() & VGETCPU_CPU_MASK;
> -		/* TODO: We can put vcpu id into higher bits of pvti.version.
> -		 * This will save a couple of cycles by getting rid of
> -		 * __getcpu() calls (Gleb).
> -		 */
> -
> -		pvti = get_pvti(cpu);
> -
> -		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -		/*
> -		 * Test we're still on the cpu as well as the version.
> -		 * We could have been migrated just after the first
> -		 * vgetcpu but before fetching the version, so we
> -		 * wouldn't notice a version change.
> -		 */
> -		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -	} while (unlikely(cpu != cpu1 ||
> -			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> -
> -	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>  		*mode = VCLOCK_NONE;
> +		return 0;
> +	}

This check must be performed after reading a stable pvti.

> +
> +	do {
> +		version = pvti->version;
> +
> +		/* This is also a read barrier, so we'll read version first. */
> +		rdtsc_barrier();
> +		tsc = __native_read_tsc();
> +
> +		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +		pvti_tsc_shift = pvti->tsc_shift;
> +		pvti_system_time = pvti->system_time;
> +		pvti_tsc = pvti->tsc_timestamp;
> +
> +		/* Make sure that the version double-check is last. */
> +		smp_rmb();
> +	} while (unlikely((version & 1) || version != pvti->version));
> +
> +	delta = tsc - pvti_tsc;
> +	ret = pvti_system_time +
> +		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +				    pvti_tsc_shift);

The following is possible:

1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
transition.
2) vCPU-1 updates its pvti with new values.
3) vCPU-0 still has not updated its pvti with new values.
4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.

The update is not actually atomic across all vCPUs, its atomic in
the sense of not allowing visibility of distinct
system_timestamp/tsc_timestamp values.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
                     ` (6 preceding siblings ...)
  2015-01-05 15:25   ` Marcelo Tosatti
@ 2015-01-05 15:25   ` Marcelo Tosatti
  2015-01-08 12:51     ` David Vrabel
  8 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-05 15:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>  1 file changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>  	cycle_t ret;
> -	u64 last;
> -	u32 version;
> -	u8 flags;
> -	unsigned cpu, cpu1;
> -
> +	u64 tsc, pvti_tsc;
> +	u64 last, delta, pvti_system_time;
> +	u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>  	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>  	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.
>  	 */

Can Xen guarantee that ?

> -	do {
> -		cpu = __getcpu() & VGETCPU_CPU_MASK;
> -		/* TODO: We can put vcpu id into higher bits of pvti.version.
> -		 * This will save a couple of cycles by getting rid of
> -		 * __getcpu() calls (Gleb).
> -		 */
> -
> -		pvti = get_pvti(cpu);
> -
> -		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> -		/*
> -		 * Test we're still on the cpu as well as the version.
> -		 * We could have been migrated just after the first
> -		 * vgetcpu but before fetching the version, so we
> -		 * wouldn't notice a version change.
> -		 */
> -		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> -	} while (unlikely(cpu != cpu1 ||
> -			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> -
> -	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> +	if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>  		*mode = VCLOCK_NONE;
> +		return 0;
> +	}

This check must be performed after reading a stable pvti.

> +
> +	do {
> +		version = pvti->version;
> +
> +		/* This is also a read barrier, so we'll read version first. */
> +		rdtsc_barrier();
> +		tsc = __native_read_tsc();
> +
> +		pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> +		pvti_tsc_shift = pvti->tsc_shift;
> +		pvti_system_time = pvti->system_time;
> +		pvti_tsc = pvti->tsc_timestamp;
> +
> +		/* Make sure that the version double-check is last. */
> +		smp_rmb();
> +	} while (unlikely((version & 1) || version != pvti->version));
> +
> +	delta = tsc - pvti_tsc;
> +	ret = pvti_system_time +
> +		pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> +				    pvti_tsc_shift);

The following is possible:

1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
transition.
2) vCPU-1 updates its pvti with new values.
3) vCPU-0 still has not updated its pvti with new values.
4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.

The update is not actually atomic across all vCPUs, its atomic in
the sense of not allowing visibility of distinct
system_timestamp/tsc_timestamp values.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 15:25   ` Marcelo Tosatti
  2015-01-05 18:56     ` Andy Lutomirski
@ 2015-01-05 18:56     ` Andy Lutomirski
  2015-01-05 19:17       ` Marcelo Tosatti
                         ` (4 more replies)
  1 sibling, 5 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-05 18:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, Gleb Natapov, kvm list, linux-kernel, xen-devel

On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid.  Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>>  1 file changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index 9793322751e0..f2e0396d5629 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>
>>  static notrace cycle_t vread_pvclock(int *mode)
>>  {
>> -     const struct pvclock_vsyscall_time_info *pvti;
>> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>>       cycle_t ret;
>> -     u64 last;
>> -     u32 version;
>> -     u8 flags;
>> -     unsigned cpu, cpu1;
>> -
>> +     u64 tsc, pvti_tsc;
>> +     u64 last, delta, pvti_system_time;
>> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>>       /*
>> -      * Note: hypervisor must guarantee that:
>> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> -      * 2. that per-CPU pvclock time info is updated if the
>> -      *    underlying CPU changes.
>> -      * 3. that version is increased whenever underlying CPU
>> -      *    changes.
>> +      * Note: The kernel and hypervisor must guarantee that cpu ID
>> +      * number maps 1:1 to per-CPU pvclock time info.
>> +      *
>> +      * Because the hypervisor is entirely unaware of guest userspace
>> +      * preemption, it cannot guarantee that per-CPU pvclock time
>> +      * info is updated if the underlying CPU changes or that that
>> +      * version is increased whenever underlying CPU changes.
>> +      *
>> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
>> +      * atomic as seen by *all* vCPUs.  This is an even stronger
>> +      * guarantee than we get with a normal seqlock.
>>        *
>> +      * On Xen, we don't appear to have that guarantee, but Xen still
>> +      * supplies a valid seqlock using the version field.
>> +
>> +      * We only do pvclock vdso timing at all if
>> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> +      * mean that all vCPUs have matching pvti and that the TSC is
>> +      * synced, so we can just look at vCPU 0's pvti.
>>        */
>
> Can Xen guarantee that ?

I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
at all.  I have no idea going forward, though.

Xen people?

>
>> -     do {
>> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
>> -              * This will save a couple of cycles by getting rid of
>> -              * __getcpu() calls (Gleb).
>> -              */
>> -
>> -             pvti = get_pvti(cpu);
>> -
>> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> -
>> -             /*
>> -              * Test we're still on the cpu as well as the version.
>> -              * We could have been migrated just after the first
>> -              * vgetcpu but before fetching the version, so we
>> -              * wouldn't notice a version change.
>> -              */
>> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> -     } while (unlikely(cpu != cpu1 ||
>> -                       (pvti->pvti.version & 1) ||
>> -                       pvti->pvti.version != version));
>> -
>> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>               *mode = VCLOCK_NONE;
>> +             return 0;
>> +     }
>
> This check must be performed after reading a stable pvti.
>

We can even read it in the middle, guarded by the version checks.
I'll do that for v2.

>> +
>> +     do {
>> +             version = pvti->version;
>> +
>> +             /* This is also a read barrier, so we'll read version first. */
>> +             rdtsc_barrier();
>> +             tsc = __native_read_tsc();
>> +
>> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> +             pvti_tsc_shift = pvti->tsc_shift;
>> +             pvti_system_time = pvti->system_time;
>> +             pvti_tsc = pvti->tsc_timestamp;
>> +
>> +             /* Make sure that the version double-check is last. */
>> +             smp_rmb();
>> +     } while (unlikely((version & 1) || version != pvti->version));
>> +
>> +     delta = tsc - pvti_tsc;
>> +     ret = pvti_system_time +
>> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> +                                 pvti_tsc_shift);
>
> The following is possible:
>
> 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> transition.
> 2) vCPU-1 updates its pvti with new values.
> 3) vCPU-0 still has not updated its pvti with new values.
> 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>
> The update is not actually atomic across all vCPUs, its atomic in
> the sense of not allowing visibility of distinct
> system_timestamp/tsc_timestamp values.
>

Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
it gets marked unstable?  Otherwise the vdso could could just as
easily be called from vCPU-1, migrated to vCPU-0, read the data
complete with stale stable bit, and get migrated back to vCPU-1.

But I thought that KVM currently froze all vCPUs when updating pvti
for any of them.  How can this happen?  I admit I don't really
understand the update request code.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 15:25   ` Marcelo Tosatti
@ 2015-01-05 18:56     ` Andy Lutomirski
  2015-01-05 18:56     ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-05 18:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid.  Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>>  1 file changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index 9793322751e0..f2e0396d5629 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>
>>  static notrace cycle_t vread_pvclock(int *mode)
>>  {
>> -     const struct pvclock_vsyscall_time_info *pvti;
>> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>>       cycle_t ret;
>> -     u64 last;
>> -     u32 version;
>> -     u8 flags;
>> -     unsigned cpu, cpu1;
>> -
>> +     u64 tsc, pvti_tsc;
>> +     u64 last, delta, pvti_system_time;
>> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>>       /*
>> -      * Note: hypervisor must guarantee that:
>> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> -      * 2. that per-CPU pvclock time info is updated if the
>> -      *    underlying CPU changes.
>> -      * 3. that version is increased whenever underlying CPU
>> -      *    changes.
>> +      * Note: The kernel and hypervisor must guarantee that cpu ID
>> +      * number maps 1:1 to per-CPU pvclock time info.
>> +      *
>> +      * Because the hypervisor is entirely unaware of guest userspace
>> +      * preemption, it cannot guarantee that per-CPU pvclock time
>> +      * info is updated if the underlying CPU changes or that that
>> +      * version is increased whenever underlying CPU changes.
>> +      *
>> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
>> +      * atomic as seen by *all* vCPUs.  This is an even stronger
>> +      * guarantee than we get with a normal seqlock.
>>        *
>> +      * On Xen, we don't appear to have that guarantee, but Xen still
>> +      * supplies a valid seqlock using the version field.
>> +
>> +      * We only do pvclock vdso timing at all if
>> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> +      * mean that all vCPUs have matching pvti and that the TSC is
>> +      * synced, so we can just look at vCPU 0's pvti.
>>        */
>
> Can Xen guarantee that ?

I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
at all.  I have no idea going forward, though.

Xen people?

>
>> -     do {
>> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
>> -              * This will save a couple of cycles by getting rid of
>> -              * __getcpu() calls (Gleb).
>> -              */
>> -
>> -             pvti = get_pvti(cpu);
>> -
>> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> -
>> -             /*
>> -              * Test we're still on the cpu as well as the version.
>> -              * We could have been migrated just after the first
>> -              * vgetcpu but before fetching the version, so we
>> -              * wouldn't notice a version change.
>> -              */
>> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> -     } while (unlikely(cpu != cpu1 ||
>> -                       (pvti->pvti.version & 1) ||
>> -                       pvti->pvti.version != version));
>> -
>> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>               *mode = VCLOCK_NONE;
>> +             return 0;
>> +     }
>
> This check must be performed after reading a stable pvti.
>

We can even read it in the middle, guarded by the version checks.
I'll do that for v2.

>> +
>> +     do {
>> +             version = pvti->version;
>> +
>> +             /* This is also a read barrier, so we'll read version first. */
>> +             rdtsc_barrier();
>> +             tsc = __native_read_tsc();
>> +
>> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> +             pvti_tsc_shift = pvti->tsc_shift;
>> +             pvti_system_time = pvti->system_time;
>> +             pvti_tsc = pvti->tsc_timestamp;
>> +
>> +             /* Make sure that the version double-check is last. */
>> +             smp_rmb();
>> +     } while (unlikely((version & 1) || version != pvti->version));
>> +
>> +     delta = tsc - pvti_tsc;
>> +     ret = pvti_system_time +
>> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> +                                 pvti_tsc_shift);
>
> The following is possible:
>
> 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> transition.
> 2) vCPU-1 updates its pvti with new values.
> 3) vCPU-0 still has not updated its pvti with new values.
> 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>
> The update is not actually atomic across all vCPUs, its atomic in
> the sense of not allowing visibility of distinct
> system_timestamp/tsc_timestamp values.
>

Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
it gets marked unstable?  Otherwise the vdso could could just as
easily be called from vCPU-1, migrated to vCPU-0, read the data
complete with stale stable bit, and get migrated back to vCPU-1.

But I thought that KVM currently froze all vCPUs when updating pvti
for any of them.  How can this happen?  I admit I don't really
understand the update request code.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 18:56     ` Andy Lutomirski
  2015-01-05 19:17       ` Marcelo Tosatti
@ 2015-01-05 19:17       ` Marcelo Tosatti
  2015-01-05 22:38         ` Andy Lutomirski
                           ` (3 more replies)
  2015-01-05 22:23       ` Paolo Bonzini
                         ` (2 subsequent siblings)
  4 siblings, 4 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-05 19:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Gleb Natapov, kvm list, linux-kernel, xen-devel

On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> The pvclock vdso code was too abstracted to understand easily and
> >> excessively paranoid.  Simplify it for a huge speedup.
> >>
> >> This opens the door for additional simplifications, as the vdso no
> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >>
> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> implementation.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> index 9793322751e0..f2e0396d5629 100644
> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >>
> >>  static notrace cycle_t vread_pvclock(int *mode)
> >>  {
> >> -     const struct pvclock_vsyscall_time_info *pvti;
> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >>       cycle_t ret;
> >> -     u64 last;
> >> -     u32 version;
> >> -     u8 flags;
> >> -     unsigned cpu, cpu1;
> >> -
> >> +     u64 tsc, pvti_tsc;
> >> +     u64 last, delta, pvti_system_time;
> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >>
> >>       /*
> >> -      * Note: hypervisor must guarantee that:
> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> -      * 2. that per-CPU pvclock time info is updated if the
> >> -      *    underlying CPU changes.
> >> -      * 3. that version is increased whenever underlying CPU
> >> -      *    changes.
> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
> >> +      * number maps 1:1 to per-CPU pvclock time info.
> >> +      *
> >> +      * Because the hypervisor is entirely unaware of guest userspace
> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
> >> +      * info is updated if the underlying CPU changes or that that
> >> +      * version is increased whenever underlying CPU changes.
> >> +      *
> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
> >> +      * guarantee than we get with a normal seqlock.
> >>        *
> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
> >> +      * supplies a valid seqlock using the version field.
> >> +
> >> +      * We only do pvclock vdso timing at all if
> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> +      * mean that all vCPUs have matching pvti and that the TSC is
> >> +      * synced, so we can just look at vCPU 0's pvti.
> >>        */
> >
> > Can Xen guarantee that ?
> 
> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> at all.  I have no idea going forward, though.
> 
> Xen people?
> 
> >
> >> -     do {
> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> -              * This will save a couple of cycles by getting rid of
> >> -              * __getcpu() calls (Gleb).
> >> -              */
> >> -
> >> -             pvti = get_pvti(cpu);
> >> -
> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> -
> >> -             /*
> >> -              * Test we're still on the cpu as well as the version.
> >> -              * We could have been migrated just after the first
> >> -              * vgetcpu but before fetching the version, so we
> >> -              * wouldn't notice a version change.
> >> -              */
> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> -     } while (unlikely(cpu != cpu1 ||
> >> -                       (pvti->pvti.version & 1) ||
> >> -                       pvti->pvti.version != version));
> >> -
> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> +
> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >>               *mode = VCLOCK_NONE;
> >> +             return 0;
> >> +     }
> >
> > This check must be performed after reading a stable pvti.
> >
> 
> We can even read it in the middle, guarded by the version checks.
> I'll do that for v2.
> 
> >> +
> >> +     do {
> >> +             version = pvti->version;
> >> +
> >> +             /* This is also a read barrier, so we'll read version first. */
> >> +             rdtsc_barrier();
> >> +             tsc = __native_read_tsc();
> >> +
> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> +             pvti_tsc_shift = pvti->tsc_shift;
> >> +             pvti_system_time = pvti->system_time;
> >> +             pvti_tsc = pvti->tsc_timestamp;
> >> +
> >> +             /* Make sure that the version double-check is last. */
> >> +             smp_rmb();
> >> +     } while (unlikely((version & 1) || version != pvti->version));
> >> +
> >> +     delta = tsc - pvti_tsc;
> >> +     ret = pvti_system_time +
> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> >> +                                 pvti_tsc_shift);
> >
> > The following is possible:
> >
> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> > transition.
> > 2) vCPU-1 updates its pvti with new values.
> > 3) vCPU-0 still has not updated its pvti with new values.
> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
> >
> > The update is not actually atomic across all vCPUs, its atomic in
> > the sense of not allowing visibility of distinct
> > system_timestamp/tsc_timestamp values.
> >
> 
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable? 

Yes. It will VM-enter after pvti is updated.

> Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.

Right.

> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.

The update is performed as follows:

	- Stop guest instruction execution on every vCPU, parking them in the host.
	- Request KVMCLOCK update for every vCPU.
	- Resume guest instruction execution.

The KVMCLOCK update (==pvti update) is guaranteed to be performed before 
guest instructions are executed again.

But there is no guarantee that vCPU-N has updated its pvti when
vCPU-M resumes guest instruction execution.

So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
Perhaps you can use Gleb's idea to stick vcpu id into version field ?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 18:56     ` Andy Lutomirski
@ 2015-01-05 19:17       ` Marcelo Tosatti
  2015-01-05 19:17       ` Marcelo Tosatti
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-05 19:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> The pvclock vdso code was too abstracted to understand easily and
> >> excessively paranoid.  Simplify it for a huge speedup.
> >>
> >> This opens the door for additional simplifications, as the vdso no
> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >>
> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> implementation.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> index 9793322751e0..f2e0396d5629 100644
> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >>
> >>  static notrace cycle_t vread_pvclock(int *mode)
> >>  {
> >> -     const struct pvclock_vsyscall_time_info *pvti;
> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >>       cycle_t ret;
> >> -     u64 last;
> >> -     u32 version;
> >> -     u8 flags;
> >> -     unsigned cpu, cpu1;
> >> -
> >> +     u64 tsc, pvti_tsc;
> >> +     u64 last, delta, pvti_system_time;
> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >>
> >>       /*
> >> -      * Note: hypervisor must guarantee that:
> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> -      * 2. that per-CPU pvclock time info is updated if the
> >> -      *    underlying CPU changes.
> >> -      * 3. that version is increased whenever underlying CPU
> >> -      *    changes.
> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
> >> +      * number maps 1:1 to per-CPU pvclock time info.
> >> +      *
> >> +      * Because the hypervisor is entirely unaware of guest userspace
> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
> >> +      * info is updated if the underlying CPU changes or that that
> >> +      * version is increased whenever underlying CPU changes.
> >> +      *
> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
> >> +      * guarantee than we get with a normal seqlock.
> >>        *
> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
> >> +      * supplies a valid seqlock using the version field.
> >> +
> >> +      * We only do pvclock vdso timing at all if
> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> +      * mean that all vCPUs have matching pvti and that the TSC is
> >> +      * synced, so we can just look at vCPU 0's pvti.
> >>        */
> >
> > Can Xen guarantee that ?
> 
> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> at all.  I have no idea going forward, though.
> 
> Xen people?
> 
> >
> >> -     do {
> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> -              * This will save a couple of cycles by getting rid of
> >> -              * __getcpu() calls (Gleb).
> >> -              */
> >> -
> >> -             pvti = get_pvti(cpu);
> >> -
> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> -
> >> -             /*
> >> -              * Test we're still on the cpu as well as the version.
> >> -              * We could have been migrated just after the first
> >> -              * vgetcpu but before fetching the version, so we
> >> -              * wouldn't notice a version change.
> >> -              */
> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> -     } while (unlikely(cpu != cpu1 ||
> >> -                       (pvti->pvti.version & 1) ||
> >> -                       pvti->pvti.version != version));
> >> -
> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> +
> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >>               *mode = VCLOCK_NONE;
> >> +             return 0;
> >> +     }
> >
> > This check must be performed after reading a stable pvti.
> >
> 
> We can even read it in the middle, guarded by the version checks.
> I'll do that for v2.
> 
> >> +
> >> +     do {
> >> +             version = pvti->version;
> >> +
> >> +             /* This is also a read barrier, so we'll read version first. */
> >> +             rdtsc_barrier();
> >> +             tsc = __native_read_tsc();
> >> +
> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> +             pvti_tsc_shift = pvti->tsc_shift;
> >> +             pvti_system_time = pvti->system_time;
> >> +             pvti_tsc = pvti->tsc_timestamp;
> >> +
> >> +             /* Make sure that the version double-check is last. */
> >> +             smp_rmb();
> >> +     } while (unlikely((version & 1) || version != pvti->version));
> >> +
> >> +     delta = tsc - pvti_tsc;
> >> +     ret = pvti_system_time +
> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> >> +                                 pvti_tsc_shift);
> >
> > The following is possible:
> >
> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> > transition.
> > 2) vCPU-1 updates its pvti with new values.
> > 3) vCPU-0 still has not updated its pvti with new values.
> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
> >
> > The update is not actually atomic across all vCPUs, its atomic in
> > the sense of not allowing visibility of distinct
> > system_timestamp/tsc_timestamp values.
> >
> 
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable? 

Yes. It will VM-enter after pvti is updated.

> Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.

Right.

> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.

The update is performed as follows:

	- Stop guest instruction execution on every vCPU, parking them in the host.
	- Request KVMCLOCK update for every vCPU.
	- Resume guest instruction execution.

The KVMCLOCK update (==pvti update) is guaranteed to be performed before 
guest instructions are executed again.

But there is no guarantee that vCPU-N has updated its pvti when
vCPU-M resumes guest instruction execution.

So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
Perhaps you can use Gleb's idea to stick vcpu id into version field ?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 18:56     ` Andy Lutomirski
  2015-01-05 19:17       ` Marcelo Tosatti
  2015-01-05 19:17       ` Marcelo Tosatti
@ 2015-01-05 22:23       ` Paolo Bonzini
  2015-01-05 22:23       ` Paolo Bonzini
  2015-01-06 14:35         ` Konrad Rzeszutek Wilk
  4 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-05 22:23 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel



On 05/01/2015 19:56, Andy Lutomirski wrote:
>> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> > transition.
>> > 2) vCPU-1 updates its pvti with new values.
>> > 3) vCPU-0 still has not updated its pvti with new values.
>> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >
>> > The update is not actually atomic across all vCPUs, its atomic in
>> > the sense of not allowing visibility of distinct
>> > system_timestamp/tsc_timestamp values.
>> >
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable?  Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.
> 
> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.

That was also my understanding.  I thought this was the point of
kvm_make_mclock_inprogress_request/KVM_REQ_MCLOCK_INPROGRESS.

Disabling TSC_STABLE_BIT is triggered by pvclock_gtod_update_fn but it
happens in kvm_gen_update_masterclock, and no guest entries will happen
in the meanwhile.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 18:56     ` Andy Lutomirski
                         ` (2 preceding siblings ...)
  2015-01-05 22:23       ` Paolo Bonzini
@ 2015-01-05 22:23       ` Paolo Bonzini
  2015-01-06 14:35         ` Konrad Rzeszutek Wilk
  4 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-05 22:23 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 05/01/2015 19:56, Andy Lutomirski wrote:
>> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> > transition.
>> > 2) vCPU-1 updates its pvti with new values.
>> > 3) vCPU-0 still has not updated its pvti with new values.
>> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >
>> > The update is not actually atomic across all vCPUs, its atomic in
>> > the sense of not allowing visibility of distinct
>> > system_timestamp/tsc_timestamp values.
>> >
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable?  Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.
> 
> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.

That was also my understanding.  I thought this was the point of
kvm_make_mclock_inprogress_request/KVM_REQ_MCLOCK_INPROGRESS.

Disabling TSC_STABLE_BIT is triggered by pvclock_gtod_update_fn but it
happens in kvm_gen_update_masterclock, and no guest entries will happen
in the meanwhile.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 19:17       ` Marcelo Tosatti
@ 2015-01-05 22:38         ` Andy Lutomirski
  2015-01-05 22:48           ` Marcelo Tosatti
  2015-01-05 22:48           ` Marcelo Tosatti
  2015-01-05 22:38         ` Andy Lutomirski
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-05 22:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, Gleb Natapov, kvm list, linux-kernel, xen-devel

On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> >> The pvclock vdso code was too abstracted to understand easily and
>> >> excessively paranoid.  Simplify it for a huge speedup.
>> >>
>> >> This opens the door for additional simplifications, as the vdso no
>> >> longer accesses the pvti for any vcpu other than vcpu 0.
>> >>
>> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> >> implementation.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>> >>  1 file changed, 47 insertions(+), 35 deletions(-)
>> >>
>> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> >> index 9793322751e0..f2e0396d5629 100644
>> >> --- a/arch/x86/vdso/vclock_gettime.c
>> >> +++ b/arch/x86/vdso/vclock_gettime.c
>> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>> >>
>> >>  static notrace cycle_t vread_pvclock(int *mode)
>> >>  {
>> >> -     const struct pvclock_vsyscall_time_info *pvti;
>> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> >>       cycle_t ret;
>> >> -     u64 last;
>> >> -     u32 version;
>> >> -     u8 flags;
>> >> -     unsigned cpu, cpu1;
>> >> -
>> >> +     u64 tsc, pvti_tsc;
>> >> +     u64 last, delta, pvti_system_time;
>> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>> >>
>> >>       /*
>> >> -      * Note: hypervisor must guarantee that:
>> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> >> -      * 2. that per-CPU pvclock time info is updated if the
>> >> -      *    underlying CPU changes.
>> >> -      * 3. that version is increased whenever underlying CPU
>> >> -      *    changes.
>> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
>> >> +      * number maps 1:1 to per-CPU pvclock time info.
>> >> +      *
>> >> +      * Because the hypervisor is entirely unaware of guest userspace
>> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
>> >> +      * info is updated if the underlying CPU changes or that that
>> >> +      * version is increased whenever underlying CPU changes.
>> >> +      *
>> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
>> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
>> >> +      * guarantee than we get with a normal seqlock.
>> >>        *
>> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
>> >> +      * supplies a valid seqlock using the version field.
>> >> +
>> >> +      * We only do pvclock vdso timing at all if
>> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> >> +      * mean that all vCPUs have matching pvti and that the TSC is
>> >> +      * synced, so we can just look at vCPU 0's pvti.
>> >>        */
>> >
>> > Can Xen guarantee that ?
>>
>> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
>> at all.  I have no idea going forward, though.
>>
>> Xen people?
>>
>> >
>> >> -     do {
>> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
>> >> -              * This will save a couple of cycles by getting rid of
>> >> -              * __getcpu() calls (Gleb).
>> >> -              */
>> >> -
>> >> -             pvti = get_pvti(cpu);
>> >> -
>> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> >> -
>> >> -             /*
>> >> -              * Test we're still on the cpu as well as the version.
>> >> -              * We could have been migrated just after the first
>> >> -              * vgetcpu but before fetching the version, so we
>> >> -              * wouldn't notice a version change.
>> >> -              */
>> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> >> -     } while (unlikely(cpu != cpu1 ||
>> >> -                       (pvti->pvti.version & 1) ||
>> >> -                       pvti->pvti.version != version));
>> >> -
>> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> >> +
>> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> >>               *mode = VCLOCK_NONE;
>> >> +             return 0;
>> >> +     }
>> >
>> > This check must be performed after reading a stable pvti.
>> >
>>
>> We can even read it in the middle, guarded by the version checks.
>> I'll do that for v2.
>>
>> >> +
>> >> +     do {
>> >> +             version = pvti->version;
>> >> +
>> >> +             /* This is also a read barrier, so we'll read version first. */
>> >> +             rdtsc_barrier();
>> >> +             tsc = __native_read_tsc();
>> >> +
>> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> >> +             pvti_tsc_shift = pvti->tsc_shift;
>> >> +             pvti_system_time = pvti->system_time;
>> >> +             pvti_tsc = pvti->tsc_timestamp;
>> >> +
>> >> +             /* Make sure that the version double-check is last. */
>> >> +             smp_rmb();
>> >> +     } while (unlikely((version & 1) || version != pvti->version));
>> >> +
>> >> +     delta = tsc - pvti_tsc;
>> >> +     ret = pvti_system_time +
>> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> >> +                                 pvti_tsc_shift);
>> >
>> > The following is possible:
>> >
>> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> > transition.
>> > 2) vCPU-1 updates its pvti with new values.
>> > 3) vCPU-0 still has not updated its pvti with new values.
>> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >
>> > The update is not actually atomic across all vCPUs, its atomic in
>> > the sense of not allowing visibility of distinct
>> > system_timestamp/tsc_timestamp values.
>> >
>>
>> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
>> it gets marked unstable?
>
> Yes. It will VM-enter after pvti is updated.
>
>> Otherwise the vdso could could just as
>> easily be called from vCPU-1, migrated to vCPU-0, read the data
>> complete with stale stable bit, and get migrated back to vCPU-1.
>
> Right.
>
>> But I thought that KVM currently froze all vCPUs when updating pvti
>> for any of them.  How can this happen?  I admit I don't really
>> understand the update request code.
>
> The update is performed as follows:
>
>         - Stop guest instruction execution on every vCPU, parking them in the host.
>         - Request KVMCLOCK update for every vCPU.
>         - Resume guest instruction execution.
>
> The KVMCLOCK update (==pvti update) is guaranteed to be performed before
> guest instructions are executed again.
>
> But there is no guarantee that vCPU-N has updated its pvti when
> vCPU-M resumes guest instruction execution.

Still confused.  So we can freeze all vCPUs in the host, then update
pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
doesn't increment the version pre-update, and we can return completely
bogus results.

>
> So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?

It removes a whole bunch of code, an extra barrier, and two __getcpus.

> Perhaps you can use Gleb's idea to stick vcpu id into version field ?

I don't understand how that's useful at all.  If you're reading pvti,
you clearly know the vcpu id.

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 19:17       ` Marcelo Tosatti
  2015-01-05 22:38         ` Andy Lutomirski
@ 2015-01-05 22:38         ` Andy Lutomirski
  2015-01-06  8:39         ` Paolo Bonzini
  2015-01-06  8:39         ` Paolo Bonzini
  3 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-05 22:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> >> The pvclock vdso code was too abstracted to understand easily and
>> >> excessively paranoid.  Simplify it for a huge speedup.
>> >>
>> >> This opens the door for additional simplifications, as the vdso no
>> >> longer accesses the pvti for any vcpu other than vcpu 0.
>> >>
>> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> >> implementation.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>> >>  1 file changed, 47 insertions(+), 35 deletions(-)
>> >>
>> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> >> index 9793322751e0..f2e0396d5629 100644
>> >> --- a/arch/x86/vdso/vclock_gettime.c
>> >> +++ b/arch/x86/vdso/vclock_gettime.c
>> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>> >>
>> >>  static notrace cycle_t vread_pvclock(int *mode)
>> >>  {
>> >> -     const struct pvclock_vsyscall_time_info *pvti;
>> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> >>       cycle_t ret;
>> >> -     u64 last;
>> >> -     u32 version;
>> >> -     u8 flags;
>> >> -     unsigned cpu, cpu1;
>> >> -
>> >> +     u64 tsc, pvti_tsc;
>> >> +     u64 last, delta, pvti_system_time;
>> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>> >>
>> >>       /*
>> >> -      * Note: hypervisor must guarantee that:
>> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> >> -      * 2. that per-CPU pvclock time info is updated if the
>> >> -      *    underlying CPU changes.
>> >> -      * 3. that version is increased whenever underlying CPU
>> >> -      *    changes.
>> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
>> >> +      * number maps 1:1 to per-CPU pvclock time info.
>> >> +      *
>> >> +      * Because the hypervisor is entirely unaware of guest userspace
>> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
>> >> +      * info is updated if the underlying CPU changes or that that
>> >> +      * version is increased whenever underlying CPU changes.
>> >> +      *
>> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
>> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
>> >> +      * guarantee than we get with a normal seqlock.
>> >>        *
>> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
>> >> +      * supplies a valid seqlock using the version field.
>> >> +
>> >> +      * We only do pvclock vdso timing at all if
>> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> >> +      * mean that all vCPUs have matching pvti and that the TSC is
>> >> +      * synced, so we can just look at vCPU 0's pvti.
>> >>        */
>> >
>> > Can Xen guarantee that ?
>>
>> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
>> at all.  I have no idea going forward, though.
>>
>> Xen people?
>>
>> >
>> >> -     do {
>> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
>> >> -              * This will save a couple of cycles by getting rid of
>> >> -              * __getcpu() calls (Gleb).
>> >> -              */
>> >> -
>> >> -             pvti = get_pvti(cpu);
>> >> -
>> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> >> -
>> >> -             /*
>> >> -              * Test we're still on the cpu as well as the version.
>> >> -              * We could have been migrated just after the first
>> >> -              * vgetcpu but before fetching the version, so we
>> >> -              * wouldn't notice a version change.
>> >> -              */
>> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> >> -     } while (unlikely(cpu != cpu1 ||
>> >> -                       (pvti->pvti.version & 1) ||
>> >> -                       pvti->pvti.version != version));
>> >> -
>> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> >> +
>> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> >>               *mode = VCLOCK_NONE;
>> >> +             return 0;
>> >> +     }
>> >
>> > This check must be performed after reading a stable pvti.
>> >
>>
>> We can even read it in the middle, guarded by the version checks.
>> I'll do that for v2.
>>
>> >> +
>> >> +     do {
>> >> +             version = pvti->version;
>> >> +
>> >> +             /* This is also a read barrier, so we'll read version first. */
>> >> +             rdtsc_barrier();
>> >> +             tsc = __native_read_tsc();
>> >> +
>> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> >> +             pvti_tsc_shift = pvti->tsc_shift;
>> >> +             pvti_system_time = pvti->system_time;
>> >> +             pvti_tsc = pvti->tsc_timestamp;
>> >> +
>> >> +             /* Make sure that the version double-check is last. */
>> >> +             smp_rmb();
>> >> +     } while (unlikely((version & 1) || version != pvti->version));
>> >> +
>> >> +     delta = tsc - pvti_tsc;
>> >> +     ret = pvti_system_time +
>> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> >> +                                 pvti_tsc_shift);
>> >
>> > The following is possible:
>> >
>> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> > transition.
>> > 2) vCPU-1 updates its pvti with new values.
>> > 3) vCPU-0 still has not updated its pvti with new values.
>> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >
>> > The update is not actually atomic across all vCPUs, its atomic in
>> > the sense of not allowing visibility of distinct
>> > system_timestamp/tsc_timestamp values.
>> >
>>
>> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
>> it gets marked unstable?
>
> Yes. It will VM-enter after pvti is updated.
>
>> Otherwise the vdso could could just as
>> easily be called from vCPU-1, migrated to vCPU-0, read the data
>> complete with stale stable bit, and get migrated back to vCPU-1.
>
> Right.
>
>> But I thought that KVM currently froze all vCPUs when updating pvti
>> for any of them.  How can this happen?  I admit I don't really
>> understand the update request code.
>
> The update is performed as follows:
>
>         - Stop guest instruction execution on every vCPU, parking them in the host.
>         - Request KVMCLOCK update for every vCPU.
>         - Resume guest instruction execution.
>
> The KVMCLOCK update (==pvti update) is guaranteed to be performed before
> guest instructions are executed again.
>
> But there is no guarantee that vCPU-N has updated its pvti when
> vCPU-M resumes guest instruction execution.

Still confused.  So we can freeze all vCPUs in the host, then update
pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
doesn't increment the version pre-update, and we can return completely
bogus results.

>
> So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?

It removes a whole bunch of code, an extra barrier, and two __getcpus.

> Perhaps you can use Gleb's idea to stick vcpu id into version field ?

I don't understand how that's useful at all.  If you're reading pvti,
you clearly know the vcpu id.

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 22:38         ` Andy Lutomirski
@ 2015-01-05 22:48           ` Marcelo Tosatti
  2015-01-05 22:53             ` Andy Lutomirski
                               ` (2 more replies)
  2015-01-05 22:48           ` Marcelo Tosatti
  1 sibling, 3 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-05 22:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Gleb Natapov, kvm list, linux-kernel, xen-devel

On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> >> The pvclock vdso code was too abstracted to understand easily and
> >> >> excessively paranoid.  Simplify it for a huge speedup.
> >> >>
> >> >> This opens the door for additional simplifications, as the vdso no
> >> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >> >>
> >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> >> implementation.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> >> ---
> >> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> >> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> >> index 9793322751e0..f2e0396d5629 100644
> >> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >> >>
> >> >>  static notrace cycle_t vread_pvclock(int *mode)
> >> >>  {
> >> >> -     const struct pvclock_vsyscall_time_info *pvti;
> >> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >> >>       cycle_t ret;
> >> >> -     u64 last;
> >> >> -     u32 version;
> >> >> -     u8 flags;
> >> >> -     unsigned cpu, cpu1;
> >> >> -
> >> >> +     u64 tsc, pvti_tsc;
> >> >> +     u64 last, delta, pvti_system_time;
> >> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >> >>
> >> >>       /*
> >> >> -      * Note: hypervisor must guarantee that:
> >> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> >> -      * 2. that per-CPU pvclock time info is updated if the
> >> >> -      *    underlying CPU changes.
> >> >> -      * 3. that version is increased whenever underlying CPU
> >> >> -      *    changes.
> >> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
> >> >> +      * number maps 1:1 to per-CPU pvclock time info.
> >> >> +      *
> >> >> +      * Because the hypervisor is entirely unaware of guest userspace
> >> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
> >> >> +      * info is updated if the underlying CPU changes or that that
> >> >> +      * version is increased whenever underlying CPU changes.
> >> >> +      *
> >> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
> >> >> +      * guarantee than we get with a normal seqlock.
> >> >>        *
> >> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
> >> >> +      * supplies a valid seqlock using the version field.
> >> >> +
> >> >> +      * We only do pvclock vdso timing at all if
> >> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> >> +      * mean that all vCPUs have matching pvti and that the TSC is
> >> >> +      * synced, so we can just look at vCPU 0's pvti.
> >> >>        */
> >> >
> >> > Can Xen guarantee that ?
> >>
> >> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> >> at all.  I have no idea going forward, though.
> >>
> >> Xen people?
> >>
> >> >
> >> >> -     do {
> >> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> >> -              * This will save a couple of cycles by getting rid of
> >> >> -              * __getcpu() calls (Gleb).
> >> >> -              */
> >> >> -
> >> >> -             pvti = get_pvti(cpu);
> >> >> -
> >> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> >> -
> >> >> -             /*
> >> >> -              * Test we're still on the cpu as well as the version.
> >> >> -              * We could have been migrated just after the first
> >> >> -              * vgetcpu but before fetching the version, so we
> >> >> -              * wouldn't notice a version change.
> >> >> -              */
> >> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> >> -     } while (unlikely(cpu != cpu1 ||
> >> >> -                       (pvti->pvti.version & 1) ||
> >> >> -                       pvti->pvti.version != version));
> >> >> -
> >> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> >> +
> >> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >> >>               *mode = VCLOCK_NONE;
> >> >> +             return 0;
> >> >> +     }
> >> >
> >> > This check must be performed after reading a stable pvti.
> >> >
> >>
> >> We can even read it in the middle, guarded by the version checks.
> >> I'll do that for v2.
> >>
> >> >> +
> >> >> +     do {
> >> >> +             version = pvti->version;
> >> >> +
> >> >> +             /* This is also a read barrier, so we'll read version first. */
> >> >> +             rdtsc_barrier();
> >> >> +             tsc = __native_read_tsc();
> >> >> +
> >> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> >> +             pvti_tsc_shift = pvti->tsc_shift;
> >> >> +             pvti_system_time = pvti->system_time;
> >> >> +             pvti_tsc = pvti->tsc_timestamp;
> >> >> +
> >> >> +             /* Make sure that the version double-check is last. */
> >> >> +             smp_rmb();
> >> >> +     } while (unlikely((version & 1) || version != pvti->version));
> >> >> +
> >> >> +     delta = tsc - pvti_tsc;
> >> >> +     ret = pvti_system_time +
> >> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> >> >> +                                 pvti_tsc_shift);
> >> >
> >> > The following is possible:
> >> >
> >> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> >> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> >> > transition.
> >> > 2) vCPU-1 updates its pvti with new values.
> >> > 3) vCPU-0 still has not updated its pvti with new values.
> >> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> >> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
> >> >
> >> > The update is not actually atomic across all vCPUs, its atomic in
> >> > the sense of not allowing visibility of distinct
> >> > system_timestamp/tsc_timestamp values.
> >> >
> >>
> >> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> >> it gets marked unstable?
> >
> > Yes. It will VM-enter after pvti is updated.
> >
> >> Otherwise the vdso could could just as
> >> easily be called from vCPU-1, migrated to vCPU-0, read the data
> >> complete with stale stable bit, and get migrated back to vCPU-1.
> >
> > Right.
> >
> >> But I thought that KVM currently froze all vCPUs when updating pvti
> >> for any of them.  How can this happen?  I admit I don't really
> >> understand the update request code.
> >
> > The update is performed as follows:
> >
> >         - Stop guest instruction execution on every vCPU, parking them in the host.
> >         - Request KVMCLOCK update for every vCPU.
> >         - Resume guest instruction execution.
> >
> > The KVMCLOCK update (==pvti update) is guaranteed to be performed before
> > guest instructions are executed again.
> >
> > But there is no guarantee that vCPU-N has updated its pvti when
> > vCPU-M resumes guest instruction execution.
> 
> Still confused.  So we can freeze all vCPUs in the host, then update
> pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> doesn't increment the version pre-update, and we can return completely
> bogus results.

Yes.

> > So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
> 
> It removes a whole bunch of code, an extra barrier, and two __getcpus.
> 
> > Perhaps you can use Gleb's idea to stick vcpu id into version field ?
> 
> I don't understand how that's useful at all.  If you're reading pvti,
> you clearly know the vcpu id.

Replace the return value of __getcpus by value read from pvti.version.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 22:38         ` Andy Lutomirski
  2015-01-05 22:48           ` Marcelo Tosatti
@ 2015-01-05 22:48           ` Marcelo Tosatti
  1 sibling, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-05 22:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> >> The pvclock vdso code was too abstracted to understand easily and
> >> >> excessively paranoid.  Simplify it for a huge speedup.
> >> >>
> >> >> This opens the door for additional simplifications, as the vdso no
> >> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >> >>
> >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> >> implementation.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> >> ---
> >> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> >> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> >> index 9793322751e0..f2e0396d5629 100644
> >> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >> >>
> >> >>  static notrace cycle_t vread_pvclock(int *mode)
> >> >>  {
> >> >> -     const struct pvclock_vsyscall_time_info *pvti;
> >> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >> >>       cycle_t ret;
> >> >> -     u64 last;
> >> >> -     u32 version;
> >> >> -     u8 flags;
> >> >> -     unsigned cpu, cpu1;
> >> >> -
> >> >> +     u64 tsc, pvti_tsc;
> >> >> +     u64 last, delta, pvti_system_time;
> >> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >> >>
> >> >>       /*
> >> >> -      * Note: hypervisor must guarantee that:
> >> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> >> -      * 2. that per-CPU pvclock time info is updated if the
> >> >> -      *    underlying CPU changes.
> >> >> -      * 3. that version is increased whenever underlying CPU
> >> >> -      *    changes.
> >> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
> >> >> +      * number maps 1:1 to per-CPU pvclock time info.
> >> >> +      *
> >> >> +      * Because the hypervisor is entirely unaware of guest userspace
> >> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
> >> >> +      * info is updated if the underlying CPU changes or that that
> >> >> +      * version is increased whenever underlying CPU changes.
> >> >> +      *
> >> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
> >> >> +      * guarantee than we get with a normal seqlock.
> >> >>        *
> >> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
> >> >> +      * supplies a valid seqlock using the version field.
> >> >> +
> >> >> +      * We only do pvclock vdso timing at all if
> >> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> >> +      * mean that all vCPUs have matching pvti and that the TSC is
> >> >> +      * synced, so we can just look at vCPU 0's pvti.
> >> >>        */
> >> >
> >> > Can Xen guarantee that ?
> >>
> >> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> >> at all.  I have no idea going forward, though.
> >>
> >> Xen people?
> >>
> >> >
> >> >> -     do {
> >> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> >> -              * This will save a couple of cycles by getting rid of
> >> >> -              * __getcpu() calls (Gleb).
> >> >> -              */
> >> >> -
> >> >> -             pvti = get_pvti(cpu);
> >> >> -
> >> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> >> -
> >> >> -             /*
> >> >> -              * Test we're still on the cpu as well as the version.
> >> >> -              * We could have been migrated just after the first
> >> >> -              * vgetcpu but before fetching the version, so we
> >> >> -              * wouldn't notice a version change.
> >> >> -              */
> >> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> >> -     } while (unlikely(cpu != cpu1 ||
> >> >> -                       (pvti->pvti.version & 1) ||
> >> >> -                       pvti->pvti.version != version));
> >> >> -
> >> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> >> +
> >> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >> >>               *mode = VCLOCK_NONE;
> >> >> +             return 0;
> >> >> +     }
> >> >
> >> > This check must be performed after reading a stable pvti.
> >> >
> >>
> >> We can even read it in the middle, guarded by the version checks.
> >> I'll do that for v2.
> >>
> >> >> +
> >> >> +     do {
> >> >> +             version = pvti->version;
> >> >> +
> >> >> +             /* This is also a read barrier, so we'll read version first. */
> >> >> +             rdtsc_barrier();
> >> >> +             tsc = __native_read_tsc();
> >> >> +
> >> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> >> +             pvti_tsc_shift = pvti->tsc_shift;
> >> >> +             pvti_system_time = pvti->system_time;
> >> >> +             pvti_tsc = pvti->tsc_timestamp;
> >> >> +
> >> >> +             /* Make sure that the version double-check is last. */
> >> >> +             smp_rmb();
> >> >> +     } while (unlikely((version & 1) || version != pvti->version));
> >> >> +
> >> >> +     delta = tsc - pvti_tsc;
> >> >> +     ret = pvti_system_time +
> >> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> >> >> +                                 pvti_tsc_shift);
> >> >
> >> > The following is possible:
> >> >
> >> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> >> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> >> > transition.
> >> > 2) vCPU-1 updates its pvti with new values.
> >> > 3) vCPU-0 still has not updated its pvti with new values.
> >> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> >> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
> >> >
> >> > The update is not actually atomic across all vCPUs, its atomic in
> >> > the sense of not allowing visibility of distinct
> >> > system_timestamp/tsc_timestamp values.
> >> >
> >>
> >> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> >> it gets marked unstable?
> >
> > Yes. It will VM-enter after pvti is updated.
> >
> >> Otherwise the vdso could could just as
> >> easily be called from vCPU-1, migrated to vCPU-0, read the data
> >> complete with stale stable bit, and get migrated back to vCPU-1.
> >
> > Right.
> >
> >> But I thought that KVM currently froze all vCPUs when updating pvti
> >> for any of them.  How can this happen?  I admit I don't really
> >> understand the update request code.
> >
> > The update is performed as follows:
> >
> >         - Stop guest instruction execution on every vCPU, parking them in the host.
> >         - Request KVMCLOCK update for every vCPU.
> >         - Resume guest instruction execution.
> >
> > The KVMCLOCK update (==pvti update) is guaranteed to be performed before
> > guest instructions are executed again.
> >
> > But there is no guarantee that vCPU-N has updated its pvti when
> > vCPU-M resumes guest instruction execution.
> 
> Still confused.  So we can freeze all vCPUs in the host, then update
> pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> doesn't increment the version pre-update, and we can return completely
> bogus results.

Yes.

> > So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
> 
> It removes a whole bunch of code, an extra barrier, and two __getcpus.
> 
> > Perhaps you can use Gleb's idea to stick vcpu id into version field ?
> 
> I don't understand how that's useful at all.  If you're reading pvti,
> you clearly know the vcpu id.

Replace the return value of __getcpus by value read from pvti.version.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 22:48           ` Marcelo Tosatti
@ 2015-01-05 22:53             ` Andy Lutomirski
  2015-01-05 22:53             ` Andy Lutomirski
  2015-01-06  8:42               ` Paolo Bonzini
  2 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-05 22:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, Gleb Natapov, kvm list, linux-kernel, xen-devel

On Mon, Jan 5, 2015 at 2:48 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
>> >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> >> >> The pvclock vdso code was too abstracted to understand easily and
>> >> >> excessively paranoid.  Simplify it for a huge speedup.
>> >> >>
>> >> >> This opens the door for additional simplifications, as the vdso no
>> >> >> longer accesses the pvti for any vcpu other than vcpu 0.
>> >> >>
>> >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> >> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> >> >> implementation.
>> >> >>
>> >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> >> ---
>> >> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>> >> >>  1 file changed, 47 insertions(+), 35 deletions(-)
>> >> >>
>> >> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> >> >> index 9793322751e0..f2e0396d5629 100644
>> >> >> --- a/arch/x86/vdso/vclock_gettime.c
>> >> >> +++ b/arch/x86/vdso/vclock_gettime.c
>> >> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>> >> >>
>> >> >>  static notrace cycle_t vread_pvclock(int *mode)
>> >> >>  {
>> >> >> -     const struct pvclock_vsyscall_time_info *pvti;
>> >> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> >> >>       cycle_t ret;
>> >> >> -     u64 last;
>> >> >> -     u32 version;
>> >> >> -     u8 flags;
>> >> >> -     unsigned cpu, cpu1;
>> >> >> -
>> >> >> +     u64 tsc, pvti_tsc;
>> >> >> +     u64 last, delta, pvti_system_time;
>> >> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>> >> >>
>> >> >>       /*
>> >> >> -      * Note: hypervisor must guarantee that:
>> >> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> >> >> -      * 2. that per-CPU pvclock time info is updated if the
>> >> >> -      *    underlying CPU changes.
>> >> >> -      * 3. that version is increased whenever underlying CPU
>> >> >> -      *    changes.
>> >> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
>> >> >> +      * number maps 1:1 to per-CPU pvclock time info.
>> >> >> +      *
>> >> >> +      * Because the hypervisor is entirely unaware of guest userspace
>> >> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
>> >> >> +      * info is updated if the underlying CPU changes or that that
>> >> >> +      * version is increased whenever underlying CPU changes.
>> >> >> +      *
>> >> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
>> >> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
>> >> >> +      * guarantee than we get with a normal seqlock.
>> >> >>        *
>> >> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
>> >> >> +      * supplies a valid seqlock using the version field.
>> >> >> +
>> >> >> +      * We only do pvclock vdso timing at all if
>> >> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> >> >> +      * mean that all vCPUs have matching pvti and that the TSC is
>> >> >> +      * synced, so we can just look at vCPU 0's pvti.
>> >> >>        */
>> >> >
>> >> > Can Xen guarantee that ?
>> >>
>> >> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
>> >> at all.  I have no idea going forward, though.
>> >>
>> >> Xen people?
>> >>
>> >> >
>> >> >> -     do {
>> >> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
>> >> >> -              * This will save a couple of cycles by getting rid of
>> >> >> -              * __getcpu() calls (Gleb).
>> >> >> -              */
>> >> >> -
>> >> >> -             pvti = get_pvti(cpu);
>> >> >> -
>> >> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> >> >> -
>> >> >> -             /*
>> >> >> -              * Test we're still on the cpu as well as the version.
>> >> >> -              * We could have been migrated just after the first
>> >> >> -              * vgetcpu but before fetching the version, so we
>> >> >> -              * wouldn't notice a version change.
>> >> >> -              */
>> >> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> >> >> -     } while (unlikely(cpu != cpu1 ||
>> >> >> -                       (pvti->pvti.version & 1) ||
>> >> >> -                       pvti->pvti.version != version));
>> >> >> -
>> >> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> >> >> +
>> >> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> >> >>               *mode = VCLOCK_NONE;
>> >> >> +             return 0;
>> >> >> +     }
>> >> >
>> >> > This check must be performed after reading a stable pvti.
>> >> >
>> >>
>> >> We can even read it in the middle, guarded by the version checks.
>> >> I'll do that for v2.
>> >>
>> >> >> +
>> >> >> +     do {
>> >> >> +             version = pvti->version;
>> >> >> +
>> >> >> +             /* This is also a read barrier, so we'll read version first. */
>> >> >> +             rdtsc_barrier();
>> >> >> +             tsc = __native_read_tsc();
>> >> >> +
>> >> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> >> >> +             pvti_tsc_shift = pvti->tsc_shift;
>> >> >> +             pvti_system_time = pvti->system_time;
>> >> >> +             pvti_tsc = pvti->tsc_timestamp;
>> >> >> +
>> >> >> +             /* Make sure that the version double-check is last. */
>> >> >> +             smp_rmb();
>> >> >> +     } while (unlikely((version & 1) || version != pvti->version));
>> >> >> +
>> >> >> +     delta = tsc - pvti_tsc;
>> >> >> +     ret = pvti_system_time +
>> >> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> >> >> +                                 pvti_tsc_shift);
>> >> >
>> >> > The following is possible:
>> >> >
>> >> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> >> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> >> > transition.
>> >> > 2) vCPU-1 updates its pvti with new values.
>> >> > 3) vCPU-0 still has not updated its pvti with new values.
>> >> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> >> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >> >
>> >> > The update is not actually atomic across all vCPUs, its atomic in
>> >> > the sense of not allowing visibility of distinct
>> >> > system_timestamp/tsc_timestamp values.
>> >> >
>> >>
>> >> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
>> >> it gets marked unstable?
>> >
>> > Yes. It will VM-enter after pvti is updated.
>> >
>> >> Otherwise the vdso could could just as
>> >> easily be called from vCPU-1, migrated to vCPU-0, read the data
>> >> complete with stale stable bit, and get migrated back to vCPU-1.
>> >
>> > Right.
>> >
>> >> But I thought that KVM currently froze all vCPUs when updating pvti
>> >> for any of them.  How can this happen?  I admit I don't really
>> >> understand the update request code.
>> >
>> > The update is performed as follows:
>> >
>> >         - Stop guest instruction execution on every vCPU, parking them in the host.
>> >         - Request KVMCLOCK update for every vCPU.
>> >         - Resume guest instruction execution.
>> >
>> > The KVMCLOCK update (==pvti update) is guaranteed to be performed before
>> > guest instructions are executed again.
>> >
>> > But there is no guarantee that vCPU-N has updated its pvti when
>> > vCPU-M resumes guest instruction execution.
>>
>> Still confused.  So we can freeze all vCPUs in the host, then update
>> pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> doesn't increment the version pre-update, and we can return completely
>> bogus results.
>
> Yes.

But then both the current code and my code are broken, right?  There
is no guarantee that the vdso only ever reads the pvti corresponding
to the vcpu it's running on.

>
>> > So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
>>
>> It removes a whole bunch of code, an extra barrier, and two __getcpus.
>>
>> > Perhaps you can use Gleb's idea to stick vcpu id into version field ?
>>
>> I don't understand how that's useful at all.  If you're reading pvti,
>> you clearly know the vcpu id.
>
> Replace the return value of __getcpus by value read from pvti.version.
>

Huh?  What pvti is the vdso supposed to read to get the vcpu number
out of the version field?

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 22:48           ` Marcelo Tosatti
  2015-01-05 22:53             ` Andy Lutomirski
@ 2015-01-05 22:53             ` Andy Lutomirski
  2015-01-06  8:42               ` Paolo Bonzini
  2 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-05 22:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Mon, Jan 5, 2015 at 2:48 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
>> >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> >> >> The pvclock vdso code was too abstracted to understand easily and
>> >> >> excessively paranoid.  Simplify it for a huge speedup.
>> >> >>
>> >> >> This opens the door for additional simplifications, as the vdso no
>> >> >> longer accesses the pvti for any vcpu other than vcpu 0.
>> >> >>
>> >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> >> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> >> >> implementation.
>> >> >>
>> >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> >> ---
>> >> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>> >> >>  1 file changed, 47 insertions(+), 35 deletions(-)
>> >> >>
>> >> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> >> >> index 9793322751e0..f2e0396d5629 100644
>> >> >> --- a/arch/x86/vdso/vclock_gettime.c
>> >> >> +++ b/arch/x86/vdso/vclock_gettime.c
>> >> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>> >> >>
>> >> >>  static notrace cycle_t vread_pvclock(int *mode)
>> >> >>  {
>> >> >> -     const struct pvclock_vsyscall_time_info *pvti;
>> >> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> >> >>       cycle_t ret;
>> >> >> -     u64 last;
>> >> >> -     u32 version;
>> >> >> -     u8 flags;
>> >> >> -     unsigned cpu, cpu1;
>> >> >> -
>> >> >> +     u64 tsc, pvti_tsc;
>> >> >> +     u64 last, delta, pvti_system_time;
>> >> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>> >> >>
>> >> >>       /*
>> >> >> -      * Note: hypervisor must guarantee that:
>> >> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> >> >> -      * 2. that per-CPU pvclock time info is updated if the
>> >> >> -      *    underlying CPU changes.
>> >> >> -      * 3. that version is increased whenever underlying CPU
>> >> >> -      *    changes.
>> >> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
>> >> >> +      * number maps 1:1 to per-CPU pvclock time info.
>> >> >> +      *
>> >> >> +      * Because the hypervisor is entirely unaware of guest userspace
>> >> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
>> >> >> +      * info is updated if the underlying CPU changes or that that
>> >> >> +      * version is increased whenever underlying CPU changes.
>> >> >> +      *
>> >> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
>> >> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
>> >> >> +      * guarantee than we get with a normal seqlock.
>> >> >>        *
>> >> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
>> >> >> +      * supplies a valid seqlock using the version field.
>> >> >> +
>> >> >> +      * We only do pvclock vdso timing at all if
>> >> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> >> >> +      * mean that all vCPUs have matching pvti and that the TSC is
>> >> >> +      * synced, so we can just look at vCPU 0's pvti.
>> >> >>        */
>> >> >
>> >> > Can Xen guarantee that ?
>> >>
>> >> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
>> >> at all.  I have no idea going forward, though.
>> >>
>> >> Xen people?
>> >>
>> >> >
>> >> >> -     do {
>> >> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
>> >> >> -              * This will save a couple of cycles by getting rid of
>> >> >> -              * __getcpu() calls (Gleb).
>> >> >> -              */
>> >> >> -
>> >> >> -             pvti = get_pvti(cpu);
>> >> >> -
>> >> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> >> >> -
>> >> >> -             /*
>> >> >> -              * Test we're still on the cpu as well as the version.
>> >> >> -              * We could have been migrated just after the first
>> >> >> -              * vgetcpu but before fetching the version, so we
>> >> >> -              * wouldn't notice a version change.
>> >> >> -              */
>> >> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> >> >> -     } while (unlikely(cpu != cpu1 ||
>> >> >> -                       (pvti->pvti.version & 1) ||
>> >> >> -                       pvti->pvti.version != version));
>> >> >> -
>> >> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> >> >> +
>> >> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> >> >>               *mode = VCLOCK_NONE;
>> >> >> +             return 0;
>> >> >> +     }
>> >> >
>> >> > This check must be performed after reading a stable pvti.
>> >> >
>> >>
>> >> We can even read it in the middle, guarded by the version checks.
>> >> I'll do that for v2.
>> >>
>> >> >> +
>> >> >> +     do {
>> >> >> +             version = pvti->version;
>> >> >> +
>> >> >> +             /* This is also a read barrier, so we'll read version first. */
>> >> >> +             rdtsc_barrier();
>> >> >> +             tsc = __native_read_tsc();
>> >> >> +
>> >> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> >> >> +             pvti_tsc_shift = pvti->tsc_shift;
>> >> >> +             pvti_system_time = pvti->system_time;
>> >> >> +             pvti_tsc = pvti->tsc_timestamp;
>> >> >> +
>> >> >> +             /* Make sure that the version double-check is last. */
>> >> >> +             smp_rmb();
>> >> >> +     } while (unlikely((version & 1) || version != pvti->version));
>> >> >> +
>> >> >> +     delta = tsc - pvti_tsc;
>> >> >> +     ret = pvti_system_time +
>> >> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> >> >> +                                 pvti_tsc_shift);
>> >> >
>> >> > The following is possible:
>> >> >
>> >> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> >> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> >> > transition.
>> >> > 2) vCPU-1 updates its pvti with new values.
>> >> > 3) vCPU-0 still has not updated its pvti with new values.
>> >> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> >> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >> >
>> >> > The update is not actually atomic across all vCPUs, its atomic in
>> >> > the sense of not allowing visibility of distinct
>> >> > system_timestamp/tsc_timestamp values.
>> >> >
>> >>
>> >> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
>> >> it gets marked unstable?
>> >
>> > Yes. It will VM-enter after pvti is updated.
>> >
>> >> Otherwise the vdso could could just as
>> >> easily be called from vCPU-1, migrated to vCPU-0, read the data
>> >> complete with stale stable bit, and get migrated back to vCPU-1.
>> >
>> > Right.
>> >
>> >> But I thought that KVM currently froze all vCPUs when updating pvti
>> >> for any of them.  How can this happen?  I admit I don't really
>> >> understand the update request code.
>> >
>> > The update is performed as follows:
>> >
>> >         - Stop guest instruction execution on every vCPU, parking them in the host.
>> >         - Request KVMCLOCK update for every vCPU.
>> >         - Resume guest instruction execution.
>> >
>> > The KVMCLOCK update (==pvti update) is guaranteed to be performed before
>> > guest instructions are executed again.
>> >
>> > But there is no guarantee that vCPU-N has updated its pvti when
>> > vCPU-M resumes guest instruction execution.
>>
>> Still confused.  So we can freeze all vCPUs in the host, then update
>> pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> doesn't increment the version pre-update, and we can return completely
>> bogus results.
>
> Yes.

But then both the current code and my code are broken, right?  There
is no guarantee that the vdso only ever reads the pvti corresponding
to the vcpu it's running on.

>
>> > So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
>>
>> It removes a whole bunch of code, an extra barrier, and two __getcpus.
>>
>> > Perhaps you can use Gleb's idea to stick vcpu id into version field ?
>>
>> I don't understand how that's useful at all.  If you're reading pvti,
>> you clearly know the vcpu id.
>
> Replace the return value of __getcpus by value read from pvti.version.
>

Huh?  What pvti is the vdso supposed to read to get the vcpu number
out of the version field?

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 19:17       ` Marcelo Tosatti
                           ` (2 preceding siblings ...)
  2015-01-06  8:39         ` Paolo Bonzini
@ 2015-01-06  8:39         ` Paolo Bonzini
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-06  8:39 UTC (permalink / raw)
  To: Marcelo Tosatti, Andy Lutomirski
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel



On 05/01/2015 20:17, Marcelo Tosatti wrote:
> But there is no guarantee that vCPU-N has updated its pvti when
> vCPU-M resumes guest instruction execution.

You're right.

> So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
> Perhaps you can use Gleb's idea to stick vcpu id into version field ?

Or just replace __getcpu with rdtscp.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 19:17       ` Marcelo Tosatti
  2015-01-05 22:38         ` Andy Lutomirski
  2015-01-05 22:38         ` Andy Lutomirski
@ 2015-01-06  8:39         ` Paolo Bonzini
  2015-01-06  8:39         ` Paolo Bonzini
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-06  8:39 UTC (permalink / raw)
  To: Marcelo Tosatti, Andy Lutomirski
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 05/01/2015 20:17, Marcelo Tosatti wrote:
> But there is no guarantee that vCPU-N has updated its pvti when
> vCPU-M resumes guest instruction execution.

You're right.

> So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
> Perhaps you can use Gleb's idea to stick vcpu id into version field ?

Or just replace __getcpu with rdtscp.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 22:48           ` Marcelo Tosatti
@ 2015-01-06  8:42               ` Paolo Bonzini
  2015-01-05 22:53             ` Andy Lutomirski
  2015-01-06  8:42               ` Paolo Bonzini
  2 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-06  8:42 UTC (permalink / raw)
  To: Marcelo Tosatti, Andy Lutomirski
  Cc: Gleb Natapov, kvm list, linux-kernel, xen-devel



On 05/01/2015 23:48, Marcelo Tosatti wrote:
>>> > > But there is no guarantee that vCPU-N has updated its pvti when
>>> > > vCPU-M resumes guest instruction execution.
>> > 
>> > Still confused.  So we can freeze all vCPUs in the host, then update
>> > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> > doesn't increment the version pre-update, and we can return completely
>> > bogus results.
> Yes.

But then the getcpu test would fail (1->0).  Even if you have an ABA
situation (1->0->1), it's okay because the pvti that is fetched is the
one returned by the first getcpu.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
@ 2015-01-06  8:42               ` Paolo Bonzini
  0 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-06  8:42 UTC (permalink / raw)
  To: Marcelo Tosatti, Andy Lutomirski
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 05/01/2015 23:48, Marcelo Tosatti wrote:
>>> > > But there is no guarantee that vCPU-N has updated its pvti when
>>> > > vCPU-M resumes guest instruction execution.
>> > 
>> > Still confused.  So we can freeze all vCPUs in the host, then update
>> > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> > doesn't increment the version pre-update, and we can return completely
>> > bogus results.
> Yes.

But then the getcpu test would fail (1->0).  Even if you have an ABA
situation (1->0->1), it's okay because the pvti that is fetched is the
one returned by the first getcpu.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06  8:42               ` Paolo Bonzini
  (?)
@ 2015-01-06 12:01               ` Paolo Bonzini
  2015-01-06 16:56                 ` Andy Lutomirski
  2015-01-06 16:56                 ` Andy Lutomirski
  -1 siblings, 2 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-06 12:01 UTC (permalink / raw)
  To: Marcelo Tosatti, Andy Lutomirski
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 06/01/2015 09:42, Paolo Bonzini wrote:
> > > Still confused.  So we can freeze all vCPUs in the host, then update
> > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> > > doesn't increment the version pre-update, and we can return completely
> > > bogus results.
> > Yes.
> But then the getcpu test would fail (1->0).  Even if you have an ABA
> situation (1->0->1), it's okay because the pvti that is fetched is the
> one returned by the first getcpu.

... this case of partial update of pvti, which is caught by the version
field, if of course different from the other (extremely unlikely) that
Andy pointed out.  That is when the getcpus are done on the same vCPU,
but the rdtsc is another.

That one can be fixed by rdtscp, like

do {
    // get a consistent (pvti, v, tsc) tuple
    do {
        cpu = get_cpu();
        pvti = get_pvti(cpu);
        v = pvti->version & ~1;
        // also acts as rmb();
        rdtsc_barrier();
        tsc = rdtscp(&cpu1);
        // control dependency, no need for rdtsc_barrier?
    } while(cpu != cpu1);

    // ... compute nanoseconds from pvti and tsc ...
    rmb();
}   while(v != pvti->version);

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-05 18:56     ` Andy Lutomirski
@ 2015-01-06 14:35         ` Konrad Rzeszutek Wilk
  2015-01-05 19:17       ` Marcelo Tosatti
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 77+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-06 14:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Marcelo Tosatti, Gleb Natapov, Paolo Bonzini, linux-kernel,
	kvm list, xen-devel

On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> The pvclock vdso code was too abstracted to understand easily and
> >> excessively paranoid.  Simplify it for a huge speedup.
> >>
> >> This opens the door for additional simplifications, as the vdso no
> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >>
> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> implementation.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> index 9793322751e0..f2e0396d5629 100644
> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >>
> >>  static notrace cycle_t vread_pvclock(int *mode)
> >>  {
> >> -     const struct pvclock_vsyscall_time_info *pvti;
> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >>       cycle_t ret;
> >> -     u64 last;
> >> -     u32 version;
> >> -     u8 flags;
> >> -     unsigned cpu, cpu1;
> >> -
> >> +     u64 tsc, pvti_tsc;
> >> +     u64 last, delta, pvti_system_time;
> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >>
> >>       /*
> >> -      * Note: hypervisor must guarantee that:
> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> -      * 2. that per-CPU pvclock time info is updated if the
> >> -      *    underlying CPU changes.
> >> -      * 3. that version is increased whenever underlying CPU
> >> -      *    changes.
> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
> >> +      * number maps 1:1 to per-CPU pvclock time info.
> >> +      *
> >> +      * Because the hypervisor is entirely unaware of guest userspace
> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
> >> +      * info is updated if the underlying CPU changes or that that
> >> +      * version is increased whenever underlying CPU changes.
> >> +      *
> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
> >> +      * guarantee than we get with a normal seqlock.
> >>        *
> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
> >> +      * supplies a valid seqlock using the version field.
> >> +
> >> +      * We only do pvclock vdso timing at all if
> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> +      * mean that all vCPUs have matching pvti and that the TSC is
> >> +      * synced, so we can just look at vCPU 0's pvti.
> >>        */
> >
> > Can Xen guarantee that ?
> 
> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> at all.  I have no idea going forward, though.
> 
> Xen people?

The person who would know of the top of his head is Dan Magenheimer, who
is now enjoy retirement :-(

I will have to dig in the code to answer this - that will take a bit of time
sadly (I am sick this week).
> 
> >
> >> -     do {
> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> -              * This will save a couple of cycles by getting rid of
> >> -              * __getcpu() calls (Gleb).
> >> -              */
> >> -
> >> -             pvti = get_pvti(cpu);
> >> -
> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> -
> >> -             /*
> >> -              * Test we're still on the cpu as well as the version.
> >> -              * We could have been migrated just after the first
> >> -              * vgetcpu but before fetching the version, so we
> >> -              * wouldn't notice a version change.
> >> -              */
> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> -     } while (unlikely(cpu != cpu1 ||
> >> -                       (pvti->pvti.version & 1) ||
> >> -                       pvti->pvti.version != version));
> >> -
> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> +
> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >>               *mode = VCLOCK_NONE;
> >> +             return 0;
> >> +     }
> >
> > This check must be performed after reading a stable pvti.
> >
> 
> We can even read it in the middle, guarded by the version checks.
> I'll do that for v2.
> 
> >> +
> >> +     do {
> >> +             version = pvti->version;
> >> +
> >> +             /* This is also a read barrier, so we'll read version first. */
> >> +             rdtsc_barrier();
> >> +             tsc = __native_read_tsc();
> >> +
> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> +             pvti_tsc_shift = pvti->tsc_shift;
> >> +             pvti_system_time = pvti->system_time;
> >> +             pvti_tsc = pvti->tsc_timestamp;
> >> +
> >> +             /* Make sure that the version double-check is last. */
> >> +             smp_rmb();
> >> +     } while (unlikely((version & 1) || version != pvti->version));
> >> +
> >> +     delta = tsc - pvti_tsc;
> >> +     ret = pvti_system_time +
> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> >> +                                 pvti_tsc_shift);
> >
> > The following is possible:
> >
> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> > transition.
> > 2) vCPU-1 updates its pvti with new values.
> > 3) vCPU-0 still has not updated its pvti with new values.
> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
> >
> > The update is not actually atomic across all vCPUs, its atomic in
> > the sense of not allowing visibility of distinct
> > system_timestamp/tsc_timestamp values.
> >
> 
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable?  Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.
> 
> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.
> 
> --Andy
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
@ 2015-01-06 14:35         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 77+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-06 14:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Gleb Natapov, Marcelo Tosatti, linux-kernel, xen-devel,
	Paolo Bonzini

On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> The pvclock vdso code was too abstracted to understand easily and
> >> excessively paranoid.  Simplify it for a huge speedup.
> >>
> >> This opens the door for additional simplifications, as the vdso no
> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >>
> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> implementation.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> index 9793322751e0..f2e0396d5629 100644
> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> >>
> >>  static notrace cycle_t vread_pvclock(int *mode)
> >>  {
> >> -     const struct pvclock_vsyscall_time_info *pvti;
> >> +     const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >>       cycle_t ret;
> >> -     u64 last;
> >> -     u32 version;
> >> -     u8 flags;
> >> -     unsigned cpu, cpu1;
> >> -
> >> +     u64 tsc, pvti_tsc;
> >> +     u64 last, delta, pvti_system_time;
> >> +     u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >>
> >>       /*
> >> -      * Note: hypervisor must guarantee that:
> >> -      * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> -      * 2. that per-CPU pvclock time info is updated if the
> >> -      *    underlying CPU changes.
> >> -      * 3. that version is increased whenever underlying CPU
> >> -      *    changes.
> >> +      * Note: The kernel and hypervisor must guarantee that cpu ID
> >> +      * number maps 1:1 to per-CPU pvclock time info.
> >> +      *
> >> +      * Because the hypervisor is entirely unaware of guest userspace
> >> +      * preemption, it cannot guarantee that per-CPU pvclock time
> >> +      * info is updated if the underlying CPU changes or that that
> >> +      * version is increased whenever underlying CPU changes.
> >> +      *
> >> +      * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> +      * atomic as seen by *all* vCPUs.  This is an even stronger
> >> +      * guarantee than we get with a normal seqlock.
> >>        *
> >> +      * On Xen, we don't appear to have that guarantee, but Xen still
> >> +      * supplies a valid seqlock using the version field.
> >> +
> >> +      * We only do pvclock vdso timing at all if
> >> +      * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> +      * mean that all vCPUs have matching pvti and that the TSC is
> >> +      * synced, so we can just look at vCPU 0's pvti.
> >>        */
> >
> > Can Xen guarantee that ?
> 
> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> at all.  I have no idea going forward, though.
> 
> Xen people?

The person who would know of the top of his head is Dan Magenheimer, who
is now enjoy retirement :-(

I will have to dig in the code to answer this - that will take a bit of time
sadly (I am sick this week).
> 
> >
> >> -     do {
> >> -             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> -             /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> -              * This will save a couple of cycles by getting rid of
> >> -              * __getcpu() calls (Gleb).
> >> -              */
> >> -
> >> -             pvti = get_pvti(cpu);
> >> -
> >> -             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> -
> >> -             /*
> >> -              * Test we're still on the cpu as well as the version.
> >> -              * We could have been migrated just after the first
> >> -              * vgetcpu but before fetching the version, so we
> >> -              * wouldn't notice a version change.
> >> -              */
> >> -             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> -     } while (unlikely(cpu != cpu1 ||
> >> -                       (pvti->pvti.version & 1) ||
> >> -                       pvti->pvti.version != version));
> >> -
> >> -     if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> +
> >> +     if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >>               *mode = VCLOCK_NONE;
> >> +             return 0;
> >> +     }
> >
> > This check must be performed after reading a stable pvti.
> >
> 
> We can even read it in the middle, guarded by the version checks.
> I'll do that for v2.
> 
> >> +
> >> +     do {
> >> +             version = pvti->version;
> >> +
> >> +             /* This is also a read barrier, so we'll read version first. */
> >> +             rdtsc_barrier();
> >> +             tsc = __native_read_tsc();
> >> +
> >> +             pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> +             pvti_tsc_shift = pvti->tsc_shift;
> >> +             pvti_system_time = pvti->system_time;
> >> +             pvti_tsc = pvti->tsc_timestamp;
> >> +
> >> +             /* Make sure that the version double-check is last. */
> >> +             smp_rmb();
> >> +     } while (unlikely((version & 1) || version != pvti->version));
> >> +
> >> +     delta = tsc - pvti_tsc;
> >> +     ret = pvti_system_time +
> >> +             pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> >> +                                 pvti_tsc_shift);
> >
> > The following is possible:
> >
> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
> > transition.
> > 2) vCPU-1 updates its pvti with new values.
> > 3) vCPU-0 still has not updated its pvti with new values.
> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
> >
> > The update is not actually atomic across all vCPUs, its atomic in
> > the sense of not allowing visibility of distinct
> > system_timestamp/tsc_timestamp values.
> >
> 
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable?  Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.
> 
> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.
> 
> --Andy
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 12:01               ` Paolo Bonzini
  2015-01-06 16:56                 ` Andy Lutomirski
@ 2015-01-06 16:56                 ` Andy Lutomirski
  2015-01-06 18:13                   ` Marcelo Tosatti
                                     ` (3 more replies)
  1 sibling, 4 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 16:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: xen-devel, linux-kernel, kvm list, Gleb Natapov, Marcelo Tosatti

On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>
>
>
> On 06/01/2015 09:42, Paolo Bonzini wrote:
> > > > Still confused.  So we can freeze all vCPUs in the host, then update
> > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> > > > doesn't increment the version pre-update, and we can return completely
> > > > bogus results.
> > > Yes.
> > But then the getcpu test would fail (1->0).  Even if you have an ABA
> > situation (1->0->1), it's okay because the pvti that is fetched is the
> > one returned by the first getcpu.
>
> ... this case of partial update of pvti, which is caught by the version
> field, if of course different from the other (extremely unlikely) that
> Andy pointed out.  That is when the getcpus are done on the same vCPU,
> but the rdtsc is another.
>
> That one can be fixed by rdtscp, like
>
> do {
>     // get a consistent (pvti, v, tsc) tuple
>     do {
>         cpu = get_cpu();
>         pvti = get_pvti(cpu);
>         v = pvti->version & ~1;
>         // also acts as rmb();
>         rdtsc_barrier();
>         tsc = rdtscp(&cpu1);

Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
specified it that way and both AMD and Intel implement it correctly.
(rdtsc, on the other hand, definitely needs the barrier beforehand.)

>         // control dependency, no need for rdtsc_barrier?
>     } while(cpu != cpu1);
>
>     // ... compute nanoseconds from pvti and tsc ...
>     rmb();
> }   while(v != pvti->version);

Still no good.  We can migrate a bunch of times so we see the same CPU
all three times and *still* don't get a consistent read, unless we
play nasty games with lots of version checks (I have a patch for that,
but I don't like it very much).  The patch is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d

but I don't like it.

Thus far, I've been told unambiguously that a guest can't observe pvti
while it's being written, and I think you're now telling me that this
isn't true and that a guest *can* observe pvti while it's being
written while the low bit of the version field is not set.  If so,
this is rather strongly incompatible with the spec in the KVM docs.

I don't suppose that you and Marcelo could agree on what the actual
semantics that KVM provides are and could write it down in a way that
people who haven't spent a long time staring at the request code
understand?  And maybe you could even fix the implementation while
you're at it if the implementation is, indeed, broken.  I have ugly
patches to fix it here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0

but I'm not thrilled with them.

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 12:01               ` Paolo Bonzini
@ 2015-01-06 16:56                 ` Andy Lutomirski
  2015-01-06 16:56                 ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 16:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Gleb Natapov, xen-devel, Marcelo Tosatti, linux-kernel, kvm list

On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>
>
>
> On 06/01/2015 09:42, Paolo Bonzini wrote:
> > > > Still confused.  So we can freeze all vCPUs in the host, then update
> > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> > > > doesn't increment the version pre-update, and we can return completely
> > > > bogus results.
> > > Yes.
> > But then the getcpu test would fail (1->0).  Even if you have an ABA
> > situation (1->0->1), it's okay because the pvti that is fetched is the
> > one returned by the first getcpu.
>
> ... this case of partial update of pvti, which is caught by the version
> field, if of course different from the other (extremely unlikely) that
> Andy pointed out.  That is when the getcpus are done on the same vCPU,
> but the rdtsc is another.
>
> That one can be fixed by rdtscp, like
>
> do {
>     // get a consistent (pvti, v, tsc) tuple
>     do {
>         cpu = get_cpu();
>         pvti = get_pvti(cpu);
>         v = pvti->version & ~1;
>         // also acts as rmb();
>         rdtsc_barrier();
>         tsc = rdtscp(&cpu1);

Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
specified it that way and both AMD and Intel implement it correctly.
(rdtsc, on the other hand, definitely needs the barrier beforehand.)

>         // control dependency, no need for rdtsc_barrier?
>     } while(cpu != cpu1);
>
>     // ... compute nanoseconds from pvti and tsc ...
>     rmb();
> }   while(v != pvti->version);

Still no good.  We can migrate a bunch of times so we see the same CPU
all three times and *still* don't get a consistent read, unless we
play nasty games with lots of version checks (I have a patch for that,
but I don't like it very much).  The patch is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d

but I don't like it.

Thus far, I've been told unambiguously that a guest can't observe pvti
while it's being written, and I think you're now telling me that this
isn't true and that a guest *can* observe pvti while it's being
written while the low bit of the version field is not set.  If so,
this is rather strongly incompatible with the spec in the KVM docs.

I don't suppose that you and Marcelo could agree on what the actual
semantics that KVM provides are and could write it down in a way that
people who haven't spent a long time staring at the request code
understand?  And maybe you could even fix the implementation while
you're at it if the implementation is, indeed, broken.  I have ugly
patches to fix it here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0

but I'm not thrilled with them.

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 16:56                 ` Andy Lutomirski
  2015-01-06 18:13                   ` Marcelo Tosatti
@ 2015-01-06 18:13                   ` Marcelo Tosatti
  2015-01-06 18:26                     ` Andy Lutomirski
  2015-01-06 18:26                     ` Andy Lutomirski
  2015-01-07  5:38                   ` Paolo Bonzini
  2015-01-07  5:38                   ` Paolo Bonzini
  3 siblings, 2 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-06 18:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >
> >
> >
> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> > > > > doesn't increment the version pre-update, and we can return completely
> > > > > bogus results.
> > > > Yes.
> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> > > one returned by the first getcpu.
> >
> > ... this case of partial update of pvti, which is caught by the version
> > field, if of course different from the other (extremely unlikely) that
> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> > but the rdtsc is another.
> >
> > That one can be fixed by rdtscp, like
> >
> > do {
> >     // get a consistent (pvti, v, tsc) tuple
> >     do {
> >         cpu = get_cpu();
> >         pvti = get_pvti(cpu);
> >         v = pvti->version & ~1;
> >         // also acts as rmb();
> >         rdtsc_barrier();
> >         tsc = rdtscp(&cpu1);
> 
> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> specified it that way and both AMD and Intel implement it correctly.
> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> 
> >         // control dependency, no need for rdtsc_barrier?
> >     } while(cpu != cpu1);
> >
> >     // ... compute nanoseconds from pvti and tsc ...
> >     rmb();
> > }   while(v != pvti->version);
> 
> Still no good.  We can migrate a bunch of times so we see the same CPU
> all three times and *still* don't get a consistent read, unless we
> play nasty games with lots of version checks (I have a patch for that,
> but I don't like it very much).  The patch is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> 
> but I don't like it.
> 
> Thus far, I've been told unambiguously that a guest can't observe pvti
> while it's being written, and I think you're now telling me that this
> isn't true and that a guest *can* observe pvti while it's being
> written while the low bit of the version field is not set.  If so,
> this is rather strongly incompatible with the spec in the KVM docs.
> 
> I don't suppose that you and Marcelo could agree on what the actual
> semantics that KVM provides are and could write it down in a way that
> people who haven't spent a long time staring at the request code
> understand?  And maybe you could even fix the implementation while
> you're at it if the implementation is, indeed, broken.  I have ugly
> patches to fix it here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> 
> but I'm not thrilled with them.
> 
> --Andy

I suppose that separating the version write from the rest of the pvclock
structure is sufficient, as that would guarantee the writes are not
reordered even with fast string REP MOVS.

Thanks for catching this Andy!


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 16:56                 ` Andy Lutomirski
@ 2015-01-06 18:13                   ` Marcelo Tosatti
  2015-01-06 18:13                   ` Marcelo Tosatti
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-06 18:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >
> >
> >
> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> > > > > doesn't increment the version pre-update, and we can return completely
> > > > > bogus results.
> > > > Yes.
> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> > > one returned by the first getcpu.
> >
> > ... this case of partial update of pvti, which is caught by the version
> > field, if of course different from the other (extremely unlikely) that
> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> > but the rdtsc is another.
> >
> > That one can be fixed by rdtscp, like
> >
> > do {
> >     // get a consistent (pvti, v, tsc) tuple
> >     do {
> >         cpu = get_cpu();
> >         pvti = get_pvti(cpu);
> >         v = pvti->version & ~1;
> >         // also acts as rmb();
> >         rdtsc_barrier();
> >         tsc = rdtscp(&cpu1);
> 
> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> specified it that way and both AMD and Intel implement it correctly.
> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> 
> >         // control dependency, no need for rdtsc_barrier?
> >     } while(cpu != cpu1);
> >
> >     // ... compute nanoseconds from pvti and tsc ...
> >     rmb();
> > }   while(v != pvti->version);
> 
> Still no good.  We can migrate a bunch of times so we see the same CPU
> all three times and *still* don't get a consistent read, unless we
> play nasty games with lots of version checks (I have a patch for that,
> but I don't like it very much).  The patch is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> 
> but I don't like it.
> 
> Thus far, I've been told unambiguously that a guest can't observe pvti
> while it's being written, and I think you're now telling me that this
> isn't true and that a guest *can* observe pvti while it's being
> written while the low bit of the version field is not set.  If so,
> this is rather strongly incompatible with the spec in the KVM docs.
> 
> I don't suppose that you and Marcelo could agree on what the actual
> semantics that KVM provides are and could write it down in a way that
> people who haven't spent a long time staring at the request code
> understand?  And maybe you could even fix the implementation while
> you're at it if the implementation is, indeed, broken.  I have ugly
> patches to fix it here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> 
> but I'm not thrilled with them.
> 
> --Andy

I suppose that separating the version write from the rest of the pvclock
structure is sufficient, as that would guarantee the writes are not
reordered even with fast string REP MOVS.

Thanks for catching this Andy!

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:13                   ` Marcelo Tosatti
  2015-01-06 18:26                     ` Andy Lutomirski
@ 2015-01-06 18:26                     ` Andy Lutomirski
  2015-01-06 18:45                       ` Marcelo Tosatti
                                         ` (3 more replies)
  1 sibling, 4 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 18:26 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >
>> >
>> >
>> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> > > > > doesn't increment the version pre-update, and we can return completely
>> > > > > bogus results.
>> > > > Yes.
>> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>> > > one returned by the first getcpu.
>> >
>> > ... this case of partial update of pvti, which is caught by the version
>> > field, if of course different from the other (extremely unlikely) that
>> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>> > but the rdtsc is another.
>> >
>> > That one can be fixed by rdtscp, like
>> >
>> > do {
>> >     // get a consistent (pvti, v, tsc) tuple
>> >     do {
>> >         cpu = get_cpu();
>> >         pvti = get_pvti(cpu);
>> >         v = pvti->version & ~1;
>> >         // also acts as rmb();
>> >         rdtsc_barrier();
>> >         tsc = rdtscp(&cpu1);
>>
>> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> specified it that way and both AMD and Intel implement it correctly.
>> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>>
>> >         // control dependency, no need for rdtsc_barrier?
>> >     } while(cpu != cpu1);
>> >
>> >     // ... compute nanoseconds from pvti and tsc ...
>> >     rmb();
>> > }   while(v != pvti->version);
>>
>> Still no good.  We can migrate a bunch of times so we see the same CPU
>> all three times and *still* don't get a consistent read, unless we
>> play nasty games with lots of version checks (I have a patch for that,
>> but I don't like it very much).  The patch is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>
>> but I don't like it.
>>
>> Thus far, I've been told unambiguously that a guest can't observe pvti
>> while it's being written, and I think you're now telling me that this
>> isn't true and that a guest *can* observe pvti while it's being
>> written while the low bit of the version field is not set.  If so,
>> this is rather strongly incompatible with the spec in the KVM docs.
>>
>> I don't suppose that you and Marcelo could agree on what the actual
>> semantics that KVM provides are and could write it down in a way that
>> people who haven't spent a long time staring at the request code
>> understand?  And maybe you could even fix the implementation while
>> you're at it if the implementation is, indeed, broken.  I have ugly
>> patches to fix it here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>>
>> but I'm not thrilled with them.
>>
>> --Andy
>
> I suppose that separating the version write from the rest of the pvclock
> structure is sufficient, as that would guarantee the writes are not
> reordered even with fast string REP MOVS.
>
> Thanks for catching this Andy!
>

Don't you stil need:

version++;
write the rest;
version++;

with possible smp_wmb() in there to keep the compiler from messing around?

Also, if you do this, can you also make setting and clearing
STABLE_BIT properly atomic across all vCPUs?  Or at least do something
like setting it last and clearing it first on vPCU 0?

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:13                   ` Marcelo Tosatti
@ 2015-01-06 18:26                     ` Andy Lutomirski
  2015-01-06 18:26                     ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 18:26 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >
>> >
>> >
>> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> > > > > doesn't increment the version pre-update, and we can return completely
>> > > > > bogus results.
>> > > > Yes.
>> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>> > > one returned by the first getcpu.
>> >
>> > ... this case of partial update of pvti, which is caught by the version
>> > field, if of course different from the other (extremely unlikely) that
>> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>> > but the rdtsc is another.
>> >
>> > That one can be fixed by rdtscp, like
>> >
>> > do {
>> >     // get a consistent (pvti, v, tsc) tuple
>> >     do {
>> >         cpu = get_cpu();
>> >         pvti = get_pvti(cpu);
>> >         v = pvti->version & ~1;
>> >         // also acts as rmb();
>> >         rdtsc_barrier();
>> >         tsc = rdtscp(&cpu1);
>>
>> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> specified it that way and both AMD and Intel implement it correctly.
>> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>>
>> >         // control dependency, no need for rdtsc_barrier?
>> >     } while(cpu != cpu1);
>> >
>> >     // ... compute nanoseconds from pvti and tsc ...
>> >     rmb();
>> > }   while(v != pvti->version);
>>
>> Still no good.  We can migrate a bunch of times so we see the same CPU
>> all three times and *still* don't get a consistent read, unless we
>> play nasty games with lots of version checks (I have a patch for that,
>> but I don't like it very much).  The patch is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>
>> but I don't like it.
>>
>> Thus far, I've been told unambiguously that a guest can't observe pvti
>> while it's being written, and I think you're now telling me that this
>> isn't true and that a guest *can* observe pvti while it's being
>> written while the low bit of the version field is not set.  If so,
>> this is rather strongly incompatible with the spec in the KVM docs.
>>
>> I don't suppose that you and Marcelo could agree on what the actual
>> semantics that KVM provides are and could write it down in a way that
>> people who haven't spent a long time staring at the request code
>> understand?  And maybe you could even fix the implementation while
>> you're at it if the implementation is, indeed, broken.  I have ugly
>> patches to fix it here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>>
>> but I'm not thrilled with them.
>>
>> --Andy
>
> I suppose that separating the version write from the rest of the pvclock
> structure is sufficient, as that would guarantee the writes are not
> reordered even with fast string REP MOVS.
>
> Thanks for catching this Andy!
>

Don't you stil need:

version++;
write the rest;
version++;

with possible smp_wmb() in there to keep the compiler from messing around?

Also, if you do this, can you also make setting and clearing
STABLE_BIT properly atomic across all vCPUs?  Or at least do something
like setting it last and clearing it first on vPCU 0?

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:26                     ` Andy Lutomirski
  2015-01-06 18:45                       ` Marcelo Tosatti
@ 2015-01-06 18:45                       ` Marcelo Tosatti
  2015-01-06 19:49                         ` Andy Lutomirski
  2015-01-06 19:49                         ` Andy Lutomirski
  2015-01-07  5:41                       ` Paolo Bonzini
  2015-01-07  5:41                       ` Paolo Bonzini
  3 siblings, 2 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-06 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >> >
> >> >
> >> >
> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> >> > > > > doesn't increment the version pre-update, and we can return completely
> >> > > > > bogus results.
> >> > > > Yes.
> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> >> > > one returned by the first getcpu.
> >> >
> >> > ... this case of partial update of pvti, which is caught by the version
> >> > field, if of course different from the other (extremely unlikely) that
> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> >> > but the rdtsc is another.
> >> >
> >> > That one can be fixed by rdtscp, like
> >> >
> >> > do {
> >> >     // get a consistent (pvti, v, tsc) tuple
> >> >     do {
> >> >         cpu = get_cpu();
> >> >         pvti = get_pvti(cpu);
> >> >         v = pvti->version & ~1;
> >> >         // also acts as rmb();
> >> >         rdtsc_barrier();
> >> >         tsc = rdtscp(&cpu1);
> >>
> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> >> specified it that way and both AMD and Intel implement it correctly.
> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> >>
> >> >         // control dependency, no need for rdtsc_barrier?
> >> >     } while(cpu != cpu1);
> >> >
> >> >     // ... compute nanoseconds from pvti and tsc ...
> >> >     rmb();
> >> > }   while(v != pvti->version);
> >>
> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> all three times and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much).  The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set.  If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >>
> >> I don't suppose that you and Marcelo could agree on what the actual
> >> semantics that KVM provides are and could write it down in a way that
> >> people who haven't spent a long time staring at the request code
> >> understand?  And maybe you could even fix the implementation while
> >> you're at it if the implementation is, indeed, broken.  I have ugly
> >> patches to fix it here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> >>
> >> but I'm not thrilled with them.
> >>
> >> --Andy
> >
> > I suppose that separating the version write from the rest of the pvclock
> > structure is sufficient, as that would guarantee the writes are not
> > reordered even with fast string REP MOVS.
> >
> > Thanks for catching this Andy!
> >
> 
> Don't you stil need:
> 
> version++;
> write the rest;
> version++;
> 
> with possible smp_wmb() in there to keep the compiler from messing around?

Correct. Could just as well follow the protocol and use odd/even, which 
is what your patch does.

What is the point with the new flags bit though?

> Also, if you do this, can you also make setting and clearing
> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> like setting it last and clearing it first on vPCU 0?

If the version "seqlock" works properly across vCPUs, why do you need
STABLE_BIT "properly atomic" ?

Please define what you mean by "properly atomic".



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:26                     ` Andy Lutomirski
@ 2015-01-06 18:45                       ` Marcelo Tosatti
  2015-01-06 18:45                       ` Marcelo Tosatti
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-06 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >> >
> >> >
> >> >
> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> >> > > > > doesn't increment the version pre-update, and we can return completely
> >> > > > > bogus results.
> >> > > > Yes.
> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> >> > > one returned by the first getcpu.
> >> >
> >> > ... this case of partial update of pvti, which is caught by the version
> >> > field, if of course different from the other (extremely unlikely) that
> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> >> > but the rdtsc is another.
> >> >
> >> > That one can be fixed by rdtscp, like
> >> >
> >> > do {
> >> >     // get a consistent (pvti, v, tsc) tuple
> >> >     do {
> >> >         cpu = get_cpu();
> >> >         pvti = get_pvti(cpu);
> >> >         v = pvti->version & ~1;
> >> >         // also acts as rmb();
> >> >         rdtsc_barrier();
> >> >         tsc = rdtscp(&cpu1);
> >>
> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> >> specified it that way and both AMD and Intel implement it correctly.
> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> >>
> >> >         // control dependency, no need for rdtsc_barrier?
> >> >     } while(cpu != cpu1);
> >> >
> >> >     // ... compute nanoseconds from pvti and tsc ...
> >> >     rmb();
> >> > }   while(v != pvti->version);
> >>
> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> all three times and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much).  The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set.  If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >>
> >> I don't suppose that you and Marcelo could agree on what the actual
> >> semantics that KVM provides are and could write it down in a way that
> >> people who haven't spent a long time staring at the request code
> >> understand?  And maybe you could even fix the implementation while
> >> you're at it if the implementation is, indeed, broken.  I have ugly
> >> patches to fix it here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> >>
> >> but I'm not thrilled with them.
> >>
> >> --Andy
> >
> > I suppose that separating the version write from the rest of the pvclock
> > structure is sufficient, as that would guarantee the writes are not
> > reordered even with fast string REP MOVS.
> >
> > Thanks for catching this Andy!
> >
> 
> Don't you stil need:
> 
> version++;
> write the rest;
> version++;
> 
> with possible smp_wmb() in there to keep the compiler from messing around?

Correct. Could just as well follow the protocol and use odd/even, which 
is what your patch does.

What is the point with the new flags bit though?

> Also, if you do this, can you also make setting and clearing
> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> like setting it last and clearing it first on vPCU 0?

If the version "seqlock" works properly across vCPUs, why do you need
STABLE_BIT "properly atomic" ?

Please define what you mean by "properly atomic".

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:45                       ` Marcelo Tosatti
@ 2015-01-06 19:49                         ` Andy Lutomirski
  2015-01-06 20:20                           ` Marcelo Tosatti
                                             ` (3 more replies)
  2015-01-06 19:49                         ` Andy Lutomirski
  1 sibling, 4 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 19:49 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> >> > > > > doesn't increment the version pre-update, and we can return completely
>> >> > > > > bogus results.
>> >> > > > Yes.
>> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>> >> > > one returned by the first getcpu.
>> >> >
>> >> > ... this case of partial update of pvti, which is caught by the version
>> >> > field, if of course different from the other (extremely unlikely) that
>> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>> >> > but the rdtsc is another.
>> >> >
>> >> > That one can be fixed by rdtscp, like
>> >> >
>> >> > do {
>> >> >     // get a consistent (pvti, v, tsc) tuple
>> >> >     do {
>> >> >         cpu = get_cpu();
>> >> >         pvti = get_pvti(cpu);
>> >> >         v = pvti->version & ~1;
>> >> >         // also acts as rmb();
>> >> >         rdtsc_barrier();
>> >> >         tsc = rdtscp(&cpu1);
>> >>
>> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> >> specified it that way and both AMD and Intel implement it correctly.
>> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>> >>
>> >> >         // control dependency, no need for rdtsc_barrier?
>> >> >     } while(cpu != cpu1);
>> >> >
>> >> >     // ... compute nanoseconds from pvti and tsc ...
>> >> >     rmb();
>> >> > }   while(v != pvti->version);
>> >>
>> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>> >> all three times and *still* don't get a consistent read, unless we
>> >> play nasty games with lots of version checks (I have a patch for that,
>> >> but I don't like it very much).  The patch is here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>> >>
>> >> but I don't like it.
>> >>
>> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>> >> while it's being written, and I think you're now telling me that this
>> >> isn't true and that a guest *can* observe pvti while it's being
>> >> written while the low bit of the version field is not set.  If so,
>> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >>
>> >> I don't suppose that you and Marcelo could agree on what the actual
>> >> semantics that KVM provides are and could write it down in a way that
>> >> people who haven't spent a long time staring at the request code
>> >> understand?  And maybe you could even fix the implementation while
>> >> you're at it if the implementation is, indeed, broken.  I have ugly
>> >> patches to fix it here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>> >>
>> >> but I'm not thrilled with them.
>> >>
>> >> --Andy
>> >
>> > I suppose that separating the version write from the rest of the pvclock
>> > structure is sufficient, as that would guarantee the writes are not
>> > reordered even with fast string REP MOVS.
>> >
>> > Thanks for catching this Andy!
>> >
>>
>> Don't you stil need:
>>
>> version++;
>> write the rest;
>> version++;
>>
>> with possible smp_wmb() in there to keep the compiler from messing around?
>
> Correct. Could just as well follow the protocol and use odd/even, which
> is what your patch does.
>
> What is the point with the new flags bit though?

To try to work around the problem on old hosts.  I'm not at all
convinced that this is worthwhile or that it helps, though.

>
>> Also, if you do this, can you also make setting and clearing
>> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
>> like setting it last and clearing it first on vPCU 0?
>
> If the version "seqlock" works properly across vCPUs, why do you need
> STABLE_BIT "properly atomic" ?
>
> Please define what you mean by "properly atomic".
>

I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
in the vdso if the stable bit is set.  That means that the host should
avoid doing things like migrating the guest, clearing the stable bit
for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
and return a bad timestamp.

Maybe this scenario is impossible, but getting rid of any getcpu-like
operation in the vdso has really nice benefits.  It's faster and it
lets us guarantee that the vdso's pvti data fits in a single page.
The latter means that we can easily make it work like the hpet
mapping, which gets us 32-bit support and will *finally* let us turn
off user access to the fixmap if vsyscall=none.

(We can, of course, still do this if the pvti data needs to be an
array, but it's messier.)

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:45                       ` Marcelo Tosatti
  2015-01-06 19:49                         ` Andy Lutomirski
@ 2015-01-06 19:49                         ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 19:49 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> >> > > > > doesn't increment the version pre-update, and we can return completely
>> >> > > > > bogus results.
>> >> > > > Yes.
>> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>> >> > > one returned by the first getcpu.
>> >> >
>> >> > ... this case of partial update of pvti, which is caught by the version
>> >> > field, if of course different from the other (extremely unlikely) that
>> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>> >> > but the rdtsc is another.
>> >> >
>> >> > That one can be fixed by rdtscp, like
>> >> >
>> >> > do {
>> >> >     // get a consistent (pvti, v, tsc) tuple
>> >> >     do {
>> >> >         cpu = get_cpu();
>> >> >         pvti = get_pvti(cpu);
>> >> >         v = pvti->version & ~1;
>> >> >         // also acts as rmb();
>> >> >         rdtsc_barrier();
>> >> >         tsc = rdtscp(&cpu1);
>> >>
>> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> >> specified it that way and both AMD and Intel implement it correctly.
>> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>> >>
>> >> >         // control dependency, no need for rdtsc_barrier?
>> >> >     } while(cpu != cpu1);
>> >> >
>> >> >     // ... compute nanoseconds from pvti and tsc ...
>> >> >     rmb();
>> >> > }   while(v != pvti->version);
>> >>
>> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>> >> all three times and *still* don't get a consistent read, unless we
>> >> play nasty games with lots of version checks (I have a patch for that,
>> >> but I don't like it very much).  The patch is here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>> >>
>> >> but I don't like it.
>> >>
>> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>> >> while it's being written, and I think you're now telling me that this
>> >> isn't true and that a guest *can* observe pvti while it's being
>> >> written while the low bit of the version field is not set.  If so,
>> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >>
>> >> I don't suppose that you and Marcelo could agree on what the actual
>> >> semantics that KVM provides are and could write it down in a way that
>> >> people who haven't spent a long time staring at the request code
>> >> understand?  And maybe you could even fix the implementation while
>> >> you're at it if the implementation is, indeed, broken.  I have ugly
>> >> patches to fix it here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>> >>
>> >> but I'm not thrilled with them.
>> >>
>> >> --Andy
>> >
>> > I suppose that separating the version write from the rest of the pvclock
>> > structure is sufficient, as that would guarantee the writes are not
>> > reordered even with fast string REP MOVS.
>> >
>> > Thanks for catching this Andy!
>> >
>>
>> Don't you stil need:
>>
>> version++;
>> write the rest;
>> version++;
>>
>> with possible smp_wmb() in there to keep the compiler from messing around?
>
> Correct. Could just as well follow the protocol and use odd/even, which
> is what your patch does.
>
> What is the point with the new flags bit though?

To try to work around the problem on old hosts.  I'm not at all
convinced that this is worthwhile or that it helps, though.

>
>> Also, if you do this, can you also make setting and clearing
>> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
>> like setting it last and clearing it first on vPCU 0?
>
> If the version "seqlock" works properly across vCPUs, why do you need
> STABLE_BIT "properly atomic" ?
>
> Please define what you mean by "properly atomic".
>

I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
in the vdso if the stable bit is set.  That means that the host should
avoid doing things like migrating the guest, clearing the stable bit
for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
and return a bad timestamp.

Maybe this scenario is impossible, but getting rid of any getcpu-like
operation in the vdso has really nice benefits.  It's faster and it
lets us guarantee that the vdso's pvti data fits in a single page.
The latter means that we can easily make it work like the hpet
mapping, which gets us 32-bit support and will *finally* let us turn
off user access to the fixmap if vsyscall=none.

(We can, of course, still do this if the pvti data needs to be an
array, but it's messier.)

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 19:49                         ` Andy Lutomirski
@ 2015-01-06 20:20                           ` Marcelo Tosatti
  2015-01-06 21:54                             ` Andy Lutomirski
  2015-01-06 21:54                             ` Andy Lutomirski
  2015-01-06 20:20                           ` Marcelo Tosatti
                                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-06 20:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
> > What is the point with the new flags bit though?
> 
> To try to work around the problem on old hosts.  I'm not at all
> convinced that this is worthwhile or that it helps, though.

Don't think so. Just fix the host bug.

> >> Also, if you do this, can you also make setting and clearing
> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> >> like setting it last and clearing it first on vPCU 0?
> >
> > If the version "seqlock" works properly across vCPUs, why do you need
> > STABLE_BIT "properly atomic" ?
> >
> > Please define what you mean by "properly atomic".
> >
> 
> I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
> in the vdso if the stable bit is set.  That means that the host should
> avoid doing things like migrating the guest, clearing the stable bit
> for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
> stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
> and return a bad timestamp.
> 
> Maybe this scenario is impossible, but getting rid of any getcpu-like
> operation in the vdso has really nice benefits. 

You can park every vCPU in host while updating vCPU-0's timestamp.

See kvm_gen_update_masterclock:

+	/* no guest entries from this point */
+	pvclock_update_vm_gtod_copy(kvm);

	- touch guest memory

+	/* guest entries allowed */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);

>  It's faster and it
> lets us guarantee that the vdso's pvti data fits in a single page.
> The latter means that we can easily make it work like the hpet
> mapping, which gets us 32-bit support and will *finally* let us turn
> off user access to the fixmap if vsyscall=none.
> 
> (We can, of course, still do this if the pvti data needs to be an
> array, but it's messier.)
> 
> --Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 19:49                         ` Andy Lutomirski
  2015-01-06 20:20                           ` Marcelo Tosatti
@ 2015-01-06 20:20                           ` Marcelo Tosatti
  2015-01-08 22:31                           ` Marcelo Tosatti
  2015-01-08 22:31                           ` Marcelo Tosatti
  3 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-06 20:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
> > What is the point with the new flags bit though?
> 
> To try to work around the problem on old hosts.  I'm not at all
> convinced that this is worthwhile or that it helps, though.

Don't think so. Just fix the host bug.

> >> Also, if you do this, can you also make setting and clearing
> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> >> like setting it last and clearing it first on vPCU 0?
> >
> > If the version "seqlock" works properly across vCPUs, why do you need
> > STABLE_BIT "properly atomic" ?
> >
> > Please define what you mean by "properly atomic".
> >
> 
> I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
> in the vdso if the stable bit is set.  That means that the host should
> avoid doing things like migrating the guest, clearing the stable bit
> for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
> stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
> and return a bad timestamp.
> 
> Maybe this scenario is impossible, but getting rid of any getcpu-like
> operation in the vdso has really nice benefits. 

You can park every vCPU in host while updating vCPU-0's timestamp.

See kvm_gen_update_masterclock:

+	/* no guest entries from this point */
+	pvclock_update_vm_gtod_copy(kvm);

	- touch guest memory

+	/* guest entries allowed */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);

>  It's faster and it
> lets us guarantee that the vdso's pvti data fits in a single page.
> The latter means that we can easily make it work like the hpet
> mapping, which gets us 32-bit support and will *finally* let us turn
> off user access to the fixmap if vsyscall=none.
> 
> (We can, of course, still do this if the pvti data needs to be an
> array, but it's messier.)
> 
> --Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 20:20                           ` Marcelo Tosatti
@ 2015-01-06 21:54                             ` Andy Lutomirski
  2015-01-06 21:54                             ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 21:54 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 6, 2015 at 12:20 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>> > What is the point with the new flags bit though?
>>
>> To try to work around the problem on old hosts.  I'm not at all
>> convinced that this is worthwhile or that it helps, though.
>
> Don't think so. Just fix the host bug.
>
>> >> Also, if you do this, can you also make setting and clearing
>> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
>> >> like setting it last and clearing it first on vPCU 0?
>> >
>> > If the version "seqlock" works properly across vCPUs, why do you need
>> > STABLE_BIT "properly atomic" ?
>> >
>> > Please define what you mean by "properly atomic".
>> >
>>
>> I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
>> in the vdso if the stable bit is set.  That means that the host should
>> avoid doing things like migrating the guest, clearing the stable bit
>> for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
>> stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
>> and return a bad timestamp.
>>
>> Maybe this scenario is impossible, but getting rid of any getcpu-like
>> operation in the vdso has really nice benefits.
>
> You can park every vCPU in host while updating vCPU-0's timestamp.
>
> See kvm_gen_update_masterclock:
>
> +       /* no guest entries from this point */
> +       pvclock_update_vm_gtod_copy(kvm);
>
>         - touch guest memory
>
> +       /* guest entries allowed */
> +       kvm_for_each_vcpu(i, vcpu, kvm)
> +               clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
>

Can we do that easily?  It looks like we're holding a spinlock in
there.  Could we make pvclock_gtod_sync_lock into a mutex?

We could also add something to explicitly prevent any of the guests
from entering until we're updated all of them, but that would hurt
performance even more.  It would be kind of nice if we could avoid
serializing all CPUs entirely, though.  For example, if we could
increment all the versions, then write all the pvtis, then increment
all the versions again from a single function, then everything is
atomic for a correctly behaving guest, but if the guest isn't actually
reading the time, then it doesn't stall.

(Also, we should really have a cpu_relax in all of these loops, IMO,
so that pause loop exiting can take effect.)

--Andy

>>  It's faster and it
>> lets us guarantee that the vdso's pvti data fits in a single page.
>> The latter means that we can easily make it work like the hpet
>> mapping, which gets us 32-bit support and will *finally* let us turn
>> off user access to the fixmap if vsyscall=none.
>>
>> (We can, of course, still do this if the pvti data needs to be an
>> array, but it's messier.)
>>
>> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 20:20                           ` Marcelo Tosatti
  2015-01-06 21:54                             ` Andy Lutomirski
@ 2015-01-06 21:54                             ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-06 21:54 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 6, 2015 at 12:20 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>> > What is the point with the new flags bit though?
>>
>> To try to work around the problem on old hosts.  I'm not at all
>> convinced that this is worthwhile or that it helps, though.
>
> Don't think so. Just fix the host bug.
>
>> >> Also, if you do this, can you also make setting and clearing
>> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
>> >> like setting it last and clearing it first on vPCU 0?
>> >
>> > If the version "seqlock" works properly across vCPUs, why do you need
>> > STABLE_BIT "properly atomic" ?
>> >
>> > Please define what you mean by "properly atomic".
>> >
>>
>> I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
>> in the vdso if the stable bit is set.  That means that the host should
>> avoid doing things like migrating the guest, clearing the stable bit
>> for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
>> stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
>> and return a bad timestamp.
>>
>> Maybe this scenario is impossible, but getting rid of any getcpu-like
>> operation in the vdso has really nice benefits.
>
> You can park every vCPU in host while updating vCPU-0's timestamp.
>
> See kvm_gen_update_masterclock:
>
> +       /* no guest entries from this point */
> +       pvclock_update_vm_gtod_copy(kvm);
>
>         - touch guest memory
>
> +       /* guest entries allowed */
> +       kvm_for_each_vcpu(i, vcpu, kvm)
> +               clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
>

Can we do that easily?  It looks like we're holding a spinlock in
there.  Could we make pvclock_gtod_sync_lock into a mutex?

We could also add something to explicitly prevent any of the guests
from entering until we're updated all of them, but that would hurt
performance even more.  It would be kind of nice if we could avoid
serializing all CPUs entirely, though.  For example, if we could
increment all the versions, then write all the pvtis, then increment
all the versions again from a single function, then everything is
atomic for a correctly behaving guest, but if the guest isn't actually
reading the time, then it doesn't stall.

(Also, we should really have a cpu_relax in all of these loops, IMO,
so that pause loop exiting can take effect.)

--Andy

>>  It's faster and it
>> lets us guarantee that the vdso's pvti data fits in a single page.
>> The latter means that we can easily make it work like the hpet
>> mapping, which gets us 32-bit support and will *finally* let us turn
>> off user access to the fixmap if vsyscall=none.
>>
>> (We can, of course, still do this if the pvti data needs to be an
>> array, but it's messier.)
>>
>> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 16:56                 ` Andy Lutomirski
                                     ` (2 preceding siblings ...)
  2015-01-07  5:38                   ` Paolo Bonzini
@ 2015-01-07  5:38                   ` Paolo Bonzini
  2015-01-07  7:18                     ` Andy Lutomirski
  2015-01-07  7:18                     ` Andy Lutomirski
  3 siblings, 2 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-07  5:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: xen-devel, linux-kernel, kvm list, Gleb Natapov, Marcelo Tosatti



On 06/01/2015 17:56, Andy Lutomirski wrote:
> Still no good.  We can migrate a bunch of times so we see the same CPU
> all three times

There are no three times.  The CPU you see here:

>> 
>> 
>>     // ... compute nanoseconds from pvti and tsc ...
>>     rmb();
>> }   while(v != pvti->version);

is the same you read here:

>>         cpu = get_cpu();

The algorithm is:

1) get a consistent (cpu, version, tsc)

   1.a) get cpu
   1.b) get pvti[cpu]->version, ignoring low bit
   1.c) get (tsc, cpu)
   1.d) if cpu from 1.a and 1.c do not match, loop
   1.e) if pvti[cpu] was being updated, we'll loop later

2) compute nanoseconds from pvti[cpu] and tsc

3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
match, retry.

As long as the CPU is consistent between get_cpu() and rdtscp(), there
is no problem with migration, because pvti is always accessed for that
one CPU.  If there were any problem, it would be caught by the version
check.  Writing it down with two nested do...whiles makes it clearer IMHO.

> and *still* don't get a consistent read, unless we
> play nasty games with lots of version checks (I have a patch for that,
> but I don't like it very much).  The patch is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> 
> but I don't like it.
> 
> Thus far, I've been told unambiguously that a guest can't observe pvti
> while it's being written, and I think you're now telling me that this
> isn't true and that a guest *can* observe pvti while it's being
> written while the low bit of the version field is not set.  If so,
> this is rather strongly incompatible with the spec in the KVM docs.

Where am I saying that?

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 16:56                 ` Andy Lutomirski
  2015-01-06 18:13                   ` Marcelo Tosatti
  2015-01-06 18:13                   ` Marcelo Tosatti
@ 2015-01-07  5:38                   ` Paolo Bonzini
  2015-01-07  5:38                   ` Paolo Bonzini
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-07  5:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, xen-devel, Marcelo Tosatti, linux-kernel, kvm list



On 06/01/2015 17:56, Andy Lutomirski wrote:
> Still no good.  We can migrate a bunch of times so we see the same CPU
> all three times

There are no three times.  The CPU you see here:

>> 
>> 
>>     // ... compute nanoseconds from pvti and tsc ...
>>     rmb();
>> }   while(v != pvti->version);

is the same you read here:

>>         cpu = get_cpu();

The algorithm is:

1) get a consistent (cpu, version, tsc)

   1.a) get cpu
   1.b) get pvti[cpu]->version, ignoring low bit
   1.c) get (tsc, cpu)
   1.d) if cpu from 1.a and 1.c do not match, loop
   1.e) if pvti[cpu] was being updated, we'll loop later

2) compute nanoseconds from pvti[cpu] and tsc

3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
match, retry.

As long as the CPU is consistent between get_cpu() and rdtscp(), there
is no problem with migration, because pvti is always accessed for that
one CPU.  If there were any problem, it would be caught by the version
check.  Writing it down with two nested do...whiles makes it clearer IMHO.

> and *still* don't get a consistent read, unless we
> play nasty games with lots of version checks (I have a patch for that,
> but I don't like it very much).  The patch is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> 
> but I don't like it.
> 
> Thus far, I've been told unambiguously that a guest can't observe pvti
> while it's being written, and I think you're now telling me that this
> isn't true and that a guest *can* observe pvti while it's being
> written while the low bit of the version field is not set.  If so,
> this is rather strongly incompatible with the spec in the KVM docs.

Where am I saying that?

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:26                     ` Andy Lutomirski
                                         ` (2 preceding siblings ...)
  2015-01-07  5:41                       ` Paolo Bonzini
@ 2015-01-07  5:41                       ` Paolo Bonzini
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-07  5:41 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: xen-devel, linux-kernel, kvm list, Gleb Natapov



On 06/01/2015 19:26, Andy Lutomirski wrote:
> Don't you stil need:
> 
> version++;
> write the rest;
> version++;
> 
> with possible smp_wmb() in there to keep the compiler from messing around?

No, see my other reply.

Separating the version write is a real bug, but that should be all that
it's needed.

> Also, if you do this, can you also make setting and clearing
> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> like setting it last and clearing it first on vPCU 0?

That would be nice if you want to make the pvclock area fit in a single
page.  However, it would have to be a separate flag bit, or a separate
CPUID feature.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 18:26                     ` Andy Lutomirski
  2015-01-06 18:45                       ` Marcelo Tosatti
  2015-01-06 18:45                       ` Marcelo Tosatti
@ 2015-01-07  5:41                       ` Paolo Bonzini
  2015-01-07  5:41                       ` Paolo Bonzini
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-07  5:41 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list



On 06/01/2015 19:26, Andy Lutomirski wrote:
> Don't you stil need:
> 
> version++;
> write the rest;
> version++;
> 
> with possible smp_wmb() in there to keep the compiler from messing around?

No, see my other reply.

Separating the version write is a real bug, but that should be all that
it's needed.

> Also, if you do this, can you also make setting and clearing
> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> like setting it last and clearing it first on vPCU 0?

That would be nice if you want to make the pvclock area fit in a single
page.  However, it would have to be a separate flag bit, or a separate
CPUID feature.

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-07  5:38                   ` Paolo Bonzini
  2015-01-07  7:18                     ` Andy Lutomirski
@ 2015-01-07  7:18                     ` Andy Lutomirski
  2015-01-07  9:00                       ` Paolo Bonzini
                                         ` (3 more replies)
  1 sibling, 4 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-07  7:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: xen-devel, linux-kernel, kvm list, Gleb Natapov, Marcelo Tosatti

On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 06/01/2015 17:56, Andy Lutomirski wrote:
>> Still no good.  We can migrate a bunch of times so we see the same CPU
>> all three times
>
> There are no three times.  The CPU you see here:
>
>>>
>>>
>>>     // ... compute nanoseconds from pvti and tsc ...
>>>     rmb();
>>> }   while(v != pvti->version);
>
> is the same you read here:
>
>>>         cpu = get_cpu();
>
> The algorithm is:

I still don't see why this is safe, and I think that the issue is that
you left out part of the loop.

>
> 1) get a consistent (cpu, version, tsc)
>
>    1.a) get cpu

Suppose we observe cpu 0.

>    1.b) get pvti[cpu]->version, ignoring low bit

Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale,
etc.  This could all execute on vCPU 1.  We could read values that are
inconsistent with each other.

>    1.c) get (tsc, cpu)

Now we could end up back on vCPU 0.

>    1.d) if cpu from 1.a and 1.c do not match, loop
>    1.e) if pvti[cpu] was being updated, we'll loop later
>
> 2) compute nanoseconds from pvti[cpu] and tsc
>
> 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
> match, retry.
>
> As long as the CPU is consistent between get_cpu() and rdtscp(), there
> is no problem with migration, because pvti is always accessed for that
> one CPU.  If there were any problem, it would be caught by the version
> check.  Writing it down with two nested do...whiles makes it clearer IMHO.

Why exactly would it be caught by the version check?

My ugly patch tries to make the argument that, at any point at which
we observe ourselves to be on a given vCPU, that vCPU won't be
updating pvti.  That means that, if version doesn't change between two
consecutive observations that we're on that vCPU, then we're okay.
This IMO sucks.  It's fragile, it's hard to make a coherent argument
about correctness, and it requires at least two getcpu-like operations
to read the time.  Those operations are *slow*.  One is much better
than two, and zero is much better than one.

>
>> and *still* don't get a consistent read, unless we
>> play nasty games with lots of version checks (I have a patch for that,
>> but I don't like it very much).  The patch is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>
>> but I don't like it.
>>
>> Thus far, I've been told unambiguously that a guest can't observe pvti
>> while it's being written, and I think you're now telling me that this
>> isn't true and that a guest *can* observe pvti while it's being
>> written while the low bit of the version field is not set.  If so,
>> this is rather strongly incompatible with the spec in the KVM docs.
>
> Where am I saying that?

I thought the conclusion from what you and Marcelo pointed out about
the code was that, once the first vCPU updated its pvti, it could
start running guest code while the other vCPUs are still updating
pvti, so its guest code can observe the other vCPUs mid-update.

>> Also, if you do this, can you also make setting and clearing
>> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
>> like setting it last and clearing it first on vPCU 0?
>
> That would be nice if you want to make the pvclock area fit in a single
> page.  However, it would have to be a separate flag bit, or a separate
> CPUID feature.

It would be nice to have.  Although I think that fixing the host to
increment version pre-update and post-update may actually be good
enough.  Is there any case in which it would fail in practice if we
made that fix and always looked at pvti 0?

--Andy

>
> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-07  5:38                   ` Paolo Bonzini
@ 2015-01-07  7:18                     ` Andy Lutomirski
  2015-01-07  7:18                     ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-07  7:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Gleb Natapov, xen-devel, Marcelo Tosatti, linux-kernel, kvm list

On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 06/01/2015 17:56, Andy Lutomirski wrote:
>> Still no good.  We can migrate a bunch of times so we see the same CPU
>> all three times
>
> There are no three times.  The CPU you see here:
>
>>>
>>>
>>>     // ... compute nanoseconds from pvti and tsc ...
>>>     rmb();
>>> }   while(v != pvti->version);
>
> is the same you read here:
>
>>>         cpu = get_cpu();
>
> The algorithm is:

I still don't see why this is safe, and I think that the issue is that
you left out part of the loop.

>
> 1) get a consistent (cpu, version, tsc)
>
>    1.a) get cpu

Suppose we observe cpu 0.

>    1.b) get pvti[cpu]->version, ignoring low bit

Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale,
etc.  This could all execute on vCPU 1.  We could read values that are
inconsistent with each other.

>    1.c) get (tsc, cpu)

Now we could end up back on vCPU 0.

>    1.d) if cpu from 1.a and 1.c do not match, loop
>    1.e) if pvti[cpu] was being updated, we'll loop later
>
> 2) compute nanoseconds from pvti[cpu] and tsc
>
> 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
> match, retry.
>
> As long as the CPU is consistent between get_cpu() and rdtscp(), there
> is no problem with migration, because pvti is always accessed for that
> one CPU.  If there were any problem, it would be caught by the version
> check.  Writing it down with two nested do...whiles makes it clearer IMHO.

Why exactly would it be caught by the version check?

My ugly patch tries to make the argument that, at any point at which
we observe ourselves to be on a given vCPU, that vCPU won't be
updating pvti.  That means that, if version doesn't change between two
consecutive observations that we're on that vCPU, then we're okay.
This IMO sucks.  It's fragile, it's hard to make a coherent argument
about correctness, and it requires at least two getcpu-like operations
to read the time.  Those operations are *slow*.  One is much better
than two, and zero is much better than one.

>
>> and *still* don't get a consistent read, unless we
>> play nasty games with lots of version checks (I have a patch for that,
>> but I don't like it very much).  The patch is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>
>> but I don't like it.
>>
>> Thus far, I've been told unambiguously that a guest can't observe pvti
>> while it's being written, and I think you're now telling me that this
>> isn't true and that a guest *can* observe pvti while it's being
>> written while the low bit of the version field is not set.  If so,
>> this is rather strongly incompatible with the spec in the KVM docs.
>
> Where am I saying that?

I thought the conclusion from what you and Marcelo pointed out about
the code was that, once the first vCPU updated its pvti, it could
start running guest code while the other vCPUs are still updating
pvti, so its guest code can observe the other vCPUs mid-update.

>> Also, if you do this, can you also make setting and clearing
>> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
>> like setting it last and clearing it first on vPCU 0?
>
> That would be nice if you want to make the pvclock area fit in a single
> page.  However, it would have to be a separate flag bit, or a separate
> CPUID feature.

It would be nice to have.  Although I think that fixing the host to
increment version pre-update and post-update may actually be good
enough.  Is there any case in which it would fail in practice if we
made that fix and always looked at pvti 0?

--Andy

>
> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-07  7:18                     ` Andy Lutomirski
  2015-01-07  9:00                       ` Paolo Bonzini
@ 2015-01-07  9:00                       ` Paolo Bonzini
  2015-01-07 14:45                       ` Marcelo Tosatti
  2015-01-07 14:45                       ` Marcelo Tosatti
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-07  9:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: xen-devel, linux-kernel, kvm list, Gleb Natapov, Marcelo Tosatti



On 07/01/2015 08:18, Andy Lutomirski wrote:
>>> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>>> >> while it's being written, and I think you're now telling me that this
>>> >> isn't true and that a guest *can* observe pvti while it's being
>>> >> written while the low bit of the version field is not set.  If so,
>>> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >
>> > Where am I saying that?
> I thought the conclusion from what you and Marcelo pointed out about
> the code was that, once the first vCPU updated its pvti, it could
> start running guest code while the other vCPUs are still updating
> pvti, so its guest code can observe the other vCPUs mid-update.

Ah, in that sense you're right.  However, each VCPU cannot observe _its
own_ pvti entry while it's being written (no matter what's in the low
bit of the version field).

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-07  7:18                     ` Andy Lutomirski
@ 2015-01-07  9:00                       ` Paolo Bonzini
  2015-01-07  9:00                       ` Paolo Bonzini
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2015-01-07  9:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, xen-devel, Marcelo Tosatti, linux-kernel, kvm list



On 07/01/2015 08:18, Andy Lutomirski wrote:
>>> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>>> >> while it's being written, and I think you're now telling me that this
>>> >> isn't true and that a guest *can* observe pvti while it's being
>>> >> written while the low bit of the version field is not set.  If so,
>>> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >
>> > Where am I saying that?
> I thought the conclusion from what you and Marcelo pointed out about
> the code was that, once the first vCPU updated its pvti, it could
> start running guest code while the other vCPUs are still updating
> pvti, so its guest code can observe the other vCPUs mid-update.

Ah, in that sense you're right.  However, each VCPU cannot observe _its
own_ pvti entry while it's being written (no matter what's in the low
bit of the version field).

Paolo

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-07  7:18                     ` Andy Lutomirski
  2015-01-07  9:00                       ` Paolo Bonzini
  2015-01-07  9:00                       ` Paolo Bonzini
@ 2015-01-07 14:45                       ` Marcelo Tosatti
  2015-01-07 14:45                       ` Marcelo Tosatti
  3 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-07 14:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 06, 2015 at 11:18:21PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 06/01/2015 17:56, Andy Lutomirski wrote:
> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> all three times
> >
> > There are no three times.  The CPU you see here:
> >
> >>>
> >>>
> >>>     // ... compute nanoseconds from pvti and tsc ...
> >>>     rmb();
> >>> }   while(v != pvti->version);
> >
> > is the same you read here:
> >
> >>>         cpu = get_cpu();
> >
> > The algorithm is:
> 
> I still don't see why this is safe, and I think that the issue is that
> you left out part of the loop.
> 
> >
> > 1) get a consistent (cpu, version, tsc)
> >
> >    1.a) get cpu
> 
> Suppose we observe cpu 0.
> 
> >    1.b) get pvti[cpu]->version, ignoring low bit
> 
> Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale,
> etc.  This could all execute on vCPU 1.  We could read values that are
> inconsistent with each other.
> 
> >    1.c) get (tsc, cpu)
> 
> Now we could end up back on vCPU 0.
> 
> >    1.d) if cpu from 1.a and 1.c do not match, loop
> >    1.e) if pvti[cpu] was being updated, we'll loop later
> >
> > 2) compute nanoseconds from pvti[cpu] and tsc
> >
> > 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
> > match, retry.
> >
> > As long as the CPU is consistent between get_cpu() and rdtscp(), there
> > is no problem with migration, because pvti is always accessed for that
> > one CPU.  If there were any problem, it would be caught by the version
> > check.  Writing it down with two nested do...whiles makes it clearer IMHO.
> 
> Why exactly would it be caught by the version check?
> 
> My ugly patch tries to make the argument that, at any point at which
> we observe ourselves to be on a given vCPU, that vCPU won't be
> updating pvti.  That means that, if version doesn't change between two
> consecutive observations that we're on that vCPU, then we're okay.
> This IMO sucks.  It's fragile, it's hard to make a coherent argument
> about correctness, and it requires at least two getcpu-like operations
> to read the time.  Those operations are *slow*.  One is much better
> than two, and zero is much better than one.
> 
> >
> >> and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much).  The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set.  If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >
> > Where am I saying that?
> 
> I thought the conclusion from what you and Marcelo pointed out about
> the code was that, once the first vCPU updated its pvti, it could
> start running guest code while the other vCPUs are still updating
> pvti, so its guest code can observe the other vCPUs mid-update.
> 
> >> Also, if you do this, can you also make setting and clearing
> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> >> like setting it last and clearing it first on vPCU 0?
> >
> > That would be nice if you want to make the pvclock area fit in a single
> > page.  However, it would have to be a separate flag bit, or a separate
> > CPUID feature.
> 
> It would be nice to have.  Although I think that fixing the host to
> increment version pre-update and post-update may actually be good
> enough.  Is there any case in which it would fail in practice if we
> made that fix and always looked at pvti 0?

TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition steps would finish but 
still allow VCPU-1 to use stale values from VCPU-0.

To fix, do one of the following:

1) Check validity of local TSC_STABLE_BIT in addition (slow).
2) Perform update of VCPU-0 pvclock area before allowing
any other VCPU to VM-entry.



> 
> --Andy
> 
> >
> > Paolo
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-07  7:18                     ` Andy Lutomirski
                                         ` (2 preceding siblings ...)
  2015-01-07 14:45                       ` Marcelo Tosatti
@ 2015-01-07 14:45                       ` Marcelo Tosatti
  3 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-07 14:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 06, 2015 at 11:18:21PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 06/01/2015 17:56, Andy Lutomirski wrote:
> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> all three times
> >
> > There are no three times.  The CPU you see here:
> >
> >>>
> >>>
> >>>     // ... compute nanoseconds from pvti and tsc ...
> >>>     rmb();
> >>> }   while(v != pvti->version);
> >
> > is the same you read here:
> >
> >>>         cpu = get_cpu();
> >
> > The algorithm is:
> 
> I still don't see why this is safe, and I think that the issue is that
> you left out part of the loop.
> 
> >
> > 1) get a consistent (cpu, version, tsc)
> >
> >    1.a) get cpu
> 
> Suppose we observe cpu 0.
> 
> >    1.b) get pvti[cpu]->version, ignoring low bit
> 
> Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale,
> etc.  This could all execute on vCPU 1.  We could read values that are
> inconsistent with each other.
> 
> >    1.c) get (tsc, cpu)
> 
> Now we could end up back on vCPU 0.
> 
> >    1.d) if cpu from 1.a and 1.c do not match, loop
> >    1.e) if pvti[cpu] was being updated, we'll loop later
> >
> > 2) compute nanoseconds from pvti[cpu] and tsc
> >
> > 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
> > match, retry.
> >
> > As long as the CPU is consistent between get_cpu() and rdtscp(), there
> > is no problem with migration, because pvti is always accessed for that
> > one CPU.  If there were any problem, it would be caught by the version
> > check.  Writing it down with two nested do...whiles makes it clearer IMHO.
> 
> Why exactly would it be caught by the version check?
> 
> My ugly patch tries to make the argument that, at any point at which
> we observe ourselves to be on a given vCPU, that vCPU won't be
> updating pvti.  That means that, if version doesn't change between two
> consecutive observations that we're on that vCPU, then we're okay.
> This IMO sucks.  It's fragile, it's hard to make a coherent argument
> about correctness, and it requires at least two getcpu-like operations
> to read the time.  Those operations are *slow*.  One is much better
> than two, and zero is much better than one.
> 
> >
> >> and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much).  The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set.  If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >
> > Where am I saying that?
> 
> I thought the conclusion from what you and Marcelo pointed out about
> the code was that, once the first vCPU updated its pvti, it could
> start running guest code while the other vCPUs are still updating
> pvti, so its guest code can observe the other vCPUs mid-update.
> 
> >> Also, if you do this, can you also make setting and clearing
> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> >> like setting it last and clearing it first on vPCU 0?
> >
> > That would be nice if you want to make the pvclock area fit in a single
> > page.  However, it would have to be a separate flag bit, or a separate
> > CPUID feature.
> 
> It would be nice to have.  Although I think that fixing the host to
> increment version pre-update and post-update may actually be good
> enough.  Is there any case in which it would fail in practice if we
> made that fix and always looked at pvti 0?

TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition steps would finish but 
still allow VCPU-1 to use stale values from VCPU-0.

To fix, do one of the following:

1) Check validity of local TSC_STABLE_BIT in addition (slow).
2) Perform update of VCPU-0 pvclock area before allowing
any other VCPU to VM-entry.



> 
> --Andy
> 
> >
> > Paolo
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
@ 2015-01-08 12:51     ` David Vrabel
  2014-12-23 10:28   ` David Vrabel
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 77+ messages in thread
From: David Vrabel @ 2015-01-08 12:51 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 23/12/2014 00:39, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.

Xen guests don't use any of this at the moment, and I don't think this 
change would prevent us from using it in the future, so:

Acked-by: David Vrabel <david.vrabel@citrix.com>

But see some additional comments below.

> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
>   static notrace cycle_t vread_pvclock(int *mode)
>   {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;

Xen updates pvti when scheduling a VCPU.  Using 0 here requires that 
VCPU 0 has been recently scheduled by Xen.  Perhaps using the current 
CPU here would be better?  It doesn't matter if the task is subsequently 
moved to a different CPU before using pvti.

> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>   	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.

I think this is a much stronger requirement than you actually need.

You only require:

- the system time (pvti->system_time) for all pvti's is synchronized; and
- TSC is synchronized; and
- the pvti has been updated sufficiently recently (so the error in the 
result is within acceptable margins).

Can you add documentation to arch/x86/include/asm/pvclock-abi.h to 
describe what properties PVCLOCK_TSC_STABLE_BIT guarantees?

David

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
@ 2015-01-08 12:51     ` David Vrabel
  0 siblings, 0 replies; 77+ messages in thread
From: David Vrabel @ 2015-01-08 12:51 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list

On 23/12/2014 00:39, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.

Xen guests don't use any of this at the moment, and I don't think this 
change would prevent us from using it in the future, so:

Acked-by: David Vrabel <david.vrabel@citrix.com>

But see some additional comments below.

> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
>   static notrace cycle_t vread_pvclock(int *mode)
>   {
> -	const struct pvclock_vsyscall_time_info *pvti;
> +	const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;

Xen updates pvti when scheduling a VCPU.  Using 0 here requires that 
VCPU 0 has been recently scheduled by Xen.  Perhaps using the current 
CPU here would be better?  It doesn't matter if the task is subsequently 
moved to a different CPU before using pvti.

> +	 * Note: The kernel and hypervisor must guarantee that cpu ID
> +	 * number maps 1:1 to per-CPU pvclock time info.
> +	 *
> +	 * Because the hypervisor is entirely unaware of guest userspace
> +	 * preemption, it cannot guarantee that per-CPU pvclock time
> +	 * info is updated if the underlying CPU changes or that that
> +	 * version is increased whenever underlying CPU changes.
> +	 *
> +	 * On KVM, we are guaranteed that pvti updates for any vCPU are
> +	 * atomic as seen by *all* vCPUs.  This is an even stronger
> +	 * guarantee than we get with a normal seqlock.
>   	 *
> +	 * On Xen, we don't appear to have that guarantee, but Xen still
> +	 * supplies a valid seqlock using the version field.
> +
> +	 * We only do pvclock vdso timing at all if
> +	 * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +	 * mean that all vCPUs have matching pvti and that the TSC is
> +	 * synced, so we can just look at vCPU 0's pvti.

I think this is a much stronger requirement than you actually need.

You only require:

- the system time (pvti->system_time) for all pvti's is synchronized; and
- TSC is synchronized; and
- the pvti has been updated sufficiently recently (so the error in the 
result is within acceptable margins).

Can you add documentation to arch/x86/include/asm/pvclock-abi.h to 
describe what properties PVCLOCK_TSC_STABLE_BIT guarantees?

David

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 19:49                         ` Andy Lutomirski
  2015-01-06 20:20                           ` Marcelo Tosatti
  2015-01-06 20:20                           ` Marcelo Tosatti
@ 2015-01-08 22:31                           ` Marcelo Tosatti
  2015-01-08 22:43                             ` Andy Lutomirski
  2015-01-08 22:43                             ` Andy Lutomirski
  2015-01-08 22:31                           ` Marcelo Tosatti
  3 siblings, 2 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-08 22:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> >> >> > > > > doesn't increment the version pre-update, and we can return completely
> >> >> > > > > bogus results.
> >> >> > > > Yes.
> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> >> >> > > one returned by the first getcpu.
> >> >> >
> >> >> > ... this case of partial update of pvti, which is caught by the version
> >> >> > field, if of course different from the other (extremely unlikely) that
> >> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> >> >> > but the rdtsc is another.
> >> >> >
> >> >> > That one can be fixed by rdtscp, like
> >> >> >
> >> >> > do {
> >> >> >     // get a consistent (pvti, v, tsc) tuple
> >> >> >     do {
> >> >> >         cpu = get_cpu();
> >> >> >         pvti = get_pvti(cpu);
> >> >> >         v = pvti->version & ~1;
> >> >> >         // also acts as rmb();
> >> >> >         rdtsc_barrier();
> >> >> >         tsc = rdtscp(&cpu1);
> >> >>
> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> >> >> specified it that way and both AMD and Intel implement it correctly.
> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> >> >>
> >> >> >         // control dependency, no need for rdtsc_barrier?
> >> >> >     } while(cpu != cpu1);
> >> >> >
> >> >> >     // ... compute nanoseconds from pvti and tsc ...
> >> >> >     rmb();
> >> >> > }   while(v != pvti->version);
> >> >>
> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> >> all three times and *still* don't get a consistent read, unless we
> >> >> play nasty games with lots of version checks (I have a patch for that,
> >> >> but I don't like it very much).  The patch is here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >> >>
> >> >> but I don't like it.
> >> >>
> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> >> while it's being written, and I think you're now telling me that this
> >> >> isn't true and that a guest *can* observe pvti while it's being
> >> >> written while the low bit of the version field is not set.  If so,
> >> >> this is rather strongly incompatible with the spec in the KVM docs.
> >> >>
> >> >> I don't suppose that you and Marcelo could agree on what the actual
> >> >> semantics that KVM provides are and could write it down in a way that
> >> >> people who haven't spent a long time staring at the request code
> >> >> understand?  And maybe you could even fix the implementation while
> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
> >> >> patches to fix it here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> >> >>
> >> >> but I'm not thrilled with them.
> >> >>
> >> >> --Andy
> >> >
> >> > I suppose that separating the version write from the rest of the pvclock
> >> > structure is sufficient, as that would guarantee the writes are not
> >> > reordered even with fast string REP MOVS.
> >> >
> >> > Thanks for catching this Andy!
> >> >
> >>
> >> Don't you stil need:
> >>
> >> version++;
> >> write the rest;
> >> version++;
> >>
> >> with possible smp_wmb() in there to keep the compiler from messing around?
> >
> > Correct. Could just as well follow the protocol and use odd/even, which
> > is what your patch does.
> >
> > What is the point with the new flags bit though?
> 
> To try to work around the problem on old hosts.  I'm not at all
> convinced that this is worthwhile or that it helps, though.

Andy, 

Are you going to submit the fix or should i? 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-06 19:49                         ` Andy Lutomirski
                                             ` (2 preceding siblings ...)
  2015-01-08 22:31                           ` Marcelo Tosatti
@ 2015-01-08 22:31                           ` Marcelo Tosatti
  3 siblings, 0 replies; 77+ messages in thread
From: Marcelo Tosatti @ 2015-01-08 22:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> >> >> > > > > doesn't increment the version pre-update, and we can return completely
> >> >> > > > > bogus results.
> >> >> > > > Yes.
> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> >> >> > > one returned by the first getcpu.
> >> >> >
> >> >> > ... this case of partial update of pvti, which is caught by the version
> >> >> > field, if of course different from the other (extremely unlikely) that
> >> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> >> >> > but the rdtsc is another.
> >> >> >
> >> >> > That one can be fixed by rdtscp, like
> >> >> >
> >> >> > do {
> >> >> >     // get a consistent (pvti, v, tsc) tuple
> >> >> >     do {
> >> >> >         cpu = get_cpu();
> >> >> >         pvti = get_pvti(cpu);
> >> >> >         v = pvti->version & ~1;
> >> >> >         // also acts as rmb();
> >> >> >         rdtsc_barrier();
> >> >> >         tsc = rdtscp(&cpu1);
> >> >>
> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> >> >> specified it that way and both AMD and Intel implement it correctly.
> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> >> >>
> >> >> >         // control dependency, no need for rdtsc_barrier?
> >> >> >     } while(cpu != cpu1);
> >> >> >
> >> >> >     // ... compute nanoseconds from pvti and tsc ...
> >> >> >     rmb();
> >> >> > }   while(v != pvti->version);
> >> >>
> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> >> all three times and *still* don't get a consistent read, unless we
> >> >> play nasty games with lots of version checks (I have a patch for that,
> >> >> but I don't like it very much).  The patch is here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >> >>
> >> >> but I don't like it.
> >> >>
> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> >> while it's being written, and I think you're now telling me that this
> >> >> isn't true and that a guest *can* observe pvti while it's being
> >> >> written while the low bit of the version field is not set.  If so,
> >> >> this is rather strongly incompatible with the spec in the KVM docs.
> >> >>
> >> >> I don't suppose that you and Marcelo could agree on what the actual
> >> >> semantics that KVM provides are and could write it down in a way that
> >> >> people who haven't spent a long time staring at the request code
> >> >> understand?  And maybe you could even fix the implementation while
> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
> >> >> patches to fix it here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> >> >>
> >> >> but I'm not thrilled with them.
> >> >>
> >> >> --Andy
> >> >
> >> > I suppose that separating the version write from the rest of the pvclock
> >> > structure is sufficient, as that would guarantee the writes are not
> >> > reordered even with fast string REP MOVS.
> >> >
> >> > Thanks for catching this Andy!
> >> >
> >>
> >> Don't you stil need:
> >>
> >> version++;
> >> write the rest;
> >> version++;
> >>
> >> with possible smp_wmb() in there to keep the compiler from messing around?
> >
> > Correct. Could just as well follow the protocol and use odd/even, which
> > is what your patch does.
> >
> > What is the point with the new flags bit though?
> 
> To try to work around the problem on old hosts.  I'm not at all
> convinced that this is worthwhile or that it helps, though.

Andy, 

Are you going to submit the fix or should i? 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-08 22:31                           ` Marcelo Tosatti
  2015-01-08 22:43                             ` Andy Lutomirski
@ 2015-01-08 22:43                             ` Andy Lutomirski
  2015-02-26 22:46                               ` Andy Lutomirski
  2015-02-26 22:46                               ` Andy Lutomirski
  1 sibling, 2 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-08 22:43 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> >> >> > > > > doesn't increment the version pre-update, and we can return completely
>> >> >> > > > > bogus results.
>> >> >> > > > Yes.
>> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>> >> >> > > one returned by the first getcpu.
>> >> >> >
>> >> >> > ... this case of partial update of pvti, which is caught by the version
>> >> >> > field, if of course different from the other (extremely unlikely) that
>> >> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>> >> >> > but the rdtsc is another.
>> >> >> >
>> >> >> > That one can be fixed by rdtscp, like
>> >> >> >
>> >> >> > do {
>> >> >> >     // get a consistent (pvti, v, tsc) tuple
>> >> >> >     do {
>> >> >> >         cpu = get_cpu();
>> >> >> >         pvti = get_pvti(cpu);
>> >> >> >         v = pvti->version & ~1;
>> >> >> >         // also acts as rmb();
>> >> >> >         rdtsc_barrier();
>> >> >> >         tsc = rdtscp(&cpu1);
>> >> >>
>> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> >> >> specified it that way and both AMD and Intel implement it correctly.
>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>> >> >>
>> >> >> >         // control dependency, no need for rdtsc_barrier?
>> >> >> >     } while(cpu != cpu1);
>> >> >> >
>> >> >> >     // ... compute nanoseconds from pvti and tsc ...
>> >> >> >     rmb();
>> >> >> > }   while(v != pvti->version);
>> >> >>
>> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>> >> >> all three times and *still* don't get a consistent read, unless we
>> >> >> play nasty games with lots of version checks (I have a patch for that,
>> >> >> but I don't like it very much).  The patch is here:
>> >> >>
>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>> >> >>
>> >> >> but I don't like it.
>> >> >>
>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>> >> >> while it's being written, and I think you're now telling me that this
>> >> >> isn't true and that a guest *can* observe pvti while it's being
>> >> >> written while the low bit of the version field is not set.  If so,
>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >> >>
>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>> >> >> semantics that KVM provides are and could write it down in a way that
>> >> >> people who haven't spent a long time staring at the request code
>> >> >> understand?  And maybe you could even fix the implementation while
>> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
>> >> >> patches to fix it here:
>> >> >>
>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>> >> >>
>> >> >> but I'm not thrilled with them.
>> >> >>
>> >> >> --Andy
>> >> >
>> >> > I suppose that separating the version write from the rest of the pvclock
>> >> > structure is sufficient, as that would guarantee the writes are not
>> >> > reordered even with fast string REP MOVS.
>> >> >
>> >> > Thanks for catching this Andy!
>> >> >
>> >>
>> >> Don't you stil need:
>> >>
>> >> version++;
>> >> write the rest;
>> >> version++;
>> >>
>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>> >
>> > Correct. Could just as well follow the protocol and use odd/even, which
>> > is what your patch does.
>> >
>> > What is the point with the new flags bit though?
>>
>> To try to work around the problem on old hosts.  I'm not at all
>> convinced that this is worthwhile or that it helps, though.
>
> Andy,
>
> Are you going to submit the fix or should i?
>

I'd prefer if you did it.  I'm not familiar enough with the KVM memory
management stuff to do it confidently.  Feel free to mooch from my
patch if it's helpful.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-08 22:31                           ` Marcelo Tosatti
@ 2015-01-08 22:43                             ` Andy Lutomirski
  2015-01-08 22:43                             ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-01-08 22:43 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>> >> >> > > > > doesn't increment the version pre-update, and we can return completely
>> >> >> > > > > bogus results.
>> >> >> > > > Yes.
>> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>> >> >> > > one returned by the first getcpu.
>> >> >> >
>> >> >> > ... this case of partial update of pvti, which is caught by the version
>> >> >> > field, if of course different from the other (extremely unlikely) that
>> >> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>> >> >> > but the rdtsc is another.
>> >> >> >
>> >> >> > That one can be fixed by rdtscp, like
>> >> >> >
>> >> >> > do {
>> >> >> >     // get a consistent (pvti, v, tsc) tuple
>> >> >> >     do {
>> >> >> >         cpu = get_cpu();
>> >> >> >         pvti = get_pvti(cpu);
>> >> >> >         v = pvti->version & ~1;
>> >> >> >         // also acts as rmb();
>> >> >> >         rdtsc_barrier();
>> >> >> >         tsc = rdtscp(&cpu1);
>> >> >>
>> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>> >> >> specified it that way and both AMD and Intel implement it correctly.
>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>> >> >>
>> >> >> >         // control dependency, no need for rdtsc_barrier?
>> >> >> >     } while(cpu != cpu1);
>> >> >> >
>> >> >> >     // ... compute nanoseconds from pvti and tsc ...
>> >> >> >     rmb();
>> >> >> > }   while(v != pvti->version);
>> >> >>
>> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>> >> >> all three times and *still* don't get a consistent read, unless we
>> >> >> play nasty games with lots of version checks (I have a patch for that,
>> >> >> but I don't like it very much).  The patch is here:
>> >> >>
>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>> >> >>
>> >> >> but I don't like it.
>> >> >>
>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>> >> >> while it's being written, and I think you're now telling me that this
>> >> >> isn't true and that a guest *can* observe pvti while it's being
>> >> >> written while the low bit of the version field is not set.  If so,
>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>> >> >>
>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>> >> >> semantics that KVM provides are and could write it down in a way that
>> >> >> people who haven't spent a long time staring at the request code
>> >> >> understand?  And maybe you could even fix the implementation while
>> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
>> >> >> patches to fix it here:
>> >> >>
>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>> >> >>
>> >> >> but I'm not thrilled with them.
>> >> >>
>> >> >> --Andy
>> >> >
>> >> > I suppose that separating the version write from the rest of the pvclock
>> >> > structure is sufficient, as that would guarantee the writes are not
>> >> > reordered even with fast string REP MOVS.
>> >> >
>> >> > Thanks for catching this Andy!
>> >> >
>> >>
>> >> Don't you stil need:
>> >>
>> >> version++;
>> >> write the rest;
>> >> version++;
>> >>
>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>> >
>> > Correct. Could just as well follow the protocol and use odd/even, which
>> > is what your patch does.
>> >
>> > What is the point with the new flags bit though?
>>
>> To try to work around the problem on old hosts.  I'm not at all
>> convinced that this is worthwhile or that it helps, though.
>
> Andy,
>
> Are you going to submit the fix or should i?
>

I'd prefer if you did it.  I'm not familiar enough with the KVM memory
management stuff to do it confidently.  Feel free to mooch from my
patch if it's helpful.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-08 22:43                             ` Andy Lutomirski
@ 2015-02-26 22:46                               ` Andy Lutomirski
  2015-02-26 22:46                               ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-02-26 22:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, xen-devel, linux-kernel, kvm list, Gleb Natapov

On Thu, Jan 8, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>>> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>>> >> >> > > > > doesn't increment the version pre-update, and we can return completely
>>> >> >> > > > > bogus results.
>>> >> >> > > > Yes.
>>> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>>> >> >> > > one returned by the first getcpu.
>>> >> >> >
>>> >> >> > ... this case of partial update of pvti, which is caught by the version
>>> >> >> > field, if of course different from the other (extremely unlikely) that
>>> >> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>>> >> >> > but the rdtsc is another.
>>> >> >> >
>>> >> >> > That one can be fixed by rdtscp, like
>>> >> >> >
>>> >> >> > do {
>>> >> >> >     // get a consistent (pvti, v, tsc) tuple
>>> >> >> >     do {
>>> >> >> >         cpu = get_cpu();
>>> >> >> >         pvti = get_pvti(cpu);
>>> >> >> >         v = pvti->version & ~1;
>>> >> >> >         // also acts as rmb();
>>> >> >> >         rdtsc_barrier();
>>> >> >> >         tsc = rdtscp(&cpu1);
>>> >> >>
>>> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>>> >> >> specified it that way and both AMD and Intel implement it correctly.
>>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>>> >> >>
>>> >> >> >         // control dependency, no need for rdtsc_barrier?
>>> >> >> >     } while(cpu != cpu1);
>>> >> >> >
>>> >> >> >     // ... compute nanoseconds from pvti and tsc ...
>>> >> >> >     rmb();
>>> >> >> > }   while(v != pvti->version);
>>> >> >>
>>> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>>> >> >> all three times and *still* don't get a consistent read, unless we
>>> >> >> play nasty games with lots of version checks (I have a patch for that,
>>> >> >> but I don't like it very much).  The patch is here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>> >> >>
>>> >> >> but I don't like it.
>>> >> >>
>>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>>> >> >> while it's being written, and I think you're now telling me that this
>>> >> >> isn't true and that a guest *can* observe pvti while it's being
>>> >> >> written while the low bit of the version field is not set.  If so,
>>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>>> >> >>
>>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>>> >> >> semantics that KVM provides are and could write it down in a way that
>>> >> >> people who haven't spent a long time staring at the request code
>>> >> >> understand?  And maybe you could even fix the implementation while
>>> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
>>> >> >> patches to fix it here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>>> >> >>
>>> >> >> but I'm not thrilled with them.
>>> >> >>
>>> >> >> --Andy
>>> >> >
>>> >> > I suppose that separating the version write from the rest of the pvclock
>>> >> > structure is sufficient, as that would guarantee the writes are not
>>> >> > reordered even with fast string REP MOVS.
>>> >> >
>>> >> > Thanks for catching this Andy!
>>> >> >
>>> >>
>>> >> Don't you stil need:
>>> >>
>>> >> version++;
>>> >> write the rest;
>>> >> version++;
>>> >>
>>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>>> >
>>> > Correct. Could just as well follow the protocol and use odd/even, which
>>> > is what your patch does.
>>> >
>>> > What is the point with the new flags bit though?
>>>
>>> To try to work around the problem on old hosts.  I'm not at all
>>> convinced that this is worthwhile or that it helps, though.
>>
>> Andy,
>>
>> Are you going to submit the fix or should i?
>>
>
> I'd prefer if you did it.  I'm not familiar enough with the KVM memory
> management stuff to do it confidently.  Feel free to mooch from my
> patch if it's helpful.

Any update here?  I can try it myself if no one else wants to do it.

--Andy

>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  2015-01-08 22:43                             ` Andy Lutomirski
  2015-02-26 22:46                               ` Andy Lutomirski
@ 2015-02-26 22:46                               ` Andy Lutomirski
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2015-02-26 22:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, linux-kernel, kvm list, xen-devel

On Thu, Jan 8, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>>> >> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
>>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
>>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>>> >> >> > > > > doesn't increment the version pre-update, and we can return completely
>>> >> >> > > > > bogus results.
>>> >> >> > > > Yes.
>>> >> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
>>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>>> >> >> > > one returned by the first getcpu.
>>> >> >> >
>>> >> >> > ... this case of partial update of pvti, which is caught by the version
>>> >> >> > field, if of course different from the other (extremely unlikely) that
>>> >> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
>>> >> >> > but the rdtsc is another.
>>> >> >> >
>>> >> >> > That one can be fixed by rdtscp, like
>>> >> >> >
>>> >> >> > do {
>>> >> >> >     // get a consistent (pvti, v, tsc) tuple
>>> >> >> >     do {
>>> >> >> >         cpu = get_cpu();
>>> >> >> >         pvti = get_pvti(cpu);
>>> >> >> >         v = pvti->version & ~1;
>>> >> >> >         // also acts as rmb();
>>> >> >> >         rdtsc_barrier();
>>> >> >> >         tsc = rdtscp(&cpu1);
>>> >> >>
>>> >> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
>>> >> >> specified it that way and both AMD and Intel implement it correctly.
>>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>>> >> >>
>>> >> >> >         // control dependency, no need for rdtsc_barrier?
>>> >> >> >     } while(cpu != cpu1);
>>> >> >> >
>>> >> >> >     // ... compute nanoseconds from pvti and tsc ...
>>> >> >> >     rmb();
>>> >> >> > }   while(v != pvti->version);
>>> >> >>
>>> >> >> Still no good.  We can migrate a bunch of times so we see the same CPU
>>> >> >> all three times and *still* don't get a consistent read, unless we
>>> >> >> play nasty games with lots of version checks (I have a patch for that,
>>> >> >> but I don't like it very much).  The patch is here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>> >> >>
>>> >> >> but I don't like it.
>>> >> >>
>>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>>> >> >> while it's being written, and I think you're now telling me that this
>>> >> >> isn't true and that a guest *can* observe pvti while it's being
>>> >> >> written while the low bit of the version field is not set.  If so,
>>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>>> >> >>
>>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>>> >> >> semantics that KVM provides are and could write it down in a way that
>>> >> >> people who haven't spent a long time staring at the request code
>>> >> >> understand?  And maybe you could even fix the implementation while
>>> >> >> you're at it if the implementation is, indeed, broken.  I have ugly
>>> >> >> patches to fix it here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>>> >> >>
>>> >> >> but I'm not thrilled with them.
>>> >> >>
>>> >> >> --Andy
>>> >> >
>>> >> > I suppose that separating the version write from the rest of the pvclock
>>> >> > structure is sufficient, as that would guarantee the writes are not
>>> >> > reordered even with fast string REP MOVS.
>>> >> >
>>> >> > Thanks for catching this Andy!
>>> >> >
>>> >>
>>> >> Don't you stil need:
>>> >>
>>> >> version++;
>>> >> write the rest;
>>> >> version++;
>>> >>
>>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>>> >
>>> > Correct. Could just as well follow the protocol and use odd/even, which
>>> > is what your patch does.
>>> >
>>> > What is the point with the new flags bit though?
>>>
>>> To try to work around the problem on old hosts.  I'm not at all
>>> convinced that this is worthwhile or that it helps, though.
>>
>> Andy,
>>
>> Are you going to submit the fix or should i?
>>
>
> I'd prefer if you did it.  I'm not familiar enough with the KVM memory
> management stuff to do it confidently.  Feel free to mooch from my
> patch if it's helpful.

Any update here?  I can try it myself if no one else wants to do it.

--Andy

>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups
@ 2014-12-23  0:39 Andy Lutomirski
  0 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2014-12-23  0:39 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: Gleb Natapov, xen-devel, linux-kernel, kvm list, Andy Lutomirski

This is a dramatic simplification and speedup of the vdso pvclock read
code.  Is it correct?

Andy Lutomirski (2):
  x86, vdso: Use asm volatile in __getcpu
  x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

 arch/x86/include/asm/vgtod.h   |  6 ++--
 arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
 2 files changed, 51 insertions(+), 37 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2015-02-26 22:46 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
2014-12-23  0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
2014-12-23  0:39 ` Andy Lutomirski
2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
2014-12-23 10:28   ` [Xen-devel] " David Vrabel
2014-12-23 10:28   ` David Vrabel
2014-12-23 15:14   ` Boris Ostrovsky
2014-12-23 15:14   ` [Xen-devel] " Boris Ostrovsky
2014-12-23 15:14     ` Paolo Bonzini
2014-12-23 15:14     ` [Xen-devel] " Paolo Bonzini
2014-12-23 15:25       ` Boris Ostrovsky
2014-12-23 15:25       ` [Xen-devel] " Boris Ostrovsky
2014-12-24 21:30   ` David Matlack
2014-12-24 21:43     ` Andy Lutomirski
2014-12-24 21:43     ` Andy Lutomirski
2014-12-24 21:30   ` David Matlack
2015-01-05 15:25   ` Marcelo Tosatti
2015-01-05 18:56     ` Andy Lutomirski
2015-01-05 18:56     ` Andy Lutomirski
2015-01-05 19:17       ` Marcelo Tosatti
2015-01-05 19:17       ` Marcelo Tosatti
2015-01-05 22:38         ` Andy Lutomirski
2015-01-05 22:48           ` Marcelo Tosatti
2015-01-05 22:53             ` Andy Lutomirski
2015-01-05 22:53             ` Andy Lutomirski
2015-01-06  8:42             ` Paolo Bonzini
2015-01-06  8:42               ` Paolo Bonzini
2015-01-06 12:01               ` Paolo Bonzini
2015-01-06 16:56                 ` Andy Lutomirski
2015-01-06 16:56                 ` Andy Lutomirski
2015-01-06 18:13                   ` Marcelo Tosatti
2015-01-06 18:13                   ` Marcelo Tosatti
2015-01-06 18:26                     ` Andy Lutomirski
2015-01-06 18:26                     ` Andy Lutomirski
2015-01-06 18:45                       ` Marcelo Tosatti
2015-01-06 18:45                       ` Marcelo Tosatti
2015-01-06 19:49                         ` Andy Lutomirski
2015-01-06 20:20                           ` Marcelo Tosatti
2015-01-06 21:54                             ` Andy Lutomirski
2015-01-06 21:54                             ` Andy Lutomirski
2015-01-06 20:20                           ` Marcelo Tosatti
2015-01-08 22:31                           ` Marcelo Tosatti
2015-01-08 22:43                             ` Andy Lutomirski
2015-01-08 22:43                             ` Andy Lutomirski
2015-02-26 22:46                               ` Andy Lutomirski
2015-02-26 22:46                               ` Andy Lutomirski
2015-01-08 22:31                           ` Marcelo Tosatti
2015-01-06 19:49                         ` Andy Lutomirski
2015-01-07  5:41                       ` Paolo Bonzini
2015-01-07  5:41                       ` Paolo Bonzini
2015-01-07  5:38                   ` Paolo Bonzini
2015-01-07  5:38                   ` Paolo Bonzini
2015-01-07  7:18                     ` Andy Lutomirski
2015-01-07  7:18                     ` Andy Lutomirski
2015-01-07  9:00                       ` Paolo Bonzini
2015-01-07  9:00                       ` Paolo Bonzini
2015-01-07 14:45                       ` Marcelo Tosatti
2015-01-07 14:45                       ` Marcelo Tosatti
2015-01-05 22:48           ` Marcelo Tosatti
2015-01-05 22:38         ` Andy Lutomirski
2015-01-06  8:39         ` Paolo Bonzini
2015-01-06  8:39         ` Paolo Bonzini
2015-01-05 22:23       ` Paolo Bonzini
2015-01-05 22:23       ` Paolo Bonzini
2015-01-06 14:35       ` [Xen-devel] " Konrad Rzeszutek Wilk
2015-01-06 14:35         ` Konrad Rzeszutek Wilk
2015-01-05 15:25   ` Marcelo Tosatti
2015-01-08 12:51   ` [Xen-devel] " David Vrabel
2015-01-08 12:51     ` David Vrabel
2014-12-23  0:39 ` Andy Lutomirski
2014-12-23  7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
2014-12-23  8:16   ` Andy Lutomirski
2014-12-23  8:30     ` Paolo Bonzini
2014-12-23  8:30     ` Paolo Bonzini
2014-12-23  8:16   ` Andy Lutomirski
2014-12-23  7:21 ` Paolo Bonzini
  -- strict thread matches above, loose matches on Subject: below --
2014-12-23  0:39 Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.