kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock
@ 2020-01-22 14:22 Paolo Bonzini
  2020-01-22 14:22 ` [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Paolo Bonzini @ 2020-01-22 14:22 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: mtosatti

Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
the default kvmclock_offset for the VM was still based on the monotonic
clock and, if the raw clock drifted enough from the monotonic clock,
this could cause a negative system_time to be written to the guest's
struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
observe a negative time value) it hangs.

This series fixes the issue by using the raw clock everywhere.

(And this, ladies and gentlemen, is why I was not applying patches to
the KVM tree.  I saw this before Christmas and could only reproduce it
today, since it requires almost 2 weeks of uptime to reproduce on my
machine.  Of course, once you have the reproducer the fix is relatively
easy to come up with).

Paolo

Paolo Bonzini (2):
  KVM: x86: reorganize pvclock_gtod_data members
  KVM: x86: use raw clock values consistently

 arch/x86/kvm/x86.c | 67 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 32 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members
  2020-01-22 14:22 [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Paolo Bonzini
@ 2020-01-22 14:22 ` Paolo Bonzini
  2020-01-23 11:32   ` Vitaly Kuznetsov
  2020-01-22 14:22 ` [PATCH 2/2] KVM: x86: use raw clock values consistently Paolo Bonzini
  2020-01-24 20:36 ` [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Marcelo Tosatti
  2 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2020-01-22 14:22 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: mtosatti, stable

We will need a copy of tk->offs_boot in the next patch.  Store it and
cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot
included, store the raw value in struct pvclock_clock and sum tk->offs_boot
in do_monotonic_raw and do_realtime.   tk->tkr_xxx.xtime_nsec also moves
to struct pvclock_clock.

While at it, fix a (usually harmless) typo in do_monotonic_raw, which
was using gtod->clock.shift instead of gtod->raw_clock.shift.

Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 89621025577a..1b4273cce63c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1532,6 +1532,8 @@ struct pvclock_clock {
 	u64 mask;
 	u32 mult;
 	u32 shift;
+	u64 base_cycles;
+	u64 offset;
 };
 
 struct pvclock_gtod_data {
@@ -1540,11 +1542,8 @@ struct pvclock_gtod_data {
 	struct pvclock_clock clock; /* extract of a clocksource struct */
 	struct pvclock_clock raw_clock; /* extract of a clocksource struct */
 
-	u64		boot_ns_raw;
-	u64		boot_ns;
-	u64		nsec_base;
+	ktime_t		offs_boot;
 	u64		wall_time_sec;
-	u64		monotonic_raw_nsec;
 };
 
 static struct pvclock_gtod_data pvclock_gtod_data;
@@ -1552,10 +1551,6 @@ struct pvclock_gtod_data {
 static void update_pvclock_gtod(struct timekeeper *tk)
 {
 	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
-	u64 boot_ns, boot_ns_raw;
-
-	boot_ns = ktime_to_ns(ktime_add(tk->tkr_mono.base, tk->offs_boot));
-	boot_ns_raw = ktime_to_ns(ktime_add(tk->tkr_raw.base, tk->offs_boot));
 
 	write_seqcount_begin(&vdata->seq);
 
@@ -1565,20 +1560,20 @@ static void update_pvclock_gtod(struct timekeeper *tk)
 	vdata->clock.mask		= tk->tkr_mono.mask;
 	vdata->clock.mult		= tk->tkr_mono.mult;
 	vdata->clock.shift		= tk->tkr_mono.shift;
+	vdata->clock.base_cycles	= tk->tkr_mono.xtime_nsec;
+	vdata->clock.offset		= tk->tkr_mono.base;
 
 	vdata->raw_clock.vclock_mode	= tk->tkr_raw.clock->archdata.vclock_mode;
 	vdata->raw_clock.cycle_last	= tk->tkr_raw.cycle_last;
 	vdata->raw_clock.mask		= tk->tkr_raw.mask;
 	vdata->raw_clock.mult		= tk->tkr_raw.mult;
 	vdata->raw_clock.shift		= tk->tkr_raw.shift;
-
-	vdata->boot_ns			= boot_ns;
-	vdata->nsec_base		= tk->tkr_mono.xtime_nsec;
+	vdata->raw_clock.base_cycles	= tk->tkr_raw.xtime_nsec;
+	vdata->raw_clock.offset		= tk->tkr_raw.base;
 
 	vdata->wall_time_sec            = tk->xtime_sec;
 
-	vdata->boot_ns_raw		= boot_ns_raw;
-	vdata->monotonic_raw_nsec	= tk->tkr_raw.xtime_nsec;
+	vdata->offs_boot		= tk->offs_boot;
 
 	write_seqcount_end(&vdata->seq);
 }
@@ -2048,10 +2043,10 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		ns = gtod->monotonic_raw_nsec;
+		ns = gtod->raw_clock.base_cycles;
 		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
-		ns >>= gtod->clock.shift;
-		ns += gtod->boot_ns_raw;
+		ns >>= gtod->raw_clock.shift;
+		ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	*t = ns;
 
@@ -2068,7 +2063,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
 		ts->tv_sec = gtod->wall_time_sec;
-		ns = gtod->nsec_base;
+		ns = gtod->clock.base_cycles;
 		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] KVM: x86: use raw clock values consistently
  2020-01-22 14:22 [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Paolo Bonzini
  2020-01-22 14:22 ` [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members Paolo Bonzini
@ 2020-01-22 14:22 ` Paolo Bonzini
  2020-01-23 13:43   ` Vitaly Kuznetsov
  2020-01-24 20:36 ` [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Marcelo Tosatti
  2 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2020-01-22 14:22 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: mtosatti, stable

Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
the default kvmclock_offset for the VM was still based on the monotonic
clock and, if the raw clock drifted enough from the monotonic clock,
this could cause a negative system_time to be written to the guest's
struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
observe a negative time value) it hangs.

There is another thing to be careful about: getboottime64 returns the
host boot time in tkr_mono units, and subtracting tkr_raw units will
cause the wallclock to be off if tkr_raw drifts from tkr_mono.  To
avoid this, compute the wallclock delta from the current time instead
of being clever and using getboottime64.

Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c | 38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b4273cce63c..b5e0648580e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1577,6 +1577,18 @@ static void update_pvclock_gtod(struct timekeeper *tk)
 
 	write_seqcount_end(&vdata->seq);
 }
+
+static s64 get_kvmclock_base_ns(void)
+{
+	/* Count up from boot time, but with the frequency of the raw clock.  */
+	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
+}
+#else
+static s64 get_kvmclock_base_ns(void)
+{
+	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
+	return ktime_get_boottime_ns();
+}
 #endif
 
 void kvm_set_pending_timer(struct kvm_vcpu *vcpu)
@@ -1590,7 +1602,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
 	int version;
 	int r;
 	struct pvclock_wall_clock wc;
-	struct timespec64 boot;
+	u64 wall_nsec;
 
 	if (!wall_clock)
 		return;
@@ -1610,17 +1622,12 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
 	/*
 	 * The guest calculates current wall clock time by adding
 	 * system time (updated by kvm_guest_time_update below) to the
-	 * wall clock specified here.  guest system time equals host
-	 * system time for us, thus we must fill in host boot time here.
+	 * wall clock specified here.  We do the reverse here.
 	 */
-	getboottime64(&boot);
+	wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
 
-	if (kvm->arch.kvmclock_offset) {
-		struct timespec64 ts = ns_to_timespec64(kvm->arch.kvmclock_offset);
-		boot = timespec64_sub(boot, ts);
-	}
-	wc.sec = (u32)boot.tv_sec; /* overflow in 2106 guest time */
-	wc.nsec = boot.tv_nsec;
+	wc.nsec = do_div(wall_nsec, 1000000000);
+	wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */
 	wc.version = version;
 
 	kvm_write_guest(kvm, wall_clock, &wc, sizeof(wc));
@@ -1868,7 +1875,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_compute_tsc_offset(vcpu, data);
-	ns = ktime_get_boottime_ns();
+	ns = get_kvmclock_base_ns();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
@@ -2206,7 +2213,7 @@ u64 get_kvmclock_ns(struct kvm *kvm)
 	spin_lock(&ka->pvclock_gtod_sync_lock);
 	if (!ka->use_master_clock) {
 		spin_unlock(&ka->pvclock_gtod_sync_lock);
-		return ktime_get_boottime_ns() + ka->kvmclock_offset;
+		return get_kvmclock_base_ns() + ka->kvmclock_offset;
 	}
 
 	hv_clock.tsc_timestamp = ka->master_cycle_now;
@@ -2222,7 +2229,7 @@ u64 get_kvmclock_ns(struct kvm *kvm)
 				   &hv_clock.tsc_to_system_mul);
 		ret = __pvclock_read_cycles(&hv_clock, rdtsc());
 	} else
-		ret = ktime_get_boottime_ns() + ka->kvmclock_offset;
+		ret = get_kvmclock_base_ns() + ka->kvmclock_offset;
 
 	put_cpu();
 
@@ -2321,7 +2328,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	}
 	if (!use_master_clock) {
 		host_tsc = rdtsc();
-		kernel_ns = ktime_get_boottime_ns();
+		kernel_ns = get_kvmclock_base_ns();
 	}
 
 	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
@@ -2361,6 +2368,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
 	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
 	vcpu->last_guest_tsc = tsc_timestamp;
+	WARN_ON(vcpu->hv_clock.system_time < 0);
 
 	/* If the host uses TSC clocksource, then it is stable */
 	pvclock_flags = 0;
@@ -9473,7 +9481,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	mutex_init(&kvm->arch.apic_map_lock);
 	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
 
-	kvm->arch.kvmclock_offset = -ktime_get_boottime_ns();
+	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
 	pvclock_update_vm_gtod_copy(kvm);
 
 	kvm->arch.guest_can_read_msr_platform_info = true;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members
  2020-01-22 14:22 ` [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members Paolo Bonzini
@ 2020-01-23 11:32   ` Vitaly Kuznetsov
  2020-01-23 11:35     ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Vitaly Kuznetsov @ 2020-01-23 11:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: mtosatti, stable, linux-kernel, kvm

Paolo Bonzini <pbonzini@redhat.com> writes:

> We will need a copy of tk->offs_boot in the next patch.  Store it and
> cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot
> included, store the raw value in struct pvclock_clock and sum tk->offs_boot
> in do_monotonic_raw and do_realtime.   tk->tkr_xxx.xtime_nsec also moves
> to struct pvclock_clock.
>
> While at it, fix a (usually harmless) typo in do_monotonic_raw, which
> was using gtod->clock.shift instead of gtod->raw_clock.shift.
>
> Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
> Cc: stable@vger.kernel.org
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 29 ++++++++++++-----------------
>  1 file changed, 12 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 89621025577a..1b4273cce63c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1532,6 +1532,8 @@ struct pvclock_clock {
>  	u64 mask;
>  	u32 mult;
>  	u32 shift;
> +	u64 base_cycles;
> +	u64 offset;
>  };
>  
>  struct pvclock_gtod_data {
> @@ -1540,11 +1542,8 @@ struct pvclock_gtod_data {
>  	struct pvclock_clock clock; /* extract of a clocksource struct */
>  	struct pvclock_clock raw_clock; /* extract of a clocksource struct */
>  
> -	u64		boot_ns_raw;
> -	u64		boot_ns;
> -	u64		nsec_base;
> +	ktime_t		offs_boot;
>  	u64		wall_time_sec;
> -	u64		monotonic_raw_nsec;
>  };
>  
>  static struct pvclock_gtod_data pvclock_gtod_data;
> @@ -1552,10 +1551,6 @@ struct pvclock_gtod_data {
>  static void update_pvclock_gtod(struct timekeeper *tk)
>  {
>  	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
> -	u64 boot_ns, boot_ns_raw;
> -
> -	boot_ns = ktime_to_ns(ktime_add(tk->tkr_mono.base, tk->offs_boot));
> -	boot_ns_raw = ktime_to_ns(ktime_add(tk->tkr_raw.base, tk->offs_boot));
>  
>  	write_seqcount_begin(&vdata->seq);
>  
> @@ -1565,20 +1560,20 @@ static void update_pvclock_gtod(struct timekeeper *tk)
>  	vdata->clock.mask		= tk->tkr_mono.mask;
>  	vdata->clock.mult		= tk->tkr_mono.mult;
>  	vdata->clock.shift		= tk->tkr_mono.shift;
> +	vdata->clock.base_cycles	= tk->tkr_mono.xtime_nsec;
> +	vdata->clock.offset		= tk->tkr_mono.base;
>  
>  	vdata->raw_clock.vclock_mode	= tk->tkr_raw.clock->archdata.vclock_mode;
>  	vdata->raw_clock.cycle_last	= tk->tkr_raw.cycle_last;
>  	vdata->raw_clock.mask		= tk->tkr_raw.mask;
>  	vdata->raw_clock.mult		= tk->tkr_raw.mult;
>  	vdata->raw_clock.shift		= tk->tkr_raw.shift;
> -
> -	vdata->boot_ns			= boot_ns;
> -	vdata->nsec_base		= tk->tkr_mono.xtime_nsec;
> +	vdata->raw_clock.base_cycles	= tk->tkr_raw.xtime_nsec;
> +	vdata->raw_clock.offset		= tk->tkr_raw.base;

Likely a personal preference but the suggested naming is a bit
confusing: we use 'base_cycles' to keep 'xtime_nsec' and 'offset' to
keep ... 'base'. Not that I think that 'struct timekeeper' is perfect
but at least it is documented. Should we maybe just stick to it (and
name 'struct pvclock_clock' fields accordingly?)

>  
>  	vdata->wall_time_sec            = tk->xtime_sec;
>  
> -	vdata->boot_ns_raw		= boot_ns_raw;
> -	vdata->monotonic_raw_nsec	= tk->tkr_raw.xtime_nsec;
> +	vdata->offs_boot		= tk->offs_boot;
>  
>  	write_seqcount_end(&vdata->seq);
>  }
> @@ -2048,10 +2043,10 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
>  
>  	do {
>  		seq = read_seqcount_begin(&gtod->seq);
> -		ns = gtod->monotonic_raw_nsec;
> +		ns = gtod->raw_clock.base_cycles;
>  		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
> -		ns >>= gtod->clock.shift;
> -		ns += gtod->boot_ns_raw;
> +		ns >>= gtod->raw_clock.shift;
> +		ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  	*t = ns;
>  
> @@ -2068,7 +2063,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>  	do {
>  		seq = read_seqcount_begin(&gtod->seq);
>  		ts->tv_sec = gtod->wall_time_sec;
> -		ns = gtod->nsec_base;
> +		ns = gtod->clock.base_cycles;
>  		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
>  		ns >>= gtod->clock.shift;
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));

FWIW,

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members
  2020-01-23 11:32   ` Vitaly Kuznetsov
@ 2020-01-23 11:35     ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2020-01-23 11:35 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: mtosatti, stable, linux-kernel, kvm

On 23/01/20 12:32, Vitaly Kuznetsov wrote:
> Likely a personal preference but the suggested naming is a bit
> confusing: we use 'base_cycles' to keep 'xtime_nsec' and 'offset' to
> keep ... 'base'. Not that I think that 'struct timekeeper' is perfect
> but at least it is documented. Should we maybe just stick to it (and
> name 'struct pvclock_clock' fields accordingly?)
> 

The problem is that xtime_nsec is not nanoseconds, and I'd really not
want to have a worse name just for consistency. :(  I chose
"base_cycles" as an incremental improvement over nsec_base, even though
that meant also changing struct timekeeper's "base" to "offset".

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] KVM: x86: use raw clock values consistently
  2020-01-22 14:22 ` [PATCH 2/2] KVM: x86: use raw clock values consistently Paolo Bonzini
@ 2020-01-23 13:43   ` Vitaly Kuznetsov
  2020-01-23 13:54     ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Vitaly Kuznetsov @ 2020-01-23 13:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: mtosatti, stable, linux-kernel, kvm

Paolo Bonzini <pbonzini@redhat.com> writes:

> Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
> clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
> the default kvmclock_offset for the VM was still based on the monotonic
> clock and, if the raw clock drifted enough from the monotonic clock,
> this could cause a negative system_time to be written to the guest's
> struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
> observe a negative time value) it hangs.
>
> There is another thing to be careful about: getboottime64 returns the
> host boot time in tkr_mono units, and subtracting tkr_raw units will
> cause the wallclock to be off if tkr_raw drifts from tkr_mono.  To
> avoid this, compute the wallclock delta from the current time instead
> of being clever and using getboottime64.
>
> Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
> Cc: stable@vger.kernel.org
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 38 +++++++++++++++++++++++---------------
>  1 file changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b4273cce63c..b5e0648580e1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1577,6 +1577,18 @@ static void update_pvclock_gtod(struct timekeeper *tk)
>  
>  	write_seqcount_end(&vdata->seq);
>  }
> +
> +static s64 get_kvmclock_base_ns(void)
> +{
> +	/* Count up from boot time, but with the frequency of the raw clock.  */
> +	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
> +}
> +#else
> +static s64 get_kvmclock_base_ns(void)
> +{
> +	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
> +	return ktime_get_boottime_ns();
> +}
>  #endif

But we could've still used the RAW+offs_boot version, right? And this is
just to basically preserve the existing behavior on !x86.

>  
>  void kvm_set_pending_timer(struct kvm_vcpu *vcpu)
> @@ -1590,7 +1602,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
>  	int version;
>  	int r;
>  	struct pvclock_wall_clock wc;
> -	struct timespec64 boot;
> +	u64 wall_nsec;
>  
>  	if (!wall_clock)
>  		return;
> @@ -1610,17 +1622,12 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
>  	/*
>  	 * The guest calculates current wall clock time by adding
>  	 * system time (updated by kvm_guest_time_update below) to the
> -	 * wall clock specified here.  guest system time equals host
> -	 * system time for us, thus we must fill in host boot time here.
> +	 * wall clock specified here.  We do the reverse here.
>  	 */
> -	getboottime64(&boot);
> +	wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);

There are not that many hosts with more than 50 years uptime and likely
none running Linux with live kernel patching support so I bet noone will
ever see this overflowing, however, as wall_nsec is u64 and we're
dealing with kvmclock here I'd suggest to add a WARN_ON().

>  
> -	if (kvm->arch.kvmclock_offset) {
> -		struct timespec64 ts = ns_to_timespec64(kvm->arch.kvmclock_offset);
> -		boot = timespec64_sub(boot, ts);
> -	}
> -	wc.sec = (u32)boot.tv_sec; /* overflow in 2106 guest time */
> -	wc.nsec = boot.tv_nsec;
> +	wc.nsec = do_div(wall_nsec, 1000000000);
> +	wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */
>  	wc.version = version;
>  
>  	kvm_write_guest(kvm, wall_clock, &wc, sizeof(wc));
> @@ -1868,7 +1875,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
>  
>  	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>  	offset = kvm_compute_tsc_offset(vcpu, data);
> -	ns = ktime_get_boottime_ns();
> +	ns = get_kvmclock_base_ns();
>  	elapsed = ns - kvm->arch.last_tsc_nsec;
>  
>  	if (vcpu->arch.virtual_tsc_khz) {
> @@ -2206,7 +2213,7 @@ u64 get_kvmclock_ns(struct kvm *kvm)
>  	spin_lock(&ka->pvclock_gtod_sync_lock);
>  	if (!ka->use_master_clock) {
>  		spin_unlock(&ka->pvclock_gtod_sync_lock);
> -		return ktime_get_boottime_ns() + ka->kvmclock_offset;
> +		return get_kvmclock_base_ns() + ka->kvmclock_offset;
>  	}
>  
>  	hv_clock.tsc_timestamp = ka->master_cycle_now;
> @@ -2222,7 +2229,7 @@ u64 get_kvmclock_ns(struct kvm *kvm)
>  				   &hv_clock.tsc_to_system_mul);
>  		ret = __pvclock_read_cycles(&hv_clock, rdtsc());
>  	} else
> -		ret = ktime_get_boottime_ns() + ka->kvmclock_offset;
> +		ret = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  
>  	put_cpu();
>  
> @@ -2321,7 +2328,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	}
>  	if (!use_master_clock) {
>  		host_tsc = rdtsc();
> -		kernel_ns = ktime_get_boottime_ns();
> +		kernel_ns = get_kvmclock_base_ns();
>  	}
>  
>  	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
> @@ -2361,6 +2368,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>  	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>  	vcpu->last_guest_tsc = tsc_timestamp;
> +	WARN_ON(vcpu->hv_clock.system_time < 0);
>  
>  	/* If the host uses TSC clocksource, then it is stable */
>  	pvclock_flags = 0;
> @@ -9473,7 +9481,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	mutex_init(&kvm->arch.apic_map_lock);
>  	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
>  
> -	kvm->arch.kvmclock_offset = -ktime_get_boottime_ns();
> +	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
>  	pvclock_update_vm_gtod_copy(kvm);
>  
>  	kvm->arch.guest_can_read_msr_platform_info = true;

This looks correct to me but kvmclock is a glorious beast so take this
with a grain of salt)

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] KVM: x86: use raw clock values consistently
  2020-01-23 13:43   ` Vitaly Kuznetsov
@ 2020-01-23 13:54     ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2020-01-23 13:54 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: mtosatti, stable, linux-kernel, kvm

On 23/01/20 14:43, Vitaly Kuznetsov wrote:
>> +
>> +static s64 get_kvmclock_base_ns(void)
>> +{
>> +	/* Count up from boot time, but with the frequency of the raw clock.  */
>> +	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
>> +}
>> +#else
>> +static s64 get_kvmclock_base_ns(void)
>> +{
>> +	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
>> +	return ktime_get_boottime_ns();
>> +}
>>  #endif
> But we could've still used the RAW+offs_boot version, right? And this is
> just to basically preserve the existing behavior on !x86.

Yes, there's no reason to restrict the pvclock_gtod notifier to x86_64.
 But this is stable material so I kept it easy.

>>
>> -	getboottime64(&boot);
>> +	wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm);
> 
> There are not that many hosts with more than 50 years uptime and likely
> none running Linux with live kernel patching support so I bet noone will
> ever see this overflowing, however, as wall_nsec is u64 and we're
> dealing with kvmclock here I'd suggest to add a WARN_ON().

You're off by a factor of 10, 2^64 nanoseconds are about 584 years
(584*365*10^9*86400). :)

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock
  2020-01-22 14:22 [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Paolo Bonzini
  2020-01-22 14:22 ` [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members Paolo Bonzini
  2020-01-22 14:22 ` [PATCH 2/2] KVM: x86: use raw clock values consistently Paolo Bonzini
@ 2020-01-24 20:36 ` Marcelo Tosatti
  2020-01-25  9:42   ` Paolo Bonzini
  2 siblings, 1 reply; 9+ messages in thread
From: Marcelo Tosatti @ 2020-01-24 20:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

On Wed, Jan 22, 2020 at 03:22:31PM +0100, Paolo Bonzini wrote:
> Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
> clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
> the default kvmclock_offset for the VM was still based on the monotonic
> clock and, if the raw clock drifted enough from the monotonic clock,
> this could cause a negative system_time to be written to the guest's
> struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
> observe a negative time value) it hangs.
> 
> This series fixes the issue by using the raw clock everywhere.
> 
> (And this, ladies and gentlemen, is why I was not applying patches to
> the KVM tree.  I saw this before Christmas and could only reproduce it
> today, since it requires almost 2 weeks of uptime to reproduce on my
> machine.  Of course, once you have the reproducer the fix is relatively
> easy to come up with).
> 
> Paolo
> 
> Paolo Bonzini (2):
>   KVM: x86: reorganize pvclock_gtod_data members
>   KVM: x86: use raw clock values consistently
> 
>  arch/x86/kvm/x86.c | 67 ++++++++++++++++++++++++++++--------------------------
>  1 file changed, 35 insertions(+), 32 deletions(-)
> 
> -- 
> 1.8.3.1

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>

BTW, should switch both masterclock and non-masterclock cases
to raw clock base. Do you see any problem with that? 

Using the same reasoning as raw clock for master, ntpd in 
the guest should correct the difference.

Could probably simplify things.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock
  2020-01-24 20:36 ` [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Marcelo Tosatti
@ 2020-01-25  9:42   ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2020-01-25  9:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm

On 24/01/20 21:36, Marcelo Tosatti wrote:
> On Wed, Jan 22, 2020 at 03:22:31PM +0100, Paolo Bonzini wrote:
>> Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
>> clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
>> the default kvmclock_offset for the VM was still based on the monotonic
>> clock and, if the raw clock drifted enough from the monotonic clock,
>> this could cause a negative system_time to be written to the guest's
>> struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
>> observe a negative time value) it hangs.
>>
>> This series fixes the issue by using the raw clock everywhere.
>>
>> (And this, ladies and gentlemen, is why I was not applying patches to
>> the KVM tree.  I saw this before Christmas and could only reproduce it
>> today, since it requires almost 2 weeks of uptime to reproduce on my
>> machine.  Of course, once you have the reproducer the fix is relatively
>> easy to come up with).
>>
>> Paolo
>>
>> Paolo Bonzini (2):
>>   KVM: x86: reorganize pvclock_gtod_data members
>>   KVM: x86: use raw clock values consistently
>>
>>  arch/x86/kvm/x86.c | 67 ++++++++++++++++++++++++++++--------------------------
>>  1 file changed, 35 insertions(+), 32 deletions(-)
>>
>> -- 
>> 1.8.3.1
> 
> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> BTW, should switch both masterclock and non-masterclock cases
> to raw clock base. Do you see any problem with that? 

Indeed, that makes sense as a kind of unification of the code.  Together
with adding pvclock_gtod support for 32-bit, it should be easy.

Paolo



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-01-25  9:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-22 14:22 [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Paolo Bonzini
2020-01-22 14:22 ` [PATCH 1/2] KVM: x86: reorganize pvclock_gtod_data members Paolo Bonzini
2020-01-23 11:32   ` Vitaly Kuznetsov
2020-01-23 11:35     ` Paolo Bonzini
2020-01-22 14:22 ` [PATCH 2/2] KVM: x86: use raw clock values consistently Paolo Bonzini
2020-01-23 13:43   ` Vitaly Kuznetsov
2020-01-23 13:54     ` Paolo Bonzini
2020-01-24 20:36 ` [PATCH 0/2] KVM: x86: do not mix raw and monotonic clocks in kvmclock Marcelo Tosatti
2020-01-25  9:42   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).