linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
@ 2015-09-18 15:54 Radim Krčmář
  2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Radim Krčmář @ 2015-09-18 15:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Paolo Bonzini, Marcelo Tosatti, Luiz Capitulino

This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
RFC because I haven't explored many potential problems or tested it.

[1/2] uses a different algorithm in the guest to start counting from 0.
[2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.

A viable alternative would be to implement opt-in features in kvm clock.

And because we probably only broke one old user (the infamous SLES 10), a
workaround like this is also possible: (but I'd rather not do that)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a60bdbccff51..ae9049248aaf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 			ka->boot_vcpu_runs_old_kvmclock = tmp;
 
-			ka->kvmclock_offset = -get_kernel_ns();
+			if (!ka->boot_vcpu_runs_old_kvmclock)
+				ka->kvmclock_offset = -get_kernel_ns();
 		}
 
 		vcpu->arch.time = data;


Radim Krčmář (2):
  x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
  Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
    system MSR"

 arch/x86/include/asm/pvclock-abi.h |  1 +
 arch/x86/kernel/kvmclock.c         | 46 +++++++++++++++++++++++++++++---------
 arch/x86/kvm/x86.c                 |  4 ----
 3 files changed, 36 insertions(+), 15 deletions(-)

-- 
2.5.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
  2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
@ 2015-09-18 15:54 ` Radim Krčmář
  2015-09-22 19:01   ` Marcelo Tosatti
  2015-09-28 14:10   ` Paolo Bonzini
  2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Radim Krčmář @ 2015-09-18 15:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Paolo Bonzini, Marcelo Tosatti, Luiz Capitulino

Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore.
The purpose of that flags was to start counting system time from 0 when
the KVM clock has been initialized.
We can achieve the same by selecting one read as the initial point.

A simple subtraction will work unless the KVM clock count overflows
earlier (has smaller width) than scheduler's cycle count.  We should be
safe till x86_128.

Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors,
setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might
regress on older ones.

I presume we don't need to change kvm_clock_read instead of introducing
kvm_sched_clock_read.  A problem could arise in case sched_clock is
expected to return the same value as get_cycles, but we should have
merged those clocks in that case.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/kernel/kvmclock.c | 46 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 2c7aafa70702..ef5b3d2cecce 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -32,6 +32,7 @@
 static int kvmclock = 1;
 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+static cycle_t kvm_sched_clock_offset;
 
 static int parse_no_kvmclock(char *arg)
 {
@@ -92,6 +93,29 @@ static cycle_t kvm_clock_get_cycles(struct clocksource *cs)
 	return kvm_clock_read();
 }
 
+static cycle_t kvm_sched_clock_read(void)
+{
+	return kvm_clock_read() - kvm_sched_clock_offset;
+}
+
+static inline void kvm_sched_clock_init(bool stable)
+{
+	if (!stable) {
+		pv_time_ops.sched_clock = kvm_clock_read;
+		return;
+	}
+
+	kvm_sched_clock_offset = kvm_clock_read();
+	pv_time_ops.sched_clock = kvm_sched_clock_read;
+	set_sched_clock_stable();
+
+	printk("kvm-clock: using sched offset of %llu cycles\n",
+			kvm_sched_clock_offset);
+
+	BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
+	         sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+}
+
 /*
  * If we don't do that, there is the possibility that the guest
  * will calibrate under heavy load - thus, getting a lower lpj -
@@ -248,7 +272,17 @@ void __init kvmclock_init(void)
 		memblock_free(mem, size);
 		return;
 	}
-	pv_time_ops.sched_clock = kvm_clock_read;
+
+	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
+		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+
+	cpu = get_cpu();
+	vcpu_time = &hv_clock[cpu].pvti;
+	flags = pvclock_read_flags(vcpu_time);
+
+	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	put_cpu();
+
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
@@ -265,16 +299,6 @@ void __init kvmclock_init(void)
 	kvm_get_preset_lpj();
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";
-
-	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
-		pvclock_set_flags(~0);
-
-	cpu = get_cpu();
-	vcpu_time = &hv_clock[cpu].pvti;
-	flags = pvclock_read_flags(vcpu_time);
-	if (flags & PVCLOCK_COUNTS_FROM_ZERO)
-		set_sched_clock_stable();
-	put_cpu();
 }
 
 int __init kvm_setup_vsyscall_timeinfo(void)
-- 
2.5.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
  2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
  2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
@ 2015-09-18 15:54 ` Radim Krčmář
  2015-09-22 19:01   ` Marcelo Tosatti
  2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
  2015-09-28 11:05 ` Paolo Bonzini
  3 siblings, 1 reply; 20+ messages in thread
From: Radim Krčmář @ 2015-09-18 15:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Paolo Bonzini, Marcelo Tosatti, Luiz Capitulino

Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
to its custom enhancements to kvmclock, but KVM never declared the MSR
only for one-shot initialization.  (Doc says that only one write is
needed.)

This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/include/asm/pvclock-abi.h | 1 +
 arch/x86/kvm/x86.c                 | 4 ----
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
index 655e07a48f6c..67f08230103a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -41,6 +41,7 @@ struct pvclock_wall_clock {
 
 #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
 #define PVCLOCK_GUEST_STOPPED	(1 << 1)
+/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
 #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 18d59b584dee..34d33f4757d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		vcpu->pvclock_set_guest_stopped_request = false;
 	}
 
-	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
-
 	/* If the host uses TSC clocksource, then it is stable */
 	if (use_master_clock)
 		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
@@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 					&vcpu->requests);
 
 			ka->boot_vcpu_runs_old_kvmclock = tmp;
-
-			ka->kvmclock_offset = -get_kernel_ns();
 		}
 
 		vcpu->arch.time = data;
-- 
2.5.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
  2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
  2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
@ 2015-09-20 22:57 ` Marcelo Tosatti
  2015-09-21 15:12   ` Radim Krčmář
  2015-09-28 11:05 ` Paolo Bonzini
  3 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-20 22:57 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> RFC because I haven't explored many potential problems or tested it.

The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
haven't explored potential problems or tested it? Sorry can't parse it.

> 
> [1/2] uses a different algorithm in the guest to start counting from 0.
> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> 
> A viable alternative would be to implement opt-in features in kvm clock.
> 
> And because we probably only broke one old user (the infamous SLES 10), a
> workaround like this is also possible: (but I'd rather not do that)

Please describe why SLES 10 breaks in detail: the state of the guest and
the host before the patch, the state of the guest and host after the
patch.

What does SLES10 expect?
Is it counting from zero that breaks SLES10? 

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a60bdbccff51..ae9049248aaf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
>  
> -			ka->kvmclock_offset = -get_kernel_ns();
> +			if (!ka->boot_vcpu_runs_old_kvmclock)
> +				ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> 
> 
> Radim Krčmář (2):
>   x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
>   Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
>     system MSR"
> 
>  arch/x86/include/asm/pvclock-abi.h |  1 +
>  arch/x86/kernel/kvmclock.c         | 46 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c                 |  4 ----
>  3 files changed, 36 insertions(+), 15 deletions(-)
> 
> -- 
> 2.5.2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
@ 2015-09-21 15:12   ` Radim Krčmář
  2015-09-21 15:43     ` Radim Krčmář
  2015-09-21 15:52     ` Marcelo Tosatti
  0 siblings, 2 replies; 20+ messages in thread
From: Radim Krčmář @ 2015-09-21 15:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

2015-09-20 19:57-0300, Marcelo Tosatti:
> On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
>> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
>> RFC because I haven't explored many potential problems or tested it.
> 
> The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> haven't explored potential problems or tested it? Sorry can't parse it.
> 
>> 
>> [1/2] uses a different algorithm in the guest to start counting from 0.
>> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
>> 
>> A viable alternative would be to implement opt-in features in kvm clock.
>> 
>> And because we probably only broke one old user (the infamous SLES 10), a
>> workaround like this is also possible: (but I'd rather not do that)
> 
> Please describe why SLES 10 breaks in detail: the state of the guest and
> the host before the patch, the state of the guest and host after the
> patch.

1) The guest periodically receives an interrupt that is handled by
   main_timer_handler():
   a) get time using the kvm clock:
      1) write the address to MSR_KVM_SYSTEM_TIME
      2) read tsc and pvclock (tsc_offset, system_time)
      3) time = tsc - tsc_offset + system_time
   b) compute time since the last main_timer_handler()
   c) bump jiffies if enough time has elapsed
2) the guest wants to calibrate loops per jiffy [1]:
   a) read tsc
   b) loop till jiffies increase
   c) compute lpj

Because (1a1) always resets the system_time to 0, we read the same value
over and over so the condition for (1c) is never true and jiffies remain
constant.  This is the problem.  A hang happens in (2b) as it is the
first place that depends on jiffies.

> What does SLES10 expect?

That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.

> Is it counting from zero that breaks SLES10?

Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
while still keeping system time;  we used to allow that, which means an
ABI breakage.  (And we can't even say that guest's behaviour is against
the spec ...)


---
1: I also did diassembly, but the reproducer is easier to paste
   (couldn't find debuginfo)
   # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
    -serial stdio -monitor /dev/null -append 'console=ttyS0'
  
  and you can get a bit further when setting loops per jiffy manually,
    -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'

  The dmesg for failing run is
    Initializing CPU#0
    PID hash table entries: 512 (order: 9, 16384 bytes)
    kvm-clock: cpu 0, msr 0:3f6041, boot clock
    kvm_get_tsc_khz: cpu 0, msr 0:e001
    time.c: Using tsc for timekeeping HZ 250
    time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
    time.c: Detected 2591.580 MHz processor.
    Console: colour VGA+ 80x25
    Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
    Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
    Checking aperture...
    Nosave address range: 000000000009f000 - 00000000000a0000
    Nosave address range: 00000000000a0000 - 00000000000f0000
    Nosave address range: 00000000000f0000 - 0000000000100000
    Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
    [Infinitely querying kvm clock here ...]

  With '-cpu kvm64,-kvmclock', the next line is
    Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)

  With 'lpj=10399519',
    Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
    [Manages to get stuck later, in default_idle.]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 15:12   ` Radim Krčmář
@ 2015-09-21 15:43     ` Radim Krčmář
  2015-09-21 15:52     ` Marcelo Tosatti
  1 sibling, 0 replies; 20+ messages in thread
From: Radim Krčmář @ 2015-09-21 15:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

2015-09-21 17:12+0200, Radim Krčmář:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
>>> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
>>> RFC because I haven't explored many potential problems or tested it.
>> 
>> The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
>> haven't explored potential problems or tested it? Sorry can't parse it.

I missed that one, sorry.
PVCLOCK_COUNTS_FROM_ZERO is disabled because it breaks ABI.
Not testing nor ruling out all problems is a justification for RFC.

(This patch series
 - will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and
 - is RFC because I haven't explored many potential problems or tested
   it [this patch series])

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 15:12   ` Radim Krčmář
  2015-09-21 15:43     ` Radim Krčmář
@ 2015-09-21 15:52     ` Marcelo Tosatti
  2015-09-21 20:00       ` Radim Krčmář
  1 sibling, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-21 15:52 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
> >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> >> RFC because I haven't explored many potential problems or tested it.
> > 
> > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> > haven't explored potential problems or tested it? Sorry can't parse it.
> > 
> >> 
> >> [1/2] uses a different algorithm in the guest to start counting from 0.
> >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> >> 
> >> A viable alternative would be to implement opt-in features in kvm clock.
> >> 
> >> And because we probably only broke one old user (the infamous SLES 10), a
> >> workaround like this is also possible: (but I'd rather not do that)
> > 
> > Please describe why SLES 10 breaks in detail: the state of the guest and
> > the host before the patch, the state of the guest and host after the
> > patch.
> 
> 1) The guest periodically receives an interrupt that is handled by
>    main_timer_handler():
>    a) get time using the kvm clock:
>       1) write the address to MSR_KVM_SYSTEM_TIME
>       2) read tsc and pvclock (tsc_offset, system_time)
>       3) time = tsc - tsc_offset + system_time
>    b) compute time since the last main_timer_handler()
>    c) bump jiffies if enough time has elapsed
> 2) the guest wants to calibrate loops per jiffy [1]:
>    a) read tsc
>    b) loop till jiffies increase
>    c) compute lpj
> 
> Because (1a1) always resets the system_time to 0, we read the same value
> over and over so the condition for (1c) is never true and jiffies remain
> constant.  This is the problem.  A hang happens in (2b) as it is the
> first place that depends on jiffies.
> 
> > What does SLES10 expect?
> 
> That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.
> 
> > Is it counting from zero that breaks SLES10?
> 
> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> while still keeping system time;  we used to allow that, which means an
> ABI breakage.  (And we can't even say that guest's behaviour is against
> the spec ...)

Because this behaviour was not defined.

Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
boot_vcpu_runs_old_kvmclock == false? 
The patch would be much simpler.

The problem is, "selecting one read as the initial point" is inherently
racy: that delta is relative to one moment (kvmclock read) at one vcpu,
but must be applied to all vcpus.

Besides:

	1) Stable sched clock in guest does not depend on
	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
	2) You rely on monotonicity across vcpus to perform 
 	   the 'minus delta that was read on vcpu0' calculation, but 
	   monotonicity across vcpus can fail during runtime
           (say host clocksource goes tsc->hpet due to tsc instability).


> 
> 
> ---
> 1: I also did diassembly, but the reproducer is easier to paste
>    (couldn't find debuginfo)
>    # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
>     -serial stdio -monitor /dev/null -append 'console=ttyS0'
>   
>   and you can get a bit further when setting loops per jiffy manually,
>     -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'
> 
>   The dmesg for failing run is
>     Initializing CPU#0
>     PID hash table entries: 512 (order: 9, 16384 bytes)
>     kvm-clock: cpu 0, msr 0:3f6041, boot clock
>     kvm_get_tsc_khz: cpu 0, msr 0:e001
>     time.c: Using tsc for timekeeping HZ 250
>     time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
>     time.c: Detected 2591.580 MHz processor.
>     Console: colour VGA+ 80x25
>     Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
>     Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
>     Checking aperture...
>     Nosave address range: 000000000009f000 - 00000000000a0000
>     Nosave address range: 00000000000a0000 - 00000000000f0000
>     Nosave address range: 00000000000f0000 - 0000000000100000
>     Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
>     [Infinitely querying kvm clock here ...]
> 
>   With '-cpu kvm64,-kvmclock', the next line is
>     Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)
> 
>   With 'lpj=10399519',
>     Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
>     [Manages to get stuck later, in default_idle.]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 15:52     ` Marcelo Tosatti
@ 2015-09-21 20:00       ` Radim Krčmář
  2015-09-21 20:53         ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Radim Krčmář @ 2015-09-21 20:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

2015-09-21 12:52-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
>> 2015-09-20 19:57-0300, Marcelo Tosatti:
>>> Is it counting from zero that breaks SLES10?
>> 
>> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> while still keeping system time;  we used to allow that, which means an
>> ABI breakage.  (And we can't even say that guest's behaviour is against
>> the spec ...)
> 
> Because this behaviour was not defined.

It is defined by implementation.

> Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> boot_vcpu_runs_old_kvmclock == false? 
> The patch would be much simpler.


If you mean the hunk in cover letter, I don't like it because we presume
that no other guests were broken.

I really don't like it so I thought about other problems with
PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
It doesn't work well ;)

We don't want to guess what the guest wants so I'd go for the opt-in
paravirt feature unless counting from zero can be done in guest alone.

> The problem is, "selecting one read as the initial point" is inherently
> racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> but must be applied to all vcpus.

I don't think that is a problem.

Kvmclock has a notion of a global system_time in nanoseconds (one value
that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
to propagate system_time into guest VCPUs as precisely as possible with
the help of TSC.

sched_clock uses kvmclock to get nanoseconds since the system was
brought up and [1/2] only works with this abstracted ns count.
A poorly synchronized initial read is irrelevant because all VCPUs will
be using the same constant offset.
(We can never know the precise time anyway.)

> Besides:
> 
> 	1) Stable sched clock in guest does not depend on
> 	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.

Yes, thanks, I will remove that requirement in v1.  (We'd need to
improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)

Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
there is now one possible unsigned overflow: in case the clock was very
imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
time of initialization.  It's very unlikely that we'll ever reach legal
overflow so I can add a condition there.

> 	2) You rely on monotonicity across vcpus to perform 
> 	   the 'minus delta that was read on vcpu0' calculation, but 
> 	   monotonicity across vcpus can fail during runtime
>            (say host clocksource goes tsc->hpet due to tsc instability).

That could be a problem, but I'm adding a VCPU independent constant to
all reads -- does the new code rely on monoticity in places where the
old one didn't?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 20:00       ` Radim Krčmář
@ 2015-09-21 20:53         ` Marcelo Tosatti
  2015-09-21 22:00           ` Radim Krčmář
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-21 20:53 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> 2015-09-21 12:52-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >>> Is it counting from zero that breaks SLES10?
> >> 
> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> while still keeping system time;  we used to allow that, which means an
> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> the spec ...)
> > 
> > Because this behaviour was not defined.
> 
> It is defined by implementation.
> 
> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> > boot_vcpu_runs_old_kvmclock == false? 
> > The patch would be much simpler.
> 
> 
> If you mean the hunk in cover letter, I don't like it because we presume
> that no other guests were broken.

Yes patch 1.

Do you have an example of another guest that is broken? 

Really, assuming things about the value of the msr read when you
write to the MSR is very awkward. That SuSE code is incorrect, it fails
on other occasions as well (see
54750f2cf042c42b4223d67b1bd20138464bde0e).

> I really don't like it 

Because it changes behaviour that was previously unspecified?

> so I thought about other problems with
> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?

Can't unplug VCPU 0 i think.

> It doesn't work well ;)

Can you hot-unplug vcpu 0? 

> We don't want to guess what the guest wants so I'd go for the opt-in
> paravirt feature unless counting from zero can be done in guest alone.

The problem it is tricky (to do the counting inside the guest).

Its much cleaner and less complicated if the host starts counting from
zero.

> 
> > The problem is, "selecting one read as the initial point" is inherently
> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> > but must be applied to all vcpus.
> 
> I don't think that is a problem.
> 
> Kvmclock has a notion of a global system_time in nanoseconds (one value
> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> to propagate system_time into guest VCPUs as precisely as possible with
> the help of TSC.

Different pairs of values (system_time, tsc reads) are fundamentally
broken. This is why 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

Exists.

Then to guarantee monotonicity you need to stop those updates
(or perform the pair update in lock step):

commit d828199e84447795c6669ff0e6c6d55eb9beeff6
    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag


> sched_clock uses kvmclock to get nanoseconds since the system was
> brought up and [1/2] only works with this abstracted ns count.
> A poorly synchronized initial read is irrelevant because all VCPUs will
> be using the same constant offset.
> (We can never know the precise time anyway.)
> 
> > Besides:
> > 
> > 	1) Stable sched clock in guest does not depend on
> > 	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
> 
> Yes, thanks, I will remove that requirement in v1.  (We'd need to
> improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)
> 
> Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
> there is now one possible unsigned overflow: in case the clock was very
> imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
> time of initialization.  It's very unlikely that we'll ever reach legal
> overflow so I can add a condition there.
> 
> > 	2) You rely on monotonicity across vcpus to perform 
> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> > 	   monotonicity across vcpus can fail during runtime
> >            (say host clocksource goes tsc->hpet due to tsc instability).
> 
> That could be a problem, but I'm adding a VCPU independent constant to
> all reads -- does the new code rely on monoticity in places where the
> old one didn't?

The problem is overflow:
kvmclock non-monotonic across vcpus AND delta subtraction:

ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
1	     1			   1 		        
2            2		           2			1		
3	     3			   0			2           -1
4	     4			   1			3	     0
5	     5			   2			4	     1
6	     6			   3			5	     2
7	     7			   4			6	     3

ptime is "physical time".

I prefer that the host counts from zero (there are less problems to
handle).

Again, that SuSE patch is incorrect and a huge hack.

An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
requests the feature.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 20:53         ` Marcelo Tosatti
@ 2015-09-21 22:00           ` Radim Krčmář
  2015-09-21 22:37             ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Radim Krčmář @ 2015-09-21 22:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

2015-09-21 17:53-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
>> 2015-09-21 12:52-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
>> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
>> >>> Is it counting from zero that breaks SLES10?
>> >> 
>> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> >> while still keeping system time;  we used to allow that, which means an
>> >> ABI breakage.  (And we can't even say that guest's behaviour is against
>> >> the spec ...)
>> > 
>> > Because this behaviour was not defined.
>> 
>> It is defined by implementation.
>> 
>> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
>> > boot_vcpu_runs_old_kvmclock == false? 
>> > The patch would be much simpler.
>> 
>> If you mean the hunk in cover letter, I don't like it because we presume
>> that no other guests were broken.
> 
> Yes patch 1.

(I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
 patch 2 = [2/2].)

> Do you have an example of another guest that is broken? 

Yes, the hot-plug in any relatively recent Linux guest.

> Really, assuming things about the value of the msr read when you
> write to the MSR is very awkward. That SuSE code is incorrect, it fails
> on other occasions as well (see
> 54750f2cf042c42b4223d67b1bd20138464bde0e).

Yeah, I even read the SUSE implementation, sad times.

>> I really don't like it 
> 
> Because it changes behaviour that was previously unspecified?

No, because it adds another exemption to already ugly code.
(Talking about the hunk in [0/2].)

>> so I thought about other problems with
>> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> 
> Can't unplug VCPU 0 i think.

You can.

>> It doesn't work well ;)
> 
> Can you hot-unplug vcpu 0? 

I can, I did, hot-plug bugged.

>> > The problem is, "selecting one read as the initial point" is inherently
>> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> > but must be applied to all vcpus.
>> 
>> I don't think that is a problem.
>> 
>> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> to propagate system_time into guest VCPUs as precisely as possible with
>> the help of TSC.
>
>Different pairs of values (system_time, tsc reads) are fundamentally
>broken. This is why 
>
>commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>    x86, paravirt: Add a global synchronization point for pvclock
>
>Exists.
>
>Then to guarantee monotonicity you need to stop those updates
>(or perform the pair update in lock step):
>
>commit d828199e84447795c6669ff0e6c6d55eb9beeff6
>    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

(Thanks for the commits, split values are the core design that avoids
 modifying rdtsc, so I'll be grad to read its details.)

>> > 	2) You rely on monotonicity across vcpus to perform 
>> > 	   the 'minus delta that was read on vcpu0' calculation, but 
>> > 	   monotonicity across vcpus can fail during runtime
>> >            (say host clocksource goes tsc->hpet due to tsc instability).
>> 
>> That could be a problem, but I'm adding a VCPU independent constant to
>> all reads -- does the new code rely on monoticity in places where the
>> old one didn't?
> 
> The problem is overflow:
> kvmclock non-monotonic across vcpus AND delta subtraction:
> 
> ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> 1          1                      1                         
> 2          2                      2                        1                

KVM sched clock not used before this point (I even moved the code to
make sure), so there is no problem with monotonicity.

> 3          3                      0                        2            -1
                                                                          ^^^
-1 is the overflow.  Very unlikely to ever happen in Linux, as we now
boot other VCPUs much later than the KVM clock configuration and it can
only happen if VCPU1 is running as the reconfiguration takes place, but
a simple

  if (vcpu[x].sched_clock <= global_sched_clock_offset)
  	return 0;

would take care of it.  The time would stand still for a while, which is
not a huge problem for boot-only scenario.  Other VCPUs couldn't be
reading KVM sched before it was configured, so only operations that
happen before vcpu1sched_clock goes positive are affected.
(We have other problems if the window can be long.)

> 4          4                      1                        3             0
> 5          5                      2                        4             1
> 6          6                      3                        5             2
> 7          7                      4                        6             3
> 
> ptime is "physical time".
> 
> I prefer that the host counts from zero (there are less problems to
> handle).

The main advantage of a hypervisor solution for me is one saved
subtraction and comparison on every sched_clock().

> Again, that SuSE patch is incorrect and a huge hack.

I'm not disputing that.  (Which doesn't justify the breakage.)

> An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> requests the feature.

Yes, and the alternative needs new paravirt interface.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 22:00           ` Radim Krčmář
@ 2015-09-21 22:37             ` Marcelo Tosatti
  2015-09-22  0:40               ` Marcelo Tosatti
                                 ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-21 22:37 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >> 
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time;  we used to allow that, which means an
> >> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> > 
> >> > Because this behaviour was not defined.
> >> 
> >> It is defined by implementation.
> >> 
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false? 
> >> > The patch would be much simpler.
> >> 
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> > 
> > Yes patch 1.
> 
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
>  patch 2 = [2/2].)
> 
> > Do you have an example of another guest that is broken? 
> 
> Yes, the hot-plug in any relatively recent Linux guest.
> 
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
> 
> Yeah, I even read the SUSE implementation, sad times.
> 
> >> I really don't like it 
> > 
> > Because it changes behaviour that was previously unspecified?
> 
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
> 
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> > 
> > Can't unplug VCPU 0 i think.
> 
> You can.
> 
> >> It doesn't work well ;)
> > 
> > Can you hot-unplug vcpu 0? 
> 
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >> 
> >> I don't think that is a problem.
> >> 
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why 
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> >    x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
> 
> (Thanks for the commits, split values are the core design that avoids
>  modifying rdtsc, 

Which is broken (thats the point).

> so I'll be grad to read its details.)
> 
> >> > 	2) You rely on monotonicity across vcpus to perform 
> >> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> >> > 	   monotonicity across vcpus can fail during runtime
> >> >            (say host clocksource goes tsc->hpet due to tsc instability).
> >> 
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> > 
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> > 
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> > 1          1                      1                         
> > 2          2                      2                        1                
> 
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
> 
> > 3          3                      0                        2            -1
>                                                                           ^^^
> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it can
> only happen if VCPU1 is running as the reconfiguration takes place, but
> a simple
> 
>   if (vcpu[x].sched_clock <= global_sched_clock_offset)
>   	return 0;
> 
> would take care of it.  The time would stand still for a while, which is
> not a huge problem for boot-only scenario.  

Look at 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

And think of what an overflow does.

Note: i tried your solution before (to add offset) and saw this issue
in practice.

> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)

The point where vcpu1sched_clock is negative is after kvmclock is
initialized.

> 
> > 4          4                      1                        3             0
> > 5          5                      2                        4             1
> > 6          6                      3                        5             2
> > 7          7                      4                        6             3
> > 
> > ptime is "physical time".
> > 
> > I prefer that the host counts from zero (there are less problems to
> > handle).
> 
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().

To me are such complications as handling overflows.

> > Again, that SuSE patch is incorrect and a huge hack.
> 
> I'm not disputing that.  (Which doesn't justify the breakage.)
> 
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
> 
> Yes, and the alternative needs new paravirt interface.

So either:

Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).

Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug, 
and fix it.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 22:37             ` Marcelo Tosatti
@ 2015-09-22  0:40               ` Marcelo Tosatti
  2015-09-22 14:33               ` Radim Krčmář
  2015-09-22 14:46               ` Radim Krčmář
  2 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-22  0:40 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

> So either:
> 
> Proceed with guest solution:
> -> Make sure the overflow can't happen (and write down why not in the
> code). Don't assume a small delta between kvmclock values of vcpus.
> -> Handle stable -> non-stable kvmclock transition.
> -> kvmclock counts from zero should not depend on stable kvmclock
> (because nohz_full should work on hpet host systems).
> 
> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
> -> Figure out whats wrong with different kvmclock values on hotplug, 
> and fix it.

Find data which allows you to differentiate between hotplug of pCPU-0
and system initialization.

Easy one: whether MSR_KVM_SYSTEM_TIME_NEW contains valid data (that is
kvmclock is enabled) for vCPUs other than vCPU-0.

This can't be the case on system initialization (otherwise the host will
be corrupting guest memory), and must be the case when hotplugging
vCPU-0.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 22:37             ` Marcelo Tosatti
  2015-09-22  0:40               ` Marcelo Tosatti
@ 2015-09-22 14:33               ` Radim Krčmář
  2015-09-22 14:46               ` Radim Krčmář
  2 siblings, 0 replies; 20+ messages in thread
From: Radim Krčmář @ 2015-09-22 14:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

2015-09-21 19:37-0300, Marcelo Tosatti:
> On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote:
>> 2015-09-21 17:53-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
>> >> so I thought about other problems with
>> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
>> >> It doesn't work well ;)
>> > 
>> > Can you hot-unplug vcpu 0? 
>> 
>> I can, I did, hot-plug bugged.
> 
> Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

Yes, VCPU 0 writes MSR_KVM_SYSTEM_TIME_NEW to access the kvm clock, but
that very operation resets system_time.  The result is a long sleep,
because at least Linux doesn't handle that shift.
(Just in case, works fine after reverting the one host patch.)

With more thinking, PVCLOCK_COUNTS_FROM_ZERO also breaks suspend and
resume as we write the MSR there.  Testing shows that is isn't as bad as
the hotplug, because `sleep 1` returns in a second and `date` is fine,
but having multiple 0 points in `dmesg` isn't very nice either.

There are too many problems with PVCLOCK_COUNTS_FROM_ZERO, I'll send the
revert with cc:stable soon.  (Without any guest changes to sched_clock.)

>> >> > The problem is, "selecting one read as the initial point" is inherently
>> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> >> > but must be applied to all vcpus.
>> >> 
>> >> I don't think that is a problem.
>> >> 
>> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> >> to propagate system_time into guest VCPUs as precisely as possible with
>> >> the help of TSC.
>> >
>> >Different pairs of values (system_time, tsc reads) are fundamentally
>> >broken. This is why 
>> >
>> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>> >    x86, paravirt: Add a global synchronization point for pvclock
>> >
>> >Exists.
>> >
>> >Then to guarantee monotonicity you need to stop those updates
>> >(or perform the pair update in lock step):
>> >
>> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
>> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
>> 
>> (Thanks for the commits, split values are the core design that avoids
>>  modifying rdtsc, 
> 
> Which is broken (thats the point).

Yes, but I don't think we could use tsc any better without having a
correct guest's time in the host.

We chose to provide kvmclock without a host_tsc -> host_ns function in
the host.  It is just impossible to give the guest a precise (tsc, ns)
tuple if the host is not using TSC for its clock.  (We don't have
control over the machine and using two reads of time can't be precise.)

>> > 3          3                      0                        2            -1
>>                                                                           ^^^
>> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
>> boot other VCPUs much later than the KVM clock configuration and it can
>> only happen if VCPU1 is running as the reconfiguration takes place, but
>> a simple
>> 
>>   if (vcpu[x].sched_clock <= global_sched_clock_offset)
>>   	return 0;
>> 
>> would take care of it.  The time would stand still for a while, which is
>> not a huge problem for boot-only scenario.  
> 
> Look at 
> 
> commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>     x86, paravirt: Add a global synchronization point for pvclock

This patch shows the place where hotplug likely fails -- system_time
goes back to 0 so time is frozen until system_time reaches the original
value, again.
It also makes the result of kvm_clock_read monotonic across all vcpus,
so [1/2] doesn't need to add code to handle initial overflow.

> And think of what an overflow does.

If the clock was slightly negative (unsigned) and then overflowed,
sched_clock would freeze because it would never reach that high value
again.  Offset in [1/2] is applied later and can't reach negative.

> Note: i tried your solution before (to add offset) and saw this issue
> in practice.

Great, thanks, I will thoroughly test it to see where it fails.

>> Other VCPUs couldn't be
>> reading KVM sched before it was configured, so only operations that
>> happen before vcpu1sched_clock goes positive are affected.
>> (We have other problems if the window can be long.)
> 
> The point where vcpu1sched_clock is negative is after kvmclock is
> initialized.

Existing code takes care of that -- vcpu0 reads a value before any user
of sched_clock could have so later reads must be larger.

>> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
>> > requests the feature.
>> 
>> Yes, and the alternative needs new paravirt interface.
> 
> So either:
> 
> Proceed with guest solution:
> -> Make sure the overflow can't happen (and write down why not in the
> code). Don't assume a small delta between kvmclock values of vcpus.

I will comment that 489fb490dbf8 ("x86, paravirt: Add a global
synchronization point for pvclock") gives all guarantees that prevent
the overflow on secondary VCPU bringup.

> -> Handle stable -> non-stable kvmclock transition.

No need to do anything here either.  It will behave like before.

> -> kvmclock counts from zero should not depend on stable kvmclock
> (because nohz_full should work on hpet host systems).

Yes, I will do that.

> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
> -> Figure out whats wrong with different kvmclock values on hotplug, 
> and fix it.

It's wrong to reset the time unconditionally on an operation that can be
used anytime (and sometimes must be used even if we don't want to reset
the time), revert is the best fix.

We can do guest-only or paravirtualized zeroing.
Paravirtualized zeroing has more complicated code, but simpler context.
I mainly don't like the idea of adding an interface for a feature the
guest can implement at the same level of quality.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-21 22:37             ` Marcelo Tosatti
  2015-09-22  0:40               ` Marcelo Tosatti
  2015-09-22 14:33               ` Radim Krčmář
@ 2015-09-22 14:46               ` Radim Krčmář
  2 siblings, 0 replies; 20+ messages in thread
From: Radim Krčmář @ 2015-09-22 14:46 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

[My mailbox filled between yours two emails, so I have pasted the latter
 one from archives and replied to a wrong one.]

2015-09-21 21:42-0300, Marcelo Tosatti:
> 2015-09-21 19:37-0300, Marcelo Tosatti:
>> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
>> -> Figure out whats wrong with different kvmclock values on hotplug, 
>> and fix it.
> 
> Find data which allows you to differentiate between hotplug of pCPU-0
> and system initialization.

This kind of programminng is exactly what I don't like and won't be
doing.  We should at least be removing previous mistakes when adding new
ones.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
  2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
@ 2015-09-22 19:01   ` Marcelo Tosatti
  2015-09-28 14:10   ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-22 19:01 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

On Fri, Sep 18, 2015 at 05:54:29PM +0200, Radim Krčmář wrote:
> Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore.
> The purpose of that flags was to start counting system time from 0 when
> the KVM clock has been initialized.
> We can achieve the same by selecting one read as the initial point.
> 
> A simple subtraction will work unless the KVM clock count overflows
> earlier (has smaller width) than scheduler's cycle count.  We should be
> safe till x86_128.
> 
> Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors,
> setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might
> regress on older ones.
> 
> I presume we don't need to change kvm_clock_read instead of introducing
> kvm_sched_clock_read.  A problem could arise in case sched_clock is
> expected to return the same value as get_cycles, but we should have
> merged those clocks in that case.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  arch/x86/kernel/kvmclock.c | 46 +++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 35 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 2c7aafa70702..ef5b3d2cecce 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -32,6 +32,7 @@
>  static int kvmclock = 1;
>  static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
>  static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
> +static cycle_t kvm_sched_clock_offset;
>  
>  static int parse_no_kvmclock(char *arg)
>  {
> @@ -92,6 +93,29 @@ static cycle_t kvm_clock_get_cycles(struct clocksource *cs)
>  	return kvm_clock_read();
>  }
>  
> +static cycle_t kvm_sched_clock_read(void)
> +{
> +	return kvm_clock_read() - kvm_sched_clock_offset;
> +}
> +
> +static inline void kvm_sched_clock_init(bool stable)
> +{
> +	if (!stable) {
> +		pv_time_ops.sched_clock = kvm_clock_read;
> +		return;
> +	}
> +
> +	kvm_sched_clock_offset = kvm_clock_read();
> +	pv_time_ops.sched_clock = kvm_sched_clock_read;
> +	set_sched_clock_stable();
> +
> +	printk("kvm-clock: using sched offset of %llu cycles\n",
> +			kvm_sched_clock_offset);
> +
> +	BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> +	         sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
> +}
> +
>  /*
>   * If we don't do that, there is the possibility that the guest
>   * will calibrate under heavy load - thus, getting a lower lpj -
> @@ -248,7 +272,17 @@ void __init kvmclock_init(void)
>  		memblock_free(mem, size);
>  		return;
>  	}
> -	pv_time_ops.sched_clock = kvm_clock_read;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
> +		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
> +
> +	cpu = get_cpu();
> +	vcpu_time = &hv_clock[cpu].pvti;
> +	flags = pvclock_read_flags(vcpu_time);
> +
> +	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	put_cpu();
> +
>  	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
>  	x86_platform.get_wallclock = kvm_get_wallclock;
>  	x86_platform.set_wallclock = kvm_set_wallclock;
> @@ -265,16 +299,6 @@ void __init kvmclock_init(void)
>  	kvm_get_preset_lpj();
>  	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
>  	pv_info.name = "KVM";
> -
> -	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
> -		pvclock_set_flags(~0);
> -
> -	cpu = get_cpu();
> -	vcpu_time = &hv_clock[cpu].pvti;
> -	flags = pvclock_read_flags(vcpu_time);
> -	if (flags & PVCLOCK_COUNTS_FROM_ZERO)
> -		set_sched_clock_stable();
> -	put_cpu();
>  }
>  
>  int __init kvm_setup_vsyscall_timeinfo(void)
> -- 
> 2.5.2

ACK


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
  2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
@ 2015-09-22 19:01   ` Marcelo Tosatti
  2015-09-22 19:52     ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-22 19:01 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino

On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote:
> Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
> MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
> to its custom enhancements to kvmclock, but KVM never declared the MSR
> only for one-shot initialization.  (Doc says that only one write is
> needed.)
> 
> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  arch/x86/include/asm/pvclock-abi.h | 1 +
>  arch/x86/kvm/x86.c                 | 4 ----
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
> index 655e07a48f6c..67f08230103a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>  
>  #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
>  #define PVCLOCK_GUEST_STOPPED	(1 << 1)
> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 18d59b584dee..34d33f4757d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  		vcpu->pvclock_set_guest_stopped_request = false;
>  	}
>  
> -	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> -
>  	/* If the host uses TSC clocksource, then it is stable */
>  	if (use_master_clock)
>  		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  					&vcpu->requests);
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
> -
> -			ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> -- 
> 2.5.2

ACK


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
  2015-09-22 19:01   ` Marcelo Tosatti
@ 2015-09-22 19:52     ` Paolo Bonzini
  2015-09-22 20:23       ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2015-09-22 19:52 UTC (permalink / raw)
  To: Marcelo Tosatti, Radim Krčmář
  Cc: linux-kernel, kvm, Luiz Capitulino



On 22/09/2015 21:01, Marcelo Tosatti wrote:
> On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote:
>> Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
>> MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
>> to its custom enhancements to kvmclock, but KVM never declared the MSR
>> only for one-shot initialization.  (Doc says that only one write is
>> needed.)
>>
>> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
>> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
>>
>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>> ---
>>  arch/x86/include/asm/pvclock-abi.h | 1 +
>>  arch/x86/kvm/x86.c                 | 4 ----
>>  2 files changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
>> index 655e07a48f6c..67f08230103a 100644
>> --- a/arch/x86/include/asm/pvclock-abi.h
>> +++ b/arch/x86/include/asm/pvclock-abi.h
>> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>>  
>>  #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
>>  #define PVCLOCK_GUEST_STOPPED	(1 << 1)
>> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
>>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
>>  #endif /* __ASSEMBLY__ */
>>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 18d59b584dee..34d33f4757d2 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>  		vcpu->pvclock_set_guest_stopped_request = false;
>>  	}
>>  
>> -	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
>> -
>>  	/* If the host uses TSC clocksource, then it is stable */
>>  	if (use_master_clock)
>>  		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
>> @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  					&vcpu->requests);
>>  
>>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
>> -
>> -			ka->kvmclock_offset = -get_kernel_ns();
>>  		}
>>  
>>  		vcpu->arch.time = data;
>> -- 
>> 2.5.2
> 
> ACK

So I suppose you changed your mind :) but can you explain the reasoning?

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
  2015-09-22 19:52     ` Paolo Bonzini
@ 2015-09-22 20:23       ` Marcelo Tosatti
  0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2015-09-22 20:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář, linux-kernel, kvm, Luiz Capitulino

On Tue, Sep 22, 2015 at 09:52:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 22/09/2015 21:01, Marcelo Tosatti wrote:
> > On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote:
> >> Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
> >> MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
> >> to its custom enhancements to kvmclock, but KVM never declared the MSR
> >> only for one-shot initialization.  (Doc says that only one write is
> >> needed.)
> >>
> >> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> >> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> >>
> >> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> >> ---
> >>  arch/x86/include/asm/pvclock-abi.h | 1 +
> >>  arch/x86/kvm/x86.c                 | 4 ----
> >>  2 files changed, 1 insertion(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
> >> index 655e07a48f6c..67f08230103a 100644
> >> --- a/arch/x86/include/asm/pvclock-abi.h
> >> +++ b/arch/x86/include/asm/pvclock-abi.h
> >> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
> >>  
> >>  #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
> >>  #define PVCLOCK_GUEST_STOPPED	(1 << 1)
> >> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
> >>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
> >>  #endif /* __ASSEMBLY__ */
> >>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 18d59b584dee..34d33f4757d2 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >>  		vcpu->pvclock_set_guest_stopped_request = false;
> >>  	}
> >>  
> >> -	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> >> -
> >>  	/* If the host uses TSC clocksource, then it is stable */
> >>  	if (use_master_clock)
> >>  		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> >> @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >>  					&vcpu->requests);
> >>  
> >>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
> >> -
> >> -			ka->kvmclock_offset = -get_kernel_ns();
> >>  		}
> >>  
> >>  		vcpu->arch.time = data;
> >> -- 
> >> 2.5.2
> > 
> > ACK
> 
> So I suppose you changed your mind :) but can you explain the reasoning?
> 
> Paolo

The patch is correct. Overflow (issue raised) is only an issue 
without PVCLOCK_TSC_STABLE_BIT.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
  2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
                   ` (2 preceding siblings ...)
  2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
@ 2015-09-28 11:05 ` Paolo Bonzini
  3 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2015-09-28 11:05 UTC (permalink / raw)
  To: Radim Krčmář, linux-kernel
  Cc: kvm, Marcelo Tosatti, Luiz Capitulino



On 18/09/2015 17:54, Radim Krčmář wrote:
> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> RFC because I haven't explored many potential problems or tested it.
> 
> [1/2] uses a different algorithm in the guest to start counting from 0.
> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> 
> A viable alternative would be to implement opt-in features in kvm clock.
> 
> And because we probably only broke one old user (the infamous SLES 10), a
> workaround like this is also possible: (but I'd rather not do that)

Thanks,

applying 2/2 for 4.4 and 1/2 for 4.3.

Paolo

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a60bdbccff51..ae9049248aaf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
>  
> -			ka->kvmclock_offset = -get_kernel_ns();
> +			if (!ka->boot_vcpu_runs_old_kvmclock)
> +				ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> 
> 
> Radim Krčmář (2):
>   x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
>   Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
>     system MSR"
> 
>  arch/x86/include/asm/pvclock-abi.h |  1 +
>  arch/x86/kernel/kvmclock.c         | 46 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c                 |  4 ----
>  3 files changed, 36 insertions(+), 15 deletions(-)
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
  2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
  2015-09-22 19:01   ` Marcelo Tosatti
@ 2015-09-28 14:10   ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2015-09-28 14:10 UTC (permalink / raw)
  To: Radim Krčmář, linux-kernel
  Cc: kvm, Marcelo Tosatti, Luiz Capitulino



On 18/09/2015 17:54, Radim Krčmář wrote:
> +	kvm_sched_clock_offset = kvm_clock_read();
> +	pv_time_ops.sched_clock = kvm_sched_clock_read;
> +	set_sched_clock_stable();
> +
> +	printk("kvm-clock: using sched offset of %llu cycles\n",

Ok to change this to KERN_DEBUG or KERN_INFO?

Paolo

> +			kvm_sched_clock_offset);
> +

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-09-28 14:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-22 19:01   ` Marcelo Tosatti
2015-09-28 14:10   ` Paolo Bonzini
2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
2015-09-22 19:01   ` Marcelo Tosatti
2015-09-22 19:52     ` Paolo Bonzini
2015-09-22 20:23       ` Marcelo Tosatti
2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
2015-09-21 15:12   ` Radim Krčmář
2015-09-21 15:43     ` Radim Krčmář
2015-09-21 15:52     ` Marcelo Tosatti
2015-09-21 20:00       ` Radim Krčmář
2015-09-21 20:53         ` Marcelo Tosatti
2015-09-21 22:00           ` Radim Krčmář
2015-09-21 22:37             ` Marcelo Tosatti
2015-09-22  0:40               ` Marcelo Tosatti
2015-09-22 14:33               ` Radim Krčmář
2015-09-22 14:46               ` Radim Krčmář
2015-09-28 11:05 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).