All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] KVM: x86: hyper-v: Fix TSC page update after KVM_SET_CLOCK(0) call
@ 2021-03-29 11:47 Vitaly Kuznetsov
  2021-03-29 11:47 ` [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters() Vitaly Kuznetsov
  2021-03-29 11:48 ` [PATCH v2 2/2] selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0) Vitaly Kuznetsov
  0 siblings, 2 replies; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-29 11:47 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Marcelo Tosatti

Changes since v1:
- Fix the issue by casting 'hv_clock->system_time' to s64 in 
 compute_tsc_page_parameters() instead of clamping its value to zero in
 kvm_guest_time_update() [Paolo]

Original description:

I discovered that after KVM_SET_CLOCK(0) TSC page value in the guest can
go through the roof and apparently we have a signedness issue when the
update is performed. Fix the issue and add a selftest.

Vitaly Kuznetsov (2):
  KVM: x86: hyper-v: Properly divide maybe-negative
    'hv_clock->system_time' in compute_tsc_page_parameters()
  selftests: kvm: Check that TSC page value is small after
    KVM_SET_CLOCK(0)

 arch/x86/kvm/hyperv.c                             |  9 ++++++---
 tools/testing/selftests/kvm/x86_64/hyperv_clock.c | 13 +++++++++++--
 2 files changed, 17 insertions(+), 5 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-29 11:47 [PATCH v2 0/2] KVM: x86: hyper-v: Fix TSC page update after KVM_SET_CLOCK(0) call Vitaly Kuznetsov
@ 2021-03-29 11:47 ` Vitaly Kuznetsov
  2021-03-29 17:24   ` Sean Christopherson
  2021-03-30 13:12   ` Marcelo Tosatti
  2021-03-29 11:48 ` [PATCH v2 2/2] selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0) Vitaly Kuznetsov
  1 sibling, 2 replies; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-29 11:47 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Marcelo Tosatti

When guest time is reset with KVM_SET_CLOCK(0), it is possible for
hv_clock->system_time to become a small negative number. This happens
because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based
on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
kvm_guest_time_update() does

hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;

And 'master_kernel_ns' represents the last time when masterclock
got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
problem, the difference is very small, e.g. I'm observing
hv_clock.system_time = -70 ns. The issue comes from the fact that
'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
compute_tsc_page_parameters() becomes a very big number.

Use div_s64() to get the proper result when dividing maybe-negative
'hv_clock.system_time' by 100.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/hyperv.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index f98370a39936..0529b892f634 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
 				hv_clock->tsc_to_system_mul,
 				100);
 
-	tsc_ref->tsc_offset = hv_clock->system_time;
-	do_div(tsc_ref->tsc_offset, 100);
-	tsc_ref->tsc_offset -=
+	/*
+	 * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative
+	 * value here, thus div_s64().
+	 */
+	tsc_ref->tsc_offset = div_s64(hv_clock->system_time, 100) -
 		mul_u64_u64_shr(hv_clock->tsc_timestamp, tsc_ref->tsc_scale, 64);
+
 	return true;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0)
  2021-03-29 11:47 [PATCH v2 0/2] KVM: x86: hyper-v: Fix TSC page update after KVM_SET_CLOCK(0) call Vitaly Kuznetsov
  2021-03-29 11:47 ` [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters() Vitaly Kuznetsov
@ 2021-03-29 11:48 ` Vitaly Kuznetsov
  1 sibling, 0 replies; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-29 11:48 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Marcelo Tosatti

Add a test for the issue when KVM_SET_CLOCK(0) call could cause
TSC page value to go very big because of a signedness issue around
hv_clock->system_time.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 tools/testing/selftests/kvm/x86_64/hyperv_clock.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
index ffbc4555c6e2..7f1d2765572c 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
@@ -80,19 +80,24 @@ static inline void check_tsc_msr_rdtsc(void)
 	GUEST_ASSERT(delta_ns * 100 < (t2 - t1) * 100);
 }
 
+static inline u64 get_tscpage_ts(struct ms_hyperv_tsc_page *tsc_page)
+{
+	return mul_u64_u64_shr64(rdtsc(), tsc_page->tsc_scale) + tsc_page->tsc_offset;
+}
+
 static inline void check_tsc_msr_tsc_page(struct ms_hyperv_tsc_page *tsc_page)
 {
 	u64 r1, r2, t1, t2;
 
 	/* Compare TSC page clocksource with HV_X64_MSR_TIME_REF_COUNT */
-	t1 = mul_u64_u64_shr64(rdtsc(), tsc_page->tsc_scale) + tsc_page->tsc_offset;
+	t1 = get_tscpage_ts(tsc_page);
 	r1 = rdmsr(HV_X64_MSR_TIME_REF_COUNT);
 
 	/* 10 ms tolerance */
 	GUEST_ASSERT(r1 >= t1 && r1 - t1 < 100000);
 	nop_loop();
 
-	t2 = mul_u64_u64_shr64(rdtsc(), tsc_page->tsc_scale) + tsc_page->tsc_offset;
+	t2 = get_tscpage_ts(tsc_page);
 	r2 = rdmsr(HV_X64_MSR_TIME_REF_COUNT);
 	GUEST_ASSERT(r2 >= t1 && r2 - t2 < 100000);
 }
@@ -130,7 +135,11 @@ static void guest_main(struct ms_hyperv_tsc_page *tsc_page, vm_paddr_t tsc_page_
 
 	tsc_offset = tsc_page->tsc_offset;
 	/* Call KVM_SET_CLOCK from userspace, check that TSC page was updated */
+
 	GUEST_SYNC(7);
+	/* Sanity check TSC page timestamp, it should be close to 0 */
+	GUEST_ASSERT(get_tscpage_ts(tsc_page) < 100000);
+
 	GUEST_ASSERT(tsc_page->tsc_offset != tsc_offset);
 
 	nop_loop();
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-29 11:47 ` [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters() Vitaly Kuznetsov
@ 2021-03-29 17:24   ` Sean Christopherson
  2021-03-30 10:21     ` Vitaly Kuznetsov
  2021-03-30 14:12     ` Marcelo Tosatti
  2021-03-30 13:12   ` Marcelo Tosatti
  1 sibling, 2 replies; 13+ messages in thread
From: Sean Christopherson @ 2021-03-29 17:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Marcelo Tosatti

On Mon, Mar 29, 2021, Vitaly Kuznetsov wrote:
> When guest time is reset with KVM_SET_CLOCK(0), it is possible for
> hv_clock->system_time to become a small negative number. This happens
> because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based
> on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
> kvm_guest_time_update() does
> 
> hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;
> 
> And 'master_kernel_ns' represents the last time when masterclock
> got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
> problem, the difference is very small, e.g. I'm observing
> hv_clock.system_time = -70 ns. The issue comes from the fact that
> 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
> compute_tsc_page_parameters() becomes a very big number.
> 
> Use div_s64() to get the proper result when dividing maybe-negative
> 'hv_clock.system_time' by 100.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/hyperv.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index f98370a39936..0529b892f634 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
>  				hv_clock->tsc_to_system_mul,
>  				100);
>  
> -	tsc_ref->tsc_offset = hv_clock->system_time;
> -	do_div(tsc_ref->tsc_offset, 100);
> -	tsc_ref->tsc_offset -=
> +	/*
> +	 * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative
> +	 * value here, thus div_s64().
> +	 */

Will anything break if hv_clock.system_time is made a s64?

> +	tsc_ref->tsc_offset = div_s64(hv_clock->system_time, 100) -
>  		mul_u64_u64_shr(hv_clock->tsc_timestamp, tsc_ref->tsc_scale, 64);
> +
>  	return true;
>  }
>  
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-29 17:24   ` Sean Christopherson
@ 2021-03-30 10:21     ` Vitaly Kuznetsov
  2021-03-30 14:12     ` Marcelo Tosatti
  1 sibling, 0 replies; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-30 10:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson, Marcelo Tosatti

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Mar 29, 2021, Vitaly Kuznetsov wrote:
>> When guest time is reset with KVM_SET_CLOCK(0), it is possible for
>> hv_clock->system_time to become a small negative number. This happens
>> because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based
>> on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
>> kvm_guest_time_update() does
>> 
>> hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;
>> 
>> And 'master_kernel_ns' represents the last time when masterclock
>> got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
>> problem, the difference is very small, e.g. I'm observing
>> hv_clock.system_time = -70 ns. The issue comes from the fact that
>> 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
>> compute_tsc_page_parameters() becomes a very big number.
>> 
>> Use div_s64() to get the proper result when dividing maybe-negative
>> 'hv_clock.system_time' by 100.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/hyperv.c | 9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index f98370a39936..0529b892f634 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
>>  				hv_clock->tsc_to_system_mul,
>>  				100);
>>  
>> -	tsc_ref->tsc_offset = hv_clock->system_time;
>> -	do_div(tsc_ref->tsc_offset, 100);
>> -	tsc_ref->tsc_offset -=
>> +	/*
>> +	 * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative
>> +	 * value here, thus div_s64().
>> +	 */
>
> Will anything break if hv_clock.system_time is made a s64?
>

I think no, but pvclock-abi.h says:

"
 * These structs MUST NOT be changed.
 * They are the ABI between hypervisor and guest OS.
 * Both Xen and KVM are using this.
"

so I was kind of scared to suggest a change...

-- 
Vitaly


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-29 11:47 ` [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters() Vitaly Kuznetsov
  2021-03-29 17:24   ` Sean Christopherson
@ 2021-03-30 13:12   ` Marcelo Tosatti
  2021-03-30 14:44     ` Vitaly Kuznetsov
  1 sibling, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2021-03-30 13:12 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson


Hi Vitaly,

On Mon, Mar 29, 2021 at 01:47:59PM +0200, Vitaly Kuznetsov wrote:
> When guest time is reset with KVM_SET_CLOCK(0), it is possible for
> hv_clock->system_time to become a small negative number. This happens
> because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based
> on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
> kvm_guest_time_update() does
> 
> hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;

Hum, masterclock update should always precede KVM_SET_CLOCK() call.

So when KVM_SET_CLOCK(0) is called, we'd like the guest (or the
following):

static __always_inline
u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
{
        u64 delta = tsc - src->tsc_timestamp;
        u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
                                             src->tsc_shift);
        return src->system_time + offset;
}

To return 0 (which won't happen if pvclock_data->system_time is being
initialized to a negative value).

	KVM_SET_CLOCK(0)

 	kvm_gen_update_masterclock(kvm);
	now_ns = get_kvmclock_ns(kvm);
	kvm->arch.kvmclock_offset += user_ns.clock - now_ns = -now_ns; 

	hv_clock.tsc_timestamp = ka->master_cycle_now;
        hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;

Wonder if the negative value happens due to the switch from
masterclock=off -> on (get_kvmclock_ns reading kernel time directly).

Perhaps this error is being added to clock when migration is performed.

But just thinking out loud...

> And 'master_kernel_ns' represents the last time when masterclock
> got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
> problem, the difference is very small, e.g. I'm observing
> hv_clock.system_time = -70 ns. The issue comes from the fact that
> 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
> compute_tsc_page_parameters() becomes a very big number.

Which it is (a very big number).

> Use div_s64() to get the proper result when dividing maybe-negative
> 'hv_clock.system_time' by 100.

Well hv_clock.system_time is not negative. It is positive.

> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/hyperv.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index f98370a39936..0529b892f634 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
>  				hv_clock->tsc_to_system_mul,
>  				100);
>  
> -	tsc_ref->tsc_offset = hv_clock->system_time;
> -	do_div(tsc_ref->tsc_offset, 100);
> -	tsc_ref->tsc_offset -=
> +	/*
> +	 * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative
> +	 * value here, thus div_s64().
> +	 */
> +	tsc_ref->tsc_offset = div_s64(hv_clock->system_time, 100) -
>  		mul_u64_u64_shr(hv_clock->tsc_timestamp, tsc_ref->tsc_scale, 64);
> +
>  	return true;
>  }

Won't the guest see:

0.1, 0.005, 0.0025, 0.0001, ..., 0, 0.0001, 0.0025, 0.005, 0.1, ...

As system_time progresses from negative value to positive value with
the above fix?

While the fix seems OK in practice, perhaps the negative system_time 
could be fixed instead.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-29 17:24   ` Sean Christopherson
  2021-03-30 10:21     ` Vitaly Kuznetsov
@ 2021-03-30 14:12     ` Marcelo Tosatti
  1 sibling, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2021-03-30 14:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson

On Mon, Mar 29, 2021 at 05:24:13PM +0000, Sean Christopherson wrote:
> On Mon, Mar 29, 2021, Vitaly Kuznetsov wrote:
> > When guest time is reset with KVM_SET_CLOCK(0), it is possible for
> > hv_clock->system_time to become a small negative number. This happens
> > because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based
> > on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
> > kvm_guest_time_update() does
> > 
> > hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;
> > 
> > And 'master_kernel_ns' represents the last time when masterclock
> > got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
> > problem, the difference is very small, e.g. I'm observing
> > hv_clock.system_time = -70 ns. The issue comes from the fact that
> > 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
> > compute_tsc_page_parameters() becomes a very big number.
> > 
> > Use div_s64() to get the proper result when dividing maybe-negative
> > 'hv_clock.system_time' by 100.
> > 
> > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > ---
> >  arch/x86/kvm/hyperv.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > index f98370a39936..0529b892f634 100644
> > --- a/arch/x86/kvm/hyperv.c
> > +++ b/arch/x86/kvm/hyperv.c
> > @@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
> >  				hv_clock->tsc_to_system_mul,
> >  				100);
> >  
> > -	tsc_ref->tsc_offset = hv_clock->system_time;
> > -	do_div(tsc_ref->tsc_offset, 100);
> > -	tsc_ref->tsc_offset -=
> > +	/*
> > +	 * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative
> > +	 * value here, thus div_s64().
> > +	 */
> 
> Will anything break if hv_clock.system_time is made a s64?

IMHO hv_clock.system_time represents an unsigned value:

        system_time:
                a host notion of monotonic time, including sleep
                time at the time this structure was last updated. Unit is
                nanoseconds.


Delta between values is not transmitted through this variable, 
so unclear what negative values would mean.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-30 13:12   ` Marcelo Tosatti
@ 2021-03-30 14:44     ` Vitaly Kuznetsov
  2021-03-30 15:44       ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-30 14:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson

Marcelo Tosatti <mtosatti@redhat.com> writes:

> Hi Vitaly,
>
> On Mon, Mar 29, 2021 at 01:47:59PM +0200, Vitaly Kuznetsov wrote:
>> When guest time is reset with KVM_SET_CLOCK(0), it is possible for
>> hv_clock->system_time to become a small negative number. This happens
>> because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based
>> on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
>> kvm_guest_time_update() does
>> 
>> hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;
>
> Hum, masterclock update should always precede KVM_SET_CLOCK() call.
>
> So when KVM_SET_CLOCK(0) is called, we'd like the guest (or the
> following):
>
> static __always_inline
> u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
> {
>         u64 delta = tsc - src->tsc_timestamp;
>         u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
>                                              src->tsc_shift);
>         return src->system_time + offset;
> }
>
> To return 0 (which won't happen if pvclock_data->system_time is being
> initialized to a negative value).
>
> 	KVM_SET_CLOCK(0)
>
>  	kvm_gen_update_masterclock(kvm);
> 	now_ns = get_kvmclock_ns(kvm);
> 	kvm->arch.kvmclock_offset += user_ns.clock - now_ns = -now_ns; 
>
> 	hv_clock.tsc_timestamp = ka->master_cycle_now;
>         hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>
> Wonder if the negative value happens due to the switch from
> masterclock=off -> on (get_kvmclock_ns reading kernel time directly).
>
> Perhaps this error is being added to clock when migration is performed.
>

The issue is very easy to reproduce if you take the selftest (PATCH2
from this series), it is just a simple KVM_SET_CLOCK(0) call which makes
Hyper-V TSC page value very very big.

The order of events is:

KVM_SET_CLOCK(0) is performed and we set 

	        now_ns = get_kvmclock_ns(kvm);
                kvm->arch.kvmclock_offset += user_ns.clock - now_ns;

user_ns.clock is zero here. Note, we use get_kvmclock_ns() which gives
the precise kvmclock value for the moment , to simplify things a bit:

 kvm->arch.kvmclock_offset = kvmclock_offset - master_kernel_ns -
 kvmclock_offset - pvclock_scale_delta() == -master_kernel_ns - pvclock_scale_delta()
 
pvclock_scale_delta() is time since the begining of the time
interval.

Now we call kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE) and
this leads to kvm_guest_time_update(). In kvm_guest_time_update() we do:

	if (use_master_clock) {
                host_tsc = ka->master_cycle_now;
                kernel_ns = ka->master_kernel_ns;
        }
        ...
        vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;

Note, we don't add the delta since the beginning of the time interval
(assuming we are still on the same interval) so combined with the
above:

vcpu->hv_clock.system_time = master_kernel_ns - master_kernel_ns -
pvclock_scale_delta() == -pvclock_scale_delta()

and this is certainly a negative number. A very small one, but it's
negative.

We could've solved the problem by reducing the precision a bit and
instead of doing 

 now_ns = get_kvmclock_ns(kvm);

in KVM_SET_CLOCK() handling we could do

 now_ns = ka->master_kernel_ns

and that would make vcpu->hv_clock.system_time == 0 after
kvm_guest_time_update() but it'd hurt 'normal' use-cases to fix the
corner case.

> But just thinking out loud...
>
>> And 'master_kernel_ns' represents the last time when masterclock
>> got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
>> problem, the difference is very small, e.g. I'm observing
>> hv_clock.system_time = -70 ns. The issue comes from the fact that
>> 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
>> compute_tsc_page_parameters() becomes a very big number.
>
> Which it is (a very big number).
>
>> Use div_s64() to get the proper result when dividing maybe-negative
>> 'hv_clock.system_time' by 100.
>
> Well hv_clock.system_time is not negative. It is positive.
>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/hyperv.c | 9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index f98370a39936..0529b892f634 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
>>  				hv_clock->tsc_to_system_mul,
>>  				100);
>>  
>> -	tsc_ref->tsc_offset = hv_clock->system_time;
>> -	do_div(tsc_ref->tsc_offset, 100);
>> -	tsc_ref->tsc_offset -=
>> +	/*
>> +	 * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative
>> +	 * value here, thus div_s64().
>> +	 */
>> +	tsc_ref->tsc_offset = div_s64(hv_clock->system_time, 100) -
>>  		mul_u64_u64_shr(hv_clock->tsc_timestamp, tsc_ref->tsc_scale, 64);
>> +
>>  	return true;
>>  }
>
> Won't the guest see:
>
> 0.1, 0.005, 0.0025, 0.0001, ..., 0, 0.0001, 0.0025, 0.005, 0.1, ...
>
> As system_time progresses from negative value to positive value with
> the above fix?
>
> While the fix seems OK in practice, perhaps the negative system_time 
> could be fixed instead.

Well, that was v1 of the patch where I suggested to clamp system_time
value to 0 in kvm_guest_time_update() but Paolo talked me into this v2
:-)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-30 14:44     ` Vitaly Kuznetsov
@ 2021-03-30 15:44       ` Paolo Bonzini
  2021-03-31  6:29         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2021-03-30 15:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Marcelo Tosatti
  Cc: kvm, Sean Christopherson, Wanpeng Li, Jim Mattson

On 30/03/21 16:44, Vitaly Kuznetsov wrote:
> We could've solved the problem by reducing the precision a bit and
> instead of doing
> 
>   now_ns = get_kvmclock_ns(kvm);
> 
> in KVM_SET_CLOCK() handling we could do
> 
>   now_ns = ka->master_kernel_ns
> 
> and that would make vcpu->hv_clock.system_time == 0 after
> kvm_guest_time_update() but it'd hurt 'normal' use-cases to fix the
> corner case.

Marcelo is right, and I guess instead of a negative system time you 
*should* have a slightly larger tsc_timestamp that corresponds to a zero 
system_time.  E.g. instead of -70 ns in the system time you'd have 210 
more clock cycles in the tsc_timestamp and 0 in the system time.

In the end it's impossible to achieve absolute precision; does the 
KVM_SET_CLOCK value refer to the nanosecond value before entering the 
kernel, or after, or what?  The difference is much more than these 70 
ns.  So what you propose above seems to be really the best solution 
within the constraints of the KVM_SET_CLOCK API (a better API, which 
Maxim had started working on and I should revisit, would pass a 
TSC+nanosecond pair).

On the other hand you'd have to take into account what happens if the 
masterclock is not in use, which would make things a bit more complex 
than what you sketched.  If guests probably do not look at the 
system_time and just add it blindly to the result, then treating 
system_time as signed as in v2 is the easiest.

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-30 15:44       ` Paolo Bonzini
@ 2021-03-31  6:29         ` Vitaly Kuznetsov
  2021-03-31  6:52           ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-31  6:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Wanpeng Li, Jim Mattson, Marcelo Tosatti

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 30/03/21 16:44, Vitaly Kuznetsov wrote:
>> We could've solved the problem by reducing the precision a bit and
>> instead of doing
>> 
>>   now_ns = get_kvmclock_ns(kvm);
>> 
>> in KVM_SET_CLOCK() handling we could do
>> 
>>   now_ns = ka->master_kernel_ns
>> 
>> and that would make vcpu->hv_clock.system_time == 0 after
>> kvm_guest_time_update() but it'd hurt 'normal' use-cases to fix the
>> corner case.
>
> Marcelo is right, and I guess instead of a negative system time you 
> *should* have a slightly larger tsc_timestamp that corresponds to a zero 
> system_time.  E.g. instead of -70 ns in the system time you'd have 210 
> more clock cycles in the tsc_timestamp and 0 in the system time.
>
> In the end it's impossible to achieve absolute precision; does the 
> KVM_SET_CLOCK value refer to the nanosecond value before entering the 
> kernel, or after, or what?  The difference is much more than these 70 
> ns.  So what you propose above seems to be really the best solution 
> within the constraints of the KVM_SET_CLOCK API (a better API, which 
> Maxim had started working on and I should revisit, would pass a 
> TSC+nanosecond pair).

I'm leaning towards making a v3 and depending on how complex it's going
to look like we can decide which way to go.

>
> On the other hand you'd have to take into account what happens if the 
> masterclock is not in use, which would make things a bit more complex 
> than what you sketched. 

(I really wish we establish a plan to get rid of !masterclock logic some
day ... )

> If guests probably do not look at the 
> system_time and just add it blindly to the result, then treating 
> system_time as signed as in v2 is the easiest.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-31  6:29         ` Vitaly Kuznetsov
@ 2021-03-31  6:52           ` Paolo Bonzini
  2021-03-31  9:59             ` Vitaly Kuznetsov
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2021-03-31  6:52 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, Sean Christopherson, Wanpeng Li, Jim Mattson, Marcelo Tosatti

On 31/03/21 08:29, Vitaly Kuznetsov wrote:
> I'm leaning towards making a v3 and depending on how complex it's going
> to look like we can decide which way to go.

As you prefer, but I would have no problem with committing v2 for now. 
 From the point of view of system_time being a signed delta your v2 is 
not a regression.

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-31  6:52           ` Paolo Bonzini
@ 2021-03-31  9:59             ` Vitaly Kuznetsov
  2021-03-31 10:58               ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-31  9:59 UTC (permalink / raw)
  To: Paolo Bonzini, Marcelo Tosatti
  Cc: kvm, Sean Christopherson, Wanpeng Li, Jim Mattson

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 31/03/21 08:29, Vitaly Kuznetsov wrote:
>> I'm leaning towards making a v3 and depending on how complex it's going
>> to look like we can decide which way to go.
>
> As you prefer, but I would have no problem with committing v2 for now. 
>  From the point of view of system_time being a signed delta your v2 is 
> not a regression.

Ok, I convinced myself this is correct:

@@ -5744,8 +5745,22 @@ long kvm_arch_vm_ioctl(struct file *filp,
                 * pvclock_update_vm_gtod_copy().
                 */
                kvm_gen_update_masterclock(kvm);
-               now_ns = get_kvmclock_ns(kvm);
-               kvm->arch.kvmclock_offset += user_ns.clock - now_ns;
+
+               /*
+                * This pairs with kvm_guest_time_update(): when masterclock is
+                * in use, we use master_kernel_ns + kvmclock_offset to set
+                * unsigned 'system_time' so if we use get_kvmclock_ns() (which
+                * is slightly ahead) here we risk going negative on unsigned
+                * 'system_time' when 'user_ns.clock' is very small.
+                */
+               spin_lock(&ka->pvclock_gtod_sync_lock);
+               if (kvm->arch.use_master_clock)
+                       now_ns = ka->master_kernel_ns;
+               else
+                       now_ns = get_kvmclock_base_ns();
+               ka->kvmclock_offset = user_ns.clock - now_ns;
+               spin_unlock(&ka->pvclock_gtod_sync_lock);
+
                kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
                break;
        }

In !masterclock case kvm_guest_time_update() uses get_kvmclock_base_ns()
(and get_kvmclock_ns() is just get_kvmclock_base_ns() + ka->kvmclock_offset)
so we're good. Also, it looks like a good idea to put the whole
calculation under spinlock here.

Personally, I like this a little bit more than treating 'system_time' as
signed, what do you guys think?

-- 
Vitaly


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters()
  2021-03-31  9:59             ` Vitaly Kuznetsov
@ 2021-03-31 10:58               ` Paolo Bonzini
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2021-03-31 10:58 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Marcelo Tosatti
  Cc: kvm, Sean Christopherson, Wanpeng Li, Jim Mattson

On 31/03/21 11:59, Vitaly Kuznetsov wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> On 31/03/21 08:29, Vitaly Kuznetsov wrote:
>>> I'm leaning towards making a v3 and depending on how complex it's going
>>> to look like we can decide which way to go.
> 
> Ok, I convinced myself this is correct:
> 
> @@ -5744,8 +5745,22 @@ long kvm_arch_vm_ioctl(struct file *filp,
>                   * pvclock_update_vm_gtod_copy().
>                   */
>                  kvm_gen_update_masterclock(kvm);
> -               now_ns = get_kvmclock_ns(kvm);
> -               kvm->arch.kvmclock_offset += user_ns.clock - now_ns;
> +
> +               /*
> +                * This pairs with kvm_guest_time_update(): when masterclock is
> +                * in use, we use master_kernel_ns + kvmclock_offset to set
> +                * unsigned 'system_time' so if we use get_kvmclock_ns() (which
> +                * is slightly ahead) here we risk going negative on unsigned
> +                * 'system_time' when 'user_ns.clock' is very small.
> +                */
> +               spin_lock(&ka->pvclock_gtod_sync_lock);
> +               if (kvm->arch.use_master_clock)
> +                       now_ns = ka->master_kernel_ns;
> +               else
> +                       now_ns = get_kvmclock_base_ns();
> +               ka->kvmclock_offset = user_ns.clock - now_ns;
> +               spin_unlock(&ka->pvclock_gtod_sync_lock);
> +
>                  kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
>                  break;
>          }
> 
> In !masterclock case kvm_guest_time_update() uses get_kvmclock_base_ns()
> (and get_kvmclock_ns() is just get_kvmclock_base_ns() + ka->kvmclock_offset)
> so we're good. Also, it looks like a good idea to put the whole
> calculation under spinlock here.

Yes, sounds good.

Note that I posted a series that changes pvclock_gtod_sync_lock to use 
spin_lock_irqsave/spin_unlock_irqrestore, so please base your patch on 
those and switch to spin_lock_irq.

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-03-31 10:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-29 11:47 [PATCH v2 0/2] KVM: x86: hyper-v: Fix TSC page update after KVM_SET_CLOCK(0) call Vitaly Kuznetsov
2021-03-29 11:47 ` [PATCH v2 1/2] KVM: x86: hyper-v: Properly divide maybe-negative 'hv_clock->system_time' in compute_tsc_page_parameters() Vitaly Kuznetsov
2021-03-29 17:24   ` Sean Christopherson
2021-03-30 10:21     ` Vitaly Kuznetsov
2021-03-30 14:12     ` Marcelo Tosatti
2021-03-30 13:12   ` Marcelo Tosatti
2021-03-30 14:44     ` Vitaly Kuznetsov
2021-03-30 15:44       ` Paolo Bonzini
2021-03-31  6:29         ` Vitaly Kuznetsov
2021-03-31  6:52           ` Paolo Bonzini
2021-03-31  9:59             ` Vitaly Kuznetsov
2021-03-31 10:58               ` Paolo Bonzini
2021-03-29 11:48 ` [PATCH v2 2/2] selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0) Vitaly Kuznetsov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.