All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/kvm: fix condition to update kvm master clocks
@ 2016-05-26 14:49 Roman Kagan
  2016-05-26 20:19 ` Radim Krčmář
  2016-05-29 23:38 ` Marcelo Tosatti
  0 siblings, 2 replies; 20+ messages in thread
From: Roman Kagan @ 2016-05-26 14:49 UTC (permalink / raw)
  To: kvm
  Cc: Denis V. Lunev, Roman Kagan, Owen Hofmann, Paolo Bonzini,
	Marcelo Tosatti

The condition to schedule per-VM master clock updates is inversed; as a
result the master clocks are never updated and the kvm_clock drift WRT
host's NTP-disciplined clock (which was the motivation to introduce this
machinery in the first place) still remains.

Fix it, and reword the comment to make it more apparent what the desired
behavior is.

Cc: Owen Hofmann <osh@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
 arch/x86/kvm/x86.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c805cf4..d8f591c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5810,10 +5810,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 
 	update_pvclock_gtod(tk);
 
-	/* disable master clock if host does not trust, or does not
-	 * use, TSC clocksource
+	/* only schedule per-VM master clock updates if the host uses TSC and
+	 * there's at least one VM in need of an update
 	 */
-	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	if (gtod->clock.vclock_mode == VCLOCK_TSC &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		queue_work(system_long_wq, &pvclock_gtod_work);
 
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-26 14:49 [PATCH] x86/kvm: fix condition to update kvm master clocks Roman Kagan
@ 2016-05-26 20:19 ` Radim Krčmář
  2016-05-27 17:28   ` Roman Kagan
  2016-05-29 23:38 ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Radim Krčmář @ 2016-05-26 20:19 UTC (permalink / raw)
  To: Roman Kagan
  Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini, Marcelo Tosatti

2016-05-26 17:49+0300, Roman Kagan:
> The condition to schedule per-VM master clock updates is inversed; as a
> result the master clocks are never updated and the kvm_clock drift WRT
> host's NTP-disciplined clock (which was the motivation to introduce this
> machinery in the first place) still remains.
> 
> Fix it, and reword the comment to make it more apparent what the desired
> behavior is.
> 
> Cc: Owen Hofmann <osh@google.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> ---
>  arch/x86/kvm/x86.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c805cf4..d8f591c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5810,10 +5810,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>  
>  	update_pvclock_gtod(tk);
>  
> -	/* disable master clock if host does not trust, or does not
> -	 * use, TSC clocksource
> +	/* only schedule per-VM master clock updates if the host uses TSC and
> +	 * there's at least one VM in need of an update
>  	 */
> -	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
> +	if (gtod->clock.vclock_mode == VCLOCK_TSC &&

I think we still want to disable master clock when vclock_mode is not
VCLOCK_TSC.

>  	    atomic_read(&kvm_guest_has_master_clock) != 0)

And I don't see why we don't want to enable master clock if the host
switches back to TSC.

>  		queue_work(system_long_wq, &pvclock_gtod_work);

Queueing unconditionally seems to be the correct thing to do.

Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
and NTP could be a problem:  kvm_gen_update_masterclock() only has to
run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
frequent NTP updates on bigger guests could kill performance.  (Maybe
kvm_gen_update_masterclock() is too wasteful even when called once.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-26 20:19 ` Radim Krčmář
@ 2016-05-27 17:28   ` Roman Kagan
  2016-05-27 18:11     ` Radim Krčmář
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-05-27 17:28 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini, Marcelo Tosatti

On Thu, May 26, 2016 at 10:19:36PM +0200, Radim Krčmář wrote:
> 2016-05-26 17:49+0300, Roman Kagan:
> > The condition to schedule per-VM master clock updates is inversed; as a
> > result the master clocks are never updated and the kvm_clock drift WRT
> > host's NTP-disciplined clock (which was the motivation to introduce this
> > machinery in the first place) still remains.
> > 
> > Fix it, and reword the comment to make it more apparent what the desired
> > behavior is.
> > 
> > Cc: Owen Hofmann <osh@google.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Marcelo Tosatti <mtosatti@redhat.com>
> > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> > ---
> >  arch/x86/kvm/x86.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c805cf4..d8f591c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -5810,10 +5810,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
> >  
> >  	update_pvclock_gtod(tk);
> >  
> > -	/* disable master clock if host does not trust, or does not
> > -	 * use, TSC clocksource
> > +	/* only schedule per-VM master clock updates if the host uses TSC and
> > +	 * there's at least one VM in need of an update
> >  	 */
> > -	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
> > +	if (gtod->clock.vclock_mode == VCLOCK_TSC &&
> 
> I think we still want to disable master clock when vclock_mode is not
> VCLOCK_TSC.

Hmm, indeed.  I missed that this thing is used both to update the
contents and to enable/disable it.  I'd say it's confusing...

> 
> >  	    atomic_read(&kvm_guest_has_master_clock) != 0)
> 
> And I don't see why we don't want to enable master clock if the host
> switches back to TSC.

Agreed (even though I guess it's not very likely: AFAICS once switched
to a different clocksource, the host can switch back to TSC only upon
human manipulating /sys/devices/system/clocksource).

> >  		queue_work(system_long_wq, &pvclock_gtod_work);
> 
> Queueing unconditionally seems to be the correct thing to do.

The notifier is registered at kvm module init, so the work will be
scheduled even when there are no VMs at all.

> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
> frequent NTP updates on bigger guests could kill performance.

Unfortunately, things are worse than that: this stuff is updated on
every *tick* on the timekeeping CPU, so, as long as you keep at least
one of your CPUs busy, the update rate can reach HZ.  The frequency of
NTP updates is unimportant; it happens without NTP updates at all.

So I tend to agree that we're perhaps better off not fixing this bug and
leaving the kvm_clocks to drift until we figure out how to do it with
acceptable overhead.

Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-27 17:28   ` Roman Kagan
@ 2016-05-27 18:11     ` Radim Krčmář
  2016-05-27 18:46       ` Roman Kagan
  0 siblings, 1 reply; 20+ messages in thread
From: Radim Krčmář @ 2016-05-27 18:11 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini,
	Marcelo Tosatti

2016-05-27 20:28+0300, Roman Kagan:
> On Thu, May 26, 2016 at 10:19:36PM +0200, Radim Krčmář wrote:
>> >  	    atomic_read(&kvm_guest_has_master_clock) != 0)
>> 
>> And I don't see why we don't want to enable master clock if the host
>> switches back to TSC.
> 
> Agreed (even though I guess it's not very likely: AFAICS once switched
> to a different clocksource, the host can switch back to TSC only upon
> human manipulating /sys/devices/system/clocksource).

Yeah, it's a corner case.  Human would have to switch from tsc as well,
automatic switch happens only when tsc is not useable anymore, AFAIK.

>> >  		queue_work(system_long_wq, &pvclock_gtod_work);
>> 
>> Queueing unconditionally seems to be the correct thing to do.
> 
> The notifier is registered at kvm module init, so the work will be
> scheduled even when there are no VMs at all.

Good point, we don't want to call pvclock_gtod_notify in that case
either.  Registering (unregistering) with the first (last) VM should be
good enough ... what about adding something based on this?

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 37af23052470..0779f0f01523 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -655,6 +655,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		goto out_err;
 
 	spin_lock(&kvm_lock);
+	if (list_empty(&kvm->vm_list))
+		kvm_arch_create_first_vm(kvm);
 	list_add(&kvm->vm_list, &vm_list);
 	spin_unlock(&kvm_lock);
 
@@ -709,6 +711,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_sync_events(kvm);
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
+	if (list_empty(&kvm->vm_list))
+		kvm_arch_destroy_last_vm(kvm);
 	spin_unlock(&kvm_lock);
 	kvm_free_irq_routing(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++)

>> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
>> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
>> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
>> frequent NTP updates on bigger guests could kill performance.
> 
> Unfortunately, things are worse than that: this stuff is updated on
> every *tick* on the timekeeping CPU, so, as long as you keep at least
> one of your CPUs busy, the update rate can reach HZ.  The frequency of
> NTP updates is unimportant; it happens without NTP updates at all.
> 
> So I tend to agree that we're perhaps better off not fixing this bug and
> leaving the kvm_clocks to drift until we figure out how to do it with
> acceptable overhead.

Yuck ... the hunk below could help a bit.
I haven't checked if the timekeeping code updates gtod and therefore
sets 'was_set' even when the resulting time hasn't changed, so we might
need to do more to avoid useless situations.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8c7ca34ee5d..37ed0a342bf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5802,12 +5802,15 @@ static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
 /*
  * Notification about pvclock gtod data update.
  */
-static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
 			       void *priv)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	struct timekeeper *tk = priv;
 
+	if (!was_set)
+		return 0;
+
 	update_pvclock_gtod(tk);
 
 	/* disable master clock if host does not trust, or does not

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-27 18:11     ` Radim Krčmář
@ 2016-05-27 18:46       ` Roman Kagan
  2016-05-27 19:29         ` Radim Krčmář
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-05-27 18:46 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini, Marcelo Tosatti

On Fri, May 27, 2016 at 08:11:40PM +0200, Radim Krčmář wrote:
> 2016-05-27 20:28+0300, Roman Kagan:
> >> Queueing unconditionally seems to be the correct thing to do.
> > 
> > The notifier is registered at kvm module init, so the work will be
> > scheduled even when there are no VMs at all.
> 
> Good point, we don't want to call pvclock_gtod_notify in that case
> either.  Registering (unregistering) with the first (last) VM should be
> good enough ... what about adding something based on this?
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 37af23052470..0779f0f01523 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -655,6 +655,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  		goto out_err;
>  
>  	spin_lock(&kvm_lock);
> +	if (list_empty(&kvm->vm_list))
> +		kvm_arch_create_first_vm(kvm);
>  	list_add(&kvm->vm_list, &vm_list);
>  	spin_unlock(&kvm_lock);
>  
> @@ -709,6 +711,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_sync_events(kvm);
>  	spin_lock(&kvm_lock);
>  	list_del(&kvm->vm_list);
> +	if (list_empty(&kvm->vm_list))
> +		kvm_arch_destroy_last_vm(kvm);
>  	spin_unlock(&kvm_lock);
>  	kvm_free_irq_routing(kvm);
>  	for (i = 0; i < KVM_NR_BUSES; i++)

Makes perfect sense IMO.

> >> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
> >> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
> >> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
> >> frequent NTP updates on bigger guests could kill performance.
> > 
> > Unfortunately, things are worse than that: this stuff is updated on
> > every *tick* on the timekeeping CPU, so, as long as you keep at least
> > one of your CPUs busy, the update rate can reach HZ.  The frequency of
> > NTP updates is unimportant; it happens without NTP updates at all.
> > 
> > So I tend to agree that we're perhaps better off not fixing this bug and
> > leaving the kvm_clocks to drift until we figure out how to do it with
> > acceptable overhead.
> 
> Yuck ... the hunk below could help a bit.
> I haven't checked if the timekeeping code updates gtod and therefore
> sets 'was_set' even when the resulting time hasn't changed, so we might
> need to do more to avoid useless situations.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a8c7ca34ee5d..37ed0a342bf1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5802,12 +5802,15 @@ static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
>  /*
>   * Notification about pvclock gtod data update.
>   */
> -static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
> +static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
>  			       void *priv)
>  {
>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>  	struct timekeeper *tk = priv;
>  
> +	if (!was_set)
> +		return 0;
> +
>  	update_pvclock_gtod(tk);
>  

Nope, this parameter is only set when there's a step-like change in the
time.  The timekeeper itself is always updated.  I guess we could
mitigate the costs somewhat if we skipped updating the gtod copy until
the accumulated error reaches certain limit; not sure if that's gonna
help though.

Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-27 18:46       ` Roman Kagan
@ 2016-05-27 19:29         ` Radim Krčmář
  0 siblings, 0 replies; 20+ messages in thread
From: Radim Krčmář @ 2016-05-27 19:29 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini,
	Marcelo Tosatti

2016-05-27 21:46+0300, Roman Kagan:
> On Fri, May 27, 2016 at 08:11:40PM +0200, Radim Krčmář wrote:
> > 2016-05-27 20:28+0300, Roman Kagan:
>> >> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
>> >> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
>> >> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
>> >> frequent NTP updates on bigger guests could kill performance.
>> > 
>> > Unfortunately, things are worse than that: this stuff is updated on
>> > every *tick* on the timekeeping CPU, so, as long as you keep at least
>> > one of your CPUs busy, the update rate can reach HZ.  The frequency of
>> > NTP updates is unimportant; it happens without NTP updates at all.
>> > 
>> > So I tend to agree that we're perhaps better off not fixing this bug and
>> > leaving the kvm_clocks to drift until we figure out how to do it with
>> > acceptable overhead.
>> 
>> Yuck ... the hunk below could help a bit.
>> I haven't checked if the timekeeping code updates gtod and therefore
>> sets 'was_set' even when the resulting time hasn't changed, so we might
>> need to do more to avoid useless situations.
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a8c7ca34ee5d..37ed0a342bf1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5802,12 +5802,15 @@ static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
>>  /*
>>   * Notification about pvclock gtod data update.
>>   */
>> -static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>> +static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
>>  			       void *priv)
>>  {
>>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>>  	struct timekeeper *tk = priv;
>>  
>> +	if (!was_set)
>> +		return 0;
>> +
>>  	update_pvclock_gtod(tk);
>>  
> 
> Nope, this parameter is only set when there's a step-like change in the
> time.  The timekeeper itself is always updated.  I guess we could
> mitigate the costs somewhat if we skipped updating the gtod copy until
> the accumulated error reaches certain limit; not sure if that's gonna
> help though.

I see, timekeeping_adjust() isn't covered, but it should not adjust
every tick, so we could propagate information about adjustments to
pvclock_gtod_notify (rename unused to has_changed), because pvclock only
cares about change of time.
Adding another threshold is a reasonable improvement if adjustments
happen too often, but we need to fix pvclock_gtod_update_fn() in any
case.

Am I missing anyting else?

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-26 14:49 [PATCH] x86/kvm: fix condition to update kvm master clocks Roman Kagan
  2016-05-26 20:19 ` Radim Krčmář
@ 2016-05-29 23:38 ` Marcelo Tosatti
  2016-06-09  3:27   ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2016-05-29 23:38 UTC (permalink / raw)
  To: Roman Kagan; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Thu, May 26, 2016 at 05:49:55PM +0300, Roman Kagan wrote:
> The condition to schedule per-VM master clock updates is inversed; as a
> result the master clocks are never updated and the kvm_clock drift WRT
> host's NTP-disciplined clock (which was the motivation to introduce this
> machinery in the first place) still remains.
> 
> Fix it, and reword the comment to make it more apparent what the desired
> behavior is.
> 
> Cc: Owen Hofmann <osh@google.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> ---
>  arch/x86/kvm/x86.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c805cf4..d8f591c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5810,10 +5810,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>  
>  	update_pvclock_gtod(tk);
>  
> -	/* disable master clock if host does not trust, or does not
> -	 * use, TSC clocksource
> +	/* only schedule per-VM master clock updates if the host uses TSC and
> +	 * there's at least one VM in need of an update
>  	 */
> -	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
> +	if (gtod->clock.vclock_mode == VCLOCK_TSC &&
>  	    atomic_read(&kvm_guest_has_master_clock) != 0)
>  		queue_work(system_long_wq, &pvclock_gtod_work);
>  
> -- 
> 2.5.5

NAK, as stated this leaves clocks out of sync between execution of
pvclock_gtod_notify and pvclock_gtod_work execution.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-05-29 23:38 ` Marcelo Tosatti
@ 2016-06-09  3:27   ` Marcelo Tosatti
  2016-06-09  3:45     ` Marcelo Tosatti
  2016-06-09 12:09     ` Roman Kagan
  0 siblings, 2 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-09  3:27 UTC (permalink / raw)
  To: Roman Kagan; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Sun, May 29, 2016 at 08:38:44PM -0300, Marcelo Tosatti wrote:
> On Thu, May 26, 2016 at 05:49:55PM +0300, Roman Kagan wrote:
> > The condition to schedule per-VM master clock updates is inversed; as a
> > result the master clocks are never updated and the kvm_clock drift WRT
> > host's NTP-disciplined clock (which was the motivation to introduce this
> > machinery in the first place) still remains.
> > 
> > Fix it, and reword the comment to make it more apparent what the desired
> > behavior is.
> > 
> > Cc: Owen Hofmann <osh@google.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Marcelo Tosatti <mtosatti@redhat.com>
> > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> > ---
> >  arch/x86/kvm/x86.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c805cf4..d8f591c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -5810,10 +5810,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
> >  
> >  	update_pvclock_gtod(tk);
> >  
> > -	/* disable master clock if host does not trust, or does not
> > -	 * use, TSC clocksource
> > +	/* only schedule per-VM master clock updates if the host uses TSC and
> > +	 * there's at least one VM in need of an update
> >  	 */
> > -	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
> > +	if (gtod->clock.vclock_mode == VCLOCK_TSC &&
> >  	    atomic_read(&kvm_guest_has_master_clock) != 0)
> >  		queue_work(system_long_wq, &pvclock_gtod_work);
> >  
> > -- 
> > 2.5.5
> 
> NAK, as stated this leaves clocks out of sync between execution of
> pvclock_gtod_notify and pvclock_gtod_work execution.

Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
(REF_PAGE) clocks in sync. Even if the frequency correction is the same
for both, the kernel updates monotonic clock differently than the
stated frequency that is:

    monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 

So the best solution IMO is to: 

    reads of guest clock = max(shared memory clock, get_kernel_ns() +
                               kvmclock_offset)

Where reads of guest clock include: 1) kvm_get_time_and_clockread
(updates to kvmclock areas), 2) rdmsr(REF_CLOCK).

Unless someone has a better idea, Roman, can you update your patch to
include such solution? for kvm_get_time_and_clockread, can fast forward
kvmclock_offset so that 

kvmclock_offset + get_kernel_ns() = shared memory clock



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-09  3:27   ` Marcelo Tosatti
@ 2016-06-09  3:45     ` Marcelo Tosatti
  2016-06-09 12:09     ` Roman Kagan
  1 sibling, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-09  3:45 UTC (permalink / raw)
  To: Roman Kagan; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Thu, Jun 09, 2016 at 12:27:10AM -0300, Marcelo Tosatti wrote:
> On Sun, May 29, 2016 at 08:38:44PM -0300, Marcelo Tosatti wrote:
> > On Thu, May 26, 2016 at 05:49:55PM +0300, Roman Kagan wrote:
> > > The condition to schedule per-VM master clock updates is inversed; as a
> > > result the master clocks are never updated and the kvm_clock drift WRT
> > > host's NTP-disciplined clock (which was the motivation to introduce this
> > > machinery in the first place) still remains.
> > > 
> > > Fix it, and reword the comment to make it more apparent what the desired
> > > behavior is.
> > > 
> > > Cc: Owen Hofmann <osh@google.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Marcelo Tosatti <mtosatti@redhat.com>
> > > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index c805cf4..d8f591c 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -5810,10 +5810,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
> > >  
> > >  	update_pvclock_gtod(tk);
> > >  
> > > -	/* disable master clock if host does not trust, or does not
> > > -	 * use, TSC clocksource
> > > +	/* only schedule per-VM master clock updates if the host uses TSC and
> > > +	 * there's at least one VM in need of an update
> > >  	 */
> > > -	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
> > > +	if (gtod->clock.vclock_mode == VCLOCK_TSC &&
> > >  	    atomic_read(&kvm_guest_has_master_clock) != 0)
> > >  		queue_work(system_long_wq, &pvclock_gtod_work);
> > >  
> > > -- 
> > > 2.5.5
> > 
> > NAK, as stated this leaves clocks out of sync between execution of
> > pvclock_gtod_notify and pvclock_gtod_work execution.
> 
> Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
> (REF_PAGE) clocks in sync. Even if the frequency correction is the same
> for both, the kernel updates monotonic clock differently than the
> stated frequency that is:
> 
>     monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 
> 
> So the best solution IMO is to: 
> 
>     reads of guest clock = max(shared memory clock, get_kernel_ns() +
>                                kvmclock_offset)
> 
> Where reads of guest clock include: 1) kvm_get_time_and_clockread
> (updates to kvmclock areas), 2) rdmsr(REF_CLOCK).
> 
> Unless someone has a better idea, Roman, can you update your patch to
> include such solution? for kvm_get_time_and_clockread, can fast forward
> kvmclock_offset so that 
> 
> kvmclock_offset + get_kernel_ns() = shared memory clock

Also please update the kvm-unit-test to check both ways (i did here and
what happens is that):

        Case1: with frequency correction from vsyscall: 
        
            a = read_shared_clock;
            b = rdmsr(REF_CLOCK);

            fail: a > b 

        Case2: without frequency correction from vsyscall:

            a = rdmsr(REF_CLOCK);
            b = read_shared_clock;

            fail: b < a

And the frequency correction advertised via syscall is the same as what
get_kernel_ns() is using. Conclusion: update_wall_time() updates to
monotonic clock do not match tsc*freqcorrection.

So there is nothing you can do to keep these clocks in sync
(other then checking what is the largest of them when reading 
REF_CLOCK).




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-09  3:27   ` Marcelo Tosatti
  2016-06-09  3:45     ` Marcelo Tosatti
@ 2016-06-09 12:09     ` Roman Kagan
  2016-06-09 18:25       ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-06-09 12:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Thu, Jun 09, 2016 at 12:27:10AM -0300, Marcelo Tosatti wrote:
> Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
> (REF_PAGE) clocks in sync. Even if the frequency correction is the same
> for both, the kernel updates monotonic clock differently than the
> stated frequency that is:
> 
>     monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 
> 
> So the best solution IMO is to: 
> 
>     reads of guest clock = max(shared memory clock, get_kernel_ns() +
>                                kvmclock_offset)
> 
> Where reads of guest clock include: 1) kvm_get_time_and_clockread
> (updates to kvmclock areas), 2) rdmsr(REF_CLOCK).
> 
> Unless someone has a better idea, Roman, can you update your patch to
> include such solution? for kvm_get_time_and_clockread, can fast forward
> kvmclock_offset so that 
> 
> kvmclock_offset + get_kernel_ns() = shared memory clock

I'm not sure I understand what you mean.

->system_time in pvclock *is* assigned kernel_ns + kvmclock_offset, i.e.
at the time kvm_get_time_and_clockread() runs they are in sync by
definition.  They can diverge later due to different whole number math
applied.

There's also a problem that there can be arbitrary amount of time
between assigning the return value for guest's rdmsr and actually
entering the guest to deliver that value.

In general I'm starting to feel the shared memory clock is trying to
provide stronger guarantees than really useful.  E.g. there's no such
thing as synchronous TSC between vCPUs in a virtual machine, so every
guest assuming it is broken; in reality that means that every sane guest
must tolerate certain violations of monotonicity when multiple CPUs are
used for timekeeping.  I wonder if this consideration can allow for some
simplification of the paravirtual clock code...

Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-09 12:09     ` Roman Kagan
@ 2016-06-09 18:25       ` Marcelo Tosatti
  2016-06-09 19:19         ` Roman Kagan
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-09 18:25 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Thu, Jun 09, 2016 at 03:09:46PM +0300, Roman Kagan wrote:
> On Thu, Jun 09, 2016 at 12:27:10AM -0300, Marcelo Tosatti wrote:
> > Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
> > (REF_PAGE) clocks in sync. Even if the frequency correction is the same
> > for both, the kernel updates monotonic clock differently than the
> > stated frequency that is:
> > 
> >     monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 
> > 
> > So the best solution IMO is to: 
> > 
> >     reads of guest clock = max(shared memory clock, get_kernel_ns() +
> >                                kvmclock_offset)
> > 
> > Where reads of guest clock include: 1) kvm_get_time_and_clockread
> > (updates to kvmclock areas), 2) rdmsr(REF_CLOCK).
> > 
> > Unless someone has a better idea, Roman, can you update your patch to
> > include such solution? for kvm_get_time_and_clockread, can fast forward
> > kvmclock_offset so that 
> > 
> > kvmclock_offset + get_kernel_ns() = shared memory clock
> 
> I'm not sure I understand what you mean.
> 
> ->system_time in pvclock *is* assigned kernel_ns + kvmclock_offset, i.e.
> at the time kvm_get_time_and_clockread() runs they are in sync by
> definition.  They can diverge later due to different whole number math
> applied.

Sync kvmclock_offset + get_kernel_ns() = shared memory clock (what you
read from the guest). 


Add wrapper around get_kernel_ns + kvmclock_offset reads:

Record somewhere the last returned (last_returned_guestclock).

u64 read_guest_clock(struct kvm *kvm)
{
      mutex_lock(&guest_clock_mutex);
      ret = get_kernel_ns() + kvmclock_offset;
      kvm->arch.last_returned_guestclock = ret;
      mutex_unlock(&guest_clock_mutex);
}


Sync code (to be executed at masterclock updates and rdmsr(REF_CLOCK)):

1. read guest shared memory = smval.
2. read guest clock = get_kernel_ns() + kvmclock_offset = kclock
3. if (kclock < smval) 
        kvmclock_offset += smval - kclock


Two possibilites for clocks state:

P1. shared memory clock > get_kernel_ns() + kvmclock_offset
P2. shared memory clock < get_kernel_ns() + kvmclock_offset

Two possibilites for guest behaviour:
G1. a = shared memory clock read;
    b = get_kernel_ns() + kvmclock_offset

G1/P1:

    a = shared memory clock read (previous read, stored in memory)
    b = get_kernel_ns() + kvmclock_offset

After sync code above: Note smval > a, so b = smval > a

G1/P2:

    a = shared memory clock read (previous read, stored in memory)
    b = get_kernel_ns() + kvmclock_offset

a < b, fine.

G2 (second possibility for guest behaviour)
    a = get_kernel_ns() + kvmclock_offset
    b = shared memory clock read

G2/P1: fine, b > a.

G2/P2: 
    a = get_kernel_ns() + kvmclock_offset >
    b = shared memory clock read

So we have to either reduce get_kernel_ns() + kvmclock_offset so that 
b is larger or 'stop get_kernel_ns() + kvmclock_offset'.

Can make get_kernel_ns() + kvmclock_offset be as small as
last_returned_guestclock (otherwise users of get_kernel_ns() +
kvmclock_offset can see time backwards).

    0. mutex_lock(&guest_clock_mutex);
    01. getkernelns = get_kernel_ns();
    1. read guest shared memory = smval.
    2. kclock = getkernelns + kvmclock_offset
    3. if (kclock > smval) 
        kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
                              kclock - smval)
    4. kclock = getkernelns + kvmclock_offset
    5. if (kclock > smval) {
        schedule_timeout(kclock - smval);
        kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
                              kclock - smval)
       }
    6. mutex_unlock(&guest_clock_mutex);

That works, right?

I'll code that patch next week if you don't beat me to it.


> There's also a problem that there can be arbitrary amount of time
> between assigning the return value for guest's rdmsr and actually
> entering the guest to deliver that value.

Don't think that matters, see the 4 cases above.

> In general I'm starting to feel the shared memory clock is trying to
> provide stronger guarantees than really useful.  E.g. there's no such
> thing as synchronous TSC between vCPUs in a virtual machine, so every
> guest assuming it is broken; 

There is, the TSC is monotonic between pCPUs:

    pCPU1          |       pCPU2

1.  a = read tsc
2.                      b = read tsc.

> in reality that means that every sane guest
> must tolerate certain violations of monotonicity when multiple CPUs are
> used for timekeeping.  I wonder if this consideration can allow for some
> simplification of the paravirtual clock code...

I think applications can fail.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-09 18:25       ` Marcelo Tosatti
@ 2016-06-09 19:19         ` Roman Kagan
  2016-06-13 17:07         ` Roman Kagan
  2016-06-13 17:19         ` Roman Kagan
  2 siblings, 0 replies; 20+ messages in thread
From: Roman Kagan @ 2016-06-09 19:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Thu, Jun 09, 2016 at 03:25:02PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 09, 2016 at 03:09:46PM +0300, Roman Kagan wrote:
> > In general I'm starting to feel the shared memory clock is trying to
> > provide stronger guarantees than really useful.  E.g. there's no such
> > thing as synchronous TSC between vCPUs in a virtual machine, so every
> > guest assuming it is broken; 
> 
> There is, the TSC is monotonic between pCPUs:
> 
>     pCPU1          |       pCPU2
> 
> 1.  a = read tsc
> 2.                      b = read tsc.

Right, the processor guarantees that upon rdtsc completion the contents
of %rdx, %rax is ordered properly across CPUs.  However, what matters is
the comparison of the values between CPUs which happens in the code that
follows rdtsc, and which is subject to delays due to vCPUs being
scheduled out.  E.g.

       vCPU1          |        vCPU2
                      |
1. a = rdtsc          |
2. preemption by host |  b = rdtsc
3. enter to guest     |  return t(b)
4. return t(a)        |

                 t(a) < t(b)

In particular, in guest linux case, in the above scenario vCPU2 would
update the timekeeper and store tsc_timestamp onto it; then vCPU1 would
use the older tsc value to calculate the time and get negative tsc
delta.

> > in reality that means that every sane guest
> > must tolerate certain violations of monotonicity when multiple CPUs are
> > used for timekeeping.  I wonder if this consideration can allow for some
> > simplification of the paravirtual clock code...
> 
> I think applications can fail.

The guest kernel must take care of compensating those violations, so
that applications see consistent view of time.

[ I'll need some time to grok the rest of your message, as I'm still new
to the timekeeping code, so I'll reply to it in another mail. ]

Thanks,
Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-09 18:25       ` Marcelo Tosatti
  2016-06-09 19:19         ` Roman Kagan
@ 2016-06-13 17:07         ` Roman Kagan
  2016-06-14 22:11           ` Marcelo Tosatti
  2016-06-13 17:19         ` Roman Kagan
  2 siblings, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-06-13 17:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Thu, Jun 09, 2016 at 03:25:02PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 09, 2016 at 03:09:46PM +0300, Roman Kagan wrote:
> > On Thu, Jun 09, 2016 at 12:27:10AM -0300, Marcelo Tosatti wrote:
> > > Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
> > > (REF_PAGE) clocks in sync. Even if the frequency correction is the same
> > > for both, the kernel updates monotonic clock differently than the
> > > stated frequency that is:
> > > 
> > >     monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 
> > > 
> > > So the best solution IMO is to: 
> > > 
> > >     reads of guest clock = max(shared memory clock, get_kernel_ns() +
> > >                                kvmclock_offset)
> > > 
> > > Where reads of guest clock include: 1) kvm_get_time_and_clockread
> > > (updates to kvmclock areas), 2) rdmsr(REF_CLOCK).
> > > 
> > > Unless someone has a better idea, Roman, can you update your patch to
> > > include such solution? for kvm_get_time_and_clockread, can fast forward
> > > kvmclock_offset so that 
> > > 
> > > kvmclock_offset + get_kernel_ns() = shared memory clock
> > 
> > I'm not sure I understand what you mean.
> > 
> > ->system_time in pvclock *is* assigned kernel_ns + kvmclock_offset, i.e.
> > at the time kvm_get_time_and_clockread() runs they are in sync by
> > definition.  They can diverge later due to different whole number math
> > applied.
> 
> Sync kvmclock_offset + get_kernel_ns() = shared memory clock (what you
> read from the guest). 
> 
> 
> Add wrapper around get_kernel_ns + kvmclock_offset reads:
> 
> Record somewhere the last returned (last_returned_guestclock).
> 
> u64 read_guest_clock(struct kvm *kvm)
> {
>       mutex_lock(&guest_clock_mutex);
>       ret = get_kernel_ns() + kvmclock_offset;
>       kvm->arch.last_returned_guestclock = ret;
>       mutex_unlock(&guest_clock_mutex);
> }
> 
> 
> Sync code (to be executed at masterclock updates and rdmsr(REF_CLOCK)):
> 
> 1. read guest shared memory = smval.
> 2. read guest clock = get_kernel_ns() + kvmclock_offset = kclock
> 3. if (kclock < smval) 
>         kvmclock_offset += smval - kclock
> 
> 
> Two possibilites for clocks state:
> 
> P1. shared memory clock > get_kernel_ns() + kvmclock_offset
> P2. shared memory clock < get_kernel_ns() + kvmclock_offset
> 
> Two possibilites for guest behaviour:
> G1. a = shared memory clock read;
>     b = get_kernel_ns() + kvmclock_offset
> 
> G1/P1:
> 
>     a = shared memory clock read (previous read, stored in memory)
>     b = get_kernel_ns() + kvmclock_offset
> 
> After sync code above: Note smval > a, so b = smval > a
> 
> G1/P2:
> 
>     a = shared memory clock read (previous read, stored in memory)
>     b = get_kernel_ns() + kvmclock_offset
> 
> a < b, fine.
> 
> G2 (second possibility for guest behaviour)
>     a = get_kernel_ns() + kvmclock_offset
>     b = shared memory clock read
> 
> G2/P1: fine, b > a.
> 
> G2/P2: 
>     a = get_kernel_ns() + kvmclock_offset >
>     b = shared memory clock read
> 
> So we have to either reduce get_kernel_ns() + kvmclock_offset so that 
> b is larger or 'stop get_kernel_ns() + kvmclock_offset'.
> 
> Can make get_kernel_ns() + kvmclock_offset be as small as
> last_returned_guestclock (otherwise users of get_kernel_ns() +
> kvmclock_offset can see time backwards).
> 
>     0. mutex_lock(&guest_clock_mutex);
>     01. getkernelns = get_kernel_ns();
>     1. read guest shared memory = smval.
>     2. kclock = getkernelns + kvmclock_offset
>     3. if (kclock > smval) 
>         kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
>                               kclock - smval)
>     4. kclock = getkernelns + kvmclock_offset
>     5. if (kclock > smval) {
>         schedule_timeout(kclock - smval);
>         kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
>                               kclock - smval)
>        }
>     6. mutex_unlock(&guest_clock_mutex);
> 
> That works, right?

I wouldn't say so.

First, I don't think changing kvmclock_offset is allowed other than
through ioctl(KVM_SET_CLOCK); it violates everybody's expectation that
this is the difference between the host and the guest clocks.

Second, since masterclock updates essentially do that synchronization, I
think that instead of all that compilcation we can just return
host-side-calculated value of REFERENCE_TSC_PAGE clock from
rdmsr(REF_CLOCK) if masterclock is enabled.

> > There's also a problem that there can be arbitrary amount of time
> > between assigning the return value for guest's rdmsr and actually
> > entering the guest to deliver that value.
> 
> Don't think that matters, see the 4 cases above.

It does in the sense that between the point where we calculate the value
we're about to return from rdmsr(REF_CLOCK), and the time the guest will
actually see it, another vCPU can read and even update the reference tsc
page.  However, as I wrote in another message in this thread, there's
no way to guarantee clock reads to be monotonic across several vCPUs;
OTOH that doesn't violate the monotonicity on a specific vCPU.

So I'll probably just add another patch adjusting rdmsr(REF_CLOCK) to
return shared memory clock if it's enabled.

Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-09 18:25       ` Marcelo Tosatti
  2016-06-09 19:19         ` Roman Kagan
  2016-06-13 17:07         ` Roman Kagan
@ 2016-06-13 17:19         ` Roman Kagan
  2016-06-17 22:21           ` Marcelo Tosatti
  2 siblings, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-06-13 17:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

While we're at this:

According to the comments in the code, the purpose of the masterclock
scheme is to prevent any vCPU from seeing an outdated hv_clock of
another vCPU.

However I'm missing how that is achieved.  AFAICS the guest entry is
allowed as soon as all vCPUs are kicked from guest with
KVM_REQ_CLOCK_UPDATE set; what stops one vCPU from processing it and
entering the guest before another vCPU even started updating its
hv_clock?

Thanks,
Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-13 17:07         ` Roman Kagan
@ 2016-06-14 22:11           ` Marcelo Tosatti
  0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-14 22:11 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Mon, Jun 13, 2016 at 08:07:47PM +0300, Roman Kagan wrote:
> On Thu, Jun 09, 2016 at 03:25:02PM -0300, Marcelo Tosatti wrote:
> > On Thu, Jun 09, 2016 at 03:09:46PM +0300, Roman Kagan wrote:
> > > On Thu, Jun 09, 2016 at 12:27:10AM -0300, Marcelo Tosatti wrote:
> > > > Ok, its not feasible to keep both REF_CLOCK (MSR) and shared memory
> > > > (REF_PAGE) clocks in sync. Even if the frequency correction is the same
> > > > for both, the kernel updates monotonic clock differently than the
> > > > stated frequency that is:
> > > > 
> > > >     monotonic clock (advertised via vsyscall notifier to use mult/freq pair) != tsc*freq 
> > > > 
> > > > So the best solution IMO is to: 
> > > > 
> > > >     reads of guest clock = max(shared memory clock, get_kernel_ns() +
> > > >                                kvmclock_offset)
> > > > 
> > > > Where reads of guest clock include: 1) kvm_get_time_and_clockread
> > > > (updates to kvmclock areas), 2) rdmsr(REF_CLOCK).
> > > > 
> > > > Unless someone has a better idea, Roman, can you update your patch to
> > > > include such solution? for kvm_get_time_and_clockread, can fast forward
> > > > kvmclock_offset so that 
> > > > 
> > > > kvmclock_offset + get_kernel_ns() = shared memory clock
> > > 
> > > I'm not sure I understand what you mean.
> > > 
> > > ->system_time in pvclock *is* assigned kernel_ns + kvmclock_offset, i.e.
> > > at the time kvm_get_time_and_clockread() runs they are in sync by
> > > definition.  They can diverge later due to different whole number math
> > > applied.
> > 
> > Sync kvmclock_offset + get_kernel_ns() = shared memory clock (what you
> > read from the guest). 
> > 
> > 
> > Add wrapper around get_kernel_ns + kvmclock_offset reads:
> > 
> > Record somewhere the last returned (last_returned_guestclock).
> > 
> > u64 read_guest_clock(struct kvm *kvm)
> > {
> >       mutex_lock(&guest_clock_mutex);
> >       ret = get_kernel_ns() + kvmclock_offset;
> >       kvm->arch.last_returned_guestclock = ret;
> >       mutex_unlock(&guest_clock_mutex);
> > }
> > 
> > 
> > Sync code (to be executed at masterclock updates and rdmsr(REF_CLOCK)):
> > 
> > 1. read guest shared memory = smval.
> > 2. read guest clock = get_kernel_ns() + kvmclock_offset = kclock
> > 3. if (kclock < smval) 
> >         kvmclock_offset += smval - kclock
> > 
> > 
> > Two possibilites for clocks state:
> > 
> > P1. shared memory clock > get_kernel_ns() + kvmclock_offset
> > P2. shared memory clock < get_kernel_ns() + kvmclock_offset
> > 
> > Two possibilites for guest behaviour:
> > G1. a = shared memory clock read;
> >     b = get_kernel_ns() + kvmclock_offset
> > 
> > G1/P1:
> > 
> >     a = shared memory clock read (previous read, stored in memory)
> >     b = get_kernel_ns() + kvmclock_offset
> > 
> > After sync code above: Note smval > a, so b = smval > a
> > 
> > G1/P2:
> > 
> >     a = shared memory clock read (previous read, stored in memory)
> >     b = get_kernel_ns() + kvmclock_offset
> > 
> > a < b, fine.
> > 
> > G2 (second possibility for guest behaviour)
> >     a = get_kernel_ns() + kvmclock_offset
> >     b = shared memory clock read
> > 
> > G2/P1: fine, b > a.
> > 
> > G2/P2: 
> >     a = get_kernel_ns() + kvmclock_offset >
> >     b = shared memory clock read
> > 
> > So we have to either reduce get_kernel_ns() + kvmclock_offset so that 
> > b is larger or 'stop get_kernel_ns() + kvmclock_offset'.
> > 
> > Can make get_kernel_ns() + kvmclock_offset be as small as
> > last_returned_guestclock (otherwise users of get_kernel_ns() +
> > kvmclock_offset can see time backwards).
> > 
> >     0. mutex_lock(&guest_clock_mutex);
> >     01. getkernelns = get_kernel_ns();
> >     1. read guest shared memory = smval.
> >     2. kclock = getkernelns + kvmclock_offset
> >     3. if (kclock > smval) 
> >         kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
> >                               kclock - smval)
> >     4. kclock = getkernelns + kvmclock_offset
> >     5. if (kclock > smval) {
> >         schedule_timeout(kclock - smval);
> >         kvmclock_offset -= min(kvmclock_offset - last_returned_guestclock,
> >                               kclock - smval)
> >        }
> >     6. mutex_unlock(&guest_clock_mutex);
> > 
> > That works, right?
> 
> I wouldn't say so.
> 
> First, I don't think changing kvmclock_offset is allowed other than
> through ioctl(KVM_SET_CLOCK); 

Says who? 

> it violates everybody's expectation that
> this is the difference between the host and the guest clocks.

Who's expectation?

userspace is supposed to, when the guest starts issue KVM_SET_CLOCK.

When the guest stops, issue KVM_GET_CLOCK (save value), and issue again KVM_SET_CLOCK
(with previously saved value).

Can't see what the problem is.

> Second, since masterclock updates essentially do that synchronization, I
> think that instead of all that compilcation we can just return
> host-side-calculated value of REFERENCE_TSC_PAGE clock from
> rdmsr(REF_CLOCK) if masterclock is enabled.

Yes, that is simpler, but i also wanted to deal with the potential
backwards event that happens at masterclock update time
(that is, if you are going to update refclock page from 
get_kernel_ns() + kvmclock_offset, need to avoid backwards event
in case 

    get_kernel_ns() + kvmclock_offset < refclock.

Which is the problem in the G2/P2 case above.

But that can be done separately, so yeah sure, returning 
host-side-calculated value of REFERENCE_TSC_PAGE clock from
rdmsr(REF_CLOCK) deals with the monotonicity issue.

> > > There's also a problem that there can be arbitrary amount of time
> > > between assigning the return value for guest's rdmsr and actually
> > > entering the guest to deliver that value.
> > 
> > Don't think that matters, see the 4 cases above.
> 
> It does in the sense that between the point where we calculate the value
> we're about to return from rdmsr(REF_CLOCK), and the time the guest will
> actually see it, another vCPU can read and even update the reference tsc
> page. 

Not sure i see, can you outline the problem in the style of the 
4 cases above? (time diagrams).

>  However, as I wrote in another message in this thread, there's
> no way to guarantee clock reads to be monotonic across several vCPUs;
> OTOH that doesn't violate the monotonicity on a specific vCPU.

Yes it does and that has been a concern with Linux for a long time.
See time-warp-test.c testcase for example.

> So I'll probably just add another patch adjusting rdmsr(REF_CLOCK) to
> return shared memory clock if it's enabled.
> 
> Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-13 17:19         ` Roman Kagan
@ 2016-06-17 22:21           ` Marcelo Tosatti
  2016-06-20 17:22             ` Roman Kagan
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-17 22:21 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Mon, Jun 13, 2016 at 08:19:30PM +0300, Roman Kagan wrote:
> While we're at this:
> 
> According to the comments in the code, the purpose of the masterclock
> scheme is to prevent any vCPU from seeing an outdated hv_clock of
> another vCPU.

It prevents two vcpus from using different hv_clocks:

"
 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.
"

> However I'm missing how that is achieved.  AFAICS the guest entry is
> allowed as soon as all vCPUs are kicked from guest with
> KVM_REQ_CLOCK_UPDATE set; what stops one vCPU from processing it and
> entering the guest before another vCPU even started updating its
> hv_clock?

static void kvm_gen_update_masterclock(struct kvm *kvm)
{
#ifdef CONFIG_X86_64
        int i;
        struct kvm_vcpu *vcpu;
        struct kvm_arch *ka = &kvm->arch;

        spin_lock(&ka->pvclock_gtod_sync_lock);
        kvm_make_mclock_inprogress_request(kvm);
        /* no guest entries from this point */
        pvclock_update_vm_gtod_copy(kvm);

        kvm_for_each_vcpu(i, vcpu, kvm)
                kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

        /* guest entries allowed */
        kvm_for_each_vcpu(i, vcpu, kvm)
                clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);

        spin_unlock(&ka->pvclock_gtod_sync_lock);
#endif
}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-17 22:21           ` Marcelo Tosatti
@ 2016-06-20 17:22             ` Roman Kagan
  2016-06-20 21:29               ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-06-20 17:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Fri, Jun 17, 2016 at 07:21:21PM -0300, Marcelo Tosatti wrote:
> On Mon, Jun 13, 2016 at 08:19:30PM +0300, Roman Kagan wrote:
> > While we're at this:
> > 
> > According to the comments in the code, the purpose of the masterclock
> > scheme is to prevent any vCPU from seeing an outdated hv_clock of
> > another vCPU.
> 
> It prevents two vcpus from using different hv_clocks:
> 
> "
>  * To avoid that problem, do not allow visibility of distinct
>  * system_timestamp/tsc_timestamp values simultaneously: use a master
>  * copy of host monotonic time values. Update that master copy
>  * in lockstep.
> "
> 
> > However I'm missing how that is achieved.  AFAICS the guest entry is
> > allowed as soon as all vCPUs are kicked from guest with
> > KVM_REQ_CLOCK_UPDATE set; what stops one vCPU from processing it and
> > entering the guest before another vCPU even started updating its
> > hv_clock?
> 
> static void kvm_gen_update_masterclock(struct kvm *kvm)
> {
> #ifdef CONFIG_X86_64
>         int i;
>         struct kvm_vcpu *vcpu;
>         struct kvm_arch *ka = &kvm->arch;
> 
>         spin_lock(&ka->pvclock_gtod_sync_lock);
>         kvm_make_mclock_inprogress_request(kvm);
>         /* no guest entries from this point */
>         pvclock_update_vm_gtod_copy(kvm);
> 
>         kvm_for_each_vcpu(i, vcpu, kvm)
>                 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> 
>         /* guest entries allowed */
>         kvm_for_each_vcpu(i, vcpu, kvm)
>                 clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
> 
>         spin_unlock(&ka->pvclock_gtod_sync_lock);
> #endif
> }

Unless I'm missing something obvious again:

The per-vcpu hv_clock is updated when the vcpu processes
KVM_REQ_CLOCK_UPDATE request.

Once kvm_gen_update_masterclock() sets KVM_REQ_CLOCK_UPDATE and
clears KVM_REQ_MCLOCK_INPROGRESS for all vcpus, one vcpu can process the
requests, enter the guest, and read another vcpu's hv_clock, before that
other vcpu had a chance to process its KVM_REQ_CLOCK_UPDATE request.

Is there anything that prevents this?

Thanks,
Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-20 17:22             ` Roman Kagan
@ 2016-06-20 21:29               ` Marcelo Tosatti
  2016-06-21 14:40                 ` Roman Kagan
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-20 21:29 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Mon, Jun 20, 2016 at 08:22:49PM +0300, Roman Kagan wrote:
> On Fri, Jun 17, 2016 at 07:21:21PM -0300, Marcelo Tosatti wrote:
> > On Mon, Jun 13, 2016 at 08:19:30PM +0300, Roman Kagan wrote:
> > > While we're at this:
> > > 
> > > According to the comments in the code, the purpose of the masterclock
> > > scheme is to prevent any vCPU from seeing an outdated hv_clock of
> > > another vCPU.
> > 
> > It prevents two vcpus from using different hv_clocks:
> > 
> > "
> >  * To avoid that problem, do not allow visibility of distinct
> >  * system_timestamp/tsc_timestamp values simultaneously: use a master
> >  * copy of host monotonic time values. Update that master copy
> >  * in lockstep.
> > "
> > 
> > > However I'm missing how that is achieved.  AFAICS the guest entry is
> > > allowed as soon as all vCPUs are kicked from guest with
> > > KVM_REQ_CLOCK_UPDATE set; what stops one vCPU from processing it and
> > > entering the guest before another vCPU even started updating its
> > > hv_clock?
> > 
> > static void kvm_gen_update_masterclock(struct kvm *kvm)
> > {
> > #ifdef CONFIG_X86_64
> >         int i;
> >         struct kvm_vcpu *vcpu;
> >         struct kvm_arch *ka = &kvm->arch;
> > 
> >         spin_lock(&ka->pvclock_gtod_sync_lock);
> >         kvm_make_mclock_inprogress_request(kvm);
> >         /* no guest entries from this point */
> >         pvclock_update_vm_gtod_copy(kvm);
> > 
> >         kvm_for_each_vcpu(i, vcpu, kvm)
> >                 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> > 
> >         /* guest entries allowed */
> >         kvm_for_each_vcpu(i, vcpu, kvm)
> >                 clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
> > 
> >         spin_unlock(&ka->pvclock_gtod_sync_lock);
> > #endif
> > }
> 
> Unless I'm missing something obvious again:
> 
> The per-vcpu hv_clock is updated when the vcpu processes
> KVM_REQ_CLOCK_UPDATE request.

Yes.

> Once kvm_gen_update_masterclock() sets KVM_REQ_CLOCK_UPDATE and
> clears KVM_REQ_MCLOCK_INPROGRESS for all vcpus, one vcpu can process the
> requests, enter the guest, and read another vcpu's hv_clock, before that
> other vcpu had a chance to process its KVM_REQ_CLOCK_UPDATE request.

Yes. But guest code should be reading its local kvmclock area:

                /*
                 * Test we're still on the cpu as well as the version.
                 * We could have been migrated just after the first
                 * vgetcpu but before fetching the version, so we
                 * wouldn't notice a version change.
                 */
                cpu1 = __getcpu() & VGETCPU_CPU_MASK;

(vclock_gettime.c)

> Is there anything that prevents this?

Guest code confirming both version and cpu do not change across 
a kvmclock read. Other than this, no.
> 
> Thanks,
> Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-20 21:29               ` Marcelo Tosatti
@ 2016-06-21 14:40                 ` Roman Kagan
  2016-06-21 21:28                   ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Kagan @ 2016-06-21 14:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Mon, Jun 20, 2016 at 06:29:10PM -0300, Marcelo Tosatti wrote:
> On Mon, Jun 20, 2016 at 08:22:49PM +0300, Roman Kagan wrote:
> > The per-vcpu hv_clock is updated when the vcpu processes
> > KVM_REQ_CLOCK_UPDATE request.
> 
> Yes.
> 
> > Once kvm_gen_update_masterclock() sets KVM_REQ_CLOCK_UPDATE and
> > clears KVM_REQ_MCLOCK_INPROGRESS for all vcpus, one vcpu can process the
> > requests, enter the guest, and read another vcpu's hv_clock, before that
> > other vcpu had a chance to process its KVM_REQ_CLOCK_UPDATE request.
> 
> Yes. But guest code should be reading its local kvmclock area:
> 
>                 /*
>                  * Test we're still on the cpu as well as the version.
>                  * We could have been migrated just after the first
>                  * vgetcpu but before fetching the version, so we
>                  * wouldn't notice a version change.
>                  */
>                 cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> 
> (vclock_gettime.c)

This code is from an older version.  The latest always reads the clock
of the CPU #0:

        /*
         * Note: The kernel and hypervisor must guarantee that cpu ID
         * number maps 1:1 to per-CPU pvclock time info.
         *
         * Because the hypervisor is entirely unaware of guest userspace
         * preemption, it cannot guarantee that per-CPU pvclock time
         * info is updated if the underlying CPU changes or that that
         * version is increased whenever underlying CPU changes.
         *
         * On KVM, we are guaranteed that pvti updates for any vCPU are
         * atomic as seen by *all* vCPUs.  This is an even stronger
         * guarantee than we get with a normal seqlock.
         *
         * On Xen, we don't appear to have that guarantee, but Xen still
         * supplies a valid seqlock using the version field.
         *
         * We only do pvclock vdso timing at all if
         * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
         * mean that all vCPUs have matching pvti and that the TSC is
         * synced, so we can just look at vCPU 0's pvti.
         */

> > Is there anything that prevents this?
> 
> Guest code confirming both version and cpu do not change across 
> a kvmclock read. Other than this, no.

So is the code reading another vcpu's hv_clock wrong?

Roman.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
  2016-06-21 14:40                 ` Roman Kagan
@ 2016-06-21 21:28                   ` Marcelo Tosatti
  0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2016-06-21 21:28 UTC (permalink / raw)
  To: Roman Kagan, kvm, Denis V. Lunev, Owen Hofmann, Paolo Bonzini

On Tue, Jun 21, 2016 at 05:40:32PM +0300, Roman Kagan wrote:
> On Mon, Jun 20, 2016 at 06:29:10PM -0300, Marcelo Tosatti wrote:
> > On Mon, Jun 20, 2016 at 08:22:49PM +0300, Roman Kagan wrote:
> > > The per-vcpu hv_clock is updated when the vcpu processes
> > > KVM_REQ_CLOCK_UPDATE request.
> > 
> > Yes.
> > 
> > > Once kvm_gen_update_masterclock() sets KVM_REQ_CLOCK_UPDATE and
> > > clears KVM_REQ_MCLOCK_INPROGRESS for all vcpus, one vcpu can process the
> > > requests, enter the guest, and read another vcpu's hv_clock, before that
> > > other vcpu had a chance to process its KVM_REQ_CLOCK_UPDATE request.
> > 
> > Yes. But guest code should be reading its local kvmclock area:
> > 
> >                 /*
> >                  * Test we're still on the cpu as well as the version.
> >                  * We could have been migrated just after the first
> >                  * vgetcpu but before fetching the version, so we
> >                  * wouldn't notice a version change.
> >                  */
> >                 cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> > 
> > (vclock_gettime.c)
> 
> This code is from an older version.  The latest always reads the clock
> of the CPU #0:
> 
>         /*
>          * Note: The kernel and hypervisor must guarantee that cpu ID
>          * number maps 1:1 to per-CPU pvclock time info.
>          *
>          * Because the hypervisor is entirely unaware of guest userspace
>          * preemption, it cannot guarantee that per-CPU pvclock time
>          * info is updated if the underlying CPU changes or that that
>          * version is increased whenever underlying CPU changes.
>          *
>          * On KVM, we are guaranteed that pvti updates for any vCPU are
>          * atomic as seen by *all* vCPUs.  This is an even stronger
>          * guarantee than we get with a normal seqlock.
>          *
>          * On Xen, we don't appear to have that guarantee, but Xen still
>          * supplies a valid seqlock using the version field.
>          *
>          * We only do pvclock vdso timing at all if
>          * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>          * mean that all vCPUs have matching pvti and that the TSC is
>          * synced, so we can just look at vCPU 0's pvti.
>          */

In that case vCPU-N (that has its local kvmclock updated), will read vCPU-0's 
(which does not have its kvmclock area updated).

vCPU-0 kvmclock area either contains the old copy of kvmclock values,
or the new copy.

That reading of vCPU-0 is protected by the version check, therefore it
is safe.

> > > Is there anything that prevents this?
> > 
> > Guest code confirming both version and cpu do not change across 
> > a kvmclock read. Other than this, no.
> 
> So is the code reading another vcpu's hv_clock wrong?

Its fine. 

What you can't do is to:

	at vcpu-3:

	read kvmclock area of local vcpu.
	read kvmclock area of remote vcpu. 

And compare the values.

What you can do is:

	at vcpu-1:
	read kvmclock area of local vcpu.

	at vcpu-3:
	read kvmclock area of local vcpu.

And compare these reads. They should be monotonic.

Or only read from a single vcpu, thats also monotonic.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-06-21 21:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 14:49 [PATCH] x86/kvm: fix condition to update kvm master clocks Roman Kagan
2016-05-26 20:19 ` Radim Krčmář
2016-05-27 17:28   ` Roman Kagan
2016-05-27 18:11     ` Radim Krčmář
2016-05-27 18:46       ` Roman Kagan
2016-05-27 19:29         ` Radim Krčmář
2016-05-29 23:38 ` Marcelo Tosatti
2016-06-09  3:27   ` Marcelo Tosatti
2016-06-09  3:45     ` Marcelo Tosatti
2016-06-09 12:09     ` Roman Kagan
2016-06-09 18:25       ` Marcelo Tosatti
2016-06-09 19:19         ` Roman Kagan
2016-06-13 17:07         ` Roman Kagan
2016-06-14 22:11           ` Marcelo Tosatti
2016-06-13 17:19         ` Roman Kagan
2016-06-17 22:21           ` Marcelo Tosatti
2016-06-20 17:22             ` Roman Kagan
2016-06-20 21:29               ` Marcelo Tosatti
2016-06-21 14:40                 ` Roman Kagan
2016-06-21 21:28                   ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.