All of lore.kernel.org
 help / color / mirror / Atom feed
* kvmclock doesn't work, help?
@ 2015-12-09 21:10 Andy Lutomirski
  2015-12-09 21:16 ` Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-09 21:10 UTC (permalink / raw)
  To: kvm list, Marcelo Tosatti, Radim Krcmar, Paolo Bonzini, X86 ML

I'm trying to clean up kvmclock and I can't get it to work at all.  My
host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.

If I boot an SMP (2 vcpus) guest, tracing says:

 qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 0
 qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
 qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
 qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1


If I boot a UP guest, tracing says:

 qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1
 qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
masterclock 0 hostclock tsc offsetmatched 1

I suspect, but I haven't verified, that this is fallout from:

commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date:   Wed May 14 12:43:24 2014 -0300

    KVM: x86: disable master clock if TSC is reset during suspend

    Updating system_time from the kernel clock once master clock
    has been enabled can result in time backwards event, in case
    kernel clock frequency is lower than TSC frequency.

    Disable master clock in case it is necessary to update it
    from the resume path.

    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


Can we please stop making kvmclock more complex?  It's a beast right
now, and not in a good way.  It's far too tangled with the vclock
machinery on both the host and guest sides, the pvclock stuff is not
well thought out (even in principle in an ABI sense), and it's never
been clear to my what problem exactly the kvmclock stuff is supposed
to solve.


I'm somewhat tempted to suggest that we delete kvmclock entirely and
start over.  A correctly functioning KVM guest using TSC (i.e.
ignoring kvmclock entirely) seems to work rather more reliably and
considerably faster than a kvmclock guest.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 21:10 kvmclock doesn't work, help? Andy Lutomirski
@ 2015-12-09 21:16 ` Paolo Bonzini
  2015-12-09 21:49   ` Andy Lutomirski
  2015-12-10 21:32 ` Marcelo Tosatti
  2015-12-10 21:36 ` Marcelo Tosatti
  2 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2015-12-09 21:16 UTC (permalink / raw)
  To: Andy Lutomirski, kvm list, Marcelo Tosatti, Radim Krcmar, X86 ML



On 09/12/2015 22:10, Andy Lutomirski wrote:
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.

It's supposed to solve the problem that:

- not all hosts have a working TSC

- even if they all do, virtual machines can be migrated (or
saved/restored) to a host with a different TSC frequency

- any MMIO- or PIO-based mechanism to access the current time is orders
of magnitude slower than the TSC and less precise too.

> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) seems to work rather more reliably and
> considerably faster than a kvmclock guest.

If all your hosts have a working TSC and you don't do migration or
save/restore, that's a valid configuration.  It's not a good default,
however.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 21:16 ` Paolo Bonzini
@ 2015-12-09 21:49   ` Andy Lutomirski
  2015-12-09 22:12     ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-09 21:49 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm list, Marcelo Tosatti, Radim Krcmar, X86 ML

On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 09/12/2015 22:10, Andy Lutomirski wrote:
>> Can we please stop making kvmclock more complex?  It's a beast right
>> now, and not in a good way.  It's far too tangled with the vclock
>> machinery on both the host and guest sides, the pvclock stuff is not
>> well thought out (even in principle in an ABI sense), and it's never
>> been clear to my what problem exactly the kvmclock stuff is supposed
>> to solve.
>
> It's supposed to solve the problem that:
>
> - not all hosts have a working TSC

Fine, but we don't need any vdso integration for that.

>
> - even if they all do, virtual machines can be migrated (or
> saved/restored) to a host with a different TSC frequency

OK, I buy that.  So we want to export a linear function that the guest
applies to the TSC so the guest can apply it.

I suppose we also want ntp frequency corrections on the host to
propagate to the guest.

>
> - any MMIO- or PIO-based mechanism to access the current time is orders
> of magnitude slower than the TSC and less precise too.

Yup.  But TSC by itself gets that benefit, too.

>
>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> start over.  A correctly functioning KVM guest using TSC (i.e.
>> ignoring kvmclock entirely) seems to work rather more reliably and
>> considerably faster than a kvmclock guest.
>
> If all your hosts have a working TSC and you don't do migration or
> save/restore, that's a valid configuration.  It's not a good default,
> however.

Er?

kvmclock is still really quite slow and buggy.  And the patch I
identified is definitely a problem here:

[  136.131241] KVM: disabling fast timing permanently due to inability
to recover from suspend

I got that on the host with this whitespace-damaged patch:

        if (backwards_tsc) {
                u64 delta_cyc = max_tsc - local_tsc;
+               if (!backwards_tsc_observed)
+                       pr_warn("KVM: disabling fast timing
permanently due to inability to recover from suspend\n");

when I suspended and resumed.

Can anyone explain what problem
16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
brief inspection, it just seems to be incorrect.  Shouldn't KVM's
normal TSC logic handle that case right?  After all, all vcpus should
be paused when we resume from suspend.  At worst, we should just need
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
shouldn't we do that regardless of which way the TSC jumped on
suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
likely to change except on the very small handful of CPUs (if any)
that keep the TSC running in S3 and hibernate.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 21:49   ` Andy Lutomirski
@ 2015-12-09 22:12     ` Paolo Bonzini
  2015-12-09 22:27       ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2015-12-09 22:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Marcelo Tosatti, Radim Krcmar, X86 ML, Alexander Graf



On 09/12/2015 22:49, Andy Lutomirski wrote:
> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>> Can we please stop making kvmclock more complex?  It's a beast right
>>> now, and not in a good way.  It's far too tangled with the vclock
>>> machinery on both the host and guest sides, the pvclock stuff is not
>>> well thought out (even in principle in an ABI sense), and it's never
>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>> to solve.
>>
>> It's supposed to solve the problem that:
>>
>> - not all hosts have a working TSC
> 
> Fine, but we don't need any vdso integration for that.

Well, you still want a fast time source.  That was a given. :)

>> - even if they all do, virtual machines can be migrated (or
>> saved/restored) to a host with a different TSC frequency
>> 
>> - any MMIO- or PIO-based mechanism to access the current time is orders
>> of magnitude slower than the TSC and less precise too.
> 
> Yup.  But TSC by itself gets that benefit, too.

Yes, the problem is if you want to solve all three of them.  The first
two are solved by the ACPI PM timer with a decent resolution (70
ns---much faster anyway than an I/O port access).  The third is solved
by TSC.  To solve all three, you need kvmclock.

>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>>> start over.  A correctly functioning KVM guest using TSC (i.e.
>>> ignoring kvmclock entirely) seems to work rather more reliably and
>>> considerably faster than a kvmclock guest.
>>
>> If all your hosts have a working TSC and you don't do migration or
>> save/restore, that's a valid configuration.  It's not a good default,
>> however.
> 
> Er?
> 
> kvmclock is still really quite slow and buggy.

Unless it takes 3-4000 clock cycles for a gettimeofday, which it
shouldn't even with vdso disabled, it's definitely not slower than PIO.

> And the patch I identified is definitely a problem here:
> 
> [  136.131241] KVM: disabling fast timing permanently due to inability
> to recover from suspend
> 
> I got that on the host with this whitespace-damaged patch:
> 
>         if (backwards_tsc) {
>                 u64 delta_cyc = max_tsc - local_tsc;
> +               if (!backwards_tsc_observed)
> +                       pr_warn("KVM: disabling fast timing
> permanently due to inability to recover from suspend\n");
> 
> when I suspended and resumed.
> 
> Can anyone explain what problem
> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
> normal TSC logic handle that case right?  After all, all vcpus should
> be paused when we resume from suspend.  At worst, we should just need
> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
> shouldn't we do that regardless of which way the TSC jumped on
> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
> likely to change except on the very small handful of CPUs (if any)
> that keep the TSC running in S3 and hibernate.

I don't recall the details of that patch, so Marcelo will have to answer
this, or Alex too since he chimed in the original thread.  At least it
should be made conditional on the existence of a VM at suspend time (and
the master clock stuff should be made per VM, as I suggested at
https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).

It would indeed be great if the master clock could be dropped.  But I'm
definitely missing some of the subtle details. :(

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 22:12     ` Paolo Bonzini
@ 2015-12-09 22:27       ` Andy Lutomirski
  2015-12-09 22:42         ` Paolo Bonzini
                           ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-09 22:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Marcelo Tosatti, Radim Krcmar, X86 ML, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 4072 bytes --]

On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 09/12/2015 22:49, Andy Lutomirski wrote:
>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>>> Can we please stop making kvmclock more complex?  It's a beast right
>>>> now, and not in a good way.  It's far too tangled with the vclock
>>>> machinery on both the host and guest sides, the pvclock stuff is not
>>>> well thought out (even in principle in an ABI sense), and it's never
>>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>>> to solve.
>>>
>>> It's supposed to solve the problem that:
>>>
>>> - not all hosts have a working TSC
>>
>> Fine, but we don't need any vdso integration for that.
>
> Well, you still want a fast time source.  That was a given. :)

If the host can't figure out how to give *itself* a fast time source,
I'd be surprised if KVM can manage to give the guest a fast, reliable
time source.

>
>>> - even if they all do, virtual machines can be migrated (or
>>> saved/restored) to a host with a different TSC frequency
>>>
>>> - any MMIO- or PIO-based mechanism to access the current time is orders
>>> of magnitude slower than the TSC and less precise too.
>>
>> Yup.  But TSC by itself gets that benefit, too.
>
> Yes, the problem is if you want to solve all three of them.  The first
> two are solved by the ACPI PM timer with a decent resolution (70
> ns---much faster anyway than an I/O port access).  The third is solved
> by TSC.  To solve all three, you need kvmclock.

Still confused.  Is kvmclock really used in cases where even the host
can't pull of working TSC?

>
>>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>>>> start over.  A correctly functioning KVM guest using TSC (i.e.
>>>> ignoring kvmclock entirely) seems to work rather more reliably and
>>>> considerably faster than a kvmclock guest.
>>>
>>> If all your hosts have a working TSC and you don't do migration or
>>> save/restore, that's a valid configuration.  It's not a good default,
>>> however.
>>
>> Er?
>>
>> kvmclock is still really quite slow and buggy.
>
> Unless it takes 3-4000 clock cycles for a gettimeofday, which it
> shouldn't even with vdso disabled, it's definitely not slower than PIO.
>
>> And the patch I identified is definitely a problem here:
>>
>> [  136.131241] KVM: disabling fast timing permanently due to inability
>> to recover from suspend
>>
>> I got that on the host with this whitespace-damaged patch:
>>
>>         if (backwards_tsc) {
>>                 u64 delta_cyc = max_tsc - local_tsc;
>> +               if (!backwards_tsc_observed)
>> +                       pr_warn("KVM: disabling fast timing
>> permanently due to inability to recover from suspend\n");
>>
>> when I suspended and resumed.
>>
>> Can anyone explain what problem
>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
>> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
>> normal TSC logic handle that case right?  After all, all vcpus should
>> be paused when we resume from suspend.  At worst, we should just need
>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
>> shouldn't we do that regardless of which way the TSC jumped on
>> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
>> likely to change except on the very small handful of CPUs (if any)
>> that keep the TSC running in S3 and hibernate.
>
> I don't recall the details of that patch, so Marcelo will have to answer
> this, or Alex too since he chimed in the original thread.  At least it
> should be made conditional on the existence of a VM at suspend time (and
> the master clock stuff should be made per VM, as I suggested at
> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
>
> It would indeed be great if the master clock could be dropped.  But I'm
> definitely missing some of the subtle details. :(

Me, too.

Anyway, see the attached untested patch.  Marcelo?

--Andy

[-- Attachment #2: 0001-x86-kvm-On-KVM-re-enable-e.g.-after-suspect-update-c.patch --]
[-- Type: text/x-patch, Size: 5259 bytes --]

From e4a5e834d3fb6fc2499966e1af42cb5bd59f4410 Mon Sep 17 00:00:00 2001
Message-Id: <e4a5e834d3fb6fc2499966e1af42cb5bd59f4410.1449700024.git.luto@kernel.org>
From: Andy Lutomirski <luto@kernel.org>
Date: Wed, 9 Dec 2015 14:21:05 -0800
Subject: [PATCH] x86/kvm: On KVM re-enable (e.g. after suspect), update clocks

This gets rid of the "did TSC go backwards" logic and just updates
all clocks.  It should work better (no more disabling of fast
timing) and more reliably (all of the clocks are actually updated).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kvm/x86.c | 75 +++---------------------------------------------------
 1 file changed, 3 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..c88f91f4b1a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 unsigned int __read_mostly lapic_timer_advance_ns = 0;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 
-static bool __read_mostly backwards_tsc_observed = false;
-
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
@@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_cycle_now);
 
 	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-				&& !backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
 
 	if (ka->use_master_clock)
@@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
 	struct kvm_vcpu *vcpu;
 	int i;
 	int ret;
-	u64 local_tsc;
-	u64 max_tsc = 0;
-	bool stable, backwards_tsc = false;
 
 	kvm_shared_msr_cpu_online();
 	ret = kvm_x86_ops->hardware_enable();
 	if (ret != 0)
 		return ret;
 
-	local_tsc = rdtsc();
-	stable = !check_tsc_unstable();
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!stable && vcpu->cpu == smp_processor_id())
+			if (vcpu->cpu == smp_processor_id()) {
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-			if (stable && vcpu->arch.last_host_tsc > local_tsc) {
-				backwards_tsc = true;
-				if (vcpu->arch.last_host_tsc > max_tsc)
-					max_tsc = vcpu->arch.last_host_tsc;
+				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
+						 vcpu);
 			}
 		}
 	}
 
-	/*
-	 * Sometimes, even reliable TSCs go backwards.  This happens on
-	 * platforms that reset TSC during suspend or hibernate actions, but
-	 * maintain synchronization.  We must compensate.  Fortunately, we can
-	 * detect that condition here, which happens early in CPU bringup,
-	 * before any KVM threads can be running.  Unfortunately, we can't
-	 * bring the TSCs fully up to date with real time, as we aren't yet far
-	 * enough into CPU bringup that we know how much real time has actually
-	 * elapsed; our helper function, get_kernel_ns() will be using boot
-	 * variables that haven't been updated yet.
-	 *
-	 * So we simply find the maximum observed TSC above, then record the
-	 * adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
-	 * the adjustment will be applied.  Note that we accumulate
-	 * adjustments, in case multiple suspend cycles happen before some VCPU
-	 * gets a chance to run again.  In the event that no KVM threads get a
-	 * chance to run, we will miss the entire elapsed period, as we'll have
-	 * reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
-	 * loose cycle time.  This isn't too big a deal, since the loss will be
-	 * uniform across all VCPUs (not to mention the scenario is extremely
-	 * unlikely). It is possible that a second hibernate recovery happens
-	 * much faster than a first, causing the observed TSC here to be
-	 * smaller; this would require additional padding adjustment, which is
-	 * why we set last_host_tsc to the local tsc observed here.
-	 *
-	 * N.B. - this code below runs only on platforms with reliable TSC,
-	 * as that is the only way backwards_tsc is set above.  Also note
-	 * that this runs for ALL vcpus, which is not a bug; all VCPUs should
-	 * have the same delta_cyc adjustment applied if backwards_tsc
-	 * is detected.  Note further, this adjustment is only done once,
-	 * as we reset last_host_tsc on all VCPUs to stop this from being
-	 * called multiple times (one for each physical CPU bringup).
-	 *
-	 * Platforms with unreliable TSCs don't have to deal with this, they
-	 * will be compensated by the logic in vcpu_load, which sets the TSC to
-	 * catchup mode.  This will catchup all VCPUs to real time, but cannot
-	 * guarantee that they stay in perfect synchronization.
-	 */
-	if (backwards_tsc) {
-		u64 delta_cyc = max_tsc - local_tsc;
-		backwards_tsc_observed = true;
-		list_for_each_entry(kvm, &vm_list, vm_list) {
-			kvm_for_each_vcpu(i, vcpu, kvm) {
-				vcpu->arch.tsc_offset_adjustment += delta_cyc;
-				vcpu->arch.last_host_tsc = local_tsc;
-				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
-			}
-
-			/*
-			 * We have to disable TSC offset matching.. if you were
-			 * booting a VM while issuing an S4 host suspend....
-			 * you may have some problem.  Solving this issue is
-			 * left as an exercise to the reader.
-			 */
-			kvm->arch.last_tsc_nsec = 0;
-			kvm->arch.last_tsc_write = 0;
-		}
-
-	}
 	return 0;
 }
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 22:27       ` Andy Lutomirski
@ 2015-12-09 22:42         ` Paolo Bonzini
  2015-12-09 22:43         ` Andy Lutomirski
  2015-12-10 21:33         ` Marcelo Tosatti
  2 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2015-12-09 22:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Marcelo Tosatti, Radim Krcmar, X86 ML, Alexander Graf



On 09/12/2015 23:27, Andy Lutomirski wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 09/12/2015 22:49, Andy Lutomirski wrote:
>>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>
>>>>
>>>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>>>> Can we please stop making kvmclock more complex?  It's a beast right
>>>>> now, and not in a good way.  It's far too tangled with the vclock
>>>>> machinery on both the host and guest sides, the pvclock stuff is not
>>>>> well thought out (even in principle in an ABI sense), and it's never
>>>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>>>> to solve.
>>>>
>>>> It's supposed to solve the problem that:
>>>>
>>>> - not all hosts have a working TSC
>>>
>>> Fine, but we don't need any vdso integration for that.
>>
>> Well, you still want a fast time source.  That was a given. :)
> 
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.

There's no vdso integration unless the host has a constant, nonstop
(fully "working") TSC.  That's the meaning of PVCLOCK_TSC_STABLE_BIT.

So, correction:  if you can pull it off, you still want a fast time
source.  Otherwise, you still want one that is as fast as possible,
especially on the kernel side.

>>>> - even if they all do, virtual machines can be migrated (or
>>>> saved/restored) to a host with a different TSC frequency
>>>>
>>>> - any MMIO- or PIO-based mechanism to access the current time is orders
>>>> of magnitude slower than the TSC and less precise too.
>>
>> the problem is if you want to solve all three of them.  The first
>> two are solved by the ACPI PM timer with a decent resolution (70
>> ns---much faster anyway than an I/O port access).  The third is solved
>> by TSC.  To solve all three, you need kvmclock.
> 
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?

You can certainly provide kvmclock even if you lack constant-rate or
nonstop TSC.  Those are only a requirement for vdso.

If the host has a constant-rate TSC, but the rate differs per physical
CPU (common on older NUMA machines), you can easily provide a working
kvmclock.  It cannot support vdso because you'll need to read the time
from a non-preemptable section, but it will work because KVM can update
the kvmclock parameters on VCPU migration, and it's still faster than
anything else.  (The purpose of the now-gone migration notifiers was to
support vdso even in this case).

If the host doesn't even have constant-rate TSC, you can still provide
kernel-only kvmclock reads through cpufreq notifiers.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 22:27       ` Andy Lutomirski
  2015-12-09 22:42         ` Paolo Bonzini
@ 2015-12-09 22:43         ` Andy Lutomirski
  2015-12-10 21:33         ` Marcelo Tosatti
  2 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-09 22:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Marcelo Tosatti, Radim Krcmar, X86 ML, Alexander Graf

On Wed, Dec 9, 2015 at 2:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 09/12/2015 22:49, Andy Lutomirski wrote:
>>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>
>>>>
>>>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>>>>> Can we please stop making kvmclock more complex?  It's a beast right
>>>>> now, and not in a good way.  It's far too tangled with the vclock
>>>>> machinery on both the host and guest sides, the pvclock stuff is not
>>>>> well thought out (even in principle in an ABI sense), and it's never
>>>>> been clear to my what problem exactly the kvmclock stuff is supposed
>>>>> to solve.
>>>>
>>>> It's supposed to solve the problem that:
>>>>
>>>> - not all hosts have a working TSC
>>>
>>> Fine, but we don't need any vdso integration for that.
>>
>> Well, you still want a fast time source.  That was a given. :)
>
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.
>
>>
>>>> - even if they all do, virtual machines can be migrated (or
>>>> saved/restored) to a host with a different TSC frequency
>>>>
>>>> - any MMIO- or PIO-based mechanism to access the current time is orders
>>>> of magnitude slower than the TSC and less precise too.
>>>
>>> Yup.  But TSC by itself gets that benefit, too.
>>
>> Yes, the problem is if you want to solve all three of them.  The first
>> two are solved by the ACPI PM timer with a decent resolution (70
>> ns---much faster anyway than an I/O port access).  The third is solved
>> by TSC.  To solve all three, you need kvmclock.
>
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?
>
>>
>>>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>>>>> start over.  A correctly functioning KVM guest using TSC (i.e.
>>>>> ignoring kvmclock entirely) seems to work rather more reliably and
>>>>> considerably faster than a kvmclock guest.
>>>>
>>>> If all your hosts have a working TSC and you don't do migration or
>>>> save/restore, that's a valid configuration.  It's not a good default,
>>>> however.
>>>
>>> Er?
>>>
>>> kvmclock is still really quite slow and buggy.
>>
>> Unless it takes 3-4000 clock cycles for a gettimeofday, which it
>> shouldn't even with vdso disabled, it's definitely not slower than PIO.
>>
>>> And the patch I identified is definitely a problem here:
>>>
>>> [  136.131241] KVM: disabling fast timing permanently due to inability
>>> to recover from suspend
>>>
>>> I got that on the host with this whitespace-damaged patch:
>>>
>>>         if (backwards_tsc) {
>>>                 u64 delta_cyc = max_tsc - local_tsc;
>>> +               if (!backwards_tsc_observed)
>>> +                       pr_warn("KVM: disabling fast timing
>>> permanently due to inability to recover from suspend\n");
>>>
>>> when I suspended and resumed.
>>>
>>> Can anyone explain what problem
>>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
>>> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
>>> normal TSC logic handle that case right?  After all, all vcpus should
>>> be paused when we resume from suspend.  At worst, we should just need
>>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
>>> shouldn't we do that regardless of which way the TSC jumped on
>>> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
>>> likely to change except on the very small handful of CPUs (if any)
>>> that keep the TSC running in S3 and hibernate.
>>
>> I don't recall the details of that patch, so Marcelo will have to answer
>> this, or Alex too since he chimed in the original thread.  At least it
>> should be made conditional on the existence of a VM at suspend time (and
>> the master clock stuff should be made per VM, as I suggested at
>> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
>>
>> It would indeed be great if the master clock could be dropped.  But I'm
>> definitely missing some of the subtle details. :(
>
> Me, too.
>
> Anyway, see the attached untested patch.  Marcelo?

That patch seems to work.  I have valid timing before and after host
suspend.  When I suspend and resume the host with a running guest, I
get:

[   26.504071] clocksource: timekeeping watchdog: Marking clocksource
'tsc' as unstable because the skew is too large:
[   26.505253] clocksource:                       'kvm-clock' wd_now:
66744c542 wd_last: 564b09794 mask: ffffffffffffffff
[   26.506436] clocksource:                       'tsc' cs_now:
fffffee310b133c8 cs_last: cf5d0b952 mask: ffffffffffffffff

in the guest, which is arguably correct.  KVM could be further
improved to update the tsc offset after suspend/resume to get rid of
that artifact.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 21:10 kvmclock doesn't work, help? Andy Lutomirski
  2015-12-09 21:16 ` Paolo Bonzini
@ 2015-12-10 21:32 ` Marcelo Tosatti
  2015-12-11 21:57   ` Andy Lutomirski
  2015-12-10 21:36 ` Marcelo Tosatti
  2 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-10 21:32 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: kvm list, Radim Krcmar, Paolo Bonzini, X86 ML

On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> 
> If I boot an SMP (2 vcpus) guest, tracing says:
> 
>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 0
>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> 
> If I boot a UP guest, tracing says:
> 
>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> I suspect, but I haven't verified, that this is fallout from:
> 
> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> Author: Marcelo Tosatti <mtosatti@redhat.com>
> Date:   Wed May 14 12:43:24 2014 -0300
> 
>     KVM: x86: disable master clock if TSC is reset during suspend
> 
>     Updating system_time from the kernel clock once master clock
>     has been enabled can result in time backwards event, in case
>     kernel clock frequency is lower than TSC frequency.
> 
>     Disable master clock in case it is necessary to update it
>     from the resume path.
> 
>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> 
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.
>
> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) 
> seems to work rather more reliably and
> considerably faster than a kvmclock guest.
> 
> --Andy
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

Andy,

I am all for solving practical problems rather than pleasing aesthetic
pleasure. 

>     Updating system_time from the kernel clock once master clock
>     has been enabled can result in time backwards event, in case
>     kernel clock frequency is lower than TSC frequency.
> 
>     Disable master clock in case it is necessary to update it
>     from the resume path.

> once master clock
>     has been enabled can result in time backwards event, in case
>     kernel clock frequency is lower than TSC frequency.

guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.

If the effective frequency of the kernel clock is lower (for example
due to NTP correcting the TSC frequency of the system), and you resume 
and update the system, the following happens:

guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE.
suspend/resume event.
guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 22:27       ` Andy Lutomirski
  2015-12-09 22:42         ` Paolo Bonzini
  2015-12-09 22:43         ` Andy Lutomirski
@ 2015-12-10 21:33         ` Marcelo Tosatti
  2 siblings, 0 replies; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-10 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML, Alexander Graf

On Wed, Dec 09, 2015 at 02:27:36PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 09/12/2015 22:49, Andy Lutomirski wrote:
> >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>
> >>>
> >>> On 09/12/2015 22:10, Andy Lutomirski wrote:
> >>>> Can we please stop making kvmclock more complex?  It's a beast right
> >>>> now, and not in a good way.  It's far too tangled with the vclock
> >>>> machinery on both the host and guest sides, the pvclock stuff is not
> >>>> well thought out (even in principle in an ABI sense), and it's never
> >>>> been clear to my what problem exactly the kvmclock stuff is supposed
> >>>> to solve.
> >>>
> >>> It's supposed to solve the problem that:
> >>>
> >>> - not all hosts have a working TSC
> >>
> >> Fine, but we don't need any vdso integration for that.
> >
> > Well, you still want a fast time source.  That was a given. :)
> 
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.
> 
> >
> >>> - even if they all do, virtual machines can be migrated (or
> >>> saved/restored) to a host with a different TSC frequency
> >>>
> >>> - any MMIO- or PIO-based mechanism to access the current time is orders
> >>> of magnitude slower than the TSC and less precise too.
> >>
> >> Yup.  But TSC by itself gets that benefit, too.
> >
> > Yes, the problem is if you want to solve all three of them.  The first
> > two are solved by the ACPI PM timer with a decent resolution (70
> > ns---much faster anyway than an I/O port access).  The third is solved
> > by TSC.  To solve all three, you need kvmclock.
> 
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?
> 
> >
> >>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >>>> start over.  A correctly functioning KVM guest using TSC (i.e.
> >>>> ignoring kvmclock entirely) seems to work rather more reliably and
> >>>> considerably faster than a kvmclock guest.
> >>>
> >>> If all your hosts have a working TSC and you don't do migration or
> >>> save/restore, that's a valid configuration.  It's not a good default,
> >>> however.
> >>
> >> Er?
> >>
> >> kvmclock is still really quite slow and buggy.
> >
> > Unless it takes 3-4000 clock cycles for a gettimeofday, which it
> > shouldn't even with vdso disabled, it's definitely not slower than PIO.
> >
> >> And the patch I identified is definitely a problem here:
> >>
> >> [  136.131241] KVM: disabling fast timing permanently due to inability
> >> to recover from suspend
> >>
> >> I got that on the host with this whitespace-damaged patch:
> >>
> >>         if (backwards_tsc) {
> >>                 u64 delta_cyc = max_tsc - local_tsc;
> >> +               if (!backwards_tsc_observed)
> >> +                       pr_warn("KVM: disabling fast timing
> >> permanently due to inability to recover from suspend\n");
> >>
> >> when I suspended and resumed.
> >>
> >> Can anyone explain what problem
> >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
> >> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
> >> normal TSC logic handle that case right?  After all, all vcpus should
> >> be paused when we resume from suspend.  At worst, we should just need
> >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
> >> shouldn't we do that regardless of which way the TSC jumped on
> >> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
> >> likely to change except on the very small handful of CPUs (if any)
> >> that keep the TSC running in S3 and hibernate.
> >
> > I don't recall the details of that patch, so Marcelo will have to answer
> > this, or Alex too since he chimed in the original thread.  At least it
> > should be made conditional on the existence of a VM at suspend time (and
> > the master clock stuff should be made per VM, as I suggested at
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
> >
> > It would indeed be great if the master clock could be dropped.  But I'm
> > definitely missing some of the subtle details. :(
> 
> Me, too.
> 
> Anyway, see the attached untested patch.  Marcelo?
> 
> --Andy

Read the last email, about the problem.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-09 21:10 kvmclock doesn't work, help? Andy Lutomirski
  2015-12-09 21:16 ` Paolo Bonzini
  2015-12-10 21:32 ` Marcelo Tosatti
@ 2015-12-10 21:36 ` Marcelo Tosatti
  2 siblings, 0 replies; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-10 21:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: kvm list, Radim Krcmar, Paolo Bonzini, X86 ML

On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> 
> If I boot an SMP (2 vcpus) guest, tracing says:
> 
>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 0
>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> 
> If I boot a UP guest, tracing says:
> 
>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> I suspect, but I haven't verified, that this is fallout from:
> 
> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> Author: Marcelo Tosatti <mtosatti@redhat.com>
> Date:   Wed May 14 12:43:24 2014 -0300
> 
>     KVM: x86: disable master clock if TSC is reset during suspend
> 
>     Updating system_time from the kernel clock once master clock
>     has been enabled can result in time backwards event, in case
>     kernel clock frequency is lower than TSC frequency.
> 
>     Disable master clock in case it is necessary to update it
>     from the resume path.
> 
>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> 
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.
> 
> 
> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) seems to work rather more reliably and
> considerably faster than a kvmclock guest.
> 
> --Andy

Users can do that, if they want. "clocksource=tsc" kernel option.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-10 21:32 ` Marcelo Tosatti
@ 2015-12-11 21:57   ` Andy Lutomirski
  2015-12-11 23:48     ` Marcelo Tosatti
  2015-12-14 13:44     ` Paolo Bonzini
  0 siblings, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-11 21:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm list, Radim Krcmar, Paolo Bonzini, X86 ML

On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
>> I'm trying to clean up kvmclock and I can't get it to work at all.  My
>> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
>>
>> If I boot an SMP (2 vcpus) guest, tracing says:
>>
>>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 0
>>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
>> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
>> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
>> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
>> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
>> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
>> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>
>>
>> If I boot a UP guest, tracing says:
>>
>>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
>> masterclock 0 hostclock tsc offsetmatched 1
>>
>> I suspect, but I haven't verified, that this is fallout from:
>>
>> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
>> Author: Marcelo Tosatti <mtosatti@redhat.com>
>> Date:   Wed May 14 12:43:24 2014 -0300
>>
>>     KVM: x86: disable master clock if TSC is reset during suspend
>>
>>     Updating system_time from the kernel clock once master clock
>>     has been enabled can result in time backwards event, in case
>>     kernel clock frequency is lower than TSC frequency.
>>
>>     Disable master clock in case it is necessary to update it
>>     from the resume path.
>>
>>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>>
>> Can we please stop making kvmclock more complex?  It's a beast right
>> now, and not in a good way.  It's far too tangled with the vclock
>> machinery on both the host and guest sides, the pvclock stuff is not
>> well thought out (even in principle in an ABI sense), and it's never
>> been clear to my what problem exactly the kvmclock stuff is supposed
>> to solve.
>>
>> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> start over.  A correctly functioning KVM guest using TSC (i.e.
>> ignoring kvmclock entirely)
>> seems to work rather more reliably and
>> considerably faster than a kvmclock guest.
>>
>> --Andy
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>
> Andy,
>
> I am all for solving practical problems rather than pleasing aesthetic
> pleasure.
>
>>     Updating system_time from the kernel clock once master clock
>>     has been enabled can result in time backwards event, in case
>>     kernel clock frequency is lower than TSC frequency.
>>
>>     Disable master clock in case it is necessary to update it
>>     from the resume path.
>
>> once master clock
>>     has been enabled can result in time backwards event, in case
>>     kernel clock frequency is lower than TSC frequency.
>
> guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
>
> If the effective frequency of the kernel clock is lower (for example
> due to NTP correcting the TSC frequency of the system), and you resume
> and update the system, the following happens:
>
> guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE.
> suspend/resume event.
> guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.
>

I'm still not seeing the issue.

The formula is:

(((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
pvti->tsc_shift) + pvti->system_time

Obviously, if you reset pvti->tsc_timestamp to the current tsc value
after suspend/resume, you would also need to update system_time.

I don't see what this has to do with suspend/resume or with whether
the effective scale factor is greater than or less than one.  The only
suspend/resume interaction I can see is that, if the host allows the
guest-observed TSC value to jump (which is arguably a bug, what that's
not important here), it needs to update pvti before resuming the
guest.

Can you clarify concretely what goes wrong here?

(I'm also at a bit of a loss as to why this needs both system_time and
tsc_timestamp.  They're redundant in the sense that you could set
tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul)
>> tsc_shift to system_time without changing the result of the
calculation.)

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-11 21:57   ` Andy Lutomirski
@ 2015-12-11 23:48     ` Marcelo Tosatti
  2015-12-14 18:07       ` Andy Lutomirski
  2015-12-14 13:44     ` Paolo Bonzini
  1 sibling, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-11 23:48 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: kvm list, Radim Krcmar, Paolo Bonzini, X86 ML

On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >>
> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >>
> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 0
> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >>
> >> If I boot a UP guest, tracing says:
> >>
> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >> I suspect, but I haven't verified, that this is fallout from:
> >>
> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> Author: Marcelo Tosatti <mtosatti@redhat.com>
> >> Date:   Wed May 14 12:43:24 2014 -0300
> >>
> >>     KVM: x86: disable master clock if TSC is reset during suspend
> >>
> >>     Updating system_time from the kernel clock once master clock
> >>     has been enabled can result in time backwards event, in case
> >>     kernel clock frequency is lower than TSC frequency.
> >>
> >>     Disable master clock in case it is necessary to update it
> >>     from the resume path.
> >>
> >>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>
> >>
> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> now, and not in a good way.  It's far too tangled with the vclock
> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> well thought out (even in principle in an ABI sense), and it's never
> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> to solve.
> >>
> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> ignoring kvmclock entirely)
> >> seems to work rather more reliably and
> >> considerably faster than a kvmclock guest.
> >>
> >> --Andy
> >>
> >> --
> >> Andy Lutomirski
> >> AMA Capital Management, LLC
> >
> > Andy,
> >
> > I am all for solving practical problems rather than pleasing aesthetic
> > pleasure.
> >
> >>     Updating system_time from the kernel clock once master clock
> >>     has been enabled can result in time backwards event, in case
> >>     kernel clock frequency is lower than TSC frequency.
> >>
> >>     Disable master clock in case it is necessary to update it
> >>     from the resume path.
> >
> >> once master clock
> >>     has been enabled can result in time backwards event, in case
> >>     kernel clock frequency is lower than TSC frequency.
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
> >
> > If the effective frequency of the kernel clock is lower (for example
> > due to NTP correcting the TSC frequency of the system), and you resume
> > and update the system, the following happens:
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE.

    guest reads clock to memory at location A = scaled tsc read.

(note TSC is counting at frequency higher than advertised by
processor, thats why NTP has to "slow down" the kernel clock 
which is maintained by successive reads of the TSC).

> > suspend/resume event.
> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.

Now the guest visible clock contains a tsc_timestamp that has been 
corrected by NTP, over say 5 days. So the tiny NTP correction has
been added up to something significant.

   guest reads clock to memory at location B = reads tsc_timestamp. 

Clock value in B (NTP corrected TSC) < clock value in A  (RAW TSC)

Yes?

> 
> I'm still not seeing the issue.

I'll add two items to the three snapshots above, hopefully will make it
clearer.

> 
> The formula is:
> 
> (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> pvti->tsc_shift) + pvti->system_time
> 
> Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> after suspend/resume, you would also need to update system_time.
> 
> I don't see what this has to do with suspend/resume or with whether
> the effective scale factor is greater than or less than one.  The only
> suspend/resume interaction I can see is that, if the host allows the
> guest-observed TSC value to jump (which is arguably a bug, what that's
> not important here), it needs to update pvti before resuming the
> guest.
> 
> Can you clarify concretely what goes wrong here?
> 
> (I'm also at a bit of a loss as to why this needs both system_time and
> tsc_timestamp.  They're redundant in the sense that you could set
> tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul)
> >> tsc_shift to system_time without changing the result of the
> calculation.)
> 
> --Andy
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-11 21:57   ` Andy Lutomirski
  2015-12-11 23:48     ` Marcelo Tosatti
@ 2015-12-14 13:44     ` Paolo Bonzini
  2015-12-14 22:00       ` Marcelo Tosatti
  1 sibling, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2015-12-14 13:44 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti; +Cc: kvm list, Radim Krcmar, X86 ML



On 11/12/2015 22:57, Andy Lutomirski wrote:
> I'm still not seeing the issue.
> 
> The formula is:
> 
> (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> pvti->tsc_shift) + pvti->system_time
> 
> Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> after suspend/resume, you would also need to update system_time.
> 
> I don't see what this has to do with suspend/resume or with whether
> the effective scale factor is greater than or less than one.  The only
> suspend/resume interaction I can see is that, if the host allows the
> guest-observed TSC value to jump (which is arguably a bug, what that's
> not important here), it needs to update pvti before resuming the
> guest.

Which is not an issue, since freezing obviously gets all CPUs out of
guest mode.

Marcelo, can you provide an example with made-up values for tsc and pvti?

> Can you clarify concretely what goes wrong here?
> 
> (I'm also at a bit of a loss as to why this needs both system_time and
> tsc_timestamp.  They're redundant in the sense that you could set
> tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >>
> tsc_shift to system_time without changing the result of the
> calculation.)

You would have to ensure that all elements of pvti are rounded correctly
whenever the base TSC is updated.  Doable, but it does seem simpler to
keep subtract-TSC and add-nanoseconds separate.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-11 23:48     ` Marcelo Tosatti
@ 2015-12-14 18:07       ` Andy Lutomirski
  2015-12-14 21:47         ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-14 18:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm list, Radim Krcmar, Paolo Bonzini, X86 ML

On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
>> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
>> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
>> >>
>> >> If I boot an SMP (2 vcpus) guest, tracing says:
>> >>
>> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 0
>> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>
>> >>
>> >> If I boot a UP guest, tracing says:
>> >>
>> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>
>> >> I suspect, but I haven't verified, that this is fallout from:
>> >>
>> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
>> >> Author: Marcelo Tosatti <mtosatti@redhat.com>
>> >> Date:   Wed May 14 12:43:24 2014 -0300
>> >>
>> >>     KVM: x86: disable master clock if TSC is reset during suspend
>> >>
>> >>     Updating system_time from the kernel clock once master clock
>> >>     has been enabled can result in time backwards event, in case
>> >>     kernel clock frequency is lower than TSC frequency.
>> >>
>> >>     Disable master clock in case it is necessary to update it
>> >>     from the resume path.
>> >>
>> >>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> >>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> >>
>> >>
>> >> Can we please stop making kvmclock more complex?  It's a beast right
>> >> now, and not in a good way.  It's far too tangled with the vclock
>> >> machinery on both the host and guest sides, the pvclock stuff is not
>> >> well thought out (even in principle in an ABI sense), and it's never
>> >> been clear to my what problem exactly the kvmclock stuff is supposed
>> >> to solve.
>> >>
>> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> >> start over.  A correctly functioning KVM guest using TSC (i.e.
>> >> ignoring kvmclock entirely)
>> >> seems to work rather more reliably and
>> >> considerably faster than a kvmclock guest.
>> >>
>> >> --Andy
>> >>
>> >> --
>> >> Andy Lutomirski
>> >> AMA Capital Management, LLC
>> >
>> > Andy,
>> >
>> > I am all for solving practical problems rather than pleasing aesthetic
>> > pleasure.
>> >
>> >>     Updating system_time from the kernel clock once master clock
>> >>     has been enabled can result in time backwards event, in case
>> >>     kernel clock frequency is lower than TSC frequency.
>> >>
>> >>     Disable master clock in case it is necessary to update it
>> >>     from the resume path.
>> >
>> >> once master clock
>> >>     has been enabled can result in time backwards event, in case
>> >>     kernel clock frequency is lower than TSC frequency.
>> >
>> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
>> >
>> > If the effective frequency of the kernel clock is lower (for example
>> > due to NTP correcting the TSC frequency of the system), and you resume
>> > and update the system, the following happens:
>> >
>> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE.
>
>     guest reads clock to memory at location A = scaled tsc read.
>
> (note TSC is counting at frequency higher than advertised by
> processor, thats why NTP has to "slow down" the kernel clock
> which is maintained by successive reads of the TSC).
>
>> > suspend/resume event.
>> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.
>
> Now the guest visible clock contains a tsc_timestamp that has been
> corrected by NTP, over say 5 days. So the tiny NTP correction has
> been added up to something significant.
>
>    guest reads clock to memory at location B = reads tsc_timestamp.
>
> Clock value in B (NTP corrected TSC) < clock value in A  (RAW TSC)
>
> Yes?

Sure, but I still don't see why this is a problem.  Why would the
guest compare raw TSC to NTP corrected TSC?

>
>>
>> I'm still not seeing the issue.
>
> I'll add two items to the three snapshots above, hopefully will make it
> clearer.

Maybe that'll help.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-14 18:07       ` Andy Lutomirski
@ 2015-12-14 21:47         ` Marcelo Tosatti
  0 siblings, 0 replies; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-14 21:47 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: kvm list, Radim Krcmar, Paolo Bonzini, X86 ML

On Mon, Dec 14, 2015 at 10:07:21AM -0800, Andy Lutomirski wrote:
> On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >> >>
> >> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >> >>
> >> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 0
> >> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>
> >> >>
> >> >> If I boot a UP guest, tracing says:
> >> >>
> >> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>
> >> >> I suspect, but I haven't verified, that this is fallout from:
> >> >>
> >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> >> Author: Marcelo Tosatti <mtosatti@redhat.com>
> >> >> Date:   Wed May 14 12:43:24 2014 -0300
> >> >>
> >> >>     KVM: x86: disable master clock if TSC is reset during suspend
> >> >>
> >> >>     Updating system_time from the kernel clock once master clock
> >> >>     has been enabled can result in time backwards event, in case
> >> >>     kernel clock frequency is lower than TSC frequency.
> >> >>
> >> >>     Disable master clock in case it is necessary to update it
> >> >>     from the resume path.
> >> >>
> >> >>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >> >>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> >>
> >> >>
> >> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> >> now, and not in a good way.  It's far too tangled with the vclock
> >> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> >> well thought out (even in principle in an ABI sense), and it's never
> >> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> >> to solve.
> >> >>
> >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> >> ignoring kvmclock entirely)
> >> >> seems to work rather more reliably and
> >> >> considerably faster than a kvmclock guest.
> >> >>
> >> >> --Andy
> >> >>
> >> >> --
> >> >> Andy Lutomirski
> >> >> AMA Capital Management, LLC
> >> >
> >> > Andy,
> >> >
> >> > I am all for solving practical problems rather than pleasing aesthetic
> >> > pleasure.
> >> >
> >> >>     Updating system_time from the kernel clock once master clock
> >> >>     has been enabled can result in time backwards event, in case
> >> >>     kernel clock frequency is lower than TSC frequency.
> >> >>
> >> >>     Disable master clock in case it is necessary to update it
> >> >>     from the resume path.
> >> >
> >> >> once master clock
> >> >>     has been enabled can result in time backwards event, in case
> >> >>     kernel clock frequency is lower than TSC frequency.
> >> >
> >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
> >> >
> >> > If the effective frequency of the kernel clock is lower (for example
> >> > due to NTP correcting the TSC frequency of the system), and you resume
> >> > and update the system, the following happens:
> >> >
> >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE.
> >
> >     guest reads clock to memory at location A = scaled tsc read.
> >
> > (note TSC is counting at frequency higher than advertised by
> > processor, thats why NTP has to "slow down" the kernel clock
> > which is maintained by successive reads of the TSC).
> >
> >> > suspend/resume event.
> >> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.
			     ^^^^^^^^^^^^^

Err this was tsc_systemtime

> > Now the guest visible clock contains a tsc_timestamp that has been
> > corrected by NTP, over say 5 days. So the tiny NTP correction has
> > been added up to something significant.
> >
> >    guest reads clock to memory at location B = reads tsc_timestamp.
> >
> > Clock value in B (NTP corrected TSC) < clock value in A  (RAW TSC)
> >
> > Yes?
> 
> Sure, but I still don't see why this is a problem.  

Time as seen by the guest goes backwards.

	clock_gettime() = 1000.
	followed by
	clock_gettime() = 999.

Can't allow that.

> Why would the
> guest compare raw TSC to NTP corrected TSC?

Its "raw TSC" because thats what KVM exports to the guest, via the
tsc_timestamp field.

Its "corrected TSC" because thats what KVM exports to the guest, 
via system_time field (because the host is using TSC clocksource, and 
the host TSC clocksource is corrected by NTP).

> 
> >
> >>
> >> I'm still not seeing the issue.
> >
> > I'll add two items to the three snapshots above, hopefully will make it
> > clearer.
> 
> Maybe that'll help.
> 
> --Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-14 13:44     ` Paolo Bonzini
@ 2015-12-14 22:00       ` Marcelo Tosatti
  2015-12-14 22:31         ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-14 22:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Andy Lutomirski, kvm list, Radim Krcmar, X86 ML

On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
> 
> 
> On 11/12/2015 22:57, Andy Lutomirski wrote:
> > I'm still not seeing the issue.
> > 
> > The formula is:
> > 
> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> > pvti->tsc_shift) + pvti->system_time
> > 
> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> > after suspend/resume, you would also need to update system_time.
> > 
> > I don't see what this has to do with suspend/resume or with whether
> > the effective scale factor is greater than or less than one.  The only
> > suspend/resume interaction I can see is that, if the host allows the
> > guest-observed TSC value to jump (which is arguably a bug, what that's
> > not important here), it needs to update pvti before resuming the
> > guest.
> 
> Which is not an issue, since freezing obviously gets all CPUs out of
> guest mode.
> 
> Marcelo, can you provide an example with made-up values for tsc and pvti?

I meant "systemtime" at ^^^^^.

guest visible clock = systemtime (updated at time 0, guest initialization) + scaled tsc reads=LARGE VALUE.
		      ^^^^^^^^^^
guest reads clock to memory at location A = scaled tsc read.
-> suspend resume event 
guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc reads=0.
guest reads clock to memory at location B.

So before the suspend/resume event, the clock is the RAW TSC values
(scaled by kvmclock, but the frequency of the RAW TSC). 

After suspend/resume event, the clock is updated from the host
via get_kernel_ns(), which reads the corrected NTP frequency TSC.

So you switch the timebase, from a clock running at a given frequency,
to a clock running at another frequency (effective frequency).

Example:

	RAW TSC			NTP corrected TSC
t0      10			10
t1	20			19.99
t2	30			29.98
t3	40			39.97
t4	50			49.96

...

if you suddenly switch from RAW TSC to NTP corrected TSC, 
you can see what will happen.

Does that make sense?

> > suspend/resume event.
> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc
> > reads=0.


> 
> > Can you clarify concretely what goes wrong here?
> > 
> > (I'm also at a bit of a loss as to why this needs both system_time and
> > tsc_timestamp.  They're redundant in the sense that you could set
> > tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >>
> > tsc_shift to system_time without changing the result of the
> > calculation.)
> 
> You would have to ensure that all elements of pvti are rounded correctly
> whenever the base TSC is updated.  Doable, but it does seem simpler to
> keep subtract-TSC and add-nanoseconds separate.
> 
> Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-14 22:00       ` Marcelo Tosatti
@ 2015-12-14 22:31         ` Andy Lutomirski
  2015-12-14 22:38           ` Marcelo Tosatti
  2015-12-15  8:42           ` Paolo Bonzini
  0 siblings, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-14 22:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 11/12/2015 22:57, Andy Lutomirski wrote:
>> > I'm still not seeing the issue.
>> >
>> > The formula is:
>> >
>> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
>> > pvti->tsc_shift) + pvti->system_time
>> >
>> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
>> > after suspend/resume, you would also need to update system_time.
>> >
>> > I don't see what this has to do with suspend/resume or with whether
>> > the effective scale factor is greater than or less than one.  The only
>> > suspend/resume interaction I can see is that, if the host allows the
>> > guest-observed TSC value to jump (which is arguably a bug, what that's
>> > not important here), it needs to update pvti before resuming the
>> > guest.
>>
>> Which is not an issue, since freezing obviously gets all CPUs out of
>> guest mode.
>>
>> Marcelo, can you provide an example with made-up values for tsc and pvti?
>
> I meant "systemtime" at ^^^^^.
>
> guest visible clock = systemtime (updated at time 0, guest initialization) + scaled tsc reads=LARGE VALUE.
>                       ^^^^^^^^^^
> guest reads clock to memory at location A = scaled tsc read.
> -> suspend resume event
> guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc reads=0.
> guest reads clock to memory at location B.
>
> So before the suspend/resume event, the clock is the RAW TSC values
> (scaled by kvmclock, but the frequency of the RAW TSC).
>
> After suspend/resume event, the clock is updated from the host
> via get_kernel_ns(), which reads the corrected NTP frequency TSC.
>
> So you switch the timebase, from a clock running at a given frequency,
> to a clock running at another frequency (effective frequency).
>
> Example:
>
>         RAW TSC                 NTP corrected TSC
> t0      10                      10
> t1      20                      19.99
> t2      30                      29.98
> t3      40                      39.97
> t4      50                      49.96
>
> ...
>
> if you suddenly switch from RAW TSC to NTP corrected TSC,
> you can see what will happen.

Sure, but why would you ever switch from one to the other?  I'm still
not seeing the scenario under which this discontinuity is visible to
anything other than the kvmclock code itself.

The only things that need to be monotonic are the output from
vread_pvclock and the in-kernel equivalent, I think.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-14 22:31         ` Andy Lutomirski
@ 2015-12-14 22:38           ` Marcelo Tosatti
  2015-12-15  8:42           ` Paolo Bonzini
  1 sibling, 0 replies; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-14 22:38 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

GOn Mon, Dec 14, 2015 at 02:31:10PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 11/12/2015 22:57, Andy Lutomirski wrote:
> >> > I'm still not seeing the issue.
> >> >
> >> > The formula is:
> >> >
> >> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> >> > pvti->tsc_shift) + pvti->system_time
> >> >
> >> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> >> > after suspend/resume, you would also need to update system_time.
> >> >
> >> > I don't see what this has to do with suspend/resume or with whether
> >> > the effective scale factor is greater than or less than one.  The only
> >> > suspend/resume interaction I can see is that, if the host allows the
> >> > guest-observed TSC value to jump (which is arguably a bug, what that's
> >> > not important here), it needs to update pvti before resuming the
> >> > guest.
> >>
> >> Which is not an issue, since freezing obviously gets all CPUs out of
> >> guest mode.
> >>
> >> Marcelo, can you provide an example with made-up values for tsc and pvti?
> >
> > I meant "systemtime" at ^^^^^.
> >
> > guest visible clock = systemtime (updated at time 0, guest initialization) + scaled tsc reads=LARGE VALUE.
> >                       ^^^^^^^^^^
> > guest reads clock to memory at location A = scaled tsc read.
> > -> suspend resume event
> > guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc reads=0.
> > guest reads clock to memory at location B.
> >
> > So before the suspend/resume event, the clock is the RAW TSC values
> > (scaled by kvmclock, but the frequency of the RAW TSC).
> >
> > After suspend/resume event, the clock is updated from the host
> > via get_kernel_ns(), which reads the corrected NTP frequency TSC.
> >
> > So you switch the timebase, from a clock running at a given frequency,
> > to a clock running at another frequency (effective frequency).
> >
> > Example:
> >
> >         RAW TSC                 NTP corrected TSC
> > t0      10                      10
> > t1      20                      19.99
> > t2      30                      29.98
> > t3      40                      39.97
> > t4      50                      49.96
> >
> > ...
> >
> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> > you can see what will happen.
> 
> Sure, but why would you ever switch from one to the other? 

Because thats what happens when you ask kvmclock to update from system
time (which is a reliable clock, resistant to suspend/resume issues).

>  I'm still not seeing the scenario under which this discontinuity is
> visible to anything other than the kvmclock code itself.

Host userspace can see if it uses TSC and clock_gettime()
and expects them to run hand in hand.

> The only things that need to be monotonic are the output from
> vread_pvclock and the in-kernel equivalent, I think.
> 
> --Andy

clock_gettime as well, should be monotonic.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-14 22:31         ` Andy Lutomirski
  2015-12-14 22:38           ` Marcelo Tosatti
@ 2015-12-15  8:42           ` Paolo Bonzini
  2015-12-16 17:48             ` Andy Lutomirski
  1 sibling, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2015-12-15  8:42 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti; +Cc: kvm list, Radim Krcmar, X86 ML



On 14/12/2015 23:31, Andy Lutomirski wrote:
> >         RAW TSC                 NTP corrected TSC
> > t0      10                      10
> > t1      20                      19.99
> > t2      30                      29.98
> > t3      40                      39.97
> > t4      50                      49.96
> >
> > ...
> >
> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> > you can see what will happen.
>
> Sure, but why would you ever switch from one to the other?

The guest uses the raw TSC and systemtime = 0 until suspend.  After
resume, the TSC certainly increases at the same rate as before, but the
raw TSC restarted counting from 0 and systemtime has increased slower
than the guest kvmclock.

Paolo

> The only things that need to be monotonic are the output from
> vread_pvclock and the in-kernel equivalent, I think.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-15  8:42           ` Paolo Bonzini
@ 2015-12-16 17:48             ` Andy Lutomirski
  2015-12-16 18:17               ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-16 17:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Marcelo Tosatti, kvm list, Radim Krcmar, X86 ML

On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >         RAW TSC                 NTP corrected TSC
>> > t0      10                      10
>> > t1      20                      19.99
>> > t2      30                      29.98
>> > t3      40                      39.97
>> > t4      50                      49.96
>> >
>> > ...
>> >
>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> > you can see what will happen.
>>
>> Sure, but why would you ever switch from one to the other?
>
> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> resume, the TSC certainly increases at the same rate as before, but the
> raw TSC restarted counting from 0 and systemtime has increased slower
> than the guest kvmclock.

Wait, are we talking about the host's NTP or the guest's NTP?

If it's the host's, then wouldn't systemtime be reset after resume to
the NTP corrected value?  If so, the guest wouldn't see time go
backwards.

If it's the guest's, then the guest's NTP correction is applied on top
of kvmclock, and this shouldn't matter.

I still feel like I'm missing something very basic here.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-16 17:48             ` Andy Lutomirski
@ 2015-12-16 18:17               ` Andy Lutomirski
  2015-12-16 21:57                 ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-16 18:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Marcelo Tosatti, kvm list, Radim Krcmar, X86 ML

On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 14/12/2015 23:31, Andy Lutomirski wrote:
>>> >         RAW TSC                 NTP corrected TSC
>>> > t0      10                      10
>>> > t1      20                      19.99
>>> > t2      30                      29.98
>>> > t3      40                      39.97
>>> > t4      50                      49.96
>>> >
>>> > ...
>>> >
>>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>>> > you can see what will happen.
>>>
>>> Sure, but why would you ever switch from one to the other?
>>
>> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> resume, the TSC certainly increases at the same rate as before, but the
>> raw TSC restarted counting from 0 and systemtime has increased slower
>> than the guest kvmclock.
>
> Wait, are we talking about the host's NTP or the guest's NTP?
>
> If it's the host's, then wouldn't systemtime be reset after resume to
> the NTP corrected value?  If so, the guest wouldn't see time go
> backwards.
>
> If it's the guest's, then the guest's NTP correction is applied on top
> of kvmclock, and this shouldn't matter.
>
> I still feel like I'm missing something very basic here.
>

OK, I think I get it.

Marcelo, I thought that kvmclock was supposed to propagate the host's
correction to the guest.  If it did, indeed, propagate the correction
then, after resume, the host's new system_time would match the guest's
idea of it (after accounting for the guest's long nap), and I don't
think there would be a problem.

That being said, I can't find the code in the masterclock stuff that
would actually do this.

If, on the other hand, the host's NTP correction is not supposed to
propagate to the guest, then shouldn't KVM just update system_time on
resume to whatever the guest would think it had (which I think would
be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
shifted by some per-guest constant offset).

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-16 18:17               ` Andy Lutomirski
@ 2015-12-16 21:57                 ` Marcelo Tosatti
  2015-12-17 16:33                   ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-16 21:57 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >>
> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >>> >         RAW TSC                 NTP corrected TSC
> >>> > t0      10                      10
> >>> > t1      20                      19.99
> >>> > t2      30                      29.98
> >>> > t3      40                      39.97
> >>> > t4      50                      49.96
> >>> >
> >>> > ...
> >>> >
> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >>> > you can see what will happen.
> >>>
> >>> Sure, but why would you ever switch from one to the other?
> >>
> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> resume, the TSC certainly increases at the same rate as before, but the
> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> than the guest kvmclock.
> >
> > Wait, are we talking about the host's NTP or the guest's NTP?
> >
> > If it's the host's, then wouldn't systemtime be reset after resume to
> > the NTP corrected value?  If so, the guest wouldn't see time go
> > backwards.
> >
> > If it's the guest's, then the guest's NTP correction is applied on top
> > of kvmclock, and this shouldn't matter.
> >
> > I still feel like I'm missing something very basic here.
> >
> 
> OK, I think I get it.
> 
> Marcelo, I thought that kvmclock was supposed to propagate the host's
> correction to the guest.  If it did, indeed, propagate the correction
> then, after resume, the host's new system_time would match the guest's
> idea of it (after accounting for the guest's long nap), and I don't
> think there would be a problem.
> That being said, I can't find the code in the masterclock stuff that
> would actually do this.

Guest clock is maintained by guest timekeeping code, which does:

timer_interrupt() 
	offset = read clocksource since last timer interrupt
	accumulate_to_systemclock(offset)

The frequency correction of NTP in the host can be applied to 
kvmclock, which will be visible to the guest 
at "read clocksource since last timer interrupt" 
(kvmclock_clocksource_read function).

This does not mean that the NTP correction in the host is propagated
to the guests system clock directly.

(For example, the guest can run NTP which is free to do further
adjustments at "accumulate_to_systemclock(offset)" time).

> If, on the other hand, the host's NTP correction is not supposed to
> propagate to the guest, 

This is optional. There is a module option to control this, in fact.

Its nice to have, because then you can execute a guest without NTP
(say without network connection), and have a kvmclock (kvmclock is a
clocksource, not a guest system clock) which is NTP corrected.

> then shouldn't KVM just update system_time on
> resume to whatever the guest would think it had (which I think would
> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
> shifted by some per-guest constant offset).
> 
> --Andy

Sure, you could add a correction to compensate and make sure 
the guest clock does not see time backwards.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-16 21:57                 ` Marcelo Tosatti
@ 2015-12-17 16:33                   ` Andy Lutomirski
  2015-12-17 19:08                     ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-17 16:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>
>> >>
>> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >>> >         RAW TSC                 NTP corrected TSC
>> >>> > t0      10                      10
>> >>> > t1      20                      19.99
>> >>> > t2      30                      29.98
>> >>> > t3      40                      39.97
>> >>> > t4      50                      49.96
>> >>> >
>> >>> > ...
>> >>> >
>> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >>> > you can see what will happen.
>> >>>
>> >>> Sure, but why would you ever switch from one to the other?
>> >>
>> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> than the guest kvmclock.
>> >
>> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >
>> > If it's the host's, then wouldn't systemtime be reset after resume to
>> > the NTP corrected value?  If so, the guest wouldn't see time go
>> > backwards.
>> >
>> > If it's the guest's, then the guest's NTP correction is applied on top
>> > of kvmclock, and this shouldn't matter.
>> >
>> > I still feel like I'm missing something very basic here.
>> >
>>
>> OK, I think I get it.
>>
>> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> correction to the guest.  If it did, indeed, propagate the correction
>> then, after resume, the host's new system_time would match the guest's
>> idea of it (after accounting for the guest's long nap), and I don't
>> think there would be a problem.
>> That being said, I can't find the code in the masterclock stuff that
>> would actually do this.
>
> Guest clock is maintained by guest timekeeping code, which does:
>
> timer_interrupt()
>         offset = read clocksource since last timer interrupt
>         accumulate_to_systemclock(offset)
>
> The frequency correction of NTP in the host can be applied to
> kvmclock, which will be visible to the guest
> at "read clocksource since last timer interrupt"
> (kvmclock_clocksource_read function).

pvclock_clocksource_read?  That seems to do the same thing as all the
other clocksource access functions.

>
> This does not mean that the NTP correction in the host is propagated
> to the guests system clock directly.
>
> (For example, the guest can run NTP which is free to do further
> adjustments at "accumulate_to_systemclock(offset)" time).

Of course.  But I expected that, in the absence of NTP on the guest,
that the guest would track the host's *corrected* time.

>
>> If, on the other hand, the host's NTP correction is not supposed to
>> propagate to the guest,
>
> This is optional. There is a module option to control this, in fact.
>
> Its nice to have, because then you can execute a guest without NTP
> (say without network connection), and have a kvmclock (kvmclock is a
> clocksource, not a guest system clock) which is NTP corrected.

Can you point to how this works?  I found kvm_guest_time_update, whch
is called under circumstances that I haven't untangled.  I can't
really tell what it's trying to do.

In any case, this still seems much more convoluted than it has to be.
In the case in which the host has a stable TSC (tsc is selected in the
core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
the time on the last few generations of CPUs, then the core
timekeeping code is already exposing a linear function that's supposed
to be used for monotonic, cpu-local access to a corrected nanosecond
counter.  It's even in pretty much exactly the right form to pass
through to the guest via pvclock in the gtod data.  Why doesn't KVM
pass it through verbatim, updated in real time?  Is there some legacy
reason that KVM must apply its own corrections and has to jump through
hoops to pause vcpus when updating those vcpu's copies of the pvclock
data?

>
>> then shouldn't KVM just update system_time on
>> resume to whatever the guest would think it had (which I think would
>> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
>> shifted by some per-guest constant offset).
>>
>> --Andy
>
> Sure, you could add a correction to compensate and make sure
> the guest clock does not see time backwards.
>

Could you help do that?  You understand the code far better than I do.

As it stands, it simply doesn't work on any system that suspends and
resumes (unless maybe the system has the upcoming Intel ART feature,
and I have no clue when that'll show up).

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-17 16:33                   ` Andy Lutomirski
@ 2015-12-17 19:08                     ` Marcelo Tosatti
  2015-12-18  1:12                       ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-17 19:08 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> >>
> >> >>
> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >>> >         RAW TSC                 NTP corrected TSC
> >> >>> > t0      10                      10
> >> >>> > t1      20                      19.99
> >> >>> > t2      30                      29.98
> >> >>> > t3      40                      39.97
> >> >>> > t4      50                      49.96

(1)

> >> >>> >
> >> >>> > ...
> >> >>> >
> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >>> > you can see what will happen.
> >> >>>
> >> >>> Sure, but why would you ever switch from one to the other?
> >> >>
> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> than the guest kvmclock.
> >> >
> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >
> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> > backwards.
> >> >
> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> > of kvmclock, and this shouldn't matter.
> >> >
> >> > I still feel like I'm missing something very basic here.
> >> >
> >>
> >> OK, I think I get it.
> >>
> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> correction to the guest.  If it did, indeed, propagate the correction
> >> then, after resume, the host's new system_time would match the guest's
> >> idea of it (after accounting for the guest's long nap), and I don't
> >> think there would be a problem.
> >> That being said, I can't find the code in the masterclock stuff that
> >> would actually do this.
> >
> > Guest clock is maintained by guest timekeeping code, which does:
> >
> > timer_interrupt()
> >         offset = read clocksource since last timer interrupt
> >         accumulate_to_systemclock(offset)
> >
> > The frequency correction of NTP in the host can be applied to
> > kvmclock, which will be visible to the guest
> > at "read clocksource since last timer interrupt"
> > (kvmclock_clocksource_read function).
> 
> pvclock_clocksource_read?  That seems to do the same thing as all the
> other clocksource access functions.
> 
> >
> > This does not mean that the NTP correction in the host is propagated
> > to the guests system clock directly.
> >
> > (For example, the guest can run NTP which is free to do further
> > adjustments at "accumulate_to_systemclock(offset)" time).
> 
> Of course.  But I expected that, in the absence of NTP on the guest,
> that the guest would track the host's *corrected* time.
> 
> >
> >> If, on the other hand, the host's NTP correction is not supposed to
> >> propagate to the guest,
> >
> > This is optional. There is a module option to control this, in fact.
> >
> > Its nice to have, because then you can execute a guest without NTP
> > (say without network connection), and have a kvmclock (kvmclock is a
> > clocksource, not a guest system clock) which is NTP corrected.
> 
> Can you point to how this works?  I found kvm_guest_time_update, whch
> is called under circumstances that I haven't untangled.  I can't
> really tell what it's trying to do.

Documentation/virtual/kvm/timekeeping.txt.

> In any case, this still seems much more convoluted than it has to be.
> In the case in which the host has a stable TSC (tsc is selected in the
> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> the time on the last few generations of CPUs, then the core
> timekeeping code is already exposing a linear function that's supposed
> to be used for monotonic, cpu-local access to a corrected nanosecond
> counter.  It's even in pretty much exactly the right form to pass
> through to the guest via pvclock in the gtod data.  Why doesn't KVM
> pass it through verbatim, updated in real time?  Is there some legacy
> reason that KVM must apply its own corrections and has to jump through
> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> data?

Read the comment on x86.c which starts with
" *
 * Assuming a stable TSC across physical CPUS, and a stable TSC
 * across virtual CPUs, the following condition is possible.
 * Each numbered line represents an event visible to both
 * CPUs at the next numbered event.
"

> >> then shouldn't KVM just update system_time on
> >> resume to whatever the guest would think it had (which I think would
> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
> >> shifted by some per-guest constant offset).
> >>
> >> --Andy
> >
> > Sure, you could add a correction to compensate and make sure
> > the guest clock does not see time backwards.
> >
> 
> Could you help do that?  You understand the code far better than I do.

Sure, you have to save the guests view of time (system_time + scaled tsc
read) when suspending, and add an offset to get_kernel_ns() to 
compensate the effect of (1) when resuming.

Does that make sense? 

> As it stands, it simply doesn't work on any system that suspends and
> resumes (unless maybe the system has the upcoming Intel ART feature,
> and I have no clue when that'll show up).
> 
> --Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-17 19:08                     ` Marcelo Tosatti
@ 2015-12-18  1:12                       ` Andy Lutomirski
  2015-12-18 11:47                         ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-18  1:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >> >>
>> >> >>
>> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >>> >         RAW TSC                 NTP corrected TSC
>> >> >>> > t0      10                      10
>> >> >>> > t1      20                      19.99
>> >> >>> > t2      30                      29.98
>> >> >>> > t3      40                      39.97
>> >> >>> > t4      50                      49.96
>
> (1)
>
>> >> >>> >
>> >> >>> > ...
>> >> >>> >
>> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >>> > you can see what will happen.
>> >> >>>
>> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >>
>> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> than the guest kvmclock.
>> >> >
>> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >
>> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> > backwards.
>> >> >
>> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> > of kvmclock, and this shouldn't matter.
>> >> >
>> >> > I still feel like I'm missing something very basic here.
>> >> >
>> >>
>> >> OK, I think I get it.
>> >>
>> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> then, after resume, the host's new system_time would match the guest's
>> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> think there would be a problem.
>> >> That being said, I can't find the code in the masterclock stuff that
>> >> would actually do this.
>> >
>> > Guest clock is maintained by guest timekeeping code, which does:
>> >
>> > timer_interrupt()
>> >         offset = read clocksource since last timer interrupt
>> >         accumulate_to_systemclock(offset)
>> >
>> > The frequency correction of NTP in the host can be applied to
>> > kvmclock, which will be visible to the guest
>> > at "read clocksource since last timer interrupt"
>> > (kvmclock_clocksource_read function).
>>
>> pvclock_clocksource_read?  That seems to do the same thing as all the
>> other clocksource access functions.
>>
>> >
>> > This does not mean that the NTP correction in the host is propagated
>> > to the guests system clock directly.
>> >
>> > (For example, the guest can run NTP which is free to do further
>> > adjustments at "accumulate_to_systemclock(offset)" time).
>>
>> Of course.  But I expected that, in the absence of NTP on the guest,
>> that the guest would track the host's *corrected* time.
>>
>> >
>> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> propagate to the guest,
>> >
>> > This is optional. There is a module option to control this, in fact.
>> >
>> > Its nice to have, because then you can execute a guest without NTP
>> > (say without network connection), and have a kvmclock (kvmclock is a
>> > clocksource, not a guest system clock) which is NTP corrected.
>>
>> Can you point to how this works?  I found kvm_guest_time_update, whch
>> is called under circumstances that I haven't untangled.  I can't
>> really tell what it's trying to do.
>
> Documentation/virtual/kvm/timekeeping.txt.
>

That document is really long.  I skimmed it and found nothing.

>> In any case, this still seems much more convoluted than it has to be.
>> In the case in which the host has a stable TSC (tsc is selected in the
>> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>> the time on the last few generations of CPUs, then the core
>> timekeeping code is already exposing a linear function that's supposed
>> to be used for monotonic, cpu-local access to a corrected nanosecond
>> counter.  It's even in pretty much exactly the right form to pass
>> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>> pass it through verbatim, updated in real time?  Is there some legacy
>> reason that KVM must apply its own corrections and has to jump through
>> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>> data?
>
> Read the comment on x86.c which starts with
> " *
>  * Assuming a stable TSC across physical CPUS, and a stable TSC
>  * across virtual CPUs, the following condition is possible.
>  * Each numbered line represents an event visible to both
>  * CPUs at the next numbered event.
> "

A couple things:

1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))

but that's wrong, I think.  rdtsc is a function, not a number.  Shouldn't it be:

timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))

which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
N + (rdtsc1 - rdtsc0)?

That doesn't change the conclusion.

In any case, I'm not arguing that the concept of a master copy is
unnecessary; I'm arguing that the implementation, the calculations,
and the machinations in the code are all very, very complicated.  All
that should be needed is to keep all of the vcpu pvti copies the same
and to make sure that you can't ever have one vcpu see a new copy and
then another vcpu see an old copy.  You can do that by brute-force
freezing all vcpus on an update (what happens now), or you could do it
by just writing all of the copies at the same time from the same host
cpu *while other vcpus are still running*.

For the best outcome, you could offer a pvclock protocol v3 in which
there is literally just one pvti copy shared by all vcpus.

>
>> >> then shouldn't KVM just update system_time on
>> >> resume to whatever the guest would think it had (which I think would
>> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
>> >> shifted by some per-guest constant offset).
>> >>
>> >> --Andy
>> >
>> > Sure, you could add a correction to compensate and make sure
>> > the guest clock does not see time backwards.
>> >
>>
>> Could you help do that?  You understand the code far better than I do.
>
> Sure, you have to save the guests view of time (system_time + scaled tsc
> read) when suspending, and add an offset to get_kernel_ns() to
> compensate the effect of (1) when resuming.
>
> Does that make sense?

I think so.

--Andy

>
>> As it stands, it simply doesn't work on any system that suspends and
>> resumes (unless maybe the system has the upcoming Intel ART feature,
>> and I have no clue when that'll show up).
>>
>> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18  1:12                       ` Andy Lutomirski
@ 2015-12-18 11:47                         ` Marcelo Tosatti
  2015-12-18 19:27                           ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-18 11:47 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> >> >>
> >> >> >>
> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >>> >         RAW TSC                 NTP corrected TSC
> >> >> >>> > t0      10                      10
> >> >> >>> > t1      20                      19.99
> >> >> >>> > t2      30                      29.98
> >> >> >>> > t3      40                      39.97
> >> >> >>> > t4      50                      49.96
> >
> > (1)
> >
> >> >> >>> >
> >> >> >>> > ...
> >> >> >>> >
> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >>> > you can see what will happen.
> >> >> >>>
> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >>
> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> >> than the guest kvmclock.
> >> >> >
> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >
> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> > backwards.
> >> >> >
> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >
> >> >> > I still feel like I'm missing something very basic here.
> >> >> >
> >> >>
> >> >> OK, I think I get it.
> >> >>
> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> correction to the guest.  If it did, indeed, propagate the correction
> >> >> then, after resume, the host's new system_time would match the guest's
> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> think there would be a problem.
> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> would actually do this.
> >> >
> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >
> >> > timer_interrupt()
> >> >         offset = read clocksource since last timer interrupt
> >> >         accumulate_to_systemclock(offset)
> >> >
> >> > The frequency correction of NTP in the host can be applied to
> >> > kvmclock, which will be visible to the guest
> >> > at "read clocksource since last timer interrupt"
> >> > (kvmclock_clocksource_read function).
> >>
> >> pvclock_clocksource_read?  That seems to do the same thing as all the
> >> other clocksource access functions.
> >>
> >> >
> >> > This does not mean that the NTP correction in the host is propagated
> >> > to the guests system clock directly.
> >> >
> >> > (For example, the guest can run NTP which is free to do further
> >> > adjustments at "accumulate_to_systemclock(offset)" time).
> >>
> >> Of course.  But I expected that, in the absence of NTP on the guest,
> >> that the guest would track the host's *corrected* time.
> >>
> >> >
> >> >> If, on the other hand, the host's NTP correction is not supposed to
> >> >> propagate to the guest,
> >> >
> >> > This is optional. There is a module option to control this, in fact.
> >> >
> >> > Its nice to have, because then you can execute a guest without NTP
> >> > (say without network connection), and have a kvmclock (kvmclock is a
> >> > clocksource, not a guest system clock) which is NTP corrected.
> >>
> >> Can you point to how this works?  I found kvm_guest_time_update, whch
> >> is called under circumstances that I haven't untangled.  I can't
> >> really tell what it's trying to do.
> >
> > Documentation/virtual/kvm/timekeeping.txt.
> >
> 
> That document is really long.  I skimmed it and found nothing.

kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.

This happens when:
	- kvmclock is enabled or disabled by the guest.
	- periodically to propagate NTP correction to kvmclock clock.
	- guest vcpu switching between host pcpus when TSCs are out of sync.
	- after migration.
	- after savevm/loadvm.

> >> In any case, this still seems much more convoluted than it has to be.
> >> In the case in which the host has a stable TSC (tsc is selected in the
> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> >> the time on the last few generations of CPUs, then the core
> >> timekeeping code is already exposing a linear function that's supposed
> >> to be used for monotonic, cpu-local access to a corrected nanosecond
> >> counter.  It's even in pretty much exactly the right form to pass
> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
> >> pass it through verbatim, updated in real time?  Is there some legacy
> >> reason that KVM must apply its own corrections and has to jump through
> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> >> data?
> >
> > Read the comment on x86.c which starts with
> > " *
> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
> >  * across virtual CPUs, the following condition is possible.
> >  * Each numbered line represents an event visible to both
> >  * CPUs at the next numbered event.
> > "
> 
> A couple things:
> 
> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
> 
> but that's wrong, I think.  rdtsc is a function, not a number.

View it as a number, then its correct.

>  Shouldn't it be:
> 
> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))

Think "rdtsc" is one number (rdtsc0 = rdtsc1).

> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
> N + (rdtsc1 - rdtsc0)?
> 
> That doesn't change the conclusion.
> 
> In any case, I'm not arguing that the concept of a master copy is
> unnecessary; I'm arguing that the implementation, the calculations,
> and the machinations in the code are all very, very complicated.  All
> that should be needed is to keep all of the vcpu pvti copies the same
> and to make sure that you can't ever have one vcpu see a new copy and
> then another vcpu see an old copy. 

Yes, you can't allow two vcpus to see a different copy of the pvti
structure.

>  You can do that by brute-force
> freezing all vcpus on an update (what happens now), or you could do it
> by just writing all of the copies at the same time from the same host
> cpu *while other vcpus are still running*.

Ok, can you do that and guarantee the first copy won't be seen by 
other vcpus? I don't know how.

> For the best outcome, you could offer a pvclock protocol v3 in which
> there is literally just one pvti copy shared by all vcpus.

Sure, lets write a more formal proposal?

> >> >> then shouldn't KVM just update system_time on
> >> >> resume to whatever the guest would think it had (which I think would
> >> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
> >> >> shifted by some per-guest constant offset).
> >> >>
> >> >> --Andy
> >> >
> >> > Sure, you could add a correction to compensate and make sure
> >> > the guest clock does not see time backwards.
> >> >
> >>
> >> Could you help do that?  You understand the code far better than I do.
> >
> > Sure, you have to save the guests view of time (system_time + scaled tsc
> > read) when suspending, and add an offset to get_kernel_ns() to
> > compensate the effect of (1) when resuming.
> >
> > Does that make sense?
> 
> I think so.
> 
> --Andy
> 
> >
> >> As it stands, it simply doesn't work on any system that suspends and
> >> resumes (unless maybe the system has the upcoming Intel ART feature,
> >> and I have no clue when that'll show up).
> >>
> >> --Andy
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18 11:47                         ` Marcelo Tosatti
@ 2015-12-18 19:27                           ` Andy Lutomirski
  2015-12-18 19:45                             ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-18 19:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >> >>> >         RAW TSC                 NTP corrected TSC
>> >> >> >>> > t0      10                      10
>> >> >> >>> > t1      20                      19.99
>> >> >> >>> > t2      30                      29.98
>> >> >> >>> > t3      40                      39.97
>> >> >> >>> > t4      50                      49.96
>> >
>> > (1)
>> >
>> >> >> >>> >
>> >> >> >>> > ...
>> >> >> >>> >
>> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >> >>> > you can see what will happen.
>> >> >> >>>
>> >> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >> >>
>> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> >> than the guest kvmclock.
>> >> >> >
>> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >> >
>> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> >> > backwards.
>> >> >> >
>> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> >> > of kvmclock, and this shouldn't matter.
>> >> >> >
>> >> >> > I still feel like I'm missing something very basic here.
>> >> >> >
>> >> >>
>> >> >> OK, I think I get it.
>> >> >>
>> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> >> then, after resume, the host's new system_time would match the guest's
>> >> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> >> think there would be a problem.
>> >> >> That being said, I can't find the code in the masterclock stuff that
>> >> >> would actually do this.
>> >> >
>> >> > Guest clock is maintained by guest timekeeping code, which does:
>> >> >
>> >> > timer_interrupt()
>> >> >         offset = read clocksource since last timer interrupt
>> >> >         accumulate_to_systemclock(offset)
>> >> >
>> >> > The frequency correction of NTP in the host can be applied to
>> >> > kvmclock, which will be visible to the guest
>> >> > at "read clocksource since last timer interrupt"
>> >> > (kvmclock_clocksource_read function).
>> >>
>> >> pvclock_clocksource_read?  That seems to do the same thing as all the
>> >> other clocksource access functions.
>> >>
>> >> >
>> >> > This does not mean that the NTP correction in the host is propagated
>> >> > to the guests system clock directly.
>> >> >
>> >> > (For example, the guest can run NTP which is free to do further
>> >> > adjustments at "accumulate_to_systemclock(offset)" time).
>> >>
>> >> Of course.  But I expected that, in the absence of NTP on the guest,
>> >> that the guest would track the host's *corrected* time.
>> >>
>> >> >
>> >> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> >> propagate to the guest,
>> >> >
>> >> > This is optional. There is a module option to control this, in fact.
>> >> >
>> >> > Its nice to have, because then you can execute a guest without NTP
>> >> > (say without network connection), and have a kvmclock (kvmclock is a
>> >> > clocksource, not a guest system clock) which is NTP corrected.
>> >>
>> >> Can you point to how this works?  I found kvm_guest_time_update, whch
>> >> is called under circumstances that I haven't untangled.  I can't
>> >> really tell what it's trying to do.
>> >
>> > Documentation/virtual/kvm/timekeeping.txt.
>> >
>>
>> That document is really long.  I skimmed it and found nothing.
>
> kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
>
> This happens when:
>         - kvmclock is enabled or disabled by the guest.
>         - periodically to propagate NTP correction to kvmclock clock.
>         - guest vcpu switching between host pcpus when TSCs are out of sync.
>         - after migration.
>         - after savevm/loadvm.
>
>> >> In any case, this still seems much more convoluted than it has to be.
>> >> In the case in which the host has a stable TSC (tsc is selected in the
>> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>> >> the time on the last few generations of CPUs, then the core
>> >> timekeeping code is already exposing a linear function that's supposed
>> >> to be used for monotonic, cpu-local access to a corrected nanosecond
>> >> counter.  It's even in pretty much exactly the right form to pass
>> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>> >> pass it through verbatim, updated in real time?  Is there some legacy
>> >> reason that KVM must apply its own corrections and has to jump through
>> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>> >> data?
>> >
>> > Read the comment on x86.c which starts with
>> > " *
>> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
>> >  * across virtual CPUs, the following condition is possible.
>> >  * Each numbered line represents an event visible to both
>> >  * CPUs at the next numbered event.
>> > "
>>
>> A couple things:
>>
>> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
>>
>> but that's wrong, I think.  rdtsc is a function, not a number.
>
> View it as a number, then its correct.
>
>>  Shouldn't it be:
>>
>> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
>
> Think "rdtsc" is one number (rdtsc0 = rdtsc1).
>
>> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
>> N + (rdtsc1 - rdtsc0)?
>>
>> That doesn't change the conclusion.
>>
>> In any case, I'm not arguing that the concept of a master copy is
>> unnecessary; I'm arguing that the implementation, the calculations,
>> and the machinations in the code are all very, very complicated.  All
>> that should be needed is to keep all of the vcpu pvti copies the same
>> and to make sure that you can't ever have one vcpu see a new copy and
>> then another vcpu see an old copy.
>
> Yes, you can't allow two vcpus to see a different copy of the pvti
> structure.

Two options:

1. Pause all vcpus, then update all the pvti copies, then unpause all
vcpus.  This would work, but it's expensive.

2. Increment all the pvti version numbers, then update all of them,
then increment the version numbers again.

I think option 2 is a lot nicer than option 1.

>
>>  You can do that by brute-force
>> freezing all vcpus on an update (what happens now), or you could do it
>> by just writing all of the copies at the same time from the same host
>> cpu *while other vcpus are still running*.
>
> Ok, can you do that and guarantee the first copy won't be seen by
> other vcpus? I don't know how.
>
>> For the best outcome, you could offer a pvclock protocol v3 in which
>> there is literally just one pvti copy shared by all vcpus.
>
> Sure, lets write a more formal proposal?
>

I can try to sketch something out in the next week or two.  It would
be basically the same as the current protocol, except that there would
be a single pvti instead of an array.  In the case where the TSCs are
out of sync and the host can't synchronize them, then this mechanism
would return an error and the guest would have to fall back.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18 19:27                           ` Andy Lutomirski
@ 2015-12-18 19:45                             ` Marcelo Tosatti
  2015-12-18 20:25                               ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-18 19:45 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >> >>> >         RAW TSC                 NTP corrected TSC
> >> >> >> >>> > t0      10                      10
> >> >> >> >>> > t1      20                      19.99
> >> >> >> >>> > t2      30                      29.98
> >> >> >> >>> > t3      40                      39.97
> >> >> >> >>> > t4      50                      49.96
> >> >
> >> > (1)
> >> >
> >> >> >> >>> >
> >> >> >> >>> > ...
> >> >> >> >>> >
> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >> >>> > you can see what will happen.
> >> >> >> >>>
> >> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >> >>
> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> >> >> than the guest kvmclock.
> >> >> >> >
> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >> >
> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> >> > backwards.
> >> >> >> >
> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >> >
> >> >> >> > I still feel like I'm missing something very basic here.
> >> >> >> >
> >> >> >>
> >> >> >> OK, I think I get it.
> >> >> >>
> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
> >> >> >> then, after resume, the host's new system_time would match the guest's
> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> >> think there would be a problem.
> >> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> >> would actually do this.
> >> >> >
> >> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >> >
> >> >> > timer_interrupt()
> >> >> >         offset = read clocksource since last timer interrupt
> >> >> >         accumulate_to_systemclock(offset)
> >> >> >
> >> >> > The frequency correction of NTP in the host can be applied to
> >> >> > kvmclock, which will be visible to the guest
> >> >> > at "read clocksource since last timer interrupt"
> >> >> > (kvmclock_clocksource_read function).
> >> >>
> >> >> pvclock_clocksource_read?  That seems to do the same thing as all the
> >> >> other clocksource access functions.
> >> >>
> >> >> >
> >> >> > This does not mean that the NTP correction in the host is propagated
> >> >> > to the guests system clock directly.
> >> >> >
> >> >> > (For example, the guest can run NTP which is free to do further
> >> >> > adjustments at "accumulate_to_systemclock(offset)" time).
> >> >>
> >> >> Of course.  But I expected that, in the absence of NTP on the guest,
> >> >> that the guest would track the host's *corrected* time.
> >> >>
> >> >> >
> >> >> >> If, on the other hand, the host's NTP correction is not supposed to
> >> >> >> propagate to the guest,
> >> >> >
> >> >> > This is optional. There is a module option to control this, in fact.
> >> >> >
> >> >> > Its nice to have, because then you can execute a guest without NTP
> >> >> > (say without network connection), and have a kvmclock (kvmclock is a
> >> >> > clocksource, not a guest system clock) which is NTP corrected.
> >> >>
> >> >> Can you point to how this works?  I found kvm_guest_time_update, whch
> >> >> is called under circumstances that I haven't untangled.  I can't
> >> >> really tell what it's trying to do.
> >> >
> >> > Documentation/virtual/kvm/timekeeping.txt.
> >> >
> >>
> >> That document is really long.  I skimmed it and found nothing.
> >
> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
> >
> > This happens when:
> >         - kvmclock is enabled or disabled by the guest.
> >         - periodically to propagate NTP correction to kvmclock clock.
> >         - guest vcpu switching between host pcpus when TSCs are out of sync.
> >         - after migration.
> >         - after savevm/loadvm.
> >
> >> >> In any case, this still seems much more convoluted than it has to be.
> >> >> In the case in which the host has a stable TSC (tsc is selected in the
> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> >> >> the time on the last few generations of CPUs, then the core
> >> >> timekeeping code is already exposing a linear function that's supposed
> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond
> >> >> counter.  It's even in pretty much exactly the right form to pass
> >> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
> >> >> pass it through verbatim, updated in real time?  Is there some legacy
> >> >> reason that KVM must apply its own corrections and has to jump through
> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> >> >> data?
> >> >
> >> > Read the comment on x86.c which starts with
> >> > " *
> >> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
> >> >  * across virtual CPUs, the following condition is possible.
> >> >  * Each numbered line represents an event visible to both
> >> >  * CPUs at the next numbered event.
> >> > "
> >>
> >> A couple things:
> >>
> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
> >>
> >> but that's wrong, I think.  rdtsc is a function, not a number.
> >
> > View it as a number, then its correct.
> >
> >>  Shouldn't it be:
> >>
> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
> >
> > Think "rdtsc" is one number (rdtsc0 = rdtsc1).
> >
> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
> >> N + (rdtsc1 - rdtsc0)?
> >>
> >> That doesn't change the conclusion.
> >>
> >> In any case, I'm not arguing that the concept of a master copy is
> >> unnecessary; I'm arguing that the implementation, the calculations,
> >> and the machinations in the code are all very, very complicated.  All
> >> that should be needed is to keep all of the vcpu pvti copies the same
> >> and to make sure that you can't ever have one vcpu see a new copy and
> >> then another vcpu see an old copy.
> >
> > Yes, you can't allow two vcpus to see a different copy of the pvti
> > structure.
> 
> Two options:
> 
> 1. Pause all vcpus, then update all the pvti copies, then unpause all
> vcpus.  This would work, but it's expensive.

This is what is done today.

> 
> 2. Increment all the pvti version numbers, then update all of them,
> then increment the version numbers again.
> 
> I think option 2 is a lot nicer than option 1.

Can you write an actual proposal (with details) that accomodates the
issue described at "Assuming a stable TSC across physical CPUS, and a
stable TSC" ?

Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
realtime guests.

> >>  You can do that by brute-force
> >> freezing all vcpus on an update (what happens now), or you could do it
> >> by just writing all of the copies at the same time from the same host
> >> cpu *while other vcpus are still running*.
> >
> > Ok, can you do that and guarantee the first copy won't be seen by
> > other vcpus? I don't know how.
> >
> >> For the best outcome, you could offer a pvclock protocol v3 in which
> >> there is literally just one pvti copy shared by all vcpus.
> >
> > Sure, lets write a more formal proposal?
> >
> 
> I can try to sketch something out in the next week or two.  It would
> be basically the same as the current protocol, except that there would
> be a single pvti instead of an array.  In the case where the TSCs are
> out of sync and the host can't synchronize them, then this mechanism
> would return an error and the guest would have to fall back.
> 
> --Andy

I'm fine with a single pvti.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18 19:45                             ` Marcelo Tosatti
@ 2015-12-18 20:25                               ` Andy Lutomirski
  2015-12-18 21:49                                 ` Marcelo Tosatti
  2015-12-19  1:16                                 ` John Stultz
  0 siblings, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-18 20:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML, John Stultz

[cc: John Stultz -- maybe you have ideas on how this should best
integrate with the core code]

On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >> >> >>> >         RAW TSC                 NTP corrected TSC
>> >> >> >> >>> > t0      10                      10
>> >> >> >> >>> > t1      20                      19.99
>> >> >> >> >>> > t2      30                      29.98
>> >> >> >> >>> > t3      40                      39.97
>> >> >> >> >>> > t4      50                      49.96
>> >> >
>> >> > (1)
>> >> >
>> >> >> >> >>> >
>> >> >> >> >>> > ...
>> >> >> >> >>> >
>> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >> >> >>> > you can see what will happen.
>> >> >> >> >>>
>> >> >> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >> >> >>
>> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> >> >> than the guest kvmclock.
>> >> >> >> >
>> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >> >> >
>> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> >> >> > backwards.
>> >> >> >> >
>> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> >> >> > of kvmclock, and this shouldn't matter.
>> >> >> >> >
>> >> >> >> > I still feel like I'm missing something very basic here.
>> >> >> >> >
>> >> >> >>
>> >> >> >> OK, I think I get it.
>> >> >> >>
>> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> >> >> then, after resume, the host's new system_time would match the guest's
>> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> >> >> think there would be a problem.
>> >> >> >> That being said, I can't find the code in the masterclock stuff that
>> >> >> >> would actually do this.
>> >> >> >
>> >> >> > Guest clock is maintained by guest timekeeping code, which does:
>> >> >> >
>> >> >> > timer_interrupt()
>> >> >> >         offset = read clocksource since last timer interrupt
>> >> >> >         accumulate_to_systemclock(offset)
>> >> >> >
>> >> >> > The frequency correction of NTP in the host can be applied to
>> >> >> > kvmclock, which will be visible to the guest
>> >> >> > at "read clocksource since last timer interrupt"
>> >> >> > (kvmclock_clocksource_read function).
>> >> >>
>> >> >> pvclock_clocksource_read?  That seems to do the same thing as all the
>> >> >> other clocksource access functions.
>> >> >>
>> >> >> >
>> >> >> > This does not mean that the NTP correction in the host is propagated
>> >> >> > to the guests system clock directly.
>> >> >> >
>> >> >> > (For example, the guest can run NTP which is free to do further
>> >> >> > adjustments at "accumulate_to_systemclock(offset)" time).
>> >> >>
>> >> >> Of course.  But I expected that, in the absence of NTP on the guest,
>> >> >> that the guest would track the host's *corrected* time.
>> >> >>
>> >> >> >
>> >> >> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> >> >> propagate to the guest,
>> >> >> >
>> >> >> > This is optional. There is a module option to control this, in fact.
>> >> >> >
>> >> >> > Its nice to have, because then you can execute a guest without NTP
>> >> >> > (say without network connection), and have a kvmclock (kvmclock is a
>> >> >> > clocksource, not a guest system clock) which is NTP corrected.
>> >> >>
>> >> >> Can you point to how this works?  I found kvm_guest_time_update, whch
>> >> >> is called under circumstances that I haven't untangled.  I can't
>> >> >> really tell what it's trying to do.
>> >> >
>> >> > Documentation/virtual/kvm/timekeeping.txt.
>> >> >
>> >>
>> >> That document is really long.  I skimmed it and found nothing.
>> >
>> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
>> >
>> > This happens when:
>> >         - kvmclock is enabled or disabled by the guest.
>> >         - periodically to propagate NTP correction to kvmclock clock.
>> >         - guest vcpu switching between host pcpus when TSCs are out of sync.
>> >         - after migration.
>> >         - after savevm/loadvm.
>> >
>> >> >> In any case, this still seems much more convoluted than it has to be.
>> >> >> In the case in which the host has a stable TSC (tsc is selected in the
>> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>> >> >> the time on the last few generations of CPUs, then the core
>> >> >> timekeeping code is already exposing a linear function that's supposed
>> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond
>> >> >> counter.  It's even in pretty much exactly the right form to pass
>> >> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>> >> >> pass it through verbatim, updated in real time?  Is there some legacy
>> >> >> reason that KVM must apply its own corrections and has to jump through
>> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>> >> >> data?
>> >> >
>> >> > Read the comment on x86.c which starts with
>> >> > " *
>> >> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
>> >> >  * across virtual CPUs, the following condition is possible.
>> >> >  * Each numbered line represents an event visible to both
>> >> >  * CPUs at the next numbered event.
>> >> > "
>> >>
>> >> A couple things:
>> >>
>> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
>> >>
>> >> but that's wrong, I think.  rdtsc is a function, not a number.
>> >
>> > View it as a number, then its correct.
>> >
>> >>  Shouldn't it be:
>> >>
>> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
>> >
>> > Think "rdtsc" is one number (rdtsc0 = rdtsc1).
>> >
>> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
>> >> N + (rdtsc1 - rdtsc0)?
>> >>
>> >> That doesn't change the conclusion.
>> >>
>> >> In any case, I'm not arguing that the concept of a master copy is
>> >> unnecessary; I'm arguing that the implementation, the calculations,
>> >> and the machinations in the code are all very, very complicated.  All
>> >> that should be needed is to keep all of the vcpu pvti copies the same
>> >> and to make sure that you can't ever have one vcpu see a new copy and
>> >> then another vcpu see an old copy.
>> >
>> > Yes, you can't allow two vcpus to see a different copy of the pvti
>> > structure.
>>
>> Two options:
>>
>> 1. Pause all vcpus, then update all the pvti copies, then unpause all
>> vcpus.  This would work, but it's expensive.
>
> This is what is done today.
>
>>
>> 2. Increment all the pvti version numbers, then update all of them,
>> then increment the version numbers again.
>>
>> I think option 2 is a lot nicer than option 1.
>
> Can you write an actual proposal (with details) that accomodates the
> issue described at "Assuming a stable TSC across physical CPUS, and a
> stable TSC" ?
>
> Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
> realtime guests.

This shouldn't require many details, and I don't think there's an ABI
change.  The rules are:

When the overall system timebase changes (e.g. when the selected
clocksource changes or when update_pvclock_gtod is called), the KVM
host would:

optionally: preempt_disable();  /* for performance */

for all vms {

  for all registered pvti structures {
    pvti->version++;  /* should be odd now */
  }

  /* Note: right now, any vcpu that tries to access pvti will start
infinite looping.  We should add cpu_relax() to the guests. */

  for all registered pvti structures {
    update everything except pvti->version;
  }

  for all registered pvti structures {
    pvti->version++;  /* should be even now */
  }

  cond_resched();
}

Is this enough detail?  This should work with all existing guests,
too, unless there's a buggy guest out there that actually fails to
double-check version.

The only reason for an ABI change that I can see would be to arrange
to have only one pvti in the loop.  (And, if we changed the ABI, let's
also get rid of one of the redundant addend fields by promising to set
it to zero while we're at it.)  Actually, if we changed the ABI, we
might want to do it more like traditional IO.  The host would set up a
single page shared by all guests that has all PVTI data, and all of
the guests would map it in.  This would save memory and avoid a loop,
at least if no legacy guests are around.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18 20:25                               ` Andy Lutomirski
@ 2015-12-18 21:49                                 ` Marcelo Tosatti
  2015-12-21 22:49                                   ` Andy Lutomirski
  2015-12-19  1:16                                 ` John Stultz
  1 sibling, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-18 21:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML, John Stultz

On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
> [cc: John Stultz -- maybe you have ideas on how this should best
> integrate with the core code]
> 
> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
> >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >> >> >>> >         RAW TSC                 NTP corrected TSC
> >> >> >> >> >>> > t0      10                      10
> >> >> >> >> >>> > t1      20                      19.99
> >> >> >> >> >>> > t2      30                      29.98
> >> >> >> >> >>> > t3      40                      39.97
> >> >> >> >> >>> > t4      50                      49.96
> >> >> >
> >> >> > (1)
> >> >> >
> >> >> >> >> >>> >
> >> >> >> >> >>> > ...
> >> >> >> >> >>> >
> >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >> >> >>> > you can see what will happen.
> >> >> >> >> >>>
> >> >> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >> >> >>
> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> >> >> >> than the guest kvmclock.
> >> >> >> >> >
> >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >> >> >
> >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> >> >> > backwards.
> >> >> >> >> >
> >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> >> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >> >> >
> >> >> >> >> > I still feel like I'm missing something very basic here.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> OK, I think I get it.
> >> >> >> >>
> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
> >> >> >> >> then, after resume, the host's new system_time would match the guest's
> >> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> >> >> think there would be a problem.
> >> >> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> >> >> would actually do this.
> >> >> >> >
> >> >> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >> >> >
> >> >> >> > timer_interrupt()
> >> >> >> >         offset = read clocksource since last timer interrupt
> >> >> >> >         accumulate_to_systemclock(offset)
> >> >> >> >
> >> >> >> > The frequency correction of NTP in the host can be applied to
> >> >> >> > kvmclock, which will be visible to the guest
> >> >> >> > at "read clocksource since last timer interrupt"
> >> >> >> > (kvmclock_clocksource_read function).
> >> >> >>
> >> >> >> pvclock_clocksource_read?  That seems to do the same thing as all the
> >> >> >> other clocksource access functions.
> >> >> >>
> >> >> >> >
> >> >> >> > This does not mean that the NTP correction in the host is propagated
> >> >> >> > to the guests system clock directly.
> >> >> >> >
> >> >> >> > (For example, the guest can run NTP which is free to do further
> >> >> >> > adjustments at "accumulate_to_systemclock(offset)" time).
> >> >> >>
> >> >> >> Of course.  But I expected that, in the absence of NTP on the guest,
> >> >> >> that the guest would track the host's *corrected* time.
> >> >> >>
> >> >> >> >
> >> >> >> >> If, on the other hand, the host's NTP correction is not supposed to
> >> >> >> >> propagate to the guest,
> >> >> >> >
> >> >> >> > This is optional. There is a module option to control this, in fact.
> >> >> >> >
> >> >> >> > Its nice to have, because then you can execute a guest without NTP
> >> >> >> > (say without network connection), and have a kvmclock (kvmclock is a
> >> >> >> > clocksource, not a guest system clock) which is NTP corrected.
> >> >> >>
> >> >> >> Can you point to how this works?  I found kvm_guest_time_update, whch
> >> >> >> is called under circumstances that I haven't untangled.  I can't
> >> >> >> really tell what it's trying to do.
> >> >> >
> >> >> > Documentation/virtual/kvm/timekeeping.txt.
> >> >> >
> >> >>
> >> >> That document is really long.  I skimmed it and found nothing.
> >> >
> >> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
> >> >
> >> > This happens when:
> >> >         - kvmclock is enabled or disabled by the guest.
> >> >         - periodically to propagate NTP correction to kvmclock clock.
> >> >         - guest vcpu switching between host pcpus when TSCs are out of sync.
> >> >         - after migration.
> >> >         - after savevm/loadvm.
> >> >
> >> >> >> In any case, this still seems much more convoluted than it has to be.
> >> >> >> In the case in which the host has a stable TSC (tsc is selected in the
> >> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> >> >> >> the time on the last few generations of CPUs, then the core
> >> >> >> timekeeping code is already exposing a linear function that's supposed
> >> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond
> >> >> >> counter.  It's even in pretty much exactly the right form to pass
> >> >> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
> >> >> >> pass it through verbatim, updated in real time?  Is there some legacy
> >> >> >> reason that KVM must apply its own corrections and has to jump through
> >> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> >> >> >> data?
> >> >> >
> >> >> > Read the comment on x86.c which starts with
> >> >> > " *
> >> >> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
> >> >> >  * across virtual CPUs, the following condition is possible.
> >> >> >  * Each numbered line represents an event visible to both
> >> >> >  * CPUs at the next numbered event.
> >> >> > "
> >> >>
> >> >> A couple things:
> >> >>
> >> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
> >> >>
> >> >> but that's wrong, I think.  rdtsc is a function, not a number.
> >> >
> >> > View it as a number, then its correct.
> >> >
> >> >>  Shouldn't it be:
> >> >>
> >> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
> >> >
> >> > Think "rdtsc" is one number (rdtsc0 = rdtsc1).
> >> >
> >> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
> >> >> N + (rdtsc1 - rdtsc0)?
> >> >>
> >> >> That doesn't change the conclusion.
> >> >>
> >> >> In any case, I'm not arguing that the concept of a master copy is
> >> >> unnecessary; I'm arguing that the implementation, the calculations,
> >> >> and the machinations in the code are all very, very complicated.  All
> >> >> that should be needed is to keep all of the vcpu pvti copies the same
> >> >> and to make sure that you can't ever have one vcpu see a new copy and
> >> >> then another vcpu see an old copy.
> >> >
> >> > Yes, you can't allow two vcpus to see a different copy of the pvti
> >> > structure.
> >>
> >> Two options:
> >>
> >> 1. Pause all vcpus, then update all the pvti copies, then unpause all
> >> vcpus.  This would work, but it's expensive.
> >
> > This is what is done today.
> >
> >>
> >> 2. Increment all the pvti version numbers, then update all of them,
> >> then increment the version numbers again.
> >>
> >> I think option 2 is a lot nicer than option 1.
> >
> > Can you write an actual proposal (with details) that accomodates the
> > issue described at "Assuming a stable TSC across physical CPUS, and a
> > stable TSC" ?
> >
> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
> > realtime guests.
> 
> This shouldn't require many details, and I don't think there's an ABI
> change.  The rules are:
> 
> When the overall system timebase changes (e.g. when the selected
> clocksource changes or when update_pvclock_gtod is called), the KVM
> host would:
> 
> optionally: preempt_disable();  /* for performance */
> 
> for all vms {
> 
>   for all registered pvti structures {
>     pvti->version++;  /* should be odd now */
>   }

pvti is userspace data, so you have to pin it before?

>   /* Note: right now, any vcpu that tries to access pvti will start
> infinite looping.  We should add cpu_relax() to the guests. */
> 
>   for all registered pvti structures {
>     update everything except pvti->version;
>   }
> 
>   for all registered pvti structures {
>     pvti->version++;  /* should be even now */
>   }
> 
>   cond_resched();
> }
> 
> Is this enough detail?  This should work with all existing guests,
> too, unless there's a buggy guest out there that actually fails to
> double-check version.

What is the advantage of this over the brute force method, given 
that guests will busy spin?

(busy spin is equally problematic as IPI for realtime guests).

> The only reason for an ABI change that I can see would be to arrange
> to have only one pvti in the loop.  (And, if we changed the ABI, let's
> also get rid of one of the redundant addend fields by promising to set
> it to zero while we're at it.)  Actually, if we changed the ABI, we
> might want to do it more like traditional IO.  The host would set up a
> single page shared by all guests that has all PVTI data, and all of
> the guests would map it in.  This would save memory and avoid a loop,
> at least if no legacy guests are around.
> 
> --Andy




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18 20:25                               ` Andy Lutomirski
  2015-12-18 21:49                                 ` Marcelo Tosatti
@ 2015-12-19  1:16                                 ` John Stultz
  1 sibling, 0 replies; 34+ messages in thread
From: John Stultz @ 2015-12-19  1:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Marcelo Tosatti, Paolo Bonzini, kvm list, Radim Krcmar, X86 ML

On Fri, Dec 18, 2015 at 12:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> [cc: John Stultz -- maybe you have ideas on how this should best
> integrate with the core code]
>
> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
>>> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>>> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>>> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>>> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>>> >> >> >> >>> >         RAW TSC                 NTP corrected TSC
>>> >> >> >> >>> > t0      10                      10
>>> >> >> >> >>> > t1      20                      19.99
>>> >> >> >> >>> > t2      30                      29.98
>>> >> >> >> >>> > t3      40                      39.97
>>> >> >> >> >>> > t4      50                      49.96
>>> >> >
>>> >> > (1)
>>> >> >
>>> >> >> >> >>> >
>>> >> >> >> >>> > ...
>>> >> >> >> >>> >
>>> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>>> >> >> >> >>> > you can see what will happen.
>>> >> >> >> >>>
>>> >> >> >> >>> Sure, but why would you ever switch from one to the other?
>>> >> >> >> >>
>>> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>>> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
>>> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>>> >> >> >> >> than the guest kvmclock.
>>> >> >> >> >
>>> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>>> >> >> >> >
>>> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>>> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>>> >> >> >> > backwards.
>>> >> >> >> >
>>> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
>>> >> >> >> > of kvmclock, and this shouldn't matter.
>>> >> >> >> >
>>> >> >> >> > I still feel like I'm missing something very basic here.
>>> >> >> >> >
>>> >> >> >>
>>> >> >> >> OK, I think I get it.
>>> >> >> >>
>>> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>>> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
>>> >> >> >> then, after resume, the host's new system_time would match the guest's
>>> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
>>> >> >> >> think there would be a problem.
>>> >> >> >> That being said, I can't find the code in the masterclock stuff that
>>> >> >> >> would actually do this.
>>> >> >> >
>>> >> >> > Guest clock is maintained by guest timekeeping code, which does:
>>> >> >> >
>>> >> >> > timer_interrupt()
>>> >> >> >         offset = read clocksource since last timer interrupt
>>> >> >> >         accumulate_to_systemclock(offset)
>>> >> >> >
>>> >> >> > The frequency correction of NTP in the host can be applied to
>>> >> >> > kvmclock, which will be visible to the guest
>>> >> >> > at "read clocksource since last timer interrupt"
>>> >> >> > (kvmclock_clocksource_read function).
>>> >> >>
>>> >> >> pvclock_clocksource_read?  That seems to do the same thing as all the
>>> >> >> other clocksource access functions.
>>> >> >>
>>> >> >> >
>>> >> >> > This does not mean that the NTP correction in the host is propagated
>>> >> >> > to the guests system clock directly.
>>> >> >> >
>>> >> >> > (For example, the guest can run NTP which is free to do further
>>> >> >> > adjustments at "accumulate_to_systemclock(offset)" time).
>>> >> >>
>>> >> >> Of course.  But I expected that, in the absence of NTP on the guest,
>>> >> >> that the guest would track the host's *corrected* time.
>>> >> >>
>>> >> >> >
>>> >> >> >> If, on the other hand, the host's NTP correction is not supposed to
>>> >> >> >> propagate to the guest,
>>> >> >> >
>>> >> >> > This is optional. There is a module option to control this, in fact.
>>> >> >> >
>>> >> >> > Its nice to have, because then you can execute a guest without NTP
>>> >> >> > (say without network connection), and have a kvmclock (kvmclock is a
>>> >> >> > clocksource, not a guest system clock) which is NTP corrected.
>>> >> >>
>>> >> >> Can you point to how this works?  I found kvm_guest_time_update, whch
>>> >> >> is called under circumstances that I haven't untangled.  I can't
>>> >> >> really tell what it's trying to do.
>>> >> >
>>> >> > Documentation/virtual/kvm/timekeeping.txt.
>>> >> >
>>> >>
>>> >> That document is really long.  I skimmed it and found nothing.
>>> >
>>> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
>>> >
>>> > This happens when:
>>> >         - kvmclock is enabled or disabled by the guest.
>>> >         - periodically to propagate NTP correction to kvmclock clock.
>>> >         - guest vcpu switching between host pcpus when TSCs are out of sync.
>>> >         - after migration.
>>> >         - after savevm/loadvm.
>>> >
>>> >> >> In any case, this still seems much more convoluted than it has to be.
>>> >> >> In the case in which the host has a stable TSC (tsc is selected in the
>>> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>>> >> >> the time on the last few generations of CPUs, then the core
>>> >> >> timekeeping code is already exposing a linear function that's supposed
>>> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond
>>> >> >> counter.  It's even in pretty much exactly the right form to pass
>>> >> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>>> >> >> pass it through verbatim, updated in real time?  Is there some legacy
>>> >> >> reason that KVM must apply its own corrections and has to jump through
>>> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>>> >> >> data?
>>> >> >
>>> >> > Read the comment on x86.c which starts with
>>> >> > " *
>>> >> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
>>> >> >  * across virtual CPUs, the following condition is possible.
>>> >> >  * Each numbered line represents an event visible to both
>>> >> >  * CPUs at the next numbered event.
>>> >> > "
>>> >>
>>> >> A couple things:
>>> >>
>>> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
>>> >>
>>> >> but that's wrong, I think.  rdtsc is a function, not a number.
>>> >
>>> > View it as a number, then its correct.
>>> >
>>> >>  Shouldn't it be:
>>> >>
>>> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
>>> >
>>> > Think "rdtsc" is one number (rdtsc0 = rdtsc1).
>>> >
>>> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
>>> >> N + (rdtsc1 - rdtsc0)?
>>> >>
>>> >> That doesn't change the conclusion.
>>> >>
>>> >> In any case, I'm not arguing that the concept of a master copy is
>>> >> unnecessary; I'm arguing that the implementation, the calculations,
>>> >> and the machinations in the code are all very, very complicated.  All
>>> >> that should be needed is to keep all of the vcpu pvti copies the same
>>> >> and to make sure that you can't ever have one vcpu see a new copy and
>>> >> then another vcpu see an old copy.
>>> >
>>> > Yes, you can't allow two vcpus to see a different copy of the pvti
>>> > structure.
>>>
>>> Two options:
>>>
>>> 1. Pause all vcpus, then update all the pvti copies, then unpause all
>>> vcpus.  This would work, but it's expensive.
>>
>> This is what is done today.
>>
>>>
>>> 2. Increment all the pvti version numbers, then update all of them,
>>> then increment the version numbers again.
>>>
>>> I think option 2 is a lot nicer than option 1.
>>
>> Can you write an actual proposal (with details) that accomodates the
>> issue described at "Assuming a stable TSC across physical CPUS, and a
>> stable TSC" ?
>>
>> Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
>> realtime guests.
>
> This shouldn't require many details, and I don't think there's an ABI
> change.  The rules are:
>
> When the overall system timebase changes (e.g. when the selected
> clocksource changes or when update_pvclock_gtod is called), the KVM
> host would:
>
> optionally: preempt_disable();  /* for performance */
>
> for all vms {
>
>   for all registered pvti structures {
>     pvti->version++;  /* should be odd now */
>   }
>
>   /* Note: right now, any vcpu that tries to access pvti will start
> infinite looping.  We should add cpu_relax() to the guests. */
>
>   for all registered pvti structures {
>     update everything except pvti->version;
>   }
>
>   for all registered pvti structures {
>     pvti->version++;  /* should be even now */
>   }
>
>   cond_resched();
> }
>
> Is this enough detail?  This should work with all existing guests,
> too, unless there's a buggy guest out there that actually fails to
> double-check version.
>
> The only reason for an ABI change that I can see would be to arrange
> to have only one pvti in the loop.  (And, if we changed the ABI, let's
> also get rid of one of the redundant addend fields by promising to set
> it to zero while we're at it.)  Actually, if we changed the ABI, we
> might want to do it more like traditional IO.  The host would set up a
> single page shared by all guests that has all PVTI data, and all of
> the guests would map it in.  This would save memory and avoid a loop,
> at least if no legacy guests are around.

So honestly, I'm a little bit lost here. The various discussion about
RAW TSC and NTP adjusted TSC doesn't seem sane to me (I suspect
there's some terminology confusion). That said, I've never followed
the kvm-clock logic very closely (most of the logic is kept behind the
clocksource abstraction), and it has always seemed quite complex to
me, so I may just be missing something.

So sorry, I'm not sure if I've got much to add here. :(

thanks
-john

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-18 21:49                                 ` Marcelo Tosatti
@ 2015-12-21 22:49                                   ` Andy Lutomirski
  2015-12-23 19:27                                     ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-21 22:49 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML, John Stultz

On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
>> [cc: John Stultz -- maybe you have ideas on how this should best
>> integrate with the core code]
>>
>> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

>> > Can you write an actual proposal (with details) that accomodates the
>> > issue described at "Assuming a stable TSC across physical CPUS, and a
>> > stable TSC" ?
>> >
>> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
>> > realtime guests.
>>
>> This shouldn't require many details, and I don't think there's an ABI
>> change.  The rules are:
>>
>> When the overall system timebase changes (e.g. when the selected
>> clocksource changes or when update_pvclock_gtod is called), the KVM
>> host would:
>>
>> optionally: preempt_disable();  /* for performance */
>>
>> for all vms {
>>
>>   for all registered pvti structures {
>>     pvti->version++;  /* should be odd now */
>>   }
>
> pvti is userspace data, so you have to pin it before?

Yes.

Fortunately, most systems probably only have one page of pvti
structures, I think (unless there are a ton of vcpus), so the
performance impact should be negligible.

>
>>   /* Note: right now, any vcpu that tries to access pvti will start
>> infinite looping.  We should add cpu_relax() to the guests. */
>>
>>   for all registered pvti structures {
>>     update everything except pvti->version;
>>   }
>>
>>   for all registered pvti structures {
>>     pvti->version++;  /* should be even now */
>>   }
>>
>>   cond_resched();
>> }
>>
>> Is this enough detail?  This should work with all existing guests,
>> too, unless there's a buggy guest out there that actually fails to
>> double-check version.
>
> What is the advantage of this over the brute force method, given
> that guests will busy spin?
>
> (busy spin is equally problematic as IPI for realtime guests).

I disagree.  It's never been safe to call clock_gettime from an RT
task and expect a guarantee of real-time performance.  We could fix
that, but it's not even safe on non-KVM.

Sending an IPI *always* stalls the task.  Taking a lock (which is
effectively what this is doing) only stalls the tasks that contend for
the lock, which, most of the time, means that nothing stalls.

Also, if the host disables preemption or otherwise boosts its priority
while version is odd, then the actual stall will be very short, in
contrast to an IPI-induced stall, which will be much, much longer.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-21 22:49                                   ` Andy Lutomirski
@ 2015-12-23 19:27                                     ` Marcelo Tosatti
  2015-12-23 23:09                                       ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2015-12-23 19:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML, John Stultz

On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
> >> [cc: John Stultz -- maybe you have ideas on how this should best
> >> integrate with the core code]
> >>
> >> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> >> > Can you write an actual proposal (with details) that accomodates the
> >> > issue described at "Assuming a stable TSC across physical CPUS, and a
> >> > stable TSC" ?
> >> >
> >> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
> >> > realtime guests.
> >>
> >> This shouldn't require many details, and I don't think there's an ABI
> >> change.  The rules are:
> >>
> >> When the overall system timebase changes (e.g. when the selected
> >> clocksource changes or when update_pvclock_gtod is called), the KVM
> >> host would:
> >>
> >> optionally: preempt_disable();  /* for performance */
> >>
> >> for all vms {
> >>
> >>   for all registered pvti structures {
> >>     pvti->version++;  /* should be odd now */
> >>   }
> >
> > pvti is userspace data, so you have to pin it before?
> 
> Yes.
> 
> Fortunately, most systems probably only have one page of pvti
> structures, I think (unless there are a ton of vcpus), so the
> performance impact should be negligible.
> 
> >
> >>   /* Note: right now, any vcpu that tries to access pvti will start
> >> infinite looping.  We should add cpu_relax() to the guests. */
> >>
> >>   for all registered pvti structures {
> >>     update everything except pvti->version;
> >>   }
> >>
> >>   for all registered pvti structures {
> >>     pvti->version++;  /* should be even now */
> >>   }
> >>
> >>   cond_resched();
> >> }
> >>
> >> Is this enough detail?  This should work with all existing guests,
> >> too, unless there's a buggy guest out there that actually fails to
> >> double-check version.
> >
> > What is the advantage of this over the brute force method, given
> > that guests will busy spin?
> >
> > (busy spin is equally problematic as IPI for realtime guests).
> 
> I disagree.  It's never been safe to call clock_gettime from an RT
> task and expect a guarantee of real-time performance.  We could fix
> that, but it's not even safe on non-KVM.

The problem is how long the IPI (or busy spinning in case of version
above) interrupts the vcpu.

> Sending an IPI *always* stalls the task.  Taking a lock (which is
> effectively what this is doing) only stalls the tasks that contend for
> the lock, which, most of the time, means that nothing stalls.
> 
> Also, if the host disables preemption or otherwise boosts its priority
> while version is odd, then the actual stall will be very short, in
> contrast to an IPI-induced stall, which will be much, much longer.
> 
> --Andy

1) The updates are rare.
2) There are no user complaints about the IPI mechanism.

Don't see a reason to change this.

For the suspend issue, though, there are complaints (guests on 
laptops which fail to use masterclock). 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: kvmclock doesn't work, help?
  2015-12-23 19:27                                     ` Marcelo Tosatti
@ 2015-12-23 23:09                                       ` Andy Lutomirski
  0 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2015-12-23 23:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, kvm list, Radim Krcmar, X86 ML, John Stultz

On Wed, Dec 23, 2015 at 11:27 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > (busy spin is equally problematic as IPI for realtime guests).
>>
>> I disagree.  It's never been safe to call clock_gettime from an RT
>> task and expect a guarantee of real-time performance.  We could fix
>> that, but it's not even safe on non-KVM.
>
> The problem is how long the IPI (or busy spinning in case of version
> above) interrupts the vcpu.

The busy spin should be a few hundred cycles in the very worst case (a
couple of remote cache misses timed such that the guest is spinning
the whole time).  The IPI is always thousands of cycles no matter what
the guest is doing.

>
>> Sending an IPI *always* stalls the task.  Taking a lock (which is
>> effectively what this is doing) only stalls the tasks that contend for
>> the lock, which, most of the time, means that nothing stalls.
>>
>> Also, if the host disables preemption or otherwise boosts its priority
>> while version is odd, then the actual stall will be very short, in
>> contrast to an IPI-induced stall, which will be much, much longer.
>>
>> --Andy
>
> 1) The updates are rare.
> 2) There are no user complaints about the IPI mechanism.

If KVM ever starts directly propagating corrected time
(CLOCK_MONOTONIC, for example), then the updates won't be rare.

Maybe I'll try to instrument this.

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-12-23 23:09 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-09 21:10 kvmclock doesn't work, help? Andy Lutomirski
2015-12-09 21:16 ` Paolo Bonzini
2015-12-09 21:49   ` Andy Lutomirski
2015-12-09 22:12     ` Paolo Bonzini
2015-12-09 22:27       ` Andy Lutomirski
2015-12-09 22:42         ` Paolo Bonzini
2015-12-09 22:43         ` Andy Lutomirski
2015-12-10 21:33         ` Marcelo Tosatti
2015-12-10 21:32 ` Marcelo Tosatti
2015-12-11 21:57   ` Andy Lutomirski
2015-12-11 23:48     ` Marcelo Tosatti
2015-12-14 18:07       ` Andy Lutomirski
2015-12-14 21:47         ` Marcelo Tosatti
2015-12-14 13:44     ` Paolo Bonzini
2015-12-14 22:00       ` Marcelo Tosatti
2015-12-14 22:31         ` Andy Lutomirski
2015-12-14 22:38           ` Marcelo Tosatti
2015-12-15  8:42           ` Paolo Bonzini
2015-12-16 17:48             ` Andy Lutomirski
2015-12-16 18:17               ` Andy Lutomirski
2015-12-16 21:57                 ` Marcelo Tosatti
2015-12-17 16:33                   ` Andy Lutomirski
2015-12-17 19:08                     ` Marcelo Tosatti
2015-12-18  1:12                       ` Andy Lutomirski
2015-12-18 11:47                         ` Marcelo Tosatti
2015-12-18 19:27                           ` Andy Lutomirski
2015-12-18 19:45                             ` Marcelo Tosatti
2015-12-18 20:25                               ` Andy Lutomirski
2015-12-18 21:49                                 ` Marcelo Tosatti
2015-12-21 22:49                                   ` Andy Lutomirski
2015-12-23 19:27                                     ` Marcelo Tosatti
2015-12-23 23:09                                       ` Andy Lutomirski
2015-12-19  1:16                                 ` John Stultz
2015-12-10 21:36 ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.