Re: kvmclock doesn't work, help?

From: Andy Lutomirski <luto@amacapital.net>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm list <kvm@vger.kernel.org>, Radim Krcmar <rkrcmar@redhat.com>,
	X86 ML <x86@kernel.org>, John Stultz <john.stultz@linaro.org>
Subject: Re: kvmclock doesn't work, help?
Date: Fri, 18 Dec 2015 12:25:11 -0800	[thread overview]
Message-ID: <CALCETrUV-BV0HzifaSsD17qhXRVLQ7zTu2N_8pc69ODbB6bsEQ@mail.gmail.com> (raw)
In-Reply-To: <20151218194538.GC28381@amt.cnet>

[cc: John Stultz -- maybe you have ideas on how this should best
integrate with the core code]

On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >> >> >>> >         RAW TSC                 NTP corrected TSC
>> >> >> >> >>> > t0      10                      10
>> >> >> >> >>> > t1      20                      19.99
>> >> >> >> >>> > t2      30                      29.98
>> >> >> >> >>> > t3      40                      39.97
>> >> >> >> >>> > t4      50                      49.96
>> >> >
>> >> > (1)
>> >> >
>> >> >> >> >>> >
>> >> >> >> >>> > ...
>> >> >> >> >>> >
>> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >> >> >>> > you can see what will happen.
>> >> >> >> >>>
>> >> >> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >> >> >>
>> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> >> >> than the guest kvmclock.
>> >> >> >> >
>> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >> >> >
>> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> >> >> > backwards.
>> >> >> >> >
>> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> >> >> > of kvmclock, and this shouldn't matter.
>> >> >> >> >
>> >> >> >> > I still feel like I'm missing something very basic here.
>> >> >> >> >
>> >> >> >>
>> >> >> >> OK, I think I get it.
>> >> >> >>
>> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> >> >> then, after resume, the host's new system_time would match the guest's
>> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> >> >> think there would be a problem.
>> >> >> >> That being said, I can't find the code in the masterclock stuff that
>> >> >> >> would actually do this.
>> >> >> >
>> >> >> > Guest clock is maintained by guest timekeeping code, which does:
>> >> >> >
>> >> >> > timer_interrupt()
>> >> >> >         offset = read clocksource since last timer interrupt
>> >> >> >         accumulate_to_systemclock(offset)
>> >> >> >
>> >> >> > The frequency correction of NTP in the host can be applied to
>> >> >> > kvmclock, which will be visible to the guest
>> >> >> > at "read clocksource since last timer interrupt"
>> >> >> > (kvmclock_clocksource_read function).
>> >> >>
>> >> >> pvclock_clocksource_read?  That seems to do the same thing as all the
>> >> >> other clocksource access functions.
>> >> >>
>> >> >> >
>> >> >> > This does not mean that the NTP correction in the host is propagated
>> >> >> > to the guests system clock directly.
>> >> >> >
>> >> >> > (For example, the guest can run NTP which is free to do further
>> >> >> > adjustments at "accumulate_to_systemclock(offset)" time).
>> >> >>
>> >> >> Of course.  But I expected that, in the absence of NTP on the guest,
>> >> >> that the guest would track the host's *corrected* time.
>> >> >>
>> >> >> >
>> >> >> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> >> >> propagate to the guest,
>> >> >> >
>> >> >> > This is optional. There is a module option to control this, in fact.
>> >> >> >
>> >> >> > Its nice to have, because then you can execute a guest without NTP
>> >> >> > (say without network connection), and have a kvmclock (kvmclock is a
>> >> >> > clocksource, not a guest system clock) which is NTP corrected.
>> >> >>
>> >> >> Can you point to how this works?  I found kvm_guest_time_update, whch
>> >> >> is called under circumstances that I haven't untangled.  I can't
>> >> >> really tell what it's trying to do.
>> >> >
>> >> > Documentation/virtual/kvm/timekeeping.txt.
>> >> >
>> >>
>> >> That document is really long.  I skimmed it and found nothing.
>> >
>> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
>> >
>> > This happens when:
>> >         - kvmclock is enabled or disabled by the guest.
>> >         - periodically to propagate NTP correction to kvmclock clock.
>> >         - guest vcpu switching between host pcpus when TSCs are out of sync.
>> >         - after migration.
>> >         - after savevm/loadvm.
>> >
>> >> >> In any case, this still seems much more convoluted than it has to be.
>> >> >> In the case in which the host has a stable TSC (tsc is selected in the
>> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>> >> >> the time on the last few generations of CPUs, then the core
>> >> >> timekeeping code is already exposing a linear function that's supposed
>> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond
>> >> >> counter.  It's even in pretty much exactly the right form to pass
>> >> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>> >> >> pass it through verbatim, updated in real time?  Is there some legacy
>> >> >> reason that KVM must apply its own corrections and has to jump through
>> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>> >> >> data?
>> >> >
>> >> > Read the comment on x86.c which starts with
>> >> > " *
>> >> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
>> >> >  * across virtual CPUs, the following condition is possible.
>> >> >  * Each numbered line represents an event visible to both
>> >> >  * CPUs at the next numbered event.
>> >> > "
>> >>
>> >> A couple things:
>> >>
>> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
>> >>
>> >> but that's wrong, I think.  rdtsc is a function, not a number.
>> >
>> > View it as a number, then its correct.
>> >
>> >>  Shouldn't it be:
>> >>
>> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
>> >
>> > Think "rdtsc" is one number (rdtsc0 = rdtsc1).
>> >
>> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
>> >> N + (rdtsc1 - rdtsc0)?
>> >>
>> >> That doesn't change the conclusion.
>> >>
>> >> In any case, I'm not arguing that the concept of a master copy is
>> >> unnecessary; I'm arguing that the implementation, the calculations,
>> >> and the machinations in the code are all very, very complicated.  All
>> >> that should be needed is to keep all of the vcpu pvti copies the same
>> >> and to make sure that you can't ever have one vcpu see a new copy and
>> >> then another vcpu see an old copy.
>> >
>> > Yes, you can't allow two vcpus to see a different copy of the pvti
>> > structure.
>>
>> Two options:
>>
>> 1. Pause all vcpus, then update all the pvti copies, then unpause all
>> vcpus.  This would work, but it's expensive.
>
> This is what is done today.
>
>>
>> 2. Increment all the pvti version numbers, then update all of them,
>> then increment the version numbers again.
>>
>> I think option 2 is a lot nicer than option 1.
>
> Can you write an actual proposal (with details) that accomodates the
> issue described at "Assuming a stable TSC across physical CPUS, and a
> stable TSC" ?
>
> Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
> realtime guests.

This shouldn't require many details, and I don't think there's an ABI
change.  The rules are:

When the overall system timebase changes (e.g. when the selected
clocksource changes or when update_pvclock_gtod is called), the KVM
host would:

optionally: preempt_disable();  /* for performance */

for all vms {

  for all registered pvti structures {
    pvti->version++;  /* should be odd now */
  }

  /* Note: right now, any vcpu that tries to access pvti will start
infinite looping.  We should add cpu_relax() to the guests. */

  for all registered pvti structures {
    update everything except pvti->version;
  }

  for all registered pvti structures {
    pvti->version++;  /* should be even now */
  }

  cond_resched();
}

Is this enough detail?  This should work with all existing guests,
too, unless there's a buggy guest out there that actually fails to
double-check version.

The only reason for an ABI change that I can see would be to arrange
to have only one pvti in the loop.  (And, if we changed the ABI, let's
also get rid of one of the redundant addend fields by promising to set
it to zero while we're at it.)  Actually, if we changed the ABI, we
might want to do it more like traditional IO.  The host would set up a
single page shared by all guests that has all PVTI data, and all of
the guests would map it in.  This would save memory and avoid a loop,
at least if no legacy guests are around.

--Andy