From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: kvmclock doesn't work, help?
Date: Fri, 18 Dec 2015 19:49:30 -0200
Message-ID: <20151218214928.GA32177@amt.cnet>
References: <CALCETrXr5FB-C-NyTXkf5vetWTKdpd16tYSnBx1nBOLJ8Kjnhw@mail.gmail.com>
 <CALCETrUDDX3upsbbUdFxtdtme86Qmgg3JfB_CEKd3h3R5dwE9A@mail.gmail.com>
 <20151216215731.GA9950@amt.cnet>
 <CALCETrVfG5YtfP+8G-NLqkyZByndT11MwkmgZ1X9u-h_syw9PQ@mail.gmail.com>
 <20151217190850.GA13981@amt.cnet>
 <CALCETrViR0iTA4SgqQRQRokKAstQnMdW+cf1FHF7ZJ2vqW58sA@mail.gmail.com>
 <20151218114734.GA28306@amt.cnet>
 <CALCETrXQDAGCkngy15Q7yrxshuu+akZPzTCvfLQK1k63kMgoOA@mail.gmail.com>
 <20151218194538.GC28381@amt.cnet>
 <CALCETrUV-BV0HzifaSsD17qhXRVLQ7zTu2N_8pc69ODbB6bsEQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm list <kvm@vger.kernel.org>,
	Radim Krcmar <rkrcmar@redhat.com>, X86 ML <x86@kernel.org>,
	John Stultz <john.stultz@linaro.org>
To: Andy Lutomirski <luto@amacapital.net>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:35242 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751995AbbLULyA (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 21 Dec 2015 06:54:00 -0500
Content-Disposition: inline
In-Reply-To: <CALCETrUV-BV0HzifaSsD17qhXRVLQ7zTu2N_8pc69ODbB6bsEQ@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
> [cc: John Stultz -- maybe you have ideas on how this should best
> integrate with the core code]
> 
> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
> >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >> >> >>> >         RAW TSC                 NTP corrected TSC
> >> >> >> >> >>> > t0      10                      10
> >> >> >> >> >>> > t1      20                      19.99
> >> >> >> >> >>> > t2      30                      29.98
> >> >> >> >> >>> > t3      40                      39.97
> >> >> >> >> >>> > t4      50                      49.96
> >> >> >
> >> >> > (1)
> >> >> >
> >> >> >> >> >>> >
> >> >> >> >> >>> > ...
> >> >> >> >> >>> >
> >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >> >> >>> > you can see what will happen.
> >> >> >> >> >>>
> >> >> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >> >> >>
> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> >> >> >> than the guest kvmclock.
> >> >> >> >> >
> >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >> >> >
> >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> >> >> > backwards.
> >> >> >> >> >
> >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> >> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >> >> >
> >> >> >> >> > I still feel like I'm missing something very basic here.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> OK, I think I get it.
> >> >> >> >>
> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
> >> >> >> >> then, after resume, the host's new system_time would match the guest's
> >> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> >> >> think there would be a problem.
> >> >> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> >> >> would actually do this.
> >> >> >> >
> >> >> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >> >> >
> >> >> >> > timer_interrupt()
> >> >> >> >         offset = read clocksource since last timer interrupt
> >> >> >> >         accumulate_to_systemclock(offset)
> >> >> >> >
> >> >> >> > The frequency correction of NTP in the host can be applied to
> >> >> >> > kvmclock, which will be visible to the guest
> >> >> >> > at "read clocksource since last timer interrupt"
> >> >> >> > (kvmclock_clocksource_read function).
> >> >> >>
> >> >> >> pvclock_clocksource_read?  That seems to do the same thing as all the
> >> >> >> other clocksource access functions.
> >> >> >>
> >> >> >> >
> >> >> >> > This does not mean that the NTP correction in the host is propagated
> >> >> >> > to the guests system clock directly.
> >> >> >> >
> >> >> >> > (For example, the guest can run NTP which is free to do further
> >> >> >> > adjustments at "accumulate_to_systemclock(offset)" time).
> >> >> >>
> >> >> >> Of course.  But I expected that, in the absence of NTP on the guest,
> >> >> >> that the guest would track the host's *corrected* time.
> >> >> >>
> >> >> >> >
> >> >> >> >> If, on the other hand, the host's NTP correction is not supposed to
> >> >> >> >> propagate to the guest,
> >> >> >> >
> >> >> >> > This is optional. There is a module option to control this, in fact.
> >> >> >> >
> >> >> >> > Its nice to have, because then you can execute a guest without NTP
> >> >> >> > (say without network connection), and have a kvmclock (kvmclock is a
> >> >> >> > clocksource, not a guest system clock) which is NTP corrected.
> >> >> >>
> >> >> >> Can you point to how this works?  I found kvm_guest_time_update, whch
> >> >> >> is called under circumstances that I haven't untangled.  I can't
> >> >> >> really tell what it's trying to do.
> >> >> >
> >> >> > Documentation/virtual/kvm/timekeeping.txt.
> >> >> >
> >> >>
> >> >> That document is really long.  I skimmed it and found nothing.
> >> >
> >> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
> >> >
> >> > This happens when:
> >> >         - kvmclock is enabled or disabled by the guest.
> >> >         - periodically to propagate NTP correction to kvmclock clock.
> >> >         - guest vcpu switching between host pcpus when TSCs are out of sync.
> >> >         - after migration.
> >> >         - after savevm/loadvm.
> >> >
> >> >> >> In any case, this still seems much more convoluted than it has to be.
> >> >> >> In the case in which the host has a stable TSC (tsc is selected in the
> >> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> >> >> >> the time on the last few generations of CPUs, then the core
> >> >> >> timekeeping code is already exposing a linear function that's supposed
> >> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond
> >> >> >> counter.  It's even in pretty much exactly the right form to pass
> >> >> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
> >> >> >> pass it through verbatim, updated in real time?  Is there some legacy
> >> >> >> reason that KVM must apply its own corrections and has to jump through
> >> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> >> >> >> data?
> >> >> >
> >> >> > Read the comment on x86.c which starts with
> >> >> > " *
> >> >> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
> >> >> >  * across virtual CPUs, the following condition is possible.
> >> >> >  * Each numbered line represents an event visible to both
> >> >> >  * CPUs at the next numbered event.
> >> >> > "
> >> >>
> >> >> A couple things:
> >> >>
> >> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
> >> >>
> >> >> but that's wrong, I think.  rdtsc is a function, not a number.
> >> >
> >> > View it as a number, then its correct.
> >> >
> >> >>  Shouldn't it be:
> >> >>
> >> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
> >> >
> >> > Think "rdtsc" is one number (rdtsc0 = rdtsc1).
> >> >
> >> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
> >> >> N + (rdtsc1 - rdtsc0)?
> >> >>
> >> >> That doesn't change the conclusion.
> >> >>
> >> >> In any case, I'm not arguing that the concept of a master copy is
> >> >> unnecessary; I'm arguing that the implementation, the calculations,
> >> >> and the machinations in the code are all very, very complicated.  All
> >> >> that should be needed is to keep all of the vcpu pvti copies the same
> >> >> and to make sure that you can't ever have one vcpu see a new copy and
> >> >> then another vcpu see an old copy.
> >> >
> >> > Yes, you can't allow two vcpus to see a different copy of the pvti
> >> > structure.
> >>
> >> Two options:
> >>
> >> 1. Pause all vcpus, then update all the pvti copies, then unpause all
> >> vcpus.  This would work, but it's expensive.
> >
> > This is what is done today.
> >
> >>
> >> 2. Increment all the pvti version numbers, then update all of them,
> >> then increment the version numbers again.
> >>
> >> I think option 2 is a lot nicer than option 1.
> >
> > Can you write an actual proposal (with details) that accomodates the
> > issue described at "Assuming a stable TSC across physical CPUS, and a
> > stable TSC" ?
> >
> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
> > realtime guests.
> 
> This shouldn't require many details, and I don't think there's an ABI
> change.  The rules are:
> 
> When the overall system timebase changes (e.g. when the selected
> clocksource changes or when update_pvclock_gtod is called), the KVM
> host would:
> 
> optionally: preempt_disable();  /* for performance */
> 
> for all vms {
> 
>   for all registered pvti structures {
>     pvti->version++;  /* should be odd now */
>   }

pvti is userspace data, so you have to pin it before?

>   /* Note: right now, any vcpu that tries to access pvti will start
> infinite looping.  We should add cpu_relax() to the guests. */
> 
>   for all registered pvti structures {
>     update everything except pvti->version;
>   }
> 
>   for all registered pvti structures {
>     pvti->version++;  /* should be even now */
>   }
> 
>   cond_resched();
> }
> 
> Is this enough detail?  This should work with all existing guests,
> too, unless there's a buggy guest out there that actually fails to
> double-check version.

What is the advantage of this over the brute force method, given 
that guests will busy spin?

(busy spin is equally problematic as IPI for realtime guests).

> The only reason for an ABI change that I can see would be to arrange
> to have only one pvti in the loop.  (And, if we changed the ABI, let's
> also get rid of one of the redundant addend fields by promising to set
> it to zero while we're at it.)  Actually, if we changed the ABI, we
> might want to do it more like traditional IO.  The host would set up a
> single page shared by all guests that has all PVTI data, and all of
> the guests would map it in.  This would save memory and avoid a loop,
> at least if no legacy guests are around.
> 
> --Andy