From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andy Lutomirski <luto@amacapital.net>
Subject: Re: kvmclock doesn't work, help?
Date: Fri, 18 Dec 2015 11:27:13 -0800
Message-ID: <CALCETrXQDAGCkngy15Q7yrxshuu+akZPzTCvfLQK1k63kMgoOA@mail.gmail.com>
References: <566EC7AF.3090508@redhat.com> <20151214220027.GA24973@amt.cnet>
 <CALCETrULJW9BpB+VQOFvLYOYrA0xBWwgzim3kRB+FzZe6Voa+g@mail.gmail.com>
 <566FD25C.5040806@redhat.com> <CALCETrXr5FB-C-NyTXkf5vetWTKdpd16tYSnBx1nBOLJ8Kjnhw@mail.gmail.com>
 <CALCETrUDDX3upsbbUdFxtdtme86Qmgg3JfB_CEKd3h3R5dwE9A@mail.gmail.com>
 <20151216215731.GA9950@amt.cnet> <CALCETrVfG5YtfP+8G-NLqkyZByndT11MwkmgZ1X9u-h_syw9PQ@mail.gmail.com>
 <20151217190850.GA13981@amt.cnet> <CALCETrViR0iTA4SgqQRQRokKAstQnMdW+cf1FHF7ZJ2vqW58sA@mail.gmail.com>
 <20151218114734.GA28306@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm list <kvm@vger.kernel.org>,
	Radim Krcmar <rkrcmar@redhat.com>, X86 ML <x86@kernel.org>
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-ob0-f173.google.com ([209.85.214.173]:33793 "EHLO
	mail-ob0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932554AbbLRT1e (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 18 Dec 2015 14:27:34 -0500
Received: by mail-ob0-f173.google.com with SMTP id iw8so86134405obc.1
        for <kvm@vger.kernel.org>; Fri, 18 Dec 2015 11:27:33 -0800 (PST)
In-Reply-To: <20151218114734.GA28306@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >> >>> >         RAW TSC                 NTP corrected TSC
>> >> >> >>> > t0      10                      10
>> >> >> >>> > t1      20                      19.99
>> >> >> >>> > t2      30                      29.98
>> >> >> >>> > t3      40                      39.97
>> >> >> >>> > t4      50                      49.96
>> >
>> > (1)
>> >
>> >> >> >>> >
>> >> >> >>> > ...
>> >> >> >>> >
>> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >> >>> > you can see what will happen.
>> >> >> >>>
>> >> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >> >>
>> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> >> than the guest kvmclock.
>> >> >> >
>> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >> >
>> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> >> > backwards.
>> >> >> >
>> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> >> > of kvmclock, and this shouldn't matter.
>> >> >> >
>> >> >> > I still feel like I'm missing something very basic here.
>> >> >> >
>> >> >>
>> >> >> OK, I think I get it.
>> >> >>
>> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> >> then, after resume, the host's new system_time would match the guest's
>> >> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> >> think there would be a problem.
>> >> >> That being said, I can't find the code in the masterclock stuff that
>> >> >> would actually do this.
>> >> >
>> >> > Guest clock is maintained by guest timekeeping code, which does:
>> >> >
>> >> > timer_interrupt()
>> >> >         offset = read clocksource since last timer interrupt
>> >> >         accumulate_to_systemclock(offset)
>> >> >
>> >> > The frequency correction of NTP in the host can be applied to
>> >> > kvmclock, which will be visible to the guest
>> >> > at "read clocksource since last timer interrupt"
>> >> > (kvmclock_clocksource_read function).
>> >>
>> >> pvclock_clocksource_read?  That seems to do the same thing as all the
>> >> other clocksource access functions.
>> >>
>> >> >
>> >> > This does not mean that the NTP correction in the host is propagated
>> >> > to the guests system clock directly.
>> >> >
>> >> > (For example, the guest can run NTP which is free to do further
>> >> > adjustments at "accumulate_to_systemclock(offset)" time).
>> >>
>> >> Of course.  But I expected that, in the absence of NTP on the guest,
>> >> that the guest would track the host's *corrected* time.
>> >>
>> >> >
>> >> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> >> propagate to the guest,
>> >> >
>> >> > This is optional. There is a module option to control this, in fact.
>> >> >
>> >> > Its nice to have, because then you can execute a guest without NTP
>> >> > (say without network connection), and have a kvmclock (kvmclock is a
>> >> > clocksource, not a guest system clock) which is NTP corrected.
>> >>
>> >> Can you point to how this works?  I found kvm_guest_time_update, whch
>> >> is called under circumstances that I haven't untangled.  I can't
>> >> really tell what it's trying to do.
>> >
>> > Documentation/virtual/kvm/timekeeping.txt.
>> >
>>
>> That document is really long.  I skimmed it and found nothing.
>
> kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
>
> This happens when:
>         - kvmclock is enabled or disabled by the guest.
>         - periodically to propagate NTP correction to kvmclock clock.
>         - guest vcpu switching between host pcpus when TSCs are out of sync.
>         - after migration.
>         - after savevm/loadvm.
>
>> >> In any case, this still seems much more convoluted than it has to be.
>> >> In the case in which the host has a stable TSC (tsc is selected in the
>> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>> >> the time on the last few generations of CPUs, then the core
>> >> timekeeping code is already exposing a linear function that's supposed
>> >> to be used for monotonic, cpu-local access to a corrected nanosecond
>> >> counter.  It's even in pretty much exactly the right form to pass
>> >> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>> >> pass it through verbatim, updated in real time?  Is there some legacy
>> >> reason that KVM must apply its own corrections and has to jump through
>> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>> >> data?
>> >
>> > Read the comment on x86.c which starts with
>> > " *
>> >  * Assuming a stable TSC across physical CPUS, and a stable TSC
>> >  * across virtual CPUs, the following condition is possible.
>> >  * Each numbered line represents an event visible to both
>> >  * CPUs at the next numbered event.
>> > "
>>
>> A couple things:
>>
>> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
>>
>> but that's wrong, I think.  rdtsc is a function, not a number.
>
> View it as a number, then its correct.
>
>>  Shouldn't it be:
>>
>> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
>
> Think "rdtsc" is one number (rdtsc0 = rdtsc1).
>
>> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
>> N + (rdtsc1 - rdtsc0)?
>>
>> That doesn't change the conclusion.
>>
>> In any case, I'm not arguing that the concept of a master copy is
>> unnecessary; I'm arguing that the implementation, the calculations,
>> and the machinations in the code are all very, very complicated.  All
>> that should be needed is to keep all of the vcpu pvti copies the same
>> and to make sure that you can't ever have one vcpu see a new copy and
>> then another vcpu see an old copy.
>
> Yes, you can't allow two vcpus to see a different copy of the pvti
> structure.

Two options:

1. Pause all vcpus, then update all the pvti copies, then unpause all
vcpus.  This would work, but it's expensive.

2. Increment all the pvti version numbers, then update all of them,
then increment the version numbers again.

I think option 2 is a lot nicer than option 1.

>
>>  You can do that by brute-force
>> freezing all vcpus on an update (what happens now), or you could do it
>> by just writing all of the copies at the same time from the same host
>> cpu *while other vcpus are still running*.
>
> Ok, can you do that and guarantee the first copy won't be seen by
> other vcpus? I don't know how.
>
>> For the best outcome, you could offer a pvclock protocol v3 in which
>> there is literally just one pvti copy shared by all vcpus.
>
> Sure, lets write a more formal proposal?
>

I can try to sketch something out in the next week or two.  It would
be basically the same as the current protocol, except that there would
be a single pvti instead of an array.  In the case where the TSCs are
out of sync and the host can't synchronize them, then this mechanism
would return an error and the guest would have to fall back.

--Andy