From mboxrd@z Thu Jan  1 00:00:00 1970
From: Owen Hofmann <osh@google.com>
Subject: Re: What time is it kvm-clock?
Date: Wed, 24 Feb 2016 11:55:23 -0800
Message-ID: <CANqFzA6TgdsoUyjKiwNa++et_wWnQgFRhEUZ6NE3_CShh0wc6g@mail.gmail.com>
References: <CANqFzA5VCQYZ6dYBjz=hbBotwe0S_4cKgxGiK9YU8Ei9G7DYng@mail.gmail.com>
	<56CDBAB1.6090405@redhat.com>
	<CALCETrUJ15DCxhFiMrcpk9B7oW+fHAGTThFWweUcLYZ5SGUzSQ@mail.gmail.com>
	<20160224173821.GA9364@amt.cnet>
	<CALCETrUzcANkDt2w_4pQDjyaSxUVBY6nyHEFSXgF2M7_hybrxQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	KVM General <kvm@vger.kernel.org>,
	Peter Hornyack <peterhornyack@google.com>,
	Joao Martins <joao.m.martins@oracle.com>
To: Andy Lutomirski <luto@amacapital.net>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-io0-f169.google.com ([209.85.223.169]:34132 "EHLO
	mail-io0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932366AbcBXTz0 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 24 Feb 2016 14:55:26 -0500
Received: by mail-io0-f169.google.com with SMTP id 9so57566388iom.1
        for <kvm@vger.kernel.org>; Wed, 24 Feb 2016 11:55:25 -0800 (PST)
In-Reply-To: <CALCETrUzcANkDt2w_4pQDjyaSxUVBY6nyHEFSXgF2M7_hybrxQ@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

>>> not-really-well-defined hybrid?
>>>
>>> --Andy
>>
>> 1. What is not well defined? I fail to spot anything
>> specific in Owen's e-mail.
>
> If I start a guest and query kvm-clock, I get a nanosecond count.
> AFAIK it is, in fact, ill-defined or at least ill-documented what that
> nanosecond count means.

To try to put the thoughts into specific questions:
- What is the value returned by KVM_GET_CLOCK? How should it be used?
- What is the value returned by a guest read of the kvm-clock
structure? (This is also Andy's question)
To me there are two possibilities for how to answer the second question:
1) kvm-clock is better than the host TSC: it propagates updates to
frequency from the host (== CLOCK_MONOTONIC)
2) kvm-clock is a paravirtual source of truth on the guest TSC:
whether it is stable and its approximate frequency. If the guest needs
to synchronize to an external source of time, it runs NTP. (==
CLOCK_MONOTONIC_RAW)

To me, (1) sounds hard, (2) sounds easy, and its not clear how much
additional value (1) provides. The recent patches Paolo sent move
kvm-clock in the direction of (1), and it sounds like Andy and I might
have slightly different opinions as well. But mostly I would like some
clarity as to which is the stated goal for kvm-clock, and to have the
implementation pick only one of those options.

>>> > Since we cannot change the past, having kvmclock synchronize with the
>>> > host TSC frequency is the only choice we can make.'

I'm not sure I understand what previous decision locks kvm-clock into
the current path. Can you clarify?

On Wed, Feb 24, 2016 at 11:38 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Feb 24, 2016 at 9:38 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> On Wed, Feb 24, 2016 at 08:44:40AM -0800, Andy Lutomirski wrote:
>>> On Wed, Feb 24, 2016 at 6:14 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> >
>>> >
>>> > On 24/02/2016 03:31, Owen Hofmann wrote:
>>> >> Specifically, what underlying source of time should be exposed through
>>> >> kvm-clock and other paravirtual ABIs like the HyperV reference tsc
>>> >> page?  Recently a couple of threads on kvm-list, along with attempts
>>> >> to produce reliable behavior from kvm-clock on our systems have
>>> >> highlighted a tension between the current implementation of kvm-clock
>>> >> and potentially diverging goals for paravirt time. Here are a few:
>>> >>
>>> >> 1) kvmclock doesn't work, help?: http://www.spinics.net/lists/kvm/msg125039.html
>>> >> 2) kvmclock: improve accuracy: http://www.spinics.net/lists/kvm/msg127215.html
>>> >> 3) KVM-clock: http://www.spinics.net/lists/kvm/msg127774.html
>>> >>
>>> >> This question is mostly in regards to kvm-clock in masterclock mode
>>> >> (with PVCLOCK_TSC_STABLE set). In this mode, is kvm-clock intended to
>>> >> expose a source of time that is more 'true' than the underlying TSC?
>>> >> For example, by passing through NTP correction from the host. For the
>>> >> current implementation, the answer seems to be... why not both? Once
>>> >> programmed, kvm-clock or the HyperV TSC page will advance with the TSC
>>> >> multiplied by the frequency specified by kvm. On the other hand,
>>> >> KVM_GET_CLOCK, KVM_SET_CLOCK, and the Windows reference counter MSR
>>> >> are measured against corrected time from the host. A guest reading its
>>> >> pvclock gets a very different result from a host KVM_GET_CLOCK if the
>>> >> guest has run long enough to for TSC to diverge from NTP time.
>>> >
>>> > Right, in fact that's why QEMU is not really using KVM_GET_CLOCK
>>> > anymore.  In retrospect, the "fix" in QEMU was probably a bad idea.  It
>>> > would have been better to fix KVM_GET_CLOCK.
>>> >
>>> >> To me, kvm-clock and the HyperV TSC page are extremely effective as
>>> >> simply a more enlightened path to the host TSC. Maintaining a
>>> >> high-performance path to the TSC in the face of updates is tricky -
>>> >> see the extended comment in pvclock_update_vm_gtod_copy, or the
>>> >> discussion on the patchset in (2). Is the cost of auditing that the
>>> >> path from host gettimeofday update -> kvm -> guest pvclock -> guest
>>> >> gettimeofday both tracks host time correctly and does not produce any
>>> >> backwards warps worth the added value, if it exists? As an
>>> >> alternative, implementing KVM_GET_CLOCK or the reference time MSR as a
>>> >> function of the last update to kvm-clock or the reference TSC page,
>>> >> respectively, sounds very straightforward.
>>> >
>>> > Yes, we could do that too.
>>> >
>>> > I think that vgettsc and do_monotonic_boot also would have to use the
>>> > TSC frequency instead the NTP-adjusted host clock.
>>> >
>>> >> (Outside of masterclock mode, the requirement that the client
>>> >> synchronizes across cpus for montonicity smoothes over a lot of
>>> >> complexity - periodically updating kvm-clock to the current time is
>>> >> simple and works.)
>>> >>
>>> >> Regardless of my opinion, I think that a clear statement of the design
>>> >> goals for kvm-clock (and kvm's implementation of the reference TSC
>>> >> page) would be valuable.
>>> >
>>> > Since we cannot change the past, having kvmclock synchronize with the
>>> > host TSC frequency is the only choice we can make.
>>> >
>>>
>>> Could we introduce a new kvm-clock or perhaps opt-in mode that:
>>>
>>> a) uses hypervisor-supplied IO pages and,
>>>
>>> b) synchronizes to host CLOCK_MONOTONIC instead of some bizarre
>>> non-suspend-resume-safe
>>
>> Please be accurate. It is suspend safe.
>>
>
> I'm being accurate enough, I think.  Master clock mode is not suspend
> safe.  When I suspend and resume my laptop, the master clock code
> determines that it messed up and disables itself.  Unloading and
> reloading the kvm modules turns it back on until the next suspect.
>
> I *think* that the underlying issue is that kvm-clock's master clock
> tracks something ill-defined instead of exposing a well-defined host
> clock.  If the master clock accurately exposed CLOCK_MONOTONIC_RAW or
> CLOCK_MONOTONIC (I much prefer the latter), then it would be fine
> across suspend/resume.
>
> I think that part of the reason that it doesn't accurately export a
> host clock is that the worst-case performance of atomic updates to the
> pvclock data structures is abysmal due to having the data structures
> living in guest memory.  To be able to access and update all relevant
> structures during host clock refreshes, the host would need to pin the
> all pvclock pages for all running guests.  This could be partially
> mitigated by only updating pvclock data for running vcpus and for vcpu
> 0 for all running guests synchronously and deferring the rest (8k
> pinned per host cpu, max), but it would still be a mess.
>
> If someone redefined the interface so that the *host* could allocate
> it, then the pages could be shared across all guests and this would be
> vastly simpler and faster.
>
> Also, kvm-clock should really coordinate with the core timekeeping
> code to handle this sort of time base export rather than hooking into
> the host vdso support code.
>
>>> not-really-well-defined hybrid?
>>>
>>> --Andy
>>
>> 1. What is not well defined? I fail to spot anything
>> specific in Owen's e-mail.
>
> If I start a guest and query kvm-clock, I get a nanosecond count.
> AFAIK it is, in fact, ill-defined or at least ill-documented what that
> nanosecond count means.
>
> [cc: Joao.  Xen may want to take this stuff into consideration.]
>
> --Andy