From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: kvmclock doesn't work, help?
Date: Fri, 11 Dec 2015 21:48:11 -0200
Message-ID: <20151211234810.GA26859@amt.cnet>
References: <CALCETrVZwDddGcW8axAb4PP+YZyfz5TGR9xYwZXv3d_aghLBtA@mail.gmail.com>
 <20151210213212.GA4836@amt.cnet>
 <CALCETrVhHP6p-XRKhzUQX4QY3ymupriarr3joUCgjQgYa-49Bg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: kvm list <kvm@vger.kernel.org>, Radim Krcmar <rkrcmar@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, X86 ML <x86@kernel.org>
To: Andy Lutomirski <luto@amacapital.net>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:34409 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751019AbbLNOUd (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 14 Dec 2015 09:20:33 -0500
Content-Disposition: inline
In-Reply-To: <CALCETrVhHP6p-XRKhzUQX4QY3ymupriarr3joUCgjQgYa-49Bg@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >>
> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >>
> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 0
> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >>
> >> If I boot a UP guest, tracing says:
> >>
> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >> I suspect, but I haven't verified, that this is fallout from:
> >>
> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> Author: Marcelo Tosatti <mtosatti@redhat.com>
> >> Date:   Wed May 14 12:43:24 2014 -0300
> >>
> >>     KVM: x86: disable master clock if TSC is reset during suspend
> >>
> >>     Updating system_time from the kernel clock once master clock
> >>     has been enabled can result in time backwards event, in case
> >>     kernel clock frequency is lower than TSC frequency.
> >>
> >>     Disable master clock in case it is necessary to update it
> >>     from the resume path.
> >>
> >>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>
> >>
> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> now, and not in a good way.  It's far too tangled with the vclock
> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> well thought out (even in principle in an ABI sense), and it's never
> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> to solve.
> >>
> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> ignoring kvmclock entirely)
> >> seems to work rather more reliably and
> >> considerably faster than a kvmclock guest.
> >>
> >> --Andy
> >>
> >> --
> >> Andy Lutomirski
> >> AMA Capital Management, LLC
> >
> > Andy,
> >
> > I am all for solving practical problems rather than pleasing aesthetic
> > pleasure.
> >
> >>     Updating system_time from the kernel clock once master clock
> >>     has been enabled can result in time backwards event, in case
> >>     kernel clock frequency is lower than TSC frequency.
> >>
> >>     Disable master clock in case it is necessary to update it
> >>     from the resume path.
> >
> >> once master clock
> >>     has been enabled can result in time backwards event, in case
> >>     kernel clock frequency is lower than TSC frequency.
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
> >
> > If the effective frequency of the kernel clock is lower (for example
> > due to NTP correcting the TSC frequency of the system), and you resume
> > and update the system, the following happens:
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE.

    guest reads clock to memory at location A = scaled tsc read.

(note TSC is counting at frequency higher than advertised by
processor, thats why NTP has to "slow down" the kernel clock 
which is maintained by successive reads of the TSC).

> > suspend/resume event.
> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.

Now the guest visible clock contains a tsc_timestamp that has been 
corrected by NTP, over say 5 days. So the tiny NTP correction has
been added up to something significant.

   guest reads clock to memory at location B = reads tsc_timestamp. 

Clock value in B (NTP corrected TSC) < clock value in A  (RAW TSC)

Yes?

> 
> I'm still not seeing the issue.

I'll add two items to the three snapshots above, hopefully will make it
clearer.

> 
> The formula is:
> 
> (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> pvti->tsc_shift) + pvti->system_time
> 
> Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> after suspend/resume, you would also need to update system_time.
> 
> I don't see what this has to do with suspend/resume or with whether
> the effective scale factor is greater than or less than one.  The only
> suspend/resume interaction I can see is that, if the host allows the
> guest-observed TSC value to jump (which is arguably a bug, what that's
> not important here), it needs to update pvti before resuming the
> guest.
> 
> Can you clarify concretely what goes wrong here?
> 
> (I'm also at a bit of a loss as to why this needs both system_time and
> tsc_timestamp.  They're redundant in the sense that you could set
> tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul)
> >> tsc_shift to system_time without changing the result of the
> calculation.)
> 
> --Andy
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC