From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: kvmclock doesn't work, help? Date: Fri, 18 Dec 2015 19:49:30 -0200 Message-ID: <20151218214928.GA32177@amt.cnet> References: <20151216215731.GA9950@amt.cnet> <20151217190850.GA13981@amt.cnet> <20151218114734.GA28306@amt.cnet> <20151218194538.GC28381@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Paolo Bonzini , kvm list , Radim Krcmar , X86 ML , John Stultz To: Andy Lutomirski Return-path: Received: from mx1.redhat.com ([209.132.183.28]:35242 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751995AbbLULyA (ORCPT ); Mon, 21 Dec 2015 06:54:00 -0500 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: > [cc: John Stultz -- maybe you have ideas on how this should best > integrate with the core code] > > On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti wrote: > > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: > >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti wrote: > >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti wrote: > >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti wrote: > >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski wrote: > >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >> >> >>> > RAW TSC NTP corrected TSC > >> >> >> >> >>> > t0 10 10 > >> >> >> >> >>> > t1 20 19.99 > >> >> >> >> >>> > t2 30 29.98 > >> >> >> >> >>> > t3 40 39.97 > >> >> >> >> >>> > t4 50 49.96 > >> >> > > >> >> > (1) > >> >> > > >> >> >> >> >>> > > >> >> >> >> >>> > ... > >> >> >> >> >>> > > >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >> >> >>> > you can see what will happen. > >> >> >> >> >>> > >> >> >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> >> >> > >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> >> >> >> >> resume, the TSC certainly increases at the same rate as before, but the > >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower > >> >> >> >> >> than the guest kvmclock. > >> >> >> >> > > >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> >> >> > > >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to > >> >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> >> >> > backwards. > >> >> >> >> > > >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on top > >> >> >> >> > of kvmclock, and this shouldn't matter. > >> >> >> >> > > >> >> >> >> > I still feel like I'm missing something very basic here. > >> >> >> >> > > >> >> >> >> > >> >> >> >> OK, I think I get it. > >> >> >> >> > >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> >> >> >> correction to the guest. If it did, indeed, propagate the correction > >> >> >> >> then, after resume, the host's new system_time would match the guest's > >> >> >> >> idea of it (after accounting for the guest's long nap), and I don't > >> >> >> >> think there would be a problem. > >> >> >> >> That being said, I can't find the code in the masterclock stuff that > >> >> >> >> would actually do this. > >> >> >> > > >> >> >> > Guest clock is maintained by guest timekeeping code, which does: > >> >> >> > > >> >> >> > timer_interrupt() > >> >> >> > offset = read clocksource since last timer interrupt > >> >> >> > accumulate_to_systemclock(offset) > >> >> >> > > >> >> >> > The frequency correction of NTP in the host can be applied to > >> >> >> > kvmclock, which will be visible to the guest > >> >> >> > at "read clocksource since last timer interrupt" > >> >> >> > (kvmclock_clocksource_read function). > >> >> >> > >> >> >> pvclock_clocksource_read? That seems to do the same thing as all the > >> >> >> other clocksource access functions. > >> >> >> > >> >> >> > > >> >> >> > This does not mean that the NTP correction in the host is propagated > >> >> >> > to the guests system clock directly. > >> >> >> > > >> >> >> > (For example, the guest can run NTP which is free to do further > >> >> >> > adjustments at "accumulate_to_systemclock(offset)" time). > >> >> >> > >> >> >> Of course. But I expected that, in the absence of NTP on the guest, > >> >> >> that the guest would track the host's *corrected* time. > >> >> >> > >> >> >> > > >> >> >> >> If, on the other hand, the host's NTP correction is not supposed to > >> >> >> >> propagate to the guest, > >> >> >> > > >> >> >> > This is optional. There is a module option to control this, in fact. > >> >> >> > > >> >> >> > Its nice to have, because then you can execute a guest without NTP > >> >> >> > (say without network connection), and have a kvmclock (kvmclock is a > >> >> >> > clocksource, not a guest system clock) which is NTP corrected. > >> >> >> > >> >> >> Can you point to how this works? I found kvm_guest_time_update, whch > >> >> >> is called under circumstances that I haven't untangled. I can't > >> >> >> really tell what it's trying to do. > >> >> > > >> >> > Documentation/virtual/kvm/timekeeping.txt. > >> >> > > >> >> > >> >> That document is really long. I skimmed it and found nothing. > >> > > >> > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set. > >> > > >> > This happens when: > >> > - kvmclock is enabled or disabled by the guest. > >> > - periodically to propagate NTP correction to kvmclock clock. > >> > - guest vcpu switching between host pcpus when TSCs are out of sync. > >> > - after migration. > >> > - after savevm/loadvm. > >> > > >> >> >> In any case, this still seems much more convoluted than it has to be. > >> >> >> In the case in which the host has a stable TSC (tsc is selected in the > >> >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all > >> >> >> the time on the last few generations of CPUs, then the core > >> >> >> timekeeping code is already exposing a linear function that's supposed > >> >> >> to be used for monotonic, cpu-local access to a corrected nanosecond > >> >> >> counter. It's even in pretty much exactly the right form to pass > >> >> >> through to the guest via pvclock in the gtod data. Why doesn't KVM > >> >> >> pass it through verbatim, updated in real time? Is there some legacy > >> >> >> reason that KVM must apply its own corrections and has to jump through > >> >> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock > >> >> >> data? > >> >> > > >> >> > Read the comment on x86.c which starts with > >> >> > " * > >> >> > * Assuming a stable TSC across physical CPUS, and a stable TSC > >> >> > * across virtual CPUs, the following condition is possible. > >> >> > * Each numbered line represents an event visible to both > >> >> > * CPUs at the next numbered event. > >> >> > " > >> >> > >> >> A couple things: > >> >> > >> >> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M)) > >> >> > >> >> but that's wrong, I think. rdtsc is a function, not a number. > >> > > >> > View it as a number, then its correct. > >> > > >> >> Shouldn't it be: > >> >> > >> >> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M)) > >> > > >> > Think "rdtsc" is one number (rdtsc0 = rdtsc1). > >> > > >> >> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M < > >> >> N + (rdtsc1 - rdtsc0)? > >> >> > >> >> That doesn't change the conclusion. > >> >> > >> >> In any case, I'm not arguing that the concept of a master copy is > >> >> unnecessary; I'm arguing that the implementation, the calculations, > >> >> and the machinations in the code are all very, very complicated. All > >> >> that should be needed is to keep all of the vcpu pvti copies the same > >> >> and to make sure that you can't ever have one vcpu see a new copy and > >> >> then another vcpu see an old copy. > >> > > >> > Yes, you can't allow two vcpus to see a different copy of the pvti > >> > structure. > >> > >> Two options: > >> > >> 1. Pause all vcpus, then update all the pvti copies, then unpause all > >> vcpus. This would work, but it's expensive. > > > > This is what is done today. > > > >> > >> 2. Increment all the pvti version numbers, then update all of them, > >> then increment the version numbers again. > >> > >> I think option 2 is a lot nicer than option 1. > > > > Can you write an actual proposal (with details) that accomodates the > > issue described at "Assuming a stable TSC across physical CPUS, and a > > stable TSC" ? > > > > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for > > realtime guests. > > This shouldn't require many details, and I don't think there's an ABI > change. The rules are: > > When the overall system timebase changes (e.g. when the selected > clocksource changes or when update_pvclock_gtod is called), the KVM > host would: > > optionally: preempt_disable(); /* for performance */ > > for all vms { > > for all registered pvti structures { > pvti->version++; /* should be odd now */ > } pvti is userspace data, so you have to pin it before? > /* Note: right now, any vcpu that tries to access pvti will start > infinite looping. We should add cpu_relax() to the guests. */ > > for all registered pvti structures { > update everything except pvti->version; > } > > for all registered pvti structures { > pvti->version++; /* should be even now */ > } > > cond_resched(); > } > > Is this enough detail? This should work with all existing guests, > too, unless there's a buggy guest out there that actually fails to > double-check version. What is the advantage of this over the brute force method, given that guests will busy spin? (busy spin is equally problematic as IPI for realtime guests). > The only reason for an ABI change that I can see would be to arrange > to have only one pvti in the loop. (And, if we changed the ABI, let's > also get rid of one of the redundant addend fields by promising to set > it to zero while we're at it.) Actually, if we changed the ABI, we > might want to do it more like traditional IO. The host would set up a > single page shared by all guests that has all PVTI data, and all of > the guests would map it in. This would save memory and avoid a loop, > at least if no legacy guests are around. > > --Andy