Re: x86: kvm: Revert "remove sched notifier for cross-cpu migrations"

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Radim Krcmar <rkrcmar@redhat.com>, kvm list <kvm@vger.kernel.org>,
	stable <stable@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Rik van Riel <riel@redhat.com>
Subject: Re: x86: kvm: Revert "remove sched notifier for cross-cpu migrations"
Date: Thu, 26 Mar 2015 20:38:04 -0300	[thread overview]
Message-ID: <20150326233804.GA31408@amt.cnet> (raw)
In-Reply-To: <CALCETrXmwMjv3bkO2k21sWFyPqQDvZzd2bX-v2_Z0EJ9=igsZg@mail.gmail.com>

On Thu, Mar 26, 2015 at 04:28:37PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
> >> On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
> >> >> > 2015-03-26 11:51-0700, Andy Lutomirski:
> >> >> >> On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> >> >> > On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
> >> >> >> >> Suppose we start out with all vcpus agreeing on their pvti and perfect
> >> >> >> >> invariant TSCs.  Now the host updates its frequency (due to NTP or
> >> >> >> >> whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
> >> >> >> >> pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
> >> >> >> >> They'll disagree on the time, and one of them will be ahead until vcpu
> >> >> >> >> 1's pvti gets updated.
> >> >> >> >
> >> >> >> > The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs
> >> >> >> > to be visible at one time, for all vcpus.
> >> >> >> >
> >> >> >> >
> >> >> >> >  * That is, when timespec0 != timespec1, M < N. Unfortunately that is
> >> >> >> >  * not
> >> >> >> >  * always the case (the difference between two distinct xtime instances
> >> >> >> >  * might be smaller then the difference between corresponding TSC reads,
> >> >> >> >  * when updating guest vcpus pvclock areas).
> >> >> >> >  *
> >> >> >> >  * To avoid that problem, do not allow visibility of distinct
> >> >> >> >  * system_timestamp/tsc_timestamp values simultaneously: use a master
> >> >> >> >  * copy of host monotonic time values. Update that master copy
> >> >> >> >  * in lockstep.
> >> >> >>
> >> >> >> Yuck.  So we have per cpu timing data, but the protocol is only usable
> >> >> >> for monotonic timing because we forcibly freeze all vcpus when we
> >> >> >> update the nominally per cpu data.
> >> >> >>
> >> >> >> The obvious guest implementations are still unnecessarily slow,
> >> >> >> though.  It would be nice if the guest could get away without using
> >> >> >> any getcpu operation at all.
> >> >> >>
> >> >> >> Even if we fixed the host to increment version as advertised, I think
> >> >> >> we can't avoid two getcpu ops.  We need one before rdtsc to figure out
> >> >> >> which pvti to look at,
> >> >> >
> >> >> > Yes.
> >> >> >
> >> >> >>                        and we need another to make sure that we were
> >> >> >> actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
> >> >> >> -- we need to check version before rdtsc, and we don't know what
> >> >> >> version to check until we do a getcpu.).
> >> >> >
> >> >> > Exactly, reading cpuid after rdtsc doesn't do that though, we could have
> >> >> > migrated back between those reads.
> >> >> > rtdscp would allow us to check that we read tsc of pvti's cpu.
> >> >> > (It doesn't get rid of that first read.)
> >> >> >
> >> >> >>                                          The migration hook has the
> >> >> >> same issue -- we need to check the migration count, then confirm we're
> >> >> >> on that cpu, then check the migration count again, and we can't do
> >> >> >> that until we know what cpu we're on.
> >> >> >
> >> >> > True;  the revert has a bug -- we need to check cpuid for the second
> >> >> > time before rdtsc.  (Migration hook is there just because we don't know
> >> >> > which cpu executed rdtsc.)
> >> >>
> >> >> One way or another, I'm planning on completely rewriting the vdso
> >> >> code.  An early draft is here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso&id=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
> >> >>
> >> >> but I can't finish it until the KVM side shakes out.
> >> >>
> >> >> I think there are at least two ways that would work:
> >> >>
> >> >> a) If KVM incremented version as advertised:
> >> >
> >> > All for it.
> >> >
> >> >> cpu = getcpu();
> >> >> pvti = pvti for cpu;
> >> >>
> >> >> ver1 = pvti->version;
> >> >> check stable bit;
> >> >> rdtsc_barrier, rdtsc, read scale, shift, etc.
> >> >> if (getcpu() != cpu) retry;
> >> >> if (pvti->version != ver1) retry;
> >> >>
> >> >> I think this is safe because, we're guaranteed that there was an
> >> >> interval (between the two version reads) in which the vcpu we think
> >> >> we're on was running and the kvmclock data was valid and marked
> >> >> stable, and we know that the tsc we read came from that interval.
> >> >>
> >> >> Note: rdtscp isn't needed. If we're stable, is makes no difference
> >> >> which cpu's tsc we actually read.
> >> >
> >> > Yes, can't see a problem with that.
> >> >
> >> >> b) If version remains buggy but we use this migrations_from hack:
> >> >
> >> > There is no reason for version to remain buggy.
> >> >
> >> >> cpu = getcpu();
> >> >> pvti = pvti for cpu;
> >> >> m1 = pvti->migrations_from;
> >> >> barrier();
> >> >>
> >> >> ver1 = pvti->version;
> >> >> check stable bit;
> >> >> rdtsc_barrier, rdtsc, read scale, shift, etc.
> >> >> if (getcpu() != cpu) retry;
> >> >> if (pvti->version != ver1) retry;  /* probably not really needed */
> >> >>
> >> >> barrier();
> >> >> if (pvti->migrations_from != m1) retry;
> >> >>
> >> >> This is just like (a), except that we're using a guest kernel hack to
> >> >> ensure that no one migrated off the vcpu during the version-protected
> >> >> critical section and that we were, in fact, on that vcpu at some point
> >> >> during that critical section.  Once we've ensured that we were on
> >> >> pvti's associated vcpu for the entire time we were reading it, then we
> >> >> are protected by the existing versioning in the host.
> >> >>
> >> >> >
> >> >> >> If, on the other hand, we could rely on having all of these things in
> >> >> >> sync, then this complication goes away, and we go down from two getcpu
> >> >> >> ops to zero.
> >> >> >
> >> >> > (Yeah, we should look what are the drawbacks of doing it differently.)
> >> >>
> >> >> If the versioning were fixed, I think we could almost get away with:
> >> >>
> >> >> pvti = pvti for vcpu 0;
> >> >>
> >> >> ver1 = pvti->version;
> >> >> check stable bit;
> >> >> rdtsc_barrier, rdtsc, read scale, shift, etc.
> >> >> if (pvti->version != ver1) retry;
> >> >>
> >> >> This guarantees that the tsc came from an interval in which vcpu0's
> >> >> kvmclock was *marked* stable.  If vcpu0's kvmclock were genuinely
> >> >> stable in that interval, then we'd be fine, but there's a race window
> >> >> in which the kvmclock is *not* stable and vcpu 0 wasn't running.
> >> >
> >> > What is that window again ? Have no objections against using vcpu0's
> >> > pvti (cacheline should be read-only 99.9% of time).
> >>
> >> This is based on my (mis?)understanding of the code.  Here goes.
> >>
> >> Suppose we transition from stable to unstable.  The host freezes all
> >> vcpus and set a flag for each vcpu that the kvmclock data needs
> >> updating.  There could then be a window in which vcpu 1 runs vdso code
> >> and vcpu 0 hasn't updated its kvmclock data yet.
> >>
> >> I don't know whether this is actually possible.  Rik suspects it isn't.
> >
> > I don't see why its not possible. We can force vcpu0's kvmclock to be
> > updated.
> >
> > Do you have an estimation of the performance gain of using vcpu0 pvti?
> 
> rdtscp is approximately 3 cycles worse than rdtsc_barrier(); rdtsc()
> on my machine, but that doesn't help much, since the current scheme
> requires that we get the cpu number before reading the time.  The
> fastest way I know of to get the cpu number is 25-ish cycles, so this
> ends up being a large fraction of the overall cost.

It should be possible to force vcpu0 kvmclock area to be updated before
vm-entry, on the update of any other vcpuN kvmclock area.

        kvm_make_mclock_inprogress_request(kvm);
        /* no guest entries from this point */
        pvclock_update_vm_gtod_copy(kvm);

	<---- HERE ------->

        kvm_for_each_vcpu(i, vcpu, kvm)
                kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

        /* guest entries allowed */
        kvm_for_each_vcpu(i, vcpu, kvm)
                clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);

> >> >> Why doesn't KVM just update all of the kvmclock data at once?
> >> >
> >> > Because it has not been necessary -- updating kvmclock data on vcpu
> >> > entry was the previous method, so that was reused.
> >> >
> >> >> (For
> >> >> that matter, why is the pvti in guest memory at all?  Wouldn't this
> >> >> all be simpler if the kvmclock data were host-allocated so the host
> >> >> could write it directly and maybe even share it between guests?)
> >> >
> >> > And use a 4K TLB entry for that kvmclock area rather than
> >> > sharing one of kernel's 2MB (or 1GB) TLB entry?
> >>
> >> Exactly.  I'd also move it into the vvar area instead of using the
> >> fixmap so 32-bit userspace could use it.
> >
> > Thats an obvious downside (an additional entry in the TLB occupied just
> > for the kvmclock area?).
> 
> True.  But TLB misses are actually pretty cheap on new hardware, and
> it's still only one TLB entry per process.
> >> I'm more than happy to handle the vdso side of all of this, but I'd
> >> like the host code to settle down first.
> >> I'm also not quite sure whether it's okay to cause the vdso timing
> >> code to regress on old hosts with new gusts.
> >
> > Must be backwards compatible.
> >
> 
> Could we add a new feature bit that indicates that the host is
> updated?  Then we could use the new code on new hosts and the old code
> on old hosts and eventually deprecate accelerated vdso timing on old
> hosts.

Makes sense.