From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932336AbbAFS0p (ORCPT ); Tue, 6 Jan 2015 13:26:45 -0500 Received: from mail-lb0-f174.google.com ([209.85.217.174]:51766 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932189AbbAFS0n (ORCPT ); Tue, 6 Jan 2015 13:26:43 -0500 MIME-Version: 1.0 In-Reply-To: <20150106181331.GA24590@amt.cnet> References: <8d09c16eb39cbe264417cc66c4aca730af10b70b.1419295081.git.luto@amacapital.net> <20150105152511.GA9172@amt.cnet> <20150105191756.GA31201@amt.cnet> <20150105224858.GA6846@amt.cnet> <54AB9FFD.6070309@redhat.com> <54ABCE85.6070004@redhat.com> <20150106181331.GA24590@amt.cnet> From: Andy Lutomirski Date: Tue, 6 Jan 2015 10:26:22 -0800 Message-ID: Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader To: Marcelo Tosatti Cc: Paolo Bonzini , "xen-devel@lists.xenproject.org" , "linux-kernel@vger.kernel.org" , kvm list , Gleb Natapov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti wrote: > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote: >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" wrote: >> > >> > >> > >> > On 06/01/2015 09:42, Paolo Bonzini wrote: >> > > > > Still confused. So we can freeze all vCPUs in the host, then update >> > > > > pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM >> > > > > doesn't increment the version pre-update, and we can return completely >> > > > > bogus results. >> > > > Yes. >> > > But then the getcpu test would fail (1->0). Even if you have an ABA >> > > situation (1->0->1), it's okay because the pvti that is fetched is the >> > > one returned by the first getcpu. >> > >> > ... this case of partial update of pvti, which is caught by the version >> > field, if of course different from the other (extremely unlikely) that >> > Andy pointed out. That is when the getcpus are done on the same vCPU, >> > but the rdtsc is another. >> > >> > That one can be fixed by rdtscp, like >> > >> > do { >> > // get a consistent (pvti, v, tsc) tuple >> > do { >> > cpu = get_cpu(); >> > pvti = get_pvti(cpu); >> > v = pvti->version & ~1; >> > // also acts as rmb(); >> > rdtsc_barrier(); >> > tsc = rdtscp(&cpu1); >> >> Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD >> specified it that way and both AMD and Intel implement it correctly. >> (rdtsc, on the other hand, definitely needs the barrier beforehand.) >> >> > // control dependency, no need for rdtsc_barrier? >> > } while(cpu != cpu1); >> > >> > // ... compute nanoseconds from pvti and tsc ... >> > rmb(); >> > } while(v != pvti->version); >> >> Still no good. We can migrate a bunch of times so we see the same CPU >> all three times and *still* don't get a consistent read, unless we >> play nasty games with lots of version checks (I have a patch for that, >> but I don't like it very much). The patch is here: >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d >> >> but I don't like it. >> >> Thus far, I've been told unambiguously that a guest can't observe pvti >> while it's being written, and I think you're now telling me that this >> isn't true and that a guest *can* observe pvti while it's being >> written while the low bit of the version field is not set. If so, >> this is rather strongly incompatible with the spec in the KVM docs. >> >> I don't suppose that you and Marcelo could agree on what the actual >> semantics that KVM provides are and could write it down in a way that >> people who haven't spent a long time staring at the request code >> understand? And maybe you could even fix the implementation while >> you're at it if the implementation is, indeed, broken. I have ugly >> patches to fix it here: >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0 >> >> but I'm not thrilled with them. >> >> --Andy > > I suppose that separating the version write from the rest of the pvclock > structure is sufficient, as that would guarantee the writes are not > reordered even with fast string REP MOVS. > > Thanks for catching this Andy! > Don't you stil need: version++; write the rest; version++; with possible smp_wmb() in there to keep the compiler from messing around? Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? --Andy