From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52568) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a8q8n-0001ow-HV for qemu-devel@nongnu.org; Tue, 15 Dec 2015 08:59:46 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a8q8i-0006iT-DO for qemu-devel@nongnu.org; Tue, 15 Dec 2015 08:59:45 -0500 Received: from mail-io0-x22e.google.com ([2607:f8b0:4001:c06::22e]:33509) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a8q8i-0006iO-5F for qemu-devel@nongnu.org; Tue, 15 Dec 2015 08:59:40 -0500 Received: by mail-io0-x22e.google.com with SMTP id 186so20256902iow.0 for ; Tue, 15 Dec 2015 05:59:39 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <566E9751.3020703@redhat.com> References: <1450082498-27109-1-git-send-email-a.rigo@virtualopensystems.com> <566E8CE3.2040104@redhat.com> <566E9751.3020703@redhat.com> Date: Tue, 15 Dec 2015 14:59:39 +0100 Message-ID: From: alvise rigo Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: mttcg@greensocs.com, Claudio Fontana , QEMU Developers , "Emilio G. Cota" , Jani Kokkonen , VirtualOpenSystems Technical Team , =?UTF-8?B?QWxleCBCZW5uw6ll?= , Richard Henderson Hi Paolo, On Mon, Dec 14, 2015 at 11:17 AM, Paolo Bonzini wrote: > > > On 14/12/2015 11:04, alvise rigo wrote: >> In any case, what I proposed in the mttcg based v5 was: >> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB. >> This is done by querying a TLB flush to all (not exactly all...) the >> CPUs. To be 100% safe, probably we should also wait that the flush is >> actually performed >> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs >> to check for possible collision with a "exclusive memory region" >> >> Now, why the fact of querying the flush (and possibly ensuring that >> the flush has been actually done) should not be enough? > > There will always be a race where the normal store fails. While I > haven't studied your code enough to do a constructive proof, it's enough > to prove the impossibility of what you're trying to do. Mind, I also > believed for a long time that it was possible to do it! > > If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a > store, you can model this as a consensus problem. For example, CPU 0 > could propose that the subsequent SC succeeds, while CPU 1 proposes that > it fails. The outcome of the SC instruction depends on who wins. I see your point. This, as you wrote, holds only when we attempt to make the fast path wait-free. However, the implementation I proposed is not wait-free and somehow serializes the accesses made to the shared resources (that will determine if the access was successful or not) by means of a mutex. The assumption I made - and somehow verified - is that the "colliding fast accesses" are rare. I guess you also agree on this, otherwise how could a wait-free implementation possibly work without being coupled with primitives with appropriate consensus number? Thank you, alvise > > Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU > 1, and hence for both LL/SC and normal store---an atomic primitive with > consensus number >= 2. Other than LL/SC itself, the commonly-available > operations satisfying this requirement are test-and-set (consensus > number 2) and compare-and-swap (infinite consensus number). Normal > memory reads and writes (called "atomic registers" in multi-processing > research lingo) have consensus number 1; it's not enough. > > If the host had LL/SC, CPU 1 could in principle delegate its side of the > consensus problem to the processor; but even that is not a solution > because processors constrain the instructions that can appear between > the load and the store, and this could cause an infinite sequence of > spurious failed SCs. Another option is transactional memory, but it's > also too slow for normal stores. > > The simplest solution is not to implement full LL/SC semantics; instead, > similar to linux-user, a SC operation can perform a cmpxchg from the > value fetched by LL to the argument of SC. This bypasses the issue > because stores do not have to be instrumented at all, but it does mean > that the emulation suffers from the ABA problem. > > TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg. > It's more complex and more accurate, but also not perfect. Which is > okay, but has to be documented. > > Paolo