From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:52568)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <a.rigo@virtualopensystems.com>) id 1a8q8n-0001ow-HV
	for qemu-devel@nongnu.org; Tue, 15 Dec 2015 08:59:46 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <a.rigo@virtualopensystems.com>) id 1a8q8i-0006iT-DO
	for qemu-devel@nongnu.org; Tue, 15 Dec 2015 08:59:45 -0500
Received: from mail-io0-x22e.google.com ([2607:f8b0:4001:c06::22e]:33509)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <a.rigo@virtualopensystems.com>) id 1a8q8i-0006iO-5F
	for qemu-devel@nongnu.org; Tue, 15 Dec 2015 08:59:40 -0500
Received: by mail-io0-x22e.google.com with SMTP id 186so20256902iow.0
	for <qemu-devel@nongnu.org>; Tue, 15 Dec 2015 05:59:39 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <566E9751.3020703@redhat.com>
References: <1450082498-27109-1-git-send-email-a.rigo@virtualopensystems.com>
	<566E8CE3.2040104@redhat.com>
	<CAH47eN2BG0mgmqEV2ZnNQ=+WMK+or-xcAxoCNwkpmY3io=0PjQ@mail.gmail.com>
	<566E9751.3020703@redhat.com>
Date: Tue, 15 Dec 2015 14:59:39 +0100
Message-ID: <CAH47eN0dfyet8muxf7msNHj9CG8FaQNWU10a4FE1gpJRZ5Y58w@mail.gmail.com>
From: alvise rigo <a.rigo@virtualopensystems.com>
Content-Type: text/plain; charset=UTF-8
Subject: Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction
	translation
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: mttcg@greensocs.com, Claudio Fontana <claudio.fontana@huawei.com>, QEMU Developers <qemu-devel@nongnu.org>, "Emilio G. Cota" <cota@braap.org>, Jani Kokkonen <jani.kokkonen@huawei.com>, VirtualOpenSystems Technical Team <tech@virtualopensystems.com>, =?UTF-8?B?QWxleCBCZW5uw6ll?= <alex.bennee@linaro.org>, Richard Henderson <rth@twiddle.net>

Hi Paolo,

On Mon, Dec 14, 2015 at 11:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 14/12/2015 11:04, alvise rigo wrote:
>> In any case, what I proposed in the mttcg based v5 was:
>> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
>> This is done by querying a TLB flush to all (not exactly all...) the
>> CPUs. To be 100% safe, probably we should also wait that the flush is
>> actually performed
>> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
>> to check for possible collision with a "exclusive memory region"
>>
>> Now, why the fact of querying the flush (and possibly ensuring that
>> the flush has been actually done) should not be enough?
>
> There will always be a race where the normal store fails.  While I
> haven't studied your code enough to do a constructive proof, it's enough
> to prove the impossibility of what you're trying to do.  Mind, I also
> believed for a long time that it was possible to do it!
>
> If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
> store, you can model this as a consensus problem.  For example, CPU 0
> could propose that the subsequent SC succeeds, while CPU 1 proposes that
> it fails.  The outcome of the SC instruction depends on who wins.

I see your point. This, as you wrote, holds only when we attempt to
make the fast path wait-free.
However, the implementation I proposed is not wait-free and somehow
serializes the accesses made to the shared resources (that will
determine if the access was successful or not) by means of a mutex.
The assumption I made - and somehow verified - is that the "colliding
fast accesses" are rare. I guess you also agree on this, otherwise how
could a wait-free implementation possibly work without being coupled
with primitives with appropriate consensus number?

Thank you,
alvise

>
> Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU
> 1, and hence for both LL/SC and normal store---an atomic primitive with
> consensus number >= 2.  Other than LL/SC itself, the commonly-available
> operations satisfying this requirement are test-and-set (consensus
> number 2) and compare-and-swap (infinite consensus number).  Normal
> memory reads and writes (called "atomic registers" in multi-processing
> research lingo) have consensus number 1; it's not enough.
>
> If the host had LL/SC, CPU 1 could in principle delegate its side of the
> consensus problem to the processor; but even that is not a solution
> because processors constrain the instructions that can appear between
> the load and the store, and this could cause an infinite sequence of
> spurious failed SCs.  Another option is transactional memory, but it's
> also too slow for normal stores.
>
> The simplest solution is not to implement full LL/SC semantics; instead,
> similar to linux-user, a SC operation can perform a cmpxchg from the
> value fetched by LL to the argument of SC.  This bypasses the issue
> because stores do not have to be instrumented at all, but it does mean
> that the emulation suffers from the ABA problem.
>
> TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg.
> It's more complex and more accurate, but also not perfect.  Which is
> okay, but has to be documented.
>
> Paolo