Re: LSE atomic op ordering is weaker than intended?

From: Will Deacon <will@kernel.org>
To: Hector Martin <marcan@marcan.st>
Cc: Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Marc Zyngier <maz@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Catalin Marinas <catalin.marinas@arm.com>
Subject: Re: LSE atomic op ordering is weaker than intended?
Date: Wed, 3 Mar 2021 15:36:19 +0000	[thread overview]
Message-ID: <20210303153619.GA19247@willie-the-truck> (raw)
In-Reply-To: <90ea3e27-b2ef-41ac-75c7-5f0686603b2a@marcan.st>

Hi Hector,

On Wed, Mar 03, 2021 at 10:05:19PM +0900, Hector Martin wrote:
> While yak shaving the AIC driver ordering minutiae, I came across this.
> 
> atomic_t.txt describes "fully ordered" atomic ops as follows:
> 
> > Fully ordered primitives are ordered against everything prior and
> > everything subsequent. Therefore a fully ordered primitive is like
> > having an smp_mb() before and an smp_mb() after the primitive.
> 
> And among those ops are the atomic_fetch_* ops. These are implemented as
> e.g. LDSETAL, with acquire-release semantics.
> 
> However, the *AL LSE ops have acquire semantics *for the read* and release
> semantics *for the write*. As independent components of the same atomic op,
> I cannot find anything in the ARM ARM that would imply ordering between the
> Load-Acquire and *prior* memory operations, nor ordering between the
> Store-Release and *subsequent* memory operations.
> 
> So it would seem these ops are not in fact fully ordered, but rather, only
> order the read component against prior ops, and the write component against
> subsequent ops.
> 
> Put another way: the current implementation means that unqualified ops are
> equal to _acquire + _release semantics as they are described in
> atomic_t.txt, but that is weaker than "fully ordered".
> 
> Throwing this litmus test at herd7 seems to confirm this theory:
> 
> AArch64 lse-atomic-al-ops-are-not-fully-ordered
> ""
> {
> 0:X1=x; 0:X3=y;
> 1:X1=x; 1:X3=y;
> }
>  P0                   | P1                  ;
>  MOV X0, #1           | MOV X0, #1          ;
>  LDSETAL X0, X2, [X1] | LDSETAL X0, X2, [X3];
>  LDR X4, [X3]         | LDR X4, [X1]        ;
> exists (0:X4=0 /\ 1:X4=0)
> 
> The positive result goes away adding a DMB ISH (i.e. smp_mb()) after the
> atomic ops, which contradicts the atomic_t.txt claim.
> 
> Did I miss something, or is this in fact an issue?

Both. The -AL atomics are actually special-cased in the
"barrier-ordered-before" relation in the Arm ARM:

  [RW1 is barrier-ordered-before if]
  * RW1 appears in program order before an atomic instruction with both
    Acquire and Release semantics that appears in program order before
    RW2.

However, that isn't sufficient to order prior accesses with the "load part"
of the RmW and later accesses with the "store part" of the RmW, as you have
observed in your test. I'm aware of some pending proposals in this area of
the architecture, so I'm reluctant to make any changes until that's
bottomed-out, but I'll make a note to chase that up.

> (And while I'm talking to the right people: this issue aside, do atomic ops
> on Normal memory create ordering with Device memory ops, or are there no
> guarantees there due to the fact that Normal memory is mapped
> inner-shareable and the ordering guarantees thus do not extend to
> outer-shareable Device accesses? My currenty understanding is the latter,
> but I find the ARM ARM wording hard to conclusively grok here.)

Outer-shareable is a superset of inner-shareable, but I think this would be
easier with a specific example. I'll go and look at the AIC patch, since
this is all a lot easier to talk about in the context of some real code.

Which is the latest version I should look at?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel