Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

From: Will Deacon <will.deacon@arm.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Michal Hocko <mhocko@kernel.org>,
	David Howells <dhowells@redhat.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Date: Thu, 12 Nov 2015 22:09:13 +0000	[thread overview]
Message-ID: <20151112220913.GF23979@arm.com> (raw)
In-Reply-To: <CA+55aFxr4JX8XEF12qCOjbj1r92w8yuHZ0vbUm9jgvY167Svsw@mail.gmail.com>

On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> >         spin_lock(&lock):
> >           r1 = *lock; // LL, r1 == 0
> >         o = READ_ONCE(object); // could be reordered here.
> >           *lock = 1; // SC
> 
> It may be worth noting that at least in theory, not only reads may
> pass the store. If the spin-lock is done as
> 
>         r1 = *lock   // read-acquire, r1 == 0
>         *lock = 1    // SC
> 
> then even *writes* inside the locked region might pass up through the
> "*lock = 1".

Right, but only if we didn't have the control dependency to branch back
in the case that the SC failed. In that case, the lock would be broken
because you'd dive into the critical section even if you failed to take
the lock.

> So other CPU's - that haven't taken the spinlock - could see the
> modifications inside the critical region before they actually see the
> lock itself change.
> 
> Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> generally be that you have some external ordering guarantee that
> guarantees that the lock has been taken. For example, for the IPC
> semaphores, we do either one of:
> 
>  (a) get large lock, then - once you hold that lock - wait for each small lock
> 
> or
> 
>  (b) get small lock, then - once you hold that lock - check that the
> largo lock is unlocked
> 
> and that's the case we should really worry about.  The other uses of
> spin_unlock_wait() should have similar "I have other reasons to know
> I've seen that the lock was taken, or will never be taken after this
> because XYZ".
> 
> This is why powerpc has a memory barrier in "arch_spin_is_locked()".
> Exactly so that the "check that the other lock is unlocked" is
> guaranteed to be ordered wrt the store that gets the first lock.
> 
> It looks like ARM64 gets this wrong and is fundamentally buggy wrt
> "spin_is_locked()" (and, as a result, "spin_unlock_wait()").

I don't see how a memory barrier would help us here, and Boqun's example
had smp_mb() either sides of the spin_unlock_wait(). What we actually need
is to make spin_unlock_wait more like a LOCK operation, so that it forces
parallel lockers to replay their LL/SC sequences.

I'll write a patch once I've heard more back from Paul about
smp_mb__after_unlock_lock().

> BUT! And this is a bug BUT:
> 
> It should be noted that that is purely an ARM64 bug. Not a bug in our
> users. If you have a spinlock where the "get lock write" part of the
> lock can be delayed, then you have to have a "arch_spin_is_locked()"
> that has the proper memory barriers.
> 
> Of course, ARM still hides their architecture manuals in odd places,
> so I can't double-check. But afaik, ARM64 store-conditional is "store
> exclusive with release", and it has only release semantics, and ARM64
> really does have the above bug.

The store-conditional in our spin_lock routine has no barrier semantics;
they are enforced by the load-exclusive-acquire that pairs with it.

> On that note: can anybody point me to the latest ARM64 8.1
> architecture manual in pdf form, without the "you have to register"
> crap? I thought ARM released it, but all my googling just points to
> the idiotic ARM service center that wants me to sign away something
> just to see the docs. Which I don't do.

We haven't yet released a document for the 8.1 instructions, but I can
certainly send you a copy when it's made available.

Will