linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* spin_lock implicit/explicit memory barrier
@ 2016-08-09 18:52 Manfred Spraul
  2016-08-10  0:05 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-08-09 18:52 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt
  Cc: Andrew Morton, Davidlohr Bueso, Linux Kernel Mailing List,
	Susi Sonnenschein

Hi Benjamin, Hi Michael,

regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to 
arch_spin_is_locked()"):

For the ipc/sem code, I would like to replace the spin_is_locked() with 
a smp_load_acquire(), see:

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367

http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch

To my understanding, I must now add a smp_mb(), otherwise it would be 
broken on PowerPC:

The approach that the memory barrier is added into spin_is_locked() 
doesn't work because the code doesn't use spin_is_locked().

Correct?

--

     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-09 18:52 spin_lock implicit/explicit memory barrier Manfred Spraul
@ 2016-08-10  0:05 ` Benjamin Herrenschmidt
  2016-08-10 18:21   ` Manfred Spraul
  2016-08-10 18:33   ` Paul E. McKenney
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-10  0:05 UTC (permalink / raw)
  To: Manfred Spraul, Michael Ellerman
  Cc: Andrew Morton, Davidlohr Bueso, Linux Kernel Mailing List,
	Susi Sonnenschein, Paul E. McKenney

On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
> Hi Benjamin, Hi Michael,
> 
> regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to 
> arch_spin_is_locked()"):
> 
> For the ipc/sem code, I would like to replace the spin_is_locked() with 
> a smp_load_acquire(), see:
> 
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
> 
> http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
> 
> To my understanding, I must now add a smp_mb(), otherwise it would be 
> broken on PowerPC:
> 
> The approach that the memory barrier is added into spin_is_locked() 
> doesn't work because the code doesn't use spin_is_locked().
> 
> Correct?

Right, otherwise you aren't properly ordered. The current powerpc locks provide
good protection between what's inside vs. what's outside the lock but not vs.
the lock *value* itself, so if, like you do in the sem code, use the lock
value as something that is relevant in term of ordering, you probably need
an explicit full barrier.

Adding Paul McKenney.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10  0:05 ` Benjamin Herrenschmidt
@ 2016-08-10 18:21   ` Manfred Spraul
  2016-08-10 19:17     ` Davidlohr Bueso
  2016-08-10 20:52     ` Paul E. McKenney
  2016-08-10 18:33   ` Paul E. McKenney
  1 sibling, 2 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-08-10 18:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Michael Ellerman
  Cc: Andrew Morton, Davidlohr Bueso, Linux Kernel Mailing List,
	Susanne Spraul, Paul E. McKenney, Peter Zijlstra

Hi,

[adding Peter, correcting Davidlohr's mail address]

On 08/10/2016 02:05 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
>> Hi Benjamin, Hi Michael,
>>
>> regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to
>> arch_spin_is_locked()"):
>>
>> For the ipc/sem code, I would like to replace the spin_is_locked() with
>> a smp_load_acquire(), see:
>>
>> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
>>
>> http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
>>
>> To my understanding, I must now add a smp_mb(), otherwise it would be
>> broken on PowerPC:
>>
>> The approach that the memory barrier is added into spin_is_locked()
>> doesn't work because the code doesn't use spin_is_locked().
>>
>> Correct?
> Right, otherwise you aren't properly ordered. The current powerpc locks provide
> good protection between what's inside vs. what's outside the lock but not vs.
> the lock *value* itself, so if, like you do in the sem code, use the lock
> value as something that is relevant in term of ordering, you probably need
> an explicit full barrier.
>
> Adding Paul McKenney.
Just to be safe, let's write down all barrier pairs:
entry and exit, simple and complex, and switching simple to complex and 
vice versa.

(@Davidlohr: Could you crosscheck, did I overlook a pair?)

1)
spin_lock/spin_unlock pair.

2)

||smp_load_acquire(&sma->complex_mode) and 
|||smp_store_release(sma->complex_mode, true) pair. 
||http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n374 
http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n321 The 
store_release guarantees that all data written by the complex_op syscall 
is - after the load_acquire - visible by the simple_op syscall. 3) 
smp_mb() [after spin_lock()] and |||smp_store_mb(sma->complex_mode, true) pair. 
|http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n287 
http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n372 This 
are actually two pairs: - Writing the lock variable must observed by the 
task that does spin_unlock_wait() - complex_mode must be observed by the 
task that does the smp_load_acquire() 4) spin_unlock_wait() and 
spin_unlock() pair 
http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291 
http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n409 The 
data from the simple op must be observed by the following complex op. 
Right now, there is still an smp_rmb() in line 300: The control barrier 
from the loop inside spin_unlock_wait() is upgraded to an acquire 
barrier by an additional smp_rmb(). Is this smp_rmb() required? If I 
understand commit 2c6100227116 ("locking/qspinlock: Fix 
spin_unlock_wait() some more") right, with this commit qspinlock handle 
this case without the smp_rmb(). What I don't know if powerpc is using 
qspinlock already, or if powerpc works without the smp_rmb(). -- Manfred|

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10  0:05 ` Benjamin Herrenschmidt
  2016-08-10 18:21   ` Manfred Spraul
@ 2016-08-10 18:33   ` Paul E. McKenney
  1 sibling, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2016-08-10 18:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Manfred Spraul, Michael Ellerman, Andrew Morton, Davidlohr Bueso,
	Linux Kernel Mailing List, Susi Sonnenschein

On Wed, Aug 10, 2016 at 10:05:37AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
> > Hi Benjamin, Hi Michael,
> > 
> > regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to 
> > arch_spin_is_locked()"):
> > 
> > For the ipc/sem code, I would like to replace the spin_is_locked() with 
> > a smp_load_acquire(), see:
> > 
> > http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
> > 
> > http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
> > 
> > To my understanding, I must now add a smp_mb(), otherwise it would be 
> > broken on PowerPC:
> > 
> > The approach that the memory barrier is added into spin_is_locked() 
> > doesn't work because the code doesn't use spin_is_locked().
> > 
> > Correct?
> 
> Right, otherwise you aren't properly ordered. The current powerpc locks provide
> good protection between what's inside vs. what's outside the lock but not vs.
> the lock *value* itself, so if, like you do in the sem code, use the lock
> value as something that is relevant in term of ordering, you probably need
> an explicit full barrier.
> 
> Adding Paul McKenney.

To amplify what Ben said...

Any CPU holding a given lock will see any previous accesses made under
the protection of that lock.

A CPU -not- holding the lock can see misordering.  As Ben noted, to
that non-lock-holding CPU it might appear that a write made under the
protection of that lock was made after the lock was released.  Similarly,
to that CPU it might appear that a load done under the protection of that
lock completed before the lock was acquired.  Finally, a CPU not holding
the lock might see a store by one CPU holding the lock as happening
after a load (from some other variable) by the next CPU holding that lock.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 18:21   ` Manfred Spraul
@ 2016-08-10 19:17     ` Davidlohr Bueso
  2016-08-10 21:00       ` Paul E. McKenney
  2016-08-12  2:47       ` Boqun Feng
  2016-08-10 20:52     ` Paul E. McKenney
  1 sibling, 2 replies; 20+ messages in thread
From: Davidlohr Bueso @ 2016-08-10 19:17 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Benjamin Herrenschmidt, Michael Ellerman, Andrew Morton,
	Linux Kernel Mailing List, Susanne Spraul, Paul E. McKenney,
	Peter Zijlstra

On Wed, 10 Aug 2016, Manfred Spraul wrote:

>On 08/10/2016 02:05 AM, Benjamin Herrenschmidt wrote:
>>On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
>>>Hi Benjamin, Hi Michael,
>>>
>>>regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to
>>>arch_spin_is_locked()"):
>>>
>>>For the ipc/sem code, I would like to replace the spin_is_locked() with
>>>a smp_load_acquire(), see:
>>>
>>>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
>>>
>>>http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
>>>
>>>To my understanding, I must now add a smp_mb(), otherwise it would be
>>>broken on PowerPC:
>>>
>>>The approach that the memory barrier is added into spin_is_locked()
>>>doesn't work because the code doesn't use spin_is_locked().
>>>
>>>Correct?
>>Right, otherwise you aren't properly ordered. The current powerpc locks provide
>>good protection between what's inside vs. what's outside the lock but not vs.
>>the lock *value* itself, so if, like you do in the sem code, use the lock
>>value as something that is relevant in term of ordering, you probably need
>>an explicit full barrier.

But the problem here is with spin_unlock_wait() (for ll/sc spin_lock) not seeing the
store that makes the lock visibly taken and both threads end up exiting out of sem_lock();
similar scenario to the spin_is_locked commit mentioned above, which is crossing of
locks.

Now that spin_unlock_wait() always implies at least an load-acquire barrier (for both
ticket and qspinlocks, which is still x86 only), we wait on the full critical region.

So this patch takes this locking scheme:

   CPU0			      CPU1
   spin_lock(l)		      spin_lock(L)
   spin_unlock_wait(L)	      if (spin_is_locked(l))
   foo()			 foo()

... and converts it now to:

   CPU0			      CPU1
   complex_mode = true	      spin_lock(l)
   smp_mb()				  <--- do we want a smp_mb() here?
   spin_unlock_wait(l)	      if (!smp_load_acquire(complex_mode))
   foo()			 foo()

We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
such as this.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 18:21   ` Manfred Spraul
  2016-08-10 19:17     ` Davidlohr Bueso
@ 2016-08-10 20:52     ` Paul E. McKenney
  2016-08-10 22:23       ` Davidlohr Bueso
  1 sibling, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2016-08-10 20:52 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Benjamin Herrenschmidt, Michael Ellerman, Andrew Morton,
	Davidlohr Bueso, Linux Kernel Mailing List, Susanne Spraul,
	Peter Zijlstra, parri.andrea

On Wed, Aug 10, 2016 at 08:21:22PM +0200, Manfred Spraul wrote:
> Hi,
> 
> [adding Peter, correcting Davidlohr's mail address]
> 
> On 08/10/2016 02:05 AM, Benjamin Herrenschmidt wrote:
> >On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
> >>Hi Benjamin, Hi Michael,
> >>
> >>regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to
> >>arch_spin_is_locked()"):
> >>
> >>For the ipc/sem code, I would like to replace the spin_is_locked() with
> >>a smp_load_acquire(), see:
> >>
> >>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
> >>
> >>http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
> >>
> >>To my understanding, I must now add a smp_mb(), otherwise it would be
> >>broken on PowerPC:
> >>
> >>The approach that the memory barrier is added into spin_is_locked()
> >>doesn't work because the code doesn't use spin_is_locked().
> >>
> >>Correct?
> >Right, otherwise you aren't properly ordered. The current powerpc locks provide
> >good protection between what's inside vs. what's outside the lock but not vs.
> >the lock *value* itself, so if, like you do in the sem code, use the lock
> >value as something that is relevant in term of ordering, you probably need
> >an explicit full barrier.
> >
> >Adding Paul McKenney.
> Just to be safe, let's write down all barrier pairs:
> entry and exit, simple and complex, and switching simple to complex
> and vice versa.
> 
> (@Davidlohr: Could you crosscheck, did I overlook a pair?)
> 
> 1)
> spin_lock/spin_unlock pair.

If CPU A does spin_unlock(&l1) and CPU B later does spin_lock(&l1),
then CPU B will see all of CPU A's l1-protected accesses.  In other
words, locks really do work like they need to.  ;-)

However, some other CPU C -not- holding l1 might see once of CPU A's
writes as happening after one of CPU B's reads, but only if that write
and that read are to two different variables.  You can prevent this sort
of misordering (as RCU does) by placing smp_mb__after_unlock_lock()
after CPU B's spin_lock().  But the need to prevent this misordering
appears to be rare.

> 2)

I am having some difficulty parsing this, but...

> ||smp_load_acquire(&sma->complex_mode) and
> |||smp_store_release(sma->complex_mode, true) pair.
> ||http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n374
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n321
> The store_release guarantees that all data written by the complex_op
> syscall is - after the load_acquire - visible by the simple_op
> syscall.

And smp_store_release() and smp_load_acquire() are quite similar to
spin_unlock() and spin_lock().

If CPU A does smp_store_release(&x) and CPU B does smp_load_acquire(&x),
and CPU B loads the value that CPU A stored to x, or some later value,
then any of CPU B accesses following its smp_load_acquire(&x) will see
all of CPU A's accesses preceding its smp_store_release(&x).

However, some other CPU C that never accessed x might see one of CPU A's
pre-release writes as happening after one of CPU B's post-acquire reads,
but only if that write and that read are to two different variables.

>          3) smp_mb() [after spin_lock()] and
> |||smp_store_mb(sma->complex_mode, true) pair.
> |http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n287
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n372
> This are actually two pairs: - Writing the lock variable must
> observed by the task that does spin_unlock_wait() - complex_mode
> must be observed by the task that does the smp_load_acquire()

So this is the smp_store_mb() and the following spin_unlock_wait()
on the one hand and the spin_lock(), the smp_mb(), and the
smp_load_acquire() on the other?

Let's see...  If the smp_load_acquire() doesn't see value stored by
smp_store_mb(), then the spin_unlock_wait() is guaranteed to see
the fact that the other CPU holds the lock.  The ->slock field is
volatile, so the compiler shouldn't be able to mess us up too badly.
The control dependency is a spinloop, so no ability for the compiler
to move stuff after the end of an "if" to before that "if", which
is good as well.

>                                                               4)
> spin_unlock_wait() and spin_unlock() pair
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n409
> The data from the simple op must be observed by the following
> complex op. Right now, there is still an smp_rmb() in line 300: The
> control barrier from the loop inside spin_unlock_wait() is upgraded
> to an acquire barrier by an additional smp_rmb(). Is this smp_rmb()
> required? If I understand commit 2c6100227116 ("locking/qspinlock:
> Fix spin_unlock_wait() some more") right, with this commit qspinlock
> handle this case without the smp_rmb(). What I don't know if powerpc
> is using qspinlock already, or if powerpc works without the
> smp_rmb(). -- Manfred|

And I was taking this into account as well.  I believe that this does
what you want it to, and checked it against the current prototype
Linux-kernel memory model with the litmus test shown below, and the
memory model agreed with my assessment.  Which might or might not
be worth anything.  ;-)

						Thanx, Paul

------------------------------------------------------------------------

C C-ManfredSpraul-Sem
(*
 * Result: Never
 *)

{
}

P0(int *lck, int *complex_mode, int *x, int *y)
{
	WRITE_ONCE(*lck, 1); /* only one lock acquirer, so can cheat. */
	smp_mb();
	r1 = smp_load_acquire(complex_mode);
	if (r1 == 0) {
		r2 = READ_ONCE(*x);
		WRITE_ONCE(*y, 1);
	}
	smp_store_release(lck, 0);
}

P1(int *lck, int *complex_mode, int *x, int *y)
{
	WRITE_ONCE(*complex_mode, 1);
	smp_mb();
	r3 = READ_ONCE(*lck);
	if (r3 == 0) {
		smp_rmb();
		WRITE_ONCE(*x, 1);
		r4 = READ_ONCE(*y);
	}
}

exists
(0:r1=0 /\ 1:r3=0 /\ (0:r2=1 /\ 1:r4=0))

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 19:17     ` Davidlohr Bueso
@ 2016-08-10 21:00       ` Paul E. McKenney
  2016-08-15 20:06         ` Manfred Spraul
  2016-08-12  2:47       ` Boqun Feng
  1 sibling, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2016-08-10 21:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Manfred Spraul, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Peter Zijlstra

On Wed, Aug 10, 2016 at 12:17:57PM -0700, Davidlohr Bueso wrote:
> On Wed, 10 Aug 2016, Manfred Spraul wrote:
> 
> >On 08/10/2016 02:05 AM, Benjamin Herrenschmidt wrote:
> >>On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
> >>>Hi Benjamin, Hi Michael,
> >>>
> >>>regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to
> >>>arch_spin_is_locked()"):
> >>>
> >>>For the ipc/sem code, I would like to replace the spin_is_locked() with
> >>>a smp_load_acquire(), see:
> >>>
> >>>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
> >>>
> >>>http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
> >>>
> >>>To my understanding, I must now add a smp_mb(), otherwise it would be
> >>>broken on PowerPC:
> >>>
> >>>The approach that the memory barrier is added into spin_is_locked()
> >>>doesn't work because the code doesn't use spin_is_locked().
> >>>
> >>>Correct?
> >>Right, otherwise you aren't properly ordered. The current powerpc locks provide
> >>good protection between what's inside vs. what's outside the lock but not vs.
> >>the lock *value* itself, so if, like you do in the sem code, use the lock
> >>value as something that is relevant in term of ordering, you probably need
> >>an explicit full barrier.
> 
> But the problem here is with spin_unlock_wait() (for ll/sc spin_lock) not seeing the
> store that makes the lock visibly taken and both threads end up exiting out of sem_lock();
> similar scenario to the spin_is_locked commit mentioned above, which is crossing of
> locks.
> 
> Now that spin_unlock_wait() always implies at least an load-acquire barrier (for both
> ticket and qspinlocks, which is still x86 only), we wait on the full critical region.
> 
> So this patch takes this locking scheme:
> 
>   CPU0			      CPU1
>   spin_lock(l)		      spin_lock(L)
>   spin_unlock_wait(L)	      if (spin_is_locked(l))
>   foo()			 foo()
> 
> ... and converts it now to:
> 
>   CPU0			      CPU1
>   complex_mode = true	      spin_lock(l)
>   smp_mb()				  <--- do we want a smp_mb() here?
>   spin_unlock_wait(l)	      if (!smp_load_acquire(complex_mode))
>   foo()			 foo()
> 
> We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
> spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
> such as this.

In this case, from what I can see, we do need a store-load fence.
That said, yes, it really should be smp_mb__after_unlock_lock() rather
than smp_mb().  So if this code pattern is both desired and legitimate,
the smp_mb__after_unlock_lock() definitions probably need to move out
of kernel/rcu/tree.h to barrier.h or some such.

Now, I agree that if everyone was acquiring and releasing the lock in
standard fashion, there would be no need for memory barriers other than
those in the locking primitives.  But that is not the case here: A task
is looking at some lock-protected state without actually holding the lock.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 20:52     ` Paul E. McKenney
@ 2016-08-10 22:23       ` Davidlohr Bueso
  2016-08-10 22:58         ` Paul E. McKenney
  2016-08-10 23:59         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 20+ messages in thread
From: Davidlohr Bueso @ 2016-08-10 22:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Manfred Spraul, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Peter Zijlstra, parri.andrea

On Wed, 10 Aug 2016, Paul E. McKenney wrote:

>On Wed, Aug 10, 2016 at 08:21:22PM +0200, Manfred Spraul wrote:

>>                                                               4)
>> spin_unlock_wait() and spin_unlock() pair
>> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
>> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n409
>> The data from the simple op must be observed by the following
>> complex op. Right now, there is still an smp_rmb() in line 300: The
>> control barrier from the loop inside spin_unlock_wait() is upgraded
>> to an acquire barrier by an additional smp_rmb(). Is this smp_rmb()
>> required? If I understand commit 2c6100227116 ("locking/qspinlock:
>> Fix spin_unlock_wait() some more") right, with this commit qspinlock
>> handle this case without the smp_rmb(). What I don't know if powerpc
>> is using qspinlock already, or if powerpc works without the
>> smp_rmb(). -- Manfred|

No, ppc doesn't use qspinlocks, but as mentioned, spin_unlock_wait for
tickets are now at least an acquire (ppc is stronger), which match that
unlock store-release you are concerned about, this is as of 726328d92a4
(locking/spinlock, arch: Update and fix spin_unlock_wait() implementations).

This is exactly what you are doing by upgrading the ctrl dependency to 
the acquire barrier in http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
and therefore we don't need it explicitly -- it also makes the comment
wrt spin_unlock_wait obsolete. Or am I'm misunderstanding you?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 22:23       ` Davidlohr Bueso
@ 2016-08-10 22:58         ` Paul E. McKenney
  2016-08-10 23:29           ` Davidlohr Bueso
  2016-08-10 23:59         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2016-08-10 22:58 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Manfred Spraul, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Peter Zijlstra, parri.andrea

On Wed, Aug 10, 2016 at 03:23:16PM -0700, Davidlohr Bueso wrote:
> On Wed, 10 Aug 2016, Paul E. McKenney wrote:
> 
> >On Wed, Aug 10, 2016 at 08:21:22PM +0200, Manfred Spraul wrote:
> 
> >>                                                              4)
> >>spin_unlock_wait() and spin_unlock() pair
> >>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
> >>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n409
> >>The data from the simple op must be observed by the following
> >>complex op. Right now, there is still an smp_rmb() in line 300: The
> >>control barrier from the loop inside spin_unlock_wait() is upgraded
> >>to an acquire barrier by an additional smp_rmb(). Is this smp_rmb()
> >>required? If I understand commit 2c6100227116 ("locking/qspinlock:
> >>Fix spin_unlock_wait() some more") right, with this commit qspinlock
> >>handle this case without the smp_rmb(). What I don't know if powerpc
> >>is using qspinlock already, or if powerpc works without the
> >>smp_rmb(). -- Manfred|
> 
> No, ppc doesn't use qspinlocks, but as mentioned, spin_unlock_wait for
> tickets are now at least an acquire (ppc is stronger), which match that
> unlock store-release you are concerned about, this is as of 726328d92a4
> (locking/spinlock, arch: Update and fix spin_unlock_wait() implementations).
> 
> This is exactly what you are doing by upgrading the ctrl dependency
> to the acquire barrier in
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
> and therefore we don't need it explicitly -- it also makes the comment
> wrt spin_unlock_wait obsolete. Or am I'm misunderstanding you?

Ah, I was looking at 4.7 rather than current mainline.  Perhaps Manfred
was doing the same.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 22:58         ` Paul E. McKenney
@ 2016-08-10 23:29           ` Davidlohr Bueso
  2016-08-11  8:11             ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Davidlohr Bueso @ 2016-08-10 23:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Manfred Spraul, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Peter Zijlstra, parri.andrea

On Wed, 10 Aug 2016, Paul E. McKenney wrote:

>On Wed, Aug 10, 2016 at 03:23:16PM -0700, Davidlohr Bueso wrote:
>> On Wed, 10 Aug 2016, Paul E. McKenney wrote:
>>
>> >On Wed, Aug 10, 2016 at 08:21:22PM +0200, Manfred Spraul wrote:
>>
>> >>                                                              4)
>> >>spin_unlock_wait() and spin_unlock() pair
>> >>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
>> >>http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n409
>> >>The data from the simple op must be observed by the following
>> >>complex op. Right now, there is still an smp_rmb() in line 300: The
>> >>control barrier from the loop inside spin_unlock_wait() is upgraded
>> >>to an acquire barrier by an additional smp_rmb(). Is this smp_rmb()
>> >>required? If I understand commit 2c6100227116 ("locking/qspinlock:
>> >>Fix spin_unlock_wait() some more") right, with this commit qspinlock
>> >>handle this case without the smp_rmb(). What I don't know if powerpc
>> >>is using qspinlock already, or if powerpc works without the
>> >>smp_rmb(). -- Manfred|
>>
>> No, ppc doesn't use qspinlocks, but as mentioned, spin_unlock_wait for
>> tickets are now at least an acquire (ppc is stronger), which match that
>> unlock store-release you are concerned about, this is as of 726328d92a4
>> (locking/spinlock, arch: Update and fix spin_unlock_wait() implementations).
>>
>> This is exactly what you are doing by upgrading the ctrl dependency
>> to the acquire barrier in
>> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
>> and therefore we don't need it explicitly -- it also makes the comment
>> wrt spin_unlock_wait obsolete. Or am I'm misunderstanding you?
>
>Ah, I was looking at 4.7 rather than current mainline.  Perhaps Manfred
>was doing the same.

Right, and therefore backporting gets icky as any versions < 4.8 will
require this explicit smp_rmb :-(  Given that the this complex vs simple
ops race goes way back to 3.12, I see these options:

(1) As Manfred suggested, have a patch 1 that fixes the race against mainline
with the redundant smp_rmb, then apply a second patch that gets rid of it
for mainline, but only backport the original patch 1 down to 3.12.

(2) Backport 726328d92a4 all the way down to 3.12.

(3) Have two patches, one for upstream and one for backporting (not sure how
that would fly though).

I'm in favor of (1) as it seems the least error prone, but long as we do get
rid of the redundant barrier. For the case of any smp_mb__after_unlock_lock
calls we end up needing for ppc, this would probably need backporting as is
afaict.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 22:23       ` Davidlohr Bueso
  2016-08-10 22:58         ` Paul E. McKenney
@ 2016-08-10 23:59         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-10 23:59 UTC (permalink / raw)
  To: Davidlohr Bueso, Paul E. McKenney
  Cc: Manfred Spraul, Michael Ellerman, Andrew Morton,
	Linux Kernel Mailing List, Susanne Spraul, Peter Zijlstra,
	parri.andrea

On Wed, 2016-08-10 at 15:23 -0700, Davidlohr Bueso wrote:
> On Wed, 10 Aug 2016, Paul E. McKenney wrote:
> 
> > 
> > On Wed, Aug 10, 2016 at 08:21:22PM +0200, Manfred Spraul wrote:
>                                                               4)
> > > spin_unlock_wait() and spin_unlock() pair
> > > http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
> > > http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n409
> > > The data from the simple op must be observed by the following
> > > complex op. Right now, there is still an smp_rmb() in line 300: The
> > > control barrier from the loop inside spin_unlock_wait() is upgraded
> > > to an acquire barrier by an additional smp_rmb(). Is this smp_rmb()
> > > required? If I understand commit 2c6100227116 ("locking/qspinlock:
> > > Fix spin_unlock_wait() some more") right, with this commit qspinlock
> > > handle this case without the smp_rmb(). What I don't know if powerpc
> > > is using qspinlock already, or if powerpc works without the
> > > smp_rmb(). -- Manfred|
> 
> > No, ppc doesn't use qspinlocks, 

 ... yet. There are patches pending to add support for them

> but as mentioned, spin_unlock_wait for
> > tickets are now at least an acquire (ppc is stronger), 

The unlock path for qspinlock for us will be a release.

> which match that
> unlock store-release you are concerned about, this is as of 726328d92a4
> (locking/spinlock, arch: Update and fix spin_unlock_wait() implementations).
> 
> This is exactly what you are doing by upgrading the ctrl dependency to 
> the acquire barrier in http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n291
> and therefore we don't need it explicitly -- it also makes the comment
> wrt spin_unlock_wait obsolete. Or am I'm misunderstanding you?

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 23:29           ` Davidlohr Bueso
@ 2016-08-11  8:11             ` Peter Zijlstra
  2016-08-11 18:31               ` Davidlohr Bueso
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2016-08-11  8:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Paul E. McKenney, Manfred Spraul, Benjamin Herrenschmidt,
	Michael Ellerman, Andrew Morton, Linux Kernel Mailing List,
	Susanne Spraul, parri.andrea

On Wed, Aug 10, 2016 at 04:29:22PM -0700, Davidlohr Bueso wrote:

> (1) As Manfred suggested, have a patch 1 that fixes the race against mainline
> with the redundant smp_rmb, then apply a second patch that gets rid of it
> for mainline, but only backport the original patch 1 down to 3.12.

I have not followed the thread closely, but this seems like the best
option. Esp. since 726328d92a42 ("locking/spinlock, arch: Update and fix
spin_unlock_wait() implementations") is incomplete, it relies on at
least 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()") to sort
PPC.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-11  8:11             ` Peter Zijlstra
@ 2016-08-11 18:31               ` Davidlohr Bueso
  2016-08-12  2:59                 ` Boqun Feng
  0 siblings, 1 reply; 20+ messages in thread
From: Davidlohr Bueso @ 2016-08-11 18:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Manfred Spraul, Benjamin Herrenschmidt,
	Michael Ellerman, Andrew Morton, Linux Kernel Mailing List,
	Susanne Spraul, parri.andrea

On Thu, 11 Aug 2016, Peter Zijlstra wrote:

>On Wed, Aug 10, 2016 at 04:29:22PM -0700, Davidlohr Bueso wrote:
>
>> (1) As Manfred suggested, have a patch 1 that fixes the race against mainline
>> with the redundant smp_rmb, then apply a second patch that gets rid of it
>> for mainline, but only backport the original patch 1 down to 3.12.
>
>I have not followed the thread closely, but this seems like the best
>option. Esp. since 726328d92a42 ("locking/spinlock, arch: Update and fix
>spin_unlock_wait() implementations") is incomplete, it relies on at
>least 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()") to sort
>PPC.

Yeah, and we'd also need the arm bits; which reminds me, aren't alpha
ldl_l/stl_c sequences also exposed to this delaying of the publishing
when a non-owner peeks at the lock? Right now sysv sem's would be busted
when doing either is_locked or unlock_wait, shouldn't these be pimped up
to full smp_mb()s?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 19:17     ` Davidlohr Bueso
  2016-08-10 21:00       ` Paul E. McKenney
@ 2016-08-12  2:47       ` Boqun Feng
  2016-08-12 18:43         ` Manfred Spraul
  1 sibling, 1 reply; 20+ messages in thread
From: Boqun Feng @ 2016-08-12  2:47 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Manfred Spraul, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Paul E. McKenney, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

On Wed, Aug 10, 2016 at 12:17:57PM -0700, Davidlohr Bueso wrote:
> On Wed, 10 Aug 2016, Manfred Spraul wrote:
> 
> > On 08/10/2016 02:05 AM, Benjamin Herrenschmidt wrote:
> > > On Tue, 2016-08-09 at 20:52 +0200, Manfred Spraul wrote:
> > > > Hi Benjamin, Hi Michael,
> > > > 
> > > > regarding commit 51d7d5205d33 ("powerpc: Add smp_mb() to
> > > > arch_spin_is_locked()"):
> > > > 
> > > > For the ipc/sem code, I would like to replace the spin_is_locked() with
> > > > a smp_load_acquire(), see:
> > > > 
> > > > http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/tree/ipc/sem.c#n367
> > > > 
> > > > http://www.ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
> > > > 
> > > > To my understanding, I must now add a smp_mb(), otherwise it would be
> > > > broken on PowerPC:
> > > > 
> > > > The approach that the memory barrier is added into spin_is_locked()
> > > > doesn't work because the code doesn't use spin_is_locked().
> > > > 
> > > > Correct?
> > > Right, otherwise you aren't properly ordered. The current powerpc locks provide
> > > good protection between what's inside vs. what's outside the lock but not vs.
> > > the lock *value* itself, so if, like you do in the sem code, use the lock
> > > value as something that is relevant in term of ordering, you probably need
> > > an explicit full barrier.
> 
> But the problem here is with spin_unlock_wait() (for ll/sc spin_lock) not seeing the
> store that makes the lock visibly taken and both threads end up exiting out of sem_lock();
> similar scenario to the spin_is_locked commit mentioned above, which is crossing of
> locks.
> 
> Now that spin_unlock_wait() always implies at least an load-acquire barrier (for both
> ticket and qspinlocks, which is still x86 only), we wait on the full critical region.
> 
> So this patch takes this locking scheme:
> 
>   CPU0			      CPU1
>   spin_lock(l)		      spin_lock(L)
>   spin_unlock_wait(L)	      if (spin_is_locked(l))
>   foo()			 foo()
> 
> ... and converts it now to:
> 
>   CPU0			      CPU1
>   complex_mode = true	      spin_lock(l)
>   smp_mb()				  <--- do we want a smp_mb() here?
>   spin_unlock_wait(l)	      if (!smp_load_acquire(complex_mode))
>   foo()			 foo()
> 
> We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
> spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
> such as this.
> 

Right.

If you have:

6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()")

you don't need smp_mb() after spin_lock() on PPC.

And, IIUC, if you have:

3a5facd09da8 ("arm64: spinlock: fix spin_unlock_wait for LSE atomics")
d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against
concurrent lockers")

you don't need smp_mb() after spin_lock() on ARM64.

And, IIUC, if you have:

2c6100227116 ("locking/qspinlock: Fix spin_unlock_wait() some more")

you don't need smp_mb() after spin_lock() on x86 with qspinlock.

Regards,
Boqun

> Thanks,

> Davidlohr

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-11 18:31               ` Davidlohr Bueso
@ 2016-08-12  2:59                 ` Boqun Feng
  2016-08-19 14:01                   ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Boqun Feng @ 2016-08-12  2:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Paul E. McKenney, Manfred Spraul,
	Benjamin Herrenschmidt, Michael Ellerman, Andrew Morton,
	Linux Kernel Mailing List, Susanne Spraul, parri.andrea

[-- Attachment #1: Type: text/plain, Size: 1852 bytes --]

On Thu, Aug 11, 2016 at 11:31:06AM -0700, Davidlohr Bueso wrote:
> On Thu, 11 Aug 2016, Peter Zijlstra wrote:
> 
> > On Wed, Aug 10, 2016 at 04:29:22PM -0700, Davidlohr Bueso wrote:
> > 
> > > (1) As Manfred suggested, have a patch 1 that fixes the race against mainline
> > > with the redundant smp_rmb, then apply a second patch that gets rid of it
> > > for mainline, but only backport the original patch 1 down to 3.12.
> > 
> > I have not followed the thread closely, but this seems like the best
> > option. Esp. since 726328d92a42 ("locking/spinlock, arch: Update and fix
> > spin_unlock_wait() implementations") is incomplete, it relies on at
> > least 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()") to sort
> > PPC.
> 
> Yeah, and we'd also need the arm bits; which reminds me, aren't alpha
> ldl_l/stl_c sequences also exposed to this delaying of the publishing
> when a non-owner peeks at the lock? Right now sysv sem's would be busted
> when doing either is_locked or unlock_wait, shouldn't these be pimped up
> to full smp_mb()s?
> 

You are talking about a similar problem as this one:

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1018307.html

right?

The trick of this problem is whether the barrier or operation in
spin_lock() could order the STORE part of the lock-acquire with memory
operations in critical sections.

On PPC, we use lwsync, which doesn't order STORE->LOAD, so there is
problem. On ARM64 and qspinlock in x86, there are similiar reasons.

But if an arch implements its spin_lock() with a full barrier, even
though the atomic is implemented by ll/sc, the STORE part of which can't
be reordered with memory operations in the critcal sections. I think
maybe that's the case for alpha(and also for ARM32).

Regards,
Boqun

> Thanks,
> Davidlohr
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-12  2:47       ` Boqun Feng
@ 2016-08-12 18:43         ` Manfred Spraul
  2016-08-22  9:15           ` Boqun Feng
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-08-12 18:43 UTC (permalink / raw)
  To: Boqun Feng, Davidlohr Bueso
  Cc: Benjamin Herrenschmidt, Michael Ellerman, Andrew Morton,
	Linux Kernel Mailing List, Susanne Spraul, Paul E. McKenney,
	Peter Zijlstra

Hi Boqun,

On 08/12/2016 04:47 AM, Boqun Feng wrote:
>> We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
>> spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
>> such as this.
>>
Do we really want to go there?
Trying to handle all unorthodox cases will end up as an endless list of 
patches, and guaranteed to be stale architectures.

> Right.
>
> If you have:
>
> 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()")
>
> you don't need smp_mb() after spin_lock() on PPC.
>
> And, IIUC, if you have:
>
> 3a5facd09da8 ("arm64: spinlock: fix spin_unlock_wait for LSE atomics")
> d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against
> concurrent lockers")
>
> you don't need smp_mb() after spin_lock() on ARM64.
>
> And, IIUC, if you have:
>
> 2c6100227116 ("locking/qspinlock: Fix spin_unlock_wait() some more")
>
> you don't need smp_mb() after spin_lock() on x86 with qspinlock.

I would really prefer the other approach:
- spin_lock() is an acquire, that's it. No further guarantees, e.g. 
ordering of writing the lock.
- spin_unlock() is a release, that's it.
- generic smp_mb__after_before_whatever(). And architectures can 
override the helpers.
E.g. if qspinlocks on x86 can implement the smp_mb__after_spin_lock() 
for free, then the helper can be a nop.

Right now, we start to hardcode something into the architectures - for 
some callers.
Other callers use solutions such as smp_mb__after_unlock_lock(), i.e. 
arch dependent workarounds in arch independent code.

And: We unnecessarily add overhead.
Both ipc/sem and netfilter do loops over many spinlocks:
>        for (i = 0; i < CONNTRACK_LOCKS; i++) {
>                 spin_unlock_wait(&nf_conntrack_locks[i]);
>         }
One memory barrier would be sufficient, but due to embedding we end up 
with CONNTRACK_LOCKS barriers.

Should I create a patch?
(i.e. documentation and generic helpers)

--
     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-10 21:00       ` Paul E. McKenney
@ 2016-08-15 20:06         ` Manfred Spraul
  2016-08-15 20:28           ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-08-15 20:06 UTC (permalink / raw)
  To: paulmck, Davidlohr Bueso
  Cc: Benjamin Herrenschmidt, Michael Ellerman, Andrew Morton,
	Linux Kernel Mailing List, Susanne Spraul, Peter Zijlstra

Hi Paul,

On 08/10/2016 11:00 PM, Paul E. McKenney wrote:
> On Wed, Aug 10, 2016 at 12:17:57PM -0700, Davidlohr Bueso wrote:
>> [...]
>>    CPU0			      CPU1
>>    complex_mode = true	      spin_lock(l)
>>    smp_mb()				  <--- do we want a smp_mb() here?
>>    spin_unlock_wait(l)	      if (!smp_load_acquire(complex_mode))
>>    foo()			 foo()
>>
>> We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
>> spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
>> such as this.
> In this case, from what I can see, we do need a store-load fence.
> That said, yes, it really should be smp_mb__after_unlock_lock() rather
> than smp_mb().  So if this code pattern is both desired and legitimate,
> the smp_mb__after_unlock_lock() definitions probably need to move out
> of kernel/rcu/tree.h to barrier.h or some such.
Can you explain the function name, why smp_mb__after_unlock_lock()?

I would have called it smp_mb__after_spin_lock().

For ipc/sem.c, the use case is:
[sorry, I only now notice that the mailer ate the formatting]:

  cpu 1: complex_mode_enter():
     smp_store_mb(sma->complex_mode, true);

    for (i = 0; i < sma->sem_nsems; i++) {
         sem = sma->sem_base + i;
         spin_unlock_wait(&sem->lock);
     }

cpu 2: sem_lock():
         spin_lock(&sem->lock);
         smp_mb();
         if (!smp_load_acquire(&sma->complex_mode)) {


What is forbidden is that both cpu1 and cpu2 proceed.

--
     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-15 20:06         ` Manfred Spraul
@ 2016-08-15 20:28           ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2016-08-15 20:28 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Davidlohr Bueso, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Peter Zijlstra

On Mon, Aug 15, 2016 at 10:06:39PM +0200, Manfred Spraul wrote:
> Hi Paul,
> 
> On 08/10/2016 11:00 PM, Paul E. McKenney wrote:
> >On Wed, Aug 10, 2016 at 12:17:57PM -0700, Davidlohr Bueso wrote:
> >>[...]
> >>   CPU0			      CPU1
> >>   complex_mode = true	      spin_lock(l)
> >>   smp_mb()				  <--- do we want a smp_mb() here?
> >>   spin_unlock_wait(l)	      if (!smp_load_acquire(complex_mode))
> >>   foo()			 foo()
> >>
> >>We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
> >>spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
> >>such as this.
> >In this case, from what I can see, we do need a store-load fence.
> >That said, yes, it really should be smp_mb__after_unlock_lock() rather
> >than smp_mb().  So if this code pattern is both desired and legitimate,
> >the smp_mb__after_unlock_lock() definitions probably need to move out
> >of kernel/rcu/tree.h to barrier.h or some such.
> Can you explain the function name, why smp_mb__after_unlock_lock()?

When placed after a locking function, it provides full ordering for
all the accesses within this critical section against all the accesses
in the previous critical section for this lock.  In addition, it
provides full ordering for all accesses within this critical section
against all previous critical sections for all locks acquired by this
task/CPU.

In short, it acts on the prior lock in combination with some earlier
unlock, hence the name.

> I would have called it smp_mb__after_spin_lock().

It works on mutexes as well as spinlocks, for whatever that is worth.

> For ipc/sem.c, the use case is:
> [sorry, I only now notice that the mailer ate the formatting]:
> 
>  cpu 1: complex_mode_enter():
>     smp_store_mb(sma->complex_mode, true);
> 
>    for (i = 0; i < sma->sem_nsems; i++) {
>         sem = sma->sem_base + i;
>         spin_unlock_wait(&sem->lock);
>     }
> 
> cpu 2: sem_lock():
>         spin_lock(&sem->lock);
>         smp_mb();
>         if (!smp_load_acquire(&sma->complex_mode)) {
> 
> 
> What is forbidden is that both cpu1 and cpu2 proceed.

It looks to me that CPU 2's smp_mb() could be an
smp_mb__after_unlock_lock() in this case, although that does mean
defining its relationship to spin_unlock_wait() in general.  Use of
smp_mb__after_unlock_lock() would get rid of a memory barrier on many
architectures, while still guaranteeing full ordering.  Probably not
measurable at the system level, though.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-12  2:59                 ` Boqun Feng
@ 2016-08-19 14:01                   ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2016-08-19 14:01 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Davidlohr Bueso, Paul E. McKenney, Manfred Spraul,
	Benjamin Herrenschmidt, Michael Ellerman, Andrew Morton,
	Linux Kernel Mailing List, Susanne Spraul, parri.andrea

On Fri, Aug 12, 2016 at 10:59:46AM +0800, Boqun Feng wrote:
> But if an arch implements its spin_lock() with a full barrier, even
> though the atomic is implemented by ll/sc, the STORE part of which can't
> be reordered with memory operations in the critcal sections. I think
> maybe that's the case for alpha(and also for ARM32).

Correct, Alpha only has a full fence and uses that after the ll/sc to
provide acquire semantics, ARM has other barriers but too uses a full
barrier here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: spin_lock implicit/explicit memory barrier
  2016-08-12 18:43         ` Manfred Spraul
@ 2016-08-22  9:15           ` Boqun Feng
  0 siblings, 0 replies; 20+ messages in thread
From: Boqun Feng @ 2016-08-22  9:15 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Davidlohr Bueso, Benjamin Herrenschmidt, Michael Ellerman,
	Andrew Morton, Linux Kernel Mailing List, Susanne Spraul,
	Paul E. McKenney, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3551 bytes --]

On Fri, Aug 12, 2016 at 08:43:55PM +0200, Manfred Spraul wrote:
> Hi Boqun,
> 
> On 08/12/2016 04:47 AM, Boqun Feng wrote:
> > > We should not be doing an smp_mb() right after a spin_lock(), makes no sense. The
> > > spinlock machinery should guarantee us the barriers in the unorthodox locking cases,
> > > such as this.
> > > 
> Do we really want to go there?
> Trying to handle all unorthodox cases will end up as an endless list of
> patches, and guaranteed to be stale architectures.
> 

To be honest, the only unorthodox case here is we try to use
spin_unlock_wait() to achieve some kind of ordering with a lock
critical section, like we do in sem code and do_exit(). Spinlocks are
mostly used for exclusive lock critical sections, which are what we care
about most and what we should make as fast as possible.

> > Right.
> > 
> > If you have:
> > 
> > 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()")
> > 
> > you don't need smp_mb() after spin_lock() on PPC.
> > 
> > And, IIUC, if you have:
> > 
> > 3a5facd09da8 ("arm64: spinlock: fix spin_unlock_wait for LSE atomics")
> > d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against
> > concurrent lockers")
> > 
> > you don't need smp_mb() after spin_lock() on ARM64.
> > 
> > And, IIUC, if you have:
> > 
> > 2c6100227116 ("locking/qspinlock: Fix spin_unlock_wait() some more")
> > 
> > you don't need smp_mb() after spin_lock() on x86 with qspinlock.
> 
> I would really prefer the other approach:
> - spin_lock() is an acquire, that's it. No further guarantees, e.g. ordering
> of writing the lock.
> - spin_unlock() is a release, that's it.

No objection to these. That's how we implement and document locks now.

> - generic smp_mb__after_before_whatever(). And architectures can override
> the helpers.
> E.g. if qspinlocks on x86 can implement the smp_mb__after_spin_lock() for
> free, then the helper can be a nop.
> 

If you are going to use smp_mb__after_before_whatever() to fix the
spin_unlock_wait() problem, the helper for qspinlocks on x86 can not be
free, because qspinlocks on x86 use separate READs and WRITEs in
spin_lock(), and we need a full barrier to order the WRITEs of the lock
acquisition with READs in critical sections to let spin_unlock_wait()
work.

> Right now, we start to hardcode something into the architectures - for some
> callers.
> Other callers use solutions such as smp_mb__after_unlock_lock(), i.e. arch
> dependent workarounds in arch independent code.
> 

Please note that fixes above in spin_unlock_wait() and
smp_mb__after_unlock_lock() are for two different problems,
smp_mb__after_unlock_lock() is to upgrade a lock+unlock pair into a full
barrier, which is useful to RCpc archs, e.g. PowerPC.

> And: We unnecessarily add overhead.
> Both ipc/sem and netfilter do loops over many spinlocks:
> >        for (i = 0; i < CONNTRACK_LOCKS; i++) {
> >                 spin_unlock_wait(&nf_conntrack_locks[i]);
> >         }
> One memory barrier would be sufficient, but due to embedding we end up with
> CONNTRACK_LOCKS barriers.
> 

We can move the smp_mb() out of the spin_unlock_wait(), if we make sure
all the users are using the proper barriers, so this overhead is easily
fixed.

> Should I create a patch?
> (i.e. documentation and generic helpers)
> 

There have been some discussions in this thread:

http://marc.info/?l=linux-arm-kernel&m=144862480822027

, which may be helpful ;-)

Regards,
Boqun

> --
>     Manfred

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-08-22  9:15 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09 18:52 spin_lock implicit/explicit memory barrier Manfred Spraul
2016-08-10  0:05 ` Benjamin Herrenschmidt
2016-08-10 18:21   ` Manfred Spraul
2016-08-10 19:17     ` Davidlohr Bueso
2016-08-10 21:00       ` Paul E. McKenney
2016-08-15 20:06         ` Manfred Spraul
2016-08-15 20:28           ` Paul E. McKenney
2016-08-12  2:47       ` Boqun Feng
2016-08-12 18:43         ` Manfred Spraul
2016-08-22  9:15           ` Boqun Feng
2016-08-10 20:52     ` Paul E. McKenney
2016-08-10 22:23       ` Davidlohr Bueso
2016-08-10 22:58         ` Paul E. McKenney
2016-08-10 23:29           ` Davidlohr Bueso
2016-08-11  8:11             ` Peter Zijlstra
2016-08-11 18:31               ` Davidlohr Bueso
2016-08-12  2:59                 ` Boqun Feng
2016-08-19 14:01                   ` Peter Zijlstra
2016-08-10 23:59         ` Benjamin Herrenschmidt
2016-08-10 18:33   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).